Message ID | 20240608172147.2779890-1-howardchu95@gmail.com (mailing list archive) |
---|---|
State | Handled Elsewhere |
Headers | show |
Series | perf trace: Fix syscall untraceable bug | expand |
> This is a bug found when implementing pretty-printing for the > landlock_add_rule system call, I decided to send this patch separately > because this is a serious bug that should be fixed fast. … Would you like to add the tag “Fixes” accordingly? Regards, Markus
On Sat, Jun 8, 2024 at 10:21 AM Howard Chu <howardchu95@gmail.com> wrote: > > This is a bug found when implementing pretty-printing for the > landlock_add_rule system call, I decided to send this patch separately > because this is a serious bug that should be fixed fast. > > I wrote a test program to do landlock_add_rule syscall in a loop, > yet perf trace -e landlock_add_rule freezes, giving no output. > > This bug is introduced by the false understanding of the variable "key" > below: > ``` > for (key = 0; key < trace->sctbl->syscalls.nr_entries; ++key) { > struct syscall *sc = trace__syscall_info(trace, NULL, key); > ... > } > ``` > The code above seems right at the beginning, but when looking at > syscalltbl.c, I found these lines: > > ``` > for (i = 0; i <= syscalltbl_native_max_id; ++i) > if (syscalltbl_native[i]) > ++nr_entries; > > entries = tbl->syscalls.entries = malloc(sizeof(struct syscall) * nr_entries); > ... > > for (i = 0, j = 0; i <= syscalltbl_native_max_id; ++i) { > if (syscalltbl_native[i]) { > entries[j].name = syscalltbl_native[i]; > entries[j].id = i; > ++j; > } > } > ``` > > meaning the key is merely an index to traverse the syscall table, > instead of the actual syscall id for this particular syscall. Thanks Howard, I'm not following this. Doesn't it make sense to use the syscall number as its id? > > > So if one uses key to do trace__syscall_info(trace, NULL, key), because > key only goes up to trace->sctbl->syscalls.nr_entries, for example, on > my X86_64 machine, this number is 373, it will end up neglecting all > the rest of the syscall, in my case, everything after `rseq`, because > the traversal will stop at 373, and `rseq` is the last syscall whose id > is lower than 373 > > in tools/perf/arch/x86/include/generated/asm/syscalls_64.c: > ``` > ... > [334] = "rseq", > [424] = "pidfd_send_signal", > ... > ``` > > The reason why the key is scrambled but perf trace works well is that > key is used in trace__syscall_info(trace, NULL, key) to do > trace->syscalls.table[id], this makes sure that the struct syscall returned > actually has an id the same value as key, making the later bpf_prog > matching all correct. Could we create a test for this? We have tests that list all perf events and then running a perf command on them. It wouldn't be possible to guarantee output. > > After fixing this bug, I can do perf trace on 38 more syscalls, and > because more syscalls are visible, we get 8 more syscalls that can be > augmented. > > before: > > perf $ perf trace -vv --max-events=1 |& grep Reusing > Reusing "open" BPF sys_enter augmenter for "stat" > Reusing "open" BPF sys_enter augmenter for "lstat" > Reusing "open" BPF sys_enter augmenter for "access" > Reusing "connect" BPF sys_enter augmenter for "accept" > Reusing "sendto" BPF sys_enter augmenter for "recvfrom" > Reusing "connect" BPF sys_enter augmenter for "bind" > Reusing "connect" BPF sys_enter augmenter for "getsockname" > Reusing "connect" BPF sys_enter augmenter for "getpeername" > Reusing "open" BPF sys_enter augmenter for "execve" > Reusing "open" BPF sys_enter augmenter for "truncate" > Reusing "open" BPF sys_enter augmenter for "chdir" > Reusing "open" BPF sys_enter augmenter for "mkdir" > Reusing "open" BPF sys_enter augmenter for "rmdir" > Reusing "open" BPF sys_enter augmenter for "creat" > Reusing "open" BPF sys_enter augmenter for "link" > Reusing "open" BPF sys_enter augmenter for "unlink" > Reusing "open" BPF sys_enter augmenter for "symlink" > Reusing "open" BPF sys_enter augmenter for "readlink" > Reusing "open" BPF sys_enter augmenter for "chmod" > Reusing "open" BPF sys_enter augmenter for "chown" > Reusing "open" BPF sys_enter augmenter for "lchown" > Reusing "open" BPF sys_enter augmenter for "mknod" > Reusing "open" BPF sys_enter augmenter for "statfs" > Reusing "open" BPF sys_enter augmenter for "pivot_root" > Reusing "open" BPF sys_enter augmenter for "chroot" > Reusing "open" BPF sys_enter augmenter for "acct" > Reusing "open" BPF sys_enter augmenter for "swapon" > Reusing "open" BPF sys_enter augmenter for "swapoff" > Reusing "open" BPF sys_enter augmenter for "delete_module" > Reusing "open" BPF sys_enter augmenter for "setxattr" > Reusing "open" BPF sys_enter augmenter for "lsetxattr" > Reusing "openat" BPF sys_enter augmenter for "fsetxattr" > Reusing "open" BPF sys_enter augmenter for "getxattr" > Reusing "open" BPF sys_enter augmenter for "lgetxattr" > Reusing "openat" BPF sys_enter augmenter for "fgetxattr" > Reusing "open" BPF sys_enter augmenter for "listxattr" > Reusing "open" BPF sys_enter augmenter for "llistxattr" > Reusing "open" BPF sys_enter augmenter for "removexattr" > Reusing "open" BPF sys_enter augmenter for "lremovexattr" > Reusing "fsetxattr" BPF sys_enter augmenter for "fremovexattr" > Reusing "open" BPF sys_enter augmenter for "mq_open" > Reusing "open" BPF sys_enter augmenter for "mq_unlink" > Reusing "fsetxattr" BPF sys_enter augmenter for "add_key" > Reusing "fremovexattr" BPF sys_enter augmenter for "request_key" > Reusing "fremovexattr" BPF sys_enter augmenter for "inotify_add_watch" > Reusing "fremovexattr" BPF sys_enter augmenter for "mkdirat" > Reusing "fremovexattr" BPF sys_enter augmenter for "mknodat" > Reusing "fremovexattr" BPF sys_enter augmenter for "fchownat" > Reusing "fremovexattr" BPF sys_enter augmenter for "futimesat" > Reusing "fremovexattr" BPF sys_enter augmenter for "newfstatat" > Reusing "fremovexattr" BPF sys_enter augmenter for "unlinkat" > Reusing "fremovexattr" BPF sys_enter augmenter for "linkat" > Reusing "open" BPF sys_enter augmenter for "symlinkat" > Reusing "fremovexattr" BPF sys_enter augmenter for "readlinkat" > Reusing "fremovexattr" BPF sys_enter augmenter for "fchmodat" > Reusing "fremovexattr" BPF sys_enter augmenter for "faccessat" > Reusing "fremovexattr" BPF sys_enter augmenter for "utimensat" > Reusing "connect" BPF sys_enter augmenter for "accept4" > Reusing "fremovexattr" BPF sys_enter augmenter for "name_to_handle_at" > Reusing "fremovexattr" BPF sys_enter augmenter for "renameat2" > Reusing "open" BPF sys_enter augmenter for "memfd_create" > Reusing "fremovexattr" BPF sys_enter augmenter for "execveat" > Reusing "fremovexattr" BPF sys_enter augmenter for "statx" > > after > > perf $ perf trace -vv --max-events=1 |& grep Reusing > Reusing "open" BPF sys_enter augmenter for "stat" > Reusing "open" BPF sys_enter augmenter for "lstat" > Reusing "open" BPF sys_enter augmenter for "access" > Reusing "connect" BPF sys_enter augmenter for "accept" > Reusing "sendto" BPF sys_enter augmenter for "recvfrom" > Reusing "connect" BPF sys_enter augmenter for "bind" > Reusing "connect" BPF sys_enter augmenter for "getsockname" > Reusing "connect" BPF sys_enter augmenter for "getpeername" > Reusing "open" BPF sys_enter augmenter for "execve" > Reusing "open" BPF sys_enter augmenter for "truncate" > Reusing "open" BPF sys_enter augmenter for "chdir" > Reusing "open" BPF sys_enter augmenter for "mkdir" > Reusing "open" BPF sys_enter augmenter for "rmdir" > Reusing "open" BPF sys_enter augmenter for "creat" > Reusing "open" BPF sys_enter augmenter for "link" > Reusing "open" BPF sys_enter augmenter for "unlink" > Reusing "open" BPF sys_enter augmenter for "symlink" > Reusing "open" BPF sys_enter augmenter for "readlink" > Reusing "open" BPF sys_enter augmenter for "chmod" > Reusing "open" BPF sys_enter augmenter for "chown" > Reusing "open" BPF sys_enter augmenter for "lchown" > Reusing "open" BPF sys_enter augmenter for "mknod" > Reusing "open" BPF sys_enter augmenter for "statfs" > Reusing "open" BPF sys_enter augmenter for "pivot_root" > Reusing "open" BPF sys_enter augmenter for "chroot" > Reusing "open" BPF sys_enter augmenter for "acct" > Reusing "open" BPF sys_enter augmenter for "swapon" > Reusing "open" BPF sys_enter augmenter for "swapoff" > Reusing "open" BPF sys_enter augmenter for "delete_module" > Reusing "open" BPF sys_enter augmenter for "setxattr" > Reusing "open" BPF sys_enter augmenter for "lsetxattr" > Reusing "openat" BPF sys_enter augmenter for "fsetxattr" > Reusing "open" BPF sys_enter augmenter for "getxattr" > Reusing "open" BPF sys_enter augmenter for "lgetxattr" > Reusing "openat" BPF sys_enter augmenter for "fgetxattr" > Reusing "open" BPF sys_enter augmenter for "listxattr" > Reusing "open" BPF sys_enter augmenter for "llistxattr" > Reusing "open" BPF sys_enter augmenter for "removexattr" > Reusing "open" BPF sys_enter augmenter for "lremovexattr" > Reusing "fsetxattr" BPF sys_enter augmenter for "fremovexattr" > Reusing "open" BPF sys_enter augmenter for "mq_open" > Reusing "open" BPF sys_enter augmenter for "mq_unlink" > Reusing "fsetxattr" BPF sys_enter augmenter for "add_key" > Reusing "fremovexattr" BPF sys_enter augmenter for "request_key" > Reusing "fremovexattr" BPF sys_enter augmenter for "inotify_add_watch" > Reusing "fremovexattr" BPF sys_enter augmenter for "mkdirat" > Reusing "fremovexattr" BPF sys_enter augmenter for "mknodat" > Reusing "fremovexattr" BPF sys_enter augmenter for "fchownat" > Reusing "fremovexattr" BPF sys_enter augmenter for "futimesat" > Reusing "fremovexattr" BPF sys_enter augmenter for "newfstatat" > Reusing "fremovexattr" BPF sys_enter augmenter for "unlinkat" > Reusing "fremovexattr" BPF sys_enter augmenter for "linkat" > Reusing "open" BPF sys_enter augmenter for "symlinkat" > Reusing "fremovexattr" BPF sys_enter augmenter for "readlinkat" > Reusing "fremovexattr" BPF sys_enter augmenter for "fchmodat" > Reusing "fremovexattr" BPF sys_enter augmenter for "faccessat" > Reusing "fremovexattr" BPF sys_enter augmenter for "utimensat" > Reusing "connect" BPF sys_enter augmenter for "accept4" > Reusing "fremovexattr" BPF sys_enter augmenter for "name_to_handle_at" > Reusing "fremovexattr" BPF sys_enter augmenter for "renameat2" > Reusing "open" BPF sys_enter augmenter for "memfd_create" > Reusing "fremovexattr" BPF sys_enter augmenter for "execveat" > Reusing "fremovexattr" BPF sys_enter augmenter for "statx" > > TL;DR: > > These are the new syscalls that can be augmented > Reusing "openat" BPF sys_enter augmenter for "open_tree" > Reusing "openat" BPF sys_enter augmenter for "openat2" > Reusing "openat" BPF sys_enter augmenter for "mount_setattr" > Reusing "openat" BPF sys_enter augmenter for "move_mount" > Reusing "open" BPF sys_enter augmenter for "fsopen" > Reusing "openat" BPF sys_enter augmenter for "fspick" > Reusing "openat" BPF sys_enter augmenter for "faccessat2" > Reusing "openat" BPF sys_enter augmenter for "fchmodat2" > > as for the perf trace output: > > before > > perf $ perf trace -e faccessat2 --max-events=1 > [no output] > > after > > perf $ ./perf trace -e faccessat2 --max-events=1 > 0.000 ( 0.037 ms): waybar/958 faccessat2(dfd: 40, filename: "uevent") = 0 > > P.S. The reason why this bug was not found in the past five years is > probably because it only happens to the newer syscalls whose id is > greater, for instance, faccessat2 of id 439, which not a lot of people > care about when using perf trace. > > Signed-off-by: Howard Chu <howardchu95@gmail.com> > --- > tools/perf/builtin-trace.c | 32 +++++++++++++++++++++----------- > tools/perf/util/syscalltbl.c | 21 +++++++++------------ > tools/perf/util/syscalltbl.h | 5 +++++ > 3 files changed, 35 insertions(+), 23 deletions(-) > > diff --git a/tools/perf/builtin-trace.c b/tools/perf/builtin-trace.c > index c42bc608954e..5cbe1748911d 100644 > --- a/tools/perf/builtin-trace.c > +++ b/tools/perf/builtin-trace.c > @@ -3354,7 +3354,8 @@ static int trace__bpf_prog_sys_exit_fd(struct trace *trace, int id) > static struct bpf_program *trace__find_usable_bpf_prog_entry(struct trace *trace, struct syscall *sc) > { > struct tep_format_field *field, *candidate_field; > - int id; > + struct __syscall *scs = trace->sctbl->syscalls.entries; > + int id, _id; > > /* > * We're only interested in syscalls that have a pointer: > @@ -3368,10 +3369,13 @@ static struct bpf_program *trace__find_usable_bpf_prog_entry(struct trace *trace > > try_to_find_pair: > for (id = 0; id < trace->sctbl->syscalls.nr_entries; ++id) { > - struct syscall *pair = trace__syscall_info(trace, NULL, id); > + struct syscall *pair; > struct bpf_program *pair_prog; > bool is_candidate = false; > > + _id = scs[id].id; > + pair = trace__syscall_info(trace, NULL, _id); > + > if (pair == NULL || pair == sc || > pair->bpf_prog.sys_enter == trace->skel->progs.syscall_unaugmented) > continue; > @@ -3456,23 +3460,26 @@ static int trace__init_syscalls_bpf_prog_array_maps(struct trace *trace) > { > int map_enter_fd = bpf_map__fd(trace->skel->maps.syscalls_sys_enter); > int map_exit_fd = bpf_map__fd(trace->skel->maps.syscalls_sys_exit); > - int err = 0, key; > + int err = 0, key, id; > + struct __syscall *scs = trace->sctbl->syscalls.entries; > > for (key = 0; key < trace->sctbl->syscalls.nr_entries; ++key) { > int prog_fd; > > - if (!trace__syscall_enabled(trace, key)) > + id = scs[key].id; > + > + if (!trace__syscall_enabled(trace, id)) > continue; > > - trace__init_syscall_bpf_progs(trace, key); > + trace__init_syscall_bpf_progs(trace, id); > > // It'll get at least the "!raw_syscalls:unaugmented" > - prog_fd = trace__bpf_prog_sys_enter_fd(trace, key); > - err = bpf_map_update_elem(map_enter_fd, &key, &prog_fd, BPF_ANY); > + prog_fd = trace__bpf_prog_sys_enter_fd(trace, id); > + err = bpf_map_update_elem(map_enter_fd, &id, &prog_fd, BPF_ANY); > if (err) > break; > - prog_fd = trace__bpf_prog_sys_exit_fd(trace, key); > - err = bpf_map_update_elem(map_exit_fd, &key, &prog_fd, BPF_ANY); > + prog_fd = trace__bpf_prog_sys_exit_fd(trace, id); > + err = bpf_map_update_elem(map_exit_fd, &id, &prog_fd, BPF_ANY); > if (err) > break; > } > @@ -3506,10 +3513,13 @@ static int trace__init_syscalls_bpf_prog_array_maps(struct trace *trace) > * array tail call, then that one will be used. > */ > for (key = 0; key < trace->sctbl->syscalls.nr_entries; ++key) { > - struct syscall *sc = trace__syscall_info(trace, NULL, key); > + struct syscall *sc; > struct bpf_program *pair_prog; > int prog_fd; > > + id = scs[key].id; > + sc = trace__syscall_info(trace, NULL, id); > + > if (sc == NULL || sc->bpf_prog.sys_enter == NULL) > continue; > > @@ -3535,7 +3545,7 @@ static int trace__init_syscalls_bpf_prog_array_maps(struct trace *trace) > * with the fd for the program we're reusing: > */ > prog_fd = bpf_program__fd(sc->bpf_prog.sys_enter); > - err = bpf_map_update_elem(map_enter_fd, &key, &prog_fd, BPF_ANY); > + err = bpf_map_update_elem(map_enter_fd, &id, &prog_fd, BPF_ANY); > if (err) > break; > } > diff --git a/tools/perf/util/syscalltbl.c b/tools/perf/util/syscalltbl.c > index 63be7b58761d..16aa886c40f0 100644 > --- a/tools/perf/util/syscalltbl.c > +++ b/tools/perf/util/syscalltbl.c > @@ -44,22 +44,17 @@ const int syscalltbl_native_max_id = SYSCALLTBL_LOONGARCH_MAX_ID; > static const char *const *syscalltbl_native = syscalltbl_loongarch; > #endif > > -struct syscall { > - int id; > - const char *name; > -}; > - > static int syscallcmpname(const void *vkey, const void *ventry) > { > const char *key = vkey; > - const struct syscall *entry = ventry; > + const struct __syscall *entry = ventry; > > return strcmp(key, entry->name); > } > > static int syscallcmp(const void *va, const void *vb) > { > - const struct syscall *a = va, *b = vb; > + const struct __syscall *a = va, *b = vb; > > return strcmp(a->name, b->name); > } > @@ -67,13 +62,14 @@ static int syscallcmp(const void *va, const void *vb) > static int syscalltbl__init_native(struct syscalltbl *tbl) > { > int nr_entries = 0, i, j; > - struct syscall *entries; > + struct __syscall *entries; > > for (i = 0; i <= syscalltbl_native_max_id; ++i) > if (syscalltbl_native[i]) > ++nr_entries; > > - entries = tbl->syscalls.entries = malloc(sizeof(struct syscall) * nr_entries); > + entries = tbl->syscalls.entries = malloc(sizeof(struct __syscall) * > + nr_entries); > if (tbl->syscalls.entries == NULL) > return -1; > > @@ -85,7 +81,8 @@ static int syscalltbl__init_native(struct syscalltbl *tbl) > } > } > > - qsort(tbl->syscalls.entries, nr_entries, sizeof(struct syscall), syscallcmp); > + qsort(tbl->syscalls.entries, nr_entries, sizeof(struct __syscall), > + syscallcmp); > tbl->syscalls.nr_entries = nr_entries; > tbl->syscalls.max_id = syscalltbl_native_max_id; > return 0; > @@ -116,7 +113,7 @@ const char *syscalltbl__name(const struct syscalltbl *tbl __maybe_unused, int id > > int syscalltbl__id(struct syscalltbl *tbl, const char *name) > { > - struct syscall *sc = bsearch(name, tbl->syscalls.entries, > + struct __syscall *sc = bsearch(name, tbl->syscalls.entries, > tbl->syscalls.nr_entries, sizeof(*sc), > syscallcmpname); > > @@ -126,7 +123,7 @@ int syscalltbl__id(struct syscalltbl *tbl, const char *name) > int syscalltbl__strglobmatch_next(struct syscalltbl *tbl, const char *syscall_glob, int *idx) > { > int i; > - struct syscall *syscalls = tbl->syscalls.entries; > + struct __syscall *syscalls = tbl->syscalls.entries; > > for (i = *idx + 1; i < tbl->syscalls.nr_entries; ++i) { > if (strglobmatch(syscalls[i].name, syscall_glob)) { > diff --git a/tools/perf/util/syscalltbl.h b/tools/perf/util/syscalltbl.h > index a41d2ca9e4ae..6e93a0874c40 100644 > --- a/tools/perf/util/syscalltbl.h > +++ b/tools/perf/util/syscalltbl.h > @@ -2,6 +2,11 @@ > #ifndef __PERF_SYSCALLTBL_H > #define __PERF_SYSCALLTBL_H > It'd be nice to document the struct with examples that explain the confusion that's happened and is fixed here. Thanks, Ian > +struct __syscall { > + int id; > + const char *name; > +}; > + > struct syscalltbl { > int audit_machine; > struct { > -- > 2.45.2 >
Hello Ian, Thanks for reviewing this patch. On Tue, Jun 11, 2024 at 5:33 AM Ian Rogers <irogers@google.com> wrote: > > On Sat, Jun 8, 2024 at 10:21 AM Howard Chu <howardchu95@gmail.com> wrote: > > > > This is a bug found when implementing pretty-printing for the > > landlock_add_rule system call, I decided to send this patch separately > > because this is a serious bug that should be fixed fast. > > > > I wrote a test program to do landlock_add_rule syscall in a loop, > > yet perf trace -e landlock_add_rule freezes, giving no output. > > > > This bug is introduced by the false understanding of the variable "key" > > below: > > ``` > > ``` > > The code above seems right at the beginning, but when looking at > > syscalltbl.c, I found these lines: > > > > ``` > > > > entries = tbl->syscalls.entries = malloc(sizeof(struct syscall) * nr_entries); > > ... > > > > ``` > > > > meaning the key is merely an index to traverse the syscall table, > > instead of the actual syscall id for this particular syscall. > > > Thanks Howard, I'm not following this. Doesn't it make sense to use > the syscall number as its id? Yes, it makes perfect sense to use the syscall number as its id. But here: > > for (key = 0; key < trace->sctbl->syscalls.nr_entries; ++key) { > > struct syscall *sc = trace__syscall_info(trace, NULL, key); > > ... > > } sctbl->syscalls.nr_entries is not the upper bound of those syscall ids, it is how many syscalls that are native to the system, that we collected. We collect them this way(in util/syscalltbl.c): > > for (i = 0; i <= syscalltbl_native_max_id; ++i) > > if (syscalltbl_native[i]) > > ++nr_entries; > > for (i = 0, j = 0; i <= syscalltbl_native_max_id; ++i) { > > if (syscalltbl_native[i]) { > > entries[j].name = syscalltbl_native[i]; > > entries[j].id = i; > > ++j; > > } > > } As shown above, we only collect syscalls that are native to the architecture. ``` #if defined(__x86_64__) #include <asm/syscalls_64.c> const int syscalltbl_native_max_id = SYSCALLTBL_x86_64_MAX_ID; static const char *const *syscalltbl_native = syscalltbl_x86_64; ``` So when collecting these entries of system calls, we neglect some of the system calls that are not available on the system. How many system calls are available? nr_entries of them are available. But the real upper bound of the syscall id is syscalls.max_id. ``` tbl->syscalls.max_id = syscalltbl_native_max_id; ``` here is one example: (think about s1 as syscall of id1, syscall s2 is not available on the system) s1 s2 s3 s4 (max_id: 4) s1 __ s3 s4 (nr_entries: 3) If one uses nr_entries as the upper bound, we'll end up traversing only the ids of s1, s2, s3 (of course, s2 doesn't exist so if you do `perf trace -v` you will see the error message), and then we will lose the precious and fully traceable s4. What will happen is, in trace__init_syscalls_bpf_prog_array_maps(), if you stop at nr_entries when traversing, you won't register bpf's sys_enter program for some system calls, causing the bpf_tail_call to fail. And if you fail, you fall through and return a 0, well, I have to admit that I am still trying to understand this, but returning 0 means no sample, but if the bpf_tail_call succeeds, syscall_unaugmented() is called, it returns a 1, and there will be samples. ``` SEC("tp/raw_syscalls/sys_enter") int syscall_unaugmented(struct syscall_enter_args *args) { return 1; } ``` ``` bpf_tail_call(args, &syscalls_sys_enter, augmented_args->args.syscall_nr); // If not found on the PROG_ARRAY syscalls map, then we're filtering it: return 0; ``` > > > > > > > So if one uses key to do trace__syscall_info(trace, NULL, key), because > > key only goes up to trace->sctbl->syscalls.nr_entries, for example, on > > my X86_64 machine, this number is 373, it will end up neglecting all > > the rest of the syscall, in my case, everything after `rseq`, because > > the traversal will stop at 373, and `rseq` is the last syscall whose id > > is lower than 373 > > > > in tools/perf/arch/x86/include/generated/asm/syscalls_64.c: > > ``` > > ... > > [334] = "rseq", > > [424] = "pidfd_send_signal", > > ... > > ``` > > > > The reason why the key is scrambled but perf trace works well is that > > key is used in trace__syscall_info(trace, NULL, key) to do > > trace->syscalls.table[id], this makes sure that the struct syscall returned > > actually has an id the same value as key, making the later bpf_prog > > matching all correct. > > > Could we create a test for this? We have tests that list all perf > events and then running a perf command on them. It wouldn't be > possible to guarantee output. Sure, you can't guarantee the output because syscalls are architecture and kernel version specific. I am running perf on kernel version 6.9.3, and my processor is X86_64. I tried to write the test script in perl and shell script but failed, so here is a quick test in c if you want to test it out. This is just doing 3 system calls though... ``` #define _GNU_SOURCE #include <unistd.h> #include <linux/landlock.h> #include <sys/syscall.h> #include <stdio.h> #include <stdlib.h> void ll() { syscall(SYS_landlock_add_rule, 0, LANDLOCK_RULE_NET_PORT, NULL, 0); syscall(SYS_landlock_add_rule, 0, LANDLOCK_RULE_PATH_BENEATH, NULL, 0); } void opat2() { syscall(SYS_openat2, 114, "openat2 path", NULL, 0); } void faat2() { syscall(SYS_faccessat2, 0, "faccessat2 path", 0, 0); } int main() { printf("testing syscalls..."); while (1) { ll(); opat2(); faat2(); sleep(1); } return 0; } ``` Please save it as test.c, compile it with ``` gcc test.c -o test ``` Please compile the patched and original perf with BUILD_BPF_SKEL=1. The behaviour of BUILD_BPF_SKEL=0 is correct. Now please run it with this command below: ``` perf trace -e faccessat2,openat2,landlock_add_rule -- ./test ``` Or just run ./test, and open up a second shell, do `perf trace -e faccessat2,openat2,landlock_add_rule` system wide, or do a `ps aux | grep test` and specify -p <pid-of-the-test-program> If I run the test program, trace it with the original perf, there is not a single trace message on all of these three system calls. Because in the syscall table on my machine: ``` [437] = "openat2", [439] = "faccessat2", [445] = "landlock_add_rule", ``` these system calls' ids are bigger than the syscalls.nr_entries(on my machine, 373), they never got registered. I'm not 100% sure that you will get the same output as mine, so please tell me if there's any difference when testing. I'll write up a test covering all system calls soon. Here's another aspect of this, the simplest fix is just changing: > > for (key = 0; key < trace->sctbl->syscalls.nr_entries; ++key) { to > > for (key = 0; key < trace->sctbl->syscalls.max_id; ++key) { no problem. But when you run `perf trace -e landlock_add_rule -v`, you get: ``` perf $ ./perf trace -e landlock_add_rule -v Using CPUID GenuineIntel-6-3D-4 Problems reading syscall 134: 2 (No such file or directory)(uselib) information Problems reading syscall 156: 2 (No such file or directory)(_sysctl) information Problems reading syscall 174: 2 (No such file or directory)(create_module) information Problems reading syscall 177: 2 (No such file or directory)(get_kernel_syms) information Problems reading syscall 178: 2 (No such file or directory)(query_module) information Problems reading syscall 180: 2 (No such file or directory)(nfsservctl) information Problems reading syscall 181: 2 (No such file or directory)(getpmsg) information Problems reading syscall 182: 2 (No such file or directory)(putpmsg) information Problems reading syscall 183: 2 (No such file or directory)(afs_syscall) information Problems reading syscall 184: 2 (No such file or directory)(tuxcall) information Problems reading syscall 185: 2 (No such file or directory)(security) information Problems reading syscall 205: 2 (No such file or directory)(set_thread_area) information Problems reading syscall 211: 2 (No such file or directory)(get_thread_area) information Problems reading syscall 212: 2 (No such file or directory)(lookup_dcookie) information Problems reading syscall 214: 2 (No such file or directory)(epoll_ctl_old) information Problems reading syscall 215: 2 (No such file or directory)(epoll_wait_old) information Problems reading syscall 236: 2 (No such file or directory)(vserver) information Problems reading syscall 335: 17 (File exists) information Problems reading syscall 336: 17 (File exists) information Problems reading syscall 337: 17 (File exists) information Problems reading syscall 338: 17 (File exists) information Problems reading syscall 339: 17 (File exists) information Problems reading syscall 340: 17 (File exists) information Problems reading syscall 341: 17 (File exists) information Problems reading syscall 342: 17 (File exists) information Problems reading syscall 343: 17 (File exists) information Problems reading syscall 344: 17 (File exists) information Problems reading syscall 345: 17 (File exists) information Problems reading syscall 346: 17 (File exists) information Problems reading syscall 347: 17 (File exists) information Problems reading syscall 348: 17 (File exists) information Problems reading syscall 349: 17 (File exists) information Problems reading syscall 350: 17 (File exists) information Problems reading syscall 351: 17 (File exists) information Problems reading syscall 352: 17 (File exists) information Problems reading syscall 353: 17 (File exists) information Problems reading syscall 354: 17 (File exists) information Problems reading syscall 355: 17 (File exists) information Problems reading syscall 356: 17 (File exists) information Problems reading syscall 357: 17 (File exists) information Problems reading syscall 358: 17 (File exists) information Problems reading syscall 359: 17 (File exists) information Problems reading syscall 360: 17 (File exists) information Problems reading syscall 361: 17 (File exists) information Problems reading syscall 362: 17 (File exists) information Problems reading syscall 363: 17 (File exists) information Problems reading syscall 364: 17 (File exists) information Problems reading syscall 365: 17 (File exists) information Problems reading syscall 366: 17 (File exists) information Problems reading syscall 367: 17 (File exists) information Problems reading syscall 368: 17 (File exists) information Problems reading syscall 369: 17 (File exists) information Problems reading syscall 370: 17 (File exists) information Problems reading syscall 371: 17 (File exists) information Problems reading syscall 372: 17 (File exists) information event qualifier tracepoint filter: (common_pid != 1115781) && (id == 445) mmap size 528384B ``` The registration of syscall from 335 to 372 is totally unnecessary. reminder: ``` [334] = "rseq", [424] = "pidfd_send_signal", ``` For example, syscall of id 335 is not available on my X86_64 machine, why even bother reading this syscall's info using trace__syscall_info(). So with this patch, the verbose message will become: ``` perf $ ./perf trace -e landlock_add_rule -v Using CPUID GenuineIntel-6-3D-4 vmlinux BTF loaded Problems reading syscall 156: 2 (No such file or directory)(_sysctl) information Problems reading syscall 183: 2 (No such file or directory)(afs_syscall) information Problems reading syscall 174: 2 (No such file or directory)(create_module) information Problems reading syscall 214: 2 (No such file or directory)(epoll_ctl_old) information Problems reading syscall 215: 2 (No such file or directory)(epoll_wait_old) information Problems reading syscall 177: 2 (No such file or directory)(get_kernel_syms) information Problems reading syscall 211: 2 (No such file or directory)(get_thread_area) information Problems reading syscall 181: 2 (No such file or directory)(getpmsg) information Problems reading syscall 156: 22 (Invalid argument)(_sysctl) information Problems reading syscall 183: 22 (Invalid argument)(afs_syscall) information Problems reading syscall 174: 22 (Invalid argument)(create_module) information Problems reading syscall 214: 22 (Invalid argument)(epoll_ctl_old) information Problems reading syscall 215: 22 (Invalid argument)(epoll_wait_old) information Problems reading syscall 177: 22 (Invalid argument)(get_kernel_syms) information Problems reading syscall 211: 22 (Invalid argument)(get_thread_area) information Problems reading syscall 181: 22 (Invalid argument)(getpmsg) information Problems reading syscall 212: 2 (No such file or directory)(lookup_dcookie) information Problems reading syscall 180: 2 (No such file or directory)(nfsservctl) information Problems reading syscall 182: 2 (No such file or directory)(putpmsg) information Problems reading syscall 178: 2 (No such file or directory)(query_module) information Problems reading syscall 185: 2 (No such file or directory)(security) information Problems reading syscall 205: 2 (No such file or directory)(set_thread_area) information Problems reading syscall 184: 2 (No such file or directory)(tuxcall) information Problems reading syscall 134: 2 (No such file or directory)(uselib) information Problems reading syscall 236: 2 (No such file or directory)(vserver) information Problems reading syscall 212: 22 (Invalid argument)(lookup_dcookie) information Problems reading syscall 180: 22 (Invalid argument)(nfsservctl) information Problems reading syscall 182: 22 (Invalid argument)(putpmsg) information Problems reading syscall 178: 22 (Invalid argument)(query_module) information Problems reading syscall 185: 22 (Invalid argument)(security) information Problems reading syscall 205: 22 (Invalid argument)(set_thread_area) information Problems reading syscall 184: 22 (Invalid argument)(tuxcall) information Problems reading syscall 134: 22 (Invalid argument)(uselib) information Problems reading syscall 236: 22 (Invalid argument)(vserver) information event qualifier tracepoint filter: (common_pid != 1129314) && (id == 445) mmap size 528384B ``` Now only when reading native syscalls' information, if there's a problem, there is an error message. I think the desired behavior in trace__init_syscalls_bpf_prog_array_maps() is that, when a system call is not augmentable, you assign syscall_unaugmented() to it(via bpf prog array) But what happens is, when tracing landlock_add_rule, this program never runs. Because there is never a bpf program registration for landlock_add_rule, the key never reaches that high. With this patch, it does. So in util/bpf_skel/augmented_raw_syscalls.bpf.c, bpf_tail_call() falls through, bpf program returns 0, there will be no samples. But if it successfully calls syscall_unaugmented() and returns 1, there will be samples. But then again, if you build perf with BUILD_BPF_SKEL=0, using no bpf functionalities, everything is fine, but if you do want to use bpf augmentation, you have to register the syscall, the key has to go up to max_id, stopping at nr_entries is probably not what we want. > > > > > After fixing this bug, I can do perf trace on 38 more syscalls, and > > because more syscalls are visible, we get 8 more syscalls that can be > > augmented. > > > > before: > > > > perf $ perf trace -vv --max-events=1 |& grep Reusing > > Reusing "open" BPF sys_enter augmenter for "stat" > > Reusing "open" BPF sys_enter augmenter for "lstat" > > Reusing "open" BPF sys_enter augmenter for "access" > > Reusing "connect" BPF sys_enter augmenter for "accept" > > Reusing "sendto" BPF sys_enter augmenter for "recvfrom" > > Reusing "connect" BPF sys_enter augmenter for "bind" > > Reusing "connect" BPF sys_enter augmenter for "getsockname" > > Reusing "connect" BPF sys_enter augmenter for "getpeername" > > Reusing "open" BPF sys_enter augmenter for "execve" > > Reusing "open" BPF sys_enter augmenter for "truncate" > > Reusing "open" BPF sys_enter augmenter for "chdir" > > Reusing "open" BPF sys_enter augmenter for "mkdir" > > Reusing "open" BPF sys_enter augmenter for "rmdir" > > Reusing "open" BPF sys_enter augmenter for "creat" > > Reusing "open" BPF sys_enter augmenter for "link" > > Reusing "open" BPF sys_enter augmenter for "unlink" > > Reusing "open" BPF sys_enter augmenter for "symlink" > > Reusing "open" BPF sys_enter augmenter for "readlink" > > Reusing "open" BPF sys_enter augmenter for "chmod" > > Reusing "open" BPF sys_enter augmenter for "chown" > > Reusing "open" BPF sys_enter augmenter for "lchown" > > Reusing "open" BPF sys_enter augmenter for "mknod" > > Reusing "open" BPF sys_enter augmenter for "statfs" > > Reusing "open" BPF sys_enter augmenter for "pivot_root" > > Reusing "open" BPF sys_enter augmenter for "chroot" > > Reusing "open" BPF sys_enter augmenter for "acct" > > Reusing "open" BPF sys_enter augmenter for "swapon" > > Reusing "open" BPF sys_enter augmenter for "swapoff" > > Reusing "open" BPF sys_enter augmenter for "delete_module" > > Reusing "open" BPF sys_enter augmenter for "setxattr" > > Reusing "open" BPF sys_enter augmenter for "lsetxattr" > > Reusing "openat" BPF sys_enter augmenter for "fsetxattr" > > Reusing "open" BPF sys_enter augmenter for "getxattr" > > Reusing "open" BPF sys_enter augmenter for "lgetxattr" > > Reusing "openat" BPF sys_enter augmenter for "fgetxattr" > > Reusing "open" BPF sys_enter augmenter for "listxattr" > > Reusing "open" BPF sys_enter augmenter for "llistxattr" > > Reusing "open" BPF sys_enter augmenter for "removexattr" > > Reusing "open" BPF sys_enter augmenter for "lremovexattr" > > Reusing "fsetxattr" BPF sys_enter augmenter for "fremovexattr" > > Reusing "open" BPF sys_enter augmenter for "mq_open" > > Reusing "open" BPF sys_enter augmenter for "mq_unlink" > > Reusing "fsetxattr" BPF sys_enter augmenter for "add_key" > > Reusing "fremovexattr" BPF sys_enter augmenter for "request_key" > > Reusing "fremovexattr" BPF sys_enter augmenter for "inotify_add_watch" > > Reusing "fremovexattr" BPF sys_enter augmenter for "mkdirat" > > Reusing "fremovexattr" BPF sys_enter augmenter for "mknodat" > > Reusing "fremovexattr" BPF sys_enter augmenter for "fchownat" > > Reusing "fremovexattr" BPF sys_enter augmenter for "futimesat" > > Reusing "fremovexattr" BPF sys_enter augmenter for "newfstatat" > > Reusing "fremovexattr" BPF sys_enter augmenter for "unlinkat" > > Reusing "fremovexattr" BPF sys_enter augmenter for "linkat" > > Reusing "open" BPF sys_enter augmenter for "symlinkat" > > Reusing "fremovexattr" BPF sys_enter augmenter for "readlinkat" > > Reusing "fremovexattr" BPF sys_enter augmenter for "fchmodat" > > Reusing "fremovexattr" BPF sys_enter augmenter for "faccessat" > > Reusing "fremovexattr" BPF sys_enter augmenter for "utimensat" > > Reusing "connect" BPF sys_enter augmenter for "accept4" > > Reusing "fremovexattr" BPF sys_enter augmenter for "name_to_handle_at" > > Reusing "fremovexattr" BPF sys_enter augmenter for "renameat2" > > Reusing "open" BPF sys_enter augmenter for "memfd_create" > > Reusing "fremovexattr" BPF sys_enter augmenter for "execveat" > > Reusing "fremovexattr" BPF sys_enter augmenter for "statx" > > > > after > > > > perf $ perf trace -vv --max-events=1 |& grep Reusing > > Reusing "open" BPF sys_enter augmenter for "stat" > > Reusing "open" BPF sys_enter augmenter for "lstat" > > Reusing "open" BPF sys_enter augmenter for "access" > > Reusing "connect" BPF sys_enter augmenter for "accept" > > Reusing "sendto" BPF sys_enter augmenter for "recvfrom" > > Reusing "connect" BPF sys_enter augmenter for "bind" > > Reusing "connect" BPF sys_enter augmenter for "getsockname" > > Reusing "connect" BPF sys_enter augmenter for "getpeername" > > Reusing "open" BPF sys_enter augmenter for "execve" > > Reusing "open" BPF sys_enter augmenter for "truncate" > > Reusing "open" BPF sys_enter augmenter for "chdir" > > Reusing "open" BPF sys_enter augmenter for "mkdir" > > Reusing "open" BPF sys_enter augmenter for "rmdir" > > Reusing "open" BPF sys_enter augmenter for "creat" > > Reusing "open" BPF sys_enter augmenter for "link" > > Reusing "open" BPF sys_enter augmenter for "unlink" > > Reusing "open" BPF sys_enter augmenter for "symlink" > > Reusing "open" BPF sys_enter augmenter for "readlink" > > Reusing "open" BPF sys_enter augmenter for "chmod" > > Reusing "open" BPF sys_enter augmenter for "chown" > > Reusing "open" BPF sys_enter augmenter for "lchown" > > Reusing "open" BPF sys_enter augmenter for "mknod" > > Reusing "open" BPF sys_enter augmenter for "statfs" > > Reusing "open" BPF sys_enter augmenter for "pivot_root" > > Reusing "open" BPF sys_enter augmenter for "chroot" > > Reusing "open" BPF sys_enter augmenter for "acct" > > Reusing "open" BPF sys_enter augmenter for "swapon" > > Reusing "open" BPF sys_enter augmenter for "swapoff" > > Reusing "open" BPF sys_enter augmenter for "delete_module" > > Reusing "open" BPF sys_enter augmenter for "setxattr" > > Reusing "open" BPF sys_enter augmenter for "lsetxattr" > > Reusing "openat" BPF sys_enter augmenter for "fsetxattr" > > Reusing "open" BPF sys_enter augmenter for "getxattr" > > Reusing "open" BPF sys_enter augmenter for "lgetxattr" > > Reusing "openat" BPF sys_enter augmenter for "fgetxattr" > > Reusing "open" BPF sys_enter augmenter for "listxattr" > > Reusing "open" BPF sys_enter augmenter for "llistxattr" > > Reusing "open" BPF sys_enter augmenter for "removexattr" > > Reusing "open" BPF sys_enter augmenter for "lremovexattr" > > Reusing "fsetxattr" BPF sys_enter augmenter for "fremovexattr" > > Reusing "open" BPF sys_enter augmenter for "mq_open" > > Reusing "open" BPF sys_enter augmenter for "mq_unlink" > > Reusing "fsetxattr" BPF sys_enter augmenter for "add_key" > > Reusing "fremovexattr" BPF sys_enter augmenter for "request_key" > > Reusing "fremovexattr" BPF sys_enter augmenter for "inotify_add_watch" > > Reusing "fremovexattr" BPF sys_enter augmenter for "mkdirat" > > Reusing "fremovexattr" BPF sys_enter augmenter for "mknodat" > > Reusing "fremovexattr" BPF sys_enter augmenter for "fchownat" > > Reusing "fremovexattr" BPF sys_enter augmenter for "futimesat" > > Reusing "fremovexattr" BPF sys_enter augmenter for "newfstatat" > > Reusing "fremovexattr" BPF sys_enter augmenter for "unlinkat" > > Reusing "fremovexattr" BPF sys_enter augmenter for "linkat" > > Reusing "open" BPF sys_enter augmenter for "symlinkat" > > Reusing "fremovexattr" BPF sys_enter augmenter for "readlinkat" > > Reusing "fremovexattr" BPF sys_enter augmenter for "fchmodat" > > Reusing "fremovexattr" BPF sys_enter augmenter for "faccessat" > > Reusing "fremovexattr" BPF sys_enter augmenter for "utimensat" > > Reusing "connect" BPF sys_enter augmenter for "accept4" > > Reusing "fremovexattr" BPF sys_enter augmenter for "name_to_handle_at" > > Reusing "fremovexattr" BPF sys_enter augmenter for "renameat2" > > Reusing "open" BPF sys_enter augmenter for "memfd_create" > > Reusing "fremovexattr" BPF sys_enter augmenter for "execveat" > > Reusing "fremovexattr" BPF sys_enter augmenter for "statx" > > > > TL;DR: > > > > These are the new syscalls that can be augmented > > Reusing "openat" BPF sys_enter augmenter for "open_tree" > > Reusing "openat" BPF sys_enter augmenter for "openat2" > > Reusing "openat" BPF sys_enter augmenter for "mount_setattr" > > Reusing "openat" BPF sys_enter augmenter for "move_mount" > > Reusing "open" BPF sys_enter augmenter for "fsopen" > > Reusing "openat" BPF sys_enter augmenter for "fspick" > > Reusing "openat" BPF sys_enter augmenter for "faccessat2" > > Reusing "openat" BPF sys_enter augmenter for "fchmodat2" > > > > as for the perf trace output: > > > > before > > > > perf $ perf trace -e faccessat2 --max-events=1 > > [no output] > > > > after > > > > perf $ ./perf trace -e faccessat2 --max-events=1 > > 0.000 ( 0.037 ms): waybar/958 faccessat2(dfd: 40, filename: "uevent") = 0 > > > > P.S. The reason why this bug was not found in the past five years is > > probably because it only happens to the newer syscalls whose id is > > greater, for instance, faccessat2 of id 439, which not a lot of people > > care about when using perf trace. > > > > Signed-off-by: Howard Chu <howardchu95@gmail.com> > > --- > > tools/perf/builtin-trace.c | 32 +++++++++++++++++++++----------- > > tools/perf/util/syscalltbl.c | 21 +++++++++------------ > > tools/perf/util/syscalltbl.h | 5 +++++ > > 3 files changed, 35 insertions(+), 23 deletions(-) > > > > diff --git a/tools/perf/builtin-trace.c b/tools/perf/builtin-trace.c > > index c42bc608954e..5cbe1748911d 100644 > > --- a/tools/perf/builtin-trace.c > > +++ b/tools/perf/builtin-trace.c > > @@ -3354,7 +3354,8 @@ static int trace__bpf_prog_sys_exit_fd(struct trace *trace, int id) > > static struct bpf_program *trace__find_usable_bpf_prog_entry(struct trace *trace, struct syscall *sc) > > { > > struct tep_format_field *field, *candidate_field; > > - int id; > > + struct __syscall *scs = trace->sctbl->syscalls.entries; > > + int id, _id; > > > > /* > > * We're only interested in syscalls that have a pointer: > > @@ -3368,10 +3369,13 @@ static struct bpf_program *trace__find_usable_bpf_prog_entry(struct trace *trace > > > > try_to_find_pair: > > for (id = 0; id < trace->sctbl->syscalls.nr_entries; ++id) { > > - struct syscall *pair = trace__syscall_info(trace, NULL, id); > > + struct syscall *pair; > > struct bpf_program *pair_prog; > > bool is_candidate = false; > > > > + _id = scs[id].id; > > + pair = trace__syscall_info(trace, NULL, _id); > > + > > if (pair == NULL || pair == sc || > > pair->bpf_prog.sys_enter == trace->skel->progs.syscall_unaugmented) > > continue; > > @@ -3456,23 +3460,26 @@ static int trace__init_syscalls_bpf_prog_array_maps(struct trace *trace) > > { > > int map_enter_fd = bpf_map__fd(trace->skel->maps.syscalls_sys_enter); > > int map_exit_fd = bpf_map__fd(trace->skel->maps.syscalls_sys_exit); > > - int err = 0, key; > > + int err = 0, key, id; > > + struct __syscall *scs = trace->sctbl->syscalls.entries; > > > > for (key = 0; key < trace->sctbl->syscalls.nr_entries; ++key) { > > int prog_fd; > > > > - if (!trace__syscall_enabled(trace, key)) > > + id = scs[key].id; > > + > > + if (!trace__syscall_enabled(trace, id)) > > continue; > > > > - trace__init_syscall_bpf_progs(trace, key); > > + trace__init_syscall_bpf_progs(trace, id); > > > > // It'll get at least the "!raw_syscalls:unaugmented" > > - prog_fd = trace__bpf_prog_sys_enter_fd(trace, key); > > - err = bpf_map_update_elem(map_enter_fd, &key, &prog_fd, BPF_ANY); > > + prog_fd = trace__bpf_prog_sys_enter_fd(trace, id); > > + err = bpf_map_update_elem(map_enter_fd, &id, &prog_fd, BPF_ANY); > > if (err) > > break; > > - prog_fd = trace__bpf_prog_sys_exit_fd(trace, key); > > - err = bpf_map_update_elem(map_exit_fd, &key, &prog_fd, BPF_ANY); > > + prog_fd = trace__bpf_prog_sys_exit_fd(trace, id); > > + err = bpf_map_update_elem(map_exit_fd, &id, &prog_fd, BPF_ANY); > > if (err) > > break; > > } > > @@ -3506,10 +3513,13 @@ static int trace__init_syscalls_bpf_prog_array_maps(struct trace *trace) > > * array tail call, then that one will be used. > > */ > > for (key = 0; key < trace->sctbl->syscalls.nr_entries; ++key) { > > - struct syscall *sc = trace__syscall_info(trace, NULL, key); > > + struct syscall *sc; > > struct bpf_program *pair_prog; > > int prog_fd; > > > > + id = scs[key].id; > > + sc = trace__syscall_info(trace, NULL, id); > > + > > if (sc == NULL || sc->bpf_prog.sys_enter == NULL) > > continue; > > > > @@ -3535,7 +3545,7 @@ static int trace__init_syscalls_bpf_prog_array_maps(struct trace *trace) > > * with the fd for the program we're reusing: > > */ > > prog_fd = bpf_program__fd(sc->bpf_prog.sys_enter); > > - err = bpf_map_update_elem(map_enter_fd, &key, &prog_fd, BPF_ANY); > > + err = bpf_map_update_elem(map_enter_fd, &id, &prog_fd, BPF_ANY); > > if (err) > > break; > > } > > diff --git a/tools/perf/util/syscalltbl.c b/tools/perf/util/syscalltbl.c > > index 63be7b58761d..16aa886c40f0 100644 > > --- a/tools/perf/util/syscalltbl.c > > +++ b/tools/perf/util/syscalltbl.c > > @@ -44,22 +44,17 @@ const int syscalltbl_native_max_id = SYSCALLTBL_LOONGARCH_MAX_ID; > > static const char *const *syscalltbl_native = syscalltbl_loongarch; > > #endif > > > > -struct syscall { > > - int id; > > - const char *name; > > -}; > > - > > static int syscallcmpname(const void *vkey, const void *ventry) > > { > > const char *key = vkey; > > - const struct syscall *entry = ventry; > > + const struct __syscall *entry = ventry; > > > > return strcmp(key, entry->name); > > } > > > > static int syscallcmp(const void *va, const void *vb) > > { > > - const struct syscall *a = va, *b = vb; > > + const struct __syscall *a = va, *b = vb; > > > > return strcmp(a->name, b->name); > > } > > @@ -67,13 +62,14 @@ static int syscallcmp(const void *va, const void *vb) > > static int syscalltbl__init_native(struct syscalltbl *tbl) > > { > > int nr_entries = 0, i, j; > > - struct syscall *entries; > > + struct __syscall *entries; > > > > for (i = 0; i <= syscalltbl_native_max_id; ++i) > > if (syscalltbl_native[i]) > > ++nr_entries; > > > > - entries = tbl->syscalls.entries = malloc(sizeof(struct syscall) * nr_entries); > > + entries = tbl->syscalls.entries = malloc(sizeof(struct __syscall) * > > + nr_entries); > > if (tbl->syscalls.entries == NULL) > > return -1; > > > > @@ -85,7 +81,8 @@ static int syscalltbl__init_native(struct syscalltbl *tbl) > > } > > } > > > > - qsort(tbl->syscalls.entries, nr_entries, sizeof(struct syscall), syscallcmp); > > + qsort(tbl->syscalls.entries, nr_entries, sizeof(struct __syscall), > > + syscallcmp); > > tbl->syscalls.nr_entries = nr_entries; > > tbl->syscalls.max_id = syscalltbl_native_max_id; > > return 0; > > @@ -116,7 +113,7 @@ const char *syscalltbl__name(const struct syscalltbl *tbl __maybe_unused, int id > > > > int syscalltbl__id(struct syscalltbl *tbl, const char *name) > > { > > - struct syscall *sc = bsearch(name, tbl->syscalls.entries, > > + struct __syscall *sc = bsearch(name, tbl->syscalls.entries, > > tbl->syscalls.nr_entries, sizeof(*sc), > > syscallcmpname); > > > > @@ -126,7 +123,7 @@ int syscalltbl__id(struct syscalltbl *tbl, const char *name) > > int syscalltbl__strglobmatch_next(struct syscalltbl *tbl, const char *syscall_glob, int *idx) > > { > > int i; > > - struct syscall *syscalls = tbl->syscalls.entries; > > + struct __syscall *syscalls = tbl->syscalls.entries; > > > > for (i = *idx + 1; i < tbl->syscalls.nr_entries; ++i) { > > if (strglobmatch(syscalls[i].name, syscall_glob)) { > > diff --git a/tools/perf/util/syscalltbl.h b/tools/perf/util/syscalltbl.h > > index a41d2ca9e4ae..6e93a0874c40 100644 > > --- a/tools/perf/util/syscalltbl.h > > +++ b/tools/perf/util/syscalltbl.h > > @@ -2,6 +2,11 @@ > > #ifndef __PERF_SYSCALLTBL_H > > #define __PERF_SYSCALLTBL_H > > > > It'd be nice to document the struct with examples that explain the > confusion that's happened and is fixed here. You mean the "struct __syscall"? That's just me trying to expose the already existed "struct syscall" it in syscalltbl.h(originally in syscalltbl.c, so it's private), so that I can use it in builtin-trace.c. It was originally named "struct syscall", but we already had one "struct syscall" in builtin-trace.c, so I renamed it to "struct __syscall". Please tell me if this is inappropriate. Thanks, Howard > > Thanks, > Ian > > > +struct __syscall { > > + int id; > > + const char *name; > > +}; > > + > > struct syscalltbl { > > int audit_machine; > > struct { > > -- > > 2.45.2 > > On Tue, Jun 11, 2024 at 5:33 AM Ian Rogers <irogers@google.com> wrote: > > On Sat, Jun 8, 2024 at 10:21 AM Howard Chu <howardchu95@gmail.com> wrote: > > > > This is a bug found when implementing pretty-printing for the > > landlock_add_rule system call, I decided to send this patch separately > > because this is a serious bug that should be fixed fast. > > > > I wrote a test program to do landlock_add_rule syscall in a loop, > > yet perf trace -e landlock_add_rule freezes, giving no output. > > > > This bug is introduced by the false understanding of the variable "key" > > below: > > ``` > > for (key = 0; key < trace->sctbl->syscalls.nr_entries; ++key) { > > struct syscall *sc = trace__syscall_info(trace, NULL, key); > > ... > > } > > ``` > > The code above seems right at the beginning, but when looking at > > syscalltbl.c, I found these lines: > > > > ``` > > for (i = 0; i <= syscalltbl_native_max_id; ++i) > > if (syscalltbl_native[i]) > > ++nr_entries; > > > > entries = tbl->syscalls.entries = malloc(sizeof(struct syscall) * nr_entries); > > ... > > > > for (i = 0, j = 0; i <= syscalltbl_native_max_id; ++i) { > > if (syscalltbl_native[i]) { > > entries[j].name = syscalltbl_native[i]; > > entries[j].id = i; > > ++j; > > } > > } > > ``` > > > > meaning the key is merely an index to traverse the syscall table, > > instead of the actual syscall id for this particular syscall. > > > Thanks Howard, I'm not following this. Doesn't it make sense to use > the syscall number as its id? > > > > > > > So if one uses key to do trace__syscall_info(trace, NULL, key), because > > key only goes up to trace->sctbl->syscalls.nr_entries, for example, on > > my X86_64 machine, this number is 373, it will end up neglecting all > > the rest of the syscall, in my case, everything after `rseq`, because > > the traversal will stop at 373, and `rseq` is the last syscall whose id > > is lower than 373 > > > > in tools/perf/arch/x86/include/generated/asm/syscalls_64.c: > > ``` > > ... > > [334] = "rseq", > > [424] = "pidfd_send_signal", > > ... > > ``` > > > > The reason why the key is scrambled but perf trace works well is that > > key is used in trace__syscall_info(trace, NULL, key) to do > > trace->syscalls.table[id], this makes sure that the struct syscall returned > > actually has an id the same value as key, making the later bpf_prog > > matching all correct. > > > Could we create a test for this? We have tests that list all perf > events and then running a perf command on them. It wouldn't be > possible to guarantee output. > > > > > After fixing this bug, I can do perf trace on 38 more syscalls, and > > because more syscalls are visible, we get 8 more syscalls that can be > > augmented. > > > > before: > > > > perf $ perf trace -vv --max-events=1 |& grep Reusing > > Reusing "open" BPF sys_enter augmenter for "stat" > > Reusing "open" BPF sys_enter augmenter for "lstat" > > Reusing "open" BPF sys_enter augmenter for "access" > > Reusing "connect" BPF sys_enter augmenter for "accept" > > Reusing "sendto" BPF sys_enter augmenter for "recvfrom" > > Reusing "connect" BPF sys_enter augmenter for "bind" > > Reusing "connect" BPF sys_enter augmenter for "getsockname" > > Reusing "connect" BPF sys_enter augmenter for "getpeername" > > Reusing "open" BPF sys_enter augmenter for "execve" > > Reusing "open" BPF sys_enter augmenter for "truncate" > > Reusing "open" BPF sys_enter augmenter for "chdir" > > Reusing "open" BPF sys_enter augmenter for "mkdir" > > Reusing "open" BPF sys_enter augmenter for "rmdir" > > Reusing "open" BPF sys_enter augmenter for "creat" > > Reusing "open" BPF sys_enter augmenter for "link" > > Reusing "open" BPF sys_enter augmenter for "unlink" > > Reusing "open" BPF sys_enter augmenter for "symlink" > > Reusing "open" BPF sys_enter augmenter for "readlink" > > Reusing "open" BPF sys_enter augmenter for "chmod" > > Reusing "open" BPF sys_enter augmenter for "chown" > > Reusing "open" BPF sys_enter augmenter for "lchown" > > Reusing "open" BPF sys_enter augmenter for "mknod" > > Reusing "open" BPF sys_enter augmenter for "statfs" > > Reusing "open" BPF sys_enter augmenter for "pivot_root" > > Reusing "open" BPF sys_enter augmenter for "chroot" > > Reusing "open" BPF sys_enter augmenter for "acct" > > Reusing "open" BPF sys_enter augmenter for "swapon" > > Reusing "open" BPF sys_enter augmenter for "swapoff" > > Reusing "open" BPF sys_enter augmenter for "delete_module" > > Reusing "open" BPF sys_enter augmenter for "setxattr" > > Reusing "open" BPF sys_enter augmenter for "lsetxattr" > > Reusing "openat" BPF sys_enter augmenter for "fsetxattr" > > Reusing "open" BPF sys_enter augmenter for "getxattr" > > Reusing "open" BPF sys_enter augmenter for "lgetxattr" > > Reusing "openat" BPF sys_enter augmenter for "fgetxattr" > > Reusing "open" BPF sys_enter augmenter for "listxattr" > > Reusing "open" BPF sys_enter augmenter for "llistxattr" > > Reusing "open" BPF sys_enter augmenter for "removexattr" > > Reusing "open" BPF sys_enter augmenter for "lremovexattr" > > Reusing "fsetxattr" BPF sys_enter augmenter for "fremovexattr" > > Reusing "open" BPF sys_enter augmenter for "mq_open" > > Reusing "open" BPF sys_enter augmenter for "mq_unlink" > > Reusing "fsetxattr" BPF sys_enter augmenter for "add_key" > > Reusing "fremovexattr" BPF sys_enter augmenter for "request_key" > > Reusing "fremovexattr" BPF sys_enter augmenter for "inotify_add_watch" > > Reusing "fremovexattr" BPF sys_enter augmenter for "mkdirat" > > Reusing "fremovexattr" BPF sys_enter augmenter for "mknodat" > > Reusing "fremovexattr" BPF sys_enter augmenter for "fchownat" > > Reusing "fremovexattr" BPF sys_enter augmenter for "futimesat" > > Reusing "fremovexattr" BPF sys_enter augmenter for "newfstatat" > > Reusing "fremovexattr" BPF sys_enter augmenter for "unlinkat" > > Reusing "fremovexattr" BPF sys_enter augmenter for "linkat" > > Reusing "open" BPF sys_enter augmenter for "symlinkat" > > Reusing "fremovexattr" BPF sys_enter augmenter for "readlinkat" > > Reusing "fremovexattr" BPF sys_enter augmenter for "fchmodat" > > Reusing "fremovexattr" BPF sys_enter augmenter for "faccessat" > > Reusing "fremovexattr" BPF sys_enter augmenter for "utimensat" > > Reusing "connect" BPF sys_enter augmenter for "accept4" > > Reusing "fremovexattr" BPF sys_enter augmenter for "name_to_handle_at" > > Reusing "fremovexattr" BPF sys_enter augmenter for "renameat2" > > Reusing "open" BPF sys_enter augmenter for "memfd_create" > > Reusing "fremovexattr" BPF sys_enter augmenter for "execveat" > > Reusing "fremovexattr" BPF sys_enter augmenter for "statx" > > > > after > > > > perf $ perf trace -vv --max-events=1 |& grep Reusing > > Reusing "open" BPF sys_enter augmenter for "stat" > > Reusing "open" BPF sys_enter augmenter for "lstat" > > Reusing "open" BPF sys_enter augmenter for "access" > > Reusing "connect" BPF sys_enter augmenter for "accept" > > Reusing "sendto" BPF sys_enter augmenter for "recvfrom" > > Reusing "connect" BPF sys_enter augmenter for "bind" > > Reusing "connect" BPF sys_enter augmenter for "getsockname" > > Reusing "connect" BPF sys_enter augmenter for "getpeername" > > Reusing "open" BPF sys_enter augmenter for "execve" > > Reusing "open" BPF sys_enter augmenter for "truncate" > > Reusing "open" BPF sys_enter augmenter for "chdir" > > Reusing "open" BPF sys_enter augmenter for "mkdir" > > Reusing "open" BPF sys_enter augmenter for "rmdir" > > Reusing "open" BPF sys_enter augmenter for "creat" > > Reusing "open" BPF sys_enter augmenter for "link" > > Reusing "open" BPF sys_enter augmenter for "unlink" > > Reusing "open" BPF sys_enter augmenter for "symlink" > > Reusing "open" BPF sys_enter augmenter for "readlink" > > Reusing "open" BPF sys_enter augmenter for "chmod" > > Reusing "open" BPF sys_enter augmenter for "chown" > > Reusing "open" BPF sys_enter augmenter for "lchown" > > Reusing "open" BPF sys_enter augmenter for "mknod" > > Reusing "open" BPF sys_enter augmenter for "statfs" > > Reusing "open" BPF sys_enter augmenter for "pivot_root" > > Reusing "open" BPF sys_enter augmenter for "chroot" > > Reusing "open" BPF sys_enter augmenter for "acct" > > Reusing "open" BPF sys_enter augmenter for "swapon" > > Reusing "open" BPF sys_enter augmenter for "swapoff" > > Reusing "open" BPF sys_enter augmenter for "delete_module" > > Reusing "open" BPF sys_enter augmenter for "setxattr" > > Reusing "open" BPF sys_enter augmenter for "lsetxattr" > > Reusing "openat" BPF sys_enter augmenter for "fsetxattr" > > Reusing "open" BPF sys_enter augmenter for "getxattr" > > Reusing "open" BPF sys_enter augmenter for "lgetxattr" > > Reusing "openat" BPF sys_enter augmenter for "fgetxattr" > > Reusing "open" BPF sys_enter augmenter for "listxattr" > > Reusing "open" BPF sys_enter augmenter for "llistxattr" > > Reusing "open" BPF sys_enter augmenter for "removexattr" > > Reusing "open" BPF sys_enter augmenter for "lremovexattr" > > Reusing "fsetxattr" BPF sys_enter augmenter for "fremovexattr" > > Reusing "open" BPF sys_enter augmenter for "mq_open" > > Reusing "open" BPF sys_enter augmenter for "mq_unlink" > > Reusing "fsetxattr" BPF sys_enter augmenter for "add_key" > > Reusing "fremovexattr" BPF sys_enter augmenter for "request_key" > > Reusing "fremovexattr" BPF sys_enter augmenter for "inotify_add_watch" > > Reusing "fremovexattr" BPF sys_enter augmenter for "mkdirat" > > Reusing "fremovexattr" BPF sys_enter augmenter for "mknodat" > > Reusing "fremovexattr" BPF sys_enter augmenter for "fchownat" > > Reusing "fremovexattr" BPF sys_enter augmenter for "futimesat" > > Reusing "fremovexattr" BPF sys_enter augmenter for "newfstatat" > > Reusing "fremovexattr" BPF sys_enter augmenter for "unlinkat" > > Reusing "fremovexattr" BPF sys_enter augmenter for "linkat" > > Reusing "open" BPF sys_enter augmenter for "symlinkat" > > Reusing "fremovexattr" BPF sys_enter augmenter for "readlinkat" > > Reusing "fremovexattr" BPF sys_enter augmenter for "fchmodat" > > Reusing "fremovexattr" BPF sys_enter augmenter for "faccessat" > > Reusing "fremovexattr" BPF sys_enter augmenter for "utimensat" > > Reusing "connect" BPF sys_enter augmenter for "accept4" > > Reusing "fremovexattr" BPF sys_enter augmenter for "name_to_handle_at" > > Reusing "fremovexattr" BPF sys_enter augmenter for "renameat2" > > Reusing "open" BPF sys_enter augmenter for "memfd_create" > > Reusing "fremovexattr" BPF sys_enter augmenter for "execveat" > > Reusing "fremovexattr" BPF sys_enter augmenter for "statx" > > > > TL;DR: > > > > These are the new syscalls that can be augmented > > Reusing "openat" BPF sys_enter augmenter for "open_tree" > > Reusing "openat" BPF sys_enter augmenter for "openat2" > > Reusing "openat" BPF sys_enter augmenter for "mount_setattr" > > Reusing "openat" BPF sys_enter augmenter for "move_mount" > > Reusing "open" BPF sys_enter augmenter for "fsopen" > > Reusing "openat" BPF sys_enter augmenter for "fspick" > > Reusing "openat" BPF sys_enter augmenter for "faccessat2" > > Reusing "openat" BPF sys_enter augmenter for "fchmodat2" > > > > as for the perf trace output: > > > > before > > > > perf $ perf trace -e faccessat2 --max-events=1 > > [no output] > > > > after > > > > perf $ ./perf trace -e faccessat2 --max-events=1 > > 0.000 ( 0.037 ms): waybar/958 faccessat2(dfd: 40, filename: "uevent") = 0 > > > > P.S. The reason why this bug was not found in the past five years is > > probably because it only happens to the newer syscalls whose id is > > greater, for instance, faccessat2 of id 439, which not a lot of people > > care about when using perf trace. > > > > Signed-off-by: Howard Chu <howardchu95@gmail.com> > > --- > > tools/perf/builtin-trace.c | 32 +++++++++++++++++++++----------- > > tools/perf/util/syscalltbl.c | 21 +++++++++------------ > > tools/perf/util/syscalltbl.h | 5 +++++ > > 3 files changed, 35 insertions(+), 23 deletions(-) > > > > diff --git a/tools/perf/builtin-trace.c b/tools/perf/builtin-trace.c > > index c42bc608954e..5cbe1748911d 100644 > > --- a/tools/perf/builtin-trace.c > > +++ b/tools/perf/builtin-trace.c > > @@ -3354,7 +3354,8 @@ static int trace__bpf_prog_sys_exit_fd(struct trace *trace, int id) > > static struct bpf_program *trace__find_usable_bpf_prog_entry(struct trace *trace, struct syscall *sc) > > { > > struct tep_format_field *field, *candidate_field; > > - int id; > > + struct __syscall *scs = trace->sctbl->syscalls.entries; > > + int id, _id; > > > > /* > > * We're only interested in syscalls that have a pointer: > > @@ -3368,10 +3369,13 @@ static struct bpf_program *trace__find_usable_bpf_prog_entry(struct trace *trace > > > > try_to_find_pair: > > for (id = 0; id < trace->sctbl->syscalls.nr_entries; ++id) { > > - struct syscall *pair = trace__syscall_info(trace, NULL, id); > > + struct syscall *pair; > > struct bpf_program *pair_prog; > > bool is_candidate = false; > > > > + _id = scs[id].id; > > + pair = trace__syscall_info(trace, NULL, _id); > > + > > if (pair == NULL || pair == sc || > > pair->bpf_prog.sys_enter == trace->skel->progs.syscall_unaugmented) > > continue; > > @@ -3456,23 +3460,26 @@ static int trace__init_syscalls_bpf_prog_array_maps(struct trace *trace) > > { > > int map_enter_fd = bpf_map__fd(trace->skel->maps.syscalls_sys_enter); > > int map_exit_fd = bpf_map__fd(trace->skel->maps.syscalls_sys_exit); > > - int err = 0, key; > > + int err = 0, key, id; > > + struct __syscall *scs = trace->sctbl->syscalls.entries; > > > > for (key = 0; key < trace->sctbl->syscalls.nr_entries; ++key) { > > int prog_fd; > > > > - if (!trace__syscall_enabled(trace, key)) > > + id = scs[key].id; > > + > > + if (!trace__syscall_enabled(trace, id)) > > continue; > > > > - trace__init_syscall_bpf_progs(trace, key); > > + trace__init_syscall_bpf_progs(trace, id); > > > > // It'll get at least the "!raw_syscalls:unaugmented" > > - prog_fd = trace__bpf_prog_sys_enter_fd(trace, key); > > - err = bpf_map_update_elem(map_enter_fd, &key, &prog_fd, BPF_ANY); > > + prog_fd = trace__bpf_prog_sys_enter_fd(trace, id); > > + err = bpf_map_update_elem(map_enter_fd, &id, &prog_fd, BPF_ANY); > > if (err) > > break; > > - prog_fd = trace__bpf_prog_sys_exit_fd(trace, key); > > - err = bpf_map_update_elem(map_exit_fd, &key, &prog_fd, BPF_ANY); > > + prog_fd = trace__bpf_prog_sys_exit_fd(trace, id); > > + err = bpf_map_update_elem(map_exit_fd, &id, &prog_fd, BPF_ANY); > > if (err) > > break; > > } > > @@ -3506,10 +3513,13 @@ static int trace__init_syscalls_bpf_prog_array_maps(struct trace *trace) > > * array tail call, then that one will be used. > > */ > > for (key = 0; key < trace->sctbl->syscalls.nr_entries; ++key) { > > - struct syscall *sc = trace__syscall_info(trace, NULL, key); > > + struct syscall *sc; > > struct bpf_program *pair_prog; > > int prog_fd; > > > > + id = scs[key].id; > > + sc = trace__syscall_info(trace, NULL, id); > > + > > if (sc == NULL || sc->bpf_prog.sys_enter == NULL) > > continue; > > > > @@ -3535,7 +3545,7 @@ static int trace__init_syscalls_bpf_prog_array_maps(struct trace *trace) > > * with the fd for the program we're reusing: > > */ > > prog_fd = bpf_program__fd(sc->bpf_prog.sys_enter); > > - err = bpf_map_update_elem(map_enter_fd, &key, &prog_fd, BPF_ANY); > > + err = bpf_map_update_elem(map_enter_fd, &id, &prog_fd, BPF_ANY); > > if (err) > > break; > > } > > diff --git a/tools/perf/util/syscalltbl.c b/tools/perf/util/syscalltbl.c > > index 63be7b58761d..16aa886c40f0 100644 > > --- a/tools/perf/util/syscalltbl.c > > +++ b/tools/perf/util/syscalltbl.c > > @@ -44,22 +44,17 @@ const int syscalltbl_native_max_id = SYSCALLTBL_LOONGARCH_MAX_ID; > > static const char *const *syscalltbl_native = syscalltbl_loongarch; > > #endif > > > > -struct syscall { > > - int id; > > - const char *name; > > -}; > > - > > static int syscallcmpname(const void *vkey, const void *ventry) > > { > > const char *key = vkey; > > - const struct syscall *entry = ventry; > > + const struct __syscall *entry = ventry; > > > > return strcmp(key, entry->name); > > } > > > > static int syscallcmp(const void *va, const void *vb) > > { > > - const struct syscall *a = va, *b = vb; > > + const struct __syscall *a = va, *b = vb; > > > > return strcmp(a->name, b->name); > > } > > @@ -67,13 +62,14 @@ static int syscallcmp(const void *va, const void *vb) > > static int syscalltbl__init_native(struct syscalltbl *tbl) > > { > > int nr_entries = 0, i, j; > > - struct syscall *entries; > > + struct __syscall *entries; > > > > for (i = 0; i <= syscalltbl_native_max_id; ++i) > > if (syscalltbl_native[i]) > > ++nr_entries; > > > > - entries = tbl->syscalls.entries = malloc(sizeof(struct syscall) * nr_entries); > > + entries = tbl->syscalls.entries = malloc(sizeof(struct __syscall) * > > + nr_entries); > > if (tbl->syscalls.entries == NULL) > > return -1; > > > > @@ -85,7 +81,8 @@ static int syscalltbl__init_native(struct syscalltbl *tbl) > > } > > } > > > > - qsort(tbl->syscalls.entries, nr_entries, sizeof(struct syscall), syscallcmp); > > + qsort(tbl->syscalls.entries, nr_entries, sizeof(struct __syscall), > > + syscallcmp); > > tbl->syscalls.nr_entries = nr_entries; > > tbl->syscalls.max_id = syscalltbl_native_max_id; > > return 0; > > @@ -116,7 +113,7 @@ const char *syscalltbl__name(const struct syscalltbl *tbl __maybe_unused, int id > > > > int syscalltbl__id(struct syscalltbl *tbl, const char *name) > > { > > - struct syscall *sc = bsearch(name, tbl->syscalls.entries, > > + struct __syscall *sc = bsearch(name, tbl->syscalls.entries, > > tbl->syscalls.nr_entries, sizeof(*sc), > > syscallcmpname); > > > > @@ -126,7 +123,7 @@ int syscalltbl__id(struct syscalltbl *tbl, const char *name) > > int syscalltbl__strglobmatch_next(struct syscalltbl *tbl, const char *syscall_glob, int *idx) > > { > > int i; > > - struct syscall *syscalls = tbl->syscalls.entries; > > + struct __syscall *syscalls = tbl->syscalls.entries; > > > > for (i = *idx + 1; i < tbl->syscalls.nr_entries; ++i) { > > if (strglobmatch(syscalls[i].name, syscall_glob)) { > > diff --git a/tools/perf/util/syscalltbl.h b/tools/perf/util/syscalltbl.h > > index a41d2ca9e4ae..6e93a0874c40 100644 > > --- a/tools/perf/util/syscalltbl.h > > +++ b/tools/perf/util/syscalltbl.h > > @@ -2,6 +2,11 @@ > > #ifndef __PERF_SYSCALLTBL_H > > #define __PERF_SYSCALLTBL_H > > > > It'd be nice to document the struct with examples that explain the > confusion that's happened and is fixed here. > > Thanks, > Ian > > > +struct __syscall { > > + int id; > > + const char *name; > > +}; > > + > > struct syscalltbl { > > int audit_machine; > > struct { > > -- > > 2.45.2 > >
On Sun, Jun 09, 2024 at 01:21:46AM +0800, Howard Chu wrote: > This is a bug found when implementing pretty-printing for the > landlock_add_rule system call, I decided to send this patch separately > because this is a serious bug that should be fixed fast. > I wrote a test program to do landlock_add_rule syscall in a loop, > yet perf trace -e landlock_add_rule freezes, giving no output. > This bug is introduced by the false understanding of the variable "key" > below: > ``` > for (key = 0; key < trace->sctbl->syscalls.nr_entries; ++key) { > struct syscall *sc = trace__syscall_info(trace, NULL, key); > ... > } > ``` > The code above seems right at the beginning, but when looking at > syscalltbl.c, I found these lines: > > ``` > for (i = 0; i <= syscalltbl_native_max_id; ++i) > if (syscalltbl_native[i]) > ++nr_entries; > > entries = tbl->syscalls.entries = malloc(sizeof(struct syscall) * nr_entries); > ... > > for (i = 0, j = 0; i <= syscalltbl_native_max_id; ++i) { > if (syscalltbl_native[i]) { > entries[j].name = syscalltbl_native[i]; > entries[j].id = i; > ++j; > } > } > ``` > > meaning the key is merely an index to traverse the syscall table, > instead of the actual syscall id for this particular syscall. > So if one uses key to do trace__syscall_info(trace, NULL, key), because > key only goes up to trace->sctbl->syscalls.nr_entries, for example, on > my X86_64 machine, this number is 373, it will end up neglecting all > the rest of the syscall, in my case, everything after `rseq`, because > the traversal will stop at 373, and `rseq` is the last syscall whose id > is lower than 373 > in tools/perf/arch/x86/include/generated/asm/syscalls_64.c: > ``` > ... > [334] = "rseq", > [424] = "pidfd_send_signal", > ... > ``` > > The reason why the key is scrambled but perf trace works well is that > key is used in trace__syscall_info(trace, NULL, key) to do > trace->syscalls.table[id], this makes sure that the struct syscall returned > actually has an id the same value as key, making the later bpf_prog > matching all correct. Right, trace->syscalls.table holds info read from tracefs, while trace->sctbl holds info created from per-arch syscall tables, usually from: ⬢[acme@toolbox perf-tools-next]$ find arch -name "*.tbl" arch/alpha/kernel/syscalls/syscall.tbl arch/arm/tools/syscall.tbl arch/m68k/kernel/syscalls/syscall.tbl arch/microblaze/kernel/syscalls/syscall.tbl arch/mips/kernel/syscalls/syscall_n32.tbl arch/mips/kernel/syscalls/syscall_n64.tbl arch/mips/kernel/syscalls/syscall_o32.tbl arch/parisc/kernel/syscalls/syscall.tbl arch/powerpc/kernel/syscalls/syscall.tbl arch/s390/kernel/syscalls/syscall.tbl arch/sh/kernel/syscalls/syscall.tbl arch/sparc/kernel/syscalls/syscall.tbl arch/x86/entry/syscalls/syscall_32.tbl arch/x86/entry/syscalls/syscall_64.tbl arch/xtensa/kernel/syscalls/syscall.tbl ⬢[acme@toolbox perf-tools-next]$ trace->sctbl->syscalls.entries is sorted by name and has { syscall_id, name } as contents. But it is loaded at the time the reuse of BPF programs is attempted, because initially we were using libaudit for getting id, that we get from the tracepoint payload: root@number:~# cat /sys/kernel/tracing/events/raw_syscalls/sys_enter/format name: sys_enter ID: 361 format: field:unsigned short common_type; offset:0; size:2; signed:0; field:unsigned char common_flags; offset:2; size:1; signed:0; field:unsigned char common_preempt_count; offset:3; size:1; signed:0; field:int common_pid; offset:4; size:4; signed:1; field:long id; offset:8; size:8; signed:1; ^^ ^^ ^^ ^^ field:unsigned long args[6]; offset:16; size:48; signed:0; print fmt: "NR %ld (%lx, %lx, %lx, %lx, %lx, %lx)", REC->id, REC->args[0], REC->args[1], REC->args[2], REC->args[3], REC->args[4], REC->args[5] root@number:~# To syscall name. Your analysis is perfect, great! Please take a look at the attached patch, that I'm testing now that is a simplification of your patch, that avoids exposing that __syscall struct, reducing patch size by introducing this instead: +++ b/tools/perf/util/syscalltbl.c @@ -123,6 +123,13 @@ int syscalltbl__id(struct syscalltbl *tbl, const char *name) return sc ? sc->id : -1; } +int syscalltbl__id_at_idx(struct syscalltbl *tbl, int idx) +{ + struct syscall *syscalls = tbl->syscalls.entries; + + return idx < tbl->syscalls.nr_entries ? syscalls[idx].id : -1; +} And then using it to go from the index to the syscall id in the loops traversing the sorted trace->sctbl-> > After fixing this bug, I can do perf trace on 38 more syscalls, and > because more syscalls are visible, we get 8 more syscalls that can be > augmented. > > before: > > perf $ perf trace -vv --max-events=1 |& grep Reusing > Reusing "open" BPF sys_enter augmenter for "stat" > Reusing "open" BPF sys_enter augmenter for "lstat" > Reusing "fremovexattr" BPF sys_enter augmenter for "execveat" > Reusing "fremovexattr" BPF sys_enter augmenter for "statx" > > TL;DR: I did the above and got: root@number:~# wc -l before after 63 before 71 after 134 total root@number:~# > These are the new syscalls that can be augmented Which matches the 8 more reused bpf programs. > Reusing "openat" BPF sys_enter augmenter for "open_tree" > Reusing "openat" BPF sys_enter augmenter for "openat2" > Reusing "openat" BPF sys_enter augmenter for "mount_setattr" > Reusing "openat" BPF sys_enter augmenter for "move_mount" > Reusing "open" BPF sys_enter augmenter for "fsopen" > Reusing "openat" BPF sys_enter augmenter for "fspick" > Reusing "openat" BPF sys_enter augmenter for "faccessat2" > Reusing "openat" BPF sys_enter augmenter for "fchmodat2" But interestingly I get this: root@number:~# grep -E 'for "(open_tree|openat2|mount_setattr|move_mount|fsopen|fspick|faccessat2|fchmodat2)' after Reusing "faccessat" BPF sys_enter augmenter for "faccessat2" Reusing "faccessat" BPF sys_enter augmenter for "fchmodat2" Reusing "access" BPF sys_enter augmenter for "fsopen" Reusing "faccessat" BPF sys_enter augmenter for "fspick" Reusing "faccessat" BPF sys_enter augmenter for "mount_setattr" Reusing "faccessat" BPF sys_enter augmenter for "move_mount" Reusing "faccessat" BPF sys_enter augmenter for "open_tree" Reusing "faccessat" BPF sys_enter augmenter for "openat2" root@number:~# Which matches my expectations as the array we're traversing, trace->sctbl->syscalls->entries is sorted by name. > as for the perf trace output: > before > > perf $ perf trace -e faccessat2 --max-events=1 > [no output] > > after > > perf $ ./perf trace -e faccessat2 --max-events=1 > 0.000 ( 0.037 ms): waybar/958 faccessat2(dfd: 40, filename: "uevent") = 0 > > P.S. The reason why this bug was not found in the past five years is > probably because it only happens to the newer syscalls whose id is > greater, for instance, faccessat2 of id 439, which not a lot of people > care about when using perf trace. That and the fact that the BPF code was hidden before having to use -e, that got changed kinda recently when we switched to using BPF skels for augmenting syscalls in 'perf trace': ⬢[acme@toolbox perf-tools-next]$ git log --oneline tools/perf/util/bpf_skel/augmented_raw_syscalls.bpf.c a9f4c6c999008c92 perf trace: Collect sys_nanosleep first argument 29d16de26df17e94 perf augmented_raw_syscalls.bpf: Move 'struct timespec64' to vmlinux.h 5069211e2f0b47e7 perf trace: Use the right bpf_probe_read(_str) variant for reading user data 33b725ce7b988756 perf trace: Avoid compile error wrt redefining bool 7d9642311b6d9d31 perf bpf augmented_raw_syscalls: Add an assert to make sure sizeof(augmented_arg->value) is a power of two. 262b54b6c9396823 perf bpf augmented_raw_syscalls: Add an assert to make sure sizeof(saddr) is a power of two. 1836480429d173c0 perf bpf_skel augmented_raw_syscalls: Cap the socklen parameter using &= sizeof(saddr) cd2cece61ac5f900 perf trace: Tidy comments related to BPF + syscall augmentation 5e6da6be3082f77b perf trace: Migrate BPF augmentation to use a skeleton ⬢[acme@toolbox perf-tools-next]$ ⬢[acme@toolbox perf-tools-next]$ git show --oneline --pretty=reference 5e6da6be3082f77b | head -1 5e6da6be3082f77b (perf trace: Migrate BPF augmentation to use a skeleton, 2023-08-10) ⬢[acme@toolbox perf-tools-next]$ I.e. from August, 2023. One had as well to ask for BUILD_BPF_SKEL=1, which now is default if all it needs is available on the system. I'll test it together with your btf_enum work. - Arnaldo diff --git a/tools/perf/builtin-trace.c b/tools/perf/builtin-trace.c index c42bc608954ee08d..c4fa8191253d4880 100644 --- a/tools/perf/builtin-trace.c +++ b/tools/perf/builtin-trace.c @@ -3354,8 +3354,6 @@ static int trace__bpf_prog_sys_exit_fd(struct trace *trace, int id) static struct bpf_program *trace__find_usable_bpf_prog_entry(struct trace *trace, struct syscall *sc) { struct tep_format_field *field, *candidate_field; - int id; - /* * We're only interested in syscalls that have a pointer: */ @@ -3367,7 +3365,8 @@ static struct bpf_program *trace__find_usable_bpf_prog_entry(struct trace *trace return NULL; try_to_find_pair: - for (id = 0; id < trace->sctbl->syscalls.nr_entries; ++id) { + for (int i = 0; i < trace->sctbl->syscalls.nr_entries; ++i) { + int id = syscalltbl__id_at_idx(trace->sctbl, i); struct syscall *pair = trace__syscall_info(trace, NULL, id); struct bpf_program *pair_prog; bool is_candidate = false; @@ -3456,10 +3455,10 @@ static int trace__init_syscalls_bpf_prog_array_maps(struct trace *trace) { int map_enter_fd = bpf_map__fd(trace->skel->maps.syscalls_sys_enter); int map_exit_fd = bpf_map__fd(trace->skel->maps.syscalls_sys_exit); - int err = 0, key; + int err = 0; - for (key = 0; key < trace->sctbl->syscalls.nr_entries; ++key) { - int prog_fd; + for (int i = 0; i < trace->sctbl->syscalls.nr_entries; ++i) { + int prog_fd, key = syscalltbl__id_at_idx(trace->sctbl, i); if (!trace__syscall_enabled(trace, key)) continue; @@ -3505,7 +3504,8 @@ static int trace__init_syscalls_bpf_prog_array_maps(struct trace *trace) * first and second arg (this one on the raw_syscalls:sys_exit prog * array tail call, then that one will be used. */ - for (key = 0; key < trace->sctbl->syscalls.nr_entries; ++key) { + for (int i = 0; i < trace->sctbl->syscalls.nr_entries; ++i) { + int key = syscalltbl__id_at_idx(trace->sctbl, i); struct syscall *sc = trace__syscall_info(trace, NULL, key); struct bpf_program *pair_prog; int prog_fd; diff --git a/tools/perf/util/syscalltbl.c b/tools/perf/util/syscalltbl.c index 63be7b58761d2f33..0dd26b991b3fb513 100644 --- a/tools/perf/util/syscalltbl.c +++ b/tools/perf/util/syscalltbl.c @@ -123,6 +123,13 @@ int syscalltbl__id(struct syscalltbl *tbl, const char *name) return sc ? sc->id : -1; } +int syscalltbl__id_at_idx(struct syscalltbl *tbl, int idx) +{ + struct syscall *syscalls = tbl->syscalls.entries; + + return idx < tbl->syscalls.nr_entries ? syscalls[idx].id : -1; +} + int syscalltbl__strglobmatch_next(struct syscalltbl *tbl, const char *syscall_glob, int *idx) { int i; diff --git a/tools/perf/util/syscalltbl.h b/tools/perf/util/syscalltbl.h index a41d2ca9e4aebad5..2b53b7ed25a6affe 100644 --- a/tools/perf/util/syscalltbl.h +++ b/tools/perf/util/syscalltbl.h @@ -16,6 +16,7 @@ void syscalltbl__delete(struct syscalltbl *tbl); const char *syscalltbl__name(const struct syscalltbl *tbl, int id); int syscalltbl__id(struct syscalltbl *tbl, const char *name); +int syscalltbl__id_at_idx(struct syscalltbl *tbl, int idx); int syscalltbl__strglobmatch_first(struct syscalltbl *tbl, const char *syscall_glob, int *idx); int syscalltbl__strglobmatch_next(struct syscalltbl *tbl, const char *syscall_glob, int *idx);
On Sun, Jun 09, 2024 at 01:21:46AM +0800, Howard Chu wrote: > as for the perf trace output: > > before > > perf $ perf trace -e faccessat2 --max-events=1 > [no output] > > after > > perf $ ./perf trace -e faccessat2 --max-events=1 > 0.000 ( 0.037 ms): waybar/958 faccessat2(dfd: 40, filename: "uevent") = 0 Yeah, before there is no output, after, with the following test case: ⬢[acme@toolbox c]$ cat faccessat2.c #include <fcntl.h> /* Definition of AT_* constants */ #include <sys/syscall.h> /* Definition of SYS_* constants */ #include <unistd.h> #include <stdio.h> /* Provide own perf_event_open stub because glibc doesn't */ __attribute__((weak)) int faccessat2(int dirfd, const char *pathname, int mode, int flags) { return syscall(SYS_faccessat2, dirfd, pathname, mode, flags); } int main(int argc, char *argv[]) { int err = faccessat2(123, argv[1], X_OK, AT_EACCESS | AT_SYMLINK_NOFOLLOW); printf("faccessat2(123, %s, X_OK, AT_EACCESS | AT_SYMLINK_NOFOLLOW) = %d\n", argv[1], err); return err; } ⬢[acme@toolbox c]$ make faccessat2 cc faccessat2.c -o faccessat2 ⬢[acme@toolbox c]$ ./faccessat2 bla faccessat2(123, bla, X_OK, AT_EACCESS | AT_SYMLINK_NOFOLLOW) = -1 ⬢[acme@toolbox c]$ In the other terminal, as root: root@number:~# perf trace --call-graph dwarf -e faccessat2 --max-events=1 0.000 ( 0.034 ms): bash/62004 faccessat2(dfd: 123, filename: "bla", mode: X, flags: EACCESS|SYMLINK_NOFOLLOW) = -1 EBADF (Bad file descriptor) syscall (/usr/lib64/libc.so.6) faccessat2 (/home/acme/c/faccessat2) main (/home/acme/c/faccessat2) __libc_start_call_main (/usr/lib64/libc.so.6) __libc_start_main@@GLIBC_2.34 (/usr/lib64/libc.so.6) _start (/home/acme/c/faccessat2) root@number:~# Now to write another test case, this time for the landlock syscall, to test your btf_enum patch. In the future please add the test case so that one can quickly reproduce your testing steps. - Arnaldo
Hello Arnaldo, On Tue, Jun 11, 2024 at 11:25 PM Arnaldo Carvalho de Melo <acme@kernel.org> wrote: > > On Sun, Jun 09, 2024 at 01:21:46AM +0800, Howard Chu wrote: > > as for the perf trace output: > > > > before > > > > perf $ perf trace -e faccessat2 --max-events=1 > > [no output] > > > > after > > > > perf $ ./perf trace -e faccessat2 --max-events=1 > > 0.000 ( 0.037 ms): waybar/958 faccessat2(dfd: 40, filename: "uevent") = 0 > > Yeah, before there is no output, after, with the following test case: > > ⬢[acme@toolbox c]$ cat faccessat2.c > #include <fcntl.h> /* Definition of AT_* constants */ > #include <sys/syscall.h> /* Definition of SYS_* constants */ > #include <unistd.h> > #include <stdio.h> > > /* Provide own perf_event_open stub because glibc doesn't */ > __attribute__((weak)) > int faccessat2(int dirfd, const char *pathname, int mode, int flags) > { > return syscall(SYS_faccessat2, dirfd, pathname, mode, flags); > } > > int main(int argc, char *argv[]) > { > int err = faccessat2(123, argv[1], X_OK, AT_EACCESS | AT_SYMLINK_NOFOLLOW); > > printf("faccessat2(123, %s, X_OK, AT_EACCESS | AT_SYMLINK_NOFOLLOW) = %d\n", argv[1], err); > return err; > } > ⬢[acme@toolbox c]$ make faccessat2 > cc faccessat2.c -o faccessat2 > ⬢[acme@toolbox c]$ ./faccessat2 bla > faccessat2(123, bla, X_OK, AT_EACCESS | AT_SYMLINK_NOFOLLOW) = -1 > ⬢[acme@toolbox c]$ > > In the other terminal, as root: > > root@number:~# perf trace --call-graph dwarf -e faccessat2 --max-events=1 > 0.000 ( 0.034 ms): bash/62004 faccessat2(dfd: 123, filename: "bla", mode: X, flags: EACCESS|SYMLINK_NOFOLLOW) = -1 EBADF (Bad file descriptor) > syscall (/usr/lib64/libc.so.6) > faccessat2 (/home/acme/c/faccessat2) > main (/home/acme/c/faccessat2) > __libc_start_call_main (/usr/lib64/libc.so.6) > __libc_start_main@@GLIBC_2.34 (/usr/lib64/libc.so.6) > _start (/home/acme/c/faccessat2) > root@number:~# > > Now to write another test case, this time for the landlock syscall, to > test your btf_enum patch. > > In the future please add the test case so that one can quickly reproduce > your testing steps. Thank you for testing this patch, sorry for the inconvenience, I will attach tests to speed up reproduction in the future. Also, your simplification looks good, thank you so much. Thanks, Howard > > - Arnaldo
diff --git a/tools/perf/builtin-trace.c b/tools/perf/builtin-trace.c index c42bc608954e..5cbe1748911d 100644 --- a/tools/perf/builtin-trace.c +++ b/tools/perf/builtin-trace.c @@ -3354,7 +3354,8 @@ static int trace__bpf_prog_sys_exit_fd(struct trace *trace, int id) static struct bpf_program *trace__find_usable_bpf_prog_entry(struct trace *trace, struct syscall *sc) { struct tep_format_field *field, *candidate_field; - int id; + struct __syscall *scs = trace->sctbl->syscalls.entries; + int id, _id; /* * We're only interested in syscalls that have a pointer: @@ -3368,10 +3369,13 @@ static struct bpf_program *trace__find_usable_bpf_prog_entry(struct trace *trace try_to_find_pair: for (id = 0; id < trace->sctbl->syscalls.nr_entries; ++id) { - struct syscall *pair = trace__syscall_info(trace, NULL, id); + struct syscall *pair; struct bpf_program *pair_prog; bool is_candidate = false; + _id = scs[id].id; + pair = trace__syscall_info(trace, NULL, _id); + if (pair == NULL || pair == sc || pair->bpf_prog.sys_enter == trace->skel->progs.syscall_unaugmented) continue; @@ -3456,23 +3460,26 @@ static int trace__init_syscalls_bpf_prog_array_maps(struct trace *trace) { int map_enter_fd = bpf_map__fd(trace->skel->maps.syscalls_sys_enter); int map_exit_fd = bpf_map__fd(trace->skel->maps.syscalls_sys_exit); - int err = 0, key; + int err = 0, key, id; + struct __syscall *scs = trace->sctbl->syscalls.entries; for (key = 0; key < trace->sctbl->syscalls.nr_entries; ++key) { int prog_fd; - if (!trace__syscall_enabled(trace, key)) + id = scs[key].id; + + if (!trace__syscall_enabled(trace, id)) continue; - trace__init_syscall_bpf_progs(trace, key); + trace__init_syscall_bpf_progs(trace, id); // It'll get at least the "!raw_syscalls:unaugmented" - prog_fd = trace__bpf_prog_sys_enter_fd(trace, key); - err = bpf_map_update_elem(map_enter_fd, &key, &prog_fd, BPF_ANY); + prog_fd = trace__bpf_prog_sys_enter_fd(trace, id); + err = bpf_map_update_elem(map_enter_fd, &id, &prog_fd, BPF_ANY); if (err) break; - prog_fd = trace__bpf_prog_sys_exit_fd(trace, key); - err = bpf_map_update_elem(map_exit_fd, &key, &prog_fd, BPF_ANY); + prog_fd = trace__bpf_prog_sys_exit_fd(trace, id); + err = bpf_map_update_elem(map_exit_fd, &id, &prog_fd, BPF_ANY); if (err) break; } @@ -3506,10 +3513,13 @@ static int trace__init_syscalls_bpf_prog_array_maps(struct trace *trace) * array tail call, then that one will be used. */ for (key = 0; key < trace->sctbl->syscalls.nr_entries; ++key) { - struct syscall *sc = trace__syscall_info(trace, NULL, key); + struct syscall *sc; struct bpf_program *pair_prog; int prog_fd; + id = scs[key].id; + sc = trace__syscall_info(trace, NULL, id); + if (sc == NULL || sc->bpf_prog.sys_enter == NULL) continue; @@ -3535,7 +3545,7 @@ static int trace__init_syscalls_bpf_prog_array_maps(struct trace *trace) * with the fd for the program we're reusing: */ prog_fd = bpf_program__fd(sc->bpf_prog.sys_enter); - err = bpf_map_update_elem(map_enter_fd, &key, &prog_fd, BPF_ANY); + err = bpf_map_update_elem(map_enter_fd, &id, &prog_fd, BPF_ANY); if (err) break; } diff --git a/tools/perf/util/syscalltbl.c b/tools/perf/util/syscalltbl.c index 63be7b58761d..16aa886c40f0 100644 --- a/tools/perf/util/syscalltbl.c +++ b/tools/perf/util/syscalltbl.c @@ -44,22 +44,17 @@ const int syscalltbl_native_max_id = SYSCALLTBL_LOONGARCH_MAX_ID; static const char *const *syscalltbl_native = syscalltbl_loongarch; #endif -struct syscall { - int id; - const char *name; -}; - static int syscallcmpname(const void *vkey, const void *ventry) { const char *key = vkey; - const struct syscall *entry = ventry; + const struct __syscall *entry = ventry; return strcmp(key, entry->name); } static int syscallcmp(const void *va, const void *vb) { - const struct syscall *a = va, *b = vb; + const struct __syscall *a = va, *b = vb; return strcmp(a->name, b->name); } @@ -67,13 +62,14 @@ static int syscallcmp(const void *va, const void *vb) static int syscalltbl__init_native(struct syscalltbl *tbl) { int nr_entries = 0, i, j; - struct syscall *entries; + struct __syscall *entries; for (i = 0; i <= syscalltbl_native_max_id; ++i) if (syscalltbl_native[i]) ++nr_entries; - entries = tbl->syscalls.entries = malloc(sizeof(struct syscall) * nr_entries); + entries = tbl->syscalls.entries = malloc(sizeof(struct __syscall) * + nr_entries); if (tbl->syscalls.entries == NULL) return -1; @@ -85,7 +81,8 @@ static int syscalltbl__init_native(struct syscalltbl *tbl) } } - qsort(tbl->syscalls.entries, nr_entries, sizeof(struct syscall), syscallcmp); + qsort(tbl->syscalls.entries, nr_entries, sizeof(struct __syscall), + syscallcmp); tbl->syscalls.nr_entries = nr_entries; tbl->syscalls.max_id = syscalltbl_native_max_id; return 0; @@ -116,7 +113,7 @@ const char *syscalltbl__name(const struct syscalltbl *tbl __maybe_unused, int id int syscalltbl__id(struct syscalltbl *tbl, const char *name) { - struct syscall *sc = bsearch(name, tbl->syscalls.entries, + struct __syscall *sc = bsearch(name, tbl->syscalls.entries, tbl->syscalls.nr_entries, sizeof(*sc), syscallcmpname); @@ -126,7 +123,7 @@ int syscalltbl__id(struct syscalltbl *tbl, const char *name) int syscalltbl__strglobmatch_next(struct syscalltbl *tbl, const char *syscall_glob, int *idx) { int i; - struct syscall *syscalls = tbl->syscalls.entries; + struct __syscall *syscalls = tbl->syscalls.entries; for (i = *idx + 1; i < tbl->syscalls.nr_entries; ++i) { if (strglobmatch(syscalls[i].name, syscall_glob)) { diff --git a/tools/perf/util/syscalltbl.h b/tools/perf/util/syscalltbl.h index a41d2ca9e4ae..6e93a0874c40 100644 --- a/tools/perf/util/syscalltbl.h +++ b/tools/perf/util/syscalltbl.h @@ -2,6 +2,11 @@ #ifndef __PERF_SYSCALLTBL_H #define __PERF_SYSCALLTBL_H +struct __syscall { + int id; + const char *name; +}; + struct syscalltbl { int audit_machine; struct {
This is a bug found when implementing pretty-printing for the landlock_add_rule system call, I decided to send this patch separately because this is a serious bug that should be fixed fast. I wrote a test program to do landlock_add_rule syscall in a loop, yet perf trace -e landlock_add_rule freezes, giving no output. This bug is introduced by the false understanding of the variable "key" below: ``` for (key = 0; key < trace->sctbl->syscalls.nr_entries; ++key) { struct syscall *sc = trace__syscall_info(trace, NULL, key); ... } ``` The code above seems right at the beginning, but when looking at syscalltbl.c, I found these lines: ``` for (i = 0; i <= syscalltbl_native_max_id; ++i) if (syscalltbl_native[i]) ++nr_entries; entries = tbl->syscalls.entries = malloc(sizeof(struct syscall) * nr_entries); ... for (i = 0, j = 0; i <= syscalltbl_native_max_id; ++i) { if (syscalltbl_native[i]) { entries[j].name = syscalltbl_native[i]; entries[j].id = i; ++j; } } ``` meaning the key is merely an index to traverse the syscall table, instead of the actual syscall id for this particular syscall. So if one uses key to do trace__syscall_info(trace, NULL, key), because key only goes up to trace->sctbl->syscalls.nr_entries, for example, on my X86_64 machine, this number is 373, it will end up neglecting all the rest of the syscall, in my case, everything after `rseq`, because the traversal will stop at 373, and `rseq` is the last syscall whose id is lower than 373 in tools/perf/arch/x86/include/generated/asm/syscalls_64.c: ``` ... [334] = "rseq", [424] = "pidfd_send_signal", ... ``` The reason why the key is scrambled but perf trace works well is that key is used in trace__syscall_info(trace, NULL, key) to do trace->syscalls.table[id], this makes sure that the struct syscall returned actually has an id the same value as key, making the later bpf_prog matching all correct. After fixing this bug, I can do perf trace on 38 more syscalls, and because more syscalls are visible, we get 8 more syscalls that can be augmented. before: perf $ perf trace -vv --max-events=1 |& grep Reusing Reusing "open" BPF sys_enter augmenter for "stat" Reusing "open" BPF sys_enter augmenter for "lstat" Reusing "open" BPF sys_enter augmenter for "access" Reusing "connect" BPF sys_enter augmenter for "accept" Reusing "sendto" BPF sys_enter augmenter for "recvfrom" Reusing "connect" BPF sys_enter augmenter for "bind" Reusing "connect" BPF sys_enter augmenter for "getsockname" Reusing "connect" BPF sys_enter augmenter for "getpeername" Reusing "open" BPF sys_enter augmenter for "execve" Reusing "open" BPF sys_enter augmenter for "truncate" Reusing "open" BPF sys_enter augmenter for "chdir" Reusing "open" BPF sys_enter augmenter for "mkdir" Reusing "open" BPF sys_enter augmenter for "rmdir" Reusing "open" BPF sys_enter augmenter for "creat" Reusing "open" BPF sys_enter augmenter for "link" Reusing "open" BPF sys_enter augmenter for "unlink" Reusing "open" BPF sys_enter augmenter for "symlink" Reusing "open" BPF sys_enter augmenter for "readlink" Reusing "open" BPF sys_enter augmenter for "chmod" Reusing "open" BPF sys_enter augmenter for "chown" Reusing "open" BPF sys_enter augmenter for "lchown" Reusing "open" BPF sys_enter augmenter for "mknod" Reusing "open" BPF sys_enter augmenter for "statfs" Reusing "open" BPF sys_enter augmenter for "pivot_root" Reusing "open" BPF sys_enter augmenter for "chroot" Reusing "open" BPF sys_enter augmenter for "acct" Reusing "open" BPF sys_enter augmenter for "swapon" Reusing "open" BPF sys_enter augmenter for "swapoff" Reusing "open" BPF sys_enter augmenter for "delete_module" Reusing "open" BPF sys_enter augmenter for "setxattr" Reusing "open" BPF sys_enter augmenter for "lsetxattr" Reusing "openat" BPF sys_enter augmenter for "fsetxattr" Reusing "open" BPF sys_enter augmenter for "getxattr" Reusing "open" BPF sys_enter augmenter for "lgetxattr" Reusing "openat" BPF sys_enter augmenter for "fgetxattr" Reusing "open" BPF sys_enter augmenter for "listxattr" Reusing "open" BPF sys_enter augmenter for "llistxattr" Reusing "open" BPF sys_enter augmenter for "removexattr" Reusing "open" BPF sys_enter augmenter for "lremovexattr" Reusing "fsetxattr" BPF sys_enter augmenter for "fremovexattr" Reusing "open" BPF sys_enter augmenter for "mq_open" Reusing "open" BPF sys_enter augmenter for "mq_unlink" Reusing "fsetxattr" BPF sys_enter augmenter for "add_key" Reusing "fremovexattr" BPF sys_enter augmenter for "request_key" Reusing "fremovexattr" BPF sys_enter augmenter for "inotify_add_watch" Reusing "fremovexattr" BPF sys_enter augmenter for "mkdirat" Reusing "fremovexattr" BPF sys_enter augmenter for "mknodat" Reusing "fremovexattr" BPF sys_enter augmenter for "fchownat" Reusing "fremovexattr" BPF sys_enter augmenter for "futimesat" Reusing "fremovexattr" BPF sys_enter augmenter for "newfstatat" Reusing "fremovexattr" BPF sys_enter augmenter for "unlinkat" Reusing "fremovexattr" BPF sys_enter augmenter for "linkat" Reusing "open" BPF sys_enter augmenter for "symlinkat" Reusing "fremovexattr" BPF sys_enter augmenter for "readlinkat" Reusing "fremovexattr" BPF sys_enter augmenter for "fchmodat" Reusing "fremovexattr" BPF sys_enter augmenter for "faccessat" Reusing "fremovexattr" BPF sys_enter augmenter for "utimensat" Reusing "connect" BPF sys_enter augmenter for "accept4" Reusing "fremovexattr" BPF sys_enter augmenter for "name_to_handle_at" Reusing "fremovexattr" BPF sys_enter augmenter for "renameat2" Reusing "open" BPF sys_enter augmenter for "memfd_create" Reusing "fremovexattr" BPF sys_enter augmenter for "execveat" Reusing "fremovexattr" BPF sys_enter augmenter for "statx" after perf $ perf trace -vv --max-events=1 |& grep Reusing Reusing "open" BPF sys_enter augmenter for "stat" Reusing "open" BPF sys_enter augmenter for "lstat" Reusing "open" BPF sys_enter augmenter for "access" Reusing "connect" BPF sys_enter augmenter for "accept" Reusing "sendto" BPF sys_enter augmenter for "recvfrom" Reusing "connect" BPF sys_enter augmenter for "bind" Reusing "connect" BPF sys_enter augmenter for "getsockname" Reusing "connect" BPF sys_enter augmenter for "getpeername" Reusing "open" BPF sys_enter augmenter for "execve" Reusing "open" BPF sys_enter augmenter for "truncate" Reusing "open" BPF sys_enter augmenter for "chdir" Reusing "open" BPF sys_enter augmenter for "mkdir" Reusing "open" BPF sys_enter augmenter for "rmdir" Reusing "open" BPF sys_enter augmenter for "creat" Reusing "open" BPF sys_enter augmenter for "link" Reusing "open" BPF sys_enter augmenter for "unlink" Reusing "open" BPF sys_enter augmenter for "symlink" Reusing "open" BPF sys_enter augmenter for "readlink" Reusing "open" BPF sys_enter augmenter for "chmod" Reusing "open" BPF sys_enter augmenter for "chown" Reusing "open" BPF sys_enter augmenter for "lchown" Reusing "open" BPF sys_enter augmenter for "mknod" Reusing "open" BPF sys_enter augmenter for "statfs" Reusing "open" BPF sys_enter augmenter for "pivot_root" Reusing "open" BPF sys_enter augmenter for "chroot" Reusing "open" BPF sys_enter augmenter for "acct" Reusing "open" BPF sys_enter augmenter for "swapon" Reusing "open" BPF sys_enter augmenter for "swapoff" Reusing "open" BPF sys_enter augmenter for "delete_module" Reusing "open" BPF sys_enter augmenter for "setxattr" Reusing "open" BPF sys_enter augmenter for "lsetxattr" Reusing "openat" BPF sys_enter augmenter for "fsetxattr" Reusing "open" BPF sys_enter augmenter for "getxattr" Reusing "open" BPF sys_enter augmenter for "lgetxattr" Reusing "openat" BPF sys_enter augmenter for "fgetxattr" Reusing "open" BPF sys_enter augmenter for "listxattr" Reusing "open" BPF sys_enter augmenter for "llistxattr" Reusing "open" BPF sys_enter augmenter for "removexattr" Reusing "open" BPF sys_enter augmenter for "lremovexattr" Reusing "fsetxattr" BPF sys_enter augmenter for "fremovexattr" Reusing "open" BPF sys_enter augmenter for "mq_open" Reusing "open" BPF sys_enter augmenter for "mq_unlink" Reusing "fsetxattr" BPF sys_enter augmenter for "add_key" Reusing "fremovexattr" BPF sys_enter augmenter for "request_key" Reusing "fremovexattr" BPF sys_enter augmenter for "inotify_add_watch" Reusing "fremovexattr" BPF sys_enter augmenter for "mkdirat" Reusing "fremovexattr" BPF sys_enter augmenter for "mknodat" Reusing "fremovexattr" BPF sys_enter augmenter for "fchownat" Reusing "fremovexattr" BPF sys_enter augmenter for "futimesat" Reusing "fremovexattr" BPF sys_enter augmenter for "newfstatat" Reusing "fremovexattr" BPF sys_enter augmenter for "unlinkat" Reusing "fremovexattr" BPF sys_enter augmenter for "linkat" Reusing "open" BPF sys_enter augmenter for "symlinkat" Reusing "fremovexattr" BPF sys_enter augmenter for "readlinkat" Reusing "fremovexattr" BPF sys_enter augmenter for "fchmodat" Reusing "fremovexattr" BPF sys_enter augmenter for "faccessat" Reusing "fremovexattr" BPF sys_enter augmenter for "utimensat" Reusing "connect" BPF sys_enter augmenter for "accept4" Reusing "fremovexattr" BPF sys_enter augmenter for "name_to_handle_at" Reusing "fremovexattr" BPF sys_enter augmenter for "renameat2" Reusing "open" BPF sys_enter augmenter for "memfd_create" Reusing "fremovexattr" BPF sys_enter augmenter for "execveat" Reusing "fremovexattr" BPF sys_enter augmenter for "statx" TL;DR: These are the new syscalls that can be augmented Reusing "openat" BPF sys_enter augmenter for "open_tree" Reusing "openat" BPF sys_enter augmenter for "openat2" Reusing "openat" BPF sys_enter augmenter for "mount_setattr" Reusing "openat" BPF sys_enter augmenter for "move_mount" Reusing "open" BPF sys_enter augmenter for "fsopen" Reusing "openat" BPF sys_enter augmenter for "fspick" Reusing "openat" BPF sys_enter augmenter for "faccessat2" Reusing "openat" BPF sys_enter augmenter for "fchmodat2" as for the perf trace output: before perf $ perf trace -e faccessat2 --max-events=1 [no output] after perf $ ./perf trace -e faccessat2 --max-events=1 0.000 ( 0.037 ms): waybar/958 faccessat2(dfd: 40, filename: "uevent") = 0 P.S. The reason why this bug was not found in the past five years is probably because it only happens to the newer syscalls whose id is greater, for instance, faccessat2 of id 439, which not a lot of people care about when using perf trace. Signed-off-by: Howard Chu <howardchu95@gmail.com> --- tools/perf/builtin-trace.c | 32 +++++++++++++++++++++----------- tools/perf/util/syscalltbl.c | 21 +++++++++------------ tools/perf/util/syscalltbl.h | 5 +++++ 3 files changed, 35 insertions(+), 23 deletions(-)