Message ID | 20240207153550.856536-1-jolsa@kernel.org (mailing list archive) |
---|---|
Headers | show |
Series | bpf: Add support to attach return prog in kprobe multi | expand |
On Wed, Feb 7, 2024 at 7:35 AM Jiri Olsa <jolsa@kernel.org> wrote: > > hi, > adding support to attach both entry and return bpf program on single > kprobe multi link. > > Having entry together with return probe for given function is common > use case for tetragon, bpftrace and most likely for others. > > At the moment if we want both entry and return probe to execute bpf > program we need to create two (entry and return probe) links. The link > for return probe creates extra entry probe to setup the return probe. > The extra entry probe execution could be omitted if we had a way to > use just single link for both entry and exit probe. > > In addition it's possible to control the execution of the return probe > with the return value of the entry bpf program. If the entry program > returns 0 the return probe is installed and executed, otherwise it's > skip. > In general, I think this is a very useful ability to have a combined entry/return program. But the way you implement it with extra flag and extra fd parameter makes it harder to have a nice high-level support in libbpf (and presumably other BPF loader libraries) for this. When I was thinking about doing something like this, I was considering adding a new program type, actually. That way it's possible to define this "let's skip return probe" protocol without backwards compatibility concerns. It's easier to use it declaratively in libbpf. You just declare SEC("kprobe.wrap/...") (or whatever the name, something to designate that it's both entry and exit probe) as one program and in the code there would be some way to determine whether we are in entry mode or exit mode (helper or field in the custom context type, the latter being faster and more usable, but it's probably not critical). Another frequently requested feature and a very common use case is to measure duration of the function, so if we have a custom type, we can have a field to record entry timestamp and let user calculate duration, if necessary. Without this users have to pay a bunch of extra overhead to record timestamp and put it into hashmap (keyed by thread id) or thread local storage (even if they have no other use for thread local storage). Also, consider that a similar concept is applicable to uprobes and we should do that as well, in similar fashion. And the above approach works for both kprobe/kretprobe and uprobe/uretprobe cases, because they have the same pt_regs data structure as a context (even if for exit probes most of the values of pt_regs are not that meaningful). So anyways, great feature, but let's discuss end-to-end usability story before we finalize the implementation? > I'm still working on the tetragon change, so I'll be sending non-RFC > version once that's ready, meanwhile any ideas and feedback on the > approach would be great. > > The change in bpftrace [1] using the new interface shows speed increase > with tracing perf bench messaging: > > # perf bench sched messaging -l 100000 > > having system wide bpftrace: > > # bpftrace -e 'kprobe:ksys_write { }, kretprobe:ksys_write { }' > > without bpftrace: > > # Running 'sched/messaging' benchmark: > # 20 sender and receiver processes per group > # 10 groups == 400 processes run > > Total time: 119.595 [sec] > > Performance counter stats for 'perf bench sched messaging -l 100000': > > 102,419,967,282 cycles:u > 5,652,444,107,001 cycles:k > 5,782,645,019,612 cycles > 22,187,151,206 instructions:u # 0.22 insn per cycle > 2,979,040,498,455 instructions:k # 0.53 insn per cycle > > 119.671169829 seconds time elapsed > > 94.959198000 seconds user > 1815.371616000 seconds sys > > with current bpftrace: > > # Running 'sched/messaging' benchmark: > # 20 sender and receiver processes per group > # 10 groups == 400 processes run > > Total time: 221.153 [sec] > > Performance counter stats for 'perf bench sched messaging -l 100000': > > 125,292,164,504 cycles:u btw, why +25% in user space?... this looks weird > 10,315,020,393,735 cycles:k > 10,501,379,274,042 cycles > 22,187,583,545 instructions:u # 0.18 insn per cycle > 4,856,893,111,303 instructions:k # 0.47 insn per cycle > > 221.229234283 seconds time elapsed > > 103.792498000 seconds user > 3432.643302000 seconds sys > > with bpftrace using the new interface: > > # Running 'sched/messaging' benchmark: > # 20 sender and receiver processes per group > # 10 groups == 400 processes run > > Total time: 157.825 [sec] > > Performance counter stats for 'perf bench sched messaging -l 100000': > > 102,423,112,279 cycles:u > 7,450,856,354,744 cycles:k > 7,584,769,726,693 cycles > 22,187,270,661 instructions:u # 0.22 insn per cycle > 3,985,522,383,425 instructions:k # 0.53 insn per cycle > > 157.900787760 seconds time elapsed > > 97.953898000 seconds user > 2425.314753000 seconds sys > > thanks, > jirka > > > [1] https://github.com/bpftrace/bpftrace/pull/2984 > --- > Jiri Olsa (4): > fprobe: Add entry/exit callbacks types > bpf: Add return prog to kprobe multi > libbpf: Add return_prog_fd to kprobe multi opts > selftests/bpf: Add kprobe multi return prog test > > include/linux/fprobe.h | 18 ++++++++++------ > include/uapi/linux/bpf.h | 4 +++- > kernel/trace/bpf_trace.c | 50 ++++++++++++++++++++++++++++++++----------- > tools/include/uapi/linux/bpf.h | 4 +++- > tools/lib/bpf/bpf.c | 1 + > tools/lib/bpf/bpf.h | 1 + > tools/lib/bpf/libbpf.c | 5 +++++ > tools/lib/bpf/libbpf.h | 6 +++++- > tools/testing/selftests/bpf/prog_tests/kprobe_multi_test.c | 53 +++++++++++++++++++++++++++++++++++++++++++++ > tools/testing/selftests/bpf/progs/kprobe_multi_return_prog.c | 105 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > 10 files changed, 226 insertions(+), 21 deletions(-) > create mode 100644 tools/testing/selftests/bpf/progs/kprobe_multi_return_prog.c
On Thu, Feb 08, 2024 at 11:35:18AM -0800, Andrii Nakryiko wrote: > On Wed, Feb 7, 2024 at 7:35 AM Jiri Olsa <jolsa@kernel.org> wrote: > > > > hi, > > adding support to attach both entry and return bpf program on single > > kprobe multi link. > > > > Having entry together with return probe for given function is common > > use case for tetragon, bpftrace and most likely for others. > > > > At the moment if we want both entry and return probe to execute bpf > > program we need to create two (entry and return probe) links. The link > > for return probe creates extra entry probe to setup the return probe. > > The extra entry probe execution could be omitted if we had a way to > > use just single link for both entry and exit probe. > > > > In addition it's possible to control the execution of the return probe > > with the return value of the entry bpf program. If the entry program > > returns 0 the return probe is installed and executed, otherwise it's > > skip. > > > > In general, I think this is a very useful ability to have a combined > entry/return program. great, thanks > > But the way you implement it with extra flag and extra fd parameter > makes it harder to have a nice high-level support in libbpf (and > presumably other BPF loader libraries) for this. > > When I was thinking about doing something like this, I was considering > adding a new program type, actually. That way it's possible to define > this "let's skip return probe" protocol without backwards > compatibility concerns. It's easier to use it declaratively in libbpf. ok, that seems cleaner.. but we need to use current kprobe programs, so not sure at the moment how would that fit in.. did you mean new link type? > You just declare SEC("kprobe.wrap/...") (or whatever the name, > something to designate that it's both entry and exit probe) as one > program and in the code there would be some way to determine whether > we are in entry mode or exit mode (helper or field in the custom > context type, the latter being faster and more usable, but it's > probably not critical). hum, so the single program would be for both entry and exit probe, I'll need to check how bad it'd be for us, but it'd probably mean just one extra tail call, so it's likely ok > > Another frequently requested feature and a very common use case is to > measure duration of the function, so if we have a custom type, we can > have a field to record entry timestamp and let user calculate you mean bpf program context right? > duration, if necessary. Without this users have to pay a bunch of > extra overhead to record timestamp and put it into hashmap (keyed by > thread id) or thread local storage (even if they have no other use for > thread local storage). sounds good > > Also, consider that a similar concept is applicable to uprobes and we > should do that as well, in similar fashion. And the above approach > works for both kprobe/kretprobe and uprobe/uretprobe cases, because > they have the same pt_regs data structure as a context (even if for > exit probes most of the values of pt_regs are not that meaningful). ok, that seems useful for uprobes as well btw I have another unrelated change in stash that that let you choose the bpf_prog_active or prog->active re-entry check before program is executed in krobe_multi link.. I also added extra flag, and it still seems like good idea to me, thoughts? ;-) > > So anyways, great feature, but let's discuss end-to-end usability > story before we finalize the implementation? yep, it does say RFC in the subject ;-) > > > > I'm still working on the tetragon change, so I'll be sending non-RFC > > version once that's ready, meanwhile any ideas and feedback on the > > approach would be great. > > > > The change in bpftrace [1] using the new interface shows speed increase > > with tracing perf bench messaging: > > > > # perf bench sched messaging -l 100000 > > > > having system wide bpftrace: > > > > # bpftrace -e 'kprobe:ksys_write { }, kretprobe:ksys_write { }' > > > > without bpftrace: > > > > # Running 'sched/messaging' benchmark: > > # 20 sender and receiver processes per group > > # 10 groups == 400 processes run > > > > Total time: 119.595 [sec] > > > > Performance counter stats for 'perf bench sched messaging -l 100000': > > > > 102,419,967,282 cycles:u > > 5,652,444,107,001 cycles:k > > 5,782,645,019,612 cycles > > 22,187,151,206 instructions:u # 0.22 insn per cycle > > 2,979,040,498,455 instructions:k # 0.53 insn per cycle > > > > 119.671169829 seconds time elapsed > > > > 94.959198000 seconds user > > 1815.371616000 seconds sys > > > > with current bpftrace: > > > > # Running 'sched/messaging' benchmark: > > # 20 sender and receiver processes per group > > # 10 groups == 400 processes run > > > > Total time: 221.153 [sec] > > > > Performance counter stats for 'perf bench sched messaging -l 100000': > > > > 125,292,164,504 cycles:u > > btw, why +25% in user space?... this looks weird hmm right.. maybe there's some bpftrace timer code that got triggered/executed more times because it takes long to finish the workload? I'll check thanks, jirka > > > > 10,315,020,393,735 cycles:k > > 10,501,379,274,042 cycles > > 22,187,583,545 instructions:u # 0.18 insn per cycle > > 4,856,893,111,303 instructions:k # 0.47 insn per cycle > > > > 221.229234283 seconds time elapsed > > > > 103.792498000 seconds user > > 3432.643302000 seconds sys > > > > with bpftrace using the new interface: > > > > # Running 'sched/messaging' benchmark: > > # 20 sender and receiver processes per group > > # 10 groups == 400 processes run > > > > Total time: 157.825 [sec] > > > > Performance counter stats for 'perf bench sched messaging -l 100000': > > > > 102,423,112,279 cycles:u > > 7,450,856,354,744 cycles:k > > 7,584,769,726,693 cycles > > 22,187,270,661 instructions:u # 0.22 insn per cycle > > 3,985,522,383,425 instructions:k # 0.53 insn per cycle > > > > 157.900787760 seconds time elapsed > > > > 97.953898000 seconds user > > 2425.314753000 seconds sys > > > > thanks, > > jirka > > > > > > [1] https://github.com/bpftrace/bpftrace/pull/2984 > > --- > > Jiri Olsa (4): > > fprobe: Add entry/exit callbacks types > > bpf: Add return prog to kprobe multi > > libbpf: Add return_prog_fd to kprobe multi opts > > selftests/bpf: Add kprobe multi return prog test > > > > include/linux/fprobe.h | 18 ++++++++++------ > > include/uapi/linux/bpf.h | 4 +++- > > kernel/trace/bpf_trace.c | 50 ++++++++++++++++++++++++++++++++----------- > > tools/include/uapi/linux/bpf.h | 4 +++- > > tools/lib/bpf/bpf.c | 1 + > > tools/lib/bpf/bpf.h | 1 + > > tools/lib/bpf/libbpf.c | 5 +++++ > > tools/lib/bpf/libbpf.h | 6 +++++- > > tools/testing/selftests/bpf/prog_tests/kprobe_multi_test.c | 53 +++++++++++++++++++++++++++++++++++++++++++++ > > tools/testing/selftests/bpf/progs/kprobe_multi_return_prog.c | 105 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > 10 files changed, 226 insertions(+), 21 deletions(-) > > create mode 100644 tools/testing/selftests/bpf/progs/kprobe_multi_return_prog.c
On Sat, Feb 10, 2024 at 7:31 AM Jiri Olsa <olsajiri@gmail.com> wrote: > > On Thu, Feb 08, 2024 at 11:35:18AM -0800, Andrii Nakryiko wrote: > > On Wed, Feb 7, 2024 at 7:35 AM Jiri Olsa <jolsa@kernel.org> wrote: > > > > > > hi, > > > adding support to attach both entry and return bpf program on single > > > kprobe multi link. > > > > > > Having entry together with return probe for given function is common > > > use case for tetragon, bpftrace and most likely for others. > > > > > > At the moment if we want both entry and return probe to execute bpf > > > program we need to create two (entry and return probe) links. The link > > > for return probe creates extra entry probe to setup the return probe. > > > The extra entry probe execution could be omitted if we had a way to > > > use just single link for both entry and exit probe. > > > > > > In addition it's possible to control the execution of the return probe > > > with the return value of the entry bpf program. If the entry program > > > returns 0 the return probe is installed and executed, otherwise it's > > > skip. > > > > > > > In general, I think this is a very useful ability to have a combined > > entry/return program. > > great, thanks > > > > > But the way you implement it with extra flag and extra fd parameter > > makes it harder to have a nice high-level support in libbpf (and > > presumably other BPF loader libraries) for this. > > > > When I was thinking about doing something like this, I was considering > > adding a new program type, actually. That way it's possible to define > > this "let's skip return probe" protocol without backwards > > compatibility concerns. It's easier to use it declaratively in libbpf. > > ok, that seems cleaner.. but we need to use current kprobe programs, > so not sure at the moment how would that fit in.. did you mean new > link type? It's kind of a less important detail, actually. New program type would allow us to have an entirely different context type, but I think we can make do with the existing kprobe program type. We can have a separate attach_type and link type, just like multi-kprobe and multi-uprobe are still kprobe programs. > > > You just declare SEC("kprobe.wrap/...") (or whatever the name, > > something to designate that it's both entry and exit probe) as one > > program and in the code there would be some way to determine whether > > we are in entry mode or exit mode (helper or field in the custom > > context type, the latter being faster and more usable, but it's > > probably not critical). > > hum, so the single program would be for both entry and exit probe, > I'll need to check how bad it'd be for us, but it'd probably mean > just one extra tail call, so it's likely ok I guess, I don't know what you are doing there :) I'd recommend looking at utilizing BPF global subprogs instead of tail calls, if your kernel allows for that, as that's a saner way to scale BPF verification. > > > > > Another frequently requested feature and a very common use case is to > > measure duration of the function, so if we have a custom type, we can > > have a field to record entry timestamp and let user calculate > > you mean bpf program context right? Yes, when I was writing that I was imagining a new context type with new fields. But thinking about this a bit more, I like a slightly different approach now. Instead of having a new context type, kernel capturing timestamps, etc, what if we introduce a concept of "a session". This paired kprobe+kretprobe (and same for uprobe+uretprobe) share a common session. So, say, if we are tracing kernel function abc() and we have our kwrapper (let's call it that for now, kwrapper and uwrapper?) program attached to abc, we have something like this: 1) abc() is called 2) we intercept it at its entry, initialize "session" 3) call our kwrapper program in entry mode 4) abc() proceeds, until return 5) our kwrapper program is called in exit mode 6) at this point the session has ended If abc() is called again or in parallel on another CPU, each such abc() invocation (with kwrapper prog) has its own session state. Now, it all sounds scary, but a session could be really just a 8 byte value that has to last for the duration of a single kprobe+kretprobe execution. Imagine we introduce just one new kfunc that would be usable from kwrapper and uwrapper programs: u64 *bpf_get_session_cookie(void); It returns a pointer to an 8-byte slot which initially is set to zero by kernel (before kprobe/entry part). And then the program can put anything into it. Either from entry (kprobe/uprobe) or exit (kretprobe/uretprobe) executions. Whatever value was stored in entry mode will still be there in exit mode as well. This gives a lot of flexibility without requiring the kernel to do anything expensive. E.g., it's easy to detect if kwrapper program is an entry call or exit call: u64 *cookie = bpf_get_session_cookie(); if (*cookie == 0) { /* entry mode */ *cookie = 1; } else { /* exit mode */ } Or if we wanted to measure function duration/latency: u64 *cookie = bpf_get_session_cookie(); if (*cookie == 0) { *cookie = bpf_get_ktime_ns(); } else { u64 duration = bpf_get_ktime_ns() - *cookie; /* output duration somehow */ } Yes, I realize special-casing zero might be a bit inconvenient, but I think simplicity trumps a potential for zero to be a valid value (and there are always ways to work around zero as a meaningful value). Now, in more complicated cases 8 bytes of temporary session state isn't enough, just like BPF cookie being 8 byte (read-only) value might not be enough. But the solution is the same as with the BPF cookie. You just use those 8 bytes as a key into ARRAY/HASHMAP/whatnot storage. It's simple and fast enough for pretty much any case. Now, how do we implement this storage session? I recently looked into uprobe/uretprobe implementation, and I know that for uretprobes there is a dynamically allocated uretprobe_instance (or whatever the name) allocated for each runtime invocation of uretprobe. We can use that. We'll need to allocate this instance a bit earlier, before uprobe part of uwrapper BPF program has to be executed, initialize session cookie to zero, and store pointer to this memory in bpf_run_ctx (just like we do for BPF cookie). I bet there is something similar in the kretprobe case, where we can carve out 8 bytes and pass it to both entry and exit parts of kwrapper program. So anyways, might need some internal refactoring, but seems very doable and will give a lot of power and flexibility for tracing use cases. It would be cool to also have a similar fentry/fexit combo with "BPF session cookie", but that's something we can think about separately, probably. > > > duration, if necessary. Without this users have to pay a bunch of > > extra overhead to record timestamp and put it into hashmap (keyed by > > thread id) or thread local storage (even if they have no other use for > > thread local storage). > > sounds good > > > > > Also, consider that a similar concept is applicable to uprobes and we > > should do that as well, in similar fashion. And the above approach > > works for both kprobe/kretprobe and uprobe/uretprobe cases, because > > they have the same pt_regs data structure as a context (even if for > > exit probes most of the values of pt_regs are not that meaningful). > > ok, that seems useful for uprobes as well yes, absolutely > > btw I have another unrelated change in stash that that let you choose > the bpf_prog_active or prog->active re-entry check before program is > executed in krobe_multi link.. I also added extra flag, and it still > seems like good idea to me, thoughts? ;-) > no specific opinion on this from my side, I guess > > > > So anyways, great feature, but let's discuss end-to-end usability > > story before we finalize the implementation? > > yep, it does say RFC in the subject ;-) > > > > > [...]
On Mon, Feb 12, 2024 at 08:06:06PM -0800, Andrii Nakryiko wrote: SNIP > > > But the way you implement it with extra flag and extra fd parameter > > > makes it harder to have a nice high-level support in libbpf (and > > > presumably other BPF loader libraries) for this. > > > > > > When I was thinking about doing something like this, I was considering > > > adding a new program type, actually. That way it's possible to define > > > this "let's skip return probe" protocol without backwards > > > compatibility concerns. It's easier to use it declaratively in libbpf. > > > > ok, that seems cleaner.. but we need to use current kprobe programs, > > so not sure at the moment how would that fit in.. did you mean new > > link type? > > It's kind of a less important detail, actually. New program type would > allow us to have an entirely different context type, but I think we > can make do with the existing kprobe program type. We can have a > separate attach_type and link type, just like multi-kprobe and > multi-uprobe are still kprobe programs. ok, having new attach type on top of kprobe_multi link makes sense > > > > > > You just declare SEC("kprobe.wrap/...") (or whatever the name, > > > something to designate that it's both entry and exit probe) as one > > > program and in the code there would be some way to determine whether > > > we are in entry mode or exit mode (helper or field in the custom > > > context type, the latter being faster and more usable, but it's > > > probably not critical). > > > > hum, so the single program would be for both entry and exit probe, > > I'll need to check how bad it'd be for us, but it'd probably mean > > just one extra tail call, so it's likely ok > > I guess, I don't know what you are doing there :) I'd recommend > looking at utilizing BPF global subprogs instead of tail calls, if > your kernel allows for that, as that's a saner way to scale BPF > verification. ok, we should probably do that.. given this enhancement will be available on latest kernel anyway, we could use global subprogs as well the related bpftrace might be bit more challenging.. will have to generate program calling entry or return program now, but seems doable of course > > > > > > > > > Another frequently requested feature and a very common use case is to > > > measure duration of the function, so if we have a custom type, we can > > > have a field to record entry timestamp and let user calculate > > > > you mean bpf program context right? > > Yes, when I was writing that I was imagining a new context type with new fields. > > But thinking about this a bit more, I like a slightly different > approach now. Instead of having a new context type, kernel capturing > timestamps, etc, what if we introduce a concept of "a session". This > paired kprobe+kretprobe (and same for uprobe+uretprobe) share a common > session. So, say, if we are tracing kernel function abc() and we have > our kwrapper (let's call it that for now, kwrapper and uwrapper?) > program attached to abc, we have something like this: > > 1) abc() is called > 2) we intercept it at its entry, initialize "session" > 3) call our kwrapper program in entry mode > 4) abc() proceeds, until return > 5) our kwrapper program is called in exit mode > 6) at this point the session has ended > > If abc() is called again or in parallel on another CPU, each such > abc() invocation (with kwrapper prog) has its own session state. > > Now, it all sounds scary, but a session could be really just a 8 byte > value that has to last for the duration of a single kprobe+kretprobe > execution. > > Imagine we introduce just one new kfunc that would be usable from > kwrapper and uwrapper programs: > > u64 *bpf_get_session_cookie(void); > > It returns a pointer to an 8-byte slot which initially is set to zero > by kernel (before kprobe/entry part). And then the program can put > anything into it. Either from entry (kprobe/uprobe) or exit > (kretprobe/uretprobe) executions. Whatever value was stored in entry > mode will still be there in exit mode as well. > > This gives a lot of flexibility without requiring the kernel to do > anything expensive. > > E.g., it's easy to detect if kwrapper program is an entry call or exit call: > > u64 *cookie = bpf_get_session_cookie(); > if (*cookie == 0) { > /* entry mode */ > *cookie = 1; > } else { > /* exit mode */ > } > > Or if we wanted to measure function duration/latency: > > u64 *cookie = bpf_get_session_cookie(); > if (*cookie == 0) { > *cookie = bpf_get_ktime_ns(); > } else { > u64 duration = bpf_get_ktime_ns() - *cookie; > /* output duration somehow */ > } > > Yes, I realize special-casing zero might be a bit inconvenient, but I > think simplicity trumps a potential for zero to be a valid value (and > there are always ways to work around zero as a meaningful value). > > Now, in more complicated cases 8 bytes of temporary session state > isn't enough, just like BPF cookie being 8 byte (read-only) value > might not be enough. But the solution is the same as with the BPF > cookie. You just use those 8 bytes as a key into ARRAY/HASHMAP/whatnot > storage. It's simple and fast enough for pretty much any case. I was recently asked for a way to have function arguments available in the return kprobe as it is in fexit programs (which was not an option to use, because we don't have fast multi attach for it) using the hash map to store arguments and storing its key in the session data might be solution for this > > > Now, how do we implement this storage session? I recently looked into > uprobe/uretprobe implementation, and I know that for uretprobes there > is a dynamically allocated uretprobe_instance (or whatever the name) > allocated for each runtime invocation of uretprobe. We can use that. > We'll need to allocate this instance a bit earlier, before uprobe part > of uwrapper BPF program has to be executed, initialize session cookie > to zero, and store pointer to this memory in bpf_run_ctx (just like we > do for BPF cookie). I think you refer to the return_instance and it seems like we could use it, I'll check on that > > I bet there is something similar in the kretprobe case, where we can > carve out 8 bytes and pass it to both entry and exit parts of kwrapper > program. for kprobes.. both kprobe and kprobe_multi/fprobe use rethook to invoke return probes, so I guess we could use it and store that shared data in there btw Masami is in process of removing rethook from kprobe_multi/fprobe, as part of migrating fprobe on top of ftrace [0] but instead the rethook I think there'll be some sort of shadow stack/data area accessible from both entry and return probes, that we could use jirka [0] https://lore.kernel.org/bpf/170723204881.502590.11906735097521170661.stgit@devnote2/ > > So anyways, might need some internal refactoring, but seems very > doable and will give a lot of power and flexibility for tracing use > cases. > > It would be cool to also have a similar fentry/fexit combo with "BPF > session cookie", but that's something we can think about separately, > probably. > > > > > > > duration, if necessary. Without this users have to pay a bunch of > > > extra overhead to record timestamp and put it into hashmap (keyed by > > > thread id) or thread local storage (even if they have no other use for > > > thread local storage). > > > > sounds good > > > > > > > > Also, consider that a similar concept is applicable to uprobes and we > > > should do that as well, in similar fashion. And the above approach > > > works for both kprobe/kretprobe and uprobe/uretprobe cases, because > > > they have the same pt_regs data structure as a context (even if for > > > exit probes most of the values of pt_regs are not that meaningful). > > > > ok, that seems useful for uprobes as well > > yes, absolutely > > > > > btw I have another unrelated change in stash that that let you choose > > the bpf_prog_active or prog->active re-entry check before program is > > executed in krobe_multi link.. I also added extra flag, and it still > > seems like good idea to me, thoughts? ;-) > > > > no specific opinion on this from my side, I guess > > > > > > > So anyways, great feature, but let's discuss end-to-end usability > > > story before we finalize the implementation? > > > > yep, it does say RFC in the subject ;-) > > > > > > > > > > [...]
On Tue, Feb 13, 2024 at 4:09 AM Jiri Olsa <olsajiri@gmail.com> wrote: > > On Mon, Feb 12, 2024 at 08:06:06PM -0800, Andrii Nakryiko wrote: > > SNIP > > > > > But the way you implement it with extra flag and extra fd parameter > > > > makes it harder to have a nice high-level support in libbpf (and > > > > presumably other BPF loader libraries) for this. > > > > > > > > When I was thinking about doing something like this, I was considering > > > > adding a new program type, actually. That way it's possible to define > > > > this "let's skip return probe" protocol without backwards > > > > compatibility concerns. It's easier to use it declaratively in libbpf. > > > > > > ok, that seems cleaner.. but we need to use current kprobe programs, > > > so not sure at the moment how would that fit in.. did you mean new > > > link type? > > > > It's kind of a less important detail, actually. New program type would > > allow us to have an entirely different context type, but I think we > > can make do with the existing kprobe program type. We can have a > > separate attach_type and link type, just like multi-kprobe and > > multi-uprobe are still kprobe programs. > > ok, having new attach type on top of kprobe_multi link makes sense > > > > > > > > > > You just declare SEC("kprobe.wrap/...") (or whatever the name, > > > > something to designate that it's both entry and exit probe) as one > > > > program and in the code there would be some way to determine whether > > > > we are in entry mode or exit mode (helper or field in the custom > > > > context type, the latter being faster and more usable, but it's > > > > probably not critical). > > > > > > hum, so the single program would be for both entry and exit probe, > > > I'll need to check how bad it'd be for us, but it'd probably mean > > > just one extra tail call, so it's likely ok > > > > I guess, I don't know what you are doing there :) I'd recommend > > looking at utilizing BPF global subprogs instead of tail calls, if > > your kernel allows for that, as that's a saner way to scale BPF > > verification. > > ok, we should probably do that.. given this enhancement will be > available on latest kernel anyway, we could use global subprogs > as well > > the related bpftrace might be bit more challenging.. will have to > generate program calling entry or return program now, but seems > doable of course So you want users to still have separate kprobe and kretprobe in bpftrace, but combine them into this kwrapper transparently? It does seem doable, but hopefully we'll be able to write kwrapper programs in bpftrace directly as well. > > > > > > > > > > > > > > Another frequently requested feature and a very common use case is to > > > > measure duration of the function, so if we have a custom type, we can > > > > have a field to record entry timestamp and let user calculate > > > > > > you mean bpf program context right? > > > > Yes, when I was writing that I was imagining a new context type with new fields. > > > > But thinking about this a bit more, I like a slightly different > > approach now. Instead of having a new context type, kernel capturing > > timestamps, etc, what if we introduce a concept of "a session". This > > paired kprobe+kretprobe (and same for uprobe+uretprobe) share a common > > session. So, say, if we are tracing kernel function abc() and we have > > our kwrapper (let's call it that for now, kwrapper and uwrapper?) > > program attached to abc, we have something like this: > > > > 1) abc() is called > > 2) we intercept it at its entry, initialize "session" > > 3) call our kwrapper program in entry mode > > 4) abc() proceeds, until return > > 5) our kwrapper program is called in exit mode > > 6) at this point the session has ended > > > > If abc() is called again or in parallel on another CPU, each such > > abc() invocation (with kwrapper prog) has its own session state. > > > > Now, it all sounds scary, but a session could be really just a 8 byte > > value that has to last for the duration of a single kprobe+kretprobe > > execution. > > > > Imagine we introduce just one new kfunc that would be usable from > > kwrapper and uwrapper programs: > > > > u64 *bpf_get_session_cookie(void); > > > > It returns a pointer to an 8-byte slot which initially is set to zero > > by kernel (before kprobe/entry part). And then the program can put > > anything into it. Either from entry (kprobe/uprobe) or exit > > (kretprobe/uretprobe) executions. Whatever value was stored in entry > > mode will still be there in exit mode as well. > > > > This gives a lot of flexibility without requiring the kernel to do > > anything expensive. > > > > E.g., it's easy to detect if kwrapper program is an entry call or exit call: > > > > u64 *cookie = bpf_get_session_cookie(); > > if (*cookie == 0) { > > /* entry mode */ > > *cookie = 1; > > } else { > > /* exit mode */ > > } > > > > Or if we wanted to measure function duration/latency: > > > > u64 *cookie = bpf_get_session_cookie(); > > if (*cookie == 0) { > > *cookie = bpf_get_ktime_ns(); > > } else { > > u64 duration = bpf_get_ktime_ns() - *cookie; > > /* output duration somehow */ > > } > > > > Yes, I realize special-casing zero might be a bit inconvenient, but I > > think simplicity trumps a potential for zero to be a valid value (and > > there are always ways to work around zero as a meaningful value). > > > > Now, in more complicated cases 8 bytes of temporary session state > > isn't enough, just like BPF cookie being 8 byte (read-only) value > > might not be enough. But the solution is the same as with the BPF > > cookie. You just use those 8 bytes as a key into ARRAY/HASHMAP/whatnot > > storage. It's simple and fast enough for pretty much any case. > > I was recently asked for a way to have function arguments available > in the return kprobe as it is in fexit programs (which was not an > option to use, because we don't have fast multi attach for it) > > using the hash map to store arguments and storing its key in the > session data might be solution for this if you are ok using hashmap keyed by tid, you can do it today without any kernel changes. With session cookie you'll be able to utilize faster ARRAY map (by building a simple ID allocator to get a free slot in ARRAY map). > > > > > > > Now, how do we implement this storage session? I recently looked into > > uprobe/uretprobe implementation, and I know that for uretprobes there > > is a dynamically allocated uretprobe_instance (or whatever the name) > > allocated for each runtime invocation of uretprobe. We can use that. > > We'll need to allocate this instance a bit earlier, before uprobe part > > of uwrapper BPF program has to be executed, initialize session cookie > > to zero, and store pointer to this memory in bpf_run_ctx (just like we > > do for BPF cookie). > > I think you refer to the return_instance and it seems like we could use it, > I'll check on that > yep, probably > > > > I bet there is something similar in the kretprobe case, where we can > > carve out 8 bytes and pass it to both entry and exit parts of kwrapper > > program. > > for kprobes.. both kprobe and kprobe_multi/fprobe use rethook to invoke > return probes, so I guess we could use it and store that shared data > in there > > btw Masami is in process of removing rethook from kprobe_multi/fprobe, > as part of migrating fprobe on top of ftrace [0] > > but instead the rethook I think there'll be some sort of shadow stack/data > area accessible from both entry and return probes, that we could use ok, cool. We also need to be careful to not share session cookie between unrelated programs. E.g., if two independent kwrapper programs are attached to the same function, they should each have their own cookie. Otherwise it's not clear how to build anything reliable on top of that, tbh. This might be a problem, though, right? > > jirka > > > [0] https://lore.kernel.org/bpf/170723204881.502590.11906735097521170661.stgit@devnote2/ > > > > > So anyways, might need some internal refactoring, but seems very > > doable and will give a lot of power and flexibility for tracing use > > cases. > > > > It would be cool to also have a similar fentry/fexit combo with "BPF > > session cookie", but that's something we can think about separately, > > probably. > > > > > > > > > > > duration, if necessary. Without this users have to pay a bunch of > > > > extra overhead to record timestamp and put it into hashmap (keyed by > > > > thread id) or thread local storage (even if they have no other use for > > > > thread local storage). > > > > > > sounds good > > > > > > > > > > > Also, consider that a similar concept is applicable to uprobes and we > > > > should do that as well, in similar fashion. And the above approach > > > > works for both kprobe/kretprobe and uprobe/uretprobe cases, because > > > > they have the same pt_regs data structure as a context (even if for > > > > exit probes most of the values of pt_regs are not that meaningful). > > > > > > ok, that seems useful for uprobes as well > > > > yes, absolutely > > > > > > > > btw I have another unrelated change in stash that that let you choose > > > the bpf_prog_active or prog->active re-entry check before program is > > > executed in krobe_multi link.. I also added extra flag, and it still > > > seems like good idea to me, thoughts? ;-) > > > > > > > no specific opinion on this from my side, I guess > > > > > > > > > > So anyways, great feature, but let's discuss end-to-end usability > > > > story before we finalize the implementation? > > > > > > yep, it does say RFC in the subject ;-) > > > > > > > > > > > > > > > [...]
On Tue, Feb 13, 2024 at 10:20:46AM -0800, Andrii Nakryiko wrote: > On Tue, Feb 13, 2024 at 4:09 AM Jiri Olsa <olsajiri@gmail.com> wrote: > > > > On Mon, Feb 12, 2024 at 08:06:06PM -0800, Andrii Nakryiko wrote: > > > > SNIP > > > > > > > But the way you implement it with extra flag and extra fd parameter > > > > > makes it harder to have a nice high-level support in libbpf (and > > > > > presumably other BPF loader libraries) for this. > > > > > > > > > > When I was thinking about doing something like this, I was considering > > > > > adding a new program type, actually. That way it's possible to define > > > > > this "let's skip return probe" protocol without backwards > > > > > compatibility concerns. It's easier to use it declaratively in libbpf. > > > > > > > > ok, that seems cleaner.. but we need to use current kprobe programs, > > > > so not sure at the moment how would that fit in.. did you mean new > > > > link type? > > > > > > It's kind of a less important detail, actually. New program type would > > > allow us to have an entirely different context type, but I think we > > > can make do with the existing kprobe program type. We can have a > > > separate attach_type and link type, just like multi-kprobe and > > > multi-uprobe are still kprobe programs. > > > > ok, having new attach type on top of kprobe_multi link makes sense > > > > > > > > > > > > > > You just declare SEC("kprobe.wrap/...") (or whatever the name, > > > > > something to designate that it's both entry and exit probe) as one > > > > > program and in the code there would be some way to determine whether > > > > > we are in entry mode or exit mode (helper or field in the custom > > > > > context type, the latter being faster and more usable, but it's > > > > > probably not critical). > > > > > > > > hum, so the single program would be for both entry and exit probe, > > > > I'll need to check how bad it'd be for us, but it'd probably mean > > > > just one extra tail call, so it's likely ok > > > > > > I guess, I don't know what you are doing there :) I'd recommend > > > looking at utilizing BPF global subprogs instead of tail calls, if > > > your kernel allows for that, as that's a saner way to scale BPF > > > verification. > > > > ok, we should probably do that.. given this enhancement will be > > available on latest kernel anyway, we could use global subprogs > > as well > > > > the related bpftrace might be bit more challenging.. will have to > > generate program calling entry or return program now, but seems > > doable of course > > So you want users to still have separate kprobe and kretprobe in > bpftrace, but combine them into this kwrapper transparently? It does no I meant I'd need to generate the wrapper program for the new interface.. which is extra compared to current bpftrace changes > seem doable, but hopefully we'll be able to write kwrapper programs in > bpftrace directly as well. yes, it should be fine SNIP > > > > > > Yes, I realize special-casing zero might be a bit inconvenient, but I > > > think simplicity trumps a potential for zero to be a valid value (and > > > there are always ways to work around zero as a meaningful value). > > > > > > Now, in more complicated cases 8 bytes of temporary session state > > > isn't enough, just like BPF cookie being 8 byte (read-only) value > > > might not be enough. But the solution is the same as with the BPF > > > cookie. You just use those 8 bytes as a key into ARRAY/HASHMAP/whatnot > > > storage. It's simple and fast enough for pretty much any case. > > > > I was recently asked for a way to have function arguments available > > in the return kprobe as it is in fexit programs (which was not an > > option to use, because we don't have fast multi attach for it) > > > > using the hash map to store arguments and storing its key in the > > session data might be solution for this > > if you are ok using hashmap keyed by tid, you can do it today without > any kernel changes. With session cookie you'll be able to utilize > faster ARRAY map (by building a simple ID allocator to get a free slot > in ARRAY map). ok SNIP > > > I bet there is something similar in the kretprobe case, where we can > > > carve out 8 bytes and pass it to both entry and exit parts of kwrapper > > > program. > > > > for kprobes.. both kprobe and kprobe_multi/fprobe use rethook to invoke > > return probes, so I guess we could use it and store that shared data > > in there > > > > btw Masami is in process of removing rethook from kprobe_multi/fprobe, > > as part of migrating fprobe on top of ftrace [0] > > > > but instead the rethook I think there'll be some sort of shadow stack/data > > area accessible from both entry and return probes, that we could use > > ok, cool. We also need to be careful to not share session cookie > between unrelated programs. E.g., if two independent kwrapper programs > are attached to the same function, they should each have their own > cookie. Otherwise it's not clear how to build anything reliable on top > of that, tbh. This might be a problem, though, right? IIRC it's tracer specific data, the shadow stack data should be unique for tracer and its called function, but I'll double check on that jirka
On Tue, Feb 13, 2024 at 10:09:09PM +0100, Jiri Olsa wrote: > On Tue, Feb 13, 2024 at 10:20:46AM -0800, Andrii Nakryiko wrote: > > On Tue, Feb 13, 2024 at 4:09 AM Jiri Olsa <olsajiri@gmail.com> wrote: > > > > > > On Mon, Feb 12, 2024 at 08:06:06PM -0800, Andrii Nakryiko wrote: > > > > > > SNIP > > > > > > > > > But the way you implement it with extra flag and extra fd parameter > > > > > > makes it harder to have a nice high-level support in libbpf (and > > > > > > presumably other BPF loader libraries) for this. > > > > > > > > > > > > When I was thinking about doing something like this, I was considering > > > > > > adding a new program type, actually. That way it's possible to define > > > > > > this "let's skip return probe" protocol without backwards > > > > > > compatibility concerns. It's easier to use it declaratively in libbpf. > > > > > > > > > > ok, that seems cleaner.. but we need to use current kprobe programs, > > > > > so not sure at the moment how would that fit in.. did you mean new > > > > > link type? > > > > > > > > It's kind of a less important detail, actually. New program type would > > > > allow us to have an entirely different context type, but I think we > > > > can make do with the existing kprobe program type. We can have a > > > > separate attach_type and link type, just like multi-kprobe and > > > > multi-uprobe are still kprobe programs. > > > > > > ok, having new attach type on top of kprobe_multi link makes sense > > > > > > > > > > > > > > > > > > You just declare SEC("kprobe.wrap/...") (or whatever the name, > > > > > > something to designate that it's both entry and exit probe) as one > > > > > > program and in the code there would be some way to determine whether > > > > > > we are in entry mode or exit mode (helper or field in the custom > > > > > > context type, the latter being faster and more usable, but it's > > > > > > probably not critical). > > > > > > > > > > hum, so the single program would be for both entry and exit probe, > > > > > I'll need to check how bad it'd be for us, but it'd probably mean > > > > > just one extra tail call, so it's likely ok > > > > > > > > I guess, I don't know what you are doing there :) I'd recommend > > > > looking at utilizing BPF global subprogs instead of tail calls, if > > > > your kernel allows for that, as that's a saner way to scale BPF > > > > verification. > > > > > > ok, we should probably do that.. given this enhancement will be > > > available on latest kernel anyway, we could use global subprogs > > > as well > > > > > > the related bpftrace might be bit more challenging.. will have to > > > generate program calling entry or return program now, but seems > > > doable of course > > > > So you want users to still have separate kprobe and kretprobe in > > bpftrace, but combine them into this kwrapper transparently? It does > > no I meant I'd need to generate the wrapper program for the new > interface.. which is extra compared to current bpftrace changes > > > seem doable, but hopefully we'll be able to write kwrapper programs in > > bpftrace directly as well. > > yes, it should be fine > > SNIP > > > > > > > > > Yes, I realize special-casing zero might be a bit inconvenient, but I > > > > think simplicity trumps a potential for zero to be a valid value (and > > > > there are always ways to work around zero as a meaningful value). > > > > > > > > Now, in more complicated cases 8 bytes of temporary session state > > > > isn't enough, just like BPF cookie being 8 byte (read-only) value > > > > might not be enough. But the solution is the same as with the BPF > > > > cookie. You just use those 8 bytes as a key into ARRAY/HASHMAP/whatnot > > > > storage. It's simple and fast enough for pretty much any case. > > > > > > I was recently asked for a way to have function arguments available > > > in the return kprobe as it is in fexit programs (which was not an > > > option to use, because we don't have fast multi attach for it) > > > > > > using the hash map to store arguments and storing its key in the > > > session data might be solution for this > > > > if you are ok using hashmap keyed by tid, you can do it today without > > any kernel changes. With session cookie you'll be able to utilize > > faster ARRAY map (by building a simple ID allocator to get a free slot > > in ARRAY map). > > ok > > SNIP > > > > > I bet there is something similar in the kretprobe case, where we can > > > > carve out 8 bytes and pass it to both entry and exit parts of kwrapper > > > > program. > > > > > > for kprobes.. both kprobe and kprobe_multi/fprobe use rethook to invoke > > > return probes, so I guess we could use it and store that shared data > > > in there > > > > > > btw Masami is in process of removing rethook from kprobe_multi/fprobe, > > > as part of migrating fprobe on top of ftrace [0] > > > > > > but instead the rethook I think there'll be some sort of shadow stack/data > > > area accessible from both entry and return probes, that we could use > > > > ok, cool. We also need to be careful to not share session cookie > > between unrelated programs. E.g., if two independent kwrapper programs > > are attached to the same function, they should each have their own > > cookie. Otherwise it's not clear how to build anything reliable on top > > of that, tbh. This might be a problem, though, right? > > IIRC it's tracer specific data, the shadow stack data should be unique > for tracer and its called function, but I'll double check on that Masami, we recently discussed the possibility to store data between entry/return probe, IIUC your current patchset [0] allows that, but it seems to be shared across all the tracers for the given function (__ftrace_return_to_handler).. is the plan to make the shadow stack per tracer and function? /cc Steven thanks, jirka [0] https://lore.kernel.org/bpf/170723204881.502590.11906735097521170661.stgit@devnote2/
On Wed, 14 Feb 2024 21:55:03 +0100 Jiri Olsa <olsajiri@gmail.com> wrote: > Masami, > we recently discussed the possibility to store data between entry/return probe, > IIUC your current patchset [0] allows that, but it seems to be shared across all > the tracers for the given function (__ftrace_return_to_handler).. is the plan to > make the shadow stack per tracer and function? /cc Steven The shadow stack is per task, but it saves unique data per tracer per function. It allows up to 16 different tracers attached to function graph tracing at a time, as there is a limited shadow stack size, which all the attached tracers use. Now you can create your own shadow stack per user (bpf program?) and have a single registered function graph user that will demux the data coming in and out of the function. -- Steve
On Thu, Feb 15, 2024 at 11:27:22AM -0500, Steven Rostedt wrote: > On Wed, 14 Feb 2024 21:55:03 +0100 > Jiri Olsa <olsajiri@gmail.com> wrote: > > > > Masami, > > we recently discussed the possibility to store data between entry/return probe, > > IIUC your current patchset [0] allows that, but it seems to be shared across all > > the tracers for the given function (__ftrace_return_to_handler).. is the plan to > > make the shadow stack per tracer and function? /cc Steven > > The shadow stack is per task, but it saves unique data per tracer per function. > > It allows up to 16 different tracers attached to function graph tracing at > a time, as there is a limited shadow stack size, which all the attached > tracers use. Now you can create your own shadow stack per user (bpf > program?) and have a single registered function graph user that will demux > the data coming in and out of the function. ok, I guess that's fgraph_reserve_data/fgraph_retrieve_data api we'd like to share data between bpf programs executed from entry and exit handlers, that should do it thanks, jirka
On 2/13/24 22:09, Jiri Olsa wrote: > On Tue, Feb 13, 2024 at 10:20:46AM -0800, Andrii Nakryiko wrote: >> On Tue, Feb 13, 2024 at 4:09 AM Jiri Olsa <olsajiri@gmail.com> wrote: >>> >>> On Mon, Feb 12, 2024 at 08:06:06PM -0800, Andrii Nakryiko wrote: >>> >>> SNIP >>> >>>>>> But the way you implement it with extra flag and extra fd parameter >>>>>> makes it harder to have a nice high-level support in libbpf (and >>>>>> presumably other BPF loader libraries) for this. >>>>>> >>>>>> When I was thinking about doing something like this, I was considering >>>>>> adding a new program type, actually. That way it's possible to define >>>>>> this "let's skip return probe" protocol without backwards >>>>>> compatibility concerns. It's easier to use it declaratively in libbpf. >>>>> >>>>> ok, that seems cleaner.. but we need to use current kprobe programs, >>>>> so not sure at the moment how would that fit in.. did you mean new >>>>> link type? >>>> >>>> It's kind of a less important detail, actually. New program type would >>>> allow us to have an entirely different context type, but I think we >>>> can make do with the existing kprobe program type. We can have a >>>> separate attach_type and link type, just like multi-kprobe and >>>> multi-uprobe are still kprobe programs. >>> >>> ok, having new attach type on top of kprobe_multi link makes sense >>> >>>> >>>>> >>>>>> You just declare SEC("kprobe.wrap/...") (or whatever the name, >>>>>> something to designate that it's both entry and exit probe) as one >>>>>> program and in the code there would be some way to determine whether >>>>>> we are in entry mode or exit mode (helper or field in the custom >>>>>> context type, the latter being faster and more usable, but it's >>>>>> probably not critical). >>>>> >>>>> hum, so the single program would be for both entry and exit probe, >>>>> I'll need to check how bad it'd be for us, but it'd probably mean >>>>> just one extra tail call, so it's likely ok >>>> >>>> I guess, I don't know what you are doing there :) I'd recommend >>>> looking at utilizing BPF global subprogs instead of tail calls, if >>>> your kernel allows for that, as that's a saner way to scale BPF >>>> verification. >>> >>> ok, we should probably do that.. given this enhancement will be >>> available on latest kernel anyway, we could use global subprogs >>> as well >>> >>> the related bpftrace might be bit more challenging.. will have to >>> generate program calling entry or return program now, but seems >>> doable of course >> >> So you want users to still have separate kprobe and kretprobe in >> bpftrace, but combine them into this kwrapper transparently? It does > > no I meant I'd need to generate the wrapper program for the new > interface.. which is extra compared to current bpftrace changes If you end up introducing this new kwrapper program type in libbpf, I think that it'll make sense to have something similar in bpftrace, too. Allowing users to write separate kprobe and kretprobe programs and transparently combining them into kwrapper doesn't seem to bring much value to me. Jirka, if you need help with implementing bpftrace support for this, let me know. I'm very interested in having this capability there. Viktor > >> seem doable, but hopefully we'll be able to write kwrapper programs in >> bpftrace directly as well. > > yes, it should be fine > > SNIP > >>>> >>>> Yes, I realize special-casing zero might be a bit inconvenient, but I >>>> think simplicity trumps a potential for zero to be a valid value (and >>>> there are always ways to work around zero as a meaningful value). >>>> >>>> Now, in more complicated cases 8 bytes of temporary session state >>>> isn't enough, just like BPF cookie being 8 byte (read-only) value >>>> might not be enough. But the solution is the same as with the BPF >>>> cookie. You just use those 8 bytes as a key into ARRAY/HASHMAP/whatnot >>>> storage. It's simple and fast enough for pretty much any case. >>> >>> I was recently asked for a way to have function arguments available >>> in the return kprobe as it is in fexit programs (which was not an >>> option to use, because we don't have fast multi attach for it) >>> >>> using the hash map to store arguments and storing its key in the >>> session data might be solution for this >> >> if you are ok using hashmap keyed by tid, you can do it today without >> any kernel changes. With session cookie you'll be able to utilize >> faster ARRAY map (by building a simple ID allocator to get a free slot >> in ARRAY map). > > ok > > SNIP > >>>> I bet there is something similar in the kretprobe case, where we can >>>> carve out 8 bytes and pass it to both entry and exit parts of kwrapper >>>> program. >>> >>> for kprobes.. both kprobe and kprobe_multi/fprobe use rethook to invoke >>> return probes, so I guess we could use it and store that shared data >>> in there >>> >>> btw Masami is in process of removing rethook from kprobe_multi/fprobe, >>> as part of migrating fprobe on top of ftrace [0] >>> >>> but instead the rethook I think there'll be some sort of shadow stack/data >>> area accessible from both entry and return probes, that we could use >> >> ok, cool. We also need to be careful to not share session cookie >> between unrelated programs. E.g., if two independent kwrapper programs >> are attached to the same function, they should each have their own >> cookie. Otherwise it's not clear how to build anything reliable on top >> of that, tbh. This might be a problem, though, right? > > IIRC it's tracer specific data, the shadow stack data should be unique > for tracer and its called function, but I'll double check on that > > jirka >
On Mon, Feb 19, 2024 at 12:20:08PM +0100, Viktor Malik wrote: > On 2/13/24 22:09, Jiri Olsa wrote: > > On Tue, Feb 13, 2024 at 10:20:46AM -0800, Andrii Nakryiko wrote: > >> On Tue, Feb 13, 2024 at 4:09 AM Jiri Olsa <olsajiri@gmail.com> wrote: > >>> > >>> On Mon, Feb 12, 2024 at 08:06:06PM -0800, Andrii Nakryiko wrote: > >>> > >>> SNIP > >>> > >>>>>> But the way you implement it with extra flag and extra fd parameter > >>>>>> makes it harder to have a nice high-level support in libbpf (and > >>>>>> presumably other BPF loader libraries) for this. > >>>>>> > >>>>>> When I was thinking about doing something like this, I was considering > >>>>>> adding a new program type, actually. That way it's possible to define > >>>>>> this "let's skip return probe" protocol without backwards > >>>>>> compatibility concerns. It's easier to use it declaratively in libbpf. > >>>>> > >>>>> ok, that seems cleaner.. but we need to use current kprobe programs, > >>>>> so not sure at the moment how would that fit in.. did you mean new > >>>>> link type? > >>>> > >>>> It's kind of a less important detail, actually. New program type would > >>>> allow us to have an entirely different context type, but I think we > >>>> can make do with the existing kprobe program type. We can have a > >>>> separate attach_type and link type, just like multi-kprobe and > >>>> multi-uprobe are still kprobe programs. > >>> > >>> ok, having new attach type on top of kprobe_multi link makes sense > >>> > >>>> > >>>>> > >>>>>> You just declare SEC("kprobe.wrap/...") (or whatever the name, > >>>>>> something to designate that it's both entry and exit probe) as one > >>>>>> program and in the code there would be some way to determine whether > >>>>>> we are in entry mode or exit mode (helper or field in the custom > >>>>>> context type, the latter being faster and more usable, but it's > >>>>>> probably not critical). > >>>>> > >>>>> hum, so the single program would be for both entry and exit probe, > >>>>> I'll need to check how bad it'd be for us, but it'd probably mean > >>>>> just one extra tail call, so it's likely ok > >>>> > >>>> I guess, I don't know what you are doing there :) I'd recommend > >>>> looking at utilizing BPF global subprogs instead of tail calls, if > >>>> your kernel allows for that, as that's a saner way to scale BPF > >>>> verification. > >>> > >>> ok, we should probably do that.. given this enhancement will be > >>> available on latest kernel anyway, we could use global subprogs > >>> as well > >>> > >>> the related bpftrace might be bit more challenging.. will have to > >>> generate program calling entry or return program now, but seems > >>> doable of course > >> > >> So you want users to still have separate kprobe and kretprobe in > >> bpftrace, but combine them into this kwrapper transparently? It does > > > > no I meant I'd need to generate the wrapper program for the new > > interface.. which is extra compared to current bpftrace changes > > If you end up introducing this new kwrapper program type in libbpf, I > think that it'll make sense to have something similar in bpftrace, too. > Allowing users to write separate kprobe and kretprobe programs and > transparently combining them into kwrapper doesn't seem to bring much > value to me. I kind of liked the idea of not introducing new probe type and silently making things faster ;-) but having wrapper probe type makes also sense > > Jirka, if you need help with implementing bpftrace support for this, let > me know. I'm very interested in having this capability there. that'd be great, I'll send new version of kernel changes soon thanks, jirka
On Tue, Feb 13, 2024 at 01:09:54PM +0100, Jiri Olsa wrote: > On Mon, Feb 12, 2024 at 08:06:06PM -0800, Andrii Nakryiko wrote: > > SNIP > > > > > But the way you implement it with extra flag and extra fd parameter > > > > makes it harder to have a nice high-level support in libbpf (and > > > > presumably other BPF loader libraries) for this. > > > > > > > > When I was thinking about doing something like this, I was considering > > > > adding a new program type, actually. That way it's possible to define > > > > this "let's skip return probe" protocol without backwards > > > > compatibility concerns. It's easier to use it declaratively in libbpf. > > > > > > ok, that seems cleaner.. but we need to use current kprobe programs, > > > so not sure at the moment how would that fit in.. did you mean new > > > link type? > > > > It's kind of a less important detail, actually. New program type would > > allow us to have an entirely different context type, but I think we > > can make do with the existing kprobe program type. We can have a > > separate attach_type and link type, just like multi-kprobe and > > multi-uprobe are still kprobe programs. > > ok, having new attach type on top of kprobe_multi link makes sense hi, I have further question in here.. ;-) so I implemented the new behaviour on top of new attach_type, but I keep thinking that it's the 'same/similar logic' as if I added extra flag to attr.link_create.kprobe_multi.flags, which results in much simpler change is following logic breaking backward compatibility in any way? - having new flag in attr.link_create.kprobe_multi.flags - that force the attach of the bpf program to entry/exit function probes - and the kprobe entry program return value now controls invocation on the exit probe so the new flag changes the semantics of the entry program return value, which seems fine to me, because it depends on the new flag.. do I miss any other problem with that? thanks, jirka > > > > > > > > > > You just declare SEC("kprobe.wrap/...") (or whatever the name, > > > > something to designate that it's both entry and exit probe) as one > > > > program and in the code there would be some way to determine whether > > > > we are in entry mode or exit mode (helper or field in the custom > > > > context type, the latter being faster and more usable, but it's > > > > probably not critical). > > > > > > hum, so the single program would be for both entry and exit probe, > > > I'll need to check how bad it'd be for us, but it'd probably mean > > > just one extra tail call, so it's likely ok > > > > I guess, I don't know what you are doing there :) I'd recommend > > looking at utilizing BPF global subprogs instead of tail calls, if > > your kernel allows for that, as that's a saner way to scale BPF > > verification. > > ok, we should probably do that.. given this enhancement will be > available on latest kernel anyway, we could use global subprogs > as well > > the related bpftrace might be bit more challenging.. will have to > generate program calling entry or return program now, but seems > doable of course > > > > > > > > > > > > > > Another frequently requested feature and a very common use case is to > > > > measure duration of the function, so if we have a custom type, we can > > > > have a field to record entry timestamp and let user calculate > > > > > > you mean bpf program context right? > > > > Yes, when I was writing that I was imagining a new context type with new fields. > > > > But thinking about this a bit more, I like a slightly different > > approach now. Instead of having a new context type, kernel capturing > > timestamps, etc, what if we introduce a concept of "a session". This > > paired kprobe+kretprobe (and same for uprobe+uretprobe) share a common > > session. So, say, if we are tracing kernel function abc() and we have > > our kwrapper (let's call it that for now, kwrapper and uwrapper?) > > program attached to abc, we have something like this: > > > > 1) abc() is called > > 2) we intercept it at its entry, initialize "session" > > 3) call our kwrapper program in entry mode > > 4) abc() proceeds, until return > > 5) our kwrapper program is called in exit mode > > 6) at this point the session has ended > > > > If abc() is called again or in parallel on another CPU, each such > > abc() invocation (with kwrapper prog) has its own session state. > > > > Now, it all sounds scary, but a session could be really just a 8 byte > > value that has to last for the duration of a single kprobe+kretprobe > > execution. > > > > Imagine we introduce just one new kfunc that would be usable from > > kwrapper and uwrapper programs: > > > > u64 *bpf_get_session_cookie(void); > > > > It returns a pointer to an 8-byte slot which initially is set to zero > > by kernel (before kprobe/entry part). And then the program can put > > anything into it. Either from entry (kprobe/uprobe) or exit > > (kretprobe/uretprobe) executions. Whatever value was stored in entry > > mode will still be there in exit mode as well. > > > > This gives a lot of flexibility without requiring the kernel to do > > anything expensive. > > > > E.g., it's easy to detect if kwrapper program is an entry call or exit call: > > > > u64 *cookie = bpf_get_session_cookie(); > > if (*cookie == 0) { > > /* entry mode */ > > *cookie = 1; > > } else { > > /* exit mode */ > > } > > > > Or if we wanted to measure function duration/latency: > > > > u64 *cookie = bpf_get_session_cookie(); > > if (*cookie == 0) { > > *cookie = bpf_get_ktime_ns(); > > } else { > > u64 duration = bpf_get_ktime_ns() - *cookie; > > /* output duration somehow */ > > } > > > > Yes, I realize special-casing zero might be a bit inconvenient, but I > > think simplicity trumps a potential for zero to be a valid value (and > > there are always ways to work around zero as a meaningful value). > > > > Now, in more complicated cases 8 bytes of temporary session state > > isn't enough, just like BPF cookie being 8 byte (read-only) value > > might not be enough. But the solution is the same as with the BPF > > cookie. You just use those 8 bytes as a key into ARRAY/HASHMAP/whatnot > > storage. It's simple and fast enough for pretty much any case. > > I was recently asked for a way to have function arguments available > in the return kprobe as it is in fexit programs (which was not an > option to use, because we don't have fast multi attach for it) > > using the hash map to store arguments and storing its key in the > session data might be solution for this > > > > > > > Now, how do we implement this storage session? I recently looked into > > uprobe/uretprobe implementation, and I know that for uretprobes there > > is a dynamically allocated uretprobe_instance (or whatever the name) > > allocated for each runtime invocation of uretprobe. We can use that. > > We'll need to allocate this instance a bit earlier, before uprobe part > > of uwrapper BPF program has to be executed, initialize session cookie > > to zero, and store pointer to this memory in bpf_run_ctx (just like we > > do for BPF cookie). > > I think you refer to the return_instance and it seems like we could use it, > I'll check on that > > > > > I bet there is something similar in the kretprobe case, where we can > > carve out 8 bytes and pass it to both entry and exit parts of kwrapper > > program. > > for kprobes.. both kprobe and kprobe_multi/fprobe use rethook to invoke > return probes, so I guess we could use it and store that shared data > in there > > btw Masami is in process of removing rethook from kprobe_multi/fprobe, > as part of migrating fprobe on top of ftrace [0] > > but instead the rethook I think there'll be some sort of shadow stack/data > area accessible from both entry and return probes, that we could use > > jirka > > > [0] https://lore.kernel.org/bpf/170723204881.502590.11906735097521170661.stgit@devnote2/ > > > > > So anyways, might need some internal refactoring, but seems very > > doable and will give a lot of power and flexibility for tracing use > > cases. > > > > It would be cool to also have a similar fentry/fexit combo with "BPF > > session cookie", but that's something we can think about separately, > > probably. > > > > > > > > > > > duration, if necessary. Without this users have to pay a bunch of > > > > extra overhead to record timestamp and put it into hashmap (keyed by > > > > thread id) or thread local storage (even if they have no other use for > > > > thread local storage). > > > > > > sounds good > > > > > > > > > > > Also, consider that a similar concept is applicable to uprobes and we > > > > should do that as well, in similar fashion. And the above approach > > > > works for both kprobe/kretprobe and uprobe/uretprobe cases, because > > > > they have the same pt_regs data structure as a context (even if for > > > > exit probes most of the values of pt_regs are not that meaningful). > > > > > > ok, that seems useful for uprobes as well > > > > yes, absolutely > > > > > > > > btw I have another unrelated change in stash that that let you choose > > > the bpf_prog_active or prog->active re-entry check before program is > > > executed in krobe_multi link.. I also added extra flag, and it still > > > seems like good idea to me, thoughts? ;-) > > > > > > > no specific opinion on this from my side, I guess > > > > > > > > > > So anyways, great feature, but let's discuss end-to-end usability > > > > story before we finalize the implementation? > > > > > > yep, it does say RFC in the subject ;-) > > > > > > > > > > > > > > > [...]
On Fri, Feb 23, 2024 at 1:32 AM Jiri Olsa <olsajiri@gmail.com> wrote: > > On Tue, Feb 13, 2024 at 01:09:54PM +0100, Jiri Olsa wrote: > > On Mon, Feb 12, 2024 at 08:06:06PM -0800, Andrii Nakryiko wrote: > > > > SNIP > > > > > > > But the way you implement it with extra flag and extra fd parameter > > > > > makes it harder to have a nice high-level support in libbpf (and > > > > > presumably other BPF loader libraries) for this. > > > > > > > > > > When I was thinking about doing something like this, I was considering > > > > > adding a new program type, actually. That way it's possible to define > > > > > this "let's skip return probe" protocol without backwards > > > > > compatibility concerns. It's easier to use it declaratively in libbpf. > > > > > > > > ok, that seems cleaner.. but we need to use current kprobe programs, > > > > so not sure at the moment how would that fit in.. did you mean new > > > > link type? > > > > > > It's kind of a less important detail, actually. New program type would > > > allow us to have an entirely different context type, but I think we > > > can make do with the existing kprobe program type. We can have a > > > separate attach_type and link type, just like multi-kprobe and > > > multi-uprobe are still kprobe programs. > > > > ok, having new attach type on top of kprobe_multi link makes sense > > hi, > I have further question in here.. ;-) > > so I implemented the new behaviour on top of new attach_type, but I keep > thinking that it's the 'same/similar logic' as if I added extra flag to > attr.link_create.kprobe_multi.flags, which results in much simpler change > > is following logic breaking backward compatibility in any way? Even if it doesn't break compatibility, isn't new attach_type better to explicitly show that semantics is completely different (it's two BPF program executions, before and after, for each specified kprobe), isn't that different enough? We can also have a different set of helpers for this kwrapper/uwrapper program type (probably could do it through flags as well, but attach_type again seems a better fit here..)? What makes it harder if you are doing it as a new attach_type? > > - having new flag in attr.link_create.kprobe_multi.flags > - that force the attach of the bpf program to entry/exit function probes > - and the kprobe entry program return value now controls invocation > on the exit probe > > so the new flag changes the semantics of the entry program return value, > which seems fine to me, because it depends on the new flag.. do I miss > any other problem with that? > > thanks, > jirka > > > > > > > > > > > > > > > You just declare SEC("kprobe.wrap/...") (or whatever the name, > > > > > something to designate that it's both entry and exit probe) as one > > > > > program and in the code there would be some way to determine whether > > > > > we are in entry mode or exit mode (helper or field in the custom > > > > > context type, the latter being faster and more usable, but it's > > > > > probably not critical). > > > > > > > > hum, so the single program would be for both entry and exit probe, > > > > I'll need to check how bad it'd be for us, but it'd probably mean > > > > just one extra tail call, so it's likely ok > > > > > > I guess, I don't know what you are doing there :) I'd recommend > > > looking at utilizing BPF global subprogs instead of tail calls, if > > > your kernel allows for that, as that's a saner way to scale BPF > > > verification. > > [...]
On Wed, Feb 28, 2024 at 4:43 PM Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote: > > On Fri, Feb 23, 2024 at 1:32 AM Jiri Olsa <olsajiri@gmail.com> wrote: > > > > On Tue, Feb 13, 2024 at 01:09:54PM +0100, Jiri Olsa wrote: > > > On Mon, Feb 12, 2024 at 08:06:06PM -0800, Andrii Nakryiko wrote: > > > > > > SNIP > > > > > > > > > But the way you implement it with extra flag and extra fd parameter > > > > > > makes it harder to have a nice high-level support in libbpf (and > > > > > > presumably other BPF loader libraries) for this. > > > > > > > > > > > > When I was thinking about doing something like this, I was considering > > > > > > adding a new program type, actually. That way it's possible to define > > > > > > this "let's skip return probe" protocol without backwards > > > > > > compatibility concerns. It's easier to use it declaratively in libbpf. > > > > > > > > > > ok, that seems cleaner.. but we need to use current kprobe programs, > > > > > so not sure at the moment how would that fit in.. did you mean new > > > > > link type? > > > > > > > > It's kind of a less important detail, actually. New program type would > > > > allow us to have an entirely different context type, but I think we > > > > can make do with the existing kprobe program type. We can have a > > > > separate attach_type and link type, just like multi-kprobe and > > > > multi-uprobe are still kprobe programs. > > > > > > ok, having new attach type on top of kprobe_multi link makes sense > > > > hi, > > I have further question in here.. ;-) > > > > so I implemented the new behaviour on top of new attach_type, but I keep > > thinking that it's the 'same/similar logic' as if I added extra flag to > > attr.link_create.kprobe_multi.flags, which results in much simpler change > > > > is following logic breaking backward compatibility in any way? > > Even if it doesn't break compatibility, isn't new attach_type better > to explicitly show that semantics is completely different (it's two > BPF program executions, before and after, for each specified kprobe), > isn't that different enough? We can also have a different set of > helpers for this kwrapper/uwrapper program type (probably could do it > through flags as well, but attach_type again seems a better fit > here..)? > > What makes it harder if you are doing it as a new attach_type? I just looked at the RFC patches you sent, and I guess it looks fine to me with the flag as well, given kretprobe.multi is also specified through flags. > > > > > - having new flag in attr.link_create.kprobe_multi.flags > > - that force the attach of the bpf program to entry/exit function probes > > - and the kprobe entry program return value now controls invocation > > on the exit probe > > > > so the new flag changes the semantics of the entry program return value, > > which seems fine to me, because it depends on the new flag.. do I miss > > any other problem with that? > > > > thanks, > > jirka > > > > > > > > > > > > > > > > > > > > You just declare SEC("kprobe.wrap/...") (or whatever the name, > > > > > > something to designate that it's both entry and exit probe) as one > > > > > > program and in the code there would be some way to determine whether > > > > > > we are in entry mode or exit mode (helper or field in the custom > > > > > > context type, the latter being faster and more usable, but it's > > > > > > probably not critical). > > > > > > > > > > hum, so the single program would be for both entry and exit probe, > > > > > I'll need to check how bad it'd be for us, but it'd probably mean > > > > > just one extra tail call, so it's likely ok > > > > > > > > I guess, I don't know what you are doing there :) I'd recommend > > > > looking at utilizing BPF global subprogs instead of tail calls, if > > > > your kernel allows for that, as that's a saner way to scale BPF > > > > verification. > > > > > [...]