mbox series

[RFC,bpf-next,0/4] bpf: Add support to attach return prog in kprobe multi

Message ID 20240207153550.856536-1-jolsa@kernel.org (mailing list archive)
Headers show
Series bpf: Add support to attach return prog in kprobe multi | expand

Message

Jiri Olsa Feb. 7, 2024, 3:35 p.m. UTC
hi,
adding support to attach both entry and return bpf program on single
kprobe multi link.

Having entry together with return probe for given function is common
use case for tetragon, bpftrace and most likely for others.

At the moment if we want both entry and return probe to execute bpf
program we need to create two (entry and return probe) links. The link
for return probe creates extra entry probe to setup the return probe.
The extra entry probe execution could be omitted if we had a way to
use just single link for both entry and exit probe.

In addition it's possible to control the execution of the return probe
with the return value of the entry bpf program. If the entry program
returns 0 the return probe is installed and executed, otherwise it's
skip.

I'm still working on the tetragon change, so I'll be sending non-RFC
version once that's ready, meanwhile any ideas and feedback on the
approach would be great.

The change in bpftrace [1] using the new interface shows speed increase
with tracing perf bench messaging:

  # perf bench sched messaging -l 100000

having system wide bpftrace:

  # bpftrace -e 'kprobe:ksys_write { }, kretprobe:ksys_write { }'

without bpftrace:

  # Running 'sched/messaging' benchmark:
  # 20 sender and receiver processes per group
  # 10 groups == 400 processes run

       Total time: 119.595 [sec]

   Performance counter stats for 'perf bench sched messaging -l 100000':

     102,419,967,282      cycles:u
   5,652,444,107,001      cycles:k
   5,782,645,019,612      cycles
      22,187,151,206      instructions:u                   #    0.22  insn per cycle
   2,979,040,498,455      instructions:k                   #    0.53  insn per cycle

       119.671169829 seconds time elapsed

        94.959198000 seconds user
      1815.371616000 seconds sys

with current bpftrace:

  # Running 'sched/messaging' benchmark:
  # 20 sender and receiver processes per group
  # 10 groups == 400 processes run

       Total time: 221.153 [sec]

   Performance counter stats for 'perf bench sched messaging -l 100000':

     125,292,164,504      cycles:u
  10,315,020,393,735      cycles:k
  10,501,379,274,042      cycles
      22,187,583,545      instructions:u                   #    0.18  insn per cycle
   4,856,893,111,303      instructions:k                   #    0.47  insn per cycle

       221.229234283 seconds time elapsed

       103.792498000 seconds user
      3432.643302000 seconds sys

with bpftrace using the new interface:

  # Running 'sched/messaging' benchmark:
  # 20 sender and receiver processes per group
  # 10 groups == 400 processes run

       Total time: 157.825 [sec]

   Performance counter stats for 'perf bench sched messaging -l 100000':

     102,423,112,279      cycles:u
   7,450,856,354,744      cycles:k
   7,584,769,726,693      cycles
      22,187,270,661      instructions:u                   #    0.22  insn per cycle
   3,985,522,383,425      instructions:k                   #    0.53  insn per cycle

       157.900787760 seconds time elapsed

        97.953898000 seconds user
      2425.314753000 seconds sys

thanks,
jirka


[1] https://github.com/bpftrace/bpftrace/pull/2984
---
Jiri Olsa (4):
      fprobe: Add entry/exit callbacks types
      bpf: Add return prog to kprobe multi
      libbpf: Add return_prog_fd to kprobe multi opts
      selftests/bpf: Add kprobe multi return prog test

 include/linux/fprobe.h                                       |  18 ++++++++++------
 include/uapi/linux/bpf.h                                     |   4 +++-
 kernel/trace/bpf_trace.c                                     |  50 ++++++++++++++++++++++++++++++++-----------
 tools/include/uapi/linux/bpf.h                               |   4 +++-
 tools/lib/bpf/bpf.c                                          |   1 +
 tools/lib/bpf/bpf.h                                          |   1 +
 tools/lib/bpf/libbpf.c                                       |   5 +++++
 tools/lib/bpf/libbpf.h                                       |   6 +++++-
 tools/testing/selftests/bpf/prog_tests/kprobe_multi_test.c   |  53 +++++++++++++++++++++++++++++++++++++++++++++
 tools/testing/selftests/bpf/progs/kprobe_multi_return_prog.c | 105 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 10 files changed, 226 insertions(+), 21 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/progs/kprobe_multi_return_prog.c

Comments

Andrii Nakryiko Feb. 8, 2024, 7:35 p.m. UTC | #1
On Wed, Feb 7, 2024 at 7:35 AM Jiri Olsa <jolsa@kernel.org> wrote:
>
> hi,
> adding support to attach both entry and return bpf program on single
> kprobe multi link.
>
> Having entry together with return probe for given function is common
> use case for tetragon, bpftrace and most likely for others.
>
> At the moment if we want both entry and return probe to execute bpf
> program we need to create two (entry and return probe) links. The link
> for return probe creates extra entry probe to setup the return probe.
> The extra entry probe execution could be omitted if we had a way to
> use just single link for both entry and exit probe.
>
> In addition it's possible to control the execution of the return probe
> with the return value of the entry bpf program. If the entry program
> returns 0 the return probe is installed and executed, otherwise it's
> skip.
>

In general, I think this is a very useful ability to have a combined
entry/return program.

But the way you implement it with extra flag and extra fd parameter
makes it harder to have a nice high-level support in libbpf (and
presumably other BPF loader libraries) for this.

When I was thinking about doing something like this, I was considering
adding a new program type, actually. That way it's possible to define
this "let's skip return probe" protocol without backwards
compatibility concerns. It's easier to use it declaratively in libbpf.
You just declare SEC("kprobe.wrap/...") (or whatever the name,
something to designate that it's both entry and exit probe) as one
program and in the code there would be some way to determine whether
we are in entry mode or exit mode (helper or field in the custom
context type, the latter being faster and more usable, but it's
probably not critical).

Another frequently requested feature and a very common use case is to
measure duration of the function, so if we have a custom type, we can
have a field to record entry timestamp and let user calculate
duration, if necessary. Without this users have to pay a bunch of
extra overhead to record timestamp and put it into hashmap (keyed by
thread id) or thread local storage (even if they have no other use for
thread local storage).

Also, consider that a similar concept is applicable to uprobes and we
should do that as well, in similar fashion. And the above approach
works for both kprobe/kretprobe and uprobe/uretprobe cases, because
they have the same pt_regs data structure as a context (even if for
exit probes most of the values of pt_regs are not that meaningful).

So anyways, great feature, but let's discuss end-to-end usability
story before we finalize the implementation?


> I'm still working on the tetragon change, so I'll be sending non-RFC
> version once that's ready, meanwhile any ideas and feedback on the
> approach would be great.
>
> The change in bpftrace [1] using the new interface shows speed increase
> with tracing perf bench messaging:
>
>   # perf bench sched messaging -l 100000
>
> having system wide bpftrace:
>
>   # bpftrace -e 'kprobe:ksys_write { }, kretprobe:ksys_write { }'
>
> without bpftrace:
>
>   # Running 'sched/messaging' benchmark:
>   # 20 sender and receiver processes per group
>   # 10 groups == 400 processes run
>
>        Total time: 119.595 [sec]
>
>    Performance counter stats for 'perf bench sched messaging -l 100000':
>
>      102,419,967,282      cycles:u
>    5,652,444,107,001      cycles:k
>    5,782,645,019,612      cycles
>       22,187,151,206      instructions:u                   #    0.22  insn per cycle
>    2,979,040,498,455      instructions:k                   #    0.53  insn per cycle
>
>        119.671169829 seconds time elapsed
>
>         94.959198000 seconds user
>       1815.371616000 seconds sys
>
> with current bpftrace:
>
>   # Running 'sched/messaging' benchmark:
>   # 20 sender and receiver processes per group
>   # 10 groups == 400 processes run
>
>        Total time: 221.153 [sec]
>
>    Performance counter stats for 'perf bench sched messaging -l 100000':
>
>      125,292,164,504      cycles:u

btw, why +25% in user space?... this looks weird


>   10,315,020,393,735      cycles:k
>   10,501,379,274,042      cycles
>       22,187,583,545      instructions:u                   #    0.18  insn per cycle
>    4,856,893,111,303      instructions:k                   #    0.47  insn per cycle
>
>        221.229234283 seconds time elapsed
>
>        103.792498000 seconds user
>       3432.643302000 seconds sys
>
> with bpftrace using the new interface:
>
>   # Running 'sched/messaging' benchmark:
>   # 20 sender and receiver processes per group
>   # 10 groups == 400 processes run
>
>        Total time: 157.825 [sec]
>
>    Performance counter stats for 'perf bench sched messaging -l 100000':
>
>      102,423,112,279      cycles:u
>    7,450,856,354,744      cycles:k
>    7,584,769,726,693      cycles
>       22,187,270,661      instructions:u                   #    0.22  insn per cycle
>    3,985,522,383,425      instructions:k                   #    0.53  insn per cycle
>
>        157.900787760 seconds time elapsed
>
>         97.953898000 seconds user
>       2425.314753000 seconds sys
>
> thanks,
> jirka
>
>
> [1] https://github.com/bpftrace/bpftrace/pull/2984
> ---
> Jiri Olsa (4):
>       fprobe: Add entry/exit callbacks types
>       bpf: Add return prog to kprobe multi
>       libbpf: Add return_prog_fd to kprobe multi opts
>       selftests/bpf: Add kprobe multi return prog test
>
>  include/linux/fprobe.h                                       |  18 ++++++++++------
>  include/uapi/linux/bpf.h                                     |   4 +++-
>  kernel/trace/bpf_trace.c                                     |  50 ++++++++++++++++++++++++++++++++-----------
>  tools/include/uapi/linux/bpf.h                               |   4 +++-
>  tools/lib/bpf/bpf.c                                          |   1 +
>  tools/lib/bpf/bpf.h                                          |   1 +
>  tools/lib/bpf/libbpf.c                                       |   5 +++++
>  tools/lib/bpf/libbpf.h                                       |   6 +++++-
>  tools/testing/selftests/bpf/prog_tests/kprobe_multi_test.c   |  53 +++++++++++++++++++++++++++++++++++++++++++++
>  tools/testing/selftests/bpf/progs/kprobe_multi_return_prog.c | 105 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  10 files changed, 226 insertions(+), 21 deletions(-)
>  create mode 100644 tools/testing/selftests/bpf/progs/kprobe_multi_return_prog.c
Jiri Olsa Feb. 10, 2024, 3:31 p.m. UTC | #2
On Thu, Feb 08, 2024 at 11:35:18AM -0800, Andrii Nakryiko wrote:
> On Wed, Feb 7, 2024 at 7:35 AM Jiri Olsa <jolsa@kernel.org> wrote:
> >
> > hi,
> > adding support to attach both entry and return bpf program on single
> > kprobe multi link.
> >
> > Having entry together with return probe for given function is common
> > use case for tetragon, bpftrace and most likely for others.
> >
> > At the moment if we want both entry and return probe to execute bpf
> > program we need to create two (entry and return probe) links. The link
> > for return probe creates extra entry probe to setup the return probe.
> > The extra entry probe execution could be omitted if we had a way to
> > use just single link for both entry and exit probe.
> >
> > In addition it's possible to control the execution of the return probe
> > with the return value of the entry bpf program. If the entry program
> > returns 0 the return probe is installed and executed, otherwise it's
> > skip.
> >
> 
> In general, I think this is a very useful ability to have a combined
> entry/return program.

great, thanks

> 
> But the way you implement it with extra flag and extra fd parameter
> makes it harder to have a nice high-level support in libbpf (and
> presumably other BPF loader libraries) for this.
> 
> When I was thinking about doing something like this, I was considering
> adding a new program type, actually. That way it's possible to define
> this "let's skip return probe" protocol without backwards
> compatibility concerns. It's easier to use it declaratively in libbpf.

ok, that seems cleaner.. but we need to use current kprobe programs,
so not sure at the moment how would that fit in.. did you mean new
link type?

> You just declare SEC("kprobe.wrap/...") (or whatever the name,
> something to designate that it's both entry and exit probe) as one
> program and in the code there would be some way to determine whether
> we are in entry mode or exit mode (helper or field in the custom
> context type, the latter being faster and more usable, but it's
> probably not critical).

hum, so the single program would be for both entry and exit probe,
I'll need to check how bad it'd be for us, but it'd probably mean
just one extra tail call, so it's likely ok

> 
> Another frequently requested feature and a very common use case is to
> measure duration of the function, so if we have a custom type, we can
> have a field to record entry timestamp and let user calculate

you mean bpf program context right?

> duration, if necessary. Without this users have to pay a bunch of
> extra overhead to record timestamp and put it into hashmap (keyed by
> thread id) or thread local storage (even if they have no other use for
> thread local storage).

sounds good

> 
> Also, consider that a similar concept is applicable to uprobes and we
> should do that as well, in similar fashion. And the above approach
> works for both kprobe/kretprobe and uprobe/uretprobe cases, because
> they have the same pt_regs data structure as a context (even if for
> exit probes most of the values of pt_regs are not that meaningful).

ok, that seems useful for uprobes as well

btw I have another unrelated change in stash that that let you choose
the bpf_prog_active or prog->active re-entry check before program is
executed in krobe_multi link.. I also added extra flag, and it still
seems like good idea to me, thoughts? ;-)

> 
> So anyways, great feature, but let's discuss end-to-end usability
> story before we finalize the implementation?

yep, it does say RFC in the subject ;-)

> 
> 
> > I'm still working on the tetragon change, so I'll be sending non-RFC
> > version once that's ready, meanwhile any ideas and feedback on the
> > approach would be great.
> >
> > The change in bpftrace [1] using the new interface shows speed increase
> > with tracing perf bench messaging:
> >
> >   # perf bench sched messaging -l 100000
> >
> > having system wide bpftrace:
> >
> >   # bpftrace -e 'kprobe:ksys_write { }, kretprobe:ksys_write { }'
> >
> > without bpftrace:
> >
> >   # Running 'sched/messaging' benchmark:
> >   # 20 sender and receiver processes per group
> >   # 10 groups == 400 processes run
> >
> >        Total time: 119.595 [sec]
> >
> >    Performance counter stats for 'perf bench sched messaging -l 100000':
> >
> >      102,419,967,282      cycles:u
> >    5,652,444,107,001      cycles:k
> >    5,782,645,019,612      cycles
> >       22,187,151,206      instructions:u                   #    0.22  insn per cycle
> >    2,979,040,498,455      instructions:k                   #    0.53  insn per cycle
> >
> >        119.671169829 seconds time elapsed
> >
> >         94.959198000 seconds user
> >       1815.371616000 seconds sys
> >
> > with current bpftrace:
> >
> >   # Running 'sched/messaging' benchmark:
> >   # 20 sender and receiver processes per group
> >   # 10 groups == 400 processes run
> >
> >        Total time: 221.153 [sec]
> >
> >    Performance counter stats for 'perf bench sched messaging -l 100000':
> >
> >      125,292,164,504      cycles:u
> 
> btw, why +25% in user space?... this looks weird

hmm right.. maybe there's some bpftrace timer code that got triggered/executed
more times because it takes long to finish the workload? I'll check

thanks,
jirka

> 
> 
> >   10,315,020,393,735      cycles:k
> >   10,501,379,274,042      cycles
> >       22,187,583,545      instructions:u                   #    0.18  insn per cycle
> >    4,856,893,111,303      instructions:k                   #    0.47  insn per cycle
> >
> >        221.229234283 seconds time elapsed
> >
> >        103.792498000 seconds user
> >       3432.643302000 seconds sys
> >
> > with bpftrace using the new interface:
> >
> >   # Running 'sched/messaging' benchmark:
> >   # 20 sender and receiver processes per group
> >   # 10 groups == 400 processes run
> >
> >        Total time: 157.825 [sec]
> >
> >    Performance counter stats for 'perf bench sched messaging -l 100000':
> >
> >      102,423,112,279      cycles:u
> >    7,450,856,354,744      cycles:k
> >    7,584,769,726,693      cycles
> >       22,187,270,661      instructions:u                   #    0.22  insn per cycle
> >    3,985,522,383,425      instructions:k                   #    0.53  insn per cycle
> >
> >        157.900787760 seconds time elapsed
> >
> >         97.953898000 seconds user
> >       2425.314753000 seconds sys
> >
> > thanks,
> > jirka
> >
> >
> > [1] https://github.com/bpftrace/bpftrace/pull/2984
> > ---
> > Jiri Olsa (4):
> >       fprobe: Add entry/exit callbacks types
> >       bpf: Add return prog to kprobe multi
> >       libbpf: Add return_prog_fd to kprobe multi opts
> >       selftests/bpf: Add kprobe multi return prog test
> >
> >  include/linux/fprobe.h                                       |  18 ++++++++++------
> >  include/uapi/linux/bpf.h                                     |   4 +++-
> >  kernel/trace/bpf_trace.c                                     |  50 ++++++++++++++++++++++++++++++++-----------
> >  tools/include/uapi/linux/bpf.h                               |   4 +++-
> >  tools/lib/bpf/bpf.c                                          |   1 +
> >  tools/lib/bpf/bpf.h                                          |   1 +
> >  tools/lib/bpf/libbpf.c                                       |   5 +++++
> >  tools/lib/bpf/libbpf.h                                       |   6 +++++-
> >  tools/testing/selftests/bpf/prog_tests/kprobe_multi_test.c   |  53 +++++++++++++++++++++++++++++++++++++++++++++
> >  tools/testing/selftests/bpf/progs/kprobe_multi_return_prog.c | 105 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >  10 files changed, 226 insertions(+), 21 deletions(-)
> >  create mode 100644 tools/testing/selftests/bpf/progs/kprobe_multi_return_prog.c
Andrii Nakryiko Feb. 13, 2024, 4:06 a.m. UTC | #3
On Sat, Feb 10, 2024 at 7:31 AM Jiri Olsa <olsajiri@gmail.com> wrote:
>
> On Thu, Feb 08, 2024 at 11:35:18AM -0800, Andrii Nakryiko wrote:
> > On Wed, Feb 7, 2024 at 7:35 AM Jiri Olsa <jolsa@kernel.org> wrote:
> > >
> > > hi,
> > > adding support to attach both entry and return bpf program on single
> > > kprobe multi link.
> > >
> > > Having entry together with return probe for given function is common
> > > use case for tetragon, bpftrace and most likely for others.
> > >
> > > At the moment if we want both entry and return probe to execute bpf
> > > program we need to create two (entry and return probe) links. The link
> > > for return probe creates extra entry probe to setup the return probe.
> > > The extra entry probe execution could be omitted if we had a way to
> > > use just single link for both entry and exit probe.
> > >
> > > In addition it's possible to control the execution of the return probe
> > > with the return value of the entry bpf program. If the entry program
> > > returns 0 the return probe is installed and executed, otherwise it's
> > > skip.
> > >
> >
> > In general, I think this is a very useful ability to have a combined
> > entry/return program.
>
> great, thanks
>
> >
> > But the way you implement it with extra flag and extra fd parameter
> > makes it harder to have a nice high-level support in libbpf (and
> > presumably other BPF loader libraries) for this.
> >
> > When I was thinking about doing something like this, I was considering
> > adding a new program type, actually. That way it's possible to define
> > this "let's skip return probe" protocol without backwards
> > compatibility concerns. It's easier to use it declaratively in libbpf.
>
> ok, that seems cleaner.. but we need to use current kprobe programs,
> so not sure at the moment how would that fit in.. did you mean new
> link type?

It's kind of a less important detail, actually. New program type would
allow us to have an entirely different context type, but I think we
can make do with the existing kprobe program type. We can have a
separate attach_type and link type, just like multi-kprobe and
multi-uprobe are still kprobe programs.

>
> > You just declare SEC("kprobe.wrap/...") (or whatever the name,
> > something to designate that it's both entry and exit probe) as one
> > program and in the code there would be some way to determine whether
> > we are in entry mode or exit mode (helper or field in the custom
> > context type, the latter being faster and more usable, but it's
> > probably not critical).
>
> hum, so the single program would be for both entry and exit probe,
> I'll need to check how bad it'd be for us, but it'd probably mean
> just one extra tail call, so it's likely ok

I guess, I don't know what you are doing there :) I'd recommend
looking at utilizing BPF global subprogs instead of tail calls, if
your kernel allows for that, as that's a saner way to scale BPF
verification.

>
> >
> > Another frequently requested feature and a very common use case is to
> > measure duration of the function, so if we have a custom type, we can
> > have a field to record entry timestamp and let user calculate
>
> you mean bpf program context right?

Yes, when I was writing that I was imagining a new context type with new fields.

But thinking about this a bit more, I like a slightly different
approach now. Instead of having a new context type, kernel capturing
timestamps, etc, what if we introduce a concept of "a session". This
paired kprobe+kretprobe (and same for uprobe+uretprobe) share a common
session. So, say, if we are tracing kernel function abc() and we have
our kwrapper (let's call it that for now, kwrapper and uwrapper?)
program attached to abc, we have something like this:

1) abc() is called
2) we intercept it at its entry, initialize "session"
3) call our kwrapper program in entry mode
4) abc() proceeds, until return
5) our kwrapper program is called in exit mode
6) at this point the session has ended

If abc() is called again or in parallel on another CPU, each such
abc() invocation (with kwrapper prog) has its own session state.

Now, it all sounds scary, but a session could be really just a 8 byte
value that has to last for the duration of a single kprobe+kretprobe
execution.

Imagine we introduce just one new kfunc that would be usable from
kwrapper and uwrapper programs:

u64 *bpf_get_session_cookie(void);

It returns a pointer to an 8-byte slot which initially is set to zero
by kernel (before kprobe/entry part). And then the program can put
anything into it. Either from entry (kprobe/uprobe) or exit
(kretprobe/uretprobe) executions. Whatever value was stored in entry
mode will still be there in exit mode as well.

This gives a lot of flexibility without requiring the kernel to do
anything expensive.

E.g., it's easy to detect if kwrapper program is an entry call or exit call:

u64 *cookie = bpf_get_session_cookie();
if (*cookie == 0) {
  /* entry mode */
  *cookie = 1;
} else {
  /* exit mode */
}

Or if we wanted to measure function duration/latency:

u64 *cookie = bpf_get_session_cookie();
if (*cookie == 0) {
  *cookie = bpf_get_ktime_ns();
} else {
  u64 duration = bpf_get_ktime_ns() - *cookie;
  /* output duration somehow */
}

Yes, I realize special-casing zero might be a bit inconvenient, but I
think simplicity trumps a potential for zero to be a valid value (and
there are always ways to work around zero as a meaningful value).

Now, in more complicated cases 8 bytes of temporary session state
isn't enough, just like BPF cookie being 8 byte (read-only) value
might not be enough. But the solution is the same as with the BPF
cookie. You just use those 8 bytes as a key into ARRAY/HASHMAP/whatnot
storage. It's simple and fast enough for pretty much any case.


Now, how do we implement this storage session? I recently looked into
uprobe/uretprobe implementation, and I know that for uretprobes there
is a dynamically allocated uretprobe_instance (or whatever the name)
allocated for each runtime invocation of uretprobe. We can use that.
We'll need to allocate this instance a bit earlier, before uprobe part
of uwrapper BPF program has to be executed, initialize session cookie
to zero, and store pointer to this memory in bpf_run_ctx (just like we
do for BPF cookie).

I bet there is something similar in the kretprobe case, where we can
carve out 8 bytes and pass it to both entry and exit parts of kwrapper
program.

So anyways, might need some internal refactoring, but seems very
doable and will give a lot of power and flexibility for tracing use
cases.

It would be cool to also have a similar fentry/fexit combo with "BPF
session cookie", but that's something we can think about separately,
probably.


>
> > duration, if necessary. Without this users have to pay a bunch of
> > extra overhead to record timestamp and put it into hashmap (keyed by
> > thread id) or thread local storage (even if they have no other use for
> > thread local storage).
>
> sounds good
>
> >
> > Also, consider that a similar concept is applicable to uprobes and we
> > should do that as well, in similar fashion. And the above approach
> > works for both kprobe/kretprobe and uprobe/uretprobe cases, because
> > they have the same pt_regs data structure as a context (even if for
> > exit probes most of the values of pt_regs are not that meaningful).
>
> ok, that seems useful for uprobes as well

yes, absolutely

>
> btw I have another unrelated change in stash that that let you choose
> the bpf_prog_active or prog->active re-entry check before program is
> executed in krobe_multi link.. I also added extra flag, and it still
> seems like good idea to me, thoughts? ;-)
>

no specific opinion on this from my side, I guess

> >
> > So anyways, great feature, but let's discuss end-to-end usability
> > story before we finalize the implementation?
>
> yep, it does say RFC in the subject ;-)
>
> >
> >

[...]
Jiri Olsa Feb. 13, 2024, 12:09 p.m. UTC | #4
On Mon, Feb 12, 2024 at 08:06:06PM -0800, Andrii Nakryiko wrote:

SNIP

> > > But the way you implement it with extra flag and extra fd parameter
> > > makes it harder to have a nice high-level support in libbpf (and
> > > presumably other BPF loader libraries) for this.
> > >
> > > When I was thinking about doing something like this, I was considering
> > > adding a new program type, actually. That way it's possible to define
> > > this "let's skip return probe" protocol without backwards
> > > compatibility concerns. It's easier to use it declaratively in libbpf.
> >
> > ok, that seems cleaner.. but we need to use current kprobe programs,
> > so not sure at the moment how would that fit in.. did you mean new
> > link type?
> 
> It's kind of a less important detail, actually. New program type would
> allow us to have an entirely different context type, but I think we
> can make do with the existing kprobe program type. We can have a
> separate attach_type and link type, just like multi-kprobe and
> multi-uprobe are still kprobe programs.

ok, having new attach type on top of kprobe_multi link makes sense

> 
> >
> > > You just declare SEC("kprobe.wrap/...") (or whatever the name,
> > > something to designate that it's both entry and exit probe) as one
> > > program and in the code there would be some way to determine whether
> > > we are in entry mode or exit mode (helper or field in the custom
> > > context type, the latter being faster and more usable, but it's
> > > probably not critical).
> >
> > hum, so the single program would be for both entry and exit probe,
> > I'll need to check how bad it'd be for us, but it'd probably mean
> > just one extra tail call, so it's likely ok
> 
> I guess, I don't know what you are doing there :) I'd recommend
> looking at utilizing BPF global subprogs instead of tail calls, if
> your kernel allows for that, as that's a saner way to scale BPF
> verification.

ok, we should probably do that.. given this enhancement will be
available on latest kernel anyway, we could use global subprogs
as well

the related bpftrace might be bit more challenging.. will have to
generate program calling entry or return program now, but seems
doable of course

> 
> >
> > >
> > > Another frequently requested feature and a very common use case is to
> > > measure duration of the function, so if we have a custom type, we can
> > > have a field to record entry timestamp and let user calculate
> >
> > you mean bpf program context right?
> 
> Yes, when I was writing that I was imagining a new context type with new fields.
> 
> But thinking about this a bit more, I like a slightly different
> approach now. Instead of having a new context type, kernel capturing
> timestamps, etc, what if we introduce a concept of "a session". This
> paired kprobe+kretprobe (and same for uprobe+uretprobe) share a common
> session. So, say, if we are tracing kernel function abc() and we have
> our kwrapper (let's call it that for now, kwrapper and uwrapper?)
> program attached to abc, we have something like this:
> 
> 1) abc() is called
> 2) we intercept it at its entry, initialize "session"
> 3) call our kwrapper program in entry mode
> 4) abc() proceeds, until return
> 5) our kwrapper program is called in exit mode
> 6) at this point the session has ended
> 
> If abc() is called again or in parallel on another CPU, each such
> abc() invocation (with kwrapper prog) has its own session state.
> 
> Now, it all sounds scary, but a session could be really just a 8 byte
> value that has to last for the duration of a single kprobe+kretprobe
> execution.
> 
> Imagine we introduce just one new kfunc that would be usable from
> kwrapper and uwrapper programs:
> 
> u64 *bpf_get_session_cookie(void);
> 
> It returns a pointer to an 8-byte slot which initially is set to zero
> by kernel (before kprobe/entry part). And then the program can put
> anything into it. Either from entry (kprobe/uprobe) or exit
> (kretprobe/uretprobe) executions. Whatever value was stored in entry
> mode will still be there in exit mode as well.
> 
> This gives a lot of flexibility without requiring the kernel to do
> anything expensive.
> 
> E.g., it's easy to detect if kwrapper program is an entry call or exit call:
> 
> u64 *cookie = bpf_get_session_cookie();
> if (*cookie == 0) {
>   /* entry mode */
>   *cookie = 1;
> } else {
>   /* exit mode */
> }
> 
> Or if we wanted to measure function duration/latency:
> 
> u64 *cookie = bpf_get_session_cookie();
> if (*cookie == 0) {
>   *cookie = bpf_get_ktime_ns();
> } else {
>   u64 duration = bpf_get_ktime_ns() - *cookie;
>   /* output duration somehow */
> }
> 
> Yes, I realize special-casing zero might be a bit inconvenient, but I
> think simplicity trumps a potential for zero to be a valid value (and
> there are always ways to work around zero as a meaningful value).
> 
> Now, in more complicated cases 8 bytes of temporary session state
> isn't enough, just like BPF cookie being 8 byte (read-only) value
> might not be enough. But the solution is the same as with the BPF
> cookie. You just use those 8 bytes as a key into ARRAY/HASHMAP/whatnot
> storage. It's simple and fast enough for pretty much any case.

I was recently asked for a way to have function arguments available
in the return kprobe as it is in fexit programs (which was not an
option to use, because we don't have fast multi attach for it)

using the hash map to store arguments and storing its key in the
session data might be solution for this

> 
> 
> Now, how do we implement this storage session? I recently looked into
> uprobe/uretprobe implementation, and I know that for uretprobes there
> is a dynamically allocated uretprobe_instance (or whatever the name)
> allocated for each runtime invocation of uretprobe. We can use that.
> We'll need to allocate this instance a bit earlier, before uprobe part
> of uwrapper BPF program has to be executed, initialize session cookie
> to zero, and store pointer to this memory in bpf_run_ctx (just like we
> do for BPF cookie).

I think you refer to the return_instance and it seems like we could use it,
I'll check on that

> 
> I bet there is something similar in the kretprobe case, where we can
> carve out 8 bytes and pass it to both entry and exit parts of kwrapper
> program.

for kprobes.. both kprobe and kprobe_multi/fprobe use rethook to invoke
return probes, so I guess we could use it and store that shared data
in there

btw Masami is in process of removing rethook from kprobe_multi/fprobe,
as part of migrating fprobe on top of ftrace [0]

but instead the rethook I think there'll be some sort of shadow stack/data
area accessible from both entry and return probes, that we could use

jirka


[0] https://lore.kernel.org/bpf/170723204881.502590.11906735097521170661.stgit@devnote2/

> 
> So anyways, might need some internal refactoring, but seems very
> doable and will give a lot of power and flexibility for tracing use
> cases.
> 
> It would be cool to also have a similar fentry/fexit combo with "BPF
> session cookie", but that's something we can think about separately,
> probably.
> 
> 
> >
> > > duration, if necessary. Without this users have to pay a bunch of
> > > extra overhead to record timestamp and put it into hashmap (keyed by
> > > thread id) or thread local storage (even if they have no other use for
> > > thread local storage).
> >
> > sounds good
> >
> > >
> > > Also, consider that a similar concept is applicable to uprobes and we
> > > should do that as well, in similar fashion. And the above approach
> > > works for both kprobe/kretprobe and uprobe/uretprobe cases, because
> > > they have the same pt_regs data structure as a context (even if for
> > > exit probes most of the values of pt_regs are not that meaningful).
> >
> > ok, that seems useful for uprobes as well
> 
> yes, absolutely
> 
> >
> > btw I have another unrelated change in stash that that let you choose
> > the bpf_prog_active or prog->active re-entry check before program is
> > executed in krobe_multi link.. I also added extra flag, and it still
> > seems like good idea to me, thoughts? ;-)
> >
> 
> no specific opinion on this from my side, I guess
> 
> > >
> > > So anyways, great feature, but let's discuss end-to-end usability
> > > story before we finalize the implementation?
> >
> > yep, it does say RFC in the subject ;-)
> >
> > >
> > >
> 
> [...]
Andrii Nakryiko Feb. 13, 2024, 6:20 p.m. UTC | #5
On Tue, Feb 13, 2024 at 4:09 AM Jiri Olsa <olsajiri@gmail.com> wrote:
>
> On Mon, Feb 12, 2024 at 08:06:06PM -0800, Andrii Nakryiko wrote:
>
> SNIP
>
> > > > But the way you implement it with extra flag and extra fd parameter
> > > > makes it harder to have a nice high-level support in libbpf (and
> > > > presumably other BPF loader libraries) for this.
> > > >
> > > > When I was thinking about doing something like this, I was considering
> > > > adding a new program type, actually. That way it's possible to define
> > > > this "let's skip return probe" protocol without backwards
> > > > compatibility concerns. It's easier to use it declaratively in libbpf.
> > >
> > > ok, that seems cleaner.. but we need to use current kprobe programs,
> > > so not sure at the moment how would that fit in.. did you mean new
> > > link type?
> >
> > It's kind of a less important detail, actually. New program type would
> > allow us to have an entirely different context type, but I think we
> > can make do with the existing kprobe program type. We can have a
> > separate attach_type and link type, just like multi-kprobe and
> > multi-uprobe are still kprobe programs.
>
> ok, having new attach type on top of kprobe_multi link makes sense
>
> >
> > >
> > > > You just declare SEC("kprobe.wrap/...") (or whatever the name,
> > > > something to designate that it's both entry and exit probe) as one
> > > > program and in the code there would be some way to determine whether
> > > > we are in entry mode or exit mode (helper or field in the custom
> > > > context type, the latter being faster and more usable, but it's
> > > > probably not critical).
> > >
> > > hum, so the single program would be for both entry and exit probe,
> > > I'll need to check how bad it'd be for us, but it'd probably mean
> > > just one extra tail call, so it's likely ok
> >
> > I guess, I don't know what you are doing there :) I'd recommend
> > looking at utilizing BPF global subprogs instead of tail calls, if
> > your kernel allows for that, as that's a saner way to scale BPF
> > verification.
>
> ok, we should probably do that.. given this enhancement will be
> available on latest kernel anyway, we could use global subprogs
> as well
>
> the related bpftrace might be bit more challenging.. will have to
> generate program calling entry or return program now, but seems
> doable of course

So you want users to still have separate kprobe and kretprobe in
bpftrace, but combine them into this kwrapper transparently? It does
seem doable, but hopefully we'll be able to write kwrapper programs in
bpftrace directly as well.

>
> >
> > >
> > > >
> > > > Another frequently requested feature and a very common use case is to
> > > > measure duration of the function, so if we have a custom type, we can
> > > > have a field to record entry timestamp and let user calculate
> > >
> > > you mean bpf program context right?
> >
> > Yes, when I was writing that I was imagining a new context type with new fields.
> >
> > But thinking about this a bit more, I like a slightly different
> > approach now. Instead of having a new context type, kernel capturing
> > timestamps, etc, what if we introduce a concept of "a session". This
> > paired kprobe+kretprobe (and same for uprobe+uretprobe) share a common
> > session. So, say, if we are tracing kernel function abc() and we have
> > our kwrapper (let's call it that for now, kwrapper and uwrapper?)
> > program attached to abc, we have something like this:
> >
> > 1) abc() is called
> > 2) we intercept it at its entry, initialize "session"
> > 3) call our kwrapper program in entry mode
> > 4) abc() proceeds, until return
> > 5) our kwrapper program is called in exit mode
> > 6) at this point the session has ended
> >
> > If abc() is called again or in parallel on another CPU, each such
> > abc() invocation (with kwrapper prog) has its own session state.
> >
> > Now, it all sounds scary, but a session could be really just a 8 byte
> > value that has to last for the duration of a single kprobe+kretprobe
> > execution.
> >
> > Imagine we introduce just one new kfunc that would be usable from
> > kwrapper and uwrapper programs:
> >
> > u64 *bpf_get_session_cookie(void);
> >
> > It returns a pointer to an 8-byte slot which initially is set to zero
> > by kernel (before kprobe/entry part). And then the program can put
> > anything into it. Either from entry (kprobe/uprobe) or exit
> > (kretprobe/uretprobe) executions. Whatever value was stored in entry
> > mode will still be there in exit mode as well.
> >
> > This gives a lot of flexibility without requiring the kernel to do
> > anything expensive.
> >
> > E.g., it's easy to detect if kwrapper program is an entry call or exit call:
> >
> > u64 *cookie = bpf_get_session_cookie();
> > if (*cookie == 0) {
> >   /* entry mode */
> >   *cookie = 1;
> > } else {
> >   /* exit mode */
> > }
> >
> > Or if we wanted to measure function duration/latency:
> >
> > u64 *cookie = bpf_get_session_cookie();
> > if (*cookie == 0) {
> >   *cookie = bpf_get_ktime_ns();
> > } else {
> >   u64 duration = bpf_get_ktime_ns() - *cookie;
> >   /* output duration somehow */
> > }
> >
> > Yes, I realize special-casing zero might be a bit inconvenient, but I
> > think simplicity trumps a potential for zero to be a valid value (and
> > there are always ways to work around zero as a meaningful value).
> >
> > Now, in more complicated cases 8 bytes of temporary session state
> > isn't enough, just like BPF cookie being 8 byte (read-only) value
> > might not be enough. But the solution is the same as with the BPF
> > cookie. You just use those 8 bytes as a key into ARRAY/HASHMAP/whatnot
> > storage. It's simple and fast enough for pretty much any case.
>
> I was recently asked for a way to have function arguments available
> in the return kprobe as it is in fexit programs (which was not an
> option to use, because we don't have fast multi attach for it)
>
> using the hash map to store arguments and storing its key in the
> session data might be solution for this

if you are ok using hashmap keyed by tid, you can do it today without
any kernel changes. With session cookie you'll be able to utilize
faster ARRAY map (by building a simple ID allocator to get a free slot
in ARRAY map).

>
> >
> >
> > Now, how do we implement this storage session? I recently looked into
> > uprobe/uretprobe implementation, and I know that for uretprobes there
> > is a dynamically allocated uretprobe_instance (or whatever the name)
> > allocated for each runtime invocation of uretprobe. We can use that.
> > We'll need to allocate this instance a bit earlier, before uprobe part
> > of uwrapper BPF program has to be executed, initialize session cookie
> > to zero, and store pointer to this memory in bpf_run_ctx (just like we
> > do for BPF cookie).
>
> I think you refer to the return_instance and it seems like we could use it,
> I'll check on that
>

yep, probably

> >
> > I bet there is something similar in the kretprobe case, where we can
> > carve out 8 bytes and pass it to both entry and exit parts of kwrapper
> > program.
>
> for kprobes.. both kprobe and kprobe_multi/fprobe use rethook to invoke
> return probes, so I guess we could use it and store that shared data
> in there
>
> btw Masami is in process of removing rethook from kprobe_multi/fprobe,
> as part of migrating fprobe on top of ftrace [0]
>
> but instead the rethook I think there'll be some sort of shadow stack/data
> area accessible from both entry and return probes, that we could use

ok, cool. We also need to be careful to not share session cookie
between unrelated programs. E.g., if two independent kwrapper programs
are attached to the same function, they should each have their own
cookie. Otherwise it's not clear how to build anything reliable on top
of that, tbh. This might be a problem, though, right?

>
> jirka
>
>
> [0] https://lore.kernel.org/bpf/170723204881.502590.11906735097521170661.stgit@devnote2/
>
> >
> > So anyways, might need some internal refactoring, but seems very
> > doable and will give a lot of power and flexibility for tracing use
> > cases.
> >
> > It would be cool to also have a similar fentry/fexit combo with "BPF
> > session cookie", but that's something we can think about separately,
> > probably.
> >
> >
> > >
> > > > duration, if necessary. Without this users have to pay a bunch of
> > > > extra overhead to record timestamp and put it into hashmap (keyed by
> > > > thread id) or thread local storage (even if they have no other use for
> > > > thread local storage).
> > >
> > > sounds good
> > >
> > > >
> > > > Also, consider that a similar concept is applicable to uprobes and we
> > > > should do that as well, in similar fashion. And the above approach
> > > > works for both kprobe/kretprobe and uprobe/uretprobe cases, because
> > > > they have the same pt_regs data structure as a context (even if for
> > > > exit probes most of the values of pt_regs are not that meaningful).
> > >
> > > ok, that seems useful for uprobes as well
> >
> > yes, absolutely
> >
> > >
> > > btw I have another unrelated change in stash that that let you choose
> > > the bpf_prog_active or prog->active re-entry check before program is
> > > executed in krobe_multi link.. I also added extra flag, and it still
> > > seems like good idea to me, thoughts? ;-)
> > >
> >
> > no specific opinion on this from my side, I guess
> >
> > > >
> > > > So anyways, great feature, but let's discuss end-to-end usability
> > > > story before we finalize the implementation?
> > >
> > > yep, it does say RFC in the subject ;-)
> > >
> > > >
> > > >
> >
> > [...]
Jiri Olsa Feb. 13, 2024, 9:09 p.m. UTC | #6
On Tue, Feb 13, 2024 at 10:20:46AM -0800, Andrii Nakryiko wrote:
> On Tue, Feb 13, 2024 at 4:09 AM Jiri Olsa <olsajiri@gmail.com> wrote:
> >
> > On Mon, Feb 12, 2024 at 08:06:06PM -0800, Andrii Nakryiko wrote:
> >
> > SNIP
> >
> > > > > But the way you implement it with extra flag and extra fd parameter
> > > > > makes it harder to have a nice high-level support in libbpf (and
> > > > > presumably other BPF loader libraries) for this.
> > > > >
> > > > > When I was thinking about doing something like this, I was considering
> > > > > adding a new program type, actually. That way it's possible to define
> > > > > this "let's skip return probe" protocol without backwards
> > > > > compatibility concerns. It's easier to use it declaratively in libbpf.
> > > >
> > > > ok, that seems cleaner.. but we need to use current kprobe programs,
> > > > so not sure at the moment how would that fit in.. did you mean new
> > > > link type?
> > >
> > > It's kind of a less important detail, actually. New program type would
> > > allow us to have an entirely different context type, but I think we
> > > can make do with the existing kprobe program type. We can have a
> > > separate attach_type and link type, just like multi-kprobe and
> > > multi-uprobe are still kprobe programs.
> >
> > ok, having new attach type on top of kprobe_multi link makes sense
> >
> > >
> > > >
> > > > > You just declare SEC("kprobe.wrap/...") (or whatever the name,
> > > > > something to designate that it's both entry and exit probe) as one
> > > > > program and in the code there would be some way to determine whether
> > > > > we are in entry mode or exit mode (helper or field in the custom
> > > > > context type, the latter being faster and more usable, but it's
> > > > > probably not critical).
> > > >
> > > > hum, so the single program would be for both entry and exit probe,
> > > > I'll need to check how bad it'd be for us, but it'd probably mean
> > > > just one extra tail call, so it's likely ok
> > >
> > > I guess, I don't know what you are doing there :) I'd recommend
> > > looking at utilizing BPF global subprogs instead of tail calls, if
> > > your kernel allows for that, as that's a saner way to scale BPF
> > > verification.
> >
> > ok, we should probably do that.. given this enhancement will be
> > available on latest kernel anyway, we could use global subprogs
> > as well
> >
> > the related bpftrace might be bit more challenging.. will have to
> > generate program calling entry or return program now, but seems
> > doable of course
> 
> So you want users to still have separate kprobe and kretprobe in
> bpftrace, but combine them into this kwrapper transparently? It does

no I meant I'd need to generate the wrapper program for the new
interface.. which is extra compared to current bpftrace changes

> seem doable, but hopefully we'll be able to write kwrapper programs in
> bpftrace directly as well.

yes, it should be fine

SNIP

> > >
> > > Yes, I realize special-casing zero might be a bit inconvenient, but I
> > > think simplicity trumps a potential for zero to be a valid value (and
> > > there are always ways to work around zero as a meaningful value).
> > >
> > > Now, in more complicated cases 8 bytes of temporary session state
> > > isn't enough, just like BPF cookie being 8 byte (read-only) value
> > > might not be enough. But the solution is the same as with the BPF
> > > cookie. You just use those 8 bytes as a key into ARRAY/HASHMAP/whatnot
> > > storage. It's simple and fast enough for pretty much any case.
> >
> > I was recently asked for a way to have function arguments available
> > in the return kprobe as it is in fexit programs (which was not an
> > option to use, because we don't have fast multi attach for it)
> >
> > using the hash map to store arguments and storing its key in the
> > session data might be solution for this
> 
> if you are ok using hashmap keyed by tid, you can do it today without
> any kernel changes. With session cookie you'll be able to utilize
> faster ARRAY map (by building a simple ID allocator to get a free slot
> in ARRAY map).

ok

SNIP

> > > I bet there is something similar in the kretprobe case, where we can
> > > carve out 8 bytes and pass it to both entry and exit parts of kwrapper
> > > program.
> >
> > for kprobes.. both kprobe and kprobe_multi/fprobe use rethook to invoke
> > return probes, so I guess we could use it and store that shared data
> > in there
> >
> > btw Masami is in process of removing rethook from kprobe_multi/fprobe,
> > as part of migrating fprobe on top of ftrace [0]
> >
> > but instead the rethook I think there'll be some sort of shadow stack/data
> > area accessible from both entry and return probes, that we could use
> 
> ok, cool. We also need to be careful to not share session cookie
> between unrelated programs. E.g., if two independent kwrapper programs
> are attached to the same function, they should each have their own
> cookie. Otherwise it's not clear how to build anything reliable on top
> of that, tbh. This might be a problem, though, right?

IIRC it's tracer specific data, the shadow stack data should be unique
for tracer and its called function, but I'll double check on that

jirka
Jiri Olsa Feb. 14, 2024, 8:55 p.m. UTC | #7
On Tue, Feb 13, 2024 at 10:09:09PM +0100, Jiri Olsa wrote:
> On Tue, Feb 13, 2024 at 10:20:46AM -0800, Andrii Nakryiko wrote:
> > On Tue, Feb 13, 2024 at 4:09 AM Jiri Olsa <olsajiri@gmail.com> wrote:
> > >
> > > On Mon, Feb 12, 2024 at 08:06:06PM -0800, Andrii Nakryiko wrote:
> > >
> > > SNIP
> > >
> > > > > > But the way you implement it with extra flag and extra fd parameter
> > > > > > makes it harder to have a nice high-level support in libbpf (and
> > > > > > presumably other BPF loader libraries) for this.
> > > > > >
> > > > > > When I was thinking about doing something like this, I was considering
> > > > > > adding a new program type, actually. That way it's possible to define
> > > > > > this "let's skip return probe" protocol without backwards
> > > > > > compatibility concerns. It's easier to use it declaratively in libbpf.
> > > > >
> > > > > ok, that seems cleaner.. but we need to use current kprobe programs,
> > > > > so not sure at the moment how would that fit in.. did you mean new
> > > > > link type?
> > > >
> > > > It's kind of a less important detail, actually. New program type would
> > > > allow us to have an entirely different context type, but I think we
> > > > can make do with the existing kprobe program type. We can have a
> > > > separate attach_type and link type, just like multi-kprobe and
> > > > multi-uprobe are still kprobe programs.
> > >
> > > ok, having new attach type on top of kprobe_multi link makes sense
> > >
> > > >
> > > > >
> > > > > > You just declare SEC("kprobe.wrap/...") (or whatever the name,
> > > > > > something to designate that it's both entry and exit probe) as one
> > > > > > program and in the code there would be some way to determine whether
> > > > > > we are in entry mode or exit mode (helper or field in the custom
> > > > > > context type, the latter being faster and more usable, but it's
> > > > > > probably not critical).
> > > > >
> > > > > hum, so the single program would be for both entry and exit probe,
> > > > > I'll need to check how bad it'd be for us, but it'd probably mean
> > > > > just one extra tail call, so it's likely ok
> > > >
> > > > I guess, I don't know what you are doing there :) I'd recommend
> > > > looking at utilizing BPF global subprogs instead of tail calls, if
> > > > your kernel allows for that, as that's a saner way to scale BPF
> > > > verification.
> > >
> > > ok, we should probably do that.. given this enhancement will be
> > > available on latest kernel anyway, we could use global subprogs
> > > as well
> > >
> > > the related bpftrace might be bit more challenging.. will have to
> > > generate program calling entry or return program now, but seems
> > > doable of course
> > 
> > So you want users to still have separate kprobe and kretprobe in
> > bpftrace, but combine them into this kwrapper transparently? It does
> 
> no I meant I'd need to generate the wrapper program for the new
> interface.. which is extra compared to current bpftrace changes
> 
> > seem doable, but hopefully we'll be able to write kwrapper programs in
> > bpftrace directly as well.
> 
> yes, it should be fine
> 
> SNIP
> 
> > > >
> > > > Yes, I realize special-casing zero might be a bit inconvenient, but I
> > > > think simplicity trumps a potential for zero to be a valid value (and
> > > > there are always ways to work around zero as a meaningful value).
> > > >
> > > > Now, in more complicated cases 8 bytes of temporary session state
> > > > isn't enough, just like BPF cookie being 8 byte (read-only) value
> > > > might not be enough. But the solution is the same as with the BPF
> > > > cookie. You just use those 8 bytes as a key into ARRAY/HASHMAP/whatnot
> > > > storage. It's simple and fast enough for pretty much any case.
> > >
> > > I was recently asked for a way to have function arguments available
> > > in the return kprobe as it is in fexit programs (which was not an
> > > option to use, because we don't have fast multi attach for it)
> > >
> > > using the hash map to store arguments and storing its key in the
> > > session data might be solution for this
> > 
> > if you are ok using hashmap keyed by tid, you can do it today without
> > any kernel changes. With session cookie you'll be able to utilize
> > faster ARRAY map (by building a simple ID allocator to get a free slot
> > in ARRAY map).
> 
> ok
> 
> SNIP
> 
> > > > I bet there is something similar in the kretprobe case, where we can
> > > > carve out 8 bytes and pass it to both entry and exit parts of kwrapper
> > > > program.
> > >
> > > for kprobes.. both kprobe and kprobe_multi/fprobe use rethook to invoke
> > > return probes, so I guess we could use it and store that shared data
> > > in there
> > >
> > > btw Masami is in process of removing rethook from kprobe_multi/fprobe,
> > > as part of migrating fprobe on top of ftrace [0]
> > >
> > > but instead the rethook I think there'll be some sort of shadow stack/data
> > > area accessible from both entry and return probes, that we could use
> > 
> > ok, cool. We also need to be careful to not share session cookie
> > between unrelated programs. E.g., if two independent kwrapper programs
> > are attached to the same function, they should each have their own
> > cookie. Otherwise it's not clear how to build anything reliable on top
> > of that, tbh. This might be a problem, though, right?
> 
> IIRC it's tracer specific data, the shadow stack data should be unique
> for tracer and its called function, but I'll double check on that

Masami,
we recently discussed the possibility to store data between entry/return probe,
IIUC your current patchset [0] allows that, but it seems to be shared across all
the tracers for the given function (__ftrace_return_to_handler).. is the plan to
make the shadow stack per tracer and function? /cc Steven

thanks,
jirka


[0] https://lore.kernel.org/bpf/170723204881.502590.11906735097521170661.stgit@devnote2/
Steven Rostedt Feb. 15, 2024, 4:27 p.m. UTC | #8
On Wed, 14 Feb 2024 21:55:03 +0100
Jiri Olsa <olsajiri@gmail.com> wrote:


> Masami,
> we recently discussed the possibility to store data between entry/return probe,
> IIUC your current patchset [0] allows that, but it seems to be shared across all
> the tracers for the given function (__ftrace_return_to_handler).. is the plan to
> make the shadow stack per tracer and function? /cc Steven

The shadow stack is per task, but it saves unique data per tracer per function.

It allows up to 16 different tracers attached to function graph tracing at
a time, as there is a limited shadow stack size, which all the attached
tracers use. Now you can create your own shadow stack per user (bpf
program?) and have a single registered function graph user that will demux
the data coming in and out of the function.

-- Steve
Jiri Olsa Feb. 16, 2024, 3:03 p.m. UTC | #9
On Thu, Feb 15, 2024 at 11:27:22AM -0500, Steven Rostedt wrote:
> On Wed, 14 Feb 2024 21:55:03 +0100
> Jiri Olsa <olsajiri@gmail.com> wrote:
> 
> 
> > Masami,
> > we recently discussed the possibility to store data between entry/return probe,
> > IIUC your current patchset [0] allows that, but it seems to be shared across all
> > the tracers for the given function (__ftrace_return_to_handler).. is the plan to
> > make the shadow stack per tracer and function? /cc Steven
> 
> The shadow stack is per task, but it saves unique data per tracer per function.
> 
> It allows up to 16 different tracers attached to function graph tracing at
> a time, as there is a limited shadow stack size, which all the attached
> tracers use. Now you can create your own shadow stack per user (bpf
> program?) and have a single registered function graph user that will demux
> the data coming in and out of the function.

ok, I guess that's fgraph_reserve_data/fgraph_retrieve_data api

we'd like to share data between bpf programs executed from entry
and exit handlers, that should do it

thanks,
jirka
Viktor Malik Feb. 19, 2024, 11:20 a.m. UTC | #10
On 2/13/24 22:09, Jiri Olsa wrote:
> On Tue, Feb 13, 2024 at 10:20:46AM -0800, Andrii Nakryiko wrote:
>> On Tue, Feb 13, 2024 at 4:09 AM Jiri Olsa <olsajiri@gmail.com> wrote:
>>>
>>> On Mon, Feb 12, 2024 at 08:06:06PM -0800, Andrii Nakryiko wrote:
>>>
>>> SNIP
>>>
>>>>>> But the way you implement it with extra flag and extra fd parameter
>>>>>> makes it harder to have a nice high-level support in libbpf (and
>>>>>> presumably other BPF loader libraries) for this.
>>>>>>
>>>>>> When I was thinking about doing something like this, I was considering
>>>>>> adding a new program type, actually. That way it's possible to define
>>>>>> this "let's skip return probe" protocol without backwards
>>>>>> compatibility concerns. It's easier to use it declaratively in libbpf.
>>>>>
>>>>> ok, that seems cleaner.. but we need to use current kprobe programs,
>>>>> so not sure at the moment how would that fit in.. did you mean new
>>>>> link type?
>>>>
>>>> It's kind of a less important detail, actually. New program type would
>>>> allow us to have an entirely different context type, but I think we
>>>> can make do with the existing kprobe program type. We can have a
>>>> separate attach_type and link type, just like multi-kprobe and
>>>> multi-uprobe are still kprobe programs.
>>>
>>> ok, having new attach type on top of kprobe_multi link makes sense
>>>
>>>>
>>>>>
>>>>>> You just declare SEC("kprobe.wrap/...") (or whatever the name,
>>>>>> something to designate that it's both entry and exit probe) as one
>>>>>> program and in the code there would be some way to determine whether
>>>>>> we are in entry mode or exit mode (helper or field in the custom
>>>>>> context type, the latter being faster and more usable, but it's
>>>>>> probably not critical).
>>>>>
>>>>> hum, so the single program would be for both entry and exit probe,
>>>>> I'll need to check how bad it'd be for us, but it'd probably mean
>>>>> just one extra tail call, so it's likely ok
>>>>
>>>> I guess, I don't know what you are doing there :) I'd recommend
>>>> looking at utilizing BPF global subprogs instead of tail calls, if
>>>> your kernel allows for that, as that's a saner way to scale BPF
>>>> verification.
>>>
>>> ok, we should probably do that.. given this enhancement will be
>>> available on latest kernel anyway, we could use global subprogs
>>> as well
>>>
>>> the related bpftrace might be bit more challenging.. will have to
>>> generate program calling entry or return program now, but seems
>>> doable of course
>>
>> So you want users to still have separate kprobe and kretprobe in
>> bpftrace, but combine them into this kwrapper transparently? It does
> 
> no I meant I'd need to generate the wrapper program for the new
> interface.. which is extra compared to current bpftrace changes

If you end up introducing this new kwrapper program type in libbpf, I
think that it'll make sense to have something similar in bpftrace, too.
Allowing users to write separate kprobe and kretprobe programs and
transparently combining them into kwrapper doesn't seem to bring much
value to me.

Jirka, if you need help with implementing bpftrace support for this, let
me know. I'm very interested in having this capability there.

Viktor

> 
>> seem doable, but hopefully we'll be able to write kwrapper programs in
>> bpftrace directly as well.
> 
> yes, it should be fine
> 
> SNIP
> 
>>>>
>>>> Yes, I realize special-casing zero might be a bit inconvenient, but I
>>>> think simplicity trumps a potential for zero to be a valid value (and
>>>> there are always ways to work around zero as a meaningful value).
>>>>
>>>> Now, in more complicated cases 8 bytes of temporary session state
>>>> isn't enough, just like BPF cookie being 8 byte (read-only) value
>>>> might not be enough. But the solution is the same as with the BPF
>>>> cookie. You just use those 8 bytes as a key into ARRAY/HASHMAP/whatnot
>>>> storage. It's simple and fast enough for pretty much any case.
>>>
>>> I was recently asked for a way to have function arguments available
>>> in the return kprobe as it is in fexit programs (which was not an
>>> option to use, because we don't have fast multi attach for it)
>>>
>>> using the hash map to store arguments and storing its key in the
>>> session data might be solution for this
>>
>> if you are ok using hashmap keyed by tid, you can do it today without
>> any kernel changes. With session cookie you'll be able to utilize
>> faster ARRAY map (by building a simple ID allocator to get a free slot
>> in ARRAY map).
> 
> ok
> 
> SNIP
> 
>>>> I bet there is something similar in the kretprobe case, where we can
>>>> carve out 8 bytes and pass it to both entry and exit parts of kwrapper
>>>> program.
>>>
>>> for kprobes.. both kprobe and kprobe_multi/fprobe use rethook to invoke
>>> return probes, so I guess we could use it and store that shared data
>>> in there
>>>
>>> btw Masami is in process of removing rethook from kprobe_multi/fprobe,
>>> as part of migrating fprobe on top of ftrace [0]
>>>
>>> but instead the rethook I think there'll be some sort of shadow stack/data
>>> area accessible from both entry and return probes, that we could use
>>
>> ok, cool. We also need to be careful to not share session cookie
>> between unrelated programs. E.g., if two independent kwrapper programs
>> are attached to the same function, they should each have their own
>> cookie. Otherwise it's not clear how to build anything reliable on top
>> of that, tbh. This might be a problem, though, right?
> 
> IIRC it's tracer specific data, the shadow stack data should be unique
> for tracer and its called function, but I'll double check on that
> 
> jirka
>
Jiri Olsa Feb. 19, 2024, 12:43 p.m. UTC | #11
On Mon, Feb 19, 2024 at 12:20:08PM +0100, Viktor Malik wrote:
> On 2/13/24 22:09, Jiri Olsa wrote:
> > On Tue, Feb 13, 2024 at 10:20:46AM -0800, Andrii Nakryiko wrote:
> >> On Tue, Feb 13, 2024 at 4:09 AM Jiri Olsa <olsajiri@gmail.com> wrote:
> >>>
> >>> On Mon, Feb 12, 2024 at 08:06:06PM -0800, Andrii Nakryiko wrote:
> >>>
> >>> SNIP
> >>>
> >>>>>> But the way you implement it with extra flag and extra fd parameter
> >>>>>> makes it harder to have a nice high-level support in libbpf (and
> >>>>>> presumably other BPF loader libraries) for this.
> >>>>>>
> >>>>>> When I was thinking about doing something like this, I was considering
> >>>>>> adding a new program type, actually. That way it's possible to define
> >>>>>> this "let's skip return probe" protocol without backwards
> >>>>>> compatibility concerns. It's easier to use it declaratively in libbpf.
> >>>>>
> >>>>> ok, that seems cleaner.. but we need to use current kprobe programs,
> >>>>> so not sure at the moment how would that fit in.. did you mean new
> >>>>> link type?
> >>>>
> >>>> It's kind of a less important detail, actually. New program type would
> >>>> allow us to have an entirely different context type, but I think we
> >>>> can make do with the existing kprobe program type. We can have a
> >>>> separate attach_type and link type, just like multi-kprobe and
> >>>> multi-uprobe are still kprobe programs.
> >>>
> >>> ok, having new attach type on top of kprobe_multi link makes sense
> >>>
> >>>>
> >>>>>
> >>>>>> You just declare SEC("kprobe.wrap/...") (or whatever the name,
> >>>>>> something to designate that it's both entry and exit probe) as one
> >>>>>> program and in the code there would be some way to determine whether
> >>>>>> we are in entry mode or exit mode (helper or field in the custom
> >>>>>> context type, the latter being faster and more usable, but it's
> >>>>>> probably not critical).
> >>>>>
> >>>>> hum, so the single program would be for both entry and exit probe,
> >>>>> I'll need to check how bad it'd be for us, but it'd probably mean
> >>>>> just one extra tail call, so it's likely ok
> >>>>
> >>>> I guess, I don't know what you are doing there :) I'd recommend
> >>>> looking at utilizing BPF global subprogs instead of tail calls, if
> >>>> your kernel allows for that, as that's a saner way to scale BPF
> >>>> verification.
> >>>
> >>> ok, we should probably do that.. given this enhancement will be
> >>> available on latest kernel anyway, we could use global subprogs
> >>> as well
> >>>
> >>> the related bpftrace might be bit more challenging.. will have to
> >>> generate program calling entry or return program now, but seems
> >>> doable of course
> >>
> >> So you want users to still have separate kprobe and kretprobe in
> >> bpftrace, but combine them into this kwrapper transparently? It does
> > 
> > no I meant I'd need to generate the wrapper program for the new
> > interface.. which is extra compared to current bpftrace changes
> 
> If you end up introducing this new kwrapper program type in libbpf, I
> think that it'll make sense to have something similar in bpftrace, too.
> Allowing users to write separate kprobe and kretprobe programs and
> transparently combining them into kwrapper doesn't seem to bring much
> value to me.

I kind of liked the idea of not introducing new probe type and silently
making things faster ;-) but having wrapper probe type makes also sense

> 
> Jirka, if you need help with implementing bpftrace support for this, let
> me know. I'm very interested in having this capability there.

that'd be great, I'll send new version of kernel changes soon

thanks,
jirka
Jiri Olsa Feb. 23, 2024, 9:32 a.m. UTC | #12
On Tue, Feb 13, 2024 at 01:09:54PM +0100, Jiri Olsa wrote:
> On Mon, Feb 12, 2024 at 08:06:06PM -0800, Andrii Nakryiko wrote:
> 
> SNIP
> 
> > > > But the way you implement it with extra flag and extra fd parameter
> > > > makes it harder to have a nice high-level support in libbpf (and
> > > > presumably other BPF loader libraries) for this.
> > > >
> > > > When I was thinking about doing something like this, I was considering
> > > > adding a new program type, actually. That way it's possible to define
> > > > this "let's skip return probe" protocol without backwards
> > > > compatibility concerns. It's easier to use it declaratively in libbpf.
> > >
> > > ok, that seems cleaner.. but we need to use current kprobe programs,
> > > so not sure at the moment how would that fit in.. did you mean new
> > > link type?
> > 
> > It's kind of a less important detail, actually. New program type would
> > allow us to have an entirely different context type, but I think we
> > can make do with the existing kprobe program type. We can have a
> > separate attach_type and link type, just like multi-kprobe and
> > multi-uprobe are still kprobe programs.
> 
> ok, having new attach type on top of kprobe_multi link makes sense

hi,
I have further question in here.. ;-)

so I implemented the new behaviour on top of new attach_type, but I keep
thinking that it's the 'same/similar logic' as if I added extra flag to
attr.link_create.kprobe_multi.flags, which results in much simpler change

is following logic breaking backward compatibility in any way?

  - having new flag in attr.link_create.kprobe_multi.flags
  - that force the attach of the bpf program to entry/exit function probes
  - and the kprobe entry program return value now controls invocation
    on the exit probe

so the new flag changes the semantics of the entry program return value,
which seems fine to me, because it depends on the new flag.. do I miss
any other problem with that?

thanks,
jirka

> 
> > 
> > >
> > > > You just declare SEC("kprobe.wrap/...") (or whatever the name,
> > > > something to designate that it's both entry and exit probe) as one
> > > > program and in the code there would be some way to determine whether
> > > > we are in entry mode or exit mode (helper or field in the custom
> > > > context type, the latter being faster and more usable, but it's
> > > > probably not critical).
> > >
> > > hum, so the single program would be for both entry and exit probe,
> > > I'll need to check how bad it'd be for us, but it'd probably mean
> > > just one extra tail call, so it's likely ok
> > 
> > I guess, I don't know what you are doing there :) I'd recommend
> > looking at utilizing BPF global subprogs instead of tail calls, if
> > your kernel allows for that, as that's a saner way to scale BPF
> > verification.
> 
> ok, we should probably do that.. given this enhancement will be
> available on latest kernel anyway, we could use global subprogs
> as well
> 
> the related bpftrace might be bit more challenging.. will have to
> generate program calling entry or return program now, but seems
> doable of course
> 
> > 
> > >
> > > >
> > > > Another frequently requested feature and a very common use case is to
> > > > measure duration of the function, so if we have a custom type, we can
> > > > have a field to record entry timestamp and let user calculate
> > >
> > > you mean bpf program context right?
> > 
> > Yes, when I was writing that I was imagining a new context type with new fields.
> > 
> > But thinking about this a bit more, I like a slightly different
> > approach now. Instead of having a new context type, kernel capturing
> > timestamps, etc, what if we introduce a concept of "a session". This
> > paired kprobe+kretprobe (and same for uprobe+uretprobe) share a common
> > session. So, say, if we are tracing kernel function abc() and we have
> > our kwrapper (let's call it that for now, kwrapper and uwrapper?)
> > program attached to abc, we have something like this:
> > 
> > 1) abc() is called
> > 2) we intercept it at its entry, initialize "session"
> > 3) call our kwrapper program in entry mode
> > 4) abc() proceeds, until return
> > 5) our kwrapper program is called in exit mode
> > 6) at this point the session has ended
> > 
> > If abc() is called again or in parallel on another CPU, each such
> > abc() invocation (with kwrapper prog) has its own session state.
> > 
> > Now, it all sounds scary, but a session could be really just a 8 byte
> > value that has to last for the duration of a single kprobe+kretprobe
> > execution.
> > 
> > Imagine we introduce just one new kfunc that would be usable from
> > kwrapper and uwrapper programs:
> > 
> > u64 *bpf_get_session_cookie(void);
> > 
> > It returns a pointer to an 8-byte slot which initially is set to zero
> > by kernel (before kprobe/entry part). And then the program can put
> > anything into it. Either from entry (kprobe/uprobe) or exit
> > (kretprobe/uretprobe) executions. Whatever value was stored in entry
> > mode will still be there in exit mode as well.
> > 
> > This gives a lot of flexibility without requiring the kernel to do
> > anything expensive.
> > 
> > E.g., it's easy to detect if kwrapper program is an entry call or exit call:
> > 
> > u64 *cookie = bpf_get_session_cookie();
> > if (*cookie == 0) {
> >   /* entry mode */
> >   *cookie = 1;
> > } else {
> >   /* exit mode */
> > }
> > 
> > Or if we wanted to measure function duration/latency:
> > 
> > u64 *cookie = bpf_get_session_cookie();
> > if (*cookie == 0) {
> >   *cookie = bpf_get_ktime_ns();
> > } else {
> >   u64 duration = bpf_get_ktime_ns() - *cookie;
> >   /* output duration somehow */
> > }
> > 
> > Yes, I realize special-casing zero might be a bit inconvenient, but I
> > think simplicity trumps a potential for zero to be a valid value (and
> > there are always ways to work around zero as a meaningful value).
> > 
> > Now, in more complicated cases 8 bytes of temporary session state
> > isn't enough, just like BPF cookie being 8 byte (read-only) value
> > might not be enough. But the solution is the same as with the BPF
> > cookie. You just use those 8 bytes as a key into ARRAY/HASHMAP/whatnot
> > storage. It's simple and fast enough for pretty much any case.
> 
> I was recently asked for a way to have function arguments available
> in the return kprobe as it is in fexit programs (which was not an
> option to use, because we don't have fast multi attach for it)
> 
> using the hash map to store arguments and storing its key in the
> session data might be solution for this
> 
> > 
> > 
> > Now, how do we implement this storage session? I recently looked into
> > uprobe/uretprobe implementation, and I know that for uretprobes there
> > is a dynamically allocated uretprobe_instance (or whatever the name)
> > allocated for each runtime invocation of uretprobe. We can use that.
> > We'll need to allocate this instance a bit earlier, before uprobe part
> > of uwrapper BPF program has to be executed, initialize session cookie
> > to zero, and store pointer to this memory in bpf_run_ctx (just like we
> > do for BPF cookie).
> 
> I think you refer to the return_instance and it seems like we could use it,
> I'll check on that
> 
> > 
> > I bet there is something similar in the kretprobe case, where we can
> > carve out 8 bytes and pass it to both entry and exit parts of kwrapper
> > program.
> 
> for kprobes.. both kprobe and kprobe_multi/fprobe use rethook to invoke
> return probes, so I guess we could use it and store that shared data
> in there
> 
> btw Masami is in process of removing rethook from kprobe_multi/fprobe,
> as part of migrating fprobe on top of ftrace [0]
> 
> but instead the rethook I think there'll be some sort of shadow stack/data
> area accessible from both entry and return probes, that we could use
> 
> jirka
> 
> 
> [0] https://lore.kernel.org/bpf/170723204881.502590.11906735097521170661.stgit@devnote2/
> 
> > 
> > So anyways, might need some internal refactoring, but seems very
> > doable and will give a lot of power and flexibility for tracing use
> > cases.
> > 
> > It would be cool to also have a similar fentry/fexit combo with "BPF
> > session cookie", but that's something we can think about separately,
> > probably.
> > 
> > 
> > >
> > > > duration, if necessary. Without this users have to pay a bunch of
> > > > extra overhead to record timestamp and put it into hashmap (keyed by
> > > > thread id) or thread local storage (even if they have no other use for
> > > > thread local storage).
> > >
> > > sounds good
> > >
> > > >
> > > > Also, consider that a similar concept is applicable to uprobes and we
> > > > should do that as well, in similar fashion. And the above approach
> > > > works for both kprobe/kretprobe and uprobe/uretprobe cases, because
> > > > they have the same pt_regs data structure as a context (even if for
> > > > exit probes most of the values of pt_regs are not that meaningful).
> > >
> > > ok, that seems useful for uprobes as well
> > 
> > yes, absolutely
> > 
> > >
> > > btw I have another unrelated change in stash that that let you choose
> > > the bpf_prog_active or prog->active re-entry check before program is
> > > executed in krobe_multi link.. I also added extra flag, and it still
> > > seems like good idea to me, thoughts? ;-)
> > >
> > 
> > no specific opinion on this from my side, I guess
> > 
> > > >
> > > > So anyways, great feature, but let's discuss end-to-end usability
> > > > story before we finalize the implementation?
> > >
> > > yep, it does say RFC in the subject ;-)
> > >
> > > >
> > > >
> > 
> > [...]
Andrii Nakryiko Feb. 29, 2024, 12:43 a.m. UTC | #13
On Fri, Feb 23, 2024 at 1:32 AM Jiri Olsa <olsajiri@gmail.com> wrote:
>
> On Tue, Feb 13, 2024 at 01:09:54PM +0100, Jiri Olsa wrote:
> > On Mon, Feb 12, 2024 at 08:06:06PM -0800, Andrii Nakryiko wrote:
> >
> > SNIP
> >
> > > > > But the way you implement it with extra flag and extra fd parameter
> > > > > makes it harder to have a nice high-level support in libbpf (and
> > > > > presumably other BPF loader libraries) for this.
> > > > >
> > > > > When I was thinking about doing something like this, I was considering
> > > > > adding a new program type, actually. That way it's possible to define
> > > > > this "let's skip return probe" protocol without backwards
> > > > > compatibility concerns. It's easier to use it declaratively in libbpf.
> > > >
> > > > ok, that seems cleaner.. but we need to use current kprobe programs,
> > > > so not sure at the moment how would that fit in.. did you mean new
> > > > link type?
> > >
> > > It's kind of a less important detail, actually. New program type would
> > > allow us to have an entirely different context type, but I think we
> > > can make do with the existing kprobe program type. We can have a
> > > separate attach_type and link type, just like multi-kprobe and
> > > multi-uprobe are still kprobe programs.
> >
> > ok, having new attach type on top of kprobe_multi link makes sense
>
> hi,
> I have further question in here.. ;-)
>
> so I implemented the new behaviour on top of new attach_type, but I keep
> thinking that it's the 'same/similar logic' as if I added extra flag to
> attr.link_create.kprobe_multi.flags, which results in much simpler change
>
> is following logic breaking backward compatibility in any way?

Even if it doesn't break compatibility, isn't new attach_type better
to explicitly show that semantics is completely different (it's two
BPF program executions, before and after, for each specified kprobe),
isn't that different enough? We can also have a different set of
helpers for this kwrapper/uwrapper program type (probably could do it
through flags as well, but attach_type again seems a better fit
here..)?

What makes it harder if you are doing it as a new attach_type?

>
>   - having new flag in attr.link_create.kprobe_multi.flags
>   - that force the attach of the bpf program to entry/exit function probes
>   - and the kprobe entry program return value now controls invocation
>     on the exit probe
>
> so the new flag changes the semantics of the entry program return value,
> which seems fine to me, because it depends on the new flag.. do I miss
> any other problem with that?
>
> thanks,
> jirka
>
> >
> > >
> > > >
> > > > > You just declare SEC("kprobe.wrap/...") (or whatever the name,
> > > > > something to designate that it's both entry and exit probe) as one
> > > > > program and in the code there would be some way to determine whether
> > > > > we are in entry mode or exit mode (helper or field in the custom
> > > > > context type, the latter being faster and more usable, but it's
> > > > > probably not critical).
> > > >
> > > > hum, so the single program would be for both entry and exit probe,
> > > > I'll need to check how bad it'd be for us, but it'd probably mean
> > > > just one extra tail call, so it's likely ok
> > >
> > > I guess, I don't know what you are doing there :) I'd recommend
> > > looking at utilizing BPF global subprogs instead of tail calls, if
> > > your kernel allows for that, as that's a saner way to scale BPF
> > > verification.
> >

[...]
Andrii Nakryiko Feb. 29, 2024, 1:25 a.m. UTC | #14
On Wed, Feb 28, 2024 at 4:43 PM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Fri, Feb 23, 2024 at 1:32 AM Jiri Olsa <olsajiri@gmail.com> wrote:
> >
> > On Tue, Feb 13, 2024 at 01:09:54PM +0100, Jiri Olsa wrote:
> > > On Mon, Feb 12, 2024 at 08:06:06PM -0800, Andrii Nakryiko wrote:
> > >
> > > SNIP
> > >
> > > > > > But the way you implement it with extra flag and extra fd parameter
> > > > > > makes it harder to have a nice high-level support in libbpf (and
> > > > > > presumably other BPF loader libraries) for this.
> > > > > >
> > > > > > When I was thinking about doing something like this, I was considering
> > > > > > adding a new program type, actually. That way it's possible to define
> > > > > > this "let's skip return probe" protocol without backwards
> > > > > > compatibility concerns. It's easier to use it declaratively in libbpf.
> > > > >
> > > > > ok, that seems cleaner.. but we need to use current kprobe programs,
> > > > > so not sure at the moment how would that fit in.. did you mean new
> > > > > link type?
> > > >
> > > > It's kind of a less important detail, actually. New program type would
> > > > allow us to have an entirely different context type, but I think we
> > > > can make do with the existing kprobe program type. We can have a
> > > > separate attach_type and link type, just like multi-kprobe and
> > > > multi-uprobe are still kprobe programs.
> > >
> > > ok, having new attach type on top of kprobe_multi link makes sense
> >
> > hi,
> > I have further question in here.. ;-)
> >
> > so I implemented the new behaviour on top of new attach_type, but I keep
> > thinking that it's the 'same/similar logic' as if I added extra flag to
> > attr.link_create.kprobe_multi.flags, which results in much simpler change
> >
> > is following logic breaking backward compatibility in any way?
>
> Even if it doesn't break compatibility, isn't new attach_type better
> to explicitly show that semantics is completely different (it's two
> BPF program executions, before and after, for each specified kprobe),
> isn't that different enough? We can also have a different set of
> helpers for this kwrapper/uwrapper program type (probably could do it
> through flags as well, but attach_type again seems a better fit
> here..)?
>
> What makes it harder if you are doing it as a new attach_type?

I just looked at the RFC patches you sent, and I guess it looks fine
to me with the flag as well, given kretprobe.multi is also specified
through flags.

>
> >
> >   - having new flag in attr.link_create.kprobe_multi.flags
> >   - that force the attach of the bpf program to entry/exit function probes
> >   - and the kprobe entry program return value now controls invocation
> >     on the exit probe
> >
> > so the new flag changes the semantics of the entry program return value,
> > which seems fine to me, because it depends on the new flag.. do I miss
> > any other problem with that?
> >
> > thanks,
> > jirka
> >
> > >
> > > >
> > > > >
> > > > > > You just declare SEC("kprobe.wrap/...") (or whatever the name,
> > > > > > something to designate that it's both entry and exit probe) as one
> > > > > > program and in the code there would be some way to determine whether
> > > > > > we are in entry mode or exit mode (helper or field in the custom
> > > > > > context type, the latter being faster and more usable, but it's
> > > > > > probably not critical).
> > > > >
> > > > > hum, so the single program would be for both entry and exit probe,
> > > > > I'll need to check how bad it'd be for us, but it'd probably mean
> > > > > just one extra tail call, so it's likely ok
> > > >
> > > > I guess, I don't know what you are doing there :) I'd recommend
> > > > looking at utilizing BPF global subprogs instead of tail calls, if
> > > > your kernel allows for that, as that's a saner way to scale BPF
> > > > verification.
> > >
>
> [...]