diff mbox series

[v2,bpf-next,04/14] bpf: implement minimal BPF perf link

Message ID 20210726161211.925206-5-andrii@kernel.org (mailing list archive)
State Changes Requested
Delegated to: BPF
Headers show
Series BPF perf link and user-provided context value | expand

Checks

Context Check Description
netdev/cover_letter success Link
netdev/fixes_present success Link
netdev/patch_count success Link
netdev/tree_selection success Clearly marked for bpf-next
netdev/subject_prefix success Link
netdev/cc_maintainers warning 17 maintainers not CCed: jolsa@redhat.com netdev@vger.kernel.org joe@cilium.io haoluo@google.com mark.rutland@arm.com yhs@fb.com kpsingh@kernel.org mingo@redhat.com alexander.shishkin@linux.intel.com kafai@fb.com rostedt@goodmis.org acme@kernel.org john.fastabend@gmail.com namhyung@kernel.org songliubraving@fb.com linux-perf-users@vger.kernel.org quentin@isovalent.com
netdev/source_inline success Was 0 now: 0
netdev/verify_signedoff success Link
netdev/module_param success Was 0 now: 0
netdev/build_32bit success Errors and warnings before: 11868 this patch: 11868
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/verify_fixes success Link
netdev/checkpatch warning WARNING: line length of 82 exceeds 80 columns WARNING: line length of 88 exceeds 80 columns WARNING: line length of 89 exceeds 80 columns
netdev/build_allmodconfig_warn success Errors and warnings before: 11370 this patch: 11370
netdev/header_inline success Link

Commit Message

Andrii Nakryiko July 26, 2021, 4:12 p.m. UTC
Introduce a new type of BPF link - BPF perf link. This brings perf_event-based
BPF program attachments (perf_event, tracepoints, kprobes, and uprobes) into
the common BPF link infrastructure, allowing to list all active perf_event
based attachments, auto-detaching BPF program from perf_event when link's FD
is closed, get generic BPF link fdinfo/get_info functionality.

BPF_LINK_CREATE command expects perf_event's FD as target_fd. No extra flags
are currently supported.

Force-detaching and atomic BPF program updates are not yet implemented, but
with perf_event-based BPF links we now have common framework for this without
the need to extend ioctl()-based perf_event interface.

One interesting consideration is a new value for bpf_attach_type, which
BPF_LINK_CREATE command expects. Generally, it's either 1-to-1 mapping from
bpf_attach_type to bpf_prog_type, or many-to-1 mapping from a subset of
bpf_attach_types to one bpf_prog_type (e.g., see BPF_PROG_TYPE_SK_SKB or
BPF_PROG_TYPE_CGROUP_SOCK). In this case, though, we have three different
program types (KPROBE, TRACEPOINT, PERF_EVENT) using the same perf_event-based
mechanism, so it's many bpf_prog_types to one bpf_attach_type. I chose to
define a single BPF_PERF_EVENT attach type for all of them and adjust
link_create()'s logic for checking correspondence between attach type and
program type.

The alternative would be to define three new attach types (e.g., BPF_KPROBE,
BPF_TRACEPOINT, and BPF_PERF_EVENT), but that seemed like unnecessary overkill
and BPF_KPROBE will cause naming conflicts with BPF_KPROBE() macro, defined by
libbpf. I chose to not do this to avoid unnecessary proliferation of
bpf_attach_type enum values and not have to deal with naming conflicts.

Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 include/linux/bpf_types.h      |   3 +
 include/linux/trace_events.h   |   3 +
 include/uapi/linux/bpf.h       |   2 +
 kernel/bpf/syscall.c           | 105 ++++++++++++++++++++++++++++++---
 kernel/events/core.c           |  10 ++--
 tools/include/uapi/linux/bpf.h |   2 +
 6 files changed, 112 insertions(+), 13 deletions(-)

Comments

Peter Zijlstra July 27, 2021, 9:04 a.m. UTC | #1
On Mon, Jul 26, 2021 at 09:12:01AM -0700, Andrii Nakryiko wrote:
> Introduce a new type of BPF link - BPF perf link. This brings perf_event-based
> BPF program attachments (perf_event, tracepoints, kprobes, and uprobes) into
> the common BPF link infrastructure, allowing to list all active perf_event
> based attachments, auto-detaching BPF program from perf_event when link's FD
> is closed, get generic BPF link fdinfo/get_info functionality.
> 
> BPF_LINK_CREATE command expects perf_event's FD as target_fd. No extra flags
> are currently supported.
> 
> Force-detaching and atomic BPF program updates are not yet implemented, but
> with perf_event-based BPF links we now have common framework for this without
> the need to extend ioctl()-based perf_event interface.
> 
> One interesting consideration is a new value for bpf_attach_type, which
> BPF_LINK_CREATE command expects. Generally, it's either 1-to-1 mapping from
> bpf_attach_type to bpf_prog_type, or many-to-1 mapping from a subset of
> bpf_attach_types to one bpf_prog_type (e.g., see BPF_PROG_TYPE_SK_SKB or
> BPF_PROG_TYPE_CGROUP_SOCK). In this case, though, we have three different
> program types (KPROBE, TRACEPOINT, PERF_EVENT) using the same perf_event-based
> mechanism, so it's many bpf_prog_types to one bpf_attach_type. I chose to
> define a single BPF_PERF_EVENT attach type for all of them and adjust
> link_create()'s logic for checking correspondence between attach type and
> program type.
> 
> The alternative would be to define three new attach types (e.g., BPF_KPROBE,
> BPF_TRACEPOINT, and BPF_PERF_EVENT), but that seemed like unnecessary overkill
> and BPF_KPROBE will cause naming conflicts with BPF_KPROBE() macro, defined by
> libbpf. I chose to not do this to avoid unnecessary proliferation of
> bpf_attach_type enum values and not have to deal with naming conflicts.
> 

So I have no idea what all that means... I don't speak BPF. That said,
the patch doesn't look terrible.

One little question below, but otherwise:

Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>

> +static void bpf_perf_link_release(struct bpf_link *link)
> +{
> +	struct bpf_perf_link *perf_link = container_of(link, struct bpf_perf_link, link);
> +	struct perf_event *event = perf_link->perf_file->private_data;
> +
> +	perf_event_free_bpf_prog(event);
> +	fput(perf_link->perf_file);
> +}

> +static int bpf_perf_link_attach(const union bpf_attr *attr, struct bpf_prog *prog)
> +{
> +	struct bpf_link_primer link_primer;
> +	struct bpf_perf_link *link;
> +	struct perf_event *event;
> +	struct file *perf_file;
> +	int err;
> +
> +	if (attr->link_create.flags)
> +		return -EINVAL;
> +
> +	perf_file = perf_event_get(attr->link_create.target_fd);
> +	if (IS_ERR(perf_file))
> +		return PTR_ERR(perf_file);
> +
> +	link = kzalloc(sizeof(*link), GFP_USER);
> +	if (!link) {
> +		err = -ENOMEM;
> +		goto out_put_file;
> +	}
> +	bpf_link_init(&link->link, BPF_LINK_TYPE_PERF_EVENT, &bpf_perf_link_lops, prog);
> +	link->perf_file = perf_file;
> +
> +	err = bpf_link_prime(&link->link, &link_primer);
> +	if (err) {
> +		kfree(link);
> +		goto out_put_file;
> +	}
> +
> +	event = perf_file->private_data;
> +	err = perf_event_set_bpf_prog(event, prog);
> +	if (err) {
> +		bpf_link_cleanup(&link_primer);
> +		goto out_put_file;
> +	}
> +	/* perf_event_set_bpf_prog() doesn't take its own refcnt on prog */

Is that otherwise expected? AFAICT the previous users of that function
were guaranteed the existance of the BPF program. But afaict there is
nothing that prevents perf_event_*_bpf_prog() from doing the addition
refcounting if that is more convenient.

> +	bpf_prog_inc(prog);
> +
> +	return bpf_link_settle(&link_primer);
> +
> +out_put_file:
> +	fput(perf_file);
> +	return err;
> +}
Peter Zijlstra July 27, 2021, 9:12 a.m. UTC | #2
On Mon, Jul 26, 2021 at 09:12:01AM -0700, Andrii Nakryiko wrote:
> diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
> index ad413b382a3c..8ac92560d3a3 100644
> --- a/include/linux/trace_events.h
> +++ b/include/linux/trace_events.h
> @@ -803,6 +803,9 @@ extern void ftrace_profile_free_filter(struct perf_event *event);
>  void perf_trace_buf_update(void *record, u16 type);
>  void *perf_trace_buf_alloc(int size, struct pt_regs **regs, int *rctxp);
>  
> +int perf_event_set_bpf_prog(struct perf_event *event, struct bpf_prog *prog);
> +void perf_event_free_bpf_prog(struct perf_event *event);
> +
>  void bpf_trace_run1(struct bpf_prog *prog, u64 arg1);
>  void bpf_trace_run2(struct bpf_prog *prog, u64 arg1, u64 arg2);
>  void bpf_trace_run3(struct bpf_prog *prog, u64 arg1, u64 arg2,

Oh, I just noticed, is this the right header to put these in? Should
this not go into include/linux/perf_event.h ?
Jiri Olsa July 27, 2021, 3:40 p.m. UTC | #3
On Mon, Jul 26, 2021 at 09:12:01AM -0700, Andrii Nakryiko wrote:
> Introduce a new type of BPF link - BPF perf link. This brings perf_event-based
> BPF program attachments (perf_event, tracepoints, kprobes, and uprobes) into
> the common BPF link infrastructure, allowing to list all active perf_event
> based attachments, auto-detaching BPF program from perf_event when link's FD
> is closed, get generic BPF link fdinfo/get_info functionality.
> 
> BPF_LINK_CREATE command expects perf_event's FD as target_fd. No extra flags
> are currently supported.
> 
> Force-detaching and atomic BPF program updates are not yet implemented, but
> with perf_event-based BPF links we now have common framework for this without
> the need to extend ioctl()-based perf_event interface.
> 
> One interesting consideration is a new value for bpf_attach_type, which
> BPF_LINK_CREATE command expects. Generally, it's either 1-to-1 mapping from
> bpf_attach_type to bpf_prog_type, or many-to-1 mapping from a subset of
> bpf_attach_types to one bpf_prog_type (e.g., see BPF_PROG_TYPE_SK_SKB or
> BPF_PROG_TYPE_CGROUP_SOCK). In this case, though, we have three different
> program types (KPROBE, TRACEPOINT, PERF_EVENT) using the same perf_event-based
> mechanism, so it's many bpf_prog_types to one bpf_attach_type. I chose to
> define a single BPF_PERF_EVENT attach type for all of them and adjust
> link_create()'s logic for checking correspondence between attach type and
> program type.
> 
> The alternative would be to define three new attach types (e.g., BPF_KPROBE,
> BPF_TRACEPOINT, and BPF_PERF_EVENT), but that seemed like unnecessary overkill
> and BPF_KPROBE will cause naming conflicts with BPF_KPROBE() macro, defined by
> libbpf. I chose to not do this to avoid unnecessary proliferation of
> bpf_attach_type enum values and not have to deal with naming conflicts.
> 
> Cc: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> ---
>  include/linux/bpf_types.h      |   3 +
>  include/linux/trace_events.h   |   3 +
>  include/uapi/linux/bpf.h       |   2 +
>  kernel/bpf/syscall.c           | 105 ++++++++++++++++++++++++++++++---
>  kernel/events/core.c           |  10 ++--
>  tools/include/uapi/linux/bpf.h |   2 +
>  6 files changed, 112 insertions(+), 13 deletions(-)
> 
> diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
> index a9db1eae6796..0a1ada7f174d 100644
> --- a/include/linux/bpf_types.h
> +++ b/include/linux/bpf_types.h
> @@ -135,3 +135,6 @@ BPF_LINK_TYPE(BPF_LINK_TYPE_ITER, iter)
>  #ifdef CONFIG_NET
>  BPF_LINK_TYPE(BPF_LINK_TYPE_NETNS, netns)
>  #endif
> +#ifdef CONFIG_PERF_EVENTS
> +BPF_LINK_TYPE(BPF_LINK_TYPE_PERF_EVENT, perf)
> +#endif
> diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
> index ad413b382a3c..8ac92560d3a3 100644
> --- a/include/linux/trace_events.h
> +++ b/include/linux/trace_events.h
> @@ -803,6 +803,9 @@ extern void ftrace_profile_free_filter(struct perf_event *event);
>  void perf_trace_buf_update(void *record, u16 type);
>  void *perf_trace_buf_alloc(int size, struct pt_regs **regs, int *rctxp);
>  
> +int perf_event_set_bpf_prog(struct perf_event *event, struct bpf_prog *prog);
> +void perf_event_free_bpf_prog(struct perf_event *event);
> +
>  void bpf_trace_run1(struct bpf_prog *prog, u64 arg1);
>  void bpf_trace_run2(struct bpf_prog *prog, u64 arg1, u64 arg2);
>  void bpf_trace_run3(struct bpf_prog *prog, u64 arg1, u64 arg2,
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 2db6925e04f4..00b1267ab4f0 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -993,6 +993,7 @@ enum bpf_attach_type {
>  	BPF_SK_SKB_VERDICT,
>  	BPF_SK_REUSEPORT_SELECT,
>  	BPF_SK_REUSEPORT_SELECT_OR_MIGRATE,
> +	BPF_PERF_EVENT,
>  	__MAX_BPF_ATTACH_TYPE
>  };
>  
> @@ -1006,6 +1007,7 @@ enum bpf_link_type {
>  	BPF_LINK_TYPE_ITER = 4,
>  	BPF_LINK_TYPE_NETNS = 5,
>  	BPF_LINK_TYPE_XDP = 6,
> +	BPF_LINK_TYPE_PERF_EVENT = 6,

hi, should be 7

jirka
Andrii Nakryiko July 27, 2021, 8:56 p.m. UTC | #4
On Tue, Jul 27, 2021 at 2:15 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Mon, Jul 26, 2021 at 09:12:01AM -0700, Andrii Nakryiko wrote:
> > diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
> > index ad413b382a3c..8ac92560d3a3 100644
> > --- a/include/linux/trace_events.h
> > +++ b/include/linux/trace_events.h
> > @@ -803,6 +803,9 @@ extern void ftrace_profile_free_filter(struct perf_event *event);
> >  void perf_trace_buf_update(void *record, u16 type);
> >  void *perf_trace_buf_alloc(int size, struct pt_regs **regs, int *rctxp);
> >
> > +int perf_event_set_bpf_prog(struct perf_event *event, struct bpf_prog *prog);
> > +void perf_event_free_bpf_prog(struct perf_event *event);
> > +
> >  void bpf_trace_run1(struct bpf_prog *prog, u64 arg1);
> >  void bpf_trace_run2(struct bpf_prog *prog, u64 arg1, u64 arg2);
> >  void bpf_trace_run3(struct bpf_prog *prog, u64 arg1, u64 arg2,
>
> Oh, I just noticed, is this the right header to put these in? Should
> this not go into include/linux/perf_event.h ?

Not that I care much, but this one has perf_event_attach_bpf_prog()
and perf_event_detach_bpf_prog(), so it felt appropriate to put it
here. perf_event.h only seems to have perf_event_bpf_event() for
BPF-related stuff (which seems to be just notification events, not
really BPF functionality per se). But let me know if you prefer to add
these new ones to perf_event.h.
Andrii Nakryiko July 27, 2021, 8:56 p.m. UTC | #5
On Tue, Jul 27, 2021 at 8:40 AM Jiri Olsa <jolsa@redhat.com> wrote:
>
> On Mon, Jul 26, 2021 at 09:12:01AM -0700, Andrii Nakryiko wrote:
> > Introduce a new type of BPF link - BPF perf link. This brings perf_event-based
> > BPF program attachments (perf_event, tracepoints, kprobes, and uprobes) into
> > the common BPF link infrastructure, allowing to list all active perf_event
> > based attachments, auto-detaching BPF program from perf_event when link's FD
> > is closed, get generic BPF link fdinfo/get_info functionality.
> >
> > BPF_LINK_CREATE command expects perf_event's FD as target_fd. No extra flags
> > are currently supported.
> >
> > Force-detaching and atomic BPF program updates are not yet implemented, but
> > with perf_event-based BPF links we now have common framework for this without
> > the need to extend ioctl()-based perf_event interface.
> >
> > One interesting consideration is a new value for bpf_attach_type, which
> > BPF_LINK_CREATE command expects. Generally, it's either 1-to-1 mapping from
> > bpf_attach_type to bpf_prog_type, or many-to-1 mapping from a subset of
> > bpf_attach_types to one bpf_prog_type (e.g., see BPF_PROG_TYPE_SK_SKB or
> > BPF_PROG_TYPE_CGROUP_SOCK). In this case, though, we have three different
> > program types (KPROBE, TRACEPOINT, PERF_EVENT) using the same perf_event-based
> > mechanism, so it's many bpf_prog_types to one bpf_attach_type. I chose to
> > define a single BPF_PERF_EVENT attach type for all of them and adjust
> > link_create()'s logic for checking correspondence between attach type and
> > program type.
> >
> > The alternative would be to define three new attach types (e.g., BPF_KPROBE,
> > BPF_TRACEPOINT, and BPF_PERF_EVENT), but that seemed like unnecessary overkill
> > and BPF_KPROBE will cause naming conflicts with BPF_KPROBE() macro, defined by
> > libbpf. I chose to not do this to avoid unnecessary proliferation of
> > bpf_attach_type enum values and not have to deal with naming conflicts.
> >
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> > ---
> >  include/linux/bpf_types.h      |   3 +
> >  include/linux/trace_events.h   |   3 +
> >  include/uapi/linux/bpf.h       |   2 +
> >  kernel/bpf/syscall.c           | 105 ++++++++++++++++++++++++++++++---
> >  kernel/events/core.c           |  10 ++--
> >  tools/include/uapi/linux/bpf.h |   2 +
> >  6 files changed, 112 insertions(+), 13 deletions(-)
> >
> > diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
> > index a9db1eae6796..0a1ada7f174d 100644
> > --- a/include/linux/bpf_types.h
> > +++ b/include/linux/bpf_types.h
> > @@ -135,3 +135,6 @@ BPF_LINK_TYPE(BPF_LINK_TYPE_ITER, iter)
> >  #ifdef CONFIG_NET
> >  BPF_LINK_TYPE(BPF_LINK_TYPE_NETNS, netns)
> >  #endif
> > +#ifdef CONFIG_PERF_EVENTS
> > +BPF_LINK_TYPE(BPF_LINK_TYPE_PERF_EVENT, perf)
> > +#endif
> > diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
> > index ad413b382a3c..8ac92560d3a3 100644
> > --- a/include/linux/trace_events.h
> > +++ b/include/linux/trace_events.h
> > @@ -803,6 +803,9 @@ extern void ftrace_profile_free_filter(struct perf_event *event);
> >  void perf_trace_buf_update(void *record, u16 type);
> >  void *perf_trace_buf_alloc(int size, struct pt_regs **regs, int *rctxp);
> >
> > +int perf_event_set_bpf_prog(struct perf_event *event, struct bpf_prog *prog);
> > +void perf_event_free_bpf_prog(struct perf_event *event);
> > +
> >  void bpf_trace_run1(struct bpf_prog *prog, u64 arg1);
> >  void bpf_trace_run2(struct bpf_prog *prog, u64 arg1, u64 arg2);
> >  void bpf_trace_run3(struct bpf_prog *prog, u64 arg1, u64 arg2,
> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > index 2db6925e04f4..00b1267ab4f0 100644
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -993,6 +993,7 @@ enum bpf_attach_type {
> >       BPF_SK_SKB_VERDICT,
> >       BPF_SK_REUSEPORT_SELECT,
> >       BPF_SK_REUSEPORT_SELECT_OR_MIGRATE,
> > +     BPF_PERF_EVENT,
> >       __MAX_BPF_ATTACH_TYPE
> >  };
> >
> > @@ -1006,6 +1007,7 @@ enum bpf_link_type {
> >       BPF_LINK_TYPE_ITER = 4,
> >       BPF_LINK_TYPE_NETNS = 5,
> >       BPF_LINK_TYPE_XDP = 6,
> > +     BPF_LINK_TYPE_PERF_EVENT = 6,
>
> hi, should be 7
>

doh! Eagle eyes! Will fix, thanks :)

> jirka
>
Yonghong Song July 29, 2021, 5:35 p.m. UTC | #6
On 7/26/21 9:12 AM, Andrii Nakryiko wrote:
> Introduce a new type of BPF link - BPF perf link. This brings perf_event-based
> BPF program attachments (perf_event, tracepoints, kprobes, and uprobes) into
> the common BPF link infrastructure, allowing to list all active perf_event
> based attachments, auto-detaching BPF program from perf_event when link's FD
> is closed, get generic BPF link fdinfo/get_info functionality.
> 
> BPF_LINK_CREATE command expects perf_event's FD as target_fd. No extra flags
> are currently supported.
> 
> Force-detaching and atomic BPF program updates are not yet implemented, but
> with perf_event-based BPF links we now have common framework for this without
> the need to extend ioctl()-based perf_event interface.
> 
> One interesting consideration is a new value for bpf_attach_type, which
> BPF_LINK_CREATE command expects. Generally, it's either 1-to-1 mapping from
> bpf_attach_type to bpf_prog_type, or many-to-1 mapping from a subset of
> bpf_attach_types to one bpf_prog_type (e.g., see BPF_PROG_TYPE_SK_SKB or
> BPF_PROG_TYPE_CGROUP_SOCK). In this case, though, we have three different
> program types (KPROBE, TRACEPOINT, PERF_EVENT) using the same perf_event-based
> mechanism, so it's many bpf_prog_types to one bpf_attach_type. I chose to
> define a single BPF_PERF_EVENT attach type for all of them and adjust
> link_create()'s logic for checking correspondence between attach type and
> program type.
> 
> The alternative would be to define three new attach types (e.g., BPF_KPROBE,
> BPF_TRACEPOINT, and BPF_PERF_EVENT), but that seemed like unnecessary overkill
> and BPF_KPROBE will cause naming conflicts with BPF_KPROBE() macro, defined by
> libbpf. I chose to not do this to avoid unnecessary proliferation of
> bpf_attach_type enum values and not have to deal with naming conflicts.
> 
> Cc: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> ---
>   include/linux/bpf_types.h      |   3 +
>   include/linux/trace_events.h   |   3 +
>   include/uapi/linux/bpf.h       |   2 +
>   kernel/bpf/syscall.c           | 105 ++++++++++++++++++++++++++++++---
>   kernel/events/core.c           |  10 ++--
>   tools/include/uapi/linux/bpf.h |   2 +
>   6 files changed, 112 insertions(+), 13 deletions(-)
> 
> diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
> index a9db1eae6796..0a1ada7f174d 100644
> --- a/include/linux/bpf_types.h
> +++ b/include/linux/bpf_types.h
> @@ -135,3 +135,6 @@ BPF_LINK_TYPE(BPF_LINK_TYPE_ITER, iter)
>   #ifdef CONFIG_NET
>   BPF_LINK_TYPE(BPF_LINK_TYPE_NETNS, netns)
>   #endif
> +#ifdef CONFIG_PERF_EVENTS
> +BPF_LINK_TYPE(BPF_LINK_TYPE_PERF_EVENT, perf)
> +#endif
> diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
> index ad413b382a3c..8ac92560d3a3 100644
> --- a/include/linux/trace_events.h
> +++ b/include/linux/trace_events.h
> @@ -803,6 +803,9 @@ extern void ftrace_profile_free_filter(struct perf_event *event);
>   void perf_trace_buf_update(void *record, u16 type);
>   void *perf_trace_buf_alloc(int size, struct pt_regs **regs, int *rctxp);
>   
> +int perf_event_set_bpf_prog(struct perf_event *event, struct bpf_prog *prog);
> +void perf_event_free_bpf_prog(struct perf_event *event);
> +
>   void bpf_trace_run1(struct bpf_prog *prog, u64 arg1);
>   void bpf_trace_run2(struct bpf_prog *prog, u64 arg1, u64 arg2);
>   void bpf_trace_run3(struct bpf_prog *prog, u64 arg1, u64 arg2,
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 2db6925e04f4..00b1267ab4f0 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -993,6 +993,7 @@ enum bpf_attach_type {
>   	BPF_SK_SKB_VERDICT,
>   	BPF_SK_REUSEPORT_SELECT,
>   	BPF_SK_REUSEPORT_SELECT_OR_MIGRATE,
> +	BPF_PERF_EVENT,
>   	__MAX_BPF_ATTACH_TYPE
>   };
>   
> @@ -1006,6 +1007,7 @@ enum bpf_link_type {
>   	BPF_LINK_TYPE_ITER = 4,
>   	BPF_LINK_TYPE_NETNS = 5,
>   	BPF_LINK_TYPE_XDP = 6,
> +	BPF_LINK_TYPE_PERF_EVENT = 6,

As Jiri has pointed out, BPF_LINK_TYPE_PERF_EVENT = 7.

>   
>   	MAX_BPF_LINK_TYPE,
>   };
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 9a2068e39d23..80c03bedd6e6 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -2906,6 +2906,79 @@ static const struct bpf_link_ops bpf_raw_tp_link_lops = {
>   	.fill_link_info = bpf_raw_tp_link_fill_link_info,
>   };
>   
> +#ifdef CONFIG_PERF_EVENTS
> +struct bpf_perf_link {
> +	struct bpf_link link;
> +	struct file *perf_file;
> +};
> +
> +static void bpf_perf_link_release(struct bpf_link *link)
> +{
> +	struct bpf_perf_link *perf_link = container_of(link, struct bpf_perf_link, link);
> +	struct perf_event *event = perf_link->perf_file->private_data;
> +
> +	perf_event_free_bpf_prog(event);
> +	fput(perf_link->perf_file);
> +}
> +
> +static void bpf_perf_link_dealloc(struct bpf_link *link)
> +{
> +	struct bpf_perf_link *perf_link = container_of(link, struct bpf_perf_link, link);
> +
> +	kfree(perf_link);
> +}
> +
> +static const struct bpf_link_ops bpf_perf_link_lops = {
> +	.release = bpf_perf_link_release,
> +	.dealloc = bpf_perf_link_dealloc,
> +};
> +
> +static int bpf_perf_link_attach(const union bpf_attr *attr, struct bpf_prog *prog)
> +{
> +	struct bpf_link_primer link_primer;
> +	struct bpf_perf_link *link;
> +	struct perf_event *event;
> +	struct file *perf_file;
> +	int err;
> +
> +	if (attr->link_create.flags)
> +		return -EINVAL;
> +
> +	perf_file = perf_event_get(attr->link_create.target_fd);
> +	if (IS_ERR(perf_file))
> +		return PTR_ERR(perf_file);
> +
> +	link = kzalloc(sizeof(*link), GFP_USER);

add __GFP_NOWARN flag?

> +	if (!link) {
> +		err = -ENOMEM;
> +		goto out_put_file;
> +	}
> +	bpf_link_init(&link->link, BPF_LINK_TYPE_PERF_EVENT, &bpf_perf_link_lops, prog);
> +	link->perf_file = perf_file;
> +
> +	err = bpf_link_prime(&link->link, &link_primer);
> +	if (err) {
> +		kfree(link);
> +		goto out_put_file;
> +	}
> +
> +	event = perf_file->private_data;
> +	err = perf_event_set_bpf_prog(event, prog);
> +	if (err) {
> +		bpf_link_cleanup(&link_primer);

Do you need kfree(link) here?

> +		goto out_put_file;
> +	}
> +	/* perf_event_set_bpf_prog() doesn't take its own refcnt on prog */
> +	bpf_prog_inc(prog);
> +
> +	return bpf_link_settle(&link_primer);
> +
> +out_put_file:
> +	fput(perf_file);
> +	return err;
> +}
> +#endif /* CONFIG_PERF_EVENTS */
> +
>   #define BPF_RAW_TRACEPOINT_OPEN_LAST_FIELD raw_tracepoint.prog_fd
>   
[...]
Andrii Nakryiko July 30, 2021, 4:16 a.m. UTC | #7
On Thu, Jul 29, 2021 at 10:36 AM Yonghong Song <yhs@fb.com> wrote:
>
>
>
> On 7/26/21 9:12 AM, Andrii Nakryiko wrote:
> > Introduce a new type of BPF link - BPF perf link. This brings perf_event-based
> > BPF program attachments (perf_event, tracepoints, kprobes, and uprobes) into
> > the common BPF link infrastructure, allowing to list all active perf_event
> > based attachments, auto-detaching BPF program from perf_event when link's FD
> > is closed, get generic BPF link fdinfo/get_info functionality.
> >
> > BPF_LINK_CREATE command expects perf_event's FD as target_fd. No extra flags
> > are currently supported.
> >
> > Force-detaching and atomic BPF program updates are not yet implemented, but
> > with perf_event-based BPF links we now have common framework for this without
> > the need to extend ioctl()-based perf_event interface.
> >
> > One interesting consideration is a new value for bpf_attach_type, which
> > BPF_LINK_CREATE command expects. Generally, it's either 1-to-1 mapping from
> > bpf_attach_type to bpf_prog_type, or many-to-1 mapping from a subset of
> > bpf_attach_types to one bpf_prog_type (e.g., see BPF_PROG_TYPE_SK_SKB or
> > BPF_PROG_TYPE_CGROUP_SOCK). In this case, though, we have three different
> > program types (KPROBE, TRACEPOINT, PERF_EVENT) using the same perf_event-based
> > mechanism, so it's many bpf_prog_types to one bpf_attach_type. I chose to
> > define a single BPF_PERF_EVENT attach type for all of them and adjust
> > link_create()'s logic for checking correspondence between attach type and
> > program type.
> >
> > The alternative would be to define three new attach types (e.g., BPF_KPROBE,
> > BPF_TRACEPOINT, and BPF_PERF_EVENT), but that seemed like unnecessary overkill
> > and BPF_KPROBE will cause naming conflicts with BPF_KPROBE() macro, defined by
> > libbpf. I chose to not do this to avoid unnecessary proliferation of
> > bpf_attach_type enum values and not have to deal with naming conflicts.
> >
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> > ---
> >   include/linux/bpf_types.h      |   3 +
> >   include/linux/trace_events.h   |   3 +
> >   include/uapi/linux/bpf.h       |   2 +
> >   kernel/bpf/syscall.c           | 105 ++++++++++++++++++++++++++++++---
> >   kernel/events/core.c           |  10 ++--
> >   tools/include/uapi/linux/bpf.h |   2 +
> >   6 files changed, 112 insertions(+), 13 deletions(-)
> >
> > diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
> > index a9db1eae6796..0a1ada7f174d 100644
> > --- a/include/linux/bpf_types.h
> > +++ b/include/linux/bpf_types.h
> > @@ -135,3 +135,6 @@ BPF_LINK_TYPE(BPF_LINK_TYPE_ITER, iter)
> >   #ifdef CONFIG_NET
> >   BPF_LINK_TYPE(BPF_LINK_TYPE_NETNS, netns)
> >   #endif
> > +#ifdef CONFIG_PERF_EVENTS
> > +BPF_LINK_TYPE(BPF_LINK_TYPE_PERF_EVENT, perf)
> > +#endif
> > diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
> > index ad413b382a3c..8ac92560d3a3 100644
> > --- a/include/linux/trace_events.h
> > +++ b/include/linux/trace_events.h
> > @@ -803,6 +803,9 @@ extern void ftrace_profile_free_filter(struct perf_event *event);
> >   void perf_trace_buf_update(void *record, u16 type);
> >   void *perf_trace_buf_alloc(int size, struct pt_regs **regs, int *rctxp);
> >
> > +int perf_event_set_bpf_prog(struct perf_event *event, struct bpf_prog *prog);
> > +void perf_event_free_bpf_prog(struct perf_event *event);
> > +
> >   void bpf_trace_run1(struct bpf_prog *prog, u64 arg1);
> >   void bpf_trace_run2(struct bpf_prog *prog, u64 arg1, u64 arg2);
> >   void bpf_trace_run3(struct bpf_prog *prog, u64 arg1, u64 arg2,
> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > index 2db6925e04f4..00b1267ab4f0 100644
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -993,6 +993,7 @@ enum bpf_attach_type {
> >       BPF_SK_SKB_VERDICT,
> >       BPF_SK_REUSEPORT_SELECT,
> >       BPF_SK_REUSEPORT_SELECT_OR_MIGRATE,
> > +     BPF_PERF_EVENT,
> >       __MAX_BPF_ATTACH_TYPE
> >   };
> >
> > @@ -1006,6 +1007,7 @@ enum bpf_link_type {
> >       BPF_LINK_TYPE_ITER = 4,
> >       BPF_LINK_TYPE_NETNS = 5,
> >       BPF_LINK_TYPE_XDP = 6,
> > +     BPF_LINK_TYPE_PERF_EVENT = 6,
>
> As Jiri has pointed out, BPF_LINK_TYPE_PERF_EVENT = 7.

yep, fixed

>
> >
> >       MAX_BPF_LINK_TYPE,
> >   };
> > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> > index 9a2068e39d23..80c03bedd6e6 100644
> > --- a/kernel/bpf/syscall.c
> > +++ b/kernel/bpf/syscall.c
> > @@ -2906,6 +2906,79 @@ static const struct bpf_link_ops bpf_raw_tp_link_lops = {
> >       .fill_link_info = bpf_raw_tp_link_fill_link_info,
> >   };
> >
> > +#ifdef CONFIG_PERF_EVENTS
> > +struct bpf_perf_link {
> > +     struct bpf_link link;
> > +     struct file *perf_file;
> > +};
> > +
> > +static void bpf_perf_link_release(struct bpf_link *link)
> > +{
> > +     struct bpf_perf_link *perf_link = container_of(link, struct bpf_perf_link, link);
> > +     struct perf_event *event = perf_link->perf_file->private_data;
> > +
> > +     perf_event_free_bpf_prog(event);
> > +     fput(perf_link->perf_file);
> > +}
> > +
> > +static void bpf_perf_link_dealloc(struct bpf_link *link)
> > +{
> > +     struct bpf_perf_link *perf_link = container_of(link, struct bpf_perf_link, link);
> > +
> > +     kfree(perf_link);
> > +}
> > +
> > +static const struct bpf_link_ops bpf_perf_link_lops = {
> > +     .release = bpf_perf_link_release,
> > +     .dealloc = bpf_perf_link_dealloc,
> > +};
> > +
> > +static int bpf_perf_link_attach(const union bpf_attr *attr, struct bpf_prog *prog)
> > +{
> > +     struct bpf_link_primer link_primer;
> > +     struct bpf_perf_link *link;
> > +     struct perf_event *event;
> > +     struct file *perf_file;
> > +     int err;
> > +
> > +     if (attr->link_create.flags)
> > +             return -EINVAL;
> > +
> > +     perf_file = perf_event_get(attr->link_create.target_fd);
> > +     if (IS_ERR(perf_file))
> > +             return PTR_ERR(perf_file);
> > +
> > +     link = kzalloc(sizeof(*link), GFP_USER);
>
> add __GFP_NOWARN flag?

I looked at few other bpf_link_alloc places in this file, they don't
use NOWARN flag. I think the idea with NOWARN flag is to avoid memory
alloc warnings when amount of allocated memory depends on
user-specified parameter (like the size of the map value). In this
case it's just a single fixed-size kernel object, so while users can
create lots of them, each is fixed in size. It's similar as any other
kernel object (e.g., struct file). So I think it's good as is.

>
> > +     if (!link) {
> > +             err = -ENOMEM;
> > +             goto out_put_file;
> > +     }
> > +     bpf_link_init(&link->link, BPF_LINK_TYPE_PERF_EVENT, &bpf_perf_link_lops, prog);
> > +     link->perf_file = perf_file;
> > +
> > +     err = bpf_link_prime(&link->link, &link_primer);
> > +     if (err) {
> > +             kfree(link);
> > +             goto out_put_file;
> > +     }
> > +
> > +     event = perf_file->private_data;
> > +     err = perf_event_set_bpf_prog(event, prog);
> > +     if (err) {
> > +             bpf_link_cleanup(&link_primer);
>
> Do you need kfree(link) here?

bpf_link_cleanup() will call kfree() in deferred fashion. This is due
to bpf_link_prime() allocating anon_inode file internally, so it needs
to be freed carefully and that's what bpf_link_cleanup() is for.

>
> > +             goto out_put_file;
> > +     }
> > +     /* perf_event_set_bpf_prog() doesn't take its own refcnt on prog */
> > +     bpf_prog_inc(prog);
> > +
> > +     return bpf_link_settle(&link_primer);
> > +
> > +out_put_file:
> > +     fput(perf_file);
> > +     return err;
> > +}
> > +#endif /* CONFIG_PERF_EVENTS */
> > +
> >   #define BPF_RAW_TRACEPOINT_OPEN_LAST_FIELD raw_tracepoint.prog_fd
> >
> [...]
Andrii Nakryiko July 30, 2021, 4:23 a.m. UTC | #8
On Tue, Jul 27, 2021 at 2:04 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Mon, Jul 26, 2021 at 09:12:01AM -0700, Andrii Nakryiko wrote:
> > Introduce a new type of BPF link - BPF perf link. This brings perf_event-based
> > BPF program attachments (perf_event, tracepoints, kprobes, and uprobes) into
> > the common BPF link infrastructure, allowing to list all active perf_event
> > based attachments, auto-detaching BPF program from perf_event when link's FD
> > is closed, get generic BPF link fdinfo/get_info functionality.
> >
> > BPF_LINK_CREATE command expects perf_event's FD as target_fd. No extra flags
> > are currently supported.
> >
> > Force-detaching and atomic BPF program updates are not yet implemented, but
> > with perf_event-based BPF links we now have common framework for this without
> > the need to extend ioctl()-based perf_event interface.
> >
> > One interesting consideration is a new value for bpf_attach_type, which
> > BPF_LINK_CREATE command expects. Generally, it's either 1-to-1 mapping from
> > bpf_attach_type to bpf_prog_type, or many-to-1 mapping from a subset of
> > bpf_attach_types to one bpf_prog_type (e.g., see BPF_PROG_TYPE_SK_SKB or
> > BPF_PROG_TYPE_CGROUP_SOCK). In this case, though, we have three different
> > program types (KPROBE, TRACEPOINT, PERF_EVENT) using the same perf_event-based
> > mechanism, so it's many bpf_prog_types to one bpf_attach_type. I chose to
> > define a single BPF_PERF_EVENT attach type for all of them and adjust
> > link_create()'s logic for checking correspondence between attach type and
> > program type.
> >
> > The alternative would be to define three new attach types (e.g., BPF_KPROBE,
> > BPF_TRACEPOINT, and BPF_PERF_EVENT), but that seemed like unnecessary overkill
> > and BPF_KPROBE will cause naming conflicts with BPF_KPROBE() macro, defined by
> > libbpf. I chose to not do this to avoid unnecessary proliferation of
> > bpf_attach_type enum values and not have to deal with naming conflicts.
> >
>
> So I have no idea what all that means... I don't speak BPF. That said,
> the patch doesn't look terrible.
>
> One little question below, but otherwise:
>
> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>
> > +static void bpf_perf_link_release(struct bpf_link *link)
> > +{
> > +     struct bpf_perf_link *perf_link = container_of(link, struct bpf_perf_link, link);
> > +     struct perf_event *event = perf_link->perf_file->private_data;
> > +
> > +     perf_event_free_bpf_prog(event);
> > +     fput(perf_link->perf_file);
> > +}
>
> > +static int bpf_perf_link_attach(const union bpf_attr *attr, struct bpf_prog *prog)
> > +{
> > +     struct bpf_link_primer link_primer;
> > +     struct bpf_perf_link *link;
> > +     struct perf_event *event;
> > +     struct file *perf_file;
> > +     int err;
> > +
> > +     if (attr->link_create.flags)
> > +             return -EINVAL;
> > +
> > +     perf_file = perf_event_get(attr->link_create.target_fd);
> > +     if (IS_ERR(perf_file))
> > +             return PTR_ERR(perf_file);
> > +
> > +     link = kzalloc(sizeof(*link), GFP_USER);
> > +     if (!link) {
> > +             err = -ENOMEM;
> > +             goto out_put_file;
> > +     }
> > +     bpf_link_init(&link->link, BPF_LINK_TYPE_PERF_EVENT, &bpf_perf_link_lops, prog);
> > +     link->perf_file = perf_file;
> > +
> > +     err = bpf_link_prime(&link->link, &link_primer);
> > +     if (err) {
> > +             kfree(link);
> > +             goto out_put_file;
> > +     }
> > +
> > +     event = perf_file->private_data;
> > +     err = perf_event_set_bpf_prog(event, prog);
> > +     if (err) {
> > +             bpf_link_cleanup(&link_primer);
> > +             goto out_put_file;
> > +     }
> > +     /* perf_event_set_bpf_prog() doesn't take its own refcnt on prog */
>
> Is that otherwise expected? AFAICT the previous users of that function
> were guaranteed the existance of the BPF program. But afaict there is
> nothing that prevents perf_event_*_bpf_prog() from doing the addition
> refcounting if that is more convenient.

Sorry, I missed this on my last pass. Yes, it's expected. The general
convention we use for BPF when passing bpf_prog (and bpf_map and other
objects like that) is that the caller already has an incremented
refcnt before calling callee. If callee succeeds, that refcnt is
"transferred" into the caller (so callee doesn't increment it, caller
doesn't put it). If callee errors out, caller is decrementing refcnt
after necessary clean up, but callee does nothing. While asymmetrical,
in practice it results in a simple and straightforward  error handling
logic.

In this case bpf_perf_link_attach() assumes one refcnt from its
caller, but if everything is ok and perf_event_set_bpf_prog()
succeeds, we need to keep 2 refcnts: one for bpf_link and one for
perf_event_set_bpf_prog() internally. So we just bump refcnt one extra
time. I intentionally removed bpf_prog_put() from
perf_event_set_bpf_prog() in the previous patch to make error handling
uniform with the rest of the code and simpler overall.

>
> > +     bpf_prog_inc(prog);
> > +
> > +     return bpf_link_settle(&link_primer);
> > +
> > +out_put_file:
> > +     fput(perf_file);
> > +     return err;
> > +}
Yonghong Song July 30, 2021, 5:42 a.m. UTC | #9
On 7/29/21 9:16 PM, Andrii Nakryiko wrote:
> On Thu, Jul 29, 2021 at 10:36 AM Yonghong Song <yhs@fb.com> wrote:
>>
>>
>>
>> On 7/26/21 9:12 AM, Andrii Nakryiko wrote:
>>> Introduce a new type of BPF link - BPF perf link. This brings perf_event-based
>>> BPF program attachments (perf_event, tracepoints, kprobes, and uprobes) into
>>> the common BPF link infrastructure, allowing to list all active perf_event
>>> based attachments, auto-detaching BPF program from perf_event when link's FD
>>> is closed, get generic BPF link fdinfo/get_info functionality.
>>>
>>> BPF_LINK_CREATE command expects perf_event's FD as target_fd. No extra flags
>>> are currently supported.
>>>
>>> Force-detaching and atomic BPF program updates are not yet implemented, but
>>> with perf_event-based BPF links we now have common framework for this without
>>> the need to extend ioctl()-based perf_event interface.
>>>
>>> One interesting consideration is a new value for bpf_attach_type, which
>>> BPF_LINK_CREATE command expects. Generally, it's either 1-to-1 mapping from
>>> bpf_attach_type to bpf_prog_type, or many-to-1 mapping from a subset of
>>> bpf_attach_types to one bpf_prog_type (e.g., see BPF_PROG_TYPE_SK_SKB or
>>> BPF_PROG_TYPE_CGROUP_SOCK). In this case, though, we have three different
>>> program types (KPROBE, TRACEPOINT, PERF_EVENT) using the same perf_event-based
>>> mechanism, so it's many bpf_prog_types to one bpf_attach_type. I chose to
>>> define a single BPF_PERF_EVENT attach type for all of them and adjust
>>> link_create()'s logic for checking correspondence between attach type and
>>> program type.
>>>
>>> The alternative would be to define three new attach types (e.g., BPF_KPROBE,
>>> BPF_TRACEPOINT, and BPF_PERF_EVENT), but that seemed like unnecessary overkill
>>> and BPF_KPROBE will cause naming conflicts with BPF_KPROBE() macro, defined by
>>> libbpf. I chose to not do this to avoid unnecessary proliferation of
>>> bpf_attach_type enum values and not have to deal with naming conflicts.
>>>
>>> Cc: Peter Zijlstra <peterz@infradead.org>
>>> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
>>> ---
>>>    include/linux/bpf_types.h      |   3 +
>>>    include/linux/trace_events.h   |   3 +
>>>    include/uapi/linux/bpf.h       |   2 +
>>>    kernel/bpf/syscall.c           | 105 ++++++++++++++++++++++++++++++---
>>>    kernel/events/core.c           |  10 ++--
>>>    tools/include/uapi/linux/bpf.h |   2 +
>>>    6 files changed, 112 insertions(+), 13 deletions(-)
>>>
>>> diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
>>> index a9db1eae6796..0a1ada7f174d 100644
>>> --- a/include/linux/bpf_types.h
>>> +++ b/include/linux/bpf_types.h
>>> @@ -135,3 +135,6 @@ BPF_LINK_TYPE(BPF_LINK_TYPE_ITER, iter)
>>>    #ifdef CONFIG_NET
>>>    BPF_LINK_TYPE(BPF_LINK_TYPE_NETNS, netns)
>>>    #endif
>>> +#ifdef CONFIG_PERF_EVENTS
>>> +BPF_LINK_TYPE(BPF_LINK_TYPE_PERF_EVENT, perf)
>>> +#endif
>>> diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
>>> index ad413b382a3c..8ac92560d3a3 100644
>>> --- a/include/linux/trace_events.h
>>> +++ b/include/linux/trace_events.h
>>> @@ -803,6 +803,9 @@ extern void ftrace_profile_free_filter(struct perf_event *event);
>>>    void perf_trace_buf_update(void *record, u16 type);
>>>    void *perf_trace_buf_alloc(int size, struct pt_regs **regs, int *rctxp);
>>>
>>> +int perf_event_set_bpf_prog(struct perf_event *event, struct bpf_prog *prog);
>>> +void perf_event_free_bpf_prog(struct perf_event *event);
>>> +
>>>    void bpf_trace_run1(struct bpf_prog *prog, u64 arg1);
>>>    void bpf_trace_run2(struct bpf_prog *prog, u64 arg1, u64 arg2);
>>>    void bpf_trace_run3(struct bpf_prog *prog, u64 arg1, u64 arg2,
>>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>>> index 2db6925e04f4..00b1267ab4f0 100644
>>> --- a/include/uapi/linux/bpf.h
>>> +++ b/include/uapi/linux/bpf.h
>>> @@ -993,6 +993,7 @@ enum bpf_attach_type {
>>>        BPF_SK_SKB_VERDICT,
>>>        BPF_SK_REUSEPORT_SELECT,
>>>        BPF_SK_REUSEPORT_SELECT_OR_MIGRATE,
>>> +     BPF_PERF_EVENT,
>>>        __MAX_BPF_ATTACH_TYPE
>>>    };
>>>
>>> @@ -1006,6 +1007,7 @@ enum bpf_link_type {
>>>        BPF_LINK_TYPE_ITER = 4,
>>>        BPF_LINK_TYPE_NETNS = 5,
>>>        BPF_LINK_TYPE_XDP = 6,
>>> +     BPF_LINK_TYPE_PERF_EVENT = 6,
>>
>> As Jiri has pointed out, BPF_LINK_TYPE_PERF_EVENT = 7.
> 
> yep, fixed
> 
>>
>>>
>>>        MAX_BPF_LINK_TYPE,
>>>    };
>>> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>>> index 9a2068e39d23..80c03bedd6e6 100644
>>> --- a/kernel/bpf/syscall.c
>>> +++ b/kernel/bpf/syscall.c
>>> @@ -2906,6 +2906,79 @@ static const struct bpf_link_ops bpf_raw_tp_link_lops = {
>>>        .fill_link_info = bpf_raw_tp_link_fill_link_info,
>>>    };
>>>
>>> +#ifdef CONFIG_PERF_EVENTS
>>> +struct bpf_perf_link {
>>> +     struct bpf_link link;
>>> +     struct file *perf_file;
>>> +};
>>> +
>>> +static void bpf_perf_link_release(struct bpf_link *link)
>>> +{
>>> +     struct bpf_perf_link *perf_link = container_of(link, struct bpf_perf_link, link);
>>> +     struct perf_event *event = perf_link->perf_file->private_data;
>>> +
>>> +     perf_event_free_bpf_prog(event);
>>> +     fput(perf_link->perf_file);
>>> +}
>>> +
>>> +static void bpf_perf_link_dealloc(struct bpf_link *link)
>>> +{
>>> +     struct bpf_perf_link *perf_link = container_of(link, struct bpf_perf_link, link);
>>> +
>>> +     kfree(perf_link);
>>> +}
>>> +
>>> +static const struct bpf_link_ops bpf_perf_link_lops = {
>>> +     .release = bpf_perf_link_release,
>>> +     .dealloc = bpf_perf_link_dealloc,
>>> +};
>>> +
>>> +static int bpf_perf_link_attach(const union bpf_attr *attr, struct bpf_prog *prog)
>>> +{
>>> +     struct bpf_link_primer link_primer;
>>> +     struct bpf_perf_link *link;
>>> +     struct perf_event *event;
>>> +     struct file *perf_file;
>>> +     int err;
>>> +
>>> +     if (attr->link_create.flags)
>>> +             return -EINVAL;
>>> +
>>> +     perf_file = perf_event_get(attr->link_create.target_fd);
>>> +     if (IS_ERR(perf_file))
>>> +             return PTR_ERR(perf_file);
>>> +
>>> +     link = kzalloc(sizeof(*link), GFP_USER);
>>
>> add __GFP_NOWARN flag?
> 
> I looked at few other bpf_link_alloc places in this file, they don't
> use NOWARN flag. I think the idea with NOWARN flag is to avoid memory
> alloc warnings when amount of allocated memory depends on
> user-specified parameter (like the size of the map value). In this
> case it's just a single fixed-size kernel object, so while users can
> create lots of them, each is fixed in size. It's similar as any other
> kernel object (e.g., struct file). So I think it's good as is.

That is fine. This is really a small struct, unlikely we have issues.

> 
>>
>>> +     if (!link) {
>>> +             err = -ENOMEM;
>>> +             goto out_put_file;
>>> +     }
>>> +     bpf_link_init(&link->link, BPF_LINK_TYPE_PERF_EVENT, &bpf_perf_link_lops, prog);
>>> +     link->perf_file = perf_file;
>>> +
>>> +     err = bpf_link_prime(&link->link, &link_primer);
>>> +     if (err) {
>>> +             kfree(link);
>>> +             goto out_put_file;
>>> +     }
>>> +
>>> +     event = perf_file->private_data;
>>> +     err = perf_event_set_bpf_prog(event, prog);
>>> +     if (err) {
>>> +             bpf_link_cleanup(&link_primer);
>>
>> Do you need kfree(link) here?
> 
> bpf_link_cleanup() will call kfree() in deferred fashion. This is due
> to bpf_link_prime() allocating anon_inode file internally, so it needs
> to be freed carefully and that's what bpf_link_cleanup() is for.

Looking at the code again, I am able to figure out. Indeed,
kfree(link) is called through file->release().

> 
>>
>>> +             goto out_put_file;
>>> +     }
>>> +     /* perf_event_set_bpf_prog() doesn't take its own refcnt on prog */
>>> +     bpf_prog_inc(prog);
>>> +
>>> +     return bpf_link_settle(&link_primer);
>>> +
>>> +out_put_file:
>>> +     fput(perf_file);
>>> +     return err;
>>> +}
>>> +#endif /* CONFIG_PERF_EVENTS */
>>> +
>>>    #define BPF_RAW_TRACEPOINT_OPEN_LAST_FIELD raw_tracepoint.prog_fd
>>>
>> [...]
diff mbox series

Patch

diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index a9db1eae6796..0a1ada7f174d 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -135,3 +135,6 @@  BPF_LINK_TYPE(BPF_LINK_TYPE_ITER, iter)
 #ifdef CONFIG_NET
 BPF_LINK_TYPE(BPF_LINK_TYPE_NETNS, netns)
 #endif
+#ifdef CONFIG_PERF_EVENTS
+BPF_LINK_TYPE(BPF_LINK_TYPE_PERF_EVENT, perf)
+#endif
diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
index ad413b382a3c..8ac92560d3a3 100644
--- a/include/linux/trace_events.h
+++ b/include/linux/trace_events.h
@@ -803,6 +803,9 @@  extern void ftrace_profile_free_filter(struct perf_event *event);
 void perf_trace_buf_update(void *record, u16 type);
 void *perf_trace_buf_alloc(int size, struct pt_regs **regs, int *rctxp);
 
+int perf_event_set_bpf_prog(struct perf_event *event, struct bpf_prog *prog);
+void perf_event_free_bpf_prog(struct perf_event *event);
+
 void bpf_trace_run1(struct bpf_prog *prog, u64 arg1);
 void bpf_trace_run2(struct bpf_prog *prog, u64 arg1, u64 arg2);
 void bpf_trace_run3(struct bpf_prog *prog, u64 arg1, u64 arg2,
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 2db6925e04f4..00b1267ab4f0 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -993,6 +993,7 @@  enum bpf_attach_type {
 	BPF_SK_SKB_VERDICT,
 	BPF_SK_REUSEPORT_SELECT,
 	BPF_SK_REUSEPORT_SELECT_OR_MIGRATE,
+	BPF_PERF_EVENT,
 	__MAX_BPF_ATTACH_TYPE
 };
 
@@ -1006,6 +1007,7 @@  enum bpf_link_type {
 	BPF_LINK_TYPE_ITER = 4,
 	BPF_LINK_TYPE_NETNS = 5,
 	BPF_LINK_TYPE_XDP = 6,
+	BPF_LINK_TYPE_PERF_EVENT = 6,
 
 	MAX_BPF_LINK_TYPE,
 };
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 9a2068e39d23..80c03bedd6e6 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -2906,6 +2906,79 @@  static const struct bpf_link_ops bpf_raw_tp_link_lops = {
 	.fill_link_info = bpf_raw_tp_link_fill_link_info,
 };
 
+#ifdef CONFIG_PERF_EVENTS
+struct bpf_perf_link {
+	struct bpf_link link;
+	struct file *perf_file;
+};
+
+static void bpf_perf_link_release(struct bpf_link *link)
+{
+	struct bpf_perf_link *perf_link = container_of(link, struct bpf_perf_link, link);
+	struct perf_event *event = perf_link->perf_file->private_data;
+
+	perf_event_free_bpf_prog(event);
+	fput(perf_link->perf_file);
+}
+
+static void bpf_perf_link_dealloc(struct bpf_link *link)
+{
+	struct bpf_perf_link *perf_link = container_of(link, struct bpf_perf_link, link);
+
+	kfree(perf_link);
+}
+
+static const struct bpf_link_ops bpf_perf_link_lops = {
+	.release = bpf_perf_link_release,
+	.dealloc = bpf_perf_link_dealloc,
+};
+
+static int bpf_perf_link_attach(const union bpf_attr *attr, struct bpf_prog *prog)
+{
+	struct bpf_link_primer link_primer;
+	struct bpf_perf_link *link;
+	struct perf_event *event;
+	struct file *perf_file;
+	int err;
+
+	if (attr->link_create.flags)
+		return -EINVAL;
+
+	perf_file = perf_event_get(attr->link_create.target_fd);
+	if (IS_ERR(perf_file))
+		return PTR_ERR(perf_file);
+
+	link = kzalloc(sizeof(*link), GFP_USER);
+	if (!link) {
+		err = -ENOMEM;
+		goto out_put_file;
+	}
+	bpf_link_init(&link->link, BPF_LINK_TYPE_PERF_EVENT, &bpf_perf_link_lops, prog);
+	link->perf_file = perf_file;
+
+	err = bpf_link_prime(&link->link, &link_primer);
+	if (err) {
+		kfree(link);
+		goto out_put_file;
+	}
+
+	event = perf_file->private_data;
+	err = perf_event_set_bpf_prog(event, prog);
+	if (err) {
+		bpf_link_cleanup(&link_primer);
+		goto out_put_file;
+	}
+	/* perf_event_set_bpf_prog() doesn't take its own refcnt on prog */
+	bpf_prog_inc(prog);
+
+	return bpf_link_settle(&link_primer);
+
+out_put_file:
+	fput(perf_file);
+	return err;
+}
+#endif /* CONFIG_PERF_EVENTS */
+
 #define BPF_RAW_TRACEPOINT_OPEN_LAST_FIELD raw_tracepoint.prog_fd
 
 static int bpf_raw_tracepoint_open(const union bpf_attr *attr)
@@ -4147,15 +4220,26 @@  static int link_create(union bpf_attr *attr, bpfptr_t uattr)
 	if (ret)
 		goto out;
 
-	if (prog->type == BPF_PROG_TYPE_EXT) {
+	switch (prog->type) {
+	case BPF_PROG_TYPE_EXT:
 		ret = tracing_bpf_link_attach(attr, uattr, prog);
 		goto out;
-	}
-
-	ptype = attach_type_to_prog_type(attr->link_create.attach_type);
-	if (ptype == BPF_PROG_TYPE_UNSPEC || ptype != prog->type) {
-		ret = -EINVAL;
-		goto out;
+	case BPF_PROG_TYPE_PERF_EVENT:
+	case BPF_PROG_TYPE_KPROBE:
+	case BPF_PROG_TYPE_TRACEPOINT:
+		if (attr->link_create.attach_type != BPF_PERF_EVENT) {
+			ret = -EINVAL;
+			goto out;
+		}
+		ptype = prog->type;
+		break;
+	default:
+		ptype = attach_type_to_prog_type(attr->link_create.attach_type);
+		if (ptype == BPF_PROG_TYPE_UNSPEC || ptype != prog->type) {
+			ret = -EINVAL;
+			goto out;
+		}
+		break;
 	}
 
 	switch (ptype) {
@@ -4179,6 +4263,13 @@  static int link_create(union bpf_attr *attr, bpfptr_t uattr)
 	case BPF_PROG_TYPE_XDP:
 		ret = bpf_xdp_link_attach(attr, prog);
 		break;
+#endif
+#ifdef CONFIG_PERF_EVENTS
+	case BPF_PROG_TYPE_PERF_EVENT:
+	case BPF_PROG_TYPE_TRACEPOINT:
+	case BPF_PROG_TYPE_KPROBE:
+		ret = bpf_perf_link_attach(attr, prog);
+		break;
 #endif
 	default:
 		ret = -EINVAL;
diff --git a/kernel/events/core.c b/kernel/events/core.c
index bf4689403498..b125943599ce 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -4697,7 +4697,6 @@  find_get_context(struct pmu *pmu, struct task_struct *task,
 }
 
 static void perf_event_free_filter(struct perf_event *event);
-static void perf_event_free_bpf_prog(struct perf_event *event);
 
 static void free_event_rcu(struct rcu_head *head)
 {
@@ -5574,7 +5573,6 @@  static inline int perf_fget_light(int fd, struct fd *p)
 static int perf_event_set_output(struct perf_event *event,
 				 struct perf_event *output_event);
 static int perf_event_set_filter(struct perf_event *event, void __user *arg);
-static int perf_event_set_bpf_prog(struct perf_event *event, struct bpf_prog *prog);
 static int perf_copy_attr(struct perf_event_attr __user *uattr,
 			  struct perf_event_attr *attr);
 
@@ -10013,7 +10011,7 @@  static inline bool perf_event_is_tracing(struct perf_event *event)
 	return false;
 }
 
-static int perf_event_set_bpf_prog(struct perf_event *event, struct bpf_prog *prog)
+int perf_event_set_bpf_prog(struct perf_event *event, struct bpf_prog *prog)
 {
 	bool is_kprobe, is_tracepoint, is_syscall_tp;
 
@@ -10047,7 +10045,7 @@  static int perf_event_set_bpf_prog(struct perf_event *event, struct bpf_prog *pr
 	return perf_event_attach_bpf_prog(event, prog);
 }
 
-static void perf_event_free_bpf_prog(struct perf_event *event)
+void perf_event_free_bpf_prog(struct perf_event *event)
 {
 	if (!perf_event_is_tracing(event)) {
 		perf_event_free_bpf_handler(event);
@@ -10066,12 +10064,12 @@  static void perf_event_free_filter(struct perf_event *event)
 {
 }
 
-static int perf_event_set_bpf_prog(struct perf_event *event, struct bpf_prog *prog)
+int perf_event_set_bpf_prog(struct perf_event *event, struct bpf_prog *prog)
 {
 	return -ENOENT;
 }
 
-static void perf_event_free_bpf_prog(struct perf_event *event)
+void perf_event_free_bpf_prog(struct perf_event *event)
 {
 }
 #endif /* CONFIG_EVENT_TRACING */
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 2db6925e04f4..00b1267ab4f0 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -993,6 +993,7 @@  enum bpf_attach_type {
 	BPF_SK_SKB_VERDICT,
 	BPF_SK_REUSEPORT_SELECT,
 	BPF_SK_REUSEPORT_SELECT_OR_MIGRATE,
+	BPF_PERF_EVENT,
 	__MAX_BPF_ATTACH_TYPE
 };
 
@@ -1006,6 +1007,7 @@  enum bpf_link_type {
 	BPF_LINK_TYPE_ITER = 4,
 	BPF_LINK_TYPE_NETNS = 5,
 	BPF_LINK_TYPE_XDP = 6,
+	BPF_LINK_TYPE_PERF_EVENT = 6,
 
 	MAX_BPF_LINK_TYPE,
 };