[v2,bpf-next,04/14] bpf: implement minimal BPF perf link

Message ID	20210726161211.925206-5-andrii@kernel.org (mailing list archive)
State	Changes Requested
Delegated to:	BPF
Headers	show Return-Path: <bpf-owner@kernel.org> From: Andrii Nakryiko <andrii@kernel.org> To: <bpf@vger.kernel.org>, <ast@kernel.org>, <daniel@iogearbox.net> CC: <andrii@kernel.org>, <kernel-team@fb.com>, Peter Zijlstra <peterz@infradead.org> Subject: [PATCH v2 bpf-next 04/14] bpf: implement minimal BPF perf link Date: Mon, 26 Jul 2021 09:12:01 -0700 Message-ID: <20210726161211.925206-5-andrii@kernel.org> In-Reply-To: <20210726161211.925206-1-andrii@kernel.org> References: <20210726161211.925206-1-andrii@kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8BIT Content-Type: text/plain Precedence: bulk
Series	BPF perf link and user-provided context value \| expand [v2,bpf-next,00/14] BPF perf link and user-provided context value [v2,bpf-next,01/14] bpf: refactor BPF_PROG_RUN into a function [v2,bpf-next,02/14] bpf: refactor BPF_PROG_RUN_ARRAY family of macros into functions [v2,bpf-next,03/14] bpf: refactor perf_event_set_bpf_prog() to use struct bpf_prog input [v2,bpf-next,04/14] bpf: implement minimal BPF perf link [v2,bpf-next,05/14] bpf: allow to specify user-provided context value for BPF perf links [v2,bpf-next,06/14] bpf: add bpf_get_user_ctx() BPF helper to access user_ctx value [v2,bpf-next,07/14] libbpf: re-build libbpf.so when libbpf.map changes [v2,bpf-next,08/14] libbpf: remove unused bpf_link's destroy operation, but add dealloc [v2,bpf-next,09/14] libbpf: use BPF perf link when supported by kernel [v2,bpf-next,10/14] libbpf: add user_ctx support to bpf_link_create() API [v2,bpf-next,11/14] libbpf: add user_ctx to perf_event, kprobe, uprobe, and tp attach APIs [v2,bpf-next,12/14] selftests/bpf: test low-level perf BPF link API [v2,bpf-next,13/14] selftests/bpf: extract uprobe-related helpers into trace_helpers.{c,h} [v2,bpf-next,14/14] selftests/bpf: add user_ctx selftests for high-level APIs

Context	Check	Description
netdev/cover_letter	success	Link
netdev/fixes_present	success	Link
netdev/patch_count	success	Link
netdev/tree_selection	success	Clearly marked for bpf-next
netdev/subject_prefix	success	Link
netdev/cc_maintainers	warning	17 maintainers not CCed: jolsa@redhat.com netdev@vger.kernel.org joe@cilium.io haoluo@google.com mark.rutland@arm.com yhs@fb.com kpsingh@kernel.org mingo@redhat.com alexander.shishkin@linux.intel.com kafai@fb.com rostedt@goodmis.org acme@kernel.org john.fastabend@gmail.com namhyung@kernel.org songliubraving@fb.com linux-perf-users@vger.kernel.org quentin@isovalent.com
netdev/source_inline	success	Was 0 now: 0
netdev/verify_signedoff	success	Link
netdev/module_param	success	Was 0 now: 0
netdev/build_32bit	success	Errors and warnings before: 11868 this patch: 11868
netdev/kdoc	success	Errors and warnings before: 0 this patch: 0
netdev/verify_fixes	success	Link
netdev/checkpatch	warning	WARNING: line length of 82 exceeds 80 columns WARNING: line length of 88 exceeds 80 columns WARNING: line length of 89 exceeds 80 columns
netdev/build_allmodconfig_warn	success	Errors and warnings before: 11370 this patch: 11370
netdev/header_inline	success	Link

Andrii Nakryiko July 26, 2021, 4:12 p.m. UTC

Introduce a new type of BPF link - BPF perf link. This brings perf_event-based
BPF program attachments (perf_event, tracepoints, kprobes, and uprobes) into
the common BPF link infrastructure, allowing to list all active perf_event
based attachments, auto-detaching BPF program from perf_event when link's FD
is closed, get generic BPF link fdinfo/get_info functionality.

BPF_LINK_CREATE command expects perf_event's FD as target_fd. No extra flags
are currently supported.

Force-detaching and atomic BPF program updates are not yet implemented, but
with perf_event-based BPF links we now have common framework for this without
the need to extend ioctl()-based perf_event interface.

One interesting consideration is a new value for bpf_attach_type, which
BPF_LINK_CREATE command expects. Generally, it's either 1-to-1 mapping from
bpf_attach_type to bpf_prog_type, or many-to-1 mapping from a subset of
bpf_attach_types to one bpf_prog_type (e.g., see BPF_PROG_TYPE_SK_SKB or
BPF_PROG_TYPE_CGROUP_SOCK). In this case, though, we have three different
program types (KPROBE, TRACEPOINT, PERF_EVENT) using the same perf_event-based
mechanism, so it's many bpf_prog_types to one bpf_attach_type. I chose to
define a single BPF_PERF_EVENT attach type for all of them and adjust
link_create()'s logic for checking correspondence between attach type and
program type.

The alternative would be to define three new attach types (e.g., BPF_KPROBE,
BPF_TRACEPOINT, and BPF_PERF_EVENT), but that seemed like unnecessary overkill
and BPF_KPROBE will cause naming conflicts with BPF_KPROBE() macro, defined by
libbpf. I chose to not do this to avoid unnecessary proliferation of
bpf_attach_type enum values and not have to deal with naming conflicts.

Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 include/linux/bpf_types.h      |   3 +
 include/linux/trace_events.h   |   3 +
 include/uapi/linux/bpf.h       |   2 +
 kernel/bpf/syscall.c           | 105 ++++++++++++++++++++++++++++++---
 kernel/events/core.c           |  10 ++--
 tools/include/uapi/linux/bpf.h |   2 +
 6 files changed, 112 insertions(+), 13 deletions(-)

Peter Zijlstra July 27, 2021, 9:04 a.m. UTC | #1

On Mon, Jul 26, 2021 at 09:12:01AM -0700, Andrii Nakryiko wrote:
> Introduce a new type of BPF link - BPF perf link. This brings perf_event-based
> BPF program attachments (perf_event, tracepoints, kprobes, and uprobes) into
> the common BPF link infrastructure, allowing to list all active perf_event
> based attachments, auto-detaching BPF program from perf_event when link's FD
> is closed, get generic BPF link fdinfo/get_info functionality.
> 
> BPF_LINK_CREATE command expects perf_event's FD as target_fd. No extra flags
> are currently supported.
> 
> Force-detaching and atomic BPF program updates are not yet implemented, but
> with perf_event-based BPF links we now have common framework for this without
> the need to extend ioctl()-based perf_event interface.
> 
> One interesting consideration is a new value for bpf_attach_type, which
> BPF_LINK_CREATE command expects. Generally, it's either 1-to-1 mapping from
> bpf_attach_type to bpf_prog_type, or many-to-1 mapping from a subset of
> bpf_attach_types to one bpf_prog_type (e.g., see BPF_PROG_TYPE_SK_SKB or
> BPF_PROG_TYPE_CGROUP_SOCK). In this case, though, we have three different
> program types (KPROBE, TRACEPOINT, PERF_EVENT) using the same perf_event-based
> mechanism, so it's many bpf_prog_types to one bpf_attach_type. I chose to
> define a single BPF_PERF_EVENT attach type for all of them and adjust
> link_create()'s logic for checking correspondence between attach type and
> program type.
> 
> The alternative would be to define three new attach types (e.g., BPF_KPROBE,
> BPF_TRACEPOINT, and BPF_PERF_EVENT), but that seemed like unnecessary overkill
> and BPF_KPROBE will cause naming conflicts with BPF_KPROBE() macro, defined by
> libbpf. I chose to not do this to avoid unnecessary proliferation of
> bpf_attach_type enum values and not have to deal with naming conflicts.
> 

So I have no idea what all that means... I don't speak BPF. That said,
the patch doesn't look terrible.

One little question below, but otherwise:

Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>

> +static void bpf_perf_link_release(struct bpf_link *link)
> +{
> +	struct bpf_perf_link *perf_link = container_of(link, struct bpf_perf_link, link);
> +	struct perf_event *event = perf_link->perf_file->private_data;
> +
> +	perf_event_free_bpf_prog(event);
> +	fput(perf_link->perf_file);
> +}

> +static int bpf_perf_link_attach(const union bpf_attr *attr, struct bpf_prog *prog)
> +{
> +	struct bpf_link_primer link_primer;
> +	struct bpf_perf_link *link;
> +	struct perf_event *event;
> +	struct file *perf_file;
> +	int err;
> +
> +	if (attr->link_create.flags)
> +		return -EINVAL;
> +
> +	perf_file = perf_event_get(attr->link_create.target_fd);
> +	if (IS_ERR(perf_file))
> +		return PTR_ERR(perf_file);
> +
> +	link = kzalloc(sizeof(*link), GFP_USER);
> +	if (!link) {
> +		err = -ENOMEM;
> +		goto out_put_file;
> +	}
> +	bpf_link_init(&link->link, BPF_LINK_TYPE_PERF_EVENT, &bpf_perf_link_lops, prog);
> +	link->perf_file = perf_file;
> +
> +	err = bpf_link_prime(&link->link, &link_primer);
> +	if (err) {
> +		kfree(link);
> +		goto out_put_file;
> +	}
> +
> +	event = perf_file->private_data;
> +	err = perf_event_set_bpf_prog(event, prog);
> +	if (err) {
> +		bpf_link_cleanup(&link_primer);
> +		goto out_put_file;
> +	}
> +	/* perf_event_set_bpf_prog() doesn't take its own refcnt on prog */

Is that otherwise expected? AFAICT the previous users of that function
were guaranteed the existance of the BPF program. But afaict there is
nothing that prevents perf_event_*_bpf_prog() from doing the addition
refcounting if that is more convenient.

> +	bpf_prog_inc(prog);
> +
> +	return bpf_link_settle(&link_primer);
> +
> +out_put_file:
> +	fput(perf_file);
> +	return err;
> +}

Peter Zijlstra July 27, 2021, 9:12 a.m. UTC | #2

On Mon, Jul 26, 2021 at 09:12:01AM -0700, Andrii Nakryiko wrote:
> diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
> index ad413b382a3c..8ac92560d3a3 100644
> --- a/include/linux/trace_events.h
> +++ b/include/linux/trace_events.h
> @@ -803,6 +803,9 @@ extern void ftrace_profile_free_filter(struct perf_event *event);
>  void perf_trace_buf_update(void *record, u16 type);
>  void *perf_trace_buf_alloc(int size, struct pt_regs **regs, int *rctxp);
>  
> +int perf_event_set_bpf_prog(struct perf_event *event, struct bpf_prog *prog);
> +void perf_event_free_bpf_prog(struct perf_event *event);
> +
>  void bpf_trace_run1(struct bpf_prog *prog, u64 arg1);
>  void bpf_trace_run2(struct bpf_prog *prog, u64 arg1, u64 arg2);
>  void bpf_trace_run3(struct bpf_prog *prog, u64 arg1, u64 arg2,

Oh, I just noticed, is this the right header to put these in? Should
this not go into include/linux/perf_event.h ?

Jiri Olsa July 27, 2021, 3:40 p.m. UTC | #3

On Mon, Jul 26, 2021 at 09:12:01AM -0700, Andrii Nakryiko wrote:
> Introduce a new type of BPF link - BPF perf link. This brings perf_event-based
> BPF program attachments (perf_event, tracepoints, kprobes, and uprobes) into
> the common BPF link infrastructure, allowing to list all active perf_event
> based attachments, auto-detaching BPF program from perf_event when link's FD
> is closed, get generic BPF link fdinfo/get_info functionality.
> 
> BPF_LINK_CREATE command expects perf_event's FD as target_fd. No extra flags
> are currently supported.
> 
> Force-detaching and atomic BPF program updates are not yet implemented, but
> with perf_event-based BPF links we now have common framework for this without
> the need to extend ioctl()-based perf_event interface.
> 
> One interesting consideration is a new value for bpf_attach_type, which
> BPF_LINK_CREATE command expects. Generally, it's either 1-to-1 mapping from
> bpf_attach_type to bpf_prog_type, or many-to-1 mapping from a subset of
> bpf_attach_types to one bpf_prog_type (e.g., see BPF_PROG_TYPE_SK_SKB or
> BPF_PROG_TYPE_CGROUP_SOCK). In this case, though, we have three different
> program types (KPROBE, TRACEPOINT, PERF_EVENT) using the same perf_event-based
> mechanism, so it's many bpf_prog_types to one bpf_attach_type. I chose to
> define a single BPF_PERF_EVENT attach type for all of them and adjust
> link_create()'s logic for checking correspondence between attach type and
> program type.
> 
> The alternative would be to define three new attach types (e.g., BPF_KPROBE,
> BPF_TRACEPOINT, and BPF_PERF_EVENT), but that seemed like unnecessary overkill
> and BPF_KPROBE will cause naming conflicts with BPF_KPROBE() macro, defined by
> libbpf. I chose to not do this to avoid unnecessary proliferation of
> bpf_attach_type enum values and not have to deal with naming conflicts.
> 
> Cc: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> ---
>  include/linux/bpf_types.h      |   3 +
>  include/linux/trace_events.h   |   3 +
>  include/uapi/linux/bpf.h       |   2 +
>  kernel/bpf/syscall.c           | 105 ++++++++++++++++++++++++++++++---
>  kernel/events/core.c           |  10 ++--
>  tools/include/uapi/linux/bpf.h |   2 +
>  6 files changed, 112 insertions(+), 13 deletions(-)
> 
> diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
> index a9db1eae6796..0a1ada7f174d 100644
> --- a/include/linux/bpf_types.h
> +++ b/include/linux/bpf_types.h
> @@ -135,3 +135,6 @@ BPF_LINK_TYPE(BPF_LINK_TYPE_ITER, iter)
>  #ifdef CONFIG_NET
>  BPF_LINK_TYPE(BPF_LINK_TYPE_NETNS, netns)
>  #endif
> +#ifdef CONFIG_PERF_EVENTS
> +BPF_LINK_TYPE(BPF_LINK_TYPE_PERF_EVENT, perf)
> +#endif
> diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
> index ad413b382a3c..8ac92560d3a3 100644
> --- a/include/linux/trace_events.h
> +++ b/include/linux/trace_events.h
> @@ -803,6 +803,9 @@ extern void ftrace_profile_free_filter(struct perf_event *event);
>  void perf_trace_buf_update(void *record, u16 type);
>  void *perf_trace_buf_alloc(int size, struct pt_regs **regs, int *rctxp);
>  
> +int perf_event_set_bpf_prog(struct perf_event *event, struct bpf_prog *prog);
> +void perf_event_free_bpf_prog(struct perf_event *event);
> +
>  void bpf_trace_run1(struct bpf_prog *prog, u64 arg1);
>  void bpf_trace_run2(struct bpf_prog *prog, u64 arg1, u64 arg2);
>  void bpf_trace_run3(struct bpf_prog *prog, u64 arg1, u64 arg2,
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 2db6925e04f4..00b1267ab4f0 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -993,6 +993,7 @@ enum bpf_attach_type {
>  	BPF_SK_SKB_VERDICT,
>  	BPF_SK_REUSEPORT_SELECT,
>  	BPF_SK_REUSEPORT_SELECT_OR_MIGRATE,
> +	BPF_PERF_EVENT,
>  	__MAX_BPF_ATTACH_TYPE
>  };
>  
> @@ -1006,6 +1007,7 @@ enum bpf_link_type {
>  	BPF_LINK_TYPE_ITER = 4,
>  	BPF_LINK_TYPE_NETNS = 5,
>  	BPF_LINK_TYPE_XDP = 6,
> +	BPF_LINK_TYPE_PERF_EVENT = 6,

hi, should be 7

jirka

Andrii Nakryiko July 27, 2021, 8:56 p.m. UTC | #4

On Tue, Jul 27, 2021 at 2:15 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Mon, Jul 26, 2021 at 09:12:01AM -0700, Andrii Nakryiko wrote:
> > diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
> > index ad413b382a3c..8ac92560d3a3 100644
> > --- a/include/linux/trace_events.h
> > +++ b/include/linux/trace_events.h
> > @@ -803,6 +803,9 @@ extern void ftrace_profile_free_filter(struct perf_event *event);
> >  void perf_trace_buf_update(void *record, u16 type);
> >  void *perf_trace_buf_alloc(int size, struct pt_regs **regs, int *rctxp);
> >
> > +int perf_event_set_bpf_prog(struct perf_event *event, struct bpf_prog *prog);
> > +void perf_event_free_bpf_prog(struct perf_event *event);
> > +
> >  void bpf_trace_run1(struct bpf_prog *prog, u64 arg1);
> >  void bpf_trace_run2(struct bpf_prog *prog, u64 arg1, u64 arg2);
> >  void bpf_trace_run3(struct bpf_prog *prog, u64 arg1, u64 arg2,
>
> Oh, I just noticed, is this the right header to put these in? Should
> this not go into include/linux/perf_event.h ?

Not that I care much, but this one has perf_event_attach_bpf_prog()
and perf_event_detach_bpf_prog(), so it felt appropriate to put it
here. perf_event.h only seems to have perf_event_bpf_event() for
BPF-related stuff (which seems to be just notification events, not
really BPF functionality per se). But let me know if you prefer to add
these new ones to perf_event.h.

Andrii Nakryiko July 27, 2021, 8:56 p.m. UTC | #5

On Tue, Jul 27, 2021 at 8:40 AM Jiri Olsa <jolsa@redhat.com> wrote:
>
> On Mon, Jul 26, 2021 at 09:12:01AM -0700, Andrii Nakryiko wrote:
> > Introduce a new type of BPF link - BPF perf link. This brings perf_event-based
> > BPF program attachments (perf_event, tracepoints, kprobes, and uprobes) into
> > the common BPF link infrastructure, allowing to list all active perf_event
> > based attachments, auto-detaching BPF program from perf_event when link's FD
> > is closed, get generic BPF link fdinfo/get_info functionality.
> >
> > BPF_LINK_CREATE command expects perf_event's FD as target_fd. No extra flags
> > are currently supported.
> >
> > Force-detaching and atomic BPF program updates are not yet implemented, but
> > with perf_event-based BPF links we now have common framework for this without
> > the need to extend ioctl()-based perf_event interface.
> >
> > One interesting consideration is a new value for bpf_attach_type, which
> > BPF_LINK_CREATE command expects. Generally, it's either 1-to-1 mapping from
> > bpf_attach_type to bpf_prog_type, or many-to-1 mapping from a subset of
> > bpf_attach_types to one bpf_prog_type (e.g., see BPF_PROG_TYPE_SK_SKB or
> > BPF_PROG_TYPE_CGROUP_SOCK). In this case, though, we have three different
> > program types (KPROBE, TRACEPOINT, PERF_EVENT) using the same perf_event-based
> > mechanism, so it's many bpf_prog_types to one bpf_attach_type. I chose to
> > define a single BPF_PERF_EVENT attach type for all of them and adjust
> > link_create()'s logic for checking correspondence between attach type and
> > program type.
> >
> > The alternative would be to define three new attach types (e.g., BPF_KPROBE,
> > BPF_TRACEPOINT, and BPF_PERF_EVENT), but that seemed like unnecessary overkill
> > and BPF_KPROBE will cause naming conflicts with BPF_KPROBE() macro, defined by
> > libbpf. I chose to not do this to avoid unnecessary proliferation of
> > bpf_attach_type enum values and not have to deal with naming conflicts.
> >
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> > ---
> >  include/linux/bpf_types.h      |   3 +
> >  include/linux/trace_events.h   |   3 +
> >  include/uapi/linux/bpf.h       |   2 +
> >  kernel/bpf/syscall.c           | 105 ++++++++++++++++++++++++++++++---
> >  kernel/events/core.c           |  10 ++--
> >  tools/include/uapi/linux/bpf.h |   2 +
> >  6 files changed, 112 insertions(+), 13 deletions(-)
> >
> > diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
> > index a9db1eae6796..0a1ada7f174d 100644
> > --- a/include/linux/bpf_types.h
> > +++ b/include/linux/bpf_types.h
> > @@ -135,3 +135,6 @@ BPF_LINK_TYPE(BPF_LINK_TYPE_ITER, iter)
> >  #ifdef CONFIG_NET
> >  BPF_LINK_TYPE(BPF_LINK_TYPE_NETNS, netns)
> >  #endif
> > +#ifdef CONFIG_PERF_EVENTS
> > +BPF_LINK_TYPE(BPF_LINK_TYPE_PERF_EVENT, perf)
> > +#endif
> > diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
> > index ad413b382a3c..8ac92560d3a3 100644
> > --- a/include/linux/trace_events.h
> > +++ b/include/linux/trace_events.h
> > @@ -803,6 +803,9 @@ extern void ftrace_profile_free_filter(struct perf_event *event);
> >  void perf_trace_buf_update(void *record, u16 type);
> >  void *perf_trace_buf_alloc(int size, struct pt_regs **regs, int *rctxp);
> >
> > +int perf_event_set_bpf_prog(struct perf_event *event, struct bpf_prog *prog);
> > +void perf_event_free_bpf_prog(struct perf_event *event);
> > +
> >  void bpf_trace_run1(struct bpf_prog *prog, u64 arg1);
> >  void bpf_trace_run2(struct bpf_prog *prog, u64 arg1, u64 arg2);
> >  void bpf_trace_run3(struct bpf_prog *prog, u64 arg1, u64 arg2,
> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > index 2db6925e04f4..00b1267ab4f0 100644
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -993,6 +993,7 @@ enum bpf_attach_type {
> >       BPF_SK_SKB_VERDICT,
> >       BPF_SK_REUSEPORT_SELECT,
> >       BPF_SK_REUSEPORT_SELECT_OR_MIGRATE,
> > +     BPF_PERF_EVENT,
> >       __MAX_BPF_ATTACH_TYPE
> >  };
> >
> > @@ -1006,6 +1007,7 @@ enum bpf_link_type {
> >       BPF_LINK_TYPE_ITER = 4,
> >       BPF_LINK_TYPE_NETNS = 5,
> >       BPF_LINK_TYPE_XDP = 6,
> > +     BPF_LINK_TYPE_PERF_EVENT = 6,
>
> hi, should be 7
>

doh! Eagle eyes! Will fix, thanks :)

> jirka
>

Yonghong Song July 29, 2021, 5:35 p.m. UTC | #6

On 7/26/21 9:12 AM, Andrii Nakryiko wrote:
> Introduce a new type of BPF link - BPF perf link. This brings perf_event-based
> BPF program attachments (perf_event, tracepoints, kprobes, and uprobes) into
> the common BPF link infrastructure, allowing to list all active perf_event
> based attachments, auto-detaching BPF program from perf_event when link's FD
> is closed, get generic BPF link fdinfo/get_info functionality.
> 
> BPF_LINK_CREATE command expects perf_event's FD as target_fd. No extra flags
> are currently supported.
> 
> Force-detaching and atomic BPF program updates are not yet implemented, but
> with perf_event-based BPF links we now have common framework for this without
> the need to extend ioctl()-based perf_event interface.
> 
> One interesting consideration is a new value for bpf_attach_type, which
> BPF_LINK_CREATE command expects. Generally, it's either 1-to-1 mapping from
> bpf_attach_type to bpf_prog_type, or many-to-1 mapping from a subset of
> bpf_attach_types to one bpf_prog_type (e.g., see BPF_PROG_TYPE_SK_SKB or
> BPF_PROG_TYPE_CGROUP_SOCK). In this case, though, we have three different
> program types (KPROBE, TRACEPOINT, PERF_EVENT) using the same perf_event-based
> mechanism, so it's many bpf_prog_types to one bpf_attach_type. I chose to
> define a single BPF_PERF_EVENT attach type for all of them and adjust
> link_create()'s logic for checking correspondence between attach type and
> program type.
> 
> The alternative would be to define three new attach types (e.g., BPF_KPROBE,
> BPF_TRACEPOINT, and BPF_PERF_EVENT), but that seemed like unnecessary overkill
> and BPF_KPROBE will cause naming conflicts with BPF_KPROBE() macro, defined by
> libbpf. I chose to not do this to avoid unnecessary proliferation of
> bpf_attach_type enum values and not have to deal with naming conflicts.
> 
> Cc: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> ---
>   include/linux/bpf_types.h      |   3 +
>   include/linux/trace_events.h   |   3 +
>   include/uapi/linux/bpf.h       |   2 +
>   kernel/bpf/syscall.c           | 105 ++++++++++++++++++++++++++++++---
>   kernel/events/core.c           |  10 ++--
>   tools/include/uapi/linux/bpf.h |   2 +
>   6 files changed, 112 insertions(+), 13 deletions(-)
> 
> diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
> index a9db1eae6796..0a1ada7f174d 100644
> --- a/include/linux/bpf_types.h
> +++ b/include/linux/bpf_types.h
> @@ -135,3 +135,6 @@ BPF_LINK_TYPE(BPF_LINK_TYPE_ITER, iter)
>   #ifdef CONFIG_NET
>   BPF_LINK_TYPE(BPF_LINK_TYPE_NETNS, netns)
>   #endif
> +#ifdef CONFIG_PERF_EVENTS
> +BPF_LINK_TYPE(BPF_LINK_TYPE_PERF_EVENT, perf)
> +#endif
> diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
> index ad413b382a3c..8ac92560d3a3 100644
> --- a/include/linux/trace_events.h
> +++ b/include/linux/trace_events.h
> @@ -803,6 +803,9 @@ extern void ftrace_profile_free_filter(struct perf_event *event);
>   void perf_trace_buf_update(void *record, u16 type);
>   void *perf_trace_buf_alloc(int size, struct pt_regs **regs, int *rctxp);
>   
> +int perf_event_set_bpf_prog(struct perf_event *event, struct bpf_prog *prog);
> +void perf_event_free_bpf_prog(struct perf_event *event);
> +
>   void bpf_trace_run1(struct bpf_prog *prog, u64 arg1);
>   void bpf_trace_run2(struct bpf_prog *prog, u64 arg1, u64 arg2);
>   void bpf_trace_run3(struct bpf_prog *prog, u64 arg1, u64 arg2,
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 2db6925e04f4..00b1267ab4f0 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -993,6 +993,7 @@ enum bpf_attach_type {
>   	BPF_SK_SKB_VERDICT,
>   	BPF_SK_REUSEPORT_SELECT,
>   	BPF_SK_REUSEPORT_SELECT_OR_MIGRATE,
> +	BPF_PERF_EVENT,
>   	__MAX_BPF_ATTACH_TYPE
>   };
>   
> @@ -1006,6 +1007,7 @@ enum bpf_link_type {
>   	BPF_LINK_TYPE_ITER = 4,
>   	BPF_LINK_TYPE_NETNS = 5,
>   	BPF_LINK_TYPE_XDP = 6,
> +	BPF_LINK_TYPE_PERF_EVENT = 6,

As Jiri has pointed out, BPF_LINK_TYPE_PERF_EVENT = 7.

>   
>   	MAX_BPF_LINK_TYPE,
>   };
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 9a2068e39d23..80c03bedd6e6 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -2906,6 +2906,79 @@ static const struct bpf_link_ops bpf_raw_tp_link_lops = {
>   	.fill_link_info = bpf_raw_tp_link_fill_link_info,
>   };
>   
> +#ifdef CONFIG_PERF_EVENTS
> +struct bpf_perf_link {
> +	struct bpf_link link;
> +	struct file *perf_file;
> +};
> +
> +static void bpf_perf_link_release(struct bpf_link *link)
> +{
> +	struct bpf_perf_link *perf_link = container_of(link, struct bpf_perf_link, link);
> +	struct perf_event *event = perf_link->perf_file->private_data;
> +
> +	perf_event_free_bpf_prog(event);
> +	fput(perf_link->perf_file);
> +}
> +
> +static void bpf_perf_link_dealloc(struct bpf_link *link)
> +{
> +	struct bpf_perf_link *perf_link = container_of(link, struct bpf_perf_link, link);
> +
> +	kfree(perf_link);
> +}
> +
> +static const struct bpf_link_ops bpf_perf_link_lops = {
> +	.release = bpf_perf_link_release,
> +	.dealloc = bpf_perf_link_dealloc,
> +};
> +
> +static int bpf_perf_link_attach(const union bpf_attr *attr, struct bpf_prog *prog)
> +{
> +	struct bpf_link_primer link_primer;
> +	struct bpf_perf_link *link;
> +	struct perf_event *event;
> +	struct file *perf_file;
> +	int err;
> +
> +	if (attr->link_create.flags)
> +		return -EINVAL;
> +
> +	perf_file = perf_event_get(attr->link_create.target_fd);
> +	if (IS_ERR(perf_file))
> +		return PTR_ERR(perf_file);
> +
> +	link = kzalloc(sizeof(*link), GFP_USER);

add __GFP_NOWARN flag?

> +	if (!link) {
> +		err = -ENOMEM;
> +		goto out_put_file;
> +	}
> +	bpf_link_init(&link->link, BPF_LINK_TYPE_PERF_EVENT, &bpf_perf_link_lops, prog);
> +	link->perf_file = perf_file;
> +
> +	err = bpf_link_prime(&link->link, &link_primer);
> +	if (err) {
> +		kfree(link);
> +		goto out_put_file;
> +	}
> +
> +	event = perf_file->private_data;
> +	err = perf_event_set_bpf_prog(event, prog);
> +	if (err) {
> +		bpf_link_cleanup(&link_primer);

Do you need kfree(link) here?

> +		goto out_put_file;
> +	}
> +	/* perf_event_set_bpf_prog() doesn't take its own refcnt on prog */
> +	bpf_prog_inc(prog);
> +
> +	return bpf_link_settle(&link_primer);
> +
> +out_put_file:
> +	fput(perf_file);
> +	return err;
> +}
> +#endif /* CONFIG_PERF_EVENTS */
> +
>   #define BPF_RAW_TRACEPOINT_OPEN_LAST_FIELD raw_tracepoint.prog_fd
>   
[...]

Andrii Nakryiko July 30, 2021, 4:16 a.m. UTC | #7

On Thu, Jul 29, 2021 at 10:36 AM Yonghong Song <yhs@fb.com> wrote:
>
>
>
> On 7/26/21 9:12 AM, Andrii Nakryiko wrote:
> > Introduce a new type of BPF link - BPF perf link. This brings perf_event-based
> > BPF program attachments (perf_event, tracepoints, kprobes, and uprobes) into
> > the common BPF link infrastructure, allowing to list all active perf_event
> > based attachments, auto-detaching BPF program from perf_event when link's FD
> > is closed, get generic BPF link fdinfo/get_info functionality.
> >
> > BPF_LINK_CREATE command expects perf_event's FD as target_fd. No extra flags
> > are currently supported.
> >
> > Force-detaching and atomic BPF program updates are not yet implemented, but
> > with perf_event-based BPF links we now have common framework for this without
> > the need to extend ioctl()-based perf_event interface.
> >
> > One interesting consideration is a new value for bpf_attach_type, which
> > BPF_LINK_CREATE command expects. Generally, it's either 1-to-1 mapping from
> > bpf_attach_type to bpf_prog_type, or many-to-1 mapping from a subset of
> > bpf_attach_types to one bpf_prog_type (e.g., see BPF_PROG_TYPE_SK_SKB or
> > BPF_PROG_TYPE_CGROUP_SOCK). In this case, though, we have three different
> > program types (KPROBE, TRACEPOINT, PERF_EVENT) using the same perf_event-based
> > mechanism, so it's many bpf_prog_types to one bpf_attach_type. I chose to
> > define a single BPF_PERF_EVENT attach type for all of them and adjust
> > link_create()'s logic for checking correspondence between attach type and
> > program type.
> >
> > The alternative would be to define three new attach types (e.g., BPF_KPROBE,
> > BPF_TRACEPOINT, and BPF_PERF_EVENT), but that seemed like unnecessary overkill
> > and BPF_KPROBE will cause naming conflicts with BPF_KPROBE() macro, defined by
> > libbpf. I chose to not do this to avoid unnecessary proliferation of
> > bpf_attach_type enum values and not have to deal with naming conflicts.
> >
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> > ---
> >   include/linux/bpf_types.h      |   3 +
> >   include/linux/trace_events.h   |   3 +
> >   include/uapi/linux/bpf.h       |   2 +
> >   kernel/bpf/syscall.c           | 105 ++++++++++++++++++++++++++++++---
> >   kernel/events/core.c           |  10 ++--
> >   tools/include/uapi/linux/bpf.h |   2 +
> >   6 files changed, 112 insertions(+), 13 deletions(-)
> >
> > diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
> > index a9db1eae6796..0a1ada7f174d 100644
> > --- a/include/linux/bpf_types.h
> > +++ b/include/linux/bpf_types.h
> > @@ -135,3 +135,6 @@ BPF_LINK_TYPE(BPF_LINK_TYPE_ITER, iter)
> >   #ifdef CONFIG_NET
> >   BPF_LINK_TYPE(BPF_LINK_TYPE_NETNS, netns)
> >   #endif
> > +#ifdef CONFIG_PERF_EVENTS
> > +BPF_LINK_TYPE(BPF_LINK_TYPE_PERF_EVENT, perf)
> > +#endif
> > diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
> > index ad413b382a3c..8ac92560d3a3 100644
> > --- a/include/linux/trace_events.h
> > +++ b/include/linux/trace_events.h
> > @@ -803,6 +803,9 @@ extern void ftrace_profile_free_filter(struct perf_event *event);
> >   void perf_trace_buf_update(void *record, u16 type);
> >   void *perf_trace_buf_alloc(int size, struct pt_regs **regs, int *rctxp);
> >
> > +int perf_event_set_bpf_prog(struct perf_event *event, struct bpf_prog *prog);
> > +void perf_event_free_bpf_prog(struct perf_event *event);
> > +
> >   void bpf_trace_run1(struct bpf_prog *prog, u64 arg1);
> >   void bpf_trace_run2(struct bpf_prog *prog, u64 arg1, u64 arg2);
> >   void bpf_trace_run3(struct bpf_prog *prog, u64 arg1, u64 arg2,
> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > index 2db6925e04f4..00b1267ab4f0 100644
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -993,6 +993,7 @@ enum bpf_attach_type {
> >       BPF_SK_SKB_VERDICT,
> >       BPF_SK_REUSEPORT_SELECT,
> >       BPF_SK_REUSEPORT_SELECT_OR_MIGRATE,
> > +     BPF_PERF_EVENT,
> >       __MAX_BPF_ATTACH_TYPE
> >   };
> >
> > @@ -1006,6 +1007,7 @@ enum bpf_link_type {
> >       BPF_LINK_TYPE_ITER = 4,
> >       BPF_LINK_TYPE_NETNS = 5,
> >       BPF_LINK_TYPE_XDP = 6,
> > +     BPF_LINK_TYPE_PERF_EVENT = 6,
>
> As Jiri has pointed out, BPF_LINK_TYPE_PERF_EVENT = 7.

yep, fixed

>
> >
> >       MAX_BPF_LINK_TYPE,
> >   };
> > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> > index 9a2068e39d23..80c03bedd6e6 100644
> > --- a/kernel/bpf/syscall.c
> > +++ b/kernel/bpf/syscall.c
> > @@ -2906,6 +2906,79 @@ static const struct bpf_link_ops bpf_raw_tp_link_lops = {
> >       .fill_link_info = bpf_raw_tp_link_fill_link_info,
> >   };
> >
> > +#ifdef CONFIG_PERF_EVENTS
> > +struct bpf_perf_link {
> > +     struct bpf_link link;
> > +     struct file *perf_file;
> > +};
> > +
> > +static void bpf_perf_link_release(struct bpf_link *link)
> > +{
> > +     struct bpf_perf_link *perf_link = container_of(link, struct bpf_perf_link, link);
> > +     struct perf_event *event = perf_link->perf_file->private_data;
> > +
> > +     perf_event_free_bpf_prog(event);
> > +     fput(perf_link->perf_file);
> > +}
> > +
> > +static void bpf_perf_link_dealloc(struct bpf_link *link)
> > +{
> > +     struct bpf_perf_link *perf_link = container_of(link, struct bpf_perf_link, link);
> > +
> > +     kfree(perf_link);
> > +}
> > +
> > +static const struct bpf_link_ops bpf_perf_link_lops = {
> > +     .release = bpf_perf_link_release,
> > +     .dealloc = bpf_perf_link_dealloc,
> > +};
> > +
> > +static int bpf_perf_link_attach(const union bpf_attr *attr, struct bpf_prog *prog)
> > +{
> > +     struct bpf_link_primer link_primer;
> > +     struct bpf_perf_link *link;
> > +     struct perf_event *event;
> > +     struct file *perf_file;
> > +     int err;
> > +
> > +     if (attr->link_create.flags)
> > +             return -EINVAL;
> > +
> > +     perf_file = perf_event_get(attr->link_create.target_fd);
> > +     if (IS_ERR(perf_file))
> > +             return PTR_ERR(perf_file);
> > +
> > +     link = kzalloc(sizeof(*link), GFP_USER);
>
> add __GFP_NOWARN flag?

I looked at few other bpf_link_alloc places in this file, they don't
use NOWARN flag. I think the idea with NOWARN flag is to avoid memory
alloc warnings when amount of allocated memory depends on
user-specified parameter (like the size of the map value). In this
case it's just a single fixed-size kernel object, so while users can
create lots of them, each is fixed in size. It's similar as any other
kernel object (e.g., struct file). So I think it's good as is.

>
> > +     if (!link) {
> > +             err = -ENOMEM;
> > +             goto out_put_file;
> > +     }
> > +     bpf_link_init(&link->link, BPF_LINK_TYPE_PERF_EVENT, &bpf_perf_link_lops, prog);
> > +     link->perf_file = perf_file;
> > +
> > +     err = bpf_link_prime(&link->link, &link_primer);
> > +     if (err) {
> > +             kfree(link);
> > +             goto out_put_file;
> > +     }
> > +
> > +     event = perf_file->private_data;
> > +     err = perf_event_set_bpf_prog(event, prog);
> > +     if (err) {
> > +             bpf_link_cleanup(&link_primer);
>
> Do you need kfree(link) here?

bpf_link_cleanup() will call kfree() in deferred fashion. This is due
to bpf_link_prime() allocating anon_inode file internally, so it needs
to be freed carefully and that's what bpf_link_cleanup() is for.

>
> > +             goto out_put_file;
> > +     }
> > +     /* perf_event_set_bpf_prog() doesn't take its own refcnt on prog */
> > +     bpf_prog_inc(prog);
> > +
> > +     return bpf_link_settle(&link_primer);
> > +
> > +out_put_file:
> > +     fput(perf_file);
> > +     return err;
> > +}
> > +#endif /* CONFIG_PERF_EVENTS */
> > +
> >   #define BPF_RAW_TRACEPOINT_OPEN_LAST_FIELD raw_tracepoint.prog_fd
> >
> [...]

Andrii Nakryiko July 30, 2021, 4:23 a.m. UTC | #8

On Tue, Jul 27, 2021 at 2:04 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Mon, Jul 26, 2021 at 09:12:01AM -0700, Andrii Nakryiko wrote:
> > Introduce a new type of BPF link - BPF perf link. This brings perf_event-based
> > BPF program attachments (perf_event, tracepoints, kprobes, and uprobes) into
> > the common BPF link infrastructure, allowing to list all active perf_event
> > based attachments, auto-detaching BPF program from perf_event when link's FD
> > is closed, get generic BPF link fdinfo/get_info functionality.
> >
> > BPF_LINK_CREATE command expects perf_event's FD as target_fd. No extra flags
> > are currently supported.
> >
> > Force-detaching and atomic BPF program updates are not yet implemented, but
> > with perf_event-based BPF links we now have common framework for this without
> > the need to extend ioctl()-based perf_event interface.
> >
> > One interesting consideration is a new value for bpf_attach_type, which
> > BPF_LINK_CREATE command expects. Generally, it's either 1-to-1 mapping from
> > bpf_attach_type to bpf_prog_type, or many-to-1 mapping from a subset of
> > bpf_attach_types to one bpf_prog_type (e.g., see BPF_PROG_TYPE_SK_SKB or
> > BPF_PROG_TYPE_CGROUP_SOCK). In this case, though, we have three different
> > program types (KPROBE, TRACEPOINT, PERF_EVENT) using the same perf_event-based
> > mechanism, so it's many bpf_prog_types to one bpf_attach_type. I chose to
> > define a single BPF_PERF_EVENT attach type for all of them and adjust
> > link_create()'s logic for checking correspondence between attach type and
> > program type.
> >
> > The alternative would be to define three new attach types (e.g., BPF_KPROBE,
> > BPF_TRACEPOINT, and BPF_PERF_EVENT), but that seemed like unnecessary overkill
> > and BPF_KPROBE will cause naming conflicts with BPF_KPROBE() macro, defined by
> > libbpf. I chose to not do this to avoid unnecessary proliferation of
> > bpf_attach_type enum values and not have to deal with naming conflicts.
> >
>
> So I have no idea what all that means... I don't speak BPF. That said,
> the patch doesn't look terrible.
>
> One little question below, but otherwise:
>
> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>
> > +static void bpf_perf_link_release(struct bpf_link *link)
> > +{
> > +     struct bpf_perf_link *perf_link = container_of(link, struct bpf_perf_link, link);
> > +     struct perf_event *event = perf_link->perf_file->private_data;
> > +
> > +     perf_event_free_bpf_prog(event);
> > +     fput(perf_link->perf_file);
> > +}
>
> > +static int bpf_perf_link_attach(const union bpf_attr *attr, struct bpf_prog *prog)
> > +{
> > +     struct bpf_link_primer link_primer;
> > +     struct bpf_perf_link *link;
> > +     struct perf_event *event;
> > +     struct file *perf_file;
> > +     int err;
> > +
> > +     if (attr->link_create.flags)
> > +             return -EINVAL;
> > +
> > +     perf_file = perf_event_get(attr->link_create.target_fd);
> > +     if (IS_ERR(perf_file))
> > +             return PTR_ERR(perf_file);
> > +
> > +     link = kzalloc(sizeof(*link), GFP_USER);
> > +     if (!link) {
> > +             err = -ENOMEM;
> > +             goto out_put_file;
> > +     }
> > +     bpf_link_init(&link->link, BPF_LINK_TYPE_PERF_EVENT, &bpf_perf_link_lops, prog);
> > +     link->perf_file = perf_file;
> > +
> > +     err = bpf_link_prime(&link->link, &link_primer);
> > +     if (err) {
> > +             kfree(link);
> > +             goto out_put_file;
> > +     }
> > +
> > +     event = perf_file->private_data;
> > +     err = perf_event_set_bpf_prog(event, prog);
> > +     if (err) {
> > +             bpf_link_cleanup(&link_primer);
> > +             goto out_put_file;
> > +     }
> > +     /* perf_event_set_bpf_prog() doesn't take its own refcnt on prog */
>
> Is that otherwise expected? AFAICT the previous users of that function
> were guaranteed the existance of the BPF program. But afaict there is
> nothing that prevents perf_event_*_bpf_prog() from doing the addition
> refcounting if that is more convenient.

Sorry, I missed this on my last pass. Yes, it's expected. The general
convention we use for BPF when passing bpf_prog (and bpf_map and other
objects like that) is that the caller already has an incremented
refcnt before calling callee. If callee succeeds, that refcnt is
"transferred" into the caller (so callee doesn't increment it, caller
doesn't put it). If callee errors out, caller is decrementing refcnt
after necessary clean up, but callee does nothing. While asymmetrical,
in practice it results in a simple and straightforward  error handling
logic.

In this case bpf_perf_link_attach() assumes one refcnt from its
caller, but if everything is ok and perf_event_set_bpf_prog()
succeeds, we need to keep 2 refcnts: one for bpf_link and one for
perf_event_set_bpf_prog() internally. So we just bump refcnt one extra
time. I intentionally removed bpf_prog_put() from
perf_event_set_bpf_prog() in the previous patch to make error handling
uniform with the rest of the code and simpler overall.

>
> > +     bpf_prog_inc(prog);
> > +
> > +     return bpf_link_settle(&link_primer);
> > +
> > +out_put_file:
> > +     fput(perf_file);
> > +     return err;
> > +}

Yonghong Song July 30, 2021, 5:42 a.m. UTC | #9

On 7/29/21 9:16 PM, Andrii Nakryiko wrote:
> On Thu, Jul 29, 2021 at 10:36 AM Yonghong Song <yhs@fb.com> wrote:
>>
>>
>>
>> On 7/26/21 9:12 AM, Andrii Nakryiko wrote:
>>> Introduce a new type of BPF link - BPF perf link. This brings perf_event-based
>>> BPF program attachments (perf_event, tracepoints, kprobes, and uprobes) into
>>> the common BPF link infrastructure, allowing to list all active perf_event
>>> based attachments, auto-detaching BPF program from perf_event when link's FD
>>> is closed, get generic BPF link fdinfo/get_info functionality.
>>>
>>> BPF_LINK_CREATE command expects perf_event's FD as target_fd. No extra flags
>>> are currently supported.
>>>
>>> Force-detaching and atomic BPF program updates are not yet implemented, but
>>> with perf_event-based BPF links we now have common framework for this without
>>> the need to extend ioctl()-based perf_event interface.
>>>
>>> One interesting consideration is a new value for bpf_attach_type, which
>>> BPF_LINK_CREATE command expects. Generally, it's either 1-to-1 mapping from
>>> bpf_attach_type to bpf_prog_type, or many-to-1 mapping from a subset of
>>> bpf_attach_types to one bpf_prog_type (e.g., see BPF_PROG_TYPE_SK_SKB or
>>> BPF_PROG_TYPE_CGROUP_SOCK). In this case, though, we have three different
>>> program types (KPROBE, TRACEPOINT, PERF_EVENT) using the same perf_event-based
>>> mechanism, so it's many bpf_prog_types to one bpf_attach_type. I chose to
>>> define a single BPF_PERF_EVENT attach type for all of them and adjust
>>> link_create()'s logic for checking correspondence between attach type and
>>> program type.
>>>
>>> The alternative would be to define three new attach types (e.g., BPF_KPROBE,
>>> BPF_TRACEPOINT, and BPF_PERF_EVENT), but that seemed like unnecessary overkill
>>> and BPF_KPROBE will cause naming conflicts with BPF_KPROBE() macro, defined by
>>> libbpf. I chose to not do this to avoid unnecessary proliferation of
>>> bpf_attach_type enum values and not have to deal with naming conflicts.
>>>
>>> Cc: Peter Zijlstra <peterz@infradead.org>
>>> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
>>> ---
>>>    include/linux/bpf_types.h      |   3 +
>>>    include/linux/trace_events.h   |   3 +
>>>    include/uapi/linux/bpf.h       |   2 +
>>>    kernel/bpf/syscall.c           | 105 ++++++++++++++++++++++++++++++---
>>>    kernel/events/core.c           |  10 ++--
>>>    tools/include/uapi/linux/bpf.h |   2 +
>>>    6 files changed, 112 insertions(+), 13 deletions(-)
>>>
>>> diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
>>> index a9db1eae6796..0a1ada7f174d 100644
>>> --- a/include/linux/bpf_types.h
>>> +++ b/include/linux/bpf_types.h
>>> @@ -135,3 +135,6 @@ BPF_LINK_TYPE(BPF_LINK_TYPE_ITER, iter)
>>>    #ifdef CONFIG_NET
>>>    BPF_LINK_TYPE(BPF_LINK_TYPE_NETNS, netns)
>>>    #endif
>>> +#ifdef CONFIG_PERF_EVENTS
>>> +BPF_LINK_TYPE(BPF_LINK_TYPE_PERF_EVENT, perf)
>>> +#endif
>>> diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
>>> index ad413b382a3c..8ac92560d3a3 100644
>>> --- a/include/linux/trace_events.h
>>> +++ b/include/linux/trace_events.h
>>> @@ -803,6 +803,9 @@ extern void ftrace_profile_free_filter(struct perf_event *event);
>>>    void perf_trace_buf_update(void *record, u16 type);
>>>    void *perf_trace_buf_alloc(int size, struct pt_regs **regs, int *rctxp);
>>>
>>> +int perf_event_set_bpf_prog(struct perf_event *event, struct bpf_prog *prog);
>>> +void perf_event_free_bpf_prog(struct perf_event *event);
>>> +
>>>    void bpf_trace_run1(struct bpf_prog *prog, u64 arg1);
>>>    void bpf_trace_run2(struct bpf_prog *prog, u64 arg1, u64 arg2);
>>>    void bpf_trace_run3(struct bpf_prog *prog, u64 arg1, u64 arg2,
>>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>>> index 2db6925e04f4..00b1267ab4f0 100644
>>> --- a/include/uapi/linux/bpf.h
>>> +++ b/include/uapi/linux/bpf.h
>>> @@ -993,6 +993,7 @@ enum bpf_attach_type {
>>>        BPF_SK_SKB_VERDICT,
>>>        BPF_SK_REUSEPORT_SELECT,
>>>        BPF_SK_REUSEPORT_SELECT_OR_MIGRATE,
>>> +     BPF_PERF_EVENT,
>>>        __MAX_BPF_ATTACH_TYPE
>>>    };
>>>
>>> @@ -1006,6 +1007,7 @@ enum bpf_link_type {
>>>        BPF_LINK_TYPE_ITER = 4,
>>>        BPF_LINK_TYPE_NETNS = 5,
>>>        BPF_LINK_TYPE_XDP = 6,
>>> +     BPF_LINK_TYPE_PERF_EVENT = 6,
>>
>> As Jiri has pointed out, BPF_LINK_TYPE_PERF_EVENT = 7.
> 
> yep, fixed
> 
>>
>>>
>>>        MAX_BPF_LINK_TYPE,
>>>    };
>>> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>>> index 9a2068e39d23..80c03bedd6e6 100644
>>> --- a/kernel/bpf/syscall.c
>>> +++ b/kernel/bpf/syscall.c
>>> @@ -2906,6 +2906,79 @@ static const struct bpf_link_ops bpf_raw_tp_link_lops = {
>>>        .fill_link_info = bpf_raw_tp_link_fill_link_info,
>>>    };
>>>
>>> +#ifdef CONFIG_PERF_EVENTS
>>> +struct bpf_perf_link {
>>> +     struct bpf_link link;
>>> +     struct file *perf_file;
>>> +};
>>> +
>>> +static void bpf_perf_link_release(struct bpf_link *link)
>>> +{
>>> +     struct bpf_perf_link *perf_link = container_of(link, struct bpf_perf_link, link);
>>> +     struct perf_event *event = perf_link->perf_file->private_data;
>>> +
>>> +     perf_event_free_bpf_prog(event);
>>> +     fput(perf_link->perf_file);
>>> +}
>>> +
>>> +static void bpf_perf_link_dealloc(struct bpf_link *link)
>>> +{
>>> +     struct bpf_perf_link *perf_link = container_of(link, struct bpf_perf_link, link);
>>> +
>>> +     kfree(perf_link);
>>> +}
>>> +
>>> +static const struct bpf_link_ops bpf_perf_link_lops = {
>>> +     .release = bpf_perf_link_release,
>>> +     .dealloc = bpf_perf_link_dealloc,
>>> +};
>>> +
>>> +static int bpf_perf_link_attach(const union bpf_attr *attr, struct bpf_prog *prog)
>>> +{
>>> +     struct bpf_link_primer link_primer;
>>> +     struct bpf_perf_link *link;
>>> +     struct perf_event *event;
>>> +     struct file *perf_file;
>>> +     int err;
>>> +
>>> +     if (attr->link_create.flags)
>>> +             return -EINVAL;
>>> +
>>> +     perf_file = perf_event_get(attr->link_create.target_fd);
>>> +     if (IS_ERR(perf_file))
>>> +             return PTR_ERR(perf_file);
>>> +
>>> +     link = kzalloc(sizeof(*link), GFP_USER);
>>
>> add __GFP_NOWARN flag?
> 
> I looked at few other bpf_link_alloc places in this file, they don't
> use NOWARN flag. I think the idea with NOWARN flag is to avoid memory
> alloc warnings when amount of allocated memory depends on
> user-specified parameter (like the size of the map value). In this
> case it's just a single fixed-size kernel object, so while users can
> create lots of them, each is fixed in size. It's similar as any other
> kernel object (e.g., struct file). So I think it's good as is.

That is fine. This is really a small struct, unlikely we have issues.

> 
>>
>>> +     if (!link) {
>>> +             err = -ENOMEM;
>>> +             goto out_put_file;
>>> +     }
>>> +     bpf_link_init(&link->link, BPF_LINK_TYPE_PERF_EVENT, &bpf_perf_link_lops, prog);
>>> +     link->perf_file = perf_file;
>>> +
>>> +     err = bpf_link_prime(&link->link, &link_primer);
>>> +     if (err) {
>>> +             kfree(link);
>>> +             goto out_put_file;
>>> +     }
>>> +
>>> +     event = perf_file->private_data;
>>> +     err = perf_event_set_bpf_prog(event, prog);
>>> +     if (err) {
>>> +             bpf_link_cleanup(&link_primer);
>>
>> Do you need kfree(link) here?
> 
> bpf_link_cleanup() will call kfree() in deferred fashion. This is due
> to bpf_link_prime() allocating anon_inode file internally, so it needs
> to be freed carefully and that's what bpf_link_cleanup() is for.

Looking at the code again, I am able to figure out. Indeed,
kfree(link) is called through file->release().

> 
>>
>>> +             goto out_put_file;
>>> +     }
>>> +     /* perf_event_set_bpf_prog() doesn't take its own refcnt on prog */
>>> +     bpf_prog_inc(prog);
>>> +
>>> +     return bpf_link_settle(&link_primer);
>>> +
>>> +out_put_file:
>>> +     fput(perf_file);
>>> +     return err;
>>> +}
>>> +#endif /* CONFIG_PERF_EVENTS */
>>> +
>>>    #define BPF_RAW_TRACEPOINT_OPEN_LAST_FIELD raw_tracepoint.prog_fd
>>>
>> [...]

[v2,bpf-next,04/14] bpf: implement minimal BPF perf link

Checks

Commit Message

Comments

Patch