diff mbox series

[v2,bpf-next,1/3] bpf: Introduce bpf_timer

Message ID 20210611042442.65444-2-alexei.starovoitov@gmail.com (mailing list archive)
State Superseded
Delegated to: BPF
Headers show
Series bpf: Introduce BPF timers. | expand

Checks

Context Check Description
netdev/cover_letter success Link
netdev/fixes_present success Link
netdev/patch_count success Link
netdev/tree_selection success Clearly marked for bpf-next
netdev/subject_prefix success Link
netdev/cc_maintainers warning 10 maintainers not CCed: joe@cilium.io yhs@fb.com kpsingh@kernel.org mingo@redhat.com kafai@fb.com rostedt@goodmis.org ast@kernel.org john.fastabend@gmail.com songliubraving@fb.com quentin@isovalent.com
netdev/source_inline success Was 0 now: 0
netdev/verify_signedoff success Link
netdev/module_param success Was 0 now: 0
netdev/build_32bit success Errors and warnings before: 10043 this patch: 10043
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/verify_fixes success Link
netdev/checkpatch fail CHECK: Unnecessary parentheses around function pointer (t->callback_fn) CHECK: multiple assignments should be avoided ERROR: code indent should use tabs where possible ERROR: space prohibited before that ':' (ctx:WxV) WARNING: line length of 81 exceeds 80 columns WARNING: line length of 82 exceeds 80 columns WARNING: line length of 83 exceeds 80 columns WARNING: line length of 84 exceeds 80 columns WARNING: line length of 86 exceeds 80 columns WARNING: line length of 89 exceeds 80 columns WARNING: line length of 93 exceeds 80 columns WARNING: please, no spaces at the start of a line
netdev/build_allmodconfig_warn success Errors and warnings before: 10457 this patch: 10457
netdev/header_inline success Link

Commit Message

Alexei Starovoitov June 11, 2021, 4:24 a.m. UTC
From: Alexei Starovoitov <ast@kernel.org>

Introduce 'struct bpf_timer { __u64 :64; __u64 :64; };' that can be embedded
in hash/array/lru maps as regular field and helpers to operate on it:

// Initialize the timer to call 'callback_fn' static function
// First 4 bits of 'flags' specify clockid.
// Only CLOCK_MONOTONIC, CLOCK_REALTIME, CLOCK_BOOTTIME are allowed.
long bpf_timer_init(struct bpf_timer *timer, void *callback_fn, int flags);

// Start the timer and set its expiration 'nsec' nanoseconds from the current time.
long bpf_timer_start(struct bpf_timer *timer, u64 nsec);

// Cancel the timer and wait for callback_fn to finish if it was running.
long bpf_timer_cancel(struct bpf_timer *timer);

Here is how BPF program might look like:
struct map_elem {
    int counter;
    struct bpf_timer timer;
};

struct {
    __uint(type, BPF_MAP_TYPE_HASH);
    __uint(max_entries, 1000);
    __type(key, int);
    __type(value, struct map_elem);
} hmap SEC(".maps");

static int timer_cb(void *map, int *key, struct map_elem *val);
/* val points to particular map element that contains bpf_timer. */

SEC("fentry/bpf_fentry_test1")
int BPF_PROG(test1, int a)
{
    struct map_elem *val;
    int key = 0;

    val = bpf_map_lookup_elem(&hmap, &key);
    if (val) {
        bpf_timer_init(&val->timer, timer_cb, CLOCK_REALTIME);
        bpf_timer_start(&val->timer, 1000 /* call timer_cb2 in 1 usec */);
    }
}

This patch adds helper implementations that rely on hrtimers
to call bpf functions as timers expire.
The following patch adds necessary safety checks.

Only programs with CAP_BPF are allowed to use bpf_timer.

The amount of timers used by the program is constrained by
the memcg recorded at map creation time.

The bpf_timer_init() helper is receiving hidden 'map' and 'prog' arguments
supplied by the verifier. The prog pointer is needed to do refcnting of bpf
program to make sure that program doesn't get freed while timer is armed.

The bpf_map_delete_elem() and bpf_map_update_elem() operations cancel
and free the timer if given map element had it allocated.
"bpftool map update" command can be used to cancel timers.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/bpf.h            |   2 +
 include/uapi/linux/bpf.h       |  40 ++++++
 kernel/bpf/helpers.c           | 227 +++++++++++++++++++++++++++++++++
 kernel/bpf/verifier.c          | 109 ++++++++++++++++
 kernel/trace/bpf_trace.c       |   2 +-
 scripts/bpf_doc.py             |   2 +
 tools/include/uapi/linux/bpf.h |  40 ++++++
 7 files changed, 421 insertions(+), 1 deletion(-)

Comments

Cong Wang June 11, 2021, 6:42 a.m. UTC | #1
On Thu, Jun 10, 2021 at 9:27 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> From: Alexei Starovoitov <ast@kernel.org>
>
> Introduce 'struct bpf_timer { __u64 :64; __u64 :64; };' that can be embedded
> in hash/array/lru maps as regular field and helpers to operate on it:

Can be or has to be? Huge difference here.

In the other thread, you said it is global data, which implies that it does
not have to be in a map.

In your test case or your example, all timers are still in a map. So what has
changed since then? Looks nothing to me.

>
> The bpf_timer_init() helper is receiving hidden 'map' and 'prog' arguments
> supplied by the verifier. The prog pointer is needed to do refcnting of bpf
> program to make sure that program doesn't get freed while timer is armed.
>

Nice trick but...

[...]

> +BPF_CALL_2(bpf_timer_start, struct bpf_timer_kern *, timer, u64, nsecs)
> +{
> +       struct bpf_hrtimer *t;
> +       int ret = 0;
> +
> +       ____bpf_spin_lock(&timer->lock);
> +       t = timer->timer;
> +       if (!t) {
> +               ret = -EINVAL;
> +               goto out;
> +       }
> +       if (!hrtimer_active(&t->timer) || hrtimer_callback_running(&t->timer))
> +               /* If the timer wasn't active or callback already executing
> +                * bump the prog refcnt to keep it alive until
> +                * callback is invoked (again).
> +                */
> +               bpf_prog_inc(t->prog);

Hmm, finally you begin refcounting it, which you were strongly against. ;)

Three questions:

1. Can t->prog be freed between bpf_timer_init() and bpf_timer_start()?
If the timer subprog is always in the same prog which installs it, then
this is fine. But then how do multiple programs share a timer? In the
case of conntrack, either ingress or egress could install the timer,
it only depends which one gets traffic first. Do they have to copy
the same subprog for the same timer?

2. Can t->prog be freed between a timer expiration and bpf_timer_start()
again? It gets a refcnt when starting a timer and puts it when cancelling
or expired, so t->prog can be freed right after cancelling or expired. What
if another program which shares this timer wants to restart this timer?

3. Since you offer no API to read the expiration time, why not just use
BPF_TEST_RUN with a user-space timer? This is preferred by Andrii.

Thanks.
Cong Wang June 11, 2021, 7:05 a.m. UTC | #2
On Thu, Jun 10, 2021 at 9:27 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> // Initialize the timer to call 'callback_fn' static function
> // First 4 bits of 'flags' specify clockid.
> // Only CLOCK_MONOTONIC, CLOCK_REALTIME, CLOCK_BOOTTIME are allowed.
> long bpf_timer_init(struct bpf_timer *timer, void *callback_fn, int flags);

Another unpopular point of view:

This init() is not suitable for bpf programs, because unlike kernel modules,
there is no init or exit functions for a bpf program. And timer init
is typically
called during module init.

(bpf spinlock is different, because it can be simply initialized as UNLOCKED.)
Alexei Starovoitov June 11, 2021, 6:45 p.m. UTC | #3
On Thu, Jun 10, 2021 at 11:42:24PM -0700, Cong Wang wrote:
> On Thu, Jun 10, 2021 at 9:27 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > From: Alexei Starovoitov <ast@kernel.org>

Please stick to one email thread in the future, ok?

I'll consolidate them here:

> What is your use case to justify your own code? Asking because
> you deny mine, so clearly my use case is not yours.

I mentioned several use cases in the prior threads.
tldr: any periodic event in tracing, networking, security.
Garbage collection falls into this category as well, but please internalize
that implementing conntrack as it is today in the kernel is an explicit non-goal.

> And more importantly, why not just use BPF_TEST_RUN with
> a user-space timer? Asking because you offer no API to read or
> modify timer expiration, so literally the same with BPF_TEST_RUN
> approach.

a wrapper on top of hrtimer_get_remaining() like bpf_timer_get_remaining()
is trivial to add, but what is the use case?

> >
> > Introduce 'struct bpf_timer { __u64 :64; __u64 :64; };' that can be embedded
> > in hash/array/lru maps as regular field and helpers to operate on it:
> 
> Can be or has to be? Huge difference here.

map elements don't have to use timers.

> In the other thread, you said it is global data, which implies that it does
> not have to be in a map.

global data is a map. That was explained in the prior thread as well.

> 
> In your test case or your example, all timers are still in a map. So what has
> changed since then? Looks nothing to me.

look again?

> Hmm, finally you begin refcounting it, which you were strongly against. ;)

That was already answered in the prior thread.
tldr: there were two options. This is one of them. Another can be added
in the future as well.

> Three questions:
> 
> 1. Can t->prog be freed between bpf_timer_init() and bpf_timer_start()?

yes.

> If the timer subprog is always in the same prog which installs it, then

installs it? I'm not following the quesiton.

> this is fine. But then how do multiple programs share a timer? 

there is only one callback function.

> In the
> case of conntrack, either ingress or egress could install the timer,
> it only depends which one gets traffic first. Do they have to copy
> the same subprog for the same timer?

conntrack is an explicit non-goal.

> 
> 2. Can t->prog be freed between a timer expiration and bpf_timer_start()
> again? 

If it's already armed with the first bpf_timer_start() it won't be freed.

> It gets a refcnt when starting a timer and puts it when cancelling
> or expired, so t->prog can be freed right after cancelling or expired. What
> if another program which shares this timer wants to restart this timer?

There is only one callback_fn per timer. Another program can share
the struct bpf_timer and the map. It might have subprog callback_fn code
that looks exactly the same as callback_fn in the first prog.
For example when libbpf loads progs/timer.c (after it was compiled into .o)
it might share a subprog in the future (when kernel has support for
dynamic linking). From bpf user pov it's a single .c file.
The split into programs and subprograms is an implemenation detail
that C programmer doesn't need to worry about.

> 3. Since you offer no API to read the expiration time, why not just use
> BPF_TEST_RUN with a user-space timer? This is preferred by Andrii.

Andrii point was that there should be no syscall cmds that replicate
bpf_timer_init/start/cancel helpers. I agree with this.


> Thanks.
>
> Another unpopular point of view:
>
> This init() is not suitable for bpf programs, because unlike kernel modules,
> there is no init or exit functions for a bpf program. And timer init
> is typically
> called during module init.

Already answerd this in the prior thread. There will be __init and __fini like
subprograms in bpf progs.

Please apply the patches to your local tree and do few experiments based
on selftests/bpf/progs/timer.c. I think experimenting with the code
will answer all of your questions.
Yonghong Song June 11, 2021, 10:12 p.m. UTC | #4
On 6/10/21 9:24 PM, Alexei Starovoitov wrote:
> From: Alexei Starovoitov <ast@kernel.org>
> 
> Introduce 'struct bpf_timer { __u64 :64; __u64 :64; };' that can be embedded
> in hash/array/lru maps as regular field and helpers to operate on it:
> 
> // Initialize the timer to call 'callback_fn' static function
> // First 4 bits of 'flags' specify clockid.
> // Only CLOCK_MONOTONIC, CLOCK_REALTIME, CLOCK_BOOTTIME are allowed.
> long bpf_timer_init(struct bpf_timer *timer, void *callback_fn, int flags);
> 
> // Start the timer and set its expiration 'nsec' nanoseconds from the current time.
> long bpf_timer_start(struct bpf_timer *timer, u64 nsec);
> 
> // Cancel the timer and wait for callback_fn to finish if it was running.
> long bpf_timer_cancel(struct bpf_timer *timer);
> 
> Here is how BPF program might look like:
> struct map_elem {
>      int counter;
>      struct bpf_timer timer;
> };
> 
> struct {
>      __uint(type, BPF_MAP_TYPE_HASH);
>      __uint(max_entries, 1000);
>      __type(key, int);
>      __type(value, struct map_elem);
> } hmap SEC(".maps");
> 
> static int timer_cb(void *map, int *key, struct map_elem *val);
> /* val points to particular map element that contains bpf_timer. */
> 
> SEC("fentry/bpf_fentry_test1")
> int BPF_PROG(test1, int a)
> {
>      struct map_elem *val;
>      int key = 0;
> 
>      val = bpf_map_lookup_elem(&hmap, &key);
>      if (val) {
>          bpf_timer_init(&val->timer, timer_cb, CLOCK_REALTIME);
>          bpf_timer_start(&val->timer, 1000 /* call timer_cb2 in 1 usec */);
>      }
> }
> 
> This patch adds helper implementations that rely on hrtimers
> to call bpf functions as timers expire.
> The following patch adds necessary safety checks.
> 
> Only programs with CAP_BPF are allowed to use bpf_timer.
> 
> The amount of timers used by the program is constrained by
> the memcg recorded at map creation time.
> 
> The bpf_timer_init() helper is receiving hidden 'map' and 'prog' arguments
> supplied by the verifier. The prog pointer is needed to do refcnting of bpf
> program to make sure that program doesn't get freed while timer is armed.
> 
> The bpf_map_delete_elem() and bpf_map_update_elem() operations cancel
> and free the timer if given map element had it allocated.
> "bpftool map update" command can be used to cancel timers.
> 
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> ---
>   include/linux/bpf.h            |   2 +
>   include/uapi/linux/bpf.h       |  40 ++++++
>   kernel/bpf/helpers.c           | 227 +++++++++++++++++++++++++++++++++
>   kernel/bpf/verifier.c          | 109 ++++++++++++++++
>   kernel/trace/bpf_trace.c       |   2 +-
>   scripts/bpf_doc.py             |   2 +
>   tools/include/uapi/linux/bpf.h |  40 ++++++
>   7 files changed, 421 insertions(+), 1 deletion(-)
> 
[...]
>   
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 2c1ba70abbf1..d25bbcdad8e6 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -4778,6 +4778,38 @@ union bpf_attr {
>    * 		Execute close syscall for given FD.
>    * 	Return
>    * 		A syscall result.
> + *
> + * long bpf_timer_init(struct bpf_timer *timer, void *callback_fn, int flags)
> + *	Description
> + *		Initialize the timer to call *callback_fn* static function.
> + *		First 4 bits of *flags* specify clockid. Only CLOCK_MONOTONIC,
> + *		CLOCK_REALTIME, CLOCK_BOOTTIME are allowed.
> + *		All other bits of *flags* are reserved.
> + *	Return
> + *		0 on success.
> + *		**-EBUSY** if *timer* is already initialized.
> + *		**-EINVAL** if invalid *flags* are passed.
> + *
> + * long bpf_timer_start(struct bpf_timer *timer, u64 nsecs)
> + *	Description
> + *		Start the timer and set its expiration N nanoseconds from the
> + *		current time. The timer callback_fn will be invoked in soft irq
> + *		context on some cpu and will not repeat unless another
> + *		bpf_timer_start() is made. In such case the next invocation can
> + *		migrate to a different cpu.
> + *	Return
> + *		0 on success.
> + *		**-EINVAL** if *timer* was not initialized with bpf_timer_init() earlier.
> + *
> + * long bpf_timer_cancel(struct bpf_timer *timer)
> + *	Description
> + *		Cancel the timer and wait for callback_fn to finish if it was running.
> + *	Return
> + *		0 if the timer was not active.
> + *		1 if the timer was active.
> + *		**-EINVAL** if *timer* was not initialized with bpf_timer_init() earlier.
> + *		**-EDEADLK** if callback_fn tried to call bpf_timer_cancel() on its own timer
> + *		which would have led to a deadlock otherwise.
>    */
>   #define __BPF_FUNC_MAPPER(FN)		\
>   	FN(unspec),			\
> @@ -4949,6 +4981,9 @@ union bpf_attr {
>   	FN(sys_bpf),			\
>   	FN(btf_find_by_name_kind),	\
>   	FN(sys_close),			\
> +	FN(timer_init),			\
> +	FN(timer_start),		\
> +	FN(timer_cancel),		\
>   	/* */
>   
>   /* integer value in 'imm' field of BPF_CALL instruction selects which helper
> @@ -6061,6 +6096,11 @@ struct bpf_spin_lock {
>   	__u32	val;
>   };
>   
> +struct bpf_timer {
> +	__u64 :64;
> +	__u64 :64;
> +};
> +
>   struct bpf_sysctl {
>   	__u32	write;		/* Sysctl is being read (= 0) or written (= 1).
>   				 * Allows 1,2,4-byte read, but no write.
> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> index 544773970dbc..3a693d451ca3 100644
> --- a/kernel/bpf/helpers.c
> +++ b/kernel/bpf/helpers.c
> @@ -985,6 +985,227 @@ const struct bpf_func_proto bpf_snprintf_proto = {
>   	.arg5_type	= ARG_CONST_SIZE_OR_ZERO,
>   };
>   
> +struct bpf_hrtimer {
> +	struct hrtimer timer;
> +	struct bpf_map *map;
> +	struct bpf_prog *prog;
> +	void *callback_fn;
> +	void *value;
> +};
> +
> +/* the actual struct hidden inside uapi struct bpf_timer */
> +struct bpf_timer_kern {
> +	struct bpf_hrtimer *timer;
> +	struct bpf_spin_lock lock;
> +};

Looks like in 32bit system, sizeof(struct bpf_timer_kern) is 64
and sizeof(struct bpf_timer) is 128.

struct bpf_spin_lock {
         __u32   val;
};

struct bpf_timer {
	__u64 :64;
	__u64 :64;
};

Checking the code, we may not have issues as structure
"bpf_timer" is only used to reserve spaces and
map copy value routine handles that properly.

Maybe we can still make it consistent with
two fields in bpf_timer_kern mapping to
two fields in bpf_timer?

struct bpf_timer_kern {
	__bpf_md_ptr(struct bpf_hrtimer *, timer);
	struct bpf_spin_lock lock;
};

> +
> +static DEFINE_PER_CPU(struct bpf_hrtimer *, hrtimer_running);
> +
> +static enum hrtimer_restart bpf_timer_cb(struct hrtimer *timer)
> +{
> +	struct bpf_hrtimer *t = container_of(timer, struct bpf_hrtimer, timer);
> +	struct bpf_prog *prog = t->prog;
> +	struct bpf_map *map = t->map;
> +	void *key;
> +	u32 idx;
> +	int ret;
> +
> +	/* bpf_timer_cb() runs in hrtimer_run_softirq. It doesn't migrate and
> +	 * cannot be preempted by another bpf_timer_cb() on the same cpu.
> +	 * Remember the timer this callback is servicing to prevent
> +	 * deadlock if callback_fn() calls bpf_timer_cancel() on the same timer.
> +	 */
> +	this_cpu_write(hrtimer_running, t);
> +	if (map->map_type == BPF_MAP_TYPE_ARRAY) {
> +		struct bpf_array *array = container_of(map, struct bpf_array, map);
> +
> +		/* compute the key */
> +		idx = ((char *)t->value - array->value) / array->elem_size;
> +		key = &idx;
> +	} else { /* hash or lru */
> +		key = t->value - round_up(map->key_size, 8);
> +	}
> +
> +	ret = BPF_CAST_CALL(t->callback_fn)((u64)(long)map,
> +					    (u64)(long)key,
> +					    (u64)(long)t->value, 0, 0);
> +	WARN_ON(ret != 0); /* Next patch disallows 1 in the verifier */
> +
> +	/* The bpf function finished executed. Drop the prog refcnt.
> +	 * It could reach zero here and trigger free of bpf_prog
> +	 * and subsequent free of the maps that were holding timers.
> +	 * If callback_fn called bpf_timer_start on this timer
> +	 * the prog refcnt will be > 0.
> +	 *
> +	 * If callback_fn deleted map element the 't' could have been freed,
> +	 * hence t->prog deref is done earlier.
> +	 */
> +	bpf_prog_put(prog);
> +	this_cpu_write(hrtimer_running, NULL);
> +	return HRTIMER_NORESTART;
> +}
> +
> +BPF_CALL_5(bpf_timer_init, struct bpf_timer_kern *, timer, void *, cb, int, flags,
> +	   struct bpf_map *, map, struct bpf_prog *, prog)
> +{
[...]
Yonghong Song June 14, 2021, 4:51 p.m. UTC | #5
On 6/10/21 9:24 PM, Alexei Starovoitov wrote:
> From: Alexei Starovoitov <ast@kernel.org>
> 
> Introduce 'struct bpf_timer { __u64 :64; __u64 :64; };' that can be embedded
> in hash/array/lru maps as regular field and helpers to operate on it:
> 
> // Initialize the timer to call 'callback_fn' static function
> // First 4 bits of 'flags' specify clockid.
> // Only CLOCK_MONOTONIC, CLOCK_REALTIME, CLOCK_BOOTTIME are allowed.
> long bpf_timer_init(struct bpf_timer *timer, void *callback_fn, int flags);
> 
> // Start the timer and set its expiration 'nsec' nanoseconds from the current time.
> long bpf_timer_start(struct bpf_timer *timer, u64 nsec);
> 
> // Cancel the timer and wait for callback_fn to finish if it was running.
> long bpf_timer_cancel(struct bpf_timer *timer);
> 
> Here is how BPF program might look like:
> struct map_elem {
>      int counter;
>      struct bpf_timer timer;
> };
> 
> struct {
>      __uint(type, BPF_MAP_TYPE_HASH);
>      __uint(max_entries, 1000);
>      __type(key, int);
>      __type(value, struct map_elem);
> } hmap SEC(".maps");
> 
> static int timer_cb(void *map, int *key, struct map_elem *val);
> /* val points to particular map element that contains bpf_timer. */
> 
> SEC("fentry/bpf_fentry_test1")
> int BPF_PROG(test1, int a)
> {
>      struct map_elem *val;
>      int key = 0;
> 
>      val = bpf_map_lookup_elem(&hmap, &key);
>      if (val) {
>          bpf_timer_init(&val->timer, timer_cb, CLOCK_REALTIME);
>          bpf_timer_start(&val->timer, 1000 /* call timer_cb2 in 1 usec */);
>      }
> }
> 
> This patch adds helper implementations that rely on hrtimers
> to call bpf functions as timers expire.
> The following patch adds necessary safety checks.
> 
> Only programs with CAP_BPF are allowed to use bpf_timer.
> 
> The amount of timers used by the program is constrained by
> the memcg recorded at map creation time.
> 
> The bpf_timer_init() helper is receiving hidden 'map' and 'prog' arguments
> supplied by the verifier. The prog pointer is needed to do refcnting of bpf
> program to make sure that program doesn't get freed while timer is armed.
> 
> The bpf_map_delete_elem() and bpf_map_update_elem() operations cancel
> and free the timer if given map element had it allocated.
> "bpftool map update" command can be used to cancel timers.
> 
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> ---
>   include/linux/bpf.h            |   2 +
>   include/uapi/linux/bpf.h       |  40 ++++++
>   kernel/bpf/helpers.c           | 227 +++++++++++++++++++++++++++++++++
>   kernel/bpf/verifier.c          | 109 ++++++++++++++++
>   kernel/trace/bpf_trace.c       |   2 +-
>   scripts/bpf_doc.py             |   2 +
>   tools/include/uapi/linux/bpf.h |  40 ++++++
>   7 files changed, 421 insertions(+), 1 deletion(-)
> 
[...]
> +
> +static enum hrtimer_restart bpf_timer_cb(struct hrtimer *timer)
> +{
> +	struct bpf_hrtimer *t = container_of(timer, struct bpf_hrtimer, timer);
> +	struct bpf_prog *prog = t->prog;
> +	struct bpf_map *map = t->map;
> +	void *key;
> +	u32 idx;
> +	int ret;
> +
> +	/* bpf_timer_cb() runs in hrtimer_run_softirq. It doesn't migrate and
> +	 * cannot be preempted by another bpf_timer_cb() on the same cpu.
> +	 * Remember the timer this callback is servicing to prevent
> +	 * deadlock if callback_fn() calls bpf_timer_cancel() on the same timer.
> +	 */
> +	this_cpu_write(hrtimer_running, t);
> +	if (map->map_type == BPF_MAP_TYPE_ARRAY) {
> +		struct bpf_array *array = container_of(map, struct bpf_array, map);
> +
> +		/* compute the key */
> +		idx = ((char *)t->value - array->value) / array->elem_size;
> +		key = &idx;
> +	} else { /* hash or lru */
> +		key = t->value - round_up(map->key_size, 8);
> +	}
> +
> +	ret = BPF_CAST_CALL(t->callback_fn)((u64)(long)map,
> +					    (u64)(long)key,
> +					    (u64)(long)t->value, 0, 0);
> +	WARN_ON(ret != 0); /* Next patch disallows 1 in the verifier */

I didn't find that next patch disallows callback return value 1 in the 
verifier. If we indeed disallows return value 1 in the verifier. We
don't need WARN_ON here. Did I miss anything?

> +
> +	/* The bpf function finished executed. Drop the prog refcnt.
> +	 * It could reach zero here and trigger free of bpf_prog
> +	 * and subsequent free of the maps that were holding timers.
> +	 * If callback_fn called bpf_timer_start on this timer
> +	 * the prog refcnt will be > 0.
> +	 *
> +	 * If callback_fn deleted map element the 't' could have been freed,
> +	 * hence t->prog deref is done earlier.
> +	 */
> +	bpf_prog_put(prog);
> +	this_cpu_write(hrtimer_running, NULL);
> +	return HRTIMER_NORESTART;
> +}
> +
> +BPF_CALL_5(bpf_timer_init, struct bpf_timer_kern *, timer, void *, cb, int, flags,
> +	   struct bpf_map *, map, struct bpf_prog *, prog)
> +{
> +	clockid_t clockid = flags & (MAX_CLOCKS - 1);
> +	struct bpf_hrtimer *t;
> +	int ret = 0;
> +
> +	BUILD_BUG_ON(MAX_CLOCKS != 16);
> +	if (flags >= MAX_CLOCKS ||
> +	    /* similar to timerfd except _ALARM variants are not supported */
> +            (clockid != CLOCK_MONOTONIC &&
> +             clockid != CLOCK_REALTIME &&
> +             clockid != CLOCK_BOOTTIME))
> +		return -EINVAL;
> +	____bpf_spin_lock(&timer->lock);
> +	t = timer->timer;
> +	if (t) {
> +		ret = -EBUSY;
> +		goto out;
> +	}
> +	/* allocate hrtimer via map_kmalloc to use memcg accounting */
> +	t = bpf_map_kmalloc_node(map, sizeof(*t), GFP_ATOMIC, NUMA_NO_NODE);
> +	if (!t) {
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +	t->callback_fn = cb;
> +	t->value = (void *)timer /* - offset of bpf_timer inside elem */;
> +	t->map = map;
> +	t->prog = prog;
> +	hrtimer_init(&t->timer, clockid, HRTIMER_MODE_REL_SOFT);
> +	t->timer.function = bpf_timer_cb;
> +	timer->timer = t;
> +out:
> +	____bpf_spin_unlock(&timer->lock);
> +	return ret;
> +}
> +
> +static const struct bpf_func_proto bpf_timer_init_proto = {
> +	.func		= bpf_timer_init,
> +	.gpl_only	= true,
> +	.ret_type	= RET_INTEGER,
> +	.arg1_type	= ARG_PTR_TO_TIMER,
> +	.arg2_type	= ARG_PTR_TO_FUNC,
> +	.arg3_type	= ARG_ANYTHING,
> +};
> +
> +BPF_CALL_2(bpf_timer_start, struct bpf_timer_kern *, timer, u64, nsecs)
> +{
> +	struct bpf_hrtimer *t;
> +	int ret = 0;
> +
> +	____bpf_spin_lock(&timer->lock);
> +	t = timer->timer;
> +	if (!t) {
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +	if (!hrtimer_active(&t->timer) || hrtimer_callback_running(&t->timer))
> +		/* If the timer wasn't active or callback already executing
> +		 * bump the prog refcnt to keep it alive until
> +		 * callback is invoked (again).
> +		 */
> +		bpf_prog_inc(t->prog);

I am not 100% sure. But could we have race condition here?
    cpu 1: running bpf_timer_start() helper call
    cpu 2: doing hrtimer work (calling callback etc.)

Is it possible that
   !hrtimer_active(&t->timer) || hrtimer_callback_running(&t->timer)
may be true and then right before bpf_prog_inc(t->prog), it becomes 
true? If hrtimer_callback_running() is called, it is possible that
callback function could have dropped the reference count for t->prog,
so we could already go into the body of the function
__bpf_prog_put()?

static void __bpf_prog_put(struct bpf_prog *prog, bool do_idr_lock)
{
         if (atomic64_dec_and_test(&prog->aux->refcnt)) {
                 perf_event_bpf_event(prog, PERF_BPF_EVENT_PROG_UNLOAD, 0);
                 bpf_audit_prog(prog, BPF_AUDIT_UNLOAD);
                 /* bpf_prog_free_id() must be called first */
                 bpf_prog_free_id(prog, do_idr_lock);
                 __bpf_prog_put_noref(prog, true);
         }
}


> +	hrtimer_start(&t->timer, ns_to_ktime(nsecs), HRTIMER_MODE_REL_SOFT);
> +out:
> +	____bpf_spin_unlock(&timer->lock);
> +	return ret;
> +}
> +
> +static const struct bpf_func_proto bpf_timer_start_proto = {
> +	.func		= bpf_timer_start,
> +	.gpl_only	= true,
> +	.ret_type	= RET_INTEGER,
> +	.arg1_type	= ARG_PTR_TO_TIMER,
> +	.arg2_type	= ARG_ANYTHING,
> +};
> +
> +BPF_CALL_1(bpf_timer_cancel, struct bpf_timer_kern *, timer)
> +{
> +	struct bpf_hrtimer *t;
> +	int ret = 0;
> +
> +	____bpf_spin_lock(&timer->lock);
> +	t = timer->timer;
> +	if (!t) {
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +	if (this_cpu_read(hrtimer_running) == t) {
> +		/* If bpf callback_fn is trying to bpf_timer_cancel()
> +		 * its own timer the hrtimer_cancel() will deadlock
> +		 * since it waits for callback_fn to finish
> +		 */
> +		ret = -EDEADLK;
> +		goto out;
> +	}
> +	/* Cancel the timer and wait for associated callback to finish
> +	 * if it was running.
> +	 */
> +	if (hrtimer_cancel(&t->timer) == 1) {

Again, could we have race here between bpf program and hrtimer_cancel()?

> +		/* If the timer was active then drop the prog refcnt,
> +		 * since callback will not be invoked.
> +		 */
> +		bpf_prog_put(t->prog);
> +		ret = 1;
> +	}
> +out:
> +	____bpf_spin_unlock(&timer->lock);
> +	return ret;
> +}
> +
> +static const struct bpf_func_proto bpf_timer_cancel_proto = {
> +	.func		= bpf_timer_cancel,
> +	.gpl_only	= true,
> +	.ret_type	= RET_INTEGER,
> +	.arg1_type	= ARG_PTR_TO_TIMER,
> +};
> +
> +/* This function is called by delete_element in htab and lru maps
> + * and by map_free for array, lru, htab maps.
> + */
> +void bpf_timer_cancel_and_free(void *val)
> +{
> +	struct bpf_timer_kern *timer = val;
> +	struct bpf_hrtimer *t;
> +
> +	____bpf_spin_lock(&timer->lock);
> +	t = timer->timer;
> +	if (!t)
> +		goto out;
> +	/* Cancel the timer and wait for callback to complete if it was
> +	 * running. Only individual delete_element in htab or lru maps can
> +	 * return 1 from hrtimer_cancel.
> +	 * The whole map is destroyed when its refcnt reaches zero.
> +	 * That happens after bpf prog refcnt reaches zero.
> +	 * bpf prog refcnt will not reach zero until all timers are executed.
> +	 * So when maps are destroyed hrtimer_cancel will surely return 0.
> +	 * In such case t->prog is a pointer to freed memory.
> +	 *
> +	 * When htab or lru is deleting individual element check that
> +	 * bpf_map_delete_elem() isn't trying to delete elem with running timer.
> +	 * In such case don't call hrtimer_cancel() (since it will deadlock)
> +	 * and don't call hrtimer_try_to_cancel() (since it will just return -1).
> +	 * Instead free the timer and set timer->timer = NULL.
> +	 * The subsequent bpf_timer_start/cancel() helpers won't be able to use it.
> +	 * In preallocated maps it's safe to do timer->timer = NULL.
> +	 * The memory could be reused for another element while current timer
> +	 * callback can still do bpf_timer_init() on it.
> +	 * In non-preallocated maps timer->timer = NULL will happen after
> +	 * callback completes, since prog execution is an RCU critical section.
> +	 */
> +	if (this_cpu_read(hrtimer_running) != t &&
> +	    hrtimer_cancel(&t->timer) == 1)
> +		bpf_prog_put(t->prog);
> +	kfree(t);
> +	timer->timer = NULL;
> +out:
> +	____bpf_spin_unlock(&timer->lock);
> +}
> +
>   const struct bpf_func_proto bpf_get_current_task_proto __weak;
>   const struct bpf_func_proto bpf_probe_read_user_proto __weak;
>   const struct bpf_func_proto bpf_probe_read_user_str_proto __weak;
> @@ -1051,6 +1272,12 @@ bpf_base_func_proto(enum bpf_func_id func_id)
>   		return &bpf_per_cpu_ptr_proto;
>   	case BPF_FUNC_this_cpu_ptr:
>   		return &bpf_this_cpu_ptr_proto;
> +	case BPF_FUNC_timer_init:
> +		return &bpf_timer_init_proto;
> +	case BPF_FUNC_timer_start:
> +		return &bpf_timer_start_proto;
> +	case BPF_FUNC_timer_cancel:
> +		return &bpf_timer_cancel_proto;
>   	default:
>   		break;
>   	}
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 1de4b8c6ee42..44ec9760b562 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -4656,6 +4656,35 @@ static int process_spin_lock(struct bpf_verifier_env *env, int regno,
>   	return 0;
>   }
>   
> +static int process_timer_func(struct bpf_verifier_env *env, int regno,
> +			      struct bpf_call_arg_meta *meta)
> +{
> +	struct bpf_reg_state *regs = cur_regs(env), *reg = &regs[regno];
> +	bool is_const = tnum_is_const(reg->var_off);
> +	struct bpf_map *map = reg->map_ptr;
> +	u64 val = reg->var_off.value;
> +
> +	if (!is_const) {
> +		verbose(env,
> +			"R%d doesn't have constant offset. bpf_timer has to be at the constant offset\n",
> +			regno);
> +		return -EINVAL;
> +	}
> +	if (!map->btf) {
> +		verbose(env, "map '%s' has to have BTF in order to use bpf_timer\n",
> +			map->name);
> +		return -EINVAL;
> +	}
> +	if (val) {
> +		/* This restriction will be removed in the next patch */
> +		verbose(env, "bpf_timer field can only be first in the map value element\n");
> +		return -EINVAL;
> +	}
> +	WARN_ON(meta->map_ptr);

Could you explain when this could happen?

> +	meta->map_ptr = map;
> +	return 0;
> +}
> +
>   static bool arg_type_is_mem_ptr(enum bpf_arg_type type)
>   {
>   	return type == ARG_PTR_TO_MEM ||
> @@ -4788,6 +4817,7 @@ static const struct bpf_reg_types percpu_btf_ptr_types = { .types = { PTR_TO_PER
>   static const struct bpf_reg_types func_ptr_types = { .types = { PTR_TO_FUNC } };
>   static const struct bpf_reg_types stack_ptr_types = { .types = { PTR_TO_STACK } };
>   static const struct bpf_reg_types const_str_ptr_types = { .types = { PTR_TO_MAP_VALUE } };
> +static const struct bpf_reg_types timer_types = { .types = { PTR_TO_MAP_VALUE } };
>   
[...]
Alexei Starovoitov June 15, 2021, 3:29 a.m. UTC | #6
On Mon, Jun 14, 2021 at 9:51 AM Yonghong Song <yhs@fb.com> wrote:
> > +     ret = BPF_CAST_CALL(t->callback_fn)((u64)(long)map,
> > +                                         (u64)(long)key,
> > +                                         (u64)(long)t->value, 0, 0);
> > +     WARN_ON(ret != 0); /* Next patch disallows 1 in the verifier */
>
> I didn't find that next patch disallows callback return value 1 in the
> verifier. If we indeed disallows return value 1 in the verifier. We
> don't need WARN_ON here. Did I miss anything?

Ohh. I forgot to address this bit in the verifier. Will fix.

> > +     if (!hrtimer_active(&t->timer) || hrtimer_callback_running(&t->timer))
> > +             /* If the timer wasn't active or callback already executing
> > +              * bump the prog refcnt to keep it alive until
> > +              * callback is invoked (again).
> > +              */
> > +             bpf_prog_inc(t->prog);
>
> I am not 100% sure. But could we have race condition here?
>     cpu 1: running bpf_timer_start() helper call
>     cpu 2: doing hrtimer work (calling callback etc.)
>
> Is it possible that
>    !hrtimer_active(&t->timer) || hrtimer_callback_running(&t->timer)
> may be true and then right before bpf_prog_inc(t->prog), it becomes
> true? If hrtimer_callback_running() is called, it is possible that
> callback function could have dropped the reference count for t->prog,
> so we could already go into the body of the function
> __bpf_prog_put()?

you're correct. Indeed there is a race.
Circular dependency is a never ending headache.
That's the same design mistake as with tail_calls.
It felt that this case would be simpler than tail_calls and a bpf program
pinning itself with bpf_prog_inc can be made to work... nope.
I'll get rid of this and switch to something 'obviously correct'.
Probably a link list with a lock to keep a set of init-ed timers and
auto-cancel them on prog refcnt going to zero.
To do 'bpf daemon' the prog would need to be pinned.

> > +     if (val) {
> > +             /* This restriction will be removed in the next patch */
> > +             verbose(env, "bpf_timer field can only be first in the map value element\n");
> > +             return -EINVAL;
> > +     }
> > +     WARN_ON(meta->map_ptr);
>
> Could you explain when this could happen?

Only if there is a verifier bug or new helper is added with arg to timer
and arg to map. I'll switch to verbose() + efault instead.
Alexei Starovoitov June 15, 2021, 3:33 a.m. UTC | #7
On Fri, Jun 11, 2021 at 3:12 PM Yonghong Song <yhs@fb.com> wrote:
> > +struct bpf_hrtimer {
> > +     struct hrtimer timer;
> > +     struct bpf_map *map;
> > +     struct bpf_prog *prog;
> > +     void *callback_fn;
> > +     void *value;
> > +};
> > +
> > +/* the actual struct hidden inside uapi struct bpf_timer */
> > +struct bpf_timer_kern {
> > +     struct bpf_hrtimer *timer;
> > +     struct bpf_spin_lock lock;
> > +};
>
> Looks like in 32bit system, sizeof(struct bpf_timer_kern) is 64
> and sizeof(struct bpf_timer) is 128.
>
> struct bpf_spin_lock {
>          __u32   val;
> };
>
> struct bpf_timer {
>         __u64 :64;
>         __u64 :64;
> };
>
> Checking the code, we may not have issues as structure
> "bpf_timer" is only used to reserve spaces and
> map copy value routine handles that properly.
>
> Maybe we can still make it consistent with
> two fields in bpf_timer_kern mapping to
> two fields in bpf_timer?
>
> struct bpf_timer_kern {
>         __bpf_md_ptr(struct bpf_hrtimer *, timer);
>         struct bpf_spin_lock lock;
> };

Such alignment of fields is not necessary,
since the fields are not accessible directly from bpf prog.
struct bpf_timer_kern needs to fit into struct bpf_timer and
alignof these two structs needs to be the same.
That's all. I'll add build_bug_on to make sure.
Yonghong Song June 15, 2021, 4:21 a.m. UTC | #8
On 6/14/21 8:33 PM, Alexei Starovoitov wrote:
> On Fri, Jun 11, 2021 at 3:12 PM Yonghong Song <yhs@fb.com> wrote:
>>> +struct bpf_hrtimer {
>>> +     struct hrtimer timer;
>>> +     struct bpf_map *map;
>>> +     struct bpf_prog *prog;
>>> +     void *callback_fn;
>>> +     void *value;
>>> +};
>>> +
>>> +/* the actual struct hidden inside uapi struct bpf_timer */
>>> +struct bpf_timer_kern {
>>> +     struct bpf_hrtimer *timer;
>>> +     struct bpf_spin_lock lock;
>>> +};
>>
>> Looks like in 32bit system, sizeof(struct bpf_timer_kern) is 64
>> and sizeof(struct bpf_timer) is 128.
>>
>> struct bpf_spin_lock {
>>           __u32   val;
>> };
>>
>> struct bpf_timer {
>>          __u64 :64;
>>          __u64 :64;
>> };
>>
>> Checking the code, we may not have issues as structure
>> "bpf_timer" is only used to reserve spaces and
>> map copy value routine handles that properly.
>>
>> Maybe we can still make it consistent with
>> two fields in bpf_timer_kern mapping to
>> two fields in bpf_timer?
>>
>> struct bpf_timer_kern {
>>          __bpf_md_ptr(struct bpf_hrtimer *, timer);
>>          struct bpf_spin_lock lock;
>> };
> 
> Such alignment of fields is not necessary,
> since the fields are not accessible directly from bpf prog.
> struct bpf_timer_kern needs to fit into struct bpf_timer and
> alignof these two structs needs to be the same.
> That's all. I'll add build_bug_on to make sure.

Sounds good to me. Thanks!
Andrii Nakryiko June 15, 2021, 4:48 a.m. UTC | #9
On Thu, Jun 10, 2021 at 9:24 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> From: Alexei Starovoitov <ast@kernel.org>
>
> Introduce 'struct bpf_timer { __u64 :64; __u64 :64; };' that can be embedded
> in hash/array/lru maps as regular field and helpers to operate on it:
>
> // Initialize the timer to call 'callback_fn' static function
> // First 4 bits of 'flags' specify clockid.
> // Only CLOCK_MONOTONIC, CLOCK_REALTIME, CLOCK_BOOTTIME are allowed.
> long bpf_timer_init(struct bpf_timer *timer, void *callback_fn, int flags);
>
> // Start the timer and set its expiration 'nsec' nanoseconds from the current time.
> long bpf_timer_start(struct bpf_timer *timer, u64 nsec);
>
> // Cancel the timer and wait for callback_fn to finish if it was running.
> long bpf_timer_cancel(struct bpf_timer *timer);
>
> Here is how BPF program might look like:
> struct map_elem {
>     int counter;
>     struct bpf_timer timer;
> };
>
> struct {
>     __uint(type, BPF_MAP_TYPE_HASH);
>     __uint(max_entries, 1000);
>     __type(key, int);
>     __type(value, struct map_elem);
> } hmap SEC(".maps");
>
> static int timer_cb(void *map, int *key, struct map_elem *val);
> /* val points to particular map element that contains bpf_timer. */
>
> SEC("fentry/bpf_fentry_test1")
> int BPF_PROG(test1, int a)
> {
>     struct map_elem *val;
>     int key = 0;
>
>     val = bpf_map_lookup_elem(&hmap, &key);
>     if (val) {
>         bpf_timer_init(&val->timer, timer_cb, CLOCK_REALTIME);
>         bpf_timer_start(&val->timer, 1000 /* call timer_cb2 in 1 usec */);
>     }
> }
>
> This patch adds helper implementations that rely on hrtimers
> to call bpf functions as timers expire.
> The following patch adds necessary safety checks.
>
> Only programs with CAP_BPF are allowed to use bpf_timer.
>
> The amount of timers used by the program is constrained by
> the memcg recorded at map creation time.
>
> The bpf_timer_init() helper is receiving hidden 'map' and 'prog' arguments
> supplied by the verifier. The prog pointer is needed to do refcnting of bpf
> program to make sure that program doesn't get freed while timer is armed.
>
> The bpf_map_delete_elem() and bpf_map_update_elem() operations cancel
> and free the timer if given map element had it allocated.
> "bpftool map update" command can be used to cancel timers.
>
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> ---

Looks great!

Acked-by: Andrii Nakryiko <andrii@kernel.org>

>  include/linux/bpf.h            |   2 +
>  include/uapi/linux/bpf.h       |  40 ++++++
>  kernel/bpf/helpers.c           | 227 +++++++++++++++++++++++++++++++++
>  kernel/bpf/verifier.c          | 109 ++++++++++++++++
>  kernel/trace/bpf_trace.c       |   2 +-
>  scripts/bpf_doc.py             |   2 +
>  tools/include/uapi/linux/bpf.h |  40 ++++++
>  7 files changed, 421 insertions(+), 1 deletion(-)
>

[...]

> + *
> + * long bpf_timer_init(struct bpf_timer *timer, void *callback_fn, int flags)
> + *     Description
> + *             Initialize the timer to call *callback_fn* static function.
> + *             First 4 bits of *flags* specify clockid. Only CLOCK_MONOTONIC,
> + *             CLOCK_REALTIME, CLOCK_BOOTTIME are allowed.
> + *             All other bits of *flags* are reserved.
> + *     Return
> + *             0 on success.
> + *             **-EBUSY** if *timer* is already initialized.
> + *             **-EINVAL** if invalid *flags* are passed.
> + *
> + * long bpf_timer_start(struct bpf_timer *timer, u64 nsecs)
> + *     Description
> + *             Start the timer and set its expiration N nanoseconds from the
> + *             current time. The timer callback_fn will be invoked in soft irq
> + *             context on some cpu and will not repeat unless another
> + *             bpf_timer_start() is made. In such case the next invocation can
> + *             migrate to a different cpu.

This is a nice description, thanks.

> + *     Return
> + *             0 on success.
> + *             **-EINVAL** if *timer* was not initialized with bpf_timer_init() earlier.
> + *
> + * long bpf_timer_cancel(struct bpf_timer *timer)
> + *     Description
> + *             Cancel the timer and wait for callback_fn to finish if it was running.
> + *     Return
> + *             0 if the timer was not active.
> + *             1 if the timer was active.
> + *             **-EINVAL** if *timer* was not initialized with bpf_timer_init() earlier.
> + *             **-EDEADLK** if callback_fn tried to call bpf_timer_cancel() on its own timer
> + *             which would have led to a deadlock otherwise.
>   */

[...]

> +       ret = BPF_CAST_CALL(t->callback_fn)((u64)(long)map,
> +                                           (u64)(long)key,
> +                                           (u64)(long)t->value, 0, 0);
> +       WARN_ON(ret != 0); /* Next patch disallows 1 in the verifier */
> +
> +       /* The bpf function finished executed. Drop the prog refcnt.

typo: execution

> +        * It could reach zero here and trigger free of bpf_prog
> +        * and subsequent free of the maps that were holding timers.
> +        * If callback_fn called bpf_timer_start on this timer
> +        * the prog refcnt will be > 0.
> +        *
> +        * If callback_fn deleted map element the 't' could have been freed,
> +        * hence t->prog deref is done earlier.
> +        */
> +       bpf_prog_put(prog);
> +       this_cpu_write(hrtimer_running, NULL);
> +       return HRTIMER_NORESTART;
> +}
> +

[...]
Andrii Nakryiko June 15, 2021, 5:31 a.m. UTC | #10
On Mon, Jun 14, 2021 at 8:29 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Mon, Jun 14, 2021 at 9:51 AM Yonghong Song <yhs@fb.com> wrote:
> > > +     ret = BPF_CAST_CALL(t->callback_fn)((u64)(long)map,
> > > +                                         (u64)(long)key,
> > > +                                         (u64)(long)t->value, 0, 0);
> > > +     WARN_ON(ret != 0); /* Next patch disallows 1 in the verifier */
> >
> > I didn't find that next patch disallows callback return value 1 in the
> > verifier. If we indeed disallows return value 1 in the verifier. We
> > don't need WARN_ON here. Did I miss anything?
>
> Ohh. I forgot to address this bit in the verifier. Will fix.
>
> > > +     if (!hrtimer_active(&t->timer) || hrtimer_callback_running(&t->timer))
> > > +             /* If the timer wasn't active or callback already executing
> > > +              * bump the prog refcnt to keep it alive until
> > > +              * callback is invoked (again).
> > > +              */
> > > +             bpf_prog_inc(t->prog);
> >
> > I am not 100% sure. But could we have race condition here?
> >     cpu 1: running bpf_timer_start() helper call
> >     cpu 2: doing hrtimer work (calling callback etc.)
> >
> > Is it possible that
> >    !hrtimer_active(&t->timer) || hrtimer_callback_running(&t->timer)
> > may be true and then right before bpf_prog_inc(t->prog), it becomes
> > true? If hrtimer_callback_running() is called, it is possible that
> > callback function could have dropped the reference count for t->prog,
> > so we could already go into the body of the function
> > __bpf_prog_put()?
>
> you're correct. Indeed there is a race.
> Circular dependency is a never ending headache.
> That's the same design mistake as with tail_calls.
> It felt that this case would be simpler than tail_calls and a bpf program
> pinning itself with bpf_prog_inc can be made to work... nope.
> I'll get rid of this and switch to something 'obviously correct'.
> Probably a link list with a lock to keep a set of init-ed timers and
> auto-cancel them on prog refcnt going to zero.
> To do 'bpf daemon' the prog would need to be pinned.

Hm.. wouldn't this eliminate that race:

switch (hrtimer_try_to_cancel(&t->timer))
{
case 0:
    /* nothing was queued */
    bpf_prog_inc(t->prog);
    break;
case 1:
    /* already have refcnt and it won't be bpf_prog_put by callback */
    break;
case -1:
    /* callback is running and will bpf_prog_put, so we need to take
another refcnt */
    bpf_prog_inc(t->prog);
    break;
}
hrtimer_start(&t->timer, ns_to_ktime(nsecs), HRTIMER_MODE_REL_SOFT);

So instead of guessing (racily) whether there is a queued callback or
not, try to cancel just in case there is. Then rely on the nice
guarantees that hrtimer cancellation API provides.

Reading a bit more of hrtimer API, I'm more concerned now with the
per-cpu variable (hrtimer_running). Seems like the timer can get
migrated from one CPU to another, so all the auxiliary per-CPU state
might get invalidated without us knowing about that.

But it's getting late, I'll think about all this a bit more tomorrow
with a fresh head.

>
> > > +     if (val) {
> > > +             /* This restriction will be removed in the next patch */
> > > +             verbose(env, "bpf_timer field can only be first in the map value element\n");
> > > +             return -EINVAL;
> > > +     }
> > > +     WARN_ON(meta->map_ptr);
> >
> > Could you explain when this could happen?
>
> Only if there is a verifier bug or new helper is added with arg to timer
> and arg to map. I'll switch to verbose() + efault instead.
Alexei Starovoitov June 15, 2021, 5:40 a.m. UTC | #11
On Mon, Jun 14, 2021 at 10:31 PM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Mon, Jun 14, 2021 at 8:29 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Mon, Jun 14, 2021 at 9:51 AM Yonghong Song <yhs@fb.com> wrote:
> > > > +     ret = BPF_CAST_CALL(t->callback_fn)((u64)(long)map,
> > > > +                                         (u64)(long)key,
> > > > +                                         (u64)(long)t->value, 0, 0);
> > > > +     WARN_ON(ret != 0); /* Next patch disallows 1 in the verifier */
> > >
> > > I didn't find that next patch disallows callback return value 1 in the
> > > verifier. If we indeed disallows return value 1 in the verifier. We
> > > don't need WARN_ON here. Did I miss anything?
> >
> > Ohh. I forgot to address this bit in the verifier. Will fix.
> >
> > > > +     if (!hrtimer_active(&t->timer) || hrtimer_callback_running(&t->timer))
> > > > +             /* If the timer wasn't active or callback already executing
> > > > +              * bump the prog refcnt to keep it alive until
> > > > +              * callback is invoked (again).
> > > > +              */
> > > > +             bpf_prog_inc(t->prog);
> > >
> > > I am not 100% sure. But could we have race condition here?
> > >     cpu 1: running bpf_timer_start() helper call
> > >     cpu 2: doing hrtimer work (calling callback etc.)
> > >
> > > Is it possible that
> > >    !hrtimer_active(&t->timer) || hrtimer_callback_running(&t->timer)
> > > may be true and then right before bpf_prog_inc(t->prog), it becomes
> > > true? If hrtimer_callback_running() is called, it is possible that
> > > callback function could have dropped the reference count for t->prog,
> > > so we could already go into the body of the function
> > > __bpf_prog_put()?
> >
> > you're correct. Indeed there is a race.
> > Circular dependency is a never ending headache.
> > That's the same design mistake as with tail_calls.
> > It felt that this case would be simpler than tail_calls and a bpf program
> > pinning itself with bpf_prog_inc can be made to work... nope.
> > I'll get rid of this and switch to something 'obviously correct'.
> > Probably a link list with a lock to keep a set of init-ed timers and
> > auto-cancel them on prog refcnt going to zero.
> > To do 'bpf daemon' the prog would need to be pinned.
>
> Hm.. wouldn't this eliminate that race:
>
> switch (hrtimer_try_to_cancel(&t->timer))
> {
> case 0:
>     /* nothing was queued */
>     bpf_prog_inc(t->prog);
>     break;
> case 1:
>     /* already have refcnt and it won't be bpf_prog_put by callback */
>     break;
> case -1:
>     /* callback is running and will bpf_prog_put, so we need to take
> another refcnt */
>     bpf_prog_inc(t->prog);
>     break;
> }
> hrtimer_start(&t->timer, ns_to_ktime(nsecs), HRTIMER_MODE_REL_SOFT);
>
> So instead of guessing (racily) whether there is a queued callback or
> not, try to cancel just in case there is. Then rely on the nice
> guarantees that hrtimer cancellation API provides.

I haven't thought it through yet, but the above approach could
indeed solve this particular race. Unfortunately there are other races.
There is an issue with bpf_timer_init. Since it doesn't take refcnt
another program might do lookup and bpf_timer_start
while the first prog got to refcnt=0 and got freed.
Adding refcnt to bpf_timer_init() makes the prog self pinned
and no callback might ever be executed (if there were no bpf_timer_start),
so that will cause a high chance of bpf prog stuck in the kernel.
There could be ref+uref schemes similar to tail_calls to address all that,
but it gets ugly quickly.
imo all these issues and races is a sign that such self pinning
shouldn't be allowed.
Cong Wang June 15, 2021, 6:10 a.m. UTC | #12
On Fri, Jun 11, 2021 at 11:45 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Thu, Jun 10, 2021 at 11:42:24PM -0700, Cong Wang wrote:
> > On Thu, Jun 10, 2021 at 9:27 PM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > From: Alexei Starovoitov <ast@kernel.org>
>
> Please stick to one email thread in the future, ok?
>
> I'll consolidate them here:
>
> > What is your use case to justify your own code? Asking because
> > you deny mine, so clearly my use case is not yours.
>
> I mentioned several use cases in the prior threads.
> tldr: any periodic event in tracing, networking, security.
> Garbage collection falls into this category as well, but please internalize
> that implementing conntrack as it is today in the kernel is an explicit non-goal.

You need to read my use case again, it is for the conntrack
in Cilium, not the kernel one.

>
> > And more importantly, why not just use BPF_TEST_RUN with
> > a user-space timer? Asking because you offer no API to read or
> > modify timer expiration, so literally the same with BPF_TEST_RUN
> > approach.
>
> a wrapper on top of hrtimer_get_remaining() like bpf_timer_get_remaining()
> is trivial to add, but what is the use case?

If you do not have any use case, then stick to BPF_TEST_RUN
with user-space timers? And of course your patches are not needed
at all.

>
> > >
> > > Introduce 'struct bpf_timer { __u64 :64; __u64 :64; };' that can be embedded
> > > in hash/array/lru maps as regular field and helpers to operate on it:
> >
> > Can be or has to be? Huge difference here.
>
> map elements don't have to use timers.

You interpret this in a wrong way, what I asked is whether a bpf timer has
to be embedded in a map. IOW, can a bpf timer be a standalone global
data?

>
> > In the other thread, you said it is global data, which implies that it does
> > not have to be in a map.
>
> global data is a map. That was explained in the prior thread as well.
>

I think you implied bpf timer can exist without a map, hence I am asking.


> >
> > In your test case or your example, all timers are still in a map. So what has
> > changed since then? Looks nothing to me.
>
> look again?

Yes, I just looked at it again, only more confusing, not less.

>
> > Hmm, finally you begin refcounting it, which you were strongly against. ;)
>
> That was already answered in the prior thread.
> tldr: there were two options. This is one of them. Another can be added
> in the future as well.
>
> > Three questions:
> >
> > 1. Can t->prog be freed between bpf_timer_init() and bpf_timer_start()?
>
> yes.

Good. So if a program which only initializes the timer and then exits,
the other program which shares this timer will crash when it calls
bpf_timer_start(), right?

>
> > If the timer subprog is always in the same prog which installs it, then
>
> installs it? I'm not following the quesiton.
>
> > this is fine. But then how do multiple programs share a timer?
>
> there is only one callback function.

That's exactly my question. How is one callback function shared
by multiple eBPF programs which want to share the timer?


>
> > In the
> > case of conntrack, either ingress or egress could install the timer,
> > it only depends which one gets traffic first. Do they have to copy
> > the same subprog for the same timer?
>
> conntrack is an explicit non-goal.

I interpret this as you do not want timers to be shared by multiple
eBPF programs, correct? Weirdly, maps are shared, timers are
placed in a map, so timers should be shared naturally too.

>
> >
> > 2. Can t->prog be freed between a timer expiration and bpf_timer_start()
> > again?
>
> If it's already armed with the first bpf_timer_start() it won't be freed.

Why? I see t->prog is released in your timer callback:

+       bpf_prog_put(prog);
+       this_cpu_write(hrtimer_running, NULL);
+       return HRTIMER_NORESTART;

>
> > It gets a refcnt when starting a timer and puts it when cancelling
> > or expired, so t->prog can be freed right after cancelling or expired. What
> > if another program which shares this timer wants to restart this timer?
>
> There is only one callback_fn per timer. Another program can share
> the struct bpf_timer and the map. It might have subprog callback_fn code
> that looks exactly the same as callback_fn in the first prog.
> For example when libbpf loads progs/timer.c (after it was compiled into .o)
> it might share a subprog in the future (when kernel has support for
> dynamic linking). From bpf user pov it's a single .c file.
> The split into programs and subprograms is an implemenation detail
> that C programmer doesn't need to worry about.

Not exactly, they share a same C file but still can be loaded/unloaded
separately. And logically, whether a timer has been initialized once or
twice makes a huge difference for programers.

>
> > 3. Since you offer no API to read the expiration time, why not just use
> > BPF_TEST_RUN with a user-space timer? This is preferred by Andrii.
>
> Andrii point was that there should be no syscall cmds that replicate
> bpf_timer_init/start/cancel helpers. I agree with this.

Actually there is no strong reason to bother a bpf timer unless you
want to access the timer itself, which mostly contains expiration.
User-space timers work just fine for your cases, even if not, extending
BPF_TEST_RUN should too.

>
>
> > Thanks.
> >
> > Another unpopular point of view:
> >
> > This init() is not suitable for bpf programs, because unlike kernel modules,
> > there is no init or exit functions for a bpf program. And timer init
> > is typically
> > called during module init.
>
> Already answerd this in the prior thread. There will be __init and __fini like
> subprograms in bpf progs.

I interpret this as init does not make sense until we have __init and __fini?

>
> Please apply the patches to your local tree and do few experiments based
> on selftests/bpf/progs/timer.c. I think experimenting with the code
> will answer all of your questions.

Sounds like you find a great excuse for a failure of documentation.
What I asked are just fundamental design questions you should have
covered in your cover letter, which is literally empty.

Thanks.
Andrii Nakryiko June 15, 2021, 3:24 p.m. UTC | #13
On Mon, Jun 14, 2021 at 10:41 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Mon, Jun 14, 2021 at 10:31 PM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
> >
> > On Mon, Jun 14, 2021 at 8:29 PM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Mon, Jun 14, 2021 at 9:51 AM Yonghong Song <yhs@fb.com> wrote:
> > > > > +     ret = BPF_CAST_CALL(t->callback_fn)((u64)(long)map,
> > > > > +                                         (u64)(long)key,
> > > > > +                                         (u64)(long)t->value, 0, 0);
> > > > > +     WARN_ON(ret != 0); /* Next patch disallows 1 in the verifier */
> > > >
> > > > I didn't find that next patch disallows callback return value 1 in the
> > > > verifier. If we indeed disallows return value 1 in the verifier. We
> > > > don't need WARN_ON here. Did I miss anything?
> > >
> > > Ohh. I forgot to address this bit in the verifier. Will fix.
> > >
> > > > > +     if (!hrtimer_active(&t->timer) || hrtimer_callback_running(&t->timer))
> > > > > +             /* If the timer wasn't active or callback already executing
> > > > > +              * bump the prog refcnt to keep it alive until
> > > > > +              * callback is invoked (again).
> > > > > +              */
> > > > > +             bpf_prog_inc(t->prog);
> > > >
> > > > I am not 100% sure. But could we have race condition here?
> > > >     cpu 1: running bpf_timer_start() helper call
> > > >     cpu 2: doing hrtimer work (calling callback etc.)
> > > >
> > > > Is it possible that
> > > >    !hrtimer_active(&t->timer) || hrtimer_callback_running(&t->timer)
> > > > may be true and then right before bpf_prog_inc(t->prog), it becomes
> > > > true? If hrtimer_callback_running() is called, it is possible that
> > > > callback function could have dropped the reference count for t->prog,
> > > > so we could already go into the body of the function
> > > > __bpf_prog_put()?
> > >
> > > you're correct. Indeed there is a race.
> > > Circular dependency is a never ending headache.
> > > That's the same design mistake as with tail_calls.
> > > It felt that this case would be simpler than tail_calls and a bpf program
> > > pinning itself with bpf_prog_inc can be made to work... nope.
> > > I'll get rid of this and switch to something 'obviously correct'.
> > > Probably a link list with a lock to keep a set of init-ed timers and
> > > auto-cancel them on prog refcnt going to zero.
> > > To do 'bpf daemon' the prog would need to be pinned.
> >
> > Hm.. wouldn't this eliminate that race:
> >
> > switch (hrtimer_try_to_cancel(&t->timer))
> > {
> > case 0:
> >     /* nothing was queued */
> >     bpf_prog_inc(t->prog);
> >     break;
> > case 1:
> >     /* already have refcnt and it won't be bpf_prog_put by callback */
> >     break;
> > case -1:
> >     /* callback is running and will bpf_prog_put, so we need to take
> > another refcnt */
> >     bpf_prog_inc(t->prog);
> >     break;
> > }
> > hrtimer_start(&t->timer, ns_to_ktime(nsecs), HRTIMER_MODE_REL_SOFT);
> >
> > So instead of guessing (racily) whether there is a queued callback or
> > not, try to cancel just in case there is. Then rely on the nice
> > guarantees that hrtimer cancellation API provides.
>
> I haven't thought it through yet, but the above approach could
> indeed solve this particular race. Unfortunately there are other races.
> There is an issue with bpf_timer_init. Since it doesn't take refcnt
> another program might do lookup and bpf_timer_start
> while the first prog got to refcnt=0 and got freed.

I think it's because of an API design. bpf_timer_init() takes a
callback (i.e., bpf_prog) but doesn't really do anything with it (so
doesn't take refcnt). It's both problematic, as you point out, and
unnecessarily restricting because it doesn't allow to change the
callback (e.g., when map is shared and bpf_program has to be changed).
If you change API to be:

long bpf_timer_init(struct bpf_timer *timer, int flags);
long bpf_timer_start(struct bpf_timer *timer, void *callback_fn, u64 nsecs);

You'll avoid this problem because bpf_timer_start will take refcnt
when arming (or re-arming) the timer. bpf_timer_init() will only take
care of initial memory allocation and hrtimer_init, but will leave
timer->prog as NULL until bpf_timer_start(). Wouldn't that solve all
the problems and be more flexible/powerful? If necessary, we can teach
bpf_timer_cb() to take spinlock briefly to avoid races fetching prog
pointer, but I haven't thought much about whether that's necessary.

If we wanted to push this to extreme, btw, we don't really need
bpf_timer_init(), bpf_timer_start() can do bpf_hrtimer allocation the
very first time (having pre-allocated spinlock makes this non-racy and
easy). But I don't know how expensive hrtimer_init() is, so it might
still make sense to split those two operations. Further, merging
bpf_timer_init() and bpf_timer_start() would require 6 input
arguments, which is a bit problematic. I have an idea how to get rid
of the necessity to pass in bpf_prog (so we'll be fine with just 5
explicit arguments), which simplifies other things (like
bpf_cgroup_storage implementation) as well, but I don't have patches
yet.

> Adding refcnt to bpf_timer_init() makes the prog self pinned
> and no callback might ever be executed (if there were no bpf_timer_start),
> so that will cause a high chance of bpf prog stuck in the kernel.
> There could be ref+uref schemes similar to tail_calls to address all that,
> but it gets ugly quickly.
> imo all these issues and races is a sign that such self pinning
> shouldn't be allowed.

I think prog refcounting is actually the saner and less surprising
approach here and we just need to spend a bit more time thinking how
to make everything work reliably. hrtimer API seems to be designed to
handle cases like this, which makes everything much easier.
Alexei Starovoitov June 16, 2021, 4:26 a.m. UTC | #14
On Tue, Jun 15, 2021 at 08:24:13AM -0700, Andrii Nakryiko wrote:
> On Mon, Jun 14, 2021 at 10:41 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Mon, Jun 14, 2021 at 10:31 PM Andrii Nakryiko
> > <andrii.nakryiko@gmail.com> wrote:
> > >
> > > On Mon, Jun 14, 2021 at 8:29 PM Alexei Starovoitov
> > > <alexei.starovoitov@gmail.com> wrote:
> > > >
> > > > On Mon, Jun 14, 2021 at 9:51 AM Yonghong Song <yhs@fb.com> wrote:
> > > > > > +     ret = BPF_CAST_CALL(t->callback_fn)((u64)(long)map,
> > > > > > +                                         (u64)(long)key,
> > > > > > +                                         (u64)(long)t->value, 0, 0);
> > > > > > +     WARN_ON(ret != 0); /* Next patch disallows 1 in the verifier */
> > > > >
> > > > > I didn't find that next patch disallows callback return value 1 in the
> > > > > verifier. If we indeed disallows return value 1 in the verifier. We
> > > > > don't need WARN_ON here. Did I miss anything?
> > > >
> > > > Ohh. I forgot to address this bit in the verifier. Will fix.
> > > >
> > > > > > +     if (!hrtimer_active(&t->timer) || hrtimer_callback_running(&t->timer))
> > > > > > +             /* If the timer wasn't active or callback already executing
> > > > > > +              * bump the prog refcnt to keep it alive until
> > > > > > +              * callback is invoked (again).
> > > > > > +              */
> > > > > > +             bpf_prog_inc(t->prog);
> > > > >
> > > > > I am not 100% sure. But could we have race condition here?
> > > > >     cpu 1: running bpf_timer_start() helper call
> > > > >     cpu 2: doing hrtimer work (calling callback etc.)
> > > > >
> > > > > Is it possible that
> > > > >    !hrtimer_active(&t->timer) || hrtimer_callback_running(&t->timer)
> > > > > may be true and then right before bpf_prog_inc(t->prog), it becomes
> > > > > true? If hrtimer_callback_running() is called, it is possible that
> > > > > callback function could have dropped the reference count for t->prog,
> > > > > so we could already go into the body of the function
> > > > > __bpf_prog_put()?
> > > >
> > > > you're correct. Indeed there is a race.
> > > > Circular dependency is a never ending headache.
> > > > That's the same design mistake as with tail_calls.
> > > > It felt that this case would be simpler than tail_calls and a bpf program
> > > > pinning itself with bpf_prog_inc can be made to work... nope.
> > > > I'll get rid of this and switch to something 'obviously correct'.
> > > > Probably a link list with a lock to keep a set of init-ed timers and
> > > > auto-cancel them on prog refcnt going to zero.
> > > > To do 'bpf daemon' the prog would need to be pinned.
> > >
> > > Hm.. wouldn't this eliminate that race:
> > >
> > > switch (hrtimer_try_to_cancel(&t->timer))
> > > {
> > > case 0:
> > >     /* nothing was queued */
> > >     bpf_prog_inc(t->prog);
> > >     break;
> > > case 1:
> > >     /* already have refcnt and it won't be bpf_prog_put by callback */
> > >     break;
> > > case -1:
> > >     /* callback is running and will bpf_prog_put, so we need to take
> > > another refcnt */
> > >     bpf_prog_inc(t->prog);
> > >     break;
> > > }
> > > hrtimer_start(&t->timer, ns_to_ktime(nsecs), HRTIMER_MODE_REL_SOFT);

Turned out that this approach has the same race as Yonghong mentioned.
Calling hrtimer_callback_running() directly or through hrtimer_try_to_cancel()
with extra cpu_base->lock as in above doesn't prevent the race.
bpf_prog_put could have already happened and above case:
 case -1:
     /* callback is running and will bpf_prog_put, so we need to take
 another refcnt */
     bpf_prog_inc(t->prog);

would be incrementing refcnt from zero.

> > >
> > > So instead of guessing (racily) whether there is a queued callback or
> > > not, try to cancel just in case there is. Then rely on the nice
> > > guarantees that hrtimer cancellation API provides.
> >
> > I haven't thought it through yet, but the above approach could
> > indeed solve this particular race. Unfortunately there are other races.
> > There is an issue with bpf_timer_init. Since it doesn't take refcnt
> > another program might do lookup and bpf_timer_start
> > while the first prog got to refcnt=0 and got freed.
> 
> I think it's because of an API design. bpf_timer_init() takes a
> callback (i.e., bpf_prog) but doesn't really do anything with it (so
> doesn't take refcnt). It's both problematic, as you point out, and
> unnecessarily restricting because it doesn't allow to change the
> callback (e.g., when map is shared and bpf_program has to be changed).
> If you change API to be:
> 
> long bpf_timer_init(struct bpf_timer *timer, int flags);
> long bpf_timer_start(struct bpf_timer *timer, void *callback_fn, u64 nsecs);
> 
> You'll avoid this problem because bpf_timer_start will take refcnt
> when arming (or re-arming) the timer. bpf_timer_init() will only take
> care of initial memory allocation and hrtimer_init, but will leave
> timer->prog as NULL until bpf_timer_start(). Wouldn't that solve all
> the problems and be more flexible/powerful?

Unfortunately no. The race would still be present and I don't see
a clean way of solving.

> If necessary, we can teach
> bpf_timer_cb() to take spinlock briefly to avoid races fetching prog
> pointer, but I haven't thought much about whether that's necessary.

That doesn't help either.
hrtimer_try_to_cancel() returning -1 (or checking it via
hrtimer_callback_running) doesn't mean that refcnt > 0.
It could have reached zero in bpf_timer_cb.
I thought whether introducing bpf_prog_inc_if_not_zero()
and using it in bpf_timer_start() could solve it...
Nope. The prog pointer could be already be freed if processing
of bpf_timer_cb is slow enough.
Then I thought whether we can move refcnt from prog into
'struct bpf_timer_kern'...
Then considered ref/uref counting...
It's slippery slop of wrong turns.

> If we wanted to push this to extreme, btw, we don't really need
> bpf_timer_init(), bpf_timer_start() can do bpf_hrtimer allocation the
> very first time (having pre-allocated spinlock makes this non-racy and
> easy). But I don't know how expensive hrtimer_init() is, so it might
> still make sense to split those two operations.

hrtimer_init is cheap, but bpf_timer_init() is expensive due
to memory allocation.
It's done once, so arguably should be ok,
but I'd like to avoid reinventing the wheel and stick
to api-s similar to hrtimer.

> Further, merging
> bpf_timer_init() and bpf_timer_start() would require 6 input
> arguments, which is a bit problematic. I have an idea how to get rid
> of the necessity to pass in bpf_prog (so we'll be fine with just 5
> explicit arguments), which simplifies other things (like
> bpf_cgroup_storage implementation) as well, but I don't have patches
> yet.
> 
> > Adding refcnt to bpf_timer_init() makes the prog self pinned
> > and no callback might ever be executed (if there were no bpf_timer_start),
> > so that will cause a high chance of bpf prog stuck in the kernel.
> > There could be ref+uref schemes similar to tail_calls to address all that,
> > but it gets ugly quickly.
> > imo all these issues and races is a sign that such self pinning
> > shouldn't be allowed.
> 
> I think prog refcounting is actually the saner and less surprising
> approach here and we just need to spend a bit more time thinking how
> to make everything work reliably. hrtimer API seems to be designed to
> handle cases like this, which makes everything much easier.

I made two 180 degree turns already. In the beginning I was strongly
against circular dependencies since old history of tail_call taught us
a valuable lesson. Then somehow convinced myself that this time around it will
be different and went with this crazy refcnting scheme. The last couple
weeks you, me, Yonghong, Toke and others invested countless hours thinking
through the race conditions. It's a sign that design took the wrong turn.
Circular dependencies must be avoided if possible. Here it's such case.
There is no need to complicate bpf_timer with crazy refcnting schemes.
The user space can simply pin the program in bpffs. In the future we might
introduce a self-pinning helper that would pin the program and create a file.
Sort-of like syscall prog type would pin self.
That would be explicit and clean api instead of obscure hacks inside bpf_timer*().
Alexei Starovoitov June 16, 2021, 4:53 a.m. UTC | #15
On Mon, Jun 14, 2021 at 11:10:46PM -0700, Cong Wang wrote:
> On Fri, Jun 11, 2021 at 11:45 AM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Thu, Jun 10, 2021 at 11:42:24PM -0700, Cong Wang wrote:
> > > On Thu, Jun 10, 2021 at 9:27 PM Alexei Starovoitov
> > > <alexei.starovoitov@gmail.com> wrote:
> > > >
> > > > From: Alexei Starovoitov <ast@kernel.org>
> >
> > Please stick to one email thread in the future, ok?
> >
> > I'll consolidate them here:
> >
> > > What is your use case to justify your own code? Asking because
> > > you deny mine, so clearly my use case is not yours.
> >
> > I mentioned several use cases in the prior threads.
> > tldr: any periodic event in tracing, networking, security.
> > Garbage collection falls into this category as well, but please internalize
> > that implementing conntrack as it is today in the kernel is an explicit non-goal.
> 
> You need to read my use case again, it is for the conntrack
> in Cilium, not the kernel one.

Cong,

the toxicity in your reply is a bit too much.
Clearly you're very upset with something.
You cannot make peace that your timer patches were rejected?
You realize that they were 'changes requested' with specific
feedback of what you can do?
You've said that it's not possible and came up with reasons to believe so.
I had no choice, but to implement the suggested changes myself.
Why? Because the requests for bpf timers came up many times over the years.
The first time it was in 2013 when bpf didn't even exist upstream.
It was still in RFC stages. Then during XDP development, etc.
Everytime folks who needed timers found a workaround.
Even today the garbage collection can be implemented in the bpf prog
without PROG_RUN cmd and user space. The prog can be attached
to periodic perf event.
If cilium conntrack needed garbage collection they could have
implemented it long ago without user space daemons and kernel additions.

> > > 1. Can t->prog be freed between bpf_timer_init() and bpf_timer_start()?
> >
> > yes.
> 
> Good. So if a program which only initializes the timer and then exits,
> the other program which shares this timer will crash when it calls
> bpf_timer_start(), right?

Correct. There is a discussion about it in a different thread.

> 
> >
> > > If the timer subprog is always in the same prog which installs it, then
> >
> > installs it? I'm not following the quesiton.
> >
> > > this is fine. But then how do multiple programs share a timer?
> >
> > there is only one callback function.
> 
> That's exactly my question. How is one callback function shared
> by multiple eBPF programs which want to share the timer?

callback function is a piece of .text. Inside C program it's written once
and then compiled once into single piece of BPF asm code by llvm.
Later libbpf can copy-paste it into many bpf programs.
The C programmer doesn't see it and doesn't need to worry about it.
From the kernel memory consumption point of view it's a bit inefficient.
In the future we will add support for dynamic linking in the kernel
and in libbpf. The bpf progs will be able to share already loaded subprograms.

> 
> >
> > > In the
> > > case of conntrack, either ingress or egress could install the timer,
> > > it only depends which one gets traffic first. Do they have to copy
> > > the same subprog for the same timer?
> >
> > conntrack is an explicit non-goal.
> 
> I interpret this as you do not want timers to be shared by multiple
> eBPF programs, correct? Weirdly, maps are shared, timers are
> placed in a map, so timers should be shared naturally too.

I only meant that re-implementing kernel conntrack as a bpf program
is an explicit non-goal.
Meaning that people can do it and with this timer api it can be easily done.
But it's explicitly excluded from api requirements.
It doesn't mean that it's hard to do. It means that reimplenting kernel
conntrack as-is with non-scalable garbage collection algorithm
is outside of the scope of this work.

> > > 2. Can t->prog be freed between a timer expiration and bpf_timer_start()
> > > again?
> >
> > If it's already armed with the first bpf_timer_start() it won't be freed.
> 
> Why? I see t->prog is released in your timer callback:

That's another race and there is another thread where it's being discussed.
Please read the whole thread to get on the same page with everyone else.

> 
> Not exactly, they share a same C file but still can be loaded/unloaded
> separately. And logically, whether a timer has been initialized once or
> twice makes a huge difference for programers.

I've looked how kernel is using hrtimer apis and didn't find a single
case where timer function is redefined or more than one timer callback is used
with single hrtimer.

> Sounds like you find a great excuse for a failure of documentation.

I got different feedback regarding documentation that is already present in
the patches, but I'll expand the comments and documentation further
to make it more clear.
Andrii Nakryiko June 16, 2021, 5:54 a.m. UTC | #16
On Tue, Jun 15, 2021 at 9:26 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, Jun 15, 2021 at 08:24:13AM -0700, Andrii Nakryiko wrote:
> > On Mon, Jun 14, 2021 at 10:41 PM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Mon, Jun 14, 2021 at 10:31 PM Andrii Nakryiko
> > > <andrii.nakryiko@gmail.com> wrote:
> > > >
> > > > On Mon, Jun 14, 2021 at 8:29 PM Alexei Starovoitov
> > > > <alexei.starovoitov@gmail.com> wrote:
> > > > >
> > > > > On Mon, Jun 14, 2021 at 9:51 AM Yonghong Song <yhs@fb.com> wrote:
> > > > > > > +     ret = BPF_CAST_CALL(t->callback_fn)((u64)(long)map,
> > > > > > > +                                         (u64)(long)key,
> > > > > > > +                                         (u64)(long)t->value, 0, 0);
> > > > > > > +     WARN_ON(ret != 0); /* Next patch disallows 1 in the verifier */
> > > > > >
> > > > > > I didn't find that next patch disallows callback return value 1 in the
> > > > > > verifier. If we indeed disallows return value 1 in the verifier. We
> > > > > > don't need WARN_ON here. Did I miss anything?
> > > > >
> > > > > Ohh. I forgot to address this bit in the verifier. Will fix.
> > > > >
> > > > > > > +     if (!hrtimer_active(&t->timer) || hrtimer_callback_running(&t->timer))
> > > > > > > +             /* If the timer wasn't active or callback already executing
> > > > > > > +              * bump the prog refcnt to keep it alive until
> > > > > > > +              * callback is invoked (again).
> > > > > > > +              */
> > > > > > > +             bpf_prog_inc(t->prog);
> > > > > >
> > > > > > I am not 100% sure. But could we have race condition here?
> > > > > >     cpu 1: running bpf_timer_start() helper call
> > > > > >     cpu 2: doing hrtimer work (calling callback etc.)
> > > > > >
> > > > > > Is it possible that
> > > > > >    !hrtimer_active(&t->timer) || hrtimer_callback_running(&t->timer)
> > > > > > may be true and then right before bpf_prog_inc(t->prog), it becomes
> > > > > > true? If hrtimer_callback_running() is called, it is possible that
> > > > > > callback function could have dropped the reference count for t->prog,
> > > > > > so we could already go into the body of the function
> > > > > > __bpf_prog_put()?
> > > > >
> > > > > you're correct. Indeed there is a race.
> > > > > Circular dependency is a never ending headache.
> > > > > That's the same design mistake as with tail_calls.
> > > > > It felt that this case would be simpler than tail_calls and a bpf program
> > > > > pinning itself with bpf_prog_inc can be made to work... nope.
> > > > > I'll get rid of this and switch to something 'obviously correct'.
> > > > > Probably a link list with a lock to keep a set of init-ed timers and
> > > > > auto-cancel them on prog refcnt going to zero.
> > > > > To do 'bpf daemon' the prog would need to be pinned.
> > > >
> > > > Hm.. wouldn't this eliminate that race:
> > > >
> > > > switch (hrtimer_try_to_cancel(&t->timer))
> > > > {
> > > > case 0:
> > > >     /* nothing was queued */
> > > >     bpf_prog_inc(t->prog);
> > > >     break;
> > > > case 1:
> > > >     /* already have refcnt and it won't be bpf_prog_put by callback */
> > > >     break;
> > > > case -1:
> > > >     /* callback is running and will bpf_prog_put, so we need to take
> > > > another refcnt */
> > > >     bpf_prog_inc(t->prog);
> > > >     break;
> > > > }
> > > > hrtimer_start(&t->timer, ns_to_ktime(nsecs), HRTIMER_MODE_REL_SOFT);
>
> Turned out that this approach has the same race as Yonghong mentioned.
> Calling hrtimer_callback_running() directly or through hrtimer_try_to_cancel()
> with extra cpu_base->lock as in above doesn't prevent the race.
> bpf_prog_put could have already happened and above case:
>  case -1:
>      /* callback is running and will bpf_prog_put, so we need to take
>  another refcnt */
>      bpf_prog_inc(t->prog);
>
> would be incrementing refcnt from zero.
>
> > > >
> > > > So instead of guessing (racily) whether there is a queued callback or
> > > > not, try to cancel just in case there is. Then rely on the nice
> > > > guarantees that hrtimer cancellation API provides.
> > >
> > > I haven't thought it through yet, but the above approach could
> > > indeed solve this particular race. Unfortunately there are other races.
> > > There is an issue with bpf_timer_init. Since it doesn't take refcnt
> > > another program might do lookup and bpf_timer_start
> > > while the first prog got to refcnt=0 and got freed.
> >
> > I think it's because of an API design. bpf_timer_init() takes a
> > callback (i.e., bpf_prog) but doesn't really do anything with it (so
> > doesn't take refcnt). It's both problematic, as you point out, and
> > unnecessarily restricting because it doesn't allow to change the
> > callback (e.g., when map is shared and bpf_program has to be changed).
> > If you change API to be:
> >
> > long bpf_timer_init(struct bpf_timer *timer, int flags);
> > long bpf_timer_start(struct bpf_timer *timer, void *callback_fn, u64 nsecs);
> >
> > You'll avoid this problem because bpf_timer_start will take refcnt
> > when arming (or re-arming) the timer. bpf_timer_init() will only take
> > care of initial memory allocation and hrtimer_init, but will leave
> > timer->prog as NULL until bpf_timer_start(). Wouldn't that solve all
> > the problems and be more flexible/powerful?
>
> Unfortunately no. The race would still be present and I don't see
> a clean way of solving.
>
> > If necessary, we can teach
> > bpf_timer_cb() to take spinlock briefly to avoid races fetching prog
> > pointer, but I haven't thought much about whether that's necessary.
>
> That doesn't help either.
> hrtimer_try_to_cancel() returning -1 (or checking it via
> hrtimer_callback_running) doesn't mean that refcnt > 0.
> It could have reached zero in bpf_timer_cb.
> I thought whether introducing bpf_prog_inc_if_not_zero()
> and using it in bpf_timer_start() could solve it...
> Nope. The prog pointer could be already be freed if processing
> of bpf_timer_cb is slow enough.
> Then I thought whether we can move refcnt from prog into
> 'struct bpf_timer_kern'...
> Then considered ref/uref counting...
> It's slippery slop of wrong turns.
>
> > If we wanted to push this to extreme, btw, we don't really need
> > bpf_timer_init(), bpf_timer_start() can do bpf_hrtimer allocation the
> > very first time (having pre-allocated spinlock makes this non-racy and
> > easy). But I don't know how expensive hrtimer_init() is, so it might
> > still make sense to split those two operations.
>
> hrtimer_init is cheap, but bpf_timer_init() is expensive due
> to memory allocation.
> It's done once, so arguably should be ok,
> but I'd like to avoid reinventing the wheel and stick
> to api-s similar to hrtimer.
>
> > Further, merging
> > bpf_timer_init() and bpf_timer_start() would require 6 input
> > arguments, which is a bit problematic. I have an idea how to get rid
> > of the necessity to pass in bpf_prog (so we'll be fine with just 5
> > explicit arguments), which simplifies other things (like
> > bpf_cgroup_storage implementation) as well, but I don't have patches
> > yet.
> >
> > > Adding refcnt to bpf_timer_init() makes the prog self pinned
> > > and no callback might ever be executed (if there were no bpf_timer_start),
> > > so that will cause a high chance of bpf prog stuck in the kernel.
> > > There could be ref+uref schemes similar to tail_calls to address all that,
> > > but it gets ugly quickly.
> > > imo all these issues and races is a sign that such self pinning
> > > shouldn't be allowed.
> >
> > I think prog refcounting is actually the saner and less surprising
> > approach here and we just need to spend a bit more time thinking how
> > to make everything work reliably. hrtimer API seems to be designed to
> > handle cases like this, which makes everything much easier.
>
> I made two 180 degree turns already. In the beginning I was strongly
> against circular dependencies since old history of tail_call taught us
> a valuable lesson. Then somehow convinced myself that this time around it will
> be different and went with this crazy refcnting scheme. The last couple
> weeks you, me, Yonghong, Toke and others invested countless hours thinking
> through the race conditions. It's a sign that design took the wrong turn.
> Circular dependencies must be avoided if possible. Here it's such case.

It could be the case, of course. But let's try to think this through
to the end before giving up. I think it's mostly because we are trying
to be too clever with lockless synchronization. I'm still not
convinced that it's a bad design.

I had a feeling that bpf_timer_cb needs to take lock as well. But once
we add that, refcounting becomes simpler and more deterministic, IMO.
Here's what I have in mind. I keep only important parts of the code,
so it's not a complete implementation. Please take a look below, I
left a few comments here and there.


struct bpf_hrtimer {
       struct hrtimer timer;
       struct bpf_map *map;
       void *value;

       struct bpf_prog *prog;
       void *callback_fn;

       /* pointer to that lock in struct bpf_timer_kern
        * so that we can access it from bpf_timer_cb()
        */
       struct bpf_spin_lock *lock;
};

BPF_CALL_5(bpf_timer_init, struct bpf_timer_kern *, timer, int, flags,
           struct bpf_map *, map)
{
       struct bpf_hrtimer *t;
       int ret = 0;

       ____bpf_spin_lock(&timer->lock);
       t = timer->timer;
       if (t) {
               ret = -EBUSY;
               goto out;
       }
       /* allocate hrtimer via map_kmalloc to use memcg accounting */
       t = bpf_map_kmalloc_node(map, sizeof(*t), GFP_ATOMIC, NUMA_NO_NODE);
       if (!t) {
               ret = -ENOMEM;
               goto out;
       }
       t->value = (void *)timer /* - offset of bpf_timer inside elem */;
       t->map = map;
       t->timer.function = bpf_timer_cb;

       /* we'll init them in bpf_timer_start */
       t->prog = NULL;
       t->callback_fn = NULL;

       hrtimer_init(&t->timer, clockid, HRTIMER_MODE_REL_SOFT);
       timer->timer = t;
out:
       ____bpf_spin_unlock(&timer->lock);
       return ret;
}


BPF_CALL_2(bpf_timer_start, struct bpf_timer_kern *, timer, u64, nsecs,
           void *, cb, struct bpf_prog *, prog)
{
       struct bpf_hrtimer *t;
       int ret = 0;

       ____bpf_spin_lock(&timer->lock);
       t = timer->timer;
       if (!t) {
               ret = -EINVAL;
               goto out;
       }

       /* doesn't matter what it returns, we just request cancellation */
       hrtimer_try_to_cancel(&t->timer);

       /* t->prog might not be the same as prog (!) */
       if (prog != t->prog) {
            /* callback hasn't yet dropped refcnt */
           if (t->prog) /* if it's null bpf_timer_cb() is running and
will put it later */
               bpf_prog_put(t->prog);

           if (IS_ERR(bpf_prog_inc_not_zero(prog))) {
               /* this will only happen if prog is still running (and
it's actually us),
                * but it was already put to zero, e.g., by closing last FD,
                * so there is no point in scheduling a new run
                */
               t->prog = NULL;
               t->callback_fn = NULL;
               ret = -E_WE_ARE_SHUTTING_DOWN;
               goto out;
           }
       } /* otherwise we keep existing refcnt on t->prog == prog */

       /* potentially new combination of prog and cb */
       t->prog = prog;
       t->callback_fn = cb;

       hrtimer_start(&t->timer, ns_to_ktime(nsecs), HRTIMER_MODE_REL_SOFT);
out:
       ____bpf_spin_unlock(&timer->lock);
       return ret;
}

BPF_CALL_1(bpf_timer_cancel, struct bpf_timer_kern *, timer)
{
       struct bpf_hrtimer *t;
       int ret = 0;

       ____bpf_spin_lock(&timer->lock);
       t = timer->timer;
       if (!t) {
               ret = -EINVAL;
               goto out;
       }

       /* this part I still worry about due to possibility of cpu migration,
        * we need to think if we should migrate_disable() in bpf_timer_cb()
        * and bpf_timer_* helpers(), but that's a separate topic
        */
       if (this_cpu_read(hrtimer_running) == t) {
               ret = -EDEADLK;
               goto out;
       }

       ret = hrtimer_cancel(&t->timer);

       if (t->prog) {
            /* bpf_timer_cb hasn't put it yet (and now won't) */
            bpf_prog_put(t->prog);
            t->prog = NULL;
            t->callback_fn = NULL;
       }
out:
       ____bpf_spin_unlock(&timer->lock);
       return ret;
}

static enum hrtimer_restart bpf_timer_cb(struct hrtimer *timer)
{
       struct bpf_hrtimer *t = container_of(timer, struct bpf_hrtimer, timer);
       struct bpf_map *map = t->map;
       struct bpf_prog *prog;
       void *key, *callback_fn;
       u32 idx;
       int ret;

       /* this is very IMPORTANT  */
       ____bpf_spin_lock(t->lock);

       prog = t->prog;
       if (!prog) {
           /* we were cancelled, prog is put already, exit early */
           ____bpf_spin_unlock(&timer->lock);
           return HRTIMER_NORESTART;
       }
       callback_fn = t->callback_fn;

       /* make sure bpf_timer_cancel/bpf_timer_start won't
bpf_prog_put our prog */
       t->prog = NULL;
       t->callback_fn = NULL;

       ____bpf_spin_unlock(t->lock);

       /* at this point we "own" prog's refcnt decrement */

       this_cpu_write(hrtimer_running, t);

       ...

       ret = BPF_CAST_CALL(t->callback_fn)((u64)(long)map,
                                           (u64)(long)key,
                                           (u64)(long)value, 0, 0);
       WARN_ON(ret != 0); /* Next patch disallows 1 in the verifier */

       bpf_prog_put(prog); /* always correct and non-racy */

       this_cpu_write(hrtimer_running, NULL);

       return HRTIMER_NORESTART;
}

bpf_timer_cancel_and_free() is mostly the same with t->prog NULL check
as everywhere else

> There is no need to complicate bpf_timer with crazy refcnting schemes.
> The user space can simply pin the program in bpffs. In the future we might
> introduce a self-pinning helper that would pin the program and create a file.
> Sort-of like syscall prog type would pin self.
> That would be explicit and clean api instead of obscure hacks inside bpf_timer*().

Do I understand correctly that the alternative that you are proposing
is to keep some linked list of all map_values across all maps in the
system that have initialized bpf_hrtimer with that particular bpf_prog
in them? And when bpf_prog is put to zero you'll go and destruct them
all in a race-free way?

I have a bit of a hard time imagining how that will be implemented
exactly, so I might be overcomplicating that in my mind. Will be happy
to see the working code.
Alexei Starovoitov June 16, 2021, 4:52 p.m. UTC | #17
On Tue, Jun 15, 2021 at 10:54:40PM -0700, Andrii Nakryiko wrote:
> 
> It could be the case, of course. But let's try to think this through
> to the end before giving up. I think it's mostly because we are trying
> to be too clever with lockless synchronization.

imo your proposed code fits "too clever" too ;)
Just a reminder that few emails ago you've argued 
about "obviously correct" approach, but now...

> I had a feeling that bpf_timer_cb needs to take lock as well. But once
> we add that, refcounting becomes simpler and more deterministic, IMO.
> Here's what I have in mind. I keep only important parts of the code,
> so it's not a complete implementation. Please take a look below, I
> left a few comments here and there.
> 
> 
> struct bpf_hrtimer {
>        struct hrtimer timer;
>        struct bpf_map *map;
>        void *value;
> 
>        struct bpf_prog *prog;
>        void *callback_fn;
> 
>        /* pointer to that lock in struct bpf_timer_kern
>         * so that we can access it from bpf_timer_cb()
>         */
>        struct bpf_spin_lock *lock;
> };
> 
> BPF_CALL_5(bpf_timer_init, struct bpf_timer_kern *, timer, int, flags,
>            struct bpf_map *, map)
> {
>        struct bpf_hrtimer *t;
>        int ret = 0;
> 
>        ____bpf_spin_lock(&timer->lock);
>        t = timer->timer;
>        if (t) {
>                ret = -EBUSY;
>                goto out;
>        }
>        /* allocate hrtimer via map_kmalloc to use memcg accounting */
>        t = bpf_map_kmalloc_node(map, sizeof(*t), GFP_ATOMIC, NUMA_NO_NODE);
>        if (!t) {
>                ret = -ENOMEM;
>                goto out;
>        }
>        t->value = (void *)timer /* - offset of bpf_timer inside elem */;
>        t->map = map;
>        t->timer.function = bpf_timer_cb;
> 
>        /* we'll init them in bpf_timer_start */
>        t->prog = NULL;
>        t->callback_fn = NULL;
> 
>        hrtimer_init(&t->timer, clockid, HRTIMER_MODE_REL_SOFT);
>        timer->timer = t;
> out:
>        ____bpf_spin_unlock(&timer->lock);
>        return ret;
> }
> 
> 
> BPF_CALL_2(bpf_timer_start, struct bpf_timer_kern *, timer, u64, nsecs,
>            void *, cb, struct bpf_prog *, prog)
> {
>        struct bpf_hrtimer *t;
>        int ret = 0;
> 
>        ____bpf_spin_lock(&timer->lock);
>        t = timer->timer;
>        if (!t) {
>                ret = -EINVAL;
>                goto out;
>        }
> 
>        /* doesn't matter what it returns, we just request cancellation */
>        hrtimer_try_to_cancel(&t->timer);
> 
>        /* t->prog might not be the same as prog (!) */
>        if (prog != t->prog) {
>             /* callback hasn't yet dropped refcnt */
>            if (t->prog) /* if it's null bpf_timer_cb() is running and
> will put it later */
>                bpf_prog_put(t->prog);
> 
>            if (IS_ERR(bpf_prog_inc_not_zero(prog))) {
>                /* this will only happen if prog is still running (and
> it's actually us),
>                 * but it was already put to zero, e.g., by closing last FD,
>                 * so there is no point in scheduling a new run
>                 */

I have a bit of mind explosion here... everything will be alright.

>                t->prog = NULL;
>                t->callback_fn = NULL;
>                ret = -E_WE_ARE_SHUTTING_DOWN;
>                goto out;
>            }
>        } /* otherwise we keep existing refcnt on t->prog == prog */
> 
>        /* potentially new combination of prog and cb */
>        t->prog = prog;
>        t->callback_fn = cb;
> 
>        hrtimer_start(&t->timer, ns_to_ktime(nsecs), HRTIMER_MODE_REL_SOFT);
> out:
>        ____bpf_spin_unlock(&timer->lock);
>        return ret;
> }
> 
> BPF_CALL_1(bpf_timer_cancel, struct bpf_timer_kern *, timer)
> {
>        struct bpf_hrtimer *t;
>        int ret = 0;
> 
>        ____bpf_spin_lock(&timer->lock);
>        t = timer->timer;
>        if (!t) {
>                ret = -EINVAL;
>                goto out;
>        }
> 
>        /* this part I still worry about due to possibility of cpu migration,
>         * we need to think if we should migrate_disable() in bpf_timer_cb()
>         * and bpf_timer_* helpers(), but that's a separate topic
>         */
>        if (this_cpu_read(hrtimer_running) == t) {
>                ret = -EDEADLK;
>                goto out;
>        }
> 
>        ret = hrtimer_cancel(&t->timer);
> 
>        if (t->prog) {
>             /* bpf_timer_cb hasn't put it yet (and now won't) */
>             bpf_prog_put(t->prog);
>             t->prog = NULL;
>             t->callback_fn = NULL;
>        }
> out:
>        ____bpf_spin_unlock(&timer->lock);
>        return ret;
> }
> 
> static enum hrtimer_restart bpf_timer_cb(struct hrtimer *timer)
> {
>        struct bpf_hrtimer *t = container_of(timer, struct bpf_hrtimer, timer);
>        struct bpf_map *map = t->map;
>        struct bpf_prog *prog;
>        void *key, *callback_fn;
>        u32 idx;
>        int ret;
> 
>        /* this is very IMPORTANT  */
>        ____bpf_spin_lock(t->lock);
> 
>        prog = t->prog;
>        if (!prog) {
>            /* we were cancelled, prog is put already, exit early */
>            ____bpf_spin_unlock(&timer->lock);
>            return HRTIMER_NORESTART;
>        }
>        callback_fn = t->callback_fn;
> 
>        /* make sure bpf_timer_cancel/bpf_timer_start won't
> bpf_prog_put our prog */
>        t->prog = NULL;
>        t->callback_fn = NULL;
> 
>        ____bpf_spin_unlock(t->lock);
> 
>        /* at this point we "own" prog's refcnt decrement */
> 
>        this_cpu_write(hrtimer_running, t);
> 
>        ...
> 
>        ret = BPF_CAST_CALL(t->callback_fn)((u64)(long)map,
>                                            (u64)(long)key,
>                                            (u64)(long)value, 0, 0);
>        WARN_ON(ret != 0); /* Next patch disallows 1 in the verifier */
> 
>        bpf_prog_put(prog); /* always correct and non-racy */
> 
>        this_cpu_write(hrtimer_running, NULL);
> 
>        return HRTIMER_NORESTART;
> }
> 
> bpf_timer_cancel_and_free() is mostly the same with t->prog NULL check
> as everywhere else

I haven't started detailed analysis of above proposal, but looks overly
complicated on the first glance. Not saying it's bad or good.
Just complexity and races are striking.

> 
> > There is no need to complicate bpf_timer with crazy refcnting schemes.
> > The user space can simply pin the program in bpffs. In the future we might
> > introduce a self-pinning helper that would pin the program and create a file.
> > Sort-of like syscall prog type would pin self.
> > That would be explicit and clean api instead of obscure hacks inside bpf_timer*().
> 
> Do I understand correctly that the alternative that you are proposing
> is to keep some linked list of all map_values across all maps in the
> system that have initialized bpf_hrtimer with that particular bpf_prog
> in them? And when bpf_prog is put to zero you'll go and destruct them
> all in a race-free way?
> 
> I have a bit of a hard time imagining how that will be implemented
> exactly, so I might be overcomplicating that in my mind. Will be happy
> to see the working code.

Here is working code...
Note how patch 1 is so much simpler without complicated refcnting.
And how patch 2 removes for_each_map_element that was necessary earlier.
Also note that link list approach is an optimization.
Instead of keeping a link list the bpf_free_used_timers() could call
a map specific op to iterate all elems and free timers with
timer->prog == prog_going_away.
That was my initial proposal couple month ago.
link_list is purely an optimization instead of for_each_map_elem.
>From c11bf0aa23f1df25682056f2c581c9bc9bd8df31 Mon Sep 17 00:00:00 2001
From: Alexei Starovoitov <ast@kernel.org>
Date: Wed, 16 Jun 2021 09:19:36 -0700
Subject: [PATCH bpf-next 1/2] bpf: Cancel and free timers when prog is going
 away.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/bpf.h  |  3 ++
 kernel/bpf/core.c    |  3 ++
 kernel/bpf/helpers.c | 70 +++++++++++++++++++++++++-------------------
 3 files changed, 46 insertions(+), 30 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 7c403235c7e8..f67ea2512844 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -245,6 +245,7 @@ static inline void copy_map_value(struct bpf_map *map, void *dst, void *src)
 void copy_map_value_locked(struct bpf_map *map, void *dst, void *src,
 			   bool lock_src);
 void bpf_timer_cancel_and_free(void *timer);
+void bpf_free_used_timers(struct bpf_prog_aux *aux);
 int bpf_obj_name_cpy(char *dst, const char *src, unsigned int size);
 
 struct bpf_offload_dev;
@@ -871,6 +872,8 @@ struct bpf_prog_aux {
 	u32 size_poke_tab;
 	struct bpf_ksym ksym;
 	const struct bpf_prog_ops *ops;
+	spinlock_t timers_lock;
+	struct hlist_head used_timers;
 	struct bpf_map **used_maps;
 	struct mutex used_maps_mutex; /* mutex for used_maps and used_map_cnt */
 	struct btf_mod_pair *used_btfs;
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 5e31ee9f7512..aa7960986a75 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -104,6 +104,8 @@ struct bpf_prog *bpf_prog_alloc_no_stats(unsigned int size, gfp_t gfp_extra_flag
 	fp->jit_requested = ebpf_jit_enabled();
 
 	INIT_LIST_HEAD_RCU(&fp->aux->ksym.lnode);
+	INIT_HLIST_HEAD(&fp->aux->used_timers);
+	spin_lock_init(&fp->aux->timers_lock);
 	mutex_init(&fp->aux->used_maps_mutex);
 	mutex_init(&fp->aux->dst_mutex);
 
@@ -2201,6 +2203,7 @@ static void bpf_prog_free_deferred(struct work_struct *work)
 	int i;
 
 	aux = container_of(work, struct bpf_prog_aux, work);
+	bpf_free_used_timers(aux);
 	bpf_free_used_maps(aux);
 	bpf_free_used_btfs(aux);
 	if (bpf_prog_is_dev_bound(aux))
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index b8df592c33cc..08f5d0f73f68 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -987,6 +987,7 @@ const struct bpf_func_proto bpf_snprintf_proto = {
 
 struct bpf_hrtimer {
 	struct hrtimer timer;
+	struct hlist_node hlist;
 	struct bpf_map *map;
 	struct bpf_prog *prog;
 	void *callback_fn;
@@ -1004,7 +1005,6 @@ static DEFINE_PER_CPU(struct bpf_hrtimer *, hrtimer_running);
 static enum hrtimer_restart bpf_timer_cb(struct hrtimer *timer)
 {
 	struct bpf_hrtimer *t = container_of(timer, struct bpf_hrtimer, timer);
-	struct bpf_prog *prog = t->prog;
 	struct bpf_map *map = t->map;
 	void *key;
 	u32 idx;
@@ -1031,16 +1031,6 @@ static enum hrtimer_restart bpf_timer_cb(struct hrtimer *timer)
 					    (u64)(long)t->value, 0, 0);
 	WARN_ON(ret != 0); /* Next patch disallows 1 in the verifier */
 
-	/* The bpf function finished executed. Drop the prog refcnt.
-	 * It could reach zero here and trigger free of bpf_prog
-	 * and subsequent free of the maps that were holding timers.
-	 * If callback_fn called bpf_timer_start on this timer
-	 * the prog refcnt will be > 0.
-	 *
-	 * If callback_fn deleted map element the 't' could have been freed,
-	 * hence t->prog deref is done earlier.
-	 */
-	bpf_prog_put(prog);
 	this_cpu_write(hrtimer_running, NULL);
 	return HRTIMER_NORESTART;
 }
@@ -1077,6 +1067,10 @@ BPF_CALL_5(bpf_timer_init, struct bpf_timer_kern *, timer, void *, cb, int, flag
 	t->prog = prog;
 	hrtimer_init(&t->timer, clockid, HRTIMER_MODE_REL_SOFT);
 	t->timer.function = bpf_timer_cb;
+	INIT_HLIST_NODE(&t->hlist);
+	spin_lock(&prog->aux->timers_lock);
+	hlist_add_head_rcu(&t->hlist, &prog->aux->used_timers);
+	spin_unlock(&prog->aux->timers_lock);
 	timer->timer = t;
 out:
 	____bpf_spin_unlock(&timer->lock);
@@ -1103,12 +1097,6 @@ BPF_CALL_2(bpf_timer_start, struct bpf_timer_kern *, timer, u64, nsecs)
 		ret = -EINVAL;
 		goto out;
 	}
-	if (!hrtimer_active(&t->timer) || hrtimer_callback_running(&t->timer))
-		/* If the timer wasn't active or callback already executing
-		 * bump the prog refcnt to keep it alive until
-		 * callback is invoked (again).
-		 */
-		bpf_prog_inc(t->prog);
 	hrtimer_start(&t->timer, ns_to_ktime(nsecs), HRTIMER_MODE_REL_SOFT);
 out:
 	____bpf_spin_unlock(&timer->lock);
@@ -1145,13 +1133,7 @@ BPF_CALL_1(bpf_timer_cancel, struct bpf_timer_kern *, timer)
 	/* Cancel the timer and wait for associated callback to finish
 	 * if it was running.
 	 */
-	if (hrtimer_cancel(&t->timer) == 1) {
-		/* If the timer was active then drop the prog refcnt,
-		 * since callback will not be invoked.
-		 */
-		bpf_prog_put(t->prog);
-		ret = 1;
-	}
+	ret = hrtimer_cancel(&t->timer);
 out:
 	____bpf_spin_unlock(&timer->lock);
 	return ret;
@@ -1164,8 +1146,10 @@ static const struct bpf_func_proto bpf_timer_cancel_proto = {
 	.arg1_type	= ARG_PTR_TO_TIMER,
 };
 
-/* This function is called by delete_element in htab and lru maps
- * and by map_free for array, lru, htab maps.
+/* This function is called by map_delete/update_elem for individual
+ * element and by bpf_free_used_timers when prog is going away.
+ * When map is destroyed by ops->map_free all bpf_timers in there
+ * are freed.
  */
 void bpf_timer_cancel_and_free(void *val)
 {
@@ -1177,7 +1161,7 @@ void bpf_timer_cancel_and_free(void *val)
 	if (!t)
 		goto out;
 	/* Cancel the timer and wait for callback to complete if it was
-	 * running. Only individual delete_element in htab or lru maps can
+	 * running. Only delete/update of individual element can
 	 * return 1 from hrtimer_cancel.
 	 * The whole map is destroyed when its refcnt reaches zero.
 	 * That happens after bpf prog refcnt reaches zero.
@@ -1197,15 +1181,41 @@ void bpf_timer_cancel_and_free(void *val)
 	 * In non-preallocated maps timer->timer = NULL will happen after
 	 * callback completes, since prog execution is an RCU critical section.
 	 */
-	if (this_cpu_read(hrtimer_running) != t &&
-	    hrtimer_cancel(&t->timer) == 1)
-		bpf_prog_put(t->prog);
+	if (this_cpu_read(hrtimer_running) != t)
+		hrtimer_cancel(&t->timer);
+
+	spin_lock(&t->prog->aux->timers_lock);
+	hlist_del_rcu(&t->hlist);
+	spin_unlock(&t->prog->aux->timers_lock);
+	t->prog = LIST_POISON1;
 	kfree(t);
 	timer->timer = NULL;
 out:
 	____bpf_spin_unlock(&timer->lock);
 }
 
+/* This function is called after prog->refcnt reaches zero.
+ * It's called before bpf_free_used_maps to clean up timers in maps
+ * if going away prog had callback_fn-s for them.
+ */
+void bpf_free_used_timers(struct bpf_prog_aux *aux)
+{
+	struct bpf_timer_kern *timer;
+	struct bpf_hrtimer *t;
+	struct hlist_node *n;
+
+	rcu_read_lock();
+	hlist_for_each_entry_safe(t, n, &aux->used_timers, hlist) {
+		timer = t->value + t->map->timer_off;
+		/* The map isn't going away. The 'timer' points into map
+		 * element that isn't going away either, but cancel_and_free
+		 * could be racing with parallel map_delete_elem.
+		 */
+		bpf_timer_cancel_and_free(timer);
+	}
+	rcu_read_unlock();
+}
+
 const struct bpf_func_proto bpf_get_current_task_proto __weak;
 const struct bpf_func_proto bpf_probe_read_user_proto __weak;
 const struct bpf_func_proto bpf_probe_read_user_str_proto __weak;
diff mbox series

Patch

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 86dec5001ae2..3816b6bae6d3 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -221,6 +221,7 @@  static inline void copy_map_value(struct bpf_map *map, void *dst, void *src)
 }
 void copy_map_value_locked(struct bpf_map *map, void *dst, void *src,
 			   bool lock_src);
+void bpf_timer_cancel_and_free(void *timer);
 int bpf_obj_name_cpy(char *dst, const char *src, unsigned int size);
 
 struct bpf_offload_dev;
@@ -314,6 +315,7 @@  enum bpf_arg_type {
 	ARG_PTR_TO_FUNC,	/* pointer to a bpf program function */
 	ARG_PTR_TO_STACK_OR_NULL,	/* pointer to stack or NULL */
 	ARG_PTR_TO_CONST_STR,	/* pointer to a null terminated read-only string */
+	ARG_PTR_TO_TIMER,	/* pointer to bpf_timer */
 	__BPF_ARG_TYPE_MAX,
 };
 
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 2c1ba70abbf1..d25bbcdad8e6 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -4778,6 +4778,38 @@  union bpf_attr {
  * 		Execute close syscall for given FD.
  * 	Return
  * 		A syscall result.
+ *
+ * long bpf_timer_init(struct bpf_timer *timer, void *callback_fn, int flags)
+ *	Description
+ *		Initialize the timer to call *callback_fn* static function.
+ *		First 4 bits of *flags* specify clockid. Only CLOCK_MONOTONIC,
+ *		CLOCK_REALTIME, CLOCK_BOOTTIME are allowed.
+ *		All other bits of *flags* are reserved.
+ *	Return
+ *		0 on success.
+ *		**-EBUSY** if *timer* is already initialized.
+ *		**-EINVAL** if invalid *flags* are passed.
+ *
+ * long bpf_timer_start(struct bpf_timer *timer, u64 nsecs)
+ *	Description
+ *		Start the timer and set its expiration N nanoseconds from the
+ *		current time. The timer callback_fn will be invoked in soft irq
+ *		context on some cpu and will not repeat unless another
+ *		bpf_timer_start() is made. In such case the next invocation can
+ *		migrate to a different cpu.
+ *	Return
+ *		0 on success.
+ *		**-EINVAL** if *timer* was not initialized with bpf_timer_init() earlier.
+ *
+ * long bpf_timer_cancel(struct bpf_timer *timer)
+ *	Description
+ *		Cancel the timer and wait for callback_fn to finish if it was running.
+ *	Return
+ *		0 if the timer was not active.
+ *		1 if the timer was active.
+ *		**-EINVAL** if *timer* was not initialized with bpf_timer_init() earlier.
+ *		**-EDEADLK** if callback_fn tried to call bpf_timer_cancel() on its own timer
+ *		which would have led to a deadlock otherwise.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -4949,6 +4981,9 @@  union bpf_attr {
 	FN(sys_bpf),			\
 	FN(btf_find_by_name_kind),	\
 	FN(sys_close),			\
+	FN(timer_init),			\
+	FN(timer_start),		\
+	FN(timer_cancel),		\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
@@ -6061,6 +6096,11 @@  struct bpf_spin_lock {
 	__u32	val;
 };
 
+struct bpf_timer {
+	__u64 :64;
+	__u64 :64;
+};
+
 struct bpf_sysctl {
 	__u32	write;		/* Sysctl is being read (= 0) or written (= 1).
 				 * Allows 1,2,4-byte read, but no write.
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 544773970dbc..3a693d451ca3 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -985,6 +985,227 @@  const struct bpf_func_proto bpf_snprintf_proto = {
 	.arg5_type	= ARG_CONST_SIZE_OR_ZERO,
 };
 
+struct bpf_hrtimer {
+	struct hrtimer timer;
+	struct bpf_map *map;
+	struct bpf_prog *prog;
+	void *callback_fn;
+	void *value;
+};
+
+/* the actual struct hidden inside uapi struct bpf_timer */
+struct bpf_timer_kern {
+	struct bpf_hrtimer *timer;
+	struct bpf_spin_lock lock;
+};
+
+static DEFINE_PER_CPU(struct bpf_hrtimer *, hrtimer_running);
+
+static enum hrtimer_restart bpf_timer_cb(struct hrtimer *timer)
+{
+	struct bpf_hrtimer *t = container_of(timer, struct bpf_hrtimer, timer);
+	struct bpf_prog *prog = t->prog;
+	struct bpf_map *map = t->map;
+	void *key;
+	u32 idx;
+	int ret;
+
+	/* bpf_timer_cb() runs in hrtimer_run_softirq. It doesn't migrate and
+	 * cannot be preempted by another bpf_timer_cb() on the same cpu.
+	 * Remember the timer this callback is servicing to prevent
+	 * deadlock if callback_fn() calls bpf_timer_cancel() on the same timer.
+	 */
+	this_cpu_write(hrtimer_running, t);
+	if (map->map_type == BPF_MAP_TYPE_ARRAY) {
+		struct bpf_array *array = container_of(map, struct bpf_array, map);
+
+		/* compute the key */
+		idx = ((char *)t->value - array->value) / array->elem_size;
+		key = &idx;
+	} else { /* hash or lru */
+		key = t->value - round_up(map->key_size, 8);
+	}
+
+	ret = BPF_CAST_CALL(t->callback_fn)((u64)(long)map,
+					    (u64)(long)key,
+					    (u64)(long)t->value, 0, 0);
+	WARN_ON(ret != 0); /* Next patch disallows 1 in the verifier */
+
+	/* The bpf function finished executed. Drop the prog refcnt.
+	 * It could reach zero here and trigger free of bpf_prog
+	 * and subsequent free of the maps that were holding timers.
+	 * If callback_fn called bpf_timer_start on this timer
+	 * the prog refcnt will be > 0.
+	 *
+	 * If callback_fn deleted map element the 't' could have been freed,
+	 * hence t->prog deref is done earlier.
+	 */
+	bpf_prog_put(prog);
+	this_cpu_write(hrtimer_running, NULL);
+	return HRTIMER_NORESTART;
+}
+
+BPF_CALL_5(bpf_timer_init, struct bpf_timer_kern *, timer, void *, cb, int, flags,
+	   struct bpf_map *, map, struct bpf_prog *, prog)
+{
+	clockid_t clockid = flags & (MAX_CLOCKS - 1);
+	struct bpf_hrtimer *t;
+	int ret = 0;
+
+	BUILD_BUG_ON(MAX_CLOCKS != 16);
+	if (flags >= MAX_CLOCKS ||
+	    /* similar to timerfd except _ALARM variants are not supported */
+            (clockid != CLOCK_MONOTONIC &&
+             clockid != CLOCK_REALTIME &&
+             clockid != CLOCK_BOOTTIME))
+		return -EINVAL;
+	____bpf_spin_lock(&timer->lock);
+	t = timer->timer;
+	if (t) {
+		ret = -EBUSY;
+		goto out;
+	}
+	/* allocate hrtimer via map_kmalloc to use memcg accounting */
+	t = bpf_map_kmalloc_node(map, sizeof(*t), GFP_ATOMIC, NUMA_NO_NODE);
+	if (!t) {
+		ret = -ENOMEM;
+		goto out;
+	}
+	t->callback_fn = cb;
+	t->value = (void *)timer /* - offset of bpf_timer inside elem */;
+	t->map = map;
+	t->prog = prog;
+	hrtimer_init(&t->timer, clockid, HRTIMER_MODE_REL_SOFT);
+	t->timer.function = bpf_timer_cb;
+	timer->timer = t;
+out:
+	____bpf_spin_unlock(&timer->lock);
+	return ret;
+}
+
+static const struct bpf_func_proto bpf_timer_init_proto = {
+	.func		= bpf_timer_init,
+	.gpl_only	= true,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_TIMER,
+	.arg2_type	= ARG_PTR_TO_FUNC,
+	.arg3_type	= ARG_ANYTHING,
+};
+
+BPF_CALL_2(bpf_timer_start, struct bpf_timer_kern *, timer, u64, nsecs)
+{
+	struct bpf_hrtimer *t;
+	int ret = 0;
+
+	____bpf_spin_lock(&timer->lock);
+	t = timer->timer;
+	if (!t) {
+		ret = -EINVAL;
+		goto out;
+	}
+	if (!hrtimer_active(&t->timer) || hrtimer_callback_running(&t->timer))
+		/* If the timer wasn't active or callback already executing
+		 * bump the prog refcnt to keep it alive until
+		 * callback is invoked (again).
+		 */
+		bpf_prog_inc(t->prog);
+	hrtimer_start(&t->timer, ns_to_ktime(nsecs), HRTIMER_MODE_REL_SOFT);
+out:
+	____bpf_spin_unlock(&timer->lock);
+	return ret;
+}
+
+static const struct bpf_func_proto bpf_timer_start_proto = {
+	.func		= bpf_timer_start,
+	.gpl_only	= true,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_TIMER,
+	.arg2_type	= ARG_ANYTHING,
+};
+
+BPF_CALL_1(bpf_timer_cancel, struct bpf_timer_kern *, timer)
+{
+	struct bpf_hrtimer *t;
+	int ret = 0;
+
+	____bpf_spin_lock(&timer->lock);
+	t = timer->timer;
+	if (!t) {
+		ret = -EINVAL;
+		goto out;
+	}
+	if (this_cpu_read(hrtimer_running) == t) {
+		/* If bpf callback_fn is trying to bpf_timer_cancel()
+		 * its own timer the hrtimer_cancel() will deadlock
+		 * since it waits for callback_fn to finish
+		 */
+		ret = -EDEADLK;
+		goto out;
+	}
+	/* Cancel the timer and wait for associated callback to finish
+	 * if it was running.
+	 */
+	if (hrtimer_cancel(&t->timer) == 1) {
+		/* If the timer was active then drop the prog refcnt,
+		 * since callback will not be invoked.
+		 */
+		bpf_prog_put(t->prog);
+		ret = 1;
+	}
+out:
+	____bpf_spin_unlock(&timer->lock);
+	return ret;
+}
+
+static const struct bpf_func_proto bpf_timer_cancel_proto = {
+	.func		= bpf_timer_cancel,
+	.gpl_only	= true,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_TIMER,
+};
+
+/* This function is called by delete_element in htab and lru maps
+ * and by map_free for array, lru, htab maps.
+ */
+void bpf_timer_cancel_and_free(void *val)
+{
+	struct bpf_timer_kern *timer = val;
+	struct bpf_hrtimer *t;
+
+	____bpf_spin_lock(&timer->lock);
+	t = timer->timer;
+	if (!t)
+		goto out;
+	/* Cancel the timer and wait for callback to complete if it was
+	 * running. Only individual delete_element in htab or lru maps can
+	 * return 1 from hrtimer_cancel.
+	 * The whole map is destroyed when its refcnt reaches zero.
+	 * That happens after bpf prog refcnt reaches zero.
+	 * bpf prog refcnt will not reach zero until all timers are executed.
+	 * So when maps are destroyed hrtimer_cancel will surely return 0.
+	 * In such case t->prog is a pointer to freed memory.
+	 *
+	 * When htab or lru is deleting individual element check that
+	 * bpf_map_delete_elem() isn't trying to delete elem with running timer.
+	 * In such case don't call hrtimer_cancel() (since it will deadlock)
+	 * and don't call hrtimer_try_to_cancel() (since it will just return -1).
+	 * Instead free the timer and set timer->timer = NULL.
+	 * The subsequent bpf_timer_start/cancel() helpers won't be able to use it.
+	 * In preallocated maps it's safe to do timer->timer = NULL.
+	 * The memory could be reused for another element while current timer
+	 * callback can still do bpf_timer_init() on it.
+	 * In non-preallocated maps timer->timer = NULL will happen after
+	 * callback completes, since prog execution is an RCU critical section.
+	 */
+	if (this_cpu_read(hrtimer_running) != t &&
+	    hrtimer_cancel(&t->timer) == 1)
+		bpf_prog_put(t->prog);
+	kfree(t);
+	timer->timer = NULL;
+out:
+	____bpf_spin_unlock(&timer->lock);
+}
+
 const struct bpf_func_proto bpf_get_current_task_proto __weak;
 const struct bpf_func_proto bpf_probe_read_user_proto __weak;
 const struct bpf_func_proto bpf_probe_read_user_str_proto __weak;
@@ -1051,6 +1272,12 @@  bpf_base_func_proto(enum bpf_func_id func_id)
 		return &bpf_per_cpu_ptr_proto;
 	case BPF_FUNC_this_cpu_ptr:
 		return &bpf_this_cpu_ptr_proto;
+	case BPF_FUNC_timer_init:
+		return &bpf_timer_init_proto;
+	case BPF_FUNC_timer_start:
+		return &bpf_timer_start_proto;
+	case BPF_FUNC_timer_cancel:
+		return &bpf_timer_cancel_proto;
 	default:
 		break;
 	}
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 1de4b8c6ee42..44ec9760b562 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -4656,6 +4656,35 @@  static int process_spin_lock(struct bpf_verifier_env *env, int regno,
 	return 0;
 }
 
+static int process_timer_func(struct bpf_verifier_env *env, int regno,
+			      struct bpf_call_arg_meta *meta)
+{
+	struct bpf_reg_state *regs = cur_regs(env), *reg = &regs[regno];
+	bool is_const = tnum_is_const(reg->var_off);
+	struct bpf_map *map = reg->map_ptr;
+	u64 val = reg->var_off.value;
+
+	if (!is_const) {
+		verbose(env,
+			"R%d doesn't have constant offset. bpf_timer has to be at the constant offset\n",
+			regno);
+		return -EINVAL;
+	}
+	if (!map->btf) {
+		verbose(env, "map '%s' has to have BTF in order to use bpf_timer\n",
+			map->name);
+		return -EINVAL;
+	}
+	if (val) {
+		/* This restriction will be removed in the next patch */
+		verbose(env, "bpf_timer field can only be first in the map value element\n");
+		return -EINVAL;
+	}
+	WARN_ON(meta->map_ptr);
+	meta->map_ptr = map;
+	return 0;
+}
+
 static bool arg_type_is_mem_ptr(enum bpf_arg_type type)
 {
 	return type == ARG_PTR_TO_MEM ||
@@ -4788,6 +4817,7 @@  static const struct bpf_reg_types percpu_btf_ptr_types = { .types = { PTR_TO_PER
 static const struct bpf_reg_types func_ptr_types = { .types = { PTR_TO_FUNC } };
 static const struct bpf_reg_types stack_ptr_types = { .types = { PTR_TO_STACK } };
 static const struct bpf_reg_types const_str_ptr_types = { .types = { PTR_TO_MAP_VALUE } };
+static const struct bpf_reg_types timer_types = { .types = { PTR_TO_MAP_VALUE } };
 
 static const struct bpf_reg_types *compatible_reg_types[__BPF_ARG_TYPE_MAX] = {
 	[ARG_PTR_TO_MAP_KEY]		= &map_key_value_types,
@@ -4819,6 +4849,7 @@  static const struct bpf_reg_types *compatible_reg_types[__BPF_ARG_TYPE_MAX] = {
 	[ARG_PTR_TO_FUNC]		= &func_ptr_types,
 	[ARG_PTR_TO_STACK_OR_NULL]	= &stack_ptr_types,
 	[ARG_PTR_TO_CONST_STR]		= &const_str_ptr_types,
+	[ARG_PTR_TO_TIMER]		= &timer_types,
 };
 
 static int check_reg_type(struct bpf_verifier_env *env, u32 regno,
@@ -5000,6 +5031,9 @@  static int check_func_arg(struct bpf_verifier_env *env, u32 arg,
 			verbose(env, "verifier internal error\n");
 			return -EFAULT;
 		}
+	} else if (arg_type == ARG_PTR_TO_TIMER) {
+		if (process_timer_func(env, regno, meta))
+			return -EACCES;
 	} else if (arg_type == ARG_PTR_TO_FUNC) {
 		meta->subprogno = reg->subprogno;
 	} else if (arg_type_is_mem_ptr(arg_type)) {
@@ -5742,6 +5776,43 @@  static int set_map_elem_callback_state(struct bpf_verifier_env *env,
 	return 0;
 }
 
+static int set_timer_init_callback_state(struct bpf_verifier_env *env,
+					 struct bpf_func_state *caller,
+					 struct bpf_func_state *callee,
+					 int insn_idx)
+{
+	struct bpf_insn_aux_data *insn_aux = &env->insn_aux_data[insn_idx];
+	struct bpf_map *map_ptr;
+
+	if (bpf_map_ptr_poisoned(insn_aux)) {
+		verbose(env, "bpf_timer_init abusing map_ptr\n");
+		return -EINVAL;
+	}
+
+	map_ptr = BPF_MAP_PTR(insn_aux->map_ptr_state);
+
+	/* bpf_timer_init(struct bpf_timer *timer, void *callback_fn, u64 flags);
+	 * callback_fn(struct bpf_map *map, void *key, void *value);
+	 */
+	callee->regs[BPF_REG_1].type = CONST_PTR_TO_MAP;
+	__mark_reg_known_zero(&callee->regs[BPF_REG_1]);
+	callee->regs[BPF_REG_1].map_ptr = map_ptr;
+
+	callee->regs[BPF_REG_2].type = PTR_TO_MAP_KEY;
+	__mark_reg_known_zero(&callee->regs[BPF_REG_2]);
+	callee->regs[BPF_REG_2].map_ptr = map_ptr;
+
+	callee->regs[BPF_REG_3].type = PTR_TO_MAP_VALUE;
+	__mark_reg_known_zero(&callee->regs[BPF_REG_3]);
+	callee->regs[BPF_REG_3].map_ptr = map_ptr;
+
+	/* unused */
+	__mark_reg_not_init(env, &callee->regs[BPF_REG_4]);
+	__mark_reg_not_init(env, &callee->regs[BPF_REG_5]);
+	callee->in_callback_fn = true;
+	return 0;
+}
+
 static int prepare_func_exit(struct bpf_verifier_env *env, int *insn_idx)
 {
 	struct bpf_verifier_state *state = env->cur_state;
@@ -5837,6 +5908,7 @@  record_func_map(struct bpf_verifier_env *env, struct bpf_call_arg_meta *meta,
 	    func_id != BPF_FUNC_map_pop_elem &&
 	    func_id != BPF_FUNC_map_peek_elem &&
 	    func_id != BPF_FUNC_for_each_map_elem &&
+	    func_id != BPF_FUNC_timer_init &&
 	    func_id != BPF_FUNC_redirect_map)
 		return 0;
 
@@ -6069,6 +6141,13 @@  static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
 			return -EINVAL;
 	}
 
+	if (func_id == BPF_FUNC_timer_init) {
+		err = __check_func_call(env, insn, insn_idx_p, meta.subprogno,
+					set_timer_init_callback_state);
+		if (err < 0)
+			return -EINVAL;
+	}
+
 	if (func_id == BPF_FUNC_snprintf) {
 		err = check_bpf_snprintf_call(env, regs);
 		if (err < 0)
@@ -12526,6 +12605,36 @@  static int do_misc_fixups(struct bpf_verifier_env *env)
 			insn      = new_prog->insnsi + i + delta;
 			continue;
 		}
+		if (insn->imm == BPF_FUNC_timer_init) {
+			aux = &env->insn_aux_data[i + delta];
+			if (bpf_map_ptr_poisoned(aux)) {
+				verbose(env, "bpf_timer_init abusing map_ptr\n");
+				return -EINVAL;
+			}
+			map_ptr = BPF_MAP_PTR(aux->map_ptr_state);
+			{
+				struct bpf_insn ld_addrs[4] = {
+					BPF_LD_IMM64(BPF_REG_4, (long)map_ptr),
+					BPF_LD_IMM64(BPF_REG_5, (long)prog),
+				};
+
+				insn_buf[0] = ld_addrs[0];
+				insn_buf[1] = ld_addrs[1];
+				insn_buf[2] = ld_addrs[2];
+				insn_buf[3] = ld_addrs[3];
+			}
+			insn_buf[4] = *insn;
+			cnt = 5;
+
+			new_prog = bpf_patch_insn_data(env, i + delta, insn_buf, cnt);
+			if (!new_prog)
+				return -ENOMEM;
+
+			delta    += cnt - 1;
+			env->prog = prog = new_prog;
+			insn      = new_prog->insnsi + i + delta;
+			goto patch_call_imm;
+		}
 
 		/* BPF_EMIT_CALL() assumptions in some of the map_gen_lookup
 		 * and other inlining handlers are currently limited to 64 bit
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index d2d7cf6cfe83..453a46c2d732 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -1065,7 +1065,7 @@  bpf_tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 	case BPF_FUNC_snprintf:
 		return &bpf_snprintf_proto;
 	default:
-		return NULL;
+		return bpf_base_func_proto(func_id);
 	}
 }
 
diff --git a/scripts/bpf_doc.py b/scripts/bpf_doc.py
index 2d94025b38e9..00ac7b79cddb 100755
--- a/scripts/bpf_doc.py
+++ b/scripts/bpf_doc.py
@@ -547,6 +547,7 @@  COMMANDS
             'struct inode',
             'struct socket',
             'struct file',
+            'struct bpf_timer',
     ]
     known_types = {
             '...',
@@ -594,6 +595,7 @@  COMMANDS
             'struct inode',
             'struct socket',
             'struct file',
+            'struct bpf_timer',
     }
     mapped_types = {
             'u8': '__u8',
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 2c1ba70abbf1..d25bbcdad8e6 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -4778,6 +4778,38 @@  union bpf_attr {
  * 		Execute close syscall for given FD.
  * 	Return
  * 		A syscall result.
+ *
+ * long bpf_timer_init(struct bpf_timer *timer, void *callback_fn, int flags)
+ *	Description
+ *		Initialize the timer to call *callback_fn* static function.
+ *		First 4 bits of *flags* specify clockid. Only CLOCK_MONOTONIC,
+ *		CLOCK_REALTIME, CLOCK_BOOTTIME are allowed.
+ *		All other bits of *flags* are reserved.
+ *	Return
+ *		0 on success.
+ *		**-EBUSY** if *timer* is already initialized.
+ *		**-EINVAL** if invalid *flags* are passed.
+ *
+ * long bpf_timer_start(struct bpf_timer *timer, u64 nsecs)
+ *	Description
+ *		Start the timer and set its expiration N nanoseconds from the
+ *		current time. The timer callback_fn will be invoked in soft irq
+ *		context on some cpu and will not repeat unless another
+ *		bpf_timer_start() is made. In such case the next invocation can
+ *		migrate to a different cpu.
+ *	Return
+ *		0 on success.
+ *		**-EINVAL** if *timer* was not initialized with bpf_timer_init() earlier.
+ *
+ * long bpf_timer_cancel(struct bpf_timer *timer)
+ *	Description
+ *		Cancel the timer and wait for callback_fn to finish if it was running.
+ *	Return
+ *		0 if the timer was not active.
+ *		1 if the timer was active.
+ *		**-EINVAL** if *timer* was not initialized with bpf_timer_init() earlier.
+ *		**-EDEADLK** if callback_fn tried to call bpf_timer_cancel() on its own timer
+ *		which would have led to a deadlock otherwise.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -4949,6 +4981,9 @@  union bpf_attr {
 	FN(sys_bpf),			\
 	FN(btf_find_by_name_kind),	\
 	FN(sys_close),			\
+	FN(timer_init),			\
+	FN(timer_start),		\
+	FN(timer_cancel),		\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
@@ -6061,6 +6096,11 @@  struct bpf_spin_lock {
 	__u32	val;
 };
 
+struct bpf_timer {
+	__u64 :64;
+	__u64 :64;
+};
+
 struct bpf_sysctl {
 	__u32	write;		/* Sysctl is being read (= 0) or written (= 1).
 				 * Allows 1,2,4-byte read, but no write.