diff mbox series

[bpf-next,v2,11/16] bpf: wq: add bpf_wq_init

Message ID 20240420-bpf_wq-v2-11-6c986a5a741f@kernel.org (mailing list archive)
State Accepted
Commit eb48f6cd41a0f7803770a76bbffb6bd5b1b2ae2f
Headers show
Series Introduce bpf_wq | expand

Commit Message

Benjamin Tissoires April 20, 2024, 9:09 a.m. UTC
We need to teach the verifier about the second argument which is declared
as void * but which is of type KF_ARG_PTR_TO_MAP. We could have dropped
this extra case if we declared the second argument as struct bpf_map *,
but that means users will have to do extra casting to have their program
compile.

We also need to duplicate the timer code for the checking if the map
argument is matching the provided workqueue.

Signed-off-by: Benjamin Tissoires <bentiss@kernel.org>

---

FWIW, I still have one concern with this implementation:
- bpf_wq_work() access ->prog without protection, but I think this might
  be racing with bpf_wq_set_callback(): if we have the following:

  CPU 0                                     CPU 1
  bpf_wq_set_callback()
  bpf_start()
                                            bpf_wq_work():
                                              prog = cb->prog;

  bpf_wq_set_callback()
    cb->prog = prog;
    bpf_prog_put(prev)
    rcu_assign_ptr(cb->callback_fn,
                   callback_fn);
                                           callback = READ_ONCE(w->cb.callback_fn);

  As I understand callback_fn is fine, prog might be, but we clearly
  have an inconstency between "prog" and "callback_fn" as they can come
  from 2 different bpf_wq_set_callback() calls.

  IMO we should protect this by the async->lock, but I'm not sure if
  it's OK or not.

---

changes in v2:
- dropped BPF_F_WQ_SLEEPABLE
- use __map suffix to drop special case in verifier
- in bpf_wq_work: use new local variable cb
---
 kernel/bpf/helpers.c | 104 ++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 102 insertions(+), 2 deletions(-)

Comments

Alexei Starovoitov April 24, 2024, 2:55 a.m. UTC | #1
On Sat, Apr 20, 2024 at 2:10 AM Benjamin Tissoires <bentiss@kernel.org> wrote:
>
> We need to teach the verifier about the second argument which is declared
> as void * but which is of type KF_ARG_PTR_TO_MAP. We could have dropped
> this extra case if we declared the second argument as struct bpf_map *,
> but that means users will have to do extra casting to have their program
> compile.
>
> We also need to duplicate the timer code for the checking if the map
> argument is matching the provided workqueue.
>
> Signed-off-by: Benjamin Tissoires <bentiss@kernel.org>
>
> ---
>
> FWIW, I still have one concern with this implementation:
> - bpf_wq_work() access ->prog without protection, but I think this might
>   be racing with bpf_wq_set_callback(): if we have the following:
>
>   CPU 0                                     CPU 1
>   bpf_wq_set_callback()
>   bpf_start()
>                                             bpf_wq_work():
>                                               prog = cb->prog;
>
>   bpf_wq_set_callback()
>     cb->prog = prog;
>     bpf_prog_put(prev)
>     rcu_assign_ptr(cb->callback_fn,
>                    callback_fn);
>                                            callback = READ_ONCE(w->cb.callback_fn);
>
>   As I understand callback_fn is fine, prog might be, but we clearly
>   have an inconstency between "prog" and "callback_fn" as they can come
>   from 2 different bpf_wq_set_callback() calls.
>
>   IMO we should protect this by the async->lock, but I'm not sure if
>   it's OK or not.

I see the concern, but I think it's overkill.
Here 'prog' is used to pass it into __bpf_prog_enter_sleepable_recur()
to keep the standard pattern of calling into sleepable prog.
But it won't recurse.
We can open code migrate_disable,etc from there except this_cpu_inc_return,
but it's an overkill.
The passed 'prog' is irrelevant.
If somebody tries really hard by having two progs sharing the same
map with bpf_wq and racing to set_callback... I can see how
prog won't match callback, but it won't make a difference.
prog is not going trigger recursion check (unless somebody
tries is obsessed) and not going to UAF.
I imagine it's possible to attach somewhere in core wq callback
invocation path with fentry, set_callback to the same prog,
and technically it's kinda sorta recursion, but different subprogs,
so not a safety issue.
The code as-is is fine. imo.
Alexei Starovoitov April 24, 2024, 3:06 p.m. UTC | #2
On Tue, Apr 23, 2024 at 7:55 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Sat, Apr 20, 2024 at 2:10 AM Benjamin Tissoires <bentiss@kernel.org> wrote:
> >
> > We need to teach the verifier about the second argument which is declared
> > as void * but which is of type KF_ARG_PTR_TO_MAP. We could have dropped
> > this extra case if we declared the second argument as struct bpf_map *,
> > but that means users will have to do extra casting to have their program
> > compile.
> >
> > We also need to duplicate the timer code for the checking if the map
> > argument is matching the provided workqueue.
> >
> > Signed-off-by: Benjamin Tissoires <bentiss@kernel.org>
> >
> > ---
> >
> > FWIW, I still have one concern with this implementation:
> > - bpf_wq_work() access ->prog without protection, but I think this might
> >   be racing with bpf_wq_set_callback(): if we have the following:
> >
> >   CPU 0                                     CPU 1
> >   bpf_wq_set_callback()
> >   bpf_start()
> >                                             bpf_wq_work():
> >                                               prog = cb->prog;
> >
> >   bpf_wq_set_callback()
> >     cb->prog = prog;
> >     bpf_prog_put(prev)
> >     rcu_assign_ptr(cb->callback_fn,
> >                    callback_fn);
> >                                            callback = READ_ONCE(w->cb.callback_fn);
> >
> >   As I understand callback_fn is fine, prog might be, but we clearly
> >   have an inconstency between "prog" and "callback_fn" as they can come
> >   from 2 different bpf_wq_set_callback() calls.
> >
> >   IMO we should protect this by the async->lock, but I'm not sure if
> >   it's OK or not.
>
> I see the concern, but I think it's overkill.
> Here 'prog' is used to pass it into __bpf_prog_enter_sleepable_recur()
> to keep the standard pattern of calling into sleepable prog.
> But it won't recurse.
> We can open code migrate_disable,etc from there except this_cpu_inc_return,
> but it's an overkill.
> The passed 'prog' is irrelevant.
> If somebody tries really hard by having two progs sharing the same
> map with bpf_wq and racing to set_callback... I can see how
> prog won't match callback, but it won't make a difference.
> prog is not going trigger recursion check (unless somebody
> tries is obsessed) and not going to UAF.
> I imagine it's possible to attach somewhere in core wq callback
> invocation path with fentry, set_callback to the same prog,
> and technically it's kinda sorta recursion, but different subprogs,
> so not a safety issue.
> The code as-is is fine. imo.

After sleeping on it, I realized that the use of
__bpf_prog_enter_sleepable_recur() here is very much incorrect :(
The tests are passing only because we don't inc prog->active
when we run the prog via prog_run cmd.
Adding the following:
diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c
index f6aad4ed2ab2..0732dfe22204 100644
--- a/net/bpf/test_run.c
+++ b/net/bpf/test_run.c
@@ -1514,7 +1514,9 @@ int bpf_prog_test_run_syscall(struct bpf_prog *prog,
        }

        rcu_read_lock_trace();
+       this_cpu_inc_return(*(prog->active));
        retval = bpf_prog_run_pin_on_cpu(prog, ctx);
+       this_cpu_dec(*(prog->active));
        rcu_read_unlock_trace();

makes the test fail sporadically.
Or 100% fail when the kernel is booted with 1 cpu.

Could you send a quick follow up to
replace __bpf_prog_enter_sleepable_recur() with
        rcu_read_lock_trace();
        migrate_disable();
?

Or I'll do it in an hour or so.
Alexei Starovoitov April 24, 2024, 4:14 p.m. UTC | #3
On Wed, Apr 24, 2024 at 8:06 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, Apr 23, 2024 at 7:55 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Sat, Apr 20, 2024 at 2:10 AM Benjamin Tissoires <bentiss@kernel.org> wrote:
> > >
> > > We need to teach the verifier about the second argument which is declared
> > > as void * but which is of type KF_ARG_PTR_TO_MAP. We could have dropped
> > > this extra case if we declared the second argument as struct bpf_map *,
> > > but that means users will have to do extra casting to have their program
> > > compile.
> > >
> > > We also need to duplicate the timer code for the checking if the map
> > > argument is matching the provided workqueue.
> > >
> > > Signed-off-by: Benjamin Tissoires <bentiss@kernel.org>
> > >
> > > ---
> > >
> > > FWIW, I still have one concern with this implementation:
> > > - bpf_wq_work() access ->prog without protection, but I think this might
> > >   be racing with bpf_wq_set_callback(): if we have the following:
> > >
> > >   CPU 0                                     CPU 1
> > >   bpf_wq_set_callback()
> > >   bpf_start()
> > >                                             bpf_wq_work():
> > >                                               prog = cb->prog;
> > >
> > >   bpf_wq_set_callback()
> > >     cb->prog = prog;
> > >     bpf_prog_put(prev)
> > >     rcu_assign_ptr(cb->callback_fn,
> > >                    callback_fn);
> > >                                            callback = READ_ONCE(w->cb.callback_fn);
> > >
> > >   As I understand callback_fn is fine, prog might be, but we clearly
> > >   have an inconstency between "prog" and "callback_fn" as they can come
> > >   from 2 different bpf_wq_set_callback() calls.
> > >
> > >   IMO we should protect this by the async->lock, but I'm not sure if
> > >   it's OK or not.
> >
> > I see the concern, but I think it's overkill.
> > Here 'prog' is used to pass it into __bpf_prog_enter_sleepable_recur()
> > to keep the standard pattern of calling into sleepable prog.
> > But it won't recurse.
> > We can open code migrate_disable,etc from there except this_cpu_inc_return,
> > but it's an overkill.
> > The passed 'prog' is irrelevant.
> > If somebody tries really hard by having two progs sharing the same
> > map with bpf_wq and racing to set_callback... I can see how
> > prog won't match callback, but it won't make a difference.
> > prog is not going trigger recursion check (unless somebody
> > tries is obsessed) and not going to UAF.
> > I imagine it's possible to attach somewhere in core wq callback
> > invocation path with fentry, set_callback to the same prog,
> > and technically it's kinda sorta recursion, but different subprogs,
> > so not a safety issue.
> > The code as-is is fine. imo.
>
> After sleeping on it, I realized that the use of
> __bpf_prog_enter_sleepable_recur() here is very much incorrect :(
> The tests are passing only because we don't inc prog->active
> when we run the prog via prog_run cmd.
> Adding the following:
> diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c
> index f6aad4ed2ab2..0732dfe22204 100644
> --- a/net/bpf/test_run.c
> +++ b/net/bpf/test_run.c
> @@ -1514,7 +1514,9 @@ int bpf_prog_test_run_syscall(struct bpf_prog *prog,
>         }
>
>         rcu_read_lock_trace();
> +       this_cpu_inc_return(*(prog->active));
>         retval = bpf_prog_run_pin_on_cpu(prog, ctx);
> +       this_cpu_dec(*(prog->active));
>         rcu_read_unlock_trace();
>
> makes the test fail sporadically.
> Or 100% fail when the kernel is booted with 1 cpu.
>
> Could you send a quick follow up to
> replace __bpf_prog_enter_sleepable_recur() with
>         rcu_read_lock_trace();
>         migrate_disable();
> ?
>
> Or I'll do it in an hour or so.

Considering two broken-build reports already
I've applied the following fix:
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/commit/?id=dc92febf7b93da5049fe177804e6b1961fcc6bd7

that addresses the build issue on !JIT and fixes this recursion problem.
Benjamin Tissoires April 25, 2024, 8:50 a.m. UTC | #4
On Apr 24 2024, Alexei Starovoitov wrote:
> On Wed, Apr 24, 2024 at 8:06 AM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Tue, Apr 23, 2024 at 7:55 PM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Sat, Apr 20, 2024 at 2:10 AM Benjamin Tissoires <bentiss@kernel.org> wrote:
> > > >
> > > > We need to teach the verifier about the second argument which is declared
> > > > as void * but which is of type KF_ARG_PTR_TO_MAP. We could have dropped
> > > > this extra case if we declared the second argument as struct bpf_map *,
> > > > but that means users will have to do extra casting to have their program
> > > > compile.
> > > >
> > > > We also need to duplicate the timer code for the checking if the map
> > > > argument is matching the provided workqueue.
> > > >
> > > > Signed-off-by: Benjamin Tissoires <bentiss@kernel.org>
> > > >
> > > > ---
> > > >
> > > > FWIW, I still have one concern with this implementation:
> > > > - bpf_wq_work() access ->prog without protection, but I think this might
> > > >   be racing with bpf_wq_set_callback(): if we have the following:
> > > >
> > > >   CPU 0                                     CPU 1
> > > >   bpf_wq_set_callback()
> > > >   bpf_start()
> > > >                                             bpf_wq_work():
> > > >                                               prog = cb->prog;
> > > >
> > > >   bpf_wq_set_callback()
> > > >     cb->prog = prog;
> > > >     bpf_prog_put(prev)
> > > >     rcu_assign_ptr(cb->callback_fn,
> > > >                    callback_fn);
> > > >                                            callback = READ_ONCE(w->cb.callback_fn);
> > > >
> > > >   As I understand callback_fn is fine, prog might be, but we clearly
> > > >   have an inconstency between "prog" and "callback_fn" as they can come
> > > >   from 2 different bpf_wq_set_callback() calls.
> > > >
> > > >   IMO we should protect this by the async->lock, but I'm not sure if
> > > >   it's OK or not.
> > >
> > > I see the concern, but I think it's overkill.
> > > Here 'prog' is used to pass it into __bpf_prog_enter_sleepable_recur()
> > > to keep the standard pattern of calling into sleepable prog.
> > > But it won't recurse.
> > > We can open code migrate_disable,etc from there except this_cpu_inc_return,
> > > but it's an overkill.
> > > The passed 'prog' is irrelevant.
> > > If somebody tries really hard by having two progs sharing the same
> > > map with bpf_wq and racing to set_callback... I can see how
> > > prog won't match callback, but it won't make a difference.
> > > prog is not going trigger recursion check (unless somebody
> > > tries is obsessed) and not going to UAF.
> > > I imagine it's possible to attach somewhere in core wq callback
> > > invocation path with fentry, set_callback to the same prog,
> > > and technically it's kinda sorta recursion, but different subprogs,
> > > so not a safety issue.
> > > The code as-is is fine. imo.
> >
> > After sleeping on it, I realized that the use of
> > __bpf_prog_enter_sleepable_recur() here is very much incorrect :(
> > The tests are passing only because we don't inc prog->active
> > when we run the prog via prog_run cmd.
> > Adding the following:
> > diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c
> > index f6aad4ed2ab2..0732dfe22204 100644
> > --- a/net/bpf/test_run.c
> > +++ b/net/bpf/test_run.c
> > @@ -1514,7 +1514,9 @@ int bpf_prog_test_run_syscall(struct bpf_prog *prog,
> >         }
> >
> >         rcu_read_lock_trace();
> > +       this_cpu_inc_return(*(prog->active));
> >         retval = bpf_prog_run_pin_on_cpu(prog, ctx);
> > +       this_cpu_dec(*(prog->active));
> >         rcu_read_unlock_trace();
> >
> > makes the test fail sporadically.
> > Or 100% fail when the kernel is booted with 1 cpu.
> >
> > Could you send a quick follow up to
> > replace __bpf_prog_enter_sleepable_recur() with
> >         rcu_read_lock_trace();
> >         migrate_disable();
> > ?
> >
> > Or I'll do it in an hour or so.
> 
> Considering two broken-build reports already
> I've applied the following fix:
> https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/commit/?id=dc92febf7b93da5049fe177804e6b1961fcc6bd7
> 
> that addresses the build issue on !JIT and fixes this recursion problem.

Thanks a lot for fixing this on my behalf. I was slightly puzzled by the
broken-build report but really I wasn't in position to think straight
yesterday (got a pretty big muscular pain in the back and had to go to
the doctor to get painkillers)

I haven't forgoten about the double list walk in 9/16, I'll hopefully
send the fix today.

Cheers,
Benjamin
diff mbox series

Patch

diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 9fd12d480b8b..b12486adc47a 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -1109,11 +1109,18 @@  struct bpf_hrtimer {
 	struct hrtimer timer;
 };
 
-/* the actual struct hidden inside uapi struct bpf_timer */
+struct bpf_work {
+	struct bpf_async_cb cb;
+	struct work_struct work;
+	struct work_struct delete_work;
+};
+
+/* the actual struct hidden inside uapi struct bpf_timer and bpf_wq */
 struct bpf_async_kern {
 	union {
 		struct bpf_async_cb *cb;
 		struct bpf_hrtimer *timer;
+		struct bpf_work *work;
 	};
 	/* bpf_spin_lock is used here instead of spinlock_t to make
 	 * sure that it always fits into space reserved by struct bpf_timer
@@ -1124,6 +1131,7 @@  struct bpf_async_kern {
 
 enum bpf_async_type {
 	BPF_ASYNC_TYPE_TIMER = 0,
+	BPF_ASYNC_TYPE_WQ,
 };
 
 static DEFINE_PER_CPU(struct bpf_hrtimer *, hrtimer_running);
@@ -1167,11 +1175,64 @@  static enum hrtimer_restart bpf_timer_cb(struct hrtimer *hrtimer)
 	return HRTIMER_NORESTART;
 }
 
+static void bpf_wq_work(struct work_struct *work)
+{
+	struct bpf_work *w = container_of(work, struct bpf_work, work);
+	struct bpf_tramp_run_ctx __maybe_unused run_ctx;
+	struct bpf_async_cb *cb = &w->cb;
+	struct bpf_prog *prog = cb->prog;
+	struct bpf_map *map = cb->map;
+	bpf_callback_t callback_fn;
+	void *value = cb->value;
+	void *key;
+	u32 idx;
+
+	BTF_TYPE_EMIT(struct bpf_wq);
+
+	callback_fn = READ_ONCE(cb->callback_fn);
+	if (!callback_fn || !prog)
+		return;
+
+	if (map->map_type == BPF_MAP_TYPE_ARRAY) {
+		struct bpf_array *array = container_of(map, struct bpf_array, map);
+
+		/* compute the key */
+		idx = ((char *)value - array->value) / array->elem_size;
+		key = &idx;
+	} else { /* hash or lru */
+		key = value - round_up(map->key_size, 8);
+	}
+
+	run_ctx.bpf_cookie = 0;
+
+	if (!__bpf_prog_enter_sleepable_recur(prog, &run_ctx)) {
+		/* recursion detected */
+		__bpf_prog_exit_sleepable_recur(prog, 0, &run_ctx);
+		return;
+	}
+
+	callback_fn((u64)(long)map, (u64)(long)key, (u64)(long)value, 0, 0);
+	/* The verifier checked that return value is zero. */
+
+	__bpf_prog_exit_sleepable_recur(prog, 0 /* bpf_prog_run does runtime stats */,
+					&run_ctx);
+}
+
+static void bpf_wq_delete_work(struct work_struct *work)
+{
+	struct bpf_work *w = container_of(work, struct bpf_work, delete_work);
+
+	cancel_work_sync(&w->work);
+
+	kfree_rcu(w, cb.rcu);
+}
+
 static int __bpf_async_init(struct bpf_async_kern *async, struct bpf_map *map, u64 flags,
 			    enum bpf_async_type type)
 {
 	struct bpf_async_cb *cb;
 	struct bpf_hrtimer *t;
+	struct bpf_work *w;
 	clockid_t clockid;
 	size_t size;
 	int ret = 0;
@@ -1183,6 +1244,9 @@  static int __bpf_async_init(struct bpf_async_kern *async, struct bpf_map *map, u
 	case BPF_ASYNC_TYPE_TIMER:
 		size = sizeof(struct bpf_hrtimer);
 		break;
+	case BPF_ASYNC_TYPE_WQ:
+		size = sizeof(struct bpf_work);
+		break;
 	default:
 		return -EINVAL;
 	}
@@ -1201,13 +1265,22 @@  static int __bpf_async_init(struct bpf_async_kern *async, struct bpf_map *map, u
 		goto out;
 	}
 
-	if (type == BPF_ASYNC_TYPE_TIMER) {
+	switch (type) {
+	case BPF_ASYNC_TYPE_TIMER:
 		clockid = flags & (MAX_CLOCKS - 1);
 		t = (struct bpf_hrtimer *)cb;
 
 		hrtimer_init(&t->timer, clockid, HRTIMER_MODE_REL_SOFT);
 		t->timer.function = bpf_timer_cb;
 		cb->value = (void *)async - map->record->timer_off;
+		break;
+	case BPF_ASYNC_TYPE_WQ:
+		w = (struct bpf_work *)cb;
+
+		INIT_WORK(&w->work, bpf_wq_work);
+		INIT_WORK(&w->delete_work, bpf_wq_delete_work);
+		cb->value = (void *)async - map->record->wq_off;
+		break;
 	}
 	cb->map = map;
 	cb->prog = NULL;
@@ -1473,7 +1546,19 @@  void bpf_timer_cancel_and_free(void *val)
  */
 void bpf_wq_cancel_and_free(void *val)
 {
+	struct bpf_work *work;
+
 	BTF_TYPE_EMIT(struct bpf_wq);
+
+	work = (struct bpf_work *)__bpf_async_cancel_and_free(val);
+	if (!work)
+		return;
+	/* Trigger cancel of the sleepable work, but *do not* wait for
+	 * it to finish if it was running as we might not be in a
+	 * sleepable context.
+	 * kfree will be called once the work has finished.
+	 */
+	schedule_work(&work->delete_work);
 }
 
 BPF_CALL_2(bpf_kptr_xchg, void *, map_value, void *, ptr)
@@ -2612,6 +2697,20 @@  __bpf_kfunc void bpf_throw(u64 cookie)
 	WARN(1, "A call to BPF exception callback should never return\n");
 }
 
+__bpf_kfunc int bpf_wq_init(struct bpf_wq *wq, void *p__map, unsigned int flags)
+{
+	struct bpf_async_kern *async = (struct bpf_async_kern *)wq;
+	struct bpf_map *map = p__map;
+
+	BUILD_BUG_ON(sizeof(struct bpf_async_kern) > sizeof(struct bpf_wq));
+	BUILD_BUG_ON(__alignof__(struct bpf_async_kern) != __alignof__(struct bpf_wq));
+
+	if (flags)
+		return -EINVAL;
+
+	return __bpf_async_init(async, map, flags, BPF_ASYNC_TYPE_WQ);
+}
+
 __bpf_kfunc_end_defs();
 
 BTF_KFUNCS_START(generic_btf_ids)
@@ -2689,6 +2788,7 @@  BTF_ID_FLAGS(func, bpf_dynptr_is_rdonly)
 BTF_ID_FLAGS(func, bpf_dynptr_size)
 BTF_ID_FLAGS(func, bpf_dynptr_clone)
 BTF_ID_FLAGS(func, bpf_modify_return_test_tp)
+BTF_ID_FLAGS(func, bpf_wq_init)
 BTF_KFUNCS_END(common_btf_ids)
 
 static const struct btf_kfunc_id_set common_kfunc_set = {