diff mbox series

[v3,bpf-next,13/15] bpf: Prepare bpf_mem_alloc to be used by sleepable bpf programs.

Message ID 20220819214232.18784-14-alexei.starovoitov@gmail.com (mailing list archive)
State New
Headers show
Series bpf: BPF specific memory allocator. | expand

Commit Message

Alexei Starovoitov Aug. 19, 2022, 9:42 p.m. UTC
From: Alexei Starovoitov <ast@kernel.org>

Use call_rcu_tasks_trace() to wait for sleepable progs to finish.
Then use call_rcu() to wait for normal progs to finish
and finally do free_one() on each element when freeing objects
into global memory pool.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 kernel/bpf/memalloc.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

Comments

Kumar Kartikeya Dwivedi Aug. 19, 2022, 10:21 p.m. UTC | #1
On Fri, 19 Aug 2022 at 23:43, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> From: Alexei Starovoitov <ast@kernel.org>
>
> Use call_rcu_tasks_trace() to wait for sleepable progs to finish.
> Then use call_rcu() to wait for normal progs to finish
> and finally do free_one() on each element when freeing objects
> into global memory pool.
>
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> ---

I fear this can make OOM issues very easy to run into, because one
sleepable prog that sleeps for a long period of time can hold the
freeing of elements from another sleepable prog which either does not
sleep often or sleeps for a very short period of time, and has a high
update frequency. I'm mostly worried that unrelated sleepable programs
not even using the same map will begin to affect each other.

Have you considered other options? E.g. we could directly expose
bpf_rcu_read_lock/bpf_rcu_read_unlock to the program and enforce that
access to RCU protected map lookups only happens in such read
sections, and unlock invalidates all RCU protected pointers? Sleepable
helpers can then not be invoked inside the BPF RCU read section. The
program uses RCU read section while accessing such maps, and sleeps
after doing bpf_rcu_read_unlock. They can be kfuncs.

It might also be useful in general, to access RCU protected data from
sleepable programs (i.e. make some sections of the program RCU
protected and non-sleepable at runtime). It will allow use of elements
from dynamically allocated maps with bpf_mem_alloc while not having to
wait for RCU tasks trace grace period, which can extend into minutes
(or even longer if unlucky).

One difference would be that you can pin a lookup across a sleep cycle
with this approach, but not with preallocated maps or the explicit RCU
section above, but I'm not sure it's worth it. It isn't possible now.

>  kernel/bpf/memalloc.c | 14 +++++++++++++-
>  1 file changed, 13 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
> index 9e5ad7dc4dc7..d34383dc12d9 100644
> --- a/kernel/bpf/memalloc.c
> +++ b/kernel/bpf/memalloc.c
> @@ -224,6 +224,13 @@ static void __free_rcu(struct rcu_head *head)
>         atomic_set(&c->call_rcu_in_progress, 0);
>  }
>
> +static void __free_rcu_tasks_trace(struct rcu_head *head)
> +{
> +       struct bpf_mem_cache *c = container_of(head, struct bpf_mem_cache, rcu);
> +
> +       call_rcu(&c->rcu, __free_rcu);
> +}
> +
>  static void enque_to_free(struct bpf_mem_cache *c, void *obj)
>  {
>         struct llist_node *llnode = obj;
> @@ -249,7 +256,11 @@ static void do_call_rcu(struct bpf_mem_cache *c)
>                  * from __free_rcu() and from drain_mem_cache().
>                  */
>                 __llist_add(llnode, &c->waiting_for_gp);
> -       call_rcu(&c->rcu, __free_rcu);
> +       /* Use call_rcu_tasks_trace() to wait for sleepable progs to finish.
> +        * Then use call_rcu() to wait for normal progs to finish
> +        * and finally do free_one() on each element.
> +        */
> +       call_rcu_tasks_trace(&c->rcu, __free_rcu_tasks_trace);
>  }
>
>  static void free_bulk(struct bpf_mem_cache *c)
> @@ -452,6 +463,7 @@ void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma)
>                 /* c->waiting_for_gp list was drained, but __free_rcu might
>                  * still execute. Wait for it now before we free 'c'.
>                  */
> +               rcu_barrier_tasks_trace();
>                 rcu_barrier();
>                 free_percpu(ma->cache);
>                 ma->cache = NULL;
> --
> 2.30.2
>
Alexei Starovoitov Aug. 19, 2022, 10:43 p.m. UTC | #2
On Sat, Aug 20, 2022 at 12:21:46AM +0200, Kumar Kartikeya Dwivedi wrote:
> On Fri, 19 Aug 2022 at 23:43, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > From: Alexei Starovoitov <ast@kernel.org>
> >
> > Use call_rcu_tasks_trace() to wait for sleepable progs to finish.
> > Then use call_rcu() to wait for normal progs to finish
> > and finally do free_one() on each element when freeing objects
> > into global memory pool.
> >
> > Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> > ---
> 
> I fear this can make OOM issues very easy to run into, because one
> sleepable prog that sleeps for a long period of time can hold the
> freeing of elements from another sleepable prog which either does not
> sleep often or sleeps for a very short period of time, and has a high
> update frequency. I'm mostly worried that unrelated sleepable programs
> not even using the same map will begin to affect each other.

'sleep for long time'? sleepable bpf prog doesn't mean that they can sleep.
sleepable progs can copy_from_user, but they're not allowed to waste time.
I don't share OOM concerns at all.
max_entries and memcg limits are still there and enforced.
dynamic map is strictly better and memory efficient than full prealloc.

> Have you considered other options? E.g. we could directly expose
> bpf_rcu_read_lock/bpf_rcu_read_unlock to the program and enforce that
> access to RCU protected map lookups only happens in such read
> sections, and unlock invalidates all RCU protected pointers? Sleepable
> helpers can then not be invoked inside the BPF RCU read section. The
> program uses RCU read section while accessing such maps, and sleeps
> after doing bpf_rcu_read_unlock. They can be kfuncs.

Yes. We can add explicit bpf_rcu_read_lock and teach verifier about RCU CS,
but I don't see the value specifically for sleepable progs.
Current sleepable progs can do map lookup without extra kfuncs.
Explicit CS would force progs to be rewritten which is not great.

> It might also be useful in general, to access RCU protected data from
> sleepable programs (i.e. make some sections of the program RCU
> protected and non-sleepable at runtime). It will allow use of elements

For other cases, sure. We can introduce RCU protected objects and
explicit bpf_rcu_read_lock.

> from dynamically allocated maps with bpf_mem_alloc while not having to
> wait for RCU tasks trace grace period, which can extend into minutes
> (or even longer if unlucky).

sleepable bpf prog that lasts minutes? In what kind of situation?
We don't have bpf_sleep() helper and not going to add one any time soon.
Kumar Kartikeya Dwivedi Aug. 19, 2022, 10:56 p.m. UTC | #3
On Sat, 20 Aug 2022 at 00:43, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Sat, Aug 20, 2022 at 12:21:46AM +0200, Kumar Kartikeya Dwivedi wrote:
> > On Fri, 19 Aug 2022 at 23:43, Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > From: Alexei Starovoitov <ast@kernel.org>
> > >
> > > Use call_rcu_tasks_trace() to wait for sleepable progs to finish.
> > > Then use call_rcu() to wait for normal progs to finish
> > > and finally do free_one() on each element when freeing objects
> > > into global memory pool.
> > >
> > > Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> > > ---
> >
> > I fear this can make OOM issues very easy to run into, because one
> > sleepable prog that sleeps for a long period of time can hold the
> > freeing of elements from another sleepable prog which either does not
> > sleep often or sleeps for a very short period of time, and has a high
> > update frequency. I'm mostly worried that unrelated sleepable programs
> > not even using the same map will begin to affect each other.
>
> 'sleep for long time'? sleepable bpf prog doesn't mean that they can sleep.
> sleepable progs can copy_from_user, but they're not allowed to waste time.

It is certainly possible to waste time, but indirectly, not through
the BPF program itself.

If you have userfaultfd enabled (for unpriv users), an unprivileged
user can trap a sleepable BPF prog (say LSM) using bpf_copy_from_user
for as long as it wants. A similar case can be done using FUSE, IIRC.

You can then say it's a problem about unprivileged users being able to
use userfaultfd or FUSE, or we could think about fixing
bpf_copy_from_user to return -EFAULT for this case, but it is totally
possible right now for malicious userspace to extend the tasks trace
gp like this for minutes (or even longer) on a system where sleepable
BPF programs are using e.g. bpf_copy_from_user.

> I don't share OOM concerns at all.
> max_entries and memcg limits are still there and enforced.
> dynamic map is strictly better and memory efficient than full prealloc.
>
> > Have you considered other options? E.g. we could directly expose
> > bpf_rcu_read_lock/bpf_rcu_read_unlock to the program and enforce that
> > access to RCU protected map lookups only happens in such read
> > sections, and unlock invalidates all RCU protected pointers? Sleepable
> > helpers can then not be invoked inside the BPF RCU read section. The
> > program uses RCU read section while accessing such maps, and sleeps
> > after doing bpf_rcu_read_unlock. They can be kfuncs.
>
> Yes. We can add explicit bpf_rcu_read_lock and teach verifier about RCU CS,
> but I don't see the value specifically for sleepable progs.
> Current sleepable progs can do map lookup without extra kfuncs.
> Explicit CS would force progs to be rewritten which is not great.
>
> > It might also be useful in general, to access RCU protected data from
> > sleepable programs (i.e. make some sections of the program RCU
> > protected and non-sleepable at runtime). It will allow use of elements
>
> For other cases, sure. We can introduce RCU protected objects and
> explicit bpf_rcu_read_lock.
>
> > from dynamically allocated maps with bpf_mem_alloc while not having to
> > wait for RCU tasks trace grace period, which can extend into minutes
> > (or even longer if unlucky).
>
> sleepable bpf prog that lasts minutes? In what kind of situation?
> We don't have bpf_sleep() helper and not going to add one any time soon.
Alexei Starovoitov Aug. 19, 2022, 11:01 p.m. UTC | #4
On Fri, Aug 19, 2022 at 3:56 PM Kumar Kartikeya Dwivedi
<memxor@gmail.com> wrote:
>
> On Sat, 20 Aug 2022 at 00:43, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Sat, Aug 20, 2022 at 12:21:46AM +0200, Kumar Kartikeya Dwivedi wrote:
> > > On Fri, 19 Aug 2022 at 23:43, Alexei Starovoitov
> > > <alexei.starovoitov@gmail.com> wrote:
> > > >
> > > > From: Alexei Starovoitov <ast@kernel.org>
> > > >
> > > > Use call_rcu_tasks_trace() to wait for sleepable progs to finish.
> > > > Then use call_rcu() to wait for normal progs to finish
> > > > and finally do free_one() on each element when freeing objects
> > > > into global memory pool.
> > > >
> > > > Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> > > > ---
> > >
> > > I fear this can make OOM issues very easy to run into, because one
> > > sleepable prog that sleeps for a long period of time can hold the
> > > freeing of elements from another sleepable prog which either does not
> > > sleep often or sleeps for a very short period of time, and has a high
> > > update frequency. I'm mostly worried that unrelated sleepable programs
> > > not even using the same map will begin to affect each other.
> >
> > 'sleep for long time'? sleepable bpf prog doesn't mean that they can sleep.
> > sleepable progs can copy_from_user, but they're not allowed to waste time.
>
> It is certainly possible to waste time, but indirectly, not through
> the BPF program itself.
>
> If you have userfaultfd enabled (for unpriv users), an unprivileged
> user can trap a sleepable BPF prog (say LSM) using bpf_copy_from_user
> for as long as it wants. A similar case can be done using FUSE, IIRC.
>
> You can then say it's a problem about unprivileged users being able to
> use userfaultfd or FUSE, or we could think about fixing
> bpf_copy_from_user to return -EFAULT for this case, but it is totally
> possible right now for malicious userspace to extend the tasks trace
> gp like this for minutes (or even longer) on a system where sleepable
> BPF programs are using e.g. bpf_copy_from_user.

Well in that sense userfaultfd can keep all sorts of things
in the kernel from making progress.
But nothing to do with OOM.
There is still the max_entries limit.
The amount of objects in waiting_for_gp is guaranteed to be less
than full prealloc.
Kumar Kartikeya Dwivedi Aug. 24, 2022, 7:49 p.m. UTC | #5
On Sat, 20 Aug 2022 at 01:01, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Fri, Aug 19, 2022 at 3:56 PM Kumar Kartikeya Dwivedi
> <memxor@gmail.com> wrote:
> >
> > On Sat, 20 Aug 2022 at 00:43, Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Sat, Aug 20, 2022 at 12:21:46AM +0200, Kumar Kartikeya Dwivedi wrote:
> > > > On Fri, 19 Aug 2022 at 23:43, Alexei Starovoitov
> > > > <alexei.starovoitov@gmail.com> wrote:
> > > > >
> > > > > From: Alexei Starovoitov <ast@kernel.org>
> > > > >
> > > > > Use call_rcu_tasks_trace() to wait for sleepable progs to finish.
> > > > > Then use call_rcu() to wait for normal progs to finish
> > > > > and finally do free_one() on each element when freeing objects
> > > > > into global memory pool.
> > > > >
> > > > > Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> > > > > ---
> > > >
> > > > I fear this can make OOM issues very easy to run into, because one
> > > > sleepable prog that sleeps for a long period of time can hold the
> > > > freeing of elements from another sleepable prog which either does not
> > > > sleep often or sleeps for a very short period of time, and has a high
> > > > update frequency. I'm mostly worried that unrelated sleepable programs
> > > > not even using the same map will begin to affect each other.
> > >
> > > 'sleep for long time'? sleepable bpf prog doesn't mean that they can sleep.
> > > sleepable progs can copy_from_user, but they're not allowed to waste time.
> >
> > It is certainly possible to waste time, but indirectly, not through
> > the BPF program itself.
> >
> > If you have userfaultfd enabled (for unpriv users), an unprivileged
> > user can trap a sleepable BPF prog (say LSM) using bpf_copy_from_user
> > for as long as it wants. A similar case can be done using FUSE, IIRC.
> >
> > You can then say it's a problem about unprivileged users being able to
> > use userfaultfd or FUSE, or we could think about fixing
> > bpf_copy_from_user to return -EFAULT for this case, but it is totally
> > possible right now for malicious userspace to extend the tasks trace
> > gp like this for minutes (or even longer) on a system where sleepable
> > BPF programs are using e.g. bpf_copy_from_user.
>
> Well in that sense userfaultfd can keep all sorts of things
> in the kernel from making progress.
> But nothing to do with OOM.
> There is still the max_entries limit.
> The amount of objects in waiting_for_gp is guaranteed to be less
> than full prealloc.

My thinking was that once you hold the GP using uffd, we can assume
you will eventually hit a case where all such maps on the system have
their max_entries exhausted. So yes, it probably won't OOM, but it
would be bad regardless.

I think this just begs instead that uffd (and even FUSE) should not be
available to untrusted processes on the system by default. Both are
used regularly to widen hard to hit race conditions in the kernel.

But anyway, there's no easy way currently to guarantee the lifetime of
elements for the sleepable case while being as low overhead as trace
RCU, so it makes sense to go ahead with this.
Alexei Starovoitov Aug. 25, 2022, 12:08 a.m. UTC | #6
On Wed, Aug 24, 2022 at 12:50 PM Kumar Kartikeya Dwivedi
<memxor@gmail.com> wrote:
>
> On Sat, 20 Aug 2022 at 01:01, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Fri, Aug 19, 2022 at 3:56 PM Kumar Kartikeya Dwivedi
> > <memxor@gmail.com> wrote:
> > >
> > > On Sat, 20 Aug 2022 at 00:43, Alexei Starovoitov
> > > <alexei.starovoitov@gmail.com> wrote:
> > > >
> > > > On Sat, Aug 20, 2022 at 12:21:46AM +0200, Kumar Kartikeya Dwivedi wrote:
> > > > > On Fri, 19 Aug 2022 at 23:43, Alexei Starovoitov
> > > > > <alexei.starovoitov@gmail.com> wrote:
> > > > > >
> > > > > > From: Alexei Starovoitov <ast@kernel.org>
> > > > > >
> > > > > > Use call_rcu_tasks_trace() to wait for sleepable progs to finish.
> > > > > > Then use call_rcu() to wait for normal progs to finish
> > > > > > and finally do free_one() on each element when freeing objects
> > > > > > into global memory pool.
> > > > > >
> > > > > > Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> > > > > > ---
> > > > >
> > > > > I fear this can make OOM issues very easy to run into, because one
> > > > > sleepable prog that sleeps for a long period of time can hold the
> > > > > freeing of elements from another sleepable prog which either does not
> > > > > sleep often or sleeps for a very short period of time, and has a high
> > > > > update frequency. I'm mostly worried that unrelated sleepable programs
> > > > > not even using the same map will begin to affect each other.
> > > >
> > > > 'sleep for long time'? sleepable bpf prog doesn't mean that they can sleep.
> > > > sleepable progs can copy_from_user, but they're not allowed to waste time.
> > >
> > > It is certainly possible to waste time, but indirectly, not through
> > > the BPF program itself.
> > >
> > > If you have userfaultfd enabled (for unpriv users), an unprivileged
> > > user can trap a sleepable BPF prog (say LSM) using bpf_copy_from_user
> > > for as long as it wants. A similar case can be done using FUSE, IIRC.
> > >
> > > You can then say it's a problem about unprivileged users being able to
> > > use userfaultfd or FUSE, or we could think about fixing
> > > bpf_copy_from_user to return -EFAULT for this case, but it is totally
> > > possible right now for malicious userspace to extend the tasks trace
> > > gp like this for minutes (or even longer) on a system where sleepable
> > > BPF programs are using e.g. bpf_copy_from_user.
> >
> > Well in that sense userfaultfd can keep all sorts of things
> > in the kernel from making progress.
> > But nothing to do with OOM.
> > There is still the max_entries limit.
> > The amount of objects in waiting_for_gp is guaranteed to be less
> > than full prealloc.
>
> My thinking was that once you hold the GP using uffd, we can assume
> you will eventually hit a case where all such maps on the system have
> their max_entries exhausted. So yes, it probably won't OOM, but it
> would be bad regardless.
>
> I think this just begs instead that uffd (and even FUSE) should not be
> available to untrusted processes on the system by default. Both are
> used regularly to widen hard to hit race conditions in the kernel.
>
> But anyway, there's no easy way currently to guarantee the lifetime of
> elements for the sleepable case while being as low overhead as trace
> RCU, so it makes sense to go ahead with this.

Right. We evaluated SRCU for sleepable and it had too much overhead.
That's the reason rcu_tasks_trace was added and sleepable bpf progs
is the only user so far.
The point I'm arguing is that call_rcu_tasks_trace in this patch
doesn't add mm concerns more than the existing call_rcu.
There is CONFIG_PREEMPT_RCU and RT. uffd will cause similar
issues in such configs too.
diff mbox series

Patch

diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
index 9e5ad7dc4dc7..d34383dc12d9 100644
--- a/kernel/bpf/memalloc.c
+++ b/kernel/bpf/memalloc.c
@@ -224,6 +224,13 @@  static void __free_rcu(struct rcu_head *head)
 	atomic_set(&c->call_rcu_in_progress, 0);
 }
 
+static void __free_rcu_tasks_trace(struct rcu_head *head)
+{
+	struct bpf_mem_cache *c = container_of(head, struct bpf_mem_cache, rcu);
+
+	call_rcu(&c->rcu, __free_rcu);
+}
+
 static void enque_to_free(struct bpf_mem_cache *c, void *obj)
 {
 	struct llist_node *llnode = obj;
@@ -249,7 +256,11 @@  static void do_call_rcu(struct bpf_mem_cache *c)
 		 * from __free_rcu() and from drain_mem_cache().
 		 */
 		__llist_add(llnode, &c->waiting_for_gp);
-	call_rcu(&c->rcu, __free_rcu);
+	/* Use call_rcu_tasks_trace() to wait for sleepable progs to finish.
+	 * Then use call_rcu() to wait for normal progs to finish
+	 * and finally do free_one() on each element.
+	 */
+	call_rcu_tasks_trace(&c->rcu, __free_rcu_tasks_trace);
 }
 
 static void free_bulk(struct bpf_mem_cache *c)
@@ -452,6 +463,7 @@  void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma)
 		/* c->waiting_for_gp list was drained, but __free_rcu might
 		 * still execute. Wait for it now before we free 'c'.
 		 */
+		rcu_barrier_tasks_trace();
 		rcu_barrier();
 		free_percpu(ma->cache);
 		ma->cache = NULL;