diff mbox series

[v3] bpf: Remove in_atomic() from bpf_link_put().

Message ID 20230526112356.fOlWmeOF@linutronix.de (mailing list archive)
State Changes Requested
Delegated to: BPF
Headers show
Series [v3] bpf: Remove in_atomic() from bpf_link_put(). | expand

Checks

Context Check Description
bpf/vmtest-bpf-next-PR success PR summary
netdev/series_format warning Single patches do not need cover letters; Target tree name not specified in the subject
netdev/tree_selection success Guessed tree name to be net-next, async
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 1470 this patch: 1470
netdev/cc_maintainers warning 8 maintainers not CCed: yhs@fb.com kpsingh@kernel.org martin.lau@linux.dev song@kernel.org sdf@google.com andrii@kernel.org jolsa@kernel.org haoluo@google.com
netdev/build_clang success Errors and warnings before: 176 this patch: 176
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 1465 this patch: 1465
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 92 lines checked
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0
bpf/vmtest-bpf-next-VM_Test-1 success Logs for ShellCheck
bpf/vmtest-bpf-next-VM_Test-6 success Logs for set-matrix
bpf/vmtest-bpf-next-VM_Test-2 success Logs for build for aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-5 success Logs for build for x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-4 success Logs for build for x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-3 success Logs for build for s390x with gcc
bpf/vmtest-bpf-next-VM_Test-7 success Logs for test_maps on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-11 success Logs for test_progs on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-15 success Logs for test_progs_no_alu32 on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-19 success Logs for test_progs_no_alu32_parallel on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-22 success Logs for test_progs_parallel on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-25 success Logs for test_verifier on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-9 success Logs for test_maps on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-10 success Logs for test_maps on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-13 success Logs for test_progs on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-14 success Logs for test_progs on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-17 success Logs for test_progs_no_alu32 on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-18 success Logs for test_progs_no_alu32 on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-20 success Logs for test_progs_no_alu32_parallel on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-21 success Logs for test_progs_no_alu32_parallel on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-23 success Logs for test_progs_parallel on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-24 success Logs for test_progs_parallel on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-27 success Logs for test_verifier on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-28 success Logs for test_verifier on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-29 success Logs for veristat
bpf/vmtest-bpf-next-VM_Test-26 success Logs for test_verifier on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-8 fail Logs for test_maps on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-16 success Logs for test_progs_no_alu32 on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-12 success Logs for test_progs on s390x with gcc

Commit Message

Sebastian Andrzej Siewior May 26, 2023, 11:23 a.m. UTC
bpf_free_inode() is invoked as a RCU callback. Usually RCU callbacks are
invoked within softirq context. By setting rcutree.use_softirq=0 boot
option the RCU callbacks will be invoked in a per-CPU kthread with
bottom halves disabled which implies a RCU read section.

On PREEMPT_RT the context remains fully preemptible. The RCU read
section however does not allow schedule() invocation. The latter happens
in mutex_lock() performed by bpf_trampoline_unlink_prog() originated
from bpf_link_put().

It was pointed out that the bpf_link_put() invocation should not be
delayed if originated from close(). It was also pointed out that other
invocations from within a syscall should also avoid the workqueue.
After an audit of all bpf_link_put() callers it looks that the only
atomic caller is the RCU callback. Everything else is called from
preemptible context because it is a syscall, a mutex_t is acquired near
by or due a GFP_KERNEL memory allocation.

Let bpf_link_put() free the resources directly. Add
bpf_link_put_from_atomic() which uses the kworker to free the
resources. Let bpf_any_put() invoke one or the other depending on the
context it is called from (RCU or not).

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
On 2023-05-25 10:51:23 [-0700], Andrii Nakryiko wrote:
v2…v3:
  - Drop bpf_link_put_direct(). Let bpf_link_put() do the direct free
    and add bpf_link_put_from_atomic() to do the delayed free via the
    worker.

v1…v2:
   - Add bpf_link_put_direct() to be used from bpf_link_release() as
     suggested.

> Looks good to me, but it's not sufficient. See kernel/bpf/inode.c, we
> need to do bpf_link_put_direct() from bpf_put_any(), which should be
> safe as well because it all is either triggered from bpf() syscall or
> by unlink()'ing BPF FS file. For file deletion we have the same
> requirement to have deterministic release of bpf_link.

Okay. I checked all callers and it seems that the only atomic caller is
the RCU callback.

 include/linux/bpf.h  |  5 +++++
 kernel/bpf/inode.c   | 13 ++++++++-----
 kernel/bpf/syscall.c | 21 ++++++++++++---------
 3 files changed, 25 insertions(+), 14 deletions(-)

Comments

Andrii Nakryiko May 31, 2023, 7 p.m. UTC | #1
On Fri, May 26, 2023 at 4:24 AM Sebastian Andrzej Siewior
<bigeasy@linutronix.de> wrote:
>
> bpf_free_inode() is invoked as a RCU callback. Usually RCU callbacks are
> invoked within softirq context. By setting rcutree.use_softirq=0 boot
> option the RCU callbacks will be invoked in a per-CPU kthread with
> bottom halves disabled which implies a RCU read section.
>
> On PREEMPT_RT the context remains fully preemptible. The RCU read
> section however does not allow schedule() invocation. The latter happens
> in mutex_lock() performed by bpf_trampoline_unlink_prog() originated
> from bpf_link_put().
>
> It was pointed out that the bpf_link_put() invocation should not be
> delayed if originated from close(). It was also pointed out that other
> invocations from within a syscall should also avoid the workqueue.
> After an audit of all bpf_link_put() callers it looks that the only
> atomic caller is the RCU callback. Everything else is called from
> preemptible context because it is a syscall, a mutex_t is acquired near
> by or due a GFP_KERNEL memory allocation.
>
> Let bpf_link_put() free the resources directly. Add
> bpf_link_put_from_atomic() which uses the kworker to free the
> resources. Let bpf_any_put() invoke one or the other depending on the
> context it is called from (RCU or not).
>
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> ---
> On 2023-05-25 10:51:23 [-0700], Andrii Nakryiko wrote:
> v2…v3:
>   - Drop bpf_link_put_direct(). Let bpf_link_put() do the direct free
>     and add bpf_link_put_from_atomic() to do the delayed free via the
>     worker.

This seems like an unsafe "default put" choice. I think it makes more
sense to have bpf_link_put() do a safe scheduled put, and then having
a separate bpf_link_put_direct() for those special cases where we care
the most (in kernel/bpf/inode.c and kernel/bpf/syscall.c).

>
> v1…v2:
>    - Add bpf_link_put_direct() to be used from bpf_link_release() as
>      suggested.
>
> > Looks good to me, but it's not sufficient. See kernel/bpf/inode.c, we
> > need to do bpf_link_put_direct() from bpf_put_any(), which should be
> > safe as well because it all is either triggered from bpf() syscall or
> > by unlink()'ing BPF FS file. For file deletion we have the same
> > requirement to have deterministic release of bpf_link.
>
> Okay. I checked all callers and it seems that the only atomic caller is
> the RCU callback.
>
>  include/linux/bpf.h  |  5 +++++
>  kernel/bpf/inode.c   | 13 ++++++++-----
>  kernel/bpf/syscall.c | 21 ++++++++++++---------
>  3 files changed, 25 insertions(+), 14 deletions(-)
>
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index e53ceee1df370..dced1f880cfa6 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -2073,6 +2073,7 @@ int bpf_link_settle(struct bpf_link_primer *primer);
>  void bpf_link_cleanup(struct bpf_link_primer *primer);
>  void bpf_link_inc(struct bpf_link *link);
>  void bpf_link_put(struct bpf_link *link);
> +void bpf_link_put_from_atomic(struct bpf_link *link);
>  int bpf_link_new_fd(struct bpf_link *link);
>  struct file *bpf_link_new_file(struct bpf_link *link, int *reserved_fd);
>  struct bpf_link *bpf_link_get_from_fd(u32 ufd);
> @@ -2432,6 +2433,10 @@ static inline void bpf_link_put(struct bpf_link *link)
>  {
>  }
>
> +static inline void bpf_link_put_from_atomic(struct bpf_link *link)
> +{
> +}
> +
>  static inline int bpf_obj_get_user(const char __user *pathname, int flags)
>  {
>         return -EOPNOTSUPP;
> diff --git a/kernel/bpf/inode.c b/kernel/bpf/inode.c
> index 9948b542a470e..2e1e9f3c7f701 100644
> --- a/kernel/bpf/inode.c
> +++ b/kernel/bpf/inode.c
> @@ -49,7 +49,7 @@ static void *bpf_any_get(void *raw, enum bpf_type type)
>         return raw;
>  }
>
> -static void bpf_any_put(void *raw, enum bpf_type type)
> +static void bpf_any_put(void *raw, enum bpf_type type, bool may_sleep)
>  {
>         switch (type) {
>         case BPF_TYPE_PROG:
> @@ -59,7 +59,10 @@ static void bpf_any_put(void *raw, enum bpf_type type)
>                 bpf_map_put_with_uref(raw);
>                 break;
>         case BPF_TYPE_LINK:
> -               bpf_link_put(raw);
> +               if (may_sleep)
> +                       bpf_link_put(raw);
> +               else
> +                       bpf_link_put_from_atomic(raw);

Do we need to do this in two different ways here? The only situation
that has may_sleep=false is when called from superblock->free_inode.
According to documentation:

  Freeing memory in the callback is fine; doing
  more than that is possible, but requires a lot of care and is best
  avoided.

So it feels like cleaning up link should be safe to do from this
context as well? I've cc'ed linux-fsdevel@, hopefully they can advise.


>                 break;
>         default:
>                 WARN_ON_ONCE(1);
> @@ -490,7 +493,7 @@ int bpf_obj_pin_user(u32 ufd, const char __user *pathname)
>
>         ret = bpf_obj_do_pin(pathname, raw, type);
>         if (ret != 0)
> -               bpf_any_put(raw, type);
> +               bpf_any_put(raw, type, true);
>
>         return ret;
>  }
> @@ -552,7 +555,7 @@ int bpf_obj_get_user(const char __user *pathname, int flags)
>                 return -ENOENT;
>
>         if (ret < 0)
> -               bpf_any_put(raw, type);
> +               bpf_any_put(raw, type, true);
>         return ret;
>  }
>
> @@ -617,7 +620,7 @@ static void bpf_free_inode(struct inode *inode)
>         if (S_ISLNK(inode->i_mode))
>                 kfree(inode->i_link);
>         if (!bpf_inode_type(inode, &type))
> -               bpf_any_put(inode->i_private, type);
> +               bpf_any_put(inode->i_private, type, false);
>         free_inode_nonrcu(inode);
>  }
>
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 14f39c1e573ee..87b07ebd6d146 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -2777,20 +2777,23 @@ static void bpf_link_put_deferred(struct work_struct *work)
>         bpf_link_free(link);
>  }
>
> -/* bpf_link_put can be called from atomic context, but ensures that resources
> - * are freed from process context
> +/* bpf_link_put_from_atomic is called from atomic context. It needs to be called
> + * from sleepable context in order to acquire sleeping locks during the process.
>   */
> -void bpf_link_put(struct bpf_link *link)
> +void bpf_link_put_from_atomic(struct bpf_link *link)
>  {
>         if (!atomic64_dec_and_test(&link->refcnt))
>                 return;
>
> -       if (in_atomic()) {
> -               INIT_WORK(&link->work, bpf_link_put_deferred);
> -               schedule_work(&link->work);
> -       } else {
> -               bpf_link_free(link);
> -       }
> +       INIT_WORK(&link->work, bpf_link_put_deferred);
> +       schedule_work(&link->work);
> +}
> +
> +void bpf_link_put(struct bpf_link *link)
> +{
> +       if (!atomic64_dec_and_test(&link->refcnt))
> +               return;
> +       bpf_link_free(link);
>  }
>  EXPORT_SYMBOL(bpf_link_put);
>
> --
> 2.40.1
>
Sebastian Andrzej Siewior June 5, 2023, 4:37 p.m. UTC | #2
On 2023-05-31 12:00:56 [-0700], Andrii Nakryiko wrote:
> > On 2023-05-25 10:51:23 [-0700], Andrii Nakryiko wrote:
> > v2…v3:
> >   - Drop bpf_link_put_direct(). Let bpf_link_put() do the direct free
> >     and add bpf_link_put_from_atomic() to do the delayed free via the
> >     worker.
> 
> This seems like an unsafe "default put" choice. I think it makes more
> sense to have bpf_link_put() do a safe scheduled put, and then having
> a separate bpf_link_put_direct() for those special cases where we care
> the most (in kernel/bpf/inode.c and kernel/bpf/syscall.c).

I audited them and ended up with them all being safe except for the one
from inode.c. I can reverse the logic if you want.

> > diff --git a/kernel/bpf/inode.c b/kernel/bpf/inode.c
> > index 9948b542a470e..2e1e9f3c7f701 100644
> > --- a/kernel/bpf/inode.c
> > +++ b/kernel/bpf/inode.c
> > @@ -59,7 +59,10 @@ static void bpf_any_put(void *raw, enum bpf_type type)
> >                 bpf_map_put_with_uref(raw);
> >                 break;
> >         case BPF_TYPE_LINK:
> > -               bpf_link_put(raw);
> > +               if (may_sleep)
> > +                       bpf_link_put(raw);
> > +               else
> > +                       bpf_link_put_from_atomic(raw);
> 
> Do we need to do this in two different ways here? The only situation
> that has may_sleep=false is when called from superblock->free_inode.
> According to documentation:
> 
>   Freeing memory in the callback is fine; doing
>   more than that is possible, but requires a lot of care and is best
>   avoided.
> 
> So it feels like cleaning up link should be safe to do from this
> context as well? I've cc'ed linux-fsdevel@, hopefully they can advise.

This is invoked from the RCU callback (which is usually softirq):
	destroy_inode()
	 -> call_rcu(&inode->i_rcu, i_callback);

Freeing memory is fine but there is a mutex that is held in the process.

Sebastian
Andrii Nakryiko June 5, 2023, 10:47 p.m. UTC | #3
On Mon, Jun 5, 2023 at 9:37 AM Sebastian Andrzej Siewior
<bigeasy@linutronix.de> wrote:
>
> On 2023-05-31 12:00:56 [-0700], Andrii Nakryiko wrote:
> > > On 2023-05-25 10:51:23 [-0700], Andrii Nakryiko wrote:
> > > v2…v3:
> > >   - Drop bpf_link_put_direct(). Let bpf_link_put() do the direct free
> > >     and add bpf_link_put_from_atomic() to do the delayed free via the
> > >     worker.
> >
> > This seems like an unsafe "default put" choice. I think it makes more
> > sense to have bpf_link_put() do a safe scheduled put, and then having
> > a separate bpf_link_put_direct() for those special cases where we care
> > the most (in kernel/bpf/inode.c and kernel/bpf/syscall.c).
>
> I audited them and ended up with them all being safe except for the one
> from inode.c. I can reverse the logic if you want.

I understand it's safe today, but I'm more worried for future places
that will do bpf_link_put(). Given it's only close() and BPF FS
unlink() that require synchronous semantics, I think it's fine to make
bpf_link_put() to be async, and have bpf_link_put_sync() (or direct,
or whatever suffix) as a conscious special case.

>
> > > diff --git a/kernel/bpf/inode.c b/kernel/bpf/inode.c
> > > index 9948b542a470e..2e1e9f3c7f701 100644
> > > --- a/kernel/bpf/inode.c
> > > +++ b/kernel/bpf/inode.c
> > > @@ -59,7 +59,10 @@ static void bpf_any_put(void *raw, enum bpf_type type)
> > >                 bpf_map_put_with_uref(raw);
> > >                 break;
> > >         case BPF_TYPE_LINK:
> > > -               bpf_link_put(raw);
> > > +               if (may_sleep)
> > > +                       bpf_link_put(raw);
> > > +               else
> > > +                       bpf_link_put_from_atomic(raw);
> >
> > Do we need to do this in two different ways here? The only situation
> > that has may_sleep=false is when called from superblock->free_inode.
> > According to documentation:
> >
> >   Freeing memory in the callback is fine; doing
> >   more than that is possible, but requires a lot of care and is best
> >   avoided.
> >
> > So it feels like cleaning up link should be safe to do from this
> > context as well? I've cc'ed linux-fsdevel@, hopefully they can advise.
>
> This is invoked from the RCU callback (which is usually softirq):
>         destroy_inode()
>          -> call_rcu(&inode->i_rcu, i_callback);
>
> Freeing memory is fine but there is a mutex that is held in the process.

I think it should be fine for BPF link destruction then?

>
> Sebastian
Sebastian Andrzej Siewior June 9, 2023, 2:19 p.m. UTC | #4
On 2023-06-05 15:47:23 [-0700], Andrii Nakryiko wrote:
> I understand it's safe today, but I'm more worried for future places
> that will do bpf_link_put(). Given it's only close() and BPF FS
> unlink() that require synchronous semantics, I think it's fine to make
> bpf_link_put() to be async, and have bpf_link_put_sync() (or direct,
> or whatever suffix) as a conscious special case.

Okay, let me do that then.

> > This is invoked from the RCU callback (which is usually softirq):
> >         destroy_inode()
> >          -> call_rcu(&inode->i_rcu, i_callback);
> >
> > Freeing memory is fine but there is a mutex that is held in the process.
> 
> I think it should be fine for BPF link destruction then?

bpf_any_put() needs that "may sleep" exception for BPF_TYPE_LINK if it
comes from RCU.
I will swap that patch to be async by default and make sync for
bpf_any_put() if called from close (except for the RCU case).

Sebastian
diff mbox series

Patch

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index e53ceee1df370..dced1f880cfa6 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -2073,6 +2073,7 @@  int bpf_link_settle(struct bpf_link_primer *primer);
 void bpf_link_cleanup(struct bpf_link_primer *primer);
 void bpf_link_inc(struct bpf_link *link);
 void bpf_link_put(struct bpf_link *link);
+void bpf_link_put_from_atomic(struct bpf_link *link);
 int bpf_link_new_fd(struct bpf_link *link);
 struct file *bpf_link_new_file(struct bpf_link *link, int *reserved_fd);
 struct bpf_link *bpf_link_get_from_fd(u32 ufd);
@@ -2432,6 +2433,10 @@  static inline void bpf_link_put(struct bpf_link *link)
 {
 }
 
+static inline void bpf_link_put_from_atomic(struct bpf_link *link)
+{
+}
+
 static inline int bpf_obj_get_user(const char __user *pathname, int flags)
 {
 	return -EOPNOTSUPP;
diff --git a/kernel/bpf/inode.c b/kernel/bpf/inode.c
index 9948b542a470e..2e1e9f3c7f701 100644
--- a/kernel/bpf/inode.c
+++ b/kernel/bpf/inode.c
@@ -49,7 +49,7 @@  static void *bpf_any_get(void *raw, enum bpf_type type)
 	return raw;
 }
 
-static void bpf_any_put(void *raw, enum bpf_type type)
+static void bpf_any_put(void *raw, enum bpf_type type, bool may_sleep)
 {
 	switch (type) {
 	case BPF_TYPE_PROG:
@@ -59,7 +59,10 @@  static void bpf_any_put(void *raw, enum bpf_type type)
 		bpf_map_put_with_uref(raw);
 		break;
 	case BPF_TYPE_LINK:
-		bpf_link_put(raw);
+		if (may_sleep)
+			bpf_link_put(raw);
+		else
+			bpf_link_put_from_atomic(raw);
 		break;
 	default:
 		WARN_ON_ONCE(1);
@@ -490,7 +493,7 @@  int bpf_obj_pin_user(u32 ufd, const char __user *pathname)
 
 	ret = bpf_obj_do_pin(pathname, raw, type);
 	if (ret != 0)
-		bpf_any_put(raw, type);
+		bpf_any_put(raw, type, true);
 
 	return ret;
 }
@@ -552,7 +555,7 @@  int bpf_obj_get_user(const char __user *pathname, int flags)
 		return -ENOENT;
 
 	if (ret < 0)
-		bpf_any_put(raw, type);
+		bpf_any_put(raw, type, true);
 	return ret;
 }
 
@@ -617,7 +620,7 @@  static void bpf_free_inode(struct inode *inode)
 	if (S_ISLNK(inode->i_mode))
 		kfree(inode->i_link);
 	if (!bpf_inode_type(inode, &type))
-		bpf_any_put(inode->i_private, type);
+		bpf_any_put(inode->i_private, type, false);
 	free_inode_nonrcu(inode);
 }
 
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 14f39c1e573ee..87b07ebd6d146 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -2777,20 +2777,23 @@  static void bpf_link_put_deferred(struct work_struct *work)
 	bpf_link_free(link);
 }
 
-/* bpf_link_put can be called from atomic context, but ensures that resources
- * are freed from process context
+/* bpf_link_put_from_atomic is called from atomic context. It needs to be called
+ * from sleepable context in order to acquire sleeping locks during the process.
  */
-void bpf_link_put(struct bpf_link *link)
+void bpf_link_put_from_atomic(struct bpf_link *link)
 {
 	if (!atomic64_dec_and_test(&link->refcnt))
 		return;
 
-	if (in_atomic()) {
-		INIT_WORK(&link->work, bpf_link_put_deferred);
-		schedule_work(&link->work);
-	} else {
-		bpf_link_free(link);
-	}
+	INIT_WORK(&link->work, bpf_link_put_deferred);
+	schedule_work(&link->work);
+}
+
+void bpf_link_put(struct bpf_link *link)
+{
+	if (!atomic64_dec_and_test(&link->refcnt))
+		return;
+	bpf_link_free(link);
 }
 EXPORT_SYMBOL(bpf_link_put);