[v3] bpf: Remove in_atomic() from bpf_link_put().

Message ID	20230526112356.fOlWmeOF@linutronix.de (mailing list archive)
State	Changes Requested
Delegated to:	BPF
Headers	show Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net [23.128.96.19]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 1B4D5742E1 for <bpf@vger.kernel.org>; Fri, 26 May 2023 11:24:03 +0000 (UTC) Date: Fri, 26 May 2023 13:23:56 +0200 From: Sebastian Andrzej Siewior <bigeasy@linutronix.de> To: Andrii Nakryiko <andrii.nakryiko@gmail.com> Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>, bpf <bpf@vger.kernel.org>, Alexei Starovoitov <ast@kernel.org>, Daniel Borkmann <daniel@iogearbox.net>, John Fastabend <john.fastabend@gmail.com>, "Paul E. McKenney" <paulmck@kernel.org>, Peter Zijlstra <peterz@infradead.org>, Thomas Gleixner <tglx@linutronix.de> Subject: [PATCH v3] bpf: Remove in_atomic() from bpf_link_put(). Message-ID: <20230526112356.fOlWmeOF@linutronix.de> References: <20230509132433.2FSY_6t7@linutronix.de> <CAEf4BzZcPKsRJDQfdVk9D1Nt6kgT4STpEUrsQ=UD3BDZnNp8eQ@mail.gmail.com> <CAADnVQLzZyZ+cPqBFfrqa8wtQ8ZhWvTSN6oD9z4Y2gtrfs8Vdg@mail.gmail.com> <CAEf4BzY-MRYnzGiZmW7AVJYgYdHW1_jOphbipRrHRTtdfq3_wQ@mail.gmail.com> <20230525141813.TFZLWM4M@linutronix.de> <CAEf4Bzaipoo6X_2Fh5WTV-m0yjP0pvhqi7-FPFtGOrSzNpdGJQ@mail.gmail.com> Precedence: bulk MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable In-Reply-To: <CAEf4Bzaipoo6X_2Fh5WTV-m0yjP0pvhqi7-FPFtGOrSzNpdGJQ@mail.gmail.com>
Series	[v3] bpf: Remove in_atomic() from bpf_link_put(). \| expand [v3] bpf: Remove in_atomic() from bpf_link_put().

Context	Check	Description
bpf/vmtest-bpf-next-PR	success	PR summary
netdev/series_format	warning	Single patches do not need cover letters; Target tree name not specified in the subject
netdev/tree_selection	success	Guessed tree name to be net-next, async
netdev/fixes_present	success	Fixes tag not required for -next series
netdev/header_inline	success	No static functions without inline keyword in header files
netdev/build_32bit	success	Errors and warnings before: 1470 this patch: 1470
netdev/cc_maintainers	warning	8 maintainers not CCed: yhs@fb.com kpsingh@kernel.org martin.lau@linux.dev song@kernel.org sdf@google.com andrii@kernel.org jolsa@kernel.org haoluo@google.com
netdev/build_clang	success	Errors and warnings before: 176 this patch: 176
netdev/verify_signedoff	success	Signed-off-by tag matches author and committer
netdev/deprecated_api	success	None detected
netdev/check_selftest	success	No net selftest shell script
netdev/verify_fixes	success	No Fixes tag
netdev/build_allmodconfig_warn	success	Errors and warnings before: 1465 this patch: 1465
netdev/checkpatch	success	total: 0 errors, 0 warnings, 0 checks, 92 lines checked
netdev/kdoc	success	Errors and warnings before: 0 this patch: 0
netdev/source_inline	success	Was 0 now: 0
bpf/vmtest-bpf-next-VM_Test-1	success	Logs for ShellCheck
bpf/vmtest-bpf-next-VM_Test-6	success	Logs for set-matrix
bpf/vmtest-bpf-next-VM_Test-2	success	Logs for build for aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-5	success	Logs for build for x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-4	success	Logs for build for x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-3	success	Logs for build for s390x with gcc
bpf/vmtest-bpf-next-VM_Test-7	success	Logs for test_maps on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-11	success	Logs for test_progs on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-15	success	Logs for test_progs_no_alu32 on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-19	success	Logs for test_progs_no_alu32_parallel on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-22	success	Logs for test_progs_parallel on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-25	success	Logs for test_verifier on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-9	success	Logs for test_maps on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-10	success	Logs for test_maps on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-13	success	Logs for test_progs on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-14	success	Logs for test_progs on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-17	success	Logs for test_progs_no_alu32 on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-18	success	Logs for test_progs_no_alu32 on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-20	success	Logs for test_progs_no_alu32_parallel on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-21	success	Logs for test_progs_no_alu32_parallel on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-23	success	Logs for test_progs_parallel on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-24	success	Logs for test_progs_parallel on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-27	success	Logs for test_verifier on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-28	success	Logs for test_verifier on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-29	success	Logs for veristat
bpf/vmtest-bpf-next-VM_Test-26	success	Logs for test_verifier on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-8	fail	Logs for test_maps on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-16	success	Logs for test_progs_no_alu32 on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-12	success	Logs for test_progs on s390x with gcc

Sebastian Andrzej Siewior May 26, 2023, 11:23 a.m. UTC

bpf_free_inode() is invoked as a RCU callback. Usually RCU callbacks are
invoked within softirq context. By setting rcutree.use_softirq=0 boot
option the RCU callbacks will be invoked in a per-CPU kthread with
bottom halves disabled which implies a RCU read section.

On PREEMPT_RT the context remains fully preemptible. The RCU read
section however does not allow schedule() invocation. The latter happens
in mutex_lock() performed by bpf_trampoline_unlink_prog() originated
from bpf_link_put().

It was pointed out that the bpf_link_put() invocation should not be
delayed if originated from close(). It was also pointed out that other
invocations from within a syscall should also avoid the workqueue.
After an audit of all bpf_link_put() callers it looks that the only
atomic caller is the RCU callback. Everything else is called from
preemptible context because it is a syscall, a mutex_t is acquired near
by or due a GFP_KERNEL memory allocation.

Let bpf_link_put() free the resources directly. Add
bpf_link_put_from_atomic() which uses the kworker to free the
resources. Let bpf_any_put() invoke one or the other depending on the
context it is called from (RCU or not).

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
On 2023-05-25 10:51:23 [-0700], Andrii Nakryiko wrote:
v2…v3:
  - Drop bpf_link_put_direct(). Let bpf_link_put() do the direct free
    and add bpf_link_put_from_atomic() to do the delayed free via the
    worker.

v1…v2:
   - Add bpf_link_put_direct() to be used from bpf_link_release() as
     suggested.

> Looks good to me, but it's not sufficient. See kernel/bpf/inode.c, we
> need to do bpf_link_put_direct() from bpf_put_any(), which should be
> safe as well because it all is either triggered from bpf() syscall or
> by unlink()'ing BPF FS file. For file deletion we have the same
> requirement to have deterministic release of bpf_link.

Okay. I checked all callers and it seems that the only atomic caller is
the RCU callback.

 include/linux/bpf.h  |  5 +++++
 kernel/bpf/inode.c   | 13 ++++++++-----
 kernel/bpf/syscall.c | 21 ++++++++++++---------
 3 files changed, 25 insertions(+), 14 deletions(-)

Andrii Nakryiko May 31, 2023, 7 p.m. UTC | #1

On Fri, May 26, 2023 at 4:24 AM Sebastian Andrzej Siewior
<bigeasy@linutronix.de> wrote:
>
> bpf_free_inode() is invoked as a RCU callback. Usually RCU callbacks are
> invoked within softirq context. By setting rcutree.use_softirq=0 boot
> option the RCU callbacks will be invoked in a per-CPU kthread with
> bottom halves disabled which implies a RCU read section.
>
> On PREEMPT_RT the context remains fully preemptible. The RCU read
> section however does not allow schedule() invocation. The latter happens
> in mutex_lock() performed by bpf_trampoline_unlink_prog() originated
> from bpf_link_put().
>
> It was pointed out that the bpf_link_put() invocation should not be
> delayed if originated from close(). It was also pointed out that other
> invocations from within a syscall should also avoid the workqueue.
> After an audit of all bpf_link_put() callers it looks that the only
> atomic caller is the RCU callback. Everything else is called from
> preemptible context because it is a syscall, a mutex_t is acquired near
> by or due a GFP_KERNEL memory allocation.
>
> Let bpf_link_put() free the resources directly. Add
> bpf_link_put_from_atomic() which uses the kworker to free the
> resources. Let bpf_any_put() invoke one or the other depending on the
> context it is called from (RCU or not).
>
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> ---
> On 2023-05-25 10:51:23 [-0700], Andrii Nakryiko wrote:
> v2…v3:
>   - Drop bpf_link_put_direct(). Let bpf_link_put() do the direct free
>     and add bpf_link_put_from_atomic() to do the delayed free via the
>     worker.

This seems like an unsafe "default put" choice. I think it makes more
sense to have bpf_link_put() do a safe scheduled put, and then having
a separate bpf_link_put_direct() for those special cases where we care
the most (in kernel/bpf/inode.c and kernel/bpf/syscall.c).

>
> v1…v2:
>    - Add bpf_link_put_direct() to be used from bpf_link_release() as
>      suggested.
>
> > Looks good to me, but it's not sufficient. See kernel/bpf/inode.c, we
> > need to do bpf_link_put_direct() from bpf_put_any(), which should be
> > safe as well because it all is either triggered from bpf() syscall or
> > by unlink()'ing BPF FS file. For file deletion we have the same
> > requirement to have deterministic release of bpf_link.
>
> Okay. I checked all callers and it seems that the only atomic caller is
> the RCU callback.
>
>  include/linux/bpf.h  |  5 +++++
>  kernel/bpf/inode.c   | 13 ++++++++-----
>  kernel/bpf/syscall.c | 21 ++++++++++++---------
>  3 files changed, 25 insertions(+), 14 deletions(-)
>
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index e53ceee1df370..dced1f880cfa6 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -2073,6 +2073,7 @@ int bpf_link_settle(struct bpf_link_primer *primer);
>  void bpf_link_cleanup(struct bpf_link_primer *primer);
>  void bpf_link_inc(struct bpf_link *link);
>  void bpf_link_put(struct bpf_link *link);
> +void bpf_link_put_from_atomic(struct bpf_link *link);
>  int bpf_link_new_fd(struct bpf_link *link);
>  struct file *bpf_link_new_file(struct bpf_link *link, int *reserved_fd);
>  struct bpf_link *bpf_link_get_from_fd(u32 ufd);
> @@ -2432,6 +2433,10 @@ static inline void bpf_link_put(struct bpf_link *link)
>  {
>  }
>
> +static inline void bpf_link_put_from_atomic(struct bpf_link *link)
> +{
> +}
> +
>  static inline int bpf_obj_get_user(const char __user *pathname, int flags)
>  {
>         return -EOPNOTSUPP;
> diff --git a/kernel/bpf/inode.c b/kernel/bpf/inode.c
> index 9948b542a470e..2e1e9f3c7f701 100644
> --- a/kernel/bpf/inode.c
> +++ b/kernel/bpf/inode.c
> @@ -49,7 +49,7 @@ static void *bpf_any_get(void *raw, enum bpf_type type)
>         return raw;
>  }
>
> -static void bpf_any_put(void *raw, enum bpf_type type)
> +static void bpf_any_put(void *raw, enum bpf_type type, bool may_sleep)
>  {
>         switch (type) {
>         case BPF_TYPE_PROG:
> @@ -59,7 +59,10 @@ static void bpf_any_put(void *raw, enum bpf_type type)
>                 bpf_map_put_with_uref(raw);
>                 break;
>         case BPF_TYPE_LINK:
> -               bpf_link_put(raw);
> +               if (may_sleep)
> +                       bpf_link_put(raw);
> +               else
> +                       bpf_link_put_from_atomic(raw);

Do we need to do this in two different ways here? The only situation
that has may_sleep=false is when called from superblock->free_inode.
According to documentation:

  Freeing memory in the callback is fine; doing
  more than that is possible, but requires a lot of care and is best
  avoided.

So it feels like cleaning up link should be safe to do from this
context as well? I've cc'ed linux-fsdevel@, hopefully they can advise.


>                 break;
>         default:
>                 WARN_ON_ONCE(1);
> @@ -490,7 +493,7 @@ int bpf_obj_pin_user(u32 ufd, const char __user *pathname)
>
>         ret = bpf_obj_do_pin(pathname, raw, type);
>         if (ret != 0)
> -               bpf_any_put(raw, type);
> +               bpf_any_put(raw, type, true);
>
>         return ret;
>  }
> @@ -552,7 +555,7 @@ int bpf_obj_get_user(const char __user *pathname, int flags)
>                 return -ENOENT;
>
>         if (ret < 0)
> -               bpf_any_put(raw, type);
> +               bpf_any_put(raw, type, true);
>         return ret;
>  }
>
> @@ -617,7 +620,7 @@ static void bpf_free_inode(struct inode *inode)
>         if (S_ISLNK(inode->i_mode))
>                 kfree(inode->i_link);
>         if (!bpf_inode_type(inode, &type))
> -               bpf_any_put(inode->i_private, type);
> +               bpf_any_put(inode->i_private, type, false);
>         free_inode_nonrcu(inode);
>  }
>
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 14f39c1e573ee..87b07ebd6d146 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -2777,20 +2777,23 @@ static void bpf_link_put_deferred(struct work_struct *work)
>         bpf_link_free(link);
>  }
>
> -/* bpf_link_put can be called from atomic context, but ensures that resources
> - * are freed from process context
> +/* bpf_link_put_from_atomic is called from atomic context. It needs to be called
> + * from sleepable context in order to acquire sleeping locks during the process.
>   */
> -void bpf_link_put(struct bpf_link *link)
> +void bpf_link_put_from_atomic(struct bpf_link *link)
>  {
>         if (!atomic64_dec_and_test(&link->refcnt))
>                 return;
>
> -       if (in_atomic()) {
> -               INIT_WORK(&link->work, bpf_link_put_deferred);
> -               schedule_work(&link->work);
> -       } else {
> -               bpf_link_free(link);
> -       }
> +       INIT_WORK(&link->work, bpf_link_put_deferred);
> +       schedule_work(&link->work);
> +}
> +
> +void bpf_link_put(struct bpf_link *link)
> +{
> +       if (!atomic64_dec_and_test(&link->refcnt))
> +               return;
> +       bpf_link_free(link);
>  }
>  EXPORT_SYMBOL(bpf_link_put);
>
> --
> 2.40.1
>

Sebastian Andrzej Siewior June 5, 2023, 4:37 p.m. UTC | #2

On 2023-05-31 12:00:56 [-0700], Andrii Nakryiko wrote:
> > On 2023-05-25 10:51:23 [-0700], Andrii Nakryiko wrote:
> > v2…v3:
> >   - Drop bpf_link_put_direct(). Let bpf_link_put() do the direct free
> >     and add bpf_link_put_from_atomic() to do the delayed free via the
> >     worker.
> 
> This seems like an unsafe "default put" choice. I think it makes more
> sense to have bpf_link_put() do a safe scheduled put, and then having
> a separate bpf_link_put_direct() for those special cases where we care
> the most (in kernel/bpf/inode.c and kernel/bpf/syscall.c).

I audited them and ended up with them all being safe except for the one
from inode.c. I can reverse the logic if you want.

> > diff --git a/kernel/bpf/inode.c b/kernel/bpf/inode.c
> > index 9948b542a470e..2e1e9f3c7f701 100644
> > --- a/kernel/bpf/inode.c
> > +++ b/kernel/bpf/inode.c
> > @@ -59,7 +59,10 @@ static void bpf_any_put(void *raw, enum bpf_type type)
> >                 bpf_map_put_with_uref(raw);
> >                 break;
> >         case BPF_TYPE_LINK:
> > -               bpf_link_put(raw);
> > +               if (may_sleep)
> > +                       bpf_link_put(raw);
> > +               else
> > +                       bpf_link_put_from_atomic(raw);
> 
> Do we need to do this in two different ways here? The only situation
> that has may_sleep=false is when called from superblock->free_inode.
> According to documentation:
> 
>   Freeing memory in the callback is fine; doing
>   more than that is possible, but requires a lot of care and is best
>   avoided.
> 
> So it feels like cleaning up link should be safe to do from this
> context as well? I've cc'ed linux-fsdevel@, hopefully they can advise.

This is invoked from the RCU callback (which is usually softirq):
	destroy_inode()
	 -> call_rcu(&inode->i_rcu, i_callback);

Freeing memory is fine but there is a mutex that is held in the process.

Sebastian

Andrii Nakryiko June 5, 2023, 10:47 p.m. UTC | #3

On Mon, Jun 5, 2023 at 9:37 AM Sebastian Andrzej Siewior
<bigeasy@linutronix.de> wrote:
>
> On 2023-05-31 12:00:56 [-0700], Andrii Nakryiko wrote:
> > > On 2023-05-25 10:51:23 [-0700], Andrii Nakryiko wrote:
> > > v2…v3:
> > >   - Drop bpf_link_put_direct(). Let bpf_link_put() do the direct free
> > >     and add bpf_link_put_from_atomic() to do the delayed free via the
> > >     worker.
> >
> > This seems like an unsafe "default put" choice. I think it makes more
> > sense to have bpf_link_put() do a safe scheduled put, and then having
> > a separate bpf_link_put_direct() for those special cases where we care
> > the most (in kernel/bpf/inode.c and kernel/bpf/syscall.c).
>
> I audited them and ended up with them all being safe except for the one
> from inode.c. I can reverse the logic if you want.

I understand it's safe today, but I'm more worried for future places
that will do bpf_link_put(). Given it's only close() and BPF FS
unlink() that require synchronous semantics, I think it's fine to make
bpf_link_put() to be async, and have bpf_link_put_sync() (or direct,
or whatever suffix) as a conscious special case.

>
> > > diff --git a/kernel/bpf/inode.c b/kernel/bpf/inode.c
> > > index 9948b542a470e..2e1e9f3c7f701 100644
> > > --- a/kernel/bpf/inode.c
> > > +++ b/kernel/bpf/inode.c
> > > @@ -59,7 +59,10 @@ static void bpf_any_put(void *raw, enum bpf_type type)
> > >                 bpf_map_put_with_uref(raw);
> > >                 break;
> > >         case BPF_TYPE_LINK:
> > > -               bpf_link_put(raw);
> > > +               if (may_sleep)
> > > +                       bpf_link_put(raw);
> > > +               else
> > > +                       bpf_link_put_from_atomic(raw);
> >
> > Do we need to do this in two different ways here? The only situation
> > that has may_sleep=false is when called from superblock->free_inode.
> > According to documentation:
> >
> >   Freeing memory in the callback is fine; doing
> >   more than that is possible, but requires a lot of care and is best
> >   avoided.
> >
> > So it feels like cleaning up link should be safe to do from this
> > context as well? I've cc'ed linux-fsdevel@, hopefully they can advise.
>
> This is invoked from the RCU callback (which is usually softirq):
>         destroy_inode()
>          -> call_rcu(&inode->i_rcu, i_callback);
>
> Freeing memory is fine but there is a mutex that is held in the process.

I think it should be fine for BPF link destruction then?

>
> Sebastian

Sebastian Andrzej Siewior June 9, 2023, 2:19 p.m. UTC | #4

On 2023-06-05 15:47:23 [-0700], Andrii Nakryiko wrote:
> I understand it's safe today, but I'm more worried for future places
> that will do bpf_link_put(). Given it's only close() and BPF FS
> unlink() that require synchronous semantics, I think it's fine to make
> bpf_link_put() to be async, and have bpf_link_put_sync() (or direct,
> or whatever suffix) as a conscious special case.

Okay, let me do that then.

> > This is invoked from the RCU callback (which is usually softirq):
> >         destroy_inode()
> >          -> call_rcu(&inode->i_rcu, i_callback);
> >
> > Freeing memory is fine but there is a mutex that is held in the process.
> 
> I think it should be fine for BPF link destruction then?

bpf_any_put() needs that "may sleep" exception for BPF_TYPE_LINK if it
comes from RCU.
I will swap that patch to be async by default and make sync for
bpf_any_put() if called from close (except for the RCU case).

Sebastian

[v3] bpf: Remove in_atomic() from bpf_link_put().

Checks

Commit Message

Comments

Patch