diff mbox series

[v2,bpf-next,3/7] bpf: Use bpf_mem_free_rcu when bpf_obj_dropping refcounted nodes

Message ID 20230821193311.3290257-4-davemarchevsky@fb.com (mailing list archive)
State Accepted
Commit 7e26cd12ad1c8f3e55d32542c7e4708a9e6a3c02
Delegated to: BPF
Headers show
Series BPF Refcount followups 3: bpf_mem_free_rcu refcounted nodes | expand

Checks

Context Check Description
netdev/series_format success Posting correctly formatted
netdev/tree_selection success Clearly marked for bpf-next, async
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 1370 this patch: 1370
netdev/cc_maintainers warning 7 maintainers not CCed: kpsingh@kernel.org martin.lau@linux.dev john.fastabend@gmail.com sdf@google.com song@kernel.org jolsa@kernel.org haoluo@google.com
netdev/build_clang success Errors and warnings before: 1353 this patch: 1353
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 1393 this patch: 1393
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 12 lines checked
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0
bpf/vmtest-bpf-next-VM_Test-9 success Logs for test_maps on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-8 success Logs for test_maps on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-19 success Logs for test_progs_no_alu32_parallel on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-20 success Logs for test_progs_no_alu32_parallel on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-22 success Logs for test_progs_parallel on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-24 success Logs for test_verifier on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-26 success Logs for test_verifier on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-27 success Logs for test_verifier on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-28 success Logs for veristat
bpf/vmtest-bpf-next-VM_Test-13 success Logs for test_progs on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-10 success Logs for test_progs on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-12 success Logs for test_progs on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-14 success Logs for test_progs_no_alu32 on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-16 success Logs for test_progs_no_alu32 on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-17 success Logs for test_progs_no_alu32 on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-18 success Logs for test_progs_no_alu32_parallel on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-21 success Logs for test_progs_parallel on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-23 success Logs for test_progs_parallel on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-11 fail Logs for test_progs on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-15 success Logs for test_progs_no_alu32 on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-25 success Logs for test_verifier on s390x with gcc
bpf/vmtest-bpf-next-PR success PR summary
bpf/vmtest-bpf-next-VM_Test-2 success Logs for build for aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-1 success Logs for ShellCheck
bpf/vmtest-bpf-next-VM_Test-0 success Logs for ${{ matrix.test }} on ${{ matrix.arch }} with ${{ matrix.toolchain_full }}
bpf/vmtest-bpf-next-VM_Test-4 success Logs for build for x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-7 success Logs for veristat
bpf/vmtest-bpf-next-VM_Test-5 success Logs for build for x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-6 success Logs for set-matrix
bpf/vmtest-bpf-next-VM_Test-3 fail Logs for build for s390x with gcc

Commit Message

Dave Marchevsky Aug. 21, 2023, 7:33 p.m. UTC
This is the final fix for the use-after-free scenario described in
commit 7793fc3babe9 ("bpf: Make bpf_refcount_acquire fallible for
non-owning refs"). That commit, by virtue of changing
bpf_refcount_acquire's refcount_inc to a refcount_inc_not_zero, fixed
the "refcount incr on 0" splat. The not_zero check in
refcount_inc_not_zero, though, still occurs on memory that could have
been free'd and reused, so the commit didn't properly fix the root
cause.

This patch actually fixes the issue by free'ing using the recently-added
bpf_mem_free_rcu, which ensures that the memory is not reused until
RCU grace period has elapsed. If that has happened then
there are no non-owning references alive that point to the
recently-free'd memory, so it can be safely reused.

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
---
 kernel/bpf/helpers.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

Comments

Yonghong Song Aug. 23, 2023, 6:26 a.m. UTC | #1
On 8/21/23 12:33 PM, Dave Marchevsky wrote:
> This is the final fix for the use-after-free scenario described in
> commit 7793fc3babe9 ("bpf: Make bpf_refcount_acquire fallible for
> non-owning refs"). That commit, by virtue of changing
> bpf_refcount_acquire's refcount_inc to a refcount_inc_not_zero, fixed
> the "refcount incr on 0" splat. The not_zero check in
> refcount_inc_not_zero, though, still occurs on memory that could have
> been free'd and reused, so the commit didn't properly fix the root
> cause.
> 
> This patch actually fixes the issue by free'ing using the recently-added
> bpf_mem_free_rcu, which ensures that the memory is not reused until
> RCU grace period has elapsed. If that has happened then
> there are no non-owning references alive that point to the
> recently-free'd memory, so it can be safely reused.
> 
> Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
> Acked-by: Yonghong Song <yonghong.song@linux.dev>
> ---
>   kernel/bpf/helpers.c | 6 +++++-
>   1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> index eb91cae0612a..945a85e25ac5 100644
> --- a/kernel/bpf/helpers.c
> +++ b/kernel/bpf/helpers.c
> @@ -1913,7 +1913,11 @@ void __bpf_obj_drop_impl(void *p, const struct btf_record *rec)
>   
>   	if (rec)
>   		bpf_obj_free_fields(rec, p);

During reviewing my percpu kptr patch with link
 
https://lore.kernel.org/bpf/20230814172809.1361446-1-yonghong.song@linux.dev/T/#m2f7631b8047e9f5da60a0a9cd8717fceaf1adbb7
Kumar mentioned although percpu memory itself is freed under rcu.
But its record fields are freed immediately. This will cause
the problem since there may be some active uses of these fields
within rcu cs and after bpf_obj_free_fields(), some fields may
be re-initialized with new memory but they do not have chances
to free any more.

Do we have problem here as well?

I am thinking whether I could create another flavor of bpf_mem_free_rcu
with a pre_free_callback function, something like
   bpf_mem_free_rcu_cb2(struct bpf_mem_alloc *ma, void *ptr,
       void (*cb)(void *, void *), void *arg1, void *arg2)

The cb(arg1, arg2) will be called right before the real free of "ptr".

For example, for this patch, the callback function can be

static bpf_obj_free_fields_cb(void *rec, void *p)
{
	if (rec)
		bpf_obj_free_fields(rec, p);
		/* we need to ensure recursive freeing fields free
		 * needs to be done immediately, which means we will
		 * add a parameter to __bpf_obj_drop_impl() to
		 * indicate whether bpf_mem_free or bpf_mem_free_rcu
		 * should be called.
		 */
}

bpf_mem_free_rcu_cb2(&bpf_global_ma, p, bpf_obj_free_fields_cb, rec, p);

In __bpf_obj_drop_impl,
we need to ensure recursive freeing fields free
needs to be done immediately, which means we will
add a parameter to __bpf_obj_drop_impl() to
indicate whether bpf_mem_free or bpf_mem_free_rcu
should be called. If __bpf_obj_drop_impl is called
from bpf_obj_drop_impl(), __rcu version should be used.
Otherwise, non rcu version should be used.

Is this something we need to worry about here as well?
What do you think the above interface?


> -	bpf_mem_free(&bpf_global_ma, p);
> +
> +	if (rec && rec->refcount_off >= 0)
> +		bpf_mem_free_rcu(&bpf_global_ma, p);
> +	else
> +		bpf_mem_free(&bpf_global_ma, p);
>   }
>   
>   __bpf_kfunc void bpf_obj_drop_impl(void *p__alloc, void *meta__ign)
Alexei Starovoitov Aug. 23, 2023, 4:20 p.m. UTC | #2
On Tue, Aug 22, 2023 at 11:26 PM Yonghong Song <yonghong.song@linux.dev> wrote:
>
>
>
> On 8/21/23 12:33 PM, Dave Marchevsky wrote:
> > This is the final fix for the use-after-free scenario described in
> > commit 7793fc3babe9 ("bpf: Make bpf_refcount_acquire fallible for
> > non-owning refs"). That commit, by virtue of changing
> > bpf_refcount_acquire's refcount_inc to a refcount_inc_not_zero, fixed
> > the "refcount incr on 0" splat. The not_zero check in
> > refcount_inc_not_zero, though, still occurs on memory that could have
> > been free'd and reused, so the commit didn't properly fix the root
> > cause.
> >
> > This patch actually fixes the issue by free'ing using the recently-added
> > bpf_mem_free_rcu, which ensures that the memory is not reused until
> > RCU grace period has elapsed. If that has happened then
> > there are no non-owning references alive that point to the
> > recently-free'd memory, so it can be safely reused.
> >
> > Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
> > Acked-by: Yonghong Song <yonghong.song@linux.dev>
> > ---
> >   kernel/bpf/helpers.c | 6 +++++-
> >   1 file changed, 5 insertions(+), 1 deletion(-)
> >
> > diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> > index eb91cae0612a..945a85e25ac5 100644
> > --- a/kernel/bpf/helpers.c
> > +++ b/kernel/bpf/helpers.c
> > @@ -1913,7 +1913,11 @@ void __bpf_obj_drop_impl(void *p, const struct btf_record *rec)
> >
> >       if (rec)
> >               bpf_obj_free_fields(rec, p);
>
> During reviewing my percpu kptr patch with link
>
> https://lore.kernel.org/bpf/20230814172809.1361446-1-yonghong.song@linux.dev/T/#m2f7631b8047e9f5da60a0a9cd8717fceaf1adbb7
> Kumar mentioned although percpu memory itself is freed under rcu.
> But its record fields are freed immediately. This will cause
> the problem since there may be some active uses of these fields
> within rcu cs and after bpf_obj_free_fields(), some fields may
> be re-initialized with new memory but they do not have chances
> to free any more.
>
> Do we have problem here as well?

I think it's not an issue here or in your percpu patch,
since bpf_obj_free_fields() calls __bpf_obj_drop_impl() which will
call bpf_mem_free_rcu() (after this patch set lands).

In other words all bpf pointers are either properly life time
checked through the verifier and freed via immediate bpf_mem_free()
or they're bpf_mem_free_rcu().


>
> I am thinking whether I could create another flavor of bpf_mem_free_rcu
> with a pre_free_callback function, something like
>    bpf_mem_free_rcu_cb2(struct bpf_mem_alloc *ma, void *ptr,
>        void (*cb)(void *, void *), void *arg1, void *arg2)
>
> The cb(arg1, arg2) will be called right before the real free of "ptr".
>
> For example, for this patch, the callback function can be
>
> static bpf_obj_free_fields_cb(void *rec, void *p)
> {
>         if (rec)
>                 bpf_obj_free_fields(rec, p);
>                 /* we need to ensure recursive freeing fields free
>                  * needs to be done immediately, which means we will
>                  * add a parameter to __bpf_obj_drop_impl() to
>                  * indicate whether bpf_mem_free or bpf_mem_free_rcu
>                  * should be called.
>                  */
> }
>
> bpf_mem_free_rcu_cb2(&bpf_global_ma, p, bpf_obj_free_fields_cb, rec, p);

It sounds nice in theory, but will be difficult to implement.
bpf_ma would need to add extra two 8 byte pointers for every object,
but there is no room at present. So to free after rcu gp with a callback
it would need to allocate 16 byte (at least) which might fail and so on.
bpf_mem_free_rcu_cb2() would be the last resort.
Yonghong Song Aug. 23, 2023, 8:29 p.m. UTC | #3
On 8/23/23 9:20 AM, Alexei Starovoitov wrote:
> On Tue, Aug 22, 2023 at 11:26 PM Yonghong Song <yonghong.song@linux.dev> wrote:
>>
>>
>>
>> On 8/21/23 12:33 PM, Dave Marchevsky wrote:
>>> This is the final fix for the use-after-free scenario described in
>>> commit 7793fc3babe9 ("bpf: Make bpf_refcount_acquire fallible for
>>> non-owning refs"). That commit, by virtue of changing
>>> bpf_refcount_acquire's refcount_inc to a refcount_inc_not_zero, fixed
>>> the "refcount incr on 0" splat. The not_zero check in
>>> refcount_inc_not_zero, though, still occurs on memory that could have
>>> been free'd and reused, so the commit didn't properly fix the root
>>> cause.
>>>
>>> This patch actually fixes the issue by free'ing using the recently-added
>>> bpf_mem_free_rcu, which ensures that the memory is not reused until
>>> RCU grace period has elapsed. If that has happened then
>>> there are no non-owning references alive that point to the
>>> recently-free'd memory, so it can be safely reused.
>>>
>>> Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
>>> Acked-by: Yonghong Song <yonghong.song@linux.dev>
>>> ---
>>>    kernel/bpf/helpers.c | 6 +++++-
>>>    1 file changed, 5 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
>>> index eb91cae0612a..945a85e25ac5 100644
>>> --- a/kernel/bpf/helpers.c
>>> +++ b/kernel/bpf/helpers.c
>>> @@ -1913,7 +1913,11 @@ void __bpf_obj_drop_impl(void *p, const struct btf_record *rec)
>>>
>>>        if (rec)
>>>                bpf_obj_free_fields(rec, p);
>>
>> During reviewing my percpu kptr patch with link
>>
>> https://lore.kernel.org/bpf/20230814172809.1361446-1-yonghong.song@linux.dev/T/#m2f7631b8047e9f5da60a0a9cd8717fceaf1adbb7
>> Kumar mentioned although percpu memory itself is freed under rcu.
>> But its record fields are freed immediately. This will cause
>> the problem since there may be some active uses of these fields
>> within rcu cs and after bpf_obj_free_fields(), some fields may
>> be re-initialized with new memory but they do not have chances
>> to free any more.
>>
>> Do we have problem here as well?
> 
> I think it's not an issue here or in your percpu patch,
> since bpf_obj_free_fields() calls __bpf_obj_drop_impl() which will
> call bpf_mem_free_rcu() (after this patch set lands).

The following is my understanding.

void bpf_obj_free_fields(const struct btf_record *rec, void *obj)
{
         const struct btf_field *fields;
         int i;

         if (IS_ERR_OR_NULL(rec))
                 return;
         fields = rec->fields;
         for (i = 0; i < rec->cnt; i++) {
                 struct btf_struct_meta *pointee_struct_meta;
                 const struct btf_field *field = &fields[i];
                 void *field_ptr = obj + field->offset;
                 void *xchgd_field;

                 switch (fields[i].type) {
                 case BPF_SPIN_LOCK:
                         break;
                 case BPF_TIMER:
                         bpf_timer_cancel_and_free(field_ptr);
                         break;
                 case BPF_KPTR_UNREF:
                         WRITE_ONCE(*(u64 *)field_ptr, 0);
                         break;
                 case BPF_KPTR_REF:
			......
                         break;
                 case BPF_LIST_HEAD:
                         if (WARN_ON_ONCE(rec->spin_lock_off < 0))
                                 continue;
                         bpf_list_head_free(field, field_ptr, obj + 
rec->spin_lock_off);
                         break;
                 case BPF_RB_ROOT:
                         if (WARN_ON_ONCE(rec->spin_lock_off < 0))
                                 continue;
                         bpf_rb_root_free(field, field_ptr, obj + 
rec->spin_lock_off);
                         break;
                 case BPF_LIST_NODE:
                 case BPF_RB_NODE:
                 case BPF_REFCOUNT:
                         break;
                 default:
                         WARN_ON_ONCE(1);
                         continue;
                 }
         }
}

For percpu kptr, the remaining possible actiionable fields are
	BPF_LIST_HEAD and BPF_RB_ROOT

So BPF_LIST_HEAD and BPF_RB_ROOT will try to go through all
list/rb nodes to unlink them from the list_head/rb_root.

So yes, rb_nodes and list nodes will call __bpf_obj_drop_impl().
Depending on whether the correspondingrec
with rb_node/list_node has ref count or not,
it may call bpf_mem_free() or bpf_mem_free_rcu(). If
bpf_mem_free() is called, then the field is immediately freed
but it may be used by some bpf prog (under rcu) concurrently,
could this be an issue? Changing bpf_mem_free() in
__bpf_obj_drop_impl() to bpf_mem_free_rcu() should fix this problem.

Another thing is related to list_head/rb_root.
During bpf_obj_free_fields(), is it possible that another cpu
may allocate a list_node/rb_node and add to list_head/rb_root?
If this is true, then we might have a memory leak.
But I don't whether this is possible or not.

I think local kptr has the issue as percpu kptr.

> 
> In other words all bpf pointers are either properly life time
> checked through the verifier and freed via immediate bpf_mem_free()
> or they're bpf_mem_free_rcu().
> 
> 
>>
>> I am thinking whether I could create another flavor of bpf_mem_free_rcu
>> with a pre_free_callback function, something like
>>     bpf_mem_free_rcu_cb2(struct bpf_mem_alloc *ma, void *ptr,
>>         void (*cb)(void *, void *), void *arg1, void *arg2)
>>
>> The cb(arg1, arg2) will be called right before the real free of "ptr".
>>
>> For example, for this patch, the callback function can be
>>
>> static bpf_obj_free_fields_cb(void *rec, void *p)
>> {
>>          if (rec)
>>                  bpf_obj_free_fields(rec, p);
>>                  /* we need to ensure recursive freeing fields free
>>                   * needs to be done immediately, which means we will
>>                   * add a parameter to __bpf_obj_drop_impl() to
>>                   * indicate whether bpf_mem_free or bpf_mem_free_rcu
>>                   * should be called.
>>                   */
>> }
>>
>> bpf_mem_free_rcu_cb2(&bpf_global_ma, p, bpf_obj_free_fields_cb, rec, p);
> 
> It sounds nice in theory, but will be difficult to implement.
> bpf_ma would need to add extra two 8 byte pointers for every object,
> but there is no room at present. So to free after rcu gp with a callback
> it would need to allocate 16 byte (at least) which might fail and so on.
> bpf_mem_free_rcu_cb2() would be the last resort.

I don't like this either. I hope we can find better solutions than this.
Alexei Starovoitov Aug. 24, 2023, 1:38 a.m. UTC | #4
On Wed, Aug 23, 2023 at 1:29 PM Yonghong Song <yonghong.song@linux.dev> wrote:
>
>
>
> On 8/23/23 9:20 AM, Alexei Starovoitov wrote:
> > On Tue, Aug 22, 2023 at 11:26 PM Yonghong Song <yonghong.song@linux.dev> wrote:
> >>
> >>
> >>
> >> On 8/21/23 12:33 PM, Dave Marchevsky wrote:
> >>> This is the final fix for the use-after-free scenario described in
> >>> commit 7793fc3babe9 ("bpf: Make bpf_refcount_acquire fallible for
> >>> non-owning refs"). That commit, by virtue of changing
> >>> bpf_refcount_acquire's refcount_inc to a refcount_inc_not_zero, fixed
> >>> the "refcount incr on 0" splat. The not_zero check in
> >>> refcount_inc_not_zero, though, still occurs on memory that could have
> >>> been free'd and reused, so the commit didn't properly fix the root
> >>> cause.
> >>>
> >>> This patch actually fixes the issue by free'ing using the recently-added
> >>> bpf_mem_free_rcu, which ensures that the memory is not reused until
> >>> RCU grace period has elapsed. If that has happened then
> >>> there are no non-owning references alive that point to the
> >>> recently-free'd memory, so it can be safely reused.
> >>>
> >>> Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
> >>> Acked-by: Yonghong Song <yonghong.song@linux.dev>
> >>> ---
> >>>    kernel/bpf/helpers.c | 6 +++++-
> >>>    1 file changed, 5 insertions(+), 1 deletion(-)
> >>>
> >>> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> >>> index eb91cae0612a..945a85e25ac5 100644
> >>> --- a/kernel/bpf/helpers.c
> >>> +++ b/kernel/bpf/helpers.c
> >>> @@ -1913,7 +1913,11 @@ void __bpf_obj_drop_impl(void *p, const struct btf_record *rec)
> >>>
> >>>        if (rec)
> >>>                bpf_obj_free_fields(rec, p);
> >>
> >> During reviewing my percpu kptr patch with link
> >>
> >> https://lore.kernel.org/bpf/20230814172809.1361446-1-yonghong.song@linux.dev/T/#m2f7631b8047e9f5da60a0a9cd8717fceaf1adbb7
> >> Kumar mentioned although percpu memory itself is freed under rcu.
> >> But its record fields are freed immediately. This will cause
> >> the problem since there may be some active uses of these fields
> >> within rcu cs and after bpf_obj_free_fields(), some fields may
> >> be re-initialized with new memory but they do not have chances
> >> to free any more.
> >>
> >> Do we have problem here as well?
> >
> > I think it's not an issue here or in your percpu patch,
> > since bpf_obj_free_fields() calls __bpf_obj_drop_impl() which will
> > call bpf_mem_free_rcu() (after this patch set lands).
>
> The following is my understanding.
>
> void bpf_obj_free_fields(const struct btf_record *rec, void *obj)
> {
>          const struct btf_field *fields;
>          int i;
>
>          if (IS_ERR_OR_NULL(rec))
>                  return;
>          fields = rec->fields;
>          for (i = 0; i < rec->cnt; i++) {
>                  struct btf_struct_meta *pointee_struct_meta;
>                  const struct btf_field *field = &fields[i];
>                  void *field_ptr = obj + field->offset;
>                  void *xchgd_field;
>
>                  switch (fields[i].type) {
>                  case BPF_SPIN_LOCK:
>                          break;
>                  case BPF_TIMER:
>                          bpf_timer_cancel_and_free(field_ptr);
>                          break;
>                  case BPF_KPTR_UNREF:
>                          WRITE_ONCE(*(u64 *)field_ptr, 0);
>                          break;
>                  case BPF_KPTR_REF:
>                         ......
>                          break;
>                  case BPF_LIST_HEAD:
>                          if (WARN_ON_ONCE(rec->spin_lock_off < 0))
>                                  continue;
>                          bpf_list_head_free(field, field_ptr, obj +
> rec->spin_lock_off);
>                          break;
>                  case BPF_RB_ROOT:
>                          if (WARN_ON_ONCE(rec->spin_lock_off < 0))
>                                  continue;
>                          bpf_rb_root_free(field, field_ptr, obj +
> rec->spin_lock_off);
>                          break;
>                  case BPF_LIST_NODE:
>                  case BPF_RB_NODE:
>                  case BPF_REFCOUNT:
>                          break;
>                  default:
>                          WARN_ON_ONCE(1);
>                          continue;
>                  }
>          }
> }
>
> For percpu kptr, the remaining possible actiionable fields are
>         BPF_LIST_HEAD and BPF_RB_ROOT
>
> So BPF_LIST_HEAD and BPF_RB_ROOT will try to go through all
> list/rb nodes to unlink them from the list_head/rb_root.
>
> So yes, rb_nodes and list nodes will call __bpf_obj_drop_impl().
> Depending on whether the correspondingrec
> with rb_node/list_node has ref count or not,
> it may call bpf_mem_free() or bpf_mem_free_rcu(). If
> bpf_mem_free() is called, then the field is immediately freed
> but it may be used by some bpf prog (under rcu) concurrently,
> could this be an issue?

I see. Yeah. Looks like percpu makes such fields refcount-like.
For non-percpu non-refcount only one bpf prog on one cpu can observe
that object. That's why we're doing plain bpf_mem_free() for them.

So this patch is a good fix for refcounted, but you and Kumar are
correct that it's not sufficient for the case when percpu struct
includes multiple rb_roots. One for each cpu.

> Changing bpf_mem_free() in
> __bpf_obj_drop_impl() to bpf_mem_free_rcu() should fix this problem.

I guess we can do that when obj is either refcount or can be
insider percpu, but it might not be enough. More below.

> Another thing is related to list_head/rb_root.
> During bpf_obj_free_fields(), is it possible that another cpu
> may allocate a list_node/rb_node and add to list_head/rb_root?

It's not an issue for the single owner case and for refcounted.
Access to rb_root/list_head is always lock protected.
For refcounted the obj needs to be acquired (from the verifier pov)
meaning to have refcount =1 to be able to do spin_lock and
operate on list_head.

But bpf_rb_root_free is indeed an issue for percpu, since each
percpu has its own rb root field with its own bpf_spinlock, but
for_each_cpu() {bpf_obj_free_fields();} violates access contract.

percpu and rb_root creates such a maze of dependencies that
I think it's better to disallow rb_root-s and kptr-s inside percpu
for now.

> If this is true, then we might have a memory leak.
> But I don't whether this is possible or not.
>
> I think local kptr has the issue as percpu kptr.

Let's tackle one at a time.
I still think Dave's patch set is a good fix for recounted,
while we need to think more about percpu case.
Alexei Starovoitov Aug. 24, 2023, 2:09 a.m. UTC | #5
On Wed, Aug 23, 2023 at 6:38 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Wed, Aug 23, 2023 at 1:29 PM Yonghong Song <yonghong.song@linux.dev> wrote:
> >
> >
> >
> > On 8/23/23 9:20 AM, Alexei Starovoitov wrote:
> > > On Tue, Aug 22, 2023 at 11:26 PM Yonghong Song <yonghong.song@linux.dev> wrote:
> > >>
> > >>
> > >>
> > >> On 8/21/23 12:33 PM, Dave Marchevsky wrote:
> > >>> This is the final fix for the use-after-free scenario described in
> > >>> commit 7793fc3babe9 ("bpf: Make bpf_refcount_acquire fallible for
> > >>> non-owning refs"). That commit, by virtue of changing
> > >>> bpf_refcount_acquire's refcount_inc to a refcount_inc_not_zero, fixed
> > >>> the "refcount incr on 0" splat. The not_zero check in
> > >>> refcount_inc_not_zero, though, still occurs on memory that could have
> > >>> been free'd and reused, so the commit didn't properly fix the root
> > >>> cause.
> > >>>
> > >>> This patch actually fixes the issue by free'ing using the recently-added
> > >>> bpf_mem_free_rcu, which ensures that the memory is not reused until
> > >>> RCU grace period has elapsed. If that has happened then
> > >>> there are no non-owning references alive that point to the
> > >>> recently-free'd memory, so it can be safely reused.
> > >>>
> > >>> Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
> > >>> Acked-by: Yonghong Song <yonghong.song@linux.dev>
> > >>> ---
> > >>>    kernel/bpf/helpers.c | 6 +++++-
> > >>>    1 file changed, 5 insertions(+), 1 deletion(-)
> > >>>
> > >>> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> > >>> index eb91cae0612a..945a85e25ac5 100644
> > >>> --- a/kernel/bpf/helpers.c
> > >>> +++ b/kernel/bpf/helpers.c
> > >>> @@ -1913,7 +1913,11 @@ void __bpf_obj_drop_impl(void *p, const struct btf_record *rec)
> > >>>
> > >>>        if (rec)
> > >>>                bpf_obj_free_fields(rec, p);
> > >>
> > >> During reviewing my percpu kptr patch with link
> > >>
> > >> https://lore.kernel.org/bpf/20230814172809.1361446-1-yonghong.song@linux.dev/T/#m2f7631b8047e9f5da60a0a9cd8717fceaf1adbb7
> > >> Kumar mentioned although percpu memory itself is freed under rcu.
> > >> But its record fields are freed immediately. This will cause
> > >> the problem since there may be some active uses of these fields
> > >> within rcu cs and after bpf_obj_free_fields(), some fields may
> > >> be re-initialized with new memory but they do not have chances
> > >> to free any more.
> > >>
> > >> Do we have problem here as well?
> > >
> > > I think it's not an issue here or in your percpu patch,
> > > since bpf_obj_free_fields() calls __bpf_obj_drop_impl() which will
> > > call bpf_mem_free_rcu() (after this patch set lands).
> >
> > The following is my understanding.
> >
> > void bpf_obj_free_fields(const struct btf_record *rec, void *obj)
> > {
> >          const struct btf_field *fields;
> >          int i;
> >
> >          if (IS_ERR_OR_NULL(rec))
> >                  return;
> >          fields = rec->fields;
> >          for (i = 0; i < rec->cnt; i++) {
> >                  struct btf_struct_meta *pointee_struct_meta;
> >                  const struct btf_field *field = &fields[i];
> >                  void *field_ptr = obj + field->offset;
> >                  void *xchgd_field;
> >
> >                  switch (fields[i].type) {
> >                  case BPF_SPIN_LOCK:
> >                          break;
> >                  case BPF_TIMER:
> >                          bpf_timer_cancel_and_free(field_ptr);
> >                          break;
> >                  case BPF_KPTR_UNREF:
> >                          WRITE_ONCE(*(u64 *)field_ptr, 0);
> >                          break;
> >                  case BPF_KPTR_REF:
> >                         ......
> >                          break;
> >                  case BPF_LIST_HEAD:
> >                          if (WARN_ON_ONCE(rec->spin_lock_off < 0))
> >                                  continue;
> >                          bpf_list_head_free(field, field_ptr, obj +
> > rec->spin_lock_off);
> >                          break;
> >                  case BPF_RB_ROOT:
> >                          if (WARN_ON_ONCE(rec->spin_lock_off < 0))
> >                                  continue;
> >                          bpf_rb_root_free(field, field_ptr, obj +
> > rec->spin_lock_off);
> >                          break;
> >                  case BPF_LIST_NODE:
> >                  case BPF_RB_NODE:
> >                  case BPF_REFCOUNT:
> >                          break;
> >                  default:
> >                          WARN_ON_ONCE(1);
> >                          continue;
> >                  }
> >          }
> > }
> >
> > For percpu kptr, the remaining possible actiionable fields are
> >         BPF_LIST_HEAD and BPF_RB_ROOT
> >
> > So BPF_LIST_HEAD and BPF_RB_ROOT will try to go through all
> > list/rb nodes to unlink them from the list_head/rb_root.
> >
> > So yes, rb_nodes and list nodes will call __bpf_obj_drop_impl().
> > Depending on whether the correspondingrec
> > with rb_node/list_node has ref count or not,
> > it may call bpf_mem_free() or bpf_mem_free_rcu(). If
> > bpf_mem_free() is called, then the field is immediately freed
> > but it may be used by some bpf prog (under rcu) concurrently,
> > could this be an issue?
>
> I see. Yeah. Looks like percpu makes such fields refcount-like.
> For non-percpu non-refcount only one bpf prog on one cpu can observe
> that object. That's why we're doing plain bpf_mem_free() for them.
>
> So this patch is a good fix for refcounted, but you and Kumar are
> correct that it's not sufficient for the case when percpu struct
> includes multiple rb_roots. One for each cpu.
>
> > Changing bpf_mem_free() in
> > __bpf_obj_drop_impl() to bpf_mem_free_rcu() should fix this problem.
>
> I guess we can do that when obj is either refcount or can be
> insider percpu, but it might not be enough. More below.
>
> > Another thing is related to list_head/rb_root.
> > During bpf_obj_free_fields(), is it possible that another cpu
> > may allocate a list_node/rb_node and add to list_head/rb_root?
>
> It's not an issue for the single owner case and for refcounted.
> Access to rb_root/list_head is always lock protected.
> For refcounted the obj needs to be acquired (from the verifier pov)
> meaning to have refcount =1 to be able to do spin_lock and
> operate on list_head.
>
> But bpf_rb_root_free is indeed an issue for percpu, since each
> percpu has its own rb root field with its own bpf_spinlock, but
> for_each_cpu() {bpf_obj_free_fields();} violates access contract.
>
> percpu and rb_root creates such a maze of dependencies that
> I think it's better to disallow rb_root-s and kptr-s inside percpu
> for now.
>
> > If this is true, then we might have a memory leak.
> > But I don't whether this is possible or not.
> >
> > I think local kptr has the issue as percpu kptr.
>
> Let's tackle one at a time.
> I still think Dave's patch set is a good fix for recounted,
> while we need to think more about percpu case.

I'm planning to land the series though there could be still bugs to fix.
At least they're fixing known issues.
Please yell if you think we have to wait for another release.

Like I think the following is needed regardless:
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index bb78212fa5b2..efa6482b1c2c 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -7210,6 +7210,10 @@ static int process_spin_lock(struct
bpf_verifier_env *env, int regno,
                        map ? map->name : "kptr");
                return -EINVAL;
        }
+       if (type_is_non_owning_ref(reg->type)) {
+               verbose(env, "Accessing locks in non-owning object is
not allowed\n");
+               return -EINVAL;
+       }
        if (rec->spin_lock_off != val + reg->off) {
                verbose(env, "off %lld doesn't point to 'struct
bpf_spin_lock' that is at %d\n",

atm I don't see where we enforce such.
Since refcounted non-own refs can have refcount=0 it's not safe to access
non-scalar objects inside them.
Maybe stronger enforcement is needed?
Yonghong Song Aug. 24, 2023, 3:52 a.m. UTC | #6
On 8/23/23 6:38 PM, Alexei Starovoitov wrote:
> On Wed, Aug 23, 2023 at 1:29 PM Yonghong Song <yonghong.song@linux.dev> wrote:
>>
>>
>>
>> On 8/23/23 9:20 AM, Alexei Starovoitov wrote:
>>> On Tue, Aug 22, 2023 at 11:26 PM Yonghong Song <yonghong.song@linux.dev> wrote:
>>>>
>>>>
>>>>
>>>> On 8/21/23 12:33 PM, Dave Marchevsky wrote:
>>>>> This is the final fix for the use-after-free scenario described in
>>>>> commit 7793fc3babe9 ("bpf: Make bpf_refcount_acquire fallible for
>>>>> non-owning refs"). That commit, by virtue of changing
>>>>> bpf_refcount_acquire's refcount_inc to a refcount_inc_not_zero, fixed
>>>>> the "refcount incr on 0" splat. The not_zero check in
>>>>> refcount_inc_not_zero, though, still occurs on memory that could have
>>>>> been free'd and reused, so the commit didn't properly fix the root
>>>>> cause.
>>>>>
>>>>> This patch actually fixes the issue by free'ing using the recently-added
>>>>> bpf_mem_free_rcu, which ensures that the memory is not reused until
>>>>> RCU grace period has elapsed. If that has happened then
>>>>> there are no non-owning references alive that point to the
>>>>> recently-free'd memory, so it can be safely reused.
>>>>>
>>>>> Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
>>>>> Acked-by: Yonghong Song <yonghong.song@linux.dev>
>>>>> ---
>>>>>     kernel/bpf/helpers.c | 6 +++++-
>>>>>     1 file changed, 5 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
>>>>> index eb91cae0612a..945a85e25ac5 100644
>>>>> --- a/kernel/bpf/helpers.c
>>>>> +++ b/kernel/bpf/helpers.c
>>>>> @@ -1913,7 +1913,11 @@ void __bpf_obj_drop_impl(void *p, const struct btf_record *rec)
>>>>>
>>>>>         if (rec)
>>>>>                 bpf_obj_free_fields(rec, p);
>>>>
>>>> During reviewing my percpu kptr patch with link
>>>>
>>>> https://lore.kernel.org/bpf/20230814172809.1361446-1-yonghong.song@linux.dev/T/#m2f7631b8047e9f5da60a0a9cd8717fceaf1adbb7
>>>> Kumar mentioned although percpu memory itself is freed under rcu.
>>>> But its record fields are freed immediately. This will cause
>>>> the problem since there may be some active uses of these fields
>>>> within rcu cs and after bpf_obj_free_fields(), some fields may
>>>> be re-initialized with new memory but they do not have chances
>>>> to free any more.
>>>>
>>>> Do we have problem here as well?
>>>
>>> I think it's not an issue here or in your percpu patch,
>>> since bpf_obj_free_fields() calls __bpf_obj_drop_impl() which will
>>> call bpf_mem_free_rcu() (after this patch set lands).
>>
>> The following is my understanding.
>>
>> void bpf_obj_free_fields(const struct btf_record *rec, void *obj)
>> {
>>           const struct btf_field *fields;
>>           int i;
>>
>>           if (IS_ERR_OR_NULL(rec))
>>                   return;
>>           fields = rec->fields;
>>           for (i = 0; i < rec->cnt; i++) {
>>                   struct btf_struct_meta *pointee_struct_meta;
>>                   const struct btf_field *field = &fields[i];
>>                   void *field_ptr = obj + field->offset;
>>                   void *xchgd_field;
>>
>>                   switch (fields[i].type) {
>>                   case BPF_SPIN_LOCK:
>>                           break;
>>                   case BPF_TIMER:
>>                           bpf_timer_cancel_and_free(field_ptr);
>>                           break;
>>                   case BPF_KPTR_UNREF:
>>                           WRITE_ONCE(*(u64 *)field_ptr, 0);
>>                           break;
>>                   case BPF_KPTR_REF:
>>                          ......
>>                           break;
>>                   case BPF_LIST_HEAD:
>>                           if (WARN_ON_ONCE(rec->spin_lock_off < 0))
>>                                   continue;
>>                           bpf_list_head_free(field, field_ptr, obj +
>> rec->spin_lock_off);
>>                           break;
>>                   case BPF_RB_ROOT:
>>                           if (WARN_ON_ONCE(rec->spin_lock_off < 0))
>>                                   continue;
>>                           bpf_rb_root_free(field, field_ptr, obj +
>> rec->spin_lock_off);
>>                           break;
>>                   case BPF_LIST_NODE:
>>                   case BPF_RB_NODE:
>>                   case BPF_REFCOUNT:
>>                           break;
>>                   default:
>>                           WARN_ON_ONCE(1);
>>                           continue;
>>                   }
>>           }
>> }
>>
>> For percpu kptr, the remaining possible actiionable fields are
>>          BPF_LIST_HEAD and BPF_RB_ROOT
>>
>> So BPF_LIST_HEAD and BPF_RB_ROOT will try to go through all
>> list/rb nodes to unlink them from the list_head/rb_root.
>>
>> So yes, rb_nodes and list nodes will call __bpf_obj_drop_impl().
>> Depending on whether the correspondingrec
>> with rb_node/list_node has ref count or not,
>> it may call bpf_mem_free() or bpf_mem_free_rcu(). If
>> bpf_mem_free() is called, then the field is immediately freed
>> but it may be used by some bpf prog (under rcu) concurrently,
>> could this be an issue?
> 
> I see. Yeah. Looks like percpu makes such fields refcount-like.
> For non-percpu non-refcount only one bpf prog on one cpu can observe
> that object. That's why we're doing plain bpf_mem_free() for them.
> 
> So this patch is a good fix for refcounted, but you and Kumar are
> correct that it's not sufficient for the case when percpu struct
> includes multiple rb_roots. One for each cpu.
> 
>> Changing bpf_mem_free() in
>> __bpf_obj_drop_impl() to bpf_mem_free_rcu() should fix this problem.
> 
> I guess we can do that when obj is either refcount or can be
> insider percpu, but it might not be enough. More below.
> 
>> Another thing is related to list_head/rb_root.
>> During bpf_obj_free_fields(), is it possible that another cpu
>> may allocate a list_node/rb_node and add to list_head/rb_root?
> 
> It's not an issue for the single owner case and for refcounted.
> Access to rb_root/list_head is always lock protected.
> For refcounted the obj needs to be acquired (from the verifier pov)
> meaning to have refcount =1 to be able to do spin_lock and
> operate on list_head.

Martin and I came up with the following example early today like below,
assuming the map value struct contains a list_head and a spin_lock.

          cpu 0                              cpu 1
       key = 1;
       v = bpf_map_lookup(&map, &key);
                                         key = 1;
                                         bpf_map_delete_elem(&map, &key);
                                         /* distruction continues and
                                          * bpf_obj_free_fields() are
                                          * called.
                                          */
                                         /* in bpf_list_head_free():
                                          * __bpf_spin_lock_irqsave(...)
                                          * ...
                                          * __bpf_spin_unlock_irqrestore();
                                          */

       n = bpf_obj_new(...)
       bpf_spin_lock(&v->lock);
       bpf_list_push_front(&v->head, &v->node);
       bpf_spin_lock(&v->lock);

In cpu 1 'bpf_obj_free_fields', there is a list head, so
bpf_list_head_free() is called. In bpf_list_head_free(), we do

         __bpf_spin_lock_irqsave(spin_lock);
         if (!head->next || list_empty(head))
                 goto unlock;
         head = head->next;
unlock:
         INIT_LIST_HEAD(orig_head);
         __bpf_spin_unlock_irqrestore(spin_lock);

         while (head != orig_head) {
                 void *obj = head;

                 obj -= field->graph_root.node_offset;
                 head = head->next;
                 /* The contained type can also have resources, including a
                  * bpf_list_head which needs to be freed.
                  */
                 migrate_disable();
                 __bpf_obj_drop_impl(obj, field->graph_root.value_rec);
                 migrate_enable();
         }

So it is possible the cpu 0 may add one element to the list_head
which will never been freed.

This happens to say list_head or rb_root too. I am aware that
this may be an existing issue for some maps, e.g. hash map.
So it may not be a big problem. Just want to mention this though.

> 
> But bpf_rb_root_free is indeed an issue for percpu, since each
> percpu has its own rb root field with its own bpf_spinlock, but
> for_each_cpu() {bpf_obj_free_fields();} violates access contract.

Could you explain what 'access contract' mean here? For non-percpu
case, 'x' special fields may be checked. For percpu case, it is
just 'x * nr_cpus' special fields to be checked.

> 
> percpu and rb_root creates such a maze of dependencies that
> I think it's better to disallow rb_root-s and kptr-s inside percpu
> for now.

I can certainly disallow rb_root and list_head. Just want to
understand what kind of issues we face specially for percpu kptr.

Current kptrs cannot be nested since the first argument of
bpf_kptr_xchg() must be a map_value. So only impactful fields
(w.r.t. bpf_obj_free_fields()) are rb_root and list_head.

> 
>> If this is true, then we might have a memory leak.
>> But I don't whether this is possible or not.
>>
>> I think local kptr has the issue as percpu kptr.
> 
> Let's tackle one at a time.
> I still think Dave's patch set is a good fix for recounted,
> while we need to think more about percpu case.

I agree that Dave's patch indeed fixed an existing issue.
We can resolve other issues (like above) gradually.
Yonghong Song Aug. 24, 2023, 4:01 a.m. UTC | #7
On 8/23/23 7:09 PM, Alexei Starovoitov wrote:
> On Wed, Aug 23, 2023 at 6:38 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
>>
>> On Wed, Aug 23, 2023 at 1:29 PM Yonghong Song <yonghong.song@linux.dev> wrote:
>>>
>>>
>>>
>>> On 8/23/23 9:20 AM, Alexei Starovoitov wrote:
>>>> On Tue, Aug 22, 2023 at 11:26 PM Yonghong Song <yonghong.song@linux.dev> wrote:
>>>>>
>>>>>
>>>>>
>>>>> On 8/21/23 12:33 PM, Dave Marchevsky wrote:
>>>>>> This is the final fix for the use-after-free scenario described in
>>>>>> commit 7793fc3babe9 ("bpf: Make bpf_refcount_acquire fallible for
>>>>>> non-owning refs"). That commit, by virtue of changing
>>>>>> bpf_refcount_acquire's refcount_inc to a refcount_inc_not_zero, fixed
>>>>>> the "refcount incr on 0" splat. The not_zero check in
>>>>>> refcount_inc_not_zero, though, still occurs on memory that could have
>>>>>> been free'd and reused, so the commit didn't properly fix the root
>>>>>> cause.
>>>>>>
>>>>>> This patch actually fixes the issue by free'ing using the recently-added
>>>>>> bpf_mem_free_rcu, which ensures that the memory is not reused until
>>>>>> RCU grace period has elapsed. If that has happened then
>>>>>> there are no non-owning references alive that point to the
>>>>>> recently-free'd memory, so it can be safely reused.
>>>>>>
>>>>>> Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
>>>>>> Acked-by: Yonghong Song <yonghong.song@linux.dev>
>>>>>> ---
>>>>>>     kernel/bpf/helpers.c | 6 +++++-
>>>>>>     1 file changed, 5 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
>>>>>> index eb91cae0612a..945a85e25ac5 100644
>>>>>> --- a/kernel/bpf/helpers.c
>>>>>> +++ b/kernel/bpf/helpers.c
>>>>>> @@ -1913,7 +1913,11 @@ void __bpf_obj_drop_impl(void *p, const struct btf_record *rec)
>>>>>>
>>>>>>         if (rec)
>>>>>>                 bpf_obj_free_fields(rec, p);
>>>>>
>>>>> During reviewing my percpu kptr patch with link
>>>>>
>>>>> https://lore.kernel.org/bpf/20230814172809.1361446-1-yonghong.song@linux.dev/T/#m2f7631b8047e9f5da60a0a9cd8717fceaf1adbb7
>>>>> Kumar mentioned although percpu memory itself is freed under rcu.
>>>>> But its record fields are freed immediately. This will cause
>>>>> the problem since there may be some active uses of these fields
>>>>> within rcu cs and after bpf_obj_free_fields(), some fields may
>>>>> be re-initialized with new memory but they do not have chances
>>>>> to free any more.
>>>>>
>>>>> Do we have problem here as well?
>>>>
>>>> I think it's not an issue here or in your percpu patch,
>>>> since bpf_obj_free_fields() calls __bpf_obj_drop_impl() which will
>>>> call bpf_mem_free_rcu() (after this patch set lands).
>>>
>>> The following is my understanding.
>>>
>>> void bpf_obj_free_fields(const struct btf_record *rec, void *obj)
>>> {
>>>           const struct btf_field *fields;
>>>           int i;
>>>
>>>           if (IS_ERR_OR_NULL(rec))
>>>                   return;
>>>           fields = rec->fields;
>>>           for (i = 0; i < rec->cnt; i++) {
>>>                   struct btf_struct_meta *pointee_struct_meta;
>>>                   const struct btf_field *field = &fields[i];
>>>                   void *field_ptr = obj + field->offset;
>>>                   void *xchgd_field;
>>>
>>>                   switch (fields[i].type) {
>>>                   case BPF_SPIN_LOCK:
>>>                           break;
>>>                   case BPF_TIMER:
>>>                           bpf_timer_cancel_and_free(field_ptr);
>>>                           break;
>>>                   case BPF_KPTR_UNREF:
>>>                           WRITE_ONCE(*(u64 *)field_ptr, 0);
>>>                           break;
>>>                   case BPF_KPTR_REF:
>>>                          ......
>>>                           break;
>>>                   case BPF_LIST_HEAD:
>>>                           if (WARN_ON_ONCE(rec->spin_lock_off < 0))
>>>                                   continue;
>>>                           bpf_list_head_free(field, field_ptr, obj +
>>> rec->spin_lock_off);
>>>                           break;
>>>                   case BPF_RB_ROOT:
>>>                           if (WARN_ON_ONCE(rec->spin_lock_off < 0))
>>>                                   continue;
>>>                           bpf_rb_root_free(field, field_ptr, obj +
>>> rec->spin_lock_off);
>>>                           break;
>>>                   case BPF_LIST_NODE:
>>>                   case BPF_RB_NODE:
>>>                   case BPF_REFCOUNT:
>>>                           break;
>>>                   default:
>>>                           WARN_ON_ONCE(1);
>>>                           continue;
>>>                   }
>>>           }
>>> }
>>>
>>> For percpu kptr, the remaining possible actiionable fields are
>>>          BPF_LIST_HEAD and BPF_RB_ROOT
>>>
>>> So BPF_LIST_HEAD and BPF_RB_ROOT will try to go through all
>>> list/rb nodes to unlink them from the list_head/rb_root.
>>>
>>> So yes, rb_nodes and list nodes will call __bpf_obj_drop_impl().
>>> Depending on whether the correspondingrec
>>> with rb_node/list_node has ref count or not,
>>> it may call bpf_mem_free() or bpf_mem_free_rcu(). If
>>> bpf_mem_free() is called, then the field is immediately freed
>>> but it may be used by some bpf prog (under rcu) concurrently,
>>> could this be an issue?
>>
>> I see. Yeah. Looks like percpu makes such fields refcount-like.
>> For non-percpu non-refcount only one bpf prog on one cpu can observe
>> that object. That's why we're doing plain bpf_mem_free() for them.
>>
>> So this patch is a good fix for refcounted, but you and Kumar are
>> correct that it's not sufficient for the case when percpu struct
>> includes multiple rb_roots. One for each cpu.
>>
>>> Changing bpf_mem_free() in
>>> __bpf_obj_drop_impl() to bpf_mem_free_rcu() should fix this problem.
>>
>> I guess we can do that when obj is either refcount or can be
>> insider percpu, but it might not be enough. More below.
>>
>>> Another thing is related to list_head/rb_root.
>>> During bpf_obj_free_fields(), is it possible that another cpu
>>> may allocate a list_node/rb_node and add to list_head/rb_root?
>>
>> It's not an issue for the single owner case and for refcounted.
>> Access to rb_root/list_head is always lock protected.
>> For refcounted the obj needs to be acquired (from the verifier pov)
>> meaning to have refcount =1 to be able to do spin_lock and
>> operate on list_head.
>>
>> But bpf_rb_root_free is indeed an issue for percpu, since each
>> percpu has its own rb root field with its own bpf_spinlock, but
>> for_each_cpu() {bpf_obj_free_fields();} violates access contract.
>>
>> percpu and rb_root creates such a maze of dependencies that
>> I think it's better to disallow rb_root-s and kptr-s inside percpu
>> for now.
>>
>>> If this is true, then we might have a memory leak.
>>> But I don't whether this is possible or not.
>>>
>>> I think local kptr has the issue as percpu kptr.
>>
>> Let's tackle one at a time.
>> I still think Dave's patch set is a good fix for recounted,
>> while we need to think more about percpu case.
> 
> I'm planning to land the series though there could be still bugs to fix.
> At least they're fixing known issues.
> Please yell if you think we have to wait for another release.
> 
> Like I think the following is needed regardless:
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index bb78212fa5b2..efa6482b1c2c 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -7210,6 +7210,10 @@ static int process_spin_lock(struct
> bpf_verifier_env *env, int regno,
>                          map ? map->name : "kptr");
>                  return -EINVAL;
>          }
> +       if (type_is_non_owning_ref(reg->type)) {
> +               verbose(env, "Accessing locks in non-owning object is
> not allowed\n");
> +               return -EINVAL;
> +       }
>          if (rec->spin_lock_off != val + reg->off) {
>                  verbose(env, "off %lld doesn't point to 'struct
> bpf_spin_lock' that is at %d\n",
> 
> atm I don't see where we enforce such.
> Since refcounted non-own refs can have refcount=0 it's not safe to access
> non-scalar objects inside them.
> Maybe stronger enforcement is needed?

Currently in bpf_obj_free_fields(), BPF_SPIN_LOCK is not touched which
means it CAN be used for non-owning reference. I do not have
strong preferences one way or another. No sure whether Dave has
tests which will use spin_lock with non-owning-ref or not.
Alexei Starovoitov Aug. 24, 2023, 10:03 p.m. UTC | #8
On Wed, Aug 23, 2023 at 8:52 PM Yonghong Song <yonghong.song@linux.dev> wrote:
>
>
>
> On 8/23/23 6:38 PM, Alexei Starovoitov wrote:
> > On Wed, Aug 23, 2023 at 1:29 PM Yonghong Song <yonghong.song@linux.dev> wrote:
> >>
> >>
> >>
> >> On 8/23/23 9:20 AM, Alexei Starovoitov wrote:
> >>> On Tue, Aug 22, 2023 at 11:26 PM Yonghong Song <yonghong.song@linux.dev> wrote:
> >>>>
> >>>>
> >>>>
> >>>> On 8/21/23 12:33 PM, Dave Marchevsky wrote:
> >>>>> This is the final fix for the use-after-free scenario described in
> >>>>> commit 7793fc3babe9 ("bpf: Make bpf_refcount_acquire fallible for
> >>>>> non-owning refs"). That commit, by virtue of changing
> >>>>> bpf_refcount_acquire's refcount_inc to a refcount_inc_not_zero, fixed
> >>>>> the "refcount incr on 0" splat. The not_zero check in
> >>>>> refcount_inc_not_zero, though, still occurs on memory that could have
> >>>>> been free'd and reused, so the commit didn't properly fix the root
> >>>>> cause.
> >>>>>
> >>>>> This patch actually fixes the issue by free'ing using the recently-added
> >>>>> bpf_mem_free_rcu, which ensures that the memory is not reused until
> >>>>> RCU grace period has elapsed. If that has happened then
> >>>>> there are no non-owning references alive that point to the
> >>>>> recently-free'd memory, so it can be safely reused.
> >>>>>
> >>>>> Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
> >>>>> Acked-by: Yonghong Song <yonghong.song@linux.dev>
> >>>>> ---
> >>>>>     kernel/bpf/helpers.c | 6 +++++-
> >>>>>     1 file changed, 5 insertions(+), 1 deletion(-)
> >>>>>
> >>>>> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> >>>>> index eb91cae0612a..945a85e25ac5 100644
> >>>>> --- a/kernel/bpf/helpers.c
> >>>>> +++ b/kernel/bpf/helpers.c
> >>>>> @@ -1913,7 +1913,11 @@ void __bpf_obj_drop_impl(void *p, const struct btf_record *rec)
> >>>>>
> >>>>>         if (rec)
> >>>>>                 bpf_obj_free_fields(rec, p);
> >>>>
> >>>> During reviewing my percpu kptr patch with link
> >>>>
> >>>> https://lore.kernel.org/bpf/20230814172809.1361446-1-yonghong.song@linux.dev/T/#m2f7631b8047e9f5da60a0a9cd8717fceaf1adbb7
> >>>> Kumar mentioned although percpu memory itself is freed under rcu.
> >>>> But its record fields are freed immediately. This will cause
> >>>> the problem since there may be some active uses of these fields
> >>>> within rcu cs and after bpf_obj_free_fields(), some fields may
> >>>> be re-initialized with new memory but they do not have chances
> >>>> to free any more.
> >>>>
> >>>> Do we have problem here as well?
> >>>
> >>> I think it's not an issue here or in your percpu patch,
> >>> since bpf_obj_free_fields() calls __bpf_obj_drop_impl() which will
> >>> call bpf_mem_free_rcu() (after this patch set lands).
> >>
> >> The following is my understanding.
> >>
> >> void bpf_obj_free_fields(const struct btf_record *rec, void *obj)
> >> {
> >>           const struct btf_field *fields;
> >>           int i;
> >>
> >>           if (IS_ERR_OR_NULL(rec))
> >>                   return;
> >>           fields = rec->fields;
> >>           for (i = 0; i < rec->cnt; i++) {
> >>                   struct btf_struct_meta *pointee_struct_meta;
> >>                   const struct btf_field *field = &fields[i];
> >>                   void *field_ptr = obj + field->offset;
> >>                   void *xchgd_field;
> >>
> >>                   switch (fields[i].type) {
> >>                   case BPF_SPIN_LOCK:
> >>                           break;
> >>                   case BPF_TIMER:
> >>                           bpf_timer_cancel_and_free(field_ptr);
> >>                           break;
> >>                   case BPF_KPTR_UNREF:
> >>                           WRITE_ONCE(*(u64 *)field_ptr, 0);
> >>                           break;
> >>                   case BPF_KPTR_REF:
> >>                          ......
> >>                           break;
> >>                   case BPF_LIST_HEAD:
> >>                           if (WARN_ON_ONCE(rec->spin_lock_off < 0))
> >>                                   continue;
> >>                           bpf_list_head_free(field, field_ptr, obj +
> >> rec->spin_lock_off);
> >>                           break;
> >>                   case BPF_RB_ROOT:
> >>                           if (WARN_ON_ONCE(rec->spin_lock_off < 0))
> >>                                   continue;
> >>                           bpf_rb_root_free(field, field_ptr, obj +
> >> rec->spin_lock_off);
> >>                           break;
> >>                   case BPF_LIST_NODE:
> >>                   case BPF_RB_NODE:
> >>                   case BPF_REFCOUNT:
> >>                           break;
> >>                   default:
> >>                           WARN_ON_ONCE(1);
> >>                           continue;
> >>                   }
> >>           }
> >> }
> >>
> >> For percpu kptr, the remaining possible actiionable fields are
> >>          BPF_LIST_HEAD and BPF_RB_ROOT
> >>
> >> So BPF_LIST_HEAD and BPF_RB_ROOT will try to go through all
> >> list/rb nodes to unlink them from the list_head/rb_root.
> >>
> >> So yes, rb_nodes and list nodes will call __bpf_obj_drop_impl().
> >> Depending on whether the correspondingrec
> >> with rb_node/list_node has ref count or not,
> >> it may call bpf_mem_free() or bpf_mem_free_rcu(). If
> >> bpf_mem_free() is called, then the field is immediately freed
> >> but it may be used by some bpf prog (under rcu) concurrently,
> >> could this be an issue?
> >
> > I see. Yeah. Looks like percpu makes such fields refcount-like.
> > For non-percpu non-refcount only one bpf prog on one cpu can observe
> > that object. That's why we're doing plain bpf_mem_free() for them.
> >
> > So this patch is a good fix for refcounted, but you and Kumar are
> > correct that it's not sufficient for the case when percpu struct
> > includes multiple rb_roots. One for each cpu.
> >
> >> Changing bpf_mem_free() in
> >> __bpf_obj_drop_impl() to bpf_mem_free_rcu() should fix this problem.
> >
> > I guess we can do that when obj is either refcount or can be
> > insider percpu, but it might not be enough. More below.
> >
> >> Another thing is related to list_head/rb_root.
> >> During bpf_obj_free_fields(), is it possible that another cpu
> >> may allocate a list_node/rb_node and add to list_head/rb_root?
> >
> > It's not an issue for the single owner case and for refcounted.
> > Access to rb_root/list_head is always lock protected.
> > For refcounted the obj needs to be acquired (from the verifier pov)
> > meaning to have refcount =1 to be able to do spin_lock and
> > operate on list_head.
>
> Martin and I came up with the following example early today like below,
> assuming the map value struct contains a list_head and a spin_lock.
>
>           cpu 0                              cpu 1
>        key = 1;
>        v = bpf_map_lookup(&map, &key);
>                                          key = 1;
>                                          bpf_map_delete_elem(&map, &key);
>                                          /* distruction continues and
>                                           * bpf_obj_free_fields() are
>                                           * called.
>                                           */
>                                          /* in bpf_list_head_free():
>                                           * __bpf_spin_lock_irqsave(...)
>                                           * ...
>                                           * __bpf_spin_unlock_irqrestore();
>                                           */
>
>        n = bpf_obj_new(...)
>        bpf_spin_lock(&v->lock);
>        bpf_list_push_front(&v->head, &v->node);
>        bpf_spin_lock(&v->lock);
>
> In cpu 1 'bpf_obj_free_fields', there is a list head, so
> bpf_list_head_free() is called. In bpf_list_head_free(), we do
>
>          __bpf_spin_lock_irqsave(spin_lock);
>          if (!head->next || list_empty(head))
>                  goto unlock;
>          head = head->next;
> unlock:
>          INIT_LIST_HEAD(orig_head);
>          __bpf_spin_unlock_irqrestore(spin_lock);
>
>          while (head != orig_head) {
>                  void *obj = head;
>
>                  obj -= field->graph_root.node_offset;
>                  head = head->next;
>                  /* The contained type can also have resources, including a
>                   * bpf_list_head which needs to be freed.
>                   */
>                  migrate_disable();
>                  __bpf_obj_drop_impl(obj, field->graph_root.value_rec);
>                  migrate_enable();
>          }
>
> So it is possible the cpu 0 may add one element to the list_head
> which will never been freed.
>
> This happens to say list_head or rb_root too. I am aware that
> this may be an existing issue for some maps, e.g. hash map.
> So it may not be a big problem. Just want to mention this though.

argh. That is indeed a problem and it's not limited to rbtree.
cpu0 doing kptr_xchg into a map value that was just deleted by cpu1
will cause a leak.
It affects both sleepable and non-sleepable progs.
For sleepable I see no other option, but to enforce 'v' use in RCU CS.
rcu_unlock currently converts mem_rcu into untrusted.
I think to fix the above it should convert all ptr_to_map_value to
untrusted as well.
In addition we can exten bpf_memalloc api with a dtor.
Register it at the time of bpf_mem_alloc_init() and use
bpf_mem_cache_free_rcu() in htab_elem_free() when kptr or
other special fields are present in the value.
For preallocated htab we'd need to reinstate call_rcu().
This would be a heavy fix. Better ideas?

>
> >
> > But bpf_rb_root_free is indeed an issue for percpu, since each
> > percpu has its own rb root field with its own bpf_spinlock, but
> > for_each_cpu() {bpf_obj_free_fields();} violates access contract.
>
> Could you explain what 'access contract' mean here? For non-percpu
> case, 'x' special fields may be checked. For percpu case, it is
> just 'x * nr_cpus' special fields to be checked.

Right. That's what I mean by refcounted-like.

>
> >
> > percpu and rb_root creates such a maze of dependencies that
> > I think it's better to disallow rb_root-s and kptr-s inside percpu
> > for now.
>
> I can certainly disallow rb_root and list_head. Just want to
> understand what kind of issues we face specially for percpu kptr.

Just matter of priorities. Fixing above is more urgent than
adding percpu kptr with their own corner cases to analyze.
Yonghong Song Aug. 24, 2023, 10:25 p.m. UTC | #9
On 8/24/23 3:03 PM, Alexei Starovoitov wrote:
> On Wed, Aug 23, 2023 at 8:52 PM Yonghong Song <yonghong.song@linux.dev> wrote:
>>
>>
>>
>> On 8/23/23 6:38 PM, Alexei Starovoitov wrote:
>>> On Wed, Aug 23, 2023 at 1:29 PM Yonghong Song <yonghong.song@linux.dev> wrote:
>>>>
>>>>
>>>>
>>>> On 8/23/23 9:20 AM, Alexei Starovoitov wrote:
>>>>> On Tue, Aug 22, 2023 at 11:26 PM Yonghong Song <yonghong.song@linux.dev> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 8/21/23 12:33 PM, Dave Marchevsky wrote:
>>>>>>> This is the final fix for the use-after-free scenario described in
>>>>>>> commit 7793fc3babe9 ("bpf: Make bpf_refcount_acquire fallible for
>>>>>>> non-owning refs"). That commit, by virtue of changing
>>>>>>> bpf_refcount_acquire's refcount_inc to a refcount_inc_not_zero, fixed
>>>>>>> the "refcount incr on 0" splat. The not_zero check in
>>>>>>> refcount_inc_not_zero, though, still occurs on memory that could have
>>>>>>> been free'd and reused, so the commit didn't properly fix the root
>>>>>>> cause.
>>>>>>>
>>>>>>> This patch actually fixes the issue by free'ing using the recently-added
>>>>>>> bpf_mem_free_rcu, which ensures that the memory is not reused until
>>>>>>> RCU grace period has elapsed. If that has happened then
>>>>>>> there are no non-owning references alive that point to the
>>>>>>> recently-free'd memory, so it can be safely reused.
>>>>>>>
>>>>>>> Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
>>>>>>> Acked-by: Yonghong Song <yonghong.song@linux.dev>
>>>>>>> ---
>>>>>>>      kernel/bpf/helpers.c | 6 +++++-
>>>>>>>      1 file changed, 5 insertions(+), 1 deletion(-)
>>>>>>>
>>>>>>> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
>>>>>>> index eb91cae0612a..945a85e25ac5 100644
>>>>>>> --- a/kernel/bpf/helpers.c
>>>>>>> +++ b/kernel/bpf/helpers.c
>>>>>>> @@ -1913,7 +1913,11 @@ void __bpf_obj_drop_impl(void *p, const struct btf_record *rec)
>>>>>>>
>>>>>>>          if (rec)
>>>>>>>                  bpf_obj_free_fields(rec, p);
>>>>>>
>>>>>> During reviewing my percpu kptr patch with link
>>>>>>
>>>>>> https://lore.kernel.org/bpf/20230814172809.1361446-1-yonghong.song@linux.dev/T/#m2f7631b8047e9f5da60a0a9cd8717fceaf1adbb7
>>>>>> Kumar mentioned although percpu memory itself is freed under rcu.
>>>>>> But its record fields are freed immediately. This will cause
>>>>>> the problem since there may be some active uses of these fields
>>>>>> within rcu cs and after bpf_obj_free_fields(), some fields may
>>>>>> be re-initialized with new memory but they do not have chances
>>>>>> to free any more.
>>>>>>
>>>>>> Do we have problem here as well?
>>>>>
>>>>> I think it's not an issue here or in your percpu patch,
>>>>> since bpf_obj_free_fields() calls __bpf_obj_drop_impl() which will
>>>>> call bpf_mem_free_rcu() (after this patch set lands).
>>>>
>>>> The following is my understanding.
>>>>
>>>> void bpf_obj_free_fields(const struct btf_record *rec, void *obj)
>>>> {
>>>>            const struct btf_field *fields;
>>>>            int i;
>>>>
>>>>            if (IS_ERR_OR_NULL(rec))
>>>>                    return;
>>>>            fields = rec->fields;
>>>>            for (i = 0; i < rec->cnt; i++) {
>>>>                    struct btf_struct_meta *pointee_struct_meta;
>>>>                    const struct btf_field *field = &fields[i];
>>>>                    void *field_ptr = obj + field->offset;
>>>>                    void *xchgd_field;
>>>>
>>>>                    switch (fields[i].type) {
>>>>                    case BPF_SPIN_LOCK:
>>>>                            break;
>>>>                    case BPF_TIMER:
>>>>                            bpf_timer_cancel_and_free(field_ptr);
>>>>                            break;
>>>>                    case BPF_KPTR_UNREF:
>>>>                            WRITE_ONCE(*(u64 *)field_ptr, 0);
>>>>                            break;
>>>>                    case BPF_KPTR_REF:
>>>>                           ......
>>>>                            break;
>>>>                    case BPF_LIST_HEAD:
>>>>                            if (WARN_ON_ONCE(rec->spin_lock_off < 0))
>>>>                                    continue;
>>>>                            bpf_list_head_free(field, field_ptr, obj +
>>>> rec->spin_lock_off);
>>>>                            break;
>>>>                    case BPF_RB_ROOT:
>>>>                            if (WARN_ON_ONCE(rec->spin_lock_off < 0))
>>>>                                    continue;
>>>>                            bpf_rb_root_free(field, field_ptr, obj +
>>>> rec->spin_lock_off);
>>>>                            break;
>>>>                    case BPF_LIST_NODE:
>>>>                    case BPF_RB_NODE:
>>>>                    case BPF_REFCOUNT:
>>>>                            break;
>>>>                    default:
>>>>                            WARN_ON_ONCE(1);
>>>>                            continue;
>>>>                    }
>>>>            }
>>>> }
>>>>
>>>> For percpu kptr, the remaining possible actiionable fields are
>>>>           BPF_LIST_HEAD and BPF_RB_ROOT
>>>>
>>>> So BPF_LIST_HEAD and BPF_RB_ROOT will try to go through all
>>>> list/rb nodes to unlink them from the list_head/rb_root.
>>>>
>>>> So yes, rb_nodes and list nodes will call __bpf_obj_drop_impl().
>>>> Depending on whether the correspondingrec
>>>> with rb_node/list_node has ref count or not,
>>>> it may call bpf_mem_free() or bpf_mem_free_rcu(). If
>>>> bpf_mem_free() is called, then the field is immediately freed
>>>> but it may be used by some bpf prog (under rcu) concurrently,
>>>> could this be an issue?
>>>
>>> I see. Yeah. Looks like percpu makes such fields refcount-like.
>>> For non-percpu non-refcount only one bpf prog on one cpu can observe
>>> that object. That's why we're doing plain bpf_mem_free() for them.
>>>
>>> So this patch is a good fix for refcounted, but you and Kumar are
>>> correct that it's not sufficient for the case when percpu struct
>>> includes multiple rb_roots. One for each cpu.
>>>
>>>> Changing bpf_mem_free() in
>>>> __bpf_obj_drop_impl() to bpf_mem_free_rcu() should fix this problem.
>>>
>>> I guess we can do that when obj is either refcount or can be
>>> insider percpu, but it might not be enough. More below.
>>>
>>>> Another thing is related to list_head/rb_root.
>>>> During bpf_obj_free_fields(), is it possible that another cpu
>>>> may allocate a list_node/rb_node and add to list_head/rb_root?
>>>
>>> It's not an issue for the single owner case and for refcounted.
>>> Access to rb_root/list_head is always lock protected.
>>> For refcounted the obj needs to be acquired (from the verifier pov)
>>> meaning to have refcount =1 to be able to do spin_lock and
>>> operate on list_head.
>>
>> Martin and I came up with the following example early today like below,
>> assuming the map value struct contains a list_head and a spin_lock.
>>
>>            cpu 0                              cpu 1
>>         key = 1;
>>         v = bpf_map_lookup(&map, &key);
>>                                           key = 1;
>>                                           bpf_map_delete_elem(&map, &key);
>>                                           /* distruction continues and
>>                                            * bpf_obj_free_fields() are
>>                                            * called.
>>                                            */
>>                                           /* in bpf_list_head_free():
>>                                            * __bpf_spin_lock_irqsave(...)
>>                                            * ...
>>                                            * __bpf_spin_unlock_irqrestore();
>>                                            */
>>
>>         n = bpf_obj_new(...)
>>         bpf_spin_lock(&v->lock);
>>         bpf_list_push_front(&v->head, &v->node);
>>         bpf_spin_lock(&v->lock);
>>
>> In cpu 1 'bpf_obj_free_fields', there is a list head, so
>> bpf_list_head_free() is called. In bpf_list_head_free(), we do
>>
>>           __bpf_spin_lock_irqsave(spin_lock);
>>           if (!head->next || list_empty(head))
>>                   goto unlock;
>>           head = head->next;
>> unlock:
>>           INIT_LIST_HEAD(orig_head);
>>           __bpf_spin_unlock_irqrestore(spin_lock);
>>
>>           while (head != orig_head) {
>>                   void *obj = head;
>>
>>                   obj -= field->graph_root.node_offset;
>>                   head = head->next;
>>                   /* The contained type can also have resources, including a
>>                    * bpf_list_head which needs to be freed.
>>                    */
>>                   migrate_disable();
>>                   __bpf_obj_drop_impl(obj, field->graph_root.value_rec);
>>                   migrate_enable();
>>           }
>>
>> So it is possible the cpu 0 may add one element to the list_head
>> which will never been freed.
>>
>> This happens to say list_head or rb_root too. I am aware that
>> this may be an existing issue for some maps, e.g. hash map.
>> So it may not be a big problem. Just want to mention this though.
> 
> argh. That is indeed a problem and it's not limited to rbtree.
> cpu0 doing kptr_xchg into a map value that was just deleted by cpu1
> will cause a leak.
> It affects both sleepable and non-sleepable progs.
> For sleepable I see no other option, but to enforce 'v' use in RCU CS.
> rcu_unlock currently converts mem_rcu into untrusted.
> I think to fix the above it should convert all ptr_to_map_value to
> untrusted as well.
> In addition we can exten bpf_memalloc api with a dtor.
> Register it at the time of bpf_mem_alloc_init() and use
> bpf_mem_cache_free_rcu() in htab_elem_free() when kptr or
> other special fields are present in the value.
> For preallocated htab we'd need to reinstate call_rcu().
> This would be a heavy fix. Better ideas?
> 
>>
>>>
>>> But bpf_rb_root_free is indeed an issue for percpu, since each
>>> percpu has its own rb root field with its own bpf_spinlock, but
>>> for_each_cpu() {bpf_obj_free_fields();} violates access contract.
>>
>> Could you explain what 'access contract' mean here? For non-percpu
>> case, 'x' special fields may be checked. For percpu case, it is
>> just 'x * nr_cpus' special fields to be checked.
> 
> Right. That's what I mean by refcounted-like.
> 
>>
>>>
>>> percpu and rb_root creates such a maze of dependencies that
>>> I think it's better to disallow rb_root-s and kptr-s inside percpu
>>> for now.
>>
>> I can certainly disallow rb_root and list_head. Just want to
>> understand what kind of issues we face specially for percpu kptr.
> 
> Just matter of priorities. Fixing above is more urgent than
> adding percpu kptr with their own corner cases to analyze.

Okay, percpu kptr patch set will still take some time.
So agree that existing refcount fix should have higher priority.
diff mbox series

Patch

diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index eb91cae0612a..945a85e25ac5 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -1913,7 +1913,11 @@  void __bpf_obj_drop_impl(void *p, const struct btf_record *rec)
 
 	if (rec)
 		bpf_obj_free_fields(rec, p);
-	bpf_mem_free(&bpf_global_ma, p);
+
+	if (rec && rec->refcount_off >= 0)
+		bpf_mem_free_rcu(&bpf_global_ma, p);
+	else
+		bpf_mem_free(&bpf_global_ma, p);
 }
 
 __bpf_kfunc void bpf_obj_drop_impl(void *p__alloc, void *meta__ign)