diff mbox series

[bpf-next,01/13] bpf: Loosen alloc obj test in verifier's reg_btf_record

Message ID 20221206231000.3180914-2-davemarchevsky@fb.com (mailing list archive)
State Superseded
Commit d8939cb0a03ce7e4e69f65bbd31b79fe42f7d5e6
Delegated to: BPF
Headers show
Series BPF rbtree next-gen datastructure | expand

Checks

Context Check Description
netdev/tree_selection success Clearly marked for bpf-next, async
netdev/fixes_present success Fixes tag not required for -next series
netdev/subject_prefix success Link
netdev/cover_letter success Series has a cover letter
netdev/patch_count success Link
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 10 this patch: 10
netdev/cc_maintainers warning 8 maintainers not CCed: kpsingh@kernel.org haoluo@google.com song@kernel.org yhs@fb.com martin.lau@linux.dev sdf@google.com john.fastabend@gmail.com jolsa@kernel.org
netdev/build_clang success Errors and warnings before: 5 this patch: 5
netdev/module_param success Was 0 now: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success Fixes tag looks correct
netdev/build_allmodconfig_warn success Errors and warnings before: 10 this patch: 10
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 19 lines checked
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0
bpf/vmtest-bpf-next-PR fail PR summary
bpf/vmtest-bpf-next-VM_Test-1 success Logs for ${{ matrix.test }} on ${{ matrix.arch }} with ${{ matrix.toolchain }}
bpf/vmtest-bpf-next-VM_Test-2 success Logs for ShellCheck
bpf/vmtest-bpf-next-VM_Test-3 fail Logs for build for aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-4 fail Logs for build for aarch64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-5 fail Logs for build for s390x with gcc
bpf/vmtest-bpf-next-VM_Test-6 fail Logs for build for x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-7 fail Logs for build for x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-8 success Logs for llvm-toolchain
bpf/vmtest-bpf-next-VM_Test-9 success Logs for set-matrix

Commit Message

Dave Marchevsky Dec. 6, 2022, 11:09 p.m. UTC
btf->struct_meta_tab is populated by btf_parse_struct_metas in btf.c.
There, a BTF record is created for any type containing a spin_lock or
any next-gen datastructure node/head.

Currently, for non-MAP_VALUE types, reg_btf_record will only search for
a record using struct_meta_tab if the reg->type exactly matches
(PTR_TO_BTF_ID | MEM_ALLOC). This exact match is too strict: an
"allocated obj" type - returned from bpf_obj_new - might pick up other
flags while working its way through the program.

Loosen the check to be exact for base_type and just use MEM_ALLOC mask
for type_flag.

This patch is marked Fixes as the original intent of reg_btf_record was
unlikely to have been to fail finding btf_record for valid alloc obj
types with additional flags, some of which (e.g. PTR_UNTRUSTED)
are valid register type states for alloc obj independent of this series.
However, I didn't find a specific broken repro case outside of this
series' added functionality, so it's possible that nothing was
triggering this logic error before.

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Fixes: 4e814da0d599 ("bpf: Allow locking bpf_spin_lock in allocated objects")
---
 kernel/bpf/verifier.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

Comments

Kumar Kartikeya Dwivedi Dec. 7, 2022, 4:41 p.m. UTC | #1
On Wed, Dec 07, 2022 at 04:39:48AM IST, Dave Marchevsky wrote:
> btf->struct_meta_tab is populated by btf_parse_struct_metas in btf.c.
> There, a BTF record is created for any type containing a spin_lock or
> any next-gen datastructure node/head.
>
> Currently, for non-MAP_VALUE types, reg_btf_record will only search for
> a record using struct_meta_tab if the reg->type exactly matches
> (PTR_TO_BTF_ID | MEM_ALLOC). This exact match is too strict: an
> "allocated obj" type - returned from bpf_obj_new - might pick up other
> flags while working its way through the program.
>

Not following. Only PTR_TO_BTF_ID | MEM_ALLOC is the valid reg->type that can be
passed to helpers. reg_btf_record is used in helpers to inspect the btf_record.
Any other flag combination (the only one possible is PTR_UNTRUSTED right now)
cannot be passed to helpers in the first place. The reason to set PTR_UNTRUSTED
is to make then unpassable to helpers.

> Loosen the check to be exact for base_type and just use MEM_ALLOC mask
> for type_flag.
>
> This patch is marked Fixes as the original intent of reg_btf_record was
> unlikely to have been to fail finding btf_record for valid alloc obj
> types with additional flags, some of which (e.g. PTR_UNTRUSTED)
> are valid register type states for alloc obj independent of this series.

That was the actual intent, same as how check_ptr_to_btf_access uses the exact
reg->type to allow the BPF_WRITE case.

I think this series is the one introducing this case, passing bpf_rbtree_first's
result to bpf_rbtree_remove, which I think is not possible to make safe in the
first place. We decided to do bpf_list_pop_front instead of bpf_list_entry ->
bpf_list_del due to this exact issue. More in [0].

 [0]: https://lore.kernel.org/bpf/CAADnVQKifhUk_HE+8qQ=AOhAssH6w9LZ082Oo53rwaS+tAGtOw@mail.gmail.com

> However, I didn't find a specific broken repro case outside of this
> series' added functionality, so it's possible that nothing was
> triggering this logic error before.
>
> Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
> cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> Fixes: 4e814da0d599 ("bpf: Allow locking bpf_spin_lock in allocated objects")
> ---
>  kernel/bpf/verifier.c | 7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 1d51bd9596da..67a13110bc22 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -451,6 +451,11 @@ static bool reg_type_not_null(enum bpf_reg_type type)
>  		type == PTR_TO_SOCK_COMMON;
>  }
>
> +static bool type_is_ptr_alloc_obj(u32 type)
> +{
> +	return base_type(type) == PTR_TO_BTF_ID && type_flag(type) & MEM_ALLOC;
> +}
> +
>  static struct btf_record *reg_btf_record(const struct bpf_reg_state *reg)
>  {
>  	struct btf_record *rec = NULL;
> @@ -458,7 +463,7 @@ static struct btf_record *reg_btf_record(const struct bpf_reg_state *reg)
>
>  	if (reg->type == PTR_TO_MAP_VALUE) {
>  		rec = reg->map_ptr->record;
> -	} else if (reg->type == (PTR_TO_BTF_ID | MEM_ALLOC)) {
> +	} else if (type_is_ptr_alloc_obj(reg->type)) {
>  		meta = btf_find_struct_meta(reg->btf, reg->btf_id);
>  		if (meta)
>  			rec = meta->record;
> --
> 2.30.2
>
Dave Marchevsky Dec. 7, 2022, 6:34 p.m. UTC | #2
On 12/7/22 11:41 AM, Kumar Kartikeya Dwivedi wrote:
> On Wed, Dec 07, 2022 at 04:39:48AM IST, Dave Marchevsky wrote:
>> btf->struct_meta_tab is populated by btf_parse_struct_metas in btf.c.
>> There, a BTF record is created for any type containing a spin_lock or
>> any next-gen datastructure node/head.
>>
>> Currently, for non-MAP_VALUE types, reg_btf_record will only search for
>> a record using struct_meta_tab if the reg->type exactly matches
>> (PTR_TO_BTF_ID | MEM_ALLOC). This exact match is too strict: an
>> "allocated obj" type - returned from bpf_obj_new - might pick up other
>> flags while working its way through the program.
>>
> 
> Not following. Only PTR_TO_BTF_ID | MEM_ALLOC is the valid reg->type that can be
> passed to helpers. reg_btf_record is used in helpers to inspect the btf_record.
> Any other flag combination (the only one possible is PTR_UNTRUSTED right now)
> cannot be passed to helpers in the first place. The reason to set PTR_UNTRUSTED
> is to make then unpassable to helpers.
> 

I see what you mean. If reg_btf_record is only used on regs which are args,
then the exact match helps enforce PTR_UNTRUSTED not being an acceptable
type flag for an arg. Most uses of reg_btf_record seem to be on arg regs,
but then we have its use in reg_may_point_to_spin_lock, which is itself
used in mark_ptr_or_null_reg and on BPF_REG_0 in check_kfunc_call. So I'm not
sure that it's only used on arg regs currently.

Regardless, if the intended use is on arg regs only, it should be renamed to
arg_reg_btf_record or similar to make that clear, as current name sounds like
it should be applicable to any reg, and thus not enforce constraints particular
to arg regs.

But I think it's better to leave it general and enforce those constraints
elsewhere. For kfuncs this is already happening in check_kfunc_args, where the
big switch statements for KF_ARG_* are doing exact type matching.

>> Loosen the check to be exact for base_type and just use MEM_ALLOC mask
>> for type_flag.
>>
>> This patch is marked Fixes as the original intent of reg_btf_record was
>> unlikely to have been to fail finding btf_record for valid alloc obj
>> types with additional flags, some of which (e.g. PTR_UNTRUSTED)
>> are valid register type states for alloc obj independent of this series.
> 
> That was the actual intent, same as how check_ptr_to_btf_access uses the exact
> reg->type to allow the BPF_WRITE case.
> 
> I think this series is the one introducing this case, passing bpf_rbtree_first's
> result to bpf_rbtree_remove, which I think is not possible to make safe in the
> first place. We decided to do bpf_list_pop_front instead of bpf_list_entry ->
> bpf_list_del due to this exact issue. More in [0].
> 
>  [0]: https://lore.kernel.org/bpf/CAADnVQKifhUk_HE+8qQ=AOhAssH6w9LZ082Oo53rwaS+tAGtOw@mail.gmail.com
> 

Thanks for the link, I better understand what Alexei meant in his comment on
patch 9 of this series. For the helpers added in this series, we can make
bpf_rbtree_first -> bpf_rbtree_remove safe by invalidating all release_on_unlock
refs after the rbtree_remove in same manner as they're invalidated after
spin_unlock currently.

Logic for why this is safe:

  * If we have two non-owning refs to nodes in a tree, e.g. from
    bpf_rbtree_add(node) and calling bpf_rbtree_first() immediately after,
    we have no way of knowing if they're aliases of same node.

  * If bpf_rbtree_remove takes arbitrary non-owning ref to node in the tree,
    it might be removing a node that's already been removed, e.g.:

        n = bpf_obj_new(...);
        bpf_spin_lock(&lock);

        bpf_rbtree_add(&tree, &n->node);
        // n is now non-owning ref to node which was added
        res = bpf_rbtree_first();
        if (!m) {}
        m = container_of(res, struct node_data, node);
        // m is now non-owning ref to the same node
        bpf_rbtree_remove(&tree, &n->node);
        bpf_rbtree_remove(&tree, &m->node); // BAD

        bpf_spin_unlock(&lock);

  * bpf_rbtree_remove is the only "pop()" currently. Non-owning refs are at risk
    of pointing to something that was already removed _only_ after a
    rbtree_remove, so if we invalidate them all after rbtree_remove they can't
    be inputs to subsequent remove()s

This does conflate current "release non-owning refs because it's not safe to
read from them" reasoning with new "release non-owning refs so they can't be
passed to remove()". Ideally we could add some new tag to these refs that
prevents them from being passed to remove()-type fns, but does allow them to
be read, e.g.:

  n = bpf_obj_new(...);
  bpf_spin_lock(&lock);

  bpf_rbtree_add(&tree, &n->node);
  // n is now non-owning ref to node which was added
  res = bpf_rbtree_first();
  if (!m) {}
  m = container_of(res, struct node_data, node);
  // m is now non-owning ref to the same node
  n = bpf_rbtree_remove(&tree, &n->node);
  // n is now owning ref again, m is non-owning ref to same node
  x = m->key; // this should be safe since we're still in CS
  bpf_rbtree_remove(&tree, &m->node); // But this should be prevented

  bpf_spin_unlock(&lock);

But this would introduce too much addt'l complexity for now IMO. The proposal
of just invalidating all non-owning refs prevents both the unsafe second
remove() and the safe x = m->key.

I will give it a shot, if it doesn't work can change rbtree_remove to
rbtree_remove_first w/o node param. But per that linked convo such logic
should be tackled eventually, might as well chip away at it now.

>> However, I didn't find a specific broken repro case outside of this
>> series' added functionality, so it's possible that nothing was
>> triggering this logic error before.
>>
>> Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
>> cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>
>> Fixes: 4e814da0d599 ("bpf: Allow locking bpf_spin_lock in allocated objects")
>> ---
>>  kernel/bpf/verifier.c | 7 ++++++-
>>  1 file changed, 6 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
>> index 1d51bd9596da..67a13110bc22 100644
>> --- a/kernel/bpf/verifier.c
>> +++ b/kernel/bpf/verifier.c
>> @@ -451,6 +451,11 @@ static bool reg_type_not_null(enum bpf_reg_type type)
>>  		type == PTR_TO_SOCK_COMMON;
>>  }
>>
>> +static bool type_is_ptr_alloc_obj(u32 type)
>> +{
>> +	return base_type(type) == PTR_TO_BTF_ID && type_flag(type) & MEM_ALLOC;
>> +}
>> +
>>  static struct btf_record *reg_btf_record(const struct bpf_reg_state *reg)
>>  {
>>  	struct btf_record *rec = NULL;
>> @@ -458,7 +463,7 @@ static struct btf_record *reg_btf_record(const struct bpf_reg_state *reg)
>>
>>  	if (reg->type == PTR_TO_MAP_VALUE) {
>>  		rec = reg->map_ptr->record;
>> -	} else if (reg->type == (PTR_TO_BTF_ID | MEM_ALLOC)) {
>> +	} else if (type_is_ptr_alloc_obj(reg->type)) {
>>  		meta = btf_find_struct_meta(reg->btf, reg->btf_id);
>>  		if (meta)
>>  			rec = meta->record;
>> --
>> 2.30.2
>>
Alexei Starovoitov Dec. 7, 2022, 6:59 p.m. UTC | #3
On Wed, Dec 07, 2022 at 01:34:44PM -0500, Dave Marchevsky wrote:
> On 12/7/22 11:41 AM, Kumar Kartikeya Dwivedi wrote:
> > On Wed, Dec 07, 2022 at 04:39:48AM IST, Dave Marchevsky wrote:
> >> btf->struct_meta_tab is populated by btf_parse_struct_metas in btf.c.
> >> There, a BTF record is created for any type containing a spin_lock or
> >> any next-gen datastructure node/head.
> >>
> >> Currently, for non-MAP_VALUE types, reg_btf_record will only search for
> >> a record using struct_meta_tab if the reg->type exactly matches
> >> (PTR_TO_BTF_ID | MEM_ALLOC). This exact match is too strict: an
> >> "allocated obj" type - returned from bpf_obj_new - might pick up other
> >> flags while working its way through the program.
> >>
> > 
> > Not following. Only PTR_TO_BTF_ID | MEM_ALLOC is the valid reg->type that can be
> > passed to helpers. reg_btf_record is used in helpers to inspect the btf_record.
> > Any other flag combination (the only one possible is PTR_UNTRUSTED right now)
> > cannot be passed to helpers in the first place. The reason to set PTR_UNTRUSTED
> > is to make then unpassable to helpers.
> > 
> 
> I see what you mean. If reg_btf_record is only used on regs which are args,
> then the exact match helps enforce PTR_UNTRUSTED not being an acceptable
> type flag for an arg. Most uses of reg_btf_record seem to be on arg regs,
> but then we have its use in reg_may_point_to_spin_lock, which is itself
> used in mark_ptr_or_null_reg and on BPF_REG_0 in check_kfunc_call. So I'm not
> sure that it's only used on arg regs currently.
> 
> Regardless, if the intended use is on arg regs only, it should be renamed to
> arg_reg_btf_record or similar to make that clear, as current name sounds like
> it should be applicable to any reg, and thus not enforce constraints particular
> to arg regs.
> 
> But I think it's better to leave it general and enforce those constraints
> elsewhere. For kfuncs this is already happening in check_kfunc_args, where the
> big switch statements for KF_ARG_* are doing exact type matching.
> 
> >> Loosen the check to be exact for base_type and just use MEM_ALLOC mask
> >> for type_flag.
> >>
> >> This patch is marked Fixes as the original intent of reg_btf_record was
> >> unlikely to have been to fail finding btf_record for valid alloc obj
> >> types with additional flags, some of which (e.g. PTR_UNTRUSTED)
> >> are valid register type states for alloc obj independent of this series.
> > 
> > That was the actual intent, same as how check_ptr_to_btf_access uses the exact
> > reg->type to allow the BPF_WRITE case.
> > 
> > I think this series is the one introducing this case, passing bpf_rbtree_first's
> > result to bpf_rbtree_remove, which I think is not possible to make safe in the
> > first place. We decided to do bpf_list_pop_front instead of bpf_list_entry ->
> > bpf_list_del due to this exact issue. More in [0].
> > 
> >  [0]: https://lore.kernel.org/bpf/CAADnVQKifhUk_HE+8qQ=AOhAssH6w9LZ082Oo53rwaS+tAGtOw@mail.gmail.com
> > 
> 
> Thanks for the link, I better understand what Alexei meant in his comment on
> patch 9 of this series. For the helpers added in this series, we can make
> bpf_rbtree_first -> bpf_rbtree_remove safe by invalidating all release_on_unlock
> refs after the rbtree_remove in same manner as they're invalidated after
> spin_unlock currently.
> 
> Logic for why this is safe:
> 
>   * If we have two non-owning refs to nodes in a tree, e.g. from
>     bpf_rbtree_add(node) and calling bpf_rbtree_first() immediately after,
>     we have no way of knowing if they're aliases of same node.
> 
>   * If bpf_rbtree_remove takes arbitrary non-owning ref to node in the tree,
>     it might be removing a node that's already been removed, e.g.:
> 
>         n = bpf_obj_new(...);
>         bpf_spin_lock(&lock);
> 
>         bpf_rbtree_add(&tree, &n->node);
>         // n is now non-owning ref to node which was added
>         res = bpf_rbtree_first();
>         if (!m) {}
>         m = container_of(res, struct node_data, node);
>         // m is now non-owning ref to the same node
>         bpf_rbtree_remove(&tree, &n->node);
>         bpf_rbtree_remove(&tree, &m->node); // BAD

Let me clarify my previous email:

Above doesn't have to be 'BAD'.
Instead of
if (WARN_ON_ONCE(RB_EMPTY_NODE(n)))

we can drop WARN and simply return.
If node is not part of the tree -> nop.

Same for bpf_rbtree_add.
If it's already added -> nop.

Then we can have bpf_rbtree_first() returning PTR_TRUSTED with acquire semantics.
We do all these checks under the same rbtree root lock, so it's safe.

>         bpf_spin_unlock(&lock);
> 
>   * bpf_rbtree_remove is the only "pop()" currently. Non-owning refs are at risk
>     of pointing to something that was already removed _only_ after a
>     rbtree_remove, so if we invalidate them all after rbtree_remove they can't
>     be inputs to subsequent remove()s

With above proposed run-time checks both bpf_rbtree_remove and bpf_rbtree_add
can have release semantics.
No need for special release_on_unlock hacks.

> This does conflate current "release non-owning refs because it's not safe to
> read from them" reasoning with new "release non-owning refs so they can't be
> passed to remove()". Ideally we could add some new tag to these refs that
> prevents them from being passed to remove()-type fns, but does allow them to
> be read, e.g.:
> 
>   n = bpf_obj_new(...);

'n' is acquired.

>   bpf_spin_lock(&lock);
> 
>   bpf_rbtree_add(&tree, &n->node);
>   // n is now non-owning ref to node which was added

since bpf_rbtree_add does release on 'n'...

>   res = bpf_rbtree_first();
>   if (!m) {}
>   m = container_of(res, struct node_data, node);
>   // m is now non-owning ref to the same node

... below is not allowed by the verifier.
>   n = bpf_rbtree_remove(&tree, &n->node);

I'm not sure what's an idea to return 'n' from remove...
Maybe it should be simple bool ?

>   // n is now owning ref again, m is non-owning ref to same node
>   x = m->key; // this should be safe since we're still in CS

below works because 'm' cames from bpf_rbtree_first that acquired 'res'.

>   bpf_rbtree_remove(&tree, &m->node); // But this should be prevented
> 
>   bpf_spin_unlock(&lock);
>
Kumar Kartikeya Dwivedi Dec. 7, 2022, 7:03 p.m. UTC | #4
On Thu, Dec 08, 2022 at 12:04:44AM IST, Dave Marchevsky wrote:
> On 12/7/22 11:41 AM, Kumar Kartikeya Dwivedi wrote:
> > On Wed, Dec 07, 2022 at 04:39:48AM IST, Dave Marchevsky wrote:
> >> btf->struct_meta_tab is populated by btf_parse_struct_metas in btf.c.
> >> There, a BTF record is created for any type containing a spin_lock or
> >> any next-gen datastructure node/head.
> >>
> >> Currently, for non-MAP_VALUE types, reg_btf_record will only search for
> >> a record using struct_meta_tab if the reg->type exactly matches
> >> (PTR_TO_BTF_ID | MEM_ALLOC). This exact match is too strict: an
> >> "allocated obj" type - returned from bpf_obj_new - might pick up other
> >> flags while working its way through the program.
> >>
> >
> > Not following. Only PTR_TO_BTF_ID | MEM_ALLOC is the valid reg->type that can be
> > passed to helpers. reg_btf_record is used in helpers to inspect the btf_record.
> > Any other flag combination (the only one possible is PTR_UNTRUSTED right now)
> > cannot be passed to helpers in the first place. The reason to set PTR_UNTRUSTED
> > is to make then unpassable to helpers.
> >
>
> I see what you mean. If reg_btf_record is only used on regs which are args,
> then the exact match helps enforce PTR_UNTRUSTED not being an acceptable
> type flag for an arg. Most uses of reg_btf_record seem to be on arg regs,
> but then we have its use in reg_may_point_to_spin_lock, which is itself
> used in mark_ptr_or_null_reg and on BPF_REG_0 in check_kfunc_call. So I'm not
> sure that it's only used on arg regs currently.
>
> Regardless, if the intended use is on arg regs only, it should be renamed to
> arg_reg_btf_record or similar to make that clear, as current name sounds like
> it should be applicable to any reg, and thus not enforce constraints particular
> to arg regs.
>
> But I think it's better to leave it general and enforce those constraints
> elsewhere. For kfuncs this is already happening in check_kfunc_args, where the
> big switch statements for KF_ARG_* are doing exact type matching.
>
> >> Loosen the check to be exact for base_type and just use MEM_ALLOC mask
> >> for type_flag.
> >>
> >> This patch is marked Fixes as the original intent of reg_btf_record was
> >> unlikely to have been to fail finding btf_record for valid alloc obj
> >> types with additional flags, some of which (e.g. PTR_UNTRUSTED)
> >> are valid register type states for alloc obj independent of this series.
> >
> > That was the actual intent, same as how check_ptr_to_btf_access uses the exact
> > reg->type to allow the BPF_WRITE case.
> >
> > I think this series is the one introducing this case, passing bpf_rbtree_first's
> > result to bpf_rbtree_remove, which I think is not possible to make safe in the
> > first place. We decided to do bpf_list_pop_front instead of bpf_list_entry ->
> > bpf_list_del due to this exact issue. More in [0].
> >
> >  [0]: https://lore.kernel.org/bpf/CAADnVQKifhUk_HE+8qQ=AOhAssH6w9LZ082Oo53rwaS+tAGtOw@mail.gmail.com
> >
>
> Thanks for the link, I better understand what Alexei meant in his comment on
> patch 9 of this series. For the helpers added in this series, we can make
> bpf_rbtree_first -> bpf_rbtree_remove safe by invalidating all release_on_unlock
> refs after the rbtree_remove in same manner as they're invalidated after
> spin_unlock currently.
>

Rather than doing that, you'll cut down on a lot of complexity and confusion
regarding PTR_UNTRUSTED's use in this set by removing bpf_rbtree_first and
bpf_rbtree_remove, and simply exposing bpf_rbtree_pop_front.

> Logic for why this is safe:
>
>   * If we have two non-owning refs to nodes in a tree, e.g. from
>     bpf_rbtree_add(node) and calling bpf_rbtree_first() immediately after,
>     we have no way of knowing if they're aliases of same node.
>
>   * If bpf_rbtree_remove takes arbitrary non-owning ref to node in the tree,
>     it might be removing a node that's already been removed, e.g.:
>
>         n = bpf_obj_new(...);
>         bpf_spin_lock(&lock);
>
>         bpf_rbtree_add(&tree, &n->node);
>         // n is now non-owning ref to node which was added
>         res = bpf_rbtree_first();
>         if (!m) {}
>         m = container_of(res, struct node_data, node);
>         // m is now non-owning ref to the same node
>         bpf_rbtree_remove(&tree, &n->node);
>         bpf_rbtree_remove(&tree, &m->node); // BAD
>
>         bpf_spin_unlock(&lock);
>
>   * bpf_rbtree_remove is the only "pop()" currently. Non-owning refs are at risk
>     of pointing to something that was already removed _only_ after a
>     rbtree_remove, so if we invalidate them all after rbtree_remove they can't
>     be inputs to subsequent remove()s
>
> This does conflate current "release non-owning refs because it's not safe to
> read from them" reasoning with new "release non-owning refs so they can't be
> passed to remove()". Ideally we could add some new tag to these refs that
> prevents them from being passed to remove()-type fns, but does allow them to
> be read, e.g.:
>
>   n = bpf_obj_new(...);
>   bpf_spin_lock(&lock);
>
>   bpf_rbtree_add(&tree, &n->node);
>   // n is now non-owning ref to node which was added
>   res = bpf_rbtree_first();
>   if (!m) {}
>   m = container_of(res, struct node_data, node);
>   // m is now non-owning ref to the same node
>   n = bpf_rbtree_remove(&tree, &n->node);
>   // n is now owning ref again, m is non-owning ref to same node
>   x = m->key; // this should be safe since we're still in CS
>   bpf_rbtree_remove(&tree, &m->node); // But this should be prevented
>
>   bpf_spin_unlock(&lock);
>
> But this would introduce too much addt'l complexity for now IMO. The proposal
> of just invalidating all non-owning refs prevents both the unsafe second
> remove() and the safe x = m->key.
>
> I will give it a shot, if it doesn't work can change rbtree_remove to
> rbtree_remove_first w/o node param. But per that linked convo such logic
> should be tackled eventually, might as well chip away at it now.
>

I sympathise with your goal to make it as close to kernel programming style as
possible. I was exploring the same option (as you saw in that link). But based
on multiple discussions so far and trying different approaches, I'm convinced
the additional complexity in the verifier is not worth it.

Both bpf_list_del and bpf_rbtree_remove are useful and should be added, but
should work on e.g. 'current' node in iteration callback. In that context
verifier knows that the node is part of the list/rbtree. Introducing more than
one node in that same context introduces potential aliasing which hinders the
verifier's ability to reason about safety. Then, it has to be pessimistic like
your case and invalidate everything to prevent invalid use, so that double
list_del and double rbtree_remove is not possible.

You will avoid all the problems with PTR_UNTRUSTED being passed to helpers if
you adopt such an approach. The code will become much simpler, while allowing
people to do the same thing without any loss of usability.
Dave Marchevsky Dec. 7, 2022, 8:38 p.m. UTC | #5
On 12/7/22 1:59 PM, Alexei Starovoitov wrote:
> On Wed, Dec 07, 2022 at 01:34:44PM -0500, Dave Marchevsky wrote:
>> On 12/7/22 11:41 AM, Kumar Kartikeya Dwivedi wrote:
>>> On Wed, Dec 07, 2022 at 04:39:48AM IST, Dave Marchevsky wrote:
>>>> btf->struct_meta_tab is populated by btf_parse_struct_metas in btf.c.
>>>> There, a BTF record is created for any type containing a spin_lock or
>>>> any next-gen datastructure node/head.
>>>>
>>>> Currently, for non-MAP_VALUE types, reg_btf_record will only search for
>>>> a record using struct_meta_tab if the reg->type exactly matches
>>>> (PTR_TO_BTF_ID | MEM_ALLOC). This exact match is too strict: an
>>>> "allocated obj" type - returned from bpf_obj_new - might pick up other
>>>> flags while working its way through the program.
>>>>
>>>
>>> Not following. Only PTR_TO_BTF_ID | MEM_ALLOC is the valid reg->type that can be
>>> passed to helpers. reg_btf_record is used in helpers to inspect the btf_record.
>>> Any other flag combination (the only one possible is PTR_UNTRUSTED right now)
>>> cannot be passed to helpers in the first place. The reason to set PTR_UNTRUSTED
>>> is to make then unpassable to helpers.
>>>
>>
>> I see what you mean. If reg_btf_record is only used on regs which are args,
>> then the exact match helps enforce PTR_UNTRUSTED not being an acceptable
>> type flag for an arg. Most uses of reg_btf_record seem to be on arg regs,
>> but then we have its use in reg_may_point_to_spin_lock, which is itself
>> used in mark_ptr_or_null_reg and on BPF_REG_0 in check_kfunc_call. So I'm not
>> sure that it's only used on arg regs currently.
>>
>> Regardless, if the intended use is on arg regs only, it should be renamed to
>> arg_reg_btf_record or similar to make that clear, as current name sounds like
>> it should be applicable to any reg, and thus not enforce constraints particular
>> to arg regs.
>>
>> But I think it's better to leave it general and enforce those constraints
>> elsewhere. For kfuncs this is already happening in check_kfunc_args, where the
>> big switch statements for KF_ARG_* are doing exact type matching.
>>
>>>> Loosen the check to be exact for base_type and just use MEM_ALLOC mask
>>>> for type_flag.
>>>>
>>>> This patch is marked Fixes as the original intent of reg_btf_record was
>>>> unlikely to have been to fail finding btf_record for valid alloc obj
>>>> types with additional flags, some of which (e.g. PTR_UNTRUSTED)
>>>> are valid register type states for alloc obj independent of this series.
>>>
>>> That was the actual intent, same as how check_ptr_to_btf_access uses the exact
>>> reg->type to allow the BPF_WRITE case.
>>>
>>> I think this series is the one introducing this case, passing bpf_rbtree_first's
>>> result to bpf_rbtree_remove, which I think is not possible to make safe in the
>>> first place. We decided to do bpf_list_pop_front instead of bpf_list_entry ->
>>> bpf_list_del due to this exact issue. More in [0].
>>>
>>>  [0]: https://lore.kernel.org/bpf/CAADnVQKifhUk_HE+8qQ=AOhAssH6w9LZ082Oo53rwaS+tAGtOw@mail.gmail.com
>>>
>>
>> Thanks for the link, I better understand what Alexei meant in his comment on
>> patch 9 of this series. For the helpers added in this series, we can make
>> bpf_rbtree_first -> bpf_rbtree_remove safe by invalidating all release_on_unlock
>> refs after the rbtree_remove in same manner as they're invalidated after
>> spin_unlock currently.
>>
>> Logic for why this is safe:
>>
>>   * If we have two non-owning refs to nodes in a tree, e.g. from
>>     bpf_rbtree_add(node) and calling bpf_rbtree_first() immediately after,
>>     we have no way of knowing if they're aliases of same node.
>>
>>   * If bpf_rbtree_remove takes arbitrary non-owning ref to node in the tree,
>>     it might be removing a node that's already been removed, e.g.:
>>
>>         n = bpf_obj_new(...);
>>         bpf_spin_lock(&lock);
>>
>>         bpf_rbtree_add(&tree, &n->node);
>>         // n is now non-owning ref to node which was added
>>         res = bpf_rbtree_first();
>>         if (!m) {}
>>         m = container_of(res, struct node_data, node);
>>         // m is now non-owning ref to the same node
>>         bpf_rbtree_remove(&tree, &n->node);
>>         bpf_rbtree_remove(&tree, &m->node); // BAD
> 
> Let me clarify my previous email:
> 
> Above doesn't have to be 'BAD'.
> Instead of
> if (WARN_ON_ONCE(RB_EMPTY_NODE(n)))
> 
> we can drop WARN and simply return.
> If node is not part of the tree -> nop.
> 
> Same for bpf_rbtree_add.
> If it's already added -> nop.
> 

These runtime checks can certainly be done, but if we can guarantee via
verifier type system that a particular ptr-to-node is guaranteed to be in /
not be in a tree, that's better, no?

Feels like a similar train of thought to "fail verification when correct rbtree
lock isn't held" vs "just check if lock is held in every rbtree API kfunc".

> Then we can have bpf_rbtree_first() returning PTR_TRUSTED with acquire semantics.
> We do all these checks under the same rbtree root lock, so it's safe.
> 

I'll comment on PTR_TRUSTED in our discussion on patch 10.

>>         bpf_spin_unlock(&lock);
>>
>>   * bpf_rbtree_remove is the only "pop()" currently. Non-owning refs are at risk
>>     of pointing to something that was already removed _only_ after a
>>     rbtree_remove, so if we invalidate them all after rbtree_remove they can't
>>     be inputs to subsequent remove()s
> 
> With above proposed run-time checks both bpf_rbtree_remove and bpf_rbtree_add
> can have release semantics.
> No need for special release_on_unlock hacks.
> 

If we want to be able to interact w/ nodes after they've been added to the
rbtree, but before critical section ends, we need to support non-owning refs,
which are currently implemented using special release_on_unlock logic.

If we go with the runtime check suggestion from above, we'd need to implement
'conditional release' similarly to earlier "rbtree map" attempt:
https://lore.kernel.org/bpf/20220830172759.4069786-14-davemarchevsky@fb.com/ .

If rbtree_add has release semantics for its node arg, but the node is already
in some tree and runtime check fails, the reference should not be released as
rbtree_add() was a nop.

Similarly, if rbtree_remove has release semantics for its node arg and acquire
semantics for its return value, runtime check failing should result in the
node arg not being released. Acquire semantics for the retval are already
conditional - if retval == NULL, mark_ptr_or_null regs will release the
acquired ref before it can be used. So no issue with failing rbtree_remove
messing up acquire.

For this reason rbtree_remove and rbtree_first are tagged
KF_ACQUIRE | KF_RET_NULL. "special release_on_unlock hacks" can likely be
refactored into a similar flag, KF_RELEASE_NON_OWN or similar.

>> This does conflate current "release non-owning refs because it's not safe to
>> read from them" reasoning with new "release non-owning refs so they can't be
>> passed to remove()". Ideally we could add some new tag to these refs that
>> prevents them from being passed to remove()-type fns, but does allow them to
>> be read, e.g.:
>>
>>   n = bpf_obj_new(...);
> 
> 'n' is acquired.
> 
>>   bpf_spin_lock(&lock);
>>
>>   bpf_rbtree_add(&tree, &n->node);
>>   // n is now non-owning ref to node which was added
> 
> since bpf_rbtree_add does release on 'n'...
> 
>>   res = bpf_rbtree_first();
>>   if (!m) {}
>>   m = container_of(res, struct node_data, node);
>>   // m is now non-owning ref to the same node
> 
> ... below is not allowed by the verifier.
>>   n = bpf_rbtree_remove(&tree, &n->node);
> 
> I'm not sure what's an idea to return 'n' from remove...
> Maybe it should be simple bool ?
> 

I agree that returning node from rbtree_remove is not strictly necessary, since
rbtree_remove can be thought of turning its non-owning ref argument into an
owning ref, instead of taking non-owning ref and returning owning ref. But such
an operation isn't really an 'acquire' by current verifier logic, since only
retvals can be 'acquired'. So we'd need to add some logic to enable acquire
semantics for args. Furthermore it's not really 'acquiring' a new ref, rather
changing properties of node arg ref.

However, if rbtree_remove can fail, such a "turn non-owning into owning"
operation will need to be able to fail as well, and the program will need to
be able to check for failure. Returning 'acquire' result in retval makes
this simple - just check for NULL. For your "return bool" proposal, we'd have
to add verifier logic which turns the 'acquired' owning ref back into non-owning
based on check of the bool, which will add some verifier complexity.

IIRC when doing experimentation with "rbtree map" implementation, I did
something like this and decided that the additional complexity wasn't worth
it when retval can just be used. 

>>   // n is now owning ref again, m is non-owning ref to same node
>>   x = m->key; // this should be safe since we're still in CS
> 
> below works because 'm' cames from bpf_rbtree_first that acquired 'res'.
> 
>>   bpf_rbtree_remove(&tree, &m->node); // But this should be prevented
>>
>>   bpf_spin_unlock(&lock);
>>
Alexei Starovoitov Dec. 7, 2022, 10:46 p.m. UTC | #6
On Wed, Dec 07, 2022 at 03:38:55PM -0500, Dave Marchevsky wrote:
> On 12/7/22 1:59 PM, Alexei Starovoitov wrote:
> > On Wed, Dec 07, 2022 at 01:34:44PM -0500, Dave Marchevsky wrote:
> >> On 12/7/22 11:41 AM, Kumar Kartikeya Dwivedi wrote:
> >>> On Wed, Dec 07, 2022 at 04:39:48AM IST, Dave Marchevsky wrote:
> >>>> btf->struct_meta_tab is populated by btf_parse_struct_metas in btf.c.
> >>>> There, a BTF record is created for any type containing a spin_lock or
> >>>> any next-gen datastructure node/head.
> >>>>
> >>>> Currently, for non-MAP_VALUE types, reg_btf_record will only search for
> >>>> a record using struct_meta_tab if the reg->type exactly matches
> >>>> (PTR_TO_BTF_ID | MEM_ALLOC). This exact match is too strict: an
> >>>> "allocated obj" type - returned from bpf_obj_new - might pick up other
> >>>> flags while working its way through the program.
> >>>>
> >>>
> >>> Not following. Only PTR_TO_BTF_ID | MEM_ALLOC is the valid reg->type that can be
> >>> passed to helpers. reg_btf_record is used in helpers to inspect the btf_record.
> >>> Any other flag combination (the only one possible is PTR_UNTRUSTED right now)
> >>> cannot be passed to helpers in the first place. The reason to set PTR_UNTRUSTED
> >>> is to make then unpassable to helpers.
> >>>
> >>
> >> I see what you mean. If reg_btf_record is only used on regs which are args,
> >> then the exact match helps enforce PTR_UNTRUSTED not being an acceptable
> >> type flag for an arg. Most uses of reg_btf_record seem to be on arg regs,
> >> but then we have its use in reg_may_point_to_spin_lock, which is itself
> >> used in mark_ptr_or_null_reg and on BPF_REG_0 in check_kfunc_call. So I'm not
> >> sure that it's only used on arg regs currently.
> >>
> >> Regardless, if the intended use is on arg regs only, it should be renamed to
> >> arg_reg_btf_record or similar to make that clear, as current name sounds like
> >> it should be applicable to any reg, and thus not enforce constraints particular
> >> to arg regs.
> >>
> >> But I think it's better to leave it general and enforce those constraints
> >> elsewhere. For kfuncs this is already happening in check_kfunc_args, where the
> >> big switch statements for KF_ARG_* are doing exact type matching.
> >>
> >>>> Loosen the check to be exact for base_type and just use MEM_ALLOC mask
> >>>> for type_flag.
> >>>>
> >>>> This patch is marked Fixes as the original intent of reg_btf_record was
> >>>> unlikely to have been to fail finding btf_record for valid alloc obj
> >>>> types with additional flags, some of which (e.g. PTR_UNTRUSTED)
> >>>> are valid register type states for alloc obj independent of this series.
> >>>
> >>> That was the actual intent, same as how check_ptr_to_btf_access uses the exact
> >>> reg->type to allow the BPF_WRITE case.
> >>>
> >>> I think this series is the one introducing this case, passing bpf_rbtree_first's
> >>> result to bpf_rbtree_remove, which I think is not possible to make safe in the
> >>> first place. We decided to do bpf_list_pop_front instead of bpf_list_entry ->
> >>> bpf_list_del due to this exact issue. More in [0].
> >>>
> >>>  [0]: https://lore.kernel.org/bpf/CAADnVQKifhUk_HE+8qQ=AOhAssH6w9LZ082Oo53rwaS+tAGtOw@mail.gmail.com
> >>>
> >>
> >> Thanks for the link, I better understand what Alexei meant in his comment on
> >> patch 9 of this series. For the helpers added in this series, we can make
> >> bpf_rbtree_first -> bpf_rbtree_remove safe by invalidating all release_on_unlock
> >> refs after the rbtree_remove in same manner as they're invalidated after
> >> spin_unlock currently.
> >>
> >> Logic for why this is safe:
> >>
> >>   * If we have two non-owning refs to nodes in a tree, e.g. from
> >>     bpf_rbtree_add(node) and calling bpf_rbtree_first() immediately after,
> >>     we have no way of knowing if they're aliases of same node.
> >>
> >>   * If bpf_rbtree_remove takes arbitrary non-owning ref to node in the tree,
> >>     it might be removing a node that's already been removed, e.g.:
> >>
> >>         n = bpf_obj_new(...);
> >>         bpf_spin_lock(&lock);
> >>
> >>         bpf_rbtree_add(&tree, &n->node);
> >>         // n is now non-owning ref to node which was added
> >>         res = bpf_rbtree_first();
> >>         if (!m) {}
> >>         m = container_of(res, struct node_data, node);
> >>         // m is now non-owning ref to the same node
> >>         bpf_rbtree_remove(&tree, &n->node);
> >>         bpf_rbtree_remove(&tree, &m->node); // BAD
> > 
> > Let me clarify my previous email:
> > 
> > Above doesn't have to be 'BAD'.
> > Instead of
> > if (WARN_ON_ONCE(RB_EMPTY_NODE(n)))
> > 
> > we can drop WARN and simply return.
> > If node is not part of the tree -> nop.
> > 
> > Same for bpf_rbtree_add.
> > If it's already added -> nop.
> > 
> 
> These runtime checks can certainly be done, but if we can guarantee via
> verifier type system that a particular ptr-to-node is guaranteed to be in /
> not be in a tree, that's better, no?
> 
> Feels like a similar train of thought to "fail verification when correct rbtree
> lock isn't held" vs "just check if lock is held in every rbtree API kfunc".
> 
> > Then we can have bpf_rbtree_first() returning PTR_TRUSTED with acquire semantics.
> > We do all these checks under the same rbtree root lock, so it's safe.
> > 
> 
> I'll comment on PTR_TRUSTED in our discussion on patch 10.
> 
> >>         bpf_spin_unlock(&lock);
> >>
> >>   * bpf_rbtree_remove is the only "pop()" currently. Non-owning refs are at risk
> >>     of pointing to something that was already removed _only_ after a
> >>     rbtree_remove, so if we invalidate them all after rbtree_remove they can't
> >>     be inputs to subsequent remove()s
> > 
> > With above proposed run-time checks both bpf_rbtree_remove and bpf_rbtree_add
> > can have release semantics.
> > No need for special release_on_unlock hacks.
> > 
> 
> If we want to be able to interact w/ nodes after they've been added to the
> rbtree, but before critical section ends, we need to support non-owning refs,
> which are currently implemented using special release_on_unlock logic.
> 
> If we go with the runtime check suggestion from above, we'd need to implement
> 'conditional release' similarly to earlier "rbtree map" attempt:
> https://lore.kernel.org/bpf/20220830172759.4069786-14-davemarchevsky@fb.com/ .
> 
> If rbtree_add has release semantics for its node arg, but the node is already
> in some tree and runtime check fails, the reference should not be released as
> rbtree_add() was a nop.

Got it.
The conditional release is tricky. We should probably avoid it for now.

I think we can either go with Kumar's proposal and do
bpf_rbtree_pop_front() instead of bpf_rbtree_first()
that avoids all these issues...

but considering that we'll have inline iterators soon and should be able to do:

struct bpf_rbtree_iter it;
struct bpf_rb_node * node;

bpf_rbtree_iter_init(&it, rb_root); // locks the rbtree
while ((node = bpf_rbtree_iter_next(&it)) {
  if (node->field == condition) {
    struct bpf_rb_node *n;

    n = bpf_rbtree_remove(rb_root, node);
    bpf_spin_lock(another_rb_root);
    bpf_rbtree_add(another_rb_root, n);
    bpf_spin_unlock(another_rb_root);
    break;
  }
}
bpf_rbtree_iter_destroy(&it);

We can treat the 'node' returned from bpf_rbtree_iter_next() the same way
as return from bpf_rbtree_first() ->  PTR_TRUSTED | MAYBE_NULL,
but not acquired (ref_obj_id == 0).

bpf_rbtree_add -> KF_RELEASE
so we cannot pass not acquired pointers into it.

We should probably remove release_on_unlock logic as Kumar suggesting and
make bpf_list_push_front/back to be KF_RELEASE.

Then
bpf_list_pop_front/back stay KF_ACQUIRE | KF_RET_NULL
and
bpf_rbtree_remove is also KF_ACQUIRE | KF_RET_NULL.

The difference is bpf_list_pop has only 'head'
while bpf_rbtree_remove has 'root' and 'node' where 'node' has to be PTR_TRUSTED
(but not acquired).

bpf_rbtree_add will always succeed.
bpf_rbtree_remove will conditionally fail if 'node' is not linked.

Similarly we can extend link list with
n = bpf_list_remove(node)
which will have KF_ACQUIRE | KF_RET_NULL semantics.

Then everything is nicely uniform.
We'll be able to iterate rbtree and iterate link lists.

There are downsides, of course.
Like the following from your test case:
+       bpf_spin_lock(&glock);
+       bpf_rbtree_add(&groot, &n->node, less);
+       bpf_rbtree_add(&groot, &m->node, less);
+       res = bpf_rbtree_remove(&groot, &n->node);
+       bpf_spin_unlock(&glock);
will not work.
Since bpf_rbtree_add() releases 'n' and it becomes UNTRUSTED.
(assuming release_on_unlock is removed).

I think it's fine for now. I have to agree with Kumar that it's hard to come up
with realistic use case where 'n' should be accessed after it was added to link
list or rbtree. Above test case doesn't look real.

This part of your test case:
+       bpf_spin_lock(&glock);
+       bpf_rbtree_add(&groot, &n->node, less);
+       bpf_rbtree_add(&groot, &m->node, less);
+       bpf_rbtree_add(&groot, &o->node, less);
+
+       res = bpf_rbtree_first(&groot);
+       if (!res) {
+               bpf_spin_unlock(&glock);
+               return 2;
+       }
+
+       o = container_of(res, struct node_data, node);
+       res = bpf_rbtree_remove(&groot, &o->node);
+       bpf_spin_unlock(&glock);

will work, because bpf_rbtree_first returns PTR_TRUSTED | MAYBE_NULL.

> Similarly, if rbtree_remove has release semantics for its node arg and acquire
> semantics for its return value, runtime check failing should result in the
> node arg not being released. Acquire semantics for the retval are already
> conditional - if retval == NULL, mark_ptr_or_null regs will release the
> acquired ref before it can be used. So no issue with failing rbtree_remove
> messing up acquire.
> 
> For this reason rbtree_remove and rbtree_first are tagged
> KF_ACQUIRE | KF_RET_NULL. "special release_on_unlock hacks" can likely be
> refactored into a similar flag, KF_RELEASE_NON_OWN or similar.

I guess what I'm propsing above is sort-of KF_RELEASE_NON_OWN idea,
but from a different angle.
I'd like to avoid introducing new flags.
I think PTR_TRUSTED is enough.

> > I'm not sure what's an idea to return 'n' from remove...
> > Maybe it should be simple bool ?
> > 
> 
> I agree that returning node from rbtree_remove is not strictly necessary, since
> rbtree_remove can be thought of turning its non-owning ref argument into an
> owning ref, instead of taking non-owning ref and returning owning ref. But such
> an operation isn't really an 'acquire' by current verifier logic, since only
> retvals can be 'acquired'. So we'd need to add some logic to enable acquire
> semantics for args. Furthermore it's not really 'acquiring' a new ref, rather
> changing properties of node arg ref.
> 
> However, if rbtree_remove can fail, such a "turn non-owning into owning"
> operation will need to be able to fail as well, and the program will need to
> be able to check for failure. Returning 'acquire' result in retval makes
> this simple - just check for NULL. For your "return bool" proposal, we'd have
> to add verifier logic which turns the 'acquired' owning ref back into non-owning
> based on check of the bool, which will add some verifier complexity.
> 
> IIRC when doing experimentation with "rbtree map" implementation, I did
> something like this and decided that the additional complexity wasn't worth
> it when retval can just be used. 

Agree. Forget 'bool' idea.
Dave Marchevsky Dec. 7, 2022, 11:42 p.m. UTC | #7
On 12/7/22 5:46 PM, Alexei Starovoitov wrote:
> On Wed, Dec 07, 2022 at 03:38:55PM -0500, Dave Marchevsky wrote:
>> On 12/7/22 1:59 PM, Alexei Starovoitov wrote:
>>> On Wed, Dec 07, 2022 at 01:34:44PM -0500, Dave Marchevsky wrote:
>>>> On 12/7/22 11:41 AM, Kumar Kartikeya Dwivedi wrote:
>>>>> On Wed, Dec 07, 2022 at 04:39:48AM IST, Dave Marchevsky wrote:
>>>>>> btf->struct_meta_tab is populated by btf_parse_struct_metas in btf.c.
>>>>>> There, a BTF record is created for any type containing a spin_lock or
>>>>>> any next-gen datastructure node/head.
>>>>>>
>>>>>> Currently, for non-MAP_VALUE types, reg_btf_record will only search for
>>>>>> a record using struct_meta_tab if the reg->type exactly matches
>>>>>> (PTR_TO_BTF_ID | MEM_ALLOC). This exact match is too strict: an
>>>>>> "allocated obj" type - returned from bpf_obj_new - might pick up other
>>>>>> flags while working its way through the program.
>>>>>>
>>>>>
>>>>> Not following. Only PTR_TO_BTF_ID | MEM_ALLOC is the valid reg->type that can be
>>>>> passed to helpers. reg_btf_record is used in helpers to inspect the btf_record.
>>>>> Any other flag combination (the only one possible is PTR_UNTRUSTED right now)
>>>>> cannot be passed to helpers in the first place. The reason to set PTR_UNTRUSTED
>>>>> is to make then unpassable to helpers.
>>>>>
>>>>
>>>> I see what you mean. If reg_btf_record is only used on regs which are args,
>>>> then the exact match helps enforce PTR_UNTRUSTED not being an acceptable
>>>> type flag for an arg. Most uses of reg_btf_record seem to be on arg regs,
>>>> but then we have its use in reg_may_point_to_spin_lock, which is itself
>>>> used in mark_ptr_or_null_reg and on BPF_REG_0 in check_kfunc_call. So I'm not
>>>> sure that it's only used on arg regs currently.
>>>>
>>>> Regardless, if the intended use is on arg regs only, it should be renamed to
>>>> arg_reg_btf_record or similar to make that clear, as current name sounds like
>>>> it should be applicable to any reg, and thus not enforce constraints particular
>>>> to arg regs.
>>>>
>>>> But I think it's better to leave it general and enforce those constraints
>>>> elsewhere. For kfuncs this is already happening in check_kfunc_args, where the
>>>> big switch statements for KF_ARG_* are doing exact type matching.
>>>>
>>>>>> Loosen the check to be exact for base_type and just use MEM_ALLOC mask
>>>>>> for type_flag.
>>>>>>
>>>>>> This patch is marked Fixes as the original intent of reg_btf_record was
>>>>>> unlikely to have been to fail finding btf_record for valid alloc obj
>>>>>> types with additional flags, some of which (e.g. PTR_UNTRUSTED)
>>>>>> are valid register type states for alloc obj independent of this series.
>>>>>
>>>>> That was the actual intent, same as how check_ptr_to_btf_access uses the exact
>>>>> reg->type to allow the BPF_WRITE case.
>>>>>
>>>>> I think this series is the one introducing this case, passing bpf_rbtree_first's
>>>>> result to bpf_rbtree_remove, which I think is not possible to make safe in the
>>>>> first place. We decided to do bpf_list_pop_front instead of bpf_list_entry ->
>>>>> bpf_list_del due to this exact issue. More in [0].
>>>>>
>>>>>  [0]: https://lore.kernel.org/bpf/CAADnVQKifhUk_HE+8qQ=AOhAssH6w9LZ082Oo53rwaS+tAGtOw@mail.gmail.com
>>>>>
>>>>
>>>> Thanks for the link, I better understand what Alexei meant in his comment on
>>>> patch 9 of this series. For the helpers added in this series, we can make
>>>> bpf_rbtree_first -> bpf_rbtree_remove safe by invalidating all release_on_unlock
>>>> refs after the rbtree_remove in same manner as they're invalidated after
>>>> spin_unlock currently.
>>>>
>>>> Logic for why this is safe:
>>>>
>>>>   * If we have two non-owning refs to nodes in a tree, e.g. from
>>>>     bpf_rbtree_add(node) and calling bpf_rbtree_first() immediately after,
>>>>     we have no way of knowing if they're aliases of same node.
>>>>
>>>>   * If bpf_rbtree_remove takes arbitrary non-owning ref to node in the tree,
>>>>     it might be removing a node that's already been removed, e.g.:
>>>>
>>>>         n = bpf_obj_new(...);
>>>>         bpf_spin_lock(&lock);
>>>>
>>>>         bpf_rbtree_add(&tree, &n->node);
>>>>         // n is now non-owning ref to node which was added
>>>>         res = bpf_rbtree_first();
>>>>         if (!m) {}
>>>>         m = container_of(res, struct node_data, node);
>>>>         // m is now non-owning ref to the same node
>>>>         bpf_rbtree_remove(&tree, &n->node);
>>>>         bpf_rbtree_remove(&tree, &m->node); // BAD
>>>
>>> Let me clarify my previous email:
>>>
>>> Above doesn't have to be 'BAD'.
>>> Instead of
>>> if (WARN_ON_ONCE(RB_EMPTY_NODE(n)))
>>>
>>> we can drop WARN and simply return.
>>> If node is not part of the tree -> nop.
>>>
>>> Same for bpf_rbtree_add.
>>> If it's already added -> nop.
>>>
>>
>> These runtime checks can certainly be done, but if we can guarantee via
>> verifier type system that a particular ptr-to-node is guaranteed to be in /
>> not be in a tree, that's better, no?
>>
>> Feels like a similar train of thought to "fail verification when correct rbtree
>> lock isn't held" vs "just check if lock is held in every rbtree API kfunc".
>>
>>> Then we can have bpf_rbtree_first() returning PTR_TRUSTED with acquire semantics.
>>> We do all these checks under the same rbtree root lock, so it's safe.
>>>
>>
>> I'll comment on PTR_TRUSTED in our discussion on patch 10.
>>
>>>>         bpf_spin_unlock(&lock);
>>>>
>>>>   * bpf_rbtree_remove is the only "pop()" currently. Non-owning refs are at risk
>>>>     of pointing to something that was already removed _only_ after a
>>>>     rbtree_remove, so if we invalidate them all after rbtree_remove they can't
>>>>     be inputs to subsequent remove()s
>>>
>>> With above proposed run-time checks both bpf_rbtree_remove and bpf_rbtree_add
>>> can have release semantics.
>>> No need for special release_on_unlock hacks.
>>>
>>
>> If we want to be able to interact w/ nodes after they've been added to the
>> rbtree, but before critical section ends, we need to support non-owning refs,
>> which are currently implemented using special release_on_unlock logic.
>>
>> If we go with the runtime check suggestion from above, we'd need to implement
>> 'conditional release' similarly to earlier "rbtree map" attempt:
>> https://lore.kernel.org/bpf/20220830172759.4069786-14-davemarchevsky@fb.com/ .
>>
>> If rbtree_add has release semantics for its node arg, but the node is already
>> in some tree and runtime check fails, the reference should not be released as
>> rbtree_add() was a nop.
> 
> Got it.
> The conditional release is tricky. We should probably avoid it for now.
> 
> I think we can either go with Kumar's proposal and do
> bpf_rbtree_pop_front() instead of bpf_rbtree_first()
> that avoids all these issues...
> 
> but considering that we'll have inline iterators soon and should be able to do:
> 
> struct bpf_rbtree_iter it;
> struct bpf_rb_node * node;
> 
> bpf_rbtree_iter_init(&it, rb_root); // locks the rbtree
> while ((node = bpf_rbtree_iter_next(&it)) {
>   if (node->field == condition) {
>     struct bpf_rb_node *n;
> 
>     n = bpf_rbtree_remove(rb_root, node);
>     bpf_spin_lock(another_rb_root);
>     bpf_rbtree_add(another_rb_root, n);
>     bpf_spin_unlock(another_rb_root);
>     break;
>   }
> }
> bpf_rbtree_iter_destroy(&it);
> 
> We can treat the 'node' returned from bpf_rbtree_iter_next() the same way
> as return from bpf_rbtree_first() ->  PTR_TRUSTED | MAYBE_NULL,
> but not acquired (ref_obj_id == 0).
> 
> bpf_rbtree_add -> KF_RELEASE
> so we cannot pass not acquired pointers into it.
> 
> We should probably remove release_on_unlock logic as Kumar suggesting and
> make bpf_list_push_front/back to be KF_RELEASE.
> 
> Then
> bpf_list_pop_front/back stay KF_ACQUIRE | KF_RET_NULL
> and
> bpf_rbtree_remove is also KF_ACQUIRE | KF_RET_NULL.
> 
> The difference is bpf_list_pop has only 'head'
> while bpf_rbtree_remove has 'root' and 'node' where 'node' has to be PTR_TRUSTED
> (but not acquired).
> 
> bpf_rbtree_add will always succeed.
> bpf_rbtree_remove will conditionally fail if 'node' is not linked.
> 
> Similarly we can extend link list with
> n = bpf_list_remove(node)
> which will have KF_ACQUIRE | KF_RET_NULL semantics.
> 
> Then everything is nicely uniform.
> We'll be able to iterate rbtree and iterate link lists.
> 
> There are downsides, of course.
> Like the following from your test case:
> +       bpf_spin_lock(&glock);
> +       bpf_rbtree_add(&groot, &n->node, less);
> +       bpf_rbtree_add(&groot, &m->node, less);
> +       res = bpf_rbtree_remove(&groot, &n->node);
> +       bpf_spin_unlock(&glock);
> will not work.
> Since bpf_rbtree_add() releases 'n' and it becomes UNTRUSTED.
> (assuming release_on_unlock is removed).
> 
> I think it's fine for now. I have to agree with Kumar that it's hard to come up
> with realistic use case where 'n' should be accessed after it was added to link
> list or rbtree. Above test case doesn't look real.
> 
> This part of your test case:
> +       bpf_spin_lock(&glock);
> +       bpf_rbtree_add(&groot, &n->node, less);
> +       bpf_rbtree_add(&groot, &m->node, less);
> +       bpf_rbtree_add(&groot, &o->node, less);
> +
> +       res = bpf_rbtree_first(&groot);
> +       if (!res) {
> +               bpf_spin_unlock(&glock);
> +               return 2;
> +       }
> +
> +       o = container_of(res, struct node_data, node);
> +       res = bpf_rbtree_remove(&groot, &o->node);
> +       bpf_spin_unlock(&glock);
> 
> will work, because bpf_rbtree_first returns PTR_TRUSTED | MAYBE_NULL.
> 
>> Similarly, if rbtree_remove has release semantics for its node arg and acquire
>> semantics for its return value, runtime check failing should result in the
>> node arg not being released. Acquire semantics for the retval are already
>> conditional - if retval == NULL, mark_ptr_or_null regs will release the
>> acquired ref before it can be used. So no issue with failing rbtree_remove
>> messing up acquire.
>>
>> For this reason rbtree_remove and rbtree_first are tagged
>> KF_ACQUIRE | KF_RET_NULL. "special release_on_unlock hacks" can likely be
>> refactored into a similar flag, KF_RELEASE_NON_OWN or similar.
> 
> I guess what I'm propsing above is sort-of KF_RELEASE_NON_OWN idea,
> but from a different angle.
> I'd like to avoid introducing new flags.
> I think PTR_TRUSTED is enough.
> 
>>> I'm not sure what's an idea to return 'n' from remove...
>>> Maybe it should be simple bool ?
>>>
>>
>> I agree that returning node from rbtree_remove is not strictly necessary, since
>> rbtree_remove can be thought of turning its non-owning ref argument into an
>> owning ref, instead of taking non-owning ref and returning owning ref. But such
>> an operation isn't really an 'acquire' by current verifier logic, since only
>> retvals can be 'acquired'. So we'd need to add some logic to enable acquire
>> semantics for args. Furthermore it's not really 'acquiring' a new ref, rather
>> changing properties of node arg ref.
>>
>> However, if rbtree_remove can fail, such a "turn non-owning into owning"
>> operation will need to be able to fail as well, and the program will need to
>> be able to check for failure. Returning 'acquire' result in retval makes
>> this simple - just check for NULL. For your "return bool" proposal, we'd have
>> to add verifier logic which turns the 'acquired' owning ref back into non-owning
>> based on check of the bool, which will add some verifier complexity.
>>
>> IIRC when doing experimentation with "rbtree map" implementation, I did
>> something like this and decided that the additional complexity wasn't worth
>> it when retval can just be used. 
> 
> Agree. Forget 'bool' idea.

We will merge this convo w/ similar one in the cover letter's thread, and
continue w/ replies there.
diff mbox series

Patch

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 1d51bd9596da..67a13110bc22 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -451,6 +451,11 @@  static bool reg_type_not_null(enum bpf_reg_type type)
 		type == PTR_TO_SOCK_COMMON;
 }
 
+static bool type_is_ptr_alloc_obj(u32 type)
+{
+	return base_type(type) == PTR_TO_BTF_ID && type_flag(type) & MEM_ALLOC;
+}
+
 static struct btf_record *reg_btf_record(const struct bpf_reg_state *reg)
 {
 	struct btf_record *rec = NULL;
@@ -458,7 +463,7 @@  static struct btf_record *reg_btf_record(const struct bpf_reg_state *reg)
 
 	if (reg->type == PTR_TO_MAP_VALUE) {
 		rec = reg->map_ptr->record;
-	} else if (reg->type == (PTR_TO_BTF_ID | MEM_ALLOC)) {
+	} else if (type_is_ptr_alloc_obj(reg->type)) {
 		meta = btf_find_struct_meta(reg->btf, reg->btf_id);
 		if (meta)
 			rec = meta->record;