mbox series

[bpf-next,00/13] BPF rbtree next-gen datastructure

Message ID 20221206231000.3180914-1-davemarchevsky@fb.com (mailing list archive)
Headers show
Series BPF rbtree next-gen datastructure | expand

Message

Dave Marchevsky Dec. 6, 2022, 11:09 p.m. UTC
This series adds a rbtree datastructure following the "next-gen
datastructure" precedent set by recently-added linked-list [0]. This is
a reimplementation of previous rbtree RFC [1] to use kfunc + kptr
instead of adding a new map type. This series adds a smaller set of API
functions than that RFC - just the minimum needed to support current
cgfifo example scheduler in ongoing sched_ext effort [2], namely:

  bpf_rbtree_add
  bpf_rbtree_remove
  bpf_rbtree_first

The meat of this series is bugfixes and verifier infra work to support
these API functions. Adding more rbtree kfuncs in future patches should
be straightforward as a result.

BPF rbtree uses struct rb_root_cached + existing rbtree lib under the
hood. From the BPF program writer's perspective, a BPF rbtree is very
similar to existing linked list. Consider the following example:

  struct node_data {
    long key;
    long data;
    struct bpf_rb_node node;
  }

  static bool less(struct bpf_rb_node *a, const struct bpf_rb_node *b)
  {
    struct node_data *node_a;
    struct node_data *node_b;

    node_a = container_of(a, struct node_data, node);
    node_b = container_of(b, struct node_data, node);

    return node_a->key < node_b->key;
  }

  private(A) struct bpf_spin_lock glock;
  private(A) struct bpf_rb_root groot __contains(node_data, node);

  /* ... in BPF program */
  struct node_data *n, *m;
  struct bpf_rb_node *res;

  n = bpf_obj_new(typeof(*n));
  if (!n)
    /* skip */
  n->key = 5;
  n->data = 10;

  bpf_spin_lock(&glock);
  bpf_rbtree_add(&groot, &n->node, less);
  bpf_spin_unlock(&glock);

  bpf_spin_lock(&glock);
  res = bpf_rbtree_first(&groot);
  if (!res)
    /* skip */
  res = bpf_rbtree_remove(&groot, res);
  if (!res)
    /* skip */
  bpf_spin_unlock(&glock);

  m = container_of(res, struct node_data, node);
  bpf_obj_drop(m);

Some obvious similarities:

  * Special bpf_rb_root and bpf_rb_node types have same semantics
    as bpf_list_head and bpf_list_node, respectively
  * __contains is used to associated node type with root
  * The spin_lock associated with a rbtree must be held when using
    rbtree API kfuncs
  * Nodes are allocated via bpf_obj_new and dropped via bpf_obj_drop
  * Rbtree takes ownership of node lifetime when a node is added.
    Removing a node gives ownership back to the program, requiring a
    bpf_obj_drop before program exit

Some new additions as well:

  * Support for callbacks in kfunc args is added to enable 'less'
    callback use above
  * bpf_rbtree_first's release_on_unlock handling is a bit novel, as
    it's the first next-gen ds API function to release_on_unlock its
    return reg instead of nonexistent node arg
  * Because all references to nodes already added to the rbtree are
    'non-owning', i.e. release_on_unlock and PTR_UNTRUSTED,
    bpf_rbtree_remove must accept such a reference in order to remove it
    from the tree

It seemed better to special-case some 'new additions' verifier logic for
now instead of adding new type flags and concepts, as some of the concepts
(e.g. PTR_UNTRUSTED + release_on_unlock) need a refactoring pass before
we pile more on. Regardless, the net-new verifier logic added in this
patchset is minimal. Verifier changes are mostly generaliztion of
existing linked-list logic and some bugfixes.

A note on naming: 

Some existing list-specific helpers are renamed to 'datastructure_head',
'datastructure_node', etc. Probably a more concise and accurate naming
would be something like 'ng_ds_head' for 'next-gen datastructure'.

For folks who weren't following the conversations over past few months, 
though, such a naming scheme might seem to indicate that _all_ next-gen
datastructures must have certain semantics, like release_on_unlock,
which aren't necessarily required. For this reason I'd like some
feedback on how to name things.

Summary of patches:

  Patches 1, 2, and 10 are bugfixes which are likely worth applying
  independently of rbtree implementation. Patch 12 is somewhere between
  nice-to-have and bugfix.

  Patches 3 and 4 are nonfunctional refactor/rename.

  Patches 5 - 9 implement the meat of rbtree support in this series,
  gradually building up to implemented kfuncs that verify as expected.
  Patch 11 adds the bpf_rbtree_{add,first,remove} to bpf_experimental.h.

  Patch 13 adds tests.

  [0]: lore.kernel.org/bpf/20221118015614.2013203-1-memxor@gmail.com
  [1]: lore.kernel.org/bpf/20220830172759.4069786-1-davemarchevsky@fb.com
  [2]: lore.kernel.org/bpf/20221130082313.3241517-1-tj@kernel.org

Future work:
  Enabling writes to release_on_unlock refs should be done before the
  functionality of BPF rbtree can truly be considered complete.
  Implementing this proved more complex than expected so it's been
  pushed off to a future patch.

Dave Marchevsky (13):
  bpf: Loosen alloc obj test in verifier's reg_btf_record
  bpf: map_check_btf should fail if btf_parse_fields fails
  bpf: Minor refactor of ref_set_release_on_unlock
  bpf: rename list_head -> datastructure_head in field info types
  bpf: Add basic bpf_rb_{root,node} support
  bpf: Add bpf_rbtree_{add,remove,first} kfuncs
  bpf: Add support for bpf_rb_root and bpf_rb_node in kfunc args
  bpf: Add callback validation to kfunc verifier logic
  bpf: Special verifier handling for bpf_rbtree_{remove, first}
  bpf, x86: BPF_PROBE_MEM handling for insn->off < 0
  bpf: Add bpf_rbtree_{add,remove,first} decls to bpf_experimental.h
  libbpf: Make BTF mandatory if program BTF has spin_lock or alloc_obj
    type
  selftests/bpf: Add rbtree selftests

 arch/x86/net/bpf_jit_comp.c                   | 123 +++--
 include/linux/bpf.h                           |  21 +-
 include/uapi/linux/bpf.h                      |  11 +
 kernel/bpf/btf.c                              | 181 ++++---
 kernel/bpf/helpers.c                          |  75 ++-
 kernel/bpf/syscall.c                          |  33 +-
 kernel/bpf/verifier.c                         | 506 +++++++++++++++---
 tools/include/uapi/linux/bpf.h                |  11 +
 tools/lib/bpf/libbpf.c                        |  50 +-
 .../testing/selftests/bpf/bpf_experimental.h  |  24 +
 .../selftests/bpf/prog_tests/linked_list.c    |  12 +-
 .../testing/selftests/bpf/prog_tests/rbtree.c | 184 +++++++
 tools/testing/selftests/bpf/progs/rbtree.c    | 180 +++++++
 .../progs/rbtree_btf_fail__add_wrong_type.c   |  48 ++
 .../progs/rbtree_btf_fail__wrong_node_type.c  |  21 +
 .../testing/selftests/bpf/progs/rbtree_fail.c | 263 +++++++++
 16 files changed, 1549 insertions(+), 194 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/rbtree.c
 create mode 100644 tools/testing/selftests/bpf/progs/rbtree.c
 create mode 100644 tools/testing/selftests/bpf/progs/rbtree_btf_fail__add_wrong_type.c
 create mode 100644 tools/testing/selftests/bpf/progs/rbtree_btf_fail__wrong_node_type.c
 create mode 100644 tools/testing/selftests/bpf/progs/rbtree_fail.c

Comments

patchwork-bot+netdevbpf@kernel.org Dec. 7, 2022, 2:50 a.m. UTC | #1
Hello:

This series was applied to bpf/bpf-next.git (master)
by Alexei Starovoitov <ast@kernel.org>:

On Tue, 6 Dec 2022 15:09:47 -0800 you wrote:
> This series adds a rbtree datastructure following the "next-gen
> datastructure" precedent set by recently-added linked-list [0]. This is
> a reimplementation of previous rbtree RFC [1] to use kfunc + kptr
> instead of adding a new map type. This series adds a smaller set of API
> functions than that RFC - just the minimum needed to support current
> cgfifo example scheduler in ongoing sched_ext effort [2], namely:
> 
> [...]

Here is the summary with links:
  - [bpf-next,01/13] bpf: Loosen alloc obj test in verifier's reg_btf_record
    https://git.kernel.org/bpf/bpf-next/c/d8939cb0a03c
  - [bpf-next,02/13] bpf: map_check_btf should fail if btf_parse_fields fails
    (no matching commit)
  - [bpf-next,03/13] bpf: Minor refactor of ref_set_release_on_unlock
    (no matching commit)
  - [bpf-next,04/13] bpf: rename list_head -> datastructure_head in field info types
    (no matching commit)
  - [bpf-next,05/13] bpf: Add basic bpf_rb_{root,node} support
    (no matching commit)
  - [bpf-next,06/13] bpf: Add bpf_rbtree_{add,remove,first} kfuncs
    (no matching commit)
  - [bpf-next,07/13] bpf: Add support for bpf_rb_root and bpf_rb_node in kfunc args
    (no matching commit)
  - [bpf-next,08/13] bpf: Add callback validation to kfunc verifier logic
    (no matching commit)
  - [bpf-next,09/13] bpf: Special verifier handling for bpf_rbtree_{remove, first}
    (no matching commit)
  - [bpf-next,10/13] bpf, x86: BPF_PROBE_MEM handling for insn->off < 0
    (no matching commit)
  - [bpf-next,11/13] bpf: Add bpf_rbtree_{add,remove,first} decls to bpf_experimental.h
    (no matching commit)
  - [bpf-next,12/13] libbpf: Make BTF mandatory if program BTF has spin_lock or alloc_obj type
    (no matching commit)
  - [bpf-next,13/13] selftests/bpf: Add rbtree selftests
    (no matching commit)

You are awesome, thank you!
Kumar Kartikeya Dwivedi Dec. 7, 2022, 7:36 p.m. UTC | #2
On Wed, Dec 07, 2022 at 04:39:47AM IST, Dave Marchevsky wrote:
> This series adds a rbtree datastructure following the "next-gen
> datastructure" precedent set by recently-added linked-list [0]. This is
> a reimplementation of previous rbtree RFC [1] to use kfunc + kptr
> instead of adding a new map type. This series adds a smaller set of API
> functions than that RFC - just the minimum needed to support current
> cgfifo example scheduler in ongoing sched_ext effort [2], namely:
>
>   bpf_rbtree_add
>   bpf_rbtree_remove
>   bpf_rbtree_first
>
> [...]
>
> Future work:
>   Enabling writes to release_on_unlock refs should be done before the
>   functionality of BPF rbtree can truly be considered complete.
>   Implementing this proved more complex than expected so it's been
>   pushed off to a future patch.
>

TBH, I think we need to revisit whether there's a strong need for this. I would
even argue that we should simply make the release semantics of rbtree_add,
list_push helpers stronger and remove release_on_unlock logic entirely,
releasing the node immediately. I don't see why it is so critical to have read,
and more importantly, write access to nodes after losing their ownership. And
that too is only available until the lock is unlocked.

I think this relaxed release logic and write support is the wrong direction to
take, as it has a direct bearing on what can be done with a node inside the
critical section. There's already the problem with not being able to do
bpf_obj_drop easily inside the critical section with this. That might be useful
for draining operations while holding the lock.

Semantically in other languages, once you move an object, accessing it is
usually a bug, and in most of the cases it is sufficient to prepare it before
insertion. We are certainly in the same territory here with these APIs.

Can you elaborate on actual use cases where immediate release or not having
write support makes it hard or impossible to support a certain use case, so that
it is easier to understand the requirements and design things accordingly?
Dave Marchevsky Dec. 7, 2022, 10:28 p.m. UTC | #3
On 12/7/22 2:36 PM, Kumar Kartikeya Dwivedi wrote:
> On Wed, Dec 07, 2022 at 04:39:47AM IST, Dave Marchevsky wrote:
>> This series adds a rbtree datastructure following the "next-gen
>> datastructure" precedent set by recently-added linked-list [0]. This is
>> a reimplementation of previous rbtree RFC [1] to use kfunc + kptr
>> instead of adding a new map type. This series adds a smaller set of API
>> functions than that RFC - just the minimum needed to support current
>> cgfifo example scheduler in ongoing sched_ext effort [2], namely:
>>
>>   bpf_rbtree_add
>>   bpf_rbtree_remove
>>   bpf_rbtree_first
>>
>> [...]
>>
>> Future work:
>>   Enabling writes to release_on_unlock refs should be done before the
>>   functionality of BPF rbtree can truly be considered complete.
>>   Implementing this proved more complex than expected so it's been
>>   pushed off to a future patch.
>>

> 
> TBH, I think we need to revisit whether there's a strong need for this. I would
> even argue that we should simply make the release semantics of rbtree_add,
> list_push helpers stronger and remove release_on_unlock logic entirely,
> releasing the node immediately. I don't see why it is so critical to have read,
> and more importantly, write access to nodes after losing their ownership. And
> that too is only available until the lock is unlocked.
> 

Moved the next paragraph here to ease reply, it was the last paragraph
in your response.

> 
> Can you elaborate on actual use cases where immediate release or not having
> write support makes it hard or impossible to support a certain use case, so that
> it is easier to understand the requirements and design things accordingly?
>

Sure, the main usecase and impetus behind this for me is the sched_ext work
Tejun and others are doing (https://lwn.net/Articles/916291/). One of the
things they'd like to be able to do is implement a CFS-like scheduler using
rbtree entirely in BPF. This would prove that sched_ext + BPF can be used to
implement complicated scheduling logic.

If we can implement such complicated scheduling logic, but it has so much
BPF-specific twisting of program logic that it's incomprehensible to scheduler
folks, that's not great. The overlap between "BPF experts" and "scheduler
experts" is small, and we want the latter group to be able to read BPF
scheduling logic without too much struggle. Lower learning curve makes folks
more likely to experiment with sched_ext.

When 'rbtree map' was in brainstorming / prototyping, non-owning reference
semantics were called out as moving BPF datastructures closer to their kernel
equivalents from a UX perspective.

If the "it makes BPF code better resemble normal kernel code" argumentwas the
only reason to do this I wouldn't feel so strongly, but there are practical
concerns as well:

If we could only read / write from rbtree node if it isn't in a tree, the common
operation of "find this node and update its data" would require removing and
re-adding it. For rbtree, these unnecessary remove and add operations could
result in unnecessary rebalancing. Going back to the sched_ext usecase,
if we have a rbtree with task or cgroup stats that need to be updated often,
unnecessary rebalancing would make this update slower than if non-owning refs
allowed in-place read/write of node data.

Also, we eventually want to be able to have a node that's part of both a
list and rbtree. Likely adding such a node to both would require calling
kfunc for adding to list, and separate kfunc call for adding to rbtree.
Once the node has been added to list, we need some way to represent a reference
to that node so that we can pass it to rbtree add kfunc. Sounds like a
non-owning reference to me, albeit with different semantics than current
release_on_unlock.

> I think this relaxed release logic and write support is the wrong direction to
> take, as it has a direct bearing on what can be done with a node inside the
> critical section. There's already the problem with not being able to do
> bpf_obj_drop easily inside the critical section with this. That might be useful
> for draining operations while holding the lock.
> 

The bpf_obj_drop case is similar to your "can't pass non-owning reference
to bpf_rbtree_remove" concern from patch 1's thread. If we have:

  n = bpf_obj_new(...); // n is owning ref
  bpf_rbtree_add(&tree, &n->node); // n is non-owning ref

  res = bpf_rbtree_first(&tree);
  if (!res) {...}
  m = container_of(res, struct node_data, node); // m is non-owning ref

  res = bpf_rbtree_remove(&tree, &n->node);
  n = container_of(res, struct node_data, node); // n is owning ref, m points to same memory

  bpf_obj_drop(n);
  // Not safe to use m anymore

Datastructures which support bpf_obj_drop in the critical section can
do same as my bpf_rbtree_remove suggestion: just invalidate all non-owning
references after bpf_obj_drop. Then there's no potential use-after-free.
(For the above example, pretend bpf_rbtree_remove didn't already invalidate
'm', or that there's some other way to obtain non-owning ref to 'n''s node
after rbtree_remove)

I think that, in practice, operations where the BPF program wants to remove
/ delete nodes will be distinct from operations where program just wants to 
obtain some non-owning refs and do read / write. At least for sched_ext usecase
this is true. So all the additional clobbers won't require program writer
to do special workarounds to deal with verifier in the common case.

> Semantically in other languages, once you move an object, accessing it is
> usually a bug, and in most of the cases it is sufficient to prepare it before
> insertion. We are certainly in the same territory here with these APIs.

Sure, but 'add'/'remove' for these intrusive linked datastructures is
_not_ a 'move'. Obscuring this from the user and forcing them to use
less performant patterns for the sake of some verifier complexity, or desire
to mimic semantics of languages w/o reference stability, doesn't make sense to
me.

If we were to add some datastructures without reference stability, sure, let's
not do non-owning references for those. So let's make this non-owning reference
stuff easy to turn on/off, perhaps via KF_RELEASE_NON_OWN or similar flags,
which will coincidentally make it very easy to remove if we later decide that
the complexity isn't worth it.
Alexei Starovoitov Dec. 7, 2022, 11:06 p.m. UTC | #4
On Wed, Dec 07, 2022 at 05:28:34PM -0500, Dave Marchevsky wrote:
> On 12/7/22 2:36 PM, Kumar Kartikeya Dwivedi wrote:
> > On Wed, Dec 07, 2022 at 04:39:47AM IST, Dave Marchevsky wrote:
> >> This series adds a rbtree datastructure following the "next-gen
> >> datastructure" precedent set by recently-added linked-list [0]. This is
> >> a reimplementation of previous rbtree RFC [1] to use kfunc + kptr
> >> instead of adding a new map type. This series adds a smaller set of API
> >> functions than that RFC - just the minimum needed to support current
> >> cgfifo example scheduler in ongoing sched_ext effort [2], namely:
> >>
> >>   bpf_rbtree_add
> >>   bpf_rbtree_remove
> >>   bpf_rbtree_first
> >>
> >> [...]
> >>
> >> Future work:
> >>   Enabling writes to release_on_unlock refs should be done before the
> >>   functionality of BPF rbtree can truly be considered complete.
> >>   Implementing this proved more complex than expected so it's been
> >>   pushed off to a future patch.
> >>
> 
> > 
> > TBH, I think we need to revisit whether there's a strong need for this. I would
> > even argue that we should simply make the release semantics of rbtree_add,
> > list_push helpers stronger and remove release_on_unlock logic entirely,
> > releasing the node immediately. I don't see why it is so critical to have read,
> > and more importantly, write access to nodes after losing their ownership. And
> > that too is only available until the lock is unlocked.
> > 
> 
> Moved the next paragraph here to ease reply, it was the last paragraph
> in your response.
> 
> > 
> > Can you elaborate on actual use cases where immediate release or not having
> > write support makes it hard or impossible to support a certain use case, so that
> > it is easier to understand the requirements and design things accordingly?
> >
> 
> Sure, the main usecase and impetus behind this for me is the sched_ext work
> Tejun and others are doing (https://lwn.net/Articles/916291/). One of the
> things they'd like to be able to do is implement a CFS-like scheduler using
> rbtree entirely in BPF. This would prove that sched_ext + BPF can be used to
> implement complicated scheduling logic.
> 
> If we can implement such complicated scheduling logic, but it has so much
> BPF-specific twisting of program logic that it's incomprehensible to scheduler
> folks, that's not great. The overlap between "BPF experts" and "scheduler
> experts" is small, and we want the latter group to be able to read BPF
> scheduling logic without too much struggle. Lower learning curve makes folks
> more likely to experiment with sched_ext.
> 
> When 'rbtree map' was in brainstorming / prototyping, non-owning reference
> semantics were called out as moving BPF datastructures closer to their kernel
> equivalents from a UX perspective.

Our emails crossed. See my previous email.
Agree on the above.

> If the "it makes BPF code better resemble normal kernel code" argumentwas the
> only reason to do this I wouldn't feel so strongly, but there are practical
> concerns as well:
> 
> If we could only read / write from rbtree node if it isn't in a tree, the common
> operation of "find this node and update its data" would require removing and
> re-adding it. For rbtree, these unnecessary remove and add operations could

Not really. See my previous email.

> result in unnecessary rebalancing. Going back to the sched_ext usecase,
> if we have a rbtree with task or cgroup stats that need to be updated often,
> unnecessary rebalancing would make this update slower than if non-owning refs
> allowed in-place read/write of node data.

Agree. Read/write from non-owning refs is necessary.
In the other email I'm arguing that PTR_TRUSTED with ref_obj_id == 0
(your non-owning ref) should not be mixed with release_on_unlock logic.

KF_RELEASE should still accept as args and release only ptrs with ref_obj_id > 0.

> 
> Also, we eventually want to be able to have a node that's part of both a
> list and rbtree. Likely adding such a node to both would require calling
> kfunc for adding to list, and separate kfunc call for adding to rbtree.
> Once the node has been added to list, we need some way to represent a reference
> to that node so that we can pass it to rbtree add kfunc. Sounds like a
> non-owning reference to me, albeit with different semantics than current
> release_on_unlock.

A node with both link list and rbtree would be a new concept.
We'd need to introduce 'struct bpf_refcnt' and make sure prog does the right thing.
That's a future discussion.

> 
> > I think this relaxed release logic and write support is the wrong direction to
> > take, as it has a direct bearing on what can be done with a node inside the
> > critical section. There's already the problem with not being able to do
> > bpf_obj_drop easily inside the critical section with this. That might be useful
> > for draining operations while holding the lock.
> > 
> 
> The bpf_obj_drop case is similar to your "can't pass non-owning reference
> to bpf_rbtree_remove" concern from patch 1's thread. If we have:
> 
>   n = bpf_obj_new(...); // n is owning ref
>   bpf_rbtree_add(&tree, &n->node); // n is non-owning ref

what I proposed in the other email...
n should be untrusted here.
That's != 'n is non-owning ref'

>   res = bpf_rbtree_first(&tree);
>   if (!res) {...}
>   m = container_of(res, struct node_data, node); // m is non-owning ref

agree. m == PTR_TRUSTED with ref_obj_id == 0.

>   res = bpf_rbtree_remove(&tree, &n->node);

a typo here? Did you mean 'm->node' ?

and after 'if (res)' ...
>   n = container_of(res, struct node_data, node); // n is owning ref, m points to same memory

agree. n -> ref_obj_id > 0

>   bpf_obj_drop(n);

above is ok to do.
'n' becomes UNTRUSTED or invalid.

>   // Not safe to use m anymore

'm' should have become UNTRUSTED after bpf_rbtree_remove.

> Datastructures which support bpf_obj_drop in the critical section can
> do same as my bpf_rbtree_remove suggestion: just invalidate all non-owning
> references after bpf_obj_drop.

'invalidate all' sounds suspicious.
I don't think we need to do sweaping search after bpf_obj_drop.

> Then there's no potential use-after-free.
> (For the above example, pretend bpf_rbtree_remove didn't already invalidate
> 'm', or that there's some other way to obtain non-owning ref to 'n''s node
> after rbtree_remove)
> 
> I think that, in practice, operations where the BPF program wants to remove
> / delete nodes will be distinct from operations where program just wants to 
> obtain some non-owning refs and do read / write. At least for sched_ext usecase
> this is true. So all the additional clobbers won't require program writer
> to do special workarounds to deal with verifier in the common case.
> 
> > Semantically in other languages, once you move an object, accessing it is
> > usually a bug, and in most of the cases it is sufficient to prepare it before
> > insertion. We are certainly in the same territory here with these APIs.
> 
> Sure, but 'add'/'remove' for these intrusive linked datastructures is
> _not_ a 'move'. Obscuring this from the user and forcing them to use
> less performant patterns for the sake of some verifier complexity, or desire
> to mimic semantics of languages w/o reference stability, doesn't make sense to
> me.

I agree, but everything we discuss in the above looks orthogonal
to release_on_unlock that myself and Kumar are proposing to drop.

> If we were to add some datastructures without reference stability, sure, let's
> not do non-owning references for those. So let's make this non-owning reference
> stuff easy to turn on/off, perhaps via KF_RELEASE_NON_OWN or similar flags,
> which will coincidentally make it very easy to remove if we later decide that
> the complexity isn't worth it. 

You mean KF_RELEASE_NON_OWN would be applied to bpf_rbtree_remove() ?
So it accepts PTR_TRUSTED ref_obj_id == 0 arg and makes it PTR_UNTRUSTED ?
If so then I agree. The 'release' part of the name was confusing.
It's also not clear which arg it applies to.
bpf_rbtree_remove has two args. Both are PTR_TRUSTED.
I wouldn't introduce a new flag for this just yet.
We can hard code bpf_rbtree_remove, bpf_list_pop for now
or use our name suffix hack.
Dave Marchevsky Dec. 8, 2022, 1:18 a.m. UTC | #5
On 12/7/22 6:06 PM, Alexei Starovoitov wrote:
> On Wed, Dec 07, 2022 at 05:28:34PM -0500, Dave Marchevsky wrote:
>> On 12/7/22 2:36 PM, Kumar Kartikeya Dwivedi wrote:
>>> On Wed, Dec 07, 2022 at 04:39:47AM IST, Dave Marchevsky wrote:
>>>> This series adds a rbtree datastructure following the "next-gen
>>>> datastructure" precedent set by recently-added linked-list [0]. This is
>>>> a reimplementation of previous rbtree RFC [1] to use kfunc + kptr
>>>> instead of adding a new map type. This series adds a smaller set of API
>>>> functions than that RFC - just the minimum needed to support current
>>>> cgfifo example scheduler in ongoing sched_ext effort [2], namely:
>>>>
>>>>   bpf_rbtree_add
>>>>   bpf_rbtree_remove
>>>>   bpf_rbtree_first
>>>>
>>>> [...]
>>>>
>>>> Future work:
>>>>   Enabling writes to release_on_unlock refs should be done before the
>>>>   functionality of BPF rbtree can truly be considered complete.
>>>>   Implementing this proved more complex than expected so it's been
>>>>   pushed off to a future patch.
>>>>
>>
>>>
>>> TBH, I think we need to revisit whether there's a strong need for this. I would
>>> even argue that we should simply make the release semantics of rbtree_add,
>>> list_push helpers stronger and remove release_on_unlock logic entirely,
>>> releasing the node immediately. I don't see why it is so critical to have read,
>>> and more importantly, write access to nodes after losing their ownership. And
>>> that too is only available until the lock is unlocked.
>>>
>>
>> Moved the next paragraph here to ease reply, it was the last paragraph
>> in your response.
>>
>>>
>>> Can you elaborate on actual use cases where immediate release or not having
>>> write support makes it hard or impossible to support a certain use case, so that
>>> it is easier to understand the requirements and design things accordingly?
>>>
>>
>> Sure, the main usecase and impetus behind this for me is the sched_ext work
>> Tejun and others are doing (https://lwn.net/Articles/916291/ ). One of the
>> things they'd like to be able to do is implement a CFS-like scheduler using
>> rbtree entirely in BPF. This would prove that sched_ext + BPF can be used to
>> implement complicated scheduling logic.
>>
>> If we can implement such complicated scheduling logic, but it has so much
>> BPF-specific twisting of program logic that it's incomprehensible to scheduler
>> folks, that's not great. The overlap between "BPF experts" and "scheduler
>> experts" is small, and we want the latter group to be able to read BPF
>> scheduling logic without too much struggle. Lower learning curve makes folks
>> more likely to experiment with sched_ext.
>>
>> When 'rbtree map' was in brainstorming / prototyping, non-owning reference
>> semantics were called out as moving BPF datastructures closer to their kernel
>> equivalents from a UX perspective.
> 
> Our emails crossed. See my previous email.
> Agree on the above.
> 
>> If the "it makes BPF code better resemble normal kernel code" argumentwas the
>> only reason to do this I wouldn't feel so strongly, but there are practical
>> concerns as well:
>>
>> If we could only read / write from rbtree node if it isn't in a tree, the common
>> operation of "find this node and update its data" would require removing and
>> re-adding it. For rbtree, these unnecessary remove and add operations could
> 
> Not really. See my previous email.
> 
>> result in unnecessary rebalancing. Going back to the sched_ext usecase,
>> if we have a rbtree with task or cgroup stats that need to be updated often,
>> unnecessary rebalancing would make this update slower than if non-owning refs
>> allowed in-place read/write of node data.
> 
> Agree. Read/write from non-owning refs is necessary.
> In the other email I'm arguing that PTR_TRUSTED with ref_obj_id == 0
> (your non-owning ref) should not be mixed with release_on_unlock logic.
> 
> KF_RELEASE should still accept as args and release only ptrs with ref_obj_id > 0.
> 
>>
>> Also, we eventually want to be able to have a node that's part of both a
>> list and rbtree. Likely adding such a node to both would require calling
>> kfunc for adding to list, and separate kfunc call for adding to rbtree.
>> Once the node has been added to list, we need some way to represent a reference
>> to that node so that we can pass it to rbtree add kfunc. Sounds like a
>> non-owning reference to me, albeit with different semantics than current
>> release_on_unlock.
> 
> A node with both link list and rbtree would be a new concept.
> We'd need to introduce 'struct bpf_refcnt' and make sure prog does the right thing.
> That's a future discussion.
> 
>>
>>> I think this relaxed release logic and write support is the wrong direction to
>>> take, as it has a direct bearing on what can be done with a node inside the
>>> critical section. There's already the problem with not being able to do
>>> bpf_obj_drop easily inside the critical section with this. That might be useful
>>> for draining operations while holding the lock.
>>>
>>
>> The bpf_obj_drop case is similar to your "can't pass non-owning reference
>> to bpf_rbtree_remove" concern from patch 1's thread. If we have:
>>
>>   n = bpf_obj_new(...); // n is owning ref
>>   bpf_rbtree_add(&tree, &n->node); // n is non-owning ref
> 
> what I proposed in the other email...
> n should be untrusted here.
> That's != 'n is non-owning ref'
> 
>>   res = bpf_rbtree_first(&tree);
>>   if (!res) {...}
>>   m = container_of(res, struct node_data, node); // m is non-owning ref
> 
> agree. m == PTR_TRUSTED with ref_obj_id == 0.
> 
>>   res = bpf_rbtree_remove(&tree, &n->node);
> 
> a typo here? Did you mean 'm->node' ?
> 
> and after 'if (res)' ...
>>   n = container_of(res, struct node_data, node); // n is owning ref, m points to same memory
> 
> agree. n -> ref_obj_id > 0
> 
>>   bpf_obj_drop(n);
> 
> above is ok to do.
> 'n' becomes UNTRUSTED or invalid.
> 
>>   // Not safe to use m anymore
> 
> 'm' should have become UNTRUSTED after bpf_rbtree_remove.
> 
>> Datastructures which support bpf_obj_drop in the critical section can
>> do same as my bpf_rbtree_remove suggestion: just invalidate all non-owning
>> references after bpf_obj_drop.
> 
> 'invalidate all' sounds suspicious.
> I don't think we need to do sweaping search after bpf_obj_drop.
> 
>> Then there's no potential use-after-free.
>> (For the above example, pretend bpf_rbtree_remove didn't already invalidate
>> 'm', or that there's some other way to obtain non-owning ref to 'n''s node
>> after rbtree_remove)
>>
>> I think that, in practice, operations where the BPF program wants to remove
>> / delete nodes will be distinct from operations where program just wants to 
>> obtain some non-owning refs and do read / write. At least for sched_ext usecase
>> this is true. So all the additional clobbers won't require program writer
>> to do special workarounds to deal with verifier in the common case.
>>
>>> Semantically in other languages, once you move an object, accessing it is
>>> usually a bug, and in most of the cases it is sufficient to prepare it before
>>> insertion. We are certainly in the same territory here with these APIs.
>>
>> Sure, but 'add'/'remove' for these intrusive linked datastructures is
>> _not_ a 'move'. Obscuring this from the user and forcing them to use
>> less performant patterns for the sake of some verifier complexity, or desire
>> to mimic semantics of languages w/o reference stability, doesn't make sense to
>> me.
> 
> I agree, but everything we discuss in the above looks orthogonal
> to release_on_unlock that myself and Kumar are proposing to drop.
> 
>> If we were to add some datastructures without reference stability, sure, let's
>> not do non-owning references for those. So let's make this non-owning reference
>> stuff easy to turn on/off, perhaps via KF_RELEASE_NON_OWN or similar flags,
>> which will coincidentally make it very easy to remove if we later decide that
>> the complexity isn't worth it. 
> 
> You mean KF_RELEASE_NON_OWN would be applied to bpf_rbtree_remove() ?
> So it accepts PTR_TRUSTED ref_obj_id == 0 arg and makes it PTR_UNTRUSTED ?
> If so then I agree. The 'release' part of the name was confusing.
> It's also not clear which arg it applies to.
> bpf_rbtree_remove has two args. Both are PTR_TRUSTED.
> I wouldn't introduce a new flag for this just yet.
> We can hard code bpf_rbtree_remove, bpf_list_pop for now
> or use our name suffix hack.

Before replying to specific things in this email, I think it would be useful
to have a subthread clearing up definitions and semantics, as I think we're
talking past each other a bit.


On a conceptual level I've still been using "owning reference" and "non-owning
reference" to understand rbtree operations. I'll use those here and try to map
them to actual verifier concepts later.

owning reference

  * This reference controls the lifetime of the pointee
  * Ownership of pointee must be 'released' by passing it to some rbtree
    API kfunc - rbtree_add in our case -  or via bpf_obj_drop, which free's
    * If not released before program ends, verifier considers prog invalid
  * Access to the memory ref is pointing at will not page fault

non-owning reference

  * No ownership of pointee so can't pass ownership via rbtree_add, not allowed
    to bpf_obj_drop
  * No control of lifetime, but can infer memory safety based on context
    (see explanation below)
  * Access to the memory ref is pointing at will not page fault
    (see explanation below)

2) From verifier's perspective non-owning references can only exist
between spin_lock and spin_unlock. Why? After spin_unlock another program
can do arbitrary operations on the rbtree like removing and free-ing
via bpf_obj_drop. A non-owning ref to some chunk of memory that was remove'd,
free'd, and reused via bpf_obj_new would point to an entirely different thing.
Or the memory could go away.

To prevent this logic violation all non-owning references are invalidated by
verifier after critical section ends. This is necessary to ensure "will
not page fault" property of non-owning reference. So if verifier hasn't
invalidated a non-owning ref, accessing it will not page fault.

Currently bpf_obj_drop is not allowed in the critical section, so similarly,
if there's a valid non-owning ref, we must be in critical section, and can
conclude that the ref's memory hasn't been dropped-and-free'd or dropped-
and-reused.

1) Any reference to a node that is in a rbtree _must_ be non-owning, since
the tree has control of pointee lifetime. Similarly, any ref to a node
that isn't in rbtree _must_ be owning. (let's ignore raw read from kptr_xchg'd
node in map_val for now)

Moving on to rbtree API:

bpf_rbtree_add(&tree, &node);
  'node' is an owning ref, becomes a non-owning ref.

bpf_rbtree_first(&tree);
  retval is a non-owning ref, since first() node is still in tree

bpf_rbtree_remove(&tree, &node);
  'node' is a non-owning ref, retval is an owning ref

All of the above can only be called when rbtree's lock is held, so invalidation
of all non-owning refs on spin_unlock is fine for rbtree_remove.

Nice property of paragraph marked with 1) above is the ability to use the
type system to prevent rbtree_add of node that's already in rbtree and
rbtree_remove of node that's not in one. So we can forego runtime
checking of "already in tree", "already not in tree".

But, as you and Kumar talked about in the past and referenced in patch 1's
thread, non-owning refs may alias each other, or an owning ref, and have no
way of knowing whether this is the case. So if X and Y are two non-owning refs
that alias each other, and bpf_rbtree_remove(tree, X) is called, a subsequent
call to bpf_rbtree_remove(tree, Y) would be removing node from tree which
already isn't in any tree (since prog has an owning ref to it). But verifier
doesn't know X and Y alias each other. So previous paragraph's "forego
runtime checks" statement can only hold if we invalidate all non-owning refs
after 'destructive' rbtree_remove operation.


It doesn't matter to me which combination of type flags, ref_obj_id, other
reg state stuff, and special-casing is used to implement owning and non-owning
refs. Specific ones chosen in this series for rbtree node:

owning ref: PTR_TO_BTF_ID | MEM_ALLOC (w/ type that contains bpf_rb_node)
            ref_obj_id > 0

non-owning ref: PTR_TO_BTF_ID | MEM_ALLOC (w/ type that contains bpf_rb_node)
                PTR_UNTRUSTED
                  - used for "can't pass ownership", not PROBE_MEM
                  - this is why I mentioned "decomposing UNTRUSTED into more
                    granular reg traits" in another thread
                ref_obj_id > 0
                release_on_unlock = true
                  - used due to paragraphs starting with 2) above                

Any other combination of type and reg state that gives me the semantics def'd
above works4me.


Based on this reply and others from today, I think you're saying that these
concepts should be implemented using:

owning ref: PTR_TO_BTF_ID | MEM_ALLOC (w/ rb_node type)
            PTR_TRUSTED
            ref_obj_id > 0

non-owning ref: PTR_TO_BTF_ID | MEM_ALLOC (w/ rb_node type)
                PTR_TRUSTED
                ref_obj_id == 0
                 - used for "can't pass ownership", since funcs that expect
                   owning ref need ref_obj_id > 0

And you're also adding 'untrusted' here, mainly as a result of
bpf_rbtree_add(tree, node) - 'node' becoming untrusted after it's added,
instead of becoming a non-owning ref. 'untrusted' would have state like:

PTR_TO_BTF_ID | MEM_ALLOC (w/ rb_node type)
PTR_UNTRUSTED
ref_obj_id == 0?

I think your "non-owning ref" definition also differs from mine, specifically
yours doesn't seem to have "will not page fault". For this reason, you don't
see the need for release_on_unlock logic, since that's used to prevent refs
escaping critical section and potentially referring to free'd memory.

This is where I start to get confused. Some questions:

  * If we get rid of release_on_unlock, and with mass invalidation of
    non-owning refs entirely, shouldn't non-owning refs be marked PTR_UNTRUSTED?

  * Since refs can alias each other, how to deal with bpf_obj_drop-and-reuse
    in this scheme, since non-owning ref can escape spin_unlock b/c no mass
    invalidation? PTR_UNTRUSTED isn't sufficient here

  * If non-owning ref can live past spin_unlock, do we expect read from
    such ref after _unlock to go through bpf_probe_read()? Otherwise direct
    read might fault and silently write 0.

  * For your 'untrusted', but not non-owning ref concept, I'm not sure
    what this gives us that's better than just invalidating the ref which
    gets in this state (rbtree_{add,remove} 'node' arg, bpf_obj_drop node)

I'm also not sure if you agree with my paragraph marked 1) above. But IMO the
release_on_unlock difference, and the perhaps-differing non-owning ref concept
are where we're really talking past each other.
Alexei Starovoitov Dec. 8, 2022, 3:51 a.m. UTC | #6
On Wed, Dec 07, 2022 at 08:18:25PM -0500, Dave Marchevsky wrote:
> 
> Before replying to specific things in this email, I think it would be useful
> to have a subthread clearing up definitions and semantics, as I think we're
> talking past each other a bit.

Yeah. We were not on the same page.
The concepts of 'owning ref' and 'non-owning ref' appeared 'new' to me.
I remember discussing 'conditional release' and OBJ_NON_OWNING_REF long ago
and I thought we agreed that both are not necessary and with that
I assumed that anything 'non-owning' as a concept is gone too.
So the only thing left (in my mind) was the 'owning' concept.
Which I mapped as ref_obj_id > 0. In other words 'owning' meant 'acquired'.

Please have this detailed explanation in the commit log next time to
avoid this back and forth.
Now to the fun part...

> 
> On a conceptual level I've still been using "owning reference" and "non-owning
> reference" to understand rbtree operations. I'll use those here and try to map
> them to actual verifier concepts later.
> 
> owning reference
> 
>   * This reference controls the lifetime of the pointee
>   * Ownership of pointee must be 'released' by passing it to some rbtree
>     API kfunc - rbtree_add in our case -  or via bpf_obj_drop, which free's
>     * If not released before program ends, verifier considers prog invalid
>   * Access to the memory ref is pointing at will not page fault

agree.

> non-owning reference
> 
>   * No ownership of pointee so can't pass ownership via rbtree_add, not allowed
>     to bpf_obj_drop
>   * No control of lifetime, but can infer memory safety based on context
>     (see explanation below)
>   * Access to the memory ref is pointing at will not page fault
>     (see explanation below)

agree with addition that both read and write should be allowed into this
'non-owning' ptr.
Which breaks if you map this to something that ORs with PTR_UNTRUSTED.

> 2) From verifier's perspective non-owning references can only exist
> between spin_lock and spin_unlock. Why? After spin_unlock another program
> can do arbitrary operations on the rbtree like removing and free-ing
> via bpf_obj_drop. A non-owning ref to some chunk of memory that was remove'd,
> free'd, and reused via bpf_obj_new would point to an entirely different thing.
> Or the memory could go away.

agree that spin_unlock needs to clean up 'non-owning'.

> To prevent this logic violation all non-owning references are invalidated by
> verifier after critical section ends. This is necessary to ensure "will
> not page fault" property of non-owning reference. So if verifier hasn't
> invalidated a non-owning ref, accessing it will not page fault.
> 
> Currently bpf_obj_drop is not allowed in the critical section, so similarly,
> if there's a valid non-owning ref, we must be in critical section, and can
> conclude that the ref's memory hasn't been dropped-and-free'd or dropped-
> and-reused.

I don't understand why is that a problem.

> 1) Any reference to a node that is in a rbtree _must_ be non-owning, since
> the tree has control of pointee lifetime. Similarly, any ref to a node
> that isn't in rbtree _must_ be owning. (let's ignore raw read from kptr_xchg'd
> node in map_val for now)

Also not clear why such restriction is necessary.

> Moving on to rbtree API:
> 
> bpf_rbtree_add(&tree, &node);
>   'node' is an owning ref, becomes a non-owning ref.
> 
> bpf_rbtree_first(&tree);
>   retval is a non-owning ref, since first() node is still in tree
> 
> bpf_rbtree_remove(&tree, &node);
>   'node' is a non-owning ref, retval is an owning ref

agree on the above definition.

> All of the above can only be called when rbtree's lock is held, so invalidation
> of all non-owning refs on spin_unlock is fine for rbtree_remove.
> 
> Nice property of paragraph marked with 1) above is the ability to use the
> type system to prevent rbtree_add of node that's already in rbtree and
> rbtree_remove of node that's not in one. So we can forego runtime
> checking of "already in tree", "already not in tree".

I think it's easier to add runtime check inside bpf_rbtree_remove()
since it already returns MAYBE_NULL. No 'conditional release' necessary.
And with that we don't need to worry about aliases.

> But, as you and Kumar talked about in the past and referenced in patch 1's
> thread, non-owning refs may alias each other, or an owning ref, and have no
> way of knowing whether this is the case. So if X and Y are two non-owning refs
> that alias each other, and bpf_rbtree_remove(tree, X) is called, a subsequent
> call to bpf_rbtree_remove(tree, Y) would be removing node from tree which
> already isn't in any tree (since prog has an owning ref to it). But verifier
> doesn't know X and Y alias each other. So previous paragraph's "forego
> runtime checks" statement can only hold if we invalidate all non-owning refs
> after 'destructive' rbtree_remove operation.

right. we either invalidate all non-owning after bpf_rbtree_remove
or do run-time check in bpf_rbtree_remove.
Consider the following:
bpf_spin_lock
n = bpf_rbtree_first(root);
m = bpf_rbtree_first(root);
x = bpf_rbtree_remove(root, n)
y = bpf_rbtree_remove(root, m)
bpf_spin_unlock
if (x)
   bpf_obj_drop(x)
if (y)
   bpf_obj_drop(y)

If we invalidate after bpf_rbtree_remove() the above will be rejected by the verifier.
If we do run-time check the above will be accepted and will work without crashing.

The problem with release_on_unlock is that it marks 'n' after 1st remove
as UNTRUSTED which means 'no write' and 'read via probe_read'.
That's not good imo.

> 
> It doesn't matter to me which combination of type flags, ref_obj_id, other
> reg state stuff, and special-casing is used to implement owning and non-owning
> refs. Specific ones chosen in this series for rbtree node:
> 
> owning ref: PTR_TO_BTF_ID | MEM_ALLOC (w/ type that contains bpf_rb_node)
>             ref_obj_id > 0
> 
> non-owning ref: PTR_TO_BTF_ID | MEM_ALLOC (w/ type that contains bpf_rb_node)
>                 PTR_UNTRUSTED
>                   - used for "can't pass ownership", not PROBE_MEM
>                   - this is why I mentioned "decomposing UNTRUSTED into more
>                     granular reg traits" in another thread

Now I undestand, but that was very hard to grasp.
UNTRUSTED means 'no write' and 'read via probe_read'.
ref_set_release_on_unlock() also keeps ref_obj_id > 0 as you're correctly
pointing out below:
>                 ref_obj_id > 0
>                 release_on_unlock = true
>                   - used due to paragraphs starting with 2) above                

but the problem with ref_set_release_on_unlock() that it mixes real ref-d
pointers with ref_obj_id > 0 with UNTRUSTED && ref_obj_id > 0.
And the latter is a quite confusing combination in my mind,
since we consider everything with ref_obj_id > 0 as good for KF_TRUSTED_ARGS.

> Any other combination of type and reg state that gives me the semantics def'd
> above works4me.
> 
> 
> Based on this reply and others from today, I think you're saying that these
> concepts should be implemented using:
> 
> owning ref: PTR_TO_BTF_ID | MEM_ALLOC (w/ rb_node type)
>             PTR_TRUSTED
>             ref_obj_id > 0

Almost.
I propose:
PTR_TO_BTF_ID | MEM_ALLOC  && ref_obj_id > 0

See the definition of is_trusted_reg().
It's ref_obj_id > 0 || flag == (MEM_ALLOC | PTR_TRUSTED)

I was saying 'trusted' because of is_trusted_reg() definition.
Sorry for confusion.

> non-owning ref: PTR_TO_BTF_ID | MEM_ALLOC (w/ rb_node type)
>                 PTR_TRUSTED
>                 ref_obj_id == 0
>                  - used for "can't pass ownership", since funcs that expect
>                    owning ref need ref_obj_id > 0

I propose:
PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == 0

Both 'owning' and 'non-owning' will fit for KF_TRUSTED_ARGS kfuncs.

And we will be able to pass 'non-owning' under spin_lock into other kfuncs
and owning outside of spin_lock into other kfuncs.
Which is a good thing.

> And you're also adding 'untrusted' here, mainly as a result of
> bpf_rbtree_add(tree, node) - 'node' becoming untrusted after it's added,
> instead of becoming a non-owning ref. 'untrusted' would have state like:
> 
> PTR_TO_BTF_ID | MEM_ALLOC (w/ rb_node type)
> PTR_UNTRUSTED
> ref_obj_id == 0?

I'm not sure whether we really need full untrusted after going through bpf_rbtree_add()
or doing 'non-owning' is enough.
If it's full untrusted it will be:
PTR_TO_BTF_ID | PTR_UNTRUSTED && ref_obj_id == 0

tbh I don't remember why we even have 'MEM_ALLOC | PTR_UNTRUSTED'.

> I think your "non-owning ref" definition also differs from mine, specifically
> yours doesn't seem to have "will not page fault". For this reason, you don't
> see the need for release_on_unlock logic, since that's used to prevent refs
> escaping critical section and potentially referring to free'd memory.

Not quite.
We should be able to read/write directly through
PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == 0
and we need to convert it to __mark_reg_unknown() after bpf_spin_unlock
the way release_reference() is doing.
I'm just not happy with using acquire_reference/release_reference() logic
(as release_on_unlock is doing) for cleaning after unlock.
Since we need to clean 'non-owning' ptrs in unlock it's confusing
to call the process 'release'.
I was hoping we can search through all states and __mark_reg_unknown() (or UNTRUSTED)
every reg where 
reg->id == cur_state->active_lock.id &&
flag == PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == 0

By deleting relase_on_unlock I meant delete release_on_unlock flag
and remove ref_set_release_on_unlock.

> This is where I start to get confused. Some questions:
> 
>   * If we get rid of release_on_unlock, and with mass invalidation of
>     non-owning refs entirely, shouldn't non-owning refs be marked PTR_UNTRUSTED?

Since we'll be cleaning all
PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == 0
it shouldn't affect ptrs with ref_obj_id > 0 that came from bpf_obj_new.

The verifier already enforces that bpf_spin_unlock will be present
at the right place in bpf prog.
When the verifier sees it it will clean all non-owning refs with this spinlock 'id'.
So no concerns of leaking 'non-owning' outside.

While processing bpf_rbtree_first we need to:
regs[BPF_REG_0].type = PTR_TO_BTF_ID | MEM_ALLOC;
regs[BPF_REG_0].id = active_lock.id;
regs[BPF_REG_0].ref_obj_id = 0;

>   * Since refs can alias each other, how to deal with bpf_obj_drop-and-reuse
>     in this scheme, since non-owning ref can escape spin_unlock b/c no mass
>     invalidation? PTR_UNTRUSTED isn't sufficient here

run-time check in bpf_rbtree_remove (and in the future bpf_list_remove)
should address it, no?

>   * If non-owning ref can live past spin_unlock, do we expect read from
>     such ref after _unlock to go through bpf_probe_read()? Otherwise direct
>     read might fault and silently write 0.

unlock has to clean them.

>   * For your 'untrusted', but not non-owning ref concept, I'm not sure
>     what this gives us that's better than just invalidating the ref which
>     gets in this state (rbtree_{add,remove} 'node' arg, bpf_obj_drop node)

Whether to mark unknown or untrusted or non-owning after bpf_rbtree_add() is a difficult one.
Untrusted will allow prog to do read only access (via probe_read) into the node
but might hide bugs.
The cleanup after bpf_spin_unlock of non-owning and clean up after
bpf_rbtree_add() does not have to be the same.
Currently I'm leaning towards PTR_UNTRUSTED for cleanup after bpf_spin_unlock
and non-owning after bpf_rbtree_add.

Walking the example from previous email:

struct bpf_rbtree_iter it;
struct bpf_rb_node * node;
struct bpf_rb_node *n, *m;

bpf_rbtree_iter_init(&it, rb_root); // locks the rbtree works as bpf_spin_lock
while ((node = bpf_rbtree_iter_next(&it)) {
  // node -> PTR_TO_BTF_ID | MEM_ALLOC | MAYBE_NULL && ref_obj_id == 0
  if (node && node->field == condition) {

    n = bpf_rbtree_remove(rb_root, node);
    if (!n) ...;
    // n -> PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == X
    m = bpf_rbtree_remove(rb_root, node); // ok, but fails in run-time
    if (!m) ...;
    // m -> PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == Y

    // node is still:
    // node -> PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == 0 && id == active_lock[0].id

    // assume we allow double locks one day
    bpf_spin_lock(another_rb_root);
    bpf_rbtree_add(another_rb_root, n);
    // n -> PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == 0 && id == active_lock[1].id
    bpf_spin_unlock(another_rb_root);
    // n -> PTR_TO_BTF_ID | PTR_UNTRUSTED && ref_obj_id == 0
    break;
  }
}
// node -> PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == 0 && id == active_lock[0].id
bpf_rbtree_iter_destroy(&it); // does unlock
// node -> PTR_TO_BTF_ID | PTR_UNTRUSTED
// n -> PTR_TO_BTF_ID | PTR_UNTRUSTED
// m -> PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == Y
bpf_obj_drop(m);
Dave Marchevsky Dec. 8, 2022, 8:28 a.m. UTC | #7
On 12/7/22 10:51 PM, Alexei Starovoitov wrote:
> On Wed, Dec 07, 2022 at 08:18:25PM -0500, Dave Marchevsky wrote:
>>
>> Before replying to specific things in this email, I think it would be useful
>> to have a subthread clearing up definitions and semantics, as I think we're
>> talking past each other a bit.
> 
> Yeah. We were not on the same page.
> The concepts of 'owning ref' and 'non-owning ref' appeared 'new' to me.
> I remember discussing 'conditional release' and OBJ_NON_OWNING_REF long ago
> and I thought we agreed that both are not necessary and with that
> I assumed that anything 'non-owning' as a concept is gone too.
> So the only thing left (in my mind) was the 'owning' concept.
> Which I mapped as ref_obj_id > 0. In other words 'owning' meant 'acquired'.
> 

Whereas in my mind the release_on_unlock logic was specifically added to
implement the mass invalidation part of non-owning reference semantics, and it
being accepted implied that we weren't getting rid of the concept :).

> Please have this detailed explanation in the commit log next time to
> avoid this back and forth.
> Now to the fun part...
> 

I will add a documentation commit explaining 'owning' and 'non-owning' ref
as they pertain to these datastructures, after we agree about the semantics.

Speaking of which, although I have a few questions / clarifications, I think
we're more in agreement after your reply. After one more round of clarification
I will summarize conclusions to see if we agree on enough to move forward.

>>
>> On a conceptual level I've still been using "owning reference" and "non-owning
>> reference" to understand rbtree operations. I'll use those here and try to map
>> them to actual verifier concepts later.
>>
>> owning reference
>>
>>   * This reference controls the lifetime of the pointee
>>   * Ownership of pointee must be 'released' by passing it to some rbtree
>>     API kfunc - rbtree_add in our case -  or via bpf_obj_drop, which free's
>>     * If not released before program ends, verifier considers prog invalid
>>   * Access to the memory ref is pointing at will not page fault
> 
> agree.
> 
>> non-owning reference
>>
>>   * No ownership of pointee so can't pass ownership via rbtree_add, not allowed
>>     to bpf_obj_drop
>>   * No control of lifetime, but can infer memory safety based on context
>>     (see explanation below)
>>   * Access to the memory ref is pointing at will not page fault
>>     (see explanation below)
> 
> agree with addition that both read and write should be allowed into this
> 'non-owning' ptr.
> Which breaks if you map this to something that ORs with PTR_UNTRUSTED.
> 

Agree re: read/write allowed. PTR_UNTRUSTED was an implementation detail.
Sounds like we agree on general purpose of owning, non-owning. Looks like
we're in agreement about above semantics.

>> 2) From verifier's perspective non-owning references can only exist
>> between spin_lock and spin_unlock. Why? After spin_unlock another program
>> can do arbitrary operations on the rbtree like removing and free-ing
>> via bpf_obj_drop. A non-owning ref to some chunk of memory that was remove'd,
>> free'd, and reused via bpf_obj_new would point to an entirely different thing.
>> Or the memory could go away.
> 
> agree that spin_unlock needs to clean up 'non-owning'.

Another point of agreement.

> 
>> To prevent this logic violation all non-owning references are invalidated by
>> verifier after critical section ends. This is necessary to ensure "will
>> not page fault" property of non-owning reference. So if verifier hasn't
>> invalidated a non-owning ref, accessing it will not page fault.
>>
>> Currently bpf_obj_drop is not allowed in the critical section, so similarly,
>> if there's a valid non-owning ref, we must be in critical section, and can
>> conclude that the ref's memory hasn't been dropped-and-free'd or dropped-
>> and-reused.
> 
> I don't understand why is that a problem.
> 
>> 1) Any reference to a node that is in a rbtree _must_ be non-owning, since
>> the tree has control of pointee lifetime. Similarly, any ref to a node
>> that isn't in rbtree _must_ be owning. (let's ignore raw read from kptr_xchg'd
>> node in map_val for now)
> 
> Also not clear why such restriction is necessary.
> 

If we have this restriction and bpf_rbtree_release also mass invalidates
non-owning refs, the type system will ensure that only nodes that are in a tree
will be passed to bpf_rbtree_release, and we can avoid the runtime check.

But below you mention preferring the runtime check, mostly noting here to
refer back when continuing reply below.

>> Moving on to rbtree API:
>>
>> bpf_rbtree_add(&tree, &node);
>>   'node' is an owning ref, becomes a non-owning ref.
>>
>> bpf_rbtree_first(&tree);
>>   retval is a non-owning ref, since first() node is still in tree
>>
>> bpf_rbtree_remove(&tree, &node);
>>   'node' is a non-owning ref, retval is an owning ref
> 
> agree on the above definition.
> >> All of the above can only be called when rbtree's lock is held, so invalidation
>> of all non-owning refs on spin_unlock is fine for rbtree_remove.
>>
>> Nice property of paragraph marked with 1) above is the ability to use the
>> type system to prevent rbtree_add of node that's already in rbtree and
>> rbtree_remove of node that's not in one. So we can forego runtime
>> checking of "already in tree", "already not in tree".
> 
> I think it's easier to add runtime check inside bpf_rbtree_remove()
> since it already returns MAYBE_NULL. No 'conditional release' necessary.
> And with that we don't need to worry about aliases.
> 

To clarify: You're proposing that we don't worry about solving the aliasing
problem at verification time. Instead rbtree_{add,remove} will deal with it
at runtime. Corollary of this is that my restriction tagged 1) above ("ref
to node in tree _must_ be non-owning, to node not in tree must be owning")
isn't something we're guaranteeing, due to possibility of aliasing.

So bpf_rbtree_remove might get a node that's not in tree, and
bpf_rbtree_add might get a node that's already in tree. Runtime behavior
of both should be 'nop'.


If that is an accurate restatement of your proposal, the verifier
logic will need to be changed:

For bpf_rbtree_remove(&tree, &node), if node is already not in a tree,
retval will be NULL, effectively not acquiring an owning ref due to
mark_ptr_or_null_reg's logic.

In this case, do we want to invalidate
arg 'node' as well? Or just leave it as a non-owning ref that points
to node not in tree? I think the latter requires fewer verifier changes,
but can see the argument for the former if we want restriction 1) to
mostly be true, unless aliasing.

The above scenario is the only case where bpf_rbtree_remove fails and
returns NULL.

(In this series it can fail and RET_NULL for this reason, but my earlier comment
about type system + invalidate all-non owning after remove as discussed below
was my original intent. So I shouldn't have been allowing RET_NULL for my
version of these semantics.)


For bpf_rbtree_add(&tree, &node, less), if arg is already in tree, then
'node' isn't really an owning ref, and we need to tag it as non-owning,
and program then won't need to bpf_obj_drop it before exiting. If node
wasn't already in tree and rbtree_add actually added it, 'node' would
also be tagged as non-owning, since tree now owns it. 

Do we need some way to indicate whether 'already in tree' case happened?
If so, would need to change retval from void to bool or struct bpf_rb_node *.

Above scenario is only case where bpf_rbtree_add fails and returns
NULL / false. 

>> But, as you and Kumar talked about in the past and referenced in patch 1's
>> thread, non-owning refs may alias each other, or an owning ref, and have no
>> way of knowing whether this is the case. So if X and Y are two non-owning refs
>> that alias each other, and bpf_rbtree_remove(tree, X) is called, a subsequent
>> call to bpf_rbtree_remove(tree, Y) would be removing node from tree which
>> already isn't in any tree (since prog has an owning ref to it). But verifier
>> doesn't know X and Y alias each other. So previous paragraph's "forego
>> runtime checks" statement can only hold if we invalidate all non-owning refs
>> after 'destructive' rbtree_remove operation.
> 
> right. we either invalidate all non-owning after bpf_rbtree_remove
> or do run-time check in bpf_rbtree_remove.
> Consider the following:
> bpf_spin_lock
> n = bpf_rbtree_first(root);
> m = bpf_rbtree_first(root);
> x = bpf_rbtree_remove(root, n)
> y = bpf_rbtree_remove(root, m)
> bpf_spin_unlock
> if (x)
>    bpf_obj_drop(x)
> if (y)
>    bpf_obj_drop(y)
> 
> If we invalidate after bpf_rbtree_remove() the above will be rejected by the verifier.
> If we do run-time check the above will be accepted and will work without crashing.
> 

Agreed, although the above example's invalid double-remove of same node is
the kind of thing I'd like to be prevented at verification time instead of
runtime. Regardless, continuing with your runtime check idea.

> The problem with release_on_unlock is that it marks 'n' after 1st remove
> as UNTRUSTED which means 'no write' and 'read via probe_read'.
> That's not good imo.
>

Based on your response to paragraph below this one, I think we're in agreement
that using PTR_UNTRUSTED for non-owning ref gives non-owning ref bunch of traits
it doesn't need, when I just wanted "can't pass ownership". So agreed that
PTR_UNTRUSTED is too blunt an instrument here.

Regarding "marks 'n' after 1st remove", the series isn't currently doing this,
I proposed it as a way to prevent aliasing problem, but I think your proposal
is explicitly not trying to prevent aliasing problem at verification time. So
for your semantics we would only have non-owning cleanup after spin_unlock.
And such cleanup might just mark refs PTR_UNTRUSTED instead of invalidating
entirely.

>>
>> It doesn't matter to me which combination of type flags, ref_obj_id, other
>> reg state stuff, and special-casing is used to implement owning and non-owning
>> refs. Specific ones chosen in this series for rbtree node:
>>
>> owning ref: PTR_TO_BTF_ID | MEM_ALLOC (w/ type that contains bpf_rb_node)
>>             ref_obj_id > 0
>>
>> non-owning ref: PTR_TO_BTF_ID | MEM_ALLOC (w/ type that contains bpf_rb_node)
>>                 PTR_UNTRUSTED
>>                   - used for "can't pass ownership", not PROBE_MEM
>>                   - this is why I mentioned "decomposing UNTRUSTED into more
>>                     granular reg traits" in another thread
> 
> Now I undestand, but that was very hard to grasp.
> UNTRUSTED means 'no write' and 'read via probe_read'.
> ref_set_release_on_unlock() also keeps ref_obj_id > 0 as you're correctly
> pointing out below:
>>                 ref_obj_id > 0
>>                 release_on_unlock = true
>>                   - used due to paragraphs starting with 2) above                
> 
> but the problem with ref_set_release_on_unlock() that it mixes real ref-d
> pointers with ref_obj_id > 0 with UNTRUSTED && ref_obj_id > 0.
> And the latter is a quite confusing combination in my mind,
> since we consider everything with ref_obj_id > 0 as good for KF_TRUSTED_ARGS.
> 

I think I understand your desire to get rid of release_on_unlock now. It's not
due to disliking the concept of "clean up non-owning refs after spin_unlock",
which you earlier agreed was necessary, but rather the specifics of
release_on_unlock mechanism used to achieve this. 

If so, I think I agree with your reasoning for why the mechanism is bad in
light of how you want owning/non-owning implemented. To summarize your
statements about release_on_unlock mechanism from the rest of your reply:

  * 'ref_obj_id > 0' already has a specific meaning wrt. is_trusted_reg,
    and we may want to support both TRUSTED and UNTRUSTED non-owning refs

    * My comment: Currently is_trusted_reg is only used for
      KF_ARG_PTR_TO_BTF_ID, while rbtree and list types are assigned special
      KF_ARGs. So hypothetically could have different 'is_trusted_reg' logic.
      I don't actually think that's a good idea, though, especially since
      rbtree / list types are really specializations of PTR_TO_BTF_ID anyways.
      So agreed.

  * Instead of using 'acquire' and (modified) 'release', we can achieve
    "clean-up non-owning after spin_unlock" by associating non-owning
    refs with active_lock.id when they're created. We can store this in
    reg.id, which is currently unused for PTR_TO_BTF_ID (afaict).

    * This will solve issue raised by previous point, allowing us to have
      non-owning refs which are truly 'untrusted' according to is_trusted_reg.

    * My comment: This all sounds reasonable. On spin_unlock we have
      active_lock.id, so can do bpf_for_each_reg_in_vstate to look for
      PTR_TO_BTF_IDs matching the id and do 'cleanup' for them.

>> Any other combination of type and reg state that gives me the semantics def'd
>> above works4me.
>>
>>
>> Based on this reply and others from today, I think you're saying that these
>> concepts should be implemented using:
>>
>> owning ref: PTR_TO_BTF_ID | MEM_ALLOC (w/ rb_node type)
>>             PTR_TRUSTED
>>             ref_obj_id > 0
> 
> Almost.
> I propose:
> PTR_TO_BTF_ID | MEM_ALLOC  && ref_obj_id > 0
> 
> See the definition of is_trusted_reg().
> It's ref_obj_id > 0 || flag == (MEM_ALLOC | PTR_TRUSTED)
> 
> I was saying 'trusted' because of is_trusted_reg() definition.
> Sorry for confusion.
>

I see. Sounds reasonable.

>> non-owning ref: PTR_TO_BTF_ID | MEM_ALLOC (w/ rb_node type)
>>                 PTR_TRUSTED
>>                 ref_obj_id == 0
>>                  - used for "can't pass ownership", since funcs that expect
>>                    owning ref need ref_obj_id > 0
> 
> I propose:
> PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == 0
> 

Also sounds reasonable, perhaps with the addition of id > 0 to account for
your desired changes to release_on_unlock mechanism?

> Both 'owning' and 'non-owning' will fit for KF_TRUSTED_ARGS kfuncs.
> 
> And we will be able to pass 'non-owning' under spin_lock into other kfuncs
> and owning outside of spin_lock into other kfuncs.
> Which is a good thing.
> 

Allowing passing of owning ref outside of spin_lock sounds reasonable to me.
'non-owning' under spinlock will have the same "what if this touches __key"
issue I brought up in another thread. But you mentioned not preventing that
and I don't necessarily disagree, so just noting here.

>> And you're also adding 'untrusted' here, mainly as a result of
>> bpf_rbtree_add(tree, node) - 'node' becoming untrusted after it's added,
>> instead of becoming a non-owning ref. 'untrusted' would have state like:
>>
>> PTR_TO_BTF_ID | MEM_ALLOC (w/ rb_node type)
>> PTR_UNTRUSTED
>> ref_obj_id == 0?
> 
> I'm not sure whether we really need full untrusted after going through bpf_rbtree_add()
> or doing 'non-owning' is enough.
> If it's full untrusted it will be:
> PTR_TO_BTF_ID | PTR_UNTRUSTED && ref_obj_id == 0
> 

Yeah, I don't see what this "full untrusted" is giving us either. Let's have
"cleanup non-owning refs on spin_unlock" just invalidate the regs for now,
instead of converting to "full untrusted"?

Adding "full untrusted" later won't make any valid programs written with
"just invalidate the regs" in mind fail the verifier. So painless to add later.

> tbh I don't remember why we even have 'MEM_ALLOC | PTR_UNTRUSTED'.
> 

I think such type combo was only added to implement non-owning refs. If it's
rewritten to use your type combos I don't think there'll be any uses of
MEM_ALLOC | PTR_UNTRUSTED remaining.

>> I think your "non-owning ref" definition also differs from mine, specifically
>> yours doesn't seem to have "will not page fault". For this reason, you don't
>> see the need for release_on_unlock logic, since that's used to prevent refs
>> escaping critical section and potentially referring to free'd memory.
> 
> Not quite.
> We should be able to read/write directly through
> PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == 0
> and we need to convert it to __mark_reg_unknown() after bpf_spin_unlock
> the way release_reference() is doing.
> I'm just not happy with using acquire_reference/release_reference() logic
> (as release_on_unlock is doing) for cleaning after unlock.
> Since we need to clean 'non-owning' ptrs in unlock it's confusing
> to call the process 'release'.
> I was hoping we can search through all states and __mark_reg_unknown() (or UNTRUSTED)
> every reg where 
> reg->id == cur_state->active_lock.id &&
> flag == PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == 0
> 
> By deleting relase_on_unlock I meant delete release_on_unlock flag
> and remove ref_set_release_on_unlock.
> 

Summarized above, but: agreed, and thanks for clarifying what you meant by 
"delete release_on_unlock".

>> This is where I start to get confused. Some questions:
>>
>>   * If we get rid of release_on_unlock, and with mass invalidation of
>>     non-owning refs entirely, shouldn't non-owning refs be marked PTR_UNTRUSTED?
> 
> Since we'll be cleaning all
> PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == 0
> it shouldn't affect ptrs with ref_obj_id > 0 that came from bpf_obj_new.
> 
> The verifier already enforces that bpf_spin_unlock will be present
> at the right place in bpf prog.
> When the verifier sees it it will clean all non-owning refs with this spinlock 'id'.
> So no concerns of leaking 'non-owning' outside.
> 

Sounds like we don't want "full untrusted" or any PTR_UNTRUSTED non-owning ref.

> While processing bpf_rbtree_first we need to:
> regs[BPF_REG_0].type = PTR_TO_BTF_ID | MEM_ALLOC;
> regs[BPF_REG_0].id = active_lock.id;
> regs[BPF_REG_0].ref_obj_id = 0;
> 

Agreed.

>>   * Since refs can alias each other, how to deal with bpf_obj_drop-and-reuse
>>     in this scheme, since non-owning ref can escape spin_unlock b/c no mass
>>     invalidation? PTR_UNTRUSTED isn't sufficient here
> 
> run-time check in bpf_rbtree_remove (and in the future bpf_list_remove)
> should address it, no?
> 

If we don't do "full untrusted" and cleanup non-owning refs by invalidating,
_and_ don't allow bpf_obj_{new,drop} in critical section, then I don't think
this is an issue.

But to elaborate on the issue, if we instead cleaned up non-owning by marking 
untrusted:

struct node_data *n = bpf_obj_new(typeof(*n));
struct node_data *m, *o;
struct some_other_type *t;

bpf_spin_lock(&lock);

bpf_rbtree_add(&tree, n);
m = bpf_rbtree_first();
o = bpf_rbtree_first(); // m and o are non-owning, point to same node

m = bpf_rbtree_remove(&tree, m); // m is owning

bpf_spin_unlock(&lock); // o is "full untrusted", marked PTR_UNTRUSTED

bpf_obj_drop(m);
t = bpf_obj_new(typeof(*t)); // pretend that exact chunk of memory that was
                             // dropped in previous statement is returned here

data = o->some_data_field;   // PROBE_MEM, but no page fault, so load will
                             // succeed, but will read garbage from another type
                             // while verifier thinks it's reading from node_data


If we clean up by invalidating, but eventually enable bpf_obj_{new,drop} inside
critical section, we'll have similar issue.

It's not necessarily "crash the kernel" dangerous, but it may anger program
writers since they can't be sure they're not reading garbage in this scenario.

>>   * If non-owning ref can live past spin_unlock, do we expect read from
>>     such ref after _unlock to go through bpf_probe_read()? Otherwise direct
>>     read might fault and silently write 0.
> 
> unlock has to clean them.
> 

Ack.

>>   * For your 'untrusted', but not non-owning ref concept, I'm not sure
>>     what this gives us that's better than just invalidating the ref which
>>     gets in this state (rbtree_{add,remove} 'node' arg, bpf_obj_drop node)
> 
> Whether to mark unknown or untrusted or non-owning after bpf_rbtree_add() is a difficult one.
> Untrusted will allow prog to do read only access (via probe_read) into the node
> but might hide bugs.
> The cleanup after bpf_spin_unlock of non-owning and clean up after
> bpf_rbtree_add() does not have to be the same.

This is a good point.

> Currently I'm leaning towards PTR_UNTRUSTED for cleanup after bpf_spin_unlock
> and non-owning after bpf_rbtree_add.
> 
> Walking the example from previous email:
> 
> struct bpf_rbtree_iter it;
> struct bpf_rb_node * node;
> struct bpf_rb_node *n, *m;
> 
> bpf_rbtree_iter_init(&it, rb_root); // locks the rbtree works as bpf_spin_lock
> while ((node = bpf_rbtree_iter_next(&it)) {
>   // node -> PTR_TO_BTF_ID | MEM_ALLOC | MAYBE_NULL && ref_obj_id == 0
>   if (node && node->field == condition) {
> 
>     n = bpf_rbtree_remove(rb_root, node);
>     if (!n) ...;
>     // n -> PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == X
>     m = bpf_rbtree_remove(rb_root, node); // ok, but fails in run-time
>     if (!m) ...;
>     // m -> PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == Y
> 
>     // node is still:
>     // node -> PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == 0 && id == active_lock[0].id
> 
>     // assume we allow double locks one day
>     bpf_spin_lock(another_rb_root);
>     bpf_rbtree_add(another_rb_root, n);
>     // n -> PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == 0 && id == active_lock[1].id
>     bpf_spin_unlock(another_rb_root);
>     // n -> PTR_TO_BTF_ID | PTR_UNTRUSTED && ref_obj_id == 0
>     break;
>   }
> }
> // node -> PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == 0 && id == active_lock[0].id
> bpf_rbtree_iter_destroy(&it); // does unlock
> // node -> PTR_TO_BTF_ID | PTR_UNTRUSTED
> // n -> PTR_TO_BTF_ID | PTR_UNTRUSTED
> // m -> PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == Y
> bpf_obj_drop(m);

This seems like a departure from other statements in your reply, where you're
leaning towards "non-owning and trusted" -> "full untrusted" after unlock
being unnecessary. I think the combo of reference aliases + bpf_obj_drop-and-
reuse make everything hard to reason about.

Regardless, your comments annotating reg state look correct to me.
Kumar Kartikeya Dwivedi Dec. 8, 2022, 12:57 p.m. UTC | #8
On Thu, Dec 08, 2022 at 01:58:44PM IST, Dave Marchevsky wrote:
> On 12/7/22 10:51 PM, Alexei Starovoitov wrote:
> > On Wed, Dec 07, 2022 at 08:18:25PM -0500, Dave Marchevsky wrote:
> >>
> >> Before replying to specific things in this email, I think it would be useful
> >> to have a subthread clearing up definitions and semantics, as I think we're
> >> talking past each other a bit.
> >
> > Yeah. We were not on the same page.
> > The concepts of 'owning ref' and 'non-owning ref' appeared 'new' to me.
> > I remember discussing 'conditional release' and OBJ_NON_OWNING_REF long ago
> > and I thought we agreed that both are not necessary and with that
> > I assumed that anything 'non-owning' as a concept is gone too.
> > So the only thing left (in my mind) was the 'owning' concept.
> > Which I mapped as ref_obj_id > 0. In other words 'owning' meant 'acquired'.
> >
>
> Whereas in my mind the release_on_unlock logic was specifically added to
> implement the mass invalidation part of non-owning reference semantics, and it
> being accepted implied that we weren't getting rid of the concept :).
>
> > Please have this detailed explanation in the commit log next time to
> > avoid this back and forth.
> > Now to the fun part...
> >
>
> I will add a documentation commit explaining 'owning' and 'non-owning' ref
> as they pertain to these datastructures, after we agree about the semantics.
>
> Speaking of which, although I have a few questions / clarifications, I think
> we're more in agreement after your reply. After one more round of clarification
> I will summarize conclusions to see if we agree on enough to move forward.
>
> >>
> >> On a conceptual level I've still been using "owning reference" and "non-owning
> >> reference" to understand rbtree operations. I'll use those here and try to map
> >> them to actual verifier concepts later.
> >>
> >> owning reference
> >>
> >>   * This reference controls the lifetime of the pointee
> >>   * Ownership of pointee must be 'released' by passing it to some rbtree
> >>     API kfunc - rbtree_add in our case -  or via bpf_obj_drop, which free's
> >>     * If not released before program ends, verifier considers prog invalid
> >>   * Access to the memory ref is pointing at will not page fault
> >
> > agree.
> >
> >> non-owning reference
> >>
> >>   * No ownership of pointee so can't pass ownership via rbtree_add, not allowed
> >>     to bpf_obj_drop
> >>   * No control of lifetime, but can infer memory safety based on context
> >>     (see explanation below)
> >>   * Access to the memory ref is pointing at will not page fault
> >>     (see explanation below)
> >
> > agree with addition that both read and write should be allowed into this
> > 'non-owning' ptr.
> > Which breaks if you map this to something that ORs with PTR_UNTRUSTED.
> >
>
> Agree re: read/write allowed. PTR_UNTRUSTED was an implementation detail.
> Sounds like we agree on general purpose of owning, non-owning. Looks like
> we're in agreement about above semantics.
>

Yes, PTR_UNTRUSTED is not appropriate for this. My opposition was also more to
the idea of mapping PTR_UNTRUSTED to non-owning references.
If we do PTR_TO_BTF_ID | MEM_ALLOC for them with ref_obj_id == 0, it SGTM.

> >> 2) From verifier's perspective non-owning references can only exist
> >> between spin_lock and spin_unlock. Why? After spin_unlock another program
> >> can do arbitrary operations on the rbtree like removing and free-ing
> >> via bpf_obj_drop. A non-owning ref to some chunk of memory that was remove'd,
> >> free'd, and reused via bpf_obj_new would point to an entirely different thing.
> >> Or the memory could go away.
> >
> > agree that spin_unlock needs to clean up 'non-owning'.
>
> Another point of agreement.
>

+1

> >
> >> To prevent this logic violation all non-owning references are invalidated by
> >> verifier after critical section ends. This is necessary to ensure "will
> >> not page fault" property of non-owning reference. So if verifier hasn't
> >> invalidated a non-owning ref, accessing it will not page fault.
> >>
> >> Currently bpf_obj_drop is not allowed in the critical section, so similarly,
> >> if there's a valid non-owning ref, we must be in critical section, and can
> >> conclude that the ref's memory hasn't been dropped-and-free'd or dropped-
> >> and-reused.
> >
> > I don't understand why is that a problem.
> >
> >> 1) Any reference to a node that is in a rbtree _must_ be non-owning, since
> >> the tree has control of pointee lifetime. Similarly, any ref to a node
> >> that isn't in rbtree _must_ be owning. (let's ignore raw read from kptr_xchg'd
> >> node in map_val for now)

The last case is going to be marked PTR_UNTRUSTED.

> >
> > Also not clear why such restriction is necessary.
> >
>
> If we have this restriction and bpf_rbtree_release also mass invalidates
> non-owning refs, the type system will ensure that only nodes that are in a tree
> will be passed to bpf_rbtree_release, and we can avoid the runtime check.
>

I like this property. This was also how I proposed implementing it for lists.
e.g. Any bpf_list_del would invalidate the result of prior bpf_list_first_entry
and bpf_list_last_entry to ensure safety.

It's a bit similar to aliasing XOR mutability guarantees that Rust has. We're
trying to implement a simple borrow checking mechanism.

Once the collection is mutated, any prior non-owning references become
invalidated. It can be further refined (e.g. bpf_rbtree_add won't do
invalidation on mutation) based on the properties of the data structure.

> But below you mention preferring the runtime check, mostly noting here to
> refer back when continuing reply below.
>
> >> Moving on to rbtree API:
> >>
> >> bpf_rbtree_add(&tree, &node);
> >>   'node' is an owning ref, becomes a non-owning ref.
> >>
> >> bpf_rbtree_first(&tree);
> >>   retval is a non-owning ref, since first() node is still in tree
> >>
> >> bpf_rbtree_remove(&tree, &node);
> >>   'node' is a non-owning ref, retval is an owning ref
> >
> > agree on the above definition.
> > >> All of the above can only be called when rbtree's lock is held, so invalidation
> >> of all non-owning refs on spin_unlock is fine for rbtree_remove.
> >>
> >> Nice property of paragraph marked with 1) above is the ability to use the
> >> type system to prevent rbtree_add of node that's already in rbtree and
> >> rbtree_remove of node that's not in one. So we can forego runtime
> >> checking of "already in tree", "already not in tree".
> >
> > I think it's easier to add runtime check inside bpf_rbtree_remove()
> > since it already returns MAYBE_NULL. No 'conditional release' necessary.
> > And with that we don't need to worry about aliases.
> >
>
> To clarify: You're proposing that we don't worry about solving the aliasing
> problem at verification time. Instead rbtree_{add,remove} will deal with it
> at runtime. Corollary of this is that my restriction tagged 1) above ("ref
> to node in tree _must_ be non-owning, to node not in tree must be owning")
> isn't something we're guaranteeing, due to possibility of aliasing.
>
> So bpf_rbtree_remove might get a node that's not in tree, and
> bpf_rbtree_add might get a node that's already in tree. Runtime behavior
> of both should be 'nop'.
>
>
> If that is an accurate restatement of your proposal, the verifier
> logic will need to be changed:
>
> For bpf_rbtree_remove(&tree, &node), if node is already not in a tree,
> retval will be NULL, effectively not acquiring an owning ref due to
> mark_ptr_or_null_reg's logic.
>
> In this case, do we want to invalidate
> arg 'node' as well? Or just leave it as a non-owning ref that points
> to node not in tree? I think the latter requires fewer verifier changes,
> but can see the argument for the former if we want restriction 1) to
> mostly be true, unless aliasing.
>
> The above scenario is the only case where bpf_rbtree_remove fails and
> returns NULL.
>
> (In this series it can fail and RET_NULL for this reason, but my earlier comment
> about type system + invalidate all-non owning after remove as discussed below
> was my original intent. So I shouldn't have been allowing RET_NULL for my
> version of these semantics.)
>

I agree with Dave to rely on the invariant that non-owning refs to nodes are
part of the collection. Then bpf_rbtree_remove is simply KF_ACQUIRE.

>
> For bpf_rbtree_add(&tree, &node, less), if arg is already in tree, then
> 'node' isn't really an owning ref, and we need to tag it as non-owning,
> and program then won't need to bpf_obj_drop it before exiting. If node
> wasn't already in tree and rbtree_add actually added it, 'node' would
> also be tagged as non-owning, since tree now owns it.
>
> Do we need some way to indicate whether 'already in tree' case happened?
> If so, would need to change retval from void to bool or struct bpf_rb_node *.
>
> Above scenario is only case where bpf_rbtree_add fails and returns
> NULL / false.
>

Why should we allow node that is not acquired to be passed to bpf_rbtree_add?

> >> But, as you and Kumar talked about in the past and referenced in patch 1's
> >> thread, non-owning refs may alias each other, or an owning ref, and have no
> >> way of knowing whether this is the case. So if X and Y are two non-owning refs
> >> that alias each other, and bpf_rbtree_remove(tree, X) is called, a subsequent
> >> call to bpf_rbtree_remove(tree, Y) would be removing node from tree which
> >> already isn't in any tree (since prog has an owning ref to it). But verifier
> >> doesn't know X and Y alias each other. So previous paragraph's "forego
> >> runtime checks" statement can only hold if we invalidate all non-owning refs
> >> after 'destructive' rbtree_remove operation.
> >
> > right. we either invalidate all non-owning after bpf_rbtree_remove
> > or do run-time check in bpf_rbtree_remove.
> > Consider the following:
> > bpf_spin_lock
> > n = bpf_rbtree_first(root);
> > m = bpf_rbtree_first(root);
> > x = bpf_rbtree_remove(root, n)
> > y = bpf_rbtree_remove(root, m)
> > bpf_spin_unlock
> > if (x)
> >    bpf_obj_drop(x)
> > if (y)
> >    bpf_obj_drop(y)
> >
> > If we invalidate after bpf_rbtree_remove() the above will be rejected by the verifier.
> > If we do run-time check the above will be accepted and will work without crashing.
> >
>
> Agreed, although the above example's invalid double-remove of same node is
> the kind of thing I'd like to be prevented at verification time instead of
> runtime. Regardless, continuing with your runtime check idea.
>

I agree with Dave, it seems better to invalidate non-owning refs after first
remove rather than allowing this to work.

> > The problem with release_on_unlock is that it marks 'n' after 1st remove
> > as UNTRUSTED which means 'no write' and 'read via probe_read'.
> > That's not good imo.
> >
>
> Based on your response to paragraph below this one, I think we're in agreement
> that using PTR_UNTRUSTED for non-owning ref gives non-owning ref bunch of traits
> it doesn't need, when I just wanted "can't pass ownership". So agreed that
> PTR_UNTRUSTED is too blunt an instrument here.
>

I think this is the part of the confusion which has left me wondering so far.
The discussion in this thread is making things more clear.

PTR_UNTRUSTED was never meant to be the kind of non-owning reference you want to
be returned from bpf_rbtree_first. PTR_TO_BTF_ID | MEM_ALLOC with ref_obj_id == 0
is the right choice.

> Regarding "marks 'n' after 1st remove", the series isn't currently doing this,
> I proposed it as a way to prevent aliasing problem, but I think your proposal
> is explicitly not trying to prevent aliasing problem at verification time. So
> for your semantics we would only have non-owning cleanup after spin_unlock.
> And such cleanup might just mark refs PTR_UNTRUSTED instead of invalidating
> entirely.
>

I would prefer proper invalidation using mark_reg_unknown.

> >>
> >> It doesn't matter to me which combination of type flags, ref_obj_id, other
> >> reg state stuff, and special-casing is used to implement owning and non-owning
> >> refs. Specific ones chosen in this series for rbtree node:
> >>
> >> owning ref: PTR_TO_BTF_ID | MEM_ALLOC (w/ type that contains bpf_rb_node)
> >>             ref_obj_id > 0
> >>
> >> non-owning ref: PTR_TO_BTF_ID | MEM_ALLOC (w/ type that contains bpf_rb_node)
> >>                 PTR_UNTRUSTED
> >>                   - used for "can't pass ownership", not PROBE_MEM
> >>                   - this is why I mentioned "decomposing UNTRUSTED into more
> >>                     granular reg traits" in another thread
> >
> > Now I undestand, but that was very hard to grasp.
> > UNTRUSTED means 'no write' and 'read via probe_read'.
> > ref_set_release_on_unlock() also keeps ref_obj_id > 0 as you're correctly
> > pointing out below:
> >>                 ref_obj_id > 0
> >>                 release_on_unlock = true
> >>                   - used due to paragraphs starting with 2) above
> >
> > but the problem with ref_set_release_on_unlock() that it mixes real ref-d
> > pointers with ref_obj_id > 0 with UNTRUSTED && ref_obj_id > 0.
> > And the latter is a quite confusing combination in my mind,
> > since we consider everything with ref_obj_id > 0 as good for KF_TRUSTED_ARGS.
> >
>
> I think I understand your desire to get rid of release_on_unlock now. It's not
> due to disliking the concept of "clean up non-owning refs after spin_unlock",
> which you earlier agreed was necessary, but rather the specifics of
> release_on_unlock mechanism used to achieve this.
>
> If so, I think I agree with your reasoning for why the mechanism is bad in
> light of how you want owning/non-owning implemented. To summarize your
> statements about release_on_unlock mechanism from the rest of your reply:
>
>   * 'ref_obj_id > 0' already has a specific meaning wrt. is_trusted_reg,
>     and we may want to support both TRUSTED and UNTRUSTED non-owning refs
>
>     * My comment: Currently is_trusted_reg is only used for
>       KF_ARG_PTR_TO_BTF_ID, while rbtree and list types are assigned special
>       KF_ARGs. So hypothetically could have different 'is_trusted_reg' logic.
>       I don't actually think that's a good idea, though, especially since
>       rbtree / list types are really specializations of PTR_TO_BTF_ID anyways.
>       So agreed.
>
>   * Instead of using 'acquire' and (modified) 'release', we can achieve
>     "clean-up non-owning after spin_unlock" by associating non-owning
>     refs with active_lock.id when they're created. We can store this in
>     reg.id, which is currently unused for PTR_TO_BTF_ID (afaict).
>

I don't mind using active_lock.id for invalidation, but using reg->id to
associate it with reg is a bad idea IMO, it's already preserved and set when the
object has bpf_spin_lock in it, and it's going to allow doing bpf_spin_unlock
with that non-owing ref if it has a spin lock, essentially unlocking different
spin lock if the reg->btf of already locked spin lock reg is same due to same
active_lock.id.

Even if you prevent it somehow it's more confusing to overload reg->id again for
this purpose.

It makes more sense to introduce a new nonref_obj_id instead dedicated for this
purpose, to associate it back to the reg->id of the collection it is coming from.

Also, there are two cases of invalidation, one is on remove from rbtree, which
should only invalidate non-owning references into the rbtree, and one is on
unlock, which should invalidate all non-owning references.

bpf_rbtree_remove shouldn't invalidate non-owning into list protected by same
lock, but unlocking should do it for both rbtree and list non-owning refs it is
protecting.

So it seems you will have to maintain two IDs for non-owning referneces, one for
the collection it comes from, and one for the lock region it is obtained in.

>     * This will solve issue raised by previous point, allowing us to have
>       non-owning refs which are truly 'untrusted' according to is_trusted_reg.
>
>     * My comment: This all sounds reasonable. On spin_unlock we have
>       active_lock.id, so can do bpf_for_each_reg_in_vstate to look for
>       PTR_TO_BTF_IDs matching the id and do 'cleanup' for them.
>
> >> Any other combination of type and reg state that gives me the semantics def'd
> >> above works4me.
> >>
> >>
> >> Based on this reply and others from today, I think you're saying that these
> >> concepts should be implemented using:
> >>
> >> owning ref: PTR_TO_BTF_ID | MEM_ALLOC (w/ rb_node type)
> >>             PTR_TRUSTED
> >>             ref_obj_id > 0
> >
> > Almost.
> > I propose:
> > PTR_TO_BTF_ID | MEM_ALLOC  && ref_obj_id > 0
> >
> > See the definition of is_trusted_reg().
> > It's ref_obj_id > 0 || flag == (MEM_ALLOC | PTR_TRUSTED)
> >
> > I was saying 'trusted' because of is_trusted_reg() definition.
> > Sorry for confusion.
> >
>
> I see. Sounds reasonable.
>
> >> non-owning ref: PTR_TO_BTF_ID | MEM_ALLOC (w/ rb_node type)
> >>                 PTR_TRUSTED
> >>                 ref_obj_id == 0
> >>                  - used for "can't pass ownership", since funcs that expect
> >>                    owning ref need ref_obj_id > 0
> >
> > I propose:
> > PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == 0
> >
>
> Also sounds reasonable, perhaps with the addition of id > 0 to account for
> your desired changes to release_on_unlock mechanism?
>
> > Both 'owning' and 'non-owning' will fit for KF_TRUSTED_ARGS kfuncs.
> >
> > And we will be able to pass 'non-owning' under spin_lock into other kfuncs
> > and owning outside of spin_lock into other kfuncs.
> > Which is a good thing.
> >
>
> Allowing passing of owning ref outside of spin_lock sounds reasonable to me.
> 'non-owning' under spinlock will have the same "what if this touches __key"
> issue I brought up in another thread. But you mentioned not preventing that
> and I don't necessarily disagree, so just noting here.
>

Yeah, I agree with Alexei that writing to key is a non-issue. 'Less' cb may not
actually do the correct thing at all, so in that sense writing to key is a small
issue. In any case violating the 'sorted' property is not something we should be
trying to prevent.

> >> And you're also adding 'untrusted' here, mainly as a result of
> >> bpf_rbtree_add(tree, node) - 'node' becoming untrusted after it's added,
> >> instead of becoming a non-owning ref. 'untrusted' would have state like:
> >>
> >> PTR_TO_BTF_ID | MEM_ALLOC (w/ rb_node type)
> >> PTR_UNTRUSTED
> >> ref_obj_id == 0?
> >
> > I'm not sure whether we really need full untrusted after going through bpf_rbtree_add()
> > or doing 'non-owning' is enough.
> > If it's full untrusted it will be:
> > PTR_TO_BTF_ID | PTR_UNTRUSTED && ref_obj_id == 0
> >
>
> Yeah, I don't see what this "full untrusted" is giving us either. Let's have
> "cleanup non-owning refs on spin_unlock" just invalidate the regs for now,
> instead of converting to "full untrusted"?
>

+1, I prefer invalidating completely on unlock.

> Adding "full untrusted" later won't make any valid programs written with
> "just invalidate the regs" in mind fail the verifier. So painless to add later.
>

+1

> > tbh I don't remember why we even have 'MEM_ALLOC | PTR_UNTRUSTED'.
> >

Eventually it will also be used for alloc obj kptr loaded from maps.

>
> I think such type combo was only added to implement non-owning refs. If it's
> rewritten to use your type combos I don't think there'll be any uses of
> MEM_ALLOC | PTR_UNTRUSTED remaining.
>

To be clear I was not intending to use PTR_UNTRUSTED to do such non-owning refs.

> >> I think your "non-owning ref" definition also differs from mine, specifically
> >> yours doesn't seem to have "will not page fault". For this reason, you don't
> >> see the need for release_on_unlock logic, since that's used to prevent refs
> >> escaping critical section and potentially referring to free'd memory.
> >
> > Not quite.
> > We should be able to read/write directly through
> > PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == 0
> > and we need to convert it to __mark_reg_unknown() after bpf_spin_unlock
> > the way release_reference() is doing.
> > I'm just not happy with using acquire_reference/release_reference() logic
> > (as release_on_unlock is doing) for cleaning after unlock.
> > Since we need to clean 'non-owning' ptrs in unlock it's confusing
> > to call the process 'release'.
> > I was hoping we can search through all states and __mark_reg_unknown() (or UNTRUSTED)
> > every reg where
> > reg->id == cur_state->active_lock.id &&
> > flag == PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == 0
> >
> > By deleting relase_on_unlock I meant delete release_on_unlock flag
> > and remove ref_set_release_on_unlock.
> >
>
> Summarized above, but: agreed, and thanks for clarifying what you meant by
> "delete release_on_unlock".
>
> >> This is where I start to get confused. Some questions:
> >>
> >>   * If we get rid of release_on_unlock, and with mass invalidation of
> >>     non-owning refs entirely, shouldn't non-owning refs be marked PTR_UNTRUSTED?
> >
> > Since we'll be cleaning all
> > PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == 0
> > it shouldn't affect ptrs with ref_obj_id > 0 that came from bpf_obj_new.
> >
> > The verifier already enforces that bpf_spin_unlock will be present
> > at the right place in bpf prog.
> > When the verifier sees it it will clean all non-owning refs with this spinlock 'id'.
> > So no concerns of leaking 'non-owning' outside.
> >
>
> Sounds like we don't want "full untrusted" or any PTR_UNTRUSTED non-owning ref.
>
> > While processing bpf_rbtree_first we need to:
> > regs[BPF_REG_0].type = PTR_TO_BTF_ID | MEM_ALLOC;
> > regs[BPF_REG_0].id = active_lock.id;
> > regs[BPF_REG_0].ref_obj_id = 0;
> >
>
> Agreed.
>

I'm a bit concerned about putting active_lock.id in reg->id. Don't object to the
idea but the implementation, since we take PTR_TO_BTF_ID | MEM_ALLOC in
bpf_spin_lock/bpf_spin_unlock. It will lead to confusion. Currently this exact
reg->type never has reg->ref_obj_id as 0. Maybe that needs to be checked for
those helper calls.

Just thinking out loud, maybe it's fine but we need to be careful, reg->id
changes meaning with ref_obj_id == 0.

> >>   * Since refs can alias each other, how to deal with bpf_obj_drop-and-reuse
> >>     in this scheme, since non-owning ref can escape spin_unlock b/c no mass
> >>     invalidation? PTR_UNTRUSTED isn't sufficient here
> >
> > run-time check in bpf_rbtree_remove (and in the future bpf_list_remove)
> > should address it, no?
> >
>
> If we don't do "full untrusted" and cleanup non-owning refs by invalidating,
> _and_ don't allow bpf_obj_{new,drop} in critical section, then I don't think
> this is an issue.
>

bpf_obj_drop if/when enabled can also do invalidation. But let's table that
discussion until we introduce it. We most likely may not need it inside the CS.

> But to elaborate on the issue, if we instead cleaned up non-owning by marking
> untrusted:
>
> struct node_data *n = bpf_obj_new(typeof(*n));
> struct node_data *m, *o;
> struct some_other_type *t;
>
> bpf_spin_lock(&lock);
>
> bpf_rbtree_add(&tree, n);
> m = bpf_rbtree_first();
> o = bpf_rbtree_first(); // m and o are non-owning, point to same node
>
> m = bpf_rbtree_remove(&tree, m); // m is owning
>
> bpf_spin_unlock(&lock); // o is "full untrusted", marked PTR_UNTRUSTED
>
> bpf_obj_drop(m);
> t = bpf_obj_new(typeof(*t)); // pretend that exact chunk of memory that was
>                              // dropped in previous statement is returned here
>
> data = o->some_data_field;   // PROBE_MEM, but no page fault, so load will
>                              // succeed, but will read garbage from another type
>                              // while verifier thinks it's reading from node_data
>
>
> If we clean up by invalidating, but eventually enable bpf_obj_{new,drop} inside
> critical section, we'll have similar issue.
>
> It's not necessarily "crash the kernel" dangerous, but it may anger program
> writers since they can't be sure they're not reading garbage in this scenario.
>

I think it's better to clean by invalidating. We have better tools to form
untrusted pointers (like bpf_rdonly_cast) now if the BPF program writer needs
such an escape hatch for some reason. It's also easier to review where an
untrusted pointer is being used in a program, and has zero cost at runtime.

> >>   * If non-owning ref can live past spin_unlock, do we expect read from
> >>     such ref after _unlock to go through bpf_probe_read()? Otherwise direct
> >>     read might fault and silently write 0.
> >
> > unlock has to clean them.
> >
>
> Ack.
>
> >>   * For your 'untrusted', but not non-owning ref concept, I'm not sure
> >>     what this gives us that's better than just invalidating the ref which
> >>     gets in this state (rbtree_{add,remove} 'node' arg, bpf_obj_drop node)
> >
> > Whether to mark unknown or untrusted or non-owning after bpf_rbtree_add() is a difficult one.
> > Untrusted will allow prog to do read only access (via probe_read) into the node
> > but might hide bugs.
> > The cleanup after bpf_spin_unlock of non-owning and clean up after
> > bpf_rbtree_add() does not have to be the same.
>
> This is a good point.
>

So far I'm leaning towards:

bpf_rbtree_add(node) : node becomes non-owned ref
bpf_spin_unlock(lock) : node is invalidated

> > Currently I'm leaning towards PTR_UNTRUSTED for cleanup after bpf_spin_unlock
> > and non-owning after bpf_rbtree_add.
> >
> > Walking the example from previous email:
> >
> > struct bpf_rbtree_iter it;
> > struct bpf_rb_node * node;
> > struct bpf_rb_node *n, *m;
> >
> > bpf_rbtree_iter_init(&it, rb_root); // locks the rbtree works as bpf_spin_lock
> > while ((node = bpf_rbtree_iter_next(&it)) {
> >   // node -> PTR_TO_BTF_ID | MEM_ALLOC | MAYBE_NULL && ref_obj_id == 0
> >   if (node && node->field == condition) {
> >
> >     n = bpf_rbtree_remove(rb_root, node);
> >     if (!n) ...;
> >     // n -> PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == X
> >     m = bpf_rbtree_remove(rb_root, node); // ok, but fails in run-time
> >     if (!m) ...;
> >     // m -> PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == Y
> >

This second remove I would simply disallow as Dave is suggesting during
verification, by invalidating non-owning refs for rb_root.

> >     // node is still:
> >     // node -> PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == 0 && id == active_lock[0].id
> >
> >     // assume we allow double locks one day
> >     bpf_spin_lock(another_rb_root);
> >     bpf_rbtree_add(another_rb_root, n);
> >     // n -> PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == 0 && id == active_lock[1].id
> >     bpf_spin_unlock(another_rb_root);
> >     // n -> PTR_TO_BTF_ID | PTR_UNTRUSTED && ref_obj_id == 0
> >     break;
> >   }
> > }
> > // node -> PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == 0 && id == active_lock[0].id
> > bpf_rbtree_iter_destroy(&it); // does unlock
> > // node -> PTR_TO_BTF_ID | PTR_UNTRUSTED
> > // n -> PTR_TO_BTF_ID | PTR_UNTRUSTED
> > // m -> PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == Y
> > bpf_obj_drop(m);
>
> This seems like a departure from other statements in your reply, where you're
> leaning towards "non-owning and trusted" -> "full untrusted" after unlock
> being unnecessary. I think the combo of reference aliases + bpf_obj_drop-and-
> reuse make everything hard to reason about.
>
> Regardless, your comments annotating reg state look correct to me.

I think it's much more clear in this thread wrt what you wanted to do. It would
be good after the thread concludes to eventually summarize how you're going to
finally implement all this before respinning.
Alexei Starovoitov Dec. 8, 2022, 8:36 p.m. UTC | #9
On Thu, Dec 08, 2022 at 06:27:29PM +0530, Kumar Kartikeya Dwivedi wrote:
> 
> I don't mind using active_lock.id for invalidation, but using reg->id to
> associate it with reg is a bad idea IMO, it's already preserved and set when the
> object has bpf_spin_lock in it, and it's going to allow doing bpf_spin_unlock
> with that non-owing ref if it has a spin lock, essentially unlocking different
> spin lock if the reg->btf of already locked spin lock reg is same due to same
> active_lock.id.

Right. Overwriting reg->id was a bad idea.

> Even if you prevent it somehow it's more confusing to overload reg->id again for
> this purpose.
> 
> It makes more sense to introduce a new nonref_obj_id instead dedicated for this
> purpose, to associate it back to the reg->id of the collection it is coming from.

nonref_obj_id name sounds too generic and I'm not sure that it shouldn't be
connected to reg->id the way we do it for ref_obj_id.

> Also, there are two cases of invalidation, one is on remove from rbtree, which
> should only invalidate non-owning references into the rbtree, and one is on
> unlock, which should invalidate all non-owning references.

Two cases only if we're going to do invalidation on rbtree_remove.

> bpf_rbtree_remove shouldn't invalidate non-owning into list protected by same
> lock, but unlocking should do it for both rbtree and list non-owning refs it is
> protecting.
> 
> So it seems you will have to maintain two IDs for non-owning referneces, one for
> the collection it comes from, and one for the lock region it is obtained in.

Right. Like this ?
collection_id = rbroot->reg->id; // to track the collection it came from
active_lock_id = cur_state->active_lock.id // to track the lock region

but before we proceed let me demonstrate an example where
cleanup on rbtree_remove is not user friendly:

bpf_spin_lock
x = bpf_list_first(); if (!x) ..
y = bpf_list_last(); if (!y) ..

n = bpf_list_remove(x); if (!n) ..

bpf_list_add_after(n, y); // we should allow this
bpf_spin_unlock

We don't have such apis right now.
The point here that cleanup after bpf_list_remove/bpf_rbtree_remove will destroy
all regs that point somewhere in the collection.
This way we save run-time check in bpf_rbtree_remove, but sacrificing usability.

x and y could be pointing to the same thing.
In such case bpf_list_add_after() should fail in runtime after discovering
that 'y' is unlinked.

Similarly with bpf_rbtree_add().
Currently it cannot fail. It takes owning ref and will release it.
We can mark it as KF_RELEASE and no extra verifier changes necessary.

But in the future we might have failing add/insert operations on lists and rbtree.
If they're failing we'd need to struggle with 'conditional release' verifier additions,
the bpf prog would need to check return value, etc.

I think we better deal with it in run-time.
The verifier could supply bpf_list_add_after() with two hidden args:
- container_of offset (delta between rb_node and begining of prog's struct)
- struct btf_struct_meta *meta
Then inside bpf_list_add_after or any failing KF_RELEASE kfunc
it can call bpf_obj_drop_impl() that element.
Then from the verifier pov the KF_RELEASE function did the release
and 'owning ref' became 'non-owning ref'.

> > >> And you're also adding 'untrusted' here, mainly as a result of
> > >> bpf_rbtree_add(tree, node) - 'node' becoming untrusted after it's added,
> > >> instead of becoming a non-owning ref. 'untrusted' would have state like:
> > >>
> > >> PTR_TO_BTF_ID | MEM_ALLOC (w/ rb_node type)
> > >> PTR_UNTRUSTED
> > >> ref_obj_id == 0?
> > >
> > > I'm not sure whether we really need full untrusted after going through bpf_rbtree_add()
> > > or doing 'non-owning' is enough.
> > > If it's full untrusted it will be:
> > > PTR_TO_BTF_ID | PTR_UNTRUSTED && ref_obj_id == 0
> > >
> >
> > Yeah, I don't see what this "full untrusted" is giving us either. Let's have
> > "cleanup non-owning refs on spin_unlock" just invalidate the regs for now,
> > instead of converting to "full untrusted"?
> >
> 
> +1, I prefer invalidating completely on unlock.

fine by me.

> 
> I think it's better to clean by invalidating. We have better tools to form
> untrusted pointers (like bpf_rdonly_cast) now if the BPF program writer needs
> such an escape hatch for some reason. It's also easier to review where an
> untrusted pointer is being used in a program, and has zero cost at runtime.

ok. Since it's more strict we can relax to untrusted later if necessary.

> So far I'm leaning towards:
> 
> bpf_rbtree_add(node) : node becomes non-owned ref
> bpf_spin_unlock(lock) : node is invalidated

ok

> > > Currently I'm leaning towards PTR_UNTRUSTED for cleanup after bpf_spin_unlock
> > > and non-owning after bpf_rbtree_add.
> > >
> > > Walking the example from previous email:
> > >
> > > struct bpf_rbtree_iter it;
> > > struct bpf_rb_node * node;
> > > struct bpf_rb_node *n, *m;
> > >
> > > bpf_rbtree_iter_init(&it, rb_root); // locks the rbtree works as bpf_spin_lock
> > > while ((node = bpf_rbtree_iter_next(&it)) {
> > >   // node -> PTR_TO_BTF_ID | MEM_ALLOC | MAYBE_NULL && ref_obj_id == 0
> > >   if (node && node->field == condition) {
> > >
> > >     n = bpf_rbtree_remove(rb_root, node);
> > >     if (!n) ...;
> > >     // n -> PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == X
> > >     m = bpf_rbtree_remove(rb_root, node); // ok, but fails in run-time
> > >     if (!m) ...;
> > >     // m -> PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == Y
> > >
> 
> This second remove I would simply disallow as Dave is suggesting during
> verification, by invalidating non-owning refs for rb_root.

Looks like cleanup from non-owning to untrusted|unknown on bpf_rbtree_remove is our
only remaining disagreement.
I feel run-time checks will be fast enough and will improve usabililty.

Also it feels that not doing cleanup on rbtree_remove is simpler to
implement and reason about.

Here is the proposal with one new field 'active_lock_id':

first = bpf_rbtree_first(root) KF_RET_NULL
  check_reg_allocation_locked() checks that root->reg->id == cur->active_lock.id
  R0 = PTR_TO_BTF_ID|MEM_ALLOC|PTR_MAYBE_NULL ref_obj_id = 0;
  R0->active_lock_id = root->reg->id
  R0->id = ++env->id_gen; which will be cleared after !NULL check inside prog.

same way we can add rb_find, rb_find_first,
but not rb_next, rb_prev, since they don't have 'root' argument.

bpf_rbtree_add(root, node, cb); KF_RELEASE.
  needs to see PTR_TO_BTF_ID|MEM_ALLOC node->ref_obj_id > 0
  check_reg_allocation_locked() checks that root->reg->id == cur->active_lock.id
  calls release_reference(node->ref_obj_id)
  converts 'node' to PTR_TO_BTF_ID|MEM_ALLOC ref_obj_id = 0;
  node->active_lock_id = root->reg->id

'node' is equivalent to 'first'. They both point to some element
inside rbtree and valid inside spin_locked region.
It's ok to read|write to both under lock.

removed_node = bpf_rbtree_remove(root, node); KF_ACQUIRE|KF_RET_NULL
  need to see PTR_TO_BTF_ID|MEM_ALLOC node->ref_obj_id = 0; and 
  usual check_reg_allocation_locked(root)
  R0 = PTR_TO_BTF_ID|MEM_ALLOC|MAYBE_NULL
  R0->ref_obj_id = R0->id = acquire_reference_state();
  R0->active_lock_id should stay 0
  mark_reg_unknown(node)

bpf_spin_unlock(lock);
  checks lock->id == cur->active_lock.id
  for all regs in state 
    if (reg->active_lock_id == lock->id)
       mark_reg_unknown(reg)
Dave Marchevsky Dec. 8, 2022, 11:35 p.m. UTC | #10
On 12/8/22 3:36 PM, Alexei Starovoitov wrote:
> On Thu, Dec 08, 2022 at 06:27:29PM +0530, Kumar Kartikeya Dwivedi wrote:
>>
>> I don't mind using active_lock.id for invalidation, but using reg->id to
>> associate it with reg is a bad idea IMO, it's already preserved and set when the
>> object has bpf_spin_lock in it, and it's going to allow doing bpf_spin_unlock
>> with that non-owing ref if it has a spin lock, essentially unlocking different
>> spin lock if the reg->btf of already locked spin lock reg is same due to same
>> active_lock.id.
> 
> Right. Overwriting reg->id was a bad idea.
> 
>> Even if you prevent it somehow it's more confusing to overload reg->id again for
>> this purpose.
>>
>> It makes more sense to introduce a new nonref_obj_id instead dedicated for this
>> purpose, to associate it back to the reg->id of the collection it is coming from.
> 
> nonref_obj_id name sounds too generic and I'm not sure that it shouldn't be
> connected to reg->id the way we do it for ref_obj_id.
> 
>> Also, there are two cases of invalidation, one is on remove from rbtree, which
>> should only invalidate non-owning references into the rbtree, and one is on
>> unlock, which should invalidate all non-owning references.
> 
> Two cases only if we're going to do invalidation on rbtree_remove.
> 
>> bpf_rbtree_remove shouldn't invalidate non-owning into list protected by same
>> lock, but unlocking should do it for both rbtree and list non-owning refs it is
>> protecting.
>>
>> So it seems you will have to maintain two IDs for non-owning referneces, one for
>> the collection it comes from, and one for the lock region it is obtained in.
> 
> Right. Like this ?
> collection_id = rbroot->reg->id; // to track the collection it came from
> active_lock_id = cur_state->active_lock.id // to track the lock region
> 
> but before we proceed let me demonstrate an example where
> cleanup on rbtree_remove is not user friendly:
> 
> bpf_spin_lock
> x = bpf_list_first(); if (!x) ..
> y = bpf_list_last(); if (!y) ..
> 
> n = bpf_list_remove(x); if (!n) ..
> 
> bpf_list_add_after(n, y); // we should allow this
> bpf_spin_unlock
> 
> We don't have such apis right now.
> The point here that cleanup after bpf_list_remove/bpf_rbtree_remove will destroy
> all regs that point somewhere in the collection.
> This way we save run-time check in bpf_rbtree_remove, but sacrificing usability.
> 
> x and y could be pointing to the same thing.
> In such case bpf_list_add_after() should fail in runtime after discovering
> that 'y' is unlinked.
> 
> Similarly with bpf_rbtree_add().
> Currently it cannot fail. It takes owning ref and will release it.
> We can mark it as KF_RELEASE and no extra verifier changes necessary.
> 
> But in the future we might have failing add/insert operations on lists and rbtree.
> If they're failing we'd need to struggle with 'conditional release' verifier additions,
> the bpf prog would need to check return value, etc.
> 
> I think we better deal with it in run-time.
> The verifier could supply bpf_list_add_after() with two hidden args:
> - container_of offset (delta between rb_node and begining of prog's struct)
> - struct btf_struct_meta *meta
> Then inside bpf_list_add_after or any failing KF_RELEASE kfunc
> it can call bpf_obj_drop_impl() that element.
> Then from the verifier pov the KF_RELEASE function did the release
> and 'owning ref' became 'non-owning ref'.
> 
>>>>> And you're also adding 'untrusted' here, mainly as a result of
>>>>> bpf_rbtree_add(tree, node) - 'node' becoming untrusted after it's added,
>>>>> instead of becoming a non-owning ref. 'untrusted' would have state like:
>>>>>
>>>>> PTR_TO_BTF_ID | MEM_ALLOC (w/ rb_node type)
>>>>> PTR_UNTRUSTED
>>>>> ref_obj_id == 0?
>>>>
>>>> I'm not sure whether we really need full untrusted after going through bpf_rbtree_add()
>>>> or doing 'non-owning' is enough.
>>>> If it's full untrusted it will be:
>>>> PTR_TO_BTF_ID | PTR_UNTRUSTED && ref_obj_id == 0
>>>>
>>>
>>> Yeah, I don't see what this "full untrusted" is giving us either. Let's have
>>> "cleanup non-owning refs on spin_unlock" just invalidate the regs for now,
>>> instead of converting to "full untrusted"?
>>>
>>
>> +1, I prefer invalidating completely on unlock.
> 
> fine by me.
> 
>>
>> I think it's better to clean by invalidating. We have better tools to form
>> untrusted pointers (like bpf_rdonly_cast) now if the BPF program writer needs
>> such an escape hatch for some reason. It's also easier to review where an
>> untrusted pointer is being used in a program, and has zero cost at runtime.
> 
> ok. Since it's more strict we can relax to untrusted later if necessary.
> 
>> So far I'm leaning towards:
>>
>> bpf_rbtree_add(node) : node becomes non-owned ref
>> bpf_spin_unlock(lock) : node is invalidated
> 
> ok
> 
>>>> Currently I'm leaning towards PTR_UNTRUSTED for cleanup after bpf_spin_unlock
>>>> and non-owning after bpf_rbtree_add.
>>>>
>>>> Walking the example from previous email:
>>>>
>>>> struct bpf_rbtree_iter it;
>>>> struct bpf_rb_node * node;
>>>> struct bpf_rb_node *n, *m;
>>>>
>>>> bpf_rbtree_iter_init(&it, rb_root); // locks the rbtree works as bpf_spin_lock
>>>> while ((node = bpf_rbtree_iter_next(&it)) {
>>>>   // node -> PTR_TO_BTF_ID | MEM_ALLOC | MAYBE_NULL && ref_obj_id == 0
>>>>   if (node && node->field == condition) {
>>>>
>>>>     n = bpf_rbtree_remove(rb_root, node);
>>>>     if (!n) ...;
>>>>     // n -> PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == X
>>>>     m = bpf_rbtree_remove(rb_root, node); // ok, but fails in run-time
>>>>     if (!m) ...;
>>>>     // m -> PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == Y
>>>>
>>
>> This second remove I would simply disallow as Dave is suggesting during
>> verification, by invalidating non-owning refs for rb_root.
> 
> Looks like cleanup from non-owning to untrusted|unknown on bpf_rbtree_remove is our
> only remaining disagreement.
> I feel run-time checks will be fast enough and will improve usabililty.
> 
> Also it feels that not doing cleanup on rbtree_remove is simpler to
> implement and reason about.
> 
> Here is the proposal with one new field 'active_lock_id':
> 
> first = bpf_rbtree_first(root) KF_RET_NULL
>   check_reg_allocation_locked() checks that root->reg->id == cur->active_lock.id
>   R0 = PTR_TO_BTF_ID|MEM_ALLOC|PTR_MAYBE_NULL ref_obj_id = 0;
>   R0->active_lock_id = root->reg->id
>   R0->id = ++env->id_gen; which will be cleared after !NULL check inside prog.
> 
> same way we can add rb_find, rb_find_first,
> but not rb_next, rb_prev, since they don't have 'root' argument.
> 
> bpf_rbtree_add(root, node, cb); KF_RELEASE.
>   needs to see PTR_TO_BTF_ID|MEM_ALLOC node->ref_obj_id > 0
>   check_reg_allocation_locked() checks that root->reg->id == cur->active_lock.id
>   calls release_reference(node->ref_obj_id)
>   converts 'node' to PTR_TO_BTF_ID|MEM_ALLOC ref_obj_id = 0;
>   node->active_lock_id = root->reg->id
> 
> 'node' is equivalent to 'first'. They both point to some element
> inside rbtree and valid inside spin_locked region.
> It's ok to read|write to both under lock.
> 
> removed_node = bpf_rbtree_remove(root, node); KF_ACQUIRE|KF_RET_NULL
>   need to see PTR_TO_BTF_ID|MEM_ALLOC node->ref_obj_id = 0; and 
>   usual check_reg_allocation_locked(root)
>   R0 = PTR_TO_BTF_ID|MEM_ALLOC|MAYBE_NULL
>   R0->ref_obj_id = R0->id = acquire_reference_state();
>   R0->active_lock_id should stay 0
>   mark_reg_unknown(node)
> 
> bpf_spin_unlock(lock);
>   checks lock->id == cur->active_lock.id
>   for all regs in state 
>     if (reg->active_lock_id == lock->id)
>        mark_reg_unknown(reg)

OK, so sounds like a few more points of agreement, regardless of whether
we go the runtime checking route or the other one:

  * We're tossing 'full untrusted' for now. non-owning references will not be
    allowed to escape critical section. They'll be clobbered w/
    mark_reg_unknown.
    * No pressing need to make bpf_obj_drop callable from critical section.
      As a result no owning or non-owning ref access can page fault.

  * When spin_lock is unlocked, verifier needs to know about all non-owning
    references so that it can clobber them. Current implementation -
    ref_obj_id + release_on_unlock - is bad for a number of reasons, should
    be replaced with something that doesn't use ref_obj_id or reg->id.
    * Specific better approach was proposed above: new field + keep track
      of lock and datastructure identity.


Differences in proposed approaches:

"Type System checks + invalidation on 'destructive' rbtree ops"

  * This approach tries to prevent aliasing problems by invalidating
    non-owning refs after 'destructive' rbtree ops - like rbtree_remove -
    in addition to invalidation on spin_unlock

  * Type system guarantees invariants:
    * "if it's an owning ref, the node is guaranteed to not be in an rbtree"
    * "if it's a non-owning ref, the node is guaranteed to be in an rbtree"

  * Downside: mass non-owning ref invalidation on rbtree_remove will make some
    programs that logically don't have aliasing problem will be rejected by
    verifier. Will affect usability depending on how bad this is.


"Runtime checks + spin_unlock invalidation only"

  * This approach allows for the possibility of aliasing problem. As a result
    the invariants guaranteed in point 2 above don't necessarily hold.
    * Helpers that add or remove need to account for possibility that the node
      they're operating on has already been added / removed. Need to check this
      at runtime and nop if so.

  * non-owning refs are only invalidated on spin_unlock.
    * As a result, usability issues of previous approach don't happen here.

  * Downside: Need to do runtime checks, some additional verifier complexity
    to deal with "runtime check failed" case due to prev approach's invariant
    not holding

Conversion of non-owning refs to 'untrusted' at a invalidation point (unlock
or remove) can be added to either approach (maybe - at least it was specifically
discussed for "runtime checks"). Such untrusted refs, by virtue of being
PTR_UNTRUSTED, can fault, and aren't accepted by rbtree_{add, remove} as input.
For the "type system" approach this might ameliorate some of the usability
issues. For the "runtime checks" approach it would only be useful to let
such refs escape spin_unlock.

But we're not going to do non-owning -> 'untrusted' for now, just listing for
completeness.


The distance between what I have now and "type system" approach is smaller
than "runtime checks" approach. And to get from "type system" to "runtime
checks" I'd need to:

  * Remove 'destructive op' invalidation points
  * Add runtime checks to rbtree_{add,remove}
  * Add verifier handling of runtime check failure possibility

Of which only the first point is getting rid of something added for the
"type system" approach, and won't be much work relative to all the refactoring
and other improvements that are common between the two approaches.

So for V2 I will do the "type system + invalidation on 'destructive' ops"
approach as it'll take less time. This'll get eyes on common improvements
faster. Then can do a "runtime checks" v3 and we can compare usability of both
on same base.
Alexei Starovoitov Dec. 9, 2022, 12:39 a.m. UTC | #11
On Thu, Dec 08, 2022 at 06:35:24PM -0500, Dave Marchevsky wrote:
> > 
> > Here is the proposal with one new field 'active_lock_id':
> > 
> > first = bpf_rbtree_first(root) KF_RET_NULL
> >   check_reg_allocation_locked() checks that root->reg->id == cur->active_lock.id
> >   R0 = PTR_TO_BTF_ID|MEM_ALLOC|PTR_MAYBE_NULL ref_obj_id = 0;
> >   R0->active_lock_id = root->reg->id
> >   R0->id = ++env->id_gen; which will be cleared after !NULL check inside prog.
> > 
> > same way we can add rb_find, rb_find_first,
> > but not rb_next, rb_prev, since they don't have 'root' argument.
> > 
> > bpf_rbtree_add(root, node, cb); KF_RELEASE.
> >   needs to see PTR_TO_BTF_ID|MEM_ALLOC node->ref_obj_id > 0
> >   check_reg_allocation_locked() checks that root->reg->id == cur->active_lock.id
> >   calls release_reference(node->ref_obj_id)
> >   converts 'node' to PTR_TO_BTF_ID|MEM_ALLOC ref_obj_id = 0;
> >   node->active_lock_id = root->reg->id
> > 
> > 'node' is equivalent to 'first'. They both point to some element
> > inside rbtree and valid inside spin_locked region.
> > It's ok to read|write to both under lock.
> > 
> > removed_node = bpf_rbtree_remove(root, node); KF_ACQUIRE|KF_RET_NULL
> >   need to see PTR_TO_BTF_ID|MEM_ALLOC node->ref_obj_id = 0; and 
> >   usual check_reg_allocation_locked(root)
> >   R0 = PTR_TO_BTF_ID|MEM_ALLOC|MAYBE_NULL
> >   R0->ref_obj_id = R0->id = acquire_reference_state();
> >   R0->active_lock_id should stay 0
> >   mark_reg_unknown(node)
> > 
> > bpf_spin_unlock(lock);
> >   checks lock->id == cur->active_lock.id
> >   for all regs in state 
> >     if (reg->active_lock_id == lock->id)
> >        mark_reg_unknown(reg)
> 
> OK, so sounds like a few more points of agreement, regardless of whether
> we go the runtime checking route or the other one:
> 
>   * We're tossing 'full untrusted' for now. non-owning references will not be
>     allowed to escape critical section. They'll be clobbered w/
>     mark_reg_unknown.

agree

>     * No pressing need to make bpf_obj_drop callable from critical section.
>       As a result no owning or non-owning ref access can page fault.

agree

> 
>   * When spin_lock is unlocked, verifier needs to know about all non-owning
>     references so that it can clobber them. Current implementation -
>     ref_obj_id + release_on_unlock - is bad for a number of reasons, should
>     be replaced with something that doesn't use ref_obj_id or reg->id.
>     * Specific better approach was proposed above: new field + keep track
>       of lock and datastructure identity.

yes

> 
> Differences in proposed approaches:
> 
> "Type System checks + invalidation on 'destructive' rbtree ops"
> 
>   * This approach tries to prevent aliasing problems by invalidating
>     non-owning refs after 'destructive' rbtree ops - like rbtree_remove -
>     in addition to invalidation on spin_unlock
> 
>   * Type system guarantees invariants:
>     * "if it's an owning ref, the node is guaranteed to not be in an rbtree"
>     * "if it's a non-owning ref, the node is guaranteed to be in an rbtree"
> 
>   * Downside: mass non-owning ref invalidation on rbtree_remove will make some
>     programs that logically don't have aliasing problem will be rejected by
>     verifier. Will affect usability depending on how bad this is.

yes.

> 
> 
> "Runtime checks + spin_unlock invalidation only"
> 
>   * This approach allows for the possibility of aliasing problem. As a result
>     the invariants guaranteed in point 2 above don't necessarily hold.
>     * Helpers that add or remove need to account for possibility that the node
>       they're operating on has already been added / removed. Need to check this
>       at runtime and nop if so.

Only 'remove' needs to check.
'add' is operating on 'owning ref'. It cannot fail.
Some future 'add_here(root, owning_node_to_add, nonowning_location)'
may need to fail.

> 
>   * non-owning refs are only invalidated on spin_unlock.
>     * As a result, usability issues of previous approach don't happen here.
> 
>   * Downside: Need to do runtime checks, some additional verifier complexity
>     to deal with "runtime check failed" case due to prev approach's invariant
>     not holding
> 
> Conversion of non-owning refs to 'untrusted' at a invalidation point (unlock
> or remove) can be added to either approach (maybe - at least it was specifically
> discussed for "runtime checks"). Such untrusted refs, by virtue of being
> PTR_UNTRUSTED, can fault, and aren't accepted by rbtree_{add, remove} as input.

correct.

> For the "type system" approach this might ameliorate some of the usability
> issues. For the "runtime checks" approach it would only be useful to let
> such refs escape spin_unlock.

the prog can do bpf_rdonly_cast() even after mark_unknown.

> But we're not going to do non-owning -> 'untrusted' for now, just listing for
> completeness.

right, because of bpf_rdonly_cast availability.

> The distance between what I have now and "type system" approach is smaller
> than "runtime checks" approach. And to get from "type system" to "runtime
> checks" I'd need to:
> 
>   * Remove 'destructive op' invalidation points
>   * Add runtime checks to rbtree_{add,remove}
>   * Add verifier handling of runtime check failure possibility
> 
> Of which only the first point is getting rid of something added for the
> "type system" approach, and won't be much work relative to all the refactoring
> and other improvements that are common between the two approaches.
> 
> So for V2 I will do the "type system + invalidation on 'destructive' ops"
> approach as it'll take less time. This'll get eyes on common improvements
> faster. Then can do a "runtime checks" v3 and we can compare usability of both
> on same base.

Sure, if you think cleanup on rbtree_remove is faster to implement
then definitely go for it.
I was imagining the other way around, but it's fine. Happy to be wrong.
I'm not seeing though how you gonna do that cleanup.
Another id-like field?
Before doing all coding could you post a proposal in the format that I did above?
imo it's much easier to think through in that form instead of analyzing the src code.