[v2,bpf-next,0/9] Shared ownership for local kptrs

Message ID	20230415201811.343116-1-davemarchevsky@fb.com (mailing list archive)
Headers	show Return-Path: <bpf-owner@vger.kernel.org> From: Dave Marchevsky <davemarchevsky@fb.com> To: <bpf@vger.kernel.org> CC: Alexei Starovoitov <ast@kernel.org>, Daniel Borkmann <daniel@iogearbox.net>, Andrii Nakryiko <andrii@kernel.org>, Martin KaFai Lau <martin.lau@kernel.org>, Kernel Team <kernel-team@fb.com>, Dave Marchevsky <davemarchevsky@fb.com> Subject: [PATCH v2 bpf-next 0/9] Shared ownership for local kptrs Date: Sat, 15 Apr 2023 13:18:02 -0700 Message-ID: <20230415201811.343116-1-davemarchevsky@fb.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain Precedence: bulk
Series	Shared ownership for local kptrs \| expand [v2,bpf-next,0/9] Shared ownership for local kptrs [v2,bpf-next,1/9] bpf: Remove btf_field_offs, use btf_record's fields instead [v2,bpf-next,2/9] bpf: Introduce opaque bpf_refcount struct and add btf_record plumbing [v2,bpf-next,3/9] bpf: Support refcounted local kptrs in existing semantics [v2,bpf-next,4/9] bpf: Add bpf_refcount_acquire kfunc [v2,bpf-next,5/9] bpf: Migrate bpf_rbtree_add and bpf_list_push_{front,back} to possibly fail [v2,bpf-next,6/9] selftests/bpf: Modify linked_list tests to work with macro-ified inserts [v2,bpf-next,7/9] bpf: Migrate bpf_rbtree_remove to possibly fail [v2,bpf-next,8/9] bpf: Centralize btf_field-specific initialization logic [v2,bpf-next,9/9] selftests/bpf: Add refcounted_kptr tests

Dave Marchevsky April 15, 2023, 8:18 p.m. UTC

This series adds support for refcounted local kptrs to the verifier. A local
kptr is 'refcounted' if its type contains a struct bpf_refcount field:

  struct refcounted_node {
    long data;
    struct bpf_list_node ll;
    struct bpf_refcount ref;
  };

bpf_refcount is used to implement shared ownership for local kptrs.

Motivating usecase
==================

If a struct has two collection node fields, e.g.:

  struct node {
    long key;
    long val;
    struct bpf_rb_node rb;
    struct bpf_list_node ll;
  };

It's not currently possible to add a node to both the list and rbtree:

  long bpf_prog(void *ctx)
  {
    struct node *n = bpf_obj_new(typeof(*n));
    if (!n) { /* ... */ }

    bpf_spin_lock(&lock);

    bpf_list_push_back(&head, &n->ll);
    bpf_rbtree_add(&root, &n->rb, less); /* Assume a resonable less() */
    bpf_spin_unlock(&lock);
  }

The above program will fail verification due to current owning / non-owning ref
logic: after bpf_list_push_back, n is a non-owning reference and thus cannot be
passed to bpf_rbtree_add. The only way to get an owning reference for the node
that was added is to bpf_list_pop_{front,back} it.

More generally, verifier ownership semantics expect that a node has one
owner (program, collection, or stashed in map) with exclusive ownership
of the node's lifetime. The owner free's the node's underlying memory when it
itself goes away.

Without a shared ownership concept it's impossible to express many real-world
usecases such that they pass verification.

Semantic Changes
================

Before this series, the verifier could make this statement: "whoever has the
owning reference has exclusive ownership of the referent's lifetime". As
demonstrated in the previous section, this implies that a BPF program can't
have an owning reference to some node if that node is in a collection. If
such a state were possible, the node would have multiple owners, each thinking
they have exclusive ownership. In order to support shared ownership it's
necessary to modify the exclusive ownership semantic.

After this series' changes, an owning reference has ownership of the referent's
lifetime, but it's not necessarily exclusive. The referent's underlying memory
is guaranteed to be valid (i.e. not free'd) until the reference is dropped or
used for collection insert.

This change doesn't affect UX of owning or non-owning references much:

  * insert kfuncs (bpf_rbtree_add, bpf_list_push_{front,back}) still require
    an owning reference arg, as ownership still must be passed to the
    collection in a shared-ownership world.

  * non-owning references still refer to valid memory without claiming
    any ownership.

One important conclusion that followed from "exclusive ownership" statement
is no longer valid, though. In exclusive-ownership world, if a BPF prog has
an owning reference to a node, the verifier can conclude that no collection has
ownership of it. This conclusion was used to avoid runtime checking in the
implementations of insert and remove operations (""has the node already been
{inserted, removed}?").

In a shared-ownership world the aforementioned conclusion is no longer valid,
which necessitates doing runtime checking in insert and remove operation
kfuncs, and those functions possibly failing to insert or remove anything.

Luckily the verifier changes necessary to go from exclusive to shared ownership
were fairly minimal. Patches in this series which do change verifier semantics
generally have some summary dedicated to explaining why certain usecases
Just Work for shared ownership without verifier changes.

Implementation
==============

The changes in this series can be categorized as follows:

  * struct bpf_refcount opaque field + plumbing
  * support for refcounted kptrs in bpf_obj_new and bpf_obj_drop
  * bpf_refcount_acquire kfunc
    * enables shared ownershp by bumping refcount + acquiring owning ref
  * support for possibly-failing collection insertion and removal
    * insertion changes are more complex

If a patch's changes have some nuance to their effect - or lack of effect - on
verifier behavior, the patch summary talks about it at length.

Patch contents:
  * Patch 1 removes btf_field_offs struct
  * Patch 2 adds struct bpf_refcount and associated plumbing
  * Patch 3 modifies semantics of bpf_obj_drop and bpf_obj_new to handle
    refcounted kptrs
  * Patch 4 adds bpf_refcount_acquire
  * Patches 5-7 add support for possibly-failing collection insert and remove
  * Patch 8 centralizes constructor-like functionality for local kptr types
  * Patch 9 adds tests for new functionality

base-commit: 4a1e885c6d143ff1b557ec7f3fc6ddf39c51502f

Changelog:

v1 -> v2: lore.kernel.org/bpf/20230410190753.2012798-1-davemarchevsky@fb.com

Patch #s used below refer to the patch's position in v1 unless otherwise
specified.

  * General
    * Rebase onto latest bpf-next (base-commit updated above)

  * Patch 4 - "bpf: Add bpf_refcount_acquire kfunc"
    * Fix typo in summary (Alexei)
  * Patch 7 - "Migrate bpf_rbtree_remove to possibly fail"
    * Modify a paragraph in patch summary to more clearly state that only
      bpf_rbtree_remove's non-owning ref clobbering behavior is changed by the
      patch (Alexei)
    * refcount_off == -1 -> refcount_off < 0  in "node type w/ both list
      and rb_node fields" check, since any negative value means "no
      bpf_refcount field found", and furthermore refcount_off is never
      explicitly set to -1, but rather -EINVAL. (Alexei)
    * Instead of just changing "btf: list_node and rb_node in same struct" test
      expectation to pass instead of fail, do some refactoring to test both
      "list_node, rb_node, and bpf_refcount" (success) and "list_node, rb_node,
      _no_ bpf_refcount" (failure) cases. This ensures that logic change in
      previous bullet point is correct.
      * v1's "btf: list_node and rb_node in same struct" test changes didn't
        add bpf_refcount, so the fact that btf load succeeded w/ list and
        rb_nodes but no bpf_refcount field is further proof that this logic
        was incorrect in v1.
  * Patch 8 - "bpf: Centralize btf_field-specific initialization logic"
    * Instead of doing __init_field_infer_size in kfuncs when taking
      bpf_list_head type input which might've been 0-initialized in map, go
      back to simple oneliner initialization. Add short comment explaining why
      this is necessary. (Alexei)
  * Patch 9 - "selftests/bpf: Add refcounted_kptr tests"
    * Don't __always_inline helper fns in progs/refcounted_kptr.c (Alexei)

Dave Marchevsky (9):
  bpf: Remove btf_field_offs, use btf_record's fields instead
  bpf: Introduce opaque bpf_refcount struct and add btf_record plumbing
  bpf: Support refcounted local kptrs in existing semantics
  bpf: Add bpf_refcount_acquire kfunc
  bpf: Migrate bpf_rbtree_add and bpf_list_push_{front,back} to possibly
    fail
  selftests/bpf: Modify linked_list tests to work with macro-ified
    inserts
  bpf: Migrate bpf_rbtree_remove to possibly fail
  bpf: Centralize btf_field-specific initialization logic
  selftests/bpf: Add refcounted_kptr tests

 include/linux/bpf.h                           |  80 ++--
 include/linux/bpf_verifier.h                  |   7 +-
 include/linux/btf.h                           |   2 -
 include/uapi/linux/bpf.h                      |   4 +
 kernel/bpf/btf.c                              | 126 ++----
 kernel/bpf/helpers.c                          | 113 +++--
 kernel/bpf/map_in_map.c                       |  15 -
 kernel/bpf/syscall.c                          |  23 +-
 kernel/bpf/verifier.c                         | 155 +++++--
 tools/include/uapi/linux/bpf.h                |   4 +
 .../testing/selftests/bpf/bpf_experimental.h  |  60 ++-
 .../selftests/bpf/prog_tests/linked_list.c    |  96 +++--
 .../testing/selftests/bpf/prog_tests/rbtree.c |  25 ++
 .../bpf/prog_tests/refcounted_kptr.c          |  18 +
 .../testing/selftests/bpf/progs/linked_list.c |  34 +-
 .../testing/selftests/bpf/progs/linked_list.h |   4 +-
 .../selftests/bpf/progs/linked_list_fail.c    |  96 +++--
 tools/testing/selftests/bpf/progs/rbtree.c    |  74 +++-
 .../testing/selftests/bpf/progs/rbtree_fail.c |  77 ++--
 .../selftests/bpf/progs/refcounted_kptr.c     | 406 ++++++++++++++++++
 .../bpf/progs/refcounted_kptr_fail.c          |  72 ++++
 21 files changed, 1114 insertions(+), 377 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/refcounted_kptr.c
 create mode 100644 tools/testing/selftests/bpf/progs/refcounted_kptr.c
 create mode 100644 tools/testing/selftests/bpf/progs/refcounted_kptr_fail.c

patchwork-bot+netdevbpf@kernel.org April 16, 2023, 12:50 a.m. UTC | #1

Hello:

This series was applied to bpf/bpf-next.git (master)
by Alexei Starovoitov <ast@kernel.org>:

On Sat, 15 Apr 2023 13:18:02 -0700 you wrote:
> This series adds support for refcounted local kptrs to the verifier. A local
> kptr is 'refcounted' if its type contains a struct bpf_refcount field:
> 
>   struct refcounted_node {
>     long data;
>     struct bpf_list_node ll;
>     struct bpf_refcount ref;
>   };
> 
> [...]

Here is the summary with links:
  - [v2,bpf-next,1/9] bpf: Remove btf_field_offs, use btf_record's fields instead
    https://git.kernel.org/bpf/bpf-next/c/cd2a8079014a
  - [v2,bpf-next,2/9] bpf: Introduce opaque bpf_refcount struct and add btf_record plumbing
    https://git.kernel.org/bpf/bpf-next/c/d54730b50bae
  - [v2,bpf-next,3/9] bpf: Support refcounted local kptrs in existing semantics
    https://git.kernel.org/bpf/bpf-next/c/1512217c47f0
  - [v2,bpf-next,4/9] bpf: Add bpf_refcount_acquire kfunc
    https://git.kernel.org/bpf/bpf-next/c/7c50b1cb76ac
  - [v2,bpf-next,5/9] bpf: Migrate bpf_rbtree_add and bpf_list_push_{front,back} to possibly fail
    https://git.kernel.org/bpf/bpf-next/c/d2dcc67df910
  - [v2,bpf-next,6/9] selftests/bpf: Modify linked_list tests to work with macro-ified inserts
    https://git.kernel.org/bpf/bpf-next/c/de67ba3968fa
  - [v2,bpf-next,7/9] bpf: Migrate bpf_rbtree_remove to possibly fail
    https://git.kernel.org/bpf/bpf-next/c/404ad75a36fb
  - [v2,bpf-next,8/9] bpf: Centralize btf_field-specific initialization logic
    https://git.kernel.org/bpf/bpf-next/c/3e81740a9062
  - [v2,bpf-next,9/9] selftests/bpf: Add refcounted_kptr tests
    https://git.kernel.org/bpf/bpf-next/c/6147f15131e2

You are awesome, thank you!

Kumar Kartikeya Dwivedi April 22, 2023, 2:03 a.m. UTC | #2

On Sat, Apr 15, 2023 at 10:18:02PM CEST, Dave Marchevsky wrote:
> This series adds support for refcounted local kptrs to the verifier. A local
> kptr is 'refcounted' if its type contains a struct bpf_refcount field:
>
>   struct refcounted_node {
>     long data;
>     struct bpf_list_node ll;
>     struct bpf_refcount ref;
>   };
>
> bpf_refcount is used to implement shared ownership for local kptrs.
>
> Motivating usecase
> ==================
>
> If a struct has two collection node fields, e.g.:
>
>   struct node {
>     long key;
>     long val;
>     struct bpf_rb_node rb;
>     struct bpf_list_node ll;
>   };
>
> It's not currently possible to add a node to both the list and rbtree:
>
>   long bpf_prog(void *ctx)
>   {
>     struct node *n = bpf_obj_new(typeof(*n));
>     if (!n) { /* ... */ }
>
>     bpf_spin_lock(&lock);
>
>     bpf_list_push_back(&head, &n->ll);
>     bpf_rbtree_add(&root, &n->rb, less); /* Assume a resonable less() */
>     bpf_spin_unlock(&lock);
>   }
>
> The above program will fail verification due to current owning / non-owning ref
> logic: after bpf_list_push_back, n is a non-owning reference and thus cannot be
> passed to bpf_rbtree_add. The only way to get an owning reference for the node
> that was added is to bpf_list_pop_{front,back} it.
>
> More generally, verifier ownership semantics expect that a node has one
> owner (program, collection, or stashed in map) with exclusive ownership
> of the node's lifetime. The owner free's the node's underlying memory when it
> itself goes away.
>
> Without a shared ownership concept it's impossible to express many real-world
> usecases such that they pass verification.
>
> Semantic Changes
> ================
>
> Before this series, the verifier could make this statement: "whoever has the
> owning reference has exclusive ownership of the referent's lifetime". As
> demonstrated in the previous section, this implies that a BPF program can't
> have an owning reference to some node if that node is in a collection. If
> such a state were possible, the node would have multiple owners, each thinking
> they have exclusive ownership. In order to support shared ownership it's
> necessary to modify the exclusive ownership semantic.
>
> After this series' changes, an owning reference has ownership of the referent's
> lifetime, but it's not necessarily exclusive. The referent's underlying memory
> is guaranteed to be valid (i.e. not free'd) until the reference is dropped or
> used for collection insert.
>
> This change doesn't affect UX of owning or non-owning references much:
>
>   * insert kfuncs (bpf_rbtree_add, bpf_list_push_{front,back}) still require
>     an owning reference arg, as ownership still must be passed to the
>     collection in a shared-ownership world.
>
>   * non-owning references still refer to valid memory without claiming
>     any ownership.
> [...]

I think there are a series of major issues right now. I am not sure everything
can be addressed using bug fixes.

If I had to summarize the main problems in one liners:
The mutation of the node fields of an object can be racy.
Lack of collection identity either at runtime or verification.

--

It is possible for multiple CPUs to get owned references to an object containing
a rbtree or list node, and they can attempt to modify those fields in parallel
without any synchronization.

CPU 0					CPU 1
n = bpf_obj_new(...)
m = bpf_refcount_acquire(n)
kptr_xchg(map, m)
					m = kptr_xchg(map, NULL)
// m == n
bpf_spin_lock(lock1)			bpf_spin_lock(lock2)
bpf_rbtree_add(rbtree1, m)		bpf_rbtree_add(rbtree2, n)
	if (!RB_EMPTY_NODE)			if (!RB_EMPTY_NODE) // RACE, both continue with 'else'
bpf_spin_unlock(lock1)			bpf_spin_unlock(lock2)

--

Blocking kptr_xchg for shared ownership nodes as a stopgap solution won't be
sufficient. Consider this:

Two CPUs can do (consider rbtree1 having the only element we add from CPU 0):

CPU 0					CPU 1
n = bpf_obj_new(...)
bpf_spin_lock(lock1)
bpf_rbtree_add(rbtree1, n)
m = bpf_refcount_acquire(n)
bpf_spin_unlock(lock1)
					bpf_spin_lock(lock1)
					n = bpf_rbtree_remove(bpf_rbtree_first(rbtree1))
					bpf_spin_unlock(lock1)
// let m == n
bpf_spin_lock(lock1)			bpf_spin_lock(lock2)
bpf_rbtree_add(rbtree1, m)		bpf_rbtree_add(rbtree2, n)
	if (!RB_EMPTY_NODE)			if (!RB_EMPTY_NODE) // RACE, both continue with 'else'
bpf_spin_unlock(lock1)			bpf_spin_unlock(lock2)

--

There's also no discussion on the problem with collection identities we
discussed before (maybe you plan to address it later):
https://lore.kernel.org/bpf/45e80d2e-af16-8584-12ec-c4c301d9a69d@meta.com

But static tracaking won't be sufficient any longer. Considering another case
where the runtime will be confused about which rbtree a node belongs to.

CPU 0								CPU 1
n = bpf_obj_new(...)
m = bpf_refcount_acquire(n)
kptr_xchg(map, m)
								p = kptr_xchg(map, NULL)
								lock(lock2)
								bpf_rbtree_add(rbtree2, p->rnode)
								unlock(lock2)
lock(lock1)
bpf_list_push_back(head1, n->lnode)
// make n non-owning ref
bpf_rbtree_remove(rbtree1, n->rnode); // OOPS, remove without lock2
unlock(lock1)

--

I can come up with multiple other examples. The point I'm trying to drive home
is that it's a more fundamental issue in the way things are set up right now,
not something that was overlooked during the implementation.

The root cause is that access to a node's fields is going to be racy once
multiple CPUs have *mutable* references to it. The lack of ownership information
mapped to the collection either at runtime or during verification is also a
separate problem.

When we were discussing this a few months ago, we tried to form consensus on
synchronizing updates over a node using an 'owner' pointer to eliminate similar
races. Every node field has an associated owner field, which each updater
modifying that node synchronizes over.

In short:
Node's owner describes its state at runtime.
node.owner == ptr_of_ds // part of DS
node.owner == NULL // not part of DS
node.owner == BPF_PTR_POISON // busy (about to go to NULL or ptr_of_ds before BPF_EXIT)

cmpxchg(node.owner, NULL, BPF_PTR_POISON) to make a free node busy.
bpf_rbtree_add and such will do this to claim node ownership before trying to
link it in, and then store owner to ptr_of_ds. The _store_ will be the
*linearization* point of bpf_rbtree_add, not cmpxchg.

READ_ONCE(node.owner) == ptr_of_ds to ensure node belongs to locked ds, and will
remain in this state until the end of CS, since ptr_to_ds to NULL only happens
in remove under a held lock for the ds. bpf_rbtree_remove will do this check
before WRITE_ONCE of NULL to unlink a node.

Ofcourse, this is slower, and requires extra space in the object, but unless
this or something else is used to eliminate races, there will be bugs.

Alexei Starovoitov April 22, 2023, 3:21 a.m. UTC | #3

On Sat, Apr 22, 2023 at 04:03:45AM +0200, Kumar Kartikeya Dwivedi wrote:
> On Sat, Apr 15, 2023 at 10:18:02PM CEST, Dave Marchevsky wrote:
> > This series adds support for refcounted local kptrs to the verifier. A local
> > kptr is 'refcounted' if its type contains a struct bpf_refcount field:
> >
> >   struct refcounted_node {
> >     long data;
> >     struct bpf_list_node ll;
> >     struct bpf_refcount ref;
> >   };
> >
> > bpf_refcount is used to implement shared ownership for local kptrs.
> >
> > Motivating usecase
> > ==================
> >
> > If a struct has two collection node fields, e.g.:
> >
> >   struct node {
> >     long key;
> >     long val;
> >     struct bpf_rb_node rb;
> >     struct bpf_list_node ll;
> >   };
> >
> > It's not currently possible to add a node to both the list and rbtree:
> >
> >   long bpf_prog(void *ctx)
> >   {
> >     struct node *n = bpf_obj_new(typeof(*n));
> >     if (!n) { /* ... */ }
> >
> >     bpf_spin_lock(&lock);
> >
> >     bpf_list_push_back(&head, &n->ll);
> >     bpf_rbtree_add(&root, &n->rb, less); /* Assume a resonable less() */
> >     bpf_spin_unlock(&lock);
> >   }
> >
> > The above program will fail verification due to current owning / non-owning ref
> > logic: after bpf_list_push_back, n is a non-owning reference and thus cannot be
> > passed to bpf_rbtree_add. The only way to get an owning reference for the node
> > that was added is to bpf_list_pop_{front,back} it.
> >
> > More generally, verifier ownership semantics expect that a node has one
> > owner (program, collection, or stashed in map) with exclusive ownership
> > of the node's lifetime. The owner free's the node's underlying memory when it
> > itself goes away.
> >
> > Without a shared ownership concept it's impossible to express many real-world
> > usecases such that they pass verification.
> >
> > Semantic Changes
> > ================
> >
> > Before this series, the verifier could make this statement: "whoever has the
> > owning reference has exclusive ownership of the referent's lifetime". As
> > demonstrated in the previous section, this implies that a BPF program can't
> > have an owning reference to some node if that node is in a collection. If
> > such a state were possible, the node would have multiple owners, each thinking
> > they have exclusive ownership. In order to support shared ownership it's
> > necessary to modify the exclusive ownership semantic.
> >
> > After this series' changes, an owning reference has ownership of the referent's
> > lifetime, but it's not necessarily exclusive. The referent's underlying memory
> > is guaranteed to be valid (i.e. not free'd) until the reference is dropped or
> > used for collection insert.
> >
> > This change doesn't affect UX of owning or non-owning references much:
> >
> >   * insert kfuncs (bpf_rbtree_add, bpf_list_push_{front,back}) still require
> >     an owning reference arg, as ownership still must be passed to the
> >     collection in a shared-ownership world.
> >
> >   * non-owning references still refer to valid memory without claiming
> >     any ownership.
> > [...]
> 
> I think there are a series of major issues right now. I am not sure everything
> can be addressed using bug fixes.
> 
> If I had to summarize the main problems in one liners:
> The mutation of the node fields of an object can be racy.
> Lack of collection identity either at runtime or verification.
> 
> --
> 
> It is possible for multiple CPUs to get owned references to an object containing
> a rbtree or list node, and they can attempt to modify those fields in parallel
> without any synchronization.
> 
> CPU 0					CPU 1
> n = bpf_obj_new(...)
> m = bpf_refcount_acquire(n)
> kptr_xchg(map, m)
> 					m = kptr_xchg(map, NULL)
> // m == n
> bpf_spin_lock(lock1)			bpf_spin_lock(lock2)
> bpf_rbtree_add(rbtree1, m)		bpf_rbtree_add(rbtree2, n)
> 	if (!RB_EMPTY_NODE)			if (!RB_EMPTY_NODE) // RACE, both continue with 'else'
> bpf_spin_unlock(lock1)			bpf_spin_unlock(lock2)
> 
> --
> 
> Blocking kptr_xchg for shared ownership nodes as a stopgap solution won't be
> sufficient. Consider this:
> 
> Two CPUs can do (consider rbtree1 having the only element we add from CPU 0):
> 
> CPU 0					CPU 1
> n = bpf_obj_new(...)
> bpf_spin_lock(lock1)
> bpf_rbtree_add(rbtree1, n)
> m = bpf_refcount_acquire(n)
> bpf_spin_unlock(lock1)
> 					bpf_spin_lock(lock1)
> 					n = bpf_rbtree_remove(bpf_rbtree_first(rbtree1))
> 					bpf_spin_unlock(lock1)
> // let m == n
> bpf_spin_lock(lock1)			bpf_spin_lock(lock2)
> bpf_rbtree_add(rbtree1, m)		bpf_rbtree_add(rbtree2, n)
> 	if (!RB_EMPTY_NODE)			if (!RB_EMPTY_NODE) // RACE, both continue with 'else'
> bpf_spin_unlock(lock1)			bpf_spin_unlock(lock2)
> 
> --
> 
> There's also no discussion on the problem with collection identities we
> discussed before (maybe you plan to address it later):
> https://lore.kernel.org/bpf/45e80d2e-af16-8584-12ec-c4c301d9a69d@meta.com
> 
> But static tracaking won't be sufficient any longer. Considering another case
> where the runtime will be confused about which rbtree a node belongs to.
> 
> CPU 0								CPU 1
> n = bpf_obj_new(...)
> m = bpf_refcount_acquire(n)
> kptr_xchg(map, m)
> 								p = kptr_xchg(map, NULL)
> 								lock(lock2)
> 								bpf_rbtree_add(rbtree2, p->rnode)
> 								unlock(lock2)
> lock(lock1)
> bpf_list_push_back(head1, n->lnode)
> // make n non-owning ref
> bpf_rbtree_remove(rbtree1, n->rnode); // OOPS, remove without lock2
> unlock(lock1)

Thanks for describing the races.

> --
> 
> I can come up with multiple other examples. The point I'm trying to drive home
> is that it's a more fundamental issue in the way things are set up right now,
> not something that was overlooked during the implementation.
> 
> The root cause is that access to a node's fields is going to be racy once
> multiple CPUs have *mutable* references to it. The lack of ownership information
> mapped to the collection either at runtime or during verification is also a
> separate problem.
> 
> When we were discussing this a few months ago, we tried to form consensus on
> synchronizing updates over a node using an 'owner' pointer to eliminate similar
> races. Every node field has an associated owner field, which each updater
> modifying that node synchronizes over.
> 
> In short:
> Node's owner describes its state at runtime.
> node.owner == ptr_of_ds // part of DS

what is DS ?

> node.owner == NULL // not part of DS
> node.owner == BPF_PTR_POISON // busy (about to go to NULL or ptr_of_ds before BPF_EXIT)
> 
> cmpxchg(node.owner, NULL, BPF_PTR_POISON) to make a free node busy.
> bpf_rbtree_add and such will do this to claim node ownership before trying to
> link it in, and then store owner to ptr_of_ds. The _store_ will be the
> *linearization* point of bpf_rbtree_add, not cmpxchg.
> 
> READ_ONCE(node.owner) == ptr_of_ds to ensure node belongs to locked ds, and will
> remain in this state until the end of CS, since ptr_to_ds to NULL only happens
> in remove under a held lock for the ds. bpf_rbtree_remove will do this check
> before WRITE_ONCE of NULL to unlink a node.
> 
> Ofcourse, this is slower, and requires extra space in the object, but unless
> this or something else is used to eliminate races, there will be bugs.

Makes sense to me. Where do you propose to store the owner? Another explicit
field that bpf prog has to provide inside a node?
A hidden field in bpf_refcount ?
A hidden field in bpf_list_node / bpf_rb_node ?

Kumar Kartikeya Dwivedi April 22, 2023, 6:42 p.m. UTC | #4

On Sat, Apr 22, 2023 at 05:21:36AM CEST, Alexei Starovoitov wrote:
> On Sat, Apr 22, 2023 at 04:03:45AM +0200, Kumar Kartikeya Dwivedi wrote:
> > On Sat, Apr 15, 2023 at 10:18:02PM CEST, Dave Marchevsky wrote:
> > > This series adds support for refcounted local kptrs to the verifier. A local
> > > kptr is 'refcounted' if its type contains a struct bpf_refcount field:
> > >
> > >   struct refcounted_node {
> > >     long data;
> > >     struct bpf_list_node ll;
> > >     struct bpf_refcount ref;
> > >   };
> > >
> > > bpf_refcount is used to implement shared ownership for local kptrs.
> > >
> > > Motivating usecase
> > > ==================
> > >
> > > If a struct has two collection node fields, e.g.:
> > >
> > >   struct node {
> > >     long key;
> > >     long val;
> > >     struct bpf_rb_node rb;
> > >     struct bpf_list_node ll;
> > >   };
> > >
> > > It's not currently possible to add a node to both the list and rbtree:
> > >
> > >   long bpf_prog(void *ctx)
> > >   {
> > >     struct node *n = bpf_obj_new(typeof(*n));
> > >     if (!n) { /* ... */ }
> > >
> > >     bpf_spin_lock(&lock);
> > >
> > >     bpf_list_push_back(&head, &n->ll);
> > >     bpf_rbtree_add(&root, &n->rb, less); /* Assume a resonable less() */
> > >     bpf_spin_unlock(&lock);
> > >   }
> > >
> > > The above program will fail verification due to current owning / non-owning ref
> > > logic: after bpf_list_push_back, n is a non-owning reference and thus cannot be
> > > passed to bpf_rbtree_add. The only way to get an owning reference for the node
> > > that was added is to bpf_list_pop_{front,back} it.
> > >
> > > More generally, verifier ownership semantics expect that a node has one
> > > owner (program, collection, or stashed in map) with exclusive ownership
> > > of the node's lifetime. The owner free's the node's underlying memory when it
> > > itself goes away.
> > >
> > > Without a shared ownership concept it's impossible to express many real-world
> > > usecases such that they pass verification.
> > >
> > > Semantic Changes
> > > ================
> > >
> > > Before this series, the verifier could make this statement: "whoever has the
> > > owning reference has exclusive ownership of the referent's lifetime". As
> > > demonstrated in the previous section, this implies that a BPF program can't
> > > have an owning reference to some node if that node is in a collection. If
> > > such a state were possible, the node would have multiple owners, each thinking
> > > they have exclusive ownership. In order to support shared ownership it's
> > > necessary to modify the exclusive ownership semantic.
> > >
> > > After this series' changes, an owning reference has ownership of the referent's
> > > lifetime, but it's not necessarily exclusive. The referent's underlying memory
> > > is guaranteed to be valid (i.e. not free'd) until the reference is dropped or
> > > used for collection insert.
> > >
> > > This change doesn't affect UX of owning or non-owning references much:
> > >
> > >   * insert kfuncs (bpf_rbtree_add, bpf_list_push_{front,back}) still require
> > >     an owning reference arg, as ownership still must be passed to the
> > >     collection in a shared-ownership world.
> > >
> > >   * non-owning references still refer to valid memory without claiming
> > >     any ownership.
> > > [...]
> >
> > I think there are a series of major issues right now. I am not sure everything
> > can be addressed using bug fixes.
> >
> > If I had to summarize the main problems in one liners:
> > The mutation of the node fields of an object can be racy.
> > Lack of collection identity either at runtime or verification.
> >
> > --
> >
> > It is possible for multiple CPUs to get owned references to an object containing
> > a rbtree or list node, and they can attempt to modify those fields in parallel
> > without any synchronization.
> >
> > CPU 0					CPU 1
> > n = bpf_obj_new(...)
> > m = bpf_refcount_acquire(n)
> > kptr_xchg(map, m)
> > 					m = kptr_xchg(map, NULL)
> > // m == n
> > bpf_spin_lock(lock1)			bpf_spin_lock(lock2)
> > bpf_rbtree_add(rbtree1, m)		bpf_rbtree_add(rbtree2, n)
> > 	if (!RB_EMPTY_NODE)			if (!RB_EMPTY_NODE) // RACE, both continue with 'else'
> > bpf_spin_unlock(lock1)			bpf_spin_unlock(lock2)
> >
> > --
> >
> > Blocking kptr_xchg for shared ownership nodes as a stopgap solution won't be
> > sufficient. Consider this:
> >
> > Two CPUs can do (consider rbtree1 having the only element we add from CPU 0):
> >
> > CPU 0					CPU 1
> > n = bpf_obj_new(...)
> > bpf_spin_lock(lock1)
> > bpf_rbtree_add(rbtree1, n)
> > m = bpf_refcount_acquire(n)
> > bpf_spin_unlock(lock1)
> > 					bpf_spin_lock(lock1)
> > 					n = bpf_rbtree_remove(bpf_rbtree_first(rbtree1))
> > 					bpf_spin_unlock(lock1)
> > // let m == n
> > bpf_spin_lock(lock1)			bpf_spin_lock(lock2)
> > bpf_rbtree_add(rbtree1, m)		bpf_rbtree_add(rbtree2, n)
> > 	if (!RB_EMPTY_NODE)			if (!RB_EMPTY_NODE) // RACE, both continue with 'else'
> > bpf_spin_unlock(lock1)			bpf_spin_unlock(lock2)
> >
> > --
> >
> > There's also no discussion on the problem with collection identities we
> > discussed before (maybe you plan to address it later):
> > https://lore.kernel.org/bpf/45e80d2e-af16-8584-12ec-c4c301d9a69d@meta.com
> >
> > But static tracaking won't be sufficient any longer. Considering another case
> > where the runtime will be confused about which rbtree a node belongs to.
> >
> > CPU 0								CPU 1
> > n = bpf_obj_new(...)
> > m = bpf_refcount_acquire(n)
> > kptr_xchg(map, m)
> > 								p = kptr_xchg(map, NULL)
> > 								lock(lock2)
> > 								bpf_rbtree_add(rbtree2, p->rnode)
> > 								unlock(lock2)
> > lock(lock1)
> > bpf_list_push_back(head1, n->lnode)
> > // make n non-owning ref
> > bpf_rbtree_remove(rbtree1, n->rnode); // OOPS, remove without lock2
> > unlock(lock1)
>
> Thanks for describing the races.
>
> > --
> >
> > I can come up with multiple other examples. The point I'm trying to drive home
> > is that it's a more fundamental issue in the way things are set up right now,
> > not something that was overlooked during the implementation.
> >
> > The root cause is that access to a node's fields is going to be racy once
> > multiple CPUs have *mutable* references to it. The lack of ownership information
> > mapped to the collection either at runtime or during verification is also a
> > separate problem.
> >
> > When we were discussing this a few months ago, we tried to form consensus on
> > synchronizing updates over a node using an 'owner' pointer to eliminate similar
> > races. Every node field has an associated owner field, which each updater
> > modifying that node synchronizes over.
> >
> > In short:
> > Node's owner describes its state at runtime.
> > node.owner == ptr_of_ds // part of DS
>
> what is DS ?
>

The graph data structure.

> > node.owner == NULL // not part of DS
> > node.owner == BPF_PTR_POISON // busy (about to go to NULL or ptr_of_ds before BPF_EXIT)
> >
> > cmpxchg(node.owner, NULL, BPF_PTR_POISON) to make a free node busy.
> > bpf_rbtree_add and such will do this to claim node ownership before trying to
> > link it in, and then store owner to ptr_of_ds. The _store_ will be the
> > *linearization* point of bpf_rbtree_add, not cmpxchg.
> >
> > READ_ONCE(node.owner) == ptr_of_ds to ensure node belongs to locked ds, and will
> > remain in this state until the end of CS, since ptr_to_ds to NULL only happens
> > in remove under a held lock for the ds. bpf_rbtree_remove will do this check
> > before WRITE_ONCE of NULL to unlink a node.
> >
> > Ofcourse, this is slower, and requires extra space in the object, but unless
> > this or something else is used to eliminate races, there will be bugs.
>
> Makes sense to me. Where do you propose to store the owner? Another explicit
> field that bpf prog has to provide inside a node?
> A hidden field in bpf_refcount ?
> A hidden field in bpf_list_node / bpf_rb_node ?

It will have to be with each bpf_list_node / bpf_rb_node.

Pseudocode wise, this is how things will work:

bpf_rbtree_add(rbtree, node):
	// Move a node from free to busy state
	if (!cmpxchg(node.owner, NULL, BPF_PTR_POISON))
		return -EINVAL
	__rb_add(...)
	WRITE_ONCE(node.owner, rbtree)

bpf_rbtree_remove(rbtree, node):
	// Check if node is part of rbtree:
	// If != rbtree, it could be changing concurrently
	// If == rbtree, changing it requires the lock we are holding
	if (READ_ONCE(node.owner) != rbtree)
		return -EINVAL
	__rb_remove(...)
	WRITE_ONCE(node.owner, NULL)

Likewise for the linked list helpers.

My opinion would be to have a different bpf_list_node_shared and
bpf_rb_node_shared for refcount_off >= 0 case, which encode both bpf_list_node
and bpf_rb_node and their owner together, since the space and runtime check
tradeoff is not required if you're using exclusive ownership.

One more point is that you can minimize cost of cmpxchg by adding compound
operation helpers.

For instance (under appropriate locking):

bpf_rbtree_move(rbtree1, rbtree2, node):
	if (READ_ONCE(node.owner) != rbtree1)
		return -EINVAL
	__rb_remove(rbtree1, node)
	__rb_add(rbtree2, node)
	WRITE_ONCE(node.owner, rbtree2)

Instead of:
	bpf_rbtree_remove(rbtree1, node)
	bpf_rbtree_add(rbtree2, node)

Alexei Starovoitov April 22, 2023, 9:25 p.m. UTC | #5

On Sat, Apr 22, 2023 at 08:42:47PM +0200, Kumar Kartikeya Dwivedi wrote:
> On Sat, Apr 22, 2023 at 05:21:36AM CEST, Alexei Starovoitov wrote:
> > On Sat, Apr 22, 2023 at 04:03:45AM +0200, Kumar Kartikeya Dwivedi wrote:
> > > On Sat, Apr 15, 2023 at 10:18:02PM CEST, Dave Marchevsky wrote:
> > > > This series adds support for refcounted local kptrs to the verifier. A local
> > > > kptr is 'refcounted' if its type contains a struct bpf_refcount field:
> > > >
> > > >   struct refcounted_node {
> > > >     long data;
> > > >     struct bpf_list_node ll;
> > > >     struct bpf_refcount ref;
> > > >   };
> > > >
> > > > bpf_refcount is used to implement shared ownership for local kptrs.
> > > >
> > > > Motivating usecase
> > > > ==================
> > > >
> > > > If a struct has two collection node fields, e.g.:
> > > >
> > > >   struct node {
> > > >     long key;
> > > >     long val;
> > > >     struct bpf_rb_node rb;
> > > >     struct bpf_list_node ll;
> > > >   };
> > > >
> > > > It's not currently possible to add a node to both the list and rbtree:
> > > >
> > > >   long bpf_prog(void *ctx)
> > > >   {
> > > >     struct node *n = bpf_obj_new(typeof(*n));
> > > >     if (!n) { /* ... */ }
> > > >
> > > >     bpf_spin_lock(&lock);
> > > >
> > > >     bpf_list_push_back(&head, &n->ll);
> > > >     bpf_rbtree_add(&root, &n->rb, less); /* Assume a resonable less() */
> > > >     bpf_spin_unlock(&lock);
> > > >   }
> > > >
> > > > The above program will fail verification due to current owning / non-owning ref
> > > > logic: after bpf_list_push_back, n is a non-owning reference and thus cannot be
> > > > passed to bpf_rbtree_add. The only way to get an owning reference for the node
> > > > that was added is to bpf_list_pop_{front,back} it.
> > > >
> > > > More generally, verifier ownership semantics expect that a node has one
> > > > owner (program, collection, or stashed in map) with exclusive ownership
> > > > of the node's lifetime. The owner free's the node's underlying memory when it
> > > > itself goes away.
> > > >
> > > > Without a shared ownership concept it's impossible to express many real-world
> > > > usecases such that they pass verification.
> > > >
> > > > Semantic Changes
> > > > ================
> > > >
> > > > Before this series, the verifier could make this statement: "whoever has the
> > > > owning reference has exclusive ownership of the referent's lifetime". As
> > > > demonstrated in the previous section, this implies that a BPF program can't
> > > > have an owning reference to some node if that node is in a collection. If
> > > > such a state were possible, the node would have multiple owners, each thinking
> > > > they have exclusive ownership. In order to support shared ownership it's
> > > > necessary to modify the exclusive ownership semantic.
> > > >
> > > > After this series' changes, an owning reference has ownership of the referent's
> > > > lifetime, but it's not necessarily exclusive. The referent's underlying memory
> > > > is guaranteed to be valid (i.e. not free'd) until the reference is dropped or
> > > > used for collection insert.
> > > >
> > > > This change doesn't affect UX of owning or non-owning references much:
> > > >
> > > >   * insert kfuncs (bpf_rbtree_add, bpf_list_push_{front,back}) still require
> > > >     an owning reference arg, as ownership still must be passed to the
> > > >     collection in a shared-ownership world.
> > > >
> > > >   * non-owning references still refer to valid memory without claiming
> > > >     any ownership.
> > > > [...]
> > >
> > > I think there are a series of major issues right now. I am not sure everything
> > > can be addressed using bug fixes.
> > >
> > > If I had to summarize the main problems in one liners:
> > > The mutation of the node fields of an object can be racy.
> > > Lack of collection identity either at runtime or verification.
> > >
> > > --
> > >
> > > It is possible for multiple CPUs to get owned references to an object containing
> > > a rbtree or list node, and they can attempt to modify those fields in parallel
> > > without any synchronization.
> > >
> > > CPU 0					CPU 1
> > > n = bpf_obj_new(...)
> > > m = bpf_refcount_acquire(n)
> > > kptr_xchg(map, m)
> > > 					m = kptr_xchg(map, NULL)
> > > // m == n
> > > bpf_spin_lock(lock1)			bpf_spin_lock(lock2)
> > > bpf_rbtree_add(rbtree1, m)		bpf_rbtree_add(rbtree2, n)
> > > 	if (!RB_EMPTY_NODE)			if (!RB_EMPTY_NODE) // RACE, both continue with 'else'
> > > bpf_spin_unlock(lock1)			bpf_spin_unlock(lock2)
> > >
> > > --
> > >
> > > Blocking kptr_xchg for shared ownership nodes as a stopgap solution won't be
> > > sufficient. Consider this:
> > >
> > > Two CPUs can do (consider rbtree1 having the only element we add from CPU 0):
> > >
> > > CPU 0					CPU 1
> > > n = bpf_obj_new(...)
> > > bpf_spin_lock(lock1)
> > > bpf_rbtree_add(rbtree1, n)
> > > m = bpf_refcount_acquire(n)
> > > bpf_spin_unlock(lock1)
> > > 					bpf_spin_lock(lock1)
> > > 					n = bpf_rbtree_remove(bpf_rbtree_first(rbtree1))
> > > 					bpf_spin_unlock(lock1)
> > > // let m == n
> > > bpf_spin_lock(lock1)			bpf_spin_lock(lock2)
> > > bpf_rbtree_add(rbtree1, m)		bpf_rbtree_add(rbtree2, n)
> > > 	if (!RB_EMPTY_NODE)			if (!RB_EMPTY_NODE) // RACE, both continue with 'else'
> > > bpf_spin_unlock(lock1)			bpf_spin_unlock(lock2)
> > >
> > > --
> > >
> > > There's also no discussion on the problem with collection identities we
> > > discussed before (maybe you plan to address it later):
> > > https://lore.kernel.org/bpf/45e80d2e-af16-8584-12ec-c4c301d9a69d@meta.com
> > >
> > > But static tracaking won't be sufficient any longer. Considering another case
> > > where the runtime will be confused about which rbtree a node belongs to.
> > >
> > > CPU 0								CPU 1
> > > n = bpf_obj_new(...)
> > > m = bpf_refcount_acquire(n)
> > > kptr_xchg(map, m)
> > > 								p = kptr_xchg(map, NULL)
> > > 								lock(lock2)
> > > 								bpf_rbtree_add(rbtree2, p->rnode)
> > > 								unlock(lock2)
> > > lock(lock1)
> > > bpf_list_push_back(head1, n->lnode)
> > > // make n non-owning ref
> > > bpf_rbtree_remove(rbtree1, n->rnode); // OOPS, remove without lock2
> > > unlock(lock1)
> >
> > Thanks for describing the races.
> >
> > > --
> > >
> > > I can come up with multiple other examples. The point I'm trying to drive home
> > > is that it's a more fundamental issue in the way things are set up right now,
> > > not something that was overlooked during the implementation.
> > >
> > > The root cause is that access to a node's fields is going to be racy once
> > > multiple CPUs have *mutable* references to it. The lack of ownership information
> > > mapped to the collection either at runtime or during verification is also a
> > > separate problem.
> > >
> > > When we were discussing this a few months ago, we tried to form consensus on
> > > synchronizing updates over a node using an 'owner' pointer to eliminate similar
> > > races. Every node field has an associated owner field, which each updater
> > > modifying that node synchronizes over.
> > >
> > > In short:
> > > Node's owner describes its state at runtime.
> > > node.owner == ptr_of_ds // part of DS
> >
> > what is DS ?
> >
> 
> The graph data structure.
> 
> > > node.owner == NULL // not part of DS
> > > node.owner == BPF_PTR_POISON // busy (about to go to NULL or ptr_of_ds before BPF_EXIT)
> > >
> > > cmpxchg(node.owner, NULL, BPF_PTR_POISON) to make a free node busy.
> > > bpf_rbtree_add and such will do this to claim node ownership before trying to
> > > link it in, and then store owner to ptr_of_ds. The _store_ will be the
> > > *linearization* point of bpf_rbtree_add, not cmpxchg.
> > >
> > > READ_ONCE(node.owner) == ptr_of_ds to ensure node belongs to locked ds, and will
> > > remain in this state until the end of CS, since ptr_to_ds to NULL only happens
> > > in remove under a held lock for the ds. bpf_rbtree_remove will do this check
> > > before WRITE_ONCE of NULL to unlink a node.
> > >
> > > Ofcourse, this is slower, and requires extra space in the object, but unless
> > > this or something else is used to eliminate races, there will be bugs.
> >
> > Makes sense to me. Where do you propose to store the owner? Another explicit
> > field that bpf prog has to provide inside a node?
> > A hidden field in bpf_refcount ?
> > A hidden field in bpf_list_node / bpf_rb_node ?
> 
> It will have to be with each bpf_list_node / bpf_rb_node.

yeah. realized after pressed 'send'.

> Pseudocode wise, this is how things will work:
> 
> bpf_rbtree_add(rbtree, node):
> 	// Move a node from free to busy state
> 	if (!cmpxchg(node.owner, NULL, BPF_PTR_POISON))
> 		return -EINVAL

should be bpf_obj_drop here instead of plain return.
In other words the current 
if (!RB_EMPTY_NODE()) bpf_obj_drop
will be replaced by the above cmpxchg

> 	__rb_add(...)
> 	WRITE_ONCE(node.owner, rbtree)
> 
> bpf_rbtree_remove(rbtree, node):
> 	// Check if node is part of rbtree:
> 	// If != rbtree, it could be changing concurrently
> 	// If == rbtree, changing it requires the lock we are holding
> 	if (READ_ONCE(node.owner) != rbtree)
> 		return -EINVAL

should be 'return NULL'.
The above check will replace current RB_EMPTY_NODE check.

> 	__rb_remove(...)
> 	WRITE_ONCE(node.owner, NULL)

comparing to existing code it's only extra WRITE_ONCE
and cmpxchg instead of load+cmp.
imo the perf cost is roughly the same.
Because of that there is no need to have
two bpf_rbtree_add/remove for shared/exclusive nodes.

> Likewise for the linked list helpers.
> 
> My opinion would be to have a different bpf_list_node_shared and
> bpf_rb_node_shared for refcount_off >= 0 case, which encode both bpf_list_node
> and bpf_rb_node and their owner together, since the space and runtime check
> tradeoff is not required if you're using exclusive ownership.

true, but runtime performance of bpf_rbtree_add/remove is the same for both cases.
With bpf_rb_node_shared the bpf prog will save 8 bytes of memory at the expense
of plenty verifier complexity that would need to have two checks
if (btf_id == bpf_rb_node_shared || btf_id == bpf_rb_node) in a bunch of places
and one of the two in another set of places.
I'd rather waste 8 bytes than complicate the verifier.
Having valid node->owner for exclusive rb-tree and link list is a nice debug feature too.
Not saying we screwed up non-shared rb-tree, but I won't be surprised if
somebody will figure out how to construct similar race there in the future.

> One more point is that you can minimize cost of cmpxchg by adding compound
> operation helpers.
> 
> For instance (under appropriate locking):
> 
> bpf_rbtree_move(rbtree1, rbtree2, node):
> 	if (READ_ONCE(node.owner) != rbtree1)
> 		return -EINVAL
> 	__rb_remove(rbtree1, node)
> 	__rb_add(rbtree2, node)
> 	WRITE_ONCE(node.owner, rbtree2)
> 
> Instead of:
> 	bpf_rbtree_remove(rbtree1, node)
> 	bpf_rbtree_add(rbtree2, node)

That makes sense. Not urgent. Distant follow up.

[v2,bpf-next,0/9] Shared ownership for local kptrs

Message

Comments