diff mbox series

[bpf-next,v4,22/24] bpf: Introduce single ownership BPF linked list API

Message ID 20221103191013.1236066-23-memxor@gmail.com (mailing list archive)
State Superseded
Delegated to: BPF
Headers show
Series Local kptrs, BPF linked lists | expand

Checks

Context Check Description
netdev/tree_selection success Clearly marked for bpf-next, async
netdev/fixes_present success Fixes tag not required for -next series
netdev/subject_prefix success Link
netdev/cover_letter success Series has a cover letter
netdev/patch_count fail Series longer than 15 patches (and no cover letter)
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit fail Errors and warnings before: 75 this patch: 81
netdev/cc_maintainers warning 11 maintainers not CCed: sdf@google.com kpsingh@kernel.org mykolal@fb.com haoluo@google.com linux-kselftest@vger.kernel.org yhs@fb.com shuah@kernel.org jolsa@kernel.org martin.lau@linux.dev song@kernel.org john.fastabend@gmail.com
netdev/build_clang fail Errors and warnings before: 17 this patch: 20
netdev/module_param success Was 0 now: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn fail Errors and warnings before: 75 this patch: 81
netdev/checkpatch warning CHECK: Unnecessary parentheses around 'insn->src_reg == BPF_PSEUDO_CALL' CHECK: braces {} should be used on all arms of this statement CHECK: extern prototypes should be avoided in .h files WARNING: line length of 100 exceeds 80 columns WARNING: line length of 101 exceeds 80 columns WARNING: line length of 102 exceeds 80 columns WARNING: line length of 104 exceeds 80 columns WARNING: line length of 112 exceeds 80 columns WARNING: line length of 118 exceeds 80 columns WARNING: line length of 81 exceeds 80 columns WARNING: line length of 82 exceeds 80 columns WARNING: line length of 83 exceeds 80 columns WARNING: line length of 84 exceeds 80 columns WARNING: line length of 85 exceeds 80 columns WARNING: line length of 86 exceeds 80 columns WARNING: line length of 87 exceeds 80 columns WARNING: line length of 88 exceeds 80 columns WARNING: line length of 89 exceeds 80 columns WARNING: line length of 91 exceeds 80 columns WARNING: line length of 92 exceeds 80 columns WARNING: line length of 93 exceeds 80 columns WARNING: line length of 94 exceeds 80 columns WARNING: line length of 95 exceeds 80 columns WARNING: line length of 96 exceeds 80 columns
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0
bpf/vmtest-bpf-next-VM_Test-2 fail Logs for build for s390x with gcc
bpf/vmtest-bpf-next-VM_Test-1 pending Logs for ${{ matrix.test }} on ${{ matrix.arch }} with ${{ matrix.toolchain }}
bpf/vmtest-bpf-next-VM_Test-3 fail Logs for build for s390x with gcc
bpf/vmtest-bpf-next-VM_Test-4 fail Logs for build for x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-5 fail Logs for build for x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-6 success Logs for llvm-toolchain
bpf/vmtest-bpf-next-VM_Test-7 success Logs for set-matrix
bpf/vmtest-bpf-next-PR fail merge-conflict

Commit Message

Kumar Kartikeya Dwivedi Nov. 3, 2022, 7:10 p.m. UTC
Add a linked list API for use in BPF programs, where it expects
protection from the bpf_spin_lock in the same allocation as the
bpf_list_head. Future patches will extend the same infrastructure to
have different flavors with varying protection domains and visibility
(e.g. percpu variant with local_t protection, usable in NMI progs).

The following functions are added to kick things off:

bpf_list_push_front
bpf_list_push_back
bpf_list_pop_front
bpf_list_pop_back

The lock protecting the bpf_list_head needs to be taken for all
operations.

Once a node has been added to the list, it's pointer changes to
PTR_UNTRUSTED. However, it is only released once the lock protecting the
list is unlocked. For such local kptrs with PTR_UNTRUSTED set but an
active ref_obj_id, it is still permitted to read and write to them as
long as the lock is held.

bpf_list_pop_front and bpf_list_pop_back delete the first or last item
of the list respectively, and return pointer to the element at the
list_node offset. The user can then use container_of style macro to get
the actual entry type. The verifier however statically knows the actual
type, so the safety properties are still preserved.

With these additions, programs can now manage their own linked lists and
store their objects in them.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 include/linux/bpf_verifier.h                  |   5 +
 kernel/bpf/helpers.c                          |  50 ++-
 kernel/bpf/verifier.c                         | 377 ++++++++++++++++--
 .../testing/selftests/bpf/bpf_experimental.h  |  28 ++
 4 files changed, 425 insertions(+), 35 deletions(-)

Comments

Dave Marchevsky Nov. 4, 2022, 5:56 a.m. UTC | #1
On 11/3/22 3:10 PM, Kumar Kartikeya Dwivedi wrote:
> Add a linked list API for use in BPF programs, where it expects
> protection from the bpf_spin_lock in the same allocation as the
> bpf_list_head. Future patches will extend the same infrastructure to
> have different flavors with varying protection domains and visibility
> (e.g. percpu variant with local_t protection, usable in NMI progs).
> 
> The following functions are added to kick things off:
> 
> bpf_list_push_front
> bpf_list_push_back
> bpf_list_pop_front
> bpf_list_pop_back
> 
> The lock protecting the bpf_list_head needs to be taken for all
> operations.
> 
> Once a node has been added to the list, it's pointer changes to
> PTR_UNTRUSTED. However, it is only released once the lock protecting the
> list is unlocked. For such local kptrs with PTR_UNTRUSTED set but an
> active ref_obj_id, it is still permitted to read and write to them as
> long as the lock is held.

I think "still permitted to ... write to them" is not accurate
for this version of the series. In v2 you mentioned [0]:

"""
I have switched things a bit to disallow stores, which is a bug right now in
this set, because one can do this:

push_front(head, &p->node);
p2 = container_of(pop_front(head));
// p2 == p
bpf_obj_drop(p2);
p->data = ...;

One can always fully initialize the object _before_ inserting it into the list,
in some cases that will be the requirement (like adding to RCU protected lists)
for correctness.
"""

I confirmed this is currently the case by moving data write after
list_push in the selftest and running it:

@@ -87,8 +87,8 @@ static __always_inline int list_push_pop(struct bpf_spin_lock *lock,
        }

        bpf_spin_lock(lock);
-       f->data = 13;
        bpf_list_push_front(head, &f->node);
+       f->data = 13;
        bpf_spin_unlock(lock);

Got "only read is supported" from verifier.
I think it's fine to punt on supporting writes for now and do it in followups.

  [0]: https://lore.kernel.org/bpf/20221025190033.bqjr7466o7pdd3wo@apollo/

> 
> bpf_list_pop_front and bpf_list_pop_back delete the first or last item
> of the list respectively, and return pointer to the element at the
> list_node offset. The user can then use container_of style macro to get
> the actual entry type. The verifier however statically knows the actual
> type, so the safety properties are still preserved.
> 
> With these additions, programs can now manage their own linked lists and
> store their objects in them.
> 
> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> ---

[...]

will comment on rest of patch in separate msg
Kumar Kartikeya Dwivedi Nov. 4, 2022, 7:42 a.m. UTC | #2
On Fri, Nov 04, 2022 at 11:26:39AM IST, Dave Marchevsky wrote:
> On 11/3/22 3:10 PM, Kumar Kartikeya Dwivedi wrote:
> > Add a linked list API for use in BPF programs, where it expects
> > protection from the bpf_spin_lock in the same allocation as the
> > bpf_list_head. Future patches will extend the same infrastructure to
> > have different flavors with varying protection domains and visibility
> > (e.g. percpu variant with local_t protection, usable in NMI progs).
> >
> > The following functions are added to kick things off:
> >
> > bpf_list_push_front
> > bpf_list_push_back
> > bpf_list_pop_front
> > bpf_list_pop_back
> >
> > The lock protecting the bpf_list_head needs to be taken for all
> > operations.
> >
> > Once a node has been added to the list, it's pointer changes to
> > PTR_UNTRUSTED. However, it is only released once the lock protecting the
> > list is unlocked. For such local kptrs with PTR_UNTRUSTED set but an
> > active ref_obj_id, it is still permitted to read and write to them as
> > long as the lock is held.
>
> I think "still permitted to ... write to them" is not accurate
> for this version of the series. In v2 you mentioned [0]:
>
> """
> I have switched things a bit to disallow stores, which is a bug right now in
> this set, because one can do this:
>
> push_front(head, &p->node);
> p2 = container_of(pop_front(head));
> // p2 == p
> bpf_obj_drop(p2);
> p->data = ...;
>
> One can always fully initialize the object _before_ inserting it into the list,
> in some cases that will be the requirement (like adding to RCU protected lists)
> for correctness.
> """
>
> I confirmed this is currently the case by moving data write after
> list_push in the selftest and running it:
>
> @@ -87,8 +87,8 @@ static __always_inline int list_push_pop(struct bpf_spin_lock *lock,
>         }
>
>         bpf_spin_lock(lock);
> -       f->data = 13;
>         bpf_list_push_front(head, &f->node);
> +       f->data = 13;
>         bpf_spin_unlock(lock);
>
> Got "only read is supported" from verifier.
> I think it's fine to punt on supporting writes for now and do it in followups.
>

Thanks for catching it, I'll fix up the commit message.

Also, just to manage the expectations I think enabling writes after pushing the
object won't be possible to make safe, unless the definition of "safe" is
twisted.

As shown in that example, we can reach a point where it has been freed but we
hold an untrusted pointer to it. Once it has been freed the object can be
reallocated and be in use again concurrently, possibly as a different type.

I was contemplating whether to simply drop this whole set_release_on_unlock
logic entirely. Not sure it's worth the added complexity, atleast for now. Once
you push you simply lose ownership of the object and any registers are
immediately killed.
Dave Marchevsky Nov. 5, 2022, 2:15 a.m. UTC | #3
On 11/4/22 3:42 AM, Kumar Kartikeya Dwivedi wrote:
> On Fri, Nov 04, 2022 at 11:26:39AM IST, Dave Marchevsky wrote:
>> On 11/3/22 3:10 PM, Kumar Kartikeya Dwivedi wrote:
>>> Add a linked list API for use in BPF programs, where it expects
>>> protection from the bpf_spin_lock in the same allocation as the
>>> bpf_list_head. Future patches will extend the same infrastructure to
>>> have different flavors with varying protection domains and visibility
>>> (e.g. percpu variant with local_t protection, usable in NMI progs).
>>>
>>> The following functions are added to kick things off:
>>>
>>> bpf_list_push_front
>>> bpf_list_push_back
>>> bpf_list_pop_front
>>> bpf_list_pop_back
>>>
>>> The lock protecting the bpf_list_head needs to be taken for all
>>> operations.
>>>
>>> Once a node has been added to the list, it's pointer changes to
>>> PTR_UNTRUSTED. However, it is only released once the lock protecting the
>>> list is unlocked. For such local kptrs with PTR_UNTRUSTED set but an
>>> active ref_obj_id, it is still permitted to read and write to them as
>>> long as the lock is held.
>>
>> I think "still permitted to ... write to them" is not accurate
>> for this version of the series. In v2 you mentioned [0]:
>>
>> """
>> I have switched things a bit to disallow stores, which is a bug right now in
>> this set, because one can do this:
>>
>> push_front(head, &p->node);
>> p2 = container_of(pop_front(head));
>> // p2 == p
>> bpf_obj_drop(p2);
>> p->data = ...;
>>
>> One can always fully initialize the object _before_ inserting it into the list,
>> in some cases that will be the requirement (like adding to RCU protected lists)
>> for correctness.
>> """
>>
>> I confirmed this is currently the case by moving data write after
>> list_push in the selftest and running it:
>>
>> @@ -87,8 +87,8 @@ static __always_inline int list_push_pop(struct bpf_spin_lock *lock,
>>         }
>>
>>         bpf_spin_lock(lock);
>> -       f->data = 13;
>>         bpf_list_push_front(head, &f->node);
>> +       f->data = 13;
>>         bpf_spin_unlock(lock);
>>
>> Got "only read is supported" from verifier.
>> I think it's fine to punt on supporting writes for now and do it in followups.
>>
> 
> Thanks for catching it, I'll fix up the commit message.
> 
> Also, just to manage the expectations I think enabling writes after pushing the
> object won't be possible to make safe, unless the definition of "safe" is
> twisted.
> 
> As shown in that example, we can reach a point where it has been freed but we
> hold an untrusted pointer to it. Once it has been freed the object can be
> reallocated and be in use again concurrently, possibly as a different type.
I think this patchset prevents that example from passing verifier, provided
that bpf_spin_{lock,unlock} are added in the right places:

  p = bpf_obj_new(typeof(*p));
  if (!p)
    return;

  bpf_spin_lock(lock);
  push_front(head, &p->node);
  p2 = container_of(pop_front(head));
  // p2 == p
  if (!p2)
    return;
  bpf_spin_unlock(lock);

  bpf_obj_drop(p2);
  p->data = ...;

After bpf_obj_new R0 is of type ptr_or_null_BTF_TYPE with nonzero ref_obj_id,
let's say 1. 
After null check any regs holding p have _or_null removed.

After push_front any regs w/ ref_obj_id==1 become untrusted and have
release_on_unlock set. After pop_front R0 is ptr_or_null_BTF_TYPE with nonzero
ref_obj_id different than bpf_obj_new result, let's say 2.
Null check works same way as before.

After bpf_spin_unlock all release_on_unlock references are clobbered, so
p->data will result in "invalid mem access 'scalar'" verifier error.

Verifier requires list manipulation functions to be in a critical section
and doesn't allow bpf_obj_new or bpf_obj_drop in the critical section.
p2 == p but having different ref_obj_ids is OK because only p2 will escape
bpf_spin_unlock unclobbered to be released by bpf_obj_drop, and no free()s
are allowed until we're out of the critical section. So by the time we can
call bpf_obj_drop we only have one valid reference remaining ("trusted" one).

My OBJ_NON_OWNING_REF stuff in rbtree patchsets had same logic, IIRC
only difference is that my "add to datastructure" function was considered
a RELEASE func that also did an ACQUIRE of a release_on_unlock retval. But this
was before static lock checking discussions, so the RELEASE/ACQUIRE could fail.
I think directly modifying arg reg like you're doing in set_release_on_unlock
is cleaner.

> I was contemplating whether to simply drop this whole set_release_on_unlock
> logic entirely. Not sure it's worth the added complexity, atleast for now. Once
> you push you simply lose ownership of the object and any registers are
> immediately killed.

I think that being able to read / modify the datastructure node after it's been 
added is pretty critical, at least from a UX perspective. 

Totally fine with it being dropped from the series and experimented with
later, though.
Alexei Starovoitov Nov. 5, 2022, 6:16 p.m. UTC | #4
On Fri, Nov 4, 2022 at 7:15 PM Dave Marchevsky <davemarchevsky@meta.com> wrote:
>
> > I was contemplating whether to simply drop this whole set_release_on_unlock
> > logic entirely. Not sure it's worth the added complexity, atleast for now. Once
> > you push you simply lose ownership of the object and any registers are
> > immediately killed.
>
> I think that being able to read / modify the datastructure node after it's been
> added is pretty critical, at least from a UX perspective.
>
> Totally fine with it being dropped from the series and experimented with
> later, though.

Kumar,
please split release_on_unlock logic into separate patch.
afaics it doesn't have to be introduced together with these kfuncs.
Kumar Kartikeya Dwivedi Nov. 6, 2022, 1:53 a.m. UTC | #5
On Sat, 5 Nov 2022 at 23:46, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Fri, Nov 4, 2022 at 7:15 PM Dave Marchevsky <davemarchevsky@meta.com> wrote:
> >
> > > I was contemplating whether to simply drop this whole set_release_on_unlock
> > > logic entirely. Not sure it's worth the added complexity, atleast for now. Once
> > > you push you simply lose ownership of the object and any registers are
> > > immediately killed.
> >
> > I think that being able to read / modify the datastructure node after it's been
> > added is pretty critical, at least from a UX perspective.
> >
> > Totally fine with it being dropped from the series and experimented with
> > later, though.
>
> Kumar,
> please split release_on_unlock logic into separate patch.
> afaics it doesn't have to be introduced together with these kfuncs.

Ack.
diff mbox series

Patch

diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index 1e9c782e0974..bde8f9e11132 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -229,6 +229,11 @@  struct bpf_reference_state {
 	 * exiting a callback function.
 	 */
 	int callback_ref;
+	/* Mark the reference state to release the registers sharing the same id
+	 * on bpf_spin_unlock (for nodes that we will lose ownership to but are
+	 * safe to access inside the critical section).
+	 */
+	bool release_on_unlock;
 };
 
 /* state of the program:
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index a30f6573e805..0acd87ed22fc 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -1774,6 +1774,50 @@  void bpf_obj_drop_impl(void *p__lkptr, void *meta__ign)
 	bpf_mem_free(&bpf_global_ma, p);
 }
 
+static void __bpf_list_add(struct bpf_list_node *node, struct bpf_list_head *head, bool tail)
+{
+	struct list_head *n = (void *)node, *h = (void *)head;
+
+	if (unlikely(!h->next))
+		INIT_LIST_HEAD(h);
+	if (unlikely(!n->next))
+		INIT_LIST_HEAD(n);
+	tail ? list_add_tail(n, h) : list_add(n, h);
+}
+
+void bpf_list_push_front(struct bpf_list_head *head, struct bpf_list_node *node)
+{
+	return __bpf_list_add(node, head, false);
+}
+
+void bpf_list_push_back(struct bpf_list_head *head, struct bpf_list_node *node)
+{
+	return __bpf_list_add(node, head, true);
+}
+
+static struct bpf_list_node *__bpf_list_del(struct bpf_list_head *head, bool tail)
+{
+	struct list_head *n, *h = (void *)head;
+
+	if (unlikely(!h->next))
+		INIT_LIST_HEAD(h);
+	if (list_empty(h))
+		return NULL;
+	n = tail ? h->prev : h->next;
+	list_del_init(n);
+	return (struct bpf_list_node *)n;
+}
+
+struct bpf_list_node *bpf_list_pop_front(struct bpf_list_head *head)
+{
+	return __bpf_list_del(head, false);
+}
+
+struct bpf_list_node *bpf_list_pop_back(struct bpf_list_head *head)
+{
+	return __bpf_list_del(head, true);
+}
+
 __diag_pop();
 
 BTF_SET8_START(generic_btf_ids)
@@ -1781,7 +1825,11 @@  BTF_SET8_START(generic_btf_ids)
 BTF_ID_FLAGS(func, crash_kexec, KF_DESTRUCTIVE)
 #endif
 BTF_ID_FLAGS(func, bpf_obj_new_impl, KF_ACQUIRE | KF_RET_NULL)
-BTF_ID_FLAGS(func, bpf_kptr_drop_impl, KF_RELEASE)
+BTF_ID_FLAGS(func, bpf_obj_drop_impl, KF_RELEASE)
+BTF_ID_FLAGS(func, bpf_list_push_front)
+BTF_ID_FLAGS(func, bpf_list_push_back)
+BTF_ID_FLAGS(func, bpf_list_pop_front, KF_ACQUIRE | KF_RET_NULL)
+BTF_ID_FLAGS(func, bpf_list_pop_back, KF_ACQUIRE | KF_RET_NULL)
 BTF_SET8_END(generic_btf_ids)
 
 static const struct btf_kfunc_id_set generic_kfunc_set = {
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 58e58678382a..c3675f858707 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -5495,7 +5495,9 @@  static int process_spin_lock(struct bpf_verifier_env *env, int regno,
 			cur->active_spin_lock_ptr = btf;
 		cur->active_spin_lock_id = reg->id;
 	} else {
+		struct bpf_func_state *fstate = cur_func(env);
 		void *ptr;
+		int i;
 
 		if (map)
 			ptr = map;
@@ -5513,6 +5515,16 @@  static int process_spin_lock(struct bpf_verifier_env *env, int regno,
 		}
 		cur->active_spin_lock_ptr = NULL;
 		cur->active_spin_lock_id = 0;
+
+		for (i = 0; i < fstate->acquired_refs; i++) {
+			/* WARN because this reference state cannot be freed
+			 * before this point, as bpf_spin_lock CS does not
+			 * allow functions that release the local kptr
+			 * immediately.
+			 */
+			if (fstate->refs[i].release_on_unlock)
+				WARN_ON_ONCE(release_reference(env, fstate->refs[i].id));
+		}
 	}
 	return 0;
 }
@@ -7708,6 +7720,9 @@  struct bpf_kfunc_call_arg_meta {
 		struct btf *btf;
 		u32 btf_id;
 	} arg_obj_drop;
+	struct {
+		struct btf_field *field;
+	} arg_list_head;
 };
 
 static bool is_kfunc_acquire(struct bpf_kfunc_call_arg_meta *meta)
@@ -7818,13 +7833,17 @@  static bool is_kfunc_arg_ret_buf_size(const struct btf *btf,
 
 enum {
 	KF_ARG_DYNPTR_ID,
+	KF_ARG_LIST_HEAD_ID,
+	KF_ARG_LIST_NODE_ID,
 };
 
 BTF_ID_LIST(kf_arg_btf_ids)
 BTF_ID(struct, bpf_dynptr_kern)
+BTF_ID(struct, bpf_list_head)
+BTF_ID(struct, bpf_list_node)
 
-static bool is_kfunc_arg_dynptr(const struct btf *btf,
-				const struct btf_param *arg)
+static bool __is_kfunc_ptr_arg_type(const struct btf *btf,
+				    const struct btf_param *arg, int type)
 {
 	const struct btf_type *t;
 	u32 res_id;
@@ -7837,7 +7856,22 @@  static bool is_kfunc_arg_dynptr(const struct btf *btf,
 	t = btf_type_skip_modifiers(btf, t->type, &res_id);
 	if (!t)
 		return false;
-	return btf_types_are_same(btf, res_id, btf_vmlinux, kf_arg_btf_ids[KF_ARG_DYNPTR_ID]);
+	return btf_types_are_same(btf, res_id, btf_vmlinux, kf_arg_btf_ids[type]);
+}
+
+static bool is_kfunc_arg_dynptr(const struct btf *btf, const struct btf_param *arg)
+{
+	return __is_kfunc_ptr_arg_type(btf, arg, KF_ARG_DYNPTR_ID);
+}
+
+static bool is_kfunc_arg_list_head(const struct btf *btf, const struct btf_param *arg)
+{
+	return __is_kfunc_ptr_arg_type(btf, arg, KF_ARG_LIST_HEAD_ID);
+}
+
+static bool is_kfunc_arg_list_node(const struct btf *btf, const struct btf_param *arg)
+{
+	return __is_kfunc_ptr_arg_type(btf, arg, KF_ARG_LIST_NODE_ID);
 }
 
 /* Returns true if struct is composed of scalars, 4 levels of nesting allowed */
@@ -7892,9 +7926,11 @@  static u32 *reg2btf_ids[__BPF_REG_TYPE_MAX] = {
 enum kfunc_ptr_arg_type {
 	KF_ARG_PTR_TO_CTX,
 	KF_ARG_PTR_TO_LOCAL_BTF_ID,  /* Local kptr */
-	KF_ARG_PTR_TO_BTF_ID,	     /* Also covers reg2btf_ids conversions */
 	KF_ARG_PTR_TO_KPTR_STRONG,   /* PTR_TO_KPTR but type specific */
 	KF_ARG_PTR_TO_DYNPTR,
+	KF_ARG_PTR_TO_LIST_HEAD,
+	KF_ARG_PTR_TO_LIST_NODE,
+	KF_ARG_PTR_TO_BTF_ID,	     /* Also covers reg2btf_ids conversions */
 	KF_ARG_PTR_TO_MEM,
 	KF_ARG_PTR_TO_MEM_SIZE,	     /* Size derived from next argument, skip it */
 };
@@ -7902,16 +7938,28 @@  enum kfunc_ptr_arg_type {
 enum special_kfunc_type {
 	KF_bpf_obj_new_impl,
 	KF_bpf_obj_drop_impl,
+	KF_bpf_list_push_front,
+	KF_bpf_list_push_back,
+	KF_bpf_list_pop_front,
+	KF_bpf_list_pop_back,
 };
 
 BTF_SET_START(special_kfunc_set)
 BTF_ID(func, bpf_obj_new_impl)
 BTF_ID(func, bpf_obj_drop_impl)
+BTF_ID(func, bpf_list_push_front)
+BTF_ID(func, bpf_list_push_back)
+BTF_ID(func, bpf_list_pop_front)
+BTF_ID(func, bpf_list_pop_back)
 BTF_SET_END(special_kfunc_set)
 
 BTF_ID_LIST(special_kfunc_list)
 BTF_ID(func, bpf_obj_new_impl)
 BTF_ID(func, bpf_obj_drop_impl)
+BTF_ID(func, bpf_list_push_front)
+BTF_ID(func, bpf_list_push_back)
+BTF_ID(func, bpf_list_pop_front)
+BTF_ID(func, bpf_list_pop_back)
 
 static enum kfunc_ptr_arg_type
 get_kfunc_ptr_arg_type(struct bpf_verifier_env *env,
@@ -7936,15 +7984,6 @@  get_kfunc_ptr_arg_type(struct bpf_verifier_env *env,
 	if (is_kfunc_arg_local_kptr(meta->btf, &args[argno]))
 		return KF_ARG_PTR_TO_LOCAL_BTF_ID;
 
-	if ((base_type(reg->type) == PTR_TO_BTF_ID || reg2btf_ids[base_type(reg->type)])) {
-		if (!btf_type_is_struct(ref_t)) {
-			verbose(env, "kernel function %s args#%d pointer type %s %s is not supported\n",
-				meta->func_name, argno, btf_type_str(ref_t), ref_tname);
-			return -EINVAL;
-		}
-		return KF_ARG_PTR_TO_BTF_ID;
-	}
-
 	if (is_kfunc_arg_kptr_get(meta, argno)) {
 		if (!btf_type_is_ptr(ref_t)) {
 			verbose(env, "arg#0 BTF type must be a double pointer for kptr_get kfunc\n");
@@ -7963,6 +8002,21 @@  get_kfunc_ptr_arg_type(struct bpf_verifier_env *env,
 	if (is_kfunc_arg_dynptr(meta->btf, &args[argno]))
 		return KF_ARG_PTR_TO_DYNPTR;
 
+	if (is_kfunc_arg_list_head(meta->btf, &args[argno]))
+		return KF_ARG_PTR_TO_LIST_HEAD;
+
+	if (is_kfunc_arg_list_node(meta->btf, &args[argno]))
+		return KF_ARG_PTR_TO_LIST_NODE;
+
+	if ((base_type(reg->type) == PTR_TO_BTF_ID || reg2btf_ids[base_type(reg->type)])) {
+		if (!btf_type_is_struct(ref_t)) {
+			verbose(env, "kernel function %s args#%d pointer type %s %s is not supported\n",
+				meta->func_name, argno, btf_type_str(ref_t), ref_tname);
+			return -EINVAL;
+		}
+		return KF_ARG_PTR_TO_BTF_ID;
+	}
+
 	if (argno + 1 < nargs && is_kfunc_arg_mem_size(meta->btf, &args[argno + 1], &regs[regno + 1]))
 		arg_mem_size = true;
 
@@ -8049,6 +8103,225 @@  static int process_kf_arg_ptr_to_kptr_strong(struct bpf_verifier_env *env,
 	return 0;
 }
 
+static int ref_set_release_on_unlock(struct bpf_verifier_env *env, u32 ref_obj_id)
+{
+	struct bpf_func_state *state = cur_func(env);
+	struct bpf_reg_state *reg;
+	int i;
+
+	/* bpf_spin_lock only allows calling list_push and list_pop, no BPF
+	 * subprogs, no global functions, so this acquired refs state will
+	 * remain unchanged till we find registers to kill on bpf_spin_unlock.
+	 *
+	 * The acquired refs state is therefore not modified inside the
+	 * bpf_spin_lock critical section by any means, nor is it copied into
+	 * another frame as subprog calls are disallowed.
+	 */
+	if (!ref_obj_id) {
+		verbose(env, "verifier internal error: ref_obj_id is zero for release_on_unlock\n");
+		return -EFAULT;
+	}
+	for (i = 0; i < state->acquired_refs; i++) {
+		if (state->refs[i].id == ref_obj_id) {
+			WARN_ON_ONCE(state->refs[i].release_on_unlock);
+			state->refs[i].release_on_unlock = true;
+			/* Now mark everyone sharing same ref_obj_id as untrusted */
+			bpf_for_each_reg_in_vstate(env->cur_state, state, reg, ({
+				if (reg->ref_obj_id == ref_obj_id)
+					reg->type |= PTR_UNTRUSTED;
+			}));
+			return 0;
+		}
+	}
+	verbose(env, "verifier internal error: ref state missing for ref_obj_id\n");
+	return -EFAULT;
+}
+
+/* Implementation details:
+ *
+ * Each register points to some region of memory, which we define as an
+ * allocation. Each allocation may embed a bpf_spin_lock which protects any
+ * special BPF objects (bpf_list_head, bpf_rb_root, etc.) part of the same
+ * allocation. The lock and the data it protects are co-located in the same
+ * memory region.
+ *
+ * Hence, everytime a register holds a pointer value pointing to such
+ * allocation, the verifier preserves a unique reg->id for it.
+ *
+ * The verifier remembers the lock 'class' and the lock 'id' whenever
+ * bpf_spin_lock is called.
+ *
+ * To enable this, lock state in the verifier captures two values:
+ *	active_spin_lock_ptr = A value identifying the register's class
+ *	active_spin_lock_id  = A unique ID for each register pointer value
+ *
+ * Currently, PTR_TO_MAP_VALUE and PTR_TO_BTF_ID | MEM_TYPE_LOCAL are the two
+ * supported register types.
+ *
+ * The active_spin_lock_ptr in case of map values is the reg->map_ptr, and in
+ * case of local kptrs is the reg->btf pointer.
+ *
+ * The active_spin_lock_id is non-unique for maps supporting direct_value_addr,
+ * as we can establish the provenance of the map value statically for each
+ * distinct lookup into such maps.
+ *
+ * In case of global variables, they use array maps with max_entries = 1, hence
+ * their active_spin_lock_ptr becomes map_ptr and id = 0 (since they all point
+ * into the same map value as max_entries is 1).
+ *
+ * In case of inner map lookups, the inner map pointer has same map_ptr as the
+ * outer map pointer (in verifier context), but each lookup into an inner map
+ * assigns a fresh reg->id to the lookup, so while lookups into distinct inner
+ * maps from the same outer map share the same map_ptr as active_spin_lock_ptr,
+ * they will get different reg->id assigned to each lookup.
+ *
+ * In case of local kptrs, active_spin_lock_ptr is the reg->btf, and the reg->id
+ * is a unique ID preserved after the NULL pointer check on the local kptr after
+ * its allocation using bpf_obj_new.
+ */
+static int check_reg_allocation_locked(struct bpf_verifier_env *env, struct bpf_reg_state *reg)
+{
+	void *ptr;
+	u32 id;
+
+	switch ((int)reg->type) {
+	case PTR_TO_MAP_VALUE:
+		ptr = reg->map_ptr;
+		break;
+	case PTR_TO_BTF_ID | MEM_TYPE_LOCAL:
+		ptr = reg->btf;
+		break;
+	default:
+		verbose(env, "verifier internal error: unknown reg type for lock check\n");
+		return -EFAULT;
+	}
+	id = reg->id;
+
+	if (env->cur_state->active_spin_lock_ptr != ptr ||
+	    env->cur_state->active_spin_lock_id != id) {
+		verbose(env, "mismatch between held lock and object allocation provenance\n");
+		return -EINVAL;
+	}
+	return 0;
+}
+
+static bool is_bpf_list_api_kfunc(u32 btf_id)
+{
+	return btf_id == special_kfunc_list[KF_bpf_list_push_front] ||
+	       btf_id == special_kfunc_list[KF_bpf_list_push_back] ||
+	       btf_id == special_kfunc_list[KF_bpf_list_pop_front] ||
+	       btf_id == special_kfunc_list[KF_bpf_list_pop_back];
+}
+
+static int process_kf_arg_ptr_to_list_head(struct bpf_verifier_env *env,
+					   struct bpf_reg_state *reg, u32 regno,
+					   struct bpf_kfunc_call_arg_meta *meta)
+{
+	struct btf_record *rec = NULL;
+	struct btf_field *field;
+	u32 list_head_off;
+
+	if (meta->btf != btf_vmlinux || !is_bpf_list_api_kfunc(meta->func_id)) {
+		verbose(env, "verifier internal error: bpf_list_head argument for unknown kfunc\n");
+		return -EFAULT;
+	}
+
+	if (reg->type == PTR_TO_MAP_VALUE) {
+		rec = reg->map_ptr->record;
+	} else /* PTR_TO_BTF_ID | MEM_TYPE_LOCAL */ {
+		struct btf_struct_meta *meta;
+
+		meta = btf_find_struct_meta(reg->btf, reg->btf_id);
+		if (!meta) {
+			verbose(env, "bpf_list_head not found for local kptr\n");
+			return -EINVAL;
+		}
+		rec = meta->record;
+	}
+
+	if (!tnum_is_const(reg->var_off)) {
+		verbose(env,
+			"R%d doesn't have constant offset. bpf_list_head has to be at the constant offset\n",
+			regno);
+		return -EINVAL;
+	}
+
+	list_head_off = reg->off + reg->var_off.value;
+	field = btf_record_find(rec, list_head_off, BPF_LIST_HEAD);
+	if (!field) {
+		verbose(env, "bpf_list_head not found at offset=%u\n", list_head_off);
+		return -EINVAL;
+	}
+
+	/* All functions require bpf_list_head to be protected using a bpf_spin_lock */
+	if (check_reg_allocation_locked(env, reg)) {
+		verbose(env, "bpf_spin_lock at off=%d must be held for manipulating bpf_list_head\n",
+			rec->spin_lock_off);
+		return -EINVAL;
+	}
+
+	if (meta->arg_list_head.field) {
+		verbose(env, "verifier internal error: repeating bpf_list_head arg\n");
+		return -EFAULT;
+	}
+	meta->arg_list_head.field = field;
+	return 0;
+}
+
+static int process_kf_arg_ptr_to_list_node(struct bpf_verifier_env *env,
+					   struct bpf_reg_state *reg, u32 regno,
+					   struct bpf_kfunc_call_arg_meta *meta)
+{
+	struct btf_struct_meta *struct_meta;
+	struct btf_field *field;
+	struct btf_record *rec;
+	u32 list_node_off;
+
+	if (meta->btf != btf_vmlinux ||
+	    (meta->func_id != special_kfunc_list[KF_bpf_list_push_front] &&
+	     meta->func_id != special_kfunc_list[KF_bpf_list_push_back])) {
+		verbose(env, "verifier internal error: bpf_list_head argument for unknown kfunc\n");
+		return -EFAULT;
+	}
+
+	if (!tnum_is_const(reg->var_off)) {
+		verbose(env,
+			"R%d doesn't have constant offset. bpf_list_head has to be at the constant offset\n",
+			regno);
+		return -EINVAL;
+	}
+
+	struct_meta = btf_find_struct_meta(reg->btf, reg->btf_id);
+	if (!struct_meta) {
+		verbose(env, "bpf_list_node not found for local kptr\n");
+		return -EINVAL;
+	}
+	rec = struct_meta->record;
+
+	list_node_off = reg->off + reg->var_off.value;
+	field = btf_record_find(rec, list_node_off, BPF_LIST_NODE);
+	if (!field || field->offset != list_node_off) {
+		verbose(env, "bpf_list_node not found at offset=%u\n", list_node_off);
+		return -EINVAL;
+	}
+
+	field = meta->arg_list_head.field;
+
+	if (!btf_struct_ids_match(&env->log, reg->btf, reg->btf_id, 0, field->list_head.btf,
+				  field->list_head.value_btf_id, true)) {
+		verbose(env, "bpf_list_head value type does not match arg#1\n");
+		return -EINVAL;
+	}
+
+	if (list_node_off != field->list_head.node_offset) {
+		verbose(env, "arg#1 offset must be for bpf_list_node at off=%d\n",
+			field->list_head.node_offset);
+		return -EINVAL;
+	}
+	/* Set arg#1 for expiration after unlock */
+	return ref_set_release_on_unlock(env, reg->ref_obj_id);
+}
+
 static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_arg_meta *meta)
 {
 	const char *func_name = meta->func_name, *ref_tname;
@@ -8167,6 +8440,8 @@  static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
 			break;
 		case KF_ARG_PTR_TO_KPTR_STRONG:
 		case KF_ARG_PTR_TO_DYNPTR:
+		case KF_ARG_PTR_TO_LIST_HEAD:
+		case KF_ARG_PTR_TO_LIST_NODE:
 		case KF_ARG_PTR_TO_MEM:
 		case KF_ARG_PTR_TO_MEM_SIZE:
 			/* Trusted by default */
@@ -8204,17 +8479,6 @@  static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
 				meta->arg_obj_drop.btf_id = reg->btf_id;
 			}
 			break;
-		case KF_ARG_PTR_TO_BTF_ID:
-			/* Only base_type is checked, further checks are done here */
-			if (reg->type != PTR_TO_BTF_ID &&
-			    (!reg2btf_ids[base_type(reg->type)] || type_flag(reg->type))) {
-				verbose(env, "arg#%d expected pointer to btf or socket\n", i);
-				return -EINVAL;
-			}
-			ret = process_kf_arg_ptr_to_btf_id(env, reg, ref_t, ref_tname, ref_id, meta, i);
-			if (ret < 0)
-				return ret;
-			break;
 		case KF_ARG_PTR_TO_KPTR_STRONG:
 			if (reg->type != PTR_TO_MAP_VALUE) {
 				verbose(env, "arg#0 expected pointer to map value\n");
@@ -8242,6 +8506,44 @@  static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
 				return -EINVAL;
 			}
 			break;
+		case KF_ARG_PTR_TO_LIST_HEAD:
+			if (reg->type != PTR_TO_MAP_VALUE &&
+			    reg->type != (PTR_TO_BTF_ID | MEM_TYPE_LOCAL)) {
+				verbose(env, "arg#%d expected pointer to map value or local kptr\n", i);
+				return -EINVAL;
+			}
+			if (reg->type == (PTR_TO_BTF_ID | MEM_TYPE_LOCAL) && !reg->ref_obj_id) {
+				verbose(env, "local kptr must be referenced\n");
+				return -EINVAL;
+			}
+			ret = process_kf_arg_ptr_to_list_head(env, reg, regno, meta);
+			if (ret < 0)
+				return ret;
+			break;
+		case KF_ARG_PTR_TO_LIST_NODE:
+			if (reg->type != (PTR_TO_BTF_ID | MEM_TYPE_LOCAL)) {
+				verbose(env, "arg#%d expected point to local kptr\n", i);
+				return -EINVAL;
+			}
+			if (!reg->ref_obj_id) {
+				verbose(env, "local kptr must be referenced\n");
+				return -EINVAL;
+			}
+			ret = process_kf_arg_ptr_to_list_node(env, reg, regno, meta);
+			if (ret < 0)
+				return ret;
+			break;
+		case KF_ARG_PTR_TO_BTF_ID:
+			/* Only base_type is checked, further checks are done here */
+			if (reg->type != PTR_TO_BTF_ID &&
+			    (!reg2btf_ids[base_type(reg->type)] || type_flag(reg->type))) {
+				verbose(env, "arg#%d expected pointer to btf or socket\n", i);
+				return -EINVAL;
+			}
+			ret = process_kf_arg_ptr_to_btf_id(env, reg, ref_t, ref_tname, ref_id, meta, i);
+			if (ret < 0)
+				return ret;
+			break;
 		case KF_ARG_PTR_TO_MEM:
 			resolve_ret = btf_resolve_size(btf, ref_t, &type_size);
 			if (IS_ERR(resolve_ret)) {
@@ -8362,11 +8664,6 @@  static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
 		ptr_type = btf_type_skip_modifiers(desc_btf, t->type, &ptr_type_id);
 
 		if (meta.btf == btf_vmlinux && btf_id_set_contains(&special_kfunc_set, meta.func_id)) {
-			if (!btf_type_is_void(ptr_type)) {
-				verbose(env, "kernel function %s must have void * return type\n",
-					meta.func_name);
-				return -EINVAL;
-			}
 			if (meta.func_id == special_kfunc_list[KF_bpf_obj_new_impl]) {
 				const struct btf_type *ret_t;
 				struct btf *ret_btf;
@@ -8404,6 +8701,15 @@  static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
 				env->insn_aux_data[insn_idx].kptr_struct_meta =
 					btf_find_struct_meta(meta.arg_obj_drop.btf,
 							     meta.arg_obj_drop.btf_id);
+			} else if (meta.func_id == special_kfunc_list[KF_bpf_list_pop_front] ||
+				   meta.func_id == special_kfunc_list[KF_bpf_list_pop_back]) {
+				struct btf_field *field = meta.arg_list_head.field;
+
+				mark_reg_known_zero(env, regs, BPF_REG_0);
+				regs[BPF_REG_0].type = PTR_TO_BTF_ID | MEM_TYPE_LOCAL;
+				regs[BPF_REG_0].btf = field->list_head.btf;
+				regs[BPF_REG_0].btf_id = field->list_head.value_btf_id;
+				regs[BPF_REG_0].off = field->list_head.node_offset;
 			} else {
 				verbose(env, "kernel function %s unhandled dynamic return type\n",
 					meta.func_name);
@@ -13072,11 +13378,14 @@  static int do_check(struct bpf_verifier_env *env)
 					return -EINVAL;
 				}
 
-				if (env->cur_state->active_spin_lock_ptr &&
-				    (insn->src_reg == BPF_PSEUDO_CALL ||
-				     insn->imm != BPF_FUNC_spin_unlock)) {
-					verbose(env, "function calls are not allowed while holding a lock\n");
-					return -EINVAL;
+				if (env->cur_state->active_spin_lock_ptr) {
+					if ((insn->src_reg == BPF_REG_0 && insn->imm != BPF_FUNC_spin_unlock) ||
+					    (insn->src_reg == BPF_PSEUDO_CALL) ||
+					    (insn->src_reg == BPF_PSEUDO_KFUNC_CALL &&
+					     (insn->off != 0 || !is_bpf_list_api_kfunc(insn->imm)))) {
+						verbose(env, "function calls are not allowed while holding a lock\n");
+						return -EINVAL;
+					}
 				}
 				if (insn->src_reg == BPF_PSEUDO_CALL)
 					err = check_func_call(env, insn, &env->insn_idx);
diff --git a/tools/testing/selftests/bpf/bpf_experimental.h b/tools/testing/selftests/bpf/bpf_experimental.h
index 29a5520a4250..4a76c64e50ad 100644
--- a/tools/testing/selftests/bpf/bpf_experimental.h
+++ b/tools/testing/selftests/bpf/bpf_experimental.h
@@ -31,3 +31,31 @@  extern void bpf_obj_drop_impl(void *kptr, void *meta) __ksym;
 
 /* Convenience macro to wrap over bpf_obj_drop_impl */
 #define bpf_obj_drop(kptr) bpf_obj_drop_impl(kptr, NULL)
+
+/* Description
+ *	Add a new entry to the beginning of the BPF linked list.
+ * Returns
+ *	Void.
+ */
+extern void bpf_list_push_front(struct bpf_list_head *head, struct bpf_list_node *node) __ksym;
+
+/* Description
+ *	Add a new entry to the end of the BPF linked list.
+ * Returns
+ *	Void.
+ */
+extern void bpf_list_push_back(struct bpf_list_head *head, struct bpf_list_node *node) __ksym;
+
+/* Description
+ *	Remove the entry at the beginning of the BPF linked list.
+ * Returns
+ *	Pointer to bpf_list_node of deleted entry, or NULL if list is empty.
+ */
+extern struct bpf_list_node *bpf_list_pop_front(struct bpf_list_head *head) __ksym;
+
+/* Description
+ *	Remove the entry at the end of the BPF linked list.
+ * Returns
+ *	Pointer to bpf_list_node of deleted entry, or NULL if list is empty.
+ */
+extern struct bpf_list_node *bpf_list_pop_back(struct bpf_list_head *head) __ksym;