diff mbox series

[bpf-next,v3,03/13] bpf: Allow storing unreferenced kptr in map

Message ID 20220320155510.671497-4-memxor@gmail.com (mailing list archive)
State Changes Requested
Delegated to: BPF
Headers show
Series Introduce typed pointer support in BPF maps | expand

Checks

Context Check Description
netdev/tree_selection success Clearly marked for bpf-next
netdev/fixes_present success Fixes tag not required for -next series
netdev/subject_prefix success Link
netdev/cover_letter success Series has a cover letter
netdev/patch_count success Link
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 1456 this patch: 1456
netdev/cc_maintainers warning 6 maintainers not CCed: netdev@vger.kernel.org songliubraving@fb.com yhs@fb.com john.fastabend@gmail.com kafai@fb.com kpsingh@kernel.org
netdev/build_clang success Errors and warnings before: 172 this patch: 172
netdev/module_param success Was 0 now: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 1473 this patch: 1473
netdev/checkpatch warning WARNING: line length of 100 exceeds 80 columns WARNING: line length of 101 exceeds 80 columns WARNING: line length of 102 exceeds 80 columns WARNING: line length of 81 exceeds 80 columns WARNING: line length of 82 exceeds 80 columns WARNING: line length of 84 exceeds 80 columns WARNING: line length of 85 exceeds 80 columns WARNING: line length of 86 exceeds 80 columns WARNING: line length of 88 exceeds 80 columns WARNING: line length of 89 exceeds 80 columns WARNING: line length of 90 exceeds 80 columns WARNING: line length of 91 exceeds 80 columns WARNING: line length of 92 exceeds 80 columns WARNING: line length of 93 exceeds 80 columns WARNING: line length of 94 exceeds 80 columns WARNING: line length of 98 exceeds 80 columns
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0
bpf/vmtest-bpf-next-PR fail PR summary

Commit Message

Kumar Kartikeya Dwivedi March 20, 2022, 3:55 p.m. UTC
This commit introduces a new pointer type 'kptr' which can be embedded
in a map value as holds a PTR_TO_BTF_ID stored by a BPF program during
its invocation. Storing to such a kptr, BPF program's PTR_TO_BTF_ID
register must have the same type as in the map value's BTF, and loading
a kptr marks the destination register as PTR_TO_BTF_ID with the correct
kernel BTF and BTF ID.

Such kptr are unreferenced, i.e. by the time another invocation of the
BPF program loads this pointer, the object which the pointer points to
may not longer exist. Since PTR_TO_BTF_ID loads (using BPF_LDX) are
patched to PROBE_MEM loads by the verifier, it would safe to allow user
to still access such invalid pointer, but passing such pointers into
BPF helpers and kfuncs should not be permitted. A future patch in this
series will close this gap.

The flexibility offered by allowing programs to dereference such invalid
pointers while being safe at runtime frees the verifier from doing
complex lifetime tracking. As long as the user may ensure that the
object remains valid, it can ensure data read by it from the kernel
object is valid.

The user indicates that a certain pointer must be treated as kptr
capable of accepting stores of PTR_TO_BTF_ID of a certain type, by using
a BTF type tag 'kptr' on the pointed to type of the pointer. Then, this
information is recorded in the object BTF which will be passed into the
kernel by way of map's BTF information. The name and kind from the map
value BTF is used to look up the in-kernel type, and the actual BTF and
BTF ID is recorded in the map struct in a new kptr_off_tab member. For
now, only storing pointers to structs is permitted.

An example of this specification is shown below:

	#define __kptr __attribute__((btf_type_tag("kptr")))

	struct map_value {
		...
		struct task_struct __kptr *task;
		...
	};

Then, in a BPF program, user may store PTR_TO_BTF_ID with the type
task_struct into the map, and then load it later.

Note that the destination register is marked PTR_TO_BTF_ID_OR_NULL, as
the verifier cannot know whether the value is NULL or not statically, it
must treat all potential loads at that map value offset as loading a
possibly NULL pointer.

Only BPF_LDX, BPF_STX, and BPF_ST with insn->imm = 0 (to denote NULL)
are allowed instructions that can access such a pointer. On BPF_LDX, the
destination register is updated to be a PTR_TO_BTF_ID, and on BPF_STX,
it is checked whether the source register type is a PTR_TO_BTF_ID with
same BTF type as specified in the map BTF. The access size must always
be BPF_DW.

For the map in map support, the kptr_off_tab for outer map is copied
from the inner map's kptr_off_tab. It was chosen to do a deep copy
instead of introducing a refcount to kptr_off_tab, because the copy only
needs to be done when paramterizing using inner_map_fd in the map in map
case, hence would be unnecessary for all other users.

It is not permitted to use MAP_FREEZE command and mmap for BPF map
having kptr, similar to the bpf_timer case.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 include/linux/bpf.h     |  29 +++++++-
 include/linux/btf.h     |   2 +
 kernel/bpf/btf.c        | 161 ++++++++++++++++++++++++++++++++++------
 kernel/bpf/map_in_map.c |   5 +-
 kernel/bpf/syscall.c    | 112 +++++++++++++++++++++++++++-
 kernel/bpf/verifier.c   | 120 ++++++++++++++++++++++++++++++
 6 files changed, 401 insertions(+), 28 deletions(-)

Comments

Joanne Koong March 21, 2022, 11:39 p.m. UTC | #1
On Sun, Mar 20, 2022 at 5:27 PM Kumar Kartikeya Dwivedi
<memxor@gmail.com> wrote:
>
> This commit introduces a new pointer type 'kptr' which can be embedded
> in a map value as holds a PTR_TO_BTF_ID stored by a BPF program during
> its invocation. Storing to such a kptr, BPF program's PTR_TO_BTF_ID
> register must have the same type as in the map value's BTF, and loading
> a kptr marks the destination register as PTR_TO_BTF_ID with the correct
> kernel BTF and BTF ID.
>
> Such kptr are unreferenced, i.e. by the time another invocation of the
> BPF program loads this pointer, the object which the pointer points to
> may not longer exist. Since PTR_TO_BTF_ID loads (using BPF_LDX) are
> patched to PROBE_MEM loads by the verifier, it would safe to allow user
> to still access such invalid pointer, but passing such pointers into
> BPF helpers and kfuncs should not be permitted. A future patch in this
> series will close this gap.
>
> The flexibility offered by allowing programs to dereference such invalid
> pointers while being safe at runtime frees the verifier from doing
> complex lifetime tracking. As long as the user may ensure that the
> object remains valid, it can ensure data read by it from the kernel
> object is valid.
>
> The user indicates that a certain pointer must be treated as kptr
> capable of accepting stores of PTR_TO_BTF_ID of a certain type, by using
> a BTF type tag 'kptr' on the pointed to type of the pointer. Then, this
> information is recorded in the object BTF which will be passed into the
> kernel by way of map's BTF information. The name and kind from the map
> value BTF is used to look up the in-kernel type, and the actual BTF and
> BTF ID is recorded in the map struct in a new kptr_off_tab member. For
> now, only storing pointers to structs is permitted.
>
> An example of this specification is shown below:
>
>         #define __kptr __attribute__((btf_type_tag("kptr")))
>
>         struct map_value {
>                 ...
>                 struct task_struct __kptr *task;
>                 ...
>         };
>
> Then, in a BPF program, user may store PTR_TO_BTF_ID with the type
> task_struct into the map, and then load it later.
>
> Note that the destination register is marked PTR_TO_BTF_ID_OR_NULL, as
> the verifier cannot know whether the value is NULL or not statically, it
> must treat all potential loads at that map value offset as loading a
> possibly NULL pointer.
>
> Only BPF_LDX, BPF_STX, and BPF_ST with insn->imm = 0 (to denote NULL)
> are allowed instructions that can access such a pointer. On BPF_LDX, the
> destination register is updated to be a PTR_TO_BTF_ID, and on BPF_STX,
> it is checked whether the source register type is a PTR_TO_BTF_ID with
> same BTF type as specified in the map BTF. The access size must always
> be BPF_DW.
>
> For the map in map support, the kptr_off_tab for outer map is copied
> from the inner map's kptr_off_tab. It was chosen to do a deep copy
> instead of introducing a refcount to kptr_off_tab, because the copy only
> needs to be done when paramterizing using inner_map_fd in the map in map
> case, hence would be unnecessary for all other users.
>
> It is not permitted to use MAP_FREEZE command and mmap for BPF map
> having kptr, similar to the bpf_timer case.
>
> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> ---
>  include/linux/bpf.h     |  29 +++++++-
>  include/linux/btf.h     |   2 +
>  kernel/bpf/btf.c        | 161 ++++++++++++++++++++++++++++++++++------
>  kernel/bpf/map_in_map.c |   5 +-
>  kernel/bpf/syscall.c    | 112 +++++++++++++++++++++++++++-
>  kernel/bpf/verifier.c   | 120 ++++++++++++++++++++++++++++++
>  6 files changed, 401 insertions(+), 28 deletions(-)
>
[...]
> +
>  struct bpf_map *bpf_map_get(u32 ufd);
>  struct bpf_map *bpf_map_get_with_uref(u32 ufd);
>  struct bpf_map *__bpf_map_get(struct fd f);
> diff --git a/include/linux/btf.h b/include/linux/btf.h
> index 36bc09b8e890..5b578dc81c04 100644
> --- a/include/linux/btf.h
> +++ b/include/linux/btf.h
> @@ -123,6 +123,8 @@ bool btf_member_is_reg_int(const struct btf *btf, const struct btf_type *s,
>                            u32 expected_offset, u32 expected_size);
>  int btf_find_spin_lock(const struct btf *btf, const struct btf_type *t);
>  int btf_find_timer(const struct btf *btf, const struct btf_type *t);
> +struct bpf_map_value_off *btf_find_kptr(const struct btf *btf,
> +                                       const struct btf_type *t);

nit: given that "btf_find_kptr" allocates memory as well, maybe the
name "btf_parse_kptr" would be more reflective?

>  bool btf_type_is_void(const struct btf_type *t);
>  s32 btf_find_by_name_kind(const struct btf *btf, const char *name, u8 kind);
>  const struct btf_type *btf_type_skip_modifiers(const struct btf *btf,
> diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
> index 9e17af936a7a..92afbec0a887 100644
> --- a/kernel/bpf/btf.c
> +++ b/kernel/bpf/btf.c
> @@ -3164,9 +3164,16 @@ static void btf_struct_log(struct btf_verifier_env *env,
>  enum {
>         BTF_FIELD_SPIN_LOCK,
>         BTF_FIELD_TIMER,
> +       BTF_FIELD_KPTR,
> +};
> +
> +enum {
> +       BTF_FIELD_IGNORE = 0,
> +       BTF_FIELD_FOUND  = 1,
>  };
>
>  struct btf_field_info {
> +       const struct btf_type *type;
>         u32 off;
>  };
>
> @@ -3174,23 +3181,48 @@ static int btf_find_field_struct(const struct btf *btf, const struct btf_type *t
>                                  u32 off, int sz, struct btf_field_info *info)
>  {
>         if (!__btf_type_is_struct(t))
> -               return 0;
> +               return BTF_FIELD_IGNORE;
>         if (t->size != sz)
> -               return 0;
> -       if (info->off != -ENOENT)
> -               /* only one such field is allowed */
> -               return -E2BIG;
> +               return BTF_FIELD_IGNORE;
>         info->off = off;
> -       return 0;
> +       return BTF_FIELD_FOUND;
> +}
> +
> +static int btf_find_field_kptr(const struct btf *btf, const struct btf_type *t,
> +                              u32 off, int sz, struct btf_field_info *info)
> +{
> +       /* For PTR, sz is always == 8 */
> +       if (!btf_type_is_ptr(t))
> +               return BTF_FIELD_IGNORE;
> +       t = btf_type_by_id(btf, t->type);
> +
> +       if (!btf_type_is_type_tag(t))
> +               return BTF_FIELD_IGNORE;
> +       /* Reject extra tags */
> +       if (btf_type_is_type_tag(btf_type_by_id(btf, t->type)))
> +               return -EINVAL;
> +       if (strcmp("kptr", __btf_name_by_offset(btf, t->name_off)))
> +               return -EINVAL;
> +
> +       /* Get the base type */
> +       if (btf_type_is_modifier(t))
> +               t = btf_type_skip_modifiers(btf, t->type, NULL);
> +       /* Only pointer to struct is allowed */
> +       if (!__btf_type_is_struct(t))
> +               return -EINVAL;
> +
> +       info->type = t;
> +       info->off = off;
> +       return BTF_FIELD_FOUND;
>  }
>
>  static int btf_find_struct_field(const struct btf *btf, const struct btf_type *t,
>                                  const char *name, int sz, int align, int field_type,
> -                                struct btf_field_info *info)
> +                                struct btf_field_info *info, int info_cnt)

From my understanding, this patch now modifies btf_find_struct_field
and btf_find_datasec_var such that the "info" that is passed in has to
be an array of size max possible + 1 while "info_cnt" is the max
possible count, or we risk writing beyond the "info" array passed in.
It seems like we could just modify the
btf_find_struct_field/btf_find_datasec_var logic so that the user can
just pass in info array of max possible size instead of max possible
size + 1 - or is your concern that this would require more idx >=
info_cnt checks inside the functions? Maybe we should include a
comment here and in btf_find_datasec_var to document that "info"
should always be max possible size + 1?

>  {
>         const struct btf_member *member;
> +       int ret, idx = 0;
>         u32 i, off;
> -       int ret;
>
>         for_each_member(i, t, member) {
>                 const struct btf_type *member_type = btf_type_by_id(btf,
> @@ -3210,24 +3242,35 @@ static int btf_find_struct_field(const struct btf *btf, const struct btf_type *t
>                 switch (field_type) {
>                 case BTF_FIELD_SPIN_LOCK:
>                 case BTF_FIELD_TIMER:
> -                       ret = btf_find_field_struct(btf, member_type, off, sz, info);
> +                       ret = btf_find_field_struct(btf, member_type, off, sz, &info[idx]);
> +                       if (ret < 0)
> +                               return ret;
> +                       break;
> +               case BTF_FIELD_KPTR:
> +                       ret = btf_find_field_kptr(btf, member_type, off, sz, &info[idx]);
>                         if (ret < 0)
>                                 return ret;
>                         break;
>                 default:
>                         return -EFAULT;
>                 }
> +
> +               if (ret == BTF_FIELD_FOUND && idx >= info_cnt)
> +                       return -E2BIG;
> +               else if (ret == BTF_FIELD_IGNORE)
> +                       continue;
> +               ++idx;
>         }
> -       return 0;
> +       return idx;
>  }
>
>  static int btf_find_datasec_var(const struct btf *btf, const struct btf_type *t,
>                                 const char *name, int sz, int align, int field_type,
> -                               struct btf_field_info *info)
> +                               struct btf_field_info *info, int info_cnt)
>  {
>         const struct btf_var_secinfo *vsi;
> +       int ret, idx = 0;
>         u32 i, off;
> -       int ret;
>
>         for_each_vsi(i, t, vsi) {
>                 const struct btf_type *var = btf_type_by_id(btf, vsi->type);
> @@ -3245,19 +3288,30 @@ static int btf_find_datasec_var(const struct btf *btf, const struct btf_type *t,
>                 switch (field_type) {
>                 case BTF_FIELD_SPIN_LOCK:
>                 case BTF_FIELD_TIMER:
> -                       ret = btf_find_field_struct(btf, var_type, off, sz, info);
> +                       ret = btf_find_field_struct(btf, var_type, off, sz, &info[idx]);
> +                       if (ret < 0)
> +                               return ret;
> +                       break;
> +               case BTF_FIELD_KPTR:
> +                       ret = btf_find_field_kptr(btf, var_type, off, sz, &info[idx]);
>                         if (ret < 0)
>                                 return ret;
>                         break;
>                 default:
>                         return -EFAULT;
>                 }
> +
> +               if (ret == BTF_FIELD_FOUND && idx >= info_cnt)
> +                       return -E2BIG;
> +               if (ret == BTF_FIELD_IGNORE)
> +                       continue;
> +               ++idx;
>         }
> -       return 0;
> +       return idx;
>  }
>
>  static int btf_find_field(const struct btf *btf, const struct btf_type *t,
> -                         int field_type, struct btf_field_info *info)
> +                         int field_type, struct btf_field_info *info, int info_cnt)
>  {
>         const char *name;
>         int sz, align;
> @@ -3273,14 +3327,20 @@ static int btf_find_field(const struct btf *btf, const struct btf_type *t,
>                 sz = sizeof(struct bpf_timer);
>                 align = __alignof__(struct bpf_timer);
>                 break;
> +       case BTF_FIELD_KPTR:
> +               name = NULL;
> +               sz = sizeof(u64);
> +               align = __alignof__(u64);
> +               break;
>         default:
>                 return -EFAULT;
>         }
>
> +       /* The maximum allowed fields of a certain type will be info_cnt - 1 */
>         if (__btf_type_is_struct(t))
> -               return btf_find_struct_field(btf, t, name, sz, align, field_type, info);
> +               return btf_find_struct_field(btf, t, name, sz, align, field_type, info, info_cnt - 1);
>         else if (btf_type_is_datasec(t))
> -               return btf_find_datasec_var(btf, t, name, sz, align, field_type, info);
> +               return btf_find_datasec_var(btf, t, name, sz, align, field_type, info, info_cnt - 1);
>         return -EINVAL;
>  }
>
> @@ -3290,24 +3350,79 @@ static int btf_find_field(const struct btf *btf, const struct btf_type *t,
>   */
>  int btf_find_spin_lock(const struct btf *btf, const struct btf_type *t)
>  {
> -       struct btf_field_info info = { .off = -ENOENT };
> +       /* btf_find_field requires array of size max + 1 */
> +       struct btf_field_info info_arr[2];
>         int ret;
>
> -       ret = btf_find_field(btf, t, BTF_FIELD_SPIN_LOCK, &info);
> +       ret = btf_find_field(btf, t, BTF_FIELD_SPIN_LOCK, info_arr, ARRAY_SIZE(info_arr));
>         if (ret < 0)
>                 return ret;
> -       return info.off;
> +       if (!ret)
> +               return -ENOENT;
> +       return info_arr[0].off;
>  }
>
>  int btf_find_timer(const struct btf *btf, const struct btf_type *t)
>  {
> -       struct btf_field_info info = { .off = -ENOENT };
> +       /* btf_find_field requires array of size max + 1 */
> +       struct btf_field_info info_arr[2];
>         int ret;
>
> -       ret = btf_find_field(btf, t, BTF_FIELD_TIMER, &info);
> +       ret = btf_find_field(btf, t, BTF_FIELD_TIMER, info_arr, ARRAY_SIZE(info_arr));
>         if (ret < 0)
>                 return ret;
> -       return info.off;
> +       if (!ret)
> +               return -ENOENT;
> +       return info_arr[0].off;
> +}
> +
> +struct bpf_map_value_off *btf_find_kptr(const struct btf *btf,
> +                                       const struct btf_type *t)
> +{
> +       /* btf_find_field requires array of size max + 1 */
> +       struct btf_field_info info_arr[BPF_MAP_VALUE_OFF_MAX + 1];
> +       struct bpf_map_value_off *tab;
> +       int ret, i, nr_off;
> +
> +       /* Revisit stack usage when bumping BPF_MAP_VALUE_OFF_MAX */
> +       BUILD_BUG_ON(BPF_MAP_VALUE_OFF_MAX != 8);
> +
> +       ret = btf_find_field(btf, t, BTF_FIELD_KPTR, info_arr, ARRAY_SIZE(info_arr));
> +       if (ret < 0)
> +               return ERR_PTR(ret);
> +       if (!ret)
> +               return NULL;
> +
> +       nr_off = ret;
> +       tab = kzalloc(offsetof(struct bpf_map_value_off, off[nr_off]), GFP_KERNEL | __GFP_NOWARN);
> +       if (!tab)
> +               return ERR_PTR(-ENOMEM);
> +
> +       tab->nr_off = 0;
> +       for (i = 0; i < nr_off; i++) {
> +               const struct btf_type *t;
> +               struct btf *off_btf;
> +               s32 id;
> +
> +               t = info_arr[i].type;
> +               id = bpf_find_btf_id(__btf_name_by_offset(btf, t->name_off), BTF_INFO_KIND(t->info),
> +                                    &off_btf);
> +               if (id < 0) {
> +                       ret = id;
> +                       goto end;
> +               }
> +
> +               tab->off[i].offset = info_arr[i].off;
> +               tab->off[i].btf_id = id;
> +               tab->off[i].btf = off_btf;
> +               tab->nr_off = i + 1;
> +       }
> +       return tab;
> +end:
> +       while (tab->nr_off--)
> +               btf_put(tab->off[tab->nr_off].btf);
> +       kfree(tab);
> +       return ERR_PTR(ret);
>  }
>
>  static void __btf_struct_show(const struct btf *btf, const struct btf_type *t,
> diff --git a/kernel/bpf/map_in_map.c b/kernel/bpf/map_in_map.c
> index 5cd8f5277279..135205d0d560 100644
> --- a/kernel/bpf/map_in_map.c
> +++ b/kernel/bpf/map_in_map.c
> @@ -52,6 +52,7 @@ struct bpf_map *bpf_map_meta_alloc(int inner_map_ufd)
>         inner_map_meta->max_entries = inner_map->max_entries;
>         inner_map_meta->spin_lock_off = inner_map->spin_lock_off;
>         inner_map_meta->timer_off = inner_map->timer_off;
> +       inner_map_meta->kptr_off_tab = bpf_map_copy_kptr_off_tab(inner_map);
>         if (inner_map->btf) {
>                 btf_get(inner_map->btf);
>                 inner_map_meta->btf = inner_map->btf;
> @@ -71,6 +72,7 @@ struct bpf_map *bpf_map_meta_alloc(int inner_map_ufd)
>
>  void bpf_map_meta_free(struct bpf_map *map_meta)
>  {
> +       bpf_map_free_kptr_off_tab(map_meta);
>         btf_put(map_meta->btf);
>         kfree(map_meta);
>  }
> @@ -83,7 +85,8 @@ bool bpf_map_meta_equal(const struct bpf_map *meta0,
>                 meta0->key_size == meta1->key_size &&
>                 meta0->value_size == meta1->value_size &&
>                 meta0->timer_off == meta1->timer_off &&
> -               meta0->map_flags == meta1->map_flags;
> +               meta0->map_flags == meta1->map_flags &&
> +               bpf_map_equal_kptr_off_tab(meta0, meta1);
>  }
>
>  void *bpf_map_fd_get_ptr(struct bpf_map *map,
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index cdaa1152436a..5990d6fa97ab 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -6,6 +6,7 @@
>  #include <linux/bpf_trace.h>
>  #include <linux/bpf_lirc.h>
>  #include <linux/bpf_verifier.h>
> +#include <linux/bsearch.h>
>  #include <linux/btf.h>
>  #include <linux/syscalls.h>
>  #include <linux/slab.h>
> @@ -473,12 +474,95 @@ static void bpf_map_release_memcg(struct bpf_map *map)
>  }
>  #endif
>
> +static int bpf_map_kptr_off_cmp(const void *a, const void *b)
> +{
> +       const struct bpf_map_value_off_desc *off_desc1 = a, *off_desc2 = b;
> +
> +       if (off_desc1->offset < off_desc2->offset)
> +               return -1;
> +       else if (off_desc1->offset > off_desc2->offset)
> +               return 1;
> +       return 0;
> +}
> +
> +struct bpf_map_value_off_desc *bpf_map_kptr_off_contains(struct bpf_map *map, u32 offset)
> +{
> +       /* Since members are iterated in btf_find_field in increasing order,
> +        * offsets appended to kptr_off_tab are in increasing order, so we can
> +        * do bsearch to find exact match.
> +        */
> +       struct bpf_map_value_off *tab;
> +
> +       if (!map_value_has_kptr(map))
> +               return NULL;
> +       tab = map->kptr_off_tab;
> +       return bsearch(&offset, tab->off, tab->nr_off, sizeof(tab->off[0]), bpf_map_kptr_off_cmp);
> +}
> +
> +void bpf_map_free_kptr_off_tab(struct bpf_map *map)
> +{
> +       struct bpf_map_value_off *tab = map->kptr_off_tab;
> +       int i;
> +
> +       if (!map_value_has_kptr(map))
> +               return;
> +       for (i = 0; i < tab->nr_off; i++) {
> +               struct btf *btf = tab->off[i].btf;
> +
> +               btf_put(btf);
> +       }
> +       kfree(tab);
> +       map->kptr_off_tab = NULL;
> +}
> +
> +struct bpf_map_value_off *bpf_map_copy_kptr_off_tab(const struct bpf_map *map)
> +{
> +       struct bpf_map_value_off *tab = map->kptr_off_tab, *new_tab;
> +       int size, i, ret;
> +
> +       if (!map_value_has_kptr(map))
> +               return ERR_PTR(-ENOENT);
> +       /* Do a deep copy of the kptr_off_tab */
> +       for (i = 0; i < tab->nr_off; i++)
> +               btf_get(tab->off[i].btf);
> +
> +       size = offsetof(struct bpf_map_value_off, off[tab->nr_off]);
> +       new_tab = kzalloc(size, GFP_KERNEL | __GFP_NOWARN);
> +       if (!new_tab) {
> +               ret = -ENOMEM;
> +               goto end;
> +       }
> +       memcpy(new_tab, tab, size);
> +       return new_tab;
> +end:
> +       while (i--)
> +               btf_put(tab->off[i].btf);
> +       return ERR_PTR(ret);
> +}
> +
> +bool bpf_map_equal_kptr_off_tab(const struct bpf_map *map_a, const struct bpf_map *map_b)
> +{
> +       struct bpf_map_value_off *tab_a = map_a->kptr_off_tab, *tab_b = map_b->kptr_off_tab;
> +       bool a_has_kptr = map_value_has_kptr(map_a), b_has_kptr = map_value_has_kptr(map_b);
> +       int size;
> +
> +       if (!a_has_kptr && !b_has_kptr)
> +               return true;
> +       if ((a_has_kptr && !b_has_kptr) || (!a_has_kptr && b_has_kptr))

nit: I think we could simplify this second if check to
if (!a_has_kptr || !b_has_kptr)
    return false;

> +               return false;
> +       if (tab_a->nr_off != tab_b->nr_off)
> +               return false;
> +       size = offsetof(struct bpf_map_value_off, off[tab_a->nr_off]);
> +       return !memcmp(tab_a, tab_b, size);
> +}
> +
>  /* called from workqueue */
>  static void bpf_map_free_deferred(struct work_struct *work)
>  {
>         struct bpf_map *map = container_of(work, struct bpf_map, work);
>
>         security_bpf_map_free(map);
> +       bpf_map_free_kptr_off_tab(map);
>         bpf_map_release_memcg(map);
>         /* implementation dependent freeing */
>         map->ops->map_free(map);
> @@ -640,7 +724,7 @@ static int bpf_map_mmap(struct file *filp, struct vm_area_struct *vma)
>         int err;
>
>         if (!map->ops->map_mmap || map_value_has_spin_lock(map) ||
> -           map_value_has_timer(map))
> +           map_value_has_timer(map) || map_value_has_kptr(map))
>                 return -ENOTSUPP;
>
>         if (!(vma->vm_flags & VM_SHARED))
> @@ -820,9 +904,31 @@ static int map_check_btf(struct bpf_map *map, const struct btf *btf,
>                         return -EOPNOTSUPP;
>         }
>
> -       if (map->ops->map_check_btf)
> +       map->kptr_off_tab = btf_find_kptr(btf, value_type);
> +       if (map_value_has_kptr(map)) {
> +               if (!bpf_capable())
> +                       return -EPERM;
> +               if (map->map_flags & BPF_F_RDONLY_PROG) {
> +                       ret = -EACCES;
> +                       goto free_map_tab;
> +               }
> +               if (map->map_type != BPF_MAP_TYPE_HASH &&
> +                   map->map_type != BPF_MAP_TYPE_LRU_HASH &&
> +                   map->map_type != BPF_MAP_TYPE_ARRAY) {
> +                       ret = -EOPNOTSUPP;
> +                       goto free_map_tab;
> +               }
> +       }
> +
> +       if (map->ops->map_check_btf) {
>                 ret = map->ops->map_check_btf(map, btf, key_type, value_type);
> +               if (ret < 0)
> +                       goto free_map_tab;
> +       }
>
> +       return ret;
> +free_map_tab:
> +       bpf_map_free_kptr_off_tab(map);
>         return ret;
>  }
>
> @@ -1639,7 +1745,7 @@ static int map_freeze(const union bpf_attr *attr)
>                 return PTR_ERR(map);
>
>         if (map->map_type == BPF_MAP_TYPE_STRUCT_OPS ||
> -           map_value_has_timer(map)) {
> +           map_value_has_timer(map) || map_value_has_kptr(map)) {
>                 fdput(f);
>                 return -ENOTSUPP;
>         }
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 4ce9a528fb63..744b7362e52e 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -3507,6 +3507,94 @@ int check_ptr_off_reg(struct bpf_verifier_env *env,
>         return __check_ptr_off_reg(env, reg, regno, false);
>  }
>
> +static int map_kptr_match_type(struct bpf_verifier_env *env,
> +                              struct bpf_map_value_off_desc *off_desc,
> +                              struct bpf_reg_state *reg, u32 regno)
> +{
> +       const char *targ_name = kernel_type_name(off_desc->btf, off_desc->btf_id);
> +       const char *reg_name = "";
> +
> +       if (reg->type != PTR_TO_BTF_ID && reg->type != PTR_TO_BTF_ID_OR_NULL)
> +               goto bad_type;
> +
> +       if (!btf_is_kernel(reg->btf)) {
> +               verbose(env, "R%d must point to kernel BTF\n", regno);
> +               return -EINVAL;
> +       }
> +       /* We need to verify reg->type and reg->btf, before accessing reg->btf */
> +       reg_name = kernel_type_name(reg->btf, reg->btf_id);
> +
> +       if (__check_ptr_off_reg(env, reg, regno, true))
> +               return -EACCES;
> +
> +       if (!btf_struct_ids_match(&env->log, reg->btf, reg->btf_id, reg->off,
> +                                 off_desc->btf, off_desc->btf_id))
> +               goto bad_type;
> +       return 0;
> +bad_type:
> +       verbose(env, "invalid kptr access, R%d type=%s%s ", regno,
> +               reg_type_str(env, reg->type), reg_name);
> +       verbose(env, "expected=%s%s\n", reg_type_str(env, PTR_TO_BTF_ID), targ_name);
> +       return -EINVAL;
> +}
> +
> +/* Returns an error, or 0 if ignoring the access, or 1 if register state was
> + * updated, in which case later updates must be skipped.
> + */
> +static int check_map_kptr_access(struct bpf_verifier_env *env, u32 regno,
> +                                int off, int size, int value_regno,
> +                                enum bpf_access_type t, int insn_idx)

Did you mean to include the "enum bpf_access_type t" argument? I'm not
seeing where it's being used in this function

> +{
> +       struct bpf_reg_state *reg = reg_state(env, regno), *val_reg;
> +       struct bpf_insn *insn = &env->prog->insnsi[insn_idx];
> +       struct bpf_map_value_off_desc *off_desc;
> +       struct bpf_map *map = reg->map_ptr;
> +       int class = BPF_CLASS(insn->code);
> +
> +       /* Things we already checked for in check_map_access:
> +        *  - Reject cases where variable offset may touch BTF ID pointer
> +        *  - size of access (must be BPF_DW)
> +        *  - off_desc->offset == off + reg->var_off.value
> +        */
> +       if (!tnum_is_const(reg->var_off))
> +               return 0;
> +
> +       off_desc = bpf_map_kptr_off_contains(map, off + reg->var_off.value);
> +       if (!off_desc)
> +               return 0;
> +
> +       /* Only BPF_[LDX,STX,ST] | BPF_MEM | BPF_DW is supported */
> +       if (BPF_MODE(insn->code) != BPF_MEM)
> +               goto end;
> +
> +       if (class == BPF_LDX) {
> +               val_reg = reg_state(env, value_regno);
> +               /* We can simply mark the value_regno receiving the pointer
> +                * value from map as PTR_TO_BTF_ID, with the correct type.
> +                */
> +               mark_btf_ld_reg(env, cur_regs(env), value_regno, PTR_TO_BTF_ID, off_desc->btf,
> +                               off_desc->btf_id, PTR_MAYBE_NULL);
> +               val_reg->id = ++env->id_gen;
> +       } else if (class == BPF_STX) {
> +               val_reg = reg_state(env, value_regno);
> +               if (!register_is_null(val_reg) &&
> +                   map_kptr_match_type(env, off_desc, val_reg, value_regno))
> +                       return -EACCES;
> +       } else if (class == BPF_ST) {
> +               if (insn->imm) {
> +                       verbose(env, "BPF_ST imm must be 0 when storing to kptr at off=%u\n",
> +                               off_desc->offset);
> +                       return -EACCES;
> +               }
> +       } else {
> +               goto end;
> +       }
> +       return 1;
> +end:
> +       verbose(env, "kptr in map can only be accessed using BPF_LDX/BPF_STX/BPF_ST\n");
> +       return -EACCES;
> +}
> +
>  /* check read/write into a map element with possible variable offset */
>  static int check_map_access(struct bpf_verifier_env *env, u32 regno,
>                             int off, int size, bool zero_size_allowed)
> @@ -3545,6 +3633,32 @@ static int check_map_access(struct bpf_verifier_env *env, u32 regno,
>                         return -EACCES;
>                 }
>         }
> +       if (map_value_has_kptr(map)) {
> +               struct bpf_map_value_off *tab = map->kptr_off_tab;
> +               int i;
> +
> +               for (i = 0; i < tab->nr_off; i++) {
> +                       u32 p = tab->off[i].offset;
> +
> +                       if (reg->smin_value + off < p + sizeof(u64) &&
> +                           p < reg->umax_value + off + size) {
> +                               if (!tnum_is_const(reg->var_off)) {
> +                                       verbose(env, "kptr access cannot have variable offset\n");
> +                                       return -EACCES;
> +                               }
> +                               if (p != off + reg->var_off.value) {
> +                                       verbose(env, "kptr access misaligned expected=%u off=%llu\n",
> +                                               p, off + reg->var_off.value);
> +                                       return -EACCES;
> +                               }
> +                               if (size != bpf_size_to_bytes(BPF_DW)) {
> +                                       verbose(env, "kptr access size must be BPF_DW\n");
> +                                       return -EACCES;
> +                               }
> +                               break;
> +                       }
> +               }
> +       }
>         return err;
>  }
>
> @@ -4421,6 +4535,10 @@ static int check_mem_access(struct bpf_verifier_env *env, int insn_idx, u32 regn
>                 if (err)
>                         return err;
>                 err = check_map_access(env, regno, off, size, false);
> +               err = err ?: check_map_kptr_access(env, regno, off, size, value_regno, t, insn_idx);
> +               if (err < 0)
> +                       return err;
> +               /* if err == 0, check_map_kptr_access ignored the access */
>                 if (!err && t == BPF_READ && value_regno >= 0) {
>                         struct bpf_map *map = reg->map_ptr;
>
> @@ -4442,6 +4560,8 @@ static int check_mem_access(struct bpf_verifier_env *env, int insn_idx, u32 regn)
>                                 mark_reg_unknown(env, regs, value_regno);
>                         }
>                 }
> +               /* clear err == 1 */
> +               err = err < 0 ? err : 0;

I find this flow a bit unintuitive to follow. Would something like

    err = check_map_access(env, regno, off, size, false);
    if (err)
        return err;
    if (bpf_map_kptr_off_contains(map, off + reg->var_off.value, &off_desc)) {
        err = check_map_kptr_access(...) *where check_map_kptr_access
now returns 0 on success and error code otherwise*
    } else if (!err && t == BPF_READ && value_regno >= 0) {
        ...
    }

be clearer?


>         } else if (base_type(reg->type) == PTR_TO_MEM) {
>                 bool rdonly_mem = type_is_rdonly_mem(reg->type);
>
> --
> 2.35.1
>
Andrii Nakryiko March 22, 2022, 5:45 a.m. UTC | #2
On Sun, Mar 20, 2022 at 8:55 AM Kumar Kartikeya Dwivedi
<memxor@gmail.com> wrote:
>
> This commit introduces a new pointer type 'kptr' which can be embedded
> in a map value as holds a PTR_TO_BTF_ID stored by a BPF program during
> its invocation. Storing to such a kptr, BPF program's PTR_TO_BTF_ID
> register must have the same type as in the map value's BTF, and loading
> a kptr marks the destination register as PTR_TO_BTF_ID with the correct
> kernel BTF and BTF ID.
>
> Such kptr are unreferenced, i.e. by the time another invocation of the
> BPF program loads this pointer, the object which the pointer points to
> may not longer exist. Since PTR_TO_BTF_ID loads (using BPF_LDX) are
> patched to PROBE_MEM loads by the verifier, it would safe to allow user
> to still access such invalid pointer, but passing such pointers into
> BPF helpers and kfuncs should not be permitted. A future patch in this
> series will close this gap.
>
> The flexibility offered by allowing programs to dereference such invalid
> pointers while being safe at runtime frees the verifier from doing
> complex lifetime tracking. As long as the user may ensure that the
> object remains valid, it can ensure data read by it from the kernel
> object is valid.
>
> The user indicates that a certain pointer must be treated as kptr
> capable of accepting stores of PTR_TO_BTF_ID of a certain type, by using
> a BTF type tag 'kptr' on the pointed to type of the pointer. Then, this
> information is recorded in the object BTF which will be passed into the
> kernel by way of map's BTF information. The name and kind from the map
> value BTF is used to look up the in-kernel type, and the actual BTF and
> BTF ID is recorded in the map struct in a new kptr_off_tab member. For
> now, only storing pointers to structs is permitted.
>
> An example of this specification is shown below:
>
>         #define __kptr __attribute__((btf_type_tag("kptr")))
>
>         struct map_value {
>                 ...
>                 struct task_struct __kptr *task;
>                 ...
>         };
>
> Then, in a BPF program, user may store PTR_TO_BTF_ID with the type
> task_struct into the map, and then load it later.
>
> Note that the destination register is marked PTR_TO_BTF_ID_OR_NULL, as
> the verifier cannot know whether the value is NULL or not statically, it
> must treat all potential loads at that map value offset as loading a
> possibly NULL pointer.
>
> Only BPF_LDX, BPF_STX, and BPF_ST with insn->imm = 0 (to denote NULL)
> are allowed instructions that can access such a pointer. On BPF_LDX, the
> destination register is updated to be a PTR_TO_BTF_ID, and on BPF_STX,
> it is checked whether the source register type is a PTR_TO_BTF_ID with
> same BTF type as specified in the map BTF. The access size must always
> be BPF_DW.
>
> For the map in map support, the kptr_off_tab for outer map is copied
> from the inner map's kptr_off_tab. It was chosen to do a deep copy
> instead of introducing a refcount to kptr_off_tab, because the copy only
> needs to be done when paramterizing using inner_map_fd in the map in map
> case, hence would be unnecessary for all other users.
>
> It is not permitted to use MAP_FREEZE command and mmap for BPF map
> having kptr, similar to the bpf_timer case.
>
> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> ---
>  include/linux/bpf.h     |  29 +++++++-
>  include/linux/btf.h     |   2 +
>  kernel/bpf/btf.c        | 161 ++++++++++++++++++++++++++++++++++------
>  kernel/bpf/map_in_map.c |   5 +-
>  kernel/bpf/syscall.c    | 112 +++++++++++++++++++++++++++-
>  kernel/bpf/verifier.c   | 120 ++++++++++++++++++++++++++++++
>  6 files changed, 401 insertions(+), 28 deletions(-)
>

[...]

> +static int btf_find_field_kptr(const struct btf *btf, const struct btf_type *t,
> +                              u32 off, int sz, struct btf_field_info *info)
> +{
> +       /* For PTR, sz is always == 8 */
> +       if (!btf_type_is_ptr(t))
> +               return BTF_FIELD_IGNORE;
> +       t = btf_type_by_id(btf, t->type);
> +
> +       if (!btf_type_is_type_tag(t))
> +               return BTF_FIELD_IGNORE;
> +       /* Reject extra tags */
> +       if (btf_type_is_type_tag(btf_type_by_id(btf, t->type)))
> +               return -EINVAL;

Can we have tag -> const -> tag -> volatile -> tag in BTF? Wouldn't
you assume there are no more tags with just this check?


> +       if (strcmp("kptr", __btf_name_by_offset(btf, t->name_off)))
> +               return -EINVAL;
> +
> +       /* Get the base type */
> +       if (btf_type_is_modifier(t))
> +               t = btf_type_skip_modifiers(btf, t->type, NULL);
> +       /* Only pointer to struct is allowed */
> +       if (!__btf_type_is_struct(t))
> +               return -EINVAL;
> +
> +       info->type = t;
> +       info->off = off;
> +       return BTF_FIELD_FOUND;
>  }
>
>  static int btf_find_struct_field(const struct btf *btf, const struct btf_type *t,
>                                  const char *name, int sz, int align, int field_type,
> -                                struct btf_field_info *info)
> +                                struct btf_field_info *info, int info_cnt)
>  {
>         const struct btf_member *member;
> +       int ret, idx = 0;
>         u32 i, off;
> -       int ret;
>
>         for_each_member(i, t, member) {
>                 const struct btf_type *member_type = btf_type_by_id(btf,
> @@ -3210,24 +3242,35 @@ static int btf_find_struct_field(const struct btf *btf, const struct btf_type *t
>                 switch (field_type) {
>                 case BTF_FIELD_SPIN_LOCK:
>                 case BTF_FIELD_TIMER:
> -                       ret = btf_find_field_struct(btf, member_type, off, sz, info);
> +                       ret = btf_find_field_struct(btf, member_type, off, sz, &info[idx]);
> +                       if (ret < 0)
> +                               return ret;
> +                       break;
> +               case BTF_FIELD_KPTR:
> +                       ret = btf_find_field_kptr(btf, member_type, off, sz, &info[idx]);
>                         if (ret < 0)
>                                 return ret;
>                         break;
>                 default:
>                         return -EFAULT;
>                 }
> +
> +               if (ret == BTF_FIELD_FOUND && idx >= info_cnt)

hm.. haven't you already written info[info_cnt] above by now? I see
that above you do (info_cnt - 1), but why such tricks if you can have
a temporary struct btf_field_info on the stack, write into it, and if
BTF_FIELD_FOUND and idx < info_cnt then write it into info[idx]?


> +                       return -E2BIG;
> +               else if (ret == BTF_FIELD_IGNORE)
> +                       continue;
> +               ++idx;
>         }
> -       return 0;
> +       return idx;
>  }
>
>  static int btf_find_datasec_var(const struct btf *btf, const struct btf_type *t,
>                                 const char *name, int sz, int align, int field_type,
> -                               struct btf_field_info *info)
> +                               struct btf_field_info *info, int info_cnt)
>  {
>         const struct btf_var_secinfo *vsi;
> +       int ret, idx = 0;
>         u32 i, off;
> -       int ret;
>
>         for_each_vsi(i, t, vsi) {
>                 const struct btf_type *var = btf_type_by_id(btf, vsi->type);
> @@ -3245,19 +3288,30 @@ static int btf_find_datasec_var(const struct btf *btf, const struct btf_type *t,
>                 switch (field_type) {
>                 case BTF_FIELD_SPIN_LOCK:
>                 case BTF_FIELD_TIMER:
> -                       ret = btf_find_field_struct(btf, var_type, off, sz, info);
> +                       ret = btf_find_field_struct(btf, var_type, off, sz, &info[idx]);
> +                       if (ret < 0)
> +                               return ret;
> +                       break;
> +               case BTF_FIELD_KPTR:
> +                       ret = btf_find_field_kptr(btf, var_type, off, sz, &info[idx]);
>                         if (ret < 0)
>                                 return ret;
>                         break;
>                 default:
>                         return -EFAULT;
>                 }
> +
> +               if (ret == BTF_FIELD_FOUND && idx >= info_cnt)

same, already writing past the end of array?

> +                       return -E2BIG;
> +               if (ret == BTF_FIELD_IGNORE)
> +                       continue;
> +               ++idx;
>         }
> -       return 0;
> +       return idx;
>  }
>
>  static int btf_find_field(const struct btf *btf, const struct btf_type *t,
> -                         int field_type, struct btf_field_info *info)
> +                         int field_type, struct btf_field_info *info, int info_cnt)
>  {
>         const char *name;
>         int sz, align;
> @@ -3273,14 +3327,20 @@ static int btf_find_field(const struct btf *btf, const struct btf_type *t,
>                 sz = sizeof(struct bpf_timer);
>                 align = __alignof__(struct bpf_timer);
>                 break;
> +       case BTF_FIELD_KPTR:
> +               name = NULL;
> +               sz = sizeof(u64);
> +               align = __alignof__(u64);

can be 4 on 32-bit arch, is that ok?

> +               break;
>         default:
>                 return -EFAULT;
>         }
>
> +       /* The maximum allowed fields of a certain type will be info_cnt - 1 */
>         if (__btf_type_is_struct(t))
> -               return btf_find_struct_field(btf, t, name, sz, align, field_type, info);
> +               return btf_find_struct_field(btf, t, name, sz, align, field_type, info, info_cnt - 1);

why -1, to avoid overwriting past the end of array?

>         else if (btf_type_is_datasec(t))
> -               return btf_find_datasec_var(btf, t, name, sz, align, field_type, info);
> +               return btf_find_datasec_var(btf, t, name, sz, align, field_type, info, info_cnt - 1);
>         return -EINVAL;
>  }
>
> @@ -3290,24 +3350,79 @@ static int btf_find_field(const struct btf *btf, const struct btf_type *t,
>   */
>  int btf_find_spin_lock(const struct btf *btf, const struct btf_type *t)
>  {
> -       struct btf_field_info info = { .off = -ENOENT };
> +       /* btf_find_field requires array of size max + 1 */

ok, right, as I expected above, but see also suggestion to not have
these weird implicit expectations

> +       struct btf_field_info info_arr[2];
>         int ret;
>
> -       ret = btf_find_field(btf, t, BTF_FIELD_SPIN_LOCK, &info);
> +       ret = btf_find_field(btf, t, BTF_FIELD_SPIN_LOCK, info_arr, ARRAY_SIZE(info_arr));
>         if (ret < 0)
>                 return ret;
> -       return info.off;
> +       if (!ret)
> +               return -ENOENT;
> +       return info_arr[0].off;
>  }
>
>  int btf_find_timer(const struct btf *btf, const struct btf_type *t)
>  {
> -       struct btf_field_info info = { .off = -ENOENT };
> +       /* btf_find_field requires array of size max + 1 */
> +       struct btf_field_info info_arr[2];
>         int ret;
>
> -       ret = btf_find_field(btf, t, BTF_FIELD_TIMER, &info);
> +       ret = btf_find_field(btf, t, BTF_FIELD_TIMER, info_arr, ARRAY_SIZE(info_arr));
>         if (ret < 0)
>                 return ret;
> -       return info.off;
> +       if (!ret)
> +               return -ENOENT;
> +       return info_arr[0].off;
> +}
> +
> +struct bpf_map_value_off *btf_find_kptr(const struct btf *btf,
> +                                       const struct btf_type *t)
> +{
> +       /* btf_find_field requires array of size max + 1 */
> +       struct btf_field_info info_arr[BPF_MAP_VALUE_OFF_MAX + 1];
> +       struct bpf_map_value_off *tab;
> +       int ret, i, nr_off;
> +
> +       /* Revisit stack usage when bumping BPF_MAP_VALUE_OFF_MAX */
> +       BUILD_BUG_ON(BPF_MAP_VALUE_OFF_MAX != 8);

you can store u32 type_id instead of full btf_type pointer, type
looking below in the loop is cheap and won't fail


> +
> +       ret = btf_find_field(btf, t, BTF_FIELD_KPTR, info_arr, ARRAY_SIZE(info_arr));
> +       if (ret < 0)
> +               return ERR_PTR(ret);
> +       if (!ret)
> +               return NULL;
> +

[...]

> +
> +bool bpf_map_equal_kptr_off_tab(const struct bpf_map *map_a, const struct bpf_map *map_b)
> +{
> +       struct bpf_map_value_off *tab_a = map_a->kptr_off_tab, *tab_b = map_b->kptr_off_tab;
> +       bool a_has_kptr = map_value_has_kptr(map_a), b_has_kptr = map_value_has_kptr(map_b);
> +       int size;
> +
> +       if (!a_has_kptr && !b_has_kptr)
> +               return true;
> +       if ((a_has_kptr && !b_has_kptr) || (!a_has_kptr && b_has_kptr))
> +               return false;

if (a_has_kptr != b_has_kptr)
    return false;

> +       if (tab_a->nr_off != tab_b->nr_off)
> +               return false;
> +       size = offsetof(struct bpf_map_value_off, off[tab_a->nr_off]);
> +       return !memcmp(tab_a, tab_b, size);
> +}
> +
>  /* called from workqueue */
>  static void bpf_map_free_deferred(struct work_struct *work)
>  {
>         struct bpf_map *map = container_of(work, struct bpf_map, work);
>
>         security_bpf_map_free(map);
> +       bpf_map_free_kptr_off_tab(map);
>         bpf_map_release_memcg(map);
>         /* implementation dependent freeing */
>         map->ops->map_free(map);
> @@ -640,7 +724,7 @@ static int bpf_map_mmap(struct file *filp, struct vm_area_struct *vma)
>         int err;
>
>         if (!map->ops->map_mmap || map_value_has_spin_lock(map) ||
> -           map_value_has_timer(map))
> +           map_value_has_timer(map) || map_value_has_kptr(map))
>                 return -ENOTSUPP;
>
>         if (!(vma->vm_flags & VM_SHARED))
> @@ -820,9 +904,31 @@ static int map_check_btf(struct bpf_map *map, const struct btf *btf,
>                         return -EOPNOTSUPP;
>         }
>
> -       if (map->ops->map_check_btf)
> +       map->kptr_off_tab = btf_find_kptr(btf, value_type);

btf_find_kptr() is so confusingly named. It certainly can find more
than one kptr, so at least it should be btf_find_kptrs(). Combining
with Joanne's suggestion, btf_parse_kptrs() would indeed be better.

> +       if (map_value_has_kptr(map)) {
> +               if (!bpf_capable())
> +                       return -EPERM;
> +               if (map->map_flags & BPF_F_RDONLY_PROG) {
> +                       ret = -EACCES;
> +                       goto free_map_tab;
> +               }
> +               if (map->map_type != BPF_MAP_TYPE_HASH &&
> +                   map->map_type != BPF_MAP_TYPE_LRU_HASH &&
> +                   map->map_type != BPF_MAP_TYPE_ARRAY) {

what about PERCPU_ARRAY, for instance? Is there something
fundamentally wrong to support it for local storage maps?

> +                       ret = -EOPNOTSUPP;
> +                       goto free_map_tab;
> +               }
> +       }
> +
> +       if (map->ops->map_check_btf) {
>                 ret = map->ops->map_check_btf(map, btf, key_type, value_type);
> +               if (ret < 0)
> +                       goto free_map_tab;
> +       }
>
> +       return ret;
> +free_map_tab:
> +       bpf_map_free_kptr_off_tab(map);
>         return ret;
>  }
>
> @@ -1639,7 +1745,7 @@ static int map_freeze(const union bpf_attr *attr)
>                 return PTR_ERR(map);
>
>         if (map->map_type == BPF_MAP_TYPE_STRUCT_OPS ||
> -           map_value_has_timer(map)) {
> +           map_value_has_timer(map) || map_value_has_kptr(map)) {
>                 fdput(f);
>                 return -ENOTSUPP;
>         }
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 4ce9a528fb63..744b7362e52e 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -3507,6 +3507,94 @@ int check_ptr_off_reg(struct bpf_verifier_env *env,
>         return __check_ptr_off_reg(env, reg, regno, false);
>  }
>
> +static int map_kptr_match_type(struct bpf_verifier_env *env,
> +                              struct bpf_map_value_off_desc *off_desc,
> +                              struct bpf_reg_state *reg, u32 regno)
> +{
> +       const char *targ_name = kernel_type_name(off_desc->btf, off_desc->btf_id);
> +       const char *reg_name = "";
> +
> +       if (reg->type != PTR_TO_BTF_ID && reg->type != PTR_TO_BTF_ID_OR_NULL)

base_type(reg->type) != PTR_TO_BTF_ID ?

> +               goto bad_type;
> +
> +       if (!btf_is_kernel(reg->btf)) {
> +               verbose(env, "R%d must point to kernel BTF\n", regno);
> +               return -EINVAL;
> +       }
> +       /* We need to verify reg->type and reg->btf, before accessing reg->btf */
> +       reg_name = kernel_type_name(reg->btf, reg->btf_id);
> +
> +       if (__check_ptr_off_reg(env, reg, regno, true))
> +               return -EACCES;
> +
> +       if (!btf_struct_ids_match(&env->log, reg->btf, reg->btf_id, reg->off,
> +                                 off_desc->btf, off_desc->btf_id))
> +               goto bad_type;
> +       return 0;
> +bad_type:
> +       verbose(env, "invalid kptr access, R%d type=%s%s ", regno,
> +               reg_type_str(env, reg->type), reg_name);
> +       verbose(env, "expected=%s%s\n", reg_type_str(env, PTR_TO_BTF_ID), targ_name);

why two separate verbose calls, you can easily combine them (and they
should be output on a single line given it's a single error)

> +       return -EINVAL;
> +}
> +

[...]
Kumar Kartikeya Dwivedi March 22, 2022, 7:04 a.m. UTC | #3
On Tue, Mar 22, 2022 at 05:09:30AM IST, Joanne Koong wrote:
> On Sun, Mar 20, 2022 at 5:27 PM Kumar Kartikeya Dwivedi
> <memxor@gmail.com> wrote:
> >
> > This commit introduces a new pointer type 'kptr' which can be embedded
> > in a map value as holds a PTR_TO_BTF_ID stored by a BPF program during
> > its invocation. Storing to such a kptr, BPF program's PTR_TO_BTF_ID
> > register must have the same type as in the map value's BTF, and loading
> > a kptr marks the destination register as PTR_TO_BTF_ID with the correct
> > kernel BTF and BTF ID.
> >
> > Such kptr are unreferenced, i.e. by the time another invocation of the
> > BPF program loads this pointer, the object which the pointer points to
> > may not longer exist. Since PTR_TO_BTF_ID loads (using BPF_LDX) are
> > patched to PROBE_MEM loads by the verifier, it would safe to allow user
> > to still access such invalid pointer, but passing such pointers into
> > BPF helpers and kfuncs should not be permitted. A future patch in this
> > series will close this gap.
> >
> > The flexibility offered by allowing programs to dereference such invalid
> > pointers while being safe at runtime frees the verifier from doing
> > complex lifetime tracking. As long as the user may ensure that the
> > object remains valid, it can ensure data read by it from the kernel
> > object is valid.
> >
> > The user indicates that a certain pointer must be treated as kptr
> > capable of accepting stores of PTR_TO_BTF_ID of a certain type, by using
> > a BTF type tag 'kptr' on the pointed to type of the pointer. Then, this
> > information is recorded in the object BTF which will be passed into the
> > kernel by way of map's BTF information. The name and kind from the map
> > value BTF is used to look up the in-kernel type, and the actual BTF and
> > BTF ID is recorded in the map struct in a new kptr_off_tab member. For
> > now, only storing pointers to structs is permitted.
> >
> > An example of this specification is shown below:
> >
> >         #define __kptr __attribute__((btf_type_tag("kptr")))
> >
> >         struct map_value {
> >                 ...
> >                 struct task_struct __kptr *task;
> >                 ...
> >         };
> >
> > Then, in a BPF program, user may store PTR_TO_BTF_ID with the type
> > task_struct into the map, and then load it later.
> >
> > Note that the destination register is marked PTR_TO_BTF_ID_OR_NULL, as
> > the verifier cannot know whether the value is NULL or not statically, it
> > must treat all potential loads at that map value offset as loading a
> > possibly NULL pointer.
> >
> > Only BPF_LDX, BPF_STX, and BPF_ST with insn->imm = 0 (to denote NULL)
> > are allowed instructions that can access such a pointer. On BPF_LDX, the
> > destination register is updated to be a PTR_TO_BTF_ID, and on BPF_STX,
> > it is checked whether the source register type is a PTR_TO_BTF_ID with
> > same BTF type as specified in the map BTF. The access size must always
> > be BPF_DW.
> >
> > For the map in map support, the kptr_off_tab for outer map is copied
> > from the inner map's kptr_off_tab. It was chosen to do a deep copy
> > instead of introducing a refcount to kptr_off_tab, because the copy only
> > needs to be done when paramterizing using inner_map_fd in the map in map
> > case, hence would be unnecessary for all other users.
> >
> > It is not permitted to use MAP_FREEZE command and mmap for BPF map
> > having kptr, similar to the bpf_timer case.
> >
> > Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> > ---
> >  include/linux/bpf.h     |  29 +++++++-
> >  include/linux/btf.h     |   2 +
> >  kernel/bpf/btf.c        | 161 ++++++++++++++++++++++++++++++++++------
> >  kernel/bpf/map_in_map.c |   5 +-
> >  kernel/bpf/syscall.c    | 112 +++++++++++++++++++++++++++-
> >  kernel/bpf/verifier.c   | 120 ++++++++++++++++++++++++++++++
> >  6 files changed, 401 insertions(+), 28 deletions(-)
> >
> [...]
> > +
> >  struct bpf_map *bpf_map_get(u32 ufd);
> >  struct bpf_map *bpf_map_get_with_uref(u32 ufd);
> >  struct bpf_map *__bpf_map_get(struct fd f);
> > diff --git a/include/linux/btf.h b/include/linux/btf.h
> > index 36bc09b8e890..5b578dc81c04 100644
> > --- a/include/linux/btf.h
> > +++ b/include/linux/btf.h
> > @@ -123,6 +123,8 @@ bool btf_member_is_reg_int(const struct btf *btf, const struct btf_type *s,
> >                            u32 expected_offset, u32 expected_size);
> >  int btf_find_spin_lock(const struct btf *btf, const struct btf_type *t);
> >  int btf_find_timer(const struct btf *btf, const struct btf_type *t);
> > +struct bpf_map_value_off *btf_find_kptr(const struct btf *btf,
> > +                                       const struct btf_type *t);
>
> nit: given that "btf_find_kptr" allocates memory as well, maybe the
> name "btf_parse_kptr" would be more reflective?
>

Good point, will change.

> >  bool btf_type_is_void(const struct btf_type *t);
> >  s32 btf_find_by_name_kind(const struct btf *btf, const char *name, u8 kind);
> >  const struct btf_type *btf_type_skip_modifiers(const struct btf *btf,
> > diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
> > index 9e17af936a7a..92afbec0a887 100644
> > --- a/kernel/bpf/btf.c
> > +++ b/kernel/bpf/btf.c
> > @@ -3164,9 +3164,16 @@ static void btf_struct_log(struct btf_verifier_env *env,
> >  enum {
> >         BTF_FIELD_SPIN_LOCK,
> >         BTF_FIELD_TIMER,
> > +       BTF_FIELD_KPTR,
> > +};
> > +
> > +enum {
> > +       BTF_FIELD_IGNORE = 0,
> > +       BTF_FIELD_FOUND  = 1,
> >  };
> >
> >  struct btf_field_info {
> > +       const struct btf_type *type;
> >         u32 off;
> >  };
> >
> > @@ -3174,23 +3181,48 @@ static int btf_find_field_struct(const struct btf *btf, const struct btf_type *t
> >                                  u32 off, int sz, struct btf_field_info *info)
> >  {
> >         if (!__btf_type_is_struct(t))
> > -               return 0;
> > +               return BTF_FIELD_IGNORE;
> >         if (t->size != sz)
> > -               return 0;
> > -       if (info->off != -ENOENT)
> > -               /* only one such field is allowed */
> > -               return -E2BIG;
> > +               return BTF_FIELD_IGNORE;
> >         info->off = off;
> > -       return 0;
> > +       return BTF_FIELD_FOUND;
> > +}
> > +
> > +static int btf_find_field_kptr(const struct btf *btf, const struct btf_type *t,
> > +                              u32 off, int sz, struct btf_field_info *info)
> > +{
> > +       /* For PTR, sz is always == 8 */
> > +       if (!btf_type_is_ptr(t))
> > +               return BTF_FIELD_IGNORE;
> > +       t = btf_type_by_id(btf, t->type);
> > +
> > +       if (!btf_type_is_type_tag(t))
> > +               return BTF_FIELD_IGNORE;
> > +       /* Reject extra tags */
> > +       if (btf_type_is_type_tag(btf_type_by_id(btf, t->type)))
> > +               return -EINVAL;
> > +       if (strcmp("kptr", __btf_name_by_offset(btf, t->name_off)))
> > +               return -EINVAL;
> > +
> > +       /* Get the base type */
> > +       if (btf_type_is_modifier(t))
> > +               t = btf_type_skip_modifiers(btf, t->type, NULL);
> > +       /* Only pointer to struct is allowed */
> > +       if (!__btf_type_is_struct(t))
> > +               return -EINVAL;
> > +
> > +       info->type = t;
> > +       info->off = off;
> > +       return BTF_FIELD_FOUND;
> >  }
> >
> >  static int btf_find_struct_field(const struct btf *btf, const struct btf_type *t,
> >                                  const char *name, int sz, int align, int field_type,
> > -                                struct btf_field_info *info)
> > +                                struct btf_field_info *info, int info_cnt)
>
> From my understanding, this patch now modifies btf_find_struct_field
> and btf_find_datasec_var such that the "info" that is passed in has to
> be an array of size max possible + 1 while "info_cnt" is the max
> possible count, or we risk writing beyond the "info" array passed in.
> It seems like we could just modify the
> btf_find_struct_field/btf_find_datasec_var logic so that the user can
> just pass in info array of max possible size instead of max possible
> size + 1 - or is your concern that this would require more idx >=
> info_cnt checks inside the functions? Maybe we should include a
> comment here and in btf_find_datasec_var to document that "info"
> should always be max possible size + 1?
>

So for some context on why this was changed, follow [0].

I agree it's pretty ugly. My first thought was to check it inside the functions,
but that is also not very great. So I went with this, one more suggestion from
Alexei was to split it into a find and then fill info, because the error on
idx >= info_cnt should only happen after we find. Right now the find and fill
happens together, so to error out, you need an extra element it can fill before
you bail for ARRAY_SIZE - 1 (which is the actual max).

TBH the find + fill split looks best to me, but open to more suggestions.

[0]: https://lore.kernel.org/bpf/20220319181538.nbqdkprjrzkxk7v4@ast-mbp.dhcp.thefacebook.com

> >  {
> >         const struct btf_member *member;
> > +       int ret, idx = 0;
> >         u32 i, off;
> > -       int ret;
> >
> >         for_each_member(i, t, member) {
> >                 const struct btf_type *member_type = btf_type_by_id(btf,
> > @@ -3210,24 +3242,35 @@ static int btf_find_struct_field(const struct btf *btf, const struct btf_type *t
> >                 switch (field_type) {
> >                 case BTF_FIELD_SPIN_LOCK:
> >                 case BTF_FIELD_TIMER:
> > -                       ret = btf_find_field_struct(btf, member_type, off, sz, info);
> > +                       ret = btf_find_field_struct(btf, member_type, off, sz, &info[idx]);
> > +                       if (ret < 0)
> > +                               return ret;
> > +                       break;
> > +               case BTF_FIELD_KPTR:
> > +                       ret = btf_find_field_kptr(btf, member_type, off, sz, &info[idx]);
> >                         if (ret < 0)
> >                                 return ret;
> >                         break;
> >                 default:
> >                         return -EFAULT;
> >                 }
> > +
> > +               if (ret == BTF_FIELD_FOUND && idx >= info_cnt)
> > +                       return -E2BIG;
> > +               else if (ret == BTF_FIELD_IGNORE)
> > +                       continue;
> > +               ++idx;
> >         }
> > -       return 0;
> > +       return idx;
> >  }
> >
> >  static int btf_find_datasec_var(const struct btf *btf, const struct btf_type *t,
> >                                 const char *name, int sz, int align, int field_type,
> > -                               struct btf_field_info *info)
> > +                               struct btf_field_info *info, int info_cnt)
> >  {
> >         const struct btf_var_secinfo *vsi;
> > +       int ret, idx = 0;
> >         u32 i, off;
> > -       int ret;
> >
> >         for_each_vsi(i, t, vsi) {
> >                 const struct btf_type *var = btf_type_by_id(btf, vsi->type);
> > @@ -3245,19 +3288,30 @@ static int btf_find_datasec_var(const struct btf *btf, const struct btf_type *t,
> >                 switch (field_type) {
> >                 case BTF_FIELD_SPIN_LOCK:
> >                 case BTF_FIELD_TIMER:
> > -                       ret = btf_find_field_struct(btf, var_type, off, sz, info);
> > +                       ret = btf_find_field_struct(btf, var_type, off, sz, &info[idx]);
> > +                       if (ret < 0)
> > +                               return ret;
> > +                       break;
> > +               case BTF_FIELD_KPTR:
> > +                       ret = btf_find_field_kptr(btf, var_type, off, sz, &info[idx]);
> >                         if (ret < 0)
> >                                 return ret;
> >                         break;
> >                 default:
> >                         return -EFAULT;
> >                 }
> > +
> > +               if (ret == BTF_FIELD_FOUND && idx >= info_cnt)
> > +                       return -E2BIG;
> > +               if (ret == BTF_FIELD_IGNORE)
> > +                       continue;
> > +               ++idx;
> >         }
> > -       return 0;
> > +       return idx;
> >  }
> >
> >  static int btf_find_field(const struct btf *btf, const struct btf_type *t,
> > -                         int field_type, struct btf_field_info *info)
> > +                         int field_type, struct btf_field_info *info, int info_cnt)
> >  {
> >         const char *name;
> >         int sz, align;
> > @@ -3273,14 +3327,20 @@ static int btf_find_field(const struct btf *btf, const struct btf_type *t,
> >                 sz = sizeof(struct bpf_timer);
> >                 align = __alignof__(struct bpf_timer);
> >                 break;
> > +       case BTF_FIELD_KPTR:
> > +               name = NULL;
> > +               sz = sizeof(u64);
> > +               align = __alignof__(u64);
> > +               break;
> >         default:
> >                 return -EFAULT;
> >         }
> >
> > +       /* The maximum allowed fields of a certain type will be info_cnt - 1 */
> >         if (__btf_type_is_struct(t))
> > -               return btf_find_struct_field(btf, t, name, sz, align, field_type, info);
> > +               return btf_find_struct_field(btf, t, name, sz, align, field_type, info, info_cnt - 1);
> >         else if (btf_type_is_datasec(t))
> > -               return btf_find_datasec_var(btf, t, name, sz, align, field_type, info);
> > +               return btf_find_datasec_var(btf, t, name, sz, align, field_type, info, info_cnt - 1);
> >         return -EINVAL;
> >  }
> >
> > @@ -3290,24 +3350,79 @@ static int btf_find_field(const struct btf *btf, const struct btf_type *t,
> >   */
> >  int btf_find_spin_lock(const struct btf *btf, const struct btf_type *t)
> >  {
> > -       struct btf_field_info info = { .off = -ENOENT };
> > +       /* btf_find_field requires array of size max + 1 */
> > +       struct btf_field_info info_arr[2];
> >         int ret;
> >
> > -       ret = btf_find_field(btf, t, BTF_FIELD_SPIN_LOCK, &info);
> > +       ret = btf_find_field(btf, t, BTF_FIELD_SPIN_LOCK, info_arr, ARRAY_SIZE(info_arr));
> >         if (ret < 0)
> >                 return ret;
> > -       return info.off;
> > +       if (!ret)
> > +               return -ENOENT;
> > +       return info_arr[0].off;
> >  }
> >
> >  int btf_find_timer(const struct btf *btf, const struct btf_type *t)
> >  {
> > -       struct btf_field_info info = { .off = -ENOENT };
> > +       /* btf_find_field requires array of size max + 1 */
> > +       struct btf_field_info info_arr[2];
> >         int ret;
> >
> > -       ret = btf_find_field(btf, t, BTF_FIELD_TIMER, &info);
> > +       ret = btf_find_field(btf, t, BTF_FIELD_TIMER, info_arr, ARRAY_SIZE(info_arr));
> >         if (ret < 0)
> >                 return ret;
> > -       return info.off;
> > +       if (!ret)
> > +               return -ENOENT;
> > +       return info_arr[0].off;
> > +}
> > +
> > +struct bpf_map_value_off *btf_find_kptr(const struct btf *btf,
> > +                                       const struct btf_type *t)
> > +{
> > +       /* btf_find_field requires array of size max + 1 */
> > +       struct btf_field_info info_arr[BPF_MAP_VALUE_OFF_MAX + 1];
> > +       struct bpf_map_value_off *tab;
> > +       int ret, i, nr_off;
> > +
> > +       /* Revisit stack usage when bumping BPF_MAP_VALUE_OFF_MAX */
> > +       BUILD_BUG_ON(BPF_MAP_VALUE_OFF_MAX != 8);
> > +
> > +       ret = btf_find_field(btf, t, BTF_FIELD_KPTR, info_arr, ARRAY_SIZE(info_arr));
> > +       if (ret < 0)
> > +               return ERR_PTR(ret);
> > +       if (!ret)
> > +               return NULL;
> > +
> > +       nr_off = ret;
> > +       tab = kzalloc(offsetof(struct bpf_map_value_off, off[nr_off]), GFP_KERNEL | __GFP_NOWARN);
> > +       if (!tab)
> > +               return ERR_PTR(-ENOMEM);
> > +
> > +       tab->nr_off = 0;
> > +       for (i = 0; i < nr_off; i++) {
> > +               const struct btf_type *t;
> > +               struct btf *off_btf;
> > +               s32 id;
> > +
> > +               t = info_arr[i].type;
> > +               id = bpf_find_btf_id(__btf_name_by_offset(btf, t->name_off), BTF_INFO_KIND(t->info),
> > +                                    &off_btf);
> > +               if (id < 0) {
> > +                       ret = id;
> > +                       goto end;
> > +               }
> > +
> > +               tab->off[i].offset = info_arr[i].off;
> > +               tab->off[i].btf_id = id;
> > +               tab->off[i].btf = off_btf;
> > +               tab->nr_off = i + 1;
> > +       }
> > +       return tab;
> > +end:
> > +       while (tab->nr_off--)
> > +               btf_put(tab->off[tab->nr_off].btf);
> > +       kfree(tab);
> > +       return ERR_PTR(ret);
> >  }
> >
> >  static void __btf_struct_show(const struct btf *btf, const struct btf_type *t,
> > diff --git a/kernel/bpf/map_in_map.c b/kernel/bpf/map_in_map.c
> > index 5cd8f5277279..135205d0d560 100644
> > --- a/kernel/bpf/map_in_map.c
> > +++ b/kernel/bpf/map_in_map.c
> > @@ -52,6 +52,7 @@ struct bpf_map *bpf_map_meta_alloc(int inner_map_ufd)
> >         inner_map_meta->max_entries = inner_map->max_entries;
> >         inner_map_meta->spin_lock_off = inner_map->spin_lock_off;
> >         inner_map_meta->timer_off = inner_map->timer_off;
> > +       inner_map_meta->kptr_off_tab = bpf_map_copy_kptr_off_tab(inner_map);
> >         if (inner_map->btf) {
> >                 btf_get(inner_map->btf);
> >                 inner_map_meta->btf = inner_map->btf;
> > @@ -71,6 +72,7 @@ struct bpf_map *bpf_map_meta_alloc(int inner_map_ufd)
> >
> >  void bpf_map_meta_free(struct bpf_map *map_meta)
> >  {
> > +       bpf_map_free_kptr_off_tab(map_meta);
> >         btf_put(map_meta->btf);
> >         kfree(map_meta);
> >  }
> > @@ -83,7 +85,8 @@ bool bpf_map_meta_equal(const struct bpf_map *meta0,
> >                 meta0->key_size == meta1->key_size &&
> >                 meta0->value_size == meta1->value_size &&
> >                 meta0->timer_off == meta1->timer_off &&
> > -               meta0->map_flags == meta1->map_flags;
> > +               meta0->map_flags == meta1->map_flags &&
> > +               bpf_map_equal_kptr_off_tab(meta0, meta1);
> >  }
> >
> >  void *bpf_map_fd_get_ptr(struct bpf_map *map,
> > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> > index cdaa1152436a..5990d6fa97ab 100644
> > --- a/kernel/bpf/syscall.c
> > +++ b/kernel/bpf/syscall.c
> > @@ -6,6 +6,7 @@
> >  #include <linux/bpf_trace.h>
> >  #include <linux/bpf_lirc.h>
> >  #include <linux/bpf_verifier.h>
> > +#include <linux/bsearch.h>
> >  #include <linux/btf.h>
> >  #include <linux/syscalls.h>
> >  #include <linux/slab.h>
> > @@ -473,12 +474,95 @@ static void bpf_map_release_memcg(struct bpf_map *map)
> >  }
> >  #endif
> >
> > +static int bpf_map_kptr_off_cmp(const void *a, const void *b)
> > +{
> > +       const struct bpf_map_value_off_desc *off_desc1 = a, *off_desc2 = b;
> > +
> > +       if (off_desc1->offset < off_desc2->offset)
> > +               return -1;
> > +       else if (off_desc1->offset > off_desc2->offset)
> > +               return 1;
> > +       return 0;
> > +}
> > +
> > +struct bpf_map_value_off_desc *bpf_map_kptr_off_contains(struct bpf_map *map, u32 offset)
> > +{
> > +       /* Since members are iterated in btf_find_field in increasing order,
> > +        * offsets appended to kptr_off_tab are in increasing order, so we can
> > +        * do bsearch to find exact match.
> > +        */
> > +       struct bpf_map_value_off *tab;
> > +
> > +       if (!map_value_has_kptr(map))
> > +               return NULL;
> > +       tab = map->kptr_off_tab;
> > +       return bsearch(&offset, tab->off, tab->nr_off, sizeof(tab->off[0]), bpf_map_kptr_off_cmp);
> > +}
> > +
> > +void bpf_map_free_kptr_off_tab(struct bpf_map *map)
> > +{
> > +       struct bpf_map_value_off *tab = map->kptr_off_tab;
> > +       int i;
> > +
> > +       if (!map_value_has_kptr(map))
> > +               return;
> > +       for (i = 0; i < tab->nr_off; i++) {
> > +               struct btf *btf = tab->off[i].btf;
> > +
> > +               btf_put(btf);
> > +       }
> > +       kfree(tab);
> > +       map->kptr_off_tab = NULL;
> > +}
> > +
> > +struct bpf_map_value_off *bpf_map_copy_kptr_off_tab(const struct bpf_map *map)
> > +{
> > +       struct bpf_map_value_off *tab = map->kptr_off_tab, *new_tab;
> > +       int size, i, ret;
> > +
> > +       if (!map_value_has_kptr(map))
> > +               return ERR_PTR(-ENOENT);
> > +       /* Do a deep copy of the kptr_off_tab */
> > +       for (i = 0; i < tab->nr_off; i++)
> > +               btf_get(tab->off[i].btf);
> > +
> > +       size = offsetof(struct bpf_map_value_off, off[tab->nr_off]);
> > +       new_tab = kzalloc(size, GFP_KERNEL | __GFP_NOWARN);
> > +       if (!new_tab) {
> > +               ret = -ENOMEM;
> > +               goto end;
> > +       }
> > +       memcpy(new_tab, tab, size);
> > +       return new_tab;
> > +end:
> > +       while (i--)
> > +               btf_put(tab->off[i].btf);
> > +       return ERR_PTR(ret);
> > +}
> > +
> > +bool bpf_map_equal_kptr_off_tab(const struct bpf_map *map_a, const struct bpf_map *map_b)
> > +{
> > +       struct bpf_map_value_off *tab_a = map_a->kptr_off_tab, *tab_b = map_b->kptr_off_tab;
> > +       bool a_has_kptr = map_value_has_kptr(map_a), b_has_kptr = map_value_has_kptr(map_b);
> > +       int size;
> > +
> > +       if (!a_has_kptr && !b_has_kptr)
> > +               return true;
> > +       if ((a_has_kptr && !b_has_kptr) || (!a_has_kptr && b_has_kptr))
>
> nit: I think we could simplify this second if check to
> if (!a_has_kptr || !b_has_kptr)
>     return false;
>

Ack.

> > +               return false;
> > +       if (tab_a->nr_off != tab_b->nr_off)
> > +               return false;
> > +       size = offsetof(struct bpf_map_value_off, off[tab_a->nr_off]);
> > +       return !memcmp(tab_a, tab_b, size);
> > +}
> > +
> >  /* called from workqueue */
> >  static void bpf_map_free_deferred(struct work_struct *work)
> >  {
> >         struct bpf_map *map = container_of(work, struct bpf_map, work);
> >
> >         security_bpf_map_free(map);
> > +       bpf_map_free_kptr_off_tab(map);
> >         bpf_map_release_memcg(map);
> >         /* implementation dependent freeing */
> >         map->ops->map_free(map);
> > @@ -640,7 +724,7 @@ static int bpf_map_mmap(struct file *filp, struct vm_area_struct *vma)
> >         int err;
> >
> >         if (!map->ops->map_mmap || map_value_has_spin_lock(map) ||
> > -           map_value_has_timer(map))
> > +           map_value_has_timer(map) || map_value_has_kptr(map))
> >                 return -ENOTSUPP;
> >
> >         if (!(vma->vm_flags & VM_SHARED))
> > @@ -820,9 +904,31 @@ static int map_check_btf(struct bpf_map *map, const struct btf *btf,
> >                         return -EOPNOTSUPP;
> >         }
> >
> > -       if (map->ops->map_check_btf)
> > +       map->kptr_off_tab = btf_find_kptr(btf, value_type);
> > +       if (map_value_has_kptr(map)) {
> > +               if (!bpf_capable())
> > +                       return -EPERM;
> > +               if (map->map_flags & BPF_F_RDONLY_PROG) {
> > +                       ret = -EACCES;
> > +                       goto free_map_tab;
> > +               }
> > +               if (map->map_type != BPF_MAP_TYPE_HASH &&
> > +                   map->map_type != BPF_MAP_TYPE_LRU_HASH &&
> > +                   map->map_type != BPF_MAP_TYPE_ARRAY) {
> > +                       ret = -EOPNOTSUPP;
> > +                       goto free_map_tab;
> > +               }
> > +       }
> > +
> > +       if (map->ops->map_check_btf) {
> >                 ret = map->ops->map_check_btf(map, btf, key_type, value_type);
> > +               if (ret < 0)
> > +                       goto free_map_tab;
> > +       }
> >
> > +       return ret;
> > +free_map_tab:
> > +       bpf_map_free_kptr_off_tab(map);
> >         return ret;
> >  }
> >
> > @@ -1639,7 +1745,7 @@ static int map_freeze(const union bpf_attr *attr)
> >                 return PTR_ERR(map);
> >
> >         if (map->map_type == BPF_MAP_TYPE_STRUCT_OPS ||
> > -           map_value_has_timer(map)) {
> > +           map_value_has_timer(map) || map_value_has_kptr(map)) {
> >                 fdput(f);
> >                 return -ENOTSUPP;
> >         }
> > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> > index 4ce9a528fb63..744b7362e52e 100644
> > --- a/kernel/bpf/verifier.c
> > +++ b/kernel/bpf/verifier.c
> > @@ -3507,6 +3507,94 @@ int check_ptr_off_reg(struct bpf_verifier_env *env,
> >         return __check_ptr_off_reg(env, reg, regno, false);
> >  }
> >
> > +static int map_kptr_match_type(struct bpf_verifier_env *env,
> > +                              struct bpf_map_value_off_desc *off_desc,
> > +                              struct bpf_reg_state *reg, u32 regno)
> > +{
> > +       const char *targ_name = kernel_type_name(off_desc->btf, off_desc->btf_id);
> > +       const char *reg_name = "";
> > +
> > +       if (reg->type != PTR_TO_BTF_ID && reg->type != PTR_TO_BTF_ID_OR_NULL)
> > +               goto bad_type;
> > +
> > +       if (!btf_is_kernel(reg->btf)) {
> > +               verbose(env, "R%d must point to kernel BTF\n", regno);
> > +               return -EINVAL;
> > +       }
> > +       /* We need to verify reg->type and reg->btf, before accessing reg->btf */
> > +       reg_name = kernel_type_name(reg->btf, reg->btf_id);
> > +
> > +       if (__check_ptr_off_reg(env, reg, regno, true))
> > +               return -EACCES;
> > +
> > +       if (!btf_struct_ids_match(&env->log, reg->btf, reg->btf_id, reg->off,
> > +                                 off_desc->btf, off_desc->btf_id))
> > +               goto bad_type;
> > +       return 0;
> > +bad_type:
> > +       verbose(env, "invalid kptr access, R%d type=%s%s ", regno,
> > +               reg_type_str(env, reg->type), reg_name);
> > +       verbose(env, "expected=%s%s\n", reg_type_str(env, PTR_TO_BTF_ID), targ_name);
> > +       return -EINVAL;
> > +}
> > +
> > +/* Returns an error, or 0 if ignoring the access, or 1 if register state was
> > + * updated, in which case later updates must be skipped.
> > + */
> > +static int check_map_kptr_access(struct bpf_verifier_env *env, u32 regno,
> > +                                int off, int size, int value_regno,
> > +                                enum bpf_access_type t, int insn_idx)
>
> Did you mean to include the "enum bpf_access_type t" argument? I'm not
> seeing where it's being used in this function
>

Good catch, this was only needed when I supported BPF_XCHG directly swapping
kptr_ref. Since that has been dropped, I will remove this unused parameter.

> > +{
> > +       struct bpf_reg_state *reg = reg_state(env, regno), *val_reg;
> > +       struct bpf_insn *insn = &env->prog->insnsi[insn_idx];
> > +       struct bpf_map_value_off_desc *off_desc;
> > +       struct bpf_map *map = reg->map_ptr;
> > +       int class = BPF_CLASS(insn->code);
> > +
> > +       /* Things we already checked for in check_map_access:
> > +        *  - Reject cases where variable offset may touch BTF ID pointer
> > +        *  - size of access (must be BPF_DW)
> > +        *  - off_desc->offset == off + reg->var_off.value
> > +        */
> > +       if (!tnum_is_const(reg->var_off))
> > +               return 0;
> > +
> > +       off_desc = bpf_map_kptr_off_contains(map, off + reg->var_off.value);
> > +       if (!off_desc)
> > +               return 0;
> > +
> > +       /* Only BPF_[LDX,STX,ST] | BPF_MEM | BPF_DW is supported */
> > +       if (BPF_MODE(insn->code) != BPF_MEM)
> > +               goto end;
> > +
> > +       if (class == BPF_LDX) {
> > +               val_reg = reg_state(env, value_regno);
> > +               /* We can simply mark the value_regno receiving the pointer
> > +                * value from map as PTR_TO_BTF_ID, with the correct type.
> > +                */
> > +               mark_btf_ld_reg(env, cur_regs(env), value_regno, PTR_TO_BTF_ID, off_desc->btf,
> > +                               off_desc->btf_id, PTR_MAYBE_NULL);
> > +               val_reg->id = ++env->id_gen;
> > +       } else if (class == BPF_STX) {
> > +               val_reg = reg_state(env, value_regno);
> > +               if (!register_is_null(val_reg) &&
> > +                   map_kptr_match_type(env, off_desc, val_reg, value_regno))
> > +                       return -EACCES;
> > +       } else if (class == BPF_ST) {
> > +               if (insn->imm) {
> > +                       verbose(env, "BPF_ST imm must be 0 when storing to kptr at off=%u\n",
> > +                               off_desc->offset);
> > +                       return -EACCES;
> > +               }
> > +       } else {
> > +               goto end;
> > +       }
> > +       return 1;
> > +end:
> > +       verbose(env, "kptr in map can only be accessed using BPF_LDX/BPF_STX/BPF_ST\n");
> > +       return -EACCES;
> > +}
> > +
> >  /* check read/write into a map element with possible variable offset */
> >  static int check_map_access(struct bpf_verifier_env *env, u32 regno,
> >                             int off, int size, bool zero_size_allowed)
> > @@ -3545,6 +3633,32 @@ static int check_map_access(struct bpf_verifier_env *env, u32 regno,
> >                         return -EACCES;
> >                 }
> >         }
> > +       if (map_value_has_kptr(map)) {
> > +               struct bpf_map_value_off *tab = map->kptr_off_tab;
> > +               int i;
> > +
> > +               for (i = 0; i < tab->nr_off; i++) {
> > +                       u32 p = tab->off[i].offset;
> > +
> > +                       if (reg->smin_value + off < p + sizeof(u64) &&
> > +                           p < reg->umax_value + off + size) {
> > +                               if (!tnum_is_const(reg->var_off)) {
> > +                                       verbose(env, "kptr access cannot have variable offset\n");
> > +                                       return -EACCES;
> > +                               }
> > +                               if (p != off + reg->var_off.value) {
> > +                                       verbose(env, "kptr access misaligned expected=%u off=%llu\n",
> > +                                               p, off + reg->var_off.value);
> > +                                       return -EACCES;
> > +                               }
> > +                               if (size != bpf_size_to_bytes(BPF_DW)) {
> > +                                       verbose(env, "kptr access size must be BPF_DW\n");
> > +                                       return -EACCES;
> > +                               }
> > +                               break;
> > +                       }
> > +               }
> > +       }
> >         return err;
> >  }
> >
> > @@ -4421,6 +4535,10 @@ static int check_mem_access(struct bpf_verifier_env *env, int insn_idx, u32 regn
> >                 if (err)
> >                         return err;
> >                 err = check_map_access(env, regno, off, size, false);
> > +               err = err ?: check_map_kptr_access(env, regno, off, size, value_regno, t, insn_idx);
> > +               if (err < 0)
> > +                       return err;
> > +               /* if err == 0, check_map_kptr_access ignored the access */
> >                 if (!err && t == BPF_READ && value_regno >= 0) {
> >                         struct bpf_map *map = reg->map_ptr;
> >
> > @@ -4442,6 +4560,8 @@ static int check_mem_access(struct bpf_verifier_env *env, int insn_idx, u32 regn)
> >                                 mark_reg_unknown(env, regs, value_regno);
> >                         }
> >                 }
> > +               /* clear err == 1 */
> > +               err = err < 0 ? err : 0;
>
> I find this flow a bit unintuitive to follow. Would something like
>
>     err = check_map_access(env, regno, off, size, false);
>     if (err)
>         return err;
>     if (bpf_map_kptr_off_contains(map, off + reg->var_off.value, &off_desc)) {
>         err = check_map_kptr_access(...) *where check_map_kptr_access
> now returns 0 on success and error code otherwise*
>     } else if (!err && t == BPF_READ && value_regno >= 0) {
>         ...
>     }
>
> be clearer?
>

Agreed, will change.

>
> >         } else if (base_type(reg->type) == PTR_TO_MEM) {
> >                 bool rdonly_mem = type_is_rdonly_mem(reg->type);
> >
> > --
> > 2.35.1
> >

--
Kartikeya
Kumar Kartikeya Dwivedi March 22, 2022, 7:16 a.m. UTC | #4
On Tue, Mar 22, 2022 at 11:15:42AM IST, Andrii Nakryiko wrote:
> On Sun, Mar 20, 2022 at 8:55 AM Kumar Kartikeya Dwivedi
> <memxor@gmail.com> wrote:
> >
> > This commit introduces a new pointer type 'kptr' which can be embedded
> > in a map value as holds a PTR_TO_BTF_ID stored by a BPF program during
> > its invocation. Storing to such a kptr, BPF program's PTR_TO_BTF_ID
> > register must have the same type as in the map value's BTF, and loading
> > a kptr marks the destination register as PTR_TO_BTF_ID with the correct
> > kernel BTF and BTF ID.
> >
> > Such kptr are unreferenced, i.e. by the time another invocation of the
> > BPF program loads this pointer, the object which the pointer points to
> > may not longer exist. Since PTR_TO_BTF_ID loads (using BPF_LDX) are
> > patched to PROBE_MEM loads by the verifier, it would safe to allow user
> > to still access such invalid pointer, but passing such pointers into
> > BPF helpers and kfuncs should not be permitted. A future patch in this
> > series will close this gap.
> >
> > The flexibility offered by allowing programs to dereference such invalid
> > pointers while being safe at runtime frees the verifier from doing
> > complex lifetime tracking. As long as the user may ensure that the
> > object remains valid, it can ensure data read by it from the kernel
> > object is valid.
> >
> > The user indicates that a certain pointer must be treated as kptr
> > capable of accepting stores of PTR_TO_BTF_ID of a certain type, by using
> > a BTF type tag 'kptr' on the pointed to type of the pointer. Then, this
> > information is recorded in the object BTF which will be passed into the
> > kernel by way of map's BTF information. The name and kind from the map
> > value BTF is used to look up the in-kernel type, and the actual BTF and
> > BTF ID is recorded in the map struct in a new kptr_off_tab member. For
> > now, only storing pointers to structs is permitted.
> >
> > An example of this specification is shown below:
> >
> >         #define __kptr __attribute__((btf_type_tag("kptr")))
> >
> >         struct map_value {
> >                 ...
> >                 struct task_struct __kptr *task;
> >                 ...
> >         };
> >
> > Then, in a BPF program, user may store PTR_TO_BTF_ID with the type
> > task_struct into the map, and then load it later.
> >
> > Note that the destination register is marked PTR_TO_BTF_ID_OR_NULL, as
> > the verifier cannot know whether the value is NULL or not statically, it
> > must treat all potential loads at that map value offset as loading a
> > possibly NULL pointer.
> >
> > Only BPF_LDX, BPF_STX, and BPF_ST with insn->imm = 0 (to denote NULL)
> > are allowed instructions that can access such a pointer. On BPF_LDX, the
> > destination register is updated to be a PTR_TO_BTF_ID, and on BPF_STX,
> > it is checked whether the source register type is a PTR_TO_BTF_ID with
> > same BTF type as specified in the map BTF. The access size must always
> > be BPF_DW.
> >
> > For the map in map support, the kptr_off_tab for outer map is copied
> > from the inner map's kptr_off_tab. It was chosen to do a deep copy
> > instead of introducing a refcount to kptr_off_tab, because the copy only
> > needs to be done when paramterizing using inner_map_fd in the map in map
> > case, hence would be unnecessary for all other users.
> >
> > It is not permitted to use MAP_FREEZE command and mmap for BPF map
> > having kptr, similar to the bpf_timer case.
> >
> > Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> > ---
> >  include/linux/bpf.h     |  29 +++++++-
> >  include/linux/btf.h     |   2 +
> >  kernel/bpf/btf.c        | 161 ++++++++++++++++++++++++++++++++++------
> >  kernel/bpf/map_in_map.c |   5 +-
> >  kernel/bpf/syscall.c    | 112 +++++++++++++++++++++++++++-
> >  kernel/bpf/verifier.c   | 120 ++++++++++++++++++++++++++++++
> >  6 files changed, 401 insertions(+), 28 deletions(-)
> >
>
> [...]
>
> > +static int btf_find_field_kptr(const struct btf *btf, const struct btf_type *t,
> > +                              u32 off, int sz, struct btf_field_info *info)
> > +{
> > +       /* For PTR, sz is always == 8 */
> > +       if (!btf_type_is_ptr(t))
> > +               return BTF_FIELD_IGNORE;
> > +       t = btf_type_by_id(btf, t->type);
> > +
> > +       if (!btf_type_is_type_tag(t))
> > +               return BTF_FIELD_IGNORE;
> > +       /* Reject extra tags */
> > +       if (btf_type_is_type_tag(btf_type_by_id(btf, t->type)))
> > +               return -EINVAL;
>
> Can we have tag -> const -> tag -> volatile -> tag in BTF? Wouldn't
> you assume there are no more tags with just this check?
>

All tags are supposed to be before other modifiers, so tags come first, in
continuity. See [0].

Alexei suggested to reject all other tags for now.

 [0]: https://lore.kernel.org/bpf/20220127154627.665163-1-yhs@fb.com

>
> > +       if (strcmp("kptr", __btf_name_by_offset(btf, t->name_off)))
> > +               return -EINVAL;
> > +
> > +       /* Get the base type */
> > +       if (btf_type_is_modifier(t))
> > +               t = btf_type_skip_modifiers(btf, t->type, NULL);
> > +       /* Only pointer to struct is allowed */
> > +       if (!__btf_type_is_struct(t))
> > +               return -EINVAL;
> > +
> > +       info->type = t;
> > +       info->off = off;
> > +       return BTF_FIELD_FOUND;
> >  }
> >
> >  static int btf_find_struct_field(const struct btf *btf, const struct btf_type *t,
> >                                  const char *name, int sz, int align, int field_type,
> > -                                struct btf_field_info *info)
> > +                                struct btf_field_info *info, int info_cnt)
> >  {
> >         const struct btf_member *member;
> > +       int ret, idx = 0;
> >         u32 i, off;
> > -       int ret;
> >
> >         for_each_member(i, t, member) {
> >                 const struct btf_type *member_type = btf_type_by_id(btf,
> > @@ -3210,24 +3242,35 @@ static int btf_find_struct_field(const struct btf *btf, const struct btf_type *t
> >                 switch (field_type) {
> >                 case BTF_FIELD_SPIN_LOCK:
> >                 case BTF_FIELD_TIMER:
> > -                       ret = btf_find_field_struct(btf, member_type, off, sz, info);
> > +                       ret = btf_find_field_struct(btf, member_type, off, sz, &info[idx]);
> > +                       if (ret < 0)
> > +                               return ret;
> > +                       break;
> > +               case BTF_FIELD_KPTR:
> > +                       ret = btf_find_field_kptr(btf, member_type, off, sz, &info[idx]);
> >                         if (ret < 0)
> >                                 return ret;
> >                         break;
> >                 default:
> >                         return -EFAULT;
> >                 }
> > +
> > +               if (ret == BTF_FIELD_FOUND && idx >= info_cnt)
>
> hm.. haven't you already written info[info_cnt] above by now? I see
> that above you do (info_cnt - 1), but why such tricks if you can have
> a temporary struct btf_field_info on the stack, write into it, and if
> BTF_FIELD_FOUND and idx < info_cnt then write it into info[idx]?
>
>
> > +                       return -E2BIG;
> > +               else if (ret == BTF_FIELD_IGNORE)
> > +                       continue;
> > +               ++idx;
> >         }
> > -       return 0;
> > +       return idx;
> >  }
> >
> >  static int btf_find_datasec_var(const struct btf *btf, const struct btf_type *t,
> >                                 const char *name, int sz, int align, int field_type,
> > -                               struct btf_field_info *info)
> > +                               struct btf_field_info *info, int info_cnt)
> >  {
> >         const struct btf_var_secinfo *vsi;
> > +       int ret, idx = 0;
> >         u32 i, off;
> > -       int ret;
> >
> >         for_each_vsi(i, t, vsi) {
> >                 const struct btf_type *var = btf_type_by_id(btf, vsi->type);
> > @@ -3245,19 +3288,30 @@ static int btf_find_datasec_var(const struct btf *btf, const struct btf_type *t,
> >                 switch (field_type) {
> >                 case BTF_FIELD_SPIN_LOCK:
> >                 case BTF_FIELD_TIMER:
> > -                       ret = btf_find_field_struct(btf, var_type, off, sz, info);
> > +                       ret = btf_find_field_struct(btf, var_type, off, sz, &info[idx]);
> > +                       if (ret < 0)
> > +                               return ret;
> > +                       break;
> > +               case BTF_FIELD_KPTR:
> > +                       ret = btf_find_field_kptr(btf, var_type, off, sz, &info[idx]);
> >                         if (ret < 0)
> >                                 return ret;
> >                         break;
> >                 default:
> >                         return -EFAULT;
> >                 }
> > +
> > +               if (ret == BTF_FIELD_FOUND && idx >= info_cnt)
>
> same, already writing past the end of array?
>
> > +                       return -E2BIG;
> > +               if (ret == BTF_FIELD_IGNORE)
> > +                       continue;
> > +               ++idx;
> >         }
> > -       return 0;
> > +       return idx;
> >  }
> >
> >  static int btf_find_field(const struct btf *btf, const struct btf_type *t,
> > -                         int field_type, struct btf_field_info *info)
> > +                         int field_type, struct btf_field_info *info, int info_cnt)
> >  {
> >         const char *name;
> >         int sz, align;
> > @@ -3273,14 +3327,20 @@ static int btf_find_field(const struct btf *btf, const struct btf_type *t,
> >                 sz = sizeof(struct bpf_timer);
> >                 align = __alignof__(struct bpf_timer);
> >                 break;
> > +       case BTF_FIELD_KPTR:
> > +               name = NULL;
> > +               sz = sizeof(u64);
> > +               align = __alignof__(u64);
>
> can be 4 on 32-bit arch, is that ok?
>

Good catch, it must be 8, so will hardcode.

> > +               break;
> >         default:
> >                 return -EFAULT;
> >         }
> >
> > +       /* The maximum allowed fields of a certain type will be info_cnt - 1 */
> >         if (__btf_type_is_struct(t))
> > -               return btf_find_struct_field(btf, t, name, sz, align, field_type, info);
> > +               return btf_find_struct_field(btf, t, name, sz, align, field_type, info, info_cnt - 1);
>
> why -1, to avoid overwriting past the end of array?
>

Yes, see my reply to Joanne, let's continue discussing it there.

> >         else if (btf_type_is_datasec(t))
> > -               return btf_find_datasec_var(btf, t, name, sz, align, field_type, info);
> > +               return btf_find_datasec_var(btf, t, name, sz, align, field_type, info, info_cnt - 1);
> >         return -EINVAL;
> >  }
> >
> > @@ -3290,24 +3350,79 @@ static int btf_find_field(const struct btf *btf, const struct btf_type *t,
> >   */
> >  int btf_find_spin_lock(const struct btf *btf, const struct btf_type *t)
> >  {
> > -       struct btf_field_info info = { .off = -ENOENT };
> > +       /* btf_find_field requires array of size max + 1 */
>
> ok, right, as I expected above, but see also suggestion to not have
> these weird implicit expectations
>
> > +       struct btf_field_info info_arr[2];
> >         int ret;
> >
> > -       ret = btf_find_field(btf, t, BTF_FIELD_SPIN_LOCK, &info);
> > +       ret = btf_find_field(btf, t, BTF_FIELD_SPIN_LOCK, info_arr, ARRAY_SIZE(info_arr));
> >         if (ret < 0)
> >                 return ret;
> > -       return info.off;
> > +       if (!ret)
> > +               return -ENOENT;
> > +       return info_arr[0].off;
> >  }
> >
> >  int btf_find_timer(const struct btf *btf, const struct btf_type *t)
> >  {
> > -       struct btf_field_info info = { .off = -ENOENT };
> > +       /* btf_find_field requires array of size max + 1 */
> > +       struct btf_field_info info_arr[2];
> >         int ret;
> >
> > -       ret = btf_find_field(btf, t, BTF_FIELD_TIMER, &info);
> > +       ret = btf_find_field(btf, t, BTF_FIELD_TIMER, info_arr, ARRAY_SIZE(info_arr));
> >         if (ret < 0)
> >                 return ret;
> > -       return info.off;
> > +       if (!ret)
> > +               return -ENOENT;
> > +       return info_arr[0].off;
> > +}
> > +
> > +struct bpf_map_value_off *btf_find_kptr(const struct btf *btf,
> > +                                       const struct btf_type *t)
> > +{
> > +       /* btf_find_field requires array of size max + 1 */
> > +       struct btf_field_info info_arr[BPF_MAP_VALUE_OFF_MAX + 1];
> > +       struct bpf_map_value_off *tab;
> > +       int ret, i, nr_off;
> > +
> > +       /* Revisit stack usage when bumping BPF_MAP_VALUE_OFF_MAX */
> > +       BUILD_BUG_ON(BPF_MAP_VALUE_OFF_MAX != 8);
>
> you can store u32 type_id instead of full btf_type pointer, type
> looking below in the loop is cheap and won't fail
>

Ok, will switch to type_id.

>
> > +
> > +       ret = btf_find_field(btf, t, BTF_FIELD_KPTR, info_arr, ARRAY_SIZE(info_arr));
> > +       if (ret < 0)
> > +               return ERR_PTR(ret);
> > +       if (!ret)
> > +               return NULL;
> > +
>
> [...]
>
> > +
> > +bool bpf_map_equal_kptr_off_tab(const struct bpf_map *map_a, const struct bpf_map *map_b)
> > +{
> > +       struct bpf_map_value_off *tab_a = map_a->kptr_off_tab, *tab_b = map_b->kptr_off_tab;
> > +       bool a_has_kptr = map_value_has_kptr(map_a), b_has_kptr = map_value_has_kptr(map_b);
> > +       int size;
> > +
> > +       if (!a_has_kptr && !b_has_kptr)
> > +               return true;
> > +       if ((a_has_kptr && !b_has_kptr) || (!a_has_kptr && b_has_kptr))
> > +               return false;
>
> if (a_has_kptr != b_has_kptr)
>     return false;
>

Ack.

> > +       if (tab_a->nr_off != tab_b->nr_off)
> > +               return false;
> > +       size = offsetof(struct bpf_map_value_off, off[tab_a->nr_off]);
> > +       return !memcmp(tab_a, tab_b, size);
> > +}
> > +
> >  /* called from workqueue */
> >  static void bpf_map_free_deferred(struct work_struct *work)
> >  {
> >         struct bpf_map *map = container_of(work, struct bpf_map, work);
> >
> >         security_bpf_map_free(map);
> > +       bpf_map_free_kptr_off_tab(map);
> >         bpf_map_release_memcg(map);
> >         /* implementation dependent freeing */
> >         map->ops->map_free(map);
> > @@ -640,7 +724,7 @@ static int bpf_map_mmap(struct file *filp, struct vm_area_struct *vma)
> >         int err;
> >
> >         if (!map->ops->map_mmap || map_value_has_spin_lock(map) ||
> > -           map_value_has_timer(map))
> > +           map_value_has_timer(map) || map_value_has_kptr(map))
> >                 return -ENOTSUPP;
> >
> >         if (!(vma->vm_flags & VM_SHARED))
> > @@ -820,9 +904,31 @@ static int map_check_btf(struct bpf_map *map, const struct btf *btf,
> >                         return -EOPNOTSUPP;
> >         }
> >
> > -       if (map->ops->map_check_btf)
> > +       map->kptr_off_tab = btf_find_kptr(btf, value_type);
>
> btf_find_kptr() is so confusingly named. It certainly can find more
> than one kptr, so at least it should be btf_find_kptrs(). Combining
> with Joanne's suggestion, btf_parse_kptrs() would indeed be better.
>

Ok.

> > +       if (map_value_has_kptr(map)) {
> > +               if (!bpf_capable())
> > +                       return -EPERM;
> > +               if (map->map_flags & BPF_F_RDONLY_PROG) {
> > +                       ret = -EACCES;
> > +                       goto free_map_tab;
> > +               }
> > +               if (map->map_type != BPF_MAP_TYPE_HASH &&
> > +                   map->map_type != BPF_MAP_TYPE_LRU_HASH &&
> > +                   map->map_type != BPF_MAP_TYPE_ARRAY) {
>
> what about PERCPU_ARRAY, for instance? Is there something
> fundamentally wrong to support it for local storage maps?
>

Plugging in support into maps that already take timers was easier to begin, I
can do percpu support as a follow up.

In case of local storage, I'm a little worried about how we prevent creating
reference cycles. There was a thread where find_get_task_by_pid was proposed as
unstable helper, once we e.g. support embedding task_struct in map, and allow
storing such pointer in task local storage, it would be pretty easy to construct
a circular reference cycle.

Should we think about this now, or should we worry about this when task_struct
is actually supported as kptr? It's not only task_struct, same applies to sock.

There's a discussion to be had, hence I left it out for now.

> > +                       ret = -EOPNOTSUPP;
> > +                       goto free_map_tab;
> > +               }
> > +       }
> > +
> > +       if (map->ops->map_check_btf) {
> >                 ret = map->ops->map_check_btf(map, btf, key_type, value_type);
> > +               if (ret < 0)
> > +                       goto free_map_tab;
> > +       }
> >
> > +       return ret;
> > +free_map_tab:
> > +       bpf_map_free_kptr_off_tab(map);
> >         return ret;
> >  }
> >
> > @@ -1639,7 +1745,7 @@ static int map_freeze(const union bpf_attr *attr)
> >                 return PTR_ERR(map);
> >
> >         if (map->map_type == BPF_MAP_TYPE_STRUCT_OPS ||
> > -           map_value_has_timer(map)) {
> > +           map_value_has_timer(map) || map_value_has_kptr(map)) {
> >                 fdput(f);
> >                 return -ENOTSUPP;
> >         }
> > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> > index 4ce9a528fb63..744b7362e52e 100644
> > --- a/kernel/bpf/verifier.c
> > +++ b/kernel/bpf/verifier.c
> > @@ -3507,6 +3507,94 @@ int check_ptr_off_reg(struct bpf_verifier_env *env,
> >         return __check_ptr_off_reg(env, reg, regno, false);
> >  }
> >
> > +static int map_kptr_match_type(struct bpf_verifier_env *env,
> > +                              struct bpf_map_value_off_desc *off_desc,
> > +                              struct bpf_reg_state *reg, u32 regno)
> > +{
> > +       const char *targ_name = kernel_type_name(off_desc->btf, off_desc->btf_id);
> > +       const char *reg_name = "";
> > +
> > +       if (reg->type != PTR_TO_BTF_ID && reg->type != PTR_TO_BTF_ID_OR_NULL)
>
> base_type(reg->type) != PTR_TO_BTF_ID ?
>
> > +               goto bad_type;
> > +
> > +       if (!btf_is_kernel(reg->btf)) {
> > +               verbose(env, "R%d must point to kernel BTF\n", regno);
> > +               return -EINVAL;
> > +       }
> > +       /* We need to verify reg->type and reg->btf, before accessing reg->btf */
> > +       reg_name = kernel_type_name(reg->btf, reg->btf_id);
> > +
> > +       if (__check_ptr_off_reg(env, reg, regno, true))
> > +               return -EACCES;
> > +
> > +       if (!btf_struct_ids_match(&env->log, reg->btf, reg->btf_id, reg->off,
> > +                                 off_desc->btf, off_desc->btf_id))
> > +               goto bad_type;
> > +       return 0;
> > +bad_type:
> > +       verbose(env, "invalid kptr access, R%d type=%s%s ", regno,
> > +               reg_type_str(env, reg->type), reg_name);
> > +       verbose(env, "expected=%s%s\n", reg_type_str(env, PTR_TO_BTF_ID), targ_name);
>
> why two separate verbose calls, you can easily combine them (and they
> should be output on a single line given it's a single error)
>

reg_type_str cannot be called more than once in the same statement, since it
reuses the same buffer.

> > +       return -EINVAL;
> > +}
> > +
>
> [...]

--
Kartikeya
Kumar Kartikeya Dwivedi March 22, 2022, 7:43 a.m. UTC | #5
On Tue, Mar 22, 2022 at 12:46:40PM IST, Kumar Kartikeya Dwivedi wrote:
> On Tue, Mar 22, 2022 at 11:15:42AM IST, Andrii Nakryiko wrote:
> > On Sun, Mar 20, 2022 at 8:55 AM Kumar Kartikeya Dwivedi
> > <memxor@gmail.com> wrote:
> > >
> > > This commit introduces a new pointer type 'kptr' which can be embedded
> > > in a map value as holds a PTR_TO_BTF_ID stored by a BPF program during
> > > its invocation. Storing to such a kptr, BPF program's PTR_TO_BTF_ID
> > > register must have the same type as in the map value's BTF, and loading
> > > a kptr marks the destination register as PTR_TO_BTF_ID with the correct
> > > kernel BTF and BTF ID.
> > >
> > > Such kptr are unreferenced, i.e. by the time another invocation of the
> > > BPF program loads this pointer, the object which the pointer points to
> > > may not longer exist. Since PTR_TO_BTF_ID loads (using BPF_LDX) are
> > > patched to PROBE_MEM loads by the verifier, it would safe to allow user
> > > to still access such invalid pointer, but passing such pointers into
> > > BPF helpers and kfuncs should not be permitted. A future patch in this
> > > series will close this gap.
> > >
> > > The flexibility offered by allowing programs to dereference such invalid
> > > pointers while being safe at runtime frees the verifier from doing
> > > complex lifetime tracking. As long as the user may ensure that the
> > > object remains valid, it can ensure data read by it from the kernel
> > > object is valid.
> > >
> > > The user indicates that a certain pointer must be treated as kptr
> > > capable of accepting stores of PTR_TO_BTF_ID of a certain type, by using
> > > a BTF type tag 'kptr' on the pointed to type of the pointer. Then, this
> > > information is recorded in the object BTF which will be passed into the
> > > kernel by way of map's BTF information. The name and kind from the map
> > > value BTF is used to look up the in-kernel type, and the actual BTF and
> > > BTF ID is recorded in the map struct in a new kptr_off_tab member. For
> > > now, only storing pointers to structs is permitted.
> > >
> > > An example of this specification is shown below:
> > >
> > >         #define __kptr __attribute__((btf_type_tag("kptr")))
> > >
> > >         struct map_value {
> > >                 ...
> > >                 struct task_struct __kptr *task;
> > >                 ...
> > >         };
> > >
> > > Then, in a BPF program, user may store PTR_TO_BTF_ID with the type
> > > task_struct into the map, and then load it later.
> > >
> > > Note that the destination register is marked PTR_TO_BTF_ID_OR_NULL, as
> > > the verifier cannot know whether the value is NULL or not statically, it
> > > must treat all potential loads at that map value offset as loading a
> > > possibly NULL pointer.
> > >
> > > Only BPF_LDX, BPF_STX, and BPF_ST with insn->imm = 0 (to denote NULL)
> > > are allowed instructions that can access such a pointer. On BPF_LDX, the
> > > destination register is updated to be a PTR_TO_BTF_ID, and on BPF_STX,
> > > it is checked whether the source register type is a PTR_TO_BTF_ID with
> > > same BTF type as specified in the map BTF. The access size must always
> > > be BPF_DW.
> > >
> > > For the map in map support, the kptr_off_tab for outer map is copied
> > > from the inner map's kptr_off_tab. It was chosen to do a deep copy
> > > instead of introducing a refcount to kptr_off_tab, because the copy only
> > > needs to be done when paramterizing using inner_map_fd in the map in map
> > > case, hence would be unnecessary for all other users.
> > >
> > > It is not permitted to use MAP_FREEZE command and mmap for BPF map
> > > having kptr, similar to the bpf_timer case.
> > >
> > > Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> > > ---
> > >  include/linux/bpf.h     |  29 +++++++-
> > >  include/linux/btf.h     |   2 +
> > >  kernel/bpf/btf.c        | 161 ++++++++++++++++++++++++++++++++++------
> > >  kernel/bpf/map_in_map.c |   5 +-
> > >  kernel/bpf/syscall.c    | 112 +++++++++++++++++++++++++++-
> > >  kernel/bpf/verifier.c   | 120 ++++++++++++++++++++++++++++++
> > >  6 files changed, 401 insertions(+), 28 deletions(-)
> > >
> >
> > [...]
> >
> > > +static int btf_find_field_kptr(const struct btf *btf, const struct btf_type *t,
> > > +                              u32 off, int sz, struct btf_field_info *info)
> > > +{
> > > +       /* For PTR, sz is always == 8 */
> > > +       if (!btf_type_is_ptr(t))
> > > +               return BTF_FIELD_IGNORE;
> > > +       t = btf_type_by_id(btf, t->type);
> > > +
> > > +       if (!btf_type_is_type_tag(t))
> > > +               return BTF_FIELD_IGNORE;
> > > +       /* Reject extra tags */
> > > +       if (btf_type_is_type_tag(btf_type_by_id(btf, t->type)))
> > > +               return -EINVAL;
> >
> > Can we have tag -> const -> tag -> volatile -> tag in BTF? Wouldn't
> > you assume there are no more tags with just this check?
> >
>
> All tags are supposed to be before other modifiers, so tags come first, in
> continuity. See [0].
>
> Alexei suggested to reject all other tags for now.
>
>  [0]: https://lore.kernel.org/bpf/20220127154627.665163-1-yhs@fb.com
>
> >
> > > +       if (strcmp("kptr", __btf_name_by_offset(btf, t->name_off)))
> > > +               return -EINVAL;
> > > +
> > > +       /* Get the base type */
> > > +       if (btf_type_is_modifier(t))
> > > +               t = btf_type_skip_modifiers(btf, t->type, NULL);
> > > +       /* Only pointer to struct is allowed */
> > > +       if (!__btf_type_is_struct(t))
> > > +               return -EINVAL;
> > > +
> > > +       info->type = t;
> > > +       info->off = off;
> > > +       return BTF_FIELD_FOUND;
> > >  }
> > >
> > >  static int btf_find_struct_field(const struct btf *btf, const struct btf_type *t,
> > >                                  const char *name, int sz, int align, int field_type,
> > > -                                struct btf_field_info *info)
> > > +                                struct btf_field_info *info, int info_cnt)
> > >  {
> > >         const struct btf_member *member;
> > > +       int ret, idx = 0;
> > >         u32 i, off;
> > > -       int ret;
> > >
> > >         for_each_member(i, t, member) {
> > >                 const struct btf_type *member_type = btf_type_by_id(btf,
> > > @@ -3210,24 +3242,35 @@ static int btf_find_struct_field(const struct btf *btf, const struct btf_type *t
> > >                 switch (field_type) {
> > >                 case BTF_FIELD_SPIN_LOCK:
> > >                 case BTF_FIELD_TIMER:
> > > -                       ret = btf_find_field_struct(btf, member_type, off, sz, info);
> > > +                       ret = btf_find_field_struct(btf, member_type, off, sz, &info[idx]);
> > > +                       if (ret < 0)
> > > +                               return ret;
> > > +                       break;
> > > +               case BTF_FIELD_KPTR:
> > > +                       ret = btf_find_field_kptr(btf, member_type, off, sz, &info[idx]);
> > >                         if (ret < 0)
> > >                                 return ret;
> > >                         break;
> > >                 default:
> > >                         return -EFAULT;
> > >                 }
> > > +
> > > +               if (ret == BTF_FIELD_FOUND && idx >= info_cnt)
> >
> > hm.. haven't you already written info[info_cnt] above by now? I see
> > that above you do (info_cnt - 1), but why such tricks if you can have
> > a temporary struct btf_field_info on the stack, write into it, and if
> > BTF_FIELD_FOUND and idx < info_cnt then write it into info[idx]?
> >
> >
> > > +                       return -E2BIG;
> > > +               else if (ret == BTF_FIELD_IGNORE)
> > > +                       continue;
> > > +               ++idx;
> > >         }
> > > -       return 0;
> > > +       return idx;
> > >  }
> > >
> > >  static int btf_find_datasec_var(const struct btf *btf, const struct btf_type *t,
> > >                                 const char *name, int sz, int align, int field_type,
> > > -                               struct btf_field_info *info)
> > > +                               struct btf_field_info *info, int info_cnt)
> > >  {
> > >         const struct btf_var_secinfo *vsi;
> > > +       int ret, idx = 0;
> > >         u32 i, off;
> > > -       int ret;
> > >
> > >         for_each_vsi(i, t, vsi) {
> > >                 const struct btf_type *var = btf_type_by_id(btf, vsi->type);
> > > @@ -3245,19 +3288,30 @@ static int btf_find_datasec_var(const struct btf *btf, const struct btf_type *t,
> > >                 switch (field_type) {
> > >                 case BTF_FIELD_SPIN_LOCK:
> > >                 case BTF_FIELD_TIMER:
> > > -                       ret = btf_find_field_struct(btf, var_type, off, sz, info);
> > > +                       ret = btf_find_field_struct(btf, var_type, off, sz, &info[idx]);
> > > +                       if (ret < 0)
> > > +                               return ret;
> > > +                       break;
> > > +               case BTF_FIELD_KPTR:
> > > +                       ret = btf_find_field_kptr(btf, var_type, off, sz, &info[idx]);
> > >                         if (ret < 0)
> > >                                 return ret;
> > >                         break;
> > >                 default:
> > >                         return -EFAULT;
> > >                 }
> > > +
> > > +               if (ret == BTF_FIELD_FOUND && idx >= info_cnt)
> >
> > same, already writing past the end of array?
> >
> > > +                       return -E2BIG;
> > > +               if (ret == BTF_FIELD_IGNORE)
> > > +                       continue;
> > > +               ++idx;
> > >         }
> > > -       return 0;
> > > +       return idx;
> > >  }
> > >
> > >  static int btf_find_field(const struct btf *btf, const struct btf_type *t,
> > > -                         int field_type, struct btf_field_info *info)
> > > +                         int field_type, struct btf_field_info *info, int info_cnt)
> > >  {
> > >         const char *name;
> > >         int sz, align;
> > > @@ -3273,14 +3327,20 @@ static int btf_find_field(const struct btf *btf, const struct btf_type *t,
> > >                 sz = sizeof(struct bpf_timer);
> > >                 align = __alignof__(struct bpf_timer);
> > >                 break;
> > > +       case BTF_FIELD_KPTR:
> > > +               name = NULL;
> > > +               sz = sizeof(u64);
> > > +               align = __alignof__(u64);
> >
> > can be 4 on 32-bit arch, is that ok?
> >
>
> Good catch, it must be 8, so will hardcode.
>
> > > +               break;
> > >         default:
> > >                 return -EFAULT;
> > >         }
> > >
> > > +       /* The maximum allowed fields of a certain type will be info_cnt - 1 */
> > >         if (__btf_type_is_struct(t))
> > > -               return btf_find_struct_field(btf, t, name, sz, align, field_type, info);
> > > +               return btf_find_struct_field(btf, t, name, sz, align, field_type, info, info_cnt - 1);
> >
> > why -1, to avoid overwriting past the end of array?
> >
>
> Yes, see my reply to Joanne, let's continue discussing it there.
>
> > >         else if (btf_type_is_datasec(t))
> > > -               return btf_find_datasec_var(btf, t, name, sz, align, field_type, info);
> > > +               return btf_find_datasec_var(btf, t, name, sz, align, field_type, info, info_cnt - 1);
> > >         return -EINVAL;
> > >  }
> > >
> > > @@ -3290,24 +3350,79 @@ static int btf_find_field(const struct btf *btf, const struct btf_type *t,
> > >   */
> > >  int btf_find_spin_lock(const struct btf *btf, const struct btf_type *t)
> > >  {
> > > -       struct btf_field_info info = { .off = -ENOENT };
> > > +       /* btf_find_field requires array of size max + 1 */
> >
> > ok, right, as I expected above, but see also suggestion to not have
> > these weird implicit expectations
> >
> > > +       struct btf_field_info info_arr[2];
> > >         int ret;
> > >
> > > -       ret = btf_find_field(btf, t, BTF_FIELD_SPIN_LOCK, &info);
> > > +       ret = btf_find_field(btf, t, BTF_FIELD_SPIN_LOCK, info_arr, ARRAY_SIZE(info_arr));
> > >         if (ret < 0)
> > >                 return ret;
> > > -       return info.off;
> > > +       if (!ret)
> > > +               return -ENOENT;
> > > +       return info_arr[0].off;
> > >  }
> > >
> > >  int btf_find_timer(const struct btf *btf, const struct btf_type *t)
> > >  {
> > > -       struct btf_field_info info = { .off = -ENOENT };
> > > +       /* btf_find_field requires array of size max + 1 */
> > > +       struct btf_field_info info_arr[2];
> > >         int ret;
> > >
> > > -       ret = btf_find_field(btf, t, BTF_FIELD_TIMER, &info);
> > > +       ret = btf_find_field(btf, t, BTF_FIELD_TIMER, info_arr, ARRAY_SIZE(info_arr));
> > >         if (ret < 0)
> > >                 return ret;
> > > -       return info.off;
> > > +       if (!ret)
> > > +               return -ENOENT;
> > > +       return info_arr[0].off;
> > > +}
> > > +
> > > +struct bpf_map_value_off *btf_find_kptr(const struct btf *btf,
> > > +                                       const struct btf_type *t)
> > > +{
> > > +       /* btf_find_field requires array of size max + 1 */
> > > +       struct btf_field_info info_arr[BPF_MAP_VALUE_OFF_MAX + 1];
> > > +       struct bpf_map_value_off *tab;
> > > +       int ret, i, nr_off;
> > > +
> > > +       /* Revisit stack usage when bumping BPF_MAP_VALUE_OFF_MAX */
> > > +       BUILD_BUG_ON(BPF_MAP_VALUE_OFF_MAX != 8);
> >
> > you can store u32 type_id instead of full btf_type pointer, type
> > looking below in the loop is cheap and won't fail
> >
>
> Ok, will switch to type_id.
>
> >
> > > +
> > > +       ret = btf_find_field(btf, t, BTF_FIELD_KPTR, info_arr, ARRAY_SIZE(info_arr));
> > > +       if (ret < 0)
> > > +               return ERR_PTR(ret);
> > > +       if (!ret)
> > > +               return NULL;
> > > +
> >
> > [...]
> >
> > > +
> > > +bool bpf_map_equal_kptr_off_tab(const struct bpf_map *map_a, const struct bpf_map *map_b)
> > > +{
> > > +       struct bpf_map_value_off *tab_a = map_a->kptr_off_tab, *tab_b = map_b->kptr_off_tab;
> > > +       bool a_has_kptr = map_value_has_kptr(map_a), b_has_kptr = map_value_has_kptr(map_b);
> > > +       int size;
> > > +
> > > +       if (!a_has_kptr && !b_has_kptr)
> > > +               return true;
> > > +       if ((a_has_kptr && !b_has_kptr) || (!a_has_kptr && b_has_kptr))
> > > +               return false;
> >
> > if (a_has_kptr != b_has_kptr)
> >     return false;
> >
>
> Ack.
>
> > > +       if (tab_a->nr_off != tab_b->nr_off)
> > > +               return false;
> > > +       size = offsetof(struct bpf_map_value_off, off[tab_a->nr_off]);
> > > +       return !memcmp(tab_a, tab_b, size);
> > > +}
> > > +
> > >  /* called from workqueue */
> > >  static void bpf_map_free_deferred(struct work_struct *work)
> > >  {
> > >         struct bpf_map *map = container_of(work, struct bpf_map, work);
> > >
> > >         security_bpf_map_free(map);
> > > +       bpf_map_free_kptr_off_tab(map);
> > >         bpf_map_release_memcg(map);
> > >         /* implementation dependent freeing */
> > >         map->ops->map_free(map);
> > > @@ -640,7 +724,7 @@ static int bpf_map_mmap(struct file *filp, struct vm_area_struct *vma)
> > >         int err;
> > >
> > >         if (!map->ops->map_mmap || map_value_has_spin_lock(map) ||
> > > -           map_value_has_timer(map))
> > > +           map_value_has_timer(map) || map_value_has_kptr(map))
> > >                 return -ENOTSUPP;
> > >
> > >         if (!(vma->vm_flags & VM_SHARED))
> > > @@ -820,9 +904,31 @@ static int map_check_btf(struct bpf_map *map, const struct btf *btf,
> > >                         return -EOPNOTSUPP;
> > >         }
> > >
> > > -       if (map->ops->map_check_btf)
> > > +       map->kptr_off_tab = btf_find_kptr(btf, value_type);
> >
> > btf_find_kptr() is so confusingly named. It certainly can find more
> > than one kptr, so at least it should be btf_find_kptrs(). Combining
> > with Joanne's suggestion, btf_parse_kptrs() would indeed be better.
> >
>
> Ok.
>
> > > +       if (map_value_has_kptr(map)) {
> > > +               if (!bpf_capable())
> > > +                       return -EPERM;
> > > +               if (map->map_flags & BPF_F_RDONLY_PROG) {
> > > +                       ret = -EACCES;
> > > +                       goto free_map_tab;
> > > +               }
> > > +               if (map->map_type != BPF_MAP_TYPE_HASH &&
> > > +                   map->map_type != BPF_MAP_TYPE_LRU_HASH &&
> > > +                   map->map_type != BPF_MAP_TYPE_ARRAY) {
> >
> > what about PERCPU_ARRAY, for instance? Is there something
> > fundamentally wrong to support it for local storage maps?
> >
>
> Plugging in support into maps that already take timers was easier to begin, I
> can do percpu support as a follow up.
>
> In case of local storage, I'm a little worried about how we prevent creating
> reference cycles. There was a thread where find_get_task_by_pid was proposed as
> unstable helper, once we e.g. support embedding task_struct in map, and allow
> storing such pointer in task local storage, it would be pretty easy to construct
> a circular reference cycle.
>
> Should we think about this now, or should we worry about this when task_struct
> is actually supported as kptr? It's not only task_struct, same applies to sock.
>
> There's a discussion to be had, hence I left it out for now.
>
> > > +                       ret = -EOPNOTSUPP;
> > > +                       goto free_map_tab;
> > > +               }
> > > +       }
> > > +
> > > +       if (map->ops->map_check_btf) {
> > >                 ret = map->ops->map_check_btf(map, btf, key_type, value_type);
> > > +               if (ret < 0)
> > > +                       goto free_map_tab;
> > > +       }
> > >
> > > +       return ret;
> > > +free_map_tab:
> > > +       bpf_map_free_kptr_off_tab(map);
> > >         return ret;
> > >  }
> > >
> > > @@ -1639,7 +1745,7 @@ static int map_freeze(const union bpf_attr *attr)
> > >                 return PTR_ERR(map);
> > >
> > >         if (map->map_type == BPF_MAP_TYPE_STRUCT_OPS ||
> > > -           map_value_has_timer(map)) {
> > > +           map_value_has_timer(map) || map_value_has_kptr(map)) {
> > >                 fdput(f);
> > >                 return -ENOTSUPP;
> > >         }
> > > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> > > index 4ce9a528fb63..744b7362e52e 100644
> > > --- a/kernel/bpf/verifier.c
> > > +++ b/kernel/bpf/verifier.c
> > > @@ -3507,6 +3507,94 @@ int check_ptr_off_reg(struct bpf_verifier_env *env,
> > >         return __check_ptr_off_reg(env, reg, regno, false);
> > >  }
> > >
> > > +static int map_kptr_match_type(struct bpf_verifier_env *env,
> > > +                              struct bpf_map_value_off_desc *off_desc,
> > > +                              struct bpf_reg_state *reg, u32 regno)
> > > +{
> > > +       const char *targ_name = kernel_type_name(off_desc->btf, off_desc->btf_id);
> > > +       const char *reg_name = "";
> > > +
> > > +       if (reg->type != PTR_TO_BTF_ID && reg->type != PTR_TO_BTF_ID_OR_NULL)
> >
> > base_type(reg->type) != PTR_TO_BTF_ID ?
> >
> > > +               goto bad_type;
> > > +
> > > +       if (!btf_is_kernel(reg->btf)) {
> > > +               verbose(env, "R%d must point to kernel BTF\n", regno);
> > > +               return -EINVAL;
> > > +       }
> > > +       /* We need to verify reg->type and reg->btf, before accessing reg->btf */
> > > +       reg_name = kernel_type_name(reg->btf, reg->btf_id);
> > > +
> > > +       if (__check_ptr_off_reg(env, reg, regno, true))
> > > +               return -EACCES;
> > > +
> > > +       if (!btf_struct_ids_match(&env->log, reg->btf, reg->btf_id, reg->off,
> > > +                                 off_desc->btf, off_desc->btf_id))
> > > +               goto bad_type;
> > > +       return 0;
> > > +bad_type:
> > > +       verbose(env, "invalid kptr access, R%d type=%s%s ", regno,
> > > +               reg_type_str(env, reg->type), reg_name);
> > > +       verbose(env, "expected=%s%s\n", reg_type_str(env, PTR_TO_BTF_ID), targ_name);
> >
> > why two separate verbose calls, you can easily combine them (and they
> > should be output on a single line given it's a single error)
> >
>
> reg_type_str cannot be called more than once in the same statement, since it
> reuses the same buffer.
>

I think to fix this we can use an array of buffers (e.g. max of 6 or so), and
then use i++ % ARRAY_SIZE(...), this would allow calling it twice in the same
statement.

> > > +       return -EINVAL;
> > > +}
> > > +
> >
> > [...]
>
> --
> Kartikeya

--
Kartikeya
Martin KaFai Lau March 22, 2022, 6:06 p.m. UTC | #6
On Sun, Mar 20, 2022 at 09:25:00PM +0530, Kumar Kartikeya Dwivedi wrote:
> @@ -820,9 +904,31 @@ static int map_check_btf(struct bpf_map *map, const struct btf *btf,
>  			return -EOPNOTSUPP;
>  	}
>  
> -	if (map->ops->map_check_btf)
> +	map->kptr_off_tab = btf_find_kptr(btf, value_type);
> +	if (map_value_has_kptr(map)) {
> +		if (!bpf_capable())
> +			return -EPERM;
Not sure if this has been brought up.

No need to bpf_map_free_kptr_off_tab() in the case?

> +		if (map->map_flags & BPF_F_RDONLY_PROG) {
> +			ret = -EACCES;
> +			goto free_map_tab;
> +		}
> +		if (map->map_type != BPF_MAP_TYPE_HASH &&
> +		    map->map_type != BPF_MAP_TYPE_LRU_HASH &&
> +		    map->map_type != BPF_MAP_TYPE_ARRAY) {
> +			ret = -EOPNOTSUPP;
> +			goto free_map_tab;
> +		}
> +	}
If btf_find_kptr() returns err, it can be ignored and continue ?

btw, it is quite unusual to store an err ptr in map.
How about only stores NULL or a valid ptr in map->kptr_off_tab?

> +
> +	if (map->ops->map_check_btf) {
>  		ret = map->ops->map_check_btf(map, btf, key_type, value_type);
> +		if (ret < 0)
> +			goto free_map_tab;
> +	}
>
> +	return ret;
> +free_map_tab:
> +	bpf_map_free_kptr_off_tab(map);
>  	return ret;
>  }
Andrii Nakryiko March 22, 2022, 6:52 p.m. UTC | #7
On Tue, Mar 22, 2022 at 12:16 AM Kumar Kartikeya Dwivedi
<memxor@gmail.com> wrote:
>
> On Tue, Mar 22, 2022 at 11:15:42AM IST, Andrii Nakryiko wrote:
> > On Sun, Mar 20, 2022 at 8:55 AM Kumar Kartikeya Dwivedi
> > <memxor@gmail.com> wrote:
> > >
> > > This commit introduces a new pointer type 'kptr' which can be embedded
> > > in a map value as holds a PTR_TO_BTF_ID stored by a BPF program during
> > > its invocation. Storing to such a kptr, BPF program's PTR_TO_BTF_ID
> > > register must have the same type as in the map value's BTF, and loading
> > > a kptr marks the destination register as PTR_TO_BTF_ID with the correct
> > > kernel BTF and BTF ID.
> > >
> > > Such kptr are unreferenced, i.e. by the time another invocation of the
> > > BPF program loads this pointer, the object which the pointer points to
> > > may not longer exist. Since PTR_TO_BTF_ID loads (using BPF_LDX) are
> > > patched to PROBE_MEM loads by the verifier, it would safe to allow user
> > > to still access such invalid pointer, but passing such pointers into
> > > BPF helpers and kfuncs should not be permitted. A future patch in this
> > > series will close this gap.
> > >
> > > The flexibility offered by allowing programs to dereference such invalid
> > > pointers while being safe at runtime frees the verifier from doing
> > > complex lifetime tracking. As long as the user may ensure that the
> > > object remains valid, it can ensure data read by it from the kernel
> > > object is valid.
> > >
> > > The user indicates that a certain pointer must be treated as kptr
> > > capable of accepting stores of PTR_TO_BTF_ID of a certain type, by using
> > > a BTF type tag 'kptr' on the pointed to type of the pointer. Then, this
> > > information is recorded in the object BTF which will be passed into the
> > > kernel by way of map's BTF information. The name and kind from the map
> > > value BTF is used to look up the in-kernel type, and the actual BTF and
> > > BTF ID is recorded in the map struct in a new kptr_off_tab member. For
> > > now, only storing pointers to structs is permitted.
> > >
> > > An example of this specification is shown below:
> > >
> > >         #define __kptr __attribute__((btf_type_tag("kptr")))
> > >
> > >         struct map_value {
> > >                 ...
> > >                 struct task_struct __kptr *task;
> > >                 ...
> > >         };
> > >
> > > Then, in a BPF program, user may store PTR_TO_BTF_ID with the type
> > > task_struct into the map, and then load it later.
> > >
> > > Note that the destination register is marked PTR_TO_BTF_ID_OR_NULL, as
> > > the verifier cannot know whether the value is NULL or not statically, it
> > > must treat all potential loads at that map value offset as loading a
> > > possibly NULL pointer.
> > >
> > > Only BPF_LDX, BPF_STX, and BPF_ST with insn->imm = 0 (to denote NULL)
> > > are allowed instructions that can access such a pointer. On BPF_LDX, the
> > > destination register is updated to be a PTR_TO_BTF_ID, and on BPF_STX,
> > > it is checked whether the source register type is a PTR_TO_BTF_ID with
> > > same BTF type as specified in the map BTF. The access size must always
> > > be BPF_DW.
> > >
> > > For the map in map support, the kptr_off_tab for outer map is copied
> > > from the inner map's kptr_off_tab. It was chosen to do a deep copy
> > > instead of introducing a refcount to kptr_off_tab, because the copy only
> > > needs to be done when paramterizing using inner_map_fd in the map in map
> > > case, hence would be unnecessary for all other users.
> > >
> > > It is not permitted to use MAP_FREEZE command and mmap for BPF map
> > > having kptr, similar to the bpf_timer case.
> > >
> > > Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> > > ---
> > >  include/linux/bpf.h     |  29 +++++++-
> > >  include/linux/btf.h     |   2 +
> > >  kernel/bpf/btf.c        | 161 ++++++++++++++++++++++++++++++++++------
> > >  kernel/bpf/map_in_map.c |   5 +-
> > >  kernel/bpf/syscall.c    | 112 +++++++++++++++++++++++++++-
> > >  kernel/bpf/verifier.c   | 120 ++++++++++++++++++++++++++++++
> > >  6 files changed, 401 insertions(+), 28 deletions(-)
> > >
> >
> > [...]
> >
> > > +static int btf_find_field_kptr(const struct btf *btf, const struct btf_type *t,
> > > +                              u32 off, int sz, struct btf_field_info *info)
> > > +{
> > > +       /* For PTR, sz is always == 8 */
> > > +       if (!btf_type_is_ptr(t))
> > > +               return BTF_FIELD_IGNORE;
> > > +       t = btf_type_by_id(btf, t->type);
> > > +
> > > +       if (!btf_type_is_type_tag(t))
> > > +               return BTF_FIELD_IGNORE;
> > > +       /* Reject extra tags */
> > > +       if (btf_type_is_type_tag(btf_type_by_id(btf, t->type)))
> > > +               return -EINVAL;
> >
> > Can we have tag -> const -> tag -> volatile -> tag in BTF? Wouldn't
> > you assume there are no more tags with just this check?
> >
>
> All tags are supposed to be before other modifiers, so tags come first, in
> continuity. See [0].

Doesn't seem like kernel's BTF validator enforces this, we should
probably tighten that up a bit. Clang won't emit such BTF, but nothing
prevents user from generating non-conformant BTF on its own either.

>
> Alexei suggested to reject all other tags for now.
>
>  [0]: https://lore.kernel.org/bpf/20220127154627.665163-1-yhs@fb.com
>
> >
> > > +       if (strcmp("kptr", __btf_name_by_offset(btf, t->name_off)))
> > > +               return -EINVAL;
> > > +
> > > +       /* Get the base type */
> > > +       if (btf_type_is_modifier(t))
> > > +               t = btf_type_skip_modifiers(btf, t->type, NULL);
> > > +       /* Only pointer to struct is allowed */
> > > +       if (!__btf_type_is_struct(t))
> > > +               return -EINVAL;
> > > +
> > > +       info->type = t;
> > > +       info->off = off;
> > > +       return BTF_FIELD_FOUND;
> > >  }

[...]

> > > +       if (map_value_has_kptr(map)) {
> > > +               if (!bpf_capable())
> > > +                       return -EPERM;
> > > +               if (map->map_flags & BPF_F_RDONLY_PROG) {
> > > +                       ret = -EACCES;
> > > +                       goto free_map_tab;
> > > +               }
> > > +               if (map->map_type != BPF_MAP_TYPE_HASH &&
> > > +                   map->map_type != BPF_MAP_TYPE_LRU_HASH &&
> > > +                   map->map_type != BPF_MAP_TYPE_ARRAY) {
> >
> > what about PERCPU_ARRAY, for instance? Is there something
> > fundamentally wrong to support it for local storage maps?
> >
>
> Plugging in support into maps that already take timers was easier to begin, I
> can do percpu support as a follow up.
>
> In case of local storage, I'm a little worried about how we prevent creating
> reference cycles. There was a thread where find_get_task_by_pid was proposed as
> unstable helper, once we e.g. support embedding task_struct in map, and allow
> storing such pointer in task local storage, it would be pretty easy to construct
> a circular reference cycle.
>
> Should we think about this now, or should we worry about this when task_struct
> is actually supported as kptr? It's not only task_struct, same applies to sock.
>
> There's a discussion to be had, hence I left it out for now.

PERCPU_ARRAY seemed (and still seems) like a safe map to support (same
as PERCPU_HASH), which is why I asked. I see concerns about local
storage, though, thanks.

>
> > > +                       ret = -EOPNOTSUPP;
> > > +                       goto free_map_tab;
> > > +               }
> > > +       }
> > > +
> > > +       if (map->ops->map_check_btf) {
> > >                 ret = map->ops->map_check_btf(map, btf, key_type, value_type);
> > > +               if (ret < 0)
> > > +                       goto free_map_tab;
> > > +       }
> > >
> > > +       return ret;
> > > +free_map_tab:
> > > +       bpf_map_free_kptr_off_tab(map);
> > >         return ret;
> > >  }
> > >
> > > @@ -1639,7 +1745,7 @@ static int map_freeze(const union bpf_attr *attr)
> > >                 return PTR_ERR(map);
> > >
> > >         if (map->map_type == BPF_MAP_TYPE_STRUCT_OPS ||
> > > -           map_value_has_timer(map)) {
> > > +           map_value_has_timer(map) || map_value_has_kptr(map)) {
> > >                 fdput(f);
> > >                 return -ENOTSUPP;
> > >         }
> > > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> > > index 4ce9a528fb63..744b7362e52e 100644
> > > --- a/kernel/bpf/verifier.c
> > > +++ b/kernel/bpf/verifier.c
> > > @@ -3507,6 +3507,94 @@ int check_ptr_off_reg(struct bpf_verifier_env *env,
> > >         return __check_ptr_off_reg(env, reg, regno, false);
> > >  }
> > >
> > > +static int map_kptr_match_type(struct bpf_verifier_env *env,
> > > +                              struct bpf_map_value_off_desc *off_desc,
> > > +                              struct bpf_reg_state *reg, u32 regno)
> > > +{
> > > +       const char *targ_name = kernel_type_name(off_desc->btf, off_desc->btf_id);
> > > +       const char *reg_name = "";
> > > +
> > > +       if (reg->type != PTR_TO_BTF_ID && reg->type != PTR_TO_BTF_ID_OR_NULL)
> >
> > base_type(reg->type) != PTR_TO_BTF_ID ?
> >
> > > +               goto bad_type;
> > > +
> > > +       if (!btf_is_kernel(reg->btf)) {
> > > +               verbose(env, "R%d must point to kernel BTF\n", regno);
> > > +               return -EINVAL;
> > > +       }
> > > +       /* We need to verify reg->type and reg->btf, before accessing reg->btf */
> > > +       reg_name = kernel_type_name(reg->btf, reg->btf_id);
> > > +
> > > +       if (__check_ptr_off_reg(env, reg, regno, true))
> > > +               return -EACCES;
> > > +
> > > +       if (!btf_struct_ids_match(&env->log, reg->btf, reg->btf_id, reg->off,
> > > +                                 off_desc->btf, off_desc->btf_id))
> > > +               goto bad_type;
> > > +       return 0;
> > > +bad_type:
> > > +       verbose(env, "invalid kptr access, R%d type=%s%s ", regno,
> > > +               reg_type_str(env, reg->type), reg_name);
> > > +       verbose(env, "expected=%s%s\n", reg_type_str(env, PTR_TO_BTF_ID), targ_name);
> >
> > why two separate verbose calls, you can easily combine them (and they
> > should be output on a single line given it's a single error)
> >
>
> reg_type_str cannot be called more than once in the same statement, since it
> reuses the same buffer.
>

ah, subtle, ok, never mind then, no big deal to have two verbose()
calls if there is a reason for it

> > > +       return -EINVAL;
> > > +}
> > > +
> >
> > [...]
>
> --
> Kartikeya
Andrii Nakryiko March 22, 2022, 8:22 p.m. UTC | #8
On Tue, Mar 22, 2022 at 12:05 AM Kumar Kartikeya Dwivedi
<memxor@gmail.com> wrote:
>
> On Tue, Mar 22, 2022 at 05:09:30AM IST, Joanne Koong wrote:
> > On Sun, Mar 20, 2022 at 5:27 PM Kumar Kartikeya Dwivedi
> > <memxor@gmail.com> wrote:
> > >
> > > This commit introduces a new pointer type 'kptr' which can be embedded
> > > in a map value as holds a PTR_TO_BTF_ID stored by a BPF program during
> > > its invocation. Storing to such a kptr, BPF program's PTR_TO_BTF_ID
> > > register must have the same type as in the map value's BTF, and loading
> > > a kptr marks the destination register as PTR_TO_BTF_ID with the correct
> > > kernel BTF and BTF ID.
> > >
> > > Such kptr are unreferenced, i.e. by the time another invocation of the
> > > BPF program loads this pointer, the object which the pointer points to
> > > may not longer exist. Since PTR_TO_BTF_ID loads (using BPF_LDX) are
> > > patched to PROBE_MEM loads by the verifier, it would safe to allow user
> > > to still access such invalid pointer, but passing such pointers into
> > > BPF helpers and kfuncs should not be permitted. A future patch in this
> > > series will close this gap.
> > >
> > > The flexibility offered by allowing programs to dereference such invalid
> > > pointers while being safe at runtime frees the verifier from doing
> > > complex lifetime tracking. As long as the user may ensure that the
> > > object remains valid, it can ensure data read by it from the kernel
> > > object is valid.
> > >
> > > The user indicates that a certain pointer must be treated as kptr
> > > capable of accepting stores of PTR_TO_BTF_ID of a certain type, by using
> > > a BTF type tag 'kptr' on the pointed to type of the pointer. Then, this
> > > information is recorded in the object BTF which will be passed into the
> > > kernel by way of map's BTF information. The name and kind from the map
> > > value BTF is used to look up the in-kernel type, and the actual BTF and
> > > BTF ID is recorded in the map struct in a new kptr_off_tab member. For
> > > now, only storing pointers to structs is permitted.
> > >
> > > An example of this specification is shown below:
> > >
> > >         #define __kptr __attribute__((btf_type_tag("kptr")))
> > >
> > >         struct map_value {
> > >                 ...
> > >                 struct task_struct __kptr *task;
> > >                 ...
> > >         };
> > >
> > > Then, in a BPF program, user may store PTR_TO_BTF_ID with the type
> > > task_struct into the map, and then load it later.
> > >
> > > Note that the destination register is marked PTR_TO_BTF_ID_OR_NULL, as
> > > the verifier cannot know whether the value is NULL or not statically, it
> > > must treat all potential loads at that map value offset as loading a
> > > possibly NULL pointer.
> > >
> > > Only BPF_LDX, BPF_STX, and BPF_ST with insn->imm = 0 (to denote NULL)
> > > are allowed instructions that can access such a pointer. On BPF_LDX, the
> > > destination register is updated to be a PTR_TO_BTF_ID, and on BPF_STX,
> > > it is checked whether the source register type is a PTR_TO_BTF_ID with
> > > same BTF type as specified in the map BTF. The access size must always
> > > be BPF_DW.
> > >
> > > For the map in map support, the kptr_off_tab for outer map is copied
> > > from the inner map's kptr_off_tab. It was chosen to do a deep copy
> > > instead of introducing a refcount to kptr_off_tab, because the copy only
> > > needs to be done when paramterizing using inner_map_fd in the map in map
> > > case, hence would be unnecessary for all other users.
> > >
> > > It is not permitted to use MAP_FREEZE command and mmap for BPF map
> > > having kptr, similar to the bpf_timer case.
> > >
> > > Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> > > ---
> > >  include/linux/bpf.h     |  29 +++++++-
> > >  include/linux/btf.h     |   2 +
> > >  kernel/bpf/btf.c        | 161 ++++++++++++++++++++++++++++++++++------
> > >  kernel/bpf/map_in_map.c |   5 +-
> > >  kernel/bpf/syscall.c    | 112 +++++++++++++++++++++++++++-
> > >  kernel/bpf/verifier.c   | 120 ++++++++++++++++++++++++++++++
> > >  6 files changed, 401 insertions(+), 28 deletions(-)
> > >
> > [...]
> > > +
> > >  struct bpf_map *bpf_map_get(u32 ufd);
> > >  struct bpf_map *bpf_map_get_with_uref(u32 ufd);
> > >  struct bpf_map *__bpf_map_get(struct fd f);
> > > diff --git a/include/linux/btf.h b/include/linux/btf.h
> > > index 36bc09b8e890..5b578dc81c04 100644
> > > --- a/include/linux/btf.h
> > > +++ b/include/linux/btf.h
> > > @@ -123,6 +123,8 @@ bool btf_member_is_reg_int(const struct btf *btf, const struct btf_type *s,
> > >                            u32 expected_offset, u32 expected_size);
> > >  int btf_find_spin_lock(const struct btf *btf, const struct btf_type *t);
> > >  int btf_find_timer(const struct btf *btf, const struct btf_type *t);
> > > +struct bpf_map_value_off *btf_find_kptr(const struct btf *btf,
> > > +                                       const struct btf_type *t);
> >
> > nit: given that "btf_find_kptr" allocates memory as well, maybe the
> > name "btf_parse_kptr" would be more reflective?
> >
>
> Good point, will change.
>
> > >  bool btf_type_is_void(const struct btf_type *t);
> > >  s32 btf_find_by_name_kind(const struct btf *btf, const char *name, u8 kind);
> > >  const struct btf_type *btf_type_skip_modifiers(const struct btf *btf,
> > > diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
> > > index 9e17af936a7a..92afbec0a887 100644
> > > --- a/kernel/bpf/btf.c
> > > +++ b/kernel/bpf/btf.c
> > > @@ -3164,9 +3164,16 @@ static void btf_struct_log(struct btf_verifier_env *env,
> > >  enum {
> > >         BTF_FIELD_SPIN_LOCK,
> > >         BTF_FIELD_TIMER,
> > > +       BTF_FIELD_KPTR,
> > > +};
> > > +
> > > +enum {
> > > +       BTF_FIELD_IGNORE = 0,
> > > +       BTF_FIELD_FOUND  = 1,
> > >  };
> > >
> > >  struct btf_field_info {
> > > +       const struct btf_type *type;
> > >         u32 off;
> > >  };
> > >
> > > @@ -3174,23 +3181,48 @@ static int btf_find_field_struct(const struct btf *btf, const struct btf_type *t
> > >                                  u32 off, int sz, struct btf_field_info *info)
> > >  {
> > >         if (!__btf_type_is_struct(t))
> > > -               return 0;
> > > +               return BTF_FIELD_IGNORE;
> > >         if (t->size != sz)
> > > -               return 0;
> > > -       if (info->off != -ENOENT)
> > > -               /* only one such field is allowed */
> > > -               return -E2BIG;
> > > +               return BTF_FIELD_IGNORE;
> > >         info->off = off;
> > > -       return 0;
> > > +       return BTF_FIELD_FOUND;
> > > +}
> > > +
> > > +static int btf_find_field_kptr(const struct btf *btf, const struct btf_type *t,
> > > +                              u32 off, int sz, struct btf_field_info *info)
> > > +{
> > > +       /* For PTR, sz is always == 8 */
> > > +       if (!btf_type_is_ptr(t))
> > > +               return BTF_FIELD_IGNORE;
> > > +       t = btf_type_by_id(btf, t->type);
> > > +
> > > +       if (!btf_type_is_type_tag(t))
> > > +               return BTF_FIELD_IGNORE;
> > > +       /* Reject extra tags */
> > > +       if (btf_type_is_type_tag(btf_type_by_id(btf, t->type)))
> > > +               return -EINVAL;
> > > +       if (strcmp("kptr", __btf_name_by_offset(btf, t->name_off)))
> > > +               return -EINVAL;
> > > +
> > > +       /* Get the base type */
> > > +       if (btf_type_is_modifier(t))
> > > +               t = btf_type_skip_modifiers(btf, t->type, NULL);
> > > +       /* Only pointer to struct is allowed */
> > > +       if (!__btf_type_is_struct(t))
> > > +               return -EINVAL;
> > > +
> > > +       info->type = t;
> > > +       info->off = off;
> > > +       return BTF_FIELD_FOUND;
> > >  }
> > >
> > >  static int btf_find_struct_field(const struct btf *btf, const struct btf_type *t,
> > >                                  const char *name, int sz, int align, int field_type,
> > > -                                struct btf_field_info *info)
> > > +                                struct btf_field_info *info, int info_cnt)
> >
> > From my understanding, this patch now modifies btf_find_struct_field
> > and btf_find_datasec_var such that the "info" that is passed in has to
> > be an array of size max possible + 1 while "info_cnt" is the max
> > possible count, or we risk writing beyond the "info" array passed in.
> > It seems like we could just modify the
> > btf_find_struct_field/btf_find_datasec_var logic so that the user can
> > just pass in info array of max possible size instead of max possible
> > size + 1 - or is your concern that this would require more idx >=
> > info_cnt checks inside the functions? Maybe we should include a
> > comment here and in btf_find_datasec_var to document that "info"
> > should always be max possible size + 1?
> >
>
> So for some context on why this was changed, follow [0].
>
> I agree it's pretty ugly. My first thought was to check it inside the functions,
> but that is also not very great. So I went with this, one more suggestion from
> Alexei was to split it into a find and then fill info, because the error on
> idx >= info_cnt should only happen after we find. Right now the find and fill
> happens together, so to error out, you need an extra element it can fill before
> you bail for ARRAY_SIZE - 1 (which is the actual max).
>
> TBH the find + fill split looks best to me, but open to more suggestions.

I think there is much simpler way that doesn't require unnecessary
copying or splitting anything:

struct btf_field_info tmp;

...

ret = btf_find_field_struct(btf, member_type, off, sz,
                            idx < info_cnt ? &info[idx] : &tmp);

...

That's it.

>
> [0]: https://lore.kernel.org/bpf/20220319181538.nbqdkprjrzkxk7v4@ast-mbp.dhcp.thefacebook.com
>

[...]
Kumar Kartikeya Dwivedi March 25, 2022, 2:42 p.m. UTC | #9
On Wed, Mar 23, 2022 at 12:22:20AM IST, Andrii Nakryiko wrote:
> On Tue, Mar 22, 2022 at 12:16 AM Kumar Kartikeya Dwivedi
> <memxor@gmail.com> wrote:
> >
> > On Tue, Mar 22, 2022 at 11:15:42AM IST, Andrii Nakryiko wrote:
> > > On Sun, Mar 20, 2022 at 8:55 AM Kumar Kartikeya Dwivedi
> > > <memxor@gmail.com> wrote:
> > > >
> > > > This commit introduces a new pointer type 'kptr' which can be embedded
> > > > in a map value as holds a PTR_TO_BTF_ID stored by a BPF program during
> > > > its invocation. Storing to such a kptr, BPF program's PTR_TO_BTF_ID
> > > > register must have the same type as in the map value's BTF, and loading
> > > > a kptr marks the destination register as PTR_TO_BTF_ID with the correct
> > > > kernel BTF and BTF ID.
> > > >
> > > > Such kptr are unreferenced, i.e. by the time another invocation of the
> > > > BPF program loads this pointer, the object which the pointer points to
> > > > may not longer exist. Since PTR_TO_BTF_ID loads (using BPF_LDX) are
> > > > patched to PROBE_MEM loads by the verifier, it would safe to allow user
> > > > to still access such invalid pointer, but passing such pointers into
> > > > BPF helpers and kfuncs should not be permitted. A future patch in this
> > > > series will close this gap.
> > > >
> > > > The flexibility offered by allowing programs to dereference such invalid
> > > > pointers while being safe at runtime frees the verifier from doing
> > > > complex lifetime tracking. As long as the user may ensure that the
> > > > object remains valid, it can ensure data read by it from the kernel
> > > > object is valid.
> > > >
> > > > The user indicates that a certain pointer must be treated as kptr
> > > > capable of accepting stores of PTR_TO_BTF_ID of a certain type, by using
> > > > a BTF type tag 'kptr' on the pointed to type of the pointer. Then, this
> > > > information is recorded in the object BTF which will be passed into the
> > > > kernel by way of map's BTF information. The name and kind from the map
> > > > value BTF is used to look up the in-kernel type, and the actual BTF and
> > > > BTF ID is recorded in the map struct in a new kptr_off_tab member. For
> > > > now, only storing pointers to structs is permitted.
> > > >
> > > > An example of this specification is shown below:
> > > >
> > > >         #define __kptr __attribute__((btf_type_tag("kptr")))
> > > >
> > > >         struct map_value {
> > > >                 ...
> > > >                 struct task_struct __kptr *task;
> > > >                 ...
> > > >         };
> > > >
> > > > Then, in a BPF program, user may store PTR_TO_BTF_ID with the type
> > > > task_struct into the map, and then load it later.
> > > >
> > > > Note that the destination register is marked PTR_TO_BTF_ID_OR_NULL, as
> > > > the verifier cannot know whether the value is NULL or not statically, it
> > > > must treat all potential loads at that map value offset as loading a
> > > > possibly NULL pointer.
> > > >
> > > > Only BPF_LDX, BPF_STX, and BPF_ST with insn->imm = 0 (to denote NULL)
> > > > are allowed instructions that can access such a pointer. On BPF_LDX, the
> > > > destination register is updated to be a PTR_TO_BTF_ID, and on BPF_STX,
> > > > it is checked whether the source register type is a PTR_TO_BTF_ID with
> > > > same BTF type as specified in the map BTF. The access size must always
> > > > be BPF_DW.
> > > >
> > > > For the map in map support, the kptr_off_tab for outer map is copied
> > > > from the inner map's kptr_off_tab. It was chosen to do a deep copy
> > > > instead of introducing a refcount to kptr_off_tab, because the copy only
> > > > needs to be done when paramterizing using inner_map_fd in the map in map
> > > > case, hence would be unnecessary for all other users.
> > > >
> > > > It is not permitted to use MAP_FREEZE command and mmap for BPF map
> > > > having kptr, similar to the bpf_timer case.
> > > >
> > > > Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> > > > ---
> > > >  include/linux/bpf.h     |  29 +++++++-
> > > >  include/linux/btf.h     |   2 +
> > > >  kernel/bpf/btf.c        | 161 ++++++++++++++++++++++++++++++++++------
> > > >  kernel/bpf/map_in_map.c |   5 +-
> > > >  kernel/bpf/syscall.c    | 112 +++++++++++++++++++++++++++-
> > > >  kernel/bpf/verifier.c   | 120 ++++++++++++++++++++++++++++++
> > > >  6 files changed, 401 insertions(+), 28 deletions(-)
> > > >
> > >
> > > [...]
> > >
> > > > +static int btf_find_field_kptr(const struct btf *btf, const struct btf_type *t,
> > > > +                              u32 off, int sz, struct btf_field_info *info)
> > > > +{
> > > > +       /* For PTR, sz is always == 8 */
> > > > +       if (!btf_type_is_ptr(t))
> > > > +               return BTF_FIELD_IGNORE;
> > > > +       t = btf_type_by_id(btf, t->type);
> > > > +
> > > > +       if (!btf_type_is_type_tag(t))
> > > > +               return BTF_FIELD_IGNORE;
> > > > +       /* Reject extra tags */
> > > > +       if (btf_type_is_type_tag(btf_type_by_id(btf, t->type)))
> > > > +               return -EINVAL;
> > >
> > > Can we have tag -> const -> tag -> volatile -> tag in BTF? Wouldn't
> > > you assume there are no more tags with just this check?
> > >
> >
> > All tags are supposed to be before other modifiers, so tags come first, in
> > continuity. See [0].
>
> Doesn't seem like kernel's BTF validator enforces this, we should
> probably tighten that up a bit. Clang won't emit such BTF, but nothing
> prevents user from generating non-conformant BTF on its own either.
>

Right, what would be a good place to do this validation? When loading a BTF
using bpf(2) syscall, or when we do this btf_parse_kptrs?

> >
> > Alexei suggested to reject all other tags for now.
> >
> >  [0]: https://lore.kernel.org/bpf/20220127154627.665163-1-yhs@fb.com
> >
> > >
> > > > +       if (strcmp("kptr", __btf_name_by_offset(btf, t->name_off)))
> > > > +               return -EINVAL;
> > > > +
> > > > +       /* Get the base type */
> > > > +       if (btf_type_is_modifier(t))
> > > > +               t = btf_type_skip_modifiers(btf, t->type, NULL);
> > > > +       /* Only pointer to struct is allowed */
> > > > +       if (!__btf_type_is_struct(t))
> > > > +               return -EINVAL;
> > > > +
> > > > +       info->type = t;
> > > > +       info->off = off;
> > > > +       return BTF_FIELD_FOUND;
> > > >  }
>
> [...]
>
> > > > +       if (map_value_has_kptr(map)) {
> > > > +               if (!bpf_capable())
> > > > +                       return -EPERM;
> > > > +               if (map->map_flags & BPF_F_RDONLY_PROG) {
> > > > +                       ret = -EACCES;
> > > > +                       goto free_map_tab;
> > > > +               }
> > > > +               if (map->map_type != BPF_MAP_TYPE_HASH &&
> > > > +                   map->map_type != BPF_MAP_TYPE_LRU_HASH &&
> > > > +                   map->map_type != BPF_MAP_TYPE_ARRAY) {
> > >
> > > what about PERCPU_ARRAY, for instance? Is there something
> > > fundamentally wrong to support it for local storage maps?
> > >
> >
> > Plugging in support into maps that already take timers was easier to begin, I
> > can do percpu support as a follow up.
> >
> > In case of local storage, I'm a little worried about how we prevent creating
> > reference cycles. There was a thread where find_get_task_by_pid was proposed as
> > unstable helper, once we e.g. support embedding task_struct in map, and allow
> > storing such pointer in task local storage, it would be pretty easy to construct
> > a circular reference cycle.
> >
> > Should we think about this now, or should we worry about this when task_struct
> > is actually supported as kptr? It's not only task_struct, same applies to sock.
> >
> > There's a discussion to be had, hence I left it out for now.
>
> PERCPU_ARRAY seemed (and still seems) like a safe map to support (same
> as PERCPU_HASH), which is why I asked. I see concerns about local
> storage, though, thanks.
>

I'll look into it after this lands.

> >
> > > > +                       ret = -EOPNOTSUPP;
> > > > +                       goto free_map_tab;
> > > > +               }
> > > > +       }
> > > > +
> > > > +       if (map->ops->map_check_btf) {
> > > >                 ret = map->ops->map_check_btf(map, btf, key_type, value_type);
> > > > +               if (ret < 0)
> > > > +                       goto free_map_tab;
> > > > +       }
> > > >
> > > > +       return ret;
> > > > +free_map_tab:
> > > > +       bpf_map_free_kptr_off_tab(map);
> > > >         return ret;
> > > >  }
> > > >
> > > > @@ -1639,7 +1745,7 @@ static int map_freeze(const union bpf_attr *attr)
> > > >                 return PTR_ERR(map);
> > > >
> > > >         if (map->map_type == BPF_MAP_TYPE_STRUCT_OPS ||
> > > > -           map_value_has_timer(map)) {
> > > > +           map_value_has_timer(map) || map_value_has_kptr(map)) {
> > > >                 fdput(f);
> > > >                 return -ENOTSUPP;
> > > >         }
> > > > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> > > > index 4ce9a528fb63..744b7362e52e 100644
> > > > --- a/kernel/bpf/verifier.c
> > > > +++ b/kernel/bpf/verifier.c
> > > > @@ -3507,6 +3507,94 @@ int check_ptr_off_reg(struct bpf_verifier_env *env,
> > > >         return __check_ptr_off_reg(env, reg, regno, false);
> > > >  }
> > > >
> > > > +static int map_kptr_match_type(struct bpf_verifier_env *env,
> > > > +                              struct bpf_map_value_off_desc *off_desc,
> > > > +                              struct bpf_reg_state *reg, u32 regno)
> > > > +{
> > > > +       const char *targ_name = kernel_type_name(off_desc->btf, off_desc->btf_id);
> > > > +       const char *reg_name = "";
> > > > +
> > > > +       if (reg->type != PTR_TO_BTF_ID && reg->type != PTR_TO_BTF_ID_OR_NULL)
> > >
> > > base_type(reg->type) != PTR_TO_BTF_ID ?
> > >
> > > > +               goto bad_type;
> > > > +
> > > > +       if (!btf_is_kernel(reg->btf)) {
> > > > +               verbose(env, "R%d must point to kernel BTF\n", regno);
> > > > +               return -EINVAL;
> > > > +       }
> > > > +       /* We need to verify reg->type and reg->btf, before accessing reg->btf */
> > > > +       reg_name = kernel_type_name(reg->btf, reg->btf_id);
> > > > +
> > > > +       if (__check_ptr_off_reg(env, reg, regno, true))
> > > > +               return -EACCES;
> > > > +
> > > > +       if (!btf_struct_ids_match(&env->log, reg->btf, reg->btf_id, reg->off,
> > > > +                                 off_desc->btf, off_desc->btf_id))
> > > > +               goto bad_type;
> > > > +       return 0;
> > > > +bad_type:
> > > > +       verbose(env, "invalid kptr access, R%d type=%s%s ", regno,
> > > > +               reg_type_str(env, reg->type), reg_name);
> > > > +       verbose(env, "expected=%s%s\n", reg_type_str(env, PTR_TO_BTF_ID), targ_name);
> > >
> > > why two separate verbose calls, you can easily combine them (and they
> > > should be output on a single line given it's a single error)
> > >
> >
> > reg_type_str cannot be called more than once in the same statement, since it
> > reuses the same buffer.
> >
>
> ah, subtle, ok, never mind then, no big deal to have two verbose()
> calls if there is a reason for it
>
> > > > +       return -EINVAL;
> > > > +}
> > > > +
> > >
> > > [...]
> >
> > --
> > Kartikeya

--
Kartikeya
Kumar Kartikeya Dwivedi March 25, 2022, 2:45 p.m. UTC | #10
On Tue, Mar 22, 2022 at 11:36:55PM IST, Martin KaFai Lau wrote:
> On Sun, Mar 20, 2022 at 09:25:00PM +0530, Kumar Kartikeya Dwivedi wrote:
> > @@ -820,9 +904,31 @@ static int map_check_btf(struct bpf_map *map, const struct btf *btf,
> >  			return -EOPNOTSUPP;
> >  	}
> >
> > -	if (map->ops->map_check_btf)
> > +	map->kptr_off_tab = btf_find_kptr(btf, value_type);
> > +	if (map_value_has_kptr(map)) {
> > +		if (!bpf_capable())
> > +			return -EPERM;
> Not sure if this has been brought up.
>
> No need to bpf_map_free_kptr_off_tab() in the case?
>

Good catch, it should indeed be freed.
For the case of map_check_btf in caller, I'm relying on map_free callback to
handle the freeing.

> > +		if (map->map_flags & BPF_F_RDONLY_PROG) {
> > +			ret = -EACCES;
> > +			goto free_map_tab;
> > +		}
> > +		if (map->map_type != BPF_MAP_TYPE_HASH &&
> > +		    map->map_type != BPF_MAP_TYPE_LRU_HASH &&
> > +		    map->map_type != BPF_MAP_TYPE_ARRAY) {
> > +			ret = -EOPNOTSUPP;
> > +			goto free_map_tab;
> > +		}
> > +	}
> If btf_find_kptr() returns err, it can be ignored and continue ?
>
> btw, it is quite unusual to store an err ptr in map.
> How about only stores NULL or a valid ptr in map->kptr_off_tab?
>

It allows us to report a clear error from process_kptr_func, similar to storing
error in place of spin_lock_off and timer_off, so for consistency I kept it
similar to those. But IS_ERR_OR_NULL still means no kptr_off_tab is present, in
places where it matters we don't distinguish between the two (e.g.
map_value_has_kptr also checks for both).

> > +
> > +	if (map->ops->map_check_btf) {
> >  		ret = map->ops->map_check_btf(map, btf, key_type, value_type);
> > +		if (ret < 0)
> > +			goto free_map_tab;
> > +	}
> >
> > +	return ret;
> > +free_map_tab:
> > +	bpf_map_free_kptr_off_tab(map);
> >  	return ret;
> >  }

--
Kartikeya
Kumar Kartikeya Dwivedi March 25, 2022, 2:51 p.m. UTC | #11
On Wed, Mar 23, 2022 at 01:52:52AM IST, Andrii Nakryiko wrote:
> On Tue, Mar 22, 2022 at 12:05 AM Kumar Kartikeya Dwivedi
> <memxor@gmail.com> wrote:
> >
> > On Tue, Mar 22, 2022 at 05:09:30AM IST, Joanne Koong wrote:
> > > On Sun, Mar 20, 2022 at 5:27 PM Kumar Kartikeya Dwivedi
> > > <memxor@gmail.com> wrote:
> > > >
> > > > This commit introduces a new pointer type 'kptr' which can be embedded
> > > > in a map value as holds a PTR_TO_BTF_ID stored by a BPF program during
> > > > its invocation. Storing to such a kptr, BPF program's PTR_TO_BTF_ID
> > > > register must have the same type as in the map value's BTF, and loading
> > > > a kptr marks the destination register as PTR_TO_BTF_ID with the correct
> > > > kernel BTF and BTF ID.
> > > >
> > > > Such kptr are unreferenced, i.e. by the time another invocation of the
> > > > BPF program loads this pointer, the object which the pointer points to
> > > > may not longer exist. Since PTR_TO_BTF_ID loads (using BPF_LDX) are
> > > > patched to PROBE_MEM loads by the verifier, it would safe to allow user
> > > > to still access such invalid pointer, but passing such pointers into
> > > > BPF helpers and kfuncs should not be permitted. A future patch in this
> > > > series will close this gap.
> > > >
> > > > The flexibility offered by allowing programs to dereference such invalid
> > > > pointers while being safe at runtime frees the verifier from doing
> > > > complex lifetime tracking. As long as the user may ensure that the
> > > > object remains valid, it can ensure data read by it from the kernel
> > > > object is valid.
> > > >
> > > > The user indicates that a certain pointer must be treated as kptr
> > > > capable of accepting stores of PTR_TO_BTF_ID of a certain type, by using
> > > > a BTF type tag 'kptr' on the pointed to type of the pointer. Then, this
> > > > information is recorded in the object BTF which will be passed into the
> > > > kernel by way of map's BTF information. The name and kind from the map
> > > > value BTF is used to look up the in-kernel type, and the actual BTF and
> > > > BTF ID is recorded in the map struct in a new kptr_off_tab member. For
> > > > now, only storing pointers to structs is permitted.
> > > >
> > > > An example of this specification is shown below:
> > > >
> > > >         #define __kptr __attribute__((btf_type_tag("kptr")))
> > > >
> > > >         struct map_value {
> > > >                 ...
> > > >                 struct task_struct __kptr *task;
> > > >                 ...
> > > >         };
> > > >
> > > > Then, in a BPF program, user may store PTR_TO_BTF_ID with the type
> > > > task_struct into the map, and then load it later.
> > > >
> > > > Note that the destination register is marked PTR_TO_BTF_ID_OR_NULL, as
> > > > the verifier cannot know whether the value is NULL or not statically, it
> > > > must treat all potential loads at that map value offset as loading a
> > > > possibly NULL pointer.
> > > >
> > > > Only BPF_LDX, BPF_STX, and BPF_ST with insn->imm = 0 (to denote NULL)
> > > > are allowed instructions that can access such a pointer. On BPF_LDX, the
> > > > destination register is updated to be a PTR_TO_BTF_ID, and on BPF_STX,
> > > > it is checked whether the source register type is a PTR_TO_BTF_ID with
> > > > same BTF type as specified in the map BTF. The access size must always
> > > > be BPF_DW.
> > > >
> > > > For the map in map support, the kptr_off_tab for outer map is copied
> > > > from the inner map's kptr_off_tab. It was chosen to do a deep copy
> > > > instead of introducing a refcount to kptr_off_tab, because the copy only
> > > > needs to be done when paramterizing using inner_map_fd in the map in map
> > > > case, hence would be unnecessary for all other users.
> > > >
> > > > It is not permitted to use MAP_FREEZE command and mmap for BPF map
> > > > having kptr, similar to the bpf_timer case.
> > > >
> > > > Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> > > > ---
> > > >  include/linux/bpf.h     |  29 +++++++-
> > > >  include/linux/btf.h     |   2 +
> > > >  kernel/bpf/btf.c        | 161 ++++++++++++++++++++++++++++++++++------
> > > >  kernel/bpf/map_in_map.c |   5 +-
> > > >  kernel/bpf/syscall.c    | 112 +++++++++++++++++++++++++++-
> > > >  kernel/bpf/verifier.c   | 120 ++++++++++++++++++++++++++++++
> > > >  6 files changed, 401 insertions(+), 28 deletions(-)
> > > >
> > > [...]
> > > > +
> > > >  struct bpf_map *bpf_map_get(u32 ufd);
> > > >  struct bpf_map *bpf_map_get_with_uref(u32 ufd);
> > > >  struct bpf_map *__bpf_map_get(struct fd f);
> > > > diff --git a/include/linux/btf.h b/include/linux/btf.h
> > > > index 36bc09b8e890..5b578dc81c04 100644
> > > > --- a/include/linux/btf.h
> > > > +++ b/include/linux/btf.h
> > > > @@ -123,6 +123,8 @@ bool btf_member_is_reg_int(const struct btf *btf, const struct btf_type *s,
> > > >                            u32 expected_offset, u32 expected_size);
> > > >  int btf_find_spin_lock(const struct btf *btf, const struct btf_type *t);
> > > >  int btf_find_timer(const struct btf *btf, const struct btf_type *t);
> > > > +struct bpf_map_value_off *btf_find_kptr(const struct btf *btf,
> > > > +                                       const struct btf_type *t);
> > >
> > > nit: given that "btf_find_kptr" allocates memory as well, maybe the
> > > name "btf_parse_kptr" would be more reflective?
> > >
> >
> > Good point, will change.
> >
> > > >  bool btf_type_is_void(const struct btf_type *t);
> > > >  s32 btf_find_by_name_kind(const struct btf *btf, const char *name, u8 kind);
> > > >  const struct btf_type *btf_type_skip_modifiers(const struct btf *btf,
> > > > diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
> > > > index 9e17af936a7a..92afbec0a887 100644
> > > > --- a/kernel/bpf/btf.c
> > > > +++ b/kernel/bpf/btf.c
> > > > @@ -3164,9 +3164,16 @@ static void btf_struct_log(struct btf_verifier_env *env,
> > > >  enum {
> > > >         BTF_FIELD_SPIN_LOCK,
> > > >         BTF_FIELD_TIMER,
> > > > +       BTF_FIELD_KPTR,
> > > > +};
> > > > +
> > > > +enum {
> > > > +       BTF_FIELD_IGNORE = 0,
> > > > +       BTF_FIELD_FOUND  = 1,
> > > >  };
> > > >
> > > >  struct btf_field_info {
> > > > +       const struct btf_type *type;
> > > >         u32 off;
> > > >  };
> > > >
> > > > @@ -3174,23 +3181,48 @@ static int btf_find_field_struct(const struct btf *btf, const struct btf_type *t
> > > >                                  u32 off, int sz, struct btf_field_info *info)
> > > >  {
> > > >         if (!__btf_type_is_struct(t))
> > > > -               return 0;
> > > > +               return BTF_FIELD_IGNORE;
> > > >         if (t->size != sz)
> > > > -               return 0;
> > > > -       if (info->off != -ENOENT)
> > > > -               /* only one such field is allowed */
> > > > -               return -E2BIG;
> > > > +               return BTF_FIELD_IGNORE;
> > > >         info->off = off;
> > > > -       return 0;
> > > > +       return BTF_FIELD_FOUND;
> > > > +}
> > > > +
> > > > +static int btf_find_field_kptr(const struct btf *btf, const struct btf_type *t,
> > > > +                              u32 off, int sz, struct btf_field_info *info)
> > > > +{
> > > > +       /* For PTR, sz is always == 8 */
> > > > +       if (!btf_type_is_ptr(t))
> > > > +               return BTF_FIELD_IGNORE;
> > > > +       t = btf_type_by_id(btf, t->type);
> > > > +
> > > > +       if (!btf_type_is_type_tag(t))
> > > > +               return BTF_FIELD_IGNORE;
> > > > +       /* Reject extra tags */
> > > > +       if (btf_type_is_type_tag(btf_type_by_id(btf, t->type)))
> > > > +               return -EINVAL;
> > > > +       if (strcmp("kptr", __btf_name_by_offset(btf, t->name_off)))
> > > > +               return -EINVAL;
> > > > +
> > > > +       /* Get the base type */
> > > > +       if (btf_type_is_modifier(t))
> > > > +               t = btf_type_skip_modifiers(btf, t->type, NULL);
> > > > +       /* Only pointer to struct is allowed */
> > > > +       if (!__btf_type_is_struct(t))
> > > > +               return -EINVAL;
> > > > +
> > > > +       info->type = t;
> > > > +       info->off = off;
> > > > +       return BTF_FIELD_FOUND;
> > > >  }
> > > >
> > > >  static int btf_find_struct_field(const struct btf *btf, const struct btf_type *t,
> > > >                                  const char *name, int sz, int align, int field_type,
> > > > -                                struct btf_field_info *info)
> > > > +                                struct btf_field_info *info, int info_cnt)
> > >
> > > From my understanding, this patch now modifies btf_find_struct_field
> > > and btf_find_datasec_var such that the "info" that is passed in has to
> > > be an array of size max possible + 1 while "info_cnt" is the max
> > > possible count, or we risk writing beyond the "info" array passed in.
> > > It seems like we could just modify the
> > > btf_find_struct_field/btf_find_datasec_var logic so that the user can
> > > just pass in info array of max possible size instead of max possible
> > > size + 1 - or is your concern that this would require more idx >=
> > > info_cnt checks inside the functions? Maybe we should include a
> > > comment here and in btf_find_datasec_var to document that "info"
> > > should always be max possible size + 1?
> > >
> >
> > So for some context on why this was changed, follow [0].
> >
> > I agree it's pretty ugly. My first thought was to check it inside the functions,
> > but that is also not very great. So I went with this, one more suggestion from
> > Alexei was to split it into a find and then fill info, because the error on
> > idx >= info_cnt should only happen after we find. Right now the find and fill
> > happens together, so to error out, you need an extra element it can fill before
> > you bail for ARRAY_SIZE - 1 (which is the actual max).
> >
> > TBH the find + fill split looks best to me, but open to more suggestions.
>
> I think there is much simpler way that doesn't require unnecessary
> copying or splitting anything:
>
> struct btf_field_info tmp;
>
> ...
>
> ret = btf_find_field_struct(btf, member_type, off, sz,
>                             idx < info_cnt ? &info[idx] : &tmp);
>
> ...
>
> That's it.
>

Indeed, not sure why I was overthinking this, it should work :).

> >
> > [0]: https://lore.kernel.org/bpf/20220319181538.nbqdkprjrzkxk7v4@ast-mbp.dhcp.thefacebook.com
> >
>
> [...]

--
Kartikeya
Andrii Nakryiko March 25, 2022, 10:59 p.m. UTC | #12
On Fri, Mar 25, 2022 at 7:42 AM Kumar Kartikeya Dwivedi
<memxor@gmail.com> wrote:
>
> On Wed, Mar 23, 2022 at 12:22:20AM IST, Andrii Nakryiko wrote:
> > On Tue, Mar 22, 2022 at 12:16 AM Kumar Kartikeya Dwivedi
> > <memxor@gmail.com> wrote:
> > >
> > > On Tue, Mar 22, 2022 at 11:15:42AM IST, Andrii Nakryiko wrote:
> > > > On Sun, Mar 20, 2022 at 8:55 AM Kumar Kartikeya Dwivedi
> > > > <memxor@gmail.com> wrote:
> > > > >
> > > > > This commit introduces a new pointer type 'kptr' which can be embedded
> > > > > in a map value as holds a PTR_TO_BTF_ID stored by a BPF program during
> > > > > its invocation. Storing to such a kptr, BPF program's PTR_TO_BTF_ID
> > > > > register must have the same type as in the map value's BTF, and loading
> > > > > a kptr marks the destination register as PTR_TO_BTF_ID with the correct
> > > > > kernel BTF and BTF ID.
> > > > >
> > > > > Such kptr are unreferenced, i.e. by the time another invocation of the
> > > > > BPF program loads this pointer, the object which the pointer points to
> > > > > may not longer exist. Since PTR_TO_BTF_ID loads (using BPF_LDX) are
> > > > > patched to PROBE_MEM loads by the verifier, it would safe to allow user
> > > > > to still access such invalid pointer, but passing such pointers into
> > > > > BPF helpers and kfuncs should not be permitted. A future patch in this
> > > > > series will close this gap.
> > > > >
> > > > > The flexibility offered by allowing programs to dereference such invalid
> > > > > pointers while being safe at runtime frees the verifier from doing
> > > > > complex lifetime tracking. As long as the user may ensure that the
> > > > > object remains valid, it can ensure data read by it from the kernel
> > > > > object is valid.
> > > > >
> > > > > The user indicates that a certain pointer must be treated as kptr
> > > > > capable of accepting stores of PTR_TO_BTF_ID of a certain type, by using
> > > > > a BTF type tag 'kptr' on the pointed to type of the pointer. Then, this
> > > > > information is recorded in the object BTF which will be passed into the
> > > > > kernel by way of map's BTF information. The name and kind from the map
> > > > > value BTF is used to look up the in-kernel type, and the actual BTF and
> > > > > BTF ID is recorded in the map struct in a new kptr_off_tab member. For
> > > > > now, only storing pointers to structs is permitted.
> > > > >
> > > > > An example of this specification is shown below:
> > > > >
> > > > >         #define __kptr __attribute__((btf_type_tag("kptr")))
> > > > >
> > > > >         struct map_value {
> > > > >                 ...
> > > > >                 struct task_struct __kptr *task;
> > > > >                 ...
> > > > >         };
> > > > >
> > > > > Then, in a BPF program, user may store PTR_TO_BTF_ID with the type
> > > > > task_struct into the map, and then load it later.
> > > > >
> > > > > Note that the destination register is marked PTR_TO_BTF_ID_OR_NULL, as
> > > > > the verifier cannot know whether the value is NULL or not statically, it
> > > > > must treat all potential loads at that map value offset as loading a
> > > > > possibly NULL pointer.
> > > > >
> > > > > Only BPF_LDX, BPF_STX, and BPF_ST with insn->imm = 0 (to denote NULL)
> > > > > are allowed instructions that can access such a pointer. On BPF_LDX, the
> > > > > destination register is updated to be a PTR_TO_BTF_ID, and on BPF_STX,
> > > > > it is checked whether the source register type is a PTR_TO_BTF_ID with
> > > > > same BTF type as specified in the map BTF. The access size must always
> > > > > be BPF_DW.
> > > > >
> > > > > For the map in map support, the kptr_off_tab for outer map is copied
> > > > > from the inner map's kptr_off_tab. It was chosen to do a deep copy
> > > > > instead of introducing a refcount to kptr_off_tab, because the copy only
> > > > > needs to be done when paramterizing using inner_map_fd in the map in map
> > > > > case, hence would be unnecessary for all other users.
> > > > >
> > > > > It is not permitted to use MAP_FREEZE command and mmap for BPF map
> > > > > having kptr, similar to the bpf_timer case.
> > > > >
> > > > > Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> > > > > ---
> > > > >  include/linux/bpf.h     |  29 +++++++-
> > > > >  include/linux/btf.h     |   2 +
> > > > >  kernel/bpf/btf.c        | 161 ++++++++++++++++++++++++++++++++++------
> > > > >  kernel/bpf/map_in_map.c |   5 +-
> > > > >  kernel/bpf/syscall.c    | 112 +++++++++++++++++++++++++++-
> > > > >  kernel/bpf/verifier.c   | 120 ++++++++++++++++++++++++++++++
> > > > >  6 files changed, 401 insertions(+), 28 deletions(-)
> > > > >
> > > >
> > > > [...]
> > > >
> > > > > +static int btf_find_field_kptr(const struct btf *btf, const struct btf_type *t,
> > > > > +                              u32 off, int sz, struct btf_field_info *info)
> > > > > +{
> > > > > +       /* For PTR, sz is always == 8 */
> > > > > +       if (!btf_type_is_ptr(t))
> > > > > +               return BTF_FIELD_IGNORE;
> > > > > +       t = btf_type_by_id(btf, t->type);
> > > > > +
> > > > > +       if (!btf_type_is_type_tag(t))
> > > > > +               return BTF_FIELD_IGNORE;
> > > > > +       /* Reject extra tags */
> > > > > +       if (btf_type_is_type_tag(btf_type_by_id(btf, t->type)))
> > > > > +               return -EINVAL;
> > > >
> > > > Can we have tag -> const -> tag -> volatile -> tag in BTF? Wouldn't
> > > > you assume there are no more tags with just this check?
> > > >
> > >
> > > All tags are supposed to be before other modifiers, so tags come first, in
> > > continuity. See [0].
> >
> > Doesn't seem like kernel's BTF validator enforces this, we should
> > probably tighten that up a bit. Clang won't emit such BTF, but nothing
> > prevents user from generating non-conformant BTF on its own either.
> >
>
> Right, what would be a good place to do this validation? When loading a BTF
> using bpf(2) syscall, or when we do this btf_parse_kptrs?

Given this is generic BTF property we are expecting and enforcing, it
should be in normal BTF validation logic.

>
> > >
> > > Alexei suggested to reject all other tags for now.
> > >
> > >  [0]: https://lore.kernel.org/bpf/20220127154627.665163-1-yhs@fb.com
> > >
> > > >
> > > > > +       if (strcmp("kptr", __btf_name_by_offset(btf, t->name_off)))
> > > > > +               return -EINVAL;
> > > > > +
> > > > > +       /* Get the base type */
> > > > > +       if (btf_type_is_modifier(t))
> > > > > +               t = btf_type_skip_modifiers(btf, t->type, NULL);
> > > > > +       /* Only pointer to struct is allowed */
> > > > > +       if (!__btf_type_is_struct(t))
> > > > > +               return -EINVAL;
> > > > > +
> > > > > +       info->type = t;
> > > > > +       info->off = off;
> > > > > +       return BTF_FIELD_FOUND;
> > > > >  }
> >
> > [...]
> >
> > > > > +       if (map_value_has_kptr(map)) {
> > > > > +               if (!bpf_capable())
> > > > > +                       return -EPERM;
> > > > > +               if (map->map_flags & BPF_F_RDONLY_PROG) {
> > > > > +                       ret = -EACCES;
> > > > > +                       goto free_map_tab;
> > > > > +               }
> > > > > +               if (map->map_type != BPF_MAP_TYPE_HASH &&
> > > > > +                   map->map_type != BPF_MAP_TYPE_LRU_HASH &&
> > > > > +                   map->map_type != BPF_MAP_TYPE_ARRAY) {
> > > >
> > > > what about PERCPU_ARRAY, for instance? Is there something
> > > > fundamentally wrong to support it for local storage maps?
> > > >
> > >
> > > Plugging in support into maps that already take timers was easier to begin, I
> > > can do percpu support as a follow up.
> > >
> > > In case of local storage, I'm a little worried about how we prevent creating
> > > reference cycles. There was a thread where find_get_task_by_pid was proposed as
> > > unstable helper, once we e.g. support embedding task_struct in map, and allow
> > > storing such pointer in task local storage, it would be pretty easy to construct
> > > a circular reference cycle.
> > >
> > > Should we think about this now, or should we worry about this when task_struct
> > > is actually supported as kptr? It's not only task_struct, same applies to sock.
> > >
> > > There's a discussion to be had, hence I left it out for now.
> >
> > PERCPU_ARRAY seemed (and still seems) like a safe map to support (same
> > as PERCPU_HASH), which is why I asked. I see concerns about local
> > storage, though, thanks.
> >
>
> I'll look into it after this lands.
>
> > >
> > > > > +                       ret = -EOPNOTSUPP;
> > > > > +                       goto free_map_tab;
> > > > > +               }
> > > > > +       }
> > > > > +

[...]
diff mbox series

Patch

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 88449fbbe063..f35920d279dd 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -155,6 +155,22 @@  struct bpf_map_ops {
 	const struct bpf_iter_seq_info *iter_seq_info;
 };
 
+enum {
+	/* Support at most 8 pointers in a BPF map value */
+	BPF_MAP_VALUE_OFF_MAX = 8,
+};
+
+struct bpf_map_value_off_desc {
+	u32 offset;
+	u32 btf_id;
+	struct btf *btf;
+};
+
+struct bpf_map_value_off {
+	u32 nr_off;
+	struct bpf_map_value_off_desc off[];
+};
+
 struct bpf_map {
 	/* The first two cachelines with read-mostly members of which some
 	 * are also accessed in fast-path (e.g. ops, max_entries).
@@ -171,6 +187,7 @@  struct bpf_map {
 	u64 map_extra; /* any per-map-type extra fields */
 	u32 map_flags;
 	int spin_lock_off; /* >=0 valid offset, <0 error */
+	struct bpf_map_value_off *kptr_off_tab;
 	int timer_off; /* >=0 valid offset, <0 error */
 	u32 id;
 	int numa_node;
@@ -184,7 +201,7 @@  struct bpf_map {
 	char name[BPF_OBJ_NAME_LEN];
 	bool bypass_spec_v1;
 	bool frozen; /* write-once; write-protected by freeze_mutex */
-	/* 14 bytes hole */
+	/* 6 bytes hole */
 
 	/* The 3rd and 4th cacheline with misc members to avoid false sharing
 	 * particularly with refcounting.
@@ -217,6 +234,11 @@  static inline bool map_value_has_timer(const struct bpf_map *map)
 	return map->timer_off >= 0;
 }
 
+static inline bool map_value_has_kptr(const struct bpf_map *map)
+{
+	return !IS_ERR_OR_NULL(map->kptr_off_tab);
+}
+
 static inline void check_and_init_map_value(struct bpf_map *map, void *dst)
 {
 	if (unlikely(map_value_has_spin_lock(map)))
@@ -1497,6 +1519,11 @@  void bpf_prog_put(struct bpf_prog *prog);
 void bpf_prog_free_id(struct bpf_prog *prog, bool do_idr_lock);
 void bpf_map_free_id(struct bpf_map *map, bool do_idr_lock);
 
+struct bpf_map_value_off_desc *bpf_map_kptr_off_contains(struct bpf_map *map, u32 offset);
+void bpf_map_free_kptr_off_tab(struct bpf_map *map);
+struct bpf_map_value_off *bpf_map_copy_kptr_off_tab(const struct bpf_map *map);
+bool bpf_map_equal_kptr_off_tab(const struct bpf_map *map_a, const struct bpf_map *map_b);
+
 struct bpf_map *bpf_map_get(u32 ufd);
 struct bpf_map *bpf_map_get_with_uref(u32 ufd);
 struct bpf_map *__bpf_map_get(struct fd f);
diff --git a/include/linux/btf.h b/include/linux/btf.h
index 36bc09b8e890..5b578dc81c04 100644
--- a/include/linux/btf.h
+++ b/include/linux/btf.h
@@ -123,6 +123,8 @@  bool btf_member_is_reg_int(const struct btf *btf, const struct btf_type *s,
 			   u32 expected_offset, u32 expected_size);
 int btf_find_spin_lock(const struct btf *btf, const struct btf_type *t);
 int btf_find_timer(const struct btf *btf, const struct btf_type *t);
+struct bpf_map_value_off *btf_find_kptr(const struct btf *btf,
+					const struct btf_type *t);
 bool btf_type_is_void(const struct btf_type *t);
 s32 btf_find_by_name_kind(const struct btf *btf, const char *name, u8 kind);
 const struct btf_type *btf_type_skip_modifiers(const struct btf *btf,
diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index 9e17af936a7a..92afbec0a887 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -3164,9 +3164,16 @@  static void btf_struct_log(struct btf_verifier_env *env,
 enum {
 	BTF_FIELD_SPIN_LOCK,
 	BTF_FIELD_TIMER,
+	BTF_FIELD_KPTR,
+};
+
+enum {
+	BTF_FIELD_IGNORE = 0,
+	BTF_FIELD_FOUND  = 1,
 };
 
 struct btf_field_info {
+	const struct btf_type *type;
 	u32 off;
 };
 
@@ -3174,23 +3181,48 @@  static int btf_find_field_struct(const struct btf *btf, const struct btf_type *t
 				 u32 off, int sz, struct btf_field_info *info)
 {
 	if (!__btf_type_is_struct(t))
-		return 0;
+		return BTF_FIELD_IGNORE;
 	if (t->size != sz)
-		return 0;
-	if (info->off != -ENOENT)
-		/* only one such field is allowed */
-		return -E2BIG;
+		return BTF_FIELD_IGNORE;
 	info->off = off;
-	return 0;
+	return BTF_FIELD_FOUND;
+}
+
+static int btf_find_field_kptr(const struct btf *btf, const struct btf_type *t,
+			       u32 off, int sz, struct btf_field_info *info)
+{
+	/* For PTR, sz is always == 8 */
+	if (!btf_type_is_ptr(t))
+		return BTF_FIELD_IGNORE;
+	t = btf_type_by_id(btf, t->type);
+
+	if (!btf_type_is_type_tag(t))
+		return BTF_FIELD_IGNORE;
+	/* Reject extra tags */
+	if (btf_type_is_type_tag(btf_type_by_id(btf, t->type)))
+		return -EINVAL;
+	if (strcmp("kptr", __btf_name_by_offset(btf, t->name_off)))
+		return -EINVAL;
+
+	/* Get the base type */
+	if (btf_type_is_modifier(t))
+		t = btf_type_skip_modifiers(btf, t->type, NULL);
+	/* Only pointer to struct is allowed */
+	if (!__btf_type_is_struct(t))
+		return -EINVAL;
+
+	info->type = t;
+	info->off = off;
+	return BTF_FIELD_FOUND;
 }
 
 static int btf_find_struct_field(const struct btf *btf, const struct btf_type *t,
 				 const char *name, int sz, int align, int field_type,
-				 struct btf_field_info *info)
+				 struct btf_field_info *info, int info_cnt)
 {
 	const struct btf_member *member;
+	int ret, idx = 0;
 	u32 i, off;
-	int ret;
 
 	for_each_member(i, t, member) {
 		const struct btf_type *member_type = btf_type_by_id(btf,
@@ -3210,24 +3242,35 @@  static int btf_find_struct_field(const struct btf *btf, const struct btf_type *t
 		switch (field_type) {
 		case BTF_FIELD_SPIN_LOCK:
 		case BTF_FIELD_TIMER:
-			ret = btf_find_field_struct(btf, member_type, off, sz, info);
+			ret = btf_find_field_struct(btf, member_type, off, sz, &info[idx]);
+			if (ret < 0)
+				return ret;
+			break;
+		case BTF_FIELD_KPTR:
+			ret = btf_find_field_kptr(btf, member_type, off, sz, &info[idx]);
 			if (ret < 0)
 				return ret;
 			break;
 		default:
 			return -EFAULT;
 		}
+
+		if (ret == BTF_FIELD_FOUND && idx >= info_cnt)
+			return -E2BIG;
+		else if (ret == BTF_FIELD_IGNORE)
+			continue;
+		++idx;
 	}
-	return 0;
+	return idx;
 }
 
 static int btf_find_datasec_var(const struct btf *btf, const struct btf_type *t,
 				const char *name, int sz, int align, int field_type,
-				struct btf_field_info *info)
+				struct btf_field_info *info, int info_cnt)
 {
 	const struct btf_var_secinfo *vsi;
+	int ret, idx = 0;
 	u32 i, off;
-	int ret;
 
 	for_each_vsi(i, t, vsi) {
 		const struct btf_type *var = btf_type_by_id(btf, vsi->type);
@@ -3245,19 +3288,30 @@  static int btf_find_datasec_var(const struct btf *btf, const struct btf_type *t,
 		switch (field_type) {
 		case BTF_FIELD_SPIN_LOCK:
 		case BTF_FIELD_TIMER:
-			ret = btf_find_field_struct(btf, var_type, off, sz, info);
+			ret = btf_find_field_struct(btf, var_type, off, sz, &info[idx]);
+			if (ret < 0)
+				return ret;
+			break;
+		case BTF_FIELD_KPTR:
+			ret = btf_find_field_kptr(btf, var_type, off, sz, &info[idx]);
 			if (ret < 0)
 				return ret;
 			break;
 		default:
 			return -EFAULT;
 		}
+
+		if (ret == BTF_FIELD_FOUND && idx >= info_cnt)
+			return -E2BIG;
+		if (ret == BTF_FIELD_IGNORE)
+			continue;
+		++idx;
 	}
-	return 0;
+	return idx;
 }
 
 static int btf_find_field(const struct btf *btf, const struct btf_type *t,
-			  int field_type, struct btf_field_info *info)
+			  int field_type, struct btf_field_info *info, int info_cnt)
 {
 	const char *name;
 	int sz, align;
@@ -3273,14 +3327,20 @@  static int btf_find_field(const struct btf *btf, const struct btf_type *t,
 		sz = sizeof(struct bpf_timer);
 		align = __alignof__(struct bpf_timer);
 		break;
+	case BTF_FIELD_KPTR:
+		name = NULL;
+		sz = sizeof(u64);
+		align = __alignof__(u64);
+		break;
 	default:
 		return -EFAULT;
 	}
 
+	/* The maximum allowed fields of a certain type will be info_cnt - 1 */
 	if (__btf_type_is_struct(t))
-		return btf_find_struct_field(btf, t, name, sz, align, field_type, info);
+		return btf_find_struct_field(btf, t, name, sz, align, field_type, info, info_cnt - 1);
 	else if (btf_type_is_datasec(t))
-		return btf_find_datasec_var(btf, t, name, sz, align, field_type, info);
+		return btf_find_datasec_var(btf, t, name, sz, align, field_type, info, info_cnt - 1);
 	return -EINVAL;
 }
 
@@ -3290,24 +3350,79 @@  static int btf_find_field(const struct btf *btf, const struct btf_type *t,
  */
 int btf_find_spin_lock(const struct btf *btf, const struct btf_type *t)
 {
-	struct btf_field_info info = { .off = -ENOENT };
+	/* btf_find_field requires array of size max + 1 */
+	struct btf_field_info info_arr[2];
 	int ret;
 
-	ret = btf_find_field(btf, t, BTF_FIELD_SPIN_LOCK, &info);
+	ret = btf_find_field(btf, t, BTF_FIELD_SPIN_LOCK, info_arr, ARRAY_SIZE(info_arr));
 	if (ret < 0)
 		return ret;
-	return info.off;
+	if (!ret)
+		return -ENOENT;
+	return info_arr[0].off;
 }
 
 int btf_find_timer(const struct btf *btf, const struct btf_type *t)
 {
-	struct btf_field_info info = { .off = -ENOENT };
+	/* btf_find_field requires array of size max + 1 */
+	struct btf_field_info info_arr[2];
 	int ret;
 
-	ret = btf_find_field(btf, t, BTF_FIELD_TIMER, &info);
+	ret = btf_find_field(btf, t, BTF_FIELD_TIMER, info_arr, ARRAY_SIZE(info_arr));
 	if (ret < 0)
 		return ret;
-	return info.off;
+	if (!ret)
+		return -ENOENT;
+	return info_arr[0].off;
+}
+
+struct bpf_map_value_off *btf_find_kptr(const struct btf *btf,
+					const struct btf_type *t)
+{
+	/* btf_find_field requires array of size max + 1 */
+	struct btf_field_info info_arr[BPF_MAP_VALUE_OFF_MAX + 1];
+	struct bpf_map_value_off *tab;
+	int ret, i, nr_off;
+
+	/* Revisit stack usage when bumping BPF_MAP_VALUE_OFF_MAX */
+	BUILD_BUG_ON(BPF_MAP_VALUE_OFF_MAX != 8);
+
+	ret = btf_find_field(btf, t, BTF_FIELD_KPTR, info_arr, ARRAY_SIZE(info_arr));
+	if (ret < 0)
+		return ERR_PTR(ret);
+	if (!ret)
+		return NULL;
+
+	nr_off = ret;
+	tab = kzalloc(offsetof(struct bpf_map_value_off, off[nr_off]), GFP_KERNEL | __GFP_NOWARN);
+	if (!tab)
+		return ERR_PTR(-ENOMEM);
+
+	tab->nr_off = 0;
+	for (i = 0; i < nr_off; i++) {
+		const struct btf_type *t;
+		struct btf *off_btf;
+		s32 id;
+
+		t = info_arr[i].type;
+		id = bpf_find_btf_id(__btf_name_by_offset(btf, t->name_off), BTF_INFO_KIND(t->info),
+				     &off_btf);
+		if (id < 0) {
+			ret = id;
+			goto end;
+		}
+
+		tab->off[i].offset = info_arr[i].off;
+		tab->off[i].btf_id = id;
+		tab->off[i].btf = off_btf;
+		tab->nr_off = i + 1;
+	}
+	return tab;
+end:
+	while (tab->nr_off--)
+		btf_put(tab->off[tab->nr_off].btf);
+	kfree(tab);
+	return ERR_PTR(ret);
 }
 
 static void __btf_struct_show(const struct btf *btf, const struct btf_type *t,
diff --git a/kernel/bpf/map_in_map.c b/kernel/bpf/map_in_map.c
index 5cd8f5277279..135205d0d560 100644
--- a/kernel/bpf/map_in_map.c
+++ b/kernel/bpf/map_in_map.c
@@ -52,6 +52,7 @@  struct bpf_map *bpf_map_meta_alloc(int inner_map_ufd)
 	inner_map_meta->max_entries = inner_map->max_entries;
 	inner_map_meta->spin_lock_off = inner_map->spin_lock_off;
 	inner_map_meta->timer_off = inner_map->timer_off;
+	inner_map_meta->kptr_off_tab = bpf_map_copy_kptr_off_tab(inner_map);
 	if (inner_map->btf) {
 		btf_get(inner_map->btf);
 		inner_map_meta->btf = inner_map->btf;
@@ -71,6 +72,7 @@  struct bpf_map *bpf_map_meta_alloc(int inner_map_ufd)
 
 void bpf_map_meta_free(struct bpf_map *map_meta)
 {
+	bpf_map_free_kptr_off_tab(map_meta);
 	btf_put(map_meta->btf);
 	kfree(map_meta);
 }
@@ -83,7 +85,8 @@  bool bpf_map_meta_equal(const struct bpf_map *meta0,
 		meta0->key_size == meta1->key_size &&
 		meta0->value_size == meta1->value_size &&
 		meta0->timer_off == meta1->timer_off &&
-		meta0->map_flags == meta1->map_flags;
+		meta0->map_flags == meta1->map_flags &&
+		bpf_map_equal_kptr_off_tab(meta0, meta1);
 }
 
 void *bpf_map_fd_get_ptr(struct bpf_map *map,
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index cdaa1152436a..5990d6fa97ab 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -6,6 +6,7 @@ 
 #include <linux/bpf_trace.h>
 #include <linux/bpf_lirc.h>
 #include <linux/bpf_verifier.h>
+#include <linux/bsearch.h>
 #include <linux/btf.h>
 #include <linux/syscalls.h>
 #include <linux/slab.h>
@@ -473,12 +474,95 @@  static void bpf_map_release_memcg(struct bpf_map *map)
 }
 #endif
 
+static int bpf_map_kptr_off_cmp(const void *a, const void *b)
+{
+	const struct bpf_map_value_off_desc *off_desc1 = a, *off_desc2 = b;
+
+	if (off_desc1->offset < off_desc2->offset)
+		return -1;
+	else if (off_desc1->offset > off_desc2->offset)
+		return 1;
+	return 0;
+}
+
+struct bpf_map_value_off_desc *bpf_map_kptr_off_contains(struct bpf_map *map, u32 offset)
+{
+	/* Since members are iterated in btf_find_field in increasing order,
+	 * offsets appended to kptr_off_tab are in increasing order, so we can
+	 * do bsearch to find exact match.
+	 */
+	struct bpf_map_value_off *tab;
+
+	if (!map_value_has_kptr(map))
+		return NULL;
+	tab = map->kptr_off_tab;
+	return bsearch(&offset, tab->off, tab->nr_off, sizeof(tab->off[0]), bpf_map_kptr_off_cmp);
+}
+
+void bpf_map_free_kptr_off_tab(struct bpf_map *map)
+{
+	struct bpf_map_value_off *tab = map->kptr_off_tab;
+	int i;
+
+	if (!map_value_has_kptr(map))
+		return;
+	for (i = 0; i < tab->nr_off; i++) {
+		struct btf *btf = tab->off[i].btf;
+
+		btf_put(btf);
+	}
+	kfree(tab);
+	map->kptr_off_tab = NULL;
+}
+
+struct bpf_map_value_off *bpf_map_copy_kptr_off_tab(const struct bpf_map *map)
+{
+	struct bpf_map_value_off *tab = map->kptr_off_tab, *new_tab;
+	int size, i, ret;
+
+	if (!map_value_has_kptr(map))
+		return ERR_PTR(-ENOENT);
+	/* Do a deep copy of the kptr_off_tab */
+	for (i = 0; i < tab->nr_off; i++)
+		btf_get(tab->off[i].btf);
+
+	size = offsetof(struct bpf_map_value_off, off[tab->nr_off]);
+	new_tab = kzalloc(size, GFP_KERNEL | __GFP_NOWARN);
+	if (!new_tab) {
+		ret = -ENOMEM;
+		goto end;
+	}
+	memcpy(new_tab, tab, size);
+	return new_tab;
+end:
+	while (i--)
+		btf_put(tab->off[i].btf);
+	return ERR_PTR(ret);
+}
+
+bool bpf_map_equal_kptr_off_tab(const struct bpf_map *map_a, const struct bpf_map *map_b)
+{
+	struct bpf_map_value_off *tab_a = map_a->kptr_off_tab, *tab_b = map_b->kptr_off_tab;
+	bool a_has_kptr = map_value_has_kptr(map_a), b_has_kptr = map_value_has_kptr(map_b);
+	int size;
+
+	if (!a_has_kptr && !b_has_kptr)
+		return true;
+	if ((a_has_kptr && !b_has_kptr) || (!a_has_kptr && b_has_kptr))
+		return false;
+	if (tab_a->nr_off != tab_b->nr_off)
+		return false;
+	size = offsetof(struct bpf_map_value_off, off[tab_a->nr_off]);
+	return !memcmp(tab_a, tab_b, size);
+}
+
 /* called from workqueue */
 static void bpf_map_free_deferred(struct work_struct *work)
 {
 	struct bpf_map *map = container_of(work, struct bpf_map, work);
 
 	security_bpf_map_free(map);
+	bpf_map_free_kptr_off_tab(map);
 	bpf_map_release_memcg(map);
 	/* implementation dependent freeing */
 	map->ops->map_free(map);
@@ -640,7 +724,7 @@  static int bpf_map_mmap(struct file *filp, struct vm_area_struct *vma)
 	int err;
 
 	if (!map->ops->map_mmap || map_value_has_spin_lock(map) ||
-	    map_value_has_timer(map))
+	    map_value_has_timer(map) || map_value_has_kptr(map))
 		return -ENOTSUPP;
 
 	if (!(vma->vm_flags & VM_SHARED))
@@ -820,9 +904,31 @@  static int map_check_btf(struct bpf_map *map, const struct btf *btf,
 			return -EOPNOTSUPP;
 	}
 
-	if (map->ops->map_check_btf)
+	map->kptr_off_tab = btf_find_kptr(btf, value_type);
+	if (map_value_has_kptr(map)) {
+		if (!bpf_capable())
+			return -EPERM;
+		if (map->map_flags & BPF_F_RDONLY_PROG) {
+			ret = -EACCES;
+			goto free_map_tab;
+		}
+		if (map->map_type != BPF_MAP_TYPE_HASH &&
+		    map->map_type != BPF_MAP_TYPE_LRU_HASH &&
+		    map->map_type != BPF_MAP_TYPE_ARRAY) {
+			ret = -EOPNOTSUPP;
+			goto free_map_tab;
+		}
+	}
+
+	if (map->ops->map_check_btf) {
 		ret = map->ops->map_check_btf(map, btf, key_type, value_type);
+		if (ret < 0)
+			goto free_map_tab;
+	}
 
+	return ret;
+free_map_tab:
+	bpf_map_free_kptr_off_tab(map);
 	return ret;
 }
 
@@ -1639,7 +1745,7 @@  static int map_freeze(const union bpf_attr *attr)
 		return PTR_ERR(map);
 
 	if (map->map_type == BPF_MAP_TYPE_STRUCT_OPS ||
-	    map_value_has_timer(map)) {
+	    map_value_has_timer(map) || map_value_has_kptr(map)) {
 		fdput(f);
 		return -ENOTSUPP;
 	}
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 4ce9a528fb63..744b7362e52e 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -3507,6 +3507,94 @@  int check_ptr_off_reg(struct bpf_verifier_env *env,
 	return __check_ptr_off_reg(env, reg, regno, false);
 }
 
+static int map_kptr_match_type(struct bpf_verifier_env *env,
+			       struct bpf_map_value_off_desc *off_desc,
+			       struct bpf_reg_state *reg, u32 regno)
+{
+	const char *targ_name = kernel_type_name(off_desc->btf, off_desc->btf_id);
+	const char *reg_name = "";
+
+	if (reg->type != PTR_TO_BTF_ID && reg->type != PTR_TO_BTF_ID_OR_NULL)
+		goto bad_type;
+
+	if (!btf_is_kernel(reg->btf)) {
+		verbose(env, "R%d must point to kernel BTF\n", regno);
+		return -EINVAL;
+	}
+	/* We need to verify reg->type and reg->btf, before accessing reg->btf */
+	reg_name = kernel_type_name(reg->btf, reg->btf_id);
+
+	if (__check_ptr_off_reg(env, reg, regno, true))
+		return -EACCES;
+
+	if (!btf_struct_ids_match(&env->log, reg->btf, reg->btf_id, reg->off,
+				  off_desc->btf, off_desc->btf_id))
+		goto bad_type;
+	return 0;
+bad_type:
+	verbose(env, "invalid kptr access, R%d type=%s%s ", regno,
+		reg_type_str(env, reg->type), reg_name);
+	verbose(env, "expected=%s%s\n", reg_type_str(env, PTR_TO_BTF_ID), targ_name);
+	return -EINVAL;
+}
+
+/* Returns an error, or 0 if ignoring the access, or 1 if register state was
+ * updated, in which case later updates must be skipped.
+ */
+static int check_map_kptr_access(struct bpf_verifier_env *env, u32 regno,
+				 int off, int size, int value_regno,
+				 enum bpf_access_type t, int insn_idx)
+{
+	struct bpf_reg_state *reg = reg_state(env, regno), *val_reg;
+	struct bpf_insn *insn = &env->prog->insnsi[insn_idx];
+	struct bpf_map_value_off_desc *off_desc;
+	struct bpf_map *map = reg->map_ptr;
+	int class = BPF_CLASS(insn->code);
+
+	/* Things we already checked for in check_map_access:
+	 *  - Reject cases where variable offset may touch BTF ID pointer
+	 *  - size of access (must be BPF_DW)
+	 *  - off_desc->offset == off + reg->var_off.value
+	 */
+	if (!tnum_is_const(reg->var_off))
+		return 0;
+
+	off_desc = bpf_map_kptr_off_contains(map, off + reg->var_off.value);
+	if (!off_desc)
+		return 0;
+
+	/* Only BPF_[LDX,STX,ST] | BPF_MEM | BPF_DW is supported */
+	if (BPF_MODE(insn->code) != BPF_MEM)
+		goto end;
+
+	if (class == BPF_LDX) {
+		val_reg = reg_state(env, value_regno);
+		/* We can simply mark the value_regno receiving the pointer
+		 * value from map as PTR_TO_BTF_ID, with the correct type.
+		 */
+		mark_btf_ld_reg(env, cur_regs(env), value_regno, PTR_TO_BTF_ID, off_desc->btf,
+				off_desc->btf_id, PTR_MAYBE_NULL);
+		val_reg->id = ++env->id_gen;
+	} else if (class == BPF_STX) {
+		val_reg = reg_state(env, value_regno);
+		if (!register_is_null(val_reg) &&
+		    map_kptr_match_type(env, off_desc, val_reg, value_regno))
+			return -EACCES;
+	} else if (class == BPF_ST) {
+		if (insn->imm) {
+			verbose(env, "BPF_ST imm must be 0 when storing to kptr at off=%u\n",
+				off_desc->offset);
+			return -EACCES;
+		}
+	} else {
+		goto end;
+	}
+	return 1;
+end:
+	verbose(env, "kptr in map can only be accessed using BPF_LDX/BPF_STX/BPF_ST\n");
+	return -EACCES;
+}
+
 /* check read/write into a map element with possible variable offset */
 static int check_map_access(struct bpf_verifier_env *env, u32 regno,
 			    int off, int size, bool zero_size_allowed)
@@ -3545,6 +3633,32 @@  static int check_map_access(struct bpf_verifier_env *env, u32 regno,
 			return -EACCES;
 		}
 	}
+	if (map_value_has_kptr(map)) {
+		struct bpf_map_value_off *tab = map->kptr_off_tab;
+		int i;
+
+		for (i = 0; i < tab->nr_off; i++) {
+			u32 p = tab->off[i].offset;
+
+			if (reg->smin_value + off < p + sizeof(u64) &&
+			    p < reg->umax_value + off + size) {
+				if (!tnum_is_const(reg->var_off)) {
+					verbose(env, "kptr access cannot have variable offset\n");
+					return -EACCES;
+				}
+				if (p != off + reg->var_off.value) {
+					verbose(env, "kptr access misaligned expected=%u off=%llu\n",
+						p, off + reg->var_off.value);
+					return -EACCES;
+				}
+				if (size != bpf_size_to_bytes(BPF_DW)) {
+					verbose(env, "kptr access size must be BPF_DW\n");
+					return -EACCES;
+				}
+				break;
+			}
+		}
+	}
 	return err;
 }
 
@@ -4421,6 +4535,10 @@  static int check_mem_access(struct bpf_verifier_env *env, int insn_idx, u32 regn
 		if (err)
 			return err;
 		err = check_map_access(env, regno, off, size, false);
+		err = err ?: check_map_kptr_access(env, regno, off, size, value_regno, t, insn_idx);
+		if (err < 0)
+			return err;
+		/* if err == 0, check_map_kptr_access ignored the access */
 		if (!err && t == BPF_READ && value_regno >= 0) {
 			struct bpf_map *map = reg->map_ptr;
 
@@ -4442,6 +4560,8 @@  static int check_mem_access(struct bpf_verifier_env *env, int insn_idx, u32 regn
 				mark_reg_unknown(env, regs, value_regno);
 			}
 		}
+		/* clear err == 1 */
+		err = err < 0 ? err : 0;
 	} else if (base_type(reg->type) == PTR_TO_MEM) {
 		bool rdonly_mem = type_is_rdonly_mem(reg->type);