diff mbox series

[bpf-next,10/13] bpf, x86: BPF_PROBE_MEM handling for insn->off < 0

Message ID 20221206231000.3180914-11-davemarchevsky@fb.com (mailing list archive)
State Superseded
Delegated to: BPF
Headers show
Series BPF rbtree next-gen datastructure | expand

Checks

Context Check Description
netdev/tree_selection success Clearly marked for bpf-next, async
netdev/fixes_present success Fixes tag not required for -next series
netdev/subject_prefix success Link
netdev/cover_letter success Series has a cover letter
netdev/patch_count success Link
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 0 this patch: 0
netdev/cc_maintainers warning 18 maintainers not CCed: yoshfuji@linux-ipv6.org mingo@redhat.com x86@kernel.org netdev@vger.kernel.org kpsingh@kernel.org bp@alien8.de haoluo@google.com davem@davemloft.net song@kernel.org yhs@fb.com tglx@linutronix.de dsahern@kernel.org dave.hansen@linux.intel.com martin.lau@linux.dev sdf@google.com john.fastabend@gmail.com jolsa@kernel.org hpa@zytor.com
netdev/build_clang success Errors and warnings before: 5 this patch: 5
netdev/module_param success Was 0 now: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 2 this patch: 2
netdev/checkpatch warning WARNING: line length of 100 exceeds 80 columns WARNING: line length of 101 exceeds 80 columns WARNING: line length of 103 exceeds 80 columns WARNING: line length of 106 exceeds 80 columns WARNING: line length of 82 exceeds 80 columns WARNING: line length of 84 exceeds 80 columns WARNING: line length of 85 exceeds 80 columns WARNING: line length of 86 exceeds 80 columns WARNING: line length of 87 exceeds 80 columns WARNING: line length of 92 exceeds 80 columns WARNING: line length of 94 exceeds 80 columns WARNING: line length of 97 exceeds 80 columns WARNING: line length of 98 exceeds 80 columns WARNING: line length of 99 exceeds 80 columns
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0
bpf/vmtest-bpf-next-PR fail PR summary
bpf/vmtest-bpf-next-VM_Test-1 success Logs for ${{ matrix.test }} on ${{ matrix.arch }} with ${{ matrix.toolchain }}
bpf/vmtest-bpf-next-VM_Test-2 success Logs for ShellCheck
bpf/vmtest-bpf-next-VM_Test-3 fail Logs for build for aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-4 fail Logs for build for aarch64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-5 fail Logs for build for s390x with gcc
bpf/vmtest-bpf-next-VM_Test-6 fail Logs for build for x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-7 fail Logs for build for x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-8 success Logs for llvm-toolchain
bpf/vmtest-bpf-next-VM_Test-9 success Logs for set-matrix

Commit Message

Dave Marchevsky Dec. 6, 2022, 11:09 p.m. UTC
Current comment in BPF_PROBE_MEM jit code claims that verifier prevents
insn->off < 0, but this appears to not be true irrespective of changes
in this series. Regardless, changes in this series will result in an
example like:

  struct example_node {
    long key;
    long val;
    struct bpf_rb_node node;
  }

  /* In BPF prog, assume root contains example_node nodes */
  struct bpf_rb_node res = bpf_rbtree_first(&root);
  if (!res)
    return 1;

  struct example_node n = container_of(res, struct example_node, node);
  long key = n->key;

Resulting in a load with off = -16, as bpf_rbtree_first's return is
modified by verifier to be PTR_TO_BTF_ID of example_node w/ offset =
offsetof(struct example_node, node), instead of PTR_TO_BTF_ID of
bpf_rb_node. So it's necessary to support negative insn->off when
jitting BPF_PROBE_MEM.

In order to ensure that page fault for a BPF_PROBE_MEM load of *src_reg +
insn->off is safely handled, we must confirm that *src_reg + insn->off is
in kernel's memory. Two runtime checks are emitted to confirm that:

  1) (*src_reg + insn->off) > boundary between user and kernel address
  spaces
  2) (*src_reg + insn->off) does not overflow to a small positive
  number. This might happen if some function meant to set src_reg
  returns ERR_PTR(-EINVAL) or similar.

Check 1 currently is sligtly off - it compares a

  u64 limit = TASK_SIZE_MAX + PAGE_SIZE + abs(insn->off);

to *src_reg, aborting the load if limit is larger. Rewriting this as an
inequality:

  *src_reg > TASK_SIZE_MAX + PAGE_SIZE + abs(insn->off)
  *src_reg - abs(insn->off) > TASK_SIZE_MAX + PAGE_SIZE

shows that this isn't quite right even if insn->off is positive, as we
really want:

  *src_reg + insn->off > TASK_SIZE_MAX + PAGE_SIZE
  *src_reg > TASK_SIZE_MAX + PAGE_SIZE - insn_off

Since *src_reg + insn->off is the address we'll be loading from, not
*src_reg - insn->off or *src_reg - abs(insn->off). So change the
subtraction to an addition and remove the abs(), as comment indicates
that it was only added to ignore negative insn->off.

For Check 2, currently "does not overflow to a small positive number" is
confirmed by emitting an 'add insn->off, src_reg' instruction and
checking for carry flag. While this works fine for a positive insn->off,
a small negative insn->off like -16 is almost guaranteed to wrap over to
a small positive number when added to any kernel address.

This patch addresses this by not doing Check 2 at BPF prog runtime when
insn->off is negative, rather doing a stronger check at JIT-time. The
logic supporting this is as follows:

1) Assume insn->off is negative, call the largest such negative offset
   MAX_NEGATIVE_OFF. So insn->off >= MAX_NEGATIVE_OFF for all possible
   insn->off.

2) *src_reg + insn->off will not wrap over to an unexpected address by
   virtue of negative insn->off, but it might wrap under if
   -insn->off > *src_reg, as that implies *src_reg + insn->off < 0

3) Inequality (TASK_SIZE_MAX + PAGE_SIZE - insn->off) > (TASK_SIZE_MAX + PAGE_SIZE)
   must be true since insn->off is negative.

4) If we've completed check 1, we know that
   src_reg >= (TASK_SIZE_MAX + PAGE_SIZE - insn->off)

5) Combining statements 3 and 4, we know src_reg > (TASK_SIZE_MAX + PAGE_SIZE)

6) By statements 1, 4, and 5, if we can prove
   (TASK_SIZE_MAX + PAGE_SIZE) > -MAX_NEGATIVE_OFF, we'll know that
   (TASK_SIZE_MAX + PAGE_SIZE) > -insn->off for all possible insn->off
   values. We can rewrite this as (TASK_SIZE_MAX + PAGE_SIZE) +
   MAX_NEGATIVE_OFF > 0.

   Since src_reg > TASK_SIZE_MAX + PAGE_SIZE and MAX_NEGATIVE_OFF is
   negative, if the previous inequality is true,
   src_reg + MAX_NEGATIVE_OFF > 0 is also true for all src_reg values.
   Similarly, since insn->off >= MAX_NEGATIVE_OFF for all possible
   negative insn->off vals, src_reg + insn->off > 0 and there can be no
   wrapping under.

So proving (TASK_SIZE_MAX + PAGE_SIZE) + MAX_NEGATIVE_OFF > 0 implies
*src_reg + insn->off > 0 for any src_reg that's passed check 1 and any
negative insn->off. Luckily the former inequality does not need to be
checked at runtime, and in fact could be a static_assert if
TASK_SIZE_MAX wasn't determined by a function when CONFIG_X86_5LEVEL
kconfig is used.

Regardless, we can just check (TASK_SIZE_MAX + PAGE_SIZE) +
MAX_NEGATIVE_OFF > 0 once per do_jit call instead of emitting a runtime
check. Given that insn->off is a s16 and is unlikely to grow larger,
this check should always succeed on any x86 processor made in the 21st
century. If it doesn't fail all do_jit calls and complain loudly with
the assumption that the BPF subsystem is misconfigured or has a bug.

A few instructions are saved for negative insn->offs as a result. Using
the struct example_node / off = -16 example from before, code looks
like:

BEFORE CHANGE
  72:   movabs $0x800000000010,%r11
  7c:   cmp    %r11,%rdi
  7f:   jb     0x000000000000008d         (check 1 on 7c and here)
  81:   mov    %rdi,%r11
  84:   add    $0xfffffffffffffff0,%r11   (check 2, will set carry for almost any r11, so bug for
  8b:   jae    0x0000000000000091          negative insn->off)
  8d:   xor    %edi,%edi                  (as a result long key = n->key; will be 0'd out here)
  8f:   jmp    0x0000000000000095
  91:   mov    -0x10(%rdi),%rdi
  95:

AFTER CHANGE:
  5a:   movabs $0x800000000010,%r11
  64:   cmp    %r11,%rdi
  67:   jae    0x000000000000006d     (check 1 on 64 and here, but now JNC instead of JC)
  69:   xor    %edi,%edi              (no check 2, 0 out if %rdi - %r11 < 0)
  6b:   jmp    0x0000000000000071
  6d:   mov    -0x10(%rdi),%rdi
  71:

We could do the same for insn->off == 0, but for now keep code
generation unchanged for previously working nonnegative insn->offs.

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
---
 arch/x86/net/bpf_jit_comp.c | 123 +++++++++++++++++++++++++++---------
 1 file changed, 92 insertions(+), 31 deletions(-)

Comments

Alexei Starovoitov Dec. 7, 2022, 2:39 a.m. UTC | #1
On Tue, Dec 06, 2022 at 03:09:57PM -0800, Dave Marchevsky wrote:
> Current comment in BPF_PROBE_MEM jit code claims that verifier prevents
> insn->off < 0, but this appears to not be true irrespective of changes
> in this series. Regardless, changes in this series will result in an
> example like:
> 
>   struct example_node {
>     long key;
>     long val;
>     struct bpf_rb_node node;
>   }
> 
>   /* In BPF prog, assume root contains example_node nodes */
>   struct bpf_rb_node res = bpf_rbtree_first(&root);
>   if (!res)
>     return 1;
> 
>   struct example_node n = container_of(res, struct example_node, node);
>   long key = n->key;
> 
> Resulting in a load with off = -16, as bpf_rbtree_first's return is

Looks like the bug in the previous patch:
+                       } else if (meta.func_id == special_kfunc_list[KF_bpf_rbtree_remove] ||
+                                  meta.func_id == special_kfunc_list[KF_bpf_rbtree_first]) {
+                               struct btf_field *field = meta.arg_rbtree_root.field;
+
+                               mark_reg_datastructure_node(regs, BPF_REG_0,
+                                                           &field->datastructure_head);

The R0 .off should have been:
 regs[BPF_REG_0].off = field->rb_node.node_offset;

node, not root.

PTR_TO_BTF_ID should have been returned with approriate 'off',
so that container_of() would it bring back to zero offset.

The apporach of returning untrusted from bpf_rbtree_first is questionable.
Without doing that this issue would not have surfaced.

All PTR_TO_BTF_ID need to have positive offset.
I'm not sure btf_struct_walk() and other PTR_TO_BTF_ID accessors
can deal with negative offsets.
There could be all kinds of things to fix.

> modified by verifier to be PTR_TO_BTF_ID of example_node w/ offset =
> offsetof(struct example_node, node), instead of PTR_TO_BTF_ID of
> bpf_rb_node. So it's necessary to support negative insn->off when
> jitting BPF_PROBE_MEM.

I'm not convinced it's necessary.
container_of() seems to be the only case where bpf prog can convert
PTR_TO_BTF_ID with off >= 0 to negative off.
Normal pointer walking will not make it negative.

> In order to ensure that page fault for a BPF_PROBE_MEM load of *src_reg +
> insn->off is safely handled, we must confirm that *src_reg + insn->off is
> in kernel's memory. Two runtime checks are emitted to confirm that:
> 
>   1) (*src_reg + insn->off) > boundary between user and kernel address
>   spaces
>   2) (*src_reg + insn->off) does not overflow to a small positive
>   number. This might happen if some function meant to set src_reg
>   returns ERR_PTR(-EINVAL) or similar.
> 
> Check 1 currently is sligtly off - it compares a
> 
>   u64 limit = TASK_SIZE_MAX + PAGE_SIZE + abs(insn->off);
> 
> to *src_reg, aborting the load if limit is larger. Rewriting this as an
> inequality:
> 
>   *src_reg > TASK_SIZE_MAX + PAGE_SIZE + abs(insn->off)
>   *src_reg - abs(insn->off) > TASK_SIZE_MAX + PAGE_SIZE
> 
> shows that this isn't quite right even if insn->off is positive, as we
> really want:
> 
>   *src_reg + insn->off > TASK_SIZE_MAX + PAGE_SIZE
>   *src_reg > TASK_SIZE_MAX + PAGE_SIZE - insn_off
> 
> Since *src_reg + insn->off is the address we'll be loading from, not
> *src_reg - insn->off or *src_reg - abs(insn->off). So change the
> subtraction to an addition and remove the abs(), as comment indicates
> that it was only added to ignore negative insn->off.
> 
> For Check 2, currently "does not overflow to a small positive number" is
> confirmed by emitting an 'add insn->off, src_reg' instruction and
> checking for carry flag. While this works fine for a positive insn->off,
> a small negative insn->off like -16 is almost guaranteed to wrap over to
> a small positive number when added to any kernel address.
> 
> This patch addresses this by not doing Check 2 at BPF prog runtime when
> insn->off is negative, rather doing a stronger check at JIT-time. The
> logic supporting this is as follows:
> 
> 1) Assume insn->off is negative, call the largest such negative offset
>    MAX_NEGATIVE_OFF. So insn->off >= MAX_NEGATIVE_OFF for all possible
>    insn->off.
> 
> 2) *src_reg + insn->off will not wrap over to an unexpected address by
>    virtue of negative insn->off, but it might wrap under if
>    -insn->off > *src_reg, as that implies *src_reg + insn->off < 0
> 
> 3) Inequality (TASK_SIZE_MAX + PAGE_SIZE - insn->off) > (TASK_SIZE_MAX + PAGE_SIZE)
>    must be true since insn->off is negative.
> 
> 4) If we've completed check 1, we know that
>    src_reg >= (TASK_SIZE_MAX + PAGE_SIZE - insn->off)
> 
> 5) Combining statements 3 and 4, we know src_reg > (TASK_SIZE_MAX + PAGE_SIZE)
> 
> 6) By statements 1, 4, and 5, if we can prove
>    (TASK_SIZE_MAX + PAGE_SIZE) > -MAX_NEGATIVE_OFF, we'll know that
>    (TASK_SIZE_MAX + PAGE_SIZE) > -insn->off for all possible insn->off
>    values. We can rewrite this as (TASK_SIZE_MAX + PAGE_SIZE) +
>    MAX_NEGATIVE_OFF > 0.
> 
>    Since src_reg > TASK_SIZE_MAX + PAGE_SIZE and MAX_NEGATIVE_OFF is
>    negative, if the previous inequality is true,
>    src_reg + MAX_NEGATIVE_OFF > 0 is also true for all src_reg values.
>    Similarly, since insn->off >= MAX_NEGATIVE_OFF for all possible
>    negative insn->off vals, src_reg + insn->off > 0 and there can be no
>    wrapping under.
> 
> So proving (TASK_SIZE_MAX + PAGE_SIZE) + MAX_NEGATIVE_OFF > 0 implies
> *src_reg + insn->off > 0 for any src_reg that's passed check 1 and any
> negative insn->off. Luckily the former inequality does not need to be
> checked at runtime, and in fact could be a static_assert if
> TASK_SIZE_MAX wasn't determined by a function when CONFIG_X86_5LEVEL
> kconfig is used.
> 
> Regardless, we can just check (TASK_SIZE_MAX + PAGE_SIZE) +
> MAX_NEGATIVE_OFF > 0 once per do_jit call instead of emitting a runtime
> check. Given that insn->off is a s16 and is unlikely to grow larger,
> this check should always succeed on any x86 processor made in the 21st
> century. If it doesn't fail all do_jit calls and complain loudly with
> the assumption that the BPF subsystem is misconfigured or has a bug.
> 
> A few instructions are saved for negative insn->offs as a result. Using
> the struct example_node / off = -16 example from before, code looks
> like:

This is quite complex to review. I couldn't convince myself
that droping 2nd check is safe, but don't have an argument to
prove that it's not safe.
Let's get to these details when there is need to support negative off.

> 
> BEFORE CHANGE
>   72:   movabs $0x800000000010,%r11
>   7c:   cmp    %r11,%rdi
>   7f:   jb     0x000000000000008d         (check 1 on 7c and here)
>   81:   mov    %rdi,%r11
>   84:   add    $0xfffffffffffffff0,%r11   (check 2, will set carry for almost any r11, so bug for
>   8b:   jae    0x0000000000000091          negative insn->off)
>   8d:   xor    %edi,%edi                  (as a result long key = n->key; will be 0'd out here)
>   8f:   jmp    0x0000000000000095
>   91:   mov    -0x10(%rdi),%rdi
>   95:
> 
> AFTER CHANGE:
>   5a:   movabs $0x800000000010,%r11
>   64:   cmp    %r11,%rdi
>   67:   jae    0x000000000000006d     (check 1 on 64 and here, but now JNC instead of JC)
>   69:   xor    %edi,%edi              (no check 2, 0 out if %rdi - %r11 < 0)
>   6b:   jmp    0x0000000000000071
>   6d:   mov    -0x10(%rdi),%rdi
>   71:
> 
> We could do the same for insn->off == 0, but for now keep code
> generation unchanged for previously working nonnegative insn->offs.
> 
> Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
> ---
>  arch/x86/net/bpf_jit_comp.c | 123 +++++++++++++++++++++++++++---------
>  1 file changed, 92 insertions(+), 31 deletions(-)
> 
> diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
> index 36ffe67ad6e5..843f619d0d35 100644
> --- a/arch/x86/net/bpf_jit_comp.c
> +++ b/arch/x86/net/bpf_jit_comp.c
> @@ -11,6 +11,7 @@
>  #include <linux/bpf.h>
>  #include <linux/memory.h>
>  #include <linux/sort.h>
> +#include <linux/limits.h>
>  #include <asm/extable.h>
>  #include <asm/set_memory.h>
>  #include <asm/nospec-branch.h>
> @@ -94,6 +95,7 @@ static int bpf_size_to_x86_bytes(int bpf_size)
>   */
>  #define X86_JB  0x72
>  #define X86_JAE 0x73
> +#define X86_JNC 0x73
>  #define X86_JE  0x74
>  #define X86_JNE 0x75
>  #define X86_JBE 0x76
> @@ -950,6 +952,36 @@ static void emit_shiftx(u8 **pprog, u32 dst_reg, u8 src_reg, bool is64, u8 op)
>  	*pprog = prog;
>  }
>  
> +/* Check that condition necessary for PROBE_MEM handling for insn->off < 0
> + * holds.
> + *
> + * This could be a static_assert((TASK_SIZE_MAX + PAGE_SIZE) > -S16_MIN),
> + * but TASK_SIZE_MAX can't always be evaluated at compile time, so let's not
> + * assume insn->off size either
> + */
> +static int check_probe_mem_task_size_overflow(void)
> +{
> +	struct bpf_insn insn;
> +	s64 max_negative;
> +
> +	switch (sizeof(insn.off)) {
> +	case 2:
> +		max_negative = S16_MIN;
> +		break;
> +	default:
> +		pr_err("bpf_jit_error: unexpected bpf_insn->off size\n");
> +		return -EFAULT;
> +	}
> +
> +	if (!((TASK_SIZE_MAX + PAGE_SIZE) > -max_negative)) {
> +		pr_err("bpf jit error: assumption does not hold:\n");
> +		pr_err("\t(TASK_SIZE_MAX + PAGE_SIZE) + (max negative insn->off) > 0\n");
> +		return -EFAULT;
> +	}
> +
> +	return 0;
> +}
> +
>  #define INSN_SZ_DIFF (((addrs[i] - addrs[i - 1]) - (prog - temp)))
>  
>  static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, u8 *rw_image,
> @@ -967,6 +999,10 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, u8 *rw_image
>  	u8 *prog = temp;
>  	int err;
>  
> +	err = check_probe_mem_task_size_overflow();
> +	if (err)
> +		return err;
> +
>  	detect_reg_usage(insn, insn_cnt, callee_regs_used,
>  			 &tail_call_seen);
>  
> @@ -1359,20 +1395,30 @@ st:			if (is_imm8(insn->off))
>  		case BPF_LDX | BPF_MEM | BPF_DW:
>  		case BPF_LDX | BPF_PROBE_MEM | BPF_DW:
>  			if (BPF_MODE(insn->code) == BPF_PROBE_MEM) {
> -				/* Though the verifier prevents negative insn->off in BPF_PROBE_MEM
> -				 * add abs(insn->off) to the limit to make sure that negative
> -				 * offset won't be an issue.
> -				 * insn->off is s16, so it won't affect valid pointers.
> -				 */
> -				u64 limit = TASK_SIZE_MAX + PAGE_SIZE + abs(insn->off);
> -				u8 *end_of_jmp1, *end_of_jmp2;
> -
>  				/* Conservatively check that src_reg + insn->off is a kernel address:
> -				 * 1. src_reg + insn->off >= limit
> -				 * 2. src_reg + insn->off doesn't become small positive.
> -				 * Cannot do src_reg + insn->off >= limit in one branch,
> -				 * since it needs two spare registers, but JIT has only one.
> +				 * 1. src_reg + insn->off >= TASK_SIZE_MAX + PAGE_SIZE
> +				 * 2. src_reg + insn->off doesn't overflow and become small positive
> +				 *
> +				 * For check 1, to save regs, do
> +				 * src_reg >= (TASK_SIZE_MAX + PAGE_SIZE - insn->off) call rhs
> +				 * of inequality 'limit'
> +				 *
> +				 * For check 2:
> +				 * If insn->off is positive, add src_reg + insn->off and check
> +				 * overflow directly
> +				 * If insn->off is negative, we know that
> +				 *   (TASK_SIZE_MAX + PAGE_SIZE - insn->off) > (TASK_SIZE_MAX + PAGE_SIZE)
> +				 * and from check 1 we know
> +				 *   src_reg >= (TASK_SIZE_MAX + PAGE_SIZE - insn->off)
> +				 * So if (TASK_SIZE_MAX + PAGE_SIZE) + MAX_NEGATIVE_OFF > 0 we can
> +				 * be sure that src_reg + insn->off won't overflow in either
> +				 * direction and avoid runtime check entirely.
> +				 *
> +				 * check_probe_mem_task_size_overflow confirms the above assumption
> +				 * at the beginning of this function
>  				 */
> +				u64 limit = TASK_SIZE_MAX + PAGE_SIZE - insn->off;
> +				u8 *end_of_jmp1, *end_of_jmp2;
>  
>  				/* movabsq r11, limit */
>  				EMIT2(add_1mod(0x48, AUX_REG), add_1reg(0xB8, AUX_REG));
> @@ -1381,32 +1427,47 @@ st:			if (is_imm8(insn->off))
>  				/* cmp src_reg, r11 */
>  				maybe_emit_mod(&prog, src_reg, AUX_REG, true);
>  				EMIT2(0x39, add_2reg(0xC0, src_reg, AUX_REG));
> -				/* if unsigned '<' goto end_of_jmp2 */
> -				EMIT2(X86_JB, 0);
> -				end_of_jmp1 = prog;
> -
> -				/* mov r11, src_reg */
> -				emit_mov_reg(&prog, true, AUX_REG, src_reg);
> -				/* add r11, insn->off */
> -				maybe_emit_1mod(&prog, AUX_REG, true);
> -				EMIT2_off32(0x81, add_1reg(0xC0, AUX_REG), insn->off);
> -				/* jmp if not carry to start_of_ldx
> -				 * Otherwise ERR_PTR(-EINVAL) + 128 will be the user addr
> -				 * that has to be rejected.
> -				 */
> -				EMIT2(0x73 /* JNC */, 0);
> -				end_of_jmp2 = prog;
> +				if (insn->off >= 0) {
> +					/* cmp src_reg, r11 */
> +					/* if unsigned '<' goto end_of_jmp2 */
> +					EMIT2(X86_JB, 0);
> +					end_of_jmp1 = prog;
> +
> +					/* mov r11, src_reg */
> +					emit_mov_reg(&prog, true, AUX_REG, src_reg);
> +					/* add r11, insn->off */
> +					maybe_emit_1mod(&prog, AUX_REG, true);
> +					EMIT2_off32(0x81, add_1reg(0xC0, AUX_REG), insn->off);
> +					/* jmp if not carry to start_of_ldx
> +					 * Otherwise ERR_PTR(-EINVAL) + 128 will be the user addr
> +					 * that has to be rejected.
> +					 */
> +					EMIT2(X86_JNC, 0);
> +					end_of_jmp2 = prog;
> +				} else {
> +					/* cmp src_reg, r11 */
> +					/* if unsigned '>=' goto start_of_ldx
> +					 * w/o needing to do check 2
> +					 */
> +					EMIT2(X86_JAE, 0);
> +					end_of_jmp1 = prog;
> +				}
>  
>  				/* xor dst_reg, dst_reg */
>  				emit_mov_imm32(&prog, false, dst_reg, 0);
>  				/* jmp byte_after_ldx */
>  				EMIT2(0xEB, 0);
>  
> -				/* populate jmp_offset for JB above to jump to xor dst_reg */
> -				end_of_jmp1[-1] = end_of_jmp2 - end_of_jmp1;
> -				/* populate jmp_offset for JNC above to jump to start_of_ldx */
>  				start_of_ldx = prog;
> -				end_of_jmp2[-1] = start_of_ldx - end_of_jmp2;
> +				if (insn->off >= 0) {
> +					/* populate jmp_offset for JB above to jump to xor dst_reg */
> +					end_of_jmp1[-1] = end_of_jmp2 - end_of_jmp1;
> +					/* populate jmp_offset for JNC above to jump to start_of_ldx */
> +					end_of_jmp2[-1] = start_of_ldx - end_of_jmp2;
> +				} else {
> +					/* populate jmp_offset for JAE above to jump to start_of_ldx */
> +					end_of_jmp1[-1] = start_of_ldx - end_of_jmp1;
> +				}
>  			}
>  			emit_ldx(&prog, BPF_SIZE(insn->code), dst_reg, src_reg, insn->off);
>  			if (BPF_MODE(insn->code) == BPF_PROBE_MEM) {
> -- 
> 2.30.2
>
Dave Marchevsky Dec. 7, 2022, 6:46 a.m. UTC | #2
On 12/6/22 9:39 PM, Alexei Starovoitov wrote:
> On Tue, Dec 06, 2022 at 03:09:57PM -0800, Dave Marchevsky wrote:
>> Current comment in BPF_PROBE_MEM jit code claims that verifier prevents
>> insn->off < 0, but this appears to not be true irrespective of changes
>> in this series. Regardless, changes in this series will result in an
>> example like:
>>
>>   struct example_node {
>>     long key;
>>     long val;
>>     struct bpf_rb_node node;
>>   }
>>
>>   /* In BPF prog, assume root contains example_node nodes */
>>   struct bpf_rb_node res = bpf_rbtree_first(&root);
>>   if (!res)
>>     return 1;
>>
>>   struct example_node n = container_of(res, struct example_node, node);
>>   long key = n->key;
>>
>> Resulting in a load with off = -16, as bpf_rbtree_first's return is
> 
> Looks like the bug in the previous patch:
> +                       } else if (meta.func_id == special_kfunc_list[KF_bpf_rbtree_remove] ||
> +                                  meta.func_id == special_kfunc_list[KF_bpf_rbtree_first]) {
> +                               struct btf_field *field = meta.arg_rbtree_root.field;
> +
> +                               mark_reg_datastructure_node(regs, BPF_REG_0,
> +                                                           &field->datastructure_head);
> 
> The R0 .off should have been:
>  regs[BPF_REG_0].off = field->rb_node.node_offset;
> 
> node, not root.
> 
> PTR_TO_BTF_ID should have been returned with approriate 'off',
> so that container_of() would it bring back to zero offset.
> 

The root's btf_field is used to hold information about the node type. Of
specific interest to us are value_btf_id and node_offset, which
mark_reg_datastructure_node uses to set REG_0's type and offset correctly.

This "use head type to keep info about node type" strategy felt strange to me
initially too: all PTR_TO_BTF_ID regs are passing around their type info, so
why not use that to lookup bpf_rb_node field info? But consider that
bpf_rbtree_first (and bpf_list_pop_{front,back}) doesn't take a node as
input arg, so there's no opportunity to get btf_field info from input
reg type. 

So we'll need to keep this info in rbtree_root's btf_field
regardless, and since any rbtree API function that operates on a node
also operates on a root and expects its node arg to match the node
type expected by the root, might as well use root's field as the main
lookup for this info and not even have &field->rb_node for now.
All __process_kf_arg_ptr_to_datastructure_node calls (added earlier
in the series) use the &meta->arg_{list_head,rbtree_root}.field for same
reason.

So it's setting the reg offset correctly.

> All PTR_TO_BTF_ID need to have positive offset.
> I'm not sure btf_struct_walk() and other PTR_TO_BTF_ID accessors
> can deal with negative offsets.
> There could be all kinds of things to fix.

I think you may be conflating reg offset and insn offset here. None of the
changes in this series result in a PTR_TO_BTF_ID reg w/ negative offset
being returned. But LLVM may generate load insns with a negative offset,
and since we're passing around pointers to bpf_rb_node that may come
after useful data fields in a type, this will happen more often.

Consider this small example from selftests in this series:

struct node_data {
  long key;
  long data;
  struct bpf_rb_node node;
};

static bool less(struct bpf_rb_node *a, const struct bpf_rb_node *b)
{
        struct node_data *node_a;
        struct node_data *node_b;

        node_a = container_of(a, struct node_data, node);
        node_b = container_of(b, struct node_data, node);

        return node_a->key < node_b->key;
}

llvm-objdump shows this bpf bytecode for 'less':

0000000000000000 <less>:
;       return node_a->key < node_b->key;
       0:       79 22 f0 ff 00 00 00 00 r2 = *(u64 *)(r2 - 0x10)
       1:       79 11 f0 ff 00 00 00 00 r1 = *(u64 *)(r1 - 0x10)
       2:       b4 00 00 00 01 00 00 00 w0 = 0x1
;       return node_a->key < node_b->key;
       3:       cd 21 01 00 00 00 00 00 if r1 s< r2 goto +0x1 <LBB2_2>
       4:       b4 00 00 00 00 00 00 00 w0 = 0x0

0000000000000028 <LBB2_2>:
;       return node_a->key < node_b->key;
       5:       95 00 00 00 00 00 00 00 exit

Insns 0 and 1 are loading node_b->key and node_a->key, respectively, using
negative insn->off. Verifier's view or R1 and R2 before insn 0 is
untrusted_ptr_node_data(off=16). If there were some intermediate insns
storing result of container_of() before dereferencing:

  r3 = (r2 - 0x10)
  r2 = *(u64 *)(r3)

Verifier would see R3 as untrusted_ptr_node_data(off=0), and load for
r2 would have insn->off = 0. But LLVM decides to just do a load-with-offset
using original arg ptrs to less() instead of storing container_of() ptr
adjustments.

Since the container_of usage and code pattern in above example's less()
isn't particularly specific to this series, I think there are other scenarios
where such code would be generated and considered this a general bugfix in
cover letter.

[ below paragraph was moved here, it originally preceded "All PTR_TO_BTF_ID"
  paragraph ]

> The apporach of returning untrusted from bpf_rbtree_first is questionable.
> Without doing that this issue would not have surfaced.
> 

I agree re: PTR_UNTRUSTED, but note that my earlier example doesn't involve
bpf_rbtree_first. Regardless, I think the issue is that PTR_UNTRUSTED is
used to denote a few separate traits of a PTR_TO_BTF_ID reg:

  * "I have no ownership over the thing I'm pointing to"
  * "My backing memory may go away at any time"
  * "Access to my fields might result in page fault"
  * "Kfuncs shouldn't accept me as an arg"

Seems like original PTR_UNTRUSTED usage really wanted to denote the first
point and the others were just naturally implied from the first. But
as you've noted there are some things using PTR_UNTRUSTED that really
want to make more granular statements:

ref_set_release_on_unlock logic sets release_on_unlock = true and adds
PTR_UNTRUSTED to the reg type. In this case PTR_UNTRUSTED is trying to say:

  * "I have no ownership over the thing I'm pointing to"
  * "My backing memory may go away at any time _after_ bpf_spin_unlock"
    * Before spin_unlock it's guaranteed to be valid
  * "Kfuncs shouldn't accept me as an arg"
    * We don't want arbitrary kfunc saving and accessing release_on_unlock
      reg after bpf_spin_unlock, as its backing memory can go away any time
      after spin_unlock.

The "backing memory" statement PTR_UNTRUSTED is making is a blunt superset
of what release_on_unlock really needs.

For less() callback we just want

  * "I have no ownership over the thing I'm pointing to"
  * "Kfuncs shouldn't accept me as an arg"

There is probably a way to decompose PTR_UNTRUSTED into a few flags such that
it's possible to denote these things separately and avoid unwanted additional
behavior. But after talking to David Vernet about current complexity of
PTR_TRUSTED and PTR_UNTRUSTED logic and his desire to refactor, it seemed
better to continue with PTR_UNTRUSTED blunt instrument with a bit of
special casing for now, instead of piling on more flags.

> 
>> modified by verifier to be PTR_TO_BTF_ID of example_node w/ offset =
>> offsetof(struct example_node, node), instead of PTR_TO_BTF_ID of
>> bpf_rb_node. So it's necessary to support negative insn->off when
>> jitting BPF_PROBE_MEM.
> 
> I'm not convinced it's necessary.
> container_of() seems to be the only case where bpf prog can convert
> PTR_TO_BTF_ID with off >= 0 to negative off.
> Normal pointer walking will not make it negative.
> 

I see what you mean - if some non-container_of case resulted in load generation
with negative insn->off, this probably would've been noticed already. But
hopefully my replies above explain why it should be addressed now.

>> In order to ensure that page fault for a BPF_PROBE_MEM load of *src_reg +
>> insn->off is safely handled, we must confirm that *src_reg + insn->off is
>> in kernel's memory. Two runtime checks are emitted to confirm that:
>>
>>   1) (*src_reg + insn->off) > boundary between user and kernel address
>>   spaces
>>   2) (*src_reg + insn->off) does not overflow to a small positive
>>   number. This might happen if some function meant to set src_reg
>>   returns ERR_PTR(-EINVAL) or similar.
>>
>> Check 1 currently is sligtly off - it compares a
>>
>>   u64 limit = TASK_SIZE_MAX + PAGE_SIZE + abs(insn->off);
>>
>> to *src_reg, aborting the load if limit is larger. Rewriting this as an
>> inequality:
>>
>>   *src_reg > TASK_SIZE_MAX + PAGE_SIZE + abs(insn->off)
>>   *src_reg - abs(insn->off) > TASK_SIZE_MAX + PAGE_SIZE
>>
>> shows that this isn't quite right even if insn->off is positive, as we
>> really want:
>>
>>   *src_reg + insn->off > TASK_SIZE_MAX + PAGE_SIZE
>>   *src_reg > TASK_SIZE_MAX + PAGE_SIZE - insn_off
>>
>> Since *src_reg + insn->off is the address we'll be loading from, not
>> *src_reg - insn->off or *src_reg - abs(insn->off). So change the
>> subtraction to an addition and remove the abs(), as comment indicates
>> that it was only added to ignore negative insn->off.
>>
>> For Check 2, currently "does not overflow to a small positive number" is
>> confirmed by emitting an 'add insn->off, src_reg' instruction and
>> checking for carry flag. While this works fine for a positive insn->off,
>> a small negative insn->off like -16 is almost guaranteed to wrap over to
>> a small positive number when added to any kernel address.
>>
>> This patch addresses this by not doing Check 2 at BPF prog runtime when
>> insn->off is negative, rather doing a stronger check at JIT-time. The
>> logic supporting this is as follows:
>>
>> 1) Assume insn->off is negative, call the largest such negative offset
>>    MAX_NEGATIVE_OFF. So insn->off >= MAX_NEGATIVE_OFF for all possible
>>    insn->off.
>>
>> 2) *src_reg + insn->off will not wrap over to an unexpected address by
>>    virtue of negative insn->off, but it might wrap under if
>>    -insn->off > *src_reg, as that implies *src_reg + insn->off < 0
>>
>> 3) Inequality (TASK_SIZE_MAX + PAGE_SIZE - insn->off) > (TASK_SIZE_MAX + PAGE_SIZE)
>>    must be true since insn->off is negative.
>>
>> 4) If we've completed check 1, we know that
>>    src_reg >= (TASK_SIZE_MAX + PAGE_SIZE - insn->off)
>>
>> 5) Combining statements 3 and 4, we know src_reg > (TASK_SIZE_MAX + PAGE_SIZE)
>>
>> 6) By statements 1, 4, and 5, if we can prove
>>    (TASK_SIZE_MAX + PAGE_SIZE) > -MAX_NEGATIVE_OFF, we'll know that
>>    (TASK_SIZE_MAX + PAGE_SIZE) > -insn->off for all possible insn->off
>>    values. We can rewrite this as (TASK_SIZE_MAX + PAGE_SIZE) +
>>    MAX_NEGATIVE_OFF > 0.
>>
>>    Since src_reg > TASK_SIZE_MAX + PAGE_SIZE and MAX_NEGATIVE_OFF is
>>    negative, if the previous inequality is true,
>>    src_reg + MAX_NEGATIVE_OFF > 0 is also true for all src_reg values.
>>    Similarly, since insn->off >= MAX_NEGATIVE_OFF for all possible
>>    negative insn->off vals, src_reg + insn->off > 0 and there can be no
>>    wrapping under.
>>
>> So proving (TASK_SIZE_MAX + PAGE_SIZE) + MAX_NEGATIVE_OFF > 0 implies
>> *src_reg + insn->off > 0 for any src_reg that's passed check 1 and any
>> negative insn->off. Luckily the former inequality does not need to be
>> checked at runtime, and in fact could be a static_assert if
>> TASK_SIZE_MAX wasn't determined by a function when CONFIG_X86_5LEVEL
>> kconfig is used.
>>
>> Regardless, we can just check (TASK_SIZE_MAX + PAGE_SIZE) +
>> MAX_NEGATIVE_OFF > 0 once per do_jit call instead of emitting a runtime
>> check. Given that insn->off is a s16 and is unlikely to grow larger,
>> this check should always succeed on any x86 processor made in the 21st
>> century. If it doesn't fail all do_jit calls and complain loudly with
>> the assumption that the BPF subsystem is misconfigured or has a bug.
>>
>> A few instructions are saved for negative insn->offs as a result. Using
>> the struct example_node / off = -16 example from before, code looks
>> like:
> 
> This is quite complex to review. I couldn't convince myself
> that droping 2nd check is safe, but don't have an argument to
> prove that it's not safe.
> Let's get to these details when there is need to support negative off.
> 

Hopefully above explanation shows that there's need to support it now.
I will try to simplify and rephrase the summary to make it easier to follow,
but will prioritize addressing feedback in less complex patches, so this
patch may not change for a few respins.

>>
>> BEFORE CHANGE
>>   72:   movabs $0x800000000010,%r11
>>   7c:   cmp    %r11,%rdi
>>   7f:   jb     0x000000000000008d         (check 1 on 7c and here)
>>   81:   mov    %rdi,%r11
>>   84:   add    $0xfffffffffffffff0,%r11   (check 2, will set carry for almost any r11, so bug for
>>   8b:   jae    0x0000000000000091          negative insn->off)
>>   8d:   xor    %edi,%edi                  (as a result long key = n->key; will be 0'd out here)
>>   8f:   jmp    0x0000000000000095
>>   91:   mov    -0x10(%rdi),%rdi
>>   95:
>>
>> AFTER CHANGE:
>>   5a:   movabs $0x800000000010,%r11
>>   64:   cmp    %r11,%rdi
>>   67:   jae    0x000000000000006d     (check 1 on 64 and here, but now JNC instead of JC)
>>   69:   xor    %edi,%edi              (no check 2, 0 out if %rdi - %r11 < 0)
>>   6b:   jmp    0x0000000000000071
>>   6d:   mov    -0x10(%rdi),%rdi
>>   71:
>>
>> We could do the same for insn->off == 0, but for now keep code
>> generation unchanged for previously working nonnegative insn->offs.
>>
>> Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
>> ---
>>  arch/x86/net/bpf_jit_comp.c | 123 +++++++++++++++++++++++++++---------
>>  1 file changed, 92 insertions(+), 31 deletions(-)
>>
>> diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
>> index 36ffe67ad6e5..843f619d0d35 100644
>> --- a/arch/x86/net/bpf_jit_comp.c
>> +++ b/arch/x86/net/bpf_jit_comp.c
>> @@ -11,6 +11,7 @@
>>  #include <linux/bpf.h>
>>  #include <linux/memory.h>
>>  #include <linux/sort.h>
>> +#include <linux/limits.h>
>>  #include <asm/extable.h>
>>  #include <asm/set_memory.h>
>>  #include <asm/nospec-branch.h>
>> @@ -94,6 +95,7 @@ static int bpf_size_to_x86_bytes(int bpf_size)
>>   */
>>  #define X86_JB  0x72
>>  #define X86_JAE 0x73
>> +#define X86_JNC 0x73
>>  #define X86_JE  0x74
>>  #define X86_JNE 0x75
>>  #define X86_JBE 0x76
>> @@ -950,6 +952,36 @@ static void emit_shiftx(u8 **pprog, u32 dst_reg, u8 src_reg, bool is64, u8 op)
>>  	*pprog = prog;
>>  }
>>  
>> +/* Check that condition necessary for PROBE_MEM handling for insn->off < 0
>> + * holds.
>> + *
>> + * This could be a static_assert((TASK_SIZE_MAX + PAGE_SIZE) > -S16_MIN),
>> + * but TASK_SIZE_MAX can't always be evaluated at compile time, so let's not
>> + * assume insn->off size either
>> + */
>> +static int check_probe_mem_task_size_overflow(void)
>> +{
>> +	struct bpf_insn insn;
>> +	s64 max_negative;
>> +
>> +	switch (sizeof(insn.off)) {
>> +	case 2:
>> +		max_negative = S16_MIN;
>> +		break;
>> +	default:
>> +		pr_err("bpf_jit_error: unexpected bpf_insn->off size\n");
>> +		return -EFAULT;
>> +	}
>> +
>> +	if (!((TASK_SIZE_MAX + PAGE_SIZE) > -max_negative)) {
>> +		pr_err("bpf jit error: assumption does not hold:\n");
>> +		pr_err("\t(TASK_SIZE_MAX + PAGE_SIZE) + (max negative insn->off) > 0\n");
>> +		return -EFAULT;
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>>  #define INSN_SZ_DIFF (((addrs[i] - addrs[i - 1]) - (prog - temp)))
>>  
>>  static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, u8 *rw_image,
>> @@ -967,6 +999,10 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, u8 *rw_image
>>  	u8 *prog = temp;
>>  	int err;
>>  
>> +	err = check_probe_mem_task_size_overflow();
>> +	if (err)
>> +		return err;
>> +
>>  	detect_reg_usage(insn, insn_cnt, callee_regs_used,
>>  			 &tail_call_seen);
>>  
>> @@ -1359,20 +1395,30 @@ st:			if (is_imm8(insn->off))
>>  		case BPF_LDX | BPF_MEM | BPF_DW:
>>  		case BPF_LDX | BPF_PROBE_MEM | BPF_DW:
>>  			if (BPF_MODE(insn->code) == BPF_PROBE_MEM) {
>> -				/* Though the verifier prevents negative insn->off in BPF_PROBE_MEM
>> -				 * add abs(insn->off) to the limit to make sure that negative
>> -				 * offset won't be an issue.
>> -				 * insn->off is s16, so it won't affect valid pointers.
>> -				 */
>> -				u64 limit = TASK_SIZE_MAX + PAGE_SIZE + abs(insn->off);
>> -				u8 *end_of_jmp1, *end_of_jmp2;
>> -
>>  				/* Conservatively check that src_reg + insn->off is a kernel address:
>> -				 * 1. src_reg + insn->off >= limit
>> -				 * 2. src_reg + insn->off doesn't become small positive.
>> -				 * Cannot do src_reg + insn->off >= limit in one branch,
>> -				 * since it needs two spare registers, but JIT has only one.
>> +				 * 1. src_reg + insn->off >= TASK_SIZE_MAX + PAGE_SIZE
>> +				 * 2. src_reg + insn->off doesn't overflow and become small positive
>> +				 *
>> +				 * For check 1, to save regs, do
>> +				 * src_reg >= (TASK_SIZE_MAX + PAGE_SIZE - insn->off) call rhs
>> +				 * of inequality 'limit'
>> +				 *
>> +				 * For check 2:
>> +				 * If insn->off is positive, add src_reg + insn->off and check
>> +				 * overflow directly
>> +				 * If insn->off is negative, we know that
>> +				 *   (TASK_SIZE_MAX + PAGE_SIZE - insn->off) > (TASK_SIZE_MAX + PAGE_SIZE)
>> +				 * and from check 1 we know
>> +				 *   src_reg >= (TASK_SIZE_MAX + PAGE_SIZE - insn->off)
>> +				 * So if (TASK_SIZE_MAX + PAGE_SIZE) + MAX_NEGATIVE_OFF > 0 we can
>> +				 * be sure that src_reg + insn->off won't overflow in either
>> +				 * direction and avoid runtime check entirely.
>> +				 *
>> +				 * check_probe_mem_task_size_overflow confirms the above assumption
>> +				 * at the beginning of this function
>>  				 */
>> +				u64 limit = TASK_SIZE_MAX + PAGE_SIZE - insn->off;
>> +				u8 *end_of_jmp1, *end_of_jmp2;
>>  
>>  				/* movabsq r11, limit */
>>  				EMIT2(add_1mod(0x48, AUX_REG), add_1reg(0xB8, AUX_REG));
>> @@ -1381,32 +1427,47 @@ st:			if (is_imm8(insn->off))
>>  				/* cmp src_reg, r11 */
>>  				maybe_emit_mod(&prog, src_reg, AUX_REG, true);
>>  				EMIT2(0x39, add_2reg(0xC0, src_reg, AUX_REG));
>> -				/* if unsigned '<' goto end_of_jmp2 */
>> -				EMIT2(X86_JB, 0);
>> -				end_of_jmp1 = prog;
>> -
>> -				/* mov r11, src_reg */
>> -				emit_mov_reg(&prog, true, AUX_REG, src_reg);
>> -				/* add r11, insn->off */
>> -				maybe_emit_1mod(&prog, AUX_REG, true);
>> -				EMIT2_off32(0x81, add_1reg(0xC0, AUX_REG), insn->off);
>> -				/* jmp if not carry to start_of_ldx
>> -				 * Otherwise ERR_PTR(-EINVAL) + 128 will be the user addr
>> -				 * that has to be rejected.
>> -				 */
>> -				EMIT2(0x73 /* JNC */, 0);
>> -				end_of_jmp2 = prog;
>> +				if (insn->off >= 0) {
>> +					/* cmp src_reg, r11 */
>> +					/* if unsigned '<' goto end_of_jmp2 */
>> +					EMIT2(X86_JB, 0);
>> +					end_of_jmp1 = prog;
>> +
>> +					/* mov r11, src_reg */
>> +					emit_mov_reg(&prog, true, AUX_REG, src_reg);
>> +					/* add r11, insn->off */
>> +					maybe_emit_1mod(&prog, AUX_REG, true);
>> +					EMIT2_off32(0x81, add_1reg(0xC0, AUX_REG), insn->off);
>> +					/* jmp if not carry to start_of_ldx
>> +					 * Otherwise ERR_PTR(-EINVAL) + 128 will be the user addr
>> +					 * that has to be rejected.
>> +					 */
>> +					EMIT2(X86_JNC, 0);
>> +					end_of_jmp2 = prog;
>> +				} else {
>> +					/* cmp src_reg, r11 */
>> +					/* if unsigned '>=' goto start_of_ldx
>> +					 * w/o needing to do check 2
>> +					 */
>> +					EMIT2(X86_JAE, 0);
>> +					end_of_jmp1 = prog;
>> +				}
>>  
>>  				/* xor dst_reg, dst_reg */
>>  				emit_mov_imm32(&prog, false, dst_reg, 0);
>>  				/* jmp byte_after_ldx */
>>  				EMIT2(0xEB, 0);
>>  
>> -				/* populate jmp_offset for JB above to jump to xor dst_reg */
>> -				end_of_jmp1[-1] = end_of_jmp2 - end_of_jmp1;
>> -				/* populate jmp_offset for JNC above to jump to start_of_ldx */
>>  				start_of_ldx = prog;
>> -				end_of_jmp2[-1] = start_of_ldx - end_of_jmp2;
>> +				if (insn->off >= 0) {
>> +					/* populate jmp_offset for JB above to jump to xor dst_reg */
>> +					end_of_jmp1[-1] = end_of_jmp2 - end_of_jmp1;
>> +					/* populate jmp_offset for JNC above to jump to start_of_ldx */
>> +					end_of_jmp2[-1] = start_of_ldx - end_of_jmp2;
>> +				} else {
>> +					/* populate jmp_offset for JAE above to jump to start_of_ldx */
>> +					end_of_jmp1[-1] = start_of_ldx - end_of_jmp1;
>> +				}
>>  			}
>>  			emit_ldx(&prog, BPF_SIZE(insn->code), dst_reg, src_reg, insn->off);
>>  			if (BPF_MODE(insn->code) == BPF_PROBE_MEM) {
>> -- 
>> 2.30.2
>>
Alexei Starovoitov Dec. 7, 2022, 6:06 p.m. UTC | #3
On Wed, Dec 07, 2022 at 01:46:56AM -0500, Dave Marchevsky wrote:
> On 12/6/22 9:39 PM, Alexei Starovoitov wrote:
> > On Tue, Dec 06, 2022 at 03:09:57PM -0800, Dave Marchevsky wrote:
> >> Current comment in BPF_PROBE_MEM jit code claims that verifier prevents
> >> insn->off < 0, but this appears to not be true irrespective of changes
> >> in this series. Regardless, changes in this series will result in an
> >> example like:
> >>
> >>   struct example_node {
> >>     long key;
> >>     long val;
> >>     struct bpf_rb_node node;
> >>   }
> >>
> >>   /* In BPF prog, assume root contains example_node nodes */
> >>   struct bpf_rb_node res = bpf_rbtree_first(&root);
> >>   if (!res)
> >>     return 1;
> >>
> >>   struct example_node n = container_of(res, struct example_node, node);
> >>   long key = n->key;
> >>
> >> Resulting in a load with off = -16, as bpf_rbtree_first's return is
> > 
> > Looks like the bug in the previous patch:
> > +                       } else if (meta.func_id == special_kfunc_list[KF_bpf_rbtree_remove] ||
> > +                                  meta.func_id == special_kfunc_list[KF_bpf_rbtree_first]) {
> > +                               struct btf_field *field = meta.arg_rbtree_root.field;
> > +
> > +                               mark_reg_datastructure_node(regs, BPF_REG_0,
> > +                                                           &field->datastructure_head);
> > 
> > The R0 .off should have been:
> >  regs[BPF_REG_0].off = field->rb_node.node_offset;
> > 
> > node, not root.
> > 
> > PTR_TO_BTF_ID should have been returned with approriate 'off',
> > so that container_of() would it bring back to zero offset.
> > 
> 
> The root's btf_field is used to hold information about the node type. Of
> specific interest to us are value_btf_id and node_offset, which
> mark_reg_datastructure_node uses to set REG_0's type and offset correctly.
> 
> This "use head type to keep info about node type" strategy felt strange to me
> initially too: all PTR_TO_BTF_ID regs are passing around their type info, so
> why not use that to lookup bpf_rb_node field info? But consider that
> bpf_rbtree_first (and bpf_list_pop_{front,back}) doesn't take a node as
> input arg, so there's no opportunity to get btf_field info from input
> reg type. 
> 
> So we'll need to keep this info in rbtree_root's btf_field
> regardless, and since any rbtree API function that operates on a node
> also operates on a root and expects its node arg to match the node
> type expected by the root, might as well use root's field as the main
> lookup for this info and not even have &field->rb_node for now.
> All __process_kf_arg_ptr_to_datastructure_node calls (added earlier
> in the series) use the &meta->arg_{list_head,rbtree_root}.field for same
> reason.
> 
> So it's setting the reg offset correctly.

Ok. Got it. Than the commit log is incorrectly describing the failing scenario.
It's a container_of() inside bool less() that is generating negative offsets.

> > All PTR_TO_BTF_ID need to have positive offset.
> > I'm not sure btf_struct_walk() and other PTR_TO_BTF_ID accessors
> > can deal with negative offsets.
> > There could be all kinds of things to fix.
> 
> I think you may be conflating reg offset and insn offset here. None of the
> changes in this series result in a PTR_TO_BTF_ID reg w/ negative offset
> being returned. But LLVM may generate load insns with a negative offset,
> and since we're passing around pointers to bpf_rb_node that may come
> after useful data fields in a type, this will happen more often.
> 
> Consider this small example from selftests in this series:
> 
> struct node_data {
>   long key;
>   long data;
>   struct bpf_rb_node node;
> };
> 
> static bool less(struct bpf_rb_node *a, const struct bpf_rb_node *b)
> {
>         struct node_data *node_a;
>         struct node_data *node_b;
> 
>         node_a = container_of(a, struct node_data, node);
>         node_b = container_of(b, struct node_data, node);
> 
>         return node_a->key < node_b->key;
> }
> 
> llvm-objdump shows this bpf bytecode for 'less':
> 
> 0000000000000000 <less>:
> ;       return node_a->key < node_b->key;
>        0:       79 22 f0 ff 00 00 00 00 r2 = *(u64 *)(r2 - 0x10)
>        1:       79 11 f0 ff 00 00 00 00 r1 = *(u64 *)(r1 - 0x10)
>        2:       b4 00 00 00 01 00 00 00 w0 = 0x1
> ;       return node_a->key < node_b->key;

I see. That's the same bug.
The args to callback should have been PTR_TO_BTF_ID | PTR_TRUSTED with 
correct positive offset.
Then node_a = container_of(a, struct node_data, node);
would have produced correct offset into proper btf_id.

The verifier should be passing into less() the btf_id
of struct node_data instead of btf_id of struct bpf_rb_node.

>        3:       cd 21 01 00 00 00 00 00 if r1 s< r2 goto +0x1 <LBB2_2>
>        4:       b4 00 00 00 00 00 00 00 w0 = 0x0
> 
> 0000000000000028 <LBB2_2>:
> ;       return node_a->key < node_b->key;
>        5:       95 00 00 00 00 00 00 00 exit
> 
> Insns 0 and 1 are loading node_b->key and node_a->key, respectively, using
> negative insn->off. Verifier's view or R1 and R2 before insn 0 is
> untrusted_ptr_node_data(off=16). If there were some intermediate insns
> storing result of container_of() before dereferencing:
> 
>   r3 = (r2 - 0x10)
>   r2 = *(u64 *)(r3)
> 
> Verifier would see R3 as untrusted_ptr_node_data(off=0), and load for
> r2 would have insn->off = 0. But LLVM decides to just do a load-with-offset
> using original arg ptrs to less() instead of storing container_of() ptr
> adjustments.
> 
> Since the container_of usage and code pattern in above example's less()
> isn't particularly specific to this series, I think there are other scenarios
> where such code would be generated and considered this a general bugfix in
> cover letter.

imo the negative offset looks specific to two misuses of PTR_UNTRUSTED in this set.

> 
> [ below paragraph was moved here, it originally preceded "All PTR_TO_BTF_ID"
>   paragraph ]
> 
> > The apporach of returning untrusted from bpf_rbtree_first is questionable.
> > Without doing that this issue would not have surfaced.
> > 
> 
> I agree re: PTR_UNTRUSTED, but note that my earlier example doesn't involve
> bpf_rbtree_first. Regardless, I think the issue is that PTR_UNTRUSTED is
> used to denote a few separate traits of a PTR_TO_BTF_ID reg:
> 
>   * "I have no ownership over the thing I'm pointing to"
>   * "My backing memory may go away at any time"
>   * "Access to my fields might result in page fault"
>   * "Kfuncs shouldn't accept me as an arg"
> 
> Seems like original PTR_UNTRUSTED usage really wanted to denote the first
> point and the others were just naturally implied from the first. But
> as you've noted there are some things using PTR_UNTRUSTED that really
> want to make more granular statements:

I think PTR_UNTRUSTED implies all of the above. All 4 statements are connected.

> ref_set_release_on_unlock logic sets release_on_unlock = true and adds
> PTR_UNTRUSTED to the reg type. In this case PTR_UNTRUSTED is trying to say:
> 
>   * "I have no ownership over the thing I'm pointing to"
>   * "My backing memory may go away at any time _after_ bpf_spin_unlock"
>     * Before spin_unlock it's guaranteed to be valid
>   * "Kfuncs shouldn't accept me as an arg"
>     * We don't want arbitrary kfunc saving and accessing release_on_unlock
>       reg after bpf_spin_unlock, as its backing memory can go away any time
>       after spin_unlock.
> 
> The "backing memory" statement PTR_UNTRUSTED is making is a blunt superset
> of what release_on_unlock really needs.
> 
> For less() callback we just want
> 
>   * "I have no ownership over the thing I'm pointing to"
>   * "Kfuncs shouldn't accept me as an arg"
> 
> There is probably a way to decompose PTR_UNTRUSTED into a few flags such that
> it's possible to denote these things separately and avoid unwanted additional
> behavior. But after talking to David Vernet about current complexity of
> PTR_TRUSTED and PTR_UNTRUSTED logic and his desire to refactor, it seemed
> better to continue with PTR_UNTRUSTED blunt instrument with a bit of
> special casing for now, instead of piling on more flags.

Exactly. More flags will only increase the confusion.
Please try to make callback args as proper PTR_TRUSTED and disallow calling specific
rbtree kfuncs while inside this particular callback to prevent recursion.
That would solve all these issues, no?
Writing into such PTR_TRUSTED should be still allowed inside cb though it's bogus.

Consider less() receiving btf_id ptr_trusted of struct node_data and it contains
both link list and rbtree.
It should still be safe to operate on link list part of that node from less()
though it's not something we would ever recommend.
The kfunc call on rb tree part of struct node_data is problematic because
of recursion, right? No other safety concerns ?

> > 
> >> modified by verifier to be PTR_TO_BTF_ID of example_node w/ offset =
> >> offsetof(struct example_node, node), instead of PTR_TO_BTF_ID of
> >> bpf_rb_node. So it's necessary to support negative insn->off when
> >> jitting BPF_PROBE_MEM.
> > 
> > I'm not convinced it's necessary.
> > container_of() seems to be the only case where bpf prog can convert
> > PTR_TO_BTF_ID with off >= 0 to negative off.
> > Normal pointer walking will not make it negative.
> > 
> 
> I see what you mean - if some non-container_of case resulted in load generation
> with negative insn->off, this probably would've been noticed already. But
> hopefully my replies above explain why it should be addressed now.

Even with container_of() usage we should be passing proper btf_id of container
struct, so that callbacks and non-callbacks can properly container_of() it
and still get offset >= 0.

> >>
> >> A few instructions are saved for negative insn->offs as a result. Using
> >> the struct example_node / off = -16 example from before, code looks
> >> like:
> > 
> > This is quite complex to review. I couldn't convince myself
> > that droping 2nd check is safe, but don't have an argument to
> > prove that it's not safe.
> > Let's get to these details when there is need to support negative off.
> > 
> 
> Hopefully above explanation shows that there's need to support it now.
> I will try to simplify and rephrase the summary to make it easier to follow,
> but will prioritize addressing feedback in less complex patches, so this
> patch may not change for a few respins.

I'm not saying that this patch will never be needed.
Supporting negative offsets here is a good thing.
I'm arguing that it's not necessary to enable bpf_rbtree.
Dave Marchevsky Dec. 7, 2022, 11:39 p.m. UTC | #4
On 12/7/22 1:06 PM, Alexei Starovoitov wrote:
> On Wed, Dec 07, 2022 at 01:46:56AM -0500, Dave Marchevsky wrote:
>> On 12/6/22 9:39 PM, Alexei Starovoitov wrote:
>>> On Tue, Dec 06, 2022 at 03:09:57PM -0800, Dave Marchevsky wrote:
>>>> Current comment in BPF_PROBE_MEM jit code claims that verifier prevents
>>>> insn->off < 0, but this appears to not be true irrespective of changes
>>>> in this series. Regardless, changes in this series will result in an
>>>> example like:
>>>>
>>>>   struct example_node {
>>>>     long key;
>>>>     long val;
>>>>     struct bpf_rb_node node;
>>>>   }
>>>>
>>>>   /* In BPF prog, assume root contains example_node nodes */
>>>>   struct bpf_rb_node res = bpf_rbtree_first(&root);
>>>>   if (!res)
>>>>     return 1;
>>>>
>>>>   struct example_node n = container_of(res, struct example_node, node);
>>>>   long key = n->key;
>>>>
>>>> Resulting in a load with off = -16, as bpf_rbtree_first's return is
>>>
>>> Looks like the bug in the previous patch:
>>> +                       } else if (meta.func_id == special_kfunc_list[KF_bpf_rbtree_remove] ||
>>> +                                  meta.func_id == special_kfunc_list[KF_bpf_rbtree_first]) {
>>> +                               struct btf_field *field = meta.arg_rbtree_root.field;
>>> +
>>> +                               mark_reg_datastructure_node(regs, BPF_REG_0,
>>> +                                                           &field->datastructure_head);
>>>
>>> The R0 .off should have been:
>>>  regs[BPF_REG_0].off = field->rb_node.node_offset;
>>>
>>> node, not root.
>>>
>>> PTR_TO_BTF_ID should have been returned with approriate 'off',
>>> so that container_of() would it bring back to zero offset.
>>>
>>
>> The root's btf_field is used to hold information about the node type. Of
>> specific interest to us are value_btf_id and node_offset, which
>> mark_reg_datastructure_node uses to set REG_0's type and offset correctly.
>>
>> This "use head type to keep info about node type" strategy felt strange to me
>> initially too: all PTR_TO_BTF_ID regs are passing around their type info, so
>> why not use that to lookup bpf_rb_node field info? But consider that
>> bpf_rbtree_first (and bpf_list_pop_{front,back}) doesn't take a node as
>> input arg, so there's no opportunity to get btf_field info from input
>> reg type. 
>>
>> So we'll need to keep this info in rbtree_root's btf_field
>> regardless, and since any rbtree API function that operates on a node
>> also operates on a root and expects its node arg to match the node
>> type expected by the root, might as well use root's field as the main
>> lookup for this info and not even have &field->rb_node for now.
>> All __process_kf_arg_ptr_to_datastructure_node calls (added earlier
>> in the series) use the &meta->arg_{list_head,rbtree_root}.field for same
>> reason.
>>
>> So it's setting the reg offset correctly.
> 
> Ok. Got it. Than the commit log is incorrectly describing the failing scenario.
> It's a container_of() inside bool less() that is generating negative offsets.
> 

I noticed this happening with container_of() both inside less() and in the
example in patch summary. Specifically in the rbtree_first_and_remove 'success'
selftest added in patch 13. There, operations like this:

  bpf_spin_lock(&glock);
  res = bpf_rbtree_first(&groot);
  if (!res) {...}

  o = container_of(res, struct node_data, node);
  first_data[1] = o->data;
  bpf_spin_unlock(&glock);

Would fail to set first_data[1] to the expected value, instead setting
it to 0. 

>>> All PTR_TO_BTF_ID need to have positive offset.
>>> I'm not sure btf_struct_walk() and other PTR_TO_BTF_ID accessors
>>> can deal with negative offsets.
>>> There could be all kinds of things to fix.
>>
>> I think you may be conflating reg offset and insn offset here. None of the
>> changes in this series result in a PTR_TO_BTF_ID reg w/ negative offset
>> being returned. But LLVM may generate load insns with a negative offset,
>> and since we're passing around pointers to bpf_rb_node that may come
>> after useful data fields in a type, this will happen more often.
>>
>> Consider this small example from selftests in this series:
>>
>> struct node_data {
>>   long key;
>>   long data;
>>   struct bpf_rb_node node;
>> };
>>
>> static bool less(struct bpf_rb_node *a, const struct bpf_rb_node *b)
>> {
>>         struct node_data *node_a;
>>         struct node_data *node_b;
>>
>>         node_a = container_of(a, struct node_data, node);
>>         node_b = container_of(b, struct node_data, node);
>>
>>         return node_a->key < node_b->key;
>> }
>>
>> llvm-objdump shows this bpf bytecode for 'less':
>>
>> 0000000000000000 <less>:
>> ;       return node_a->key < node_b->key;
>>        0:       79 22 f0 ff 00 00 00 00 r2 = *(u64 *)(r2 - 0x10)
>>        1:       79 11 f0 ff 00 00 00 00 r1 = *(u64 *)(r1 - 0x10)
>>        2:       b4 00 00 00 01 00 00 00 w0 = 0x1
>> ;       return node_a->key < node_b->key;
> 
> I see. That's the same bug.
> The args to callback should have been PTR_TO_BTF_ID | PTR_TRUSTED with 
> correct positive offset.
> Then node_a = container_of(a, struct node_data, node);
> would have produced correct offset into proper btf_id.
> 
> The verifier should be passing into less() the btf_id
> of struct node_data instead of btf_id of struct bpf_rb_node.
> 

The verifier is already passing the struct node_data type, not bpf_rb_node.
For less() args, and rbtree_{first,remove} retval, mark_reg_datastructure_node
- added in patch 8 - is doing as you describe.

Verifier sees less' arg regs as R=ptr_to_node_data(off=16). If it was
instead passing R=ptr_to_bpf_rb_node(off=0), attempting to access *(reg - 0x10)
would cause verifier err.

>>        3:       cd 21 01 00 00 00 00 00 if r1 s< r2 goto +0x1 <LBB2_2>
>>        4:       b4 00 00 00 00 00 00 00 w0 = 0x0
>>
>> 0000000000000028 <LBB2_2>:
>> ;       return node_a->key < node_b->key;
>>        5:       95 00 00 00 00 00 00 00 exit
>>
>> Insns 0 and 1 are loading node_b->key and node_a->key, respectively, using
>> negative insn->off. Verifier's view or R1 and R2 before insn 0 is
>> untrusted_ptr_node_data(off=16). If there were some intermediate insns
>> storing result of container_of() before dereferencing:
>>
>>   r3 = (r2 - 0x10)
>>   r2 = *(u64 *)(r3)
>>
>> Verifier would see R3 as untrusted_ptr_node_data(off=0), and load for
>> r2 would have insn->off = 0. But LLVM decides to just do a load-with-offset
>> using original arg ptrs to less() instead of storing container_of() ptr
>> adjustments.
>>
>> Since the container_of usage and code pattern in above example's less()
>> isn't particularly specific to this series, I think there are other scenarios
>> where such code would be generated and considered this a general bugfix in
>> cover letter.
> 
> imo the negative offset looks specific to two misuses of PTR_UNTRUSTED in this set.
> 

If I used PTR_TRUSTED here, the JITted instructions would still do a load like
r2 = *(u64 *)(r2 - 0x10). There would just be no BPF_PROBE_MEM runtime checking
insns generated, avoiding negative insn issue there. But the negative insn->off
load being generated is not specific to PTR_UNTRUSTED.

>>
>> [ below paragraph was moved here, it originally preceded "All PTR_TO_BTF_ID"
>>   paragraph ]
>>
>>> The apporach of returning untrusted from bpf_rbtree_first is questionable.
>>> Without doing that this issue would not have surfaced.
>>>
>>
>> I agree re: PTR_UNTRUSTED, but note that my earlier example doesn't involve
>> bpf_rbtree_first. Regardless, I think the issue is that PTR_UNTRUSTED is
>> used to denote a few separate traits of a PTR_TO_BTF_ID reg:
>>
>>   * "I have no ownership over the thing I'm pointing to"
>>   * "My backing memory may go away at any time"
>>   * "Access to my fields might result in page fault"
>>   * "Kfuncs shouldn't accept me as an arg"
>>
>> Seems like original PTR_UNTRUSTED usage really wanted to denote the first
>> point and the others were just naturally implied from the first. But
>> as you've noted there are some things using PTR_UNTRUSTED that really
>> want to make more granular statements:
> 
> I think PTR_UNTRUSTED implies all of the above. All 4 statements are connected.
> 
>> ref_set_release_on_unlock logic sets release_on_unlock = true and adds
>> PTR_UNTRUSTED to the reg type. In this case PTR_UNTRUSTED is trying to say:
>>
>>   * "I have no ownership over the thing I'm pointing to"
>>   * "My backing memory may go away at any time _after_ bpf_spin_unlock"
>>     * Before spin_unlock it's guaranteed to be valid
>>   * "Kfuncs shouldn't accept me as an arg"
>>     * We don't want arbitrary kfunc saving and accessing release_on_unlock
>>       reg after bpf_spin_unlock, as its backing memory can go away any time
>>       after spin_unlock.
>>
>> The "backing memory" statement PTR_UNTRUSTED is making is a blunt superset
>> of what release_on_unlock really needs.
>>
>> For less() callback we just want
>>
>>   * "I have no ownership over the thing I'm pointing to"
>>   * "Kfuncs shouldn't accept me as an arg"
>>
>> There is probably a way to decompose PTR_UNTRUSTED into a few flags such that
>> it's possible to denote these things separately and avoid unwanted additional
>> behavior. But after talking to David Vernet about current complexity of
>> PTR_TRUSTED and PTR_UNTRUSTED logic and his desire to refactor, it seemed
>> better to continue with PTR_UNTRUSTED blunt instrument with a bit of
>> special casing for now, instead of piling on more flags.
> 
> Exactly. More flags will only increase the confusion.
> Please try to make callback args as proper PTR_TRUSTED and disallow calling specific
> rbtree kfuncs while inside this particular callback to prevent recursion.
> That would solve all these issues, no?
> Writing into such PTR_TRUSTED should be still allowed inside cb though it's bogus.
> 
> Consider less() receiving btf_id ptr_trusted of struct node_data and it contains
> both link list and rbtree.
> It should still be safe to operate on link list part of that node from less()
> though it's not something we would ever recommend.

I definitely want to allow writes on non-owning references. In order to properly
support this, there needs to be a way to designate a field as a "key":

struct node_data {
  long key __key;
  long data;
  struct bpf_rb_node node;
};

or perhaps on the rb_root via __contains or separate tag:

struct bpf_rb_root groot __contains(struct node_data, node, key);

This is necessary because rbtree's less() uses key field to determine order, so
we don't want to allow write to the key field when the node is in a rbtree. If
such a write were possible the rbtree could easily be placed in an invalid state
since the new key may mean that the rbtree is no longer sorted. Subsequent add()
operations would compare less() using the new key, so other nodes will be placed
in wrong spot as well.

Since PTR_UNTRUSTED currently allows read but not write, and prevents use of
non-owning ref as kfunc arg, it seemed to be reasonable tag for less() args.

I was planning on adding __key / non-owning-ref write support as a followup, but
adding it as part of this series will probably save a lot of back-and-forth.
Will try to add it.

> The kfunc call on rb tree part of struct node_data is problematic because
> of recursion, right? No other safety concerns ?
> 
>>>
>>>> modified by verifier to be PTR_TO_BTF_ID of example_node w/ offset =
>>>> offsetof(struct example_node, node), instead of PTR_TO_BTF_ID of
>>>> bpf_rb_node. So it's necessary to support negative insn->off when
>>>> jitting BPF_PROBE_MEM.
>>>
>>> I'm not convinced it's necessary.
>>> container_of() seems to be the only case where bpf prog can convert
>>> PTR_TO_BTF_ID with off >= 0 to negative off.
>>> Normal pointer walking will not make it negative.
>>>
>>
>> I see what you mean - if some non-container_of case resulted in load generation
>> with negative insn->off, this probably would've been noticed already. But
>> hopefully my replies above explain why it should be addressed now.
> 
> Even with container_of() usage we should be passing proper btf_id of container
> struct, so that callbacks and non-callbacks can properly container_of() it
> and still get offset >= 0.
> 

This was addressed earlier in my response.

>>>>
>>>> A few instructions are saved for negative insn->offs as a result. Using
>>>> the struct example_node / off = -16 example from before, code looks
>>>> like:
>>>
>>> This is quite complex to review. I couldn't convince myself
>>> that droping 2nd check is safe, but don't have an argument to
>>> prove that it's not safe.
>>> Let's get to these details when there is need to support negative off.
>>>
>>
>> Hopefully above explanation shows that there's need to support it now.
>> I will try to simplify and rephrase the summary to make it easier to follow,
>> but will prioritize addressing feedback in less complex patches, so this
>> patch may not change for a few respins.
> 
> I'm not saying that this patch will never be needed.
> Supporting negative offsets here is a good thing.
> I'm arguing that it's not necessary to enable bpf_rbtree.
Alexei Starovoitov Dec. 8, 2022, 12:47 a.m. UTC | #5
On Wed, Dec 07, 2022 at 06:39:38PM -0500, Dave Marchevsky wrote:
> >>
> >> 0000000000000000 <less>:
> >> ;       return node_a->key < node_b->key;
> >>        0:       79 22 f0 ff 00 00 00 00 r2 = *(u64 *)(r2 - 0x10)
> >>        1:       79 11 f0 ff 00 00 00 00 r1 = *(u64 *)(r1 - 0x10)
> >>        2:       b4 00 00 00 01 00 00 00 w0 = 0x1
> >> ;       return node_a->key < node_b->key;
> > 
> > I see. That's the same bug.
> > The args to callback should have been PTR_TO_BTF_ID | PTR_TRUSTED with 
> > correct positive offset.
> > Then node_a = container_of(a, struct node_data, node);
> > would have produced correct offset into proper btf_id.
> > 
> > The verifier should be passing into less() the btf_id
> > of struct node_data instead of btf_id of struct bpf_rb_node.
> > 
> 
> The verifier is already passing the struct node_data type, not bpf_rb_node.
> For less() args, and rbtree_{first,remove} retval, mark_reg_datastructure_node
> - added in patch 8 - is doing as you describe.
> 
> Verifier sees less' arg regs as R=ptr_to_node_data(off=16). If it was
> instead passing R=ptr_to_bpf_rb_node(off=0), attempting to access *(reg - 0x10)
> would cause verifier err.

Ahh. I finally got it :)
Please put these details in the commit log when you respin.

> >>        3:       cd 21 01 00 00 00 00 00 if r1 s< r2 goto +0x1 <LBB2_2>
> >>        4:       b4 00 00 00 00 00 00 00 w0 = 0x0
> >>
> >> 0000000000000028 <LBB2_2>:
> >> ;       return node_a->key < node_b->key;
> >>        5:       95 00 00 00 00 00 00 00 exit
> >>
> >> Insns 0 and 1 are loading node_b->key and node_a->key, respectively, using
> >> negative insn->off. Verifier's view or R1 and R2 before insn 0 is
> >> untrusted_ptr_node_data(off=16). If there were some intermediate insns
> >> storing result of container_of() before dereferencing:
> >>
> >>   r3 = (r2 - 0x10)
> >>   r2 = *(u64 *)(r3)
> >>
> >> Verifier would see R3 as untrusted_ptr_node_data(off=0), and load for
> >> r2 would have insn->off = 0. But LLVM decides to just do a load-with-offset
> >> using original arg ptrs to less() instead of storing container_of() ptr
> >> adjustments.
> >>
> >> Since the container_of usage and code pattern in above example's less()
> >> isn't particularly specific to this series, I think there are other scenarios
> >> where such code would be generated and considered this a general bugfix in
> >> cover letter.
> > 
> > imo the negative offset looks specific to two misuses of PTR_UNTRUSTED in this set.
> > 
> 
> If I used PTR_TRUSTED here, the JITted instructions would still do a load like
> r2 = *(u64 *)(r2 - 0x10). There would just be no BPF_PROBE_MEM runtime checking
> insns generated, avoiding negative insn issue there. But the negative insn->off
> load being generated is not specific to PTR_UNTRUSTED.

yep.

> > 
> > Exactly. More flags will only increase the confusion.
> > Please try to make callback args as proper PTR_TRUSTED and disallow calling specific
> > rbtree kfuncs while inside this particular callback to prevent recursion.
> > That would solve all these issues, no?
> > Writing into such PTR_TRUSTED should be still allowed inside cb though it's bogus.
> > 
> > Consider less() receiving btf_id ptr_trusted of struct node_data and it contains
> > both link list and rbtree.
> > It should still be safe to operate on link list part of that node from less()
> > though it's not something we would ever recommend.
> 
> I definitely want to allow writes on non-owning references. In order to properly
> support this, there needs to be a way to designate a field as a "key":
> 
> struct node_data {
>   long key __key;
>   long data;
>   struct bpf_rb_node node;
> };
> 
> or perhaps on the rb_root via __contains or separate tag:
> 
> struct bpf_rb_root groot __contains(struct node_data, node, key);
> 
> This is necessary because rbtree's less() uses key field to determine order, so
> we don't want to allow write to the key field when the node is in a rbtree. If
> such a write were possible the rbtree could easily be placed in an invalid state
> since the new key may mean that the rbtree is no longer sorted. Subsequent add()
> operations would compare less() using the new key, so other nodes will be placed
> in wrong spot as well.
> 
> Since PTR_UNTRUSTED currently allows read but not write, and prevents use of
> non-owning ref as kfunc arg, it seemed to be reasonable tag for less() args.
> 
> I was planning on adding __key / non-owning-ref write support as a followup, but
> adding it as part of this series will probably save a lot of back-and-forth.
> Will try to add it.

Just key mark might not be enough. less() could be doing all sort of complex
logic on more than one field and even global fields.
But what is the concern with writing into 'key' ?
The rbtree will not be sorted. find/add operation will not be correct,
but nothing will crash. At the end bpf_rb_root_free() will walk all
unsorted nodes anyway and free them all.
Even if we pass PTR_TRUSTED | MEM_RDONLY pointers into less() the less()
can still do nonsensical things like returning random true/false.
Doesn't look like an issue to me.
Dave Marchevsky Dec. 8, 2022, 8:50 a.m. UTC | #6
On 12/7/22 7:47 PM, Alexei Starovoitov wrote:
> On Wed, Dec 07, 2022 at 06:39:38PM -0500, Dave Marchevsky wrote:
>>>>
>>>> 0000000000000000 <less>:
>>>> ;       return node_a->key < node_b->key;
>>>>        0:       79 22 f0 ff 00 00 00 00 r2 = *(u64 *)(r2 - 0x10)
>>>>        1:       79 11 f0 ff 00 00 00 00 r1 = *(u64 *)(r1 - 0x10)
>>>>        2:       b4 00 00 00 01 00 00 00 w0 = 0x1
>>>> ;       return node_a->key < node_b->key;
>>>
>>> I see. That's the same bug.
>>> The args to callback should have been PTR_TO_BTF_ID | PTR_TRUSTED with 
>>> correct positive offset.
>>> Then node_a = container_of(a, struct node_data, node);
>>> would have produced correct offset into proper btf_id.
>>>
>>> The verifier should be passing into less() the btf_id
>>> of struct node_data instead of btf_id of struct bpf_rb_node.
>>>
>>
>> The verifier is already passing the struct node_data type, not bpf_rb_node.
>> For less() args, and rbtree_{first,remove} retval, mark_reg_datastructure_node
>> - added in patch 8 - is doing as you describe.
>>
>> Verifier sees less' arg regs as R=ptr_to_node_data(off=16). If it was
>> instead passing R=ptr_to_bpf_rb_node(off=0), attempting to access *(reg - 0x10)
>> would cause verifier err.
> 
> Ahh. I finally got it :)
> Please put these details in the commit log when you respin.
> 

Glad it finally started making sense.
Will do big improvement of patch summary after addressing other
feedback from this series.

>>>>        3:       cd 21 01 00 00 00 00 00 if r1 s< r2 goto +0x1 <LBB2_2>
>>>>        4:       b4 00 00 00 00 00 00 00 w0 = 0x0
>>>>
>>>> 0000000000000028 <LBB2_2>:
>>>> ;       return node_a->key < node_b->key;
>>>>        5:       95 00 00 00 00 00 00 00 exit
>>>>
>>>> Insns 0 and 1 are loading node_b->key and node_a->key, respectively, using
>>>> negative insn->off. Verifier's view or R1 and R2 before insn 0 is
>>>> untrusted_ptr_node_data(off=16). If there were some intermediate insns
>>>> storing result of container_of() before dereferencing:
>>>>
>>>>   r3 = (r2 - 0x10)
>>>>   r2 = *(u64 *)(r3)
>>>>
>>>> Verifier would see R3 as untrusted_ptr_node_data(off=0), and load for
>>>> r2 would have insn->off = 0. But LLVM decides to just do a load-with-offset
>>>> using original arg ptrs to less() instead of storing container_of() ptr
>>>> adjustments.
>>>>
>>>> Since the container_of usage and code pattern in above example's less()
>>>> isn't particularly specific to this series, I think there are other scenarios
>>>> where such code would be generated and considered this a general bugfix in
>>>> cover letter.
>>>
>>> imo the negative offset looks specific to two misuses of PTR_UNTRUSTED in this set.
>>>
>>
>> If I used PTR_TRUSTED here, the JITted instructions would still do a load like
>> r2 = *(u64 *)(r2 - 0x10). There would just be no BPF_PROBE_MEM runtime checking
>> insns generated, avoiding negative insn issue there. But the negative insn->off
>> load being generated is not specific to PTR_UNTRUSTED.
> 
> yep.
> 
>>>
>>> Exactly. More flags will only increase the confusion.
>>> Please try to make callback args as proper PTR_TRUSTED and disallow calling specific
>>> rbtree kfuncs while inside this particular callback to prevent recursion.
>>> That would solve all these issues, no?
>>> Writing into such PTR_TRUSTED should be still allowed inside cb though it's bogus.
>>>
>>> Consider less() receiving btf_id ptr_trusted of struct node_data and it contains
>>> both link list and rbtree.
>>> It should still be safe to operate on link list part of that node from less()
>>> though it's not something we would ever recommend.
>>
>> I definitely want to allow writes on non-owning references. In order to properly
>> support this, there needs to be a way to designate a field as a "key":
>>
>> struct node_data {
>>   long key __key;
>>   long data;
>>   struct bpf_rb_node node;
>> };
>>
>> or perhaps on the rb_root via __contains or separate tag:
>>
>> struct bpf_rb_root groot __contains(struct node_data, node, key);
>>
>> This is necessary because rbtree's less() uses key field to determine order, so
>> we don't want to allow write to the key field when the node is in a rbtree. If
>> such a write were possible the rbtree could easily be placed in an invalid state
>> since the new key may mean that the rbtree is no longer sorted. Subsequent add()
>> operations would compare less() using the new key, so other nodes will be placed
>> in wrong spot as well.
>>
>> Since PTR_UNTRUSTED currently allows read but not write, and prevents use of
>> non-owning ref as kfunc arg, it seemed to be reasonable tag for less() args.
>>
>> I was planning on adding __key / non-owning-ref write support as a followup, but
>> adding it as part of this series will probably save a lot of back-and-forth.
>> Will try to add it.
> 
> Just key mark might not be enough. less() could be doing all sort of complex
> logic on more than one field and even global fields.
> But what is the concern with writing into 'key' ?
> The rbtree will not be sorted. find/add operation will not be correct,
> but nothing will crash. At the end bpf_rb_root_free() will walk all
> unsorted nodes anyway and free them all.
> Even if we pass PTR_TRUSTED | MEM_RDONLY pointers into less() the less()
> can still do nonsensical things like returning random true/false.
> Doesn't look like an issue to me.

Agreed re: complex logic + global fields, less() being able to do nonsensical
things, and writing to key not crashing anything even if it breaks the tree.

OK, let's forget about __key. In next version of the series non-owning refs
will be write-able. Can add more protection in the future if it's deemed
necessary. Since this means non-owning refs won't be PTR_UNTRUSTED anymore,
I can split this patch out from the rest of the series after confirming that
it isn't necessary to ship rbtree.

Still want to convince you that the skipping of a check is correct before
I page out the details, but less urgent now. IIUC although the cause of the
issue is clear now, you'd still like me to clarify the details of solution.
diff mbox series

Patch

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index 36ffe67ad6e5..843f619d0d35 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -11,6 +11,7 @@ 
 #include <linux/bpf.h>
 #include <linux/memory.h>
 #include <linux/sort.h>
+#include <linux/limits.h>
 #include <asm/extable.h>
 #include <asm/set_memory.h>
 #include <asm/nospec-branch.h>
@@ -94,6 +95,7 @@  static int bpf_size_to_x86_bytes(int bpf_size)
  */
 #define X86_JB  0x72
 #define X86_JAE 0x73
+#define X86_JNC 0x73
 #define X86_JE  0x74
 #define X86_JNE 0x75
 #define X86_JBE 0x76
@@ -950,6 +952,36 @@  static void emit_shiftx(u8 **pprog, u32 dst_reg, u8 src_reg, bool is64, u8 op)
 	*pprog = prog;
 }
 
+/* Check that condition necessary for PROBE_MEM handling for insn->off < 0
+ * holds.
+ *
+ * This could be a static_assert((TASK_SIZE_MAX + PAGE_SIZE) > -S16_MIN),
+ * but TASK_SIZE_MAX can't always be evaluated at compile time, so let's not
+ * assume insn->off size either
+ */
+static int check_probe_mem_task_size_overflow(void)
+{
+	struct bpf_insn insn;
+	s64 max_negative;
+
+	switch (sizeof(insn.off)) {
+	case 2:
+		max_negative = S16_MIN;
+		break;
+	default:
+		pr_err("bpf_jit_error: unexpected bpf_insn->off size\n");
+		return -EFAULT;
+	}
+
+	if (!((TASK_SIZE_MAX + PAGE_SIZE) > -max_negative)) {
+		pr_err("bpf jit error: assumption does not hold:\n");
+		pr_err("\t(TASK_SIZE_MAX + PAGE_SIZE) + (max negative insn->off) > 0\n");
+		return -EFAULT;
+	}
+
+	return 0;
+}
+
 #define INSN_SZ_DIFF (((addrs[i] - addrs[i - 1]) - (prog - temp)))
 
 static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, u8 *rw_image,
@@ -967,6 +999,10 @@  static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, u8 *rw_image
 	u8 *prog = temp;
 	int err;
 
+	err = check_probe_mem_task_size_overflow();
+	if (err)
+		return err;
+
 	detect_reg_usage(insn, insn_cnt, callee_regs_used,
 			 &tail_call_seen);
 
@@ -1359,20 +1395,30 @@  st:			if (is_imm8(insn->off))
 		case BPF_LDX | BPF_MEM | BPF_DW:
 		case BPF_LDX | BPF_PROBE_MEM | BPF_DW:
 			if (BPF_MODE(insn->code) == BPF_PROBE_MEM) {
-				/* Though the verifier prevents negative insn->off in BPF_PROBE_MEM
-				 * add abs(insn->off) to the limit to make sure that negative
-				 * offset won't be an issue.
-				 * insn->off is s16, so it won't affect valid pointers.
-				 */
-				u64 limit = TASK_SIZE_MAX + PAGE_SIZE + abs(insn->off);
-				u8 *end_of_jmp1, *end_of_jmp2;
-
 				/* Conservatively check that src_reg + insn->off is a kernel address:
-				 * 1. src_reg + insn->off >= limit
-				 * 2. src_reg + insn->off doesn't become small positive.
-				 * Cannot do src_reg + insn->off >= limit in one branch,
-				 * since it needs two spare registers, but JIT has only one.
+				 * 1. src_reg + insn->off >= TASK_SIZE_MAX + PAGE_SIZE
+				 * 2. src_reg + insn->off doesn't overflow and become small positive
+				 *
+				 * For check 1, to save regs, do
+				 * src_reg >= (TASK_SIZE_MAX + PAGE_SIZE - insn->off) call rhs
+				 * of inequality 'limit'
+				 *
+				 * For check 2:
+				 * If insn->off is positive, add src_reg + insn->off and check
+				 * overflow directly
+				 * If insn->off is negative, we know that
+				 *   (TASK_SIZE_MAX + PAGE_SIZE - insn->off) > (TASK_SIZE_MAX + PAGE_SIZE)
+				 * and from check 1 we know
+				 *   src_reg >= (TASK_SIZE_MAX + PAGE_SIZE - insn->off)
+				 * So if (TASK_SIZE_MAX + PAGE_SIZE) + MAX_NEGATIVE_OFF > 0 we can
+				 * be sure that src_reg + insn->off won't overflow in either
+				 * direction and avoid runtime check entirely.
+				 *
+				 * check_probe_mem_task_size_overflow confirms the above assumption
+				 * at the beginning of this function
 				 */
+				u64 limit = TASK_SIZE_MAX + PAGE_SIZE - insn->off;
+				u8 *end_of_jmp1, *end_of_jmp2;
 
 				/* movabsq r11, limit */
 				EMIT2(add_1mod(0x48, AUX_REG), add_1reg(0xB8, AUX_REG));
@@ -1381,32 +1427,47 @@  st:			if (is_imm8(insn->off))
 				/* cmp src_reg, r11 */
 				maybe_emit_mod(&prog, src_reg, AUX_REG, true);
 				EMIT2(0x39, add_2reg(0xC0, src_reg, AUX_REG));
-				/* if unsigned '<' goto end_of_jmp2 */
-				EMIT2(X86_JB, 0);
-				end_of_jmp1 = prog;
-
-				/* mov r11, src_reg */
-				emit_mov_reg(&prog, true, AUX_REG, src_reg);
-				/* add r11, insn->off */
-				maybe_emit_1mod(&prog, AUX_REG, true);
-				EMIT2_off32(0x81, add_1reg(0xC0, AUX_REG), insn->off);
-				/* jmp if not carry to start_of_ldx
-				 * Otherwise ERR_PTR(-EINVAL) + 128 will be the user addr
-				 * that has to be rejected.
-				 */
-				EMIT2(0x73 /* JNC */, 0);
-				end_of_jmp2 = prog;
+				if (insn->off >= 0) {
+					/* cmp src_reg, r11 */
+					/* if unsigned '<' goto end_of_jmp2 */
+					EMIT2(X86_JB, 0);
+					end_of_jmp1 = prog;
+
+					/* mov r11, src_reg */
+					emit_mov_reg(&prog, true, AUX_REG, src_reg);
+					/* add r11, insn->off */
+					maybe_emit_1mod(&prog, AUX_REG, true);
+					EMIT2_off32(0x81, add_1reg(0xC0, AUX_REG), insn->off);
+					/* jmp if not carry to start_of_ldx
+					 * Otherwise ERR_PTR(-EINVAL) + 128 will be the user addr
+					 * that has to be rejected.
+					 */
+					EMIT2(X86_JNC, 0);
+					end_of_jmp2 = prog;
+				} else {
+					/* cmp src_reg, r11 */
+					/* if unsigned '>=' goto start_of_ldx
+					 * w/o needing to do check 2
+					 */
+					EMIT2(X86_JAE, 0);
+					end_of_jmp1 = prog;
+				}
 
 				/* xor dst_reg, dst_reg */
 				emit_mov_imm32(&prog, false, dst_reg, 0);
 				/* jmp byte_after_ldx */
 				EMIT2(0xEB, 0);
 
-				/* populate jmp_offset for JB above to jump to xor dst_reg */
-				end_of_jmp1[-1] = end_of_jmp2 - end_of_jmp1;
-				/* populate jmp_offset for JNC above to jump to start_of_ldx */
 				start_of_ldx = prog;
-				end_of_jmp2[-1] = start_of_ldx - end_of_jmp2;
+				if (insn->off >= 0) {
+					/* populate jmp_offset for JB above to jump to xor dst_reg */
+					end_of_jmp1[-1] = end_of_jmp2 - end_of_jmp1;
+					/* populate jmp_offset for JNC above to jump to start_of_ldx */
+					end_of_jmp2[-1] = start_of_ldx - end_of_jmp2;
+				} else {
+					/* populate jmp_offset for JAE above to jump to start_of_ldx */
+					end_of_jmp1[-1] = start_of_ldx - end_of_jmp1;
+				}
 			}
 			emit_ldx(&prog, BPF_SIZE(insn->code), dst_reg, src_reg, insn->off);
 			if (BPF_MODE(insn->code) == BPF_PROBE_MEM) {