[bpf-next,10/13] bpf, x86: BPF_PROBE_MEM handling for insn->off < 0

Current comment in BPF_PROBE_MEM jit code claims that verifier prevents
insn->off < 0, but this appears to not be true irrespective of changes
in this series. Regardless, changes in this series will result in an
example like:

  struct example_node {
    long key;
    long val;
    struct bpf_rb_node node;
  }

  /* In BPF prog, assume root contains example_node nodes */
  struct bpf_rb_node res = bpf_rbtree_first(&root);
  if (!res)
    return 1;

  struct example_node n = container_of(res, struct example_node, node);
  long key = n->key;

Resulting in a load with off = -16, as bpf_rbtree_first's return is
modified by verifier to be PTR_TO_BTF_ID of example_node w/ offset =
offsetof(struct example_node, node), instead of PTR_TO_BTF_ID of
bpf_rb_node. So it's necessary to support negative insn->off when
jitting BPF_PROBE_MEM.

In order to ensure that page fault for a BPF_PROBE_MEM load of *src_reg +
insn->off is safely handled, we must confirm that *src_reg + insn->off is
in kernel's memory. Two runtime checks are emitted to confirm that:

  1) (*src_reg + insn->off) > boundary between user and kernel address
  spaces
  2) (*src_reg + insn->off) does not overflow to a small positive
  number. This might happen if some function meant to set src_reg
  returns ERR_PTR(-EINVAL) or similar.

Check 1 currently is sligtly off - it compares a

  u64 limit = TASK_SIZE_MAX + PAGE_SIZE + abs(insn->off);

to *src_reg, aborting the load if limit is larger. Rewriting this as an
inequality:

  *src_reg > TASK_SIZE_MAX + PAGE_SIZE + abs(insn->off)
  *src_reg - abs(insn->off) > TASK_SIZE_MAX + PAGE_SIZE

shows that this isn't quite right even if insn->off is positive, as we
really want:

  *src_reg + insn->off > TASK_SIZE_MAX + PAGE_SIZE
  *src_reg > TASK_SIZE_MAX + PAGE_SIZE - insn_off

Since *src_reg + insn->off is the address we'll be loading from, not
*src_reg - insn->off or *src_reg - abs(insn->off). So change the
subtraction to an addition and remove the abs(), as comment indicates
that it was only added to ignore negative insn->off.

For Check 2, currently "does not overflow to a small positive number" is
confirmed by emitting an 'add insn->off, src_reg' instruction and
checking for carry flag. While this works fine for a positive insn->off,
a small negative insn->off like -16 is almost guaranteed to wrap over to
a small positive number when added to any kernel address.

This patch addresses this by not doing Check 2 at BPF prog runtime when
insn->off is negative, rather doing a stronger check at JIT-time. The
logic supporting this is as follows:

1) Assume insn->off is negative, call the largest such negative offset
   MAX_NEGATIVE_OFF. So insn->off >= MAX_NEGATIVE_OFF for all possible
   insn->off.

2) *src_reg + insn->off will not wrap over to an unexpected address by
   virtue of negative insn->off, but it might wrap under if
   -insn->off > *src_reg, as that implies *src_reg + insn->off < 0

3) Inequality (TASK_SIZE_MAX + PAGE_SIZE - insn->off) > (TASK_SIZE_MAX + PAGE_SIZE)
   must be true since insn->off is negative.

4) If we've completed check 1, we know that
   src_reg >= (TASK_SIZE_MAX + PAGE_SIZE - insn->off)

5) Combining statements 3 and 4, we know src_reg > (TASK_SIZE_MAX + PAGE_SIZE)

6) By statements 1, 4, and 5, if we can prove
   (TASK_SIZE_MAX + PAGE_SIZE) > -MAX_NEGATIVE_OFF, we'll know that
   (TASK_SIZE_MAX + PAGE_SIZE) > -insn->off for all possible insn->off
   values. We can rewrite this as (TASK_SIZE_MAX + PAGE_SIZE) +
   MAX_NEGATIVE_OFF > 0.

   Since src_reg > TASK_SIZE_MAX + PAGE_SIZE and MAX_NEGATIVE_OFF is
   negative, if the previous inequality is true,
   src_reg + MAX_NEGATIVE_OFF > 0 is also true for all src_reg values.
   Similarly, since insn->off >= MAX_NEGATIVE_OFF for all possible
   negative insn->off vals, src_reg + insn->off > 0 and there can be no
   wrapping under.

So proving (TASK_SIZE_MAX + PAGE_SIZE) + MAX_NEGATIVE_OFF > 0 implies
*src_reg + insn->off > 0 for any src_reg that's passed check 1 and any
negative insn->off. Luckily the former inequality does not need to be
checked at runtime, and in fact could be a static_assert if
TASK_SIZE_MAX wasn't determined by a function when CONFIG_X86_5LEVEL
kconfig is used.

Regardless, we can just check (TASK_SIZE_MAX + PAGE_SIZE) +
MAX_NEGATIVE_OFF > 0 once per do_jit call instead of emitting a runtime
check. Given that insn->off is a s16 and is unlikely to grow larger,
this check should always succeed on any x86 processor made in the 21st
century. If it doesn't fail all do_jit calls and complain loudly with
the assumption that the BPF subsystem is misconfigured or has a bug.

A few instructions are saved for negative insn->offs as a result. Using
the struct example_node / off = -16 example from before, code looks
like:

BEFORE CHANGE
  72:   movabs $0x800000000010,%r11
  7c:   cmp    %r11,%rdi
  7f:   jb     0x000000000000008d         (check 1 on 7c and here)
  81:   mov    %rdi,%r11
  84:   add    $0xfffffffffffffff0,%r11   (check 2, will set carry for almost any r11, so bug for
  8b:   jae    0x0000000000000091          negative insn->off)
  8d:   xor    %edi,%edi                  (as a result long key = n->key; will be 0'd out here)
  8f:   jmp    0x0000000000000095
  91:   mov    -0x10(%rdi),%rdi
  95:

AFTER CHANGE:
  5a:   movabs $0x800000000010,%r11
  64:   cmp    %r11,%rdi
  67:   jae    0x000000000000006d     (check 1 on 64 and here, but now JNC instead of JC)
  69:   xor    %edi,%edi              (no check 2, 0 out if %rdi - %r11 < 0)
  6b:   jmp    0x0000000000000071
  6d:   mov    -0x10(%rdi),%rdi
  71:

We could do the same for insn->off == 0, but for now keep code
generation unchanged for previously working nonnegative insn->offs.

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
---
 arch/x86/net/bpf_jit_comp.c | 123 +++++++++++++++++++++++++++---------
 1 file changed, 92 insertions(+), 31 deletions(-)

Message ID	20221206231000.3180914-11-davemarchevsky@fb.com (mailing list archive)
State	Superseded
Delegated to:	BPF
Headers	show Return-Path: <bpf-owner@kernel.org> X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6827AC3A5A7 for <bpf@archiver.kernel.org>; Tue, 6 Dec 2022 23:10:36 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229661AbiLFXKf (ORCPT <rfc822;bpf@archiver.kernel.org>); Tue, 6 Dec 2022 18:10:35 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:55742 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229628AbiLFXKe (ORCPT <rfc822;bpf@vger.kernel.org>); Tue, 6 Dec 2022 18:10:34 -0500 Received: from mx0a-00082601.pphosted.com (mx0a-00082601.pphosted.com [67.231.145.42]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D77504298E for <bpf@vger.kernel.org>; Tue, 6 Dec 2022 15:10:32 -0800 (PST) Received: from pps.filterd (m0044012.ppops.net [127.0.0.1]) by mx0a-00082601.pphosted.com (8.17.1.19/8.17.1.19) with ESMTP id 2B6LhJ5n032730 for <bpf@vger.kernel.org>; Tue, 6 Dec 2022 15:10:32 -0800 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding : content-type; s=facebook; bh=+Ema17lhkGx/i/301jhTliryFi0neBmgVkp0cqpGJQA=; b=fVDfLamWjHoGJ91lMKmqjeX+ORVscFWu+1/hvpQZ0OVtiOHePA+e2ZX2TT5CzYivet3D JSX/qXnptI7nf8EFUc/SmlwBdaXwk1VBQO9UEWYCHZXgj9zsic7a+Jp/Ib1QMFlmRKSy ryhz+SJb+jYy92V1+W3hQVCUj/Df2VojNm4= Received: from maileast.thefacebook.com ([163.114.130.16]) by mx0a-00082601.pphosted.com (PPS) with ESMTPS id 3m9x70xv4w-15 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT) for <bpf@vger.kernel.org>; Tue, 06 Dec 2022 15:10:32 -0800 Received: from twshared25383.14.frc2.facebook.com (2620:10d:c0a8:1b::d) by mail.thefacebook.com (2620:10d:c0a8:82::d) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2375.34; Tue, 6 Dec 2022 15:10:31 -0800 Received: by devbig077.ldc1.facebook.com (Postfix, from userid 158236) id 4CD77120B378A; Tue, 6 Dec 2022 15:10:07 -0800 (PST) From: Dave Marchevsky <davemarchevsky@fb.com> To: <bpf@vger.kernel.org> CC: Alexei Starovoitov <ast@kernel.org>, Daniel Borkmann <daniel@iogearbox.net>, Andrii Nakryiko <andrii@kernel.org>, Kernel Team <kernel-team@fb.com>, Kumar Kartikeya Dwivedi <memxor@gmail.com>, Tejun Heo <tj@kernel.org>, Dave Marchevsky <davemarchevsky@fb.com> Subject: [PATCH bpf-next 10/13] bpf, x86: BPF_PROBE_MEM handling for insn->off < 0 Date: Tue, 6 Dec 2022 15:09:57 -0800 Message-ID: <20221206231000.3180914-11-davemarchevsky@fb.com> X-Mailer: git-send-email 2.30.2 In-Reply-To: <20221206231000.3180914-1-davemarchevsky@fb.com> References: <20221206231000.3180914-1-davemarchevsky@fb.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-FB-Internal: Safe Content-Type: text/plain X-Proofpoint-ORIG-GUID: TGr0BQd9FlNe5FDX_CGNRXdpIzuEiYTv X-Proofpoint-GUID: TGr0BQd9FlNe5FDX_CGNRXdpIzuEiYTv X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.923,Hydra:6.0.545,FMLib:17.11.122.1 definitions=2022-12-06_12,2022-12-06_01,2022-06-22_01 Precedence: bulk List-ID: <bpf.vger.kernel.org> X-Mailing-List: bpf@vger.kernel.org X-Patchwork-Delegate: bpf@iogearbox.net
Series	BPF rbtree next-gen datastructure \| expand [bpf-next,00/13] BPF rbtree next-gen datastructure [bpf-next,01/13] bpf: Loosen alloc obj test in verifier's reg_btf_record [bpf-next,02/13] bpf: map_check_btf should fail if btf_parse_fields fails [bpf-next,03/13] bpf: Minor refactor of ref_set_release_on_unlock [bpf-next,04/13] bpf: rename list_head -> datastructure_head in field info types [bpf-next,05/13] bpf: Add basic bpf_rb_{root,node} support [bpf-next,06/13] bpf: Add bpf_rbtree_{add,remove,first} kfuncs [bpf-next,07/13] bpf: Add support for bpf_rb_root and bpf_rb_node in kfunc args [bpf-next,08/13] bpf: Add callback validation to kfunc verifier logic [bpf-next,09/13] bpf: Special verifier handling for bpf_rbtree_{remove, first} [bpf-next,10/13] bpf, x86: BPF_PROBE_MEM handling for insn->off < 0 [bpf-next,11/13] bpf: Add bpf_rbtree_{add,remove,first} decls to bpf_experimental.h [bpf-next,12/13] libbpf: Make BTF mandatory if program BTF has spin_lock or alloc_obj type [bpf-next,13/13] selftests/bpf: Add rbtree selftests

Context	Check	Description
netdev/tree_selection	success	Clearly marked for bpf-next, async
netdev/fixes_present	success	Fixes tag not required for -next series
netdev/subject_prefix	success	Link
netdev/cover_letter	success	Series has a cover letter
netdev/patch_count	success	Link
netdev/header_inline	success	No static functions without inline keyword in header files
netdev/build_32bit	success	Errors and warnings before: 0 this patch: 0
netdev/cc_maintainers	warning	18 maintainers not CCed: yoshfuji@linux-ipv6.org mingo@redhat.com x86@kernel.org netdev@vger.kernel.org kpsingh@kernel.org bp@alien8.de haoluo@google.com davem@davemloft.net song@kernel.org yhs@fb.com tglx@linutronix.de dsahern@kernel.org dave.hansen@linux.intel.com martin.lau@linux.dev sdf@google.com john.fastabend@gmail.com jolsa@kernel.org hpa@zytor.com
netdev/build_clang	success	Errors and warnings before: 5 this patch: 5
netdev/module_param	success	Was 0 now: 0
netdev/verify_signedoff	success	Signed-off-by tag matches author and committer
netdev/check_selftest	success	No net selftest shell script
netdev/verify_fixes	success	No Fixes tag
netdev/build_allmodconfig_warn	success	Errors and warnings before: 2 this patch: 2
netdev/checkpatch	warning	WARNING: line length of 100 exceeds 80 columns WARNING: line length of 101 exceeds 80 columns WARNING: line length of 103 exceeds 80 columns WARNING: line length of 106 exceeds 80 columns WARNING: line length of 82 exceeds 80 columns WARNING: line length of 84 exceeds 80 columns WARNING: line length of 85 exceeds 80 columns WARNING: line length of 86 exceeds 80 columns WARNING: line length of 87 exceeds 80 columns WARNING: line length of 92 exceeds 80 columns WARNING: line length of 94 exceeds 80 columns WARNING: line length of 97 exceeds 80 columns WARNING: line length of 98 exceeds 80 columns WARNING: line length of 99 exceeds 80 columns
netdev/kdoc	success	Errors and warnings before: 0 this patch: 0
netdev/source_inline	success	Was 0 now: 0
bpf/vmtest-bpf-next-PR	fail	PR summary
bpf/vmtest-bpf-next-VM_Test-1	success	Logs for ${{ matrix.test }} on ${{ matrix.arch }} with ${{ matrix.toolchain }}
bpf/vmtest-bpf-next-VM_Test-2	success	Logs for ShellCheck
bpf/vmtest-bpf-next-VM_Test-3	fail	Logs for build for aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-4	fail	Logs for build for aarch64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-5	fail	Logs for build for s390x with gcc
bpf/vmtest-bpf-next-VM_Test-6	fail	Logs for build for x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-7	fail	Logs for build for x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-8	success	Logs for llvm-toolchain
bpf/vmtest-bpf-next-VM_Test-9	success	Logs for set-matrix

[bpf-next,10/13] bpf, x86: BPF_PROBE_MEM handling for insn->off < 0

Checks

Commit Message

Comments

Patch