diff mbox series

[bpf-next,1/2] bpf, x86: Improve PROBE_MEM runtime load check

Message ID 20221213182726.325137-1-davemarchevsky@fb.com (mailing list archive)
State Superseded
Delegated to: BPF
Headers show
Series [bpf-next,1/2] bpf, x86: Improve PROBE_MEM runtime load check | expand

Checks

Context Check Description
netdev/tree_selection success Clearly marked for bpf-next
netdev/fixes_present success Fixes tag not required for -next series
netdev/subject_prefix success Link
netdev/cover_letter success Single patches do not need cover letters
netdev/patch_count success Link
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 0 this patch: 0
netdev/cc_maintainers warning 18 maintainers not CCed: yoshfuji@linux-ipv6.org mingo@redhat.com x86@kernel.org netdev@vger.kernel.org kpsingh@kernel.org bp@alien8.de haoluo@google.com davem@davemloft.net song@kernel.org yhs@fb.com tglx@linutronix.de dsahern@kernel.org dave.hansen@linux.intel.com martin.lau@linux.dev sdf@google.com john.fastabend@gmail.com jolsa@kernel.org hpa@zytor.com
netdev/build_clang success Errors and warnings before: 5 this patch: 5
netdev/module_param success Was 0 now: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 2 this patch: 2
netdev/checkpatch warning WARNING: line length of 101 exceeds 80 columns WARNING: line length of 85 exceeds 80 columns WARNING: line length of 90 exceeds 80 columns WARNING: line length of 92 exceeds 80 columns WARNING: line length of 94 exceeds 80 columns WARNING: line length of 95 exceeds 80 columns WARNING: line length of 96 exceeds 80 columns WARNING: line length of 97 exceeds 80 columns WARNING: line length of 98 exceeds 80 columns WARNING: line length of 99 exceeds 80 columns
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0
bpf/vmtest-bpf-next-PR success PR summary
bpf/vmtest-bpf-next-VM_Test-1 success Logs for ShellCheck
bpf/vmtest-bpf-next-VM_Test-7 success Logs for llvm-toolchain
bpf/vmtest-bpf-next-VM_Test-8 success Logs for set-matrix
bpf/vmtest-bpf-next-VM_Test-2 success Logs for build for aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-3 success Logs for build for aarch64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-5 success Logs for build for x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-6 success Logs for build for x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-4 success Logs for build for s390x with gcc
bpf/vmtest-bpf-next-VM_Test-12 success Logs for test_maps on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-13 success Logs for test_maps on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-17 success Logs for test_progs on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-22 success Logs for test_progs_no_alu32 on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-23 success Logs for test_progs_no_alu32 on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-27 success Logs for test_progs_no_alu32_parallel on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-28 success Logs for test_progs_no_alu32_parallel on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-32 success Logs for test_progs_parallel on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-37 success Logs for test_verifier on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-38 success Logs for test_verifier on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-9 success Logs for test_maps on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-10 success Logs for test_maps on aarch64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-14 success Logs for test_progs on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-15 success Logs for test_progs on aarch64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-18 success Logs for test_progs on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-19 fail Logs for test_progs_no_alu32 on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-20 success Logs for test_progs_no_alu32 on aarch64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-24 success Logs for test_progs_no_alu32_parallel on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-25 success Logs for test_progs_no_alu32_parallel on aarch64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-29 success Logs for test_progs_parallel on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-30 success Logs for test_progs_parallel on aarch64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-33 success Logs for test_progs_parallel on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-34 success Logs for test_verifier on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-35 success Logs for test_verifier on aarch64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-26 success Logs for test_progs_no_alu32_parallel on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-36 success Logs for test_verifier on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-16 success Logs for test_progs on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-31 success Logs for test_progs_parallel on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-21 success Logs for test_progs_no_alu32 on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-11 success Logs for test_maps on s390x with gcc

Commit Message

Dave Marchevsky Dec. 13, 2022, 6:27 p.m. UTC
This patch rewrites the runtime PROBE_MEM check insns emitted by the BPF
JIT in order to ensure load safety. The changes in the patch fix two
issues with the previous logic and more generally improve size of
emitted code. Paragraphs between this one and "FIX 1" below explain the
purpose of the runtime check and examine the current implementation.

When a load is marked PROBE_MEM - e.g. due to PTR_UNTRUSTED access - the
address being loaded from is not necessarily valid. The BPF jit sets up
exception handlers for each such load which catch page faults and 0 out
the destination register.

Arbitrary register-relative loads can escape this exception handling
mechanism. Specifically, a load like dst_reg = *(src_reg + off) will not
trigger BPF exception handling if (src_reg + off) is outside of kernel
address space, resulting in an uncaught page fault. A concrete example
of such behavior is a program like:

  struct result {
    char space[40];
    long a;
  };

  /* if err, returns ERR_PTR(-EINVAL) */
  struct result *ptr = get_ptr_maybe_err();
  long x = ptr->a;

If get_ptr_maybe_err returns ERR_PTR(-EINVAL) and the result isn't
checked for err, 'result' will be (u64)-EINVAL, a number close to
U64_MAX. The ptr->a load will be > U64_MAX and will wrap over to a small
positive u64, which will be in userspace and thus not covered by BPF
exception handling mechanism.

In order to prevent such loads from occurring, the BPF jit emits some
instructions which do runtime checking of (src_reg + off) and skip the
actual load if it's out of range. As an example, here are instructions
emitted for a %rdi = *(%rdi + 0x10) PROBE_MEM load:

  72:   movabs $0x800000000010,%r11 --|
  7c:   cmp    %r11,%rdi              |- 72 - 7f: Check 1
  7f:    jb    0x000000000000008d   --|
  81:   mov    %rdi,%r11             -----|
  84:   add    $0x0000000000000010,%r11   |- 81-8b: Check 2
  8b:   jnc    0x0000000000000091    -----|
  8d:   xor    %edi,%edi             ---- 0 out dest
  8f:   jmp    0x0000000000000095
  91:   mov    0x10(%rdi),%rdi       ---- Actual load
  95:

The JIT considers kernel address space to start at MAX_TASK_SIZE +
PAGE_SIZE. Determining whether a load will be outside of kernel address
space should be a simple check:

  (src_reg + off) >= MAX_TASK_SIZE + PAGE_SIZE

But because there is only one spare register when the checking logic is
emitted, this logic is split into two checks:

  Check 1: src_reg >= (MAX_TASK_SIZE + PAGE_SIZE - off)
  Check 2: src_reg + off doesn't wrap over U64_MAX and result in small pos u64

Emitted insns implementing Checks 1 and 2 are annotated in the above
example. Check 1 can be done with a single spare register since the
source reg by definition is the left-hand-side of the inequality.
Since adding 'off' to both sides of Check 1's inequality results in the
original inequality we want, it's equivalent to testing that inequality.
Except in the case where src_reg + off wraps past U64_MAX, which is why
Check 2 needs to actually add src_reg + off if Check 1 passes - again
using the single spare reg.

FIX 1: The Check 1 inequality listed above is not what current code is
doing. Current code is a bit more pessimistic, instead checking:

  src_reg >= (MAX_TASK_SIZE + PAGE_SIZE + abs(off))

The 0x800000000010 in above example is from this current check. If Check
1 was corrected to use the correct right-hand-side, the value would be
0x7ffffffffff0. This patch changes the checking logic more broadly (FIX
2 below will elaborate), fixing this issue as a side-effect of the
rewrite. Regardless, it's important to understand why Check 1 should've
been doing MAX_TASK_SIZE + PAGE_SIZE - off before proceeding.

FIX 2: Current code relies on a 'jnc' to determine whether src_reg + off
addition wrapped over. For negative offsets this logic is incorrect.
Consider Check 2 insns emitted when off = -0x10:

  81:   mov    %rdi,%r11
  84:   add    0xfffffffffffffff0,%r11
  8b:   jnc    0x0000000000000091

2's complement representation of -0x10 is a large positive u64. Any
value of src_reg that passes Check 1 will result in carry flag being set
after (src_reg + off) addition. So a load with any negative offset will
always fail Check 2 at runtime and never do the actual load. This patch
fixes the negative offset issue by rewriting both checks in order to not
rely on carry flag.

The rewrite takes advantage of the fact that, while we only have one
scratch reg to hold arbitrary values, we know the offset at JIT time.
This we can use src_reg as a temporary scratch reg to hold src_reg +
offset since we can return it to its original value by later subtracting
offset. As a result we can directly check the original inequality we
care about:

  (src_reg + off) >= MAX_TASK_SIZE + PAGE_SIZE

For a load like %rdi = *(%rsi + -0x10), this results in emitted code:

  43:   movabs $0x800000000000,%r11
  4d:   add    $0xfffffffffffffff0,%rsi --- src_reg += off
  54:   cmp    %r11,%rsi                --- Check original inequality
  57:   jae    0x000000000000005d
  59:   xor    %edi,%edi
  5b:   jmp    0x0000000000000061
  5d:   mov    0x0(%rdi),%rsi           --- Actual Load
  61:   sub    $0xfffffffffffffff0,%rsi --- src_reg -= off

Note that the actual load is always done with offset 0, since previous
insns have already done src_reg += off. Regardless of whether the new
check succeeds or fails, insn 61 is always executed, returning src_reg
to its original value.

Because the goal of these checks is to ensure that loaded-from address
will be protected by BPF exception handler, the new check can safely
ignore any wrapover from insn 4d. If such wrapped-over address passes
insn 54 + 57's cmp-and-jmp it will have such protection so the load can
proceed.

As an aside, since offset in above calculations comes from
bpf_insn, it's a s16 and thus won't wrap under unless src_reg is
an anomalously low address in user address space. But again, if such a
wrapunder results in an address in kernelspace, it's fine for the
purpose of this check.

IMPROVEMENTS: The above improved logic is 8 insns vs original logic's 9,
and has 1 fewer jmp. The number of checking insns can be further
improved in common scenarios:

If src_reg == dst_reg, the actual load insn will clobber src_reg, so
there's no original src_reg state for the sub insn immediately following
the load to restore, so it can be omitted. In fact, it must be omitted
since it would incorrectly subtract from the result of the load if it
wasn't. So for src_reg == dst_reg, JIT emits these insns:

  3c:   movabs $0x800000000000,%r11
  46:   add    $0xfffffffffffffff0,%rdi
  4d:   cmp    %r11,%rdi
  50:   jae    0x0000000000000056
  52:   xor    %edi,%edi
  54:   jmp    0x000000000000005a
  56:   mov    0x0(%rdi),%rdi
  5a:

The only difference from larger example being the omitted sub, which
would've been insn 5a in this example.

If offset == 0, we can similarly omit the sub as in previous case, since
there's nothing added to subtract. For the same reason we can omit the
addition as well, resulting in JIT emitting these insns:

  46:   movabs $0x800000000000,%r11
  4d:   cmp    %r11,%rdi
  50:   jae    0x0000000000000056
  52:   xor    %edi,%edi
  54:   jmp    0x000000000000005a
  56:   mov    0x0(%rdi),%rdi
  5a:

Although the above example also has src_reg == dst_reg, the same
offset == 0 optimization is valid to apply if src_reg != dst_reg.

To summarize the improvements in emitted insn count for the
check-and-load:

BEFORE:                8 check insns, 3 jmps
AFTER (general case):  7 check insns, 2 jmps (12.5% fewer insn, 33% jmp)
AFTER (src == dst):    6 check insns, 2 jmps (25% fewer insn)
AFTER (offset == 0):   5 check insns, 2 jmps (37.5% fewer insn)

(Above counts don't include the 1 load insn, just checking around it)

Based on BPF bytecode + JITted x86 insn I saw while experimenting with
these improvements, I expect the src_reg == dst_reg case to occur most
often, followed by offset == 0, then the general case.

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
---
I originally noticed this issue while working on rbtree and submitted a
fix as part of that series: https://lore.kernel.org/bpf/20221206231000.3180914-11-davemarchevsky@fb.com/

This patch is more of an independent reimplementation than a follow-up,
it's not necessary to have read that patch or the discussion around it
to understand.

Submitting independently as I won't actually need this for rbtree and it
fixes issues which existed independently of rbtree implementation.

 arch/x86/net/bpf_jit_comp.c | 70 +++++++++++++++++++++----------------
 1 file changed, 39 insertions(+), 31 deletions(-)

Comments

Yonghong Song Dec. 15, 2022, 11:38 p.m. UTC | #1
On 12/13/22 10:27 AM, Dave Marchevsky wrote:
> This patch rewrites the runtime PROBE_MEM check insns emitted by the BPF
> JIT in order to ensure load safety. The changes in the patch fix two
> issues with the previous logic and more generally improve size of
> emitted code. Paragraphs between this one and "FIX 1" below explain the
> purpose of the runtime check and examine the current implementation.
> 
> When a load is marked PROBE_MEM - e.g. due to PTR_UNTRUSTED access - the
> address being loaded from is not necessarily valid. The BPF jit sets up
> exception handlers for each such load which catch page faults and 0 out
> the destination register.
> 
> Arbitrary register-relative loads can escape this exception handling
> mechanism. Specifically, a load like dst_reg = *(src_reg + off) will not
> trigger BPF exception handling if (src_reg + off) is outside of kernel
> address space, resulting in an uncaught page fault. A concrete example
> of such behavior is a program like:
> 
>    struct result {
>      char space[40];
>      long a;
>    };
> 
>    /* if err, returns ERR_PTR(-EINVAL) */
>    struct result *ptr = get_ptr_maybe_err();
>    long x = ptr->a;
> 
> If get_ptr_maybe_err returns ERR_PTR(-EINVAL) and the result isn't
> checked for err, 'result' will be (u64)-EINVAL, a number close to
> U64_MAX. The ptr->a load will be > U64_MAX and will wrap over to a small
> positive u64, which will be in userspace and thus not covered by BPF
> exception handling mechanism.
> 
> In order to prevent such loads from occurring, the BPF jit emits some
> instructions which do runtime checking of (src_reg + off) and skip the
> actual load if it's out of range. As an example, here are instructions
> emitted for a %rdi = *(%rdi + 0x10) PROBE_MEM load:
> 
>    72:   movabs $0x800000000010,%r11 --|
>    7c:   cmp    %r11,%rdi              |- 72 - 7f: Check 1
>    7f:    jb    0x000000000000008d   --|
>    81:   mov    %rdi,%r11             -----|
>    84:   add    $0x0000000000000010,%r11   |- 81-8b: Check 2
>    8b:   jnc    0x0000000000000091    -----|
>    8d:   xor    %edi,%edi             ---- 0 out dest
>    8f:   jmp    0x0000000000000095
>    91:   mov    0x10(%rdi),%rdi       ---- Actual load
>    95:
> 
> The JIT considers kernel address space to start at MAX_TASK_SIZE +
> PAGE_SIZE. Determining whether a load will be outside of kernel address
> space should be a simple check:
> 
>    (src_reg + off) >= MAX_TASK_SIZE + PAGE_SIZE
> 
> But because there is only one spare register when the checking logic is
> emitted, this logic is split into two checks:
> 
>    Check 1: src_reg >= (MAX_TASK_SIZE + PAGE_SIZE - off)
>    Check 2: src_reg + off doesn't wrap over U64_MAX and result in small pos u64
> 
> Emitted insns implementing Checks 1 and 2 are annotated in the above
> example. Check 1 can be done with a single spare register since the
> source reg by definition is the left-hand-side of the inequality.
> Since adding 'off' to both sides of Check 1's inequality results in the
> original inequality we want, it's equivalent to testing that inequality.
> Except in the case where src_reg + off wraps past U64_MAX, which is why
> Check 2 needs to actually add src_reg + off if Check 1 passes - again
> using the single spare reg.
> 
> FIX 1: The Check 1 inequality listed above is not what current code is
> doing. Current code is a bit more pessimistic, instead checking:
> 
>    src_reg >= (MAX_TASK_SIZE + PAGE_SIZE + abs(off))
> 
> The 0x800000000010 in above example is from this current check. If Check
> 1 was corrected to use the correct right-hand-side, the value would be
> 0x7ffffffffff0. This patch changes the checking logic more broadly (FIX
> 2 below will elaborate), fixing this issue as a side-effect of the
> rewrite. Regardless, it's important to understand why Check 1 should've
> been doing MAX_TASK_SIZE + PAGE_SIZE - off before proceeding.
> 
> FIX 2: Current code relies on a 'jnc' to determine whether src_reg + off
> addition wrapped over. For negative offsets this logic is incorrect.
> Consider Check 2 insns emitted when off = -0x10:
> 
>    81:   mov    %rdi,%r11
>    84:   add    0xfffffffffffffff0,%r11
>    8b:   jnc    0x0000000000000091
> 
> 2's complement representation of -0x10 is a large positive u64. Any
> value of src_reg that passes Check 1 will result in carry flag being set
> after (src_reg + off) addition. So a load with any negative offset will
> always fail Check 2 at runtime and never do the actual load. This patch
> fixes the negative offset issue by rewriting both checks in order to not
> rely on carry flag.
> 
> The rewrite takes advantage of the fact that, while we only have one
> scratch reg to hold arbitrary values, we know the offset at JIT time.
> This we can use src_reg as a temporary scratch reg to hold src_reg +
> offset since we can return it to its original value by later subtracting
> offset. As a result we can directly check the original inequality we
> care about:
> 
>    (src_reg + off) >= MAX_TASK_SIZE + PAGE_SIZE
> 
> For a load like %rdi = *(%rsi + -0x10), this results in emitted code:
> 
>    43:   movabs $0x800000000000,%r11
>    4d:   add    $0xfffffffffffffff0,%rsi --- src_reg += off
>    54:   cmp    %r11,%rsi                --- Check original inequality
>    57:   jae    0x000000000000005d
>    59:   xor    %edi,%edi
>    5b:   jmp    0x0000000000000061
>    5d:   mov    0x0(%rdi),%rsi           --- Actual Load
>    61:   sub    $0xfffffffffffffff0,%rsi --- src_reg -= off
> 
> Note that the actual load is always done with offset 0, since previous
> insns have already done src_reg += off. Regardless of whether the new
> check succeeds or fails, insn 61 is always executed, returning src_reg
> to its original value.
> 
> Because the goal of these checks is to ensure that loaded-from address
> will be protected by BPF exception handler, the new check can safely
> ignore any wrapover from insn 4d. If such wrapped-over address passes
> insn 54 + 57's cmp-and-jmp it will have such protection so the load can
> proceed.
> 
> As an aside, since offset in above calculations comes from
> bpf_insn, it's a s16 and thus won't wrap under unless src_reg is
> an anomalously low address in user address space. But again, if such a
> wrapunder results in an address in kernelspace, it's fine for the
> purpose of this check.

Not sure how useful the above paragraph is. Are we talking about
'offset' wraparound here? If src_reg is indeed a very low address in
user space, the following compare
    (src_reg + off) >= MAX_TASK_SIZE + PAGE_SIZE
should be force and everything is fine. What exactly this paragraph
will convey?

> 
> IMPROVEMENTS: The above improved logic is 8 insns vs original logic's 9,
> and has 1 fewer jmp. The number of checking insns can be further
> improved in common scenarios:
> 
> If src_reg == dst_reg, the actual load insn will clobber src_reg, so
> there's no original src_reg state for the sub insn immediately following
> the load to restore, so it can be omitted. In fact, it must be omitted
> since it would incorrectly subtract from the result of the load if it
> wasn't. So for src_reg == dst_reg, JIT emits these insns:
> 
>    3c:   movabs $0x800000000000,%r11
>    46:   add    $0xfffffffffffffff0,%rdi
>    4d:   cmp    %r11,%rdi
>    50:   jae    0x0000000000000056
>    52:   xor    %edi,%edi
>    54:   jmp    0x000000000000005a
>    56:   mov    0x0(%rdi),%rdi
>    5a:
> 
> The only difference from larger example being the omitted sub, which
> would've been insn 5a in this example.
> 
> If offset == 0, we can similarly omit the sub as in previous case, since
> there's nothing added to subtract. For the same reason we can omit the
> addition as well, resulting in JIT emitting these insns:
> 
>    46:   movabs $0x800000000000,%r11
>    4d:   cmp    %r11,%rdi
>    50:   jae    0x0000000000000056
>    52:   xor    %edi,%edi
>    54:   jmp    0x000000000000005a
>    56:   mov    0x0(%rdi),%rdi
>    5a:
> 
> Although the above example also has src_reg == dst_reg, the same
> offset == 0 optimization is valid to apply if src_reg != dst_reg.
> 
> To summarize the improvements in emitted insn count for the
> check-and-load:
> 
> BEFORE:                8 check insns, 3 jmps
> AFTER (general case):  7 check insns, 2 jmps (12.5% fewer insn, 33% jmp)
> AFTER (src == dst):    6 check insns, 2 jmps (25% fewer insn)
> AFTER (offset == 0):   5 check insns, 2 jmps (37.5% fewer insn)
> 
> (Above counts don't include the 1 load insn, just checking around it)
> 
> Based on BPF bytecode + JITted x86 insn I saw while experimenting with
> these improvements, I expect the src_reg == dst_reg case to occur most
> often, followed by offset == 0, then the general case.
> 
> Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>

Acked-by: Yonghong Song <yhs@fb.com>
diff mbox series

Patch

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index 36ffe67ad6e5..e3e2b57e4e13 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -992,6 +992,7 @@  static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, u8 *rw_image
 		u8 b2 = 0, b3 = 0;
 		u8 *start_of_ldx;
 		s64 jmp_offset;
+		s16 insn_off;
 		u8 jmp_cond;
 		u8 *func;
 		int nops;
@@ -1358,57 +1359,52 @@  st:			if (is_imm8(insn->off))
 		case BPF_LDX | BPF_PROBE_MEM | BPF_W:
 		case BPF_LDX | BPF_MEM | BPF_DW:
 		case BPF_LDX | BPF_PROBE_MEM | BPF_DW:
+			insn_off = insn->off;
+
 			if (BPF_MODE(insn->code) == BPF_PROBE_MEM) {
-				/* Though the verifier prevents negative insn->off in BPF_PROBE_MEM
-				 * add abs(insn->off) to the limit to make sure that negative
-				 * offset won't be an issue.
-				 * insn->off is s16, so it won't affect valid pointers.
+				/* Conservatively check that src_reg + insn->off is a kernel address:
+				 *   src_reg + insn->off >= TASK_SIZE_MAX + PAGE_SIZE
+				 * src_reg is used as scratch for src_reg += insn->off and restored
+				 * after emit_ldx if necessary
 				 */
-				u64 limit = TASK_SIZE_MAX + PAGE_SIZE + abs(insn->off);
-				u8 *end_of_jmp1, *end_of_jmp2;
 
-				/* Conservatively check that src_reg + insn->off is a kernel address:
-				 * 1. src_reg + insn->off >= limit
-				 * 2. src_reg + insn->off doesn't become small positive.
-				 * Cannot do src_reg + insn->off >= limit in one branch,
-				 * since it needs two spare registers, but JIT has only one.
+				u64 limit = TASK_SIZE_MAX + PAGE_SIZE;
+				u8 *end_of_jmp;
+
+				/* At end of these emitted checks, insn->off will have been added
+				 * to src_reg, so no need to do relative load with insn->off offset
 				 */
+				insn_off = 0;
 
 				/* movabsq r11, limit */
 				EMIT2(add_1mod(0x48, AUX_REG), add_1reg(0xB8, AUX_REG));
 				EMIT((u32)limit, 4);
 				EMIT(limit >> 32, 4);
+
+				if (insn->off) {
+					/* add src_reg, insn->off */
+					maybe_emit_1mod(&prog, src_reg, true);
+					EMIT2_off32(0x81, add_1reg(0xC0, src_reg), insn->off);
+				}
+
 				/* cmp src_reg, r11 */
 				maybe_emit_mod(&prog, src_reg, AUX_REG, true);
 				EMIT2(0x39, add_2reg(0xC0, src_reg, AUX_REG));
-				/* if unsigned '<' goto end_of_jmp2 */
-				EMIT2(X86_JB, 0);
-				end_of_jmp1 = prog;
-
-				/* mov r11, src_reg */
-				emit_mov_reg(&prog, true, AUX_REG, src_reg);
-				/* add r11, insn->off */
-				maybe_emit_1mod(&prog, AUX_REG, true);
-				EMIT2_off32(0x81, add_1reg(0xC0, AUX_REG), insn->off);
-				/* jmp if not carry to start_of_ldx
-				 * Otherwise ERR_PTR(-EINVAL) + 128 will be the user addr
-				 * that has to be rejected.
-				 */
-				EMIT2(0x73 /* JNC */, 0);
-				end_of_jmp2 = prog;
+
+				/* if unsigned '>=', goto load */
+				EMIT2(X86_JAE, 0);
+				end_of_jmp = prog;
 
 				/* xor dst_reg, dst_reg */
 				emit_mov_imm32(&prog, false, dst_reg, 0);
 				/* jmp byte_after_ldx */
 				EMIT2(0xEB, 0);
 
-				/* populate jmp_offset for JB above to jump to xor dst_reg */
-				end_of_jmp1[-1] = end_of_jmp2 - end_of_jmp1;
-				/* populate jmp_offset for JNC above to jump to start_of_ldx */
+				/* populate jmp_offset for JAE above to jump to start_of_ldx */
 				start_of_ldx = prog;
-				end_of_jmp2[-1] = start_of_ldx - end_of_jmp2;
+				end_of_jmp[-1] = start_of_ldx - end_of_jmp;
 			}
-			emit_ldx(&prog, BPF_SIZE(insn->code), dst_reg, src_reg, insn->off);
+			emit_ldx(&prog, BPF_SIZE(insn->code), dst_reg, src_reg, insn_off);
 			if (BPF_MODE(insn->code) == BPF_PROBE_MEM) {
 				struct exception_table_entry *ex;
 				u8 *_insn = image + proglen + (start_of_ldx - temp);
@@ -1417,6 +1413,18 @@  st:			if (is_imm8(insn->off))
 				/* populate jmp_offset for JMP above */
 				start_of_ldx[-1] = prog - start_of_ldx;
 
+				if (insn->off && src_reg != dst_reg) {
+					/* sub src_reg, insn->off
+					 * Restore src_reg after "add src_reg, insn->off" in prev
+					 * if statement. But if src_reg == dst_reg, emit_ldx
+					 * above already clobbered src_reg, so no need to restore.
+					 * If add src_reg, insn->off was unnecessary, no need to
+					 * restore either.
+					 */
+					maybe_emit_1mod(&prog, src_reg, true);
+					EMIT2_off32(0x81, add_1reg(0xE8, src_reg), insn->off);
+				}
+
 				if (!bpf_prog->aux->extable)
 					break;