diff mbox series

[bpf-next] bpf: Relax precision marking in open coded iters and may_goto loop.

Message ID 20240522024713.59136-1-alexei.starovoitov@gmail.com (mailing list archive)
State Superseded
Delegated to: BPF
Headers show
Series [bpf-next] bpf: Relax precision marking in open coded iters and may_goto loop. | expand

Checks

Context Check Description
bpf/vmtest-bpf-next-PR fail PR summary
bpf/vmtest-bpf-next-VM_Test-3 success Logs for Validate matrix.py
bpf/vmtest-bpf-next-VM_Test-2 success Logs for Unittests
bpf/vmtest-bpf-next-VM_Test-0 success Logs for Lint
bpf/vmtest-bpf-next-VM_Test-5 success Logs for aarch64-gcc / build-release
bpf/vmtest-bpf-next-VM_Test-1 success Logs for ShellCheck
bpf/vmtest-bpf-next-VM_Test-10 success Logs for aarch64-gcc / veristat
bpf/vmtest-bpf-next-VM_Test-12 success Logs for s390x-gcc / build-release
bpf/vmtest-bpf-next-VM_Test-4 success Logs for aarch64-gcc / build / build for aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-14 success Logs for s390x-gcc / test (test_progs, false, 360) / test_progs on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-8 success Logs for aarch64-gcc / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-11 success Logs for s390x-gcc / build / build for s390x with gcc
bpf/vmtest-bpf-next-VM_Test-9 success Logs for aarch64-gcc / test (test_verifier, false, 360) / test_verifier on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-7 success Logs for aarch64-gcc / test (test_progs, false, 360) / test_progs on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-34 success Logs for x86_64-llvm-17 / veristat
bpf/vmtest-bpf-next-VM_Test-13 success Logs for s390x-gcc / test (test_maps, false, 360) / test_maps on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-17 success Logs for s390x-gcc / veristat
bpf/vmtest-bpf-next-VM_Test-36 success Logs for x86_64-llvm-18 / build-release / build for x86_64 with llvm-18 and -O2 optimization
bpf/vmtest-bpf-next-VM_Test-18 success Logs for set-matrix
bpf/vmtest-bpf-next-VM_Test-28 success Logs for x86_64-llvm-17 / build / build for x86_64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-29 success Logs for x86_64-llvm-17 / build-release / build for x86_64 with llvm-17 and -O2 optimization
bpf/vmtest-bpf-next-VM_Test-16 success Logs for s390x-gcc / test (test_verifier, false, 360) / test_verifier on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-42 success Logs for x86_64-llvm-18 / veristat
bpf/vmtest-bpf-next-VM_Test-15 success Logs for s390x-gcc / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-35 success Logs for x86_64-llvm-18 / build / build for x86_64 with llvm-18
bpf/vmtest-bpf-next-VM_Test-6 success Logs for aarch64-gcc / test (test_maps, false, 360) / test_maps on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-20 success Logs for x86_64-gcc / build-release
bpf/vmtest-bpf-next-VM_Test-19 success Logs for x86_64-gcc / build / build for x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-21 success Logs for x86_64-gcc / test (test_maps, false, 360) / test_maps on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-32 success Logs for x86_64-llvm-17 / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on x86_64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-26 success Logs for x86_64-gcc / test (test_verifier, false, 360) / test_verifier on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-31 success Logs for x86_64-llvm-17 / test (test_progs, false, 360) / test_progs on x86_64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-41 success Logs for x86_64-llvm-18 / test (test_verifier, false, 360) / test_verifier on x86_64 with llvm-18
bpf/vmtest-bpf-next-VM_Test-33 success Logs for x86_64-llvm-17 / test (test_verifier, false, 360) / test_verifier on x86_64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-30 success Logs for x86_64-llvm-17 / test (test_maps, false, 360) / test_maps on x86_64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-23 success Logs for x86_64-gcc / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-24 success Logs for x86_64-gcc / test (test_progs_no_alu32_parallel, true, 30) / test_progs_no_alu32_parallel on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-22 success Logs for x86_64-gcc / test (test_progs, false, 360) / test_progs on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-27 fail Logs for x86_64-gcc / veristat / veristat on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-25 success Logs for x86_64-gcc / test (test_progs_parallel, true, 30) / test_progs_parallel on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-39 success Logs for x86_64-llvm-18 / test (test_progs_cpuv4, false, 360) / test_progs_cpuv4 on x86_64 with llvm-18
bpf/vmtest-bpf-next-VM_Test-37 success Logs for x86_64-llvm-18 / test (test_maps, false, 360) / test_maps on x86_64 with llvm-18
bpf/vmtest-bpf-next-VM_Test-40 success Logs for x86_64-llvm-18 / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on x86_64 with llvm-18
bpf/vmtest-bpf-next-VM_Test-38 success Logs for x86_64-llvm-18 / test (test_progs, false, 360) / test_progs on x86_64 with llvm-18

Commit Message

Alexei Starovoitov May 22, 2024, 2:47 a.m. UTC
From: Alexei Starovoitov <ast@kernel.org>

Motivation for the patch
------------------------
Open coded iterators and may_goto is a great mechanism to implement loops,
but counted loops are problematic. For example:
  for (i = 0; i < 100 && can_loop; i++)
is verified as a bounded loop, since i < 100 condition forces the verifier
to mark 'i' as precise and loop states at different iterations are not equivalent.
That removes the benefit of open coded iterators and may_goto.
The workaround is to do:
  int zero = 0; /* global or volatile variable */
  for (i = zero; i < 100 && can_loop; i++)
to hide from the verifier the value of 'i'.
It's unnatural and so far users didn't learn such odd programming pattern.

This patch aims to improve the verifier to support
  for (i = 0; i < 100000 && can_loop; i++)
as open coded iter loop (when 'i' doesn't need to be precise).

Algorithm
---------
First of all:
   if (is_may_goto_insn_at(env, insn_idx)) {
+          update_loop_entry(cur, &sl->state);
           if (states_equal(env, &sl->state, cur, RANGE_WITHIN)) {
-                  update_loop_entry(cur, &sl->state);

This should be correct, since reaching the same insn should
satisfy "if h1 in path" requirement of update_loop_entry() algorithm.
It's too conservative to update loop_entry only on a state match.

With that the get_loop_entry() can be used to gate is_branch_taken() logic.
When 'if (i < 1000)' is done within open coded iterator or in a loop with may_goto
don't invoke is_branch_taken() logic.
When it's skipped don't do reg_bounds_sanity_check(), since it will surely
see range violations.

Now, consider progs/iters_task_vma.c that has the following logic:
    bpf_for_each(...) {
       if (i > 1000)
          break;

       arr[i] = ..;
    }

Skipping precision mark at if (i > 1000) keeps 'i' imprecise,
but arr[i] will mark 'i' as precise anyway, because 'arr' is a map.
On the next iteration of the loop the patch does copy_precision()
that copies precision markings for top of the loop into next state
of the loop. So on the next iteration 'i' will be seen as precise.

Hence the key part of the patch:
-       pred = is_branch_taken(dst_reg, src_reg, opcode, is_jmp32);
+       if (!get_loop_entry(this_branch) || src_reg->precise || dst_reg->precise ||
+           (BPF_SRC(insn->code) == BPF_K && insn->imm == 0))
+               pred = is_branch_taken(dst_reg, src_reg, opcode, is_jmp32);

!get_loop_entry(this_branch) -> if not inside open coded iter keep
  existing is_branch_taken() logic, since bounded loop relies on it.

src_reg->precise || dst_reg->precise -> if later inside the loop the 'i' was
  actually marked as precise then we have to do is_branch_taken() and above
  bpf_for_each() will be verified as a bounded loop checking all 1000
  iterations. Otherwise we will keep incrementing 'i' and it will eventually
  get out of bounds in arr[i] and the verifier will reject such memory access.

BPF_SRC(insn->code) == BPF_K && insn->imm == 0 -> if it's a check for
  an exit condition from open coded iterator then do is_branch_taken() as well.
  Otherwise all open coded iterators won't work.

Now consider the same example:
    bpf_for_each(...) {
       if (i > 1000)
          break;

       arr[i] = ..;
    }
but 'arr' is an arena pointer. In this case 'i > 1000' will keep 'i' as
imprecise and arr[i] will keep it as imprecise as well.
And the whole loop will be verified with open coded iterator logic.

Now the following works:
-       for (i = zero; i < 1000; i++)
+       for (i = 0; i < 100000 && can_loop; i++) {
                htab_update_elem(htab, i, i);
+               arr[i] = i; // either arr1 or arr2
+       }
+char __arena arr1[100000]; /* works */
+char arr2[100000]; /* runs into 1M limit */

So the users can now use 'for (i = 0;...' pattern everywhere and
the verifier will fall back to bounded loop logic and precise 'i'
when 'i' is used in map-style memory access.
For arena based algorithms 'i' will stay imprecise.

-       for (i = zero; i < ARR_SZ && can_loop; i++)
+       /* i = 0 is ok here, since i is not used in memory access */
+       for (i = 0; i < ARR_SZ && can_loop; i++)
                sum += i;
+
+       /* have to use i = zero due to arr[i] where arr is not an arena */
        for (i = zero; i < ARR_SZ; i++) {
                barrier_var(i);
                sum += i + arr[i];

and i = zero workaround in iter_obfuscate_counter() can be removed.

copy_precision() is a hack, of course, to demonstrate an idea.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 kernel/bpf/verifier.c                         | 94 +++++++++++++++++--
 .../testing/selftests/bpf/progs/arena_htab.c  | 11 ++-
 tools/testing/selftests/bpf/progs/iters.c     | 18 +---
 .../bpf/progs/verifier_iterating_callbacks.c  | 17 ++--
 4 files changed, 112 insertions(+), 28 deletions(-)

Comments

Andrii Nakryiko May 22, 2024, 5:33 p.m. UTC | #1
On Tue, May 21, 2024 at 7:47 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> From: Alexei Starovoitov <ast@kernel.org>
>
> Motivation for the patch
> ------------------------
> Open coded iterators and may_goto is a great mechanism to implement loops,
> but counted loops are problematic. For example:
>   for (i = 0; i < 100 && can_loop; i++)
> is verified as a bounded loop, since i < 100 condition forces the verifier
> to mark 'i' as precise and loop states at different iterations are not equivalent.
> That removes the benefit of open coded iterators and may_goto.
> The workaround is to do:
>   int zero = 0; /* global or volatile variable */
>   for (i = zero; i < 100 && can_loop; i++)
> to hide from the verifier the value of 'i'.
> It's unnatural and so far users didn't learn such odd programming pattern.
>
> This patch aims to improve the verifier to support
>   for (i = 0; i < 100000 && can_loop; i++)
> as open coded iter loop (when 'i' doesn't need to be precise).
>
> Algorithm
> ---------
> First of all:
>    if (is_may_goto_insn_at(env, insn_idx)) {
> +          update_loop_entry(cur, &sl->state);
>            if (states_equal(env, &sl->state, cur, RANGE_WITHIN)) {
> -                  update_loop_entry(cur, &sl->state);
>
> This should be correct, since reaching the same insn should
> satisfy "if h1 in path" requirement of update_loop_entry() algorithm.
> It's too conservative to update loop_entry only on a state match.
>
> With that the get_loop_entry() can be used to gate is_branch_taken() logic.
> When 'if (i < 1000)' is done within open coded iterator or in a loop with may_goto
> don't invoke is_branch_taken() logic.
> When it's skipped don't do reg_bounds_sanity_check(), since it will surely
> see range violations.
>
> Now, consider progs/iters_task_vma.c that has the following logic:
>     bpf_for_each(...) {
>        if (i > 1000)

I'm wondering, maybe we should change rules around handling inequality
(>, >=, <, <=) comparisons for register(s) that have a constant value
(or maybe actually any value).

My reasoning is the following. When we have something like this `if (i
> 1000)` condition, that means that for fallthrough branch whether i
is 0, or 1, or 2, or whatever doesn't really matter, because the code
presumably works for any value in [0, 999] range, right? So maybe in
addition to marking it precise and keeping i's range estimate the
same, we should extend this range according to inequality condition?

That is, even if we know on first iteration that i is 0 (!precise),
when we simulate this conditional jump instruction, adjust i's range
to be [0, 999] (precise) in the fallthrough branch, and [1000,
U64_MAX] in the true branch?

I.e., make conditional jumps into "range widening" instructions?

Have you thought about this approach? Do you think it will work in
practice? I'm sure it can't be as simple, but still, worth
considering. Curious also to hear Eduard's opinion as well, he's dealt
with this a lot in the past.

>           break;
>
>        arr[i] = ..;
>     }
>
> Skipping precision mark at if (i > 1000) keeps 'i' imprecise,
> but arr[i] will mark 'i' as precise anyway, because 'arr' is a map.
> On the next iteration of the loop the patch does copy_precision()
> that copies precision markings for top of the loop into next state
> of the loop. So on the next iteration 'i' will be seen as precise.
>
> Hence the key part of the patch:
> -       pred = is_branch_taken(dst_reg, src_reg, opcode, is_jmp32);
> +       if (!get_loop_entry(this_branch) || src_reg->precise || dst_reg->precise ||
> +           (BPF_SRC(insn->code) == BPF_K && insn->imm == 0))
> +               pred = is_branch_taken(dst_reg, src_reg, opcode, is_jmp32);
>
> !get_loop_entry(this_branch) -> if not inside open coded iter keep
>   existing is_branch_taken() logic, since bounded loop relies on it.
>
> src_reg->precise || dst_reg->precise -> if later inside the loop the 'i' was
>   actually marked as precise then we have to do is_branch_taken() and above
>   bpf_for_each() will be verified as a bounded loop checking all 1000
>   iterations. Otherwise we will keep incrementing 'i' and it will eventually
>   get out of bounds in arr[i] and the verifier will reject such memory access.
>
> BPF_SRC(insn->code) == BPF_K && insn->imm == 0 -> if it's a check for
>   an exit condition from open coded iterator then do is_branch_taken() as well.
>   Otherwise all open coded iterators won't work.
>
> Now consider the same example:
>     bpf_for_each(...) {
>        if (i > 1000)
>           break;
>
>        arr[i] = ..;
>     }
> but 'arr' is an arena pointer. In this case 'i > 1000' will keep 'i' as
> imprecise and arr[i] will keep it as imprecise as well.
> And the whole loop will be verified with open coded iterator logic.
>
> Now the following works:
> -       for (i = zero; i < 1000; i++)
> +       for (i = 0; i < 100000 && can_loop; i++) {
>                 htab_update_elem(htab, i, i);
> +               arr[i] = i; // either arr1 or arr2
> +       }
> +char __arena arr1[100000]; /* works */
> +char arr2[100000]; /* runs into 1M limit */
>
> So the users can now use 'for (i = 0;...' pattern everywhere and
> the verifier will fall back to bounded loop logic and precise 'i'
> when 'i' is used in map-style memory access.
> For arena based algorithms 'i' will stay imprecise.
>
> -       for (i = zero; i < ARR_SZ && can_loop; i++)
> +       /* i = 0 is ok here, since i is not used in memory access */
> +       for (i = 0; i < ARR_SZ && can_loop; i++)
>                 sum += i;
> +
> +       /* have to use i = zero due to arr[i] where arr is not an arena */
>         for (i = zero; i < ARR_SZ; i++) {
>                 barrier_var(i);
>                 sum += i + arr[i];
>
> and i = zero workaround in iter_obfuscate_counter() can be removed.
>
> copy_precision() is a hack, of course, to demonstrate an idea.
>
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> ---

There is a lot to think about here, I'll try to get to this
today/tomorrow. But for now veristat is concerned about this change
([0]):

|File                              |Program
|Verdict                |States Diff (%)|
|----------------------------------|---------------------------------|-----------------------|---------------|
|arena_htab_asm.bpf.o              |arena_htab_asm
|success                |-80.91 %       |
|core_kern.bpf.o                   |balancer_ingress
|success -> failure (!!)|+0.00 %        |
|dynptr_success.bpf.o              |test_read_write
|success -> failure (!!)|+0.00 %        |
|iters.bpf.o                       |checkpoint_states_deletion
|success -> failure (!!)|+0.00 %        |
|iters.bpf.o                       |iter_multiple_sequential_loops
|success                |-11.43 %       |
|iters.bpf.o                       |iter_obfuscate_counter
|success                |+30.00 %       |
|iters.bpf.o                       |iter_pragma_unroll_loop
|success                |-23.08 %       |
|iters.bpf.o                       |iter_subprog_iters
|success                |+1.14 %        |
|iters.bpf.o                       |loop_state_deps1
|failure                |+7.14 %        |
|iters.bpf.o                       |loop_state_deps2
|failure                |-2.17 %        |
|iters_task_vma.bpf.o              |iter_task_vma_for_each
|success -> failure (!!)|+99.20 %       |
|linked_list.bpf.o                 |global_list_push_pop_multiple
|success -> failure (!!)|+0.00 %        |
|linked_list.bpf.o                 |inner_map_list_push_pop_multiple
|success -> failure (!!)|+0.00 %        |
|linked_list.bpf.o                 |map_list_push_pop_multiple
|success -> failure (!!)|+0.00 %        |
|test_seg6_loop.bpf.o              |__add_egr_x
|success -> failure (!!)|+0.00 %        |
|test_sysctl_loop1.bpf.o           |sysctl_tcp_mem
|success -> failure (!!)|+0.00 %        |
|test_sysctl_loop2.bpf.o           |sysctl_tcp_mem
|success -> failure (!!)|+0.00 %        |
|test_verif_scale2.bpf.o           |balancer_ingress
|success -> failure (!!)|+0.00 %        |
|verifier_bounds.bpf.o             |bound_greater_than_u32_max
|success -> failure (!!)|+0.00 %        |
|verifier_bounds.bpf.o
|crossing_32_bit_signed_boundary_2|success -> failure (!!)|+0.00 %
   |
|verifier_bounds.bpf.o
|crossing_64_bit_signed_boundary_2|success -> failure (!!)|+0.00 %
   |
|verifier_iterating_callbacks.bpf.o|cond_break2
|success                |+75.00 %       |
|verifier_iterating_callbacks.bpf.o|cond_break3
|success                |+66.67 %       |
|verifier_iterating_callbacks.bpf.o|cond_break4
|success                |+300.00 %      |
|verifier_iterating_callbacks.bpf.o|cond_break5
|success                |+266.67 %      |

  [0] https://github.com/kernel-patches/bpf/actions/runs/9184700207/job/25257587541

>  kernel/bpf/verifier.c                         | 94 +++++++++++++++++--
>  .../testing/selftests/bpf/progs/arena_htab.c  | 11 ++-
>  tools/testing/selftests/bpf/progs/iters.c     | 18 +---
>  .../bpf/progs/verifier_iterating_callbacks.c  | 17 ++--
>  4 files changed, 112 insertions(+), 28 deletions(-)
>

[...]
Eduard Zingerman May 22, 2024, 8:18 p.m. UTC | #2
On Tue, 2024-05-21 at 19:47 -0700, Alexei Starovoitov wrote:

[...]

Regarding this part, since we discussed it off-list
(I'll continue with the rest of the patch a bit later).

> First of all:
>    if (is_may_goto_insn_at(env, insn_idx)) {
> +          update_loop_entry(cur, &sl->state);
>            if (states_equal(env, &sl->state, cur, RANGE_WITHIN)) {
> -                  update_loop_entry(cur, &sl->state);
> 
> This should be correct, since reaching the same insn should
> satisfy "if h1 in path" requirement of update_loop_entry() algorithm.
> It's too conservative to update loop_entry only on a state match.

So, this basically changes the definition of the verifier states loop.
Previously, we considered a state loop to be such a sequence of states
Si -> ... -> Sj -> ... -> Sk that states_equal(Si, Sk, RANGE_WITHIN)
is true.

With this change Si -> ... -> Sj -> ... Sk is a loop if call sites and
instruction pointers for Si and Sk match.

Whether or not Si and Sk are in the loop influences two things:
(a) if exact comparison is needed for states cache;
(b) if widening transformation could be applied to some scalars.

As far as I understand, all pairs (Si, Sk) marked as a loop using old
definition would be marked as such using new definition
(in a addition to some new pairs).

I think that it is safe to apply (a) and (b) in strictly more cases.
(Although it is probably possible to conjure a program where such
 change would hinder verification convergence).

[...]
Alexei Starovoitov May 22, 2024, 9:09 p.m. UTC | #3
On Wed, May 22, 2024 at 10:33 AM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Tue, May 21, 2024 at 7:47 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > From: Alexei Starovoitov <ast@kernel.org>
> >
> > Motivation for the patch
> > ------------------------
> > Open coded iterators and may_goto is a great mechanism to implement loops,
> > but counted loops are problematic. For example:
> >   for (i = 0; i < 100 && can_loop; i++)
> > is verified as a bounded loop, since i < 100 condition forces the verifier
> > to mark 'i' as precise and loop states at different iterations are not equivalent.
> > That removes the benefit of open coded iterators and may_goto.
> > The workaround is to do:
> >   int zero = 0; /* global or volatile variable */
> >   for (i = zero; i < 100 && can_loop; i++)
> > to hide from the verifier the value of 'i'.
> > It's unnatural and so far users didn't learn such odd programming pattern.
> >
> > This patch aims to improve the verifier to support
> >   for (i = 0; i < 100000 && can_loop; i++)
> > as open coded iter loop (when 'i' doesn't need to be precise).
> >
> > Algorithm
> > ---------
> > First of all:
> >    if (is_may_goto_insn_at(env, insn_idx)) {
> > +          update_loop_entry(cur, &sl->state);
> >            if (states_equal(env, &sl->state, cur, RANGE_WITHIN)) {
> > -                  update_loop_entry(cur, &sl->state);
> >
> > This should be correct, since reaching the same insn should
> > satisfy "if h1 in path" requirement of update_loop_entry() algorithm.
> > It's too conservative to update loop_entry only on a state match.
> >
> > With that the get_loop_entry() can be used to gate is_branch_taken() logic.
> > When 'if (i < 1000)' is done within open coded iterator or in a loop with may_goto
> > don't invoke is_branch_taken() logic.
> > When it's skipped don't do reg_bounds_sanity_check(), since it will surely
> > see range violations.
> >
> > Now, consider progs/iters_task_vma.c that has the following logic:
> >     bpf_for_each(...) {
> >        if (i > 1000)
>
> I'm wondering, maybe we should change rules around handling inequality
> (>, >=, <, <=) comparisons for register(s) that have a constant value
> (or maybe actually any value).
>
> My reasoning is the following. When we have something like this `if (i
> > 1000)` condition, that means that for fallthrough branch whether i
> is 0, or 1, or 2, or whatever doesn't really matter, because the code
> presumably works for any value in [0, 999] range, right? So maybe in
> addition to marking it precise and keeping i's range estimate the
> same, we should extend this range according to inequality condition?
>
> That is, even if we know on first iteration that i is 0 (!precise),
> when we simulate this conditional jump instruction, adjust i's range
> to be [0, 999] (precise) in the fallthrough branch, and [1000,
> U64_MAX] in the true branch?
>
> I.e., make conditional jumps into "range widening" instructions?
>
> Have you thought about this approach? Do you think it will work in
> practice? I'm sure it can't be as simple, but still, worth
> considering. Curious also to hear Eduard's opinion as well, he's dealt
> with this a lot in the past.

I looked into doing exactly that [0, 999] and [1000, max],
then on the next iteration i+=1 insn will adjust it to
[1, 1000], but the next i < 1000 will widen it back to
[0, 999] and the state equivalence will be happy.
But my excitement was short lived, since both gcc and llvm
optimize the loop exit condition to !=
and they do it in the middle end.
Backends cannot influence this optimization.
I don't think it's practical to undo it in the backend.
So most of the loops written as:
for (i = 0; i < 1000; i++)
are compiled as
for (i = 0; i != 1000; i++)
for x86, arm, bpf, etc.

so if there is arr[i] inside the loop the verifier
have to rely on bounded loop logic and check i=0, 1, 2, ... 999
one by one, since nothing else inside the loop
makes the array index bounded.

Another small obstacle is that we don't have [!=const] range,
so i != 100 cannot be widened into [100] and [!=100].
We can add that without too much trouble.
But it won't help this arr[i] case anyway.

We can make i != 100 to be [unknown] and [unknown].
It's bad for arr[i] too, but fine when arr is an arena pointer.
Unfortunately at the time of the 'if' we don't know what comes later.
If the verifier knew that it's only dealing with arena pointers
it could disable precision altogether.

So I went with conditional disable of is_branch_taken + mark_precise and
surprisingly it didn't break any tests.

>
> >           break;
> >
> >        arr[i] = ..;
> >     }
> >
> > Skipping precision mark at if (i > 1000) keeps 'i' imprecise,
> > but arr[i] will mark 'i' as precise anyway, because 'arr' is a map.
> > On the next iteration of the loop the patch does copy_precision()
> > that copies precision markings for top of the loop into next state
> > of the loop. So on the next iteration 'i' will be seen as precise.
> >
> > Hence the key part of the patch:
> > -       pred = is_branch_taken(dst_reg, src_reg, opcode, is_jmp32);
> > +       if (!get_loop_entry(this_branch) || src_reg->precise || dst_reg->precise ||
> > +           (BPF_SRC(insn->code) == BPF_K && insn->imm == 0))
> > +               pred = is_branch_taken(dst_reg, src_reg, opcode, is_jmp32);
> >
> > !get_loop_entry(this_branch) -> if not inside open coded iter keep
> >   existing is_branch_taken() logic, since bounded loop relies on it.
> >
> > src_reg->precise || dst_reg->precise -> if later inside the loop the 'i' was
> >   actually marked as precise then we have to do is_branch_taken() and above
> >   bpf_for_each() will be verified as a bounded loop checking all 1000
> >   iterations. Otherwise we will keep incrementing 'i' and it will eventually
> >   get out of bounds in arr[i] and the verifier will reject such memory access.
> >
> > BPF_SRC(insn->code) == BPF_K && insn->imm == 0 -> if it's a check for
> >   an exit condition from open coded iterator then do is_branch_taken() as well.
> >   Otherwise all open coded iterators won't work.
> >
> > Now consider the same example:
> >     bpf_for_each(...) {
> >        if (i > 1000)
> >           break;
> >
> >        arr[i] = ..;
> >     }
> > but 'arr' is an arena pointer. In this case 'i > 1000' will keep 'i' as
> > imprecise and arr[i] will keep it as imprecise as well.
> > And the whole loop will be verified with open coded iterator logic.
> >
> > Now the following works:
> > -       for (i = zero; i < 1000; i++)
> > +       for (i = 0; i < 100000 && can_loop; i++) {
> >                 htab_update_elem(htab, i, i);
> > +               arr[i] = i; // either arr1 or arr2
> > +       }
> > +char __arena arr1[100000]; /* works */
> > +char arr2[100000]; /* runs into 1M limit */
> >
> > So the users can now use 'for (i = 0;...' pattern everywhere and
> > the verifier will fall back to bounded loop logic and precise 'i'
> > when 'i' is used in map-style memory access.
> > For arena based algorithms 'i' will stay imprecise.
> >
> > -       for (i = zero; i < ARR_SZ && can_loop; i++)
> > +       /* i = 0 is ok here, since i is not used in memory access */
> > +       for (i = 0; i < ARR_SZ && can_loop; i++)
> >                 sum += i;
> > +
> > +       /* have to use i = zero due to arr[i] where arr is not an arena */
> >         for (i = zero; i < ARR_SZ; i++) {
> >                 barrier_var(i);
> >                 sum += i + arr[i];
> >
> > and i = zero workaround in iter_obfuscate_counter() can be removed.
> >

...

> > copy_precision() is a hack, of course, to demonstrate an idea.

btw I think I know of a better way of doing copy_precision().
So don't pay much attention to it.

> >
> > Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> > ---
>
> There is a lot to think about here, I'll try to get to this
> today/tomorrow. But for now veristat is concerned about this change
> ([0]):
>
> |File                              |Program
> |Verdict                |States Diff (%)|
> |----------------------------------|---------------------------------|-----------------------|---------------|
> |arena_htab_asm.bpf.o              |arena_htab_asm
> |success                |-80.91 %       |
> |core_kern.bpf.o                   |balancer_ingress
> |success -> failure (!!)|+0.00 %        |
> |dynptr_success.bpf.o              |test_read_write
> |success -> failure (!!)|+0.00 %        |
> |iters.bpf.o                       |checkpoint_states_deletion
> |success -> failure (!!)|+0.00 %        |
> |iters.bpf.o                       |iter_multiple_sequential_loops
> |success                |-11.43 %       |
> |iters.bpf.o                       |iter_obfuscate_counter
> |success                |+30.00 %       |
> |iters.bpf.o                       |iter_pragma_unroll_loop
> |success                |-23.08 %       |
> |iters.bpf.o                       |iter_subprog_iters
> |success                |+1.14 %        |
> |iters.bpf.o                       |loop_state_deps1
> |failure                |+7.14 %        |
> |iters.bpf.o                       |loop_state_deps2
> |failure                |-2.17 %        |
> |iters_task_vma.bpf.o              |iter_task_vma_for_each
> |success -> failure (!!)|+99.20 %       |
> |linked_list.bpf.o                 |global_list_push_pop_multiple
> |success -> failure (!!)|+0.00 %        |
> |linked_list.bpf.o                 |inner_map_list_push_pop_multiple
> |success -> failure (!!)|+0.00 %        |
> |linked_list.bpf.o                 |map_list_push_pop_multiple
> |success -> failure (!!)|+0.00 %        |
> |test_seg6_loop.bpf.o              |__add_egr_x
> |success -> failure (!!)|+0.00 %        |
> |test_sysctl_loop1.bpf.o           |sysctl_tcp_mem
> |success -> failure (!!)|+0.00 %        |
> |test_sysctl_loop2.bpf.o           |sysctl_tcp_mem
> |success -> failure (!!)|+0.00 %        |
> |test_verif_scale2.bpf.o           |balancer_ingress
> |success -> failure (!!)|+0.00 %        |
> |verifier_bounds.bpf.o             |bound_greater_than_u32_max
> |success -> failure (!!)|+0.00 %        |
> |verifier_bounds.bpf.o

That was due to veristat being picky ;)
Extra verbose() in the verifier not gated by log_level
didn't fit in 64k veristat log buffer and ENOSPC turned into failure.

> |crossing_32_bit_signed_boundary_2|success -> failure (!!)|+0.00 %
>    |
> |verifier_bounds.bpf.o
> |crossing_64_bit_signed_boundary_2|success -> failure (!!)|+0.00 %
>    |
> |verifier_iterating_callbacks.bpf.o|cond_break2
> |success                |+75.00 %       |
> |verifier_iterating_callbacks.bpf.o|cond_break3
> |success                |+66.67 %       |
> |verifier_iterating_callbacks.bpf.o|cond_break4
> |success                |+300.00 %      |
> |verifier_iterating_callbacks.bpf.o|cond_break5
> |success                |+266.67 %      |

This is expected, since the tests changed.
In this case 300% regression is from 1 state to 3 states,
and from 10 to 21 states, ...
We should probably print absolute state values in veristat CI.
Alexei Starovoitov May 22, 2024, 9:13 p.m. UTC | #4
On Wed, May 22, 2024 at 1:18 PM Eduard Zingerman <eddyz87@gmail.com> wrote:
>
> On Tue, 2024-05-21 at 19:47 -0700, Alexei Starovoitov wrote:
>
> [...]
>
> Regarding this part, since we discussed it off-list
> (I'll continue with the rest of the patch a bit later).
>
> > First of all:
> >    if (is_may_goto_insn_at(env, insn_idx)) {
> > +          update_loop_entry(cur, &sl->state);
> >            if (states_equal(env, &sl->state, cur, RANGE_WITHIN)) {
> > -                  update_loop_entry(cur, &sl->state);
> >
> > This should be correct, since reaching the same insn should
> > satisfy "if h1 in path" requirement of update_loop_entry() algorithm.
> > It's too conservative to update loop_entry only on a state match.
>
> So, this basically changes the definition of the verifier states loop.
> Previously, we considered a state loop to be such a sequence of states
> Si -> ... -> Sj -> ... -> Sk that states_equal(Si, Sk, RANGE_WITHIN)
> is true.
>
> With this change Si -> ... -> Sj -> ... Sk is a loop if call sites and
> instruction pointers for Si and Sk match.
>
> Whether or not Si and Sk are in the loop influences two things:
> (a) if exact comparison is needed for states cache;
> (b) if widening transformation could be applied to some scalars.
>
> As far as I understand, all pairs (Si, Sk) marked as a loop using old
> definition would be marked as such using new definition
> (in a addition to some new pairs).
>
> I think that it is safe to apply (a) and (b) in strictly more cases.

Agree with this conclusion.
As discussed offlist we can add a check that
Si->parent->parent...->parent == Sk.
to make the algorithm "by the book".
I'll play with that.
Eduard Zingerman May 22, 2024, 9:54 p.m. UTC | #5
On Wed, 2024-05-22 at 14:13 -0700, Alexei Starovoitov wrote:

[...]

> Agree with this conclusion.
> As discussed offlist we can add a check that
> Si->parent->parent...->parent == Sk.
> to make the algorithm "by the book".
> I'll play with that.

Actually, I don't think this is necessary, here is the code for
update_loop_entry():

    static void update_loop_entry(struct bpf_verifier_state *cur,
                                  struct bpf_verifier_state *hdr)
    {
            struct bpf_verifier_state *cur1, *hdr1;

            cur1 = get_loop_entry(cur) ?: cur;
            hdr1 = get_loop_entry(hdr) ?: hdr;
            if (hdr1->branches && hdr1->dfs_depth <= cur1->dfs_depth) {
                    cur->loop_entry = hdr;
                    hdr->used_as_loop_entry = true;
            }
    }
    
It relies on the following properties:
- every state in the current DFS path (except current)
  has branches > 0;
- states not in the DFS path are either:
  - in explored_states, are fully explored and have branches == 0;
  - in env->stack, are not yet explored and have branches == 0
    (and also not reachable from is_state_visited()).

So, I don't think there is a need to check that hdr1 is in the parent
chain for cur1.
Eduard Zingerman May 23, 2024, 12:02 a.m. UTC | #6
On Tue, 2024-05-21 at 19:47 -0700, Alexei Starovoitov wrote:

[...]

> Skipping precision mark at if (i > 1000) keeps 'i' imprecise,
> but arr[i] will mark 'i' as precise anyway, because 'arr' is a map.
> On the next iteration of the loop the patch does copy_precision()
> that copies precision markings for top of the loop into next state
> of the loop. So on the next iteration 'i' will be seen as precise.

Could you please elaborate a bit on why copy_precision() is necessary?
In general, the main idea of the patch is to skip precision marks in
certain cases, meaning that strictly more branches would be explored,
and it does not seem that copy_precision() is needed for safety reasons.

I tried turning copy_precision() off and see a single test failing:

    $ ./test_progs -vvv -a iters/task_vma
    ...
    ; bpf_for_each(task_vma, vma, task, 0) { @ iters_task_vma.c:30
    35: (55) if r0 != 0x0 goto pc-15 21: R0_w=ptr_vm_area_struct(id=1002) R6=1000 R10=fp0 fp-8=iter_task_vma(ref_id=1,state=active,depth=1001) refs=1
    ; if (bpf_cmp_unlikely(seen, >=, 1000)) @ iters_task_vma.c:31
    21: (35) if r6 >= 0x3e8 goto pc+14    ; R6=1000 refs=1
    ; vm_ranges[seen].vm_start = vma->vm_start; @ iters_task_vma.c:34
    22: (bf) r1 = r6
    REG INVARIANTS VIOLATION (alu): range bounds violation u64=[0x3e8, 0x3e7] s64=[0x3e8, 0x3e7] u32=[0x3e8, 0x3e7] s32=[0x3e8, 0x3e8] var_off=(0x3e8, 0x0)
    23: R1_w=1000 R6=1000 refs=1
    23: (67) r1 <<= 4                     ; R1_w=16000 refs=1
    24: (18) r2 = 0xffffc90000342008      ; R2_w=map_value(map=iters_ta.bss,ks=4,vs=16008,off=8) refs=1
    26: (0f) r2 += r1                     ; R1_w=16000 R2_w=map_value(map=iters_ta.bss,ks=4,vs=16008,off=16008) refs=1
    27: (79) r1 = *(u64 *)(r0 +0)         ; R0_w=ptr_vm_area_struct(id=1002) R1_w=scalar() refs=1
    28: (7b) *(u64 *)(r2 +0) = r1
    invalid access to map value, value_size=16008 off=16008 size=8
    R2 min value is outside of the allowed memory range
    processed 27035 insns (limit 1000000) max_states_per_insn 65 total_states 2003 peak_states 1008 mark_read 2

I wonder, what if we forgo the 'ignore_bad_range' flag and instead
consider branches with invalid range as impossible?
E.g. when pred == -2. Or when prediction says that branch would be
taken and another branch leads to invalid range.

I'll give it a try later this evening, but still curious about the
reasoning behind copy_precision().

[...]
Alexei Starovoitov May 23, 2024, 2:31 a.m. UTC | #7
On Wed, May 22, 2024 at 5:02 PM Eduard Zingerman <eddyz87@gmail.com> wrote:
>
> On Tue, 2024-05-21 at 19:47 -0700, Alexei Starovoitov wrote:
>
> [...]
>
> > Skipping precision mark at if (i > 1000) keeps 'i' imprecise,
> > but arr[i] will mark 'i' as precise anyway, because 'arr' is a map.
> > On the next iteration of the loop the patch does copy_precision()
> > that copies precision markings for top of the loop into next state
> > of the loop. So on the next iteration 'i' will be seen as precise.
>
> Could you please elaborate a bit on why copy_precision() is necessary?
> In general, the main idea of the patch is to skip precision marks in
> certain cases, meaning that strictly more branches would be explored,
> and it does not seem that copy_precision() is needed for safety reasons.
>
> I tried turning copy_precision() off and see a single test failing:
>
>     $ ./test_progs -vvv -a iters/task_vma
>     ...
>     ; bpf_for_each(task_vma, vma, task, 0) { @ iters_task_vma.c:30
>     35: (55) if r0 != 0x0 goto pc-15 21: R0_w=ptr_vm_area_struct(id=1002) R6=1000 R10=fp0 fp-8=iter_task_vma(ref_id=1,state=active,depth=1001) refs=1
>     ; if (bpf_cmp_unlikely(seen, >=, 1000)) @ iters_task_vma.c:31
>     21: (35) if r6 >= 0x3e8 goto pc+14    ; R6=1000 refs=1
>     ; vm_ranges[seen].vm_start = vma->vm_start; @ iters_task_vma.c:34
>     22: (bf) r1 = r6
>     REG INVARIANTS VIOLATION (alu): range bounds violation u64=[0x3e8, 0x3e7] s64=[0x3e8, 0x3e7] u32=[0x3e8, 0x3e7] s32=[0x3e8, 0x3e8] var_off=(0x3e8, 0x0)
>     23: R1_w=1000 R6=1000 refs=1
>     23: (67) r1 <<= 4                     ; R1_w=16000 refs=1
>     24: (18) r2 = 0xffffc90000342008      ; R2_w=map_value(map=iters_ta.bss,ks=4,vs=16008,off=8) refs=1
>     26: (0f) r2 += r1                     ; R1_w=16000 R2_w=map_value(map=iters_ta.bss,ks=4,vs=16008,off=16008) refs=1
>     27: (79) r1 = *(u64 *)(r0 +0)         ; R0_w=ptr_vm_area_struct(id=1002) R1_w=scalar() refs=1
>     28: (7b) *(u64 *)(r2 +0) = r1
>     invalid access to map value, value_size=16008 off=16008 size=8
>     R2 min value is outside of the allowed memory range
>     processed 27035 insns (limit 1000000) max_states_per_insn 65 total_states 2003 peak_states 1008 mark_read 2

Exactly. It's failing without copy_precision().
I didn't add an additional selftest, but I tried to explain
the idea in the commit log.

That's the case where the verifier processes open coded iter
with a bounded loop logic.
Try:
-               if (bpf_cmp_unlikely(seen, >=, 1000))
+               if (bpf_cmp_unlikely(seen, ==, 1000))

and it will also pass.
In both cases there will be 27038 processed insn.
All 1k iterations will be checked by the verifier.

That's the main idea.
Let users right normal loops for (i=0; i < 1000 && can_loop; ...)
and if there is no map value access through 'i' the loop
will converge quickly as may_goto.
If not bounded loop logic will quick in for arr[i].
Just like it did for vm_ranges[seen] in this test.

> I wonder, what if we forgo the 'ignore_bad_range' flag and instead
> consider branches with invalid range as impossible?
> E.g. when pred == -2. Or when prediction says that branch would be
> taken and another branch leads to invalid range.

That sounds pretty much like is_branch_taken.
If we don't explore another branch we have to mark the register as precise.

> I'll give it a try later this evening, but still curious about the
> reasoning behind copy_precision().

copy_precision() is kinda the main idea.
Once loop iteration copies precision from header of the loop
the if (dst_reg->precise || src_reg->precise) logic kicks in and
is_branch_taken() starts to work and it becomes bounded loop like
where every iteration states are different.
Alexei Starovoitov May 24, 2024, 4:19 a.m. UTC | #8
On Wed, May 22, 2024 at 2:09 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Wed, May 22, 2024 at 10:33 AM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
> >
> > On Tue, May 21, 2024 at 7:47 PM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > From: Alexei Starovoitov <ast@kernel.org>
> > >
> > > Motivation for the patch
> > > ------------------------
> > > Open coded iterators and may_goto is a great mechanism to implement loops,
> > > but counted loops are problematic. For example:
> > >   for (i = 0; i < 100 && can_loop; i++)
> > > is verified as a bounded loop, since i < 100 condition forces the verifier
> > > to mark 'i' as precise and loop states at different iterations are not equivalent.
> > > That removes the benefit of open coded iterators and may_goto.
> > > The workaround is to do:
> > >   int zero = 0; /* global or volatile variable */
> > >   for (i = zero; i < 100 && can_loop; i++)
> > > to hide from the verifier the value of 'i'.
> > > It's unnatural and so far users didn't learn such odd programming pattern.
> > >
> > > This patch aims to improve the verifier to support
> > >   for (i = 0; i < 100000 && can_loop; i++)
> > > as open coded iter loop (when 'i' doesn't need to be precise).
> > >
> > > Algorithm
> > > ---------
> > > First of all:
> > >    if (is_may_goto_insn_at(env, insn_idx)) {
> > > +          update_loop_entry(cur, &sl->state);
> > >            if (states_equal(env, &sl->state, cur, RANGE_WITHIN)) {
> > > -                  update_loop_entry(cur, &sl->state);
> > >
> > > This should be correct, since reaching the same insn should
> > > satisfy "if h1 in path" requirement of update_loop_entry() algorithm.
> > > It's too conservative to update loop_entry only on a state match.
> > >
> > > With that the get_loop_entry() can be used to gate is_branch_taken() logic.
> > > When 'if (i < 1000)' is done within open coded iterator or in a loop with may_goto
> > > don't invoke is_branch_taken() logic.
> > > When it's skipped don't do reg_bounds_sanity_check(), since it will surely
> > > see range violations.
> > >
> > > Now, consider progs/iters_task_vma.c that has the following logic:
> > >     bpf_for_each(...) {
> > >        if (i > 1000)
> >
> > I'm wondering, maybe we should change rules around handling inequality
> > (>, >=, <, <=) comparisons for register(s) that have a constant value
> > (or maybe actually any value).
> >
> > My reasoning is the following. When we have something like this `if (i
> > > 1000)` condition, that means that for fallthrough branch whether i
> > is 0, or 1, or 2, or whatever doesn't really matter, because the code
> > presumably works for any value in [0, 999] range, right? So maybe in
> > addition to marking it precise and keeping i's range estimate the
> > same, we should extend this range according to inequality condition?
> >
> > That is, even if we know on first iteration that i is 0 (!precise),
> > when we simulate this conditional jump instruction, adjust i's range
> > to be [0, 999] (precise) in the fallthrough branch, and [1000,
> > U64_MAX] in the true branch?
> >
> > I.e., make conditional jumps into "range widening" instructions?
> >
> > Have you thought about this approach? Do you think it will work in
> > practice? I'm sure it can't be as simple, but still, worth
> > considering. Curious also to hear Eduard's opinion as well, he's dealt
> > with this a lot in the past.
>
> I looked into doing exactly that [0, 999] and [1000, max],
> then on the next iteration i+=1 insn will adjust it to
> [1, 1000], but the next i < 1000 will widen it back to
> [0, 999] and the state equivalence will be happy.
> But my excitement was short lived, since both gcc and llvm
> optimize the loop exit condition to !=
> and they do it in the middle end.
> Backends cannot influence this optimization.
> I don't think it's practical to undo it in the backend.
> So most of the loops written as:
> for (i = 0; i < 1000; i++)
> are compiled as
> for (i = 0; i != 1000; i++)
> for x86, arm, bpf, etc.
>
> so if there is arr[i] inside the loop the verifier
> have to rely on bounded loop logic and check i=0, 1, 2, ... 999
> one by one, since nothing else inside the loop
> makes the array index bounded.

So I've decided to implement this idea anyway to see whether
it can be salvaged for some cases.
It turned out that it's not that bad. With extra heuristic
it's almost working. Stay tuned.
diff mbox series

Patch

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 77da1f438bec..7a1606ccf692 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -14882,7 +14882,7 @@  static int reg_set_min_max(struct bpf_verifier_env *env,
 			   struct bpf_reg_state *true_reg2,
 			   struct bpf_reg_state *false_reg1,
 			   struct bpf_reg_state *false_reg2,
-			   u8 opcode, bool is_jmp32)
+			   u8 opcode, bool is_jmp32, bool ignore_bad_range)
 {
 	int err;
 
@@ -14903,6 +14903,8 @@  static int reg_set_min_max(struct bpf_verifier_env *env,
 	reg_bounds_sync(true_reg1);
 	reg_bounds_sync(true_reg2);
 
+	if (ignore_bad_range)
+		return 0;
 	err = reg_bounds_sanity_check(env, true_reg1, "true_reg1");
 	err = err ?: reg_bounds_sanity_check(env, true_reg2, "true_reg2");
 	err = err ?: reg_bounds_sanity_check(env, false_reg1, "false_reg1");
@@ -15177,7 +15179,11 @@  static int check_cond_jmp_op(struct bpf_verifier_env *env,
 	}
 
 	is_jmp32 = BPF_CLASS(insn->code) == BPF_JMP32;
-	pred = is_branch_taken(dst_reg, src_reg, opcode, is_jmp32);
+	if (!get_loop_entry(this_branch) || src_reg->precise || dst_reg->precise ||
+	    (BPF_SRC(insn->code) == BPF_K && insn->imm == 0))
+		pred = is_branch_taken(dst_reg, src_reg, opcode, is_jmp32);
+	else
+		pred = -2;
 	if (pred >= 0) {
 		/* If we get here with a dst_reg pointer type it is because
 		 * above is_branch_taken() special cased the 0 comparison.
@@ -15229,13 +15235,13 @@  static int check_cond_jmp_op(struct bpf_verifier_env *env,
 		err = reg_set_min_max(env,
 				      &other_branch_regs[insn->dst_reg],
 				      &other_branch_regs[insn->src_reg],
-				      dst_reg, src_reg, opcode, is_jmp32);
+				      dst_reg, src_reg, opcode, is_jmp32, pred == -2);
 	} else /* BPF_SRC(insn->code) == BPF_K */ {
 		err = reg_set_min_max(env,
 				      &other_branch_regs[insn->dst_reg],
 				      src_reg /* fake one */,
 				      dst_reg, src_reg /* same fake one */,
-				      opcode, is_jmp32);
+				      opcode, is_jmp32, pred == -2);
 	}
 	if (err)
 		return err;
@@ -17217,6 +17223,81 @@  static int propagate_precision(struct bpf_verifier_env *env,
 	return 0;
 }
 
+static void __copy_precision(struct bpf_verifier_env *env,
+			     struct bpf_verifier_state *cur,
+			     const struct bpf_verifier_state *old)
+{
+	struct bpf_reg_state *state_reg;
+	struct bpf_func_state *state, *cur_fr;
+	int i, fr;
+	bool first;
+
+	for (fr = min(cur->curframe, old->curframe); fr >= 0; fr--) {
+		state = old->frame[fr];
+		cur_fr = cur->frame[fr];
+		state_reg = state->regs;
+		first = true;
+		verbose(env, "XX old state:");
+		print_verifier_state(env, state, true);
+		for (i = 0; i < BPF_REG_FP; i++, state_reg++) {
+			if (state_reg->type != SCALAR_VALUE ||
+			    !state_reg->precise ||
+			    !(state_reg->live & REG_LIVE_READ))
+				continue;
+			if (env->log.level & BPF_LOG_LEVEL2) {
+				if (first)
+					verbose(env, "XX frame %d: propagating r%d", fr, i);
+				else
+					verbose(env, ",r%d", i);
+			}
+			cur_fr->regs[i].precise = true;
+			first = false;
+		}
+
+		for (i = 0; i < min(cur_fr->allocated_stack, state->allocated_stack) / BPF_REG_SIZE; i++) {
+			if (!is_spilled_reg(&state->stack[i]))
+				continue;
+			state_reg = &state->stack[i].spilled_ptr;
+			if (state_reg->type != SCALAR_VALUE ||
+			    !state_reg->precise ||
+			    !(state_reg->live & REG_LIVE_READ))
+				continue;
+			if (env->log.level & BPF_LOG_LEVEL2) {
+				if (first)
+					verbose(env, "XX frame %d: propagating fp%d",
+						fr, (-i - 1) * BPF_REG_SIZE);
+				else
+					verbose(env, ",fp%d", (-i - 1) * BPF_REG_SIZE);
+			}
+			cur_fr->stack[i].spilled_ptr.precise = true;
+			first = false;
+		}
+		if (!first)
+			verbose(env, "\n");
+	}
+}
+
+static void copy_precision(struct bpf_verifier_env *env,
+			   struct bpf_verifier_state *cur,
+			   const struct bpf_verifier_state *old)
+{
+	if (!old)
+		return;
+	/*
+	 * parent state unlikely to have precise registers
+	 * due to mark_all_scalars_imprecise(), but let's try anyway.
+	 */
+	__copy_precision(env, cur, old);
+	old = old->parent;
+	if (!old)
+		return;
+	/*
+	 * This one might have precise scalars, since precision propagation
+	 * from array access will mark them in the parent.
+	 */
+	__copy_precision(env, cur, old);
+}
+
 static bool states_maybe_looping(struct bpf_verifier_state *old,
 				 struct bpf_verifier_state *cur)
 {
@@ -17409,6 +17490,7 @@  static int is_state_visited(struct bpf_verifier_env *env, int insn_idx)
 			 * => unsafe memory access at 11 would not be caught.
 			 */
 			if (is_iter_next_insn(env, insn_idx)) {
+				update_loop_entry(cur, &sl->state);
 				if (states_equal(env, &sl->state, cur, RANGE_WITHIN)) {
 					struct bpf_func_state *cur_frame;
 					struct bpf_reg_state *iter_state, *iter_reg;
@@ -17426,15 +17508,14 @@  static int is_state_visited(struct bpf_verifier_env *env, int insn_idx)
 					spi = __get_spi(iter_reg->off + iter_reg->var_off.value);
 					iter_state = &func(env, iter_reg)->stack[spi].spilled_ptr;
 					if (iter_state->iter.state == BPF_ITER_STATE_ACTIVE) {
-						update_loop_entry(cur, &sl->state);
 						goto hit;
 					}
 				}
 				goto skip_inf_loop_check;
 			}
 			if (is_may_goto_insn_at(env, insn_idx)) {
+				update_loop_entry(cur, &sl->state);
 				if (states_equal(env, &sl->state, cur, RANGE_WITHIN)) {
-					update_loop_entry(cur, &sl->state);
 					goto hit;
 				}
 				goto skip_inf_loop_check;
@@ -18066,6 +18147,7 @@  static int do_check(struct bpf_verifier_env *env)
 						return err;
 					break;
 				} else {
+					copy_precision(env, env->cur_state, env->cur_state->parent);
 					do_print_state = true;
 					continue;
 				}
diff --git a/tools/testing/selftests/bpf/progs/arena_htab.c b/tools/testing/selftests/bpf/progs/arena_htab.c
index 1e6ac187a6a0..ac45700ca5ad 100644
--- a/tools/testing/selftests/bpf/progs/arena_htab.c
+++ b/tools/testing/selftests/bpf/progs/arena_htab.c
@@ -18,24 +18,31 @@  void __arena *htab_for_user;
 bool skip = false;
 
 int zero = 0;
+char __arena arr1[100000]; /* works */
+char arr2[100000]; /* runs into 1M limit */
 
 SEC("syscall")
 int arena_htab_llvm(void *ctx)
 {
 #if defined(__BPF_FEATURE_ADDR_SPACE_CAST) || defined(BPF_ARENA_FORCE_ASM)
 	struct htab __arena *htab;
+	char __arena *arr = arr1;
 	__u64 i;
 
 	htab = bpf_alloc(sizeof(*htab));
 	cast_kern(htab);
 	htab_init(htab);
 
+	cast_kern(arr);
+
 	/* first run. No old elems in the table */
-	for (i = zero; i < 1000; i++)
+	for (i = 0; i < 100000 && can_loop; i++) {
 		htab_update_elem(htab, i, i);
+		arr[i] = i;
+	}
 
 	/* should replace all elems with new ones */
-	for (i = zero; i < 1000; i++)
+	for (i = 0; i < 100000 && can_loop; i++)
 		htab_update_elem(htab, i, i);
 	cast_user(htab);
 	htab_for_user = htab;
diff --git a/tools/testing/selftests/bpf/progs/iters.c b/tools/testing/selftests/bpf/progs/iters.c
index fe65e0952a1e..dfc2c9cc0529 100644
--- a/tools/testing/selftests/bpf/progs/iters.c
+++ b/tools/testing/selftests/bpf/progs/iters.c
@@ -188,6 +188,8 @@  int iter_pragma_unroll_loop(const void *ctx)
 	for (i = 0; i < 3; i++) {
 		v = bpf_iter_num_next(&it);
 		bpf_printk("ITER_BASIC: E3 VAL: i=%d v=%d", i, v ? *v : -1);
+		if (!v)
+			break;
 	}
 	bpf_iter_num_destroy(&it);
 
@@ -243,6 +245,8 @@  int iter_multiple_sequential_loops(const void *ctx)
 	for (i = 0; i < 3; i++) {
 		v = bpf_iter_num_next(&it);
 		bpf_printk("ITER_BASIC: E3 VAL: i=%d v=%d", i, v ? *v : -1);
+		if (!v)
+			break;
 	}
 	bpf_iter_num_destroy(&it);
 
@@ -291,10 +295,7 @@  int iter_obfuscate_counter(const void *ctx)
 {
 	struct bpf_iter_num it;
 	int *v, sum = 0;
-	/* Make i's initial value unknowable for verifier to prevent it from
-	 * pruning if/else branch inside the loop body and marking i as precise.
-	 */
-	int i = zero;
+	int i = 0;
 
 	MY_PID_GUARD();
 
@@ -304,15 +305,6 @@  int iter_obfuscate_counter(const void *ctx)
 
 		i += 1;
 
-		/* If we initialized i as `int i = 0;` above, verifier would
-		 * track that i becomes 1 on first iteration after increment
-		 * above, and here verifier would eagerly prune else branch
-		 * and mark i as precise, ruining open-coded iterator logic
-		 * completely, as each next iteration would have a different
-		 * *precise* value of i, and thus there would be no
-		 * convergence of state. This would result in reaching maximum
-		 * instruction limit, no matter what the limit is.
-		 */
 		if (i == 1)
 			x = 123;
 		else
diff --git a/tools/testing/selftests/bpf/progs/verifier_iterating_callbacks.c b/tools/testing/selftests/bpf/progs/verifier_iterating_callbacks.c
index bd676d7e615f..bd45a328fa85 100644
--- a/tools/testing/selftests/bpf/progs/verifier_iterating_callbacks.c
+++ b/tools/testing/selftests/bpf/progs/verifier_iterating_callbacks.c
@@ -318,8 +318,11 @@  int cond_break1(const void *ctx)
 	unsigned long i;
 	unsigned int sum = 0;
 
-	for (i = zero; i < ARR_SZ && can_loop; i++)
+	/* i = 0 is ok here, since i is not used in memory access */
+	for (i = 0; i < ARR_SZ && can_loop; i++)
 		sum += i;
+
+	/* have to use i = zero due to arr[i] where arr is not an arena */
 	for (i = zero; i < ARR_SZ; i++) {
 		barrier_var(i);
 		sum += i + arr[i];
@@ -336,8 +339,8 @@  int cond_break2(const void *ctx)
 	int i, j;
 	int sum = 0;
 
-	for (i = zero; i < 1000 && can_loop; i++)
-		for (j = zero; j < 1000; j++) {
+	for (i = 0; i < 1000 && can_loop; i++)
+		for (j = 0; j < 1000; j++) {
 			sum += i + j;
 			cond_break;
 	}
@@ -348,7 +351,7 @@  static __noinline int loop(void)
 {
 	int i, sum = 0;
 
-	for (i = zero; i <= 1000000 && can_loop; i++)
+	for (i = 0; i <= 1000000 && can_loop; i++)
 		sum += i;
 
 	return sum;
@@ -365,7 +368,7 @@  SEC("socket")
 __success __retval(1)
 int cond_break4(const void *ctx)
 {
-	int cnt = zero;
+	int cnt = 0;
 
 	for (;;) {
 		/* should eventually break out of the loop */
@@ -378,7 +381,7 @@  int cond_break4(const void *ctx)
 
 static __noinline int static_subprog(void)
 {
-	int cnt = zero;
+	int cnt = 0;
 
 	for (;;) {
 		cond_break;
@@ -392,7 +395,7 @@  SEC("socket")
 __success __retval(1)
 int cond_break5(const void *ctx)
 {
-	int cnt1 = zero, cnt2;
+	int cnt1 = 0, cnt2;
 
 	for (;;) {
 		cond_break;