mbox series

[bpf-next,v2,0/2] bpf: Fix to preserve reg parent/live fields when copying range info

Message ID 20230106142214.1040390-1-eddyz87@gmail.com (mailing list archive)
Headers show
Series bpf: Fix to preserve reg parent/live fields when copying range info | expand

Message

Eduard Zingerman Jan. 6, 2023, 2:22 p.m. UTC
Struct bpf_reg_state is copied directly in several places including:
- check_stack_write_fixed_off() (via save_register_state());
- check_stack_read_fixed_off();
- find_equal_scalars().

However, a literal copy of this struct also copies the following fields:

struct bpf_reg_state {
	...
	struct bpf_reg_state *parent;
	...
	enum bpf_reg_liveness live;
	...
};

This breaks register parentage chain and liveness marking logic.
The commit message for the first patch has a detailed example.
This patch-set replaces direct copies with a call to a function
copy_register_state(dst,src), which preserves 'parent' and 'live'
fields of the 'dst'.

The fix comes with a significant verifier runtime penalty for some
selftest binaries listed in tools/testing/selftests/bpf/veristat.cfg
and cilium BPF binaries (see [1]):

$ ./veristat -e file,prog,states -C -f 'states_diff>10' master-baseline.log current.log 
File                        Program                           States (A)  States (B)  States   (DIFF)
--------------------------  --------------------------------  ----------  ----------  ---------------
bpf_host.o                  tail_handle_ipv4_from_host               231         299    +68 (+29.44%)
bpf_host.o                  tail_handle_nat_fwd_ipv4                1088        1320   +232 (+21.32%)
bpf_host.o                  tail_handle_nat_fwd_ipv6                 716         729     +13 (+1.82%)
bpf_host.o                  tail_nodeport_nat_ingress_ipv4           281         314    +33 (+11.74%)
bpf_host.o                  tail_nodeport_nat_ingress_ipv6           245         256     +11 (+4.49%)
bpf_lxc.o                   tail_handle_nat_fwd_ipv4                1088        1320   +232 (+21.32%)
bpf_lxc.o                   tail_handle_nat_fwd_ipv6                 716         729     +13 (+1.82%)
bpf_lxc.o                   tail_ipv4_ct_egress                      239         262     +23 (+9.62%)
bpf_lxc.o                   tail_ipv4_ct_ingress                     239         262     +23 (+9.62%)
bpf_lxc.o                   tail_ipv4_ct_ingress_policy_only         239         262     +23 (+9.62%)
bpf_lxc.o                   tail_ipv6_ct_egress                      181         195     +14 (+7.73%)
bpf_lxc.o                   tail_ipv6_ct_ingress                     181         195     +14 (+7.73%)
bpf_lxc.o                   tail_ipv6_ct_ingress_policy_only         181         195     +14 (+7.73%)
bpf_lxc.o                   tail_nodeport_nat_ingress_ipv4           281         314    +33 (+11.74%)
bpf_lxc.o                   tail_nodeport_nat_ingress_ipv6           245         256     +11 (+4.49%)
bpf_overlay.o               tail_handle_nat_fwd_ipv4                 799         829     +30 (+3.75%)
bpf_overlay.o               tail_nodeport_nat_ingress_ipv4           281         314    +33 (+11.74%)
bpf_overlay.o               tail_nodeport_nat_ingress_ipv6           245         256     +11 (+4.49%)
bpf_sock.o                  cil_sock4_connect                         47          70    +23 (+48.94%)
bpf_sock.o                  cil_sock4_sendmsg                         45          68    +23 (+51.11%)
bpf_sock.o                  cil_sock6_post_bind                       31          42    +11 (+35.48%)
bpf_xdp.o                   tail_lb_ipv4                            4413        6457  +2044 (+46.32%)
bpf_xdp.o                   tail_lb_ipv6                            6876        7249    +373 (+5.42%)
test_cls_redirect.bpf.o     cls_redirect                            4704        4799     +95 (+2.02%)
test_tcp_hdr_options.bpf.o  estab                                    180         206    +26 (+14.44%)
xdp_synproxy_kern.bpf.o     syncookie_tc                           21059       21485    +426 (+2.02%)
xdp_synproxy_kern.bpf.o     syncookie_xdp                          21857       23122   +1265 (+5.79%)
--------------------------  --------------------------------  ----------  ----------  ---------------

I looked through verification log for bpf_xdp.o tail_lb_ipv4 program in
order to identify the reason for ~50% visited states increase.
The slowdown is triggered by a difference in handling of three stack slots:
fp-56, fp-72 and fp-80, with the main difference coming from fp-72.
In fact the following change removes all the difference:

@@ -3256,7 +3256,10 @@ static void save_register_state(struct bpf_func_state *state,
 {
        int i;
 
-       copy_register_state(&state->stack[spi].spilled_ptr, reg);
+       if ((spi == 6 /*56*/ || spi == 8 /*72*/ || spi == 9 /*80*/) && size != BPF_REG_SIZE)
+               state->stack[spi].spilled_ptr = *reg;
+       else
+               copy_register_state(&state->stack[spi].spilled_ptr, reg);

For fp-56 I found the following pattern for divergences between
verification logs with and w/o this patch:

- At some point insn 1862 is reached and checkpoint is created;
- At some other point insn 1862 is reached again:
  - with this patch:
    - the current state is considered *not* equivalent to the old checkpoint;
    - the reason for mismatch is the state of fp-56:
      - current state: fp-56=????mmmm
      - checkpoint: fp-56_rD=mmmmmmmm
  - without this patch the current state is considered equivalent to the
    checkpoint, the fp-56 is not present in the checkpoint.

Here is a fragment of the verification log for when the checkpoint in
question created at insn 1862:

checkpoint 1862:  ... fp-56=mmmmmmmm ...
1862: ...
1863: ...
1864: (61) r1 = *(u32 *)(r0 +0)
1865: ...
1866: (63) *(u32 *)(r10 -56) = r1     ; R1_w=scalar(...) R10=fp0 fp-56=
1867: (bf) r2 = r10                   ; R2_w=fp0 R10=fp0 
1868: (07) r2 += -56                  ; R2_w=fp-56
; return map_lookup_elem(&LB4_BACKEND_MAP_V2, &backend_id);
1869: (18) r1 = 0xffff888100286000    ; R1_w=map_ptr(off=0,ks=4,vs=8,imm=0)
1871: (85) call bpf_map_lookup_elem#1

- Without this patch:
  - at insn 1864 r1 liveness is set to REG_LIVE_WRITTEN;
  - at insn 1866 fp-56 liveness is set REG_LIVE_WRITTEN mark because
    of the direct r1 copy in save_register_state();
  - at insn 1871 REG_LIVE_READ is not propagated to fp-56 at
    checkpoint 1862 because of the REG_LIVE_WRITTEN mark;
  - eventually fp-56 is pruned from checkpoint at 1862 in
    clean_func_state().
- With this patch:
  - at insn 1864 r1 liveness is set to REG_LIVE_WRITTEN;
  - at insn 1866 fp-56 liveness is *not* set to REG_LIVE_WRITTEN mark
    because write size is not equal to BPF_REG_SIZE;
  - at insn 1871 REG_LIVE_READ is propagated to fp-56 at checkpoint 1862.

Hence more states have to be visited by verifier with this patch compared
to current master.

Similar patterns could be found for both fp-72 and fp-80, although these
are harder to track trough the log because of a big number of insns between
slot write and bpf_map_lookup_elem() call triggering read mark, boils down
to the following C code:

	struct ipv4_frag_id frag_id = {
		.daddr = ip4->daddr,
		.saddr = ip4->saddr,
		.id = ip4->id,
		.proto = ip4->protocol,
		.pad = 0,
	};
    ...
    map_lookup_elem(..., &frag_id);
    
Where:
- .id is mapped to fp-72, write of size u16;
- .saddr is mapped to fp-80, write of size u32.

This patch-set is a continuation of discussion from [2].

Changes v1 -> v2 (no changes in the code itself):
- added analysis for the tail_lb_ipv4 verification slowdown;
- rebase against fresh master branch.

[1] git@github.com:anakryiko/cilium.git
[2] https://lore.kernel.org/bpf/517af2c57ee4b9ce2d96a8cf33f7295f2d2dfe13.camel@gmail.com/

Eduard Zingerman (2):
  bpf: Fix to preserve reg parent/live fields when copying range info
  selftests/bpf: Verify copy_register_state() preserves parent/live
    fields

 kernel/bpf/verifier.c                         | 25 +++++++++----
 .../selftests/bpf/verifier/search_pruning.c   | 36 +++++++++++++++++++
 2 files changed, 54 insertions(+), 7 deletions(-)

Comments

Andrii Nakryiko Jan. 12, 2023, 12:24 a.m. UTC | #1
On Fri, Jan 6, 2023 at 6:22 AM Eduard Zingerman <eddyz87@gmail.com> wrote:
>
> Struct bpf_reg_state is copied directly in several places including:
> - check_stack_write_fixed_off() (via save_register_state());
> - check_stack_read_fixed_off();
> - find_equal_scalars().
>
> However, a literal copy of this struct also copies the following fields:
>
> struct bpf_reg_state {
>         ...
>         struct bpf_reg_state *parent;
>         ...
>         enum bpf_reg_liveness live;
>         ...
> };
>
> This breaks register parentage chain and liveness marking logic.
> The commit message for the first patch has a detailed example.
> This patch-set replaces direct copies with a call to a function
> copy_register_state(dst,src), which preserves 'parent' and 'live'
> fields of the 'dst'.
>
> The fix comes with a significant verifier runtime penalty for some
> selftest binaries listed in tools/testing/selftests/bpf/veristat.cfg
> and cilium BPF binaries (see [1]):
>
> $ ./veristat -e file,prog,states -C -f 'states_diff>10' master-baseline.log current.log
> File                        Program                           States (A)  States (B)  States   (DIFF)
> --------------------------  --------------------------------  ----------  ----------  ---------------
> bpf_host.o                  tail_handle_ipv4_from_host               231         299    +68 (+29.44%)
> bpf_host.o                  tail_handle_nat_fwd_ipv4                1088        1320   +232 (+21.32%)
> bpf_host.o                  tail_handle_nat_fwd_ipv6                 716         729     +13 (+1.82%)
> bpf_host.o                  tail_nodeport_nat_ingress_ipv4           281         314    +33 (+11.74%)
> bpf_host.o                  tail_nodeport_nat_ingress_ipv6           245         256     +11 (+4.49%)
> bpf_lxc.o                   tail_handle_nat_fwd_ipv4                1088        1320   +232 (+21.32%)
> bpf_lxc.o                   tail_handle_nat_fwd_ipv6                 716         729     +13 (+1.82%)
> bpf_lxc.o                   tail_ipv4_ct_egress                      239         262     +23 (+9.62%)
> bpf_lxc.o                   tail_ipv4_ct_ingress                     239         262     +23 (+9.62%)
> bpf_lxc.o                   tail_ipv4_ct_ingress_policy_only         239         262     +23 (+9.62%)
> bpf_lxc.o                   tail_ipv6_ct_egress                      181         195     +14 (+7.73%)
> bpf_lxc.o                   tail_ipv6_ct_ingress                     181         195     +14 (+7.73%)
> bpf_lxc.o                   tail_ipv6_ct_ingress_policy_only         181         195     +14 (+7.73%)
> bpf_lxc.o                   tail_nodeport_nat_ingress_ipv4           281         314    +33 (+11.74%)
> bpf_lxc.o                   tail_nodeport_nat_ingress_ipv6           245         256     +11 (+4.49%)
> bpf_overlay.o               tail_handle_nat_fwd_ipv4                 799         829     +30 (+3.75%)
> bpf_overlay.o               tail_nodeport_nat_ingress_ipv4           281         314    +33 (+11.74%)
> bpf_overlay.o               tail_nodeport_nat_ingress_ipv6           245         256     +11 (+4.49%)
> bpf_sock.o                  cil_sock4_connect                         47          70    +23 (+48.94%)
> bpf_sock.o                  cil_sock4_sendmsg                         45          68    +23 (+51.11%)
> bpf_sock.o                  cil_sock6_post_bind                       31          42    +11 (+35.48%)
> bpf_xdp.o                   tail_lb_ipv4                            4413        6457  +2044 (+46.32%)
> bpf_xdp.o                   tail_lb_ipv6                            6876        7249    +373 (+5.42%)
> test_cls_redirect.bpf.o     cls_redirect                            4704        4799     +95 (+2.02%)
> test_tcp_hdr_options.bpf.o  estab                                    180         206    +26 (+14.44%)
> xdp_synproxy_kern.bpf.o     syncookie_tc                           21059       21485    +426 (+2.02%)
> xdp_synproxy_kern.bpf.o     syncookie_xdp                          21857       23122   +1265 (+5.79%)
> --------------------------  --------------------------------  ----------  ----------  ---------------
>
> I looked through verification log for bpf_xdp.o tail_lb_ipv4 program in
> order to identify the reason for ~50% visited states increase.

It's just a 2K increase and it is fixing a real issue, so I think it's
totally acceptable (and see below about STACK_INVALID vs STACK_MISC).
Thanks for taking the time to analyze and explain this!

> The slowdown is triggered by a difference in handling of three stack slots:
> fp-56, fp-72 and fp-80, with the main difference coming from fp-72.
> In fact the following change removes all the difference:
>
> @@ -3256,7 +3256,10 @@ static void save_register_state(struct bpf_func_state *state,
>  {
>         int i;
>
> -       copy_register_state(&state->stack[spi].spilled_ptr, reg);
> +       if ((spi == 6 /*56*/ || spi == 8 /*72*/ || spi == 9 /*80*/) && size != BPF_REG_SIZE)
> +               state->stack[spi].spilled_ptr = *reg;
> +       else
> +               copy_register_state(&state->stack[spi].spilled_ptr, reg);
>
> For fp-56 I found the following pattern for divergences between
> verification logs with and w/o this patch:
>
> - At some point insn 1862 is reached and checkpoint is created;
> - At some other point insn 1862 is reached again:
>   - with this patch:
>     - the current state is considered *not* equivalent to the old checkpoint;
>     - the reason for mismatch is the state of fp-56:
>       - current state: fp-56=????mmmm
>       - checkpoint: fp-56_rD=mmmmmmmm

I'm wondering if we should consider allowing uninitialized
(STACK_INVALID) reads from stack, in general. It feels like it's
causing more issues than is actually helpful in practice. Common code
pattern is to __builtin_memset() some struct first, and only then
initialize it, basically doing unnecessary work of zeroing out. All
just to avoid verifier to complain about some irrelevant padding not
being initialized. I haven't thought about this much, but it feels
that STACK_MISC (initialized, but unknown scalar value) is basically
equivalent to STACK_INVALID for all intents and purposes. Thoughts?

Obviously, this is a completely separate change and issue from what
you are addressing in this patch set.

Awesome job on tracking this down and fixing it! For the patch set:

Acked-by: Andrii Nakryiko <andrii@kernel.org>


>   - without this patch the current state is considered equivalent to the
>     checkpoint, the fp-56 is not present in the checkpoint.
>
> Here is a fragment of the verification log for when the checkpoint in
> question created at insn 1862:
>
> checkpoint 1862:  ... fp-56=mmmmmmmm ...
> 1862: ...
> 1863: ...
> 1864: (61) r1 = *(u32 *)(r0 +0)
> 1865: ...
> 1866: (63) *(u32 *)(r10 -56) = r1     ; R1_w=scalar(...) R10=fp0 fp-56=
> 1867: (bf) r2 = r10                   ; R2_w=fp0 R10=fp0
> 1868: (07) r2 += -56                  ; R2_w=fp-56
> ; return map_lookup_elem(&LB4_BACKEND_MAP_V2, &backend_id);
> 1869: (18) r1 = 0xffff888100286000    ; R1_w=map_ptr(off=0,ks=4,vs=8,imm=0)
> 1871: (85) call bpf_map_lookup_elem#1
>
> - Without this patch:
>   - at insn 1864 r1 liveness is set to REG_LIVE_WRITTEN;
>   - at insn 1866 fp-56 liveness is set REG_LIVE_WRITTEN mark because
>     of the direct r1 copy in save_register_state();
>   - at insn 1871 REG_LIVE_READ is not propagated to fp-56 at
>     checkpoint 1862 because of the REG_LIVE_WRITTEN mark;
>   - eventually fp-56 is pruned from checkpoint at 1862 in
>     clean_func_state().
> - With this patch:
>   - at insn 1864 r1 liveness is set to REG_LIVE_WRITTEN;
>   - at insn 1866 fp-56 liveness is *not* set to REG_LIVE_WRITTEN mark
>     because write size is not equal to BPF_REG_SIZE;
>   - at insn 1871 REG_LIVE_READ is propagated to fp-56 at checkpoint 1862.
>
> Hence more states have to be visited by verifier with this patch compared
> to current master.
>
> Similar patterns could be found for both fp-72 and fp-80, although these
> are harder to track trough the log because of a big number of insns between
> slot write and bpf_map_lookup_elem() call triggering read mark, boils down
> to the following C code:
>
>         struct ipv4_frag_id frag_id = {
>                 .daddr = ip4->daddr,
>                 .saddr = ip4->saddr,
>                 .id = ip4->id,
>                 .proto = ip4->protocol,
>                 .pad = 0,
>         };
>     ...
>     map_lookup_elem(..., &frag_id);
>
> Where:
> - .id is mapped to fp-72, write of size u16;
> - .saddr is mapped to fp-80, write of size u32.
>
> This patch-set is a continuation of discussion from [2].
>
> Changes v1 -> v2 (no changes in the code itself):
> - added analysis for the tail_lb_ipv4 verification slowdown;
> - rebase against fresh master branch.
>
> [1] git@github.com:anakryiko/cilium.git
> [2] https://lore.kernel.org/bpf/517af2c57ee4b9ce2d96a8cf33f7295f2d2dfe13.camel@gmail.com/
>
> Eduard Zingerman (2):
>   bpf: Fix to preserve reg parent/live fields when copying range info
>   selftests/bpf: Verify copy_register_state() preserves parent/live
>     fields
>
>  kernel/bpf/verifier.c                         | 25 +++++++++----
>  .../selftests/bpf/verifier/search_pruning.c   | 36 +++++++++++++++++++
>  2 files changed, 54 insertions(+), 7 deletions(-)
>
> --
> 2.39.0
>
Eduard Zingerman Jan. 13, 2023, 8:02 p.m. UTC | #2
On Wed, 2023-01-11 at 16:24 -0800, Andrii Nakryiko wrote:
[...]
> 
> I'm wondering if we should consider allowing uninitialized
> (STACK_INVALID) reads from stack, in general. It feels like it's
> causing more issues than is actually helpful in practice. Common code
> pattern is to __builtin_memset() some struct first, and only then
> initialize it, basically doing unnecessary work of zeroing out. All
> just to avoid verifier to complain about some irrelevant padding not
> being initialized. I haven't thought about this much, but it feels
> that STACK_MISC (initialized, but unknown scalar value) is basically
> equivalent to STACK_INVALID for all intents and purposes. Thoughts?

Do you have an example of the __builtin_memset() usage?
I tried passing partially initialized stack allocated structure to
bpf_map_update_elem() and bpf_probe_write_user() and verifier did not
complain.

Regarding STACK_MISC vs STACK_INVALID, I think it's ok to replace
STACK_INVALID with STACK_MISC if we are talking about STX/LDX/ALU
instructions because after LDX you would get a full range register and
you can't do much with a full range value. However, if a structure
containing un-initialized fields (e.g. not just padding) is passed to
a helper or kfunc is it an error?

> Obviously, this is a completely separate change and issue from what
> you are addressing in this patch set.
> 
> Awesome job on tracking this down and fixing it! For the patch set:

Thank you for reviewing this issue with me.

> 
> Acked-by: Andrii Nakryiko <andrii@kernel.org>
> 
> 
[...]
Andrii Nakryiko Jan. 13, 2023, 10:22 p.m. UTC | #3
On Fri, Jan 13, 2023 at 12:02 PM Eduard Zingerman <eddyz87@gmail.com> wrote:
>
> On Wed, 2023-01-11 at 16:24 -0800, Andrii Nakryiko wrote:
> [...]
> >
> > I'm wondering if we should consider allowing uninitialized
> > (STACK_INVALID) reads from stack, in general. It feels like it's
> > causing more issues than is actually helpful in practice. Common code
> > pattern is to __builtin_memset() some struct first, and only then
> > initialize it, basically doing unnecessary work of zeroing out. All
> > just to avoid verifier to complain about some irrelevant padding not
> > being initialized. I haven't thought about this much, but it feels
> > that STACK_MISC (initialized, but unknown scalar value) is basically
> > equivalent to STACK_INVALID for all intents and purposes. Thoughts?
>
> Do you have an example of the __builtin_memset() usage?
> I tried passing partially initialized stack allocated structure to
> bpf_map_update_elem() and bpf_probe_write_user() and verifier did not
> complain.
>
> Regarding STACK_MISC vs STACK_INVALID, I think it's ok to replace
> STACK_INVALID with STACK_MISC if we are talking about STX/LDX/ALU
> instructions because after LDX you would get a full range register and
> you can't do much with a full range value. However, if a structure
> containing un-initialized fields (e.g. not just padding) is passed to
> a helper or kfunc is it an error?

if we are passing stack as a memory to helper/kfunc (which should be
the only valid use case with STACK_MISC, right?), then I think we
expect helper/kfunc to treat it as memory with unknowable contents.
Not sure if I'm missing something, but MISC says it's some unknown
value, and the only difference between INVALID and MISC is that MISC's
value was written by program explicitly, while for INVALID that
garbage value was there on the stack already (but still unknowable
scalar), which effectively is the same thing.

>
> > Obviously, this is a completely separate change and issue from what
> > you are addressing in this patch set.
> >
> > Awesome job on tracking this down and fixing it! For the patch set:
>
> Thank you for reviewing this issue with me.
>
> >
> > Acked-by: Andrii Nakryiko <andrii@kernel.org>
> >
> >
> [...]
Eduard Zingerman Jan. 14, 2023, 12:10 a.m. UTC | #4
On Fri, 2023-01-13 at 14:22 -0800, Andrii Nakryiko wrote:
> On Fri, Jan 13, 2023 at 12:02 PM Eduard Zingerman <eddyz87@gmail.com> wrote:
> > 
> > On Wed, 2023-01-11 at 16:24 -0800, Andrii Nakryiko wrote:
> > [...]
> > > 
> > > I'm wondering if we should consider allowing uninitialized
> > > (STACK_INVALID) reads from stack, in general. It feels like it's
> > > causing more issues than is actually helpful in practice. Common code
> > > pattern is to __builtin_memset() some struct first, and only then
> > > initialize it, basically doing unnecessary work of zeroing out. All
> > > just to avoid verifier to complain about some irrelevant padding not
> > > being initialized. I haven't thought about this much, but it feels
> > > that STACK_MISC (initialized, but unknown scalar value) is basically
> > > equivalent to STACK_INVALID for all intents and purposes. Thoughts?
> > 
> > Do you have an example of the __builtin_memset() usage?
> > I tried passing partially initialized stack allocated structure to
> > bpf_map_update_elem() and bpf_probe_write_user() and verifier did not
> > complain.
> > 
> > Regarding STACK_MISC vs STACK_INVALID, I think it's ok to replace
> > STACK_INVALID with STACK_MISC if we are talking about STX/LDX/ALU
> > instructions because after LDX you would get a full range register and
> > you can't do much with a full range value. However, if a structure
> > containing un-initialized fields (e.g. not just padding) is passed to
> > a helper or kfunc is it an error?
> 
> if we are passing stack as a memory to helper/kfunc (which should be
> the only valid use case with STACK_MISC, right?), then I think we
> expect helper/kfunc to treat it as memory with unknowable contents.
> Not sure if I'm missing something, but MISC says it's some unknown
> value, and the only difference between INVALID and MISC is that MISC's
> value was written by program explicitly, while for INVALID that
> garbage value was there on the stack already (but still unknowable
> scalar), which effectively is the same thing.

I looked through the places where STACK_INVALID is used, here is the list:

- unmark_stack_slots_dynptr()
  Destroy dynptr marks. Suppose STACK_INVALID is replaced by
  STACK_MISC here, in this case a scalar read would be possible from
  such slot, which in turn might lead to pointer leak.
  Might be a problem?

- scrub_spilled_slot()
  mark spill slot STACK_MISC if not STACK_INVALID
  Called from:
  - save_register_state() called from check_stack_write_fixed_off()
    Would mark not all slots only for 32-bit writes.
  - check_stack_write_fixed_off() for insns like `fp[-8] = <const>` to
    destroy previous stack marks.
  - check_stack_range_initialized()
    here it always marks all 8 spi slots as STACK_MISC.
  Looks like STACK_MISC instead of STACK_INVALID wouldn't make a
  difference in these cases.

- check_stack_write_fixed_off()
  Mark insn as sanitize_stack_spill if pointer is spilled to a stack
  slot that is marked STACK_INVALID. This one is a bit strange.
  E.g. the program like this:

    ...
    42:  fp[-8] = ptr
    ...
    
  Will mark insn (42) as sanitize_stack_spill.
  However, the program like this:

    ...
    21:  fp[-8] = 22   ;; marks as STACK_MISC
    ...
    42:  fp[-8] = ptr
    ...

  Won't mark insn (42) as sanitize_stack_spill, which seems strange.

- stack_write_var_off()
  If !env->allow_ptr_leaks only allow writes if slots are not
  STACK_INVALID. I'm not sure I understand the intention.

- clean_func_state()
  STACK_INVALID is used to mark spi's that are not REG_LIVE_READ as
  such that should not take part in the state comparison. However,
  stacksafe() has REG_LIVE_READ check as well, so this marking might
  be unnecessary.

- stacksafe()
  STACK_INVALID is used as a mark that some bytes of an spi are not
  important in a state cached for state comparison. E.g. a slot in an
  old state might be marked 'mmmm????' and 'mmmmmmmm' or 'mmmm0000' in
  a new state. However other checks in stacksafe() would catch these
  variations.

The conclusion being that some pointer leakage checks might need
adjustment if STACK_INVALID is replaced by STACK_MISC.

> 
> > 
> > > Obviously, this is a completely separate change and issue from what
> > > you are addressing in this patch set.
> > > 
> > > Awesome job on tracking this down and fixing it! For the patch set:
> > 
> > Thank you for reviewing this issue with me.
> > 
> > > 
> > > Acked-by: Andrii Nakryiko <andrii@kernel.org>
> > > 
> > > 
> > [...]
Andrii Nakryiko Jan. 14, 2023, 1:17 a.m. UTC | #5
On Fri, Jan 13, 2023 at 4:10 PM Eduard Zingerman <eddyz87@gmail.com> wrote:
>
> On Fri, 2023-01-13 at 14:22 -0800, Andrii Nakryiko wrote:
> > On Fri, Jan 13, 2023 at 12:02 PM Eduard Zingerman <eddyz87@gmail.com> wrote:
> > >
> > > On Wed, 2023-01-11 at 16:24 -0800, Andrii Nakryiko wrote:
> > > [...]
> > > >
> > > > I'm wondering if we should consider allowing uninitialized
> > > > (STACK_INVALID) reads from stack, in general. It feels like it's
> > > > causing more issues than is actually helpful in practice. Common code
> > > > pattern is to __builtin_memset() some struct first, and only then
> > > > initialize it, basically doing unnecessary work of zeroing out. All
> > > > just to avoid verifier to complain about some irrelevant padding not
> > > > being initialized. I haven't thought about this much, but it feels
> > > > that STACK_MISC (initialized, but unknown scalar value) is basically
> > > > equivalent to STACK_INVALID for all intents and purposes. Thoughts?
> > >
> > > Do you have an example of the __builtin_memset() usage?
> > > I tried passing partially initialized stack allocated structure to
> > > bpf_map_update_elem() and bpf_probe_write_user() and verifier did not
> > > complain.
> > >
> > > Regarding STACK_MISC vs STACK_INVALID, I think it's ok to replace
> > > STACK_INVALID with STACK_MISC if we are talking about STX/LDX/ALU
> > > instructions because after LDX you would get a full range register and
> > > you can't do much with a full range value. However, if a structure
> > > containing un-initialized fields (e.g. not just padding) is passed to
> > > a helper or kfunc is it an error?
> >
> > if we are passing stack as a memory to helper/kfunc (which should be
> > the only valid use case with STACK_MISC, right?), then I think we
> > expect helper/kfunc to treat it as memory with unknowable contents.
> > Not sure if I'm missing something, but MISC says it's some unknown
> > value, and the only difference between INVALID and MISC is that MISC's
> > value was written by program explicitly, while for INVALID that
> > garbage value was there on the stack already (but still unknowable
> > scalar), which effectively is the same thing.
>
> I looked through the places where STACK_INVALID is used, here is the list:
>
> - unmark_stack_slots_dynptr()
>   Destroy dynptr marks. Suppose STACK_INVALID is replaced by
>   STACK_MISC here, in this case a scalar read would be possible from
>   such slot, which in turn might lead to pointer leak.
>   Might be a problem?

We are already talking to enable reading STACK_DYNPTR slots directly.
So not a problem?

>
> - scrub_spilled_slot()
>   mark spill slot STACK_MISC if not STACK_INVALID
>   Called from:
>   - save_register_state() called from check_stack_write_fixed_off()
>     Would mark not all slots only for 32-bit writes.
>   - check_stack_write_fixed_off() for insns like `fp[-8] = <const>` to
>     destroy previous stack marks.
>   - check_stack_range_initialized()
>     here it always marks all 8 spi slots as STACK_MISC.
>   Looks like STACK_MISC instead of STACK_INVALID wouldn't make a
>   difference in these cases.
>
> - check_stack_write_fixed_off()
>   Mark insn as sanitize_stack_spill if pointer is spilled to a stack
>   slot that is marked STACK_INVALID. This one is a bit strange.
>   E.g. the program like this:
>
>     ...
>     42:  fp[-8] = ptr
>     ...
>
>   Will mark insn (42) as sanitize_stack_spill.
>   However, the program like this:
>
>     ...
>     21:  fp[-8] = 22   ;; marks as STACK_MISC
>     ...
>     42:  fp[-8] = ptr
>     ...
>
>   Won't mark insn (42) as sanitize_stack_spill, which seems strange.
>
> - stack_write_var_off()
>   If !env->allow_ptr_leaks only allow writes if slots are not
>   STACK_INVALID. I'm not sure I understand the intention.
>
> - clean_func_state()
>   STACK_INVALID is used to mark spi's that are not REG_LIVE_READ as
>   such that should not take part in the state comparison. However,
>   stacksafe() has REG_LIVE_READ check as well, so this marking might
>   be unnecessary.
>
> - stacksafe()
>   STACK_INVALID is used as a mark that some bytes of an spi are not
>   important in a state cached for state comparison. E.g. a slot in an
>   old state might be marked 'mmmm????' and 'mmmmmmmm' or 'mmmm0000' in
>   a new state. However other checks in stacksafe() would catch these
>   variations.
>
> The conclusion being that some pointer leakage checks might need
> adjustment if STACK_INVALID is replaced by STACK_MISC.

Just to be clear. My suggestion was to *treat* STACK_INVALID as
equivalent to STACK_MISC in stacksafe(), not really replace all the
uses of STACK_INVALID with STACK_MISC. And to be on the safe side, I'd
do it only if env->allow_ptr_leaks, of course.

>
> >
> > >
> > > > Obviously, this is a completely separate change and issue from what
> > > > you are addressing in this patch set.
> > > >
> > > > Awesome job on tracking this down and fixing it! For the patch set:
> > >
> > > Thank you for reviewing this issue with me.
> > >
> > > >
> > > > Acked-by: Andrii Nakryiko <andrii@kernel.org>
> > > >
> > > >
> > > [...]
>
Andrii Nakryiko Jan. 14, 2023, 1:30 a.m. UTC | #6
On Fri, Jan 13, 2023 at 5:17 PM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Fri, Jan 13, 2023 at 4:10 PM Eduard Zingerman <eddyz87@gmail.com> wrote:
> >
> > On Fri, 2023-01-13 at 14:22 -0800, Andrii Nakryiko wrote:
> > > On Fri, Jan 13, 2023 at 12:02 PM Eduard Zingerman <eddyz87@gmail.com> wrote:
> > > >
> > > > On Wed, 2023-01-11 at 16:24 -0800, Andrii Nakryiko wrote:
> > > > [...]
> > > > >
> > > > > I'm wondering if we should consider allowing uninitialized
> > > > > (STACK_INVALID) reads from stack, in general. It feels like it's
> > > > > causing more issues than is actually helpful in practice. Common code
> > > > > pattern is to __builtin_memset() some struct first, and only then
> > > > > initialize it, basically doing unnecessary work of zeroing out. All
> > > > > just to avoid verifier to complain about some irrelevant padding not
> > > > > being initialized. I haven't thought about this much, but it feels
> > > > > that STACK_MISC (initialized, but unknown scalar value) is basically
> > > > > equivalent to STACK_INVALID for all intents and purposes. Thoughts?
> > > >
> > > > Do you have an example of the __builtin_memset() usage?
> > > > I tried passing partially initialized stack allocated structure to
> > > > bpf_map_update_elem() and bpf_probe_write_user() and verifier did not
> > > > complain.
> > > >
> > > > Regarding STACK_MISC vs STACK_INVALID, I think it's ok to replace
> > > > STACK_INVALID with STACK_MISC if we are talking about STX/LDX/ALU
> > > > instructions because after LDX you would get a full range register and
> > > > you can't do much with a full range value. However, if a structure
> > > > containing un-initialized fields (e.g. not just padding) is passed to
> > > > a helper or kfunc is it an error?
> > >
> > > if we are passing stack as a memory to helper/kfunc (which should be
> > > the only valid use case with STACK_MISC, right?), then I think we
> > > expect helper/kfunc to treat it as memory with unknowable contents.
> > > Not sure if I'm missing something, but MISC says it's some unknown
> > > value, and the only difference between INVALID and MISC is that MISC's
> > > value was written by program explicitly, while for INVALID that
> > > garbage value was there on the stack already (but still unknowable
> > > scalar), which effectively is the same thing.
> >
> > I looked through the places where STACK_INVALID is used, here is the list:
> >
> > - unmark_stack_slots_dynptr()
> >   Destroy dynptr marks. Suppose STACK_INVALID is replaced by
> >   STACK_MISC here, in this case a scalar read would be possible from
> >   such slot, which in turn might lead to pointer leak.
> >   Might be a problem?
>
> We are already talking to enable reading STACK_DYNPTR slots directly.
> So not a problem?
>
> >
> > - scrub_spilled_slot()
> >   mark spill slot STACK_MISC if not STACK_INVALID
> >   Called from:
> >   - save_register_state() called from check_stack_write_fixed_off()
> >     Would mark not all slots only for 32-bit writes.
> >   - check_stack_write_fixed_off() for insns like `fp[-8] = <const>` to
> >     destroy previous stack marks.
> >   - check_stack_range_initialized()
> >     here it always marks all 8 spi slots as STACK_MISC.
> >   Looks like STACK_MISC instead of STACK_INVALID wouldn't make a
> >   difference in these cases.
> >
> > - check_stack_write_fixed_off()
> >   Mark insn as sanitize_stack_spill if pointer is spilled to a stack
> >   slot that is marked STACK_INVALID. This one is a bit strange.
> >   E.g. the program like this:
> >
> >     ...
> >     42:  fp[-8] = ptr
> >     ...
> >
> >   Will mark insn (42) as sanitize_stack_spill.
> >   However, the program like this:
> >
> >     ...
> >     21:  fp[-8] = 22   ;; marks as STACK_MISC
> >     ...
> >     42:  fp[-8] = ptr
> >     ...
> >
> >   Won't mark insn (42) as sanitize_stack_spill, which seems strange.
> >
> > - stack_write_var_off()
> >   If !env->allow_ptr_leaks only allow writes if slots are not
> >   STACK_INVALID. I'm not sure I understand the intention.
> >
> > - clean_func_state()
> >   STACK_INVALID is used to mark spi's that are not REG_LIVE_READ as
> >   such that should not take part in the state comparison. However,
> >   stacksafe() has REG_LIVE_READ check as well, so this marking might
> >   be unnecessary.
> >
> > - stacksafe()
> >   STACK_INVALID is used as a mark that some bytes of an spi are not
> >   important in a state cached for state comparison. E.g. a slot in an
> >   old state might be marked 'mmmm????' and 'mmmmmmmm' or 'mmmm0000' in
> >   a new state. However other checks in stacksafe() would catch these
> >   variations.
> >
> > The conclusion being that some pointer leakage checks might need
> > adjustment if STACK_INVALID is replaced by STACK_MISC.
>
> Just to be clear. My suggestion was to *treat* STACK_INVALID as
> equivalent to STACK_MISC in stacksafe(), not really replace all the
> uses of STACK_INVALID with STACK_MISC. And to be on the safe side, I'd
> do it only if env->allow_ptr_leaks, of course.

Well, that, and to allow STACK_INVALID if env->allow_ptr_leaks in
check_stack_read_fixed_off(), of course, to avoid "invalid read from
stack off %d+%d size %d\n" error (that's fixing at least part of the
problem with uninitialized struct padding).

>
> >
> > >
> > > >
> > > > > Obviously, this is a completely separate change and issue from what
> > > > > you are addressing in this patch set.
> > > > >
> > > > > Awesome job on tracking this down and fixing it! For the patch set:
> > > >
> > > > Thank you for reviewing this issue with me.
> > > >
> > > > >
> > > > > Acked-by: Andrii Nakryiko <andrii@kernel.org>
> > > > >
> > > > >
> > > > [...]
> >
patchwork-bot+netdevbpf@kernel.org Jan. 19, 2023, 11:30 p.m. UTC | #7
Hello:

This series was applied to bpf/bpf.git (master)
by Alexei Starovoitov <ast@kernel.org>:

On Fri,  6 Jan 2023 16:22:12 +0200 you wrote:
> Struct bpf_reg_state is copied directly in several places including:
> - check_stack_write_fixed_off() (via save_register_state());
> - check_stack_read_fixed_off();
> - find_equal_scalars().
> 
> However, a literal copy of this struct also copies the following fields:
> 
> [...]

Here is the summary with links:
  - [bpf-next,v2,1/2] bpf: Fix to preserve reg parent/live fields when copying range info
    https://git.kernel.org/bpf/bpf/c/71f656a50176
  - [bpf-next,v2,2/2] selftests/bpf: Verify copy_register_state() preserves parent/live fields
    https://git.kernel.org/bpf/bpf/c/b9fa9bc83929

You are awesome, thank you!
Alexei Starovoitov Jan. 19, 2023, 11:52 p.m. UTC | #8
On Fri, Jan 13, 2023 at 5:31 PM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Fri, Jan 13, 2023 at 5:17 PM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
> >
> > On Fri, Jan 13, 2023 at 4:10 PM Eduard Zingerman <eddyz87@gmail.com> wrote:
> > >
> > > On Fri, 2023-01-13 at 14:22 -0800, Andrii Nakryiko wrote:
> > > > On Fri, Jan 13, 2023 at 12:02 PM Eduard Zingerman <eddyz87@gmail.com> wrote:
> > > > >
> > > > > On Wed, 2023-01-11 at 16:24 -0800, Andrii Nakryiko wrote:
> > > > > [...]
> > > > > >
> > > > > > I'm wondering if we should consider allowing uninitialized
> > > > > > (STACK_INVALID) reads from stack, in general. It feels like it's
> > > > > > causing more issues than is actually helpful in practice. Common code
> > > > > > pattern is to __builtin_memset() some struct first, and only then
> > > > > > initialize it, basically doing unnecessary work of zeroing out. All
> > > > > > just to avoid verifier to complain about some irrelevant padding not
> > > > > > being initialized. I haven't thought about this much, but it feels
> > > > > > that STACK_MISC (initialized, but unknown scalar value) is basically
> > > > > > equivalent to STACK_INVALID for all intents and purposes. Thoughts?
> > > > >
> > > > > Do you have an example of the __builtin_memset() usage?
> > > > > I tried passing partially initialized stack allocated structure to
> > > > > bpf_map_update_elem() and bpf_probe_write_user() and verifier did not
> > > > > complain.
> > > > >
> > > > > Regarding STACK_MISC vs STACK_INVALID, I think it's ok to replace
> > > > > STACK_INVALID with STACK_MISC if we are talking about STX/LDX/ALU
> > > > > instructions because after LDX you would get a full range register and
> > > > > you can't do much with a full range value. However, if a structure
> > > > > containing un-initialized fields (e.g. not just padding) is passed to
> > > > > a helper or kfunc is it an error?
> > > >
> > > > if we are passing stack as a memory to helper/kfunc (which should be
> > > > the only valid use case with STACK_MISC, right?), then I think we
> > > > expect helper/kfunc to treat it as memory with unknowable contents.
> > > > Not sure if I'm missing something, but MISC says it's some unknown
> > > > value, and the only difference between INVALID and MISC is that MISC's
> > > > value was written by program explicitly, while for INVALID that
> > > > garbage value was there on the stack already (but still unknowable
> > > > scalar), which effectively is the same thing.
> > >
> > > I looked through the places where STACK_INVALID is used, here is the list:
> > >
> > > - unmark_stack_slots_dynptr()
> > >   Destroy dynptr marks. Suppose STACK_INVALID is replaced by
> > >   STACK_MISC here, in this case a scalar read would be possible from
> > >   such slot, which in turn might lead to pointer leak.
> > >   Might be a problem?
> >
> > We are already talking to enable reading STACK_DYNPTR slots directly.
> > So not a problem?
> >
> > >
> > > - scrub_spilled_slot()
> > >   mark spill slot STACK_MISC if not STACK_INVALID
> > >   Called from:
> > >   - save_register_state() called from check_stack_write_fixed_off()
> > >     Would mark not all slots only for 32-bit writes.
> > >   - check_stack_write_fixed_off() for insns like `fp[-8] = <const>` to
> > >     destroy previous stack marks.
> > >   - check_stack_range_initialized()
> > >     here it always marks all 8 spi slots as STACK_MISC.
> > >   Looks like STACK_MISC instead of STACK_INVALID wouldn't make a
> > >   difference in these cases.
> > >
> > > - check_stack_write_fixed_off()
> > >   Mark insn as sanitize_stack_spill if pointer is spilled to a stack
> > >   slot that is marked STACK_INVALID. This one is a bit strange.
> > >   E.g. the program like this:
> > >
> > >     ...
> > >     42:  fp[-8] = ptr
> > >     ...
> > >
> > >   Will mark insn (42) as sanitize_stack_spill.
> > >   However, the program like this:
> > >
> > >     ...
> > >     21:  fp[-8] = 22   ;; marks as STACK_MISC
> > >     ...
> > >     42:  fp[-8] = ptr
> > >     ...
> > >
> > >   Won't mark insn (42) as sanitize_stack_spill, which seems strange.
> > >
> > > - stack_write_var_off()
> > >   If !env->allow_ptr_leaks only allow writes if slots are not
> > >   STACK_INVALID. I'm not sure I understand the intention.
> > >
> > > - clean_func_state()
> > >   STACK_INVALID is used to mark spi's that are not REG_LIVE_READ as
> > >   such that should not take part in the state comparison. However,
> > >   stacksafe() has REG_LIVE_READ check as well, so this marking might
> > >   be unnecessary.
> > >
> > > - stacksafe()
> > >   STACK_INVALID is used as a mark that some bytes of an spi are not
> > >   important in a state cached for state comparison. E.g. a slot in an
> > >   old state might be marked 'mmmm????' and 'mmmmmmmm' or 'mmmm0000' in
> > >   a new state. However other checks in stacksafe() would catch these
> > >   variations.
> > >
> > > The conclusion being that some pointer leakage checks might need
> > > adjustment if STACK_INVALID is replaced by STACK_MISC.
> >
> > Just to be clear. My suggestion was to *treat* STACK_INVALID as
> > equivalent to STACK_MISC in stacksafe(), not really replace all the
> > uses of STACK_INVALID with STACK_MISC. And to be on the safe side, I'd
> > do it only if env->allow_ptr_leaks, of course.
>
> Well, that, and to allow STACK_INVALID if env->allow_ptr_leaks in
> check_stack_read_fixed_off(), of course, to avoid "invalid read from
> stack off %d+%d size %d\n" error (that's fixing at least part of the
> problem with uninitialized struct padding).

+1 to Andrii's idea.
It should help us recover this small increase in processed states.

Eduard,

The fix itself is brilliant. Thank you for investigating
and providing the detailed explanation.
I've read this thread and the previous one,
walked through all the points and it all looks correct.
Sorry it took me a long time to remember the details
of liveness logic to review it properly.

While you, Andrii and me keep this tricky knowledge in our
heads could you please document how liveness works in
Documentation/bpf/verifier.rst ?
We'll be able to review it now and next time it will be
easier to remember.

I've tried Andrii's suggestion:

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 7ee218827259..0f71ba6a56e2 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -3591,7 +3591,7 @@ static int check_stack_read_fixed_off(struct
bpf_verifier_env *env,

copy_register_state(&state->regs[dst_regno], reg);
                                state->regs[dst_regno].subreg_def = subreg_def;
                        } else {
-                               for (i = 0; i < size; i++) {
+                               for (i = 0; i < size &&
!env->allow_uninit_stack; i++) {
                                        type = stype[(slot - i) % BPF_REG_SIZE];
                                        if (type == STACK_SPILL)
                                                continue;
@@ -3628,7 +3628,7 @@ static int check_stack_read_fixed_off(struct
bpf_verifier_env *env,
                }
                mark_reg_read(env, reg, reg->parent, REG_LIVE_READ64);
        } else {
-               for (i = 0; i < size; i++) {
+               for (i = 0; i < size && !env->allow_uninit_stack; i++) {
                        type = stype[(slot - i) % BPF_REG_SIZE];
                        if (type == STACK_MISC)
                                continue;
@@ -13208,6 +13208,10 @@ static bool stacksafe(struct bpf_verifier_env
*env, struct bpf_func_state *old,
                if (old->stack[spi].slot_type[i % BPF_REG_SIZE] ==
STACK_INVALID)
                        continue;

+               if (env->allow_uninit_stack &&
+                   old->stack[spi].slot_type[i % BPF_REG_SIZE] == STACK_MISC)
+                       continue;

and only dynptr/invalid_read[134] tests failed
which is expected and acceptable.
We can tweak those tests.

Could you take over this diff, run veristat analysis and
submit it as an official patch? I suspect we should see nice
improvements in states processed.

Thanks!
Alexei Starovoitov Jan. 20, 2023, 12:16 a.m. UTC | #9
On Thu, Jan 19, 2023 at 3:52 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Fri, Jan 13, 2023 at 5:31 PM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
> >
> > On Fri, Jan 13, 2023 at 5:17 PM Andrii Nakryiko
> > <andrii.nakryiko@gmail.com> wrote:
> > >
> > > On Fri, Jan 13, 2023 at 4:10 PM Eduard Zingerman <eddyz87@gmail.com> wrote:
> > > >
> > > > On Fri, 2023-01-13 at 14:22 -0800, Andrii Nakryiko wrote:
> > > > > On Fri, Jan 13, 2023 at 12:02 PM Eduard Zingerman <eddyz87@gmail.com> wrote:
> > > > > >
> > > > > > On Wed, 2023-01-11 at 16:24 -0800, Andrii Nakryiko wrote:
> > > > > > [...]
> > > > > > >
> > > > > > > I'm wondering if we should consider allowing uninitialized
> > > > > > > (STACK_INVALID) reads from stack, in general. It feels like it's
> > > > > > > causing more issues than is actually helpful in practice. Common code
> > > > > > > pattern is to __builtin_memset() some struct first, and only then
> > > > > > > initialize it, basically doing unnecessary work of zeroing out. All
> > > > > > > just to avoid verifier to complain about some irrelevant padding not
> > > > > > > being initialized. I haven't thought about this much, but it feels
> > > > > > > that STACK_MISC (initialized, but unknown scalar value) is basically
> > > > > > > equivalent to STACK_INVALID for all intents and purposes. Thoughts?
> > > > > >
> > > > > > Do you have an example of the __builtin_memset() usage?
> > > > > > I tried passing partially initialized stack allocated structure to
> > > > > > bpf_map_update_elem() and bpf_probe_write_user() and verifier did not
> > > > > > complain.
> > > > > >
> > > > > > Regarding STACK_MISC vs STACK_INVALID, I think it's ok to replace
> > > > > > STACK_INVALID with STACK_MISC if we are talking about STX/LDX/ALU
> > > > > > instructions because after LDX you would get a full range register and
> > > > > > you can't do much with a full range value. However, if a structure
> > > > > > containing un-initialized fields (e.g. not just padding) is passed to
> > > > > > a helper or kfunc is it an error?
> > > > >
> > > > > if we are passing stack as a memory to helper/kfunc (which should be
> > > > > the only valid use case with STACK_MISC, right?), then I think we
> > > > > expect helper/kfunc to treat it as memory with unknowable contents.
> > > > > Not sure if I'm missing something, but MISC says it's some unknown
> > > > > value, and the only difference between INVALID and MISC is that MISC's
> > > > > value was written by program explicitly, while for INVALID that
> > > > > garbage value was there on the stack already (but still unknowable
> > > > > scalar), which effectively is the same thing.
> > > >
> > > > I looked through the places where STACK_INVALID is used, here is the list:
> > > >
> > > > - unmark_stack_slots_dynptr()
> > > >   Destroy dynptr marks. Suppose STACK_INVALID is replaced by
> > > >   STACK_MISC here, in this case a scalar read would be possible from
> > > >   such slot, which in turn might lead to pointer leak.
> > > >   Might be a problem?
> > >
> > > We are already talking to enable reading STACK_DYNPTR slots directly.
> > > So not a problem?
> > >
> > > >
> > > > - scrub_spilled_slot()
> > > >   mark spill slot STACK_MISC if not STACK_INVALID
> > > >   Called from:
> > > >   - save_register_state() called from check_stack_write_fixed_off()
> > > >     Would mark not all slots only for 32-bit writes.
> > > >   - check_stack_write_fixed_off() for insns like `fp[-8] = <const>` to
> > > >     destroy previous stack marks.
> > > >   - check_stack_range_initialized()
> > > >     here it always marks all 8 spi slots as STACK_MISC.
> > > >   Looks like STACK_MISC instead of STACK_INVALID wouldn't make a
> > > >   difference in these cases.
> > > >
> > > > - check_stack_write_fixed_off()
> > > >   Mark insn as sanitize_stack_spill if pointer is spilled to a stack
> > > >   slot that is marked STACK_INVALID. This one is a bit strange.
> > > >   E.g. the program like this:
> > > >
> > > >     ...
> > > >     42:  fp[-8] = ptr
> > > >     ...
> > > >
> > > >   Will mark insn (42) as sanitize_stack_spill.
> > > >   However, the program like this:
> > > >
> > > >     ...
> > > >     21:  fp[-8] = 22   ;; marks as STACK_MISC
> > > >     ...
> > > >     42:  fp[-8] = ptr
> > > >     ...
> > > >
> > > >   Won't mark insn (42) as sanitize_stack_spill, which seems strange.
> > > >
> > > > - stack_write_var_off()
> > > >   If !env->allow_ptr_leaks only allow writes if slots are not
> > > >   STACK_INVALID. I'm not sure I understand the intention.
> > > >
> > > > - clean_func_state()
> > > >   STACK_INVALID is used to mark spi's that are not REG_LIVE_READ as
> > > >   such that should not take part in the state comparison. However,
> > > >   stacksafe() has REG_LIVE_READ check as well, so this marking might
> > > >   be unnecessary.
> > > >
> > > > - stacksafe()
> > > >   STACK_INVALID is used as a mark that some bytes of an spi are not
> > > >   important in a state cached for state comparison. E.g. a slot in an
> > > >   old state might be marked 'mmmm????' and 'mmmmmmmm' or 'mmmm0000' in
> > > >   a new state. However other checks in stacksafe() would catch these
> > > >   variations.
> > > >
> > > > The conclusion being that some pointer leakage checks might need
> > > > adjustment if STACK_INVALID is replaced by STACK_MISC.
> > >
> > > Just to be clear. My suggestion was to *treat* STACK_INVALID as
> > > equivalent to STACK_MISC in stacksafe(), not really replace all the
> > > uses of STACK_INVALID with STACK_MISC. And to be on the safe side, I'd
> > > do it only if env->allow_ptr_leaks, of course.
> >
> > Well, that, and to allow STACK_INVALID if env->allow_ptr_leaks in
> > check_stack_read_fixed_off(), of course, to avoid "invalid read from
> > stack off %d+%d size %d\n" error (that's fixing at least part of the
> > problem with uninitialized struct padding).
>
> +1 to Andrii's idea.
> It should help us recover this small increase in processed states.
>
> Eduard,
>
> The fix itself is brilliant. Thank you for investigating
> and providing the detailed explanation.
> I've read this thread and the previous one,
> walked through all the points and it all looks correct.
> Sorry it took me a long time to remember the details
> of liveness logic to review it properly.
>
> While you, Andrii and me keep this tricky knowledge in our
> heads could you please document how liveness works in
> Documentation/bpf/verifier.rst ?
> We'll be able to review it now and next time it will be
> easier to remember.
>
> I've tried Andrii's suggestion:
>
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 7ee218827259..0f71ba6a56e2 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -3591,7 +3591,7 @@ static int check_stack_read_fixed_off(struct
> bpf_verifier_env *env,
>
> copy_register_state(&state->regs[dst_regno], reg);
>                                 state->regs[dst_regno].subreg_def = subreg_def;
>                         } else {
> -                               for (i = 0; i < size; i++) {
> +                               for (i = 0; i < size &&
> !env->allow_uninit_stack; i++) {
>                                         type = stype[(slot - i) % BPF_REG_SIZE];
>                                         if (type == STACK_SPILL)
>                                                 continue;
> @@ -3628,7 +3628,7 @@ static int check_stack_read_fixed_off(struct
> bpf_verifier_env *env,
>                 }
>                 mark_reg_read(env, reg, reg->parent, REG_LIVE_READ64);
>         } else {
> -               for (i = 0; i < size; i++) {
> +               for (i = 0; i < size && !env->allow_uninit_stack; i++) {
>                         type = stype[(slot - i) % BPF_REG_SIZE];
>                         if (type == STACK_MISC)
>                                 continue;
> @@ -13208,6 +13208,10 @@ static bool stacksafe(struct bpf_verifier_env
> *env, struct bpf_func_state *old,
>                 if (old->stack[spi].slot_type[i % BPF_REG_SIZE] ==
> STACK_INVALID)
>                         continue;
>
> +               if (env->allow_uninit_stack &&
> +                   old->stack[spi].slot_type[i % BPF_REG_SIZE] == STACK_MISC)
> +                       continue;
>
> and only dynptr/invalid_read[134] tests failed
> which is expected and acceptable.
> We can tweak those tests.
>
> Could you take over this diff, run veristat analysis and
> submit it as an official patch? I suspect we should see nice
> improvements in states processed.

Indeed, some massive improvements:
./veristat -e file,prog,states -C -f 'states_diff<-10' bb aa
File                              Program                  States (A)
States (B)  States    (DIFF)
--------------------------------  -----------------------  ----------
----------  ----------------
bpf_flow.bpf.o                    flow_dissector_0                 78
        67     -11 (-14.10%)
loop6.bpf.o                       trace_virtqueue_add_sgs         336
       316      -20 (-5.95%)
pyperf100.bpf.o                   on_event                       6213
      4670   -1543 (-24.84%)
pyperf180.bpf.o                   on_event                      11470
      8364   -3106 (-27.08%)
pyperf50.bpf.o                    on_event                       3263
      2370    -893 (-27.37%)
pyperf600.bpf.o                   on_event                      30335
     22200   -8135 (-26.82%)
pyperf600_bpf_loop.bpf.o          on_event                        287
       145    -142 (-49.48%)
pyperf600_nounroll.bpf.o          on_event                      37101
     34169    -2932 (-7.90%)
strobemeta.bpf.o                  on_event                      15939
      4893  -11046 (-69.30%)
strobemeta_nounroll1.bpf.o        on_event                       1936
      1538    -398 (-20.56%)
strobemeta_nounroll2.bpf.o        on_event                       4436
      3991    -445 (-10.03%)
strobemeta_subprogs.bpf.o         on_event                       2025
      1689    -336 (-16.59%)
test_cls_redirect.bpf.o           cls_redirect                   4865
      4042    -823 (-16.92%)
test_cls_redirect_subprogs.bpf.o  cls_redirect                   4506
      4389     -117 (-2.60%)
test_tcp_hdr_options.bpf.o        estab                           211
       178     -33 (-15.64%)
test_xdp_noinline.bpf.o           balancer_ingress_v4             262
       235     -27 (-10.31%)
test_xdp_noinline.bpf.o           balancer_ingress_v6             253
       210     -43 (-17.00%)
xdp_synproxy_kern.bpf.o           syncookie_tc                  25086
      7016  -18070 (-72.03%)
xdp_synproxy_kern.bpf.o           syncookie_xdp                 24206
      6941  -17265 (-71.33%)
--------------------------------  -----------------------  ----------
----------  ----------------
Eduard Zingerman Jan. 20, 2023, 1:39 p.m. UTC | #10
On Thu, 2023-01-19 at 15:52 -0800, Alexei Starovoitov wrote:
> On Fri, Jan 13, 2023 at 5:31 PM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
> > 
> > On Fri, Jan 13, 2023 at 5:17 PM Andrii Nakryiko
> > <andrii.nakryiko@gmail.com> wrote:
> > > 
> > > On Fri, Jan 13, 2023 at 4:10 PM Eduard Zingerman <eddyz87@gmail.com> wrote:
> > > > 
> > > > On Fri, 2023-01-13 at 14:22 -0800, Andrii Nakryiko wrote:
> > > > > On Fri, Jan 13, 2023 at 12:02 PM Eduard Zingerman <eddyz87@gmail.com> wrote:
> > > > > > 
> > > > > > On Wed, 2023-01-11 at 16:24 -0800, Andrii Nakryiko wrote:
> > > > > > [...]
> > > > > > > 
> > > > > > > I'm wondering if we should consider allowing uninitialized
> > > > > > > (STACK_INVALID) reads from stack, in general. It feels like it's
> > > > > > > causing more issues than is actually helpful in practice. Common code
> > > > > > > pattern is to __builtin_memset() some struct first, and only then
> > > > > > > initialize it, basically doing unnecessary work of zeroing out. All
> > > > > > > just to avoid verifier to complain about some irrelevant padding not
> > > > > > > being initialized. I haven't thought about this much, but it feels
> > > > > > > that STACK_MISC (initialized, but unknown scalar value) is basically
> > > > > > > equivalent to STACK_INVALID for all intents and purposes. Thoughts?
> > > > > > 
> > > > > > Do you have an example of the __builtin_memset() usage?
> > > > > > I tried passing partially initialized stack allocated structure to
> > > > > > bpf_map_update_elem() and bpf_probe_write_user() and verifier did not
> > > > > > complain.
> > > > > > 
> > > > > > Regarding STACK_MISC vs STACK_INVALID, I think it's ok to replace
> > > > > > STACK_INVALID with STACK_MISC if we are talking about STX/LDX/ALU
> > > > > > instructions because after LDX you would get a full range register and
> > > > > > you can't do much with a full range value. However, if a structure
> > > > > > containing un-initialized fields (e.g. not just padding) is passed to
> > > > > > a helper or kfunc is it an error?
> > > > > 
> > > > > if we are passing stack as a memory to helper/kfunc (which should be
> > > > > the only valid use case with STACK_MISC, right?), then I think we
> > > > > expect helper/kfunc to treat it as memory with unknowable contents.
> > > > > Not sure if I'm missing something, but MISC says it's some unknown
> > > > > value, and the only difference between INVALID and MISC is that MISC's
> > > > > value was written by program explicitly, while for INVALID that
> > > > > garbage value was there on the stack already (but still unknowable
> > > > > scalar), which effectively is the same thing.
> > > > 
> > > > I looked through the places where STACK_INVALID is used, here is the list:
> > > > 
> > > > - unmark_stack_slots_dynptr()
> > > >   Destroy dynptr marks. Suppose STACK_INVALID is replaced by
> > > >   STACK_MISC here, in this case a scalar read would be possible from
> > > >   such slot, which in turn might lead to pointer leak.
> > > >   Might be a problem?
> > > 
> > > We are already talking to enable reading STACK_DYNPTR slots directly.
> > > So not a problem?
> > > 
> > > > 
> > > > - scrub_spilled_slot()
> > > >   mark spill slot STACK_MISC if not STACK_INVALID
> > > >   Called from:
> > > >   - save_register_state() called from check_stack_write_fixed_off()
> > > >     Would mark not all slots only for 32-bit writes.
> > > >   - check_stack_write_fixed_off() for insns like `fp[-8] = <const>` to
> > > >     destroy previous stack marks.
> > > >   - check_stack_range_initialized()
> > > >     here it always marks all 8 spi slots as STACK_MISC.
> > > >   Looks like STACK_MISC instead of STACK_INVALID wouldn't make a
> > > >   difference in these cases.
> > > > 
> > > > - check_stack_write_fixed_off()
> > > >   Mark insn as sanitize_stack_spill if pointer is spilled to a stack
> > > >   slot that is marked STACK_INVALID. This one is a bit strange.
> > > >   E.g. the program like this:
> > > > 
> > > >     ...
> > > >     42:  fp[-8] = ptr
> > > >     ...
> > > > 
> > > >   Will mark insn (42) as sanitize_stack_spill.
> > > >   However, the program like this:
> > > > 
> > > >     ...
> > > >     21:  fp[-8] = 22   ;; marks as STACK_MISC
> > > >     ...
> > > >     42:  fp[-8] = ptr
> > > >     ...
> > > > 
> > > >   Won't mark insn (42) as sanitize_stack_spill, which seems strange.
> > > > 
> > > > - stack_write_var_off()
> > > >   If !env->allow_ptr_leaks only allow writes if slots are not
> > > >   STACK_INVALID. I'm not sure I understand the intention.
> > > > 
> > > > - clean_func_state()
> > > >   STACK_INVALID is used to mark spi's that are not REG_LIVE_READ as
> > > >   such that should not take part in the state comparison. However,
> > > >   stacksafe() has REG_LIVE_READ check as well, so this marking might
> > > >   be unnecessary.
> > > > 
> > > > - stacksafe()
> > > >   STACK_INVALID is used as a mark that some bytes of an spi are not
> > > >   important in a state cached for state comparison. E.g. a slot in an
> > > >   old state might be marked 'mmmm????' and 'mmmmmmmm' or 'mmmm0000' in
> > > >   a new state. However other checks in stacksafe() would catch these
> > > >   variations.
> > > > 
> > > > The conclusion being that some pointer leakage checks might need
> > > > adjustment if STACK_INVALID is replaced by STACK_MISC.
> > > 
> > > Just to be clear. My suggestion was to *treat* STACK_INVALID as
> > > equivalent to STACK_MISC in stacksafe(), not really replace all the
> > > uses of STACK_INVALID with STACK_MISC. And to be on the safe side, I'd
> > > do it only if env->allow_ptr_leaks, of course.
> > 
> > Well, that, and to allow STACK_INVALID if env->allow_ptr_leaks in
> > check_stack_read_fixed_off(), of course, to avoid "invalid read from
> > stack off %d+%d size %d\n" error (that's fixing at least part of the
> > problem with uninitialized struct padding).
> 
> +1 to Andrii's idea.
> It should help us recover this small increase in processed states.
> 
> Eduard,
> 
> The fix itself is brilliant. Thank you for investigating
> and providing the detailed explanation.
> I've read this thread and the previous one,
> walked through all the points and it all looks correct.
> Sorry it took me a long time to remember the details
> of liveness logic to review it properly.
> 
> While you, Andrii and me keep this tricky knowledge in our
> heads could you please document how liveness works in
> Documentation/bpf/verifier.rst ?
> We'll be able to review it now and next time it will be
> easier to remember.

I'll extend the doc and finalize your patch over the weekend,
thank you for the review.

> 
> I've tried Andrii's suggestion:
> 
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 7ee218827259..0f71ba6a56e2 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -3591,7 +3591,7 @@ static int check_stack_read_fixed_off(struct
> bpf_verifier_env *env,
> 
> copy_register_state(&state->regs[dst_regno], reg);
>                                 state->regs[dst_regno].subreg_def = subreg_def;
>                         } else {
> -                               for (i = 0; i < size; i++) {
> +                               for (i = 0; i < size &&
> !env->allow_uninit_stack; i++) {
>                                         type = stype[(slot - i) % BPF_REG_SIZE];
>                                         if (type == STACK_SPILL)
>                                                 continue;
> @@ -3628,7 +3628,7 @@ static int check_stack_read_fixed_off(struct
> bpf_verifier_env *env,
>                 }
>                 mark_reg_read(env, reg, reg->parent, REG_LIVE_READ64);
>         } else {
> -               for (i = 0; i < size; i++) {
> +               for (i = 0; i < size && !env->allow_uninit_stack; i++) {
>                         type = stype[(slot - i) % BPF_REG_SIZE];
>                         if (type == STACK_MISC)
>                                 continue;
> @@ -13208,6 +13208,10 @@ static bool stacksafe(struct bpf_verifier_env
> *env, struct bpf_func_state *old,
>                 if (old->stack[spi].slot_type[i % BPF_REG_SIZE] ==
> STACK_INVALID)
>                         continue;
> 
> +               if (env->allow_uninit_stack &&
> +                   old->stack[spi].slot_type[i % BPF_REG_SIZE] == STACK_MISC)
> +                       continue;
> 
> and only dynptr/invalid_read[134] tests failed
> which is expected and acceptable.
> We can tweak those tests.
> 
> Could you take over this diff, run veristat analysis and
> submit it as an official patch? I suspect we should see nice
> improvements in states processed.
> 
> Thanks!
Eduard Zingerman Jan. 30, 2023, 3:33 p.m. UTC | #11
On Thu, 2023-01-19 at 16:16 -0800, Alexei Starovoitov wrote:
[...]
> > > > 
> > > > Just to be clear. My suggestion was to *treat* STACK_INVALID as
> > > > equivalent to STACK_MISC in stacksafe(), not really replace all the
> > > > uses of STACK_INVALID with STACK_MISC. And to be on the safe side, I'd
> > > > do it only if env->allow_ptr_leaks, of course.
> > > 
> > > Well, that, and to allow STACK_INVALID if env->allow_ptr_leaks in
> > > check_stack_read_fixed_off(), of course, to avoid "invalid read from
> > > stack off %d+%d size %d\n" error (that's fixing at least part of the
> > > problem with uninitialized struct padding).
> > 
> > +1 to Andrii's idea.
> > It should help us recover this small increase in processed states.
> > 
[...]
> > 
> > I've tried Andrii's suggestion:
> > 
> > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> > index 7ee218827259..0f71ba6a56e2 100644
> > --- a/kernel/bpf/verifier.c
> > +++ b/kernel/bpf/verifier.c
> > @@ -3591,7 +3591,7 @@ static int check_stack_read_fixed_off(struct
> > bpf_verifier_env *env,
> > 
> > copy_register_state(&state->regs[dst_regno], reg);
> >                                 state->regs[dst_regno].subreg_def = subreg_def;
> >                         } else {
> > -                               for (i = 0; i < size; i++) {
> > +                               for (i = 0; i < size &&
> > !env->allow_uninit_stack; i++) {
> >                                         type = stype[(slot - i) % BPF_REG_SIZE];
> >                                         if (type == STACK_SPILL)
> >                                                 continue;
> > @@ -3628,7 +3628,7 @@ static int check_stack_read_fixed_off(struct
> > bpf_verifier_env *env,
> >                 }
> >                 mark_reg_read(env, reg, reg->parent, REG_LIVE_READ64);
> >         } else {
> > -               for (i = 0; i < size; i++) {
> > +               for (i = 0; i < size && !env->allow_uninit_stack; i++) {
> >                         type = stype[(slot - i) % BPF_REG_SIZE];
> >                         if (type == STACK_MISC)
> >                                 continue;
> > @@ -13208,6 +13208,10 @@ static bool stacksafe(struct bpf_verifier_env
> > *env, struct bpf_func_state *old,
> >                 if (old->stack[spi].slot_type[i % BPF_REG_SIZE] ==
> > STACK_INVALID)
> >                         continue;
> > 
> > +               if (env->allow_uninit_stack &&
> > +                   old->stack[spi].slot_type[i % BPF_REG_SIZE] == STACK_MISC)
> > +                       continue;
> > 
> > and only dynptr/invalid_read[134] tests failed
> > which is expected and acceptable.
> > We can tweak those tests.
> > 
> > Could you take over this diff, run veristat analysis and
> > submit it as an official patch? I suspect we should see nice
> > improvements in states processed.
> 

Hi Alexei, Andrii,

Please note that the patch
"bpf: Fix to preserve reg parent/live fields when copying range info"
that started this conversation was applied to `bpf` tree, not `bpf-next`,
so I'll wait until it gets its way to `bpf-next` before submitting formal
patches, as it changes the performance numbers collected by veristat.
I did all my experiments with this patch applied on top of `bpf-next`.

I adapted the patch suggested by Alexei and put it to my github for
now [1]. The performance gains are indeed significant:

$ ./veristat -e file,states -C -f 'states_pct<-30' master.log uninit-reads.log
File                        States (A)  States (B)  States    (DIFF)
--------------------------  ----------  ----------  ----------------
bpf_host.o                         349         244    -105 (-30.09%)
bpf_host.o                        1320         895    -425 (-32.20%)
bpf_lxc.o                         1320         895    -425 (-32.20%)
bpf_sock.o                          70          48     -22 (-31.43%)
bpf_sock.o                          68          46     -22 (-32.35%)
bpf_xdp.o                         1554         803    -751 (-48.33%)
bpf_xdp.o                         6457        2473   -3984 (-61.70%)
bpf_xdp.o                         7249        3908   -3341 (-46.09%)
pyperf600_bpf_loop.bpf.o           287         145    -142 (-49.48%)
strobemeta.bpf.o                 15879        4790  -11089 (-69.83%)
strobemeta_nounroll2.bpf.o       20505        3931  -16574 (-80.83%)
xdp_synproxy_kern.bpf.o          22564        7009  -15555 (-68.94%)
xdp_synproxy_kern.bpf.o          24206        6941  -17265 (-71.33%)
--------------------------  ----------  ----------  ----------------

However, this comes at a cost of allowing reads from uninitialized
stack locations. As far as I understand access to uninitialized local
variable is one of the most common errors when programming in C
(although citation is needed).

Also more tests are failing after register parentage chains patch is
applied than in Alexei's initial try: 10 verifier tests and 1 progs
test (test_global_func10.c, I have not modified it yet, it should wait
for my changes for unprivileged execution mode support in
test_loader.c). I don't really like how I had to fix those tests.

I took a detailed look at the difference in verifier behavior between
master and the branch [1] for pyperf600_bpf_loop.bpf.o and identified
that the difference is caused by the fact that helper functions do not
mark the stack they access as REG_LIVE_WRITTEN, the details are in the
commit message [3], but TLDR is the following example:

        1: bpf_probe_read_user(&foo, ...);
        2: if (*foo) ...

Here `*foo` will not get REG_LIVE_WRITTEN mark when (1) is verified,
thus `*foo` read at (2) might lead to excessive REG_LIVE_READ marks
and thus more verification states.

I prepared a patch that changes helper calls verification to apply
REG_LIVE_WRITTEN when write size and alignment allow this, again
currently on my github [2]. This patch has less dramatic performance
impact, but nonetheless significant:

$ veristat -e file,states -C -f 'states_pct<-30' master.log helpers-written.log
File                        States (A)  States (B)  States    (DIFF)
--------------------------  ----------  ----------  ----------------
pyperf600_bpf_loop.bpf.o           287         156    -131 (-45.64%)
strobemeta.bpf.o                 15879        4772  -11107 (-69.95%)
strobemeta_nounroll1.bpf.o        2065        1337    -728 (-35.25%)
strobemeta_nounroll2.bpf.o       20505        3788  -16717 (-81.53%)
test_cls_redirect.bpf.o           8129        4799   -3330 (-40.96%)
--------------------------  ----------  ----------  ----------------

I suggest that instead of dropping a useful safety check I can further
investigate difference in behavior between "uninit-reads.log" and
"helpers-written.log" and maybe figure out other improvements.
Unfortunately the comparison process is extremely time consuming.

wdyt?
        
[1] https://github.com/eddyz87/bpf/tree/allow-uninit-stack-reads
[2] https://github.com/eddyz87/bpf/tree/mark-helper-stack-as-written
[3] https://github.com/kernel-patches/bpf/commit/b29842309271c21cbcb3f85d56cdf9f45f8382d2

> Indeed, some massive improvements:
> ./veristat -e file,prog,states -C -f 'states_diff<-10' bb aa
> File                              Program                  States (A)
> States (B)  States    (DIFF)
> --------------------------------  -----------------------  ----------
> ----------  ----------------
> bpf_flow.bpf.o                    flow_dissector_0                 78
>         67     -11 (-14.10%)
> loop6.bpf.o                       trace_virtqueue_add_sgs         336
>        316      -20 (-5.95%)
> pyperf100.bpf.o                   on_event                       6213
>       4670   -1543 (-24.84%)
> pyperf180.bpf.o                   on_event                      11470
>       8364   -3106 (-27.08%)
> pyperf50.bpf.o                    on_event                       3263
>       2370    -893 (-27.37%)
> pyperf600.bpf.o                   on_event                      30335
>      22200   -8135 (-26.82%)
> pyperf600_bpf_loop.bpf.o          on_event                        287
>        145    -142 (-49.48%)
> pyperf600_nounroll.bpf.o          on_event                      37101
>      34169    -2932 (-7.90%)
> strobemeta.bpf.o                  on_event                      15939
>       4893  -11046 (-69.30%)
> strobemeta_nounroll1.bpf.o        on_event                       1936
>       1538    -398 (-20.56%)
> strobemeta_nounroll2.bpf.o        on_event                       4436
>       3991    -445 (-10.03%)
> strobemeta_subprogs.bpf.o         on_event                       2025
>       1689    -336 (-16.59%)
> test_cls_redirect.bpf.o           cls_redirect                   4865
>       4042    -823 (-16.92%)
> test_cls_redirect_subprogs.bpf.o  cls_redirect                   4506
>       4389     -117 (-2.60%)
> test_tcp_hdr_options.bpf.o        estab                           211
>        178     -33 (-15.64%)
> test_xdp_noinline.bpf.o           balancer_ingress_v4             262
>        235     -27 (-10.31%)
> test_xdp_noinline.bpf.o           balancer_ingress_v6             253
>        210     -43 (-17.00%)
> xdp_synproxy_kern.bpf.o           syncookie_tc                  25086
>       7016  -18070 (-72.03%)
> xdp_synproxy_kern.bpf.o           syncookie_xdp                 24206
>       6941  -17265 (-71.33%)
> --------------------------------  -----------------------  ----------
> ----------  ----------------
Andrii Nakryiko Jan. 31, 2023, 1:17 a.m. UTC | #12
On Mon, Jan 30, 2023 at 7:34 AM Eduard Zingerman <eddyz87@gmail.com> wrote:
>
> On Thu, 2023-01-19 at 16:16 -0800, Alexei Starovoitov wrote:
> [...]
> > > > >
> > > > > Just to be clear. My suggestion was to *treat* STACK_INVALID as
> > > > > equivalent to STACK_MISC in stacksafe(), not really replace all the
> > > > > uses of STACK_INVALID with STACK_MISC. And to be on the safe side, I'd
> > > > > do it only if env->allow_ptr_leaks, of course.
> > > >
> > > > Well, that, and to allow STACK_INVALID if env->allow_ptr_leaks in
> > > > check_stack_read_fixed_off(), of course, to avoid "invalid read from
> > > > stack off %d+%d size %d\n" error (that's fixing at least part of the
> > > > problem with uninitialized struct padding).
> > >
> > > +1 to Andrii's idea.
> > > It should help us recover this small increase in processed states.
> > >
> [...]
> > >
> > > I've tried Andrii's suggestion:
> > >
> > > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> > > index 7ee218827259..0f71ba6a56e2 100644
> > > --- a/kernel/bpf/verifier.c
> > > +++ b/kernel/bpf/verifier.c
> > > @@ -3591,7 +3591,7 @@ static int check_stack_read_fixed_off(struct
> > > bpf_verifier_env *env,
> > >
> > > copy_register_state(&state->regs[dst_regno], reg);
> > >                                 state->regs[dst_regno].subreg_def = subreg_def;
> > >                         } else {
> > > -                               for (i = 0; i < size; i++) {
> > > +                               for (i = 0; i < size &&
> > > !env->allow_uninit_stack; i++) {
> > >                                         type = stype[(slot - i) % BPF_REG_SIZE];
> > >                                         if (type == STACK_SPILL)
> > >                                                 continue;
> > > @@ -3628,7 +3628,7 @@ static int check_stack_read_fixed_off(struct
> > > bpf_verifier_env *env,
> > >                 }
> > >                 mark_reg_read(env, reg, reg->parent, REG_LIVE_READ64);
> > >         } else {
> > > -               for (i = 0; i < size; i++) {
> > > +               for (i = 0; i < size && !env->allow_uninit_stack; i++) {
> > >                         type = stype[(slot - i) % BPF_REG_SIZE];
> > >                         if (type == STACK_MISC)
> > >                                 continue;
> > > @@ -13208,6 +13208,10 @@ static bool stacksafe(struct bpf_verifier_env
> > > *env, struct bpf_func_state *old,
> > >                 if (old->stack[spi].slot_type[i % BPF_REG_SIZE] ==
> > > STACK_INVALID)
> > >                         continue;
> > >
> > > +               if (env->allow_uninit_stack &&
> > > +                   old->stack[spi].slot_type[i % BPF_REG_SIZE] == STACK_MISC)
> > > +                       continue;
> > >
> > > and only dynptr/invalid_read[134] tests failed
> > > which is expected and acceptable.
> > > We can tweak those tests.
> > >
> > > Could you take over this diff, run veristat analysis and
> > > submit it as an official patch? I suspect we should see nice
> > > improvements in states processed.
> >
>
> Hi Alexei, Andrii,
>
> Please note that the patch
> "bpf: Fix to preserve reg parent/live fields when copying range info"
> that started this conversation was applied to `bpf` tree, not `bpf-next`,
> so I'll wait until it gets its way to `bpf-next` before submitting formal
> patches, as it changes the performance numbers collected by veristat.
> I did all my experiments with this patch applied on top of `bpf-next`.
>
> I adapted the patch suggested by Alexei and put it to my github for
> now [1]. The performance gains are indeed significant:
>
> $ ./veristat -e file,states -C -f 'states_pct<-30' master.log uninit-reads.log
> File                        States (A)  States (B)  States    (DIFF)
> --------------------------  ----------  ----------  ----------------
> bpf_host.o                         349         244    -105 (-30.09%)
> bpf_host.o                        1320         895    -425 (-32.20%)
> bpf_lxc.o                         1320         895    -425 (-32.20%)
> bpf_sock.o                          70          48     -22 (-31.43%)
> bpf_sock.o                          68          46     -22 (-32.35%)
> bpf_xdp.o                         1554         803    -751 (-48.33%)
> bpf_xdp.o                         6457        2473   -3984 (-61.70%)
> bpf_xdp.o                         7249        3908   -3341 (-46.09%)
> pyperf600_bpf_loop.bpf.o           287         145    -142 (-49.48%)
> strobemeta.bpf.o                 15879        4790  -11089 (-69.83%)
> strobemeta_nounroll2.bpf.o       20505        3931  -16574 (-80.83%)
> xdp_synproxy_kern.bpf.o          22564        7009  -15555 (-68.94%)
> xdp_synproxy_kern.bpf.o          24206        6941  -17265 (-71.33%)
> --------------------------  ----------  ----------  ----------------
>
> However, this comes at a cost of allowing reads from uninitialized
> stack locations. As far as I understand access to uninitialized local
> variable is one of the most common errors when programming in C
> (although citation is needed).

Yeah, a citation is really needed :) I don't see this often in
practice, tbh. What I do see in practice is that people are
unnecessarily __builtint_memset(0) struct and initialize all fields
with field-by-field initialization, instead of just using a nice C
syntax:

struct my_struct s = {
   .field_a = 123,
   .field_b = 234,
};


And all that just because there is some padding between field_a and
field_b which the compiler won't zero-initialize.

>
> Also more tests are failing after register parentage chains patch is
> applied than in Alexei's initial try: 10 verifier tests and 1 progs
> test (test_global_func10.c, I have not modified it yet, it should wait
> for my changes for unprivileged execution mode support in
> test_loader.c). I don't really like how I had to fix those tests.
>
> I took a detailed look at the difference in verifier behavior between
> master and the branch [1] for pyperf600_bpf_loop.bpf.o and identified
> that the difference is caused by the fact that helper functions do not
> mark the stack they access as REG_LIVE_WRITTEN, the details are in the
> commit message [3], but TLDR is the following example:
>
>         1: bpf_probe_read_user(&foo, ...);
>         2: if (*foo) ...
>
> Here `*foo` will not get REG_LIVE_WRITTEN mark when (1) is verified,
> thus `*foo` read at (2) might lead to excessive REG_LIVE_READ marks
> and thus more verification states.

This is a good fix in its own right, of course, we should definitely do this!

>
> I prepared a patch that changes helper calls verification to apply
> REG_LIVE_WRITTEN when write size and alignment allow this, again
> currently on my github [2]. This patch has less dramatic performance
> impact, but nonetheless significant:
>
> $ veristat -e file,states -C -f 'states_pct<-30' master.log helpers-written.log
> File                        States (A)  States (B)  States    (DIFF)
> --------------------------  ----------  ----------  ----------------
> pyperf600_bpf_loop.bpf.o           287         156    -131 (-45.64%)
> strobemeta.bpf.o                 15879        4772  -11107 (-69.95%)
> strobemeta_nounroll1.bpf.o        2065        1337    -728 (-35.25%)
> strobemeta_nounroll2.bpf.o       20505        3788  -16717 (-81.53%)
> test_cls_redirect.bpf.o           8129        4799   -3330 (-40.96%)
> --------------------------  ----------  ----------  ----------------
>
> I suggest that instead of dropping a useful safety check I can further
> investigate difference in behavior between "uninit-reads.log" and
> "helpers-written.log" and maybe figure out other improvements.
> Unfortunately the comparison process is extremely time consuming.
>
> wdyt?

I think reading uninitialized stack slot concerns are overblown in
practice (in terms of their good implications for programmer's
productivity), I'd still do it if only in the name of improving user
experience.

>
> [1] https://github.com/eddyz87/bpf/tree/allow-uninit-stack-reads
> [2] https://github.com/eddyz87/bpf/tree/mark-helper-stack-as-written
> [3] https://github.com/kernel-patches/bpf/commit/b29842309271c21cbcb3f85d56cdf9f45f8382d2
>
> > Indeed, some massive improvements:
> > ./veristat -e file,prog,states -C -f 'states_diff<-10' bb aa
> > File                              Program                  States (A)
> > States (B)  States    (DIFF)
> > --------------------------------  -----------------------  ----------
> > ----------  ----------------
> > bpf_flow.bpf.o                    flow_dissector_0                 78
> >         67     -11 (-14.10%)
> > loop6.bpf.o                       trace_virtqueue_add_sgs         336
> >        316      -20 (-5.95%)
> > pyperf100.bpf.o                   on_event                       6213
> >       4670   -1543 (-24.84%)
> > pyperf180.bpf.o                   on_event                      11470
> >       8364   -3106 (-27.08%)
> > pyperf50.bpf.o                    on_event                       3263
> >       2370    -893 (-27.37%)
> > pyperf600.bpf.o                   on_event                      30335
> >      22200   -8135 (-26.82%)
> > pyperf600_bpf_loop.bpf.o          on_event                        287
> >        145    -142 (-49.48%)
> > pyperf600_nounroll.bpf.o          on_event                      37101
> >      34169    -2932 (-7.90%)
> > strobemeta.bpf.o                  on_event                      15939
> >       4893  -11046 (-69.30%)
> > strobemeta_nounroll1.bpf.o        on_event                       1936
> >       1538    -398 (-20.56%)
> > strobemeta_nounroll2.bpf.o        on_event                       4436
> >       3991    -445 (-10.03%)
> > strobemeta_subprogs.bpf.o         on_event                       2025
> >       1689    -336 (-16.59%)
> > test_cls_redirect.bpf.o           cls_redirect                   4865
> >       4042    -823 (-16.92%)
> > test_cls_redirect_subprogs.bpf.o  cls_redirect                   4506
> >       4389     -117 (-2.60%)
> > test_tcp_hdr_options.bpf.o        estab                           211
> >        178     -33 (-15.64%)
> > test_xdp_noinline.bpf.o           balancer_ingress_v4             262
> >        235     -27 (-10.31%)
> > test_xdp_noinline.bpf.o           balancer_ingress_v6             253
> >        210     -43 (-17.00%)
> > xdp_synproxy_kern.bpf.o           syncookie_tc                  25086
> >       7016  -18070 (-72.03%)
> > xdp_synproxy_kern.bpf.o           syncookie_xdp                 24206
> >       6941  -17265 (-71.33%)
> > --------------------------------  -----------------------  ----------
> > ----------  ----------------
>
Alexei Starovoitov Jan. 31, 2023, 2:42 a.m. UTC | #13
On Mon, Jan 30, 2023 at 05:17:19PM -0800, Andrii Nakryiko wrote:
> On Mon, Jan 30, 2023 at 7:34 AM Eduard Zingerman <eddyz87@gmail.com> wrote:
> >
> > On Thu, 2023-01-19 at 16:16 -0800, Alexei Starovoitov wrote:
> > [...]
> > > > > >
> > > > > > Just to be clear. My suggestion was to *treat* STACK_INVALID as
> > > > > > equivalent to STACK_MISC in stacksafe(), not really replace all the
> > > > > > uses of STACK_INVALID with STACK_MISC. And to be on the safe side, I'd
> > > > > > do it only if env->allow_ptr_leaks, of course.
> > > > >
> > > > > Well, that, and to allow STACK_INVALID if env->allow_ptr_leaks in
> > > > > check_stack_read_fixed_off(), of course, to avoid "invalid read from
> > > > > stack off %d+%d size %d\n" error (that's fixing at least part of the
> > > > > problem with uninitialized struct padding).
> > > >
> > > > +1 to Andrii's idea.
> > > > It should help us recover this small increase in processed states.
> > > >
> > [...]
> > > >
> > > > I've tried Andrii's suggestion:
> > > >
> > > > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> > > > index 7ee218827259..0f71ba6a56e2 100644
> > > > --- a/kernel/bpf/verifier.c
> > > > +++ b/kernel/bpf/verifier.c
> > > > @@ -3591,7 +3591,7 @@ static int check_stack_read_fixed_off(struct
> > > > bpf_verifier_env *env,
> > > >
> > > > copy_register_state(&state->regs[dst_regno], reg);
> > > >                                 state->regs[dst_regno].subreg_def = subreg_def;
> > > >                         } else {
> > > > -                               for (i = 0; i < size; i++) {
> > > > +                               for (i = 0; i < size &&
> > > > !env->allow_uninit_stack; i++) {
> > > >                                         type = stype[(slot - i) % BPF_REG_SIZE];
> > > >                                         if (type == STACK_SPILL)
> > > >                                                 continue;
> > > > @@ -3628,7 +3628,7 @@ static int check_stack_read_fixed_off(struct
> > > > bpf_verifier_env *env,
> > > >                 }
> > > >                 mark_reg_read(env, reg, reg->parent, REG_LIVE_READ64);
> > > >         } else {
> > > > -               for (i = 0; i < size; i++) {
> > > > +               for (i = 0; i < size && !env->allow_uninit_stack; i++) {
> > > >                         type = stype[(slot - i) % BPF_REG_SIZE];
> > > >                         if (type == STACK_MISC)
> > > >                                 continue;
> > > > @@ -13208,6 +13208,10 @@ static bool stacksafe(struct bpf_verifier_env
> > > > *env, struct bpf_func_state *old,
> > > >                 if (old->stack[spi].slot_type[i % BPF_REG_SIZE] ==
> > > > STACK_INVALID)
> > > >                         continue;
> > > >
> > > > +               if (env->allow_uninit_stack &&
> > > > +                   old->stack[spi].slot_type[i % BPF_REG_SIZE] == STACK_MISC)
> > > > +                       continue;
> > > >
> > > > and only dynptr/invalid_read[134] tests failed
> > > > which is expected and acceptable.
> > > > We can tweak those tests.
> > > >
> > > > Could you take over this diff, run veristat analysis and
> > > > submit it as an official patch? I suspect we should see nice
> > > > improvements in states processed.
> > >
> >
> > Hi Alexei, Andrii,
> >
> > Please note that the patch
> > "bpf: Fix to preserve reg parent/live fields when copying range info"
> > that started this conversation was applied to `bpf` tree, not `bpf-next`,
> > so I'll wait until it gets its way to `bpf-next` before submitting formal
> > patches, as it changes the performance numbers collected by veristat.
> > I did all my experiments with this patch applied on top of `bpf-next`.
> >
> > I adapted the patch suggested by Alexei and put it to my github for
> > now [1]. The performance gains are indeed significant:
> >
> > $ ./veristat -e file,states -C -f 'states_pct<-30' master.log uninit-reads.log
> > File                        States (A)  States (B)  States    (DIFF)
> > --------------------------  ----------  ----------  ----------------
> > bpf_host.o                         349         244    -105 (-30.09%)
> > bpf_host.o                        1320         895    -425 (-32.20%)
> > bpf_lxc.o                         1320         895    -425 (-32.20%)
> > bpf_sock.o                          70          48     -22 (-31.43%)
> > bpf_sock.o                          68          46     -22 (-32.35%)
> > bpf_xdp.o                         1554         803    -751 (-48.33%)
> > bpf_xdp.o                         6457        2473   -3984 (-61.70%)
> > bpf_xdp.o                         7249        3908   -3341 (-46.09%)
> > pyperf600_bpf_loop.bpf.o           287         145    -142 (-49.48%)
> > strobemeta.bpf.o                 15879        4790  -11089 (-69.83%)
> > strobemeta_nounroll2.bpf.o       20505        3931  -16574 (-80.83%)
> > xdp_synproxy_kern.bpf.o          22564        7009  -15555 (-68.94%)
> > xdp_synproxy_kern.bpf.o          24206        6941  -17265 (-71.33%)
> > --------------------------  ----------  ----------  ----------------
> >
> > However, this comes at a cost of allowing reads from uninitialized
> > stack locations. As far as I understand access to uninitialized local
> > variable is one of the most common errors when programming in C
> > (although citation is needed).
> 
> Yeah, a citation is really needed :) I don't see this often in
> practice, tbh. What I do see in practice is that people are
> unnecessarily __builtint_memset(0) struct and initialize all fields
> with field-by-field initialization, instead of just using a nice C
> syntax:
> 
> struct my_struct s = {
>    .field_a = 123,
>    .field_b = 234,
> };
> 
> 
> And all that just because there is some padding between field_a and
> field_b which the compiler won't zero-initialize.
> 
> >
> > Also more tests are failing after register parentage chains patch is
> > applied than in Alexei's initial try: 10 verifier tests and 1 progs
> > test (test_global_func10.c, I have not modified it yet, it should wait
> > for my changes for unprivileged execution mode support in
> > test_loader.c). I don't really like how I had to fix those tests.
> >
> > I took a detailed look at the difference in verifier behavior between
> > master and the branch [1] for pyperf600_bpf_loop.bpf.o and identified
> > that the difference is caused by the fact that helper functions do not
> > mark the stack they access as REG_LIVE_WRITTEN, the details are in the
> > commit message [3], but TLDR is the following example:
> >
> >         1: bpf_probe_read_user(&foo, ...);
> >         2: if (*foo) ...
> >
> > Here `*foo` will not get REG_LIVE_WRITTEN mark when (1) is verified,
> > thus `*foo` read at (2) might lead to excessive REG_LIVE_READ marks
> > and thus more verification states.
> 
> This is a good fix in its own right, of course, we should definitely do this!

+1

> >
> > I prepared a patch that changes helper calls verification to apply
> > REG_LIVE_WRITTEN when write size and alignment allow this, again
> > currently on my github [2]. This patch has less dramatic performance
> > impact, but nonetheless significant:
> >
> > $ veristat -e file,states -C -f 'states_pct<-30' master.log helpers-written.log
> > File                        States (A)  States (B)  States    (DIFF)
> > --------------------------  ----------  ----------  ----------------
> > pyperf600_bpf_loop.bpf.o           287         156    -131 (-45.64%)
> > strobemeta.bpf.o                 15879        4772  -11107 (-69.95%)
> > strobemeta_nounroll1.bpf.o        2065        1337    -728 (-35.25%)
> > strobemeta_nounroll2.bpf.o       20505        3788  -16717 (-81.53%)
> > test_cls_redirect.bpf.o           8129        4799   -3330 (-40.96%)
> > --------------------------  ----------  ----------  ----------------
> >
> > I suggest that instead of dropping a useful safety check I can further
> > investigate difference in behavior between "uninit-reads.log" and
> > "helpers-written.log" and maybe figure out other improvements.
> > Unfortunately the comparison process is extremely time consuming.
> >
> > wdyt?
> 
> I think reading uninitialized stack slot concerns are overblown in
> practice (in terms of their good implications for programmer's
> productivity), I'd still do it if only in the name of improving user
> experience.

+1
Let's do both (REG_LIVE_WRITTEN for helpers and allow uninit).

Uninit access should be caught by the compiler.
The verifier is repeating the check for historical reasons when we
tried to make it work for unpriv.
Allow uninit won't increase the number of errors in bpf progs.
Eduard Zingerman Jan. 31, 2023, 8:29 a.m. UTC | #14
On Mon, 2023-01-30 at 18:42 -0800, Alexei Starovoitov wrote:
[...]
> > > 
> > > Hi Alexei, Andrii,
> > > 
> > > Please note that the patch
> > > "bpf: Fix to preserve reg parent/live fields when copying range info"
> > > that started this conversation was applied to `bpf` tree, not `bpf-next`,
> > > so I'll wait until it gets its way to `bpf-next` before submitting formal
> > > patches, as it changes the performance numbers collected by veristat.
> > > I did all my experiments with this patch applied on top of `bpf-next`.
> > > 
> > > I adapted the patch suggested by Alexei and put it to my github for
> > > now [1]. The performance gains are indeed significant:
> > > 
> > > $ ./veristat -e file,states -C -f 'states_pct<-30' master.log uninit-reads.log
> > > File                        States (A)  States (B)  States    (DIFF)
> > > --------------------------  ----------  ----------  ----------------
> > > bpf_host.o                         349         244    -105 (-30.09%)
> > > bpf_host.o                        1320         895    -425 (-32.20%)
> > > bpf_lxc.o                         1320         895    -425 (-32.20%)
> > > bpf_sock.o                          70          48     -22 (-31.43%)
> > > bpf_sock.o                          68          46     -22 (-32.35%)
> > > bpf_xdp.o                         1554         803    -751 (-48.33%)
> > > bpf_xdp.o                         6457        2473   -3984 (-61.70%)
> > > bpf_xdp.o                         7249        3908   -3341 (-46.09%)
> > > pyperf600_bpf_loop.bpf.o           287         145    -142 (-49.48%)
> > > strobemeta.bpf.o                 15879        4790  -11089 (-69.83%)
> > > strobemeta_nounroll2.bpf.o       20505        3931  -16574 (-80.83%)
> > > xdp_synproxy_kern.bpf.o          22564        7009  -15555 (-68.94%)
> > > xdp_synproxy_kern.bpf.o          24206        6941  -17265 (-71.33%)
> > > --------------------------  ----------  ----------  ----------------
> > > 
> > > However, this comes at a cost of allowing reads from uninitialized
> > > stack locations. As far as I understand access to uninitialized local
> > > variable is one of the most common errors when programming in C
> > > (although citation is needed).
> > 
> > Yeah, a citation is really needed :) I don't see this often in
> > practice, tbh. What I do see in practice is that people are
> > unnecessarily __builtint_memset(0) struct and initialize all fields
> > with field-by-field initialization, instead of just using a nice C
> > syntax:
> > 
> > struct my_struct s = {
> >    .field_a = 123,
> >    .field_b = 234,
> > };
> > 
> > 
> > And all that just because there is some padding between field_a and
> > field_b which the compiler won't zero-initialize.

Andrii, do you have such example somewhere? If the use case is to pass
's' to some helper function it should already work if function
argument type is 'ARG_PTR_TO_UNINIT_MEM'. Handled by the following
code in the verifier.c:check_stack_range_initialized():

        ...
        if (meta && meta->raw_mode) {
                ...
                meta->access_size = access_size;
                meta->regno = regno;
                return 0;
        }

But only for fixed stack offsets. I'm asking because it might point to
some bug.

> > 
> > > 
> > > Also more tests are failing after register parentage chains patch is
> > > applied than in Alexei's initial try: 10 verifier tests and 1 progs
> > > test (test_global_func10.c, I have not modified it yet, it should wait
> > > for my changes for unprivileged execution mode support in
> > > test_loader.c). I don't really like how I had to fix those tests.
> > > 
> > > I took a detailed look at the difference in verifier behavior between
> > > master and the branch [1] for pyperf600_bpf_loop.bpf.o and identified
> > > that the difference is caused by the fact that helper functions do not
> > > mark the stack they access as REG_LIVE_WRITTEN, the details are in the
> > > commit message [3], but TLDR is the following example:
> > > 
> > >         1: bpf_probe_read_user(&foo, ...);
> > >         2: if (*foo) ...
> > > 
> > > Here `*foo` will not get REG_LIVE_WRITTEN mark when (1) is verified,
> > > thus `*foo` read at (2) might lead to excessive REG_LIVE_READ marks
> > > and thus more verification states.
> > 
> > This is a good fix in its own right, of course, we should definitely do this!
> 
> +1
> 
> > > 
> > > I prepared a patch that changes helper calls verification to apply
> > > REG_LIVE_WRITTEN when write size and alignment allow this, again
> > > currently on my github [2]. This patch has less dramatic performance
> > > impact, but nonetheless significant:
> > > 
> > > $ veristat -e file,states -C -f 'states_pct<-30' master.log helpers-written.log
> > > File                        States (A)  States (B)  States    (DIFF)
> > > --------------------------  ----------  ----------  ----------------
> > > pyperf600_bpf_loop.bpf.o           287         156    -131 (-45.64%)
> > > strobemeta.bpf.o                 15879        4772  -11107 (-69.95%)
> > > strobemeta_nounroll1.bpf.o        2065        1337    -728 (-35.25%)
> > > strobemeta_nounroll2.bpf.o       20505        3788  -16717 (-81.53%)
> > > test_cls_redirect.bpf.o           8129        4799   -3330 (-40.96%)
> > > --------------------------  ----------  ----------  ----------------
> > > 
> > > I suggest that instead of dropping a useful safety check I can further
> > > investigate difference in behavior between "uninit-reads.log" and
> > > "helpers-written.log" and maybe figure out other improvements.
> > > Unfortunately the comparison process is extremely time consuming.
> > > 
> > > wdyt?
> > 
> > I think reading uninitialized stack slot concerns are overblown in
> > practice (in terms of their good implications for programmer's
> > productivity), I'd still do it if only in the name of improving user
> > experience.
> 
> +1
> Let's do both (REG_LIVE_WRITTEN for helpers and allow uninit).
> 
> Uninit access should be caught by the compiler.
> The verifier is repeating the check for historical reasons when we
> tried to make it work for unpriv.
> Allow uninit won't increase the number of errors in bpf progs.

Thank you for the feedback. I'll submit both patches when
"bpf: Fix to preserve reg parent/live fields when copying range info"
will get to bpf-next.

Thanks,
Eduard.
Andrii Nakryiko Jan. 31, 2023, 6:55 p.m. UTC | #15
On Tue, Jan 31, 2023 at 12:29 AM Eduard Zingerman <eddyz87@gmail.com> wrote:
>
> On Mon, 2023-01-30 at 18:42 -0800, Alexei Starovoitov wrote:
> [...]
> > > >
> > > > Hi Alexei, Andrii,
> > > >
> > > > Please note that the patch
> > > > "bpf: Fix to preserve reg parent/live fields when copying range info"
> > > > that started this conversation was applied to `bpf` tree, not `bpf-next`,
> > > > so I'll wait until it gets its way to `bpf-next` before submitting formal
> > > > patches, as it changes the performance numbers collected by veristat.
> > > > I did all my experiments with this patch applied on top of `bpf-next`.
> > > >
> > > > I adapted the patch suggested by Alexei and put it to my github for
> > > > now [1]. The performance gains are indeed significant:
> > > >
> > > > $ ./veristat -e file,states -C -f 'states_pct<-30' master.log uninit-reads.log
> > > > File                        States (A)  States (B)  States    (DIFF)
> > > > --------------------------  ----------  ----------  ----------------
> > > > bpf_host.o                         349         244    -105 (-30.09%)
> > > > bpf_host.o                        1320         895    -425 (-32.20%)
> > > > bpf_lxc.o                         1320         895    -425 (-32.20%)
> > > > bpf_sock.o                          70          48     -22 (-31.43%)
> > > > bpf_sock.o                          68          46     -22 (-32.35%)
> > > > bpf_xdp.o                         1554         803    -751 (-48.33%)
> > > > bpf_xdp.o                         6457        2473   -3984 (-61.70%)
> > > > bpf_xdp.o                         7249        3908   -3341 (-46.09%)
> > > > pyperf600_bpf_loop.bpf.o           287         145    -142 (-49.48%)
> > > > strobemeta.bpf.o                 15879        4790  -11089 (-69.83%)
> > > > strobemeta_nounroll2.bpf.o       20505        3931  -16574 (-80.83%)
> > > > xdp_synproxy_kern.bpf.o          22564        7009  -15555 (-68.94%)
> > > > xdp_synproxy_kern.bpf.o          24206        6941  -17265 (-71.33%)
> > > > --------------------------  ----------  ----------  ----------------
> > > >
> > > > However, this comes at a cost of allowing reads from uninitialized
> > > > stack locations. As far as I understand access to uninitialized local
> > > > variable is one of the most common errors when programming in C
> > > > (although citation is needed).
> > >
> > > Yeah, a citation is really needed :) I don't see this often in
> > > practice, tbh. What I do see in practice is that people are
> > > unnecessarily __builtint_memset(0) struct and initialize all fields
> > > with field-by-field initialization, instead of just using a nice C
> > > syntax:
> > >
> > > struct my_struct s = {
> > >    .field_a = 123,
> > >    .field_b = 234,
> > > };
> > >
> > >
> > > And all that just because there is some padding between field_a and
> > > field_b which the compiler won't zero-initialize.
>
> Andrii, do you have such example somewhere? If the use case is to pass
> 's' to some helper function it should already work if function
> argument type is 'ARG_PTR_TO_UNINIT_MEM'. Handled by the following
> code in the verifier.c:check_stack_range_initialized():
>
>         ...
>         if (meta && meta->raw_mode) {
>                 ...
>                 meta->access_size = access_size;
>                 meta->regno = regno;
>                 return 0;
>         }
>
> But only for fixed stack offsets. I'm asking because it might point to
> some bug.

I don't have specifics at hand, but there were at least few different
internal cases where this popped up. And just grepping code base, we
have tons of

__builtin_memset(&event, 0, sizeof(event));

in our code base, even if struct initialization would be more convenient.

I don't remember how exactly that struct was used, but one of the use
cases was passing it to bpf_perf_event_output(), which expects
initialized memory.

>
> > >
> > > >
> > > > Also more tests are failing after register parentage chains patch is
> > > > applied than in Alexei's initial try: 10 verifier tests and 1 progs
> > > > test (test_global_func10.c, I have not modified it yet, it should wait
> > > > for my changes for unprivileged execution mode support in
> > > > test_loader.c). I don't really like how I had to fix those tests.
> > > >
> > > > I took a detailed look at the difference in verifier behavior between
> > > > master and the branch [1] for pyperf600_bpf_loop.bpf.o and identified
> > > > that the difference is caused by the fact that helper functions do not
> > > > mark the stack they access as REG_LIVE_WRITTEN, the details are in the
> > > > commit message [3], but TLDR is the following example:
> > > >
> > > >         1: bpf_probe_read_user(&foo, ...);
> > > >         2: if (*foo) ...
> > > >
> > > > Here `*foo` will not get REG_LIVE_WRITTEN mark when (1) is verified,
> > > > thus `*foo` read at (2) might lead to excessive REG_LIVE_READ marks
> > > > and thus more verification states.
> > >
> > > This is a good fix in its own right, of course, we should definitely do this!
> >
> > +1
> >
> > > >
> > > > I prepared a patch that changes helper calls verification to apply
> > > > REG_LIVE_WRITTEN when write size and alignment allow this, again
> > > > currently on my github [2]. This patch has less dramatic performance
> > > > impact, but nonetheless significant:
> > > >
> > > > $ veristat -e file,states -C -f 'states_pct<-30' master.log helpers-written.log
> > > > File                        States (A)  States (B)  States    (DIFF)
> > > > --------------------------  ----------  ----------  ----------------
> > > > pyperf600_bpf_loop.bpf.o           287         156    -131 (-45.64%)
> > > > strobemeta.bpf.o                 15879        4772  -11107 (-69.95%)
> > > > strobemeta_nounroll1.bpf.o        2065        1337    -728 (-35.25%)
> > > > strobemeta_nounroll2.bpf.o       20505        3788  -16717 (-81.53%)
> > > > test_cls_redirect.bpf.o           8129        4799   -3330 (-40.96%)
> > > > --------------------------  ----------  ----------  ----------------
> > > >
> > > > I suggest that instead of dropping a useful safety check I can further
> > > > investigate difference in behavior between "uninit-reads.log" and
> > > > "helpers-written.log" and maybe figure out other improvements.
> > > > Unfortunately the comparison process is extremely time consuming.
> > > >
> > > > wdyt?
> > >
> > > I think reading uninitialized stack slot concerns are overblown in
> > > practice (in terms of their good implications for programmer's
> > > productivity), I'd still do it if only in the name of improving user
> > > experience.
> >
> > +1
> > Let's do both (REG_LIVE_WRITTEN for helpers and allow uninit).
> >
> > Uninit access should be caught by the compiler.
> > The verifier is repeating the check for historical reasons when we
> > tried to make it work for unpriv.
> > Allow uninit won't increase the number of errors in bpf progs.
>
> Thank you for the feedback. I'll submit both patches when
> "bpf: Fix to preserve reg parent/live fields when copying range info"
> will get to bpf-next.
>
> Thanks,
> Eduard.