mbox series

[bpf-next,v6,0/2] bpf: Add a generic bits iterator

Message ID 20240411131127.73098-1-laoar.shao@gmail.com (mailing list archive)
Headers show
Series bpf: Add a generic bits iterator | expand

Message

Yafang Shao April 11, 2024, 1:11 p.m. UTC
Three new kfuncs, namely bpf_iter_bits_{new,next,destroy}, have been
added for the new bpf_iter_bits functionality. These kfuncs enable the
iteration of the bits from a given address and a given number of bits.

- bpf_iter_bits_new
  Initialize a new bits iterator for a given memory area. Due to the
  limitation of bpf memalloc, the max number of bits to be iterated
  over is (4096 * 8).
- bpf_iter_bits_next
  Get the next bit in a bpf_iter_bits
- bpf_iter_bits_destroy
  Destroy a bpf_iter_bits

The bits iterator can be used in any context and on any address.

Changes:
- v5->v6:
  - Add positive tests (Andrii)
- v4->v5:
  - Simplify test cases (Andrii)
- v3->v4:
  - Fix endianness error on s390x (Andrii)
  - zero-initialize kit->bits_copy and zero out nr_bits (Andrii)
- v2->v3:
  - Optimization for u64/u32 mask (Andrii)
- v1->v2:
  - Simplify the CPU number verification code to avoid the failure on s390x
    (Eduard)
- bpf: Add bpf_iter_cpumask
  https://lwn.net/Articles/961104/
- bpf: Add new bpf helper bpf_for_each_cpu
  https://lwn.net/Articles/939939/

Yafang Shao (2):
  bpf: Add bits iterator
  selftests/bpf: Add selftest for bits iter

 kernel/bpf/helpers.c                          | 120 +++++++++++++++++
 .../selftests/bpf/prog_tests/verifier.c       |   2 +
 .../selftests/bpf/progs/verifier_bits_iter.c  | 127 ++++++++++++++++++
 3 files changed, 249 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/progs/verifier_bits_iter.c

Comments

Yafang Shao April 11, 2024, 1:50 p.m. UTC | #1
On Thu, Apr 11, 2024 at 9:11 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> Three new kfuncs, namely bpf_iter_bits_{new,next,destroy}, have been
> added for the new bpf_iter_bits functionality. These kfuncs enable the
> iteration of the bits from a given address and a given number of bits.
>
> - bpf_iter_bits_new
>   Initialize a new bits iterator for a given memory area. Due to the
>   limitation of bpf memalloc, the max number of bits to be iterated
>   over is (4096 * 8).
> - bpf_iter_bits_next
>   Get the next bit in a bpf_iter_bits
> - bpf_iter_bits_destroy
>   Destroy a bpf_iter_bits
>
> The bits iterator can be used in any context and on any address.
>
> Changes:
> - v5->v6:
>   - Add positive tests (Andrii)
> - v4->v5:
>   - Simplify test cases (Andrii)
> - v3->v4:
>   - Fix endianness error on s390x (Andrii)
>   - zero-initialize kit->bits_copy and zero out nr_bits (Andrii)
> - v2->v3:
>   - Optimization for u64/u32 mask (Andrii)
> - v1->v2:
>   - Simplify the CPU number verification code to avoid the failure on s390x
>     (Eduard)
> - bpf: Add bpf_iter_cpumask
>   https://lwn.net/Articles/961104/
> - bpf: Add new bpf helper bpf_for_each_cpu
>   https://lwn.net/Articles/939939/
>
> Yafang Shao (2):
>   bpf: Add bits iterator
>   selftests/bpf: Add selftest for bits iter
>
>  kernel/bpf/helpers.c                          | 120 +++++++++++++++++
>  .../selftests/bpf/prog_tests/verifier.c       |   2 +
>  .../selftests/bpf/progs/verifier_bits_iter.c  | 127 ++++++++++++++++++
>  3 files changed, 249 insertions(+)
>  create mode 100644 tools/testing/selftests/bpf/progs/verifier_bits_iter.c
>
> --
> 2.39.1
>

It appears that the test case failed on s390x when the data is
a u32 value because we need to set the higher 32 bits.
will analyze it.


--
Regards
Yafang
Andrii Nakryiko April 25, 2024, 12:34 a.m. UTC | #2
On Thu, Apr 11, 2024 at 6:51 AM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> On Thu, Apr 11, 2024 at 9:11 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > Three new kfuncs, namely bpf_iter_bits_{new,next,destroy}, have been
> > added for the new bpf_iter_bits functionality. These kfuncs enable the
> > iteration of the bits from a given address and a given number of bits.
> >
> > - bpf_iter_bits_new
> >   Initialize a new bits iterator for a given memory area. Due to the
> >   limitation of bpf memalloc, the max number of bits to be iterated
> >   over is (4096 * 8).
> > - bpf_iter_bits_next
> >   Get the next bit in a bpf_iter_bits
> > - bpf_iter_bits_destroy
> >   Destroy a bpf_iter_bits
> >
> > The bits iterator can be used in any context and on any address.
> >
> > Changes:
> > - v5->v6:
> >   - Add positive tests (Andrii)
> > - v4->v5:
> >   - Simplify test cases (Andrii)
> > - v3->v4:
> >   - Fix endianness error on s390x (Andrii)
> >   - zero-initialize kit->bits_copy and zero out nr_bits (Andrii)
> > - v2->v3:
> >   - Optimization for u64/u32 mask (Andrii)
> > - v1->v2:
> >   - Simplify the CPU number verification code to avoid the failure on s390x
> >     (Eduard)
> > - bpf: Add bpf_iter_cpumask
> >   https://lwn.net/Articles/961104/
> > - bpf: Add new bpf helper bpf_for_each_cpu
> >   https://lwn.net/Articles/939939/
> >
> > Yafang Shao (2):
> >   bpf: Add bits iterator
> >   selftests/bpf: Add selftest for bits iter
> >
> >  kernel/bpf/helpers.c                          | 120 +++++++++++++++++
> >  .../selftests/bpf/prog_tests/verifier.c       |   2 +
> >  .../selftests/bpf/progs/verifier_bits_iter.c  | 127 ++++++++++++++++++
> >  3 files changed, 249 insertions(+)
> >  create mode 100644 tools/testing/selftests/bpf/progs/verifier_bits_iter.c
> >
> > --
> > 2.39.1
> >
>
> It appears that the test case failed on s390x when the data is
> a u32 value because we need to set the higher 32 bits.
> will analyze it.
>

Hey Yafang, did you get a chance to debug and fix the issue?

>
> --
> Regards
> Yafang
Yafang Shao April 25, 2024, 5:36 a.m. UTC | #3
On Thu, Apr 25, 2024 at 8:34 AM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Thu, Apr 11, 2024 at 6:51 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > On Thu, Apr 11, 2024 at 9:11 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > >
> > > Three new kfuncs, namely bpf_iter_bits_{new,next,destroy}, have been
> > > added for the new bpf_iter_bits functionality. These kfuncs enable the
> > > iteration of the bits from a given address and a given number of bits.
> > >
> > > - bpf_iter_bits_new
> > >   Initialize a new bits iterator for a given memory area. Due to the
> > >   limitation of bpf memalloc, the max number of bits to be iterated
> > >   over is (4096 * 8).
> > > - bpf_iter_bits_next
> > >   Get the next bit in a bpf_iter_bits
> > > - bpf_iter_bits_destroy
> > >   Destroy a bpf_iter_bits
> > >
> > > The bits iterator can be used in any context and on any address.
> > >
> > > Changes:
> > > - v5->v6:
> > >   - Add positive tests (Andrii)
> > > - v4->v5:
> > >   - Simplify test cases (Andrii)
> > > - v3->v4:
> > >   - Fix endianness error on s390x (Andrii)
> > >   - zero-initialize kit->bits_copy and zero out nr_bits (Andrii)
> > > - v2->v3:
> > >   - Optimization for u64/u32 mask (Andrii)
> > > - v1->v2:
> > >   - Simplify the CPU number verification code to avoid the failure on s390x
> > >     (Eduard)
> > > - bpf: Add bpf_iter_cpumask
> > >   https://lwn.net/Articles/961104/
> > > - bpf: Add new bpf helper bpf_for_each_cpu
> > >   https://lwn.net/Articles/939939/
> > >
> > > Yafang Shao (2):
> > >   bpf: Add bits iterator
> > >   selftests/bpf: Add selftest for bits iter
> > >
> > >  kernel/bpf/helpers.c                          | 120 +++++++++++++++++
> > >  .../selftests/bpf/prog_tests/verifier.c       |   2 +
> > >  .../selftests/bpf/progs/verifier_bits_iter.c  | 127 ++++++++++++++++++
> > >  3 files changed, 249 insertions(+)
> > >  create mode 100644 tools/testing/selftests/bpf/progs/verifier_bits_iter.c
> > >
> > > --
> > > 2.39.1
> > >
> >
> > It appears that the test case failed on s390x when the data is
> > a u32 value because we need to set the higher 32 bits.
> > will analyze it.
> >
>
> Hey Yafang, did you get a chance to debug and fix the issue?
>

Hi Andrii,

Apologies for the delay; I recently returned from an extended holiday.

The issue stems from the limitations of bpf_probe_read_kernel() on
s390 architecture. The attachment provides a straightforward example
to illustrate this issue. The observed results are as follows:

    Error: #463/1 verifier_probe_read/probe read 4 bytes
    8897 run_subtest:PASS:obj_open_mem 0 nsec
    8898 run_subtest:PASS:unexpected_load_failure 0 nsec
    8899 do_prog_test_run:PASS:bpf_prog_test_run 0 nsec
    8900 run_subtest:FAIL:659 Unexpected retval: 2817064 != 512

    Error: #463/2 verifier_probe_read/probe read 8 bytes
    8903 run_subtest:PASS:obj_open_mem 0 nsec
    8904 run_subtest:PASS:unexpected_load_failure 0 nsec
    8905 do_prog_test_run:PASS:bpf_prog_test_run 0 nsec
    8906 run_subtest:FAIL:659 Unexpected retval: 0 != 512

More details can be found at:  https://github.com/kernel-patches/bpf/pull/6872

Should we consider this behavior of bpf_probe_read_kernel() as
expected, or is it something that requires fixing?
Yonghong Song April 25, 2024, 6:05 a.m. UTC | #4
On 4/24/24 10:36 PM, Yafang Shao wrote:
> On Thu, Apr 25, 2024 at 8:34 AM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
>> On Thu, Apr 11, 2024 at 6:51 AM Yafang Shao <laoar.shao@gmail.com> wrote:
>>> On Thu, Apr 11, 2024 at 9:11 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>>>> Three new kfuncs, namely bpf_iter_bits_{new,next,destroy}, have been
>>>> added for the new bpf_iter_bits functionality. These kfuncs enable the
>>>> iteration of the bits from a given address and a given number of bits.
>>>>
>>>> - bpf_iter_bits_new
>>>>    Initialize a new bits iterator for a given memory area. Due to the
>>>>    limitation of bpf memalloc, the max number of bits to be iterated
>>>>    over is (4096 * 8).
>>>> - bpf_iter_bits_next
>>>>    Get the next bit in a bpf_iter_bits
>>>> - bpf_iter_bits_destroy
>>>>    Destroy a bpf_iter_bits
>>>>
>>>> The bits iterator can be used in any context and on any address.
>>>>
>>>> Changes:
>>>> - v5->v6:
>>>>    - Add positive tests (Andrii)
>>>> - v4->v5:
>>>>    - Simplify test cases (Andrii)
>>>> - v3->v4:
>>>>    - Fix endianness error on s390x (Andrii)
>>>>    - zero-initialize kit->bits_copy and zero out nr_bits (Andrii)
>>>> - v2->v3:
>>>>    - Optimization for u64/u32 mask (Andrii)
>>>> - v1->v2:
>>>>    - Simplify the CPU number verification code to avoid the failure on s390x
>>>>      (Eduard)
>>>> - bpf: Add bpf_iter_cpumask
>>>>    https://lwn.net/Articles/961104/
>>>> - bpf: Add new bpf helper bpf_for_each_cpu
>>>>    https://lwn.net/Articles/939939/
>>>>
>>>> Yafang Shao (2):
>>>>    bpf: Add bits iterator
>>>>    selftests/bpf: Add selftest for bits iter
>>>>
>>>>   kernel/bpf/helpers.c                          | 120 +++++++++++++++++
>>>>   .../selftests/bpf/prog_tests/verifier.c       |   2 +
>>>>   .../selftests/bpf/progs/verifier_bits_iter.c  | 127 ++++++++++++++++++
>>>>   3 files changed, 249 insertions(+)
>>>>   create mode 100644 tools/testing/selftests/bpf/progs/verifier_bits_iter.c
>>>>
>>>> --
>>>> 2.39.1
>>>>
>>> It appears that the test case failed on s390x when the data is
>>> a u32 value because we need to set the higher 32 bits.
>>> will analyze it.
>>>
>> Hey Yafang, did you get a chance to debug and fix the issue?
>>
> Hi Andrii,
>
> Apologies for the delay; I recently returned from an extended holiday.
>
> The issue stems from the limitations of bpf_probe_read_kernel() on
> s390 architecture. The attachment provides a straightforward example
> to illustrate this issue. The observed results are as follows:
>
>      Error: #463/1 verifier_probe_read/probe read 4 bytes
>      8897 run_subtest:PASS:obj_open_mem 0 nsec
>      8898 run_subtest:PASS:unexpected_load_failure 0 nsec
>      8899 do_prog_test_run:PASS:bpf_prog_test_run 0 nsec
>      8900 run_subtest:FAIL:659 Unexpected retval: 2817064 != 512
>
>      Error: #463/2 verifier_probe_read/probe read 8 bytes
>      8903 run_subtest:PASS:obj_open_mem 0 nsec
>      8904 run_subtest:PASS:unexpected_load_failure 0 nsec
>      8905 do_prog_test_run:PASS:bpf_prog_test_run 0 nsec
>      8906 run_subtest:FAIL:659 Unexpected retval: 0 != 512
>
> More details can be found at:  https://github.com/kernel-patches/bpf/pull/6872
>
> Should we consider this behavior of bpf_probe_read_kernel() as
> expected, or is it something that requires fixing?

Maybe to guard the result with macros like __s390x__ to differentiate 
s390 vs. arm64/x86_64? There are some examples in prog_tests/* having 
such guards. probe_user.c:#if defined(__s390x__) 
test_bpf_syscall_macro.c:#if defined(__aarch64__) || defined(__s390__) 
xdp_adjust_tail.c:#if defined(__s390x__) xdp_do_redirect.c:#if 
defined(__s390x__)
Yafang Shao April 25, 2024, 8:03 a.m. UTC | #5
On Thu, Apr 25, 2024 at 2:05 PM Yonghong Song <yonghong.song@linux.dev> wrote:
>
>
> On 4/24/24 10:36 PM, Yafang Shao wrote:
> > On Thu, Apr 25, 2024 at 8:34 AM Andrii Nakryiko
> > <andrii.nakryiko@gmail.com> wrote:
> >> On Thu, Apr 11, 2024 at 6:51 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> >>> On Thu, Apr 11, 2024 at 9:11 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> >>>> Three new kfuncs, namely bpf_iter_bits_{new,next,destroy}, have been
> >>>> added for the new bpf_iter_bits functionality. These kfuncs enable the
> >>>> iteration of the bits from a given address and a given number of bits.
> >>>>
> >>>> - bpf_iter_bits_new
> >>>>    Initialize a new bits iterator for a given memory area. Due to the
> >>>>    limitation of bpf memalloc, the max number of bits to be iterated
> >>>>    over is (4096 * 8).
> >>>> - bpf_iter_bits_next
> >>>>    Get the next bit in a bpf_iter_bits
> >>>> - bpf_iter_bits_destroy
> >>>>    Destroy a bpf_iter_bits
> >>>>
> >>>> The bits iterator can be used in any context and on any address.
> >>>>
> >>>> Changes:
> >>>> - v5->v6:
> >>>>    - Add positive tests (Andrii)
> >>>> - v4->v5:
> >>>>    - Simplify test cases (Andrii)
> >>>> - v3->v4:
> >>>>    - Fix endianness error on s390x (Andrii)
> >>>>    - zero-initialize kit->bits_copy and zero out nr_bits (Andrii)
> >>>> - v2->v3:
> >>>>    - Optimization for u64/u32 mask (Andrii)
> >>>> - v1->v2:
> >>>>    - Simplify the CPU number verification code to avoid the failure on s390x
> >>>>      (Eduard)
> >>>> - bpf: Add bpf_iter_cpumask
> >>>>    https://lwn.net/Articles/961104/
> >>>> - bpf: Add new bpf helper bpf_for_each_cpu
> >>>>    https://lwn.net/Articles/939939/
> >>>>
> >>>> Yafang Shao (2):
> >>>>    bpf: Add bits iterator
> >>>>    selftests/bpf: Add selftest for bits iter
> >>>>
> >>>>   kernel/bpf/helpers.c                          | 120 +++++++++++++++++
> >>>>   .../selftests/bpf/prog_tests/verifier.c       |   2 +
> >>>>   .../selftests/bpf/progs/verifier_bits_iter.c  | 127 ++++++++++++++++++
> >>>>   3 files changed, 249 insertions(+)
> >>>>   create mode 100644 tools/testing/selftests/bpf/progs/verifier_bits_iter.c
> >>>>
> >>>> --
> >>>> 2.39.1
> >>>>
> >>> It appears that the test case failed on s390x when the data is
> >>> a u32 value because we need to set the higher 32 bits.
> >>> will analyze it.
> >>>
> >> Hey Yafang, did you get a chance to debug and fix the issue?
> >>
> > Hi Andrii,
> >
> > Apologies for the delay; I recently returned from an extended holiday.
> >
> > The issue stems from the limitations of bpf_probe_read_kernel() on
> > s390 architecture. The attachment provides a straightforward example
> > to illustrate this issue. The observed results are as follows:
> >
> >      Error: #463/1 verifier_probe_read/probe read 4 bytes
> >      8897 run_subtest:PASS:obj_open_mem 0 nsec
> >      8898 run_subtest:PASS:unexpected_load_failure 0 nsec
> >      8899 do_prog_test_run:PASS:bpf_prog_test_run 0 nsec
> >      8900 run_subtest:FAIL:659 Unexpected retval: 2817064 != 512
> >
> >      Error: #463/2 verifier_probe_read/probe read 8 bytes
> >      8903 run_subtest:PASS:obj_open_mem 0 nsec
> >      8904 run_subtest:PASS:unexpected_load_failure 0 nsec
> >      8905 do_prog_test_run:PASS:bpf_prog_test_run 0 nsec
> >      8906 run_subtest:FAIL:659 Unexpected retval: 0 != 512
> >
> > More details can be found at:  https://github.com/kernel-patches/bpf/pull/6872
> >
> > Should we consider this behavior of bpf_probe_read_kernel() as
> > expected, or is it something that requires fixing?
>
> Maybe to guard the result with macros like __s390x__ to differentiate
> s390 vs. arm64/x86_64? There are some examples in prog_tests/* having
> such guards. probe_user.c:#if defined(__s390x__)
> test_bpf_syscall_macro.c:#if defined(__aarch64__) || defined(__s390__)
> xdp_adjust_tail.c:#if defined(__s390x__) xdp_do_redirect.c:#if
> defined(__s390x__)
>

That's feasible. Thank you for your suggestion. I'll make the necessary changes.
Andrii Nakryiko April 25, 2024, 6:15 p.m. UTC | #6
On Wed, Apr 24, 2024 at 10:37 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> On Thu, Apr 25, 2024 at 8:34 AM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
> >
> > On Thu, Apr 11, 2024 at 6:51 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> > >
> > > On Thu, Apr 11, 2024 at 9:11 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > >
> > > > Three new kfuncs, namely bpf_iter_bits_{new,next,destroy}, have been
> > > > added for the new bpf_iter_bits functionality. These kfuncs enable the
> > > > iteration of the bits from a given address and a given number of bits.
> > > >
> > > > - bpf_iter_bits_new
> > > >   Initialize a new bits iterator for a given memory area. Due to the
> > > >   limitation of bpf memalloc, the max number of bits to be iterated
> > > >   over is (4096 * 8).
> > > > - bpf_iter_bits_next
> > > >   Get the next bit in a bpf_iter_bits
> > > > - bpf_iter_bits_destroy
> > > >   Destroy a bpf_iter_bits
> > > >
> > > > The bits iterator can be used in any context and on any address.
> > > >
> > > > Changes:
> > > > - v5->v6:
> > > >   - Add positive tests (Andrii)
> > > > - v4->v5:
> > > >   - Simplify test cases (Andrii)
> > > > - v3->v4:
> > > >   - Fix endianness error on s390x (Andrii)
> > > >   - zero-initialize kit->bits_copy and zero out nr_bits (Andrii)
> > > > - v2->v3:
> > > >   - Optimization for u64/u32 mask (Andrii)
> > > > - v1->v2:
> > > >   - Simplify the CPU number verification code to avoid the failure on s390x
> > > >     (Eduard)
> > > > - bpf: Add bpf_iter_cpumask
> > > >   https://lwn.net/Articles/961104/
> > > > - bpf: Add new bpf helper bpf_for_each_cpu
> > > >   https://lwn.net/Articles/939939/
> > > >
> > > > Yafang Shao (2):
> > > >   bpf: Add bits iterator
> > > >   selftests/bpf: Add selftest for bits iter
> > > >
> > > >  kernel/bpf/helpers.c                          | 120 +++++++++++++++++
> > > >  .../selftests/bpf/prog_tests/verifier.c       |   2 +
> > > >  .../selftests/bpf/progs/verifier_bits_iter.c  | 127 ++++++++++++++++++
> > > >  3 files changed, 249 insertions(+)
> > > >  create mode 100644 tools/testing/selftests/bpf/progs/verifier_bits_iter.c
> > > >
> > > > --
> > > > 2.39.1
> > > >
> > >
> > > It appears that the test case failed on s390x when the data is
> > > a u32 value because we need to set the higher 32 bits.
> > > will analyze it.
> > >
> >
> > Hey Yafang, did you get a chance to debug and fix the issue?
> >
>
> Hi Andrii,
>
> Apologies for the delay; I recently returned from an extended holiday.
>
> The issue stems from the limitations of bpf_probe_read_kernel() on
> s390 architecture. The attachment provides a straightforward example
> to illustrate this issue. The observed results are as follows:
>
>     Error: #463/1 verifier_probe_read/probe read 4 bytes
>     8897 run_subtest:PASS:obj_open_mem 0 nsec
>     8898 run_subtest:PASS:unexpected_load_failure 0 nsec
>     8899 do_prog_test_run:PASS:bpf_prog_test_run 0 nsec
>     8900 run_subtest:FAIL:659 Unexpected retval: 2817064 != 512
>
>     Error: #463/2 verifier_probe_read/probe read 8 bytes
>     8903 run_subtest:PASS:obj_open_mem 0 nsec
>     8904 run_subtest:PASS:unexpected_load_failure 0 nsec
>     8905 do_prog_test_run:PASS:bpf_prog_test_run 0 nsec
>     8906 run_subtest:FAIL:659 Unexpected retval: 0 != 512
>
> More details can be found at:  https://github.com/kernel-patches/bpf/pull/6872
>
> Should we consider this behavior of bpf_probe_read_kernel() as
> expected, or is it something that requires fixing?
>

I might be missing something, but there is nothing wrong with
bpf_probe_read_kernel() behavior. In "read 4" case you are overwriting
only upper 4 bytes of u64, so lower 4 bytes are garbage. In "read 8"
you are reading (upper) 4 bytes of garbage from uninitialized
data_dst.

So getting back to iter implementation. Make sure you are
zero-initializing that u64 value you are reading into?


> --
> Regards
> Yafang
Yafang Shao April 26, 2024, 5:04 a.m. UTC | #7
On Fri, Apr 26, 2024 at 2:15 AM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Wed, Apr 24, 2024 at 10:37 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > On Thu, Apr 25, 2024 at 8:34 AM Andrii Nakryiko
> > <andrii.nakryiko@gmail.com> wrote:
> > >
> > > On Thu, Apr 11, 2024 at 6:51 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > >
> > > > On Thu, Apr 11, 2024 at 9:11 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > > >
> > > > > Three new kfuncs, namely bpf_iter_bits_{new,next,destroy}, have been
> > > > > added for the new bpf_iter_bits functionality. These kfuncs enable the
> > > > > iteration of the bits from a given address and a given number of bits.
> > > > >
> > > > > - bpf_iter_bits_new
> > > > >   Initialize a new bits iterator for a given memory area. Due to the
> > > > >   limitation of bpf memalloc, the max number of bits to be iterated
> > > > >   over is (4096 * 8).
> > > > > - bpf_iter_bits_next
> > > > >   Get the next bit in a bpf_iter_bits
> > > > > - bpf_iter_bits_destroy
> > > > >   Destroy a bpf_iter_bits
> > > > >
> > > > > The bits iterator can be used in any context and on any address.
> > > > >
> > > > > Changes:
> > > > > - v5->v6:
> > > > >   - Add positive tests (Andrii)
> > > > > - v4->v5:
> > > > >   - Simplify test cases (Andrii)
> > > > > - v3->v4:
> > > > >   - Fix endianness error on s390x (Andrii)
> > > > >   - zero-initialize kit->bits_copy and zero out nr_bits (Andrii)
> > > > > - v2->v3:
> > > > >   - Optimization for u64/u32 mask (Andrii)
> > > > > - v1->v2:
> > > > >   - Simplify the CPU number verification code to avoid the failure on s390x
> > > > >     (Eduard)
> > > > > - bpf: Add bpf_iter_cpumask
> > > > >   https://lwn.net/Articles/961104/
> > > > > - bpf: Add new bpf helper bpf_for_each_cpu
> > > > >   https://lwn.net/Articles/939939/
> > > > >
> > > > > Yafang Shao (2):
> > > > >   bpf: Add bits iterator
> > > > >   selftests/bpf: Add selftest for bits iter
> > > > >
> > > > >  kernel/bpf/helpers.c                          | 120 +++++++++++++++++
> > > > >  .../selftests/bpf/prog_tests/verifier.c       |   2 +
> > > > >  .../selftests/bpf/progs/verifier_bits_iter.c  | 127 ++++++++++++++++++
> > > > >  3 files changed, 249 insertions(+)
> > > > >  create mode 100644 tools/testing/selftests/bpf/progs/verifier_bits_iter.c
> > > > >
> > > > > --
> > > > > 2.39.1
> > > > >
> > > >
> > > > It appears that the test case failed on s390x when the data is
> > > > a u32 value because we need to set the higher 32 bits.
> > > > will analyze it.
> > > >
> > >
> > > Hey Yafang, did you get a chance to debug and fix the issue?
> > >
> >
> > Hi Andrii,
> >
> > Apologies for the delay; I recently returned from an extended holiday.
> >
> > The issue stems from the limitations of bpf_probe_read_kernel() on
> > s390 architecture. The attachment provides a straightforward example
> > to illustrate this issue. The observed results are as follows:
> >
> >     Error: #463/1 verifier_probe_read/probe read 4 bytes
> >     8897 run_subtest:PASS:obj_open_mem 0 nsec
> >     8898 run_subtest:PASS:unexpected_load_failure 0 nsec
> >     8899 do_prog_test_run:PASS:bpf_prog_test_run 0 nsec
> >     8900 run_subtest:FAIL:659 Unexpected retval: 2817064 != 512
> >
> >     Error: #463/2 verifier_probe_read/probe read 8 bytes
> >     8903 run_subtest:PASS:obj_open_mem 0 nsec
> >     8904 run_subtest:PASS:unexpected_load_failure 0 nsec
> >     8905 do_prog_test_run:PASS:bpf_prog_test_run 0 nsec
> >     8906 run_subtest:FAIL:659 Unexpected retval: 0 != 512
> >
> > More details can be found at:  https://github.com/kernel-patches/bpf/pull/6872
> >
> > Should we consider this behavior of bpf_probe_read_kernel() as
> > expected, or is it something that requires fixing?
> >
>
> I might be missing something, but there is nothing wrong with
> bpf_probe_read_kernel() behavior. In "read 4" case you are overwriting
> only upper 4 bytes of u64, so lower 4 bytes are garbage. In "read 8"
> you are reading (upper) 4 bytes of garbage from uninitialized
> data_dst.

The issue doesn't lie with the dst but rather with the src. Even after
initializing the destination, the operation still fails. You can find
more details in the following link:
https://github.com/kernel-patches/bpf/pull/6882. It appears that
bpf_probe_read_kernel() encounters difficulties when dealing with
non-long-aligned source addresses.

>
> So getting back to iter implementation. Make sure you are
> zero-initializing that u64 value you are reading into?
>

It has been zero-initialized:

+ kit->nr_bits = 0;
+ kit->bits_copy = 0;
Andrii Nakryiko April 26, 2024, 4:50 p.m. UTC | #8
On Thu, Apr 25, 2024 at 10:05 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> On Fri, Apr 26, 2024 at 2:15 AM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
> >
> > On Wed, Apr 24, 2024 at 10:37 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > >
> > > On Thu, Apr 25, 2024 at 8:34 AM Andrii Nakryiko
> > > <andrii.nakryiko@gmail.com> wrote:
> > > >
> > > > On Thu, Apr 11, 2024 at 6:51 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > > >
> > > > > On Thu, Apr 11, 2024 at 9:11 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > > > >
> > > > > > Three new kfuncs, namely bpf_iter_bits_{new,next,destroy}, have been
> > > > > > added for the new bpf_iter_bits functionality. These kfuncs enable the
> > > > > > iteration of the bits from a given address and a given number of bits.
> > > > > >
> > > > > > - bpf_iter_bits_new
> > > > > >   Initialize a new bits iterator for a given memory area. Due to the
> > > > > >   limitation of bpf memalloc, the max number of bits to be iterated
> > > > > >   over is (4096 * 8).
> > > > > > - bpf_iter_bits_next
> > > > > >   Get the next bit in a bpf_iter_bits
> > > > > > - bpf_iter_bits_destroy
> > > > > >   Destroy a bpf_iter_bits
> > > > > >
> > > > > > The bits iterator can be used in any context and on any address.
> > > > > >
> > > > > > Changes:
> > > > > > - v5->v6:
> > > > > >   - Add positive tests (Andrii)
> > > > > > - v4->v5:
> > > > > >   - Simplify test cases (Andrii)
> > > > > > - v3->v4:
> > > > > >   - Fix endianness error on s390x (Andrii)
> > > > > >   - zero-initialize kit->bits_copy and zero out nr_bits (Andrii)
> > > > > > - v2->v3:
> > > > > >   - Optimization for u64/u32 mask (Andrii)
> > > > > > - v1->v2:
> > > > > >   - Simplify the CPU number verification code to avoid the failure on s390x
> > > > > >     (Eduard)
> > > > > > - bpf: Add bpf_iter_cpumask
> > > > > >   https://lwn.net/Articles/961104/
> > > > > > - bpf: Add new bpf helper bpf_for_each_cpu
> > > > > >   https://lwn.net/Articles/939939/
> > > > > >
> > > > > > Yafang Shao (2):
> > > > > >   bpf: Add bits iterator
> > > > > >   selftests/bpf: Add selftest for bits iter
> > > > > >
> > > > > >  kernel/bpf/helpers.c                          | 120 +++++++++++++++++
> > > > > >  .../selftests/bpf/prog_tests/verifier.c       |   2 +
> > > > > >  .../selftests/bpf/progs/verifier_bits_iter.c  | 127 ++++++++++++++++++
> > > > > >  3 files changed, 249 insertions(+)
> > > > > >  create mode 100644 tools/testing/selftests/bpf/progs/verifier_bits_iter.c
> > > > > >
> > > > > > --
> > > > > > 2.39.1
> > > > > >
> > > > >
> > > > > It appears that the test case failed on s390x when the data is
> > > > > a u32 value because we need to set the higher 32 bits.
> > > > > will analyze it.
> > > > >
> > > >
> > > > Hey Yafang, did you get a chance to debug and fix the issue?
> > > >
> > >
> > > Hi Andrii,
> > >
> > > Apologies for the delay; I recently returned from an extended holiday.
> > >
> > > The issue stems from the limitations of bpf_probe_read_kernel() on
> > > s390 architecture. The attachment provides a straightforward example
> > > to illustrate this issue. The observed results are as follows:
> > >
> > >     Error: #463/1 verifier_probe_read/probe read 4 bytes
> > >     8897 run_subtest:PASS:obj_open_mem 0 nsec
> > >     8898 run_subtest:PASS:unexpected_load_failure 0 nsec
> > >     8899 do_prog_test_run:PASS:bpf_prog_test_run 0 nsec
> > >     8900 run_subtest:FAIL:659 Unexpected retval: 2817064 != 512
> > >
> > >     Error: #463/2 verifier_probe_read/probe read 8 bytes
> > >     8903 run_subtest:PASS:obj_open_mem 0 nsec
> > >     8904 run_subtest:PASS:unexpected_load_failure 0 nsec
> > >     8905 do_prog_test_run:PASS:bpf_prog_test_run 0 nsec
> > >     8906 run_subtest:FAIL:659 Unexpected retval: 0 != 512
> > >
> > > More details can be found at:  https://github.com/kernel-patches/bpf/pull/6872
> > >
> > > Should we consider this behavior of bpf_probe_read_kernel() as
> > > expected, or is it something that requires fixing?
> > >
> >
> > I might be missing something, but there is nothing wrong with
> > bpf_probe_read_kernel() behavior. In "read 4" case you are overwriting
> > only upper 4 bytes of u64, so lower 4 bytes are garbage. In "read 8"
> > you are reading (upper) 4 bytes of garbage from uninitialized
> > data_dst.
>
> The issue doesn't lie with the dst but rather with the src. Even after
> initializing the destination, the operation still fails. You can find

Are you sure the operation "fails"? If it would fail, you'd get a
negative error code, but you are getting zero. Which actually makes
sense.

I think you are just getting confused by big endianness of s390x, and
there is nothing wrong with bpf_probe_read_kernel().

In both of your tests (I pasted your code below, it would be better if
you did it in your initial emails) you end up with 0x200 in *upper* 32
bits (on big endian) and lower bits are zeros. And __retval thing is
32-bit (despite BPF program returning long), so this return value is
truncated to *lower* 32-bits, which are, expectedly, zeroes.

So I think everything works as expected, but your tests (at least)
don't handle the big-endian arch well.

__description("probe read 4 bytes")
__success __retval(0x200)
long probe_read_4(void)
{
    int data = 0x200;
    long data_dst = 0;
    int err;

    err = bpf_probe_read_kernel(&data_dst, 4, &data);
    if (err)
        return err;

    return data_dst;
}

SEC("syscall")
__description("probe read 8 bytes")
__success __retval(0x200)
long probe_read_8(void)
{
    int data = 0x200;
    long data_dst = 0;
    int err;

    err = bpf_probe_read_kernel(&data_dst, 8, &data);
    if (err)
        return err;

    return data_dst;

}

> more details in the following link:
> https://github.com/kernel-patches/bpf/pull/6882. It appears that
> bpf_probe_read_kernel() encounters difficulties when dealing with
> non-long-aligned source addresses.
>
> >
> > So getting back to iter implementation. Make sure you are
> > zero-initializing that u64 value you are reading into?
> >
>
> It has been zero-initialized:
>
> + kit->nr_bits = 0;
> + kit->bits_copy = 0;
>

ok, then the problem is somewhere else, but it doesn't seem to be in
bpf_probe_read_kernel(). I'm forgetting what was the original test
failure for your patch set, but please double check again, taking into
account the big endianness of s390x.

> --
> Regards
> Yafang
Yafang Shao April 28, 2024, 1:47 p.m. UTC | #9
On Sat, Apr 27, 2024 at 12:51 AM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Thu, Apr 25, 2024 at 10:05 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > On Fri, Apr 26, 2024 at 2:15 AM Andrii Nakryiko
> > <andrii.nakryiko@gmail.com> wrote:
> > >
> > > On Wed, Apr 24, 2024 at 10:37 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > >
> > > > On Thu, Apr 25, 2024 at 8:34 AM Andrii Nakryiko
> > > > <andrii.nakryiko@gmail.com> wrote:
> > > > >
> > > > > On Thu, Apr 11, 2024 at 6:51 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > > > >
> > > > > > On Thu, Apr 11, 2024 at 9:11 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > > > > >
> > > > > > > Three new kfuncs, namely bpf_iter_bits_{new,next,destroy}, have been
> > > > > > > added for the new bpf_iter_bits functionality. These kfuncs enable the
> > > > > > > iteration of the bits from a given address and a given number of bits.
> > > > > > >
> > > > > > > - bpf_iter_bits_new
> > > > > > >   Initialize a new bits iterator for a given memory area. Due to the
> > > > > > >   limitation of bpf memalloc, the max number of bits to be iterated
> > > > > > >   over is (4096 * 8).
> > > > > > > - bpf_iter_bits_next
> > > > > > >   Get the next bit in a bpf_iter_bits
> > > > > > > - bpf_iter_bits_destroy
> > > > > > >   Destroy a bpf_iter_bits
> > > > > > >
> > > > > > > The bits iterator can be used in any context and on any address.
> > > > > > >
> > > > > > > Changes:
> > > > > > > - v5->v6:
> > > > > > >   - Add positive tests (Andrii)
> > > > > > > - v4->v5:
> > > > > > >   - Simplify test cases (Andrii)
> > > > > > > - v3->v4:
> > > > > > >   - Fix endianness error on s390x (Andrii)
> > > > > > >   - zero-initialize kit->bits_copy and zero out nr_bits (Andrii)
> > > > > > > - v2->v3:
> > > > > > >   - Optimization for u64/u32 mask (Andrii)
> > > > > > > - v1->v2:
> > > > > > >   - Simplify the CPU number verification code to avoid the failure on s390x
> > > > > > >     (Eduard)
> > > > > > > - bpf: Add bpf_iter_cpumask
> > > > > > >   https://lwn.net/Articles/961104/
> > > > > > > - bpf: Add new bpf helper bpf_for_each_cpu
> > > > > > >   https://lwn.net/Articles/939939/
> > > > > > >
> > > > > > > Yafang Shao (2):
> > > > > > >   bpf: Add bits iterator
> > > > > > >   selftests/bpf: Add selftest for bits iter
> > > > > > >
> > > > > > >  kernel/bpf/helpers.c                          | 120 +++++++++++++++++
> > > > > > >  .../selftests/bpf/prog_tests/verifier.c       |   2 +
> > > > > > >  .../selftests/bpf/progs/verifier_bits_iter.c  | 127 ++++++++++++++++++
> > > > > > >  3 files changed, 249 insertions(+)
> > > > > > >  create mode 100644 tools/testing/selftests/bpf/progs/verifier_bits_iter.c
> > > > > > >
> > > > > > > --
> > > > > > > 2.39.1
> > > > > > >
> > > > > >
> > > > > > It appears that the test case failed on s390x when the data is
> > > > > > a u32 value because we need to set the higher 32 bits.
> > > > > > will analyze it.
> > > > > >
> > > > >
> > > > > Hey Yafang, did you get a chance to debug and fix the issue?
> > > > >
> > > >
> > > > Hi Andrii,
> > > >
> > > > Apologies for the delay; I recently returned from an extended holiday.
> > > >
> > > > The issue stems from the limitations of bpf_probe_read_kernel() on
> > > > s390 architecture. The attachment provides a straightforward example
> > > > to illustrate this issue. The observed results are as follows:
> > > >
> > > >     Error: #463/1 verifier_probe_read/probe read 4 bytes
> > > >     8897 run_subtest:PASS:obj_open_mem 0 nsec
> > > >     8898 run_subtest:PASS:unexpected_load_failure 0 nsec
> > > >     8899 do_prog_test_run:PASS:bpf_prog_test_run 0 nsec
> > > >     8900 run_subtest:FAIL:659 Unexpected retval: 2817064 != 512
> > > >
> > > >     Error: #463/2 verifier_probe_read/probe read 8 bytes
> > > >     8903 run_subtest:PASS:obj_open_mem 0 nsec
> > > >     8904 run_subtest:PASS:unexpected_load_failure 0 nsec
> > > >     8905 do_prog_test_run:PASS:bpf_prog_test_run 0 nsec
> > > >     8906 run_subtest:FAIL:659 Unexpected retval: 0 != 512
> > > >
> > > > More details can be found at:  https://github.com/kernel-patches/bpf/pull/6872
> > > >
> > > > Should we consider this behavior of bpf_probe_read_kernel() as
> > > > expected, or is it something that requires fixing?
> > > >
> > >
> > > I might be missing something, but there is nothing wrong with
> > > bpf_probe_read_kernel() behavior. In "read 4" case you are overwriting
> > > only upper 4 bytes of u64, so lower 4 bytes are garbage. In "read 8"
> > > you are reading (upper) 4 bytes of garbage from uninitialized
> > > data_dst.
> >
> > The issue doesn't lie with the dst but rather with the src. Even after
> > initializing the destination, the operation still fails. You can find
>
> Are you sure the operation "fails"? If it would fail, you'd get a
> negative error code, but you are getting zero. Which actually makes
> sense.
>
> I think you are just getting confused by big endianness of s390x, and
> there is nothing wrong with bpf_probe_read_kernel().
>
> In both of your tests (I pasted your code below, it would be better if
> you did it in your initial emails) you end up with 0x200 in *upper* 32
> bits (on big endian) and lower bits are zeros. And __retval thing is
> 32-bit (despite BPF program returning long), so this return value is
> truncated to *lower* 32-bits, which are, expectedly, zeroes.

Thank you for clarifying. The presence of the 32-bit __retval led to
my misunderstanding :(

>
> So I think everything works as expected, but your tests (at least)
> don't handle the big-endian arch well.

The issue arises when the dst and src have different sizes, causing
bpf_probe_read_kernel_common() to handle them poorly on big-endian
machines. To address this, we need to calculate the offset for
copying, as demonstrated by the following

   bpf_probe_read_kernel_common(&kit->bits_copy + offset, size,
unsafe_ptr__ign);

One might wonder why this calculation is not incorporated directly
into the implementation of bpf_probe_read_kernel_common() ?

>
> __description("probe read 4 bytes")
> __success __retval(0x200)
> long probe_read_4(void)
> {
>     int data = 0x200;
>     long data_dst = 0;
>     int err;
>
>     err = bpf_probe_read_kernel(&data_dst, 4, &data);
>     if (err)
>         return err;
>
>     return data_dst;
> }
>
> SEC("syscall")
> __description("probe read 8 bytes")
> __success __retval(0x200)
> long probe_read_8(void)
> {
>     int data = 0x200;
>     long data_dst = 0;
>     int err;
>
>     err = bpf_probe_read_kernel(&data_dst, 8, &data);
>     if (err)
>         return err;
>
>     return data_dst;
>
> }
>
> > more details in the following link:
> > https://github.com/kernel-patches/bpf/pull/6882. It appears that
> > bpf_probe_read_kernel() encounters difficulties when dealing with
> > non-long-aligned source addresses.
> >
> > >
> > > So getting back to iter implementation. Make sure you are
> > > zero-initializing that u64 value you are reading into?
> > >
> >
> > It has been zero-initialized:
> >
> > + kit->nr_bits = 0;
> > + kit->bits_copy = 0;
> >
>
> ok, then the problem is somewhere else, but it doesn't seem to be in
> bpf_probe_read_kernel(). I'm forgetting what was the original test
> failure for your patch set, but please double check again, taking into
> account the big endianness of s390x.
>

If we aim to make it compatible with s390, we need to introduce some
constraints regarding the bits iteration.

1. We must replace nr_bits with size:

  bpf_iter_bits_new(struct bpf_iter_bits *it, const void
*unsafe_ptr__ign, u32 size)

2. The size must adhere to alignment requirements:

        if (size <= sizeof(u64)) {
                int offset = IS_ENABLED(CONFIG_S390) ? sizeof(u64) - size : 0;

                switch (size) {
                case 1:
                case 2:
                case 4:
                case 8:
                        break;
                default:
                        return -EINVAL;
                }

                err = bpf_probe_read_kernel_common(((char
*)&kit->bits_copy) + offset, size, unsafe_ptr__ign);
                if (err)
                        return -EFAULT;

                kit->size = size;
                kit->bit = -1;
                return 0;
        }

        /* Not long-aligned */
        if (size & (sizeof(unsigned long) - 1))
                return -EINVAL;

        ....

Does this meet your expectations?

--
Regards



Yafang
Andrii Nakryiko April 29, 2024, 4:27 p.m. UTC | #10
On Sun, Apr 28, 2024 at 6:47 AM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> On Sat, Apr 27, 2024 at 12:51 AM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
> >
> > On Thu, Apr 25, 2024 at 10:05 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > >
> > > On Fri, Apr 26, 2024 at 2:15 AM Andrii Nakryiko
> > > <andrii.nakryiko@gmail.com> wrote:
> > > >
> > > > On Wed, Apr 24, 2024 at 10:37 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > > >
> > > > > On Thu, Apr 25, 2024 at 8:34 AM Andrii Nakryiko
> > > > > <andrii.nakryiko@gmail.com> wrote:
> > > > > >
> > > > > > On Thu, Apr 11, 2024 at 6:51 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > > > > >
> > > > > > > On Thu, Apr 11, 2024 at 9:11 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > > > > > >
> > > > > > > > Three new kfuncs, namely bpf_iter_bits_{new,next,destroy}, have been
> > > > > > > > added for the new bpf_iter_bits functionality. These kfuncs enable the
> > > > > > > > iteration of the bits from a given address and a given number of bits.
> > > > > > > >
> > > > > > > > - bpf_iter_bits_new
> > > > > > > >   Initialize a new bits iterator for a given memory area. Due to the
> > > > > > > >   limitation of bpf memalloc, the max number of bits to be iterated
> > > > > > > >   over is (4096 * 8).
> > > > > > > > - bpf_iter_bits_next
> > > > > > > >   Get the next bit in a bpf_iter_bits
> > > > > > > > - bpf_iter_bits_destroy
> > > > > > > >   Destroy a bpf_iter_bits
> > > > > > > >
> > > > > > > > The bits iterator can be used in any context and on any address.
> > > > > > > >
> > > > > > > > Changes:
> > > > > > > > - v5->v6:
> > > > > > > >   - Add positive tests (Andrii)
> > > > > > > > - v4->v5:
> > > > > > > >   - Simplify test cases (Andrii)
> > > > > > > > - v3->v4:
> > > > > > > >   - Fix endianness error on s390x (Andrii)
> > > > > > > >   - zero-initialize kit->bits_copy and zero out nr_bits (Andrii)
> > > > > > > > - v2->v3:
> > > > > > > >   - Optimization for u64/u32 mask (Andrii)
> > > > > > > > - v1->v2:
> > > > > > > >   - Simplify the CPU number verification code to avoid the failure on s390x
> > > > > > > >     (Eduard)
> > > > > > > > - bpf: Add bpf_iter_cpumask
> > > > > > > >   https://lwn.net/Articles/961104/
> > > > > > > > - bpf: Add new bpf helper bpf_for_each_cpu
> > > > > > > >   https://lwn.net/Articles/939939/
> > > > > > > >
> > > > > > > > Yafang Shao (2):
> > > > > > > >   bpf: Add bits iterator
> > > > > > > >   selftests/bpf: Add selftest for bits iter
> > > > > > > >
> > > > > > > >  kernel/bpf/helpers.c                          | 120 +++++++++++++++++
> > > > > > > >  .../selftests/bpf/prog_tests/verifier.c       |   2 +
> > > > > > > >  .../selftests/bpf/progs/verifier_bits_iter.c  | 127 ++++++++++++++++++
> > > > > > > >  3 files changed, 249 insertions(+)
> > > > > > > >  create mode 100644 tools/testing/selftests/bpf/progs/verifier_bits_iter.c
> > > > > > > >
> > > > > > > > --
> > > > > > > > 2.39.1
> > > > > > > >
> > > > > > >
> > > > > > > It appears that the test case failed on s390x when the data is
> > > > > > > a u32 value because we need to set the higher 32 bits.
> > > > > > > will analyze it.
> > > > > > >
> > > > > >
> > > > > > Hey Yafang, did you get a chance to debug and fix the issue?
> > > > > >
> > > > >
> > > > > Hi Andrii,
> > > > >
> > > > > Apologies for the delay; I recently returned from an extended holiday.
> > > > >
> > > > > The issue stems from the limitations of bpf_probe_read_kernel() on
> > > > > s390 architecture. The attachment provides a straightforward example
> > > > > to illustrate this issue. The observed results are as follows:
> > > > >
> > > > >     Error: #463/1 verifier_probe_read/probe read 4 bytes
> > > > >     8897 run_subtest:PASS:obj_open_mem 0 nsec
> > > > >     8898 run_subtest:PASS:unexpected_load_failure 0 nsec
> > > > >     8899 do_prog_test_run:PASS:bpf_prog_test_run 0 nsec
> > > > >     8900 run_subtest:FAIL:659 Unexpected retval: 2817064 != 512
> > > > >
> > > > >     Error: #463/2 verifier_probe_read/probe read 8 bytes
> > > > >     8903 run_subtest:PASS:obj_open_mem 0 nsec
> > > > >     8904 run_subtest:PASS:unexpected_load_failure 0 nsec
> > > > >     8905 do_prog_test_run:PASS:bpf_prog_test_run 0 nsec
> > > > >     8906 run_subtest:FAIL:659 Unexpected retval: 0 != 512
> > > > >
> > > > > More details can be found at:  https://github.com/kernel-patches/bpf/pull/6872
> > > > >
> > > > > Should we consider this behavior of bpf_probe_read_kernel() as
> > > > > expected, or is it something that requires fixing?
> > > > >
> > > >
> > > > I might be missing something, but there is nothing wrong with
> > > > bpf_probe_read_kernel() behavior. In "read 4" case you are overwriting
> > > > only upper 4 bytes of u64, so lower 4 bytes are garbage. In "read 8"
> > > > you are reading (upper) 4 bytes of garbage from uninitialized
> > > > data_dst.
> > >
> > > The issue doesn't lie with the dst but rather with the src. Even after
> > > initializing the destination, the operation still fails. You can find
> >
> > Are you sure the operation "fails"? If it would fail, you'd get a
> > negative error code, but you are getting zero. Which actually makes
> > sense.
> >
> > I think you are just getting confused by big endianness of s390x, and
> > there is nothing wrong with bpf_probe_read_kernel().
> >
> > In both of your tests (I pasted your code below, it would be better if
> > you did it in your initial emails) you end up with 0x200 in *upper* 32
> > bits (on big endian) and lower bits are zeros. And __retval thing is
> > 32-bit (despite BPF program returning long), so this return value is
> > truncated to *lower* 32-bits, which are, expectedly, zeroes.
>
> Thank you for clarifying. The presence of the 32-bit __retval led to
> my misunderstanding :(
>
> >
> > So I think everything works as expected, but your tests (at least)
> > don't handle the big-endian arch well.
>
> The issue arises when the dst and src have different sizes, causing
> bpf_probe_read_kernel_common() to handle them poorly on big-endian
> machines. To address this, we need to calculate the offset for
> copying, as demonstrated by the following
>
>    bpf_probe_read_kernel_common(&kit->bits_copy + offset, size,
> unsafe_ptr__ign);
>
> One might wonder why this calculation is not incorporated directly
> into the implementation of bpf_probe_read_kernel_common() ?

Let's stop talking about
bpf_probe_read_kernel/bpf_probe_read_kernel_common doing anything
wrong or not handling anything wrong. It's not, it's correct. Your
code is *using* it wrong, and that's what we'll have to fix. The
contract of bpf_probe_read_kernel is in terms of memory addresses of
source/destination *bytes* and the *common* size of both source and
destination. bpf_probe_read_kernel() cannot know that you are passing
a pointer to int or long or whatever, it's just a void * pointer. So
if you are writing bytes over long, you need to take care of adjusting
offsets to accommodate big-endian.

It's a distraction to talk about bpf_probe_read_kernel() here (but
it's useful to understand its contract).

>
> >
> > __description("probe read 4 bytes")
> > __success __retval(0x200)
> > long probe_read_4(void)
> > {
> >     int data = 0x200;
> >     long data_dst = 0;
> >     int err;
> >
> >     err = bpf_probe_read_kernel(&data_dst, 4, &data);
> >     if (err)
> >         return err;
> >
> >     return data_dst;
> > }
> >
> > SEC("syscall")
> > __description("probe read 8 bytes")
> > __success __retval(0x200)
> > long probe_read_8(void)
> > {
> >     int data = 0x200;
> >     long data_dst = 0;
> >     int err;
> >
> >     err = bpf_probe_read_kernel(&data_dst, 8, &data);
> >     if (err)
> >         return err;
> >
> >     return data_dst;
> >
> > }
> >
> > > more details in the following link:
> > > https://github.com/kernel-patches/bpf/pull/6882. It appears that
> > > bpf_probe_read_kernel() encounters difficulties when dealing with
> > > non-long-aligned source addresses.
> > >
> > > >
> > > > So getting back to iter implementation. Make sure you are
> > > > zero-initializing that u64 value you are reading into?
> > > >
> > >
> > > It has been zero-initialized:
> > >
> > > + kit->nr_bits = 0;
> > > + kit->bits_copy = 0;
> > >
> >
> > ok, then the problem is somewhere else, but it doesn't seem to be in
> > bpf_probe_read_kernel(). I'm forgetting what was the original test
> > failure for your patch set, but please double check again, taking into
> > account the big endianness of s390x.
> >
>
> If we aim to make it compatible with s390, we need to introduce some
> constraints regarding the bits iteration.
>
> 1. We must replace nr_bits with size:
>
>   bpf_iter_bits_new(struct bpf_iter_bits *it, const void
> *unsafe_ptr__ign, u32 size)
>
> 2. The size must adhere to alignment requirements:
>
>         if (size <= sizeof(u64)) {
>                 int offset = IS_ENABLED(CONFIG_S390) ? sizeof(u64) - size : 0;
>
>                 switch (size) {
>                 case 1:
>                 case 2:
>                 case 4:
>                 case 8:
>                         break;
>                 default:
>                         return -EINVAL;
>                 }
>
>                 err = bpf_probe_read_kernel_common(((char
> *)&kit->bits_copy) + offset, size, unsafe_ptr__ign);
>                 if (err)
>                         return -EFAULT;
>
>                 kit->size = size;
>                 kit->bit = -1;
>                 return 0;
>         }
>
>         /* Not long-aligned */
>         if (size & (sizeof(unsigned long) - 1))
>                 return -EINVAL;
>
>         ....
>
> Does this meet your expectations?
>

I don't think you need to add any restrictions or change anything to
be byte-sized. You just need to calculate byte offset and do a bit
shift to place N bits from the mask into upper N bits of long on
big-endian. You might need to do some masking even for little endian
as well.

Which is why I suggested investing in simple but thorough tests. Write
down a few bit mask patterns of variable length (not just
multiple-of-8 sizes) and think about the sequence of bits that should
be returned. And then codify that.

Then check that both little- and big-endian arches work correctly.

This is all a bit tricky because kernel's find_next_bit() makes the
assumption that bit masks are long-based, while this bits iterator
protocol is based on bit sizes, which are not necessarily multiples of
8 bits. So after you copy memory as bytes, you might need to clear
(and shift, for big endian) some bits. Use simple but non-symmetrical
bit masks to test everything.

> --
> Regards
>
>
>
> Yafang
Yafang Shao May 1, 2024, 2:12 a.m. UTC | #11
On Tue, Apr 30, 2024 at 12:28 AM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Sun, Apr 28, 2024 at 6:47 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > On Sat, Apr 27, 2024 at 12:51 AM Andrii Nakryiko
> > <andrii.nakryiko@gmail.com> wrote:
> > >
> > > On Thu, Apr 25, 2024 at 10:05 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > >
> > > > On Fri, Apr 26, 2024 at 2:15 AM Andrii Nakryiko
> > > > <andrii.nakryiko@gmail.com> wrote:
> > > > >
> > > > > On Wed, Apr 24, 2024 at 10:37 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > > > >
> > > > > > On Thu, Apr 25, 2024 at 8:34 AM Andrii Nakryiko
> > > > > > <andrii.nakryiko@gmail.com> wrote:
> > > > > > >
> > > > > > > On Thu, Apr 11, 2024 at 6:51 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > > > > > >
> > > > > > > > On Thu, Apr 11, 2024 at 9:11 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > > > > > > >
> > > > > > > > > Three new kfuncs, namely bpf_iter_bits_{new,next,destroy}, have been
> > > > > > > > > added for the new bpf_iter_bits functionality. These kfuncs enable the
> > > > > > > > > iteration of the bits from a given address and a given number of bits.
> > > > > > > > >
> > > > > > > > > - bpf_iter_bits_new
> > > > > > > > >   Initialize a new bits iterator for a given memory area. Due to the
> > > > > > > > >   limitation of bpf memalloc, the max number of bits to be iterated
> > > > > > > > >   over is (4096 * 8).
> > > > > > > > > - bpf_iter_bits_next
> > > > > > > > >   Get the next bit in a bpf_iter_bits
> > > > > > > > > - bpf_iter_bits_destroy
> > > > > > > > >   Destroy a bpf_iter_bits
> > > > > > > > >
> > > > > > > > > The bits iterator can be used in any context and on any address.
> > > > > > > > >
> > > > > > > > > Changes:
> > > > > > > > > - v5->v6:
> > > > > > > > >   - Add positive tests (Andrii)
> > > > > > > > > - v4->v5:
> > > > > > > > >   - Simplify test cases (Andrii)
> > > > > > > > > - v3->v4:
> > > > > > > > >   - Fix endianness error on s390x (Andrii)
> > > > > > > > >   - zero-initialize kit->bits_copy and zero out nr_bits (Andrii)
> > > > > > > > > - v2->v3:
> > > > > > > > >   - Optimization for u64/u32 mask (Andrii)
> > > > > > > > > - v1->v2:
> > > > > > > > >   - Simplify the CPU number verification code to avoid the failure on s390x
> > > > > > > > >     (Eduard)
> > > > > > > > > - bpf: Add bpf_iter_cpumask
> > > > > > > > >   https://lwn.net/Articles/961104/
> > > > > > > > > - bpf: Add new bpf helper bpf_for_each_cpu
> > > > > > > > >   https://lwn.net/Articles/939939/
> > > > > > > > >
> > > > > > > > > Yafang Shao (2):
> > > > > > > > >   bpf: Add bits iterator
> > > > > > > > >   selftests/bpf: Add selftest for bits iter
> > > > > > > > >
> > > > > > > > >  kernel/bpf/helpers.c                          | 120 +++++++++++++++++
> > > > > > > > >  .../selftests/bpf/prog_tests/verifier.c       |   2 +
> > > > > > > > >  .../selftests/bpf/progs/verifier_bits_iter.c  | 127 ++++++++++++++++++
> > > > > > > > >  3 files changed, 249 insertions(+)
> > > > > > > > >  create mode 100644 tools/testing/selftests/bpf/progs/verifier_bits_iter.c
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > 2.39.1
> > > > > > > > >
> > > > > > > >
> > > > > > > > It appears that the test case failed on s390x when the data is
> > > > > > > > a u32 value because we need to set the higher 32 bits.
> > > > > > > > will analyze it.
> > > > > > > >
> > > > > > >
> > > > > > > Hey Yafang, did you get a chance to debug and fix the issue?
> > > > > > >
> > > > > >
> > > > > > Hi Andrii,
> > > > > >
> > > > > > Apologies for the delay; I recently returned from an extended holiday.
> > > > > >
> > > > > > The issue stems from the limitations of bpf_probe_read_kernel() on
> > > > > > s390 architecture. The attachment provides a straightforward example
> > > > > > to illustrate this issue. The observed results are as follows:
> > > > > >
> > > > > >     Error: #463/1 verifier_probe_read/probe read 4 bytes
> > > > > >     8897 run_subtest:PASS:obj_open_mem 0 nsec
> > > > > >     8898 run_subtest:PASS:unexpected_load_failure 0 nsec
> > > > > >     8899 do_prog_test_run:PASS:bpf_prog_test_run 0 nsec
> > > > > >     8900 run_subtest:FAIL:659 Unexpected retval: 2817064 != 512
> > > > > >
> > > > > >     Error: #463/2 verifier_probe_read/probe read 8 bytes
> > > > > >     8903 run_subtest:PASS:obj_open_mem 0 nsec
> > > > > >     8904 run_subtest:PASS:unexpected_load_failure 0 nsec
> > > > > >     8905 do_prog_test_run:PASS:bpf_prog_test_run 0 nsec
> > > > > >     8906 run_subtest:FAIL:659 Unexpected retval: 0 != 512
> > > > > >
> > > > > > More details can be found at:  https://github.com/kernel-patches/bpf/pull/6872
> > > > > >
> > > > > > Should we consider this behavior of bpf_probe_read_kernel() as
> > > > > > expected, or is it something that requires fixing?
> > > > > >
> > > > >
> > > > > I might be missing something, but there is nothing wrong with
> > > > > bpf_probe_read_kernel() behavior. In "read 4" case you are overwriting
> > > > > only upper 4 bytes of u64, so lower 4 bytes are garbage. In "read 8"
> > > > > you are reading (upper) 4 bytes of garbage from uninitialized
> > > > > data_dst.
> > > >
> > > > The issue doesn't lie with the dst but rather with the src. Even after
> > > > initializing the destination, the operation still fails. You can find
> > >
> > > Are you sure the operation "fails"? If it would fail, you'd get a
> > > negative error code, but you are getting zero. Which actually makes
> > > sense.
> > >
> > > I think you are just getting confused by big endianness of s390x, and
> > > there is nothing wrong with bpf_probe_read_kernel().
> > >
> > > In both of your tests (I pasted your code below, it would be better if
> > > you did it in your initial emails) you end up with 0x200 in *upper* 32
> > > bits (on big endian) and lower bits are zeros. And __retval thing is
> > > 32-bit (despite BPF program returning long), so this return value is
> > > truncated to *lower* 32-bits, which are, expectedly, zeroes.
> >
> > Thank you for clarifying. The presence of the 32-bit __retval led to
> > my misunderstanding :(
> >
> > >
> > > So I think everything works as expected, but your tests (at least)
> > > don't handle the big-endian arch well.
> >
> > The issue arises when the dst and src have different sizes, causing
> > bpf_probe_read_kernel_common() to handle them poorly on big-endian
> > machines. To address this, we need to calculate the offset for
> > copying, as demonstrated by the following
> >
> >    bpf_probe_read_kernel_common(&kit->bits_copy + offset, size,
> > unsafe_ptr__ign);
> >
> > One might wonder why this calculation is not incorporated directly
> > into the implementation of bpf_probe_read_kernel_common() ?
>
> Let's stop talking about
> bpf_probe_read_kernel/bpf_probe_read_kernel_common doing anything
> wrong or not handling anything wrong. It's not, it's correct. Your
> code is *using* it wrong, and that's what we'll have to fix. The
> contract of bpf_probe_read_kernel is in terms of memory addresses of
> source/destination *bytes* and the *common* size of both source and
> destination. bpf_probe_read_kernel() cannot know that you are passing
> a pointer to int or long or whatever, it's just a void * pointer. So
> if you are writing bytes over long, you need to take care of adjusting
> offsets to accommodate big-endian.
>
> It's a distraction to talk about bpf_probe_read_kernel() here (but
> it's useful to understand its contract).
>
> >
> > >
> > > __description("probe read 4 bytes")
> > > __success __retval(0x200)
> > > long probe_read_4(void)
> > > {
> > >     int data = 0x200;
> > >     long data_dst = 0;
> > >     int err;
> > >
> > >     err = bpf_probe_read_kernel(&data_dst, 4, &data);
> > >     if (err)
> > >         return err;
> > >
> > >     return data_dst;
> > > }
> > >
> > > SEC("syscall")
> > > __description("probe read 8 bytes")
> > > __success __retval(0x200)
> > > long probe_read_8(void)
> > > {
> > >     int data = 0x200;
> > >     long data_dst = 0;
> > >     int err;
> > >
> > >     err = bpf_probe_read_kernel(&data_dst, 8, &data);
> > >     if (err)
> > >         return err;
> > >
> > >     return data_dst;
> > >
> > > }
> > >
> > > > more details in the following link:
> > > > https://github.com/kernel-patches/bpf/pull/6882. It appears that
> > > > bpf_probe_read_kernel() encounters difficulties when dealing with
> > > > non-long-aligned source addresses.
> > > >
> > > > >
> > > > > So getting back to iter implementation. Make sure you are
> > > > > zero-initializing that u64 value you are reading into?
> > > > >
> > > >
> > > > It has been zero-initialized:
> > > >
> > > > + kit->nr_bits = 0;
> > > > + kit->bits_copy = 0;
> > > >
> > >
> > > ok, then the problem is somewhere else, but it doesn't seem to be in
> > > bpf_probe_read_kernel(). I'm forgetting what was the original test
> > > failure for your patch set, but please double check again, taking into
> > > account the big endianness of s390x.
> > >
> >
> > If we aim to make it compatible with s390, we need to introduce some
> > constraints regarding the bits iteration.
> >
> > 1. We must replace nr_bits with size:
> >
> >   bpf_iter_bits_new(struct bpf_iter_bits *it, const void
> > *unsafe_ptr__ign, u32 size)
> >
> > 2. The size must adhere to alignment requirements:
> >
> >         if (size <= sizeof(u64)) {
> >                 int offset = IS_ENABLED(CONFIG_S390) ? sizeof(u64) - size : 0;
> >
> >                 switch (size) {
> >                 case 1:
> >                 case 2:
> >                 case 4:
> >                 case 8:
> >                         break;
> >                 default:
> >                         return -EINVAL;
> >                 }
> >
> >                 err = bpf_probe_read_kernel_common(((char
> > *)&kit->bits_copy) + offset, size, unsafe_ptr__ign);
> >                 if (err)
> >                         return -EFAULT;
> >
> >                 kit->size = size;
> >                 kit->bit = -1;
> >                 return 0;
> >         }
> >
> >         /* Not long-aligned */
> >         if (size & (sizeof(unsigned long) - 1))
> >                 return -EINVAL;
> >
> >         ....
> >
> > Does this meet your expectations?
> >
>
> I don't think you need to add any restrictions or change anything to
> be byte-sized. You just need to calculate byte offset and do a bit
> shift to place N bits from the mask into upper N bits of long on
> big-endian. You might need to do some masking even for little endian
> as well.
>
> Which is why I suggested investing in simple but thorough tests. Write
> down a few bit mask patterns of variable length (not just
> multiple-of-8 sizes) and think about the sequence of bits that should
> be returned. And then codify that.
>
> Then check that both little- and big-endian arches work correctly.
>
> This is all a bit tricky because kernel's find_next_bit() makes the
> assumption that bit masks are long-based, while this bits iterator
> protocol is based on bit sizes, which are not necessarily multiples of
> 8 bits. So after you copy memory as bytes, you might need to clear
> (and shift, for big endian) some bits. Use simple but non-symmetrical
> bit masks to test everything.
>

The reason for using 'size' instead of 'nr_bits' lies in the nature of
the pointer 'unsafe_ptr__ign', which is of type void *. Due to this
ambiguity in the type, the 'size' parameter serves to indicate the
size of the data being accessed. For instance, if the type is u32,
then the 'size' parameter corresponds to sizeof(u32)

    u32 src = 0x100;
    bpf_for_each(bits, bit, &src, sizeof(u32));

This approach shields the user of the bits iterator from concerns
about endianness, simplifying usage.

Conversely, if we were to use 'nr_bits', the user would need to
account for endianness themselves. In other words, the user would be
responsible for calculating the offset of 'src'.For instance, on a
big-endian machine, if the 'src' is of type u64 and the user only want
to iterate over 32 bits, the code would appear as follows:

    u64 src = 0x100;
    bpf_for_each(bits, bit, ((void *)&src) + (sizeof(u64) - (32 + 7) / 8), 32);

It may indeed be less user-friendly. However, I can make the
adjustment as you suggested, given that it aligns with the pattern
observed in bpf_probe_read_kernel_common().

--
Regards
Yafang
Andrii Nakryiko May 1, 2024, 4:14 a.m. UTC | #12
On Tue, Apr 30, 2024 at 7:12 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> On Tue, Apr 30, 2024 at 12:28 AM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
> >
> > On Sun, Apr 28, 2024 at 6:47 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> > >
> > > On Sat, Apr 27, 2024 at 12:51 AM Andrii Nakryiko
> > > <andrii.nakryiko@gmail.com> wrote:
> > > >
> > > > On Thu, Apr 25, 2024 at 10:05 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > > >
> > > > > On Fri, Apr 26, 2024 at 2:15 AM Andrii Nakryiko
> > > > > <andrii.nakryiko@gmail.com> wrote:
> > > > > >
> > > > > > On Wed, Apr 24, 2024 at 10:37 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > > > > >
> > > > > > > On Thu, Apr 25, 2024 at 8:34 AM Andrii Nakryiko
> > > > > > > <andrii.nakryiko@gmail.com> wrote:
> > > > > > > >
> > > > > > > > On Thu, Apr 11, 2024 at 6:51 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > > > > > > >
> > > > > > > > > On Thu, Apr 11, 2024 at 9:11 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > > > > > > > >
> > > > > > > > > > Three new kfuncs, namely bpf_iter_bits_{new,next,destroy}, have been
> > > > > > > > > > added for the new bpf_iter_bits functionality. These kfuncs enable the
> > > > > > > > > > iteration of the bits from a given address and a given number of bits.
> > > > > > > > > >
> > > > > > > > > > - bpf_iter_bits_new
> > > > > > > > > >   Initialize a new bits iterator for a given memory area. Due to the
> > > > > > > > > >   limitation of bpf memalloc, the max number of bits to be iterated
> > > > > > > > > >   over is (4096 * 8).
> > > > > > > > > > - bpf_iter_bits_next
> > > > > > > > > >   Get the next bit in a bpf_iter_bits
> > > > > > > > > > - bpf_iter_bits_destroy
> > > > > > > > > >   Destroy a bpf_iter_bits
> > > > > > > > > >
> > > > > > > > > > The bits iterator can be used in any context and on any address.
> > > > > > > > > >
> > > > > > > > > > Changes:
> > > > > > > > > > - v5->v6:
> > > > > > > > > >   - Add positive tests (Andrii)
> > > > > > > > > > - v4->v5:
> > > > > > > > > >   - Simplify test cases (Andrii)
> > > > > > > > > > - v3->v4:
> > > > > > > > > >   - Fix endianness error on s390x (Andrii)
> > > > > > > > > >   - zero-initialize kit->bits_copy and zero out nr_bits (Andrii)
> > > > > > > > > > - v2->v3:
> > > > > > > > > >   - Optimization for u64/u32 mask (Andrii)
> > > > > > > > > > - v1->v2:
> > > > > > > > > >   - Simplify the CPU number verification code to avoid the failure on s390x
> > > > > > > > > >     (Eduard)
> > > > > > > > > > - bpf: Add bpf_iter_cpumask
> > > > > > > > > >   https://lwn.net/Articles/961104/
> > > > > > > > > > - bpf: Add new bpf helper bpf_for_each_cpu
> > > > > > > > > >   https://lwn.net/Articles/939939/
> > > > > > > > > >
> > > > > > > > > > Yafang Shao (2):
> > > > > > > > > >   bpf: Add bits iterator
> > > > > > > > > >   selftests/bpf: Add selftest for bits iter
> > > > > > > > > >
> > > > > > > > > >  kernel/bpf/helpers.c                          | 120 +++++++++++++++++
> > > > > > > > > >  .../selftests/bpf/prog_tests/verifier.c       |   2 +
> > > > > > > > > >  .../selftests/bpf/progs/verifier_bits_iter.c  | 127 ++++++++++++++++++
> > > > > > > > > >  3 files changed, 249 insertions(+)
> > > > > > > > > >  create mode 100644 tools/testing/selftests/bpf/progs/verifier_bits_iter.c
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > > 2.39.1
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > It appears that the test case failed on s390x when the data is
> > > > > > > > > a u32 value because we need to set the higher 32 bits.
> > > > > > > > > will analyze it.
> > > > > > > > >
> > > > > > > >
> > > > > > > > Hey Yafang, did you get a chance to debug and fix the issue?
> > > > > > > >
> > > > > > >
> > > > > > > Hi Andrii,
> > > > > > >
> > > > > > > Apologies for the delay; I recently returned from an extended holiday.
> > > > > > >
> > > > > > > The issue stems from the limitations of bpf_probe_read_kernel() on
> > > > > > > s390 architecture. The attachment provides a straightforward example
> > > > > > > to illustrate this issue. The observed results are as follows:
> > > > > > >
> > > > > > >     Error: #463/1 verifier_probe_read/probe read 4 bytes
> > > > > > >     8897 run_subtest:PASS:obj_open_mem 0 nsec
> > > > > > >     8898 run_subtest:PASS:unexpected_load_failure 0 nsec
> > > > > > >     8899 do_prog_test_run:PASS:bpf_prog_test_run 0 nsec
> > > > > > >     8900 run_subtest:FAIL:659 Unexpected retval: 2817064 != 512
> > > > > > >
> > > > > > >     Error: #463/2 verifier_probe_read/probe read 8 bytes
> > > > > > >     8903 run_subtest:PASS:obj_open_mem 0 nsec
> > > > > > >     8904 run_subtest:PASS:unexpected_load_failure 0 nsec
> > > > > > >     8905 do_prog_test_run:PASS:bpf_prog_test_run 0 nsec
> > > > > > >     8906 run_subtest:FAIL:659 Unexpected retval: 0 != 512
> > > > > > >
> > > > > > > More details can be found at:  https://github.com/kernel-patches/bpf/pull/6872
> > > > > > >
> > > > > > > Should we consider this behavior of bpf_probe_read_kernel() as
> > > > > > > expected, or is it something that requires fixing?
> > > > > > >
> > > > > >
> > > > > > I might be missing something, but there is nothing wrong with
> > > > > > bpf_probe_read_kernel() behavior. In "read 4" case you are overwriting
> > > > > > only upper 4 bytes of u64, so lower 4 bytes are garbage. In "read 8"
> > > > > > you are reading (upper) 4 bytes of garbage from uninitialized
> > > > > > data_dst.
> > > > >
> > > > > The issue doesn't lie with the dst but rather with the src. Even after
> > > > > initializing the destination, the operation still fails. You can find
> > > >
> > > > Are you sure the operation "fails"? If it would fail, you'd get a
> > > > negative error code, but you are getting zero. Which actually makes
> > > > sense.
> > > >
> > > > I think you are just getting confused by big endianness of s390x, and
> > > > there is nothing wrong with bpf_probe_read_kernel().
> > > >
> > > > In both of your tests (I pasted your code below, it would be better if
> > > > you did it in your initial emails) you end up with 0x200 in *upper* 32
> > > > bits (on big endian) and lower bits are zeros. And __retval thing is
> > > > 32-bit (despite BPF program returning long), so this return value is
> > > > truncated to *lower* 32-bits, which are, expectedly, zeroes.
> > >
> > > Thank you for clarifying. The presence of the 32-bit __retval led to
> > > my misunderstanding :(
> > >
> > > >
> > > > So I think everything works as expected, but your tests (at least)
> > > > don't handle the big-endian arch well.
> > >
> > > The issue arises when the dst and src have different sizes, causing
> > > bpf_probe_read_kernel_common() to handle them poorly on big-endian
> > > machines. To address this, we need to calculate the offset for
> > > copying, as demonstrated by the following
> > >
> > >    bpf_probe_read_kernel_common(&kit->bits_copy + offset, size,
> > > unsafe_ptr__ign);
> > >
> > > One might wonder why this calculation is not incorporated directly
> > > into the implementation of bpf_probe_read_kernel_common() ?
> >
> > Let's stop talking about
> > bpf_probe_read_kernel/bpf_probe_read_kernel_common doing anything
> > wrong or not handling anything wrong. It's not, it's correct. Your
> > code is *using* it wrong, and that's what we'll have to fix. The
> > contract of bpf_probe_read_kernel is in terms of memory addresses of
> > source/destination *bytes* and the *common* size of both source and
> > destination. bpf_probe_read_kernel() cannot know that you are passing
> > a pointer to int or long or whatever, it's just a void * pointer. So
> > if you are writing bytes over long, you need to take care of adjusting
> > offsets to accommodate big-endian.
> >
> > It's a distraction to talk about bpf_probe_read_kernel() here (but
> > it's useful to understand its contract).
> >
> > >
> > > >
> > > > __description("probe read 4 bytes")
> > > > __success __retval(0x200)
> > > > long probe_read_4(void)
> > > > {
> > > >     int data = 0x200;
> > > >     long data_dst = 0;
> > > >     int err;
> > > >
> > > >     err = bpf_probe_read_kernel(&data_dst, 4, &data);
> > > >     if (err)
> > > >         return err;
> > > >
> > > >     return data_dst;
> > > > }
> > > >
> > > > SEC("syscall")
> > > > __description("probe read 8 bytes")
> > > > __success __retval(0x200)
> > > > long probe_read_8(void)
> > > > {
> > > >     int data = 0x200;
> > > >     long data_dst = 0;
> > > >     int err;
> > > >
> > > >     err = bpf_probe_read_kernel(&data_dst, 8, &data);
> > > >     if (err)
> > > >         return err;
> > > >
> > > >     return data_dst;
> > > >
> > > > }
> > > >
> > > > > more details in the following link:
> > > > > https://github.com/kernel-patches/bpf/pull/6882. It appears that
> > > > > bpf_probe_read_kernel() encounters difficulties when dealing with
> > > > > non-long-aligned source addresses.
> > > > >
> > > > > >
> > > > > > So getting back to iter implementation. Make sure you are
> > > > > > zero-initializing that u64 value you are reading into?
> > > > > >
> > > > >
> > > > > It has been zero-initialized:
> > > > >
> > > > > + kit->nr_bits = 0;
> > > > > + kit->bits_copy = 0;
> > > > >
> > > >
> > > > ok, then the problem is somewhere else, but it doesn't seem to be in
> > > > bpf_probe_read_kernel(). I'm forgetting what was the original test
> > > > failure for your patch set, but please double check again, taking into
> > > > account the big endianness of s390x.
> > > >
> > >
> > > If we aim to make it compatible with s390, we need to introduce some
> > > constraints regarding the bits iteration.
> > >
> > > 1. We must replace nr_bits with size:
> > >
> > >   bpf_iter_bits_new(struct bpf_iter_bits *it, const void
> > > *unsafe_ptr__ign, u32 size)
> > >
> > > 2. The size must adhere to alignment requirements:
> > >
> > >         if (size <= sizeof(u64)) {
> > >                 int offset = IS_ENABLED(CONFIG_S390) ? sizeof(u64) - size : 0;
> > >
> > >                 switch (size) {
> > >                 case 1:
> > >                 case 2:
> > >                 case 4:
> > >                 case 8:
> > >                         break;
> > >                 default:
> > >                         return -EINVAL;
> > >                 }
> > >
> > >                 err = bpf_probe_read_kernel_common(((char
> > > *)&kit->bits_copy) + offset, size, unsafe_ptr__ign);
> > >                 if (err)
> > >                         return -EFAULT;
> > >
> > >                 kit->size = size;
> > >                 kit->bit = -1;
> > >                 return 0;
> > >         }
> > >
> > >         /* Not long-aligned */
> > >         if (size & (sizeof(unsigned long) - 1))
> > >                 return -EINVAL;
> > >
> > >         ....
> > >
> > > Does this meet your expectations?
> > >
> >
> > I don't think you need to add any restrictions or change anything to
> > be byte-sized. You just need to calculate byte offset and do a bit
> > shift to place N bits from the mask into upper N bits of long on
> > big-endian. You might need to do some masking even for little endian
> > as well.
> >
> > Which is why I suggested investing in simple but thorough tests. Write
> > down a few bit mask patterns of variable length (not just
> > multiple-of-8 sizes) and think about the sequence of bits that should
> > be returned. And then codify that.
> >
> > Then check that both little- and big-endian arches work correctly.
> >
> > This is all a bit tricky because kernel's find_next_bit() makes the
> > assumption that bit masks are long-based, while this bits iterator
> > protocol is based on bit sizes, which are not necessarily multiples of
> > 8 bits. So after you copy memory as bytes, you might need to clear
> > (and shift, for big endian) some bits. Use simple but non-symmetrical
> > bit masks to test everything.
> >
>
> The reason for using 'size' instead of 'nr_bits' lies in the nature of
> the pointer 'unsafe_ptr__ign', which is of type void *. Due to this
> ambiguity in the type, the 'size' parameter serves to indicate the
> size of the data being accessed. For instance, if the type is u32,
> then the 'size' parameter corresponds to sizeof(u32)
>
>     u32 src = 0x100;
>     bpf_for_each(bits, bit, &src, sizeof(u32));
>
> This approach shields the user of the bits iterator from concerns
> about endianness, simplifying usage.
>
> Conversely, if we were to use 'nr_bits', the user would need to
> account for endianness themselves. In other words, the user would be

Well, not exactly. Only if they try to cleverly use int/long for
bitmask instead of sticking to an array of bytes. All this endianness
comes into effect only when you are dealing with something else than
pure bytes. And if they do, I think it's ok to expect them to deal
with endianness correctly. (But also for little endian it will just
work even if they don't think much about it).

So I'd keep it as nr_bits and define an interface based on `char *` or
`void *` to make it clear we are dealing with bits in arbitrary-sized
byte arrays.

But if someone has strong opinions otherwise, I'd be happy to hear
some other arguments, of course. I just think it's confusing to be
dealing with *bit*masks but in terms of byte sizes, tbh.

And also, consider that in practice most users will be dealing with
sizeof(u32) or sizeof(u64) (or larger, but multiple-of-4-or-8-bytes)
bit masks. But the implementation has to be correct for all valid
inputs and use cases.

> responsible for calculating the offset of 'src'.For instance, on a
> big-endian machine, if the 'src' is of type u64 and the user only want
> to iterate over 32 bits, the code would appear as follows:
>
>     u64 src = 0x100;
>     bpf_for_each(bits, bit, ((void *)&src) + (sizeof(u64) - (32 + 7) / 8), 32);
>
> It may indeed be less user-friendly. However, I can make the
> adjustment as you suggested, given that it aligns with the pattern
> observed in bpf_probe_read_kernel_common().
>
> --
> Regards
> Yafang