mbox series

[RFC/PATCHES,00/12] pahole: Reproducible parallel DWARF loading/serial BTF encoding

Message ID 20240402193945.17327-1-acme@kernel.org (mailing list archive)
Headers show
Series pahole: Reproducible parallel DWARF loading/serial BTF encoding | expand

Message

Arnaldo Carvalho de Melo April 2, 2024, 7:39 p.m. UTC
Hi,

	This allows us to have reproducible builds while keeping the
DWARF loading phase in parallel, achieving a noticeable speedup as
showed in the commit log messages:

On a:

  model name    : Intel(R) Core(TM) i7-14700K

  8 performance cores (16 threads), 12 efficiency cores.

Serial encoding:

  $ perf stat -e cycles -r5 pahole --btf_encode_detached=vmlinux.btf.serial vmlinux
             5.18276 +- 0.00952 seconds time elapsed  ( +-  0.18% )

Parallel, but non-reproducible:

  $ perf stat -e cycles -r5 pahole -j --btf_encode_detached=vmlinux.btf.parallel vmlinux
              1.8529 +- 0.0159 seconds time elapsed  ( +-  0.86% )

reproducible build done using parallel DWARF loading + CUs-ordered-as-in-vmlinux serial BTF encoding:

  $ perf stat -e cycles -r5 pahole -j --reproducible_build --btf_encode_detached=vmlinux.btf.parallel.reproducible_build vmlinux
              2.3632 +- 0.0164 seconds time elapsed  ( +-  0.69% )

Please take a look, its in the 'next' branch at:

  https://git.kernel.org/pub/scm/devel/pahole/pahole.git
  https://git.kernel.org/pub/scm/devel/pahole/pahole.git/log/?h=next

There is a new tool to do regression testing on this feature:

  https://git.kernel.org/pub/scm/devel/pahole/pahole.git/commit/?h=next&id=c751214c19bf8591bf8e4abdc677cbadee08f630
  
And here a more detailed set of tests using it:

  https://git.kernel.org/pub/scm/devel/pahole/pahole.git/commit/?h=next&id=4451467ca16a6e31834f6f98661c63587ce556f7

Working on libbpf to allow for parallel reproducible BTF encoding is the
next step.

Thanks a lot,

- Arnaldo

Arnaldo Carvalho de Melo (12):
  core: Allow asking for a reproducible build
  pahole: Disable BTF multithreaded encoded when doing reproducible builds
  dwarf_loader: Separate creating the cu/dcu pair from processing it
  dwarf_loader: Introduce dwarf_cus__process_cu()
  dwarf_loader: Create the cu/dcu pair in dwarf_cus__nextcu()
  dwarf_loader: Remove unused 'thr_data' arg from dwarf_cus__create_and_process_cu()
  core: Add unlocked cus__add() variant
  core: Add cus__remove(), counterpart of cus__add()
  dwarf_loader: Add the cu to the cus list early, remove on LSK_DELETE
  core/dwarf_loader: Add functions to set state of CU processing
  pahole: Encode BTF serially in a reproducible build
  tests: Add a BTF reproducible generation test

 dwarf_loader.c              | 73 +++++++++++++++++++++++---------
 dwarves.c                   | 58 ++++++++++++++++++++++++-
 dwarves.h                   | 17 ++++++++
 pahole.c                    | 84 +++++++++++++++++++++++++++++++++++--
 tests/reproducible_build.sh | 56 +++++++++++++++++++++++++
 5 files changed, 264 insertions(+), 24 deletions(-)
 create mode 100755 tests/reproducible_build.sh

Comments

Eduard Zingerman April 4, 2024, 12:08 a.m. UTC | #1
On Tue, 2024-04-02 at 16:39 -0300, Arnaldo Carvalho de Melo wrote:
> Hi,
> 
> 	This allows us to have reproducible builds while keeping the
> DWARF loading phase in parallel, achieving a noticeable speedup as
> showed in the commit log messages:

[...]

> Working on libbpf to allow for parallel reproducible BTF encoding is the
> next step.

Another option would be to apply some sort of canonical ordering to
BTF itself. E.g. put all PTR before STRUCT, sort same kinds by name,
sort same kinds by vlen, etc. Something akin to [1], however this
experiment has several flaws:
- slowdown is much worse than with your patch-set;
- I see a small number of functions with identical names appearing and
  disappearing from final BTF.
  
I'll try to figure out the reason for slowdown tomorrow.

[1] https://github.com/eddyz87/dwarves/tree/sort-btf
Alan Maguire April 4, 2024, 8:05 a.m. UTC | #2
On 04/04/2024 01:08, Eduard Zingerman wrote:
> On Tue, 2024-04-02 at 16:39 -0300, Arnaldo Carvalho de Melo wrote:
>> Hi,
>>
>> 	This allows us to have reproducible builds while keeping the
>> DWARF loading phase in parallel, achieving a noticeable speedup as
>> showed in the commit log messages:
> 
> [...]
> 
>> Working on libbpf to allow for parallel reproducible BTF encoding is the
>> next step.
> 
> Another option would be to apply some sort of canonical ordering to
> BTF itself. E.g. put all PTR before STRUCT, sort same kinds by name,
> sort same kinds by vlen, etc. Something akin to [1], however this
> experiment has several flaws:
> - slowdown is much worse than with your patch-set;
> - I see a small number of functions with identical names appearing and
>   disappearing from final BTF.
>   

Could that be the handling of functions with same name, inconsistent
prototypes? We leave them out deliberately (see
btf_encoder__add_saved_funcs().

> I'll try to figure out the reason for slowdown tomorrow.
> 
> [1] https://github.com/eddyz87/dwarves/tree/sort-btf
>
Alan Maguire April 4, 2024, 8:58 a.m. UTC | #3
On 02/04/2024 20:39, Arnaldo Carvalho de Melo wrote:
> Hi,
> 
> 	This allows us to have reproducible builds while keeping the
> DWARF loading phase in parallel, achieving a noticeable speedup as
> showed in the commit log messages:
> 
> On a:
> 
>   model name    : Intel(R) Core(TM) i7-14700K
> 
>   8 performance cores (16 threads), 12 efficiency cores.
> 
> Serial encoding:
> 
>   $ perf stat -e cycles -r5 pahole --btf_encode_detached=vmlinux.btf.serial vmlinux
>              5.18276 +- 0.00952 seconds time elapsed  ( +-  0.18% )
> 
> Parallel, but non-reproducible:
> 
>   $ perf stat -e cycles -r5 pahole -j --btf_encode_detached=vmlinux.btf.parallel vmlinux
>               1.8529 +- 0.0159 seconds time elapsed  ( +-  0.86% )
> 
> reproducible build done using parallel DWARF loading + CUs-ordered-as-in-vmlinux serial BTF encoding:
> 
>   $ perf stat -e cycles -r5 pahole -j --reproducible_build --btf_encode_detached=vmlinux.btf.parallel.reproducible_build vmlinux
>               2.3632 +- 0.0164 seconds time elapsed  ( +-  0.69% )
> 
> Please take a look, its in the 'next' branch at:
> 
>   https://git.kernel.org/pub/scm/devel/pahole/pahole.git
>   https://git.kernel.org/pub/scm/devel/pahole/pahole.git/log/?h=next
> 
> There is a new tool to do regression testing on this feature:
> 
>   https://git.kernel.org/pub/scm/devel/pahole/pahole.git/commit/?h=next&id=c751214c19bf8591bf8e4abdc677cbadee08f630
>   
> And here a more detailed set of tests using it:
> 
>   https://git.kernel.org/pub/scm/devel/pahole/pahole.git/commit/?h=next&id=4451467ca16a6e31834f6f98661c63587ce556f7
> 
> Working on libbpf to allow for parallel reproducible BTF encoding is the
> next step.
> 
> Thanks a lot,
> 

Hey Arnaldo

In testing this series I've hit a segmentation fault:

Using host libthread_db library "/usr/lib64/libthread_db.so.1".
Core was generated by `pahole -J --btf_features=all --reproducible_build
-j vmlinux'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f8c8260a58c in ptr_table__entry (pt=0x7f8c60001e70, id=77)
    at /home/almagui/src/dwarves/dwarves.c:612
612		return id >= pt->nr_entries ? NULL : pt->entries[id];
[Current thread is 1 (Thread 0x7f8c65400700 (LWP 624441))]
(gdb) bt
#0  0x00007f8c8260a58c in ptr_table__entry (pt=0x7f8c60001e70, id=77)
    at /home/almagui/src/dwarves/dwarves.c:612
#1  0x00007f8c8260ada2 in cu__type (cu=0x7f8c60001e40, id=77)
    at /home/almagui/src/dwarves/dwarves.c:806
#2  0x00007f8c8261342c in ftype__fprintf (ftype=0x7f8c60272f30,
    cu=0x7f8c60001e40, name=0x0, inlined=0, is_pointer=0, type_spacing=0,
    is_prototype=true, conf=0x7f8c653ff930, fp=0x7f8c3804bc90)
    at /home/almagui/src/dwarves/dwarves_fprintf.c:1388
#3  0x00007f8c8261289d in function__prototype_conf (func=0x7f8c60272f30,
    cu=0x7f8c60001e40, conf=0x7f8c653ff930, bf=0x7f8c27225dad "", len=512)
    at /home/almagui/src/dwarves/dwarves_fprintf.c:1183
#4  0x00007f8c8261b52b in proto__get (func=0x7f8c60272f30,
    proto=0x7f8c27225dad "", len=512)
    at /home/almagui/src/dwarves/btf_encoder.c:811
#5  0x00007f8c8261b665 in funcs__match (encoder=0x7f8c28023220,
    func=0x7f8c27225d88, f2=0x7f8c5805c560)
    at /home/almagui/src/dwarves/btf_encoder.c:839
#6  0x00007f8c8261b7fc in btf_encoder__save_func (encoder=0x7f8c28023220,
    fn=0x7f8c5805c560, func=0x7f8c27225d88)
    at /home/almagui/src/dwarves/btf_encoder.c:871
#7  0x00007f8c8261e361 in btf_encoder__encode_cu (encoder=0x7f8c28023220,
    cu=0x7f8c58001e20, conf_load=0x412400 <conf_load>)
    at /home/almagui/src/dwarves/btf_encoder.c:1888
#8  0x000000000040a36c in pahole_stealer (cu=0x7f8c58001e20,
    conf_load=0x412400 <conf_load>, thr_data=0x0)
    at /home/almagui/src/dwarves/pahole.c:3342
#9  0x00007f8c8262672c in cu__finalize (cu=0x7f8c38001e20, cus=0x21412a0,
    conf=0x412400 <conf_load>, thr_data=0x0)
    at /home/almagui/src/dwarves/dwarf_loader.c:3029
#10 0x00007f8c82626765 in cus__finalize (cus=0x21412a0, cu=0x7f8c38001e20,
    conf=0x412400 <conf_load>, thr_data=0x0)
    at /home/almagui/src/dwarves/dwarf_loader.c:3036
#11 0x00007f8c82626e9b in dwarf_cus__process_cu (dcus=0x7ffd71eaf0d0,
    cu_die=0x7f8c653ffeb0, cu=0x7f8c38001e20, thr_data=0x0)
    at /home/almagui/src/dwarves/dwarf_loader.c:3243
#12 0x00007f8c826270d2 in dwarf_cus__process_cu_thread (arg=0x7ffd71eaef50)
    at /home/almagui/src/dwarves/dwarf_loader.c:3313
#13 0x00007f8c816081da in start_thread () from /usr/lib64/libpthread.so.0
#14 0x00007f8c81239e73 in clone () from /usr/lib64/libc.so.6

So for conf_load->skip_encoding_btf_inconsistent_proto (enabled as part
of "all" and enabled for vmlinux/module BTF), we use dwarves_fprintf()
to write prototypes to check for inconsistent definitions.

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f8c8260a58c in ptr_table__entry (pt=0x7f8c60001e70, id=77)
    at /home/almagui/src/dwarves/dwarves.c:612
612		return id >= pt->nr_entries ? NULL : pt->entries[id];
[Current thread is 1 (Thread 0x7f8c65400700 (LWP 624441))]
(gdb) print *(struct ptr_table *)0x7f8c60001e70
$1 = {entries = 0x0, nr_entries = 2979, allocated_entries = 4096}
(gdb)

So it looks like the ptr_table has 2979 entries but entries is NULL;
could there be an issue where CU initialization is not yet complete
for some threads (it also happens very early in processing)? Can you
reproduce this failure at your end? Thanks!

Alan

> - Arnaldo
> 
> Arnaldo Carvalho de Melo (12):
>   core: Allow asking for a reproducible build
>   pahole: Disable BTF multithreaded encoded when doing reproducible builds
>   dwarf_loader: Separate creating the cu/dcu pair from processing it
>   dwarf_loader: Introduce dwarf_cus__process_cu()
>   dwarf_loader: Create the cu/dcu pair in dwarf_cus__nextcu()
>   dwarf_loader: Remove unused 'thr_data' arg from dwarf_cus__create_and_process_cu()
>   core: Add unlocked cus__add() variant
>   core: Add cus__remove(), counterpart of cus__add()
>   dwarf_loader: Add the cu to the cus list early, remove on LSK_DELETE
>   core/dwarf_loader: Add functions to set state of CU processing
>   pahole: Encode BTF serially in a reproducible build
>   tests: Add a BTF reproducible generation test
> 
>  dwarf_loader.c              | 73 +++++++++++++++++++++++---------
>  dwarves.c                   | 58 ++++++++++++++++++++++++-
>  dwarves.h                   | 17 ++++++++
>  pahole.c                    | 84 +++++++++++++++++++++++++++++++++++--
>  tests/reproducible_build.sh | 56 +++++++++++++++++++++++++
>  5 files changed, 264 insertions(+), 24 deletions(-)
>  create mode 100755 tests/reproducible_build.sh
>
Jiri Olsa April 4, 2024, 9:42 a.m. UTC | #4
On Tue, Apr 02, 2024 at 04:39:33PM -0300, Arnaldo Carvalho de Melo wrote:
> Hi,
> 
> 	This allows us to have reproducible builds while keeping the
> DWARF loading phase in parallel, achieving a noticeable speedup as
> showed in the commit log messages:
> 
> On a:
> 
>   model name    : Intel(R) Core(TM) i7-14700K
> 
>   8 performance cores (16 threads), 12 efficiency cores.
> 
> Serial encoding:
> 
>   $ perf stat -e cycles -r5 pahole --btf_encode_detached=vmlinux.btf.serial vmlinux
>              5.18276 +- 0.00952 seconds time elapsed  ( +-  0.18% )
> 
> Parallel, but non-reproducible:
> 
>   $ perf stat -e cycles -r5 pahole -j --btf_encode_detached=vmlinux.btf.parallel vmlinux
>               1.8529 +- 0.0159 seconds time elapsed  ( +-  0.86% )
> 
> reproducible build done using parallel DWARF loading + CUs-ordered-as-in-vmlinux serial BTF encoding:
> 
>   $ perf stat -e cycles -r5 pahole -j --reproducible_build --btf_encode_detached=vmlinux.btf.parallel.reproducible_build vmlinux
>               2.3632 +- 0.0164 seconds time elapsed  ( +-  0.69% )

hm, got it even faster than paralel build ;-) but it's within the
1 second deviation, I guess it shows better on bigger kernels

reproducible_build:

	[root@krava linux-qemu]# perf stat -e cycles -r 3 -- /home/jolsa/kernel/bpf/pahole/build/pahole -j --reproducible_build --btf_encode_detached=btf.2 ./vmlinux

	 Performance counter stats for '/home/jolsa/kernel/bpf/pahole/build/pahole -j --reproducible_build --btf_encode_detached=btf.2 ./vmlinux' (3 runs):

	   295,519,117,258      cycles                                                                  ( +- 19.48% )

		      9.43 +- 1.02 seconds time elapsed  ( +- 10.84% )

paralel build:

	[root@krava linux-qemu]# perf stat -e cycles -r 3 -- /home/jolsa/kernel/bpf/pahole/build/pahole -j  --btf_encode_detached=btf.2 ./vmlinux

	 Performance counter stats for '/home/jolsa/kernel/bpf/pahole/build/pahole -j --btf_encode_detached=btf.2 ./vmlinux' (3 runs):

	   391,320,973,331      cycles                                                                  ( +- 19.19% )

		     9.851 +- 0.695 seconds time elapsed  ( +-  7.06% )

1 cpu:

	[root@krava linux-qemu]# perf stat -e cycles -r 3 -- /home/jolsa/kernel/bpf/pahole/build/pahole --btf_encode_detached=btf.2 ./vmlinux

	 Performance counter stats for '/home/jolsa/kernel/bpf/pahole/build/pahole --btf_encode_detached=btf.2 ./vmlinux' (3 runs):

	   208,492,342,135      cycles                                                                  ( +- 19.43% )

		    16.761 +- 0.644 seconds time elapsed  ( +-  3.84% )

jirka

> 
> Please take a look, its in the 'next' branch at:
> 
>   https://git.kernel.org/pub/scm/devel/pahole/pahole.git
>   https://git.kernel.org/pub/scm/devel/pahole/pahole.git/log/?h=next
> 
> There is a new tool to do regression testing on this feature:
> 
>   https://git.kernel.org/pub/scm/devel/pahole/pahole.git/commit/?h=next&id=c751214c19bf8591bf8e4abdc677cbadee08f630
>   
> And here a more detailed set of tests using it:
> 
>   https://git.kernel.org/pub/scm/devel/pahole/pahole.git/commit/?h=next&id=4451467ca16a6e31834f6f98661c63587ce556f7
> 
> Working on libbpf to allow for parallel reproducible BTF encoding is the
> next step.
> 
> Thanks a lot,
> 
> - Arnaldo
> 
> Arnaldo Carvalho de Melo (12):
>   core: Allow asking for a reproducible build
>   pahole: Disable BTF multithreaded encoded when doing reproducible builds
>   dwarf_loader: Separate creating the cu/dcu pair from processing it
>   dwarf_loader: Introduce dwarf_cus__process_cu()
>   dwarf_loader: Create the cu/dcu pair in dwarf_cus__nextcu()
>   dwarf_loader: Remove unused 'thr_data' arg from dwarf_cus__create_and_process_cu()
>   core: Add unlocked cus__add() variant
>   core: Add cus__remove(), counterpart of cus__add()
>   dwarf_loader: Add the cu to the cus list early, remove on LSK_DELETE
>   core/dwarf_loader: Add functions to set state of CU processing
>   pahole: Encode BTF serially in a reproducible build
>   tests: Add a BTF reproducible generation test
> 
>  dwarf_loader.c              | 73 +++++++++++++++++++++++---------
>  dwarves.c                   | 58 ++++++++++++++++++++++++-
>  dwarves.h                   | 17 ++++++++
>  pahole.c                    | 84 +++++++++++++++++++++++++++++++++++--
>  tests/reproducible_build.sh | 56 +++++++++++++++++++++++++
>  5 files changed, 264 insertions(+), 24 deletions(-)
>  create mode 100755 tests/reproducible_build.sh
> 
> -- 
> 2.44.0
>
Alan Maguire April 8, 2024, noon UTC | #5
On 04/04/2024 09:58, Alan Maguire wrote:
> On 02/04/2024 20:39, Arnaldo Carvalho de Melo wrote:
>> Hi,
>>
>> 	This allows us to have reproducible builds while keeping the
>> DWARF loading phase in parallel, achieving a noticeable speedup as
>> showed in the commit log messages:
>>
>> On a:
>>
>>   model name    : Intel(R) Core(TM) i7-14700K
>>
>>   8 performance cores (16 threads), 12 efficiency cores.
>>
>> Serial encoding:
>>
>>   $ perf stat -e cycles -r5 pahole --btf_encode_detached=vmlinux.btf.serial vmlinux
>>              5.18276 +- 0.00952 seconds time elapsed  ( +-  0.18% )
>>
>> Parallel, but non-reproducible:
>>
>>   $ perf stat -e cycles -r5 pahole -j --btf_encode_detached=vmlinux.btf.parallel vmlinux
>>               1.8529 +- 0.0159 seconds time elapsed  ( +-  0.86% )
>>
>> reproducible build done using parallel DWARF loading + CUs-ordered-as-in-vmlinux serial BTF encoding:
>>
>>   $ perf stat -e cycles -r5 pahole -j --reproducible_build --btf_encode_detached=vmlinux.btf.parallel.reproducible_build vmlinux
>>               2.3632 +- 0.0164 seconds time elapsed  ( +-  0.69% )
>>
>> Please take a look, its in the 'next' branch at:
>>
>>   https://git.kernel.org/pub/scm/devel/pahole/pahole.git
>>   https://git.kernel.org/pub/scm/devel/pahole/pahole.git/log/?h=next
>>
>> There is a new tool to do regression testing on this feature:
>>
>>   https://git.kernel.org/pub/scm/devel/pahole/pahole.git/commit/?h=next&id=c751214c19bf8591bf8e4abdc677cbadee08f630
>>   
>> And here a more detailed set of tests using it:
>>
>>   https://git.kernel.org/pub/scm/devel/pahole/pahole.git/commit/?h=next&id=4451467ca16a6e31834f6f98661c63587ce556f7
>>
>> Working on libbpf to allow for parallel reproducible BTF encoding is the
>> next step.
>>
>> Thanks a lot,
>>
> 
> Hey Arnaldo
> 
> In testing this series I've hit a segmentation fault:
> 
> Using host libthread_db library "/usr/lib64/libthread_db.so.1".
> Core was generated by `pahole -J --btf_features=all --reproducible_build
> -j vmlinux'.
> Program terminated with signal SIGSEGV, Segmentation fault.
> #0  0x00007f8c8260a58c in ptr_table__entry (pt=0x7f8c60001e70, id=77)
>     at /home/almagui/src/dwarves/dwarves.c:612
> 612		return id >= pt->nr_entries ? NULL : pt->entries[id];
> [Current thread is 1 (Thread 0x7f8c65400700 (LWP 624441))]
> (gdb) bt
> #0  0x00007f8c8260a58c in ptr_table__entry (pt=0x7f8c60001e70, id=77)
>     at /home/almagui/src/dwarves/dwarves.c:612
> #1  0x00007f8c8260ada2 in cu__type (cu=0x7f8c60001e40, id=77)
>     at /home/almagui/src/dwarves/dwarves.c:806
> #2  0x00007f8c8261342c in ftype__fprintf (ftype=0x7f8c60272f30,
>     cu=0x7f8c60001e40, name=0x0, inlined=0, is_pointer=0, type_spacing=0,
>     is_prototype=true, conf=0x7f8c653ff930, fp=0x7f8c3804bc90)
>     at /home/almagui/src/dwarves/dwarves_fprintf.c:1388
> #3  0x00007f8c8261289d in function__prototype_conf (func=0x7f8c60272f30,
>     cu=0x7f8c60001e40, conf=0x7f8c653ff930, bf=0x7f8c27225dad "", len=512)
>     at /home/almagui/src/dwarves/dwarves_fprintf.c:1183
> #4  0x00007f8c8261b52b in proto__get (func=0x7f8c60272f30,
>     proto=0x7f8c27225dad "", len=512)
>     at /home/almagui/src/dwarves/btf_encoder.c:811
> #5  0x00007f8c8261b665 in funcs__match (encoder=0x7f8c28023220,
>     func=0x7f8c27225d88, f2=0x7f8c5805c560)
>     at /home/almagui/src/dwarves/btf_encoder.c:839
> #6  0x00007f8c8261b7fc in btf_encoder__save_func (encoder=0x7f8c28023220,
>     fn=0x7f8c5805c560, func=0x7f8c27225d88)
>     at /home/almagui/src/dwarves/btf_encoder.c:871
> #7  0x00007f8c8261e361 in btf_encoder__encode_cu (encoder=0x7f8c28023220,
>     cu=0x7f8c58001e20, conf_load=0x412400 <conf_load>)
>     at /home/almagui/src/dwarves/btf_encoder.c:1888
> #8  0x000000000040a36c in pahole_stealer (cu=0x7f8c58001e20,
>     conf_load=0x412400 <conf_load>, thr_data=0x0)
>     at /home/almagui/src/dwarves/pahole.c:3342
> #9  0x00007f8c8262672c in cu__finalize (cu=0x7f8c38001e20, cus=0x21412a0,
>     conf=0x412400 <conf_load>, thr_data=0x0)
>     at /home/almagui/src/dwarves/dwarf_loader.c:3029
> #10 0x00007f8c82626765 in cus__finalize (cus=0x21412a0, cu=0x7f8c38001e20,
>     conf=0x412400 <conf_load>, thr_data=0x0)
>     at /home/almagui/src/dwarves/dwarf_loader.c:3036
> #11 0x00007f8c82626e9b in dwarf_cus__process_cu (dcus=0x7ffd71eaf0d0,
>     cu_die=0x7f8c653ffeb0, cu=0x7f8c38001e20, thr_data=0x0)
>     at /home/almagui/src/dwarves/dwarf_loader.c:3243
> #12 0x00007f8c826270d2 in dwarf_cus__process_cu_thread (arg=0x7ffd71eaef50)
>     at /home/almagui/src/dwarves/dwarf_loader.c:3313
> #13 0x00007f8c816081da in start_thread () from /usr/lib64/libpthread.so.0
> #14 0x00007f8c81239e73 in clone () from /usr/lib64/libc.so.6
> 
> So for conf_load->skip_encoding_btf_inconsistent_proto (enabled as part
> of "all" and enabled for vmlinux/module BTF), we use dwarves_fprintf()
> to write prototypes to check for inconsistent definitions.
> 
> Program terminated with signal SIGSEGV, Segmentation fault.
> #0  0x00007f8c8260a58c in ptr_table__entry (pt=0x7f8c60001e70, id=77)
>     at /home/almagui/src/dwarves/dwarves.c:612
> 612		return id >= pt->nr_entries ? NULL : pt->entries[id];
> [Current thread is 1 (Thread 0x7f8c65400700 (LWP 624441))]
> (gdb) print *(struct ptr_table *)0x7f8c60001e70
> $1 = {entries = 0x0, nr_entries = 2979, allocated_entries = 4096}
> (gdb)
> 
> So it looks like the ptr_table has 2979 entries but entries is NULL;
> could there be an issue where CU initialization is not yet complete
> for some threads (it also happens very early in processing)? Can you
> reproduce this failure at your end? Thanks!
>

the following (when applied on top of the series) resolves the
segmentation fault for me:

diff --git a/pahole.c b/pahole.c
index 6c7e738..5ff0eaf 100644
--- a/pahole.c
+++ b/pahole.c
@@ -3348,8 +3348,8 @@ static enum load_steal_kind pahole_stealer(struct
cu *cu,
                if (conf_load->reproducible_build) {
                        ret = LSK__KEEPIT; // we're not processing the
cu passed to this function, so keep it.
-                        // Equivalent to LSK__DELETE since we processed
this
-                       cus__remove(cus, cu);
-                       cu__delete(cu);
                }
 out_btf:
                if (!thr_data) // See comment about reproducibe_build above


In other words, the problem is we remove/delete CUs when finished with
them in each thread (when BTF is generated).  However because the
save/add_saved_funcs stashes CU references in the associated struct
function * (to allow prototype comparison for the same function in
different CUs), we end up with stale CU references and in this case the
freed/nulled ptr_table caused an issue. As far as I can see we need to
retain CUs until all BTF has been merged from threads.

With the fix in place, I'm seeing less then 100msec difference between
reproducible/non-reproducible vmlinux BTF generation; that's great!

Alan

> Alan
> 
>> - Arnaldo
>>
>> Arnaldo Carvalho de Melo (12):
>>   core: Allow asking for a reproducible build
>>   pahole: Disable BTF multithreaded encoded when doing reproducible builds
>>   dwarf_loader: Separate creating the cu/dcu pair from processing it
>>   dwarf_loader: Introduce dwarf_cus__process_cu()
>>   dwarf_loader: Create the cu/dcu pair in dwarf_cus__nextcu()
>>   dwarf_loader: Remove unused 'thr_data' arg from dwarf_cus__create_and_process_cu()
>>   core: Add unlocked cus__add() variant
>>   core: Add cus__remove(), counterpart of cus__add()
>>   dwarf_loader: Add the cu to the cus list early, remove on LSK_DELETE
>>   core/dwarf_loader: Add functions to set state of CU processing
>>   pahole: Encode BTF serially in a reproducible build
>>   tests: Add a BTF reproducible generation test
>>
>>  dwarf_loader.c              | 73 +++++++++++++++++++++++---------
>>  dwarves.c                   | 58 ++++++++++++++++++++++++-
>>  dwarves.h                   | 17 ++++++++
>>  pahole.c                    | 84 +++++++++++++++++++++++++++++++++++--
>>  tests/reproducible_build.sh | 56 +++++++++++++++++++++++++
>>  5 files changed, 264 insertions(+), 24 deletions(-)
>>  create mode 100755 tests/reproducible_build.sh
>>
>
Arnaldo Carvalho de Melo April 8, 2024, 2:39 p.m. UTC | #6
On Mon, Apr 08, 2024 at 01:00:59PM +0100, Alan Maguire wrote:
> On 04/04/2024 09:58, Alan Maguire wrote:
> > Program terminated with signal SIGSEGV, Segmentation fault.
> > #0  0x00007f8c8260a58c in ptr_table__entry (pt=0x7f8c60001e70, id=77)
> >     at /home/almagui/src/dwarves/dwarves.c:612
> > 612		return id >= pt->nr_entries ? NULL : pt->entries[id];
> > [Current thread is 1 (Thread 0x7f8c65400700 (LWP 624441))]
> > (gdb) print *(struct ptr_table *)0x7f8c60001e70
> > $1 = {entries = 0x0, nr_entries = 2979, allocated_entries = 4096}
> > (gdb)

> > So it looks like the ptr_table has 2979 entries but entries is NULL;
> > could there be an issue where CU initialization is not yet complete
> > for some threads (it also happens very early in processing)? Can you
> > reproduce this failure at your end? Thanks!
 
> the following (when applied on top of the series) resolves the
> segmentation fault for me:
 
> diff --git a/pahole.c b/pahole.c
> index 6c7e738..5ff0eaf 100644
> --- a/pahole.c
> +++ b/pahole.c
> @@ -3348,8 +3348,8 @@ static enum load_steal_kind pahole_stealer(struct
> cu *cu,
>                 if (conf_load->reproducible_build) {
>                         ret = LSK__KEEPIT; // we're not processing the
> cu passed to this function, so keep it.
> -                        // Equivalent to LSK__DELETE since we processed
> this
> -                       cus__remove(cus, cu);
> -                       cu__delete(cu);
>                 }
>  out_btf:
>                 if (!thr_data) // See comment about reproducibe_build above
> 

Yeah, Jiri also pointed out this call to cu__delete() was new, I was
trying to avoid having unprocessed 'struct cu' using too much memory, so
after processing it, delete them, but as you found out there are
references to that memory...

> In other words, the problem is we remove/delete CUs when finished with
> them in each thread (when BTF is generated).  However because the
> save/add_saved_funcs stashes CU references in the associated struct
> function * (to allow prototype comparison for the same function in
> different CUs), we end up with stale CU references and in this case the
> freed/nulled ptr_table caused an issue. As far as I can see we need to
> retain CUs until all BTF has been merged from threads.
 
> With the fix in place, I'm seeing less then 100msec difference between
> reproducible/non-reproducible vmlinux BTF generation; that's great!

Excellent!

I'll remove it and add a note crediting you with the removal and having
the explanation about why its not possibe to delete it at that point
(references to the associated 'struct function').

Perhaps we can save this info in some other way that allows us to free
the CU after having it processed, I'll think about it.

But its good to see that the difference is small, great!

Thanks a lot!

- Arnaldo
Eduard Zingerman April 9, 2024, 2:34 p.m. UTC | #7
On Thu, 2024-04-04 at 09:05 +0100, Alan Maguire wrote:
[...]

> Could that be the handling of functions with same name, inconsistent
> prototypes? We leave them out deliberately (see
> btf_encoder__add_saved_funcs().
> 
> > I'll try to figure out the reason for slowdown tomorrow.
> > 
> > [1] https://github.com/eddyz87/dwarves/tree/sort-btf
> > 

Fwiw, the best I can do is here:
https://github.com/eddyz87/dwarves/tree/sort-btf

It adds total ordering to BTF types using kind, kflag, vlen, name etc properties,
and rebuilds final BTF to follow this order. Here are the measurements:

$ sudo cpupower frequency-set --min 3Ghz --max 3Ghz
$ nice -n18 perf stat -r50 pahole --reproducible_build -j --btf_encode_detached=/dev/null vmlinux
           ...
           3.08648 +- 0.00813 seconds time elapsed  ( +-  0.26% )

$ nice -n18 perf stat -r50 pahole -j --btf_encode_detached=/dev/null vmlinux
           ...
           3.00785 +- 0.00878 seconds time elapsed  ( +-  0.29% )

Which gives 2.6% total time penalty when reproducible build option is used.
./test_progs tests are passing.

Still, there are a few discrepancies in generated BTFs: some function
prototypes are included twice at random (about 30 IDs added/deleted).
This might be connected to Alan's suggestion and requires
further investigation.

All in all, Arnaldo's approach with CU ordering looks simpler.

Best regards,
Eduard
Alexei Starovoitov April 9, 2024, 2:56 p.m. UTC | #8
On Tue, Apr 9, 2024 at 7:34 AM Eduard Zingerman <eddyz87@gmail.com> wrote:
>
> On Thu, 2024-04-04 at 09:05 +0100, Alan Maguire wrote:
> [...]
>
> > Could that be the handling of functions with same name, inconsistent
> > prototypes? We leave them out deliberately (see
> > btf_encoder__add_saved_funcs().
> >
> > > I'll try to figure out the reason for slowdown tomorrow.
> > >
> > > [1] https://github.com/eddyz87/dwarves/tree/sort-btf
> > >
>
> Fwiw, the best I can do is here:
> https://github.com/eddyz87/dwarves/tree/sort-btf
>
> It adds total ordering to BTF types using kind, kflag, vlen, name etc properties,
> and rebuilds final BTF to follow this order. Here are the measurements:
>
> $ sudo cpupower frequency-set --min 3Ghz --max 3Ghz
> $ nice -n18 perf stat -r50 pahole --reproducible_build -j --btf_encode_detached=/dev/null vmlinux
>            ...
>            3.08648 +- 0.00813 seconds time elapsed  ( +-  0.26% )
>
> $ nice -n18 perf stat -r50 pahole -j --btf_encode_detached=/dev/null vmlinux
>            ...
>            3.00785 +- 0.00878 seconds time elapsed  ( +-  0.29% )
>
> Which gives 2.6% total time penalty when reproducible build option is used.
> ./test_progs tests are passing.
>
> Still, there are a few discrepancies in generated BTFs: some function
> prototypes are included twice at random (about 30 IDs added/deleted).
> This might be connected to Alan's suggestion and requires
> further investigation.
>
> All in all, Arnaldo's approach with CU ordering looks simpler.

I would actually go with sorted BTF, since it will probably
make diff-ing of BTFs practical. Will be easier to track changes
from one kernel version to another. vmlinux.h will become
a bit more sorted too and normal diff vmlinux_6_1.h vmlinux_6_2.h
will be possible.
Or am I misunderstanding the sorting concept?
Eduard Zingerman April 9, 2024, 3:01 p.m. UTC | #9
On Tue, 2024-04-09 at 07:56 -0700, Alexei Starovoitov wrote:
[...]

> I would actually go with sorted BTF, since it will probably
> make diff-ing of BTFs practical. Will be easier to track changes
> from one kernel version to another. vmlinux.h will become
> a bit more sorted too and normal diff vmlinux_6_1.h vmlinux_6_2.h
> will be possible.
> Or am I misunderstanding the sorting concept?

You understand the concept correctly, here is a sample:

  [1] INT '_Bool' size=1 bits_offset=0 nr_bits=8 encoding=BOOL
  [2] INT '__int128' size=16 bits_offset=0 nr_bits=128 encoding=SIGNED
  [3] INT '__int128 unsigned' size=16 bits_offset=0 nr_bits=128 encoding=(none)
  [4] INT 'char' size=1 bits_offset=0 nr_bits=8 encoding=(none)
  [5] INT 'int' size=4 bits_offset=0 nr_bits=32 encoding=SIGNED
  [6] INT 'long int' size=8 bits_offset=0 nr_bits=64 encoding=SIGNED
  [7] INT 'long long int' size=8 bits_offset=0 nr_bits=64 encoding=SIGNED
  ...
  [15085] STRUCT 'arch_elf_state' size=0 vlen=0
  [15086] STRUCT 'arch_vdso_data' size=0 vlen=0
  [15087] STRUCT 'bpf_run_ctx' size=0 vlen=0
  [15088] STRUCT 'dev_archdata' size=0 vlen=0
  [15089] STRUCT 'dyn_arch_ftrace' size=0 vlen=0
  [15090] STRUCT 'fscrypt_dummy_policy' size=0 vlen=0
  ...
  
(Sort by kind, than by vlen, than by name because sorting by name is a
 bit costly, then by member properties)
Arnaldo Carvalho de Melo April 9, 2024, 6:45 p.m. UTC | #10
On Tue, Apr 09, 2024 at 06:01:08PM +0300, Eduard Zingerman wrote:
> On Tue, 2024-04-09 at 07:56 -0700, Alexei Starovoitov wrote:
> [...]
 
> > I would actually go with sorted BTF, since it will probably
> > make diff-ing of BTFs practical. Will be easier to track changes

What kind of diff-ing of BTFs from different kernels are you interested
in?

in pahole's repository we have btfdiff, that will, given a vmlinux with
both DWARF and BTF use pahole to pretty print all types, expanded, and
then compare the two outputs, which should produce the same results from
BTF and DWARF. Ditto for DWARF from a vmlinux compared to a detached BTF
file.

And also now we have another regression test script that will produce
the output from 'btftool btf dump' for the BTF generated from DWARF in
serial mode, and then compare that with the output from 'bpftool btf
dump' for reproducible encodings done using -j 1 ...
number-of-processors-on-the-machine. All have to match, all types, all
BTF ids.

We can as well use something like btfdiff to compare the output from
'pahole --expand_types --sort' for two BTFs for two different kernels,
to see what are the new types and the changes to types in both.

What else do you want to compare? To be able to match we would have to
somehow have ranges for each DWARF CU so that when encoding and then
deduplicating we would have space in the ID space for new types to fill
in while keeping the old types IDs matching the same types in the new
vmlinux.

While ordering all types we would have to have ID space available from
each of the BTF kinds, no?

I haven't looked at Eduard's patches, is that what it is done?

> > from one kernel version to another. vmlinux.h will become
> > a bit more sorted too and normal diff vmlinux_6_1.h vmlinux_6_2.h
> > will be possible.
> > Or am I misunderstanding the sorting concept?
 
> You understand the concept correctly, here is a sample:
 
>   [1] INT '_Bool' size=1 bits_offset=0 nr_bits=8 encoding=BOOL
>   [2] INT '__int128' size=16 bits_offset=0 nr_bits=128 encoding=SIGNED
>   [3] INT '__int128 unsigned' size=16 bits_offset=0 nr_bits=128 encoding=(none)
>   [4] INT 'char' size=1 bits_offset=0 nr_bits=8 encoding=(none)
>   [5] INT 'int' size=4 bits_offset=0 nr_bits=32 encoding=SIGNED
>   [6] INT 'long int' size=8 bits_offset=0 nr_bits=64 encoding=SIGNED
>   [7] INT 'long long int' size=8 bits_offset=0 nr_bits=64 encoding=SIGNED

The above: so far so good, probably there will not be something that
will push what is now BTF id 6 to become 7 in a new vmlinux, but can we
say the same for the more dynamic parts, like the list of structs?

A struct can vanish, that abstraction not being used anymore in the
kernel, so its BTF id will vacate and all of the next struct IDs will
"fall down" and gets its IDs decremented, no?

If these difficulties are present as I mentioned, then rebuilding from
the BTF data with something like the existing 'pahole --expand_types
--sort' from the BTF from kernel N to compare with the same output for
kernel N + 1 should be enough to see what changed from one kernel to the
next one?

- Arnaldo

>   ...
>   [15085] STRUCT 'arch_elf_state' size=0 vlen=0
>   [15086] STRUCT 'arch_vdso_data' size=0 vlen=0
>   [15087] STRUCT 'bpf_run_ctx' size=0 vlen=0
>   [15088] STRUCT 'dev_archdata' size=0 vlen=0
>   [15089] STRUCT 'dyn_arch_ftrace' size=0 vlen=0
>   [15090] STRUCT 'fscrypt_dummy_policy' size=0 vlen=0
>   ...
>   
> (Sort by kind, than by vlen, than by name because sorting by name is a
>  bit costly, then by member properties)
Eduard Zingerman April 9, 2024, 7:29 p.m. UTC | #11
On Tue, 2024-04-09 at 15:45 -0300, Arnaldo Carvalho de Melo wrote:
> On Tue, Apr 09, 2024 at 06:01:08PM +0300, Eduard Zingerman wrote:
> > On Tue, 2024-04-09 at 07:56 -0700, Alexei Starovoitov wrote:
> > [...]
>  
> > > I would actually go with sorted BTF, since it will probably
> > > make diff-ing of BTFs practical. Will be easier to track changes
> 
> What kind of diff-ing of BTFs from different kernels are you interested
> in?
> 
> in pahole's repository we have btfdiff, that will, given a vmlinux with
> both DWARF and BTF use pahole to pretty print all types, expanded, and
> then compare the two outputs, which should produce the same results from
> BTF and DWARF. Ditto for DWARF from a vmlinux compared to a detached BTF
> file.
> 
> And also now we have another regression test script that will produce
> the output from 'btftool btf dump' for the BTF generated from DWARF in
> serial mode, and then compare that with the output from 'bpftool btf
> dump' for reproducible encodings done using -j 1 ...
> number-of-processors-on-the-machine. All have to match, all types, all
> BTF ids.
> 
> We can as well use something like btfdiff to compare the output from
> 'pahole --expand_types --sort' for two BTFs for two different kernels,
> to see what are the new types and the changes to types in both.
> 
> What else do you want to compare? To be able to match we would have to
> somehow have ranges for each DWARF CU so that when encoding and then
> deduplicating we would have space in the ID space for new types to fill
> in while keeping the old types IDs matching the same types in the new
> vmlinux.

As far as I understand Alexei, he means diffing two vmlinux.h files
generated for different kernel versions. The vmlinux.h is generated by
bpftool using command `bpftool btf dump file <binary-file> format c`.
The output is topologically sorted to satisfy C compiler, but ordering
is not total, so vmlinux.h content may vary from build to build if BTF
type order differs.

Thus, any kind of stable BTF type ordering would make vmlinux.h stable.
On the other hand, topological ordering used by bpftool
(the algorithm is in the libbpf, actually) might be extended with
additional rules to make the ordering total.

> While ordering all types we would have to have ID space available from
> each of the BTF kinds, no?
> 
> I haven't looked at Eduard's patches, is that what it is done?

No, I don't reserve any ID space, the output of 
`bpftool btf dump file <binary-file> format raw` is not suitable for
diffing w/o post-processing if some types are added or removed in the
middle.

I simply add a function to compare two BTF types and a pass that sorts
all BTF types before finalizing BTF generation.

> > > from one kernel version to another. vmlinux.h will become
> > > a bit more sorted too and normal diff vmlinux_6_1.h vmlinux_6_2.h
> > > will be possible.
> > > Or am I misunderstanding the sorting concept?
>  
> > You understand the concept correctly, here is a sample:
>  
> >   [1] INT '_Bool' size=1 bits_offset=0 nr_bits=8 encoding=BOOL
> >   [2] INT '__int128' size=16 bits_offset=0 nr_bits=128 encoding=SIGNED
> >   [3] INT '__int128 unsigned' size=16 bits_offset=0 nr_bits=128 encoding=(none)
> >   [4] INT 'char' size=1 bits_offset=0 nr_bits=8 encoding=(none)
> >   [5] INT 'int' size=4 bits_offset=0 nr_bits=32 encoding=SIGNED
> >   [6] INT 'long int' size=8 bits_offset=0 nr_bits=64 encoding=SIGNED
> >   [7] INT 'long long int' size=8 bits_offset=0 nr_bits=64 encoding=SIGNED
> 
> The above: so far so good, probably there will not be something that
> will push what is now BTF id 6 to become 7 in a new vmlinux, but can we
> say the same for the more dynamic parts, like the list of structs?
> 
> A struct can vanish, that abstraction not being used anymore in the
> kernel, so its BTF id will vacate and all of the next struct IDs will
> "fall down" and gets its IDs decremented, no?

Yes, this would happen.

> If these difficulties are present as I mentioned, then rebuilding from
> the BTF data with something like the existing 'pahole --expand_types
> --sort' from the BTF from kernel N to compare with the same output for
> kernel N + 1 should be enough to see what changed from one kernel to the
> next one?

Yes, this is an option.

Thanks,
Eduard
Alexei Starovoitov April 9, 2024, 7:34 p.m. UTC | #12
On Tue, Apr 9, 2024 at 12:29 PM Eduard Zingerman <eddyz87@gmail.com> wrote:
>
> On Tue, 2024-04-09 at 15:45 -0300, Arnaldo Carvalho de Melo wrote:
> > On Tue, Apr 09, 2024 at 06:01:08PM +0300, Eduard Zingerman wrote:
> > > On Tue, 2024-04-09 at 07:56 -0700, Alexei Starovoitov wrote:
> > > [...]
> >
> > > > I would actually go with sorted BTF, since it will probably
> > > > make diff-ing of BTFs practical. Will be easier to track changes
> >
> > What kind of diff-ing of BTFs from different kernels are you interested
> > in?
> >
> > in pahole's repository we have btfdiff, that will, given a vmlinux with
> > both DWARF and BTF use pahole to pretty print all types, expanded, and
> > then compare the two outputs, which should produce the same results from
> > BTF and DWARF. Ditto for DWARF from a vmlinux compared to a detached BTF
> > file.
> >
> > And also now we have another regression test script that will produce
> > the output from 'btftool btf dump' for the BTF generated from DWARF in
> > serial mode, and then compare that with the output from 'bpftool btf
> > dump' for reproducible encodings done using -j 1 ...
> > number-of-processors-on-the-machine. All have to match, all types, all
> > BTF ids.
> >
> > We can as well use something like btfdiff to compare the output from
> > 'pahole --expand_types --sort' for two BTFs for two different kernels,
> > to see what are the new types and the changes to types in both.
> >
> > What else do you want to compare? To be able to match we would have to
> > somehow have ranges for each DWARF CU so that when encoding and then
> > deduplicating we would have space in the ID space for new types to fill
> > in while keeping the old types IDs matching the same types in the new
> > vmlinux.
>
> As far as I understand Alexei, he means diffing two vmlinux.h files
> generated for different kernel versions. The vmlinux.h is generated by
> bpftool using command `bpftool btf dump file <binary-file> format c`.
> The output is topologically sorted to satisfy C compiler, but ordering
> is not total, so vmlinux.h content may vary from build to build if BTF
> type order differs.
>
> Thus, any kind of stable BTF type ordering would make vmlinux.h stable.
> On the other hand, topological ordering used by bpftool
> (the algorithm is in the libbpf, actually) might be extended with
> additional rules to make the ordering total.

Not looking for perfect ordering.
People check-in vmlinux.h into their repos.
If it's more or less sorted the git diff of updated vmlinux.h will be
a bit more human readable. Hopefully.
Arnaldo Carvalho de Melo April 9, 2024, 7:57 p.m. UTC | #13
On Tue, Apr 09, 2024 at 10:29:18PM +0300, Eduard Zingerman wrote:
> On Tue, 2024-04-09 at 15:45 -0300, Arnaldo Carvalho de Melo wrote:
> > On Tue, Apr 09, 2024 at 06:01:08PM +0300, Eduard Zingerman wrote:
> > > On Tue, 2024-04-09 at 07:56 -0700, Alexei Starovoitov wrote:
> > > [...]
> >  
> > > > I would actually go with sorted BTF, since it will probably
> > > > make diff-ing of BTFs practical. Will be easier to track changes
> > 
> > What kind of diff-ing of BTFs from different kernels are you interested
> > in?
> > 
> > in pahole's repository we have btfdiff, that will, given a vmlinux with
> > both DWARF and BTF use pahole to pretty print all types, expanded, and
> > then compare the two outputs, which should produce the same results from
> > BTF and DWARF. Ditto for DWARF from a vmlinux compared to a detached BTF
> > file.
> > 
> > And also now we have another regression test script that will produce
> > the output from 'btftool btf dump' for the BTF generated from DWARF in
> > serial mode, and then compare that with the output from 'bpftool btf
> > dump' for reproducible encodings done using -j 1 ...
> > number-of-processors-on-the-machine. All have to match, all types, all
> > BTF ids.
> > 
> > We can as well use something like btfdiff to compare the output from
> > 'pahole --expand_types --sort' for two BTFs for two different kernels,
> > to see what are the new types and the changes to types in both.
> > 
> > What else do you want to compare? To be able to match we would have to
> > somehow have ranges for each DWARF CU so that when encoding and then
> > deduplicating we would have space in the ID space for new types to fill
> > in while keeping the old types IDs matching the same types in the new
> > vmlinux.
> 
> As far as I understand Alexei, he means diffing two vmlinux.h files
> generated for different kernel versions. The vmlinux.h is generated by
> bpftool using command `bpftool btf dump file <binary-file> format c`.
> The output is topologically sorted to satisfy C compiler, but ordering
> is not total, so vmlinux.h content may vary from build to build if BTF
> type order differs.
> 
> Thus, any kind of stable BTF type ordering would make vmlinux.h stable.
> On the other hand, topological ordering used by bpftool
> (the algorithm is in the libbpf, actually) might be extended with
> additional rules to make the ordering total.

Interesting, the other tool that is in the pahole repo is 'fullcircle',
that given a .o file will generate a compileable file (a vmlinux.h say)
and then build it again to generate DWARF and then compare the original
DWARF with the new onbe.
 
> > While ordering all types we would have to have ID space available from
> > each of the BTF kinds, no?
> > 
> > I haven't looked at Eduard's patches, is that what it is done?
> 
> No, I don't reserve any ID space, the output of 
> `bpftool btf dump file <binary-file> format raw` is not suitable for
> diffing w/o post-processing if some types are added or removed in the
> middle.
 
> I simply add a function to compare two BTF types and a pass that sorts
> all BTF types before finalizing BTF generation.

Ok, so I see that the BTF ids for the types will change, its the
vmlinux.h that is to be compared.

root@x1:~# pahole -F btf --compile | tail -12
struct ncsi_aen_handler {
	unsigned char              type;                 /*     0     1 */

	/* XXX 3 bytes hole, try to pack */

	int                        payload;              /*     4     4 */
	int                        (*handler)(struct ncsi_dev_priv *, struct ncsi_aen_pkt_hdr *); /*     8     8 */

	/* size: 16, cachelines: 1, members: 3 */
	/* sum members: 13, holes: 1, sum holes: 3 */
	/* last cacheline: 16 bytes */
};
root@x1:~# pahole -F btf --compile > a.c ; echo 'int main(void) { struct ncsi_aen_handler b = { 1, } ; return b.type ; } ' >> a.c ; gcc -g -o bla -c a.c
root@x1:~# pahole --expand_types ncsi_aen_handler > from_kernel_btf
root@x1:~# pahole --expand_types -C ncsi_aen_handler bla > from_bla_dwarf
root@x1:~# diff -u from_kernel_btf from_bla_dwarf
root@x1:~#

The above is for a super simple struct, no expansions even, now for:

root@x1:~# pahole -F btf --compile > a.c ; echo 'int main(void) { struct task_struct b = { .prio = 12345, } ; return b.prio ; } ' >> a.c ; gcc -g -o bla -c a.c
root@x1:~# pahole --suppress_aligned_attribute --expand_types -C task_struct bla > from_bla_dwarf
root@x1:~# pahole --suppress_aligned_attribute --expand_types task_struct > from_kernel_btf
root@x1:~# diff -u from_kernel_btf from_bla_dwarf
root@x1:~#

I suppressed the align attribute as right now the output from pahole
when it finds the __attribute__ alignment present in DWARF is slightly
different, but equivalent (barring bugs) to when it infers the alignment
and adds it to BTF data, that has no alignment info other than the
member offsets (DWARF has both the member offsets to infer the alignment
_and_ attributes when they are present in the source code, sometimes
even duplicated, which probably is the reason for the difference in
output (albeit the end result should be equivalent)).

root@x1:~# pahole --expand_types task_struct | wc -l
1254
root@x1:~# pahole --expand_types task_struct | tail
	/* XXX last struct has 1 hole, 1 bit hole */

	/* size: 13696, cachelines: 214, members: 265 */
	/* sum members: 13522, holes: 20, sum holes: 158 */
	/* sum bitfield members: 83 bits, bit holes: 2, sum bit holes: 45 bits */
	/* member types with holes: 4, total: 6, bit holes: 2, total: 2 */
	/* paddings: 6, sum paddings: 49 */
	/* forced alignments: 2, forced holes: 2, sum forced holes: 88 */
};

root@x1:~#

I.e. the original BTF doesn't have to be sorted (well, it will keep the
order DWARF does, which, in turn, is another desire of reproducible
builds, it will not have the same output for two kernel releases, but
should be as close as possible) pahole (--sort or --compile) or bpftool
can do it either by plain sorting the types (pahole --sort, used by
btfdiff to compara output from DWARF to output from BTF) or by
generating a compilable source code (pahole --compile, aka
"topologically sorted to satisfy C compiler").
 
> > > > from one kernel version to another. vmlinux.h will become
> > > > a bit more sorted too and normal diff vmlinux_6_1.h vmlinux_6_2.h
> > > > will be possible.
> > > > Or am I misunderstanding the sorting concept?
  
> > > You understand the concept correctly, here is a sample:
  
> > >   [1] INT '_Bool' size=1 bits_offset=0 nr_bits=8 encoding=BOOL
> > >   [2] INT '__int128' size=16 bits_offset=0 nr_bits=128 encoding=SIGNED
> > >   [3] INT '__int128 unsigned' size=16 bits_offset=0 nr_bits=128 encoding=(none)
> > >   [4] INT 'char' size=1 bits_offset=0 nr_bits=8 encoding=(none)
> > >   [5] INT 'int' size=4 bits_offset=0 nr_bits=32 encoding=SIGNED
> > >   [6] INT 'long int' size=8 bits_offset=0 nr_bits=64 encoding=SIGNED
> > >   [7] INT 'long long int' size=8 bits_offset=0 nr_bits=64 encoding=SIGNED

> > The above: so far so good, probably there will not be something that
> > will push what is now BTF id 6 to become 7 in a new vmlinux, but can we
> > say the same for the more dynamic parts, like the list of structs?
 
> > A struct can vanish, that abstraction not being used anymore in the
> > kernel, so its BTF id will vacate and all of the next struct IDs will
> > "fall down" and gets its IDs decremented, no?
 
> Yes, this would happen.

We're on the same page.

> > If these difficulties are present as I mentioned, then rebuilding from
> > the BTF data with something like the existing 'pahole --expand_types
> > --sort' from the BTF from kernel N to compare with the same output for
> > kernel N + 1 should be enough to see what changed from one kernel to the
> > next one?
 
> Yes, this is an option.

Agreed. What I tried in my series was to do as little as possible to
make the serial output be the same as whatever level of paralelism we
have while making the whole process to cost as close to the
unconstrained parallelism that we had in place, i.e. to get a
reproducible build at the lowest cost in terms of code churn (the more
code we touch, the more chances we have of new bugs to be introduced)
and of CPU cycles/memory use, etc.

- Arnaldo
Arnaldo Carvalho de Melo April 12, 2024, 8:36 p.m. UTC | #14
On Mon, Apr 08, 2024 at 11:39:32AM -0300, Arnaldo Carvalho de Melo wrote:
> On Mon, Apr 08, 2024 at 01:00:59PM +0100, Alan Maguire wrote:
> > On 04/04/2024 09:58, Alan Maguire wrote:
> > > Program terminated with signal SIGSEGV, Segmentation fault.
> > > #0  0x00007f8c8260a58c in ptr_table__entry (pt=0x7f8c60001e70, id=77)
> > >     at /home/almagui/src/dwarves/dwarves.c:612
> > > 612		return id >= pt->nr_entries ? NULL : pt->entries[id];
> > > [Current thread is 1 (Thread 0x7f8c65400700 (LWP 624441))]
> > > (gdb) print *(struct ptr_table *)0x7f8c60001e70
> > > $1 = {entries = 0x0, nr_entries = 2979, allocated_entries = 4096}
> > > (gdb)
> 
> > > So it looks like the ptr_table has 2979 entries but entries is NULL;
> > > could there be an issue where CU initialization is not yet complete
> > > for some threads (it also happens very early in processing)? Can you
> > > reproduce this failure at your end? Thanks!
>  
> > the following (when applied on top of the series) resolves the
> > segmentation fault for me:
>  
> > diff --git a/pahole.c b/pahole.c
> > index 6c7e738..5ff0eaf 100644
> > --- a/pahole.c
> > +++ b/pahole.c
> > @@ -3348,8 +3348,8 @@ static enum load_steal_kind pahole_stealer(struct
> > cu *cu,
> >                 if (conf_load->reproducible_build) {
> >                         ret = LSK__KEEPIT; // we're not processing the
> > cu passed to this function, so keep it.
> > -                        // Equivalent to LSK__DELETE since we processed
> > this
> > -                       cus__remove(cus, cu);
> > -                       cu__delete(cu);
> >                 }
> >  out_btf:
> >                 if (!thr_data) // See comment about reproducibe_build above
> > 
> 
> Yeah, Jiri also pointed out this call to cu__delete() was new, I was
> trying to avoid having unprocessed 'struct cu' using too much memory, so
> after processing it, delete them, but as you found out there are
> references to that memory...
> 
> > In other words, the problem is we remove/delete CUs when finished with
> > them in each thread (when BTF is generated).  However because the
> > save/add_saved_funcs stashes CU references in the associated struct
> > function * (to allow prototype comparison for the same function in
> > different CUs), we end up with stale CU references and in this case the
> > freed/nulled ptr_table caused an issue. As far as I can see we need to
> > retain CUs until all BTF has been merged from threads.
>  
> > With the fix in place, I'm seeing less then 100msec difference between
> > reproducible/non-reproducible vmlinux BTF generation; that's great!
> 
> Excellent!
> 
> I'll remove it and add a note crediting you with the removal and having
> the explanation about why its not possibe to delete it at that point
> (references to the associated 'struct function').

So I removed that cus__remove + cu__delete and also the other one at the
flush operation, leaving all cleaning up to cus__delete() time:

⬢[acme@toolbox pahole]$ git diff
diff --git a/dwarves.c b/dwarves.c
index fbc8d8aa0060b7d0..1ec259f50dbd3778 100644
--- a/dwarves.c
+++ b/dwarves.c
@@ -489,8 +489,12 @@ struct cu *cus__get_next_processable_cu(struct cus *cus)
                        cu->state = CU__PROCESSING;
                        goto found;
                case CU__PROCESSING:
-                       // This will only happen when we get to parallel
-                       // reproducible BTF encoding, libbpf dedup work needed here.
+                       // This will happen when we get to parallel
+                       // reproducible BTF encoding, libbpf dedup work needed
+                       // here. The other possibility is when we're flushing
+                       // the DWARF processed CUs when the parallel DWARF
+                       // loading stoped and we still have CUs to encode to
+                       // BTF because of ordering requirements.
                        continue;
                case CU__UNPROCESSED:
                        // The first entry isn't loaded, signal the
diff --git a/pahole.c b/pahole.c
index 6c7e73835b3e9139..77772bb42bb443ce 100644
--- a/pahole.c
+++ b/pahole.c
@@ -3347,9 +3347,9 @@ static enum load_steal_kind pahole_stealer(struct cu *cu,
 
                if (conf_load->reproducible_build) {
                        ret = LSK__KEEPIT; // we're not processing the cu passed to this function, so keep it.
-                       // Equivalent to LSK__DELETE since we processed this
-                       cus__remove(cus, cu);
-                       cu__delete(cu);
+                       // Kinda equivalent to LSK__DELETE since we processed this, but we can't delete it
+                       // as we stash references to entries in CUs for 'struct function' in btf_encoder__add_saved_funcs()
+                       // and btf_encoder__save_func(), so we can't delete them here. - Alan Maguire
                }
 out_btf:
                if (!thr_data) // See comment about reproducibe_build above
@@ -3667,9 +3667,6 @@ static int cus__flush_reproducible_build(struct cus *cus, struct btf_encoder *en
                err = btf_encoder__encode_cu(encoder, cu, conf_load);
                if (err < 0)
                        break;
-
-               cus__remove(cus, cu);
-               cu__delete(cu);
⬢[acme@toolbox pahole]$


It ends up taking a bit more time on this 14700K with 32Gb, I'll later
try to remove that need to keep everything in memory and also double
check this hunch that this is due to keeping everyuthing in memory.

Can I take this (with the above patch, that is a bit bigger than yours)
as a Tested-by + Reviewed-by you?

- Arnald
Arnaldo Carvalho de Melo April 12, 2024, 8:37 p.m. UTC | #15
On Tue, Apr 09, 2024 at 05:34:46PM +0300, Eduard Zingerman wrote:
> Still, there are a few discrepancies in generated BTFs: some function
> prototypes are included twice at random (about 30 IDs added/deleted).
> This might be connected to Alan's suggestion and requires
> further investigation.
> 
> All in all, Arnaldo's approach with CU ordering looks simpler.

I'm going, for now, with the simple approach, can I take your comments
as a Reviewed-by: you?

- Arnaldo
Eduard Zingerman April 12, 2024, 8:40 p.m. UTC | #16
On Fri, 2024-04-12 at 17:37 -0300, Arnaldo Carvalho de Melo wrote:
> On Tue, Apr 09, 2024 at 05:34:46PM +0300, Eduard Zingerman wrote:
> > Still, there are a few discrepancies in generated BTFs: some function
> > prototypes are included twice at random (about 30 IDs added/deleted).
> > This might be connected to Alan's suggestion and requires
> > further investigation.
> > 
> > All in all, Arnaldo's approach with CU ordering looks simpler.
> 
> I'm going, for now, with the simple approach, can I take your comments
> as a Reviewed-by: you?

If you are going to post next version I'll go through the new series
and ack the patches (I understand the main idea but did not read the
series in detail).
Arnaldo Carvalho de Melo April 12, 2024, 9:09 p.m. UTC | #17
On Fri, Apr 12, 2024 at 11:40:44PM +0300, Eduard Zingerman wrote:
> On Fri, 2024-04-12 at 17:37 -0300, Arnaldo Carvalho de Melo wrote:
> > On Tue, Apr 09, 2024 at 05:34:46PM +0300, Eduard Zingerman wrote:
> > > Still, there are a few discrepancies in generated BTFs: some function
> > > prototypes are included twice at random (about 30 IDs added/deleted).
> > > This might be connected to Alan's suggestion and requires
> > > further investigation.
> > > 
> > > All in all, Arnaldo's approach with CU ordering looks simpler.
> > 
> > I'm going, for now, with the simple approach, can I take your comments
> > as a Reviewed-by: you?
> 
> If you are going to post next version I'll go through the new series
> and ack the patches (I understand the main idea but did not read the
> series in detail).

Ok, its now in the next branch, I'll repost here as well.

- Arnaldo
Eduard Zingerman April 12, 2024, 9:10 p.m. UTC | #18
On Fri, 2024-04-12 at 18:09 -0300, Arnaldo Carvalho de Melo wrote:
> On Fri, Apr 12, 2024 at 11:40:44PM +0300, Eduard Zingerman wrote:
> > On Fri, 2024-04-12 at 17:37 -0300, Arnaldo Carvalho de Melo wrote:
> > > On Tue, Apr 09, 2024 at 05:34:46PM +0300, Eduard Zingerman wrote:
> > > > Still, there are a few discrepancies in generated BTFs: some function
> > > > prototypes are included twice at random (about 30 IDs added/deleted).
> > > > This might be connected to Alan's suggestion and requires
> > > > further investigation.
> > > > 
> > > > All in all, Arnaldo's approach with CU ordering looks simpler.
> > > 
> > > I'm going, for now, with the simple approach, can I take your comments
> > > as a Reviewed-by: you?
> > 
> > If you are going to post next version I'll go through the new series
> > and ack the patches (I understand the main idea but did not read the
> > series in detail).
> 
> Ok, its now in the next branch, I'll repost here as well.

Great, thank you, I'll go through it over the weekend.