Message ID | 20210325065332.3122473-1-yhs@fb.com (mailing list archive) |
---|---|
State | Not Applicable |
Headers | show |
Series | add option to merge more dwarf cu's into | expand |
Context | Check | Description |
---|---|---|
netdev/tree_selection | success | Not a local patch |
Em Wed, Mar 24, 2021 at 11:53:32PM -0700, Yonghong Song escreveu: > This patch added an option "merge_cus", which will permit > to merge all debug info cu's into one pahole cu. > For vmlinux built with clang thin-lto or lto, there exist > cross cu type references. For example, you could have > compile unit 1: > tag 10: type A > compile unit 2: > ... > refer to type A (tag 10 in compile unit 1) > I only checked a few but have seen type A may be a simple type > like "unsigned char" or a complex type like an array of base types. > > There are two different ways to resolve this issue: > (1). merge all compile units as one pahole cu so tags/types > can be resolved easily, or > (2). try to do on-demand type traversal in other debuginfo cu's > when we do die_process(). > The method (2) is much more complicated so I picked method (1). > An option "merge_cus" is added to permit such an operation. > > Merging cu's will create a single cu with lots of types, tags > and functions. For example with clang thin-lto built vmlinux, > I saw 9M entries in types table, 5.2M in tags table. The > below are pahole wallclock time for different hashbits: > command line: time pahole -J --merge_cus vmlinux > # of hashbits wallclock time in seconds > 15 460 > 16 255 > 17 131 > 18 97 > 19 75 > 20 69 > 21 64 > 22 62 > 23 58 > 24 64 > > Note that the number of hashbits 24 makes performance worse > than 23. The reason could be that 23 hashbits can cover 8M > buckets (close to 9M for the number of entries in types table). > Higher number of hash bits allocates more memory and becomes > less cache efficient compared to 23 hashbits. > > This patch picks # of hashbits 21 as the starting value > and will try to allocate memory based on that, if memory > allocation fails, we will go with less hashbits until > we reach hashbits 15 which is the default for > non merge-cu case. I'll probably add a way to specify the starting max_hashbits to be able to use 'perf stat' to show what causes the performance difference. I'm also adding the man page patch below, now to build the kernel with your bpf-next patch to test it. - Arnaldo [acme@five pahole]$ git diff diff --git a/man-pages/pahole.1 b/man-pages/pahole.1 index cbbefbf22556412c..1be2a293ad4bcc50 100644 --- a/man-pages/pahole.1 +++ b/man-pages/pahole.1 @@ -208,6 +208,12 @@ information has float types. .B \-\-btf_gen_all Allow using all the BTF features supported by pahole. +.TP +.B \-\-merge_cus +Merge all cus (except possible types_cu) when loading DWARF, this is needed +when processing files that have inter-CU references, this happens, for instance +when building the Linux kernel with clang using thin-LTO or LTO. + .TP .B \-l, \-\-show_first_biggest_size_base_type_member Show first biggest size base_type member. [acme@five pahole]$
On 3/26/21 7:41 AM, Arnaldo Carvalho de Melo wrote: > Em Wed, Mar 24, 2021 at 11:53:32PM -0700, Yonghong Song escreveu: >> This patch added an option "merge_cus", which will permit >> to merge all debug info cu's into one pahole cu. >> For vmlinux built with clang thin-lto or lto, there exist >> cross cu type references. For example, you could have >> compile unit 1: >> tag 10: type A >> compile unit 2: >> ... >> refer to type A (tag 10 in compile unit 1) >> I only checked a few but have seen type A may be a simple type >> like "unsigned char" or a complex type like an array of base types. >> >> There are two different ways to resolve this issue: >> (1). merge all compile units as one pahole cu so tags/types >> can be resolved easily, or >> (2). try to do on-demand type traversal in other debuginfo cu's >> when we do die_process(). >> The method (2) is much more complicated so I picked method (1). >> An option "merge_cus" is added to permit such an operation. >> >> Merging cu's will create a single cu with lots of types, tags >> and functions. For example with clang thin-lto built vmlinux, >> I saw 9M entries in types table, 5.2M in tags table. The >> below are pahole wallclock time for different hashbits: >> command line: time pahole -J --merge_cus vmlinux >> # of hashbits wallclock time in seconds >> 15 460 >> 16 255 >> 17 131 >> 18 97 >> 19 75 >> 20 69 >> 21 64 >> 22 62 >> 23 58 >> 24 64 >> >> Note that the number of hashbits 24 makes performance worse >> than 23. The reason could be that 23 hashbits can cover 8M >> buckets (close to 9M for the number of entries in types table). >> Higher number of hash bits allocates more memory and becomes >> less cache efficient compared to 23 hashbits. >> >> This patch picks # of hashbits 21 as the starting value >> and will try to allocate memory based on that, if memory >> allocation fails, we will go with less hashbits until >> we reach hashbits 15 which is the default for >> non merge-cu case. > > I'll probably add a way to specify the starting max_hashbits to be able > to use 'perf stat' to show what causes the performance difference. The problem is with hashtags__find(), esp. the loop uint32_t bucket = hashtags__fn(id); const struct hlist_head *head = hashtable + bucket; hlist_for_each_entry(tpos, pos, head, hash_node) { if (tpos->id == id) return tpos; } Say we have 8M types and (1 << 15) buckets, that means each bucket will 64 elements. So each lookup will traverse the loop 32 iterations on average. If we have 1 << 21 buckets, then each buckets will have 4 elements, and the average number of loop iterations for hashtags__find() will be 2. If the patch needs respin, I can add the above descriptions in the commit message. > > I'm also adding the man page patch below, now to build the kernel with > your bpf-next patch to test it. Thanks for adding man page and testing, let me know if you need any help! > > - Arnaldo > > [acme@five pahole]$ git diff > diff --git a/man-pages/pahole.1 b/man-pages/pahole.1 > index cbbefbf22556412c..1be2a293ad4bcc50 100644 > --- a/man-pages/pahole.1 > +++ b/man-pages/pahole.1 > @@ -208,6 +208,12 @@ information has float types. > .B \-\-btf_gen_all > Allow using all the BTF features supported by pahole. > > +.TP > +.B \-\-merge_cus > +Merge all cus (except possible types_cu) when loading DWARF, this is needed > +when processing files that have inter-CU references, this happens, for instance > +when building the Linux kernel with clang using thin-LTO or LTO. > + > .TP > .B \-l, \-\-show_first_biggest_size_base_type_member > Show first biggest size base_type member. > [acme@five pahole]$ >
Em Fri, Mar 26, 2021 at 11:41:32AM -0300, Arnaldo Carvalho de Melo escreveu: > I'm also adding the man page patch below, now to build the kernel with > your bpf-next patch to test it. [acme@five bpf]$ grep CONFIG_CLANG ../build/bpf_clang_thin_lto/.config CONFIG_CLANG_VERSION=110000 [acme@five bpf]$ grep CLANG ../build/bpf_clang_thin_lto/.config CONFIG_CC_IS_CLANG=y CONFIG_CLANG_VERSION=110000 CONFIG_LTO_CLANG=y CONFIG_ARCH_SUPPORTS_LTO_CLANG=y CONFIG_ARCH_SUPPORTS_LTO_CLANG_THIN=y CONFIG_HAS_LTO_CLANG=y # CONFIG_LTO_CLANG_FULL is not set CONFIG_LTO_CLANG_THIN=y [acme@five bpf]$ Building now. - Arnaldo
Em Fri, Mar 26, 2021 at 08:18:07AM -0700, Yonghong Song escreveu: > > > On 3/26/21 7:41 AM, Arnaldo Carvalho de Melo wrote: > > Em Wed, Mar 24, 2021 at 11:53:32PM -0700, Yonghong Song escreveu: > > > This patch added an option "merge_cus", which will permit > > > to merge all debug info cu's into one pahole cu. > > > For vmlinux built with clang thin-lto or lto, there exist > > > cross cu type references. For example, you could have > > > compile unit 1: > > > tag 10: type A > > > compile unit 2: > > > ... > > > refer to type A (tag 10 in compile unit 1) > > > I only checked a few but have seen type A may be a simple type > > > like "unsigned char" or a complex type like an array of base types. > > > > > > There are two different ways to resolve this issue: > > > (1). merge all compile units as one pahole cu so tags/types > > > can be resolved easily, or > > > (2). try to do on-demand type traversal in other debuginfo cu's > > > when we do die_process(). > > > The method (2) is much more complicated so I picked method (1). > > > An option "merge_cus" is added to permit such an operation. > > > > > > Merging cu's will create a single cu with lots of types, tags > > > and functions. For example with clang thin-lto built vmlinux, > > > I saw 9M entries in types table, 5.2M in tags table. The > > > below are pahole wallclock time for different hashbits: > > > command line: time pahole -J --merge_cus vmlinux > > > # of hashbits wallclock time in seconds > > > 15 460 > > > 16 255 > > > 17 131 > > > 18 97 > > > 19 75 > > > 20 69 > > > 21 64 > > > 22 62 > > > 23 58 > > > 24 64 > > > > > > Note that the number of hashbits 24 makes performance worse > > > than 23. The reason could be that 23 hashbits can cover 8M > > > buckets (close to 9M for the number of entries in types table). > > > Higher number of hash bits allocates more memory and becomes > > > less cache efficient compared to 23 hashbits. > > > > > > This patch picks # of hashbits 21 as the starting value > > > and will try to allocate memory based on that, if memory > > > allocation fails, we will go with less hashbits until > > > we reach hashbits 15 which is the default for > > > non merge-cu case. > > > > I'll probably add a way to specify the starting max_hashbits to be able > > to use 'perf stat' to show what causes the performance difference. > > The problem is with hashtags__find(), esp. the loop > > uint32_t bucket = hashtags__fn(id); > const struct hlist_head *head = hashtable + bucket; > > hlist_for_each_entry(tpos, pos, head, hash_node) { > if (tpos->id == id) > return tpos; > } > > Say we have 8M types and (1 << 15) buckets, that means > each bucket will 64 elements. So each lookup will traverse > the loop 32 iterations on average. > > If we have 1 << 21 buckets, then each buckets will have 4 elements, > and the average number of loop iterations for hashtags__find() > will be 2. > > If the patch needs respin, I can add the above descriptions > in the commit message. I can add that, as a comment. - Arnaldo
Em Fri, Mar 26, 2021 at 08:18:07AM -0700, Yonghong Song escreveu: > > > On 3/26/21 7:41 AM, Arnaldo Carvalho de Melo wrote: > > Em Wed, Mar 24, 2021 at 11:53:32PM -0700, Yonghong Song escreveu: > > I'm also adding the man page patch below, now to build the kernel with > > your bpf-next patch to test it. > Thanks for adding man page and testing, let me know if you > need any help! So, this is also needed if the vmlinux was buit with LTO: [acme@seventh pahole]$ git diff btfdiff diff --git a/btfdiff b/btfdiff index 4db703245e7d..440241de7c2e 100755 --- a/btfdiff +++ b/btfdiff @@ -18,6 +18,7 @@ dwarf_output=$(mktemp /tmp/btfdiff.dwarf.XXXXXX) pahole_bin=${PAHOLE-"pahole"} ${pahole_bin} -F dwarf \ + --merge_cus \ --flat_arrays \ --suppress_aligned_attribute \ --suppress_force_paddings \ [acme@seventh pahole]$ After that we're down tho this diff, which probably isn't related to the patches being tested, but some difference in how clang encodes this in DWARF and then how the BTF encoder does it, or perhaps some problem in the dwarves_fprintf.c routine, I'll check: [acme@seventh pahole]$ ./btfdiff vmlinux --- /tmp/btfdiff.dwarf.ik3LN3 2021-03-26 15:08:05.833806712 -0300 +++ /tmp/btfdiff.btf.69SSZs 2021-03-26 15:08:06.124802727 -0300 @@ -67233,7 +67233,7 @@ struct cpu_rmap { struct { u16 index; /* 16 2 */ u16 dist; /* 18 2 */ - } near[0]; /* 16 0 */ + } near[]; /* 16 0 */ /* size: 16, cachelines: 1, members: 5 */ /* last cacheline: 16 bytes */ @@ -101159,7 +101159,7 @@ struct linux_efi_memreserve { struct { phys_addr_t base; /* 16 8 */ phys_addr_t size; /* 24 8 */ - } entry[0]; /* 16 0 */ + } entry[]; /* 16 0 */ /* size: 16, cachelines: 1, members: 4 */ /* last cacheline: 16 bytes */ @@ -113494,7 +113494,7 @@ struct netlink_policy_dump_state { struct { const struct nla_policy * policy; /* 16 8 */ unsigned int maxtype; /* 24 4 */ - } policies[0]; /* 16 0 */ + } policies[]; /* 16 0 */ /* size: 16, cachelines: 1, members: 4 */ /* sum members: 12, holes: 1, sum holes: 4 */ [acme@seventh pahole]$ But we need to find a way to discover if the costly --merge_cus need to be used... For the kernel its just a matter of looking if that CONFIG_ asking for one of the CLANG LTO variants is present, but for pahole users wanting to work with a LTO vmlinux this gets confusing as it crashes, perhaps I need to count how many lookups fail, fix the segfaults and at the end emit a warning... OR we can look at... [acme@five bpf]$ eu-readelf -winfo ../build/bpf_clang_thin_lto/vmlinux | grep -i producer -m1 producer (strp) "clang version 11.0.0 (Fedora 11.0.0-2.fc33)" [acme@five bpf]$ oops, it seems a kernel built with clang doesn't come with the compiler options used like when using gcc: [acme@five bpf]$ eu-readelf -winfo ../build/v5.12.0-rc4+/vmlinux | grep -i producer -m2 producer (strp) "GNU AS 2.35" producer (strp) "GNU C89 10.2.1 20201125 (Red Hat 10.2.1-9) -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -mno-avx -m64 -mno-80387 -mno-fp-ret-in-387 -mpreferred-stack-boundary=3 -mskip-rax-setup -mtune=generic -mno-red-zone -mcmodel=kernel -mindirect-branch=thunk-extern -mindirect-branch-register -mrecord-mcount -mfentry -march=x86-64 -g -gdwarf-4 -O2 -std=gnu90 -fno-strict-aliasing -fno-common -fshort-wchar -fno-PIE -fcf-protection=none -falign-jumps=1 -falign-loops=1 -fno-asynchronous-unwind-tables -fno-jump-tables -fno-delete-null-pointer-checks -fno-allow-store-data-races -fstack-protector-strong -fno-strict-overflow -fstack-check=no -fconserve-stack -fno-stack-protector" [acme@five bpf]$ Humm, can't we automagically detect that we need to merge the CUs and do it if needed? Have to go AFK now, will try to think about it while driving Pedro from school... Did a last test, may be unrelated: [acme@five pahole]$ fullcircle ./tcp_ipv4.o /home/acme/bin/fullcircle: line 40: 984531 Segmentation fault (core dumped) ${codiff_bin} -q -s $file $o_output [acme@five pahole]$ pahole --help | grep merge --merge_cus Merge all cus (except possible types_cu) [acme@five pahole]$ - Arnaldo
On 3/26/21 11:19 AM, Arnaldo Carvalho de Melo wrote: > Em Fri, Mar 26, 2021 at 08:18:07AM -0700, Yonghong Song escreveu: >> >> >> On 3/26/21 7:41 AM, Arnaldo Carvalho de Melo wrote: >>> Em Wed, Mar 24, 2021 at 11:53:32PM -0700, Yonghong Song escreveu: >>> I'm also adding the man page patch below, now to build the kernel with >>> your bpf-next patch to test it. > >> Thanks for adding man page and testing, let me know if you >> need any help! > > So, this is also needed if the vmlinux was buit with LTO: > > [acme@seventh pahole]$ git diff btfdiff > diff --git a/btfdiff b/btfdiff > index 4db703245e7d..440241de7c2e 100755 > --- a/btfdiff > +++ b/btfdiff > @@ -18,6 +18,7 @@ dwarf_output=$(mktemp /tmp/btfdiff.dwarf.XXXXXX) > pahole_bin=${PAHOLE-"pahole"} > > ${pahole_bin} -F dwarf \ > + --merge_cus \ > --flat_arrays \ > --suppress_aligned_attribute \ > --suppress_force_paddings \ > [acme@seventh pahole]$ > > After that we're down tho this diff, which probably isn't related to the > patches being tested, but some difference in how clang encodes this in > DWARF and then how the BTF encoder does it, or perhaps some problem in > the dwarves_fprintf.c routine, I'll check: > > [acme@seventh pahole]$ ./btfdiff vmlinux > --- /tmp/btfdiff.dwarf.ik3LN3 2021-03-26 15:08:05.833806712 -0300 > +++ /tmp/btfdiff.btf.69SSZs 2021-03-26 15:08:06.124802727 -0300 > @@ -67233,7 +67233,7 @@ struct cpu_rmap { > struct { > u16 index; /* 16 2 */ > u16 dist; /* 18 2 */ > - } near[0]; /* 16 0 */ > + } near[]; /* 16 0 */ > > /* size: 16, cachelines: 1, members: 5 */ > /* last cacheline: 16 bytes */ > @@ -101159,7 +101159,7 @@ struct linux_efi_memreserve { > struct { > phys_addr_t base; /* 16 8 */ > phys_addr_t size; /* 24 8 */ > - } entry[0]; /* 16 0 */ > + } entry[]; /* 16 0 */ > > /* size: 16, cachelines: 1, members: 4 */ > /* last cacheline: 16 bytes */ > @@ -113494,7 +113494,7 @@ struct netlink_policy_dump_state { > struct { > const struct nla_policy * policy; /* 16 8 */ > unsigned int maxtype; /* 24 4 */ > - } policies[0]; /* 16 0 */ > + } policies[]; /* 16 0 */ > > /* size: 16, cachelines: 1, members: 4 */ > /* sum members: 12, holes: 1, sum holes: 4 */ > [acme@seventh pahole]$ > > But we need to find a way to discover if the costly --merge_cus need to > be used... > > For the kernel its just a matter of looking if that CONFIG_ asking for > one of the CLANG LTO variants is present, but for pahole users wanting > to work with a LTO vmlinux this gets confusing as it crashes, perhaps I > need to count how many lookups fail, fix the segfaults and at the end > emit a warning... > > OR we can look at... > > [acme@five bpf]$ eu-readelf -winfo ../build/bpf_clang_thin_lto/vmlinux | grep -i producer -m1 > producer (strp) "clang version 11.0.0 (Fedora 11.0.0-2.fc33)" > [acme@five bpf]$ > > oops, it seems a kernel built with clang doesn't come with the compiler > options used like when using gcc: > > [acme@five bpf]$ eu-readelf -winfo ../build/v5.12.0-rc4+/vmlinux | grep -i producer -m2 > producer (strp) "GNU AS 2.35" > producer (strp) "GNU C89 10.2.1 20201125 (Red Hat 10.2.1-9) -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -mno-avx -m64 -mno-80387 -mno-fp-ret-in-387 -mpreferred-stack-boundary=3 -mskip-rax-setup -mtune=generic -mno-red-zone -mcmodel=kernel -mindirect-branch=thunk-extern -mindirect-branch-register -mrecord-mcount -mfentry -march=x86-64 -g -gdwarf-4 -O2 -std=gnu90 -fno-strict-aliasing -fno-common -fshort-wchar -fno-PIE -fcf-protection=none -falign-jumps=1 -falign-loops=1 -fno-asynchronous-unwind-tables -fno-jump-tables -fno-delete-null-pointer-checks -fno-allow-store-data-races -fstack-protector-strong -fno-strict-overflow -fstack-check=no -fconserve-stack -fno-stack-protector" > [acme@five bpf]$ > > Humm, can't we automagically detect that we need to merge the CUs and do > it if needed? This is a good question. In the beginning, I wanted to automatically detect lto mode as well so we don't have to invent this options. Since we cannot get hints from the dwarf, the only thing we can do is to actually scan through each cu and if somehow we cannot resolve the tag, then we try to the merging-cu mechanism. This is a little bit heavy weight. That is why I invented this option. Now since you found gcc actually has flags in dwarf tag producer which will provides whether lto is used, I went on clang side found that the following flag is needed in clang in order to embed flags in the producer tag: -grecord-gcc-switches So I am going to make the following changes: In pahole: - check one DW_AT_producer, if lto flag is in flags, phaole will merge cus, - otherwise, old way, one cu at a time. In Linux: - add flag -grecord-gcc-switches if clang lto is enabled. Then just for vmlinux-lto, we won't need merge_cus option. But for other lto built binaries without -grecord-gcc-switches, pahole will not work. Maybe we still need --merge_cus option eventually, but we can delay this until a later point. Another further suggestions? I will start to do a v2 based on my above outline. > > Have to go AFK now, will try to think about it while driving Pedro from > school... > > Did a last test, may be unrelated: > > [acme@five pahole]$ fullcircle ./tcp_ipv4.o > /home/acme/bin/fullcircle: line 40: 984531 Segmentation fault (core dumped) ${codiff_bin} -q -s $file $o_output The .o file in lto build is not really an elf .o, it is llvm internal ir bitcode. > [acme@five pahole]$ pahole --help | grep merge > --merge_cus Merge all cus (except possible types_cu) > [acme@five pahole]$ > > > - Arnaldo >
On Fri, Mar 26, 2021 at 4:05 PM Yonghong Song <yhs@fb.com> wrote: > > Now since you found gcc actually has flags in dwarf tag producer which > will provides whether lto is used, I went on clang side found that > the following flag is needed in clang in order to embed flags in > the producer tag: > -grecord-gcc-switches ... > In Linux: > - add flag -grecord-gcc-switches if clang lto is enabled. I think that will help to make dwarf output a bit more uniform between gcc and clang. So it's a good thing on its own. Recording compilation flags in the debug info could be useful in other cases too. I would pass it for both lto and non-lto builds.
On 3/26/21 4:12 PM, Alexei Starovoitov wrote: > On Fri, Mar 26, 2021 at 4:05 PM Yonghong Song <yhs@fb.com> wrote: >> >> Now since you found gcc actually has flags in dwarf tag producer which >> will provides whether lto is used, I went on clang side found that >> the following flag is needed in clang in order to embed flags in >> the producer tag: >> -grecord-gcc-switches > ... >> In Linux: >> - add flag -grecord-gcc-switches if clang lto is enabled. > > I think that will help to make dwarf output a bit more uniform between > gcc and clang. So it's a good thing on its own. > Recording compilation flags in the debug info could be useful in > other cases too. I would pass it for both lto and non-lto builds. Good point. Will do this.
On Wed, Mar 24, 2021 at 11:53 PM Yonghong Song <yhs@fb.com> wrote: > > This patch added an option "merge_cus", which will permit > to merge all debug info cu's into one pahole cu. > For vmlinux built with clang thin-lto or lto, there exist > cross cu type references. For example, you could have > compile unit 1: > tag 10: type A > compile unit 2: > ... > refer to type A (tag 10 in compile unit 1) > I only checked a few but have seen type A may be a simple type > like "unsigned char" or a complex type like an array of base types. > > There are two different ways to resolve this issue: > (1). merge all compile units as one pahole cu so tags/types > can be resolved easily, or > (2). try to do on-demand type traversal in other debuginfo cu's > when we do die_process(). > The method (2) is much more complicated so I picked method (1). > An option "merge_cus" is added to permit such an operation. > > Merging cu's will create a single cu with lots of types, tags > and functions. For example with clang thin-lto built vmlinux, > I saw 9M entries in types table, 5.2M in tags table. The > below are pahole wallclock time for different hashbits: > command line: time pahole -J --merge_cus vmlinux > # of hashbits wallclock time in seconds > 15 460 > 16 255 > 17 131 > 18 97 > 19 75 > 20 69 > 21 64 > 22 62 > 23 58 > 24 64 What were the numbers for different hashbits without --merge_cus? > > Note that the number of hashbits 24 makes performance worse > than 23. The reason could be that 23 hashbits can cover 8M > buckets (close to 9M for the number of entries in types table). > Higher number of hash bits allocates more memory and becomes > less cache efficient compared to 23 hashbits. > > This patch picks # of hashbits 21 as the starting value > and will try to allocate memory based on that, if memory > allocation fails, we will go with less hashbits until > we reach hashbits 15 which is the default for > non merge-cu case. > > Signed-off-by: Yonghong Song <yhs@fb.com> > --- > dwarf_loader.c | 90 ++++++++++++++++++++++++++++++++++++++++++++++++++ > dwarves.h | 2 ++ > pahole.c | 8 +++++ > 3 files changed, 100 insertions(+) > [...]
On 3/26/21 4:21 PM, Andrii Nakryiko wrote: > On Wed, Mar 24, 2021 at 11:53 PM Yonghong Song <yhs@fb.com> wrote: >> >> This patch added an option "merge_cus", which will permit >> to merge all debug info cu's into one pahole cu. >> For vmlinux built with clang thin-lto or lto, there exist >> cross cu type references. For example, you could have >> compile unit 1: >> tag 10: type A >> compile unit 2: >> ... >> refer to type A (tag 10 in compile unit 1) >> I only checked a few but have seen type A may be a simple type >> like "unsigned char" or a complex type like an array of base types. >> >> There are two different ways to resolve this issue: >> (1). merge all compile units as one pahole cu so tags/types >> can be resolved easily, or >> (2). try to do on-demand type traversal in other debuginfo cu's >> when we do die_process(). >> The method (2) is much more complicated so I picked method (1). >> An option "merge_cus" is added to permit such an operation. >> >> Merging cu's will create a single cu with lots of types, tags >> and functions. For example with clang thin-lto built vmlinux, >> I saw 9M entries in types table, 5.2M in tags table. The >> below are pahole wallclock time for different hashbits: >> command line: time pahole -J --merge_cus vmlinux >> # of hashbits wallclock time in seconds >> 15 460 >> 16 255 >> 17 131 >> 18 97 >> 19 75 >> 20 69 >> 21 64 >> 22 62 >> 23 58 >> 24 64 > > What were the numbers for different hashbits without --merge_cus? Without --merge_cus means non-lto vmlinux. Just did quick measurement, for hashbits 10 - 18, all ranges from 37s - 39s for "pahole -J vmlinux" run with 10 - 15 between 37 - 38 and the rest 38 - 39. The number of cus for my particular vmlinux is 2915. The total number of types among all cus is roughly 8M based on a rough regex matching, so each cu roughly 2K. So the current default setting is okay for non-lto vmlinux. > >> >> Note that the number of hashbits 24 makes performance worse >> than 23. The reason could be that 23 hashbits can cover 8M >> buckets (close to 9M for the number of entries in types table). >> Higher number of hash bits allocates more memory and becomes >> less cache efficient compared to 23 hashbits. >> >> This patch picks # of hashbits 21 as the starting value >> and will try to allocate memory based on that, if memory >> allocation fails, we will go with less hashbits until >> we reach hashbits 15 which is the default for >> non merge-cu case. >> >> Signed-off-by: Yonghong Song <yhs@fb.com> >> --- >> dwarf_loader.c | 90 ++++++++++++++++++++++++++++++++++++++++++++++++++ >> dwarves.h | 2 ++ >> pahole.c | 8 +++++ >> 3 files changed, 100 insertions(+) >> > > [...] >
Em Fri, Mar 26, 2021 at 04:05:45PM -0700, Yonghong Song escreveu: > On 3/26/21 11:19 AM, Arnaldo Carvalho de Melo wrote: > > [acme@five pahole]$ fullcircle ./tcp_ipv4.o > > /home/acme/bin/fullcircle: line 40: 984531 Segmentation fault (core dumped) ${codiff_bin} -q -s $file $o_output > > The .o file in lto build is not really an elf .o, it is llvm internal > ir bitcode. This one wasn't from a LTO build, I'll revisit this soon. Testing v3 now. - Arnaldo > > [acme@five pahole]$ pahole --help | grep merge > > --merge_cus Merge all cus (except possible types_cu) > > [acme@five pahole]$ > > > > > > - Arnaldo > >
diff --git a/dwarf_loader.c b/dwarf_loader.c index dc66df0..ed4f0da 100644 --- a/dwarf_loader.c +++ b/dwarf_loader.c @@ -51,6 +51,7 @@ struct strings *strings; #endif static uint32_t hashtags__bits = 15; +static uint32_t max_hashtags__bits = 21; uint32_t hashtags__fn(Dwarf_Off key) { @@ -2484,6 +2485,85 @@ static int cus__load_debug_types(struct cus *cus, struct conf_load *conf, return 0; } +static int cus__merge_and_process_cu(struct cus *cus, struct conf_load *conf, + Dwfl_Module *mod, Dwarf *dw, Elf *elf, + const char *filename, + const unsigned char *build_id, + int build_id_len, + struct dwarf_cu *type_dcu) +{ + uint8_t pointer_size, offset_size; + struct dwarf_cu *dcu = NULL; + Dwarf_Off off = 0, noff; + struct cu *cu = NULL; + size_t cuhl; + + /* Merge all cus */ + while (dwarf_nextcu(dw, off, &noff, &cuhl, NULL, &pointer_size, + &offset_size) == 0) { + Dwarf_Die die_mem; + Dwarf_Die *cu_die = dwarf_offdie(dw, off + cuhl, &die_mem); + + if (cu_die == NULL) + break; + + if (cu == NULL) { + cu = cu__new("", pointer_size, build_id, build_id_len, + filename); + if (cu == NULL || cu__set_common(cu, conf, mod, elf) != 0) + return DWARF_CB_ABORT; + + dcu = malloc(sizeof(struct dwarf_cu)); + if (dcu == NULL) + return DWARF_CB_ABORT; + + /* Merged cu tends to need a lot more memory. + * Let us start with max_hashtags__bits and + * go down to find a proper hashtag bit value. + */ + uint32_t default_hbits = hashtags__bits; + for (hashtags__bits = max_hashtags__bits; + hashtags__bits >= default_hbits; + hashtags__bits--) { + if (dwarf_cu__init(dcu) == 0) + break; + } + if (hashtags__bits < default_hbits) + return DWARF_CB_ABORT; + + dcu->cu = cu; + dcu->type_unit = type_dcu; + cu->priv = dcu; + cu->dfops = &dwarf__ops; + cu->language = attr_numeric(cu_die, DW_AT_language); + } + + const uint16_t tag = dwarf_tag(cu_die); + if (tag != DW_TAG_compile_unit && tag != DW_TAG_type_unit) { + fprintf(stderr, "%s: DW_TAG_compile_unit or DW_TAG_type_unit expected got %s!\n", + __FUNCTION__, dwarf_tag_name(tag)); + return DWARF_CB_ABORT; + } + + Dwarf_Die child; + if (dwarf_child(cu_die, &child) == 0) { + if (die__process_unit(&child, cu) != 0) + return DWARF_CB_ABORT; + } + + off = noff; + } + + /* process merged cu */ + if (cu__recode_dwarf_types(cu) != LSK__KEEPIT) + return DWARF_CB_ABORT; + if (finalize_cu_immediately(cus, cu, dcu, conf) + == LSK__STOP_LOADING) + return DWARF_CB_ABORT; + + return 0; +} + static int cus__load_module(struct cus *cus, struct conf_load *conf, Dwfl_Module *mod, Dwarf *dw, Elf *elf, const char *filename) @@ -2518,6 +2598,15 @@ static int cus__load_module(struct cus *cus, struct conf_load *conf, } } + if (conf->merge_cus == true) { + res = cus__merge_and_process_cu(cus, conf, mod, dw, elf, filename, + build_id, build_id_len, + type_cu ? &type_dcu : NULL); + if (res != 0) + return res; + goto out; + } + while (dwarf_nextcu(dw, off, &noff, &cuhl, NULL, &pointer_size, &offset_size) == 0) { Dwarf_Die die_mem; @@ -2557,6 +2646,7 @@ static int cus__load_module(struct cus *cus, struct conf_load *conf, off = noff; } +out: if (type_lsk == LSK__DELETE) cu__delete(type_cu); diff --git a/dwarves.h b/dwarves.h index 98caf1a..29b518d 100644 --- a/dwarves.h +++ b/dwarves.h @@ -40,6 +40,7 @@ struct conf_fprintf; * @extra_dbg_info - keep original debugging format extra info * (e.g. DWARF's decl_{line,file}, id, etc) * @fixup_silly_bitfields - Fixup silly things such as "int foo:32;" + * @merge_cus - Merge compile units except possible types_cu * @get_addr_info - wheter to load DW_AT_location and other addr info */ struct conf_load { @@ -50,6 +51,7 @@ struct conf_load { bool extra_dbg_info; bool fixup_silly_bitfields; bool get_addr_info; + bool merge_cus; struct conf_fprintf *conf_fprintf; }; diff --git a/pahole.c b/pahole.c index df6aa83..29fbe1d 100644 --- a/pahole.c +++ b/pahole.c @@ -827,6 +827,7 @@ ARGP_PROGRAM_VERSION_HOOK_DEF = dwarves_print_version; #define ARGP_btf_base 321 #define ARGP_btf_gen_floats 322 #define ARGP_btf_gen_all 323 +#define ARGP_merge_cus 324 static const struct argp_option pahole__options[] = { { @@ -1151,6 +1152,11 @@ static const struct argp_option pahole__options[] = { .key = ARGP_numeric_version, .doc = "Print a numeric version, i.e. 119 instead of v1.19" }, + { + .name = "merge_cus", + .key = ARGP_merge_cus, + .doc = "Merge all cus (except possible types_cu)" + }, { .name = NULL, } @@ -1270,6 +1276,8 @@ static error_t pahole__options_parser(int key, char *arg, btf_gen_floats = true; break; case ARGP_btf_gen_all: btf_gen_floats = true; break; + case ARGP_merge_cus: + conf_load.merge_cus = true; break; default: return ARGP_ERR_UNKNOWN; }
This patch added an option "merge_cus", which will permit to merge all debug info cu's into one pahole cu. For vmlinux built with clang thin-lto or lto, there exist cross cu type references. For example, you could have compile unit 1: tag 10: type A compile unit 2: ... refer to type A (tag 10 in compile unit 1) I only checked a few but have seen type A may be a simple type like "unsigned char" or a complex type like an array of base types. There are two different ways to resolve this issue: (1). merge all compile units as one pahole cu so tags/types can be resolved easily, or (2). try to do on-demand type traversal in other debuginfo cu's when we do die_process(). The method (2) is much more complicated so I picked method (1). An option "merge_cus" is added to permit such an operation. Merging cu's will create a single cu with lots of types, tags and functions. For example with clang thin-lto built vmlinux, I saw 9M entries in types table, 5.2M in tags table. The below are pahole wallclock time for different hashbits: command line: time pahole -J --merge_cus vmlinux # of hashbits wallclock time in seconds 15 460 16 255 17 131 18 97 19 75 20 69 21 64 22 62 23 58 24 64 Note that the number of hashbits 24 makes performance worse than 23. The reason could be that 23 hashbits can cover 8M buckets (close to 9M for the number of entries in types table). Higher number of hash bits allocates more memory and becomes less cache efficient compared to 23 hashbits. This patch picks # of hashbits 21 as the starting value and will try to allocate memory based on that, if memory allocation fails, we will go with less hashbits until we reach hashbits 15 which is the default for non merge-cu case. Signed-off-by: Yonghong Song <yhs@fb.com> --- dwarf_loader.c | 90 ++++++++++++++++++++++++++++++++++++++++++++++++++ dwarves.h | 2 ++ pahole.c | 8 +++++ 3 files changed, 100 insertions(+)