diff mbox series

[2/2] kallsyms: build faster by using .incbin

Message ID 20240221202655.2423854-2-jannh@google.com (mailing list archive)
State New, archived
Headers show
Series [1/2] kallsyms: get rid of code for absolute kallsyms | expand

Commit Message

Jann Horn Feb. 21, 2024, 8:26 p.m. UTC
Currently, kallsyms builds a big assembly file (~19M with a normal
kernel config), and then the assembler has to turn that big assembly
file back into binary data, which takes around a second per kallsyms
invocation. (Normally there are two kallsyms invocations per build.)

It is much faster to instead directly output binary data, which can
be imported in an assembly file using ".incbin". This is also the
approach taken by arch/x86/boot/compressed/mkpiggy.c.
So this patch switches kallsyms to that approach.

A complication with this is that the endianness of numbers between
host and target might not match (for example, when cross-compiling);
and there seems to be no kconfig symbol that tells us what endianness
the target has.
So pass the path to the intermediate vmlinux ELF file to the kallsyms
tool, and let it parse the ELF header to figure out the target's
endianness.

I have verified that running kallsyms without these changes and
kallsyms with these changes on the same input System.map results
in identical object files.

This change reduces the time for an incremental kernel rebuild
(touch fs/ioctl.c, then re-run make) from 27.7s to 24.1s (medians
over 16 runs each) on my machine - saving around 3.6 seconds.

Signed-off-by: Jann Horn <jannh@google.com>
---
 scripts/kallsyms.c      | 196 ++++++++++++++++++++++++++++++++--------
 scripts/link-vmlinux.sh |   5 +-
 2 files changed, 159 insertions(+), 42 deletions(-)

Comments

Jann Horn Feb. 21, 2024, 8:29 p.m. UTC | #1
On Wed, Feb 21, 2024 at 9:27 PM Jann Horn <jannh@google.com> wrote:
> Currently, kallsyms builds a big assembly file (~19M with a normal
> kernel config), and then the assembler has to turn that big assembly
> file back into binary data, which takes around a second per kallsyms
> invocation. (Normally there are two kallsyms invocations per build.)
>
> It is much faster to instead directly output binary data, which can
> be imported in an assembly file using ".incbin". This is also the
> approach taken by arch/x86/boot/compressed/mkpiggy.c.
> So this patch switches kallsyms to that approach.
>
> A complication with this is that the endianness of numbers between
> host and target might not match (for example, when cross-compiling);
> and there seems to be no kconfig symbol that tells us what endianness
> the target has.
> So pass the path to the intermediate vmlinux ELF file to the kallsyms
> tool, and let it parse the ELF header to figure out the target's
> endianness.
>
> I have verified that running kallsyms without these changes and
> kallsyms with these changes on the same input System.map results
> in identical object files.
>
> This change reduces the time for an incremental kernel rebuild
> (touch fs/ioctl.c, then re-run make) from 27.7s to 24.1s (medians
> over 16 runs each) on my machine - saving around 3.6 seconds.

Ah, I found no maintainer for this file in MAINTAINERS, but now that
I'm looking at the git history, it looks like fixes have come in
through Masahiro Yamada's kbuild tree? So I'm not entirely sure
whether the maintainer for this is Masahiro Yamada or akpm.
Masahiro Yamada Feb. 22, 2024, 4:06 a.m. UTC | #2
On Thu, Feb 22, 2024 at 5:27 AM Jann Horn <jannh@google.com> wrote:
>
> Currently, kallsyms builds a big assembly file (~19M with a normal
> kernel config), and then the assembler has to turn that big assembly
> file back into binary data, which takes around a second per kallsyms
> invocation. (Normally there are two kallsyms invocations per build.)
>
> It is much faster to instead directly output binary data, which can
> be imported in an assembly file using ".incbin". This is also the
> approach taken by arch/x86/boot/compressed/mkpiggy.c.


Yes, that is a sensible case because it just wraps the binary
without any modification.




> So this patch switches kallsyms to that approach.
>
> A complication with this is that the endianness of numbers between
> host and target might not match (for example, when cross-compiling);
> and there seems to be no kconfig symbol that tells us what endianness
> the target has.



CONFIG_CPU_BIG_ENDIAN is it.



You could do this:

if is_enabled CONFIG_CPU_BIG_ENDIAN; then
        kallsymopt="${kallsymopt} --big-endian"
fi

if is_enabled CONFIG_64BIT; then
        kallsymopt="${kallsymopt} --64bit"
fi






> So pass the path to the intermediate vmlinux ELF file to the kallsyms
> tool, and let it parse the ELF header to figure out the target's
> endianness.
>
> I have verified that running kallsyms without these changes and
> kallsyms with these changes on the same input System.map results
> in identical object files.
>
> This change reduces the time for an incremental kernel rebuild
> (touch fs/ioctl.c, then re-run make) from 27.7s to 24.1s (medians
> over 16 runs each) on my machine - saving around 3.6 seconds.




This reverts bea5b74504742f1b51b815bcaf9a70bddbc49ce3

Somebody might struggle with debugging again, but I am not sure.

Arnd?



If the effort were "I invented a way to do kallsyms in
one pass instead of three", it would be so much more attractive.


I am not so sure if this grain of the optimization is exciting,
but I confirmed that a few seconds were saved for the defconfig.

I am neutral about this.



For the debugging purpose, perhaps we can add --debug option
in order to leave the possibility for
outputting the full assembly as comments.



>
> Signed-off-by: Jann Horn <jannh@google.com>
> ---
>  scripts/kallsyms.c      | 196 ++++++++++++++++++++++++++++++++--------
>  scripts/link-vmlinux.sh |   5 +-
>  2 files changed, 159 insertions(+), 42 deletions(-)
>
> diff --git a/scripts/kallsyms.c b/scripts/kallsyms.c
> index f35be95adfbe..ef03d723aded 100644
> --- a/scripts/kallsyms.c
> +++ b/scripts/kallsyms.c
> @@ -27,6 +27,10 @@
>  #include <string.h>
>  #include <ctype.h>
>  #include <limits.h>
> +#include <endian.h>
> +#include <elf.h>
> +#include <fcntl.h>
> +#include <unistd.h>
>
>  #define ARRAY_SIZE(arr) (sizeof(arr) / sizeof(arr[0]))
>
> @@ -75,7 +79,7 @@ static unsigned char best_table_len[256];
>  static void usage(void)
>  {
>         fprintf(stderr, "Usage: kallsyms [--all-symbols] [--absolute-percpu] "
> -                       "[--lto-clang] in.map > out.S\n");
> +                       "[--lto-clang] in in.map out.S out.bin\n");
>         exit(1);
>  }
>
> @@ -290,20 +294,57 @@ static void read_map(const char *in)
>         fclose(fp);
>  }
>
> +static bool is_64bit, is_little_endian;
> +static char *asm_path, *bin_path;
> +static FILE *asm_file, *bin_file;
> +static size_t bin_offset, bin_included;
> +
>  static void output_label(const char *label)
>  {
> -       printf(".globl %s\n", label);
> -       printf("\tALGN\n");
> -       printf("%s:\n", label);
> +       fprintf(asm_file, ".globl %s\n", label);
> +       fprintf(asm_file, "\tALGN\n");
> +       fprintf(asm_file, "%s:\n", label);
>  }
>
>  /* Provide proper symbols relocatability by their '_text' relativeness. */
>  static void output_address(unsigned long long addr)
>  {
>         if (_text <= addr)
> -               printf("\tPTR\t_text + %#llx\n", addr - _text);
> +               fprintf(asm_file, "\tPTR\t_text + %#llx\n", addr - _text);
>         else
> -               printf("\tPTR\t_text - %#llx\n", _text - addr);
> +               fprintf(asm_file, "\tPTR\t_text - %#llx\n", _text - addr);
> +}
> +
> +/*
> + * Include all data that has been written into bin_file since the last call to
> + * this function.
> + */
> +static void include_bin_data(void)
> +{
> +       fprintf(asm_file, ".incbin \"%s\", %zu, %zu\n", bin_path,
> +               bin_included, bin_offset - bin_included);
> +       bin_included = bin_offset;
> +}
> +
> +static void output_bin_data(const void *data, size_t len)
> +{
> +       if (fwrite(data, 1, len, bin_file) != len) {
> +               fprintf(stderr, "kallsyms: unable to write output\n");
> +               exit(EXIT_FAILURE);
> +       }
> +       bin_offset += len;
> +}
> +static void output_bin_u32(uint32_t value)
> +{
> +       uint32_t encoded = is_little_endian ? htole32(value) : htobe32(value);
> +
> +       output_bin_data(&encoded, sizeof(encoded));
> +}
> +static void output_bin_u16(uint16_t value)



You might want to insert a blank line
between functions.





> +{
> +       uint16_t encoded = is_little_endian ? htole16(value) : htobe16(value);
> +
> +       output_bin_data(&encoded, sizeof(encoded));
>  }
>
>  /* uncompress a compressed symbol. When this function is called, the best table
> @@ -384,25 +425,36 @@ static void sort_symbols_by_name(void)
>
>  static void write_src(void)
>  {
> -       unsigned int i, k, off;
> +       unsigned int i, off;
>         unsigned int best_idx[256];
>         unsigned int *markers;
>         char buf[KSYM_NAME_LEN];
>
> -       printf("#include <asm/bitsperlong.h>\n");
> -       printf("#if BITS_PER_LONG == 64\n");
> -       printf("#define PTR .quad\n");
> -       printf("#define ALGN .balign 8\n");
> -       printf("#else\n");
> -       printf("#define PTR .long\n");
> -       printf("#define ALGN .balign 4\n");
> -       printf("#endif\n");
> +       asm_file = fopen(asm_path, "w");
> +       if (!asm_file) {
> +               perror("unable to open asm output");
> +               exit(EXIT_FAILURE);
> +       }
> +       bin_file = fopen(bin_path, "w");
> +       if (!bin_file) {
> +               perror("unable to open bin output");
> +               exit(EXIT_FAILURE);
> +       }
> +
> +       fprintf(asm_file, "#include <asm/bitsperlong.h>\n");
> +       fprintf(asm_file, "#if BITS_PER_LONG == 64\n");
> +       fprintf(asm_file, "#define PTR .quad\n");
> +       fprintf(asm_file, "#define ALGN .balign 8\n");
> +       fprintf(asm_file, "#else\n");
> +       fprintf(asm_file, "#define PTR .long\n");
> +       fprintf(asm_file, "#define ALGN .balign 4\n");
> +       fprintf(asm_file, "#endif\n");




With this patch, this tool will need to be aware
whether the target is 64-bit or not.

There is no point to include <asm/bitsperlong.h>
to check BITS_PER_LONG.








>
> -       printf("\t.section .rodata, \"a\"\n");
> +       fprintf(asm_file, "\t.section .rodata, \"a\"\n");
>
>         output_label("kallsyms_num_syms");
> -       printf("\t.long\t%u\n", table_cnt);
> -       printf("\n");
> +       fprintf(asm_file, "\t.long\t%u\n", table_cnt);
> +       fprintf(asm_file, "\n");
>
>         /* table of offset markers, that give the offset in the compressed stream
>          * every 256 symbols */
> @@ -437,20 +489,23 @@ static void write_src(void)
>                 /* Encode length with ULEB128. */
>                 if (table[i]->len <= 0x7F) {
>                         /* Most symbols use a single byte for the length. */
> -                       printf("\t.byte 0x%02x", table[i]->len);
> +                       unsigned char len_encoded[1] = { table[i]->len };
> +
> +                       output_bin_data(len_encoded, sizeof(len_encoded));
>                         off += table[i]->len + 1;
>                 } else {
>                         /* "Big" symbols use two bytes. */
> -                       printf("\t.byte 0x%02x, 0x%02x",
> +                       unsigned char len_encoded[2] = {
>                                 (table[i]->len & 0x7F) | 0x80,
> -                               (table[i]->len >> 7) & 0x7F);
> +                               (table[i]->len >> 7) & 0x7F
> +                       };
> +
> +                       output_bin_data(len_encoded, sizeof(len_encoded));
>                         off += table[i]->len + 2;
>                 }
> -               for (k = 0; k < table[i]->len; k++)
> -                       printf(", 0x%02x", table[i]->sym[k]);
> -               printf("\n");
> +               output_bin_data(table[i]->sym, table[i]->len);
>         }
> -       printf("\n");
> +       include_bin_data();
>
>         /*
>          * Now that we wrote out the compressed symbol names, restore the
> @@ -463,8 +518,8 @@ static void write_src(void)
>
>         output_label("kallsyms_markers");
>         for (i = 0; i < ((table_cnt + 255) >> 8); i++)
> -               printf("\t.long\t%u\n", markers[i]);
> -       printf("\n");
> +               output_bin_u32(markers[i]);
> +       include_bin_data();
>
>         free(markers);
>
> @@ -473,15 +528,15 @@ static void write_src(void)
>         for (i = 0; i < 256; i++) {
>                 best_idx[i] = off;
>                 expand_symbol(best_table[i], best_table_len[i], buf);
> -               printf("\t.asciz\t\"%s\"\n", buf);
> +               output_bin_data(buf, strlen(buf)+1);
>                 off += strlen(buf) + 1;
>         }
> -       printf("\n");
> +       include_bin_data();
>
>         output_label("kallsyms_token_index");
>         for (i = 0; i < 256; i++)
> -               printf("\t.short\t%d\n", best_idx[i]);
> -       printf("\n");
> +               output_bin_u16(best_idx[i]);
> +       include_bin_data();
>
>         output_label("kallsyms_offsets");
>
> @@ -513,13 +568,12 @@ static void write_src(void)
>                                 table[i]->addr);
>                         exit(EXIT_FAILURE);
>                 }
> -               printf("\t.long\t%#x    /* %s */\n", (int)offset, table[i]->sym);
> +               output_bin_u32((uint32_t)offset);
>         }
> -       printf("\n");
> +       include_bin_data();
>
>         output_label("kallsyms_relative_base");
>         output_address(relative_base);
> -       printf("\n");
>
>         if (lto_clang)
>                 for (i = 0; i < table_cnt; i++)
> @@ -527,12 +581,24 @@ static void write_src(void)
>
>         sort_symbols_by_name();
>         output_label("kallsyms_seqs_of_names");
> -       for (i = 0; i < table_cnt; i++)
> -               printf("\t.byte 0x%02x, 0x%02x, 0x%02x\n",
> +       for (i = 0; i < table_cnt; i++) {
> +               unsigned char seq_encoded[3] = {
>                         (unsigned char)(table[i]->seq >> 16),
>                         (unsigned char)(table[i]->seq >> 8),
> -                       (unsigned char)(table[i]->seq >> 0));
> -       printf("\n");
> +                       (unsigned char)(table[i]->seq >> 0)
> +               };
> +               output_bin_data(seq_encoded, sizeof(seq_encoded));
> +       }
> +       include_bin_data();
> +
> +       if (fclose(asm_file)) {
> +               perror("unable to write to asm output");
> +               exit(EXIT_FAILURE);
> +       }
> +       if (fclose(bin_file)) {
> +               perror("unable to write to bin output");
> +               exit(EXIT_FAILURE);
> +       }
>  }
>
>
> @@ -795,6 +861,52 @@ static void record_relative_base(void)
>                 }
>  }
>
> +static void get_target_data_types(const char *elf_path)
> +{
> +       int elf_fd = open(elf_path, O_RDONLY);
> +       unsigned char elf_ident[EI_NIDENT];
> +
> +       if (elf_fd == -1) {
> +               perror("open ELF");
> +               exit(EXIT_FAILURE);
> +       }
> +       if (read(elf_fd, elf_ident, sizeof(elf_ident)) != sizeof(elf_ident)) {
> +               perror("read ELF header");
> +               exit(EXIT_FAILURE);
> +       }
> +       close(elf_fd);
> +
> +       if (elf_ident[EI_MAG0] != ELFMAG0 || elf_ident[EI_MAG1] != ELFMAG1 ||
> +           elf_ident[EI_MAG2] != ELFMAG2 || elf_ident[EI_MAG3] != ELFMAG3) {
> +               fprintf(stderr, "kallsyms: input ELF has invalid header\n");
> +               exit(EXIT_FAILURE);
> +       }
> +
> +       switch (elf_ident[EI_CLASS]) {
> +       case ELFCLASS32:
> +               is_64bit = false;
> +               break;
> +       case ELFCLASS64:
> +               is_64bit = true;
> +               break;
> +       default:
> +               fprintf(stderr, "kallsyms: input ELF has invalid bitness\n");
> +               exit(EXIT_FAILURE);
> +       }
> +
> +       switch (elf_ident[EI_DATA]) {
> +       case ELFDATA2LSB:
> +               is_little_endian = true;
> +               break;
> +       case ELFDATA2MSB:
> +               is_little_endian = false;
> +               break;
> +       default:
> +               fprintf(stderr, "kallsyms: input ELF has invalid endianness\n");
> +               exit(EXIT_FAILURE);
> +       }
> +}
> +
>  int main(int argc, char **argv)
>  {
>         while (1) {
> @@ -813,10 +925,14 @@ int main(int argc, char **argv)
>                         usage();
>         }
>
> -       if (optind >= argc)
> +       if (optind+4 != argc)
>                 usage();
> +       asm_path = argv[optind+2];
> +       bin_path = argv[optind+3];
> +
> +       get_target_data_types(argv[optind]);
>
> -       read_map(argv[optind]);
> +       read_map(argv[optind+1]);
>         shrink_table();
>         if (absolute_percpu)
>                 make_percpus_absolute();
> diff --git a/scripts/link-vmlinux.sh b/scripts/link-vmlinux.sh
> index 5127371d3393..1b5ff33a2d4a 100755
> --- a/scripts/link-vmlinux.sh
> +++ b/scripts/link-vmlinux.sh
> @@ -162,7 +162,7 @@ kallsyms()
>         fi
>
>         info KSYMS ${2}
> -       scripts/kallsyms ${kallsymopt} ${1} > ${2}
> +       scripts/kallsyms ${kallsymopt} ${1} ${2} ${3} ${4}
>  }
>
>  # Perform one step in kallsyms generation, including temporary linking of
> @@ -173,10 +173,11 @@ kallsyms_step()
>         kallsyms_vmlinux=.tmp_vmlinux.kallsyms${1}
>         kallsymso=${kallsyms_vmlinux}.o
>         kallsyms_S=${kallsyms_vmlinux}.S
> +       kallsyms_bin=${kallsyms_vmlinux}.bin
>
>         vmlinux_link ${kallsyms_vmlinux} "${kallsymso_prev}" ${btf_vmlinux_bin_o}
>         mksysmap ${kallsyms_vmlinux} ${kallsyms_vmlinux}.syms ${kallsymso_prev}
> -       kallsyms ${kallsyms_vmlinux}.syms ${kallsyms_S}
> +       kallsyms ${kallsyms_vmlinux} ${kallsyms_vmlinux}.syms ${kallsyms_S} ${kallsyms_bin}
>
>         info AS ${kallsyms_S}
>         ${CC} ${NOSTDINC_FLAGS} ${LINUXINCLUDE} ${KBUILD_CPPFLAGS} \
> --
> 2.44.0.rc0.258.g7320e95886-goog
>
Arnd Bergmann Feb. 22, 2024, 8:20 a.m. UTC | #3
On Thu, Feb 22, 2024, at 05:06, Masahiro Yamada wrote:
> On Thu, Feb 22, 2024 at 5:27 AM Jann Horn <jannh@google.com> wrote:
>> This change reduces the time for an incremental kernel rebuild
>> (touch fs/ioctl.c, then re-run make) from 27.7s to 24.1s (medians
>> over 16 runs each) on my machine - saving around 3.6 seconds.

Nice!

...

> This reverts bea5b74504742f1b51b815bcaf9a70bddbc49ce3
>
> Somebody might struggle with debugging again, but I am not sure.
>
> Arnd?

So far, I have not needed it again, but it's only been a year.

> If the effort were "I invented a way to do kallsyms in
> one pass instead of three", it would be so much more attractive.
>
>
> I am not so sure if this grain of the optimization is exciting,
> but I confirmed that a few seconds were saved for the defconfig.
>
> I am neutral about this.

I think the time savings are worth it, especially since this
is going to help anyone building on large machines where
the compile stage is already optimized a lot but the link
stage is limited by single-thread performance.

      Arnd
Jann Horn Feb. 22, 2024, 11:20 a.m. UTC | #4
On Thu, Feb 22, 2024 at 5:07 AM Masahiro Yamada <masahiroy@kernel.org> wrote:
> On Thu, Feb 22, 2024 at 5:27 AM Jann Horn <jannh@google.com> wrote:
> >
> > Currently, kallsyms builds a big assembly file (~19M with a normal
> > kernel config), and then the assembler has to turn that big assembly
> > file back into binary data, which takes around a second per kallsyms
> > invocation. (Normally there are two kallsyms invocations per build.)
> >
> > It is much faster to instead directly output binary data, which can
> > be imported in an assembly file using ".incbin". This is also the
> > approach taken by arch/x86/boot/compressed/mkpiggy.c.
>
>
> Yes, that is a sensible case because it just wraps the binary
> without any modification.
>
>
>
>
> > So this patch switches kallsyms to that approach.
> >
> > A complication with this is that the endianness of numbers between
> > host and target might not match (for example, when cross-compiling);
> > and there seems to be no kconfig symbol that tells us what endianness
> > the target has.
>
>
>
> CONFIG_CPU_BIG_ENDIAN is it.
>
>
>
> You could do this:
>
> if is_enabled CONFIG_CPU_BIG_ENDIAN; then
>         kallsymopt="${kallsymopt} --big-endian"
> fi
>
> if is_enabled CONFIG_64BIT; then
>         kallsymopt="${kallsymopt} --64bit"
> fi

Aah, nice, thanks, I searched for endianness kconfig flags but somehow
missed that one.

Though actually, I think further optimizations might make it necessary
to directly operate on ELF files anyway, in which case it would
probably be easier to keep using the ELF header...

> > So pass the path to the intermediate vmlinux ELF file to the kallsyms
> > tool, and let it parse the ELF header to figure out the target's
> > endianness.
> >
> > I have verified that running kallsyms without these changes and
> > kallsyms with these changes on the same input System.map results
> > in identical object files.
> >
> > This change reduces the time for an incremental kernel rebuild
> > (touch fs/ioctl.c, then re-run make) from 27.7s to 24.1s (medians
> > over 16 runs each) on my machine - saving around 3.6 seconds.
>
>
>
>
> This reverts bea5b74504742f1b51b815bcaf9a70bddbc49ce3
>
> Somebody might struggle with debugging again, but I am not sure.
>
> Arnd?
>
>
>
> If the effort were "I invented a way to do kallsyms in
> one pass instead of three", it would be so much more attractive.

Actually, I was chatting with someone about this yesterday, and I
think I have an idea on how to get rid of two link steps... I might
try out some stuff and then come back with another version of this
series afterwards.

> I am not so sure if this grain of the optimization is exciting,
> but I confirmed that a few seconds were saved for the defconfig.
>
> I am neutral about this.
>
>
>
> For the debugging purpose, perhaps we can add --debug option
> in order to leave the possibility for
> outputting the full assembly as comments.

Hm, maybe... though that also involves a lot of duplicate code...
Jann Horn Feb. 22, 2024, 2:20 p.m. UTC | #5
On Thu, Feb 22, 2024 at 12:20 PM Jann Horn <jannh@google.com> wrote:
> On Thu, Feb 22, 2024 at 5:07 AM Masahiro Yamada <masahiroy@kernel.org> wrote:
> > On Thu, Feb 22, 2024 at 5:27 AM Jann Horn <jannh@google.com> wrote:
> > >
> > > Currently, kallsyms builds a big assembly file (~19M with a normal
> > > kernel config), and then the assembler has to turn that big assembly
> > > file back into binary data, which takes around a second per kallsyms
> > > invocation. (Normally there are two kallsyms invocations per build.)
> > >
> > > It is much faster to instead directly output binary data, which can
> > > be imported in an assembly file using ".incbin". This is also the
> > > approach taken by arch/x86/boot/compressed/mkpiggy.c.
> >
> >
> > Yes, that is a sensible case because it just wraps the binary
> > without any modification.
> >
> >
> >
> >
> > > So this patch switches kallsyms to that approach.
> > >
> > > A complication with this is that the endianness of numbers between
> > > host and target might not match (for example, when cross-compiling);
> > > and there seems to be no kconfig symbol that tells us what endianness
> > > the target has.
> >
> >
> >
> > CONFIG_CPU_BIG_ENDIAN is it.
> >
> >
> >
> > You could do this:
> >
> > if is_enabled CONFIG_CPU_BIG_ENDIAN; then
> >         kallsymopt="${kallsymopt} --big-endian"
> > fi
> >
> > if is_enabled CONFIG_64BIT; then
> >         kallsymopt="${kallsymopt} --64bit"
> > fi
>
> Aah, nice, thanks, I searched for endianness kconfig flags but somehow
> missed that one.
>
> Though actually, I think further optimizations might make it necessary
> to directly operate on ELF files anyway, in which case it would
> probably be easier to keep using the ELF header...
>
> > > So pass the path to the intermediate vmlinux ELF file to the kallsyms
> > > tool, and let it parse the ELF header to figure out the target's
> > > endianness.
> > >
> > > I have verified that running kallsyms without these changes and
> > > kallsyms with these changes on the same input System.map results
> > > in identical object files.
> > >
> > > This change reduces the time for an incremental kernel rebuild
> > > (touch fs/ioctl.c, then re-run make) from 27.7s to 24.1s (medians
> > > over 16 runs each) on my machine - saving around 3.6 seconds.
> >
> >
> >
> >
> > This reverts bea5b74504742f1b51b815bcaf9a70bddbc49ce3
> >
> > Somebody might struggle with debugging again, but I am not sure.
> >
> > Arnd?
> >
> >
> >
> > If the effort were "I invented a way to do kallsyms in
> > one pass instead of three", it would be so much more attractive.
>
> Actually, I was chatting with someone about this yesterday, and I
> think I have an idea on how to get rid of two link steps... I might
> try out some stuff and then come back with another version of this
> series afterwards.

I think basically we could change kallsyms so that on the second run,
it checks if the kallsyms layout is the same as on the first run, and
if yes, directly overwrite the relevant part of vmlinux. (And adjust
the relative_base.) That would save us the final link... does that
sound like a reasonable idea?

I don't really have any good ideas for saving more than that, given
that we want to squeeze the kallsyms in between the data and bss
sections, so we can't just append it at the end of vmlinux... we could
get the symbol list from vmlinux.o instead of linking
".tmp_vmlinux.kallsyms1", but the comments in link-vmlinux.sh say that
extra linker-generated symbols might appear, and I guess we probably
don't want to miss those...
Masahiro Yamada Feb. 23, 2024, 4:26 a.m. UTC | #6
On Thu, Feb 22, 2024 at 11:21 PM Jann Horn <jannh@google.com> wrote:
>
> On Thu, Feb 22, 2024 at 12:20 PM Jann Horn <jannh@google.com> wrote:
> > On Thu, Feb 22, 2024 at 5:07 AM Masahiro Yamada <masahiroy@kernel.org> wrote:
> > > On Thu, Feb 22, 2024 at 5:27 AM Jann Horn <jannh@google.com> wrote:
> > > >
> > > > Currently, kallsyms builds a big assembly file (~19M with a normal
> > > > kernel config), and then the assembler has to turn that big assembly
> > > > file back into binary data, which takes around a second per kallsyms
> > > > invocation. (Normally there are two kallsyms invocations per build.)
> > > >
> > > > It is much faster to instead directly output binary data, which can
> > > > be imported in an assembly file using ".incbin". This is also the
> > > > approach taken by arch/x86/boot/compressed/mkpiggy.c.
> > >
> > >
> > > Yes, that is a sensible case because it just wraps the binary
> > > without any modification.
> > >
> > >
> > >
> > >
> > > > So this patch switches kallsyms to that approach.
> > > >
> > > > A complication with this is that the endianness of numbers between
> > > > host and target might not match (for example, when cross-compiling);
> > > > and there seems to be no kconfig symbol that tells us what endianness
> > > > the target has.
> > >
> > >
> > >
> > > CONFIG_CPU_BIG_ENDIAN is it.
> > >
> > >
> > >
> > > You could do this:
> > >
> > > if is_enabled CONFIG_CPU_BIG_ENDIAN; then
> > >         kallsymopt="${kallsymopt} --big-endian"
> > > fi
> > >
> > > if is_enabled CONFIG_64BIT; then
> > >         kallsymopt="${kallsymopt} --64bit"
> > > fi
> >
> > Aah, nice, thanks, I searched for endianness kconfig flags but somehow
> > missed that one.
> >
> > Though actually, I think further optimizations might make it necessary
> > to directly operate on ELF files anyway, in which case it would
> > probably be easier to keep using the ELF header...
> >
> > > > So pass the path to the intermediate vmlinux ELF file to the kallsyms
> > > > tool, and let it parse the ELF header to figure out the target's
> > > > endianness.
> > > >
> > > > I have verified that running kallsyms without these changes and
> > > > kallsyms with these changes on the same input System.map results
> > > > in identical object files.
> > > >
> > > > This change reduces the time for an incremental kernel rebuild
> > > > (touch fs/ioctl.c, then re-run make) from 27.7s to 24.1s (medians
> > > > over 16 runs each) on my machine - saving around 3.6 seconds.
> > >
> > >
> > >
> > >
> > > This reverts bea5b74504742f1b51b815bcaf9a70bddbc49ce3
> > >
> > > Somebody might struggle with debugging again, but I am not sure.
> > >
> > > Arnd?
> > >
> > >
> > >
> > > If the effort were "I invented a way to do kallsyms in
> > > one pass instead of three", it would be so much more attractive.
> >
> > Actually, I was chatting with someone about this yesterday, and I
> > think I have an idea on how to get rid of two link steps... I might
> > try out some stuff and then come back with another version of this
> > series afterwards.
>
> I think basically we could change kallsyms so that on the second run,
> it checks if the kallsyms layout is the same as on the first run, and
> if yes, directly overwrite the relevant part of vmlinux. (And adjust
> the relative_base.) That would save us the final link... does that
> sound like a reasonable idea?


I do not know how we can save the final link.

Inserting the kallsyms data into the .rodata section
would change the address of all symbols that come after.
Only the linker can sort out the address change.


>
> I don't really have any good ideas for saving more than that, given
> that we want to squeeze the kallsyms in between the data and bss
> sections, so we can't just append it at the end of vmlinux... we could
> get the symbol list from vmlinux.o instead of linking
> ".tmp_vmlinux.kallsyms1", but the comments in link-vmlinux.sh say that
> extra linker-generated symbols might appear, and I guess we probably
> don't want to miss those...


I knew it was not trivial.
If you do not have an idea, you do not need to change it.
diff mbox series

Patch

diff --git a/scripts/kallsyms.c b/scripts/kallsyms.c
index f35be95adfbe..ef03d723aded 100644
--- a/scripts/kallsyms.c
+++ b/scripts/kallsyms.c
@@ -27,6 +27,10 @@ 
 #include <string.h>
 #include <ctype.h>
 #include <limits.h>
+#include <endian.h>
+#include <elf.h>
+#include <fcntl.h>
+#include <unistd.h>
 
 #define ARRAY_SIZE(arr) (sizeof(arr) / sizeof(arr[0]))
 
@@ -75,7 +79,7 @@  static unsigned char best_table_len[256];
 static void usage(void)
 {
 	fprintf(stderr, "Usage: kallsyms [--all-symbols] [--absolute-percpu] "
-			"[--lto-clang] in.map > out.S\n");
+			"[--lto-clang] in in.map out.S out.bin\n");
 	exit(1);
 }
 
@@ -290,20 +294,57 @@  static void read_map(const char *in)
 	fclose(fp);
 }
 
+static bool is_64bit, is_little_endian;
+static char *asm_path, *bin_path;
+static FILE *asm_file, *bin_file;
+static size_t bin_offset, bin_included;
+
 static void output_label(const char *label)
 {
-	printf(".globl %s\n", label);
-	printf("\tALGN\n");
-	printf("%s:\n", label);
+	fprintf(asm_file, ".globl %s\n", label);
+	fprintf(asm_file, "\tALGN\n");
+	fprintf(asm_file, "%s:\n", label);
 }
 
 /* Provide proper symbols relocatability by their '_text' relativeness. */
 static void output_address(unsigned long long addr)
 {
 	if (_text <= addr)
-		printf("\tPTR\t_text + %#llx\n", addr - _text);
+		fprintf(asm_file, "\tPTR\t_text + %#llx\n", addr - _text);
 	else
-		printf("\tPTR\t_text - %#llx\n", _text - addr);
+		fprintf(asm_file, "\tPTR\t_text - %#llx\n", _text - addr);
+}
+
+/*
+ * Include all data that has been written into bin_file since the last call to
+ * this function.
+ */
+static void include_bin_data(void)
+{
+	fprintf(asm_file, ".incbin \"%s\", %zu, %zu\n", bin_path,
+		bin_included, bin_offset - bin_included);
+	bin_included = bin_offset;
+}
+
+static void output_bin_data(const void *data, size_t len)
+{
+	if (fwrite(data, 1, len, bin_file) != len) {
+		fprintf(stderr, "kallsyms: unable to write output\n");
+		exit(EXIT_FAILURE);
+	}
+	bin_offset += len;
+}
+static void output_bin_u32(uint32_t value)
+{
+	uint32_t encoded = is_little_endian ? htole32(value) : htobe32(value);
+
+	output_bin_data(&encoded, sizeof(encoded));
+}
+static void output_bin_u16(uint16_t value)
+{
+	uint16_t encoded = is_little_endian ? htole16(value) : htobe16(value);
+
+	output_bin_data(&encoded, sizeof(encoded));
 }
 
 /* uncompress a compressed symbol. When this function is called, the best table
@@ -384,25 +425,36 @@  static void sort_symbols_by_name(void)
 
 static void write_src(void)
 {
-	unsigned int i, k, off;
+	unsigned int i, off;
 	unsigned int best_idx[256];
 	unsigned int *markers;
 	char buf[KSYM_NAME_LEN];
 
-	printf("#include <asm/bitsperlong.h>\n");
-	printf("#if BITS_PER_LONG == 64\n");
-	printf("#define PTR .quad\n");
-	printf("#define ALGN .balign 8\n");
-	printf("#else\n");
-	printf("#define PTR .long\n");
-	printf("#define ALGN .balign 4\n");
-	printf("#endif\n");
+	asm_file = fopen(asm_path, "w");
+	if (!asm_file) {
+		perror("unable to open asm output");
+		exit(EXIT_FAILURE);
+	}
+	bin_file = fopen(bin_path, "w");
+	if (!bin_file) {
+		perror("unable to open bin output");
+		exit(EXIT_FAILURE);
+	}
+
+	fprintf(asm_file, "#include <asm/bitsperlong.h>\n");
+	fprintf(asm_file, "#if BITS_PER_LONG == 64\n");
+	fprintf(asm_file, "#define PTR .quad\n");
+	fprintf(asm_file, "#define ALGN .balign 8\n");
+	fprintf(asm_file, "#else\n");
+	fprintf(asm_file, "#define PTR .long\n");
+	fprintf(asm_file, "#define ALGN .balign 4\n");
+	fprintf(asm_file, "#endif\n");
 
-	printf("\t.section .rodata, \"a\"\n");
+	fprintf(asm_file, "\t.section .rodata, \"a\"\n");
 
 	output_label("kallsyms_num_syms");
-	printf("\t.long\t%u\n", table_cnt);
-	printf("\n");
+	fprintf(asm_file, "\t.long\t%u\n", table_cnt);
+	fprintf(asm_file, "\n");
 
 	/* table of offset markers, that give the offset in the compressed stream
 	 * every 256 symbols */
@@ -437,20 +489,23 @@  static void write_src(void)
 		/* Encode length with ULEB128. */
 		if (table[i]->len <= 0x7F) {
 			/* Most symbols use a single byte for the length. */
-			printf("\t.byte 0x%02x", table[i]->len);
+			unsigned char len_encoded[1] = { table[i]->len };
+
+			output_bin_data(len_encoded, sizeof(len_encoded));
 			off += table[i]->len + 1;
 		} else {
 			/* "Big" symbols use two bytes. */
-			printf("\t.byte 0x%02x, 0x%02x",
+			unsigned char len_encoded[2] = {
 				(table[i]->len & 0x7F) | 0x80,
-				(table[i]->len >> 7) & 0x7F);
+				(table[i]->len >> 7) & 0x7F
+			};
+
+			output_bin_data(len_encoded, sizeof(len_encoded));
 			off += table[i]->len + 2;
 		}
-		for (k = 0; k < table[i]->len; k++)
-			printf(", 0x%02x", table[i]->sym[k]);
-		printf("\n");
+		output_bin_data(table[i]->sym, table[i]->len);
 	}
-	printf("\n");
+	include_bin_data();
 
 	/*
 	 * Now that we wrote out the compressed symbol names, restore the
@@ -463,8 +518,8 @@  static void write_src(void)
 
 	output_label("kallsyms_markers");
 	for (i = 0; i < ((table_cnt + 255) >> 8); i++)
-		printf("\t.long\t%u\n", markers[i]);
-	printf("\n");
+		output_bin_u32(markers[i]);
+	include_bin_data();
 
 	free(markers);
 
@@ -473,15 +528,15 @@  static void write_src(void)
 	for (i = 0; i < 256; i++) {
 		best_idx[i] = off;
 		expand_symbol(best_table[i], best_table_len[i], buf);
-		printf("\t.asciz\t\"%s\"\n", buf);
+		output_bin_data(buf, strlen(buf)+1);
 		off += strlen(buf) + 1;
 	}
-	printf("\n");
+	include_bin_data();
 
 	output_label("kallsyms_token_index");
 	for (i = 0; i < 256; i++)
-		printf("\t.short\t%d\n", best_idx[i]);
-	printf("\n");
+		output_bin_u16(best_idx[i]);
+	include_bin_data();
 
 	output_label("kallsyms_offsets");
 
@@ -513,13 +568,12 @@  static void write_src(void)
 				table[i]->addr);
 			exit(EXIT_FAILURE);
 		}
-		printf("\t.long\t%#x	/* %s */\n", (int)offset, table[i]->sym);
+		output_bin_u32((uint32_t)offset);
 	}
-	printf("\n");
+	include_bin_data();
 
 	output_label("kallsyms_relative_base");
 	output_address(relative_base);
-	printf("\n");
 
 	if (lto_clang)
 		for (i = 0; i < table_cnt; i++)
@@ -527,12 +581,24 @@  static void write_src(void)
 
 	sort_symbols_by_name();
 	output_label("kallsyms_seqs_of_names");
-	for (i = 0; i < table_cnt; i++)
-		printf("\t.byte 0x%02x, 0x%02x, 0x%02x\n",
+	for (i = 0; i < table_cnt; i++) {
+		unsigned char seq_encoded[3] = {
 			(unsigned char)(table[i]->seq >> 16),
 			(unsigned char)(table[i]->seq >> 8),
-			(unsigned char)(table[i]->seq >> 0));
-	printf("\n");
+			(unsigned char)(table[i]->seq >> 0)
+		};
+		output_bin_data(seq_encoded, sizeof(seq_encoded));
+	}
+	include_bin_data();
+
+	if (fclose(asm_file)) {
+		perror("unable to write to asm output");
+		exit(EXIT_FAILURE);
+	}
+	if (fclose(bin_file)) {
+		perror("unable to write to bin output");
+		exit(EXIT_FAILURE);
+	}
 }
 
 
@@ -795,6 +861,52 @@  static void record_relative_base(void)
 		}
 }
 
+static void get_target_data_types(const char *elf_path)
+{
+	int elf_fd = open(elf_path, O_RDONLY);
+	unsigned char elf_ident[EI_NIDENT];
+
+	if (elf_fd == -1) {
+		perror("open ELF");
+		exit(EXIT_FAILURE);
+	}
+	if (read(elf_fd, elf_ident, sizeof(elf_ident)) != sizeof(elf_ident)) {
+		perror("read ELF header");
+		exit(EXIT_FAILURE);
+	}
+	close(elf_fd);
+
+	if (elf_ident[EI_MAG0] != ELFMAG0 || elf_ident[EI_MAG1] != ELFMAG1 ||
+	    elf_ident[EI_MAG2] != ELFMAG2 || elf_ident[EI_MAG3] != ELFMAG3) {
+		fprintf(stderr, "kallsyms: input ELF has invalid header\n");
+		exit(EXIT_FAILURE);
+	}
+
+	switch (elf_ident[EI_CLASS]) {
+	case ELFCLASS32:
+		is_64bit = false;
+		break;
+	case ELFCLASS64:
+		is_64bit = true;
+		break;
+	default:
+		fprintf(stderr, "kallsyms: input ELF has invalid bitness\n");
+		exit(EXIT_FAILURE);
+	}
+
+	switch (elf_ident[EI_DATA]) {
+	case ELFDATA2LSB:
+		is_little_endian = true;
+		break;
+	case ELFDATA2MSB:
+		is_little_endian = false;
+		break;
+	default:
+		fprintf(stderr, "kallsyms: input ELF has invalid endianness\n");
+		exit(EXIT_FAILURE);
+	}
+}
+
 int main(int argc, char **argv)
 {
 	while (1) {
@@ -813,10 +925,14 @@  int main(int argc, char **argv)
 			usage();
 	}
 
-	if (optind >= argc)
+	if (optind+4 != argc)
 		usage();
+	asm_path = argv[optind+2];
+	bin_path = argv[optind+3];
+
+	get_target_data_types(argv[optind]);
 
-	read_map(argv[optind]);
+	read_map(argv[optind+1]);
 	shrink_table();
 	if (absolute_percpu)
 		make_percpus_absolute();
diff --git a/scripts/link-vmlinux.sh b/scripts/link-vmlinux.sh
index 5127371d3393..1b5ff33a2d4a 100755
--- a/scripts/link-vmlinux.sh
+++ b/scripts/link-vmlinux.sh
@@ -162,7 +162,7 @@  kallsyms()
 	fi
 
 	info KSYMS ${2}
-	scripts/kallsyms ${kallsymopt} ${1} > ${2}
+	scripts/kallsyms ${kallsymopt} ${1} ${2} ${3} ${4}
 }
 
 # Perform one step in kallsyms generation, including temporary linking of
@@ -173,10 +173,11 @@  kallsyms_step()
 	kallsyms_vmlinux=.tmp_vmlinux.kallsyms${1}
 	kallsymso=${kallsyms_vmlinux}.o
 	kallsyms_S=${kallsyms_vmlinux}.S
+	kallsyms_bin=${kallsyms_vmlinux}.bin
 
 	vmlinux_link ${kallsyms_vmlinux} "${kallsymso_prev}" ${btf_vmlinux_bin_o}
 	mksysmap ${kallsyms_vmlinux} ${kallsyms_vmlinux}.syms ${kallsymso_prev}
-	kallsyms ${kallsyms_vmlinux}.syms ${kallsyms_S}
+	kallsyms ${kallsyms_vmlinux} ${kallsyms_vmlinux}.syms ${kallsyms_S} ${kallsyms_bin}
 
 	info AS ${kallsyms_S}
 	${CC} ${NOSTDINC_FLAGS} ${LINUXINCLUDE} ${KBUILD_CPPFLAGS} \