mbox series

[0/4] kbuild: build speed improvment of CONFIG_TRIM_UNUSED_KSYMS

Message ID 20210225160247.2959903-1-masahiroy@kernel.org (mailing list archive)
Headers show
Series kbuild: build speed improvment of CONFIG_TRIM_UNUSED_KSYMS | expand

Message

Masahiro Yamada Feb. 25, 2021, 4:02 p.m. UTC
Now CONFIG_TRIM_UNUSED_KSYMS is revived, but Linus is still unhappy
about the build speed.

I re-implemented this feature, and the build time cost is now
almost unnoticeable level.

I hope this makes Linus happy.



Masahiro Yamada (4):
  kbuild: fix UNUSED_KSYMS_WHITELIST for Clang LTO
  export.h: make __ksymtab_strings per-symbol section
  kbuild: separate out vmlinux.lds generation
  kbuild: re-implement CONFIG_TRIM_UNUSED_KSYMS to make it work in
    one-pass

 Makefile                          | 34 ++++++------
 arch/alpha/kernel/Makefile        |  3 +-
 arch/arc/kernel/Makefile          |  3 +-
 arch/arm/kernel/Makefile          |  3 +-
 arch/arm64/kernel/Makefile        |  3 +-
 arch/csky/kernel/Makefile         |  3 +-
 arch/h8300/kernel/Makefile        |  2 +-
 arch/hexagon/kernel/Makefile      |  3 +-
 arch/ia64/kernel/Makefile         |  3 +-
 arch/m68k/kernel/Makefile         |  2 +-
 arch/microblaze/kernel/Makefile   |  3 +-
 arch/mips/kernel/Makefile         |  3 +-
 arch/nds32/kernel/Makefile        |  3 +-
 arch/nios2/kernel/Makefile        |  2 +-
 arch/openrisc/kernel/Makefile     |  3 +-
 arch/parisc/kernel/Makefile       |  3 +-
 arch/powerpc/kernel/Makefile      |  2 +-
 arch/riscv/kernel/Makefile        |  2 +-
 arch/s390/kernel/Makefile         |  3 +-
 arch/sh/kernel/Makefile           |  3 +-
 arch/sparc/kernel/Makefile        |  2 +-
 arch/um/kernel/Makefile           |  2 +-
 arch/x86/kernel/Makefile          |  2 +-
 arch/xtensa/kernel/Makefile       |  3 +-
 include/asm-generic/export.h      | 25 +--------
 include/asm-generic/vmlinux.lds.h | 29 +++++++++--
 include/linux/export.h            | 56 +++++---------------
 init/Kconfig                      |  4 +-
 scripts/Makefile.build            |  7 +--
 scripts/adjust_autoksyms.sh       | 76 ---------------------------
 scripts/gen-keep-ksyms.sh         | 86 +++++++++++++++++++++++++++++++
 scripts/gen_autoksyms.sh          | 55 --------------------
 scripts/gen_ksymdeps.sh           | 25 ---------
 scripts/lto-used-symbollist.txt   |  5 --
 scripts/module.lds.S              | 38 ++++++++++----
 35 files changed, 210 insertions(+), 291 deletions(-)
 delete mode 100755 scripts/adjust_autoksyms.sh
 create mode 100755 scripts/gen-keep-ksyms.sh
 delete mode 100755 scripts/gen_autoksyms.sh
 delete mode 100755 scripts/gen_ksymdeps.sh
 delete mode 100644 scripts/lto-used-symbollist.txt

Comments

Nicolas Pitre Feb. 25, 2021, 5:19 p.m. UTC | #1
On Fri, 26 Feb 2021, Masahiro Yamada wrote:

> 
> Now CONFIG_TRIM_UNUSED_KSYMS is revived, but Linus is still unhappy
> about the build speed.
> 
> I re-implemented this feature, and the build time cost is now
> almost unnoticeable level.
> 
> I hope this makes Linus happy.

:-)

I'm surprised to see that Linus is using this feature. When disabled 
(the default) this should have had no impact on the build time.

This feature provides a nice security advantage by significantly 
reducing the kernel input surface. And people are using that also to 
better what third party vendor can and cannot do with a distro kernel, 
etc. But that's not the reason why I implemented this feature in the 
first place.

My primary goal was to efficiently reduce the kernel binary size using 
LTO even with kernel modules enabled. Each EXPORT_SYMBOL() created a 
symbol dependency that prevented LTO from optimizing out the related 
code even though a tiny fraction of those exported symbols were needed.

The idea behind the recursion was to catch those cases where disabling 
an exported symbol within a module would optimize out references to more 
exported symbols that, in turn, could be disabled and possibly trigger 
yet more code elimination. There is no way that can be achieved without 
extra compiler passes in a recursive manner.


Nicolas
Masahiro Yamada Feb. 25, 2021, 6:57 p.m. UTC | #2
On Fri, Feb 26, 2021 at 2:20 AM Nicolas Pitre <nico@fluxnic.net> wrote:
>
> On Fri, 26 Feb 2021, Masahiro Yamada wrote:
>
> >
> > Now CONFIG_TRIM_UNUSED_KSYMS is revived, but Linus is still unhappy
> > about the build speed.
> >
> > I re-implemented this feature, and the build time cost is now
> > almost unnoticeable level.
> >
> > I hope this makes Linus happy.
>
> :-)
>
> I'm surprised to see that Linus is using this feature. When disabled
> (the default) this should have had no impact on the build time.

Linus is not using this feature, but does build tests.
After pulling the module subsystem pull request in this merge window,
CONFIG_TRIM_UNUSED_KSYMS was enabled by allmodconfig.


> This feature provides a nice security advantage by significantly
> reducing the kernel input surface. And people are using that also to
> better what third party vendor can and cannot do with a distro kernel,
> etc. But that's not the reason why I implemented this feature in the
> first place.
>
> My primary goal was to efficiently reduce the kernel binary size using
> LTO even with kernel modules enabled.


Clang LTO landed in this MW.

Do you think it will reduce the kernel binary size?
No, opposite.

CONFIG_LTO_CLANG cannot trim any code even if it
is obviously unused.
Hence, it never reduces the kernel binary size.
Rather, it produces a bigger kernel.

The reason is Clang LTO was implemented against
relocatable ELF (vmlinux.o) .

I pointed out this flaw in the review process, but
it was dismissed.

This is the main reason why I did not give any Ack
(but it was merged via Kees Cook's tree).


So, the help text of this option should be revised:

          This option allows for unused exported symbols to be dropped from
          the build. In turn, this provides the compiler more opportunities
          (especially when using LTO) for optimizing the code and reducing
          binary size.  This might have some security advantages as well.

Clang LTO is opposite to your expectation.



> Each EXPORT_SYMBOL() created a
> symbol dependency that prevented LTO from optimizing out the related
> code even though a tiny fraction of those exported symbols were needed.
>
> The idea behind the recursion was to catch those cases where disabling
> an exported symbol within a module would optimize out references to more
> exported symbols that, in turn, could be disabled and possibly trigger
> yet more code elimination. There is no way that can be achieved without
> extra compiler passes in a recursive manner.

I do not understand.

Modules are relocatable ELF.
Clang LTO cannot eliminate any code.
GCC LTO does not work with relocatable ELF
in the first place.


Are you talking about a story in a perfect world?
But, I do not know how LTO can eliminate dead code
from relocatable ELF.




- Current implementation

  CLANG LTO works against vmlinux.o,
  so it is completely useless for the purpose of
  eliminating dead code.

  So, this case is don't care.
  TRIM_UNUSED_KSYMS removes only the meta data of EXPORT_SYMBOL,
  but no further optimization anyway.


- What if Clang LTO had been implemented in the final link?
   (this means LTO runs 3 times if KALLSYMS_ALL is enabled)

  With proper linker script input with /DISCARD/,
  the meta-data of EXPORT_SYMBOL() will be dropped,
  and LTO should be able to do further dead code elimination.
  So, I guess we do not need to no-op EXPORT_SYMBOL by CPP
  (unless I am missing something).






--
Best Regards
Masahiro Yamada
Nicolas Pitre Feb. 25, 2021, 7:24 p.m. UTC | #3
On Fri, 26 Feb 2021, Masahiro Yamada wrote:

> On Fri, Feb 26, 2021 at 2:20 AM Nicolas Pitre <nico@fluxnic.net> wrote:
> >
> > On Fri, 26 Feb 2021, Masahiro Yamada wrote:
> >
> > >
> > > Now CONFIG_TRIM_UNUSED_KSYMS is revived, but Linus is still unhappy
> > > about the build speed.
> > >
> > > I re-implemented this feature, and the build time cost is now
> > > almost unnoticeable level.
> > >
> > > I hope this makes Linus happy.
> >
> > :-)
> >
> > I'm surprised to see that Linus is using this feature. When disabled
> > (the default) this should have had no impact on the build time.
> 
> Linus is not using this feature, but does build tests.
> After pulling the module subsystem pull request in this merge window,
> CONFIG_TRIM_UNUSED_KSYMS was enabled by allmodconfig.

If CONFIG_TRIM_UNUSED_KSYMS is enabled then build time willincrease. 
That comes with the feature.

> > This feature provides a nice security advantage by significantly
> > reducing the kernel input surface. And people are using that also to
> > better what third party vendor can and cannot do with a distro kernel,
> > etc. But that's not the reason why I implemented this feature in the
> > first place.
> >
> > My primary goal was to efficiently reduce the kernel binary size using
> > LTO even with kernel modules enabled.
> 
> 
> Clang LTO landed in this MW.
> 
> Do you think it will reduce the kernel binary size?
> No, opposite.

LTO ought to reduce binary size. It is rather broken otherwise.
Having a global view before optimizing allows for the compiler to do 
project wide constant propagation and dead code elimination.

> CONFIG_LTO_CLANG cannot trim any code even if it
> is obviously unused.
> Hence, it never reduces the kernel binary size.
> Rather, it produces a bigger kernel.

Then what's the point?

> The reason is Clang LTO was implemented against
> relocatable ELF (vmlinux.o) .

That's not true LTO then.

> I pointed out this flaw in the review process, but
> it was dismissed.
> 
> This is the main reason why I did not give any Ack
> (but it was merged via Kees Cook's tree).

> So, the help text of this option should be revised:
> 
>           This option allows for unused exported symbols to be dropped from
>           the build. In turn, this provides the compiler more opportunities
>           (especially when using LTO) for optimizing the code and reducing
>           binary size.  This might have some security advantages as well.
> 
> Clang LTO is opposite to your expectation.

Then Clang LTO is a misnomer. That is the option to revise not this one.

> > Each EXPORT_SYMBOL() created a
> > symbol dependency that prevented LTO from optimizing out the related
> > code even though a tiny fraction of those exported symbols were needed.
> >
> > The idea behind the recursion was to catch those cases where disabling
> > an exported symbol within a module would optimize out references to more
> > exported symbols that, in turn, could be disabled and possibly trigger
> > yet more code elimination. There is no way that can be achieved without
> > extra compiler passes in a recursive manner.
> 
> I do not understand.
> 
> Modules are relocatable ELF.
> Clang LTO cannot eliminate any code.
> GCC LTO does not work with relocatable ELF
> in the first place.

I don't think I follow you here. What relocatable ELF has to do with LTO?

I've successfully used gcc LTO on the kernel quite a while ago.

For a reference about binary size reduction with LTO and 
CONFIG_TRIM_UNUSED_KSYMS please read this article:

https://lwn.net/Articles/746780/


Nicolas
Masahiro Yamada March 9, 2021, 7:28 a.m. UTC | #4
On Fri, Feb 26, 2021 at 4:24 AM Nicolas Pitre <nico@fluxnic.net> wrote:
>
> On Fri, 26 Feb 2021, Masahiro Yamada wrote:
>
> > On Fri, Feb 26, 2021 at 2:20 AM Nicolas Pitre <nico@fluxnic.net> wrote:
> > >
> > > On Fri, 26 Feb 2021, Masahiro Yamada wrote:
> > >
> > > >
> > > > Now CONFIG_TRIM_UNUSED_KSYMS is revived, but Linus is still unhappy
> > > > about the build speed.
> > > >
> > > > I re-implemented this feature, and the build time cost is now
> > > > almost unnoticeable level.
> > > >
> > > > I hope this makes Linus happy.
> > >
> > > :-)
> > >
> > > I'm surprised to see that Linus is using this feature. When disabled
> > > (the default) this should have had no impact on the build time.
> >
> > Linus is not using this feature, but does build tests.
> > After pulling the module subsystem pull request in this merge window,
> > CONFIG_TRIM_UNUSED_KSYMS was enabled by allmodconfig.
>
> If CONFIG_TRIM_UNUSED_KSYMS is enabled then build time willincrease.
> That comes with the feature.


This patch set intends to change this.
TRIM_UNUSED_KSYMS will build without additional cost,
like LD_DEAD_CODE_DATA_ELIMINATION.



>
> > > This feature provides a nice security advantage by significantly
> > > reducing the kernel input surface. And people are using that also to
> > > better what third party vendor can and cannot do with a distro kernel,
> > > etc. But that's not the reason why I implemented this feature in the
> > > first place.
> > >
> > > My primary goal was to efficiently reduce the kernel binary size using
> > > LTO even with kernel modules enabled.
> >
> >
> > Clang LTO landed in this MW.
> >
> > Do you think it will reduce the kernel binary size?
> > No, opposite.
>
> LTO ought to reduce binary size. It is rather broken otherwise.
> Having a global view before optimizing allows for the compiler to do
> project wide constant propagation and dead code elimination.
>
> > CONFIG_LTO_CLANG cannot trim any code even if it
> > is obviously unused.
> > Hence, it never reduces the kernel binary size.
> > Rather, it produces a bigger kernel.
>
> Then what's the point?


Presumably, reducing the size is not
the main interest for Googlers.


>
> > The reason is Clang LTO was implemented against
> > relocatable ELF (vmlinux.o) .
>
> That's not true LTO then.


This is the same as what I said in the review process.
:-)

https://lore.kernel.org/linux-kbuild/CAK7LNASQPOGohtUyzBM6n54pzpLN35kDXC7VbvWzX8QWUmqq9g@mail.gmail.com/




>
> > I pointed out this flaw in the review process, but
> > it was dismissed.
> >
> > This is the main reason why I did not give any Ack
> > (but it was merged via Kees Cook's tree).
>
> > So, the help text of this option should be revised:
> >
> >           This option allows for unused exported symbols to be dropped from
> >           the build. In turn, this provides the compiler more opportunities
> >           (especially when using LTO) for optimizing the code and reducing
> >           binary size.  This might have some security advantages as well.
> >
> > Clang LTO is opposite to your expectation.
>
> Then Clang LTO is a misnomer. That is the option to revise not this one.
>
> > > Each EXPORT_SYMBOL() created a
> > > symbol dependency that prevented LTO from optimizing out the related
> > > code even though a tiny fraction of those exported symbols were needed.
> > >
> > > The idea behind the recursion was to catch those cases where disabling
> > > an exported symbol within a module would optimize out references to more
> > > exported symbols that, in turn, could be disabled and possibly trigger
> > > yet more code elimination. There is no way that can be achieved without
> > > extra compiler passes in a recursive manner.
> >
> > I do not understand.
> >
> > Modules are relocatable ELF.
> > Clang LTO cannot eliminate any code.
> > GCC LTO does not work with relocatable ELF
> > in the first place.
>
> I don't think I follow you here. What relocatable ELF has to do with LTO?



What is important is,
GCC LTO is the feature of gcc, not binutils.
That is, LD_FINAL is $(CC).

GCC LTO can be implemented for the final link stage
by using $(CC) as the linker driver.
Then, it can determine which code is unreachable.
In other words, GCC LTO works only when building
the final executable.


On the other hand, a relocatable ELF is created
by $(LD) -r by combining some objects together.
The relocatable ELF can be fed to another $(LD) -r,
or the final link stage.


vmlinux is an executable ELF.
modules (*.ko files) are relocatable ELFs.


You can confirm it easily
by using the 'file' command.

masahiro@oscar:~/ref/linux$ file vmlinux
vmlinux: ELF 64-bit LSB executable, x86-64, version 1 (SYSV),
statically linked,
BuildID[sha1]=ee0cef2ff3d9f490e0f5ee1d7e74b19aa167933b, not stripped
masahiro@oscar:~/ref/linux$ file  net/ipv4/netfilter/iptable_nat.ko
net/ipv4/netfilter/iptable_nat.ko: ELF 64-bit LSB relocatable, x86-64,
version 1 (SYSV),
BuildID[sha1]=4829e82f9b9e7fd65be3c19c1cf0e16a7ddf0967, not stripped



Modules are not filled with addresses yet
since we do not know which memory address
the module will be loaded to.
The addresses are resolved at modprobe time.

As I said above, modules are created by $(LD) -r.
It is not possible to implement GCC LTO for modules.



In contrast, Clang LTO is the ability of $(LD).
So, it can be implemented for not only for executable ELFs,
but also for relocated ELFs.
The problem is Clang LTO cannot determine which code is
unreachable if it is implemented for a relocatable ELF,
since it is not a final image.

Did I answer your question?





> I've successfully used gcc LTO on the kernel quite a while ago.
>
> For a reference about binary size reduction with LTO and
> CONFIG_TRIM_UNUSED_KSYMS please read this article:
>
> https://lwn.net/Articles/746780/


Thanks for the great articles.

Just for curiosity, I think you used GCC LTO from
Andy's GitHub.


In the article, you took stm32_defconfig as an example,
but ARM does not select ARCH_SUPPORTS_LTO.

Did you add some local hacks to make LTO work
for ARM?

I tried the lto-5.8.1 branch, but
I did not even succeed in building x86 + LTO.






>
> Nicolas
Nicolas Pitre March 9, 2021, 4:49 p.m. UTC | #5
On Tue, 9 Mar 2021, Masahiro Yamada wrote:

> On Fri, Feb 26, 2021 at 4:24 AM Nicolas Pitre <nico@fluxnic.net> wrote:
> >
> > If CONFIG_TRIM_UNUSED_KSYMS is enabled then build time willincrease.
> > That comes with the feature.
> 
> This patch set intends to change this.
> TRIM_UNUSED_KSYMS will build without additional cost,
> like LD_DEAD_CODE_DATA_ELIMINATION.

OK... I do see how you're going about it.

> > > Modules are relocatable ELF.
> > > Clang LTO cannot eliminate any code.
> > > GCC LTO does not work with relocatable ELF
> > > in the first place.
> >
> > I don't think I follow you here. What relocatable ELF has to do with LTO?
> 
> What is important is,
> GCC LTO is the feature of gcc, not binutils.
> That is, LD_FINAL is $(CC).

Exact.

> GCC LTO can be implemented for the final link stage
> by using $(CC) as the linker driver.
> Then, it can determine which code is unreachable.
> In other words, GCC LTO works only when building
> the final executable.

Yes. And it does so by filling .o files with its intermediate code 
representation and not ELF code.

> On the other hand, a relocatable ELF is created
> by $(LD) -r by combining some objects together.
> The relocatable ELF can be fed to another $(LD) -r,
> or the final link stage.

You still can create relocatable ELF using LTO. But LTO stops there. 
From that point on, .o files will no longer contain data that LTO can 
use if you further combine those object files together. But until that 
point, LTO is still usable.

> As I said above, modules are created by $(LD) -r.
> It is not possible to implement GCC LTO for modules.

If I remember correctly (that was a while ago) the problem with LTO and 
the kernel had to do with the fact that avery subdirectory was gathering 
object files in built-in.o using ld -r. At some point we switched to 
gathering object files into built-in.a files where no linking is taking 
place. The real linking happens in vmlinux.o where LTO may now do its 
magic.

The same is true for modules. Compiling foo_module.c into foo_module.o 
will create a .o file with LTO data rather than executable code. But 
when you create the final .o for the module then LTO takes place and 
produce the relocatable ELF executable.

> > I've successfully used gcc LTO on the kernel quite a while ago.
> >
> > For a reference about binary size reduction with LTO and
> > CONFIG_TRIM_UNUSED_KSYMS please read this article:
> >
> > https://lwn.net/Articles/746780/
> 
> Thanks for the great articles.
> 
> Just for curiosity, I think you used GCC LTO from
> Andy's GitHub.

Right. I provided the reference in the preceding article:
https://lwn.net/Articles/744507/ 

> In the article, you took stm32_defconfig as an example,
> but ARM does not select ARCH_SUPPORTS_LTO.
> 
> Did you add some local hacks to make LTO work
> for ARM?

Of course. This article was written in 2017 and no LTO support at all 
was in mainline back then. But, besides adding CONFIG_LTO, very little 
was needed to make it compile, and I did upstream most changes such as 
commit 75fea300d7, commit a85b2257a5, commit 5d48417592, commit 
19c233b79d, etc.

> I tried the lto-5.8.1 branch, but
> I did not even succeed in building x86 + LTO.

My latest working LTO branch (i.e. last time I worked on it) is much 
older than that.

Maybe people aren't very excited about LTO because it makes the time to 
recompiling the kernel many times longer because gcc does its 
optimization passes on the whole kernel even if you modify a single 
file.


Nicolas