mbox series

[v2,00/28] Add support for Clang LTO

Message ID 20200903203053.3411268-1-samitolvanen@google.com
Headers show
Series Add support for Clang LTO | expand

Message

Sami Tolvanen Sept. 3, 2020, 8:30 p.m. UTC
This patch series adds support for building x86_64 and arm64 kernels
with Clang's Link Time Optimization (LTO).

In addition to performance, the primary motivation for LTO is
to allow Clang's Control-Flow Integrity (CFI) to be used in the
kernel. Google has shipped millions of Pixel devices running three
major kernel versions with LTO+CFI since 2018.

Most of the patches are build system changes for handling LLVM
bitcode, which Clang produces with LTO instead of ELF object files,
postponing ELF processing until a later stage, and ensuring initcall
ordering.

Note that patches 1-4 are not directly related to LTO, but are
needed to compile LTO kernels with ToT Clang, so I'm including them
in the series for your convenience:

 - Patches 1-3 are required for building the kernel with ToT Clang,
   and IAS, and patch 4 is needed to build allmodconfig with LTO.

 - Patches 3-4 are already in linux-next, but not yet in 5.9-rc.

---
Changes in v2:

  - Fixed -Wmissing-prototypes warnings with W=1.

  - Dropped cc-option from -fsplit-lto-unit and added .thinlto-cache
    scrubbing to make distclean.

  - Added a comment about Clang >=11 being required.

  - Added a patch to disable LTO for the arm64 KVM nVHE code.

  - Disabled objtool's noinstr validation with LTO unless enabled.

  - Included Peter's proposed objtool mcount patch in the series
    and replaced recordmcount with the objtool pass to avoid
    whitelisting relocations that are not calls.

  - Updated several commit messages with better explanations.


Arvind Sankar (2):
  x86/boot/compressed: Disable relocation relaxation
  x86/asm: Replace __force_order with memory clobber

Luca Stefani (1):
  RAS/CEC: Fix cec_init() prototype

Nick Desaulniers (1):
  lib/string.c: implement stpcpy

Peter Zijlstra (1):
  objtool: Add a pass for generating __mcount_loc

Sami Tolvanen (23):
  objtool: Don't autodetect vmlinux.o
  kbuild: add support for objtool mcount
  x86, build: use objtool mcount
  kbuild: add support for Clang LTO
  kbuild: lto: fix module versioning
  kbuild: lto: postpone objtool
  kbuild: lto: limit inlining
  kbuild: lto: merge module sections
  kbuild: lto: remove duplicate dependencies from .mod files
  init: lto: ensure initcall ordering
  init: lto: fix PREL32 relocations
  PCI: Fix PREL32 relocations for LTO
  modpost: lto: strip .lto from module names
  scripts/mod: disable LTO for empty.c
  efi/libstub: disable LTO
  drivers/misc/lkdtm: disable LTO for rodata.o
  arm64: export CC_USING_PATCHABLE_FUNCTION_ENTRY
  arm64: vdso: disable LTO
  KVM: arm64: disable LTO for the nVHE directory
  arm64: allow LTO_CLANG and THINLTO to be selected
  x86, vdso: disable LTO only for vDSO
  x86, relocs: Ignore L4_PAGE_OFFSET relocations
  x86, build: allow LTO_CLANG and THINLTO to be selected

 .gitignore                            |   1 +
 Makefile                              |  65 ++++++-
 arch/Kconfig                          |  67 +++++++
 arch/arm64/Kconfig                    |   2 +
 arch/arm64/Makefile                   |   1 +
 arch/arm64/kernel/vdso/Makefile       |   4 +-
 arch/arm64/kvm/hyp/nvhe/Makefile      |   4 +-
 arch/x86/Kconfig                      |   3 +
 arch/x86/Makefile                     |   5 +
 arch/x86/boot/compressed/Makefile     |   2 +
 arch/x86/boot/compressed/pgtable_64.c |   9 -
 arch/x86/entry/vdso/Makefile          |   5 +-
 arch/x86/include/asm/special_insns.h  |  28 +--
 arch/x86/kernel/cpu/common.c          |   4 +-
 arch/x86/tools/relocs.c               |   1 +
 drivers/firmware/efi/libstub/Makefile |   2 +
 drivers/misc/lkdtm/Makefile           |   1 +
 drivers/ras/cec.c                     |   9 +-
 include/asm-generic/vmlinux.lds.h     |  11 +-
 include/linux/init.h                  |  79 +++++++-
 include/linux/pci.h                   |  19 +-
 kernel/trace/Kconfig                  |   5 +
 lib/string.c                          |  24 +++
 scripts/Makefile.build                |  55 +++++-
 scripts/Makefile.lib                  |   6 +-
 scripts/Makefile.modfinal             |  31 ++-
 scripts/Makefile.modpost              |  26 ++-
 scripts/generate_initcall_order.pl    | 270 ++++++++++++++++++++++++++
 scripts/link-vmlinux.sh               |  94 ++++++++-
 scripts/mod/Makefile                  |   1 +
 scripts/mod/modpost.c                 |  16 +-
 scripts/mod/modpost.h                 |   9 +
 scripts/mod/sumversion.c              |   6 +-
 scripts/module-lto.lds                |  26 +++
 tools/objtool/builtin-check.c         |  13 +-
 tools/objtool/builtin.h               |   2 +-
 tools/objtool/check.c                 |  83 ++++++++
 tools/objtool/check.h                 |   1 +
 tools/objtool/objtool.h               |   1 +
 39 files changed, 883 insertions(+), 108 deletions(-)
 create mode 100755 scripts/generate_initcall_order.pl
 create mode 100644 scripts/module-lto.lds


base-commit: e28f0104343d0c132fa37f479870c9e43355fee4

Comments

Kees Cook Sept. 3, 2020, 11:34 p.m. UTC | #1
On Thu, Sep 03, 2020 at 01:30:25PM -0700, Sami Tolvanen wrote:
> This patch series adds support for building x86_64 and arm64 kernels
> with Clang's Link Time Optimization (LTO).

Tested-by: Kees Cook <keescook@chromium.org>

FWIW, this gives me a happy booting x86 kernel:

# cat /proc/version 
Linux version 5.9.0-rc3+ (kees@amarok) (clang version 12.0.0 (https://github.com/llvm/llvm-project.git db1ec04963cce70f2593e58cecac55f2e6accf52), LLD 12.0.0 (https://github.com/llvm/llvm-project.git db1ec04963cce70f2593e58cecac55f2e6accf52)) #1 SMP Thu Sep 3 15:54:14 PDT 2020
# zgrep 'LTO[_=]' /proc/config.gz
CONFIG_LTO=y
CONFIG_ARCH_SUPPORTS_LTO_CLANG=y
CONFIG_ARCH_SUPPORTS_THINLTO=y
CONFIG_THINLTO=y
# CONFIG_LTO_NONE is not set
CONFIG_LTO_CLANG=y

I'd like to find a way to get this series landing sanely. It has
dependencies on fixes/features in a few trees, and it looks like
it's been difficult to keep forward momentum on LTO while trying to
simultaneously chase changes in those trees, especially since it means
no one care carry LTO in -next without shared branches. To that end,
I'd like to find a way forward where Sami doesn't have to keep carrying
a couple dozen patches. :)

The fixes/features outside of, or partially overlapping, Masahiro's
kbuild tree appear to be:

[PATCH v2 01/28] x86/boot/compressed: Disable relocation relaxation
[PATCH v2 02/28] x86/asm: Replace __force_order with memory clobber
[PATCH v2 03/28] lib/string.c: implement stpcpy
[PATCH v2 04/28] RAS/CEC: Fix cec_init() prototype
[PATCH v2 05/28] objtool: Add a pass for generating __mcount_loc
[PATCH v2 06/28] objtool: Don't autodetect vmlinux.o
[PATCH v2 07/28] kbuild: add support for objtool mcount
[PATCH v2 08/28] x86, build: use objtool mcount
[PATCH v2 17/28] PCI: Fix PREL32 relocations for LTO
[PATCH v2 20/28] efi/libstub: disable LTO
[PATCH v2 21/28] drivers/misc/lkdtm: disable LTO for rodata.o
[PATCH v2 22/28] arm64: export CC_USING_PATCHABLE_FUNCTION_ENTRY
[PATCH v2 23/28] arm64: vdso: disable LTO 
[PATCH v2 24/28] KVM: arm64: disable LTO for the nVHE directory
[PATCH v2 25/28] arm64: allow LTO_CLANG and THINLTO to be selected
[PATCH v2 26/28] x86, vdso: disable LTO only for vDSO
[PATCH v2 27/28] x86, relocs: Ignore L4_PAGE_OFFSET relocations
[PATCH v2 28/28] x86, build: allow LTO_CLANG and THINLTO to be selected

The distinctly kbuild patches are:

[PATCH v2 09/28] kbuild: add support for Clang LTO
[PATCH v2 10/28] kbuild: lto: fix module versioning
[PATCH v2 11/28] kbuild: lto: postpone objtool
[PATCH v2 12/28] kbuild: lto: limit inlining
[PATCH v2 13/28] kbuild: lto: merge module sections
[PATCH v2 14/28] kbuild: lto: remove duplicate dependencies from .mod files
[PATCH v2 15/28] init: lto: ensure initcall ordering
[PATCH v2 16/28] init: lto: fix PREL32 relocations
[PATCH v2 18/28] modpost: lto: strip .lto from module names
[PATCH v2 19/28] scripts/mod: disable LTO for empty.c

Patch 3 is in -mm and I expect it will land in the next rc (I hope,
since it's needed universally for Clang builds).

Patch 4 is living in -tip, to appear shortly in -next, AFAICT?

I would expect 1 and 2 to appear in -tip soon, but I'm not sure?

For patches 5, 6, 7, and 8 I would expect them to normally go via -tip's
objtool tree, but getting an Ack would let them land elsewhere.

Patch 17 I'd expect to normally go via Bjorn's tree, but he's given an
Ack so it can live elsewhere without surprises. :)

Patches 19, 20, 21, 23, 24, 26 are all simple "just disable LTO"
patches.

This leaves 9-16 and 18. Patches 10, 12, 14, 16, and 18 seem mostly
"mechanical" in nature, leaving the bulk of the review on patches 9,
11, 13, and 15.

Masahiro, given the spread of dependent patches between 2 (or more?) -tip
branches and -mm, how do you want to proceed? I wonder if it might
be possible to create a shared branch to avoid merge headaches, and I
(or -tip folks, or you) could carry patches 1-8 there so patches 9 and
later could have a common base?

Thanks!
Kees Cook Sept. 3, 2020, 11:38 p.m. UTC | #2
On Thu, Sep 03, 2020 at 01:30:25PM -0700, Sami Tolvanen wrote:
> This patch series adds support for building x86_64 and arm64 kernels
> with Clang's Link Time Optimization (LTO).
> [...]
> base-commit: e28f0104343d0c132fa37f479870c9e43355fee4

And if you're not a b4 user, this tree can be found at either of these places:

https://github.com/samitolvanen/linux/commits/clang-lto

git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux.git kspp/sami/lto/v2
Nathan Chancellor Sept. 4, 2020, 4:45 a.m. UTC | #3
On Thu, Sep 03, 2020 at 04:34:09PM -0700, Kees Cook wrote:
> On Thu, Sep 03, 2020 at 01:30:25PM -0700, Sami Tolvanen wrote:
> > This patch series adds support for building x86_64 and arm64 kernels
> > with Clang's Link Time Optimization (LTO).
> 
> Tested-by: Kees Cook <keescook@chromium.org>

Tested-by: Nathan Chancellor <natechancellor@gmail.com>

I have been continuously running this series on virtualized x86_64 (WSL2
on my home workstation) and bare metal arm64 (Raspberry Pi 4) with no
major issues or regressions noticed.

> FWIW, this gives me a happy booting x86 kernel:
> 
> # cat /proc/version 
> Linux version 5.9.0-rc3+ (kees@amarok) (clang version 12.0.0 (https://github.com/llvm/llvm-project.git db1ec04963cce70f2593e58cecac55f2e6accf52), LLD 12.0.0 (https://github.com/llvm/llvm-project.git db1ec04963cce70f2593e58cecac55f2e6accf52)) #1 SMP Thu Sep 3 15:54:14 PDT 2020
> # zgrep 'LTO[_=]' /proc/config.gz
> CONFIG_LTO=y
> CONFIG_ARCH_SUPPORTS_LTO_CLANG=y
> CONFIG_ARCH_SUPPORTS_THINLTO=y
> CONFIG_THINLTO=y
> # CONFIG_LTO_NONE is not set
> CONFIG_LTO_CLANG=y
> 
> I'd like to find a way to get this series landing sanely. It has
> dependencies on fixes/features in a few trees, and it looks like
> it's been difficult to keep forward momentum on LTO while trying to
> simultaneously chase changes in those trees, especially since it means
> no one care carry LTO in -next without shared branches. To that end,
> I'd like to find a way forward where Sami doesn't have to keep carrying
> a couple dozen patches. :)
> 
> The fixes/features outside of, or partially overlapping, Masahiro's
> kbuild tree appear to be:
> 
> [PATCH v2 01/28] x86/boot/compressed: Disable relocation relaxation
> [PATCH v2 02/28] x86/asm: Replace __force_order with memory clobber
> [PATCH v2 03/28] lib/string.c: implement stpcpy
> [PATCH v2 04/28] RAS/CEC: Fix cec_init() prototype
> [PATCH v2 05/28] objtool: Add a pass for generating __mcount_loc
> [PATCH v2 06/28] objtool: Don't autodetect vmlinux.o
> [PATCH v2 07/28] kbuild: add support for objtool mcount
> [PATCH v2 08/28] x86, build: use objtool mcount
> [PATCH v2 17/28] PCI: Fix PREL32 relocations for LTO
> [PATCH v2 20/28] efi/libstub: disable LTO
> [PATCH v2 21/28] drivers/misc/lkdtm: disable LTO for rodata.o
> [PATCH v2 22/28] arm64: export CC_USING_PATCHABLE_FUNCTION_ENTRY
> [PATCH v2 23/28] arm64: vdso: disable LTO 
> [PATCH v2 24/28] KVM: arm64: disable LTO for the nVHE directory
> [PATCH v2 25/28] arm64: allow LTO_CLANG and THINLTO to be selected
> [PATCH v2 26/28] x86, vdso: disable LTO only for vDSO
> [PATCH v2 27/28] x86, relocs: Ignore L4_PAGE_OFFSET relocations
> [PATCH v2 28/28] x86, build: allow LTO_CLANG and THINLTO to be selected
> 
> The distinctly kbuild patches are:
> 
> [PATCH v2 09/28] kbuild: add support for Clang LTO
> [PATCH v2 10/28] kbuild: lto: fix module versioning
> [PATCH v2 11/28] kbuild: lto: postpone objtool
> [PATCH v2 12/28] kbuild: lto: limit inlining
> [PATCH v2 13/28] kbuild: lto: merge module sections
> [PATCH v2 14/28] kbuild: lto: remove duplicate dependencies from .mod files
> [PATCH v2 15/28] init: lto: ensure initcall ordering
> [PATCH v2 16/28] init: lto: fix PREL32 relocations
> [PATCH v2 18/28] modpost: lto: strip .lto from module names
> [PATCH v2 19/28] scripts/mod: disable LTO for empty.c
> 
> Patch 3 is in -mm and I expect it will land in the next rc (I hope,
> since it's needed universally for Clang builds).
> 
> Patch 4 is living in -tip, to appear shortly in -next, AFAICT?
> 
> I would expect 1 and 2 to appear in -tip soon, but I'm not sure?
> 
> For patches 5, 6, 7, and 8 I would expect them to normally go via -tip's
> objtool tree, but getting an Ack would let them land elsewhere.
> 
> Patch 17 I'd expect to normally go via Bjorn's tree, but he's given an
> Ack so it can live elsewhere without surprises. :)
> 
> Patches 19, 20, 21, 23, 24, 26 are all simple "just disable LTO"
> patches.
> 
> This leaves 9-16 and 18. Patches 10, 12, 14, 16, and 18 seem mostly
> "mechanical" in nature, leaving the bulk of the review on patches 9,
> 11, 13, and 15.
> 
> Masahiro, given the spread of dependent patches between 2 (or more?) -tip
> branches and -mm, how do you want to proceed? I wonder if it might
> be possible to create a shared branch to avoid merge headaches, and I
> (or -tip folks, or you) could carry patches 1-8 there so patches 9 and
> later could have a common base?
> 
> Thanks!
> 
> -- 
> Kees Cook
> 

For what it's worth, the static call series that is in -tip and about to
land in -next conflicts relatively heavy with this. There are fairly
innocuous conflicts in some objtool files but two contextual changes
are needed to keep things building. It probably makes sense for most if
not all of this to live in -tip with acks. Ideally, if the stpcpy patch
gets merged into an -rc, this can just be based on that.

check.c:556:80: error: too few arguments to function call, expected 5, have 4
        sec = elf_create_section(file->elf, "__mcount_loc", sizeof(unsigned long), idx);
              ~~~~~~~~~~~~~~~~~~                                                      ^
./elf.h:124:17: note: 'elf_create_section' declared here
struct section *elf_create_section(struct elf *elf, const char *name, unsigned int sh_flags, size_t entsize, int nr);
                ^
1 error generated.

kernel/static_call.c:438:16: error: returning 'void' from a function with incompatible result type 'int'
early_initcall(static_call_init);
~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~
include/linux/init.h:268:47: note: expanded from macro 'early_initcall'
#define early_initcall(fn)              __define_initcall(fn, early)
                                        ~~~~~~~~~~~~~~~~~~^~~~~~~~~~
include/linux/init.h:261:54: note: expanded from macro '__define_initcall'
#define __define_initcall(fn, id) ___define_initcall(fn, id, .initcall##id)
                                  ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~
include/linux/init.h:259:20: note: expanded from macro '___define_initcall'
        __unique_initcall(fn, id, __sec, __initcall_id(fn))
        ~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
include/linux/init.h:253:22: note: expanded from macro '__unique_initcall'
        ____define_initcall(fn,                                 \
        ~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
include/linux/init.h:241:33: note: expanded from macro '____define_initcall'
        __define_initcall_stub(__stub, fn)                      \
        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~
include/linux/init.h:226:10: note: expanded from macro '__define_initcall_stub'
                return fn();                                    \
                       ^~~~
1 error generated.

Below is what I ended up with for fixes.

Cheers,
Nathan

diff --git a/include/linux/static_call.h b/include/linux/static_call.h
index bfa2ba39be57..61034e9798d6 100644
--- a/include/linux/static_call.h
+++ b/include/linux/static_call.h
@@ -136,7 +136,7 @@ extern void arch_static_call_transform(void *site, void *tramp, void *func, bool
 
 #ifdef CONFIG_HAVE_STATIC_CALL_INLINE
 
-extern void __init static_call_init(void);
+extern int __init static_call_init(void);
 
 struct static_call_mod {
 	struct static_call_mod *next;
diff --git a/kernel/static_call.c b/kernel/static_call.c
index f8362b3f8fd5..84565c2a41b8 100644
--- a/kernel/static_call.c
+++ b/kernel/static_call.c
@@ -410,12 +410,12 @@ int static_call_text_reserved(void *start, void *end)
 	return __static_call_mod_text_reserved(start, end);
 }
 
-void __init static_call_init(void)
+int __init static_call_init(void)
 {
 	int ret;
 
 	if (static_call_initialized)
-		return;
+		return 0;
 
 	cpus_read_lock();
 	static_call_lock();
@@ -434,6 +434,7 @@ void __init static_call_init(void)
 #ifdef CONFIG_MODULES
 	register_module_notifier(&static_call_module_nb);
 #endif
+	return 0;
 }
 early_initcall(static_call_init);
 
diff --git a/tools/objtool/check.c b/tools/objtool/check.c
index d31554adcf4e..34db58110f3d 100644
--- a/tools/objtool/check.c
+++ b/tools/objtool/check.c
@@ -553,7 +553,7 @@ static int create_mcount_loc_sections(struct objtool_file *file)
 	list_for_each_entry(insn, &file->mcount_loc_list, mcount_loc_node)
 		idx++;
 
-	sec = elf_create_section(file->elf, "__mcount_loc", sizeof(unsigned long), idx);
+	sec = elf_create_section(file->elf, "__mcount_loc", 0, sizeof(unsigned long), idx);
 	if (!sec)
 		return -1;
Sedat Dilek Sept. 4, 2020, 7:53 a.m. UTC | #4
On Thu, Sep 3, 2020 at 10:30 PM 'Sami Tolvanen' via Clang Built Linux
<clang-built-linux@googlegroups.com> wrote:
>
> This patch series adds support for building x86_64 and arm64 kernels
> with Clang's Link Time Optimization (LTO).
>
> In addition to performance, the primary motivation for LTO is
> to allow Clang's Control-Flow Integrity (CFI) to be used in the
> kernel. Google has shipped millions of Pixel devices running three
> major kernel versions with LTO+CFI since 2018.
>
> Most of the patches are build system changes for handling LLVM
> bitcode, which Clang produces with LTO instead of ELF object files,
> postponing ELF processing until a later stage, and ensuring initcall
> ordering.
>
> Note that patches 1-4 are not directly related to LTO, but are
> needed to compile LTO kernels with ToT Clang, so I'm including them
> in the series for your convenience:
>
>  - Patches 1-3 are required for building the kernel with ToT Clang,
>    and IAS, and patch 4 is needed to build allmodconfig with LTO.
>
>  - Patches 3-4 are already in linux-next, but not yet in 5.9-rc.
>

I jumped to Sami's clang-cfi Git tree which includes clang-lto v2.

My LLVM toolchain is version 11.0.0.0-rc2+ more precisely git
97ac9e82002d6b12831ca2c78f739cca65a4fa05.

If this is OK, feel free to add my...

Tested-by: Sedat Dilek <sedat.dilek@gmail.com>

- Sedat -

[1] https://github.com/samitolvanen/linux/commits/clang-cfi

> ---
> Changes in v2:
>
>   - Fixed -Wmissing-prototypes warnings with W=1.
>
>   - Dropped cc-option from -fsplit-lto-unit and added .thinlto-cache
>     scrubbing to make distclean.
>
>   - Added a comment about Clang >=11 being required.
>
>   - Added a patch to disable LTO for the arm64 KVM nVHE code.
>
>   - Disabled objtool's noinstr validation with LTO unless enabled.
>
>   - Included Peter's proposed objtool mcount patch in the series
>     and replaced recordmcount with the objtool pass to avoid
>     whitelisting relocations that are not calls.
>
>   - Updated several commit messages with better explanations.
>
>
> Arvind Sankar (2):
>   x86/boot/compressed: Disable relocation relaxation
>   x86/asm: Replace __force_order with memory clobber
>
> Luca Stefani (1):
>   RAS/CEC: Fix cec_init() prototype
>
> Nick Desaulniers (1):
>   lib/string.c: implement stpcpy
>
> Peter Zijlstra (1):
>   objtool: Add a pass for generating __mcount_loc
>
> Sami Tolvanen (23):
>   objtool: Don't autodetect vmlinux.o
>   kbuild: add support for objtool mcount
>   x86, build: use objtool mcount
>   kbuild: add support for Clang LTO
>   kbuild: lto: fix module versioning
>   kbuild: lto: postpone objtool
>   kbuild: lto: limit inlining
>   kbuild: lto: merge module sections
>   kbuild: lto: remove duplicate dependencies from .mod files
>   init: lto: ensure initcall ordering
>   init: lto: fix PREL32 relocations
>   PCI: Fix PREL32 relocations for LTO
>   modpost: lto: strip .lto from module names
>   scripts/mod: disable LTO for empty.c
>   efi/libstub: disable LTO
>   drivers/misc/lkdtm: disable LTO for rodata.o
>   arm64: export CC_USING_PATCHABLE_FUNCTION_ENTRY
>   arm64: vdso: disable LTO
>   KVM: arm64: disable LTO for the nVHE directory
>   arm64: allow LTO_CLANG and THINLTO to be selected
>   x86, vdso: disable LTO only for vDSO
>   x86, relocs: Ignore L4_PAGE_OFFSET relocations
>   x86, build: allow LTO_CLANG and THINLTO to be selected
>
>  .gitignore                            |   1 +
>  Makefile                              |  65 ++++++-
>  arch/Kconfig                          |  67 +++++++
>  arch/arm64/Kconfig                    |   2 +
>  arch/arm64/Makefile                   |   1 +
>  arch/arm64/kernel/vdso/Makefile       |   4 +-
>  arch/arm64/kvm/hyp/nvhe/Makefile      |   4 +-
>  arch/x86/Kconfig                      |   3 +
>  arch/x86/Makefile                     |   5 +
>  arch/x86/boot/compressed/Makefile     |   2 +
>  arch/x86/boot/compressed/pgtable_64.c |   9 -
>  arch/x86/entry/vdso/Makefile          |   5 +-
>  arch/x86/include/asm/special_insns.h  |  28 +--
>  arch/x86/kernel/cpu/common.c          |   4 +-
>  arch/x86/tools/relocs.c               |   1 +
>  drivers/firmware/efi/libstub/Makefile |   2 +
>  drivers/misc/lkdtm/Makefile           |   1 +
>  drivers/ras/cec.c                     |   9 +-
>  include/asm-generic/vmlinux.lds.h     |  11 +-
>  include/linux/init.h                  |  79 +++++++-
>  include/linux/pci.h                   |  19 +-
>  kernel/trace/Kconfig                  |   5 +
>  lib/string.c                          |  24 +++
>  scripts/Makefile.build                |  55 +++++-
>  scripts/Makefile.lib                  |   6 +-
>  scripts/Makefile.modfinal             |  31 ++-
>  scripts/Makefile.modpost              |  26 ++-
>  scripts/generate_initcall_order.pl    | 270 ++++++++++++++++++++++++++
>  scripts/link-vmlinux.sh               |  94 ++++++++-
>  scripts/mod/Makefile                  |   1 +
>  scripts/mod/modpost.c                 |  16 +-
>  scripts/mod/modpost.h                 |   9 +
>  scripts/mod/sumversion.c              |   6 +-
>  scripts/module-lto.lds                |  26 +++
>  tools/objtool/builtin-check.c         |  13 +-
>  tools/objtool/builtin.h               |   2 +-
>  tools/objtool/check.c                 |  83 ++++++++
>  tools/objtool/check.h                 |   1 +
>  tools/objtool/objtool.h               |   1 +
>  39 files changed, 883 insertions(+), 108 deletions(-)
>  create mode 100755 scripts/generate_initcall_order.pl
>  create mode 100644 scripts/module-lto.lds
>
>
> base-commit: e28f0104343d0c132fa37f479870c9e43355fee4
> --
> 2.28.0.402.g5ffc5be6b7-goog
>
> --
> You received this message because you are subscribed to the Google Groups "Clang Built Linux" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to clang-built-linux+unsubscribe@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/clang-built-linux/20200903203053.3411268-1-samitolvanen%40google.com.
Peter Zijlstra Sept. 4, 2020, 8:55 a.m. UTC | #5
Please don't nest series!

Start a new thread for every posting.
Sedat Dilek Sept. 4, 2020, 9:08 a.m. UTC | #6
On Fri, Sep 4, 2020 at 10:55 AM <peterz@infradead.org> wrote:
>
>
> Please don't nest series!
>
> Start a new thread for every posting.
>

You are right Peter, my apologies.

- Sedat -
Masahiro Yamada Sept. 6, 2020, 12:24 a.m. UTC | #7
On Fri, Sep 4, 2020 at 5:30 AM Sami Tolvanen <samitolvanen@google.com> wrote:
>
> This patch series adds support for building x86_64 and arm64 kernels
> with Clang's Link Time Optimization (LTO).
>
> In addition to performance, the primary motivation for LTO is
> to allow Clang's Control-Flow Integrity (CFI) to be used in the
> kernel. Google has shipped millions of Pixel devices running three
> major kernel versions with LTO+CFI since 2018.
>
> Most of the patches are build system changes for handling LLVM
> bitcode, which Clang produces with LTO instead of ELF object files,
> postponing ELF processing until a later stage, and ensuring initcall
> ordering.
>
> Note that patches 1-4 are not directly related to LTO, but are
> needed to compile LTO kernels with ToT Clang, so I'm including them
> in the series for your convenience:
>
>  - Patches 1-3 are required for building the kernel with ToT Clang,
>    and IAS, and patch 4 is needed to build allmodconfig with LTO.
>
>  - Patches 3-4 are already in linux-next, but not yet in 5.9-rc.
>


I still do not understand how this patch set works.
(only me?)

Please let me ask fundamental questions.



I applied this series on top of Linus' tree,
and compiled for ARCH=arm64.

I compared the kernel size with/without LTO.



[1] No LTO  (arm64 defconfig, CONFIG_LTO_NONE)

$ llvm-size   vmlinux
   text    data     bss     dec     hex filename
15848692 10099449 493060 26441201 19375f1 vmlinux



[2] Clang LTO  (arm64 defconfig + CONFIG_LTO_CLANG)

$ llvm-size   vmlinux
   text    data     bss     dec     hex filename
15906864 10197445 490804 26595113 195cf29 vmlinux


I compared the size of raw binary, arch/arm64/boot/Image.
Its size increased too.



So, in my experiment, enabling CONFIG_LTO_CLANG
increases the kernel size.
Is this correct?


One more thing, could you teach me
how Clang LTO optimizes the code against
relocatable objects?



When I learned Clang LTO first, I read this document:
https://llvm.org/docs/LinkTimeOptimization.html

It is easy to confirm the final executable
does not contain foo2, foo3...



In contrast to userspace programs,
kernel modules are basically relocatable objects.

Does Clang drop unused symbols from relocatable objects?
If so, how?

I implemented an example module (see the attachment),
and checked the symbols.
Nothing was dropped.

The situation is the same for build-in
because LTO is run against vmlinux.o, which is
relocatable as well.


--
Best Regards

Masahiro Yamada
Sami Tolvanen Sept. 8, 2020, 11:46 p.m. UTC | #8
On Sun, Sep 06, 2020 at 09:24:38AM +0900, Masahiro Yamada wrote:
> On Fri, Sep 4, 2020 at 5:30 AM Sami Tolvanen <samitolvanen@google.com> wrote:
> >
> > This patch series adds support for building x86_64 and arm64 kernels
> > with Clang's Link Time Optimization (LTO).
> >
> > In addition to performance, the primary motivation for LTO is
> > to allow Clang's Control-Flow Integrity (CFI) to be used in the
> > kernel. Google has shipped millions of Pixel devices running three
> > major kernel versions with LTO+CFI since 2018.
> >
> > Most of the patches are build system changes for handling LLVM
> > bitcode, which Clang produces with LTO instead of ELF object files,
> > postponing ELF processing until a later stage, and ensuring initcall
> > ordering.
> >
> > Note that patches 1-4 are not directly related to LTO, but are
> > needed to compile LTO kernels with ToT Clang, so I'm including them
> > in the series for your convenience:
> >
> >  - Patches 1-3 are required for building the kernel with ToT Clang,
> >    and IAS, and patch 4 is needed to build allmodconfig with LTO.
> >
> >  - Patches 3-4 are already in linux-next, but not yet in 5.9-rc.
> >
> 
> 
> I still do not understand how this patch set works.
> (only me?)
> 
> Please let me ask fundamental questions.
> 
> 
> 
> I applied this series on top of Linus' tree,
> and compiled for ARCH=arm64.
> 
> I compared the kernel size with/without LTO.
> 
> 
> 
> [1] No LTO  (arm64 defconfig, CONFIG_LTO_NONE)
> 
> $ llvm-size   vmlinux
>    text    data     bss     dec     hex filename
> 15848692 10099449 493060 26441201 19375f1 vmlinux
> 
> 
> 
> [2] Clang LTO  (arm64 defconfig + CONFIG_LTO_CLANG)
> 
> $ llvm-size   vmlinux
>    text    data     bss     dec     hex filename
> 15906864 10197445 490804 26595113 195cf29 vmlinux
> 
> 
> I compared the size of raw binary, arch/arm64/boot/Image.
> Its size increased too.
> 
> 
> 
> So, in my experiment, enabling CONFIG_LTO_CLANG
> increases the kernel size.
> Is this correct?

Yes. LTO does produce larger binaries, mostly due to function
inlining between translation units, I believe. The compiler people
can probably give you a more detailed answer here. Without -mllvm
-import-instr-limit, the binaries would be even larger.

> One more thing, could you teach me
> how Clang LTO optimizes the code against
> relocatable objects?
> 
> 
> 
> When I learned Clang LTO first, I read this document:
> https://llvm.org/docs/LinkTimeOptimization.html
> 
> It is easy to confirm the final executable
> does not contain foo2, foo3...
> 
> 
> 
> In contrast to userspace programs,
> kernel modules are basically relocatable objects.
> 
> Does Clang drop unused symbols from relocatable objects?
> If so, how?

I don't think the compiler can legally drop global symbols from
relocatable objects, but it can rename and possibly even drop static
functions. This is why we need global wrappers for initcalls, for
example, to have stable symbol names.

Sami
Masahiro Yamada Sept. 10, 2020, 1:18 a.m. UTC | #9
On Wed, Sep 9, 2020 at 8:46 AM Sami Tolvanen <samitolvanen@google.com> wrote:
>
> On Sun, Sep 06, 2020 at 09:24:38AM +0900, Masahiro Yamada wrote:
> > On Fri, Sep 4, 2020 at 5:30 AM Sami Tolvanen <samitolvanen@google.com> wrote:
> > >
> > > This patch series adds support for building x86_64 and arm64 kernels
> > > with Clang's Link Time Optimization (LTO).
> > >
> > > In addition to performance, the primary motivation for LTO is
> > > to allow Clang's Control-Flow Integrity (CFI) to be used in the
> > > kernel. Google has shipped millions of Pixel devices running three
> > > major kernel versions with LTO+CFI since 2018.
> > >
> > > Most of the patches are build system changes for handling LLVM
> > > bitcode, which Clang produces with LTO instead of ELF object files,
> > > postponing ELF processing until a later stage, and ensuring initcall
> > > ordering.
> > >
> > > Note that patches 1-4 are not directly related to LTO, but are
> > > needed to compile LTO kernels with ToT Clang, so I'm including them
> > > in the series for your convenience:
> > >
> > >  - Patches 1-3 are required for building the kernel with ToT Clang,
> > >    and IAS, and patch 4 is needed to build allmodconfig with LTO.
> > >
> > >  - Patches 3-4 are already in linux-next, but not yet in 5.9-rc.
> > >
> >
> >
> > I still do not understand how this patch set works.
> > (only me?)
> >
> > Please let me ask fundamental questions.
> >
> >
> >
> > I applied this series on top of Linus' tree,
> > and compiled for ARCH=arm64.
> >
> > I compared the kernel size with/without LTO.
> >
> >
> >
> > [1] No LTO  (arm64 defconfig, CONFIG_LTO_NONE)
> >
> > $ llvm-size   vmlinux
> >    text    data     bss     dec     hex filename
> > 15848692 10099449 493060 26441201 19375f1 vmlinux
> >
> >
> >
> > [2] Clang LTO  (arm64 defconfig + CONFIG_LTO_CLANG)
> >
> > $ llvm-size   vmlinux
> >    text    data     bss     dec     hex filename
> > 15906864 10197445 490804 26595113 195cf29 vmlinux
> >
> >
> > I compared the size of raw binary, arch/arm64/boot/Image.
> > Its size increased too.
> >
> >
> >
> > So, in my experiment, enabling CONFIG_LTO_CLANG
> > increases the kernel size.
> > Is this correct?
>
> Yes. LTO does produce larger binaries, mostly due to function
> inlining between translation units, I believe. The compiler people
> can probably give you a more detailed answer here. Without -mllvm
> -import-instr-limit, the binaries would be even larger.
>
> > One more thing, could you teach me
> > how Clang LTO optimizes the code against
> > relocatable objects?
> >
> >
> >
> > When I learned Clang LTO first, I read this document:
> > https://llvm.org/docs/LinkTimeOptimization.html
> >
> > It is easy to confirm the final executable
> > does not contain foo2, foo3...
> >
> >
> >
> > In contrast to userspace programs,
> > kernel modules are basically relocatable objects.
> >
> > Does Clang drop unused symbols from relocatable objects?
> > If so, how?
>
> I don't think the compiler can legally drop global symbols from
> relocatable objects, but it can rename and possibly even drop static
> functions.


Compilers can drop static functions without LTO.
Rather, it is a compiler warning
(-Wunused-function), so the code should be cleaned up.



> This is why we need global wrappers for initcalls, for
> example, to have stable symbol names.
>
> Sami



At first, I thought the motivation of LTO
was to remove unused global symbols, and
to perform further optimization.


It is true for userspace programs.
In fact, the example of
https://llvm.org/docs/LinkTimeOptimization.html
produces a smaller binary.


In contrast, this patch set produces a bigger kernel
because LTO cannot remove any unused symbol.

So, I do not understand what the benefit is.


Is inlining beneficial?
I am not sure.


Documentation/process/coding-style.rst
"15) The inline disease"
mentions that inlining is not always
a good thing.


As a whole, I still do not understand
the motivation of this patch set.
Sami Tolvanen Sept. 10, 2020, 3:17 p.m. UTC | #10
On Thu, Sep 10, 2020 at 10:18:05AM +0900, Masahiro Yamada wrote:
> On Wed, Sep 9, 2020 at 8:46 AM Sami Tolvanen <samitolvanen@google.com> wrote:
> >
> > On Sun, Sep 06, 2020 at 09:24:38AM +0900, Masahiro Yamada wrote:
> > > On Fri, Sep 4, 2020 at 5:30 AM Sami Tolvanen <samitolvanen@google.com> wrote:
> > > >
> > > > This patch series adds support for building x86_64 and arm64 kernels
> > > > with Clang's Link Time Optimization (LTO).
> > > >
> > > > In addition to performance, the primary motivation for LTO is
> > > > to allow Clang's Control-Flow Integrity (CFI) to be used in the
> > > > kernel. Google has shipped millions of Pixel devices running three
> > > > major kernel versions with LTO+CFI since 2018.
> > > >
> > > > Most of the patches are build system changes for handling LLVM
> > > > bitcode, which Clang produces with LTO instead of ELF object files,
> > > > postponing ELF processing until a later stage, and ensuring initcall
> > > > ordering.
> > > >
> > > > Note that patches 1-4 are not directly related to LTO, but are
> > > > needed to compile LTO kernels with ToT Clang, so I'm including them
> > > > in the series for your convenience:
> > > >
> > > >  - Patches 1-3 are required for building the kernel with ToT Clang,
> > > >    and IAS, and patch 4 is needed to build allmodconfig with LTO.
> > > >
> > > >  - Patches 3-4 are already in linux-next, but not yet in 5.9-rc.
> > > >
> > >
> > >
> > > I still do not understand how this patch set works.
> > > (only me?)
> > >
> > > Please let me ask fundamental questions.
> > >
> > >
> > >
> > > I applied this series on top of Linus' tree,
> > > and compiled for ARCH=arm64.
> > >
> > > I compared the kernel size with/without LTO.
> > >
> > >
> > >
> > > [1] No LTO  (arm64 defconfig, CONFIG_LTO_NONE)
> > >
> > > $ llvm-size   vmlinux
> > >    text    data     bss     dec     hex filename
> > > 15848692 10099449 493060 26441201 19375f1 vmlinux
> > >
> > >
> > >
> > > [2] Clang LTO  (arm64 defconfig + CONFIG_LTO_CLANG)
> > >
> > > $ llvm-size   vmlinux
> > >    text    data     bss     dec     hex filename
> > > 15906864 10197445 490804 26595113 195cf29 vmlinux
> > >
> > >
> > > I compared the size of raw binary, arch/arm64/boot/Image.
> > > Its size increased too.
> > >
> > >
> > >
> > > So, in my experiment, enabling CONFIG_LTO_CLANG
> > > increases the kernel size.
> > > Is this correct?
> >
> > Yes. LTO does produce larger binaries, mostly due to function
> > inlining between translation units, I believe. The compiler people
> > can probably give you a more detailed answer here. Without -mllvm
> > -import-instr-limit, the binaries would be even larger.
> >
> > > One more thing, could you teach me
> > > how Clang LTO optimizes the code against
> > > relocatable objects?
> > >
> > >
> > >
> > > When I learned Clang LTO first, I read this document:
> > > https://llvm.org/docs/LinkTimeOptimization.html
> > >
> > > It is easy to confirm the final executable
> > > does not contain foo2, foo3...
> > >
> > >
> > >
> > > In contrast to userspace programs,
> > > kernel modules are basically relocatable objects.
> > >
> > > Does Clang drop unused symbols from relocatable objects?
> > > If so, how?
> >
> > I don't think the compiler can legally drop global symbols from
> > relocatable objects, but it can rename and possibly even drop static
> > functions.
> 
> 
> Compilers can drop static functions without LTO.
> Rather, it is a compiler warning
> (-Wunused-function), so the code should be cleaned up.
> 
> 
> 
> > This is why we need global wrappers for initcalls, for
> > example, to have stable symbol names.
> >
> > Sami
> 
> 
> 
> At first, I thought the motivation of LTO
> was to remove unused global symbols, and
> to perform further optimization.
> 
> 
> It is true for userspace programs.
> In fact, the example of
> https://llvm.org/docs/LinkTimeOptimization.html
> produces a smaller binary.
> 
> 
> In contrast, this patch set produces a bigger kernel
> because LTO cannot remove any unused symbol.
> 
> So, I do not understand what the benefit is.
> 
> 
> Is inlining beneficial?
> I am not sure.
> 
> 
> Documentation/process/coding-style.rst
> "15) The inline disease"
> mentions that inlining is not always
> a good thing.
> 
> 
> As a whole, I still do not understand
> the motivation of this patch set.

Clang produces faster code with LTO even if unused functions are not
removed, and I'm not sure how many unused globals there really are in
the kernel that aren't exported for modules. However, as I mentioned in
the cover letter, we also need LTO for Control-Flow Integrity (CFI),
which we have used in Pixel kernels for a couple of years now, and plan
to use in more Android devices in future:

  https://clang.llvm.org/docs/ControlFlowIntegrity.html

Sami
Kees Cook Sept. 10, 2020, 6:18 p.m. UTC | #11
On Thu, Sep 10, 2020 at 10:18:05AM +0900, Masahiro Yamada wrote:
> On Wed, Sep 9, 2020 at 8:46 AM Sami Tolvanen <samitolvanen@google.com> wrote:
> >
> > On Sun, Sep 06, 2020 at 09:24:38AM +0900, Masahiro Yamada wrote:
> > > On Fri, Sep 4, 2020 at 5:30 AM Sami Tolvanen <samitolvanen@google.com> wrote:
> > > >
> > > > This patch series adds support for building x86_64 and arm64 kernels
> > > > with Clang's Link Time Optimization (LTO).
> > > [...]
> > > One more thing, could you teach me
> > > how Clang LTO optimizes the code against
> > > relocatable objects?
> > >
> > > When I learned Clang LTO first, I read this document:
> > > https://llvm.org/docs/LinkTimeOptimization.html
> > >
> > > It is easy to confirm the final executable
> > > does not contain foo2, foo3...
> > >
> > > In contrast to userspace programs,
> > > kernel modules are basically relocatable objects.
> > >
> > > Does Clang drop unused symbols from relocatable objects?
> > > If so, how?
> >
> > I don't think the compiler can legally drop global symbols from
> > relocatable objects, but it can rename and possibly even drop static
> > functions.
> 
> Compilers can drop static functions without LTO.
> Rather, it is a compiler warning
> (-Wunused-function), so the code should be cleaned up.

Right -- I think you're both saying the same thing. Unused static
functions can be dropped (modulo a warning) in both regular and LTO
builds.

> At first, I thought the motivation of LTO
> was to remove unused global symbols, and
> to perform further optimization.

One of LTO's benefits is the performance optimizations, but that's not
the driving motivation for it here. The performance optimizations are
possible because LTO provides the compiler with a view of the entire
built-in portion of the kernel (i.e. not shared objects). That "visible
all at once" state is the central concern because CFI (Control Flow
Integrity, the driving motivation for this series) needs it in the same
way that the performance optimization passes need it.

i.e. to gain CFI coverage, LTO is required. Since LTO is a distinct
first step independent of CFI, it was split out to be upstreamed while
fixes for CFI continued to land independently[1]. Once LTO is landed,
CFI comes next.

> In contrast, this patch set produces a bigger kernel
> because LTO cannot remove any unused symbol.
> 
> So, I do not understand what the benefit is.
> 
> Is inlining beneficial?
> I am not sure.

This is just a side-effect of LTO. As Sami mentions, it's entirely
tunable, and that tuning was chosen based on measurements made for the
kernel being built with LTO[2].

> As a whole, I still do not understand
> the motivation of this patch set.

It is a prerequisite for CFI, and CFI has been protecting *mumble*billion
Android device kernels against code-reuse attacks for the last 2ish
years[3]. I want this available for the entire Linux ecosystem, not just
Android; it is a strong security flaw mitigation technique.

I hope that helps explain it!

-Kees


[1] for example, these are some:
    https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/log/?qt=grep&q=Control+Flow+Integrity

[2] https://lore.kernel.org/lkml/20200624203200.78870-1-samitolvanen@google.com/T/#m6b576c3af79bdacada10f21651a2b02d33a4e32e

[3] https://android-developers.googleblog.com/2018/10/control-flow-integrity-in-android-kernel.html