mbox series

[00/22] add support for Clang LTO

Message ID	20200624203200.78870-1-samitolvanen@google.com (mailing list archive)
Headers	show Return-Path: <SRS0=MW6C=AF=vger.kernel.org=linux-kbuild-owner@kernel.org> Date: Wed, 24 Jun 2020 13:31:38 -0700 Message-Id: <20200624203200.78870-1-samitolvanen@google.com> Mime-Version: 1.0 Subject: [PATCH 00/22] add support for Clang LTO From: Sami Tolvanen <samitolvanen@google.com> To: Masahiro Yamada <masahiroy@kernel.org>, Will Deacon <will@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>, "Paul E. McKenney" <paulmck@kernel.org>, Kees Cook <keescook@chromium.org>, Nick Desaulniers <ndesaulniers@google.com>, clang-built-linux@googlegroups.com, kernel-hardening@lists.openwall.com, linux-arch@vger.kernel.org, linux-arm-kernel@lists.infradead.org, linux-kbuild@vger.kernel.org, linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, x86@kernel.org, Sami Tolvanen <samitolvanen@google.com> Content-Type: text/plain; charset="UTF-8" Sender: linux-kbuild-owner@vger.kernel.org Precedence: bulk
Series	add support for Clang LTO \| expand [00/22] add support for Clang LTO [01/22] objtool: use sh_info to find the base for .rela sections [02/22] kbuild: add support for Clang LTO [03/22] kbuild: lto: fix module versioning [04/22] kbuild: lto: fix recordmcount [05/22] kbuild: lto: postpone objtool [06/22] kbuild: lto: limit inlining [07/22] kbuild: lto: merge module sections [08/22] kbuild: lto: remove duplicate dependencies from .mod files [09/22] init: lto: ensure initcall ordering [10/22] init: lto: fix PREL32 relocations [11/22] pci: lto: fix PREL32 relocations [12/22] modpost: lto: strip .lto from module names [13/22] scripts/mod: disable LTO for empty.c [14/22] efi/libstub: disable LTO [15/22] drivers/misc/lkdtm: disable LTO for rodata.o [16/22] arm64: export CC_USING_PATCHABLE_FUNCTION_ENTRY [17/22] arm64: vdso: disable LTO [18/22] arm64: allow LTO_CLANG and THINLTO to be selected [19/22] x86, vdso: disable LTO only for vDSO [20/22] x86, ftrace: disable recordmcount for ftrace_make_nop [21/22] x86, relocs: Ignore L4_PAGE_OFFSET relocations [22/22] x86, build: allow LTO_CLANG and THINLTO to be selected

Message ID

20200624203200.78870-1-samitolvanen@google.com (mailing list archive)

Headers

Date: Wed, 24 Jun 2020 13:31:38 -0700
Message-Id: <20200624203200.78870-1-samitolvanen@google.com>
Mime-Version: 1.0
Subject: [PATCH 00/22] add support for Clang LTO
From: Sami Tolvanen <samitolvanen@google.com>
To: Masahiro Yamada <masahiroy@kernel.org>,
        Will Deacon <will@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
        "Paul E. McKenney" <paulmck@kernel.org>,
        Kees Cook <keescook@chromium.org>,
        Nick Desaulniers <ndesaulniers@google.com>,
        clang-built-linux@googlegroups.com,
        kernel-hardening@lists.openwall.com, linux-arch@vger.kernel.org,
        linux-arm-kernel@lists.infradead.org, linux-kbuild@vger.kernel.org,
        linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org,
        x86@kernel.org, Sami Tolvanen <samitolvanen@google.com>
Content-Type: text/plain; charset="UTF-8"
Sender: linux-kbuild-owner@vger.kernel.org
Precedence: bulk

Series

add support for Clang LTO | expand

Message

Sami Tolvanen June 24, 2020, 8:31 p.m. UTC

This patch series adds support for building x86_64 and arm64 kernels
with Clang's Link Time Optimization (LTO).

In addition to performance, the primary motivation for LTO is to allow
Clang's Control-Flow Integrity (CFI) to be used in the kernel. Google's
Pixel devices have shipped with LTO+CFI kernels since 2018.

Most of the patches are build system changes for handling LLVM bitcode,
which Clang produces with LTO instead of ELF object files, postponing
ELF processing until a later stage, and ensuring initcall ordering.

Note that first objtool patch in the series is already in linux-next,
but as it's needed with LTO, I'm including it also here to make testing
easier.

Sami Tolvanen (22):
  objtool: use sh_info to find the base for .rela sections
  kbuild: add support for Clang LTO
  kbuild: lto: fix module versioning
  kbuild: lto: fix recordmcount
  kbuild: lto: postpone objtool
  kbuild: lto: limit inlining
  kbuild: lto: merge module sections
  kbuild: lto: remove duplicate dependencies from .mod files
  init: lto: ensure initcall ordering
  init: lto: fix PREL32 relocations
  pci: lto: fix PREL32 relocations
  modpost: lto: strip .lto from module names
  scripts/mod: disable LTO for empty.c
  efi/libstub: disable LTO
  drivers/misc/lkdtm: disable LTO for rodata.o
  arm64: export CC_USING_PATCHABLE_FUNCTION_ENTRY
  arm64: vdso: disable LTO
  arm64: allow LTO_CLANG and THINLTO to be selected
  x86, vdso: disable LTO only for vDSO
  x86, ftrace: disable recordmcount for ftrace_make_nop
  x86, relocs: Ignore L4_PAGE_OFFSET relocations
  x86, build: allow LTO_CLANG and THINLTO to be selected

 .gitignore                            |   1 +
 Makefile                              |  27 ++-
 arch/Kconfig                          |  65 +++++++
 arch/arm64/Kconfig                    |   2 +
 arch/arm64/Makefile                   |   1 +
 arch/arm64/kernel/vdso/Makefile       |   4 +-
 arch/x86/Kconfig                      |   2 +
 arch/x86/Makefile                     |   5 +
 arch/x86/entry/vdso/Makefile          |   5 +-
 arch/x86/kernel/ftrace.c              |   1 +
 arch/x86/tools/relocs.c               |   1 +
 drivers/firmware/efi/libstub/Makefile |   2 +
 drivers/misc/lkdtm/Makefile           |   1 +
 include/asm-generic/vmlinux.lds.h     |  12 +-
 include/linux/compiler-clang.h        |   4 +
 include/linux/compiler.h              |   2 +-
 include/linux/compiler_types.h        |   4 +
 include/linux/init.h                  |  78 +++++++-
 include/linux/pci.h                   |  15 +-
 kernel/trace/ftrace.c                 |   1 +
 lib/Kconfig.debug                     |   2 +-
 scripts/Makefile.build                |  55 +++++-
 scripts/Makefile.lib                  |   6 +-
 scripts/Makefile.modfinal             |  40 +++-
 scripts/Makefile.modpost              |  26 ++-
 scripts/generate_initcall_order.pl    | 270 ++++++++++++++++++++++++++
 scripts/link-vmlinux.sh               | 100 +++++++++-
 scripts/mod/Makefile                  |   1 +
 scripts/mod/modpost.c                 |  16 +-
 scripts/mod/modpost.h                 |   9 +
 scripts/mod/sumversion.c              |   6 +-
 scripts/module-lto.lds                |  26 +++
 scripts/recordmcount.c                |   3 +-
 tools/objtool/elf.c                   |   2 +-
 34 files changed, 737 insertions(+), 58 deletions(-)
 create mode 100755 scripts/generate_initcall_order.pl
 create mode 100644 scripts/module-lto.lds


base-commit: 26e122e97a3d0390ebec389347f64f3730fdf48f

Comments

Peter Zijlstra June 24, 2020, 9:15 p.m. UTC | #1

On Wed, Jun 24, 2020 at 01:31:38PM -0700, Sami Tolvanen wrote:
> This patch series adds support for building x86_64 and arm64 kernels
> with Clang's Link Time Optimization (LTO).
> 
> In addition to performance, the primary motivation for LTO is to allow
> Clang's Control-Flow Integrity (CFI) to be used in the kernel. Google's
> Pixel devices have shipped with LTO+CFI kernels since 2018.
> 
> Most of the patches are build system changes for handling LLVM bitcode,
> which Clang produces with LTO instead of ELF object files, postponing
> ELF processing until a later stage, and ensuring initcall ordering.
> 
> Note that first objtool patch in the series is already in linux-next,
> but as it's needed with LTO, I'm including it also here to make testing
> easier.

I'm very sad that yet again, memory ordering isn't addressed. LTO vastly
increases the range of the optimizer to wreck things.

Sami Tolvanen June 24, 2020, 9:30 p.m. UTC | #2

On Wed, Jun 24, 2020 at 11:15:40PM +0200, Peter Zijlstra wrote:
> On Wed, Jun 24, 2020 at 01:31:38PM -0700, Sami Tolvanen wrote:
> > This patch series adds support for building x86_64 and arm64 kernels
> > with Clang's Link Time Optimization (LTO).
> > 
> > In addition to performance, the primary motivation for LTO is to allow
> > Clang's Control-Flow Integrity (CFI) to be used in the kernel. Google's
> > Pixel devices have shipped with LTO+CFI kernels since 2018.
> > 
> > Most of the patches are build system changes for handling LLVM bitcode,
> > which Clang produces with LTO instead of ELF object files, postponing
> > ELF processing until a later stage, and ensuring initcall ordering.
> > 
> > Note that first objtool patch in the series is already in linux-next,
> > but as it's needed with LTO, I'm including it also here to make testing
> > easier.
> 
> I'm very sad that yet again, memory ordering isn't addressed. LTO vastly
> increases the range of the optimizer to wreck things.

I believe Will has some thoughts about this, and patches, but I'll let
him talk about it.

Sami

Nick Desaulniers June 24, 2020, 9:31 p.m. UTC | #3

On Wed, Jun 24, 2020 at 2:15 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Wed, Jun 24, 2020 at 01:31:38PM -0700, Sami Tolvanen wrote:
> > This patch series adds support for building x86_64 and arm64 kernels
> > with Clang's Link Time Optimization (LTO).
> >
> > In addition to performance, the primary motivation for LTO is to allow
> > Clang's Control-Flow Integrity (CFI) to be used in the kernel. Google's
> > Pixel devices have shipped with LTO+CFI kernels since 2018.
> >
> > Most of the patches are build system changes for handling LLVM bitcode,
> > which Clang produces with LTO instead of ELF object files, postponing
> > ELF processing until a later stage, and ensuring initcall ordering.
> >
> > Note that first objtool patch in the series is already in linux-next,
> > but as it's needed with LTO, I'm including it also here to make testing
> > easier.
>
> I'm very sad that yet again, memory ordering isn't addressed. LTO vastly
> increases the range of the optimizer to wreck things.

Hi Peter, could you expand on the issue for the folks on the thread?
I'm happy to try to hack something up in LLVM if we check that X does
or does not happen; maybe we can even come up with some concrete test
cases that can be added to LLVM's codebase?

Peter Zijlstra June 25, 2020, 8:03 a.m. UTC | #4

On Wed, Jun 24, 2020 at 02:31:36PM -0700, Nick Desaulniers wrote:
> On Wed, Jun 24, 2020 at 2:15 PM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Wed, Jun 24, 2020 at 01:31:38PM -0700, Sami Tolvanen wrote:
> > > This patch series adds support for building x86_64 and arm64 kernels
> > > with Clang's Link Time Optimization (LTO).
> > >
> > > In addition to performance, the primary motivation for LTO is to allow
> > > Clang's Control-Flow Integrity (CFI) to be used in the kernel. Google's
> > > Pixel devices have shipped with LTO+CFI kernels since 2018.
> > >
> > > Most of the patches are build system changes for handling LLVM bitcode,
> > > which Clang produces with LTO instead of ELF object files, postponing
> > > ELF processing until a later stage, and ensuring initcall ordering.
> > >
> > > Note that first objtool patch in the series is already in linux-next,
> > > but as it's needed with LTO, I'm including it also here to make testing
> > > easier.
> >
> > I'm very sad that yet again, memory ordering isn't addressed. LTO vastly
> > increases the range of the optimizer to wreck things.
> 
> Hi Peter, could you expand on the issue for the folks on the thread?
> I'm happy to try to hack something up in LLVM if we check that X does
> or does not happen; maybe we can even come up with some concrete test
> cases that can be added to LLVM's codebase?

I'm sure Will will respond, but the basic issue is the trainwreck C11
made of dependent loads.

Anyway, here's a link to the last time this came up:

  https://lore.kernel.org/linux-arm-kernel/20171116174830.GX3624@linux.vnet.ibm.com/

Peter Zijlstra June 25, 2020, 8:24 a.m. UTC | #5

On Thu, Jun 25, 2020 at 10:03:13AM +0200, Peter Zijlstra wrote:
> On Wed, Jun 24, 2020 at 02:31:36PM -0700, Nick Desaulniers wrote:
> > On Wed, Jun 24, 2020 at 2:15 PM Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > > On Wed, Jun 24, 2020 at 01:31:38PM -0700, Sami Tolvanen wrote:
> > > > This patch series adds support for building x86_64 and arm64 kernels
> > > > with Clang's Link Time Optimization (LTO).
> > > >
> > > > In addition to performance, the primary motivation for LTO is to allow
> > > > Clang's Control-Flow Integrity (CFI) to be used in the kernel. Google's
> > > > Pixel devices have shipped with LTO+CFI kernels since 2018.
> > > >
> > > > Most of the patches are build system changes for handling LLVM bitcode,
> > > > which Clang produces with LTO instead of ELF object files, postponing
> > > > ELF processing until a later stage, and ensuring initcall ordering.
> > > >
> > > > Note that first objtool patch in the series is already in linux-next,
> > > > but as it's needed with LTO, I'm including it also here to make testing
> > > > easier.
> > >
> > > I'm very sad that yet again, memory ordering isn't addressed. LTO vastly
> > > increases the range of the optimizer to wreck things.
> > 
> > Hi Peter, could you expand on the issue for the folks on the thread?
> > I'm happy to try to hack something up in LLVM if we check that X does
> > or does not happen; maybe we can even come up with some concrete test
> > cases that can be added to LLVM's codebase?
> 
> I'm sure Will will respond, but the basic issue is the trainwreck C11
> made of dependent loads.
> 
> Anyway, here's a link to the last time this came up:
> 
>   https://lore.kernel.org/linux-arm-kernel/20171116174830.GX3624@linux.vnet.ibm.com/

Another good read:

  https://lore.kernel.org/lkml/20150520005510.GA23559@linux.vnet.ibm.com/

and having (partially) re-read that, I now worry intensily about things
like latch_tree_find(), cyc2ns_read_begin, __ktime_get_fast_ns().

It looks like kernel/time/sched_clock.c uses raw_read_seqcount() which
deviates from the above patterns by, for some reason, using a primitive
that includes an extra smp_rmb().

And this is just the few things I could remember off the top of my head,
who knows what else is out there.

Will Deacon June 25, 2020, 8:27 a.m. UTC | #6

On Wed, Jun 24, 2020 at 02:30:14PM -0700, Sami Tolvanen wrote:
> On Wed, Jun 24, 2020 at 11:15:40PM +0200, Peter Zijlstra wrote:
> > On Wed, Jun 24, 2020 at 01:31:38PM -0700, Sami Tolvanen wrote:
> > > This patch series adds support for building x86_64 and arm64 kernels
> > > with Clang's Link Time Optimization (LTO).
> > > 
> > > In addition to performance, the primary motivation for LTO is to allow
> > > Clang's Control-Flow Integrity (CFI) to be used in the kernel. Google's
> > > Pixel devices have shipped with LTO+CFI kernels since 2018.
> > > 
> > > Most of the patches are build system changes for handling LLVM bitcode,
> > > which Clang produces with LTO instead of ELF object files, postponing
> > > ELF processing until a later stage, and ensuring initcall ordering.
> > > 
> > > Note that first objtool patch in the series is already in linux-next,
> > > but as it's needed with LTO, I'm including it also here to make testing
> > > easier.
> > 
> > I'm very sad that yet again, memory ordering isn't addressed. LTO vastly
> > increases the range of the optimizer to wreck things.
> 
> I believe Will has some thoughts about this, and patches, but I'll let
> him talk about it.

Thanks for reminding me! I will get patches out ASAP (I've been avoiding the
rebase from hell, but this is the motivation I need).

Will

Peter Zijlstra June 25, 2020, 8:57 a.m. UTC | #7

On Thu, Jun 25, 2020 at 10:24:33AM +0200, Peter Zijlstra wrote:
> On Thu, Jun 25, 2020 at 10:03:13AM +0200, Peter Zijlstra wrote:

> > I'm sure Will will respond, but the basic issue is the trainwreck C11
> > made of dependent loads.
> > 
> > Anyway, here's a link to the last time this came up:
> > 
> >   https://lore.kernel.org/linux-arm-kernel/20171116174830.GX3624@linux.vnet.ibm.com/
> 
> Another good read:
> 
>   https://lore.kernel.org/lkml/20150520005510.GA23559@linux.vnet.ibm.com/
> 
> and having (partially) re-read that, I now worry intensily about things
> like latch_tree_find(), cyc2ns_read_begin, __ktime_get_fast_ns().
> 
> It looks like kernel/time/sched_clock.c uses raw_read_seqcount() which
> deviates from the above patterns by, for some reason, using a primitive
> that includes an extra smp_rmb().
> 
> And this is just the few things I could remember off the top of my head,
> who knows what else is out there.

As an example, let us consider __ktime_get_fast_ns(), the critical bit
is:

		seq = raw_read_seqcount_latch(&tkf->seq);
		tkr = tkf->base + (seq & 0x01);
		now = tkr->base;

And we hard rely on that being a dependent load, so:

  LOAD	seq, (tkf->seq)
  LOAD  tkr, tkf->base
  AND   seq, 1
  MUL   seq, sizeof(tk_read_base)
  ADD	tkr, seq
  LOAD  now, (tkr->base)

Such that we obtain 'now' as a direct dependency on 'seq'. This ensures
the loads are ordered.

A compiler can wreck this by translating it into something like:

  LOAD	seq, (tkf->seq)
  LOAD  tkr, tkf->base
  AND   seq, 1
  CMP	seq, 0
  JE	1f
  ADD	tkr, sizeof(tk_read_base)
1:
  LOAD  now, (tkr->base)

Because now the machine can speculate and load now before seq, breaking
the ordering.

Masahiro Yamada June 28, 2020, 4:56 p.m. UTC | #8

On Thu, Jun 25, 2020 at 5:32 AM 'Sami Tolvanen' via Clang Built Linux
<clang-built-linux@googlegroups.com> wrote:
>
> This patch series adds support for building x86_64 and arm64 kernels
> with Clang's Link Time Optimization (LTO).
>
> In addition to performance, the primary motivation for LTO is to allow
> Clang's Control-Flow Integrity (CFI) to be used in the kernel. Google's
> Pixel devices have shipped with LTO+CFI kernels since 2018.
>
> Most of the patches are build system changes for handling LLVM bitcode,
> which Clang produces with LTO instead of ELF object files, postponing
> ELF processing until a later stage, and ensuring initcall ordering.
>
> Note that first objtool patch in the series is already in linux-next,
> but as it's needed with LTO, I'm including it also here to make testing
> easier.


I put this series on a testing branch,
and 0-day bot started reporting some issues.

(but 0-day bot is quieter than I expected.
Perhaps, 0-day bot does not turn on LLVM=1 ?)



I also got an error for
ARCH=arm64 allyesconfig + CONFIG_LTO_CLANG=y



$ make ARCH=arm64 LLVM=1 LLVM_IAS=1
CROSS_COMPILE=~/tools/aarch64-linaro-7.5/bin/aarch64-linux-gnu-
-j24

  ...

  GEN     .version
  CHK     include/generated/compile.h
  UPD     include/generated/compile.h
  CC      init/version.o
  AR      init/built-in.a
  GEN     .tmp_initcalls.lds
  GEN     .tmp_symversions.lds
  LTO     vmlinux.o
  MODPOST vmlinux.symvers
  MODINFO modules.builtin.modinfo
  GEN     modules.builtin
  LD      .tmp_vmlinux.kallsyms1
ld.lld: error: undefined symbol: __compiletime_assert_905
>>> referenced by irqbypass.c
>>>               vmlinux.o:(jeq_imm)
make: *** [Makefile:1161: vmlinux] Error 1








> Sami Tolvanen (22):
>   objtool: use sh_info to find the base for .rela sections
>   kbuild: add support for Clang LTO
>   kbuild: lto: fix module versioning
>   kbuild: lto: fix recordmcount
>   kbuild: lto: postpone objtool
>   kbuild: lto: limit inlining
>   kbuild: lto: merge module sections
>   kbuild: lto: remove duplicate dependencies from .mod files
>   init: lto: ensure initcall ordering
>   init: lto: fix PREL32 relocations
>   pci: lto: fix PREL32 relocations
>   modpost: lto: strip .lto from module names
>   scripts/mod: disable LTO for empty.c
>   efi/libstub: disable LTO
>   drivers/misc/lkdtm: disable LTO for rodata.o
>   arm64: export CC_USING_PATCHABLE_FUNCTION_ENTRY
>   arm64: vdso: disable LTO
>   arm64: allow LTO_CLANG and THINLTO to be selected
>   x86, vdso: disable LTO only for vDSO
>   x86, ftrace: disable recordmcount for ftrace_make_nop
>   x86, relocs: Ignore L4_PAGE_OFFSET relocations
>   x86, build: allow LTO_CLANG and THINLTO to be selected
>
>  .gitignore                            |   1 +
>  Makefile                              |  27 ++-
>  arch/Kconfig                          |  65 +++++++
>  arch/arm64/Kconfig                    |   2 +
>  arch/arm64/Makefile                   |   1 +
>  arch/arm64/kernel/vdso/Makefile       |   4 +-
>  arch/x86/Kconfig                      |   2 +
>  arch/x86/Makefile                     |   5 +
>  arch/x86/entry/vdso/Makefile          |   5 +-
>  arch/x86/kernel/ftrace.c              |   1 +
>  arch/x86/tools/relocs.c               |   1 +
>  drivers/firmware/efi/libstub/Makefile |   2 +
>  drivers/misc/lkdtm/Makefile           |   1 +
>  include/asm-generic/vmlinux.lds.h     |  12 +-
>  include/linux/compiler-clang.h        |   4 +
>  include/linux/compiler.h              |   2 +-
>  include/linux/compiler_types.h        |   4 +
>  include/linux/init.h                  |  78 +++++++-
>  include/linux/pci.h                   |  15 +-
>  kernel/trace/ftrace.c                 |   1 +
>  lib/Kconfig.debug                     |   2 +-
>  scripts/Makefile.build                |  55 +++++-
>  scripts/Makefile.lib                  |   6 +-
>  scripts/Makefile.modfinal             |  40 +++-
>  scripts/Makefile.modpost              |  26 ++-
>  scripts/generate_initcall_order.pl    | 270 ++++++++++++++++++++++++++
>  scripts/link-vmlinux.sh               | 100 +++++++++-
>  scripts/mod/Makefile                  |   1 +
>  scripts/mod/modpost.c                 |  16 +-
>  scripts/mod/modpost.h                 |   9 +
>  scripts/mod/sumversion.c              |   6 +-
>  scripts/module-lto.lds                |  26 +++
>  scripts/recordmcount.c                |   3 +-
>  tools/objtool/elf.c                   |   2 +-
>  34 files changed, 737 insertions(+), 58 deletions(-)
>  create mode 100755 scripts/generate_initcall_order.pl
>  create mode 100644 scripts/module-lto.lds
>
>
> base-commit: 26e122e97a3d0390ebec389347f64f3730fdf48f
> --
> 2.27.0.212.ge8ba1cc988-goog
>
> --
> You received this message because you are subscribed to the Google Groups "Clang Built Linux" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to clang-built-linux+unsubscribe@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/clang-built-linux/20200624203200.78870-1-samitolvanen%40google.com.



--
Best Regards
Masahiro Yamada

Sami Tolvanen June 29, 2020, 11:20 p.m. UTC | #9

Hi Masahiro,

On Mon, Jun 29, 2020 at 01:56:19AM +0900, Masahiro Yamada wrote:
> On Thu, Jun 25, 2020 at 5:32 AM 'Sami Tolvanen' via Clang Built Linux
> <clang-built-linux@googlegroups.com> wrote:
> >
> > This patch series adds support for building x86_64 and arm64 kernels
> > with Clang's Link Time Optimization (LTO).
> >
> > In addition to performance, the primary motivation for LTO is to allow
> > Clang's Control-Flow Integrity (CFI) to be used in the kernel. Google's
> > Pixel devices have shipped with LTO+CFI kernels since 2018.
> >
> > Most of the patches are build system changes for handling LLVM bitcode,
> > which Clang produces with LTO instead of ELF object files, postponing
> > ELF processing until a later stage, and ensuring initcall ordering.
> >
> > Note that first objtool patch in the series is already in linux-next,
> > but as it's needed with LTO, I'm including it also here to make testing
> > easier.
> 
> 
> I put this series on a testing branch,
> and 0-day bot started reporting some issues.

Yes, I'll fix those issues in v2.

> (but 0-day bot is quieter than I expected.
> Perhaps, 0-day bot does not turn on LLVM=1 ?)

In order for it to test an LTO build, it would need to enable LTO_CLANG
explicitly though, in addition to LLVM=1.

> I also got an error for
> ARCH=arm64 allyesconfig + CONFIG_LTO_CLANG=y
> 
> 
> 
> $ make ARCH=arm64 LLVM=1 LLVM_IAS=1
> CROSS_COMPILE=~/tools/aarch64-linaro-7.5/bin/aarch64-linux-gnu-
> -j24
> 
>   ...
> 
>   GEN     .version
>   CHK     include/generated/compile.h
>   UPD     include/generated/compile.h
>   CC      init/version.o
>   AR      init/built-in.a
>   GEN     .tmp_initcalls.lds
>   GEN     .tmp_symversions.lds
>   LTO     vmlinux.o
>   MODPOST vmlinux.symvers
>   MODINFO modules.builtin.modinfo
>   GEN     modules.builtin
>   LD      .tmp_vmlinux.kallsyms1
> ld.lld: error: undefined symbol: __compiletime_assert_905
> >>> referenced by irqbypass.c
> >>>               vmlinux.o:(jeq_imm)
> make: *** [Makefile:1161: vmlinux] Error 1

I can reproduce this with ToT LLVM and it's BUILD_BUG_ON_MSG(..., "value
too large for the field") in drivers/net/ethernet/netronome/nfp/bpf/jit.c.
Specifically, the FIELD_FIT / __BF_FIELD_CHECK macro in ur_load_imm_any.

This compiles just fine with an earlier LLVM revision, so it could be a
relatively recent regression. I'll take a look. Thanks for catching this!

Sami

Marco Elver June 30, 2020, 7:19 p.m. UTC | #10

I was asked for input on this, and after a few days digging through some
history, thought I'd comment. Hope you don't mind.

On Thu, Jun 25, 2020 at 10:57AM +0200, Peter Zijlstra wrote:
> On Thu, Jun 25, 2020 at 10:24:33AM +0200, Peter Zijlstra wrote:
> > On Thu, Jun 25, 2020 at 10:03:13AM +0200, Peter Zijlstra wrote:
> > > I'm sure Will will respond, but the basic issue is the trainwreck C11
> > > made of dependent loads.
> > > 
> > > Anyway, here's a link to the last time this came up:
> > > 
> > >   https://lore.kernel.org/linux-arm-kernel/20171116174830.GX3624@linux.vnet.ibm.com/
> > 
> > Another good read:
> > 
> >   https://lore.kernel.org/lkml/20150520005510.GA23559@linux.vnet.ibm.com/
[...]
> Because now the machine can speculate and load now before seq, breaking
> the ordering.

First of all, I agree with the concerns, but not because of LTO.

To set the stage better, and summarize the fundamental problem again:
we're in the unfortunate situation that no compiler today has a way to
_efficiently_ deal with C11's memory_order_consume
[https://lwn.net/Articles/588300/]. If we did, we could just use that
and be done with it. But, sadly, that doesn't seem possible right now --
compilers just say consume==acquire. Will suggests doing the same in the
kernel: https://lkml.kernel.org/r/20200630173734.14057-19-will@kernel.org

What we're most worried about right now is the existence of compiler
transformations that could break data dependencies by e.g. turning them
into control dependencies.

If this is a real worry, I don't think LTO is the magical feature that
will uncover those optimizations. If these compiler transformations are
real, they also exist in a normal build! And if we are worried about
them, we need to stop relying on dependent load ordering across the
board; or switch to -O0 for everything. Clearly, we don't want either.

Why do we think LTO is special?

With LTO, Clang just emits LLVM bitcode instead of ELF objects, and
during the linker stage intermodular optimizations across translation
unit boundaries are done that might not be possible otherwise
[https://llvm.org/docs/LinkTimeOptimization.html]. From the memory model
side of things, if we could fully convey our intent to the compiler (the
imaginary consume), there would be no problem, because all optimization
stages from bitcode generation to the final machine code generation
after LTO know about the intended semantics. (Also, keep in mind that
LTO is _not_ doing post link optimization of machine code binaries!)

But as far as we can tell, there is no evidence of the dreaded "data
dependency to control dependency" conversion with LTO that isn't there
in non-LTO builds, if it's even there at all. Has the data to control
dependency conversion been encountered in the wild? If not, is the
resulting reaction an overreaction? If so, we need to be careful blaming
LTO for something that it isn't even guilty of.

So, we are probably better off untangling LTO from the story:

1. LTO or no LTO does not matter. The LTO series should not get tangled
   up with memory model issues.

2. The memory model question and problems need to be answered and
   addressed separately.

Thoughts?

Thanks,
-- Marco

Peter Zijlstra June 30, 2020, 8:12 p.m. UTC | #11

On Tue, Jun 30, 2020 at 09:19:31PM +0200, Marco Elver wrote:
> I was asked for input on this, and after a few days digging through some
> history, thought I'd comment. Hope you don't mind.

Not at all, being the one that asked :-)

> First of all, I agree with the concerns, but not because of LTO.
> 
> To set the stage better, and summarize the fundamental problem again:
> we're in the unfortunate situation that no compiler today has a way to
> _efficiently_ deal with C11's memory_order_consume
> [https://lwn.net/Articles/588300/]. If we did, we could just use that
> and be done with it. But, sadly, that doesn't seem possible right now --
> compilers just say consume==acquire.

I'm not convinced C11 memory_order_consume would actually work for us,
even if it would work. That is, given:

  https://lore.kernel.org/lkml/20150520005510.GA23559@linux.vnet.ibm.com/

only pointers can have consume, but like I pointed out, we have code
that relies on dependent loads from integers.

> Will suggests doing the same in the
> kernel: https://lkml.kernel.org/r/20200630173734.14057-19-will@kernel.org

PowerPC would need a similar thing, it too will not preserve causality
for control dependecies.

> What we're most worried about right now is the existence of compiler
> transformations that could break data dependencies by e.g. turning them
> into control dependencies.

Correct.

> If this is a real worry, I don't think LTO is the magical feature that
> will uncover those optimizations. If these compiler transformations are
> real, they also exist in a normal build! 

Agreed, _however_ with the caveat that LTO could make them more common.

After all, with whole program analysis, the compiler might be able to
more easily determine that our pointer @ptr is only ever assigned the
values of &A, &B or &C, while without that visibility it would not be
able to determine this.

Once it knows @ptr has a limited number of determined values, the
conversion into control dependencies becomes much more likely.

> And if we are worried about them, we need to stop relying on dependent
> load ordering across the board; or switch to -O0 for everything.
> Clearly, we don't want either.

Agreed.

> Why do we think LTO is special?

As argued above, whole-program analysis would make it more likely. But I
agree the fundamental problem exists independent from LTO.

> But as far as we can tell, there is no evidence of the dreaded "data
> dependency to control dependency" conversion with LTO that isn't there
> in non-LTO builds, if it's even there at all. Has the data to control
> dependency conversion been encountered in the wild? If not, is the
> resulting reaction an overreaction? If so, we need to be careful blaming
> LTO for something that it isn't even guilty of.

It is mostly paranoia; in a large part driven by the fact that even if
such a conversion were to be done, it could go a very long time without
actually causing problems, and longer still for such problems to be
traced back to such an 'optimization'.

That is, the collective hurt from debugging too many ordering issues.

> So, we are probably better off untangling LTO from the story:
> 
> 1. LTO or no LTO does not matter. The LTO series should not get tangled
>    up with memory model issues.
> 
> 2. The memory model question and problems need to be answered and
>    addressed separately.
> 
> Thoughts?

How hard would it be to creates something that analyzes a build and
looks for all 'dependent load -> control dependency' transformations
headed by a volatile (and/or from asm) load and issues a warning for
them?

This would give us an indication of how valuable this transformation is
for the kernel. I'm hoping/expecting it's vanishingly rare, but what do
I know.

Paul E. McKenney June 30, 2020, 8:30 p.m. UTC | #12

On Tue, Jun 30, 2020 at 10:12:43PM +0200, Peter Zijlstra wrote:
> On Tue, Jun 30, 2020 at 09:19:31PM +0200, Marco Elver wrote:
> > I was asked for input on this, and after a few days digging through some
> > history, thought I'd comment. Hope you don't mind.
> 
> Not at all, being the one that asked :-)
> 
> > First of all, I agree with the concerns, but not because of LTO.
> > 
> > To set the stage better, and summarize the fundamental problem again:
> > we're in the unfortunate situation that no compiler today has a way to
> > _efficiently_ deal with C11's memory_order_consume
> > [https://lwn.net/Articles/588300/]. If we did, we could just use that
> > and be done with it. But, sadly, that doesn't seem possible right now --
> > compilers just say consume==acquire.
> 
> I'm not convinced C11 memory_order_consume would actually work for us,
> even if it would work. That is, given:
> 
>   https://lore.kernel.org/lkml/20150520005510.GA23559@linux.vnet.ibm.com/
> 
> only pointers can have consume, but like I pointed out, we have code
> that relies on dependent loads from integers.

I agree that C11 memory_order_consume is not normally what we want,
given that it is universally promoted to memory_order_acquire.

However, dependent loads from integers are, if anything, more difficult
to defend from the compiler than are control dependencies.  This applies
doubly to integers that are used to index two-element arrays, in which
case you are just asking the compiler to destroy your dependent loads
by converting them into control dependencies.

> > Will suggests doing the same in the
> > kernel: https://lkml.kernel.org/r/20200630173734.14057-19-will@kernel.org
> 
> PowerPC would need a similar thing, it too will not preserve causality
> for control dependecies.
> 
> > What we're most worried about right now is the existence of compiler
> > transformations that could break data dependencies by e.g. turning them
> > into control dependencies.
> 
> Correct.
> 
> > If this is a real worry, I don't think LTO is the magical feature that
> > will uncover those optimizations. If these compiler transformations are
> > real, they also exist in a normal build! 
> 
> Agreed, _however_ with the caveat that LTO could make them more common.
> 
> After all, with whole program analysis, the compiler might be able to
> more easily determine that our pointer @ptr is only ever assigned the
> values of &A, &B or &C, while without that visibility it would not be
> able to determine this.
> 
> Once it knows @ptr has a limited number of determined values, the
> conversion into control dependencies becomes much more likely.

Which would of course break dependent loads.

> > And if we are worried about them, we need to stop relying on dependent
> > load ordering across the board; or switch to -O0 for everything.
> > Clearly, we don't want either.
> 
> Agreed.
> 
> > Why do we think LTO is special?
> 
> As argued above, whole-program analysis would make it more likely. But I
> agree the fundamental problem exists independent from LTO.
> 
> > But as far as we can tell, there is no evidence of the dreaded "data
> > dependency to control dependency" conversion with LTO that isn't there
> > in non-LTO builds, if it's even there at all. Has the data to control
> > dependency conversion been encountered in the wild? If not, is the
> > resulting reaction an overreaction? If so, we need to be careful blaming
> > LTO for something that it isn't even guilty of.
> 
> It is mostly paranoia; in a large part driven by the fact that even if
> such a conversion were to be done, it could go a very long time without
> actually causing problems, and longer still for such problems to be
> traced back to such an 'optimization'.
> 
> That is, the collective hurt from debugging too many ordering issues.
> 
> > So, we are probably better off untangling LTO from the story:
> > 
> > 1. LTO or no LTO does not matter. The LTO series should not get tangled
> >    up with memory model issues.
> > 
> > 2. The memory model question and problems need to be answered and
> >    addressed separately.
> > 
> > Thoughts?
> 
> How hard would it be to creates something that analyzes a build and
> looks for all 'dependent load -> control dependency' transformations
> headed by a volatile (and/or from asm) load and issues a warning for
> them?
> 
> This would give us an indication of how valuable this transformation is
> for the kernel. I'm hoping/expecting it's vanishingly rare, but what do
> I know.
> 

This could be quite useful!

							Thanx, Paul

Peter Zijlstra July 1, 2020, 9:10 a.m. UTC | #13

On Tue, Jun 30, 2020 at 01:30:16PM -0700, Paul E. McKenney wrote:
> On Tue, Jun 30, 2020 at 10:12:43PM +0200, Peter Zijlstra wrote:

> > I'm not convinced C11 memory_order_consume would actually work for us,
> > even if it would work. That is, given:
> > 
> >   https://lore.kernel.org/lkml/20150520005510.GA23559@linux.vnet.ibm.com/
> > 
> > only pointers can have consume, but like I pointed out, we have code
> > that relies on dependent loads from integers.
> 
> I agree that C11 memory_order_consume is not normally what we want,
> given that it is universally promoted to memory_order_acquire.
> 
> However, dependent loads from integers are, if anything, more difficult
> to defend from the compiler than are control dependencies.  This applies
> doubly to integers that are used to index two-element arrays, in which
> case you are just asking the compiler to destroy your dependent loads
> by converting them into control dependencies.

Yes, I'm aware. However, as you might know, I'm firmly in the 'C is a
glorified assembler' camp (as I expect most actual OS people are, out of
necessity if nothing else) and if I wanted a control dependency I
would've bloody well written one.

I think an optimizing compiler is awesome, but only in so far as that
optimization is actually helpful -- and yes, I just stepped into a giant
twilight zone there. That is, any optimization that has _any_
controversy should be controllable (like -fno-strict-overflow
-fno-strict-aliasing) and I'd very much like the same here.

In a larger context, I still think that eliminating speculative stores
is both necessary and sufficient to avoid out-of-thin-air. So I'd also
love to get some control on that.

Marco Elver July 1, 2020, 9:41 a.m. UTC | #14

On Tue, 30 Jun 2020 at 22:30, Paul E. McKenney <paulmck@kernel.org> wrote:
> On Tue, Jun 30, 2020 at 10:12:43PM +0200, Peter Zijlstra wrote:
> > On Tue, Jun 30, 2020 at 09:19:31PM +0200, Marco Elver wrote:

> > > First of all, I agree with the concerns, but not because of LTO.
> > >
> > > To set the stage better, and summarize the fundamental problem again:
> > > we're in the unfortunate situation that no compiler today has a way to
> > > _efficiently_ deal with C11's memory_order_consume
> > > [https://lwn.net/Articles/588300/]. If we did, we could just use that
> > > and be done with it. But, sadly, that doesn't seem possible right now --
> > > compilers just say consume==acquire.
> >
> > I'm not convinced C11 memory_order_consume would actually work for us,
> > even if it would work. That is, given:
> >
> >   https://lore.kernel.org/lkml/20150520005510.GA23559@linux.vnet.ibm.com/
> >
> > only pointers can have consume, but like I pointed out, we have code
> > that relies on dependent loads from integers.
>
> I agree that C11 memory_order_consume is not normally what we want,
> given that it is universally promoted to memory_order_acquire.
>
> However, dependent loads from integers are, if anything, more difficult
> to defend from the compiler than are control dependencies.  This applies
> doubly to integers that are used to index two-element arrays, in which
> case you are just asking the compiler to destroy your dependent loads
> by converting them into control dependencies.
>
> > > Will suggests doing the same in the
> > > kernel: https://lkml.kernel.org/r/20200630173734.14057-19-will@kernel.org
> >
> > PowerPC would need a similar thing, it too will not preserve causality
> > for control dependecies.
> >
> > > What we're most worried about right now is the existence of compiler
> > > transformations that could break data dependencies by e.g. turning them
> > > into control dependencies.
> >
> > Correct.
> >
> > > If this is a real worry, I don't think LTO is the magical feature that
> > > will uncover those optimizations. If these compiler transformations are
> > > real, they also exist in a normal build!
> >
> > Agreed, _however_ with the caveat that LTO could make them more common.
> >
> > After all, with whole program analysis, the compiler might be able to
> > more easily determine that our pointer @ptr is only ever assigned the
> > values of &A, &B or &C, while without that visibility it would not be
> > able to determine this.
> >
> > Once it knows @ptr has a limited number of determined values, the
> > conversion into control dependencies becomes much more likely.
>
> Which would of course break dependent loads.
>
> > > And if we are worried about them, we need to stop relying on dependent
> > > load ordering across the board; or switch to -O0 for everything.
> > > Clearly, we don't want either.
> >
> > Agreed.
> >
> > > Why do we think LTO is special?
> >
> > As argued above, whole-program analysis would make it more likely. But I
> > agree the fundamental problem exists independent from LTO.
> >
> > > But as far as we can tell, there is no evidence of the dreaded "data
> > > dependency to control dependency" conversion with LTO that isn't there
> > > in non-LTO builds, if it's even there at all. Has the data to control
> > > dependency conversion been encountered in the wild? If not, is the
> > > resulting reaction an overreaction? If so, we need to be careful blaming
> > > LTO for something that it isn't even guilty of.
> >
> > It is mostly paranoia; in a large part driven by the fact that even if
> > such a conversion were to be done, it could go a very long time without
> > actually causing problems, and longer still for such problems to be
> > traced back to such an 'optimization'.
> >
> > That is, the collective hurt from debugging too many ordering issues.
> >
> > > So, we are probably better off untangling LTO from the story:
> > >
> > > 1. LTO or no LTO does not matter. The LTO series should not get tangled
> > >    up with memory model issues.
> > >
> > > 2. The memory model question and problems need to be answered and
> > >    addressed separately.
> > >
> > > Thoughts?
> >
> > How hard would it be to creates something that analyzes a build and
> > looks for all 'dependent load -> control dependency' transformations
> > headed by a volatile (and/or from asm) load and issues a warning for
> > them?

I was thinking about this, but in the context of the "auto-promote to
acquire" which you didn't like. Issuing a warning should certainly be
simpler.

I think there is no one place where we know these transformations
happen, but rather, need to analyze the IR before transformations,
take note of all the dependent loads headed by volatile+asm, and then
run an analysis after optimizations checking the dependencies are
still there.

> > This would give us an indication of how valuable this transformation is
> > for the kernel. I'm hoping/expecting it's vanishingly rare, but what do
> > I know.
>
> This could be quite useful!

We might then even be able to say, "if you get this warning, turn on
CONFIG_ACQUIRE_READ_DEPENDENCIES" (or however the option will be
named). Or some other tricks, like automatically recompile the TU
where this happens with the option. But again, this is not something
that should specifically block LTO, because if we have this, we'll
need to turn it on for everything.

Thanks,
-- Marco

Will Deacon July 1, 2020, 10:03 a.m. UTC | #15

On Wed, Jul 01, 2020 at 11:41:17AM +0200, Marco Elver wrote:
> On Tue, 30 Jun 2020 at 22:30, Paul E. McKenney <paulmck@kernel.org> wrote:
> > On Tue, Jun 30, 2020 at 10:12:43PM +0200, Peter Zijlstra wrote:
> > > On Tue, Jun 30, 2020 at 09:19:31PM +0200, Marco Elver wrote:
> > > > So, we are probably better off untangling LTO from the story:
> > > >
> > > > 1. LTO or no LTO does not matter. The LTO series should not get tangled
> > > >    up with memory model issues.
> > > >
> > > > 2. The memory model question and problems need to be answered and
> > > >    addressed separately.
> > > >
> > > > Thoughts?
> > >
> > > How hard would it be to creates something that analyzes a build and
> > > looks for all 'dependent load -> control dependency' transformations
> > > headed by a volatile (and/or from asm) load and issues a warning for
> > > them?
> 
> I was thinking about this, but in the context of the "auto-promote to
> acquire" which you didn't like. Issuing a warning should certainly be
> simpler.
> 
> I think there is no one place where we know these transformations
> happen, but rather, need to analyze the IR before transformations,
> take note of all the dependent loads headed by volatile+asm, and then
> run an analysis after optimizations checking the dependencies are
> still there.
> 
> > > This would give us an indication of how valuable this transformation is
> > > for the kernel. I'm hoping/expecting it's vanishingly rare, but what do
> > > I know.
> >
> > This could be quite useful!
> 
> We might then even be able to say, "if you get this warning, turn on
> CONFIG_ACQUIRE_READ_DEPENDENCIES" (or however the option will be
> named). Or some other tricks, like automatically recompile the TU
> where this happens with the option. But again, this is not something
> that should specifically block LTO, because if we have this, we'll
> need to turn it on for everything.

I'm not especially keen on solving this with additional config options --
all it does it further fragment the number of kernels we have to care about
and distributions really won't be in a position to know whether this should
be enabled or not. I would prefer that the build fails, and we figure out
which compiler switch we need to stop the harmful optimisation taking place.
As Peter says, it _should_ be a rare thing to see (empirically, the kernel
seems to be getting away with it so far).

The problem, as I see it, is that the C language doesn't provide us with
a way to express dependency ordering and so we're at the mercy of the
compiler when we roll our own implementation. Paul continues to fight the
good fight at committee meetings to improve the situation, but in the
meantime we'd benefit from two things:

  1. A way to disable any compiler optimisations that break our dependency
     ordering in spite of READ_ONCE()

  2. A way to detect at build time if these harmful optimisations are taking
     place

Finally, while I agree that this problem isn't limited to LTO, my fear is
that LTO provides enough information for address dependencies headed by
a READ_ONCE() to be converted to control dependencies when some common
values of the pointer can be determined by the compiler. If we can rule
this sort of thing out, then great, but in the absence of (2) I think
throwing in an acquire is a sensible safety measure. Doesn't CFI rely
on LTO to do something similar for indirect branch targets, or have I
got that totally mixed up?

Will

Peter Zijlstra July 1, 2020, 11:40 a.m. UTC | #16

On Wed, Jul 01, 2020 at 11:41:17AM +0200, Marco Elver wrote:
> On Tue, 30 Jun 2020 at 22:30, Paul E. McKenney <paulmck@kernel.org> wrote:
> > On Tue, Jun 30, 2020 at 10:12:43PM +0200, Peter Zijlstra wrote:
> > > On Tue, Jun 30, 2020 at 09:19:31PM +0200, Marco Elver wrote:

> > > > Thoughts?
> > >
> > > How hard would it be to creates something that analyzes a build and
> > > looks for all 'dependent load -> control dependency' transformations
> > > headed by a volatile (and/or from asm) load and issues a warning for
> > > them?
> 
> I was thinking about this, but in the context of the "auto-promote to
> acquire" which you didn't like. Issuing a warning should certainly be
> simpler.
> 
> I think there is no one place where we know these transformations
> happen, but rather, need to analyze the IR before transformations,
> take note of all the dependent loads headed by volatile+asm, and then
> run an analysis after optimizations checking the dependencies are
> still there.

Urgh, that sounds nasty. The thing is, as I've hinted at in my other
reply, I would really like a compiler switch to disable this
optimization entirely -- knowing how relevant the trnaformation is, is
simply a first step towards that.

In order to control the tranformation, you have to actually know where
in the optimization passes it happens.

Also, if (big if in my book) we find the optimization is actually
beneficial, we can invert the warning when using the switch and warn
about lost optimization possibilities and manually re-write the code to
use control deps.

> > > This would give us an indication of how valuable this transformation is
> > > for the kernel. I'm hoping/expecting it's vanishingly rare, but what do
> > > I know.
> >
> > This could be quite useful!
> 
> We might then even be able to say, "if you get this warning, turn on
> CONFIG_ACQUIRE_READ_DEPENDENCIES" (or however the option will be
> named). 

I was going to suggest: if this happens, employ -fno-wreck-dependencies
:-)

Paul E. McKenney July 1, 2020, 2:06 p.m. UTC | #17

On Wed, Jul 01, 2020 at 01:40:27PM +0200, Peter Zijlstra wrote:
> On Wed, Jul 01, 2020 at 11:41:17AM +0200, Marco Elver wrote:
> > On Tue, 30 Jun 2020 at 22:30, Paul E. McKenney <paulmck@kernel.org> wrote:
> > > On Tue, Jun 30, 2020 at 10:12:43PM +0200, Peter Zijlstra wrote:
> > > > On Tue, Jun 30, 2020 at 09:19:31PM +0200, Marco Elver wrote:
> 
> > > > > Thoughts?
> > > >
> > > > How hard would it be to creates something that analyzes a build and
> > > > looks for all 'dependent load -> control dependency' transformations
> > > > headed by a volatile (and/or from asm) load and issues a warning for
> > > > them?
> > 
> > I was thinking about this, but in the context of the "auto-promote to
> > acquire" which you didn't like. Issuing a warning should certainly be
> > simpler.
> > 
> > I think there is no one place where we know these transformations
> > happen, but rather, need to analyze the IR before transformations,
> > take note of all the dependent loads headed by volatile+asm, and then
> > run an analysis after optimizations checking the dependencies are
> > still there.
> 
> Urgh, that sounds nasty. The thing is, as I've hinted at in my other
> reply, I would really like a compiler switch to disable this
> optimization entirely -- knowing how relevant the trnaformation is, is
> simply a first step towards that.
> 
> In order to control the tranformation, you have to actually know where
> in the optimization passes it happens.
> 
> Also, if (big if in my book) we find the optimization is actually
> beneficial, we can invert the warning when using the switch and warn
> about lost optimization possibilities and manually re-write the code to
> use control deps.

There are lots of optimization passes and any of them might decide to
destroy dependencies.  :-(

> > > > This would give us an indication of how valuable this transformation is
> > > > for the kernel. I'm hoping/expecting it's vanishingly rare, but what do
> > > > I know.
> > >
> > > This could be quite useful!
> > 
> > We might then even be able to say, "if you get this warning, turn on
> > CONFIG_ACQUIRE_READ_DEPENDENCIES" (or however the option will be
> > named). 
> 
> I was going to suggest: if this happens, employ -fno-wreck-dependencies
> :-)

The current state in the C++ committee is that marking variables
carrying dependencies is the way forward.  This is of course not what
the Linux kernel community does, but it should not be hard to have a
-fall-variables-dependent or some such that causes all variables to be
treated as if they were marked.  Though I was hoping for only pointers.
Are they -sure- that they -absolutely- need to carry dependencies
through integers???

Anyway, the next step is to provide this functionality in one of the
major compilers.  Akshat Garg started this in GCC as a GSoC project
by duplicating "volatile" functionality with a _Dependent_ptr keyword.
Next steps would include removing "volatile" functionality not required
for dependencies.  Here is a random posting, which if I remember correctly
raised some doubts as to whether "volatile" was really carried through
everywhere that it needs to for things like LTO:

https://gcc.gnu.org/legacy-ml/gcc/2019-07/msg00139.html

What happened to this effort?  Akshat graduated and got an unrelated
job, you know, the usual.  ;-)

							Thanx, Paul

David Laight July 1, 2020, 2:20 p.m. UTC | #18

From: Peter Zijlstra
> Sent: 01 July 2020 10:11
> On Tue, Jun 30, 2020 at 01:30:16PM -0700, Paul E. McKenney wrote:
> > On Tue, Jun 30, 2020 at 10:12:43PM +0200, Peter Zijlstra wrote:
> 
> > > I'm not convinced C11 memory_order_consume would actually work for us,
> > > even if it would work. That is, given:
> > >
> > >   https://lore.kernel.org/lkml/20150520005510.GA23559@linux.vnet.ibm.com/
> > >
> > > only pointers can have consume, but like I pointed out, we have code
> > > that relies on dependent loads from integers.
> >
> > I agree that C11 memory_order_consume is not normally what we want,
> > given that it is universally promoted to memory_order_acquire.
> >
> > However, dependent loads from integers are, if anything, more difficult
> > to defend from the compiler than are control dependencies.  This applies
> > doubly to integers that are used to index two-element arrays, in which
> > case you are just asking the compiler to destroy your dependent loads
> > by converting them into control dependencies.
> 
> Yes, I'm aware. However, as you might know, I'm firmly in the 'C is a
> glorified assembler' camp (as I expect most actual OS people are, out of
> necessity if nothing else) and if I wanted a control dependency I
> would've bloody well written one.

I write in C because doing register tracking is hard :-)
I've got an hdlc implementation in C that is carefully adjusted
so that the worst case path is bounded.
I probably know every one of the 1000 instructions in it.

Would an asm statement that uses the same 'register' for input and
output but doesn't actually do anything help?
It won't generate any code, but the compiler ought to assume that
it might change the value - so can't do optimisations that track
the value across the call.

> I think an optimizing compiler is awesome, but only in so far as that
> optimization is actually helpful -- and yes, I just stepped into a giant
> twilight zone there. That is, any optimization that has _any_
> controversy should be controllable (like -fno-strict-overflow
> -fno-strict-aliasing) and I'd very much like the same here.

I'm fed up of gcc generating the code that uses SIMD instructions
for the 'tail' loop at the end of a function that is already doing
SIMD operations for the main part of the loop.
And compilers that convert a byte copy loop to 'rep movsb'.
If I'm copying 3 or 4 bytes I don't want a 40 clock overhead.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

Peter Zijlstra July 1, 2020, 3:05 p.m. UTC | #19

On Wed, Jul 01, 2020 at 07:06:54AM -0700, Paul E. McKenney wrote:

> The current state in the C++ committee is that marking variables
> carrying dependencies is the way forward.  This is of course not what
> the Linux kernel community does, but it should not be hard to have a
> -fall-variables-dependent or some such that causes all variables to be
> treated as if they were marked.  Though I was hoping for only pointers.
> Are they -sure- that they -absolutely- need to carry dependencies
> through integers???

What's 'need'? :-)

I'm thinking __ktime_get_fast_ns() is better off with a dependent load
than it is with an extra smp_rmb().

Yes we can stick an smp_rmb() in there, but I don't like it. Like I
wrote earlier, if I wanted a control dependency, I'd have written one.

Paul E. McKenney July 1, 2020, 4:03 p.m. UTC | #20

On Wed, Jul 01, 2020 at 05:05:12PM +0200, Peter Zijlstra wrote:
> On Wed, Jul 01, 2020 at 07:06:54AM -0700, Paul E. McKenney wrote:
> 
> > The current state in the C++ committee is that marking variables
> > carrying dependencies is the way forward.  This is of course not what
> > the Linux kernel community does, but it should not be hard to have a
> > -fall-variables-dependent or some such that causes all variables to be
> > treated as if they were marked.  Though I was hoping for only pointers.
> > Are they -sure- that they -absolutely- need to carry dependencies
> > through integers???
> 
> What's 'need'? :-)

Turning off all dependency-killing optimizations on all pointers is
likely a non-event.  Turning off all dependency-killing optimizations
on all integers is not the road to happiness.

So whatever "need" might be, it would need to be rather earthshaking.  ;-)
It is probably not -that- hard to convert to pointers, even if they
are indexing multiple arrays.

> I'm thinking __ktime_get_fast_ns() is better off with a dependent load
> than it is with an extra smp_rmb().
> 
> Yes we can stick an smp_rmb() in there, but I don't like it. Like I
> wrote earlier, if I wanted a control dependency, I'd have written one.

No argument here.

But it looks like we are going to have to tell the compiler.

							Thanx, Paul

Paul E. McKenney July 1, 2020, 4:06 p.m. UTC | #21

On Wed, Jul 01, 2020 at 02:20:13PM +0000, David Laight wrote:
> From: Peter Zijlstra
> > Sent: 01 July 2020 10:11
> > On Tue, Jun 30, 2020 at 01:30:16PM -0700, Paul E. McKenney wrote:
> > > On Tue, Jun 30, 2020 at 10:12:43PM +0200, Peter Zijlstra wrote:
> > 
> > > > I'm not convinced C11 memory_order_consume would actually work for us,
> > > > even if it would work. That is, given:
> > > >
> > > >   https://lore.kernel.org/lkml/20150520005510.GA23559@linux.vnet.ibm.com/
> > > >
> > > > only pointers can have consume, but like I pointed out, we have code
> > > > that relies on dependent loads from integers.
> > >
> > > I agree that C11 memory_order_consume is not normally what we want,
> > > given that it is universally promoted to memory_order_acquire.
> > >
> > > However, dependent loads from integers are, if anything, more difficult
> > > to defend from the compiler than are control dependencies.  This applies
> > > doubly to integers that are used to index two-element arrays, in which
> > > case you are just asking the compiler to destroy your dependent loads
> > > by converting them into control dependencies.
> > 
> > Yes, I'm aware. However, as you might know, I'm firmly in the 'C is a
> > glorified assembler' camp (as I expect most actual OS people are, out of
> > necessity if nothing else) and if I wanted a control dependency I
> > would've bloody well written one.
> 
> I write in C because doing register tracking is hard :-)
> I've got an hdlc implementation in C that is carefully adjusted
> so that the worst case path is bounded.
> I probably know every one of the 1000 instructions in it.
> 
> Would an asm statement that uses the same 'register' for input and
> output but doesn't actually do anything help?
> It won't generate any code, but the compiler ought to assume that
> it might change the value - so can't do optimisations that track
> the value across the call.

It might replace the volatile load, but there are optimizations that
apply to the downstream code as well.

Or are you suggesting periodically pushing the dependent variable
through this asm?  That might work, but it would be easier and
more maintainable to just mark the variable.

> > I think an optimizing compiler is awesome, but only in so far as that
> > optimization is actually helpful -- and yes, I just stepped into a giant
> > twilight zone there. That is, any optimization that has _any_
> > controversy should be controllable (like -fno-strict-overflow
> > -fno-strict-aliasing) and I'd very much like the same here.
> 
> I'm fed up of gcc generating the code that uses SIMD instructions
> for the 'tail' loop at the end of a function that is already doing
> SIMD operations for the main part of the loop.
> And compilers that convert a byte copy loop to 'rep movsb'.
> If I'm copying 3 or 4 bytes I don't want a 40 clock overhead.

Agreed, compilers can often be all too "helpful".  :-(

							Thanx, Paul

Peter Zijlstra July 2, 2020, 8:20 a.m. UTC | #22

On Wed, Jul 01, 2020 at 09:03:38AM -0700, Paul E. McKenney wrote:

> But it looks like we are going to have to tell the compiler.

What does the current proposal look like? I can certainly annotate the
seqcount latch users, but who knows what other code is out there....

David Laight July 2, 2020, 9:37 a.m. UTC | #23

From: Paul E. McKenney
> Sent: 01 July 2020 17:06
...
> > Would an asm statement that uses the same 'register' for input and
> > output but doesn't actually do anything help?
> > It won't generate any code, but the compiler ought to assume that
> > it might change the value - so can't do optimisations that track
> > the value across the call.
> 
> It might replace the volatile load, but there are optimizations that
> apply to the downstream code as well.
> 
> Or are you suggesting periodically pushing the dependent variable
> through this asm?  That might work, but it would be easier and
> more maintainable to just mark the variable.

Marking the variable requires compiler support.
Although what 'volatile register int foo;' means might be interesting.

So I was thinking that in the case mentioned earlier you do:
	ptr += LAUNDER(offset & 1);
to ensure the compiler didn't convert to:
	if (offset & 1) ptr++;
(Which is probably a pessimisation - the reverse is likely better.)

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

Paul E. McKenney July 2, 2020, 5:59 p.m. UTC | #24

On Thu, Jul 02, 2020 at 10:20:40AM +0200, Peter Zijlstra wrote:
> On Wed, Jul 01, 2020 at 09:03:38AM -0700, Paul E. McKenney wrote:
> 
> > But it looks like we are going to have to tell the compiler.
> 
> What does the current proposal look like? I can certainly annotate the
> seqcount latch users, but who knows what other code is out there....

For pointers, yes, within the Linux kernel it is hopeless, thus the
thought of a -fall-dependent-ptr or some such that makes the compiler
pretend that each and every pointer is marked with the _Dependent_ptr
qualifier.

New non-Linux-kernel code might want to use his qualifier explicitly,
perhaps something like the following:

	_Dependent_ptr struct foo *p;  // Or maybe after the "*"?

	rcu_read_lock();
	p = rcu_dereference(gp);
	// And so on...

If a function is to take a dependent pointer as a function argument,
then the corresponding parameter need the _Dependent_ptr marking.
Ditto for return values.

The proposal did not cover integers due to concerns about the number of
optimization passes that would need to be reviewed to make that work.
Nevertheless, using a marked integer would be safer than using an unmarked
one, and if the review can be carried out, why not?  Maybe something
like this:

	_Dependent_ptr int idx;

	rcu_read_lock();
	idx = READ_ONCE(gidx);
	d = rcuarray[idx];
	rcu_read_unlock();
	do_something_with(d);

So use of this qualifier is quite reasonable.

The prototype for GCC is here: https://github.com/AKG001/gcc/

							Thanx, Paul

Paul E. McKenney July 2, 2020, 6 p.m. UTC | #25

On Thu, Jul 02, 2020 at 09:37:26AM +0000, David Laight wrote:
> From: Paul E. McKenney
> > Sent: 01 July 2020 17:06
> ...
> > > Would an asm statement that uses the same 'register' for input and
> > > output but doesn't actually do anything help?
> > > It won't generate any code, but the compiler ought to assume that
> > > it might change the value - so can't do optimisations that track
> > > the value across the call.
> > 
> > It might replace the volatile load, but there are optimizations that
> > apply to the downstream code as well.
> > 
> > Or are you suggesting periodically pushing the dependent variable
> > through this asm?  That might work, but it would be easier and
> > more maintainable to just mark the variable.
> 
> Marking the variable requires compiler support.
> Although what 'volatile register int foo;' means might be interesting.
> 
> So I was thinking that in the case mentioned earlier you do:
> 	ptr += LAUNDER(offset & 1);
> to ensure the compiler didn't convert to:
> 	if (offset & 1) ptr++;
> (Which is probably a pessimisation - the reverse is likely better.)

Indeed, Akshat's prototype follows the "volatile" qualifier in many
ways.  https://github.com/AKG001/gcc/

							Thanx, Paul

Peter Zijlstra July 3, 2020, 1:13 p.m. UTC | #26

On Thu, Jul 02, 2020 at 10:59:48AM -0700, Paul E. McKenney wrote:
> On Thu, Jul 02, 2020 at 10:20:40AM +0200, Peter Zijlstra wrote:
> > On Wed, Jul 01, 2020 at 09:03:38AM -0700, Paul E. McKenney wrote:
> > 
> > > But it looks like we are going to have to tell the compiler.
> > 
> > What does the current proposal look like? I can certainly annotate the
> > seqcount latch users, but who knows what other code is out there....
> 
> For pointers, yes, within the Linux kernel it is hopeless, thus the
> thought of a -fall-dependent-ptr or some such that makes the compiler
> pretend that each and every pointer is marked with the _Dependent_ptr
> qualifier.
> 
> New non-Linux-kernel code might want to use his qualifier explicitly,
> perhaps something like the following:
> 
> 	_Dependent_ptr struct foo *p;  // Or maybe after the "*"?

After, as you've written it, it's a pointer to a '_Dependent struct
foo'.

> 
> 	rcu_read_lock();
> 	p = rcu_dereference(gp);
> 	// And so on...
> 
> If a function is to take a dependent pointer as a function argument,
> then the corresponding parameter need the _Dependent_ptr marking.
> Ditto for return values.
> 
> The proposal did not cover integers due to concerns about the number of
> optimization passes that would need to be reviewed to make that work.
> Nevertheless, using a marked integer would be safer than using an unmarked
> one, and if the review can be carried out, why not?  Maybe something
> like this:
> 
> 	_Dependent_ptr int idx;
> 
> 	rcu_read_lock();
> 	idx = READ_ONCE(gidx);
> 	d = rcuarray[idx];
> 	rcu_read_unlock();
> 	do_something_with(d);
> 
> So use of this qualifier is quite reasonable.

The above usage might warrant a rename of the qualifier though, since
clearly there isn't anything ptr around.

> The prototype for GCC is here: https://github.com/AKG001/gcc/

Thanks! Those test cases are somewhat over qualified though:

       static volatile _Atomic (TYPE) * _Dependent_ptr a;     		\

Also, if C goes and specifies load dependencies, in any form, is then
not the corrolary that they need to specify control dependencies? How
else can they exclude the transformation.

And of course, once we're there, can we get explicit support for control
dependencies too? :-) :-)

Peter Zijlstra July 3, 2020, 1:25 p.m. UTC | #27

On Fri, Jul 03, 2020 at 03:13:30PM +0200, Peter Zijlstra wrote:
> > The prototype for GCC is here: https://github.com/AKG001/gcc/
> 
> Thanks! Those test cases are somewhat over qualified though:
> 
>        static volatile _Atomic (TYPE) * _Dependent_ptr a;     		\

One question though; since its a qualifier, and we've recently spend a
whole lot of effort to strip qualifiers in say READ_ONCE(), how does,
and how do we want, this qualifier to behave.

C++ has very convenient means of manipulating qualifiers, so it's not
much of a problem there, but for C it is, as we've found, really quite
cumbersome. Even with _Generic() we can't manipulate individual
qualifiers afaict.

Paul E. McKenney July 3, 2020, 2:42 p.m. UTC | #28

On Fri, Jul 03, 2020 at 03:13:30PM +0200, Peter Zijlstra wrote:
> On Thu, Jul 02, 2020 at 10:59:48AM -0700, Paul E. McKenney wrote:
> > On Thu, Jul 02, 2020 at 10:20:40AM +0200, Peter Zijlstra wrote:
> > > On Wed, Jul 01, 2020 at 09:03:38AM -0700, Paul E. McKenney wrote:
> > > 
> > > > But it looks like we are going to have to tell the compiler.
> > > 
> > > What does the current proposal look like? I can certainly annotate the
> > > seqcount latch users, but who knows what other code is out there....
> > 
> > For pointers, yes, within the Linux kernel it is hopeless, thus the
> > thought of a -fall-dependent-ptr or some such that makes the compiler
> > pretend that each and every pointer is marked with the _Dependent_ptr
> > qualifier.
> > 
> > New non-Linux-kernel code might want to use his qualifier explicitly,
> > perhaps something like the following:
> > 
> > 	_Dependent_ptr struct foo *p;  // Or maybe after the "*"?
> 
> After, as you've written it, it's a pointer to a '_Dependent struct
> foo'.

Yeah, I have to look that up every time.  :-/

Thank you for checking!

> > 	rcu_read_lock();
> > 	p = rcu_dereference(gp);
> > 	// And so on...
> > 
> > If a function is to take a dependent pointer as a function argument,
> > then the corresponding parameter need the _Dependent_ptr marking.
> > Ditto for return values.
> > 
> > The proposal did not cover integers due to concerns about the number of
> > optimization passes that would need to be reviewed to make that work.
> > Nevertheless, using a marked integer would be safer than using an unmarked
> > one, and if the review can be carried out, why not?  Maybe something
> > like this:
> > 
> > 	_Dependent_ptr int idx;
> > 
> > 	rcu_read_lock();
> > 	idx = READ_ONCE(gidx);
> > 	d = rcuarray[idx];
> > 	rcu_read_unlock();
> > 	do_something_with(d);
> > 
> > So use of this qualifier is quite reasonable.
> 
> The above usage might warrant a rename of the qualifier though, since
> clearly there isn't anything ptr around.

Given the large number of additional optimizations that need to be
suppressed in the non-pointer case, any discouragement based on the "_ptr"
at the end of the name is all to the good.

And if that line of reasoning is unconvincing, please look at the program
at the end of this email, which compiles without errors with -Wall and
gives the expected output.  ;-)

> > The prototype for GCC is here: https://github.com/AKG001/gcc/
> 
> Thanks! Those test cases are somewhat over qualified though:
> 
>        static volatile _Atomic (TYPE) * _Dependent_ptr a;     		\

Especially given that in C, _Atomic operations are implicitly volatile.
But this is likely a holdover from Akshat's implementation strategy,
which was to pattern _Dependent_ptr after the volatile keyword.

> Also, if C goes and specifies load dependencies, in any form, is then
> not the corrolary that they need to specify control dependencies? How
> else can they exclude the transformation.

By requiring that any temporaries generated from variables that are
marked _Dependent_ptr also be marked _Dependent_ptr.  This is of course
one divergence of _Dependent_ptr from the volatile keyword.

> And of course, once we're there, can we get explicit support for control
> dependencies too? :-) :-)

Keep talking like this and I am going to make sure that you attend a
standards committee meeting.  If need be, by arranging for you to be
physically dragged there.  ;-)

More seriously, for control dependencies, the variable that would need
to be marked would be the program counter, which might require some
additional syntax.

							Thanx, Paul

------------------------------------------------------------------------

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int foo(int *p, int i)
{
	return i[p];
}

int arr[] = { 0, 1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121, 144, };

int main(int argc, char *argv[])
{
	int i = atoi(argv[1]);

	printf("%d[arr] = %d\n", i, foo(arr, i));
	return 0;
}

Paul E. McKenney July 3, 2020, 2:51 p.m. UTC | #29

On Fri, Jul 03, 2020 at 03:25:23PM +0200, Peter Zijlstra wrote:
> On Fri, Jul 03, 2020 at 03:13:30PM +0200, Peter Zijlstra wrote:
> > > The prototype for GCC is here: https://github.com/AKG001/gcc/
> > 
> > Thanks! Those test cases are somewhat over qualified though:
> > 
> >        static volatile _Atomic (TYPE) * _Dependent_ptr a;     		\
> 
> One question though; since its a qualifier, and we've recently spend a
> whole lot of effort to strip qualifiers in say READ_ONCE(), how does,
> and how do we want, this qualifier to behave.

Dereferencing a _Dependent_ptr pointer gives you something that is not
_Dependent_ptr, unless the declaration was like this:

	_Dependent_ptr _Atomic (TYPE) * _Dependent_ptr a;

And if I recall correctly, the current state is that assigning a
_Dependent_ptr variable to a non-_Dependent_ptr variable strips this
marking (though the thought was to be able to ask for a warning).

So, yes, it would be nice to be able to explicitly strip the
_Dependent_ptr, perhaps the kill_dependency() macro, which is already
in the C standard.

> C++ has very convenient means of manipulating qualifiers, so it's not
> much of a problem there, but for C it is, as we've found, really quite
> cumbersome. Even with _Generic() we can't manipulate individual
> qualifiers afaict.

Fair point, and in C++ this is a templated class, at least in the same
sense that std::atomic<> is a templated class.

But in this case, would kill_dependency do what you want?

							Thanx, Paul

Paul E. McKenney July 6, 2020, 4:26 p.m. UTC | #30

On Fri, Jul 03, 2020 at 07:42:28AM -0700, Paul E. McKenney wrote:
> On Fri, Jul 03, 2020 at 03:13:30PM +0200, Peter Zijlstra wrote:
> > On Thu, Jul 02, 2020 at 10:59:48AM -0700, Paul E. McKenney wrote:
> > > On Thu, Jul 02, 2020 at 10:20:40AM +0200, Peter Zijlstra wrote:
> > > > On Wed, Jul 01, 2020 at 09:03:38AM -0700, Paul E. McKenney wrote:

[ . . . ]

> > Also, if C goes and specifies load dependencies, in any form, is then
> > not the corrolary that they need to specify control dependencies? How
> > else can they exclude the transformation.
> 
> By requiring that any temporaries generated from variables that are
> marked _Dependent_ptr also be marked _Dependent_ptr.  This is of course
> one divergence of _Dependent_ptr from the volatile keyword.
> 
> > And of course, once we're there, can we get explicit support for control
> > dependencies too? :-) :-)
> 
> Keep talking like this and I am going to make sure that you attend a
> standards committee meeting.  If need be, by arranging for you to be
> physically dragged there.  ;-)
> 
> More seriously, for control dependencies, the variable that would need
> to be marked would be the program counter, which might require some
> additional syntax.

And perhaps more constructively, we do need to prioritize address and data
dependencies over control dependencies.  For one thing, there are a lot
more address/data dependencies in existing code than there are control
dependencies, and (sadly, perhaps more importantly) there are a lot more
people who are convinced that address/data dependencies are important.

For another (admittedly more theoretical) thing, the OOTA scenarios
stemming from control dependencies are a lot less annoying than those
from address/data dependencies.

And address/data dependencies are as far as I know vulnerable to things
like conditional-move instructions that can cause problems for control
dependencies.

Nevertheless, yes, control dependencies also need attention.

							Thanx, Paul

Peter Zijlstra July 6, 2020, 6:29 p.m. UTC | #31

On Mon, Jul 06, 2020 at 09:26:33AM -0700, Paul E. McKenney wrote:

> And perhaps more constructively, we do need to prioritize address and data
> dependencies over control dependencies.  For one thing, there are a lot
> more address/data dependencies in existing code than there are control
> dependencies, and (sadly, perhaps more importantly) there are a lot more
> people who are convinced that address/data dependencies are important.

If they do not consider their Linux OS running correctly :-)

> For another (admittedly more theoretical) thing, the OOTA scenarios
> stemming from control dependencies are a lot less annoying than those
> from address/data dependencies.
> 
> And address/data dependencies are as far as I know vulnerable to things
> like conditional-move instructions that can cause problems for control
> dependencies.
> 
> Nevertheless, yes, control dependencies also need attention.

Today I added one more \o/

Paul E. McKenney July 6, 2020, 6:39 p.m. UTC | #32

On Mon, Jul 06, 2020 at 08:29:26PM +0200, Peter Zijlstra wrote:
> On Mon, Jul 06, 2020 at 09:26:33AM -0700, Paul E. McKenney wrote:
> 
> > And perhaps more constructively, we do need to prioritize address and data
> > dependencies over control dependencies.  For one thing, there are a lot
> > more address/data dependencies in existing code than there are control
> > dependencies, and (sadly, perhaps more importantly) there are a lot more
> > people who are convinced that address/data dependencies are important.
> 
> If they do not consider their Linux OS running correctly :-)

Many of them really do not care at all.  In fact, some would consider
Linux failing to run as an added bonus.

> > For another (admittedly more theoretical) thing, the OOTA scenarios
> > stemming from control dependencies are a lot less annoying than those
> > from address/data dependencies.
> > 
> > And address/data dependencies are as far as I know vulnerable to things
> > like conditional-move instructions that can cause problems for control
> > dependencies.
> > 
> > Nevertheless, yes, control dependencies also need attention.
> 
> Today I added one more \o/

Just make sure you continually check to make sure that compilers
don't break it, along with the others you have added.  ;-)

							Thanx, Paul

Peter Zijlstra July 6, 2020, 7:40 p.m. UTC | #33

On Mon, Jul 06, 2020 at 11:39:33AM -0700, Paul E. McKenney wrote:
> On Mon, Jul 06, 2020 at 08:29:26PM +0200, Peter Zijlstra wrote:
> > On Mon, Jul 06, 2020 at 09:26:33AM -0700, Paul E. McKenney wrote:

> > If they do not consider their Linux OS running correctly :-)
> 
> Many of them really do not care at all.  In fact, some would consider
> Linux failing to run as an added bonus.

This I think is why we have compiler people in the thread that care a
lot more.

> > > Nevertheless, yes, control dependencies also need attention.
> > 
> > Today I added one more \o/
> 
> Just make sure you continually check to make sure that compilers
> don't break it, along with the others you have added.  ;-)

There's:

kernel/locking/mcs_spinlock.h:  smp_cond_load_acquire(l, VAL);                          \
kernel/sched/core.c:                    smp_cond_load_acquire(&p->on_cpu, !VAL);
kernel/smp.c:   smp_cond_load_acquire(&csd->node.u_flags, !(VAL & CSD_FLAG_LOCK));

arch/x86/kernel/alternative.c:          atomic_cond_read_acquire(&desc.refs, !VAL);
kernel/locking/qrwlock.c:               atomic_cond_read_acquire(&lock->cnts, !(VAL & _QW_LOCKED));
kernel/locking/qrwlock.c:       atomic_cond_read_acquire(&lock->cnts, !(VAL & _QW_LOCKED));
kernel/locking/qrwlock.c:               atomic_cond_read_acquire(&lock->cnts, VAL == _QW_WAITING);
kernel/locking/qspinlock.c:             atomic_cond_read_acquire(&lock->val, !(VAL & _Q_LOCKED_MASK));
kernel/locking/qspinlock.c:     val = atomic_cond_read_acquire(&lock->val, !(VAL & _Q_LOCKED_PENDING_MASK));

include/linux/refcount.h:               smp_acquire__after_ctrl_dep();
ipc/mqueue.c:                   smp_acquire__after_ctrl_dep();
ipc/msg.c:                      smp_acquire__after_ctrl_dep();
ipc/sem.c:                      smp_acquire__after_ctrl_dep();
kernel/locking/rwsem.c:                 smp_acquire__after_ctrl_dep();
kernel/sched/core.c:    smp_acquire__after_ctrl_dep();

kernel/events/ring_buffer.c:__perf_output_begin()

And I'm fairly sure I'm forgetting some... One could argue there's too
many of them to check already.

Both GCC and CLANG had better think about it.

Paul E. McKenney July 6, 2020, 11:41 p.m. UTC | #34

On Mon, Jul 06, 2020 at 09:40:12PM +0200, Peter Zijlstra wrote:
> On Mon, Jul 06, 2020 at 11:39:33AM -0700, Paul E. McKenney wrote:
> > On Mon, Jul 06, 2020 at 08:29:26PM +0200, Peter Zijlstra wrote:
> > > On Mon, Jul 06, 2020 at 09:26:33AM -0700, Paul E. McKenney wrote:
> 
> > > If they do not consider their Linux OS running correctly :-)
> > 
> > Many of them really do not care at all.  In fact, some would consider
> > Linux failing to run as an added bonus.
> 
> This I think is why we have compiler people in the thread that care a
> lot more.

Here is hoping! ;-)

> > > > Nevertheless, yes, control dependencies also need attention.
> > > 
> > > Today I added one more \o/
> > 
> > Just make sure you continually check to make sure that compilers
> > don't break it, along with the others you have added.  ;-)
> 
> There's:
> 
> kernel/locking/mcs_spinlock.h:  smp_cond_load_acquire(l, VAL);                          \
> kernel/sched/core.c:                    smp_cond_load_acquire(&p->on_cpu, !VAL);
> kernel/smp.c:   smp_cond_load_acquire(&csd->node.u_flags, !(VAL & CSD_FLAG_LOCK));
> 
> arch/x86/kernel/alternative.c:          atomic_cond_read_acquire(&desc.refs, !VAL);
> kernel/locking/qrwlock.c:               atomic_cond_read_acquire(&lock->cnts, !(VAL & _QW_LOCKED));
> kernel/locking/qrwlock.c:       atomic_cond_read_acquire(&lock->cnts, !(VAL & _QW_LOCKED));
> kernel/locking/qrwlock.c:               atomic_cond_read_acquire(&lock->cnts, VAL == _QW_WAITING);
> kernel/locking/qspinlock.c:             atomic_cond_read_acquire(&lock->val, !(VAL & _Q_LOCKED_MASK));
> kernel/locking/qspinlock.c:     val = atomic_cond_read_acquire(&lock->val, !(VAL & _Q_LOCKED_PENDING_MASK));
> 
> include/linux/refcount.h:               smp_acquire__after_ctrl_dep();
> ipc/mqueue.c:                   smp_acquire__after_ctrl_dep();
> ipc/msg.c:                      smp_acquire__after_ctrl_dep();
> ipc/sem.c:                      smp_acquire__after_ctrl_dep();
> kernel/locking/rwsem.c:                 smp_acquire__after_ctrl_dep();
> kernel/sched/core.c:    smp_acquire__after_ctrl_dep();
> 
> kernel/events/ring_buffer.c:__perf_output_begin()
> 
> And I'm fairly sure I'm forgetting some... One could argue there's too
> many of them to check already.
> 
> Both GCC and CLANG had better think about it.

That would be good!

I won't list the number of address/data dependencies given that there
are well over a thousand of them.

							Thanx, Paul

Sami Tolvanen July 7, 2020, 3:51 p.m. UTC | #35

On Mon, Jun 29, 2020 at 04:20:59PM -0700, Sami Tolvanen wrote:
> Hi Masahiro,
> 
> On Mon, Jun 29, 2020 at 01:56:19AM +0900, Masahiro Yamada wrote:
> > I also got an error for
> > ARCH=arm64 allyesconfig + CONFIG_LTO_CLANG=y
> > 
> > 
> > 
> > $ make ARCH=arm64 LLVM=1 LLVM_IAS=1
> > CROSS_COMPILE=~/tools/aarch64-linaro-7.5/bin/aarch64-linux-gnu-
> > -j24
> > 
> >   ...
> > 
> >   GEN     .version
> >   CHK     include/generated/compile.h
> >   UPD     include/generated/compile.h
> >   CC      init/version.o
> >   AR      init/built-in.a
> >   GEN     .tmp_initcalls.lds
> >   GEN     .tmp_symversions.lds
> >   LTO     vmlinux.o
> >   MODPOST vmlinux.symvers
> >   MODINFO modules.builtin.modinfo
> >   GEN     modules.builtin
> >   LD      .tmp_vmlinux.kallsyms1
> > ld.lld: error: undefined symbol: __compiletime_assert_905
> > >>> referenced by irqbypass.c
> > >>>               vmlinux.o:(jeq_imm)
> > make: *** [Makefile:1161: vmlinux] Error 1
> 
> I can reproduce this with ToT LLVM and it's BUILD_BUG_ON_MSG(..., "value
> too large for the field") in drivers/net/ethernet/netronome/nfp/bpf/jit.c.
> Specifically, the FIELD_FIT / __BF_FIELD_CHECK macro in ur_load_imm_any.
> 
> This compiles just fine with an earlier LLVM revision, so it could be a
> relatively recent regression. I'll take a look. Thanks for catching this!

After spending some time debugging this with Nick, it looks like the
error is caused by a recent optimization change in LLVM, which together
with the inlining of ur_load_imm_any into jeq_imm, changes a runtime
check in FIELD_FIT that would always fail, to a compile-time check that
breaks the build. In jeq_imm, we have:

    /* struct bpf_insn: _s32 imm */
    u64 imm = insn->imm; /* sign extend */
    ...
    if (imm >> 32) { /* non-zero only if insn->imm is negative */
    	/* inlined from ur_load_imm_any */
	u32 __imm = imm >> 32; /* therefore, always 0xffffffff */

        /*
	 * __imm has a value known at compile-time, which means
	 * __builtin_constant_p(__imm) is true and we end up with
	 * essentially this in __BF_FIELD_CHECK:
	 */
	if (__builtin_constant_p(__imm) && __imm <= 255)
	      __compiletime_assert_N();

The compile-time check comes from the following BUILD_BUG_ON_MSG:

    #define __BF_FIELD_CHECK(_mask, _reg, _val, _pfx)		\
    ...
	BUILD_BUG_ON_MSG(__builtin_constant_p(_val) ?		\
			 ~((_mask) >> __bf_shf(_mask)) & (_val) : 0, \
			 _pfx "value too large for the field"); \

While we could stop the compiler from performing this optimization by
telling it to never inline ur_load_imm_any, we feel like a better fix
might be to replace FIELD_FIT(UR_REG_IMM_MAX, imm) with a simple imm
<= UR_REG_IMM_MAX check that won't trip a compile-time assertion even
when the condition is known to fail.

Jiong, Jakub, do you see any issues here?

Sami

Sami Tolvanen July 7, 2020, 4:05 p.m. UTC | #36

On Tue, Jul 07, 2020 at 08:51:07AM -0700, Sami Tolvanen wrote:
> After spending some time debugging this with Nick, it looks like the
> error is caused by a recent optimization change in LLVM, which together
> with the inlining of ur_load_imm_any into jeq_imm, changes a runtime
> check in FIELD_FIT that would always fail, to a compile-time check that
> breaks the build. In jeq_imm, we have:
> 
>     /* struct bpf_insn: _s32 imm */
>     u64 imm = insn->imm; /* sign extend */
>     ...
>     if (imm >> 32) { /* non-zero only if insn->imm is negative */
>     	/* inlined from ur_load_imm_any */
> 	u32 __imm = imm >> 32; /* therefore, always 0xffffffff */
> 
>         /*
> 	 * __imm has a value known at compile-time, which means
> 	 * __builtin_constant_p(__imm) is true and we end up with
> 	 * essentially this in __BF_FIELD_CHECK:
> 	 */
> 	if (__builtin_constant_p(__imm) && __imm <= 255)

Should be __imm > 255, of course, which means the compiler will generate
a call to __compiletime_assert.

> Jiong, Jakub, do you see any issues here?

(Jiong's email bounced, so removing from the recipient list.)

Sami

Jakub Kicinski July 7, 2020, 4:56 p.m. UTC | #37

On Tue, 7 Jul 2020 09:05:28 -0700 Sami Tolvanen wrote:
> On Tue, Jul 07, 2020 at 08:51:07AM -0700, Sami Tolvanen wrote:
> > After spending some time debugging this with Nick, it looks like the
> > error is caused by a recent optimization change in LLVM, which together
> > with the inlining of ur_load_imm_any into jeq_imm, changes a runtime
> > check in FIELD_FIT that would always fail, to a compile-time check that
> > breaks the build. In jeq_imm, we have:
> > 
> >     /* struct bpf_insn: _s32 imm */
> >     u64 imm = insn->imm; /* sign extend */
> >     ...
> >     if (imm >> 32) { /* non-zero only if insn->imm is negative */
> >     	/* inlined from ur_load_imm_any */
> > 	u32 __imm = imm >> 32; /* therefore, always 0xffffffff */
> > 
> >         /*
> > 	 * __imm has a value known at compile-time, which means
> > 	 * __builtin_constant_p(__imm) is true and we end up with
> > 	 * essentially this in __BF_FIELD_CHECK:
> > 	 */
> > 	if (__builtin_constant_p(__imm) && __imm <= 255)  
> 
> Should be __imm > 255, of course, which means the compiler will generate
> a call to __compiletime_assert.

I think FIELD_FIT() should not pass the value into __BF_FIELD_CHECK().

So:

diff --git a/include/linux/bitfield.h b/include/linux/bitfield.h
index 48ea093ff04c..4e035aca6f7e 100644
--- a/include/linux/bitfield.h
+++ b/include/linux/bitfield.h
@@ -77,7 +77,7 @@
  */
 #define FIELD_FIT(_mask, _val)                                         \
        ({                                                              \
-               __BF_FIELD_CHECK(_mask, 0ULL, _val, "FIELD_FIT: ");     \
+               __BF_FIELD_CHECK(_mask, 0ULL, 0ULL, "FIELD_FIT: ");     \
                !((((typeof(_mask))_val) << __bf_shf(_mask)) & ~(_mask)); \
        })
 
It's perfectly legal to pass a constant which does not fit, in which
case FIELD_FIT() should just return false not break the build.

Right?

Nick Desaulniers July 7, 2020, 5:17 p.m. UTC | #38

On Tue, Jul 7, 2020 at 9:56 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> > On Tue, Jul 07, 2020 at 08:51:07AM -0700, Sami Tolvanen wrote:
> > > After spending some time debugging this with Nick, it looks like the
> > > error is caused by a recent optimization change in LLVM, which together
> > > with the inlining of ur_load_imm_any into jeq_imm, changes a runtime
> > > check in FIELD_FIT that would always fail, to a compile-time check that
> > > breaks the build. In jeq_imm, we have:
> > >
> > >     /* struct bpf_insn: _s32 imm */
> > >     u64 imm = insn->imm; /* sign extend */
> > >     ...
> > >     if (imm >> 32) { /* non-zero only if insn->imm is negative */
> > >             /* inlined from ur_load_imm_any */
> > >     u32 __imm = imm >> 32; /* therefore, always 0xffffffff */
> > >
> > >         /*
> > >      * __imm has a value known at compile-time, which means
> > >      * __builtin_constant_p(__imm) is true and we end up with
> > >      * essentially this in __BF_FIELD_CHECK:
> > >      */
> > >     if (__builtin_constant_p(__imm) && __imm > 255)
>
> I think FIELD_FIT() should not pass the value into __BF_FIELD_CHECK().
>
> So:
>
> diff --git a/include/linux/bitfield.h b/include/linux/bitfield.h
> index 48ea093ff04c..4e035aca6f7e 100644
> --- a/include/linux/bitfield.h
> +++ b/include/linux/bitfield.h
> @@ -77,7 +77,7 @@
>   */
>  #define FIELD_FIT(_mask, _val)                                         \
>         ({                                                              \
> -               __BF_FIELD_CHECK(_mask, 0ULL, _val, "FIELD_FIT: ");     \
> +               __BF_FIELD_CHECK(_mask, 0ULL, 0ULL, "FIELD_FIT: ");     \
>                 !((((typeof(_mask))_val) << __bf_shf(_mask)) & ~(_mask)); \
>         })
>
> It's perfectly legal to pass a constant which does not fit, in which
> case FIELD_FIT() should just return false not break the build.
>
> Right?

I see the value of the __builtin_constant_p check; this is just a very
interesting case where rather than an integer literal appearing in the
source, the compiler is able to deduce that the parameter can only
have one value in one case, and allows __builtin_constant_p to
evaluate to true for it.

I had definitely asked Sami about the comment above FIELD_FIT:
"""
 76  * Return: true if @_val can fit inside @_mask, false if @_val is
too big.
"""
in which FIELD_FIT doesn't return false if @_val is too big and a
compile time constant. (Rather it breaks the build).

Of the 14 expansion sites of FIELD_FIT I see in mainline, it doesn't
look like any integral literals are passed, so maybe the compile time
checks of _val are of little value for FIELD_FIT.

So I think your suggested diff is the most concise fix.

Jakub Kicinski July 7, 2020, 5:30 p.m. UTC | #39

On Tue, 7 Jul 2020 10:17:25 -0700 Nick Desaulniers wrote:
> On Tue, Jul 7, 2020 at 9:56 AM Jakub Kicinski <kuba@kernel.org> wrote:
> >  
> > > On Tue, Jul 07, 2020 at 08:51:07AM -0700, Sami Tolvanen wrote:  
> > > > After spending some time debugging this with Nick, it looks like the
> > > > error is caused by a recent optimization change in LLVM, which together
> > > > with the inlining of ur_load_imm_any into jeq_imm, changes a runtime
> > > > check in FIELD_FIT that would always fail, to a compile-time check that
> > > > breaks the build. In jeq_imm, we have:
> > > >
> > > >     /* struct bpf_insn: _s32 imm */
> > > >     u64 imm = insn->imm; /* sign extend */
> > > >     ...
> > > >     if (imm >> 32) { /* non-zero only if insn->imm is negative */
> > > >             /* inlined from ur_load_imm_any */
> > > >     u32 __imm = imm >> 32; /* therefore, always 0xffffffff */
> > > >
> > > >         /*
> > > >      * __imm has a value known at compile-time, which means
> > > >      * __builtin_constant_p(__imm) is true and we end up with
> > > >      * essentially this in __BF_FIELD_CHECK:
> > > >      */
> > > >     if (__builtin_constant_p(__imm) && __imm > 255)  
> >
> > I think FIELD_FIT() should not pass the value into __BF_FIELD_CHECK().
> >
> > So:
> >
> > diff --git a/include/linux/bitfield.h b/include/linux/bitfield.h
> > index 48ea093ff04c..4e035aca6f7e 100644
> > --- a/include/linux/bitfield.h
> > +++ b/include/linux/bitfield.h
> > @@ -77,7 +77,7 @@
> >   */
> >  #define FIELD_FIT(_mask, _val)                                         \
> >         ({                                                              \
> > -               __BF_FIELD_CHECK(_mask, 0ULL, _val, "FIELD_FIT: ");     \
> > +               __BF_FIELD_CHECK(_mask, 0ULL, 0ULL, "FIELD_FIT: ");     \
> >                 !((((typeof(_mask))_val) << __bf_shf(_mask)) & ~(_mask)); \
> >         })
> >
> > It's perfectly legal to pass a constant which does not fit, in which
> > case FIELD_FIT() should just return false not break the build.
> >
> > Right?  
> 
> I see the value of the __builtin_constant_p check; this is just a very
> interesting case where rather than an integer literal appearing in the
> source, the compiler is able to deduce that the parameter can only
> have one value in one case, and allows __builtin_constant_p to
> evaluate to true for it.
> 
> I had definitely asked Sami about the comment above FIELD_FIT:
> """
>  76  * Return: true if @_val can fit inside @_mask, false if @_val is
> too big.
> """
> in which FIELD_FIT doesn't return false if @_val is too big and a
> compile time constant. (Rather it breaks the build).
> 
> Of the 14 expansion sites of FIELD_FIT I see in mainline, it doesn't
> look like any integral literals are passed, so maybe the compile time
> checks of _val are of little value for FIELD_FIT.

Also I just double checked and all FIELD_FIT() uses check the return
value.

> So I think your suggested diff is the most concise fix.

Feel free to submit that officially as a patch if it fixes the build
for you, here's my sign-off:

Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Paul Menzel July 11, 2020, 4:32 p.m. UTC | #40

Dear Sami,


Am 24.06.20 um 22:31 schrieb Sami Tolvanen:
> This patch series adds support for building x86_64 and arm64 kernels
> with Clang's Link Time Optimization (LTO).
> 
> In addition to performance, the primary motivation for LTO is to allow
> Clang's Control-Flow Integrity (CFI) to be used in the kernel. Google's
> Pixel devices have shipped with LTO+CFI kernels since 2018.
> 
> Most of the patches are build system changes for handling LLVM bitcode,
> which Clang produces with LTO instead of ELF object files, postponing
> ELF processing until a later stage, and ensuring initcall ordering.
> 
> Note that first objtool patch in the series is already in linux-next,
> but as it's needed with LTO, I'm including it also here to make testing
> easier.

[…]

Thank you very much for sending these changes.

Do you have a branch, where your current work can be pulled from? Your 
branch on GitHub [1] seems 15 months old.

Out of curiosity, I applied the changes, allowed the selection for i386 
(x86), and with Clang 1:11~++20200701093119+ffee8040534-1~exp1 from 
Debian experimental, it failed with `Invalid absolute R_386_32 
relocation: KERNEL_PAGES`:

> make -f ./scripts/Makefile.build obj=arch/x86/boot arch/x86/boot/bzImage
> make -f ./scripts/Makefile.build obj=arch/x86/boot/compressed arch/x86/boot/compressed/vmlinux
>   llvm-nm vmlinux | sed -n -e 's/^\([0-9a-fA-F]*\) [ABCDGRSTVW] \(_text\|__bss_start\|_end\)$/#define VO_ _AC(0x,UL)/p' > arch/x86/boot/compressed/../voffset.h
>   clang -Wp,-MMD,arch/x86/boot/compressed/.misc.o.d -nostdinc -isystem /usr/lib/llvm-11/lib/clang/11.0.0/include -I./arch/x86/include -I./arch/x86/include/generated  -I./include -I./arch/x86/include/uapi -I./arch/x86/include/generated/uapi -I./include/uapi -I./include/generated/uapi -include ./include/linux/kconfig.h -include ./include/linux/compiler_types.h -D__KERNEL__ -Qunused-arguments -m32 -O2 -fno-strict-aliasing -fPIE -DDISABLE_BRANCH_PROFILING -march=i386 -mno-mmx -mno-sse -ffreestanding -fno-stack-protector -Wno-address-of-packed-member -Wno-gnu -Wno-pointer-sign -fmacro-prefix-map=./= -fno-asynchronous-unwind-tables    -DKBUILD_MODFILE='"arch/x86/boot/compressed/misc"' -DKBUILD_BASENAME='"misc"' -DKBUILD_MODNAME='"misc"' -D__KBUILD_MODNAME=misc -c -o arch/x86/boot/compressed/misc.o arch/x86/boot/compressed/misc.c
>   llvm-objcopy  -R .comment -S vmlinux arch/x86/boot/compressed/vmlinux.bin
>   arch/x86/tools/relocs vmlinux > arch/x86/boot/compressed/vmlinux.relocs;arch/x86/tools/relocs --abs-relocs vmlinux
> Invalid absolute R_386_32 relocation: KERNEL_PAGES
> make[2]: *** [arch/x86/boot/compressed/Makefile:134: arch/x86/boot/compressed/vmlinux.relocs] Error 1
> make[2]: *** Deleting file 'arch/x86/boot/compressed/vmlinux.relocs'
> make[1]: *** [arch/x86/boot/Makefile:115: arch/x86/boot/compressed/vmlinux] Error 2
> make: *** [arch/x86/Makefile:268: bzImage] Error 2


Kind regards,

Paul



[1]: https://github.com/samitolvanen/linux/tree/clang-lto

Sedat Dilek July 12, 2020, 8:59 a.m. UTC | #41

On Sat, Jul 11, 2020 at 6:32 PM Paul Menzel <pmenzel@molgen.mpg.de> wrote:
>
> Dear Sami,
>
>
> Am 24.06.20 um 22:31 schrieb Sami Tolvanen:
> > This patch series adds support for building x86_64 and arm64 kernels
> > with Clang's Link Time Optimization (LTO).
> >
> > In addition to performance, the primary motivation for LTO is to allow
> > Clang's Control-Flow Integrity (CFI) to be used in the kernel. Google's
> > Pixel devices have shipped with LTO+CFI kernels since 2018.
> >
> > Most of the patches are build system changes for handling LLVM bitcode,
> > which Clang produces with LTO instead of ELF object files, postponing
> > ELF processing until a later stage, and ensuring initcall ordering.
> >
> > Note that first objtool patch in the series is already in linux-next,
> > but as it's needed with LTO, I'm including it also here to make testing
> > easier.
>
> […]
>
> Thank you very much for sending these changes.
>
> Do you have a branch, where your current work can be pulled from? Your
> branch on GitHub [1] seems 15 months old.
>

Agreed it's easier to git-pull.
I have seen [1] - not sure if this is the latest version.
Alternatively, you can check patchwork LKML by searching for $submitter.
( You can open patch 01/22 and download the whole patch-series by
following the link "series", see [3]. )

- Sedat -

[1] https://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild.git/log/?h=lto
[2] https://lore.kernel.org/patchwork/project/lkml/list/?series=&submitter=19676
[3] https://lore.kernel.org/patchwork/series/450026/mbox/

Nathan Chancellor July 12, 2020, 6:40 p.m. UTC | #42

On Sun, Jul 12, 2020 at 10:59:17AM +0200, Sedat Dilek wrote:
> On Sat, Jul 11, 2020 at 6:32 PM Paul Menzel <pmenzel@molgen.mpg.de> wrote:
> >
> > Dear Sami,
> >
> >
> > Am 24.06.20 um 22:31 schrieb Sami Tolvanen:
> > > This patch series adds support for building x86_64 and arm64 kernels
> > > with Clang's Link Time Optimization (LTO).
> > >
> > > In addition to performance, the primary motivation for LTO is to allow
> > > Clang's Control-Flow Integrity (CFI) to be used in the kernel. Google's
> > > Pixel devices have shipped with LTO+CFI kernels since 2018.
> > >
> > > Most of the patches are build system changes for handling LLVM bitcode,
> > > which Clang produces with LTO instead of ELF object files, postponing
> > > ELF processing until a later stage, and ensuring initcall ordering.
> > >
> > > Note that first objtool patch in the series is already in linux-next,
> > > but as it's needed with LTO, I'm including it also here to make testing
> > > easier.
> >
> > […]
> >
> > Thank you very much for sending these changes.
> >
> > Do you have a branch, where your current work can be pulled from? Your
> > branch on GitHub [1] seems 15 months old.
> >
> 
> Agreed it's easier to git-pull.
> I have seen [1] - not sure if this is the latest version.
> Alternatively, you can check patchwork LKML by searching for $submitter.
> ( You can open patch 01/22 and download the whole patch-series by
> following the link "series", see [3]. )
> 
> - Sedat -
> 
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild.git/log/?h=lto
> [2] https://lore.kernel.org/patchwork/project/lkml/list/?series=&submitter=19676
> [3] https://lore.kernel.org/patchwork/series/450026/mbox/
> 

Sami tagged this series on his GitHub:

https://github.com/samitolvanen/linux/releases/tag/lto-v1

git pull https://github.com/samitolvanen/linux lto-v1

Otherwise, he is updating the clang-cfi branch that includes both the
LTO and CFI patchsets. You can pull that and just turn on
CONFIG_LTO_CLANG.

Lastly, for the future, I would recommend grabbing b4 to easily apply
patches (specifically full series) from lore.kernel.org.

https://git.kernel.org/pub/scm/utils/b4/b4.git/
https://git.kernel.org/pub/scm/utils/b4/b4.git/tree/README.rst

You could grab this series and apply it easily by either downloading the
mbox file and following the instructions it gives for applying the mbox
file:

$ b4 am 20200624203200.78870-1-samitolvanen@google.com

or I prefer piping so that I don't have to clean up later:

$ b4 am -o - 20200624203200.78870-1-samitolvanen@google.com | git am

Cheers,
Nathan

Sami Tolvanen July 12, 2020, 11:34 p.m. UTC | #43

On Sat, Jul 11, 2020 at 9:32 AM Paul Menzel <pmenzel@molgen.mpg.de> wrote:
> Thank you very much for sending these changes.
>
> Do you have a branch, where your current work can be pulled from? Your
> branch on GitHub [1] seems 15 months old.

The clang-lto branch is rebased regularly on top of Linus' tree.
GitHub just looks at the commit date of the last commit in the tree,
which isn't all that informative.

> Out of curiosity, I applied the changes, allowed the selection for i386
> (x86), and with Clang 1:11~++20200701093119+ffee8040534-1~exp1 from
> Debian experimental, it failed with `Invalid absolute R_386_32
> relocation: KERNEL_PAGES`:

I haven't looked at getting this to work on i386, which is why we only
select ARCH_SUPPORTS_LTO for x86_64. I would expect there to be a few
issues to address.

> >   arch/x86/tools/relocs vmlinux > arch/x86/boot/compressed/vmlinux.relocs;arch/x86/tools/relocs --abs-relocs vmlinux
> > Invalid absolute R_386_32 relocation: KERNEL_PAGES

KERNEL_PAGES looks like a constant, so it's probably safe to ignore
the absolute relocation in tools/relocs.c.

Sami

Sedat Dilek July 14, 2020, 9:44 a.m. UTC | #44

On Sun, Jul 12, 2020 at 8:40 PM Nathan Chancellor
<natechancellor@gmail.com> wrote:
>
> On Sun, Jul 12, 2020 at 10:59:17AM +0200, Sedat Dilek wrote:
> > On Sat, Jul 11, 2020 at 6:32 PM Paul Menzel <pmenzel@molgen.mpg.de> wrote:
> > >
> > > Dear Sami,
> > >
> > >
> > > Am 24.06.20 um 22:31 schrieb Sami Tolvanen:
> > > > This patch series adds support for building x86_64 and arm64 kernels
> > > > with Clang's Link Time Optimization (LTO).
> > > >
> > > > In addition to performance, the primary motivation for LTO is to allow
> > > > Clang's Control-Flow Integrity (CFI) to be used in the kernel. Google's
> > > > Pixel devices have shipped with LTO+CFI kernels since 2018.
> > > >
> > > > Most of the patches are build system changes for handling LLVM bitcode,
> > > > which Clang produces with LTO instead of ELF object files, postponing
> > > > ELF processing until a later stage, and ensuring initcall ordering.
> > > >
> > > > Note that first objtool patch in the series is already in linux-next,
> > > > but as it's needed with LTO, I'm including it also here to make testing
> > > > easier.
> > >
> > > […]
> > >
> > > Thank you very much for sending these changes.
> > >
> > > Do you have a branch, where your current work can be pulled from? Your
> > > branch on GitHub [1] seems 15 months old.
> > >
> >
> > Agreed it's easier to git-pull.
> > I have seen [1] - not sure if this is the latest version.
> > Alternatively, you can check patchwork LKML by searching for $submitter.
> > ( You can open patch 01/22 and download the whole patch-series by
> > following the link "series", see [3]. )
> >
> > - Sedat -
> >
> > [1] https://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild.git/log/?h=lto
> > [2] https://lore.kernel.org/patchwork/project/lkml/list/?series=&submitter=19676
> > [3] https://lore.kernel.org/patchwork/series/450026/mbox/
> >
>
> Sami tagged this series on his GitHub:
>
> https://github.com/samitolvanen/linux/releases/tag/lto-v1
>
> git pull https://github.com/samitolvanen/linux lto-v1
>
> Otherwise, he is updating the clang-cfi branch that includes both the
> LTO and CFI patchsets. You can pull that and just turn on
> CONFIG_LTO_CLANG.
>
> Lastly, for the future, I would recommend grabbing b4 to easily apply
> patches (specifically full series) from lore.kernel.org.
>
> https://git.kernel.org/pub/scm/utils/b4/b4.git/
> https://git.kernel.org/pub/scm/utils/b4/b4.git/tree/README.rst
>
> You could grab this series and apply it easily by either downloading the
> mbox file and following the instructions it gives for applying the mbox
> file:
>
> $ b4 am 20200624203200.78870-1-samitolvanen@google.com
>
> or I prefer piping so that I don't have to clean up later:
>
> $ b4 am -o - 20200624203200.78870-1-samitolvanen@google.com | git am
>

It is always a pleasure to read your replies and enrich my know-how
beyond Linux-kernel hacking :-).

Thanks for the tip with "b4" tool.
Might add this to our ClangBuiltLinux wiki "Command line tips and tricks"?

- Sedat -

[1] https://github.com/ClangBuiltLinux/linux/wiki/Command-line-tips-and-tricks

Paul Menzel July 14, 2020, 12:16 p.m. UTC | #45

Dear Sami,


Am 13.07.20 um 01:34 schrieb Sami Tolvanen:
> On Sat, Jul 11, 2020 at 9:32 AM Paul Menzel <pmenzel@molgen.mpg.de> wrote:
>> Thank you very much for sending these changes.
>>
>> Do you have a branch, where your current work can be pulled from? Your
>> branch on GitHub [1] seems 15 months old.
> 
> The clang-lto branch is rebased regularly on top of Linus' tree.
> GitHub just looks at the commit date of the last commit in the tree,
> which isn't all that informative.

Thank you for clearing this up, and sorry for not checking myself.

>> Out of curiosity, I applied the changes, allowed the selection for i386
>> (x86), and with Clang 1:11~++20200701093119+ffee8040534-1~exp1 from
>> Debian experimental, it failed with `Invalid absolute R_386_32
>> relocation: KERNEL_PAGES`:
> 
> I haven't looked at getting this to work on i386, which is why we only
> select ARCH_SUPPORTS_LTO for x86_64. I would expect there to be a few
> issues to address.
> 
>>>    arch/x86/tools/relocs vmlinux > arch/x86/boot/compressed/vmlinux.relocs;arch/x86/tools/relocs --abs-relocs vmlinux
>>> Invalid absolute R_386_32 relocation: KERNEL_PAGES
> 
> KERNEL_PAGES looks like a constant, so it's probably safe to ignore
> the absolute relocation in tools/relocs.c.

Thank you for pointing me to the right direction. I am happy to report, 
that with the diff below (no idea to what list to add the string), Linux 
5.8-rc5 with the LLVM/Clang/LTO patches on top, builds and boots on the 
ASRock E350M1.

```
diff --git a/arch/x86/tools/relocs.c b/arch/x86/tools/relocs.c
index 8f3bf34840cef..e91af127ed3c0 100644
--- a/arch/x86/tools/relocs.c
+++ b/arch/x86/tools/relocs.c
@@ -79,6 +79,7 @@ static const char * const 
sym_regex_kernel[S_NSYMTYPES] = {
         "__end_rodata_hpage_align|"
  #endif
         "__vvar_page|"
+       "KERNEL_PAGES|"
         "_end)$"
  };
```


Kind regards,

Paul

Sedat Dilek July 14, 2020, 12:35 p.m. UTC | #46

On Tue, Jul 14, 2020 at 2:16 PM Paul Menzel <pmenzel@molgen.mpg.de> wrote:
>
> Dear Sami,
>
>
> Am 13.07.20 um 01:34 schrieb Sami Tolvanen:
> > On Sat, Jul 11, 2020 at 9:32 AM Paul Menzel <pmenzel@molgen.mpg.de> wrote:
> >> Thank you very much for sending these changes.
> >>
> >> Do you have a branch, where your current work can be pulled from? Your
> >> branch on GitHub [1] seems 15 months old.
> >
> > The clang-lto branch is rebased regularly on top of Linus' tree.
> > GitHub just looks at the commit date of the last commit in the tree,
> > which isn't all that informative.
>
> Thank you for clearing this up, and sorry for not checking myself.
>
> >> Out of curiosity, I applied the changes, allowed the selection for i386
> >> (x86), and with Clang 1:11~++20200701093119+ffee8040534-1~exp1 from
> >> Debian experimental, it failed with `Invalid absolute R_386_32
> >> relocation: KERNEL_PAGES`:
> >
> > I haven't looked at getting this to work on i386, which is why we only
> > select ARCH_SUPPORTS_LTO for x86_64. I would expect there to be a few
> > issues to address.
> >
> >>>    arch/x86/tools/relocs vmlinux > arch/x86/boot/compressed/vmlinux.relocs;arch/x86/tools/relocs --abs-relocs vmlinux
> >>> Invalid absolute R_386_32 relocation: KERNEL_PAGES
> >
> > KERNEL_PAGES looks like a constant, so it's probably safe to ignore
> > the absolute relocation in tools/relocs.c.
>
> Thank you for pointing me to the right direction. I am happy to report,
> that with the diff below (no idea to what list to add the string), Linux
> 5.8-rc5 with the LLVM/Clang/LTO patches on top, builds and boots on the
> ASRock E350M1.
>
> ```
> diff --git a/arch/x86/tools/relocs.c b/arch/x86/tools/relocs.c
> index 8f3bf34840cef..e91af127ed3c0 100644
> --- a/arch/x86/tools/relocs.c
> +++ b/arch/x86/tools/relocs.c
> @@ -79,6 +79,7 @@ static const char * const
> sym_regex_kernel[S_NSYMTYPES] = {
>          "__end_rodata_hpage_align|"
>   #endif
>          "__vvar_page|"
> +       "KERNEL_PAGES|"
>          "_end)$"
>   };
> ```
>

What llvm-toolchain and version did you use?

Can you post your linux-config?

Thanks.

- Sedat -

Nick Desaulniers July 14, 2020, 5:54 p.m. UTC | #47

On Tue, Jul 14, 2020 at 2:44 AM Sedat Dilek <sedat.dilek@gmail.com> wrote:
>
> On Sun, Jul 12, 2020 at 8:40 PM Nathan Chancellor
> <natechancellor@gmail.com> wrote:
> >
> > Lastly, for the future, I would recommend grabbing b4 to easily apply
> > patches (specifically full series) from lore.kernel.org.
> >
> > https://git.kernel.org/pub/scm/utils/b4/b4.git/
> > https://git.kernel.org/pub/scm/utils/b4/b4.git/tree/README.rst
> >
> > You could grab this series and apply it easily by either downloading the
> > mbox file and following the instructions it gives for applying the mbox
> > file:
> >
> > $ b4 am 20200624203200.78870-1-samitolvanen@google.com
> >
> > or I prefer piping so that I don't have to clean up later:
> >
> > $ b4 am -o - 20200624203200.78870-1-samitolvanen@google.com | git am
> >
>
> It is always a pleasure to read your replies and enrich my know-how
> beyond Linux-kernel hacking :-).
>
> Thanks for the tip with "b4" tool.
> Might add this to our ClangBuiltLinux wiki "Command line tips and tricks"?
>
> - Sedat -
>
> [1] https://github.com/ClangBuiltLinux/linux/wiki/Command-line-tips-and-tricks

Good idea, done.