diff mbox series

[v9] pgo: add clang's Profile Guided Optimization infrastructure

Message ID 20210407211704.367039-1-morbo@google.com (mailing list archive)
State New, archived
Headers show
Series [v9] pgo: add clang's Profile Guided Optimization infrastructure | expand

Commit Message

Bill Wendling April 7, 2021, 9:17 p.m. UTC
From: Sami Tolvanen <samitolvanen@google.com>

Enable the use of clang's Profile-Guided Optimization[1]. To generate a
profile, the kernel is instrumented with PGO counters, a representative
workload is run, and the raw profile data is collected from
/sys/kernel/debug/pgo/profraw.

The raw profile data must be processed by clang's "llvm-profdata" tool
before it can be used during recompilation:

  $ cp /sys/kernel/debug/pgo/profraw vmlinux.profraw
  $ llvm-profdata merge --output=vmlinux.profdata vmlinux.profraw

Multiple raw profiles may be merged during this step.

The data can now be used by the compiler:

  $ make LLVM=1 KCFLAGS=-fprofile-use=vmlinux.profdata ...

This initial submission is restricted to x86, as that's the platform we
know works. This restriction can be lifted once other platforms have
been verified to work with PGO.

Note that this method of profiling the kernel is clang-native, unlike
the clang support in kernel/gcov.

[1] https://clang.llvm.org/docs/UsersManual.html#profile-guided-optimization

Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Co-developed-by: Bill Wendling <morbo@google.com>
Signed-off-by: Bill Wendling <morbo@google.com>
Tested-by: Nick Desaulniers <ndesaulniers@google.com>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Reviewed-by: Fangrui Song <maskray@google.com>
---
v9: - [maskray] Remove explicit 'ALIGN' and 'KEEP' from PGO variables in
      vmlinux.lds.h.
v8: - Rebased on top-of-tree.
v7: - [sedat.dilek] Fix minor build failure.
v6: - Add better documentation about the locking scheme and other things.
    - Rename macros to better match the same macros in LLVM's source code.
v5: - [natechancellor] Correct padding calculation.
v4: - [ndesaulniers] Remove non-x86 Makfile changes and se "hweight64" instead
      of using our own popcount implementation.
v3: - [sedat.dilek] Added change log section.
v2: - [natechancellor] Added "__llvm_profile_instrument_memop".
    - [maskray] Corrected documentation, re PGO flags when using LTO.
---
 Documentation/dev-tools/index.rst     |   1 +
 Documentation/dev-tools/pgo.rst       | 127 +++++++++
 MAINTAINERS                           |   9 +
 Makefile                              |   3 +
 arch/Kconfig                          |   1 +
 arch/x86/Kconfig                      |   1 +
 arch/x86/boot/Makefile                |   1 +
 arch/x86/boot/compressed/Makefile     |   1 +
 arch/x86/crypto/Makefile              |   4 +
 arch/x86/entry/vdso/Makefile          |   1 +
 arch/x86/kernel/vmlinux.lds.S         |   2 +
 arch/x86/platform/efi/Makefile        |   1 +
 arch/x86/purgatory/Makefile           |   1 +
 arch/x86/realmode/rm/Makefile         |   1 +
 arch/x86/um/vdso/Makefile             |   1 +
 drivers/firmware/efi/libstub/Makefile |   1 +
 include/asm-generic/vmlinux.lds.h     |  34 +++
 kernel/Makefile                       |   1 +
 kernel/pgo/Kconfig                    |  35 +++
 kernel/pgo/Makefile                   |   5 +
 kernel/pgo/fs.c                       | 389 ++++++++++++++++++++++++++
 kernel/pgo/instrument.c               | 189 +++++++++++++
 kernel/pgo/pgo.h                      | 203 ++++++++++++++
 scripts/Makefile.lib                  |  10 +
 24 files changed, 1022 insertions(+)
 create mode 100644 Documentation/dev-tools/pgo.rst
 create mode 100644 kernel/pgo/Kconfig
 create mode 100644 kernel/pgo/Makefile
 create mode 100644 kernel/pgo/fs.c
 create mode 100644 kernel/pgo/instrument.c
 create mode 100644 kernel/pgo/pgo.h

Comments

Kees Cook April 7, 2021, 9:22 p.m. UTC | #1
On Wed, Apr 07, 2021 at 02:17:04PM -0700, 'Bill Wendling' via Clang Built Linux wrote:
> From: Sami Tolvanen <samitolvanen@google.com>
> 
> Enable the use of clang's Profile-Guided Optimization[1]. To generate a
> profile, the kernel is instrumented with PGO counters, a representative
> workload is run, and the raw profile data is collected from
> /sys/kernel/debug/pgo/profraw.
> 
> The raw profile data must be processed by clang's "llvm-profdata" tool
> before it can be used during recompilation:
> 
>   $ cp /sys/kernel/debug/pgo/profraw vmlinux.profraw
>   $ llvm-profdata merge --output=vmlinux.profdata vmlinux.profraw
> 
> Multiple raw profiles may be merged during this step.
> 
> The data can now be used by the compiler:
> 
>   $ make LLVM=1 KCFLAGS=-fprofile-use=vmlinux.profdata ...
> 
> This initial submission is restricted to x86, as that's the platform we
> know works. This restriction can be lifted once other platforms have
> been verified to work with PGO.
> 
> Note that this method of profiling the kernel is clang-native, unlike
> the clang support in kernel/gcov.
> 
> [1] https://clang.llvm.org/docs/UsersManual.html#profile-guided-optimization
> 
> Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
> Co-developed-by: Bill Wendling <morbo@google.com>
> Signed-off-by: Bill Wendling <morbo@google.com>
> Tested-by: Nick Desaulniers <ndesaulniers@google.com>
> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
> Reviewed-by: Fangrui Song <maskray@google.com>

Thanks for sending this again! I'm looking forward to using it.

Masahiro and Andrew, unless one of you would prefer to take this in your
tree, I figure I can snag it to send to Linus.

Anyone else have feedback?

Thanks!

-Kees
Fangrui Song April 7, 2021, 9:44 p.m. UTC | #2
On Wed, Apr 7, 2021 at 2:22 PM Kees Cook <keescook@chromium.org> wrote:
>
> On Wed, Apr 07, 2021 at 02:17:04PM -0700, 'Bill Wendling' via Clang Built Linux wrote:
> > From: Sami Tolvanen <samitolvanen@google.com>
> >
> > Enable the use of clang's Profile-Guided Optimization[1]. To generate a
> > profile, the kernel is instrumented with PGO counters, a representative
> > workload is run, and the raw profile data is collected from
> > /sys/kernel/debug/pgo/profraw.
> >
> > The raw profile data must be processed by clang's "llvm-profdata" tool
> > before it can be used during recompilation:
> >
> >   $ cp /sys/kernel/debug/pgo/profraw vmlinux.profraw
> >   $ llvm-profdata merge --output=vmlinux.profdata vmlinux.profraw
> >
> > Multiple raw profiles may be merged during this step.
> >
> > The data can now be used by the compiler:
> >
> >   $ make LLVM=1 KCFLAGS=-fprofile-use=vmlinux.profdata ...
> >
> > This initial submission is restricted to x86, as that's the platform we
> > know works. This restriction can be lifted once other platforms have
> > been verified to work with PGO.
> >
> > Note that this method of profiling the kernel is clang-native, unlike
> > the clang support in kernel/gcov.
> >
> > [1] https://clang.llvm.org/docs/UsersManual.html#profile-guided-optimization
> >
> > Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
> > Co-developed-by: Bill Wendling <morbo@google.com>
> > Signed-off-by: Bill Wendling <morbo@google.com>
> > Tested-by: Nick Desaulniers <ndesaulniers@google.com>
> > Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
> > Reviewed-by: Fangrui Song <maskray@google.com>
>
> Thanks for sending this again! I'm looking forward to using it.

Yay. Quite excited about that:)

> Masahiro and Andrew, unless one of you would prefer to take this in your
> tree, I figure I can snag it to send to Linus.
>
> Anyone else have feedback?

I have carefully compared the implementation and the original
implementation in llvm-project/compiler-rt.
This looks great.
Also very happy about the cleaner include/asm-generic/vmlinux.lds.h now.

Just adding a note here for folks who may want to help test the
not-yet-common option LD_DEAD_CODE_DATA_ELIMINATION.
--gc-sections may not work perfectly with some advanced PGO features
before Clang 13 (not broken but probably just in an inferior state).
There were some upstream changes in this area recently and I think as
of my https://reviews.llvm.org/D97649 things should be perfect with GC
now.
This does not deserve any comment without more testing, though.

Thanks for already carrying my Reviewed-by tag.

> Thanks!
>
> -Kees
>
> --
> Kees Cook
Nathan Chancellor April 7, 2021, 9:47 p.m. UTC | #3
Hi Bill,

On Wed, Apr 07, 2021 at 02:17:04PM -0700, Bill Wendling wrote:
> From: Sami Tolvanen <samitolvanen@google.com>
> 
> Enable the use of clang's Profile-Guided Optimization[1]. To generate a
> profile, the kernel is instrumented with PGO counters, a representative
> workload is run, and the raw profile data is collected from
> /sys/kernel/debug/pgo/profraw.
> 
> The raw profile data must be processed by clang's "llvm-profdata" tool
> before it can be used during recompilation:
> 
>   $ cp /sys/kernel/debug/pgo/profraw vmlinux.profraw
>   $ llvm-profdata merge --output=vmlinux.profdata vmlinux.profraw
> 
> Multiple raw profiles may be merged during this step.
> 
> The data can now be used by the compiler:
> 
>   $ make LLVM=1 KCFLAGS=-fprofile-use=vmlinux.profdata ...
> 
> This initial submission is restricted to x86, as that's the platform we
> know works. This restriction can be lifted once other platforms have
> been verified to work with PGO.
> 
> Note that this method of profiling the kernel is clang-native, unlike
> the clang support in kernel/gcov.
> 
> [1] https://clang.llvm.org/docs/UsersManual.html#profile-guided-optimization
> 
> Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
> Co-developed-by: Bill Wendling <morbo@google.com>
> Signed-off-by: Bill Wendling <morbo@google.com>
> Tested-by: Nick Desaulniers <ndesaulniers@google.com>
> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
> Reviewed-by: Fangrui Song <maskray@google.com>

Few small nits below, not sure they warrant a v10 versus just some
follow up patches, up to you. Regardless:

Reviewed-by: Nathan Chancellor <nathan@kernel.org>

> ---
> v9: - [maskray] Remove explicit 'ALIGN' and 'KEEP' from PGO variables in
>       vmlinux.lds.h.
> v8: - Rebased on top-of-tree.
> v7: - [sedat.dilek] Fix minor build failure.
> v6: - Add better documentation about the locking scheme and other things.
>     - Rename macros to better match the same macros in LLVM's source code.
> v5: - [natechancellor] Correct padding calculation.
> v4: - [ndesaulniers] Remove non-x86 Makfile changes and se "hweight64" instead
>       of using our own popcount implementation.
> v3: - [sedat.dilek] Added change log section.
> v2: - [natechancellor] Added "__llvm_profile_instrument_memop".
>     - [maskray] Corrected documentation, re PGO flags when using LTO.
> ---
>  Documentation/dev-tools/index.rst     |   1 +
>  Documentation/dev-tools/pgo.rst       | 127 +++++++++
>  MAINTAINERS                           |   9 +
>  Makefile                              |   3 +
>  arch/Kconfig                          |   1 +
>  arch/x86/Kconfig                      |   1 +
>  arch/x86/boot/Makefile                |   1 +
>  arch/x86/boot/compressed/Makefile     |   1 +
>  arch/x86/crypto/Makefile              |   4 +
>  arch/x86/entry/vdso/Makefile          |   1 +
>  arch/x86/kernel/vmlinux.lds.S         |   2 +
>  arch/x86/platform/efi/Makefile        |   1 +
>  arch/x86/purgatory/Makefile           |   1 +
>  arch/x86/realmode/rm/Makefile         |   1 +
>  arch/x86/um/vdso/Makefile             |   1 +
>  drivers/firmware/efi/libstub/Makefile |   1 +
>  include/asm-generic/vmlinux.lds.h     |  34 +++
>  kernel/Makefile                       |   1 +
>  kernel/pgo/Kconfig                    |  35 +++
>  kernel/pgo/Makefile                   |   5 +
>  kernel/pgo/fs.c                       | 389 ++++++++++++++++++++++++++
>  kernel/pgo/instrument.c               | 189 +++++++++++++
>  kernel/pgo/pgo.h                      | 203 ++++++++++++++
>  scripts/Makefile.lib                  |  10 +
>  24 files changed, 1022 insertions(+)
>  create mode 100644 Documentation/dev-tools/pgo.rst
>  create mode 100644 kernel/pgo/Kconfig
>  create mode 100644 kernel/pgo/Makefile
>  create mode 100644 kernel/pgo/fs.c
>  create mode 100644 kernel/pgo/instrument.c
>  create mode 100644 kernel/pgo/pgo.h
> 
> diff --git a/Documentation/dev-tools/index.rst b/Documentation/dev-tools/index.rst
> index 1b1cf4f5c9d9..6a30cd98e6f9 100644
> --- a/Documentation/dev-tools/index.rst
> +++ b/Documentation/dev-tools/index.rst
> @@ -27,6 +27,7 @@ whole; patches welcome!
>     kgdb
>     kselftest
>     kunit/index
> +   pgo
>  
>  
>  .. only::  subproject and html
> diff --git a/Documentation/dev-tools/pgo.rst b/Documentation/dev-tools/pgo.rst
> new file mode 100644
> index 000000000000..b7f11d8405b7
> --- /dev/null
> +++ b/Documentation/dev-tools/pgo.rst
> @@ -0,0 +1,127 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +===============================
> +Using PGO with the Linux kernel
> +===============================
> +
> +Clang's profiling kernel support (PGO_) enables profiling of the Linux kernel
> +when building with Clang. The profiling data is exported via the ``pgo``
> +debugfs directory.
> +
> +.. _PGO: https://clang.llvm.org/docs/UsersManual.html#profile-guided-optimization
> +
> +
> +Preparation
> +===========
> +
> +Configure the kernel with:
> +
> +.. code-block:: make
> +
> +   CONFIG_DEBUG_FS=y
> +   CONFIG_PGO_CLANG=y
> +
> +Note that kernels compiled with profiling flags will be significantly larger
> +and run slower.
> +
> +Profiling data will only become accessible once debugfs has been mounted:
> +
> +.. code-block:: sh
> +
> +   mount -t debugfs none /sys/kernel/debug
> +
> +
> +Customization
> +=============
> +
> +You can enable or disable profiling for individual file and directories by
> +adding a line similar to the following to the respective kernel Makefile:
> +
> +- For a single file (e.g. main.o)
> +
> +  .. code-block:: make
> +
> +     PGO_PROFILE_main.o := y
> +
> +- For all files in one directory
> +
> +  .. code-block:: make
> +
> +     PGO_PROFILE := y
> +
> +To exclude files from being profiled use
> +
> +  .. code-block:: make
> +
> +     PGO_PROFILE_main.o := n
> +
> +and
> +
> +  .. code-block:: make
> +
> +     PGO_PROFILE := n
> +
> +Only files which are linked to the main kernel image or are compiled as kernel
> +modules are supported by this mechanism.
> +
> +
> +Files
> +=====
> +
> +The PGO kernel support creates the following files in debugfs:
> +
> +``/sys/kernel/debug/pgo``
> +	Parent directory for all PGO-related files.
> +
> +``/sys/kernel/debug/pgo/reset``
> +	Global reset file: resets all coverage data to zero when written to.
> +
> +``/sys/kernel/debug/profraw``
> +	The raw PGO data that must be processed with ``llvm_profdata``.
> +
> +
> +Workflow
> +========
> +
> +The PGO kernel can be run on the host or test machines. The data though should
> +be analyzed with Clang's tools from the same Clang version as the kernel was
> +compiled. Clang's tolerant of version skew, but it's easier to use the same
> +Clang version.
> +
> +The profiling data is useful for optimizing the kernel, analyzing coverage,
> +etc. Clang offers tools to perform these tasks.
> +
> +Here is an example workflow for profiling an instrumented kernel with PGO and
> +using the result to optimize the kernel:
> +
> +1) Install the kernel on the TEST machine.
> +
> +2) Reset the data counters right before running the load tests
> +
> +   .. code-block:: sh
> +
> +      $ echo 1 > /sys/kernel/debug/pgo/reset
> +
> +3) Run the load tests.
> +
> +4) Collect the raw profile data
> +
> +   .. code-block:: sh
> +
> +      $ cp -a /sys/kernel/debug/pgo/profraw /tmp/vmlinux.profraw
> +
> +5) (Optional) Download the raw profile data to the HOST machine.
> +
> +6) Process the raw profile data
> +
> +   .. code-block:: sh
> +
> +      $ llvm-profdata merge --output=vmlinux.profdata vmlinux.profraw
> +
> +   Note that multiple raw profile data files can be merged during this step.
> +
> +7) Rebuild the kernel using the profile data (PGO disabled)
> +
> +   .. code-block:: sh
> +
> +      $ make LLVM=1 KCFLAGS=-fprofile-use=vmlinux.profdata ...
> diff --git a/MAINTAINERS b/MAINTAINERS
> index c80ad735b384..742058188af2 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -14054,6 +14054,15 @@ S:	Maintained
>  F:	include/linux/personality.h
>  F:	include/uapi/linux/personality.h
>  
> +PGO BASED KERNEL PROFILING
> +M:	Sami Tolvanen <samitolvanen@google.com>
> +M:	Bill Wendling <wcw@google.com>
> +R:	Nathan Chancellor <natechancellor@gmail.com>

This should be updated to my @kernel.org address. I can send a follow-up
patch if need be.

> +R:	Nick Desaulniers <ndesaulniers@google.com>
> +S:	Supported
> +F:	Documentation/dev-tools/pgo.rst
> +F:	kernel/pgo
> +
>  PHOENIX RC FLIGHT CONTROLLER ADAPTER
>  M:	Marcus Folkesson <marcus.folkesson@gmail.com>
>  L:	linux-input@vger.kernel.org
> diff --git a/Makefile b/Makefile
> index cc77fd45ca64..6450faceb137 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -660,6 +660,9 @@ endif # KBUILD_EXTMOD
>  # Defaults to vmlinux, but the arch makefile usually adds further targets
>  all: vmlinux
>  
> +CFLAGS_PGO_CLANG := -fprofile-generate
> +export CFLAGS_PGO_CLANG
> +
>  CFLAGS_GCOV	:= -fprofile-arcs -ftest-coverage \
>  	$(call cc-option,-fno-tree-loop-im) \
>  	$(call cc-disable-warning,maybe-uninitialized,)
> diff --git a/arch/Kconfig b/arch/Kconfig
> index ecfd3520b676..afd082133e0a 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -1191,6 +1191,7 @@ config ARCH_HAS_ELFCORE_COMPAT
>  	bool
>  
>  source "kernel/gcov/Kconfig"
> +source "kernel/pgo/Kconfig"
>  
>  source "scripts/gcc-plugins/Kconfig"
>  
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 2792879d398e..62be93b199ff 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -99,6 +99,7 @@ config X86
>  	select ARCH_SUPPORTS_KMAP_LOCAL_FORCE_MAP	if NR_CPUS <= 4096
>  	select ARCH_SUPPORTS_LTO_CLANG		if X86_64
>  	select ARCH_SUPPORTS_LTO_CLANG_THIN	if X86_64
> +	select ARCH_SUPPORTS_PGO_CLANG		if X86_64
>  	select ARCH_USE_BUILTIN_BSWAP
>  	select ARCH_USE_QUEUED_RWLOCKS
>  	select ARCH_USE_QUEUED_SPINLOCKS
> diff --git a/arch/x86/boot/Makefile b/arch/x86/boot/Makefile
> index fe605205b4ce..383853e32f67 100644
> --- a/arch/x86/boot/Makefile
> +++ b/arch/x86/boot/Makefile
> @@ -71,6 +71,7 @@ KBUILD_AFLAGS	:= $(KBUILD_CFLAGS) -D__ASSEMBLY__
>  KBUILD_CFLAGS	+= $(call cc-option,-fmacro-prefix-map=$(srctree)/=)
>  KBUILD_CFLAGS	+= -fno-asynchronous-unwind-tables
>  GCOV_PROFILE := n
> +PGO_PROFILE := n
>  UBSAN_SANITIZE := n
>  
>  $(obj)/bzImage: asflags-y  := $(SVGA_MODE)
> diff --git a/arch/x86/boot/compressed/Makefile b/arch/x86/boot/compressed/Makefile
> index e0bc3988c3fa..ed12ab65f606 100644
> --- a/arch/x86/boot/compressed/Makefile
> +++ b/arch/x86/boot/compressed/Makefile
> @@ -54,6 +54,7 @@ CFLAGS_sev-es.o += -I$(objtree)/arch/x86/lib/
>  
>  KBUILD_AFLAGS  := $(KBUILD_CFLAGS) -D__ASSEMBLY__
>  GCOV_PROFILE := n
> +PGO_PROFILE := n
>  UBSAN_SANITIZE :=n
>  
>  KBUILD_LDFLAGS := -m elf_$(UTS_MACHINE)
> diff --git a/arch/x86/crypto/Makefile b/arch/x86/crypto/Makefile
> index b28e36b7c96b..4b2e9620c412 100644
> --- a/arch/x86/crypto/Makefile
> +++ b/arch/x86/crypto/Makefile
> @@ -4,6 +4,10 @@
>  
>  OBJECT_FILES_NON_STANDARD := y
>  
> +# Disable PGO for curve25519-x86_64. With PGO enabled, clang runs out of
> +# registers for some of the functions.
> +PGO_PROFILE_curve25519-x86_64.o := n
> +
>  obj-$(CONFIG_CRYPTO_TWOFISH_586) += twofish-i586.o
>  twofish-i586-y := twofish-i586-asm_32.o twofish_glue.o
>  obj-$(CONFIG_CRYPTO_TWOFISH_X86_64) += twofish-x86_64.o
> diff --git a/arch/x86/entry/vdso/Makefile b/arch/x86/entry/vdso/Makefile
> index 05c4abc2fdfd..f7421e44725a 100644
> --- a/arch/x86/entry/vdso/Makefile
> +++ b/arch/x86/entry/vdso/Makefile
> @@ -180,6 +180,7 @@ quiet_cmd_vdso = VDSO    $@
>  VDSO_LDFLAGS = -shared --hash-style=both --build-id=sha1 \
>  	$(call ld-option, --eh-frame-hdr) -Bsymbolic
>  GCOV_PROFILE := n
> +PGO_PROFILE := n
>  
>  quiet_cmd_vdso_and_check = VDSO    $@
>        cmd_vdso_and_check = $(cmd_vdso); $(cmd_vdso_check)
> diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
> index efd9e9ea17f2..f6cab2316c46 100644
> --- a/arch/x86/kernel/vmlinux.lds.S
> +++ b/arch/x86/kernel/vmlinux.lds.S
> @@ -184,6 +184,8 @@ SECTIONS
>  
>  	BUG_TABLE
>  
> +	PGO_CLANG_DATA
> +
>  	ORC_UNWIND_TABLE
>  
>  	. = ALIGN(PAGE_SIZE);
> diff --git a/arch/x86/platform/efi/Makefile b/arch/x86/platform/efi/Makefile
> index 84b09c230cbd..5f22b31446ad 100644
> --- a/arch/x86/platform/efi/Makefile
> +++ b/arch/x86/platform/efi/Makefile
> @@ -2,6 +2,7 @@
>  OBJECT_FILES_NON_STANDARD_efi_thunk_$(BITS).o := y
>  KASAN_SANITIZE := n
>  GCOV_PROFILE := n
> +PGO_PROFILE := n
>  
>  obj-$(CONFIG_EFI) 		+= quirks.o efi.o efi_$(BITS).o efi_stub_$(BITS).o
>  obj-$(CONFIG_EFI_MIXED)		+= efi_thunk_$(BITS).o
> diff --git a/arch/x86/purgatory/Makefile b/arch/x86/purgatory/Makefile
> index 95ea17a9d20c..36f20e99da0b 100644
> --- a/arch/x86/purgatory/Makefile
> +++ b/arch/x86/purgatory/Makefile
> @@ -23,6 +23,7 @@ targets += purgatory.ro purgatory.chk
>  
>  # Sanitizer, etc. runtimes are unavailable and cannot be linked here.
>  GCOV_PROFILE	:= n
> +PGO_PROFILE	:= n
>  KASAN_SANITIZE	:= n
>  UBSAN_SANITIZE	:= n
>  KCSAN_SANITIZE	:= n
> diff --git a/arch/x86/realmode/rm/Makefile b/arch/x86/realmode/rm/Makefile
> index 83f1b6a56449..21797192f958 100644
> --- a/arch/x86/realmode/rm/Makefile
> +++ b/arch/x86/realmode/rm/Makefile
> @@ -76,4 +76,5 @@ KBUILD_CFLAGS	:= $(REALMODE_CFLAGS) -D_SETUP -D_WAKEUP \
>  KBUILD_AFLAGS	:= $(KBUILD_CFLAGS) -D__ASSEMBLY__
>  KBUILD_CFLAGS	+= -fno-asynchronous-unwind-tables
>  GCOV_PROFILE := n
> +PGO_PROFILE := n
>  UBSAN_SANITIZE := n
> diff --git a/arch/x86/um/vdso/Makefile b/arch/x86/um/vdso/Makefile
> index 5943387e3f35..54f5768f5853 100644
> --- a/arch/x86/um/vdso/Makefile
> +++ b/arch/x86/um/vdso/Makefile
> @@ -64,6 +64,7 @@ quiet_cmd_vdso = VDSO    $@
>  
>  VDSO_LDFLAGS = -fPIC -shared -Wl,--hash-style=sysv
>  GCOV_PROFILE := n
> +PGO_PROFILE := n
>  
>  #
>  # Install the unstripped copy of vdso*.so listed in $(vdso-install-y).
> diff --git a/drivers/firmware/efi/libstub/Makefile b/drivers/firmware/efi/libstub/Makefile
> index c23466e05e60..724fb389bb9d 100644
> --- a/drivers/firmware/efi/libstub/Makefile
> +++ b/drivers/firmware/efi/libstub/Makefile
> @@ -42,6 +42,7 @@ KBUILD_CFLAGS := $(filter-out $(CC_FLAGS_SCS), $(KBUILD_CFLAGS))
>  KBUILD_CFLAGS := $(filter-out $(CC_FLAGS_LTO), $(KBUILD_CFLAGS))
>  
>  GCOV_PROFILE			:= n
> +PGO_PROFILE			:= n
>  # Sanitizer runtimes are unavailable and cannot be linked here.
>  KASAN_SANITIZE			:= n
>  KCSAN_SANITIZE			:= n
> diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h
> index 0331d5d49551..b371857097e8 100644
> --- a/include/asm-generic/vmlinux.lds.h
> +++ b/include/asm-generic/vmlinux.lds.h
> @@ -329,6 +329,39 @@
>  #define DTPM_TABLE()
>  #endif
>  
> +#ifdef CONFIG_PGO_CLANG
> +#define PGO_CLANG_DATA							\
> +	__llvm_prf_data : AT(ADDR(__llvm_prf_data) - LOAD_OFFSET) {	\
> +		__llvm_prf_start = .;					\
> +		__llvm_prf_data_start = .;				\
> +		*(__llvm_prf_data)					\
> +		__llvm_prf_data_end = .;				\
> +	}								\
> +	__llvm_prf_cnts : AT(ADDR(__llvm_prf_cnts) - LOAD_OFFSET) {	\
> +		__llvm_prf_cnts_start = .;				\
> +		*(__llvm_prf_cnts)					\
> +		__llvm_prf_cnts_end = .;				\
> +	}								\
> +	__llvm_prf_names : AT(ADDR(__llvm_prf_names) - LOAD_OFFSET) {	\
> +		__llvm_prf_names_start = .;				\
> +		*(__llvm_prf_names)					\
> +		__llvm_prf_names_end = .;				\
> +	}								\
> +	__llvm_prf_vals : AT(ADDR(__llvm_prf_vals) - LOAD_OFFSET) {	\
> +		__llvm_prf_vals_start = .;				\
> +		*(__llvm_prf_vals)					\
> +		__llvm_prf_vals_end = .;				\
> +	}								\
> +	__llvm_prf_vnds : AT(ADDR(__llvm_prf_vnds) - LOAD_OFFSET) {	\
> +		__llvm_prf_vnds_start = .;				\
> +		*(__llvm_prf_vnds)					\
> +		__llvm_prf_vnds_end = .;				\
> +		__llvm_prf_end = .;					\
> +	}
> +#else
> +#define PGO_CLANG_DATA
> +#endif
> +
>  #define KERNEL_DTB()							\
>  	STRUCT_ALIGN();							\
>  	__dtb_start = .;						\
> @@ -1106,6 +1139,7 @@
>  		CONSTRUCTORS						\
>  	}								\
>  	BUG_TABLE							\
> +	PGO_CLANG_DATA
>  
>  #define INIT_TEXT_SECTION(inittext_align)				\
>  	. = ALIGN(inittext_align);					\
> diff --git a/kernel/Makefile b/kernel/Makefile
> index 320f1f3941b7..a2a23ef2b12f 100644
> --- a/kernel/Makefile
> +++ b/kernel/Makefile
> @@ -111,6 +111,7 @@ obj-$(CONFIG_BPF) += bpf/
>  obj-$(CONFIG_KCSAN) += kcsan/
>  obj-$(CONFIG_SHADOW_CALL_STACK) += scs.o
>  obj-$(CONFIG_HAVE_STATIC_CALL_INLINE) += static_call.o
> +obj-$(CONFIG_PGO_CLANG) += pgo/
>  
>  obj-$(CONFIG_PERF_EVENTS) += events/
>  
> diff --git a/kernel/pgo/Kconfig b/kernel/pgo/Kconfig
> new file mode 100644
> index 000000000000..76a640b6cf6e
> --- /dev/null
> +++ b/kernel/pgo/Kconfig
> @@ -0,0 +1,35 @@
> +# SPDX-License-Identifier: GPL-2.0-only
> +menu "Profile Guided Optimization (PGO) (EXPERIMENTAL)"
> +
> +config ARCH_SUPPORTS_PGO_CLANG
> +	bool
> +
> +config PGO_CLANG
> +	bool "Enable clang's PGO-based kernel profiling"
> +	depends on DEBUG_FS
> +	depends on ARCH_SUPPORTS_PGO_CLANG
> +	depends on CC_IS_CLANG && CLANG_VERSION >= 120000
> +	help
> +	  This option enables clang's PGO (Profile Guided Optimization) based
> +	  code profiling to better optimize the kernel.
> +
> +	  If unsure, say N.
> +
> +	  Run a representative workload for your application on a kernel
> +	  compiled with this option and download the raw profile file from
> +	  /sys/kernel/debug/pgo/profraw. This file needs to be processed with
> +	  llvm-profdata. It may be merged with other collected raw profiles.
> +
> +	  Copy the resulting profile file into vmlinux.profdata, and enable
> +	  KCFLAGS=-fprofile-use=vmlinux.profdata to produce an optimized
> +	  kernel.
> +
> +	  Note that a kernel compiled with profiling flags will be
> +	  significantly larger and run slower. Also be sure to exclude files
> +	  from profiling which are not linked to the kernel image to prevent
> +	  linker errors.
> +
> +	  Note that the debugfs filesystem has to be mounted to access
> +	  profiling data.

It might be nice to have CONFIG_PGO_PROFILE_ALL like
CONFIG_GCOV_PROFILE_ALL so that people do not have to go sprinkle the
kernel with PGO_PROFILE definitions in the Makefile.

> +endmenu
> diff --git a/kernel/pgo/Makefile b/kernel/pgo/Makefile
> new file mode 100644
> index 000000000000..41e27cefd9a4
> --- /dev/null
> +++ b/kernel/pgo/Makefile
> @@ -0,0 +1,5 @@
> +# SPDX-License-Identifier: GPL-2.0
> +GCOV_PROFILE	:= n
> +PGO_PROFILE	:= n
> +
> +obj-y	+= fs.o instrument.o
> diff --git a/kernel/pgo/fs.c b/kernel/pgo/fs.c
> new file mode 100644
> index 000000000000..1678df3b7d64
> --- /dev/null
> +++ b/kernel/pgo/fs.c
> @@ -0,0 +1,389 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright (C) 2019 Google, Inc.
> + *
> + * Author:
> + *	Sami Tolvanen <samitolvanen@google.com>
> + *
> + * This software is licensed under the terms of the GNU General Public
> + * License version 2, as published by the Free Software Foundation, and
> + * may be copied, distributed, and modified under those terms.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + */
> +
> +#define pr_fmt(fmt)	"pgo: " fmt
> +
> +#include <linux/kernel.h>
> +#include <linux/debugfs.h>
> +#include <linux/fs.h>
> +#include <linux/module.h>
> +#include <linux/slab.h>
> +#include <linux/vmalloc.h>
> +#include "pgo.h"
> +
> +static struct dentry *directory;
> +
> +struct prf_private_data {
> +	void *buffer;
> +	unsigned long size;
> +};
> +
> +/*
> + * Raw profile data format:
> + *
> + *	- llvm_prf_header
> + *	- __llvm_prf_data
> + *	- __llvm_prf_cnts
> + *	- __llvm_prf_names
> + *	- zero padding to 8 bytes
> + *	- for each llvm_prf_data in __llvm_prf_data:
> + *		- llvm_prf_value_data
> + *			- llvm_prf_value_record + site count array
> + *				- llvm_prf_value_node_data
> + *				...
> + *			...
> + *		...
> + */
> +
> +static void prf_fill_header(void **buffer)
> +{
> +	struct llvm_prf_header *header = *(struct llvm_prf_header **)buffer;
> +
> +#ifdef CONFIG_64BIT
> +	header->magic = LLVM_INSTR_PROF_RAW_MAGIC_64;
> +#else
> +	header->magic = LLVM_INSTR_PROF_RAW_MAGIC_32;
> +#endif
> +	header->version = LLVM_VARIANT_MASK_IR_PROF | LLVM_INSTR_PROF_RAW_VERSION;
> +	header->data_size = prf_data_count();
> +	header->padding_bytes_before_counters = 0;
> +	header->counters_size = prf_cnts_count();
> +	header->padding_bytes_after_counters = 0;
> +	header->names_size = prf_names_count();
> +	header->counters_delta = (u64)__llvm_prf_cnts_start;
> +	header->names_delta = (u64)__llvm_prf_names_start;
> +	header->value_kind_last = LLVM_INSTR_PROF_IPVK_LAST;
> +
> +	*buffer += sizeof(*header);
> +}
> +
> +/*
> + * Copy the source into the buffer, incrementing the pointer into buffer in the
> + * process.
> + */
> +static void prf_copy_to_buffer(void **buffer, void *src, unsigned long size)
> +{
> +	memcpy(*buffer, src, size);
> +	*buffer += size;
> +}
> +
> +static u32 __prf_get_value_size(struct llvm_prf_data *p, u32 *value_kinds)
> +{
> +	struct llvm_prf_value_node **nodes =
> +		(struct llvm_prf_value_node **)p->values;
> +	u32 kinds = 0;
> +	u32 size = 0;
> +	unsigned int kind;
> +	unsigned int n;
> +	unsigned int s = 0;
> +
> +	for (kind = 0; kind < ARRAY_SIZE(p->num_value_sites); kind++) {
> +		unsigned int sites = p->num_value_sites[kind];
> +
> +		if (!sites)
> +			continue;
> +
> +		/* Record + site count array */
> +		size += prf_get_value_record_size(sites);
> +		kinds++;
> +
> +		if (!nodes)
> +			continue;
> +
> +		for (n = 0; n < sites; n++) {
> +			u32 count = 0;
> +			struct llvm_prf_value_node *site = nodes[s + n];
> +
> +			while (site && ++count <= U8_MAX)
> +				site = site->next;
> +
> +			size += count *
> +				sizeof(struct llvm_prf_value_node_data);
> +		}
> +
> +		s += sites;
> +	}
> +
> +	if (size)
> +		size += sizeof(struct llvm_prf_value_data);
> +
> +	if (value_kinds)
> +		*value_kinds = kinds;
> +
> +	return size;
> +}
> +
> +static u32 prf_get_value_size(void)
> +{
> +	u32 size = 0;
> +	struct llvm_prf_data *p;
> +
> +	for (p = __llvm_prf_data_start; p < __llvm_prf_data_end; p++)
> +		size += __prf_get_value_size(p, NULL);
> +
> +	return size;
> +}
> +
> +/* Serialize the profiling's value. */
> +static void prf_serialize_value(struct llvm_prf_data *p, void **buffer)
> +{
> +	struct llvm_prf_value_data header;
> +	struct llvm_prf_value_node **nodes =
> +		(struct llvm_prf_value_node **)p->values;
> +	unsigned int kind;
> +	unsigned int n;
> +	unsigned int s = 0;
> +
> +	header.total_size = __prf_get_value_size(p, &header.num_value_kinds);
> +
> +	if (!header.num_value_kinds)
> +		/* Nothing to write. */
> +		return;
> +
> +	prf_copy_to_buffer(buffer, &header, sizeof(header));
> +
> +	for (kind = 0; kind < ARRAY_SIZE(p->num_value_sites); kind++) {
> +		struct llvm_prf_value_record *record;
> +		u8 *counts;
> +		unsigned int sites = p->num_value_sites[kind];
> +
> +		if (!sites)
> +			continue;
> +
> +		/* Profiling value record. */
> +		record = *(struct llvm_prf_value_record **)buffer;
> +		*buffer += prf_get_value_record_header_size();
> +
> +		record->kind = kind;
> +		record->num_value_sites = sites;
> +
> +		/* Site count array. */
> +		counts = *(u8 **)buffer;
> +		*buffer += prf_get_value_record_site_count_size(sites);
> +
> +		/*
> +		 * If we don't have nodes, we can skip updating the site count
> +		 * array, because the buffer is zero filled.
> +		 */
> +		if (!nodes)
> +			continue;
> +
> +		for (n = 0; n < sites; n++) {
> +			u32 count = 0;
> +			struct llvm_prf_value_node *site = nodes[s + n];
> +
> +			while (site && ++count <= U8_MAX) {
> +				prf_copy_to_buffer(buffer, site,
> +						   sizeof(struct llvm_prf_value_node_data));
> +				site = site->next;
> +			}
> +
> +			counts[n] = (u8)count;
> +		}
> +
> +		s += sites;
> +	}
> +}
> +
> +static void prf_serialize_values(void **buffer)
> +{
> +	struct llvm_prf_data *p;
> +
> +	for (p = __llvm_prf_data_start; p < __llvm_prf_data_end; p++)
> +		prf_serialize_value(p, buffer);
> +}
> +
> +static inline unsigned long prf_get_padding(unsigned long size)
> +{
> +	return 7 & (sizeof(u64) - size % sizeof(u64));
> +}
> +
> +static unsigned long prf_buffer_size(void)
> +{
> +	return sizeof(struct llvm_prf_header) +
> +			prf_data_size()	+
> +			prf_cnts_size() +
> +			prf_names_size() +
> +			prf_get_padding(prf_names_size()) +
> +			prf_get_value_size();
> +}
> +
> +/*
> + * Serialize the profiling data into a format LLVM's tools can understand.
> + * Note: caller *must* hold pgo_lock.
> + */
> +static int prf_serialize(struct prf_private_data *p)
> +{
> +	int err = 0;
> +	void *buffer;
> +
> +	p->size = prf_buffer_size();
> +	p->buffer = vzalloc(p->size);
> +
> +	if (!p->buffer) {
> +		err = -ENOMEM;
> +		goto out;
> +	}
> +
> +	buffer = p->buffer;
> +
> +	prf_fill_header(&buffer);
> +	prf_copy_to_buffer(&buffer, __llvm_prf_data_start,  prf_data_size());
> +	prf_copy_to_buffer(&buffer, __llvm_prf_cnts_start,  prf_cnts_size());
> +	prf_copy_to_buffer(&buffer, __llvm_prf_names_start, prf_names_size());
> +	buffer += prf_get_padding(prf_names_size());
> +
> +	prf_serialize_values(&buffer);
> +
> +out:
> +	return err;
> +}
> +
> +/* open() implementation for PGO. Creates a copy of the profiling data set. */
> +static int prf_open(struct inode *inode, struct file *file)
> +{
> +	struct prf_private_data *data;
> +	unsigned long flags;
> +	int err;
> +
> +	data = kzalloc(sizeof(*data), GFP_KERNEL);
> +	if (!data) {
> +		err = -ENOMEM;
> +		goto out;
> +	}
> +
> +	flags = prf_lock();
> +
> +	err = prf_serialize(data);
> +	if (unlikely(err)) {
> +		kfree(data);
> +		goto out_unlock;
> +	}
> +
> +	file->private_data = data;
> +
> +out_unlock:
> +	prf_unlock(flags);
> +out:
> +	return err;
> +}
> +
> +/* read() implementation for PGO. */
> +static ssize_t prf_read(struct file *file, char __user *buf, size_t count,
> +			loff_t *ppos)
> +{
> +	struct prf_private_data *data = file->private_data;
> +
> +	BUG_ON(!data);
> +
> +	return simple_read_from_buffer(buf, count, ppos, data->buffer,
> +				       data->size);
> +}
> +
> +/* release() implementation for PGO. Release resources allocated by open(). */
> +static int prf_release(struct inode *inode, struct file *file)
> +{
> +	struct prf_private_data *data = file->private_data;
> +
> +	if (data) {
> +		vfree(data->buffer);
> +		kfree(data);
> +	}
> +
> +	return 0;
> +}
> +
> +static const struct file_operations prf_fops = {
> +	.owner		= THIS_MODULE,
> +	.open		= prf_open,
> +	.read		= prf_read,
> +	.llseek		= default_llseek,
> +	.release	= prf_release
> +};
> +
> +/* write() implementation for resetting PGO's profile data. */
> +static ssize_t reset_write(struct file *file, const char __user *addr,
> +			   size_t len, loff_t *pos)
> +{
> +	struct llvm_prf_data *data;
> +
> +	memset(__llvm_prf_cnts_start, 0, prf_cnts_size());
> +
> +	for (data = __llvm_prf_data_start; data < __llvm_prf_data_end; data++) {
> +		struct llvm_prf_value_node **vnodes;
> +		u64 current_vsite_count;
> +		u32 i;
> +
> +		if (!data->values)
> +			continue;
> +
> +		current_vsite_count = 0;
> +		vnodes = (struct llvm_prf_value_node **)data->values;
> +
> +		for (i = LLVM_INSTR_PROF_IPVK_FIRST; i <= LLVM_INSTR_PROF_IPVK_LAST; i++)
> +			current_vsite_count += data->num_value_sites[i];
> +
> +		for (i = 0; i < current_vsite_count; i++) {
> +			struct llvm_prf_value_node *current_vnode = vnodes[i];
> +
> +			while (current_vnode) {
> +				current_vnode->count = 0;
> +				current_vnode = current_vnode->next;
> +			}
> +		}
> +	}
> +
> +	return len;
> +}
> +
> +static const struct file_operations prf_reset_fops = {
> +	.owner		= THIS_MODULE,
> +	.write		= reset_write,
> +	.llseek		= noop_llseek,
> +};
> +
> +/* Create debugfs entries. */
> +static int __init pgo_init(void)
> +{
> +	directory = debugfs_create_dir("pgo", NULL);
> +	if (!directory)
> +		goto err_remove;
> +
> +	if (!debugfs_create_file("profraw", 0600, directory, NULL,
> +				 &prf_fops))
> +		goto err_remove;
> +
> +	if (!debugfs_create_file("reset", 0200, directory, NULL,
> +				 &prf_reset_fops))
> +		goto err_remove;
> +
> +	return 0;
> +
> +err_remove:
> +	pr_err("initialization failed\n");
> +	return -EIO;
> +}
> +
> +/* Remove debugfs entries. */
> +static void __exit pgo_exit(void)
> +{
> +	debugfs_remove_recursive(directory);
> +}
> +
> +module_init(pgo_init);
> +module_exit(pgo_exit);
> diff --git a/kernel/pgo/instrument.c b/kernel/pgo/instrument.c
> new file mode 100644
> index 000000000000..464b3bc77431
> --- /dev/null
> +++ b/kernel/pgo/instrument.c
> @@ -0,0 +1,189 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright (C) 2019 Google, Inc.
> + *
> + * Author:
> + *	Sami Tolvanen <samitolvanen@google.com>
> + *
> + * This software is licensed under the terms of the GNU General Public
> + * License version 2, as published by the Free Software Foundation, and
> + * may be copied, distributed, and modified under those terms.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + */
> +
> +#define pr_fmt(fmt)	"pgo: " fmt
> +
> +#include <linux/bitops.h>
> +#include <linux/kernel.h>
> +#include <linux/export.h>
> +#include <linux/spinlock.h>
> +#include <linux/types.h>
> +#include "pgo.h"
> +
> +/*
> + * This lock guards both profile count updating and serialization of the
> + * profiling data. Keeping both of these activities separate via locking
> + * ensures that we don't try to serialize data that's only partially updated.
> + */
> +static DEFINE_SPINLOCK(pgo_lock);
> +static int current_node;
> +
> +unsigned long prf_lock(void)
> +{
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&pgo_lock, flags);
> +
> +	return flags;
> +}
> +
> +void prf_unlock(unsigned long flags)
> +{
> +	spin_unlock_irqrestore(&pgo_lock, flags);
> +}
> +
> +/*
> + * Return a newly allocated profiling value node which contains the tracked
> + * value by the value profiler.
> + * Note: caller *must* hold pgo_lock.
> + */
> +static struct llvm_prf_value_node *allocate_node(struct llvm_prf_data *p,
> +						 u32 index, u64 value)
> +{
> +	if (&__llvm_prf_vnds_start[current_node + 1] >= __llvm_prf_vnds_end)
> +		return NULL; /* Out of nodes */
> +
> +	current_node++;
> +
> +	/* Make sure the node is entirely within the section */
> +	if (&__llvm_prf_vnds_start[current_node] >= __llvm_prf_vnds_end ||
> +	    &__llvm_prf_vnds_start[current_node + 1] > __llvm_prf_vnds_end)
> +		return NULL;
> +
> +	return &__llvm_prf_vnds_start[current_node];
> +}
> +
> +/*
> + * Counts the number of times a target value is seen.
> + *
> + * Records the target value for the index if not seen before. Otherwise,
> + * increments the counter associated w/ the target value.
> + */
> +void __llvm_profile_instrument_target(u64 target_value, void *data, u32 index);
> +void __llvm_profile_instrument_target(u64 target_value, void *data, u32 index)
> +{
> +	struct llvm_prf_data *p = (struct llvm_prf_data *)data;
> +	struct llvm_prf_value_node **counters;
> +	struct llvm_prf_value_node *curr;
> +	struct llvm_prf_value_node *min = NULL;
> +	struct llvm_prf_value_node *prev = NULL;
> +	u64 min_count = U64_MAX;
> +	u8 values = 0;
> +	unsigned long flags;
> +
> +	if (!p || !p->values)
> +		return;
> +
> +	counters = (struct llvm_prf_value_node **)p->values;
> +	curr = counters[index];
> +
> +	while (curr) {
> +		if (target_value == curr->value) {
> +			curr->count++;
> +			return;
> +		}
> +
> +		if (curr->count < min_count) {
> +			min_count = curr->count;
> +			min = curr;
> +		}
> +
> +		prev = curr;
> +		curr = curr->next;
> +		values++;
> +	}
> +
> +	if (values >= LLVM_INSTR_PROF_MAX_NUM_VAL_PER_SITE) {
> +		if (!min->count || !(--min->count)) {
> +			curr = min;
> +			curr->value = target_value;
> +			curr->count++;
> +		}
> +		return;
> +	}
> +
> +	/* Lock when updating the value node structure. */
> +	flags = prf_lock();
> +
> +	curr = allocate_node(p, index, target_value);
> +	if (!curr)
> +		goto out;
> +
> +	curr->value = target_value;
> +	curr->count++;
> +
> +	if (!counters[index])
> +		counters[index] = curr;
> +	else if (prev && !prev->next)
> +		prev->next = curr;
> +
> +out:
> +	prf_unlock(flags);
> +}
> +EXPORT_SYMBOL(__llvm_profile_instrument_target);
> +
> +/* Counts the number of times a range of targets values are seen. */
> +void __llvm_profile_instrument_range(u64 target_value, void *data,
> +				     u32 index, s64 precise_start,
> +				     s64 precise_last, s64 large_value);
> +void __llvm_profile_instrument_range(u64 target_value, void *data,
> +				     u32 index, s64 precise_start,
> +				     s64 precise_last, s64 large_value)
> +{
> +	if (large_value != S64_MIN && (s64)target_value >= large_value)
> +		target_value = large_value;
> +	else if ((s64)target_value < precise_start ||
> +		 (s64)target_value > precise_last)
> +		target_value = precise_last + 1;
> +
> +	__llvm_profile_instrument_target(target_value, data, index);
> +}
> +EXPORT_SYMBOL(__llvm_profile_instrument_range);
> +
> +static u64 inst_prof_get_range_rep_value(u64 value)
> +{
> +	if (value <= 8)
> +		/* The first ranges are individually tracked, use it as is. */
> +		return value;
> +	else if (value >= 513)
> +		/* The last range is mapped to its lowest value. */
> +		return 513;
> +	else if (hweight64(value) == 1)
> +		/* If it's a power of two, use it as is. */
> +		return value;
> +
> +	/* Otherwise, take to the previous power of two + 1. */
> +	return ((u64)1 << (64 - __builtin_clzll(value) - 1)) + 1;
> +}
> +
> +/*
> + * The target values are partitioned into multiple ranges. The range spec is
> + * defined in compiler-rt/include/profile/InstrProfData.inc.
> + */
> +void __llvm_profile_instrument_memop(u64 target_value, void *data,
> +				     u32 counter_index);
> +void __llvm_profile_instrument_memop(u64 target_value, void *data,
> +				     u32 counter_index)
> +{
> +	u64 rep_value;
> +
> +	/* Map the target value to the representative value of its range. */
> +	rep_value = inst_prof_get_range_rep_value(target_value);
> +	__llvm_profile_instrument_target(rep_value, data, counter_index);
> +}
> +EXPORT_SYMBOL(__llvm_profile_instrument_memop);
> diff --git a/kernel/pgo/pgo.h b/kernel/pgo/pgo.h
> new file mode 100644
> index 000000000000..ddc8d3002fe5
> --- /dev/null
> +++ b/kernel/pgo/pgo.h
> @@ -0,0 +1,203 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * Copyright (C) 2019 Google, Inc.
> + *
> + * Author:
> + *	Sami Tolvanen <samitolvanen@google.com>
> + *
> + * This software is licensed under the terms of the GNU General Public
> + * License version 2, as published by the Free Software Foundation, and
> + * may be copied, distributed, and modified under those terms.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + */
> +
> +#ifndef _PGO_H
> +#define _PGO_H
> +
> +/*
> + * Note: These internal LLVM definitions must match the compiler version.
> + * See llvm/include/llvm/ProfileData/InstrProfData.inc in LLVM's source code.
> + */
> +
> +#define LLVM_INSTR_PROF_RAW_MAGIC_64	\
> +		((u64)255 << 56 |	\
> +		 (u64)'l' << 48 |	\
> +		 (u64)'p' << 40 |	\
> +		 (u64)'r' << 32 |	\
> +		 (u64)'o' << 24 |	\
> +		 (u64)'f' << 16 |	\
> +		 (u64)'r' << 8  |	\
> +		 (u64)129)
> +#define LLVM_INSTR_PROF_RAW_MAGIC_32	\
> +		((u64)255 << 56 |	\
> +		 (u64)'l' << 48 |	\
> +		 (u64)'p' << 40 |	\
> +		 (u64)'r' << 32 |	\
> +		 (u64)'o' << 24 |	\
> +		 (u64)'f' << 16 |	\
> +		 (u64)'R' << 8  |	\
> +		 (u64)129)
> +
> +#define LLVM_INSTR_PROF_RAW_VERSION		5
> +#define LLVM_INSTR_PROF_DATA_ALIGNMENT		8
> +#define LLVM_INSTR_PROF_IPVK_FIRST		0
> +#define LLVM_INSTR_PROF_IPVK_LAST		1
> +#define LLVM_INSTR_PROF_MAX_NUM_VAL_PER_SITE	255
> +
> +#define LLVM_VARIANT_MASK_IR_PROF	(0x1ULL << 56)
> +#define LLVM_VARIANT_MASK_CSIR_PROF	(0x1ULL << 57)
> +
> +/**
> + * struct llvm_prf_header - represents the raw profile header data structure.
> + * @magic: the magic token for the file format.
> + * @version: the version of the file format.
> + * @data_size: the number of entries in the profile data section.
> + * @padding_bytes_before_counters: the number of padding bytes before the
> + *   counters.
> + * @counters_size: the size in bytes of the LLVM profile section containing the
> + *   counters.
> + * @padding_bytes_after_counters: the number of padding bytes after the
> + *   counters.
> + * @names_size: the size in bytes of the LLVM profile section containing the
> + *   counters' names.
> + * @counters_delta: the beginning of the LLMV profile counters section.
> + * @names_delta: the beginning of the LLMV profile names section.
> + * @value_kind_last: the last profile value kind.
> + */
> +struct llvm_prf_header {
> +	u64 magic;
> +	u64 version;
> +	u64 data_size;
> +	u64 padding_bytes_before_counters;
> +	u64 counters_size;
> +	u64 padding_bytes_after_counters;
> +	u64 names_size;
> +	u64 counters_delta;
> +	u64 names_delta;
> +	u64 value_kind_last;
> +};
> +
> +/**
> + * struct llvm_prf_data - represents the per-function control structure.
> + * @name_ref: the reference to the function's name.
> + * @func_hash: the hash value of the function.
> + * @counter_ptr: a pointer to the profile counter.
> + * @function_ptr: a pointer to the function.
> + * @values: the profiling values associated with this function.
> + * @num_counters: the number of counters in the function.
> + * @num_value_sites: the number of value profile sites.
> + */
> +struct llvm_prf_data {
> +	const u64 name_ref;
> +	const u64 func_hash;
> +	const void *counter_ptr;
> +	const void *function_ptr;
> +	void *values;
> +	const u32 num_counters;
> +	const u16 num_value_sites[LLVM_INSTR_PROF_IPVK_LAST + 1];
> +} __aligned(LLVM_INSTR_PROF_DATA_ALIGNMENT);
> +
> +/**
> + * structure llvm_prf_value_node_data - represents the data part of the struct
> + *   llvm_prf_value_node data structure.
> + * @value: the value counters.
> + * @count: the counters' count.
> + */
> +struct llvm_prf_value_node_data {
> +	u64 value;
> +	u64 count;
> +};
> +
> +/**
> + * struct llvm_prf_value_node - represents an internal data structure used by
> + *   the value profiler.
> + * @value: the value counters.
> + * @count: the counters' count.
> + * @next: the next value node.
> + */
> +struct llvm_prf_value_node {
> +	u64 value;
> +	u64 count;
> +	struct llvm_prf_value_node *next;
> +};
> +
> +/**
> + * struct llvm_prf_value_data - represents the value profiling data in indexed
> + *   format.
> + * @total_size: the total size in bytes including this field.
> + * @num_value_kinds: the number of value profile kinds that has value profile
> + *   data.
> + */
> +struct llvm_prf_value_data {
> +	u32 total_size;
> +	u32 num_value_kinds;
> +};
> +
> +/**
> + * struct llvm_prf_value_record - represents the on-disk layout of the value
> + *   profile data of a particular kind for one function.
> + * @kind: the kind of the value profile record.
> + * @num_value_sites: the number of value profile sites.
> + * @site_count_array: the first element of the array that stores the number
> + *   of profiled values for each value site.
> + */
> +struct llvm_prf_value_record {
> +	u32 kind;
> +	u32 num_value_sites;
> +	u8 site_count_array[];
> +};
> +
> +#define prf_get_value_record_header_size()		\
> +	offsetof(struct llvm_prf_value_record, site_count_array)
> +#define prf_get_value_record_site_count_size(sites)	\
> +	roundup((sites), 8)
> +#define prf_get_value_record_size(sites)		\
> +	(prf_get_value_record_header_size() +		\
> +	 prf_get_value_record_site_count_size((sites)))
> +
> +/* Data sections */
> +extern struct llvm_prf_data __llvm_prf_data_start[];
> +extern struct llvm_prf_data __llvm_prf_data_end[];
> +
> +extern u64 __llvm_prf_cnts_start[];
> +extern u64 __llvm_prf_cnts_end[];
> +
> +extern char __llvm_prf_names_start[];
> +extern char __llvm_prf_names_end[];
> +
> +extern struct llvm_prf_value_node __llvm_prf_vnds_start[];
> +extern struct llvm_prf_value_node __llvm_prf_vnds_end[];
> +
> +/* Locking for vnodes */
> +extern unsigned long prf_lock(void);
> +extern void prf_unlock(unsigned long flags);
> +
> +#define __DEFINE_PRF_SIZE(s) \
> +	static inline unsigned long prf_ ## s ## _size(void)		\
> +	{								\
> +		unsigned long start =					\
> +			(unsigned long)__llvm_prf_ ## s ## _start;	\
> +		unsigned long end =					\
> +			(unsigned long)__llvm_prf_ ## s ## _end;	\
> +		return roundup(end - start,				\
> +				sizeof(__llvm_prf_ ## s ## _start[0]));	\
> +	}								\
> +	static inline unsigned long prf_ ## s ## _count(void)		\
> +	{								\
> +		return prf_ ## s ## _size() /				\
> +			sizeof(__llvm_prf_ ## s ## _start[0]);		\
> +	}
> +
> +__DEFINE_PRF_SIZE(data);
> +__DEFINE_PRF_SIZE(cnts);
> +__DEFINE_PRF_SIZE(names);
> +__DEFINE_PRF_SIZE(vnds);
> +
> +#undef __DEFINE_PRF_SIZE
> +
> +#endif /* _PGO_H */
> diff --git a/scripts/Makefile.lib b/scripts/Makefile.lib
> index 8cd67b1b6d15..d411e92dd0d6 100644
> --- a/scripts/Makefile.lib
> +++ b/scripts/Makefile.lib
> @@ -139,6 +139,16 @@ _c_flags += $(if $(patsubst n%,, \
>  		$(CFLAGS_GCOV))
>  endif
>  
> +#
> +# Enable clang's PGO profiling flags for a file or directory depending on
> +# variables PGO_PROFILE_obj.o and PGO_PROFILE.
> +#
> +ifeq ($(CONFIG_PGO_CLANG),y)
> +_c_flags += $(if $(patsubst n%,, \
> +		$(PGO_PROFILE_$(basetarget).o)$(PGO_PROFILE)y), \
> +		$(CFLAGS_PGO_CLANG))
> +endif
> +
>  #
>  # Enable address sanitizer flags for kernel except some files or directories
>  # we don't want to check (depends on variables KASAN_SANITIZE_obj.o, KASAN_SANITIZE)
> -- 
> 2.31.0.208.g409f899ff0-goog
>
Bill Wendling April 7, 2021, 9:58 p.m. UTC | #4
On Wed, Apr 7, 2021 at 2:47 PM Nathan Chancellor <nathan@kernel.org> wrote:
>
> Hi Bill,
>
> On Wed, Apr 07, 2021 at 02:17:04PM -0700, Bill Wendling wrote:
> > From: Sami Tolvanen <samitolvanen@google.com>
> >
> > Enable the use of clang's Profile-Guided Optimization[1]. To generate a
> > profile, the kernel is instrumented with PGO counters, a representative
> > workload is run, and the raw profile data is collected from
> > /sys/kernel/debug/pgo/profraw.
> >
> > The raw profile data must be processed by clang's "llvm-profdata" tool
> > before it can be used during recompilation:
> >
> >   $ cp /sys/kernel/debug/pgo/profraw vmlinux.profraw
> >   $ llvm-profdata merge --output=vmlinux.profdata vmlinux.profraw
> >
> > Multiple raw profiles may be merged during this step.
> >
> > The data can now be used by the compiler:
> >
> >   $ make LLVM=1 KCFLAGS=-fprofile-use=vmlinux.profdata ...
> >
> > This initial submission is restricted to x86, as that's the platform we
> > know works. This restriction can be lifted once other platforms have
> > been verified to work with PGO.
> >
> > Note that this method of profiling the kernel is clang-native, unlike
> > the clang support in kernel/gcov.
> >
> > [1] https://clang.llvm.org/docs/UsersManual.html#profile-guided-optimization
> >
> > Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
> > Co-developed-by: Bill Wendling <morbo@google.com>
> > Signed-off-by: Bill Wendling <morbo@google.com>
> > Tested-by: Nick Desaulniers <ndesaulniers@google.com>
> > Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
> > Reviewed-by: Fangrui Song <maskray@google.com>
>
> Few small nits below, not sure they warrant a v10 versus just some
> follow up patches, up to you. Regardless:
>
> Reviewed-by: Nathan Chancellor <nathan@kernel.org>
>
> > ---
> > v9: - [maskray] Remove explicit 'ALIGN' and 'KEEP' from PGO variables in
> >       vmlinux.lds.h.
> > v8: - Rebased on top-of-tree.
> > v7: - [sedat.dilek] Fix minor build failure.
> > v6: - Add better documentation about the locking scheme and other things.
> >     - Rename macros to better match the same macros in LLVM's source code.
> > v5: - [natechancellor] Correct padding calculation.
> > v4: - [ndesaulniers] Remove non-x86 Makfile changes and se "hweight64" instead
> >       of using our own popcount implementation.
> > v3: - [sedat.dilek] Added change log section.
> > v2: - [natechancellor] Added "__llvm_profile_instrument_memop".
> >     - [maskray] Corrected documentation, re PGO flags when using LTO.
> > ---
> >  Documentation/dev-tools/index.rst     |   1 +
> >  Documentation/dev-tools/pgo.rst       | 127 +++++++++
> >  MAINTAINERS                           |   9 +
> >  Makefile                              |   3 +
> >  arch/Kconfig                          |   1 +
> >  arch/x86/Kconfig                      |   1 +
> >  arch/x86/boot/Makefile                |   1 +
> >  arch/x86/boot/compressed/Makefile     |   1 +
> >  arch/x86/crypto/Makefile              |   4 +
> >  arch/x86/entry/vdso/Makefile          |   1 +
> >  arch/x86/kernel/vmlinux.lds.S         |   2 +
> >  arch/x86/platform/efi/Makefile        |   1 +
> >  arch/x86/purgatory/Makefile           |   1 +
> >  arch/x86/realmode/rm/Makefile         |   1 +
> >  arch/x86/um/vdso/Makefile             |   1 +
> >  drivers/firmware/efi/libstub/Makefile |   1 +
> >  include/asm-generic/vmlinux.lds.h     |  34 +++
> >  kernel/Makefile                       |   1 +
> >  kernel/pgo/Kconfig                    |  35 +++
> >  kernel/pgo/Makefile                   |   5 +
> >  kernel/pgo/fs.c                       | 389 ++++++++++++++++++++++++++
> >  kernel/pgo/instrument.c               | 189 +++++++++++++
> >  kernel/pgo/pgo.h                      | 203 ++++++++++++++
> >  scripts/Makefile.lib                  |  10 +
> >  24 files changed, 1022 insertions(+)
> >  create mode 100644 Documentation/dev-tools/pgo.rst
> >  create mode 100644 kernel/pgo/Kconfig
> >  create mode 100644 kernel/pgo/Makefile
> >  create mode 100644 kernel/pgo/fs.c
> >  create mode 100644 kernel/pgo/instrument.c
> >  create mode 100644 kernel/pgo/pgo.h
> >
> > diff --git a/Documentation/dev-tools/index.rst b/Documentation/dev-tools/index.rst
> > index 1b1cf4f5c9d9..6a30cd98e6f9 100644
> > --- a/Documentation/dev-tools/index.rst
> > +++ b/Documentation/dev-tools/index.rst
> > @@ -27,6 +27,7 @@ whole; patches welcome!
> >     kgdb
> >     kselftest
> >     kunit/index
> > +   pgo
> >
> >
> >  .. only::  subproject and html
> > diff --git a/Documentation/dev-tools/pgo.rst b/Documentation/dev-tools/pgo.rst
> > new file mode 100644
> > index 000000000000..b7f11d8405b7
> > --- /dev/null
> > +++ b/Documentation/dev-tools/pgo.rst
> > @@ -0,0 +1,127 @@
> > +.. SPDX-License-Identifier: GPL-2.0
> > +
> > +===============================
> > +Using PGO with the Linux kernel
> > +===============================
> > +
> > +Clang's profiling kernel support (PGO_) enables profiling of the Linux kernel
> > +when building with Clang. The profiling data is exported via the ``pgo``
> > +debugfs directory.
> > +
> > +.. _PGO: https://clang.llvm.org/docs/UsersManual.html#profile-guided-optimization
> > +
> > +
> > +Preparation
> > +===========
> > +
> > +Configure the kernel with:
> > +
> > +.. code-block:: make
> > +
> > +   CONFIG_DEBUG_FS=y
> > +   CONFIG_PGO_CLANG=y
> > +
> > +Note that kernels compiled with profiling flags will be significantly larger
> > +and run slower.
> > +
> > +Profiling data will only become accessible once debugfs has been mounted:
> > +
> > +.. code-block:: sh
> > +
> > +   mount -t debugfs none /sys/kernel/debug
> > +
> > +
> > +Customization
> > +=============
> > +
> > +You can enable or disable profiling for individual file and directories by
> > +adding a line similar to the following to the respective kernel Makefile:
> > +
> > +- For a single file (e.g. main.o)
> > +
> > +  .. code-block:: make
> > +
> > +     PGO_PROFILE_main.o := y
> > +
> > +- For all files in one directory
> > +
> > +  .. code-block:: make
> > +
> > +     PGO_PROFILE := y
> > +
> > +To exclude files from being profiled use
> > +
> > +  .. code-block:: make
> > +
> > +     PGO_PROFILE_main.o := n
> > +
> > +and
> > +
> > +  .. code-block:: make
> > +
> > +     PGO_PROFILE := n
> > +
> > +Only files which are linked to the main kernel image or are compiled as kernel
> > +modules are supported by this mechanism.
> > +
> > +
> > +Files
> > +=====
> > +
> > +The PGO kernel support creates the following files in debugfs:
> > +
> > +``/sys/kernel/debug/pgo``
> > +     Parent directory for all PGO-related files.
> > +
> > +``/sys/kernel/debug/pgo/reset``
> > +     Global reset file: resets all coverage data to zero when written to.
> > +
> > +``/sys/kernel/debug/profraw``
> > +     The raw PGO data that must be processed with ``llvm_profdata``.
> > +
> > +
> > +Workflow
> > +========
> > +
> > +The PGO kernel can be run on the host or test machines. The data though should
> > +be analyzed with Clang's tools from the same Clang version as the kernel was
> > +compiled. Clang's tolerant of version skew, but it's easier to use the same
> > +Clang version.
> > +
> > +The profiling data is useful for optimizing the kernel, analyzing coverage,
> > +etc. Clang offers tools to perform these tasks.
> > +
> > +Here is an example workflow for profiling an instrumented kernel with PGO and
> > +using the result to optimize the kernel:
> > +
> > +1) Install the kernel on the TEST machine.
> > +
> > +2) Reset the data counters right before running the load tests
> > +
> > +   .. code-block:: sh
> > +
> > +      $ echo 1 > /sys/kernel/debug/pgo/reset
> > +
> > +3) Run the load tests.
> > +
> > +4) Collect the raw profile data
> > +
> > +   .. code-block:: sh
> > +
> > +      $ cp -a /sys/kernel/debug/pgo/profraw /tmp/vmlinux.profraw
> > +
> > +5) (Optional) Download the raw profile data to the HOST machine.
> > +
> > +6) Process the raw profile data
> > +
> > +   .. code-block:: sh
> > +
> > +      $ llvm-profdata merge --output=vmlinux.profdata vmlinux.profraw
> > +
> > +   Note that multiple raw profile data files can be merged during this step.
> > +
> > +7) Rebuild the kernel using the profile data (PGO disabled)
> > +
> > +   .. code-block:: sh
> > +
> > +      $ make LLVM=1 KCFLAGS=-fprofile-use=vmlinux.profdata ...
> > diff --git a/MAINTAINERS b/MAINTAINERS
> > index c80ad735b384..742058188af2 100644
> > --- a/MAINTAINERS
> > +++ b/MAINTAINERS
> > @@ -14054,6 +14054,15 @@ S:   Maintained
> >  F:   include/linux/personality.h
> >  F:   include/uapi/linux/personality.h
> >
> > +PGO BASED KERNEL PROFILING
> > +M:   Sami Tolvanen <samitolvanen@google.com>
> > +M:   Bill Wendling <wcw@google.com>
> > +R:   Nathan Chancellor <natechancellor@gmail.com>
>
> This should be updated to my @kernel.org address. I can send a follow-up
> patch if need be.
>
Sorry about that!

> > +R:   Nick Desaulniers <ndesaulniers@google.com>
> > +S:   Supported
> > +F:   Documentation/dev-tools/pgo.rst
> > +F:   kernel/pgo
> > +
> >  PHOENIX RC FLIGHT CONTROLLER ADAPTER
> >  M:   Marcus Folkesson <marcus.folkesson@gmail.com>
> >  L:   linux-input@vger.kernel.org
> > diff --git a/Makefile b/Makefile
> > index cc77fd45ca64..6450faceb137 100644
> > --- a/Makefile
> > +++ b/Makefile
> > @@ -660,6 +660,9 @@ endif # KBUILD_EXTMOD
> >  # Defaults to vmlinux, but the arch makefile usually adds further targets
> >  all: vmlinux
> >
> > +CFLAGS_PGO_CLANG := -fprofile-generate
> > +export CFLAGS_PGO_CLANG
> > +
> >  CFLAGS_GCOV  := -fprofile-arcs -ftest-coverage \
> >       $(call cc-option,-fno-tree-loop-im) \
> >       $(call cc-disable-warning,maybe-uninitialized,)
> > diff --git a/arch/Kconfig b/arch/Kconfig
> > index ecfd3520b676..afd082133e0a 100644
> > --- a/arch/Kconfig
> > +++ b/arch/Kconfig
> > @@ -1191,6 +1191,7 @@ config ARCH_HAS_ELFCORE_COMPAT
> >       bool
> >
> >  source "kernel/gcov/Kconfig"
> > +source "kernel/pgo/Kconfig"
> >
> >  source "scripts/gcc-plugins/Kconfig"
> >
> > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> > index 2792879d398e..62be93b199ff 100644
> > --- a/arch/x86/Kconfig
> > +++ b/arch/x86/Kconfig
> > @@ -99,6 +99,7 @@ config X86
> >       select ARCH_SUPPORTS_KMAP_LOCAL_FORCE_MAP       if NR_CPUS <= 4096
> >       select ARCH_SUPPORTS_LTO_CLANG          if X86_64
> >       select ARCH_SUPPORTS_LTO_CLANG_THIN     if X86_64
> > +     select ARCH_SUPPORTS_PGO_CLANG          if X86_64
> >       select ARCH_USE_BUILTIN_BSWAP
> >       select ARCH_USE_QUEUED_RWLOCKS
> >       select ARCH_USE_QUEUED_SPINLOCKS
> > diff --git a/arch/x86/boot/Makefile b/arch/x86/boot/Makefile
> > index fe605205b4ce..383853e32f67 100644
> > --- a/arch/x86/boot/Makefile
> > +++ b/arch/x86/boot/Makefile
> > @@ -71,6 +71,7 @@ KBUILD_AFLAGS       := $(KBUILD_CFLAGS) -D__ASSEMBLY__
> >  KBUILD_CFLAGS        += $(call cc-option,-fmacro-prefix-map=$(srctree)/=)
> >  KBUILD_CFLAGS        += -fno-asynchronous-unwind-tables
> >  GCOV_PROFILE := n
> > +PGO_PROFILE := n
> >  UBSAN_SANITIZE := n
> >
> >  $(obj)/bzImage: asflags-y  := $(SVGA_MODE)
> > diff --git a/arch/x86/boot/compressed/Makefile b/arch/x86/boot/compressed/Makefile
> > index e0bc3988c3fa..ed12ab65f606 100644
> > --- a/arch/x86/boot/compressed/Makefile
> > +++ b/arch/x86/boot/compressed/Makefile
> > @@ -54,6 +54,7 @@ CFLAGS_sev-es.o += -I$(objtree)/arch/x86/lib/
> >
> >  KBUILD_AFLAGS  := $(KBUILD_CFLAGS) -D__ASSEMBLY__
> >  GCOV_PROFILE := n
> > +PGO_PROFILE := n
> >  UBSAN_SANITIZE :=n
> >
> >  KBUILD_LDFLAGS := -m elf_$(UTS_MACHINE)
> > diff --git a/arch/x86/crypto/Makefile b/arch/x86/crypto/Makefile
> > index b28e36b7c96b..4b2e9620c412 100644
> > --- a/arch/x86/crypto/Makefile
> > +++ b/arch/x86/crypto/Makefile
> > @@ -4,6 +4,10 @@
> >
> >  OBJECT_FILES_NON_STANDARD := y
> >
> > +# Disable PGO for curve25519-x86_64. With PGO enabled, clang runs out of
> > +# registers for some of the functions.
> > +PGO_PROFILE_curve25519-x86_64.o := n
> > +
> >  obj-$(CONFIG_CRYPTO_TWOFISH_586) += twofish-i586.o
> >  twofish-i586-y := twofish-i586-asm_32.o twofish_glue.o
> >  obj-$(CONFIG_CRYPTO_TWOFISH_X86_64) += twofish-x86_64.o
> > diff --git a/arch/x86/entry/vdso/Makefile b/arch/x86/entry/vdso/Makefile
> > index 05c4abc2fdfd..f7421e44725a 100644
> > --- a/arch/x86/entry/vdso/Makefile
> > +++ b/arch/x86/entry/vdso/Makefile
> > @@ -180,6 +180,7 @@ quiet_cmd_vdso = VDSO    $@
> >  VDSO_LDFLAGS = -shared --hash-style=both --build-id=sha1 \
> >       $(call ld-option, --eh-frame-hdr) -Bsymbolic
> >  GCOV_PROFILE := n
> > +PGO_PROFILE := n
> >
> >  quiet_cmd_vdso_and_check = VDSO    $@
> >        cmd_vdso_and_check = $(cmd_vdso); $(cmd_vdso_check)
> > diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
> > index efd9e9ea17f2..f6cab2316c46 100644
> > --- a/arch/x86/kernel/vmlinux.lds.S
> > +++ b/arch/x86/kernel/vmlinux.lds.S
> > @@ -184,6 +184,8 @@ SECTIONS
> >
> >       BUG_TABLE
> >
> > +     PGO_CLANG_DATA
> > +
> >       ORC_UNWIND_TABLE
> >
> >       . = ALIGN(PAGE_SIZE);
> > diff --git a/arch/x86/platform/efi/Makefile b/arch/x86/platform/efi/Makefile
> > index 84b09c230cbd..5f22b31446ad 100644
> > --- a/arch/x86/platform/efi/Makefile
> > +++ b/arch/x86/platform/efi/Makefile
> > @@ -2,6 +2,7 @@
> >  OBJECT_FILES_NON_STANDARD_efi_thunk_$(BITS).o := y
> >  KASAN_SANITIZE := n
> >  GCOV_PROFILE := n
> > +PGO_PROFILE := n
> >
> >  obj-$(CONFIG_EFI)            += quirks.o efi.o efi_$(BITS).o efi_stub_$(BITS).o
> >  obj-$(CONFIG_EFI_MIXED)              += efi_thunk_$(BITS).o
> > diff --git a/arch/x86/purgatory/Makefile b/arch/x86/purgatory/Makefile
> > index 95ea17a9d20c..36f20e99da0b 100644
> > --- a/arch/x86/purgatory/Makefile
> > +++ b/arch/x86/purgatory/Makefile
> > @@ -23,6 +23,7 @@ targets += purgatory.ro purgatory.chk
> >
> >  # Sanitizer, etc. runtimes are unavailable and cannot be linked here.
> >  GCOV_PROFILE := n
> > +PGO_PROFILE  := n
> >  KASAN_SANITIZE       := n
> >  UBSAN_SANITIZE       := n
> >  KCSAN_SANITIZE       := n
> > diff --git a/arch/x86/realmode/rm/Makefile b/arch/x86/realmode/rm/Makefile
> > index 83f1b6a56449..21797192f958 100644
> > --- a/arch/x86/realmode/rm/Makefile
> > +++ b/arch/x86/realmode/rm/Makefile
> > @@ -76,4 +76,5 @@ KBUILD_CFLAGS       := $(REALMODE_CFLAGS) -D_SETUP -D_WAKEUP \
> >  KBUILD_AFLAGS        := $(KBUILD_CFLAGS) -D__ASSEMBLY__
> >  KBUILD_CFLAGS        += -fno-asynchronous-unwind-tables
> >  GCOV_PROFILE := n
> > +PGO_PROFILE := n
> >  UBSAN_SANITIZE := n
> > diff --git a/arch/x86/um/vdso/Makefile b/arch/x86/um/vdso/Makefile
> > index 5943387e3f35..54f5768f5853 100644
> > --- a/arch/x86/um/vdso/Makefile
> > +++ b/arch/x86/um/vdso/Makefile
> > @@ -64,6 +64,7 @@ quiet_cmd_vdso = VDSO    $@
> >
> >  VDSO_LDFLAGS = -fPIC -shared -Wl,--hash-style=sysv
> >  GCOV_PROFILE := n
> > +PGO_PROFILE := n
> >
> >  #
> >  # Install the unstripped copy of vdso*.so listed in $(vdso-install-y).
> > diff --git a/drivers/firmware/efi/libstub/Makefile b/drivers/firmware/efi/libstub/Makefile
> > index c23466e05e60..724fb389bb9d 100644
> > --- a/drivers/firmware/efi/libstub/Makefile
> > +++ b/drivers/firmware/efi/libstub/Makefile
> > @@ -42,6 +42,7 @@ KBUILD_CFLAGS := $(filter-out $(CC_FLAGS_SCS), $(KBUILD_CFLAGS))
> >  KBUILD_CFLAGS := $(filter-out $(CC_FLAGS_LTO), $(KBUILD_CFLAGS))
> >
> >  GCOV_PROFILE                 := n
> > +PGO_PROFILE                  := n
> >  # Sanitizer runtimes are unavailable and cannot be linked here.
> >  KASAN_SANITIZE                       := n
> >  KCSAN_SANITIZE                       := n
> > diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h
> > index 0331d5d49551..b371857097e8 100644
> > --- a/include/asm-generic/vmlinux.lds.h
> > +++ b/include/asm-generic/vmlinux.lds.h
> > @@ -329,6 +329,39 @@
> >  #define DTPM_TABLE()
> >  #endif
> >
> > +#ifdef CONFIG_PGO_CLANG
> > +#define PGO_CLANG_DATA                                                       \
> > +     __llvm_prf_data : AT(ADDR(__llvm_prf_data) - LOAD_OFFSET) {     \
> > +             __llvm_prf_start = .;                                   \
> > +             __llvm_prf_data_start = .;                              \
> > +             *(__llvm_prf_data)                                      \
> > +             __llvm_prf_data_end = .;                                \
> > +     }                                                               \
> > +     __llvm_prf_cnts : AT(ADDR(__llvm_prf_cnts) - LOAD_OFFSET) {     \
> > +             __llvm_prf_cnts_start = .;                              \
> > +             *(__llvm_prf_cnts)                                      \
> > +             __llvm_prf_cnts_end = .;                                \
> > +     }                                                               \
> > +     __llvm_prf_names : AT(ADDR(__llvm_prf_names) - LOAD_OFFSET) {   \
> > +             __llvm_prf_names_start = .;                             \
> > +             *(__llvm_prf_names)                                     \
> > +             __llvm_prf_names_end = .;                               \
> > +     }                                                               \
> > +     __llvm_prf_vals : AT(ADDR(__llvm_prf_vals) - LOAD_OFFSET) {     \
> > +             __llvm_prf_vals_start = .;                              \
> > +             *(__llvm_prf_vals)                                      \
> > +             __llvm_prf_vals_end = .;                                \
> > +     }                                                               \
> > +     __llvm_prf_vnds : AT(ADDR(__llvm_prf_vnds) - LOAD_OFFSET) {     \
> > +             __llvm_prf_vnds_start = .;                              \
> > +             *(__llvm_prf_vnds)                                      \
> > +             __llvm_prf_vnds_end = .;                                \
> > +             __llvm_prf_end = .;                                     \
> > +     }
> > +#else
> > +#define PGO_CLANG_DATA
> > +#endif
> > +
> >  #define KERNEL_DTB()                                                 \
> >       STRUCT_ALIGN();                                                 \
> >       __dtb_start = .;                                                \
> > @@ -1106,6 +1139,7 @@
> >               CONSTRUCTORS                                            \
> >       }                                                               \
> >       BUG_TABLE                                                       \
> > +     PGO_CLANG_DATA
> >
> >  #define INIT_TEXT_SECTION(inittext_align)                            \
> >       . = ALIGN(inittext_align);                                      \
> > diff --git a/kernel/Makefile b/kernel/Makefile
> > index 320f1f3941b7..a2a23ef2b12f 100644
> > --- a/kernel/Makefile
> > +++ b/kernel/Makefile
> > @@ -111,6 +111,7 @@ obj-$(CONFIG_BPF) += bpf/
> >  obj-$(CONFIG_KCSAN) += kcsan/
> >  obj-$(CONFIG_SHADOW_CALL_STACK) += scs.o
> >  obj-$(CONFIG_HAVE_STATIC_CALL_INLINE) += static_call.o
> > +obj-$(CONFIG_PGO_CLANG) += pgo/
> >
> >  obj-$(CONFIG_PERF_EVENTS) += events/
> >
> > diff --git a/kernel/pgo/Kconfig b/kernel/pgo/Kconfig
> > new file mode 100644
> > index 000000000000..76a640b6cf6e
> > --- /dev/null
> > +++ b/kernel/pgo/Kconfig
> > @@ -0,0 +1,35 @@
> > +# SPDX-License-Identifier: GPL-2.0-only
> > +menu "Profile Guided Optimization (PGO) (EXPERIMENTAL)"
> > +
> > +config ARCH_SUPPORTS_PGO_CLANG
> > +     bool
> > +
> > +config PGO_CLANG
> > +     bool "Enable clang's PGO-based kernel profiling"
> > +     depends on DEBUG_FS
> > +     depends on ARCH_SUPPORTS_PGO_CLANG
> > +     depends on CC_IS_CLANG && CLANG_VERSION >= 120000
> > +     help
> > +       This option enables clang's PGO (Profile Guided Optimization) based
> > +       code profiling to better optimize the kernel.
> > +
> > +       If unsure, say N.
> > +
> > +       Run a representative workload for your application on a kernel
> > +       compiled with this option and download the raw profile file from
> > +       /sys/kernel/debug/pgo/profraw. This file needs to be processed with
> > +       llvm-profdata. It may be merged with other collected raw profiles.
> > +
> > +       Copy the resulting profile file into vmlinux.profdata, and enable
> > +       KCFLAGS=-fprofile-use=vmlinux.profdata to produce an optimized
> > +       kernel.
> > +
> > +       Note that a kernel compiled with profiling flags will be
> > +       significantly larger and run slower. Also be sure to exclude files
> > +       from profiling which are not linked to the kernel image to prevent
> > +       linker errors.
> > +
> > +       Note that the debugfs filesystem has to be mounted to access
> > +       profiling data.
>
> It might be nice to have CONFIG_PGO_PROFILE_ALL like
> CONFIG_GCOV_PROFILE_ALL so that people do not have to go sprinkle the
> kernel with PGO_PROFILE definitions in the Makefile.
>
It seemed to me that the GCOV_PROFILE_ALL option was there to
differentiate between profiling and coverage. I may be wrong about
that. I didn't add the PGO_PROFILE_ALL because there's only one use
when you enable PGO_CLANG, profiling the entire kernel. It may be
useful to have PGO_PROFILE_ALL once we include coverage. Thoughts?

-bw

> > +endmenu
> > diff --git a/kernel/pgo/Makefile b/kernel/pgo/Makefile
> > new file mode 100644
> > index 000000000000..41e27cefd9a4
> > --- /dev/null
> > +++ b/kernel/pgo/Makefile
> > @@ -0,0 +1,5 @@
> > +# SPDX-License-Identifier: GPL-2.0
> > +GCOV_PROFILE := n
> > +PGO_PROFILE  := n
> > +
> > +obj-y        += fs.o instrument.o
> > diff --git a/kernel/pgo/fs.c b/kernel/pgo/fs.c
> > new file mode 100644
> > index 000000000000..1678df3b7d64
> > --- /dev/null
> > +++ b/kernel/pgo/fs.c
> > @@ -0,0 +1,389 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/*
> > + * Copyright (C) 2019 Google, Inc.
> > + *
> > + * Author:
> > + *   Sami Tolvanen <samitolvanen@google.com>
> > + *
> > + * This software is licensed under the terms of the GNU General Public
> > + * License version 2, as published by the Free Software Foundation, and
> > + * may be copied, distributed, and modified under those terms.
> > + *
> > + * This program is distributed in the hope that it will be useful,
> > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > + * GNU General Public License for more details.
> > + *
> > + */
> > +
> > +#define pr_fmt(fmt)  "pgo: " fmt
> > +
> > +#include <linux/kernel.h>
> > +#include <linux/debugfs.h>
> > +#include <linux/fs.h>
> > +#include <linux/module.h>
> > +#include <linux/slab.h>
> > +#include <linux/vmalloc.h>
> > +#include "pgo.h"
> > +
> > +static struct dentry *directory;
> > +
> > +struct prf_private_data {
> > +     void *buffer;
> > +     unsigned long size;
> > +};
> > +
> > +/*
> > + * Raw profile data format:
> > + *
> > + *   - llvm_prf_header
> > + *   - __llvm_prf_data
> > + *   - __llvm_prf_cnts
> > + *   - __llvm_prf_names
> > + *   - zero padding to 8 bytes
> > + *   - for each llvm_prf_data in __llvm_prf_data:
> > + *           - llvm_prf_value_data
> > + *                   - llvm_prf_value_record + site count array
> > + *                           - llvm_prf_value_node_data
> > + *                           ...
> > + *                   ...
> > + *           ...
> > + */
> > +
> > +static void prf_fill_header(void **buffer)
> > +{
> > +     struct llvm_prf_header *header = *(struct llvm_prf_header **)buffer;
> > +
> > +#ifdef CONFIG_64BIT
> > +     header->magic = LLVM_INSTR_PROF_RAW_MAGIC_64;
> > +#else
> > +     header->magic = LLVM_INSTR_PROF_RAW_MAGIC_32;
> > +#endif
> > +     header->version = LLVM_VARIANT_MASK_IR_PROF | LLVM_INSTR_PROF_RAW_VERSION;
> > +     header->data_size = prf_data_count();
> > +     header->padding_bytes_before_counters = 0;
> > +     header->counters_size = prf_cnts_count();
> > +     header->padding_bytes_after_counters = 0;
> > +     header->names_size = prf_names_count();
> > +     header->counters_delta = (u64)__llvm_prf_cnts_start;
> > +     header->names_delta = (u64)__llvm_prf_names_start;
> > +     header->value_kind_last = LLVM_INSTR_PROF_IPVK_LAST;
> > +
> > +     *buffer += sizeof(*header);
> > +}
> > +
> > +/*
> > + * Copy the source into the buffer, incrementing the pointer into buffer in the
> > + * process.
> > + */
> > +static void prf_copy_to_buffer(void **buffer, void *src, unsigned long size)
> > +{
> > +     memcpy(*buffer, src, size);
> > +     *buffer += size;
> > +}
> > +
> > +static u32 __prf_get_value_size(struct llvm_prf_data *p, u32 *value_kinds)
> > +{
> > +     struct llvm_prf_value_node **nodes =
> > +             (struct llvm_prf_value_node **)p->values;
> > +     u32 kinds = 0;
> > +     u32 size = 0;
> > +     unsigned int kind;
> > +     unsigned int n;
> > +     unsigned int s = 0;
> > +
> > +     for (kind = 0; kind < ARRAY_SIZE(p->num_value_sites); kind++) {
> > +             unsigned int sites = p->num_value_sites[kind];
> > +
> > +             if (!sites)
> > +                     continue;
> > +
> > +             /* Record + site count array */
> > +             size += prf_get_value_record_size(sites);
> > +             kinds++;
> > +
> > +             if (!nodes)
> > +                     continue;
> > +
> > +             for (n = 0; n < sites; n++) {
> > +                     u32 count = 0;
> > +                     struct llvm_prf_value_node *site = nodes[s + n];
> > +
> > +                     while (site && ++count <= U8_MAX)
> > +                             site = site->next;
> > +
> > +                     size += count *
> > +                             sizeof(struct llvm_prf_value_node_data);
> > +             }
> > +
> > +             s += sites;
> > +     }
> > +
> > +     if (size)
> > +             size += sizeof(struct llvm_prf_value_data);
> > +
> > +     if (value_kinds)
> > +             *value_kinds = kinds;
> > +
> > +     return size;
> > +}
> > +
> > +static u32 prf_get_value_size(void)
> > +{
> > +     u32 size = 0;
> > +     struct llvm_prf_data *p;
> > +
> > +     for (p = __llvm_prf_data_start; p < __llvm_prf_data_end; p++)
> > +             size += __prf_get_value_size(p, NULL);
> > +
> > +     return size;
> > +}
> > +
> > +/* Serialize the profiling's value. */
> > +static void prf_serialize_value(struct llvm_prf_data *p, void **buffer)
> > +{
> > +     struct llvm_prf_value_data header;
> > +     struct llvm_prf_value_node **nodes =
> > +             (struct llvm_prf_value_node **)p->values;
> > +     unsigned int kind;
> > +     unsigned int n;
> > +     unsigned int s = 0;
> > +
> > +     header.total_size = __prf_get_value_size(p, &header.num_value_kinds);
> > +
> > +     if (!header.num_value_kinds)
> > +             /* Nothing to write. */
> > +             return;
> > +
> > +     prf_copy_to_buffer(buffer, &header, sizeof(header));
> > +
> > +     for (kind = 0; kind < ARRAY_SIZE(p->num_value_sites); kind++) {
> > +             struct llvm_prf_value_record *record;
> > +             u8 *counts;
> > +             unsigned int sites = p->num_value_sites[kind];
> > +
> > +             if (!sites)
> > +                     continue;
> > +
> > +             /* Profiling value record. */
> > +             record = *(struct llvm_prf_value_record **)buffer;
> > +             *buffer += prf_get_value_record_header_size();
> > +
> > +             record->kind = kind;
> > +             record->num_value_sites = sites;
> > +
> > +             /* Site count array. */
> > +             counts = *(u8 **)buffer;
> > +             *buffer += prf_get_value_record_site_count_size(sites);
> > +
> > +             /*
> > +              * If we don't have nodes, we can skip updating the site count
> > +              * array, because the buffer is zero filled.
> > +              */
> > +             if (!nodes)
> > +                     continue;
> > +
> > +             for (n = 0; n < sites; n++) {
> > +                     u32 count = 0;
> > +                     struct llvm_prf_value_node *site = nodes[s + n];
> > +
> > +                     while (site && ++count <= U8_MAX) {
> > +                             prf_copy_to_buffer(buffer, site,
> > +                                                sizeof(struct llvm_prf_value_node_data));
> > +                             site = site->next;
> > +                     }
> > +
> > +                     counts[n] = (u8)count;
> > +             }
> > +
> > +             s += sites;
> > +     }
> > +}
> > +
> > +static void prf_serialize_values(void **buffer)
> > +{
> > +     struct llvm_prf_data *p;
> > +
> > +     for (p = __llvm_prf_data_start; p < __llvm_prf_data_end; p++)
> > +             prf_serialize_value(p, buffer);
> > +}
> > +
> > +static inline unsigned long prf_get_padding(unsigned long size)
> > +{
> > +     return 7 & (sizeof(u64) - size % sizeof(u64));
> > +}
> > +
> > +static unsigned long prf_buffer_size(void)
> > +{
> > +     return sizeof(struct llvm_prf_header) +
> > +                     prf_data_size() +
> > +                     prf_cnts_size() +
> > +                     prf_names_size() +
> > +                     prf_get_padding(prf_names_size()) +
> > +                     prf_get_value_size();
> > +}
> > +
> > +/*
> > + * Serialize the profiling data into a format LLVM's tools can understand.
> > + * Note: caller *must* hold pgo_lock.
> > + */
> > +static int prf_serialize(struct prf_private_data *p)
> > +{
> > +     int err = 0;
> > +     void *buffer;
> > +
> > +     p->size = prf_buffer_size();
> > +     p->buffer = vzalloc(p->size);
> > +
> > +     if (!p->buffer) {
> > +             err = -ENOMEM;
> > +             goto out;
> > +     }
> > +
> > +     buffer = p->buffer;
> > +
> > +     prf_fill_header(&buffer);
> > +     prf_copy_to_buffer(&buffer, __llvm_prf_data_start,  prf_data_size());
> > +     prf_copy_to_buffer(&buffer, __llvm_prf_cnts_start,  prf_cnts_size());
> > +     prf_copy_to_buffer(&buffer, __llvm_prf_names_start, prf_names_size());
> > +     buffer += prf_get_padding(prf_names_size());
> > +
> > +     prf_serialize_values(&buffer);
> > +
> > +out:
> > +     return err;
> > +}
> > +
> > +/* open() implementation for PGO. Creates a copy of the profiling data set. */
> > +static int prf_open(struct inode *inode, struct file *file)
> > +{
> > +     struct prf_private_data *data;
> > +     unsigned long flags;
> > +     int err;
> > +
> > +     data = kzalloc(sizeof(*data), GFP_KERNEL);
> > +     if (!data) {
> > +             err = -ENOMEM;
> > +             goto out;
> > +     }
> > +
> > +     flags = prf_lock();
> > +
> > +     err = prf_serialize(data);
> > +     if (unlikely(err)) {
> > +             kfree(data);
> > +             goto out_unlock;
> > +     }
> > +
> > +     file->private_data = data;
> > +
> > +out_unlock:
> > +     prf_unlock(flags);
> > +out:
> > +     return err;
> > +}
> > +
> > +/* read() implementation for PGO. */
> > +static ssize_t prf_read(struct file *file, char __user *buf, size_t count,
> > +                     loff_t *ppos)
> > +{
> > +     struct prf_private_data *data = file->private_data;
> > +
> > +     BUG_ON(!data);
> > +
> > +     return simple_read_from_buffer(buf, count, ppos, data->buffer,
> > +                                    data->size);
> > +}
> > +
> > +/* release() implementation for PGO. Release resources allocated by open(). */
> > +static int prf_release(struct inode *inode, struct file *file)
> > +{
> > +     struct prf_private_data *data = file->private_data;
> > +
> > +     if (data) {
> > +             vfree(data->buffer);
> > +             kfree(data);
> > +     }
> > +
> > +     return 0;
> > +}
> > +
> > +static const struct file_operations prf_fops = {
> > +     .owner          = THIS_MODULE,
> > +     .open           = prf_open,
> > +     .read           = prf_read,
> > +     .llseek         = default_llseek,
> > +     .release        = prf_release
> > +};
> > +
> > +/* write() implementation for resetting PGO's profile data. */
> > +static ssize_t reset_write(struct file *file, const char __user *addr,
> > +                        size_t len, loff_t *pos)
> > +{
> > +     struct llvm_prf_data *data;
> > +
> > +     memset(__llvm_prf_cnts_start, 0, prf_cnts_size());
> > +
> > +     for (data = __llvm_prf_data_start; data < __llvm_prf_data_end; data++) {
> > +             struct llvm_prf_value_node **vnodes;
> > +             u64 current_vsite_count;
> > +             u32 i;
> > +
> > +             if (!data->values)
> > +                     continue;
> > +
> > +             current_vsite_count = 0;
> > +             vnodes = (struct llvm_prf_value_node **)data->values;
> > +
> > +             for (i = LLVM_INSTR_PROF_IPVK_FIRST; i <= LLVM_INSTR_PROF_IPVK_LAST; i++)
> > +                     current_vsite_count += data->num_value_sites[i];
> > +
> > +             for (i = 0; i < current_vsite_count; i++) {
> > +                     struct llvm_prf_value_node *current_vnode = vnodes[i];
> > +
> > +                     while (current_vnode) {
> > +                             current_vnode->count = 0;
> > +                             current_vnode = current_vnode->next;
> > +                     }
> > +             }
> > +     }
> > +
> > +     return len;
> > +}
> > +
> > +static const struct file_operations prf_reset_fops = {
> > +     .owner          = THIS_MODULE,
> > +     .write          = reset_write,
> > +     .llseek         = noop_llseek,
> > +};
> > +
> > +/* Create debugfs entries. */
> > +static int __init pgo_init(void)
> > +{
> > +     directory = debugfs_create_dir("pgo", NULL);
> > +     if (!directory)
> > +             goto err_remove;
> > +
> > +     if (!debugfs_create_file("profraw", 0600, directory, NULL,
> > +                              &prf_fops))
> > +             goto err_remove;
> > +
> > +     if (!debugfs_create_file("reset", 0200, directory, NULL,
> > +                              &prf_reset_fops))
> > +             goto err_remove;
> > +
> > +     return 0;
> > +
> > +err_remove:
> > +     pr_err("initialization failed\n");
> > +     return -EIO;
> > +}
> > +
> > +/* Remove debugfs entries. */
> > +static void __exit pgo_exit(void)
> > +{
> > +     debugfs_remove_recursive(directory);
> > +}
> > +
> > +module_init(pgo_init);
> > +module_exit(pgo_exit);
> > diff --git a/kernel/pgo/instrument.c b/kernel/pgo/instrument.c
> > new file mode 100644
> > index 000000000000..464b3bc77431
> > --- /dev/null
> > +++ b/kernel/pgo/instrument.c
> > @@ -0,0 +1,189 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/*
> > + * Copyright (C) 2019 Google, Inc.
> > + *
> > + * Author:
> > + *   Sami Tolvanen <samitolvanen@google.com>
> > + *
> > + * This software is licensed under the terms of the GNU General Public
> > + * License version 2, as published by the Free Software Foundation, and
> > + * may be copied, distributed, and modified under those terms.
> > + *
> > + * This program is distributed in the hope that it will be useful,
> > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > + * GNU General Public License for more details.
> > + *
> > + */
> > +
> > +#define pr_fmt(fmt)  "pgo: " fmt
> > +
> > +#include <linux/bitops.h>
> > +#include <linux/kernel.h>
> > +#include <linux/export.h>
> > +#include <linux/spinlock.h>
> > +#include <linux/types.h>
> > +#include "pgo.h"
> > +
> > +/*
> > + * This lock guards both profile count updating and serialization of the
> > + * profiling data. Keeping both of these activities separate via locking
> > + * ensures that we don't try to serialize data that's only partially updated.
> > + */
> > +static DEFINE_SPINLOCK(pgo_lock);
> > +static int current_node;
> > +
> > +unsigned long prf_lock(void)
> > +{
> > +     unsigned long flags;
> > +
> > +     spin_lock_irqsave(&pgo_lock, flags);
> > +
> > +     return flags;
> > +}
> > +
> > +void prf_unlock(unsigned long flags)
> > +{
> > +     spin_unlock_irqrestore(&pgo_lock, flags);
> > +}
> > +
> > +/*
> > + * Return a newly allocated profiling value node which contains the tracked
> > + * value by the value profiler.
> > + * Note: caller *must* hold pgo_lock.
> > + */
> > +static struct llvm_prf_value_node *allocate_node(struct llvm_prf_data *p,
> > +                                              u32 index, u64 value)
> > +{
> > +     if (&__llvm_prf_vnds_start[current_node + 1] >= __llvm_prf_vnds_end)
> > +             return NULL; /* Out of nodes */
> > +
> > +     current_node++;
> > +
> > +     /* Make sure the node is entirely within the section */
> > +     if (&__llvm_prf_vnds_start[current_node] >= __llvm_prf_vnds_end ||
> > +         &__llvm_prf_vnds_start[current_node + 1] > __llvm_prf_vnds_end)
> > +             return NULL;
> > +
> > +     return &__llvm_prf_vnds_start[current_node];
> > +}
> > +
> > +/*
> > + * Counts the number of times a target value is seen.
> > + *
> > + * Records the target value for the index if not seen before. Otherwise,
> > + * increments the counter associated w/ the target value.
> > + */
> > +void __llvm_profile_instrument_target(u64 target_value, void *data, u32 index);
> > +void __llvm_profile_instrument_target(u64 target_value, void *data, u32 index)
> > +{
> > +     struct llvm_prf_data *p = (struct llvm_prf_data *)data;
> > +     struct llvm_prf_value_node **counters;
> > +     struct llvm_prf_value_node *curr;
> > +     struct llvm_prf_value_node *min = NULL;
> > +     struct llvm_prf_value_node *prev = NULL;
> > +     u64 min_count = U64_MAX;
> > +     u8 values = 0;
> > +     unsigned long flags;
> > +
> > +     if (!p || !p->values)
> > +             return;
> > +
> > +     counters = (struct llvm_prf_value_node **)p->values;
> > +     curr = counters[index];
> > +
> > +     while (curr) {
> > +             if (target_value == curr->value) {
> > +                     curr->count++;
> > +                     return;
> > +             }
> > +
> > +             if (curr->count < min_count) {
> > +                     min_count = curr->count;
> > +                     min = curr;
> > +             }
> > +
> > +             prev = curr;
> > +             curr = curr->next;
> > +             values++;
> > +     }
> > +
> > +     if (values >= LLVM_INSTR_PROF_MAX_NUM_VAL_PER_SITE) {
> > +             if (!min->count || !(--min->count)) {
> > +                     curr = min;
> > +                     curr->value = target_value;
> > +                     curr->count++;
> > +             }
> > +             return;
> > +     }
> > +
> > +     /* Lock when updating the value node structure. */
> > +     flags = prf_lock();
> > +
> > +     curr = allocate_node(p, index, target_value);
> > +     if (!curr)
> > +             goto out;
> > +
> > +     curr->value = target_value;
> > +     curr->count++;
> > +
> > +     if (!counters[index])
> > +             counters[index] = curr;
> > +     else if (prev && !prev->next)
> > +             prev->next = curr;
> > +
> > +out:
> > +     prf_unlock(flags);
> > +}
> > +EXPORT_SYMBOL(__llvm_profile_instrument_target);
> > +
> > +/* Counts the number of times a range of targets values are seen. */
> > +void __llvm_profile_instrument_range(u64 target_value, void *data,
> > +                                  u32 index, s64 precise_start,
> > +                                  s64 precise_last, s64 large_value);
> > +void __llvm_profile_instrument_range(u64 target_value, void *data,
> > +                                  u32 index, s64 precise_start,
> > +                                  s64 precise_last, s64 large_value)
> > +{
> > +     if (large_value != S64_MIN && (s64)target_value >= large_value)
> > +             target_value = large_value;
> > +     else if ((s64)target_value < precise_start ||
> > +              (s64)target_value > precise_last)
> > +             target_value = precise_last + 1;
> > +
> > +     __llvm_profile_instrument_target(target_value, data, index);
> > +}
> > +EXPORT_SYMBOL(__llvm_profile_instrument_range);
> > +
> > +static u64 inst_prof_get_range_rep_value(u64 value)
> > +{
> > +     if (value <= 8)
> > +             /* The first ranges are individually tracked, use it as is. */
> > +             return value;
> > +     else if (value >= 513)
> > +             /* The last range is mapped to its lowest value. */
> > +             return 513;
> > +     else if (hweight64(value) == 1)
> > +             /* If it's a power of two, use it as is. */
> > +             return value;
> > +
> > +     /* Otherwise, take to the previous power of two + 1. */
> > +     return ((u64)1 << (64 - __builtin_clzll(value) - 1)) + 1;
> > +}
> > +
> > +/*
> > + * The target values are partitioned into multiple ranges. The range spec is
> > + * defined in compiler-rt/include/profile/InstrProfData.inc.
> > + */
> > +void __llvm_profile_instrument_memop(u64 target_value, void *data,
> > +                                  u32 counter_index);
> > +void __llvm_profile_instrument_memop(u64 target_value, void *data,
> > +                                  u32 counter_index)
> > +{
> > +     u64 rep_value;
> > +
> > +     /* Map the target value to the representative value of its range. */
> > +     rep_value = inst_prof_get_range_rep_value(target_value);
> > +     __llvm_profile_instrument_target(rep_value, data, counter_index);
> > +}
> > +EXPORT_SYMBOL(__llvm_profile_instrument_memop);
> > diff --git a/kernel/pgo/pgo.h b/kernel/pgo/pgo.h
> > new file mode 100644
> > index 000000000000..ddc8d3002fe5
> > --- /dev/null
> > +++ b/kernel/pgo/pgo.h
> > @@ -0,0 +1,203 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +/*
> > + * Copyright (C) 2019 Google, Inc.
> > + *
> > + * Author:
> > + *   Sami Tolvanen <samitolvanen@google.com>
> > + *
> > + * This software is licensed under the terms of the GNU General Public
> > + * License version 2, as published by the Free Software Foundation, and
> > + * may be copied, distributed, and modified under those terms.
> > + *
> > + * This program is distributed in the hope that it will be useful,
> > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > + * GNU General Public License for more details.
> > + *
> > + */
> > +
> > +#ifndef _PGO_H
> > +#define _PGO_H
> > +
> > +/*
> > + * Note: These internal LLVM definitions must match the compiler version.
> > + * See llvm/include/llvm/ProfileData/InstrProfData.inc in LLVM's source code.
> > + */
> > +
> > +#define LLVM_INSTR_PROF_RAW_MAGIC_64 \
> > +             ((u64)255 << 56 |       \
> > +              (u64)'l' << 48 |       \
> > +              (u64)'p' << 40 |       \
> > +              (u64)'r' << 32 |       \
> > +              (u64)'o' << 24 |       \
> > +              (u64)'f' << 16 |       \
> > +              (u64)'r' << 8  |       \
> > +              (u64)129)
> > +#define LLVM_INSTR_PROF_RAW_MAGIC_32 \
> > +             ((u64)255 << 56 |       \
> > +              (u64)'l' << 48 |       \
> > +              (u64)'p' << 40 |       \
> > +              (u64)'r' << 32 |       \
> > +              (u64)'o' << 24 |       \
> > +              (u64)'f' << 16 |       \
> > +              (u64)'R' << 8  |       \
> > +              (u64)129)
> > +
> > +#define LLVM_INSTR_PROF_RAW_VERSION          5
> > +#define LLVM_INSTR_PROF_DATA_ALIGNMENT               8
> > +#define LLVM_INSTR_PROF_IPVK_FIRST           0
> > +#define LLVM_INSTR_PROF_IPVK_LAST            1
> > +#define LLVM_INSTR_PROF_MAX_NUM_VAL_PER_SITE 255
> > +
> > +#define LLVM_VARIANT_MASK_IR_PROF    (0x1ULL << 56)
> > +#define LLVM_VARIANT_MASK_CSIR_PROF  (0x1ULL << 57)
> > +
> > +/**
> > + * struct llvm_prf_header - represents the raw profile header data structure.
> > + * @magic: the magic token for the file format.
> > + * @version: the version of the file format.
> > + * @data_size: the number of entries in the profile data section.
> > + * @padding_bytes_before_counters: the number of padding bytes before the
> > + *   counters.
> > + * @counters_size: the size in bytes of the LLVM profile section containing the
> > + *   counters.
> > + * @padding_bytes_after_counters: the number of padding bytes after the
> > + *   counters.
> > + * @names_size: the size in bytes of the LLVM profile section containing the
> > + *   counters' names.
> > + * @counters_delta: the beginning of the LLMV profile counters section.
> > + * @names_delta: the beginning of the LLMV profile names section.
> > + * @value_kind_last: the last profile value kind.
> > + */
> > +struct llvm_prf_header {
> > +     u64 magic;
> > +     u64 version;
> > +     u64 data_size;
> > +     u64 padding_bytes_before_counters;
> > +     u64 counters_size;
> > +     u64 padding_bytes_after_counters;
> > +     u64 names_size;
> > +     u64 counters_delta;
> > +     u64 names_delta;
> > +     u64 value_kind_last;
> > +};
> > +
> > +/**
> > + * struct llvm_prf_data - represents the per-function control structure.
> > + * @name_ref: the reference to the function's name.
> > + * @func_hash: the hash value of the function.
> > + * @counter_ptr: a pointer to the profile counter.
> > + * @function_ptr: a pointer to the function.
> > + * @values: the profiling values associated with this function.
> > + * @num_counters: the number of counters in the function.
> > + * @num_value_sites: the number of value profile sites.
> > + */
> > +struct llvm_prf_data {
> > +     const u64 name_ref;
> > +     const u64 func_hash;
> > +     const void *counter_ptr;
> > +     const void *function_ptr;
> > +     void *values;
> > +     const u32 num_counters;
> > +     const u16 num_value_sites[LLVM_INSTR_PROF_IPVK_LAST + 1];
> > +} __aligned(LLVM_INSTR_PROF_DATA_ALIGNMENT);
> > +
> > +/**
> > + * structure llvm_prf_value_node_data - represents the data part of the struct
> > + *   llvm_prf_value_node data structure.
> > + * @value: the value counters.
> > + * @count: the counters' count.
> > + */
> > +struct llvm_prf_value_node_data {
> > +     u64 value;
> > +     u64 count;
> > +};
> > +
> > +/**
> > + * struct llvm_prf_value_node - represents an internal data structure used by
> > + *   the value profiler.
> > + * @value: the value counters.
> > + * @count: the counters' count.
> > + * @next: the next value node.
> > + */
> > +struct llvm_prf_value_node {
> > +     u64 value;
> > +     u64 count;
> > +     struct llvm_prf_value_node *next;
> > +};
> > +
> > +/**
> > + * struct llvm_prf_value_data - represents the value profiling data in indexed
> > + *   format.
> > + * @total_size: the total size in bytes including this field.
> > + * @num_value_kinds: the number of value profile kinds that has value profile
> > + *   data.
> > + */
> > +struct llvm_prf_value_data {
> > +     u32 total_size;
> > +     u32 num_value_kinds;
> > +};
> > +
> > +/**
> > + * struct llvm_prf_value_record - represents the on-disk layout of the value
> > + *   profile data of a particular kind for one function.
> > + * @kind: the kind of the value profile record.
> > + * @num_value_sites: the number of value profile sites.
> > + * @site_count_array: the first element of the array that stores the number
> > + *   of profiled values for each value site.
> > + */
> > +struct llvm_prf_value_record {
> > +     u32 kind;
> > +     u32 num_value_sites;
> > +     u8 site_count_array[];
> > +};
> > +
> > +#define prf_get_value_record_header_size()           \
> > +     offsetof(struct llvm_prf_value_record, site_count_array)
> > +#define prf_get_value_record_site_count_size(sites)  \
> > +     roundup((sites), 8)
> > +#define prf_get_value_record_size(sites)             \
> > +     (prf_get_value_record_header_size() +           \
> > +      prf_get_value_record_site_count_size((sites)))
> > +
> > +/* Data sections */
> > +extern struct llvm_prf_data __llvm_prf_data_start[];
> > +extern struct llvm_prf_data __llvm_prf_data_end[];
> > +
> > +extern u64 __llvm_prf_cnts_start[];
> > +extern u64 __llvm_prf_cnts_end[];
> > +
> > +extern char __llvm_prf_names_start[];
> > +extern char __llvm_prf_names_end[];
> > +
> > +extern struct llvm_prf_value_node __llvm_prf_vnds_start[];
> > +extern struct llvm_prf_value_node __llvm_prf_vnds_end[];
> > +
> > +/* Locking for vnodes */
> > +extern unsigned long prf_lock(void);
> > +extern void prf_unlock(unsigned long flags);
> > +
> > +#define __DEFINE_PRF_SIZE(s) \
> > +     static inline unsigned long prf_ ## s ## _size(void)            \
> > +     {                                                               \
> > +             unsigned long start =                                   \
> > +                     (unsigned long)__llvm_prf_ ## s ## _start;      \
> > +             unsigned long end =                                     \
> > +                     (unsigned long)__llvm_prf_ ## s ## _end;        \
> > +             return roundup(end - start,                             \
> > +                             sizeof(__llvm_prf_ ## s ## _start[0])); \
> > +     }                                                               \
> > +     static inline unsigned long prf_ ## s ## _count(void)           \
> > +     {                                                               \
> > +             return prf_ ## s ## _size() /                           \
> > +                     sizeof(__llvm_prf_ ## s ## _start[0]);          \
> > +     }
> > +
> > +__DEFINE_PRF_SIZE(data);
> > +__DEFINE_PRF_SIZE(cnts);
> > +__DEFINE_PRF_SIZE(names);
> > +__DEFINE_PRF_SIZE(vnds);
> > +
> > +#undef __DEFINE_PRF_SIZE
> > +
> > +#endif /* _PGO_H */
> > diff --git a/scripts/Makefile.lib b/scripts/Makefile.lib
> > index 8cd67b1b6d15..d411e92dd0d6 100644
> > --- a/scripts/Makefile.lib
> > +++ b/scripts/Makefile.lib
> > @@ -139,6 +139,16 @@ _c_flags += $(if $(patsubst n%,, \
> >               $(CFLAGS_GCOV))
> >  endif
> >
> > +#
> > +# Enable clang's PGO profiling flags for a file or directory depending on
> > +# variables PGO_PROFILE_obj.o and PGO_PROFILE.
> > +#
> > +ifeq ($(CONFIG_PGO_CLANG),y)
> > +_c_flags += $(if $(patsubst n%,, \
> > +             $(PGO_PROFILE_$(basetarget).o)$(PGO_PROFILE)y), \
> > +             $(CFLAGS_PGO_CLANG))
> > +endif
> > +
> >  #
> >  # Enable address sanitizer flags for kernel except some files or directories
> >  # we don't want to check (depends on variables KASAN_SANITIZE_obj.o, KASAN_SANITIZE)
> > --
> > 2.31.0.208.g409f899ff0-goog
> >
Kees Cook May 19, 2021, 9:37 p.m. UTC | #5
On Wed, Apr 07, 2021 at 02:17:04PM -0700, Bill Wendling wrote:
> From: Sami Tolvanen <samitolvanen@google.com>
> 
> Enable the use of clang's Profile-Guided Optimization[1]. To generate a
> profile, the kernel is instrumented with PGO counters, a representative
> workload is run, and the raw profile data is collected from
> /sys/kernel/debug/pgo/profraw.
> 
> The raw profile data must be processed by clang's "llvm-profdata" tool
> before it can be used during recompilation:
> 
>   $ cp /sys/kernel/debug/pgo/profraw vmlinux.profraw
>   $ llvm-profdata merge --output=vmlinux.profdata vmlinux.profraw
> 
> Multiple raw profiles may be merged during this step.
> 
> The data can now be used by the compiler:
> 
>   $ make LLVM=1 KCFLAGS=-fprofile-use=vmlinux.profdata ...
> 
> This initial submission is restricted to x86, as that's the platform we
> know works. This restriction can be lifted once other platforms have
> been verified to work with PGO.
> 
> Note that this method of profiling the kernel is clang-native, unlike
> the clang support in kernel/gcov.
> 
> [1] https://clang.llvm.org/docs/UsersManual.html#profile-guided-optimization
> 
> Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
> Co-developed-by: Bill Wendling <morbo@google.com>
> Signed-off-by: Bill Wendling <morbo@google.com>
> Tested-by: Nick Desaulniers <ndesaulniers@google.com>
> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
> Reviewed-by: Fangrui Song <maskray@google.com>
> ---
> v9: - [maskray] Remove explicit 'ALIGN' and 'KEEP' from PGO variables in
>       vmlinux.lds.h.
> v8: - Rebased on top-of-tree.
> v7: - [sedat.dilek] Fix minor build failure.
> v6: - Add better documentation about the locking scheme and other things.
>     - Rename macros to better match the same macros in LLVM's source code.
> v5: - [natechancellor] Correct padding calculation.
> v4: - [ndesaulniers] Remove non-x86 Makfile changes and se "hweight64" instead
>       of using our own popcount implementation.
> v3: - [sedat.dilek] Added change log section.
> v2: - [natechancellor] Added "__llvm_profile_instrument_memop".
>     - [maskray] Corrected documentation, re PGO flags when using LTO.
> ---
>  Documentation/dev-tools/index.rst     |   1 +
>  Documentation/dev-tools/pgo.rst       | 127 +++++++++
>  MAINTAINERS                           |   9 +
>  Makefile                              |   3 +
>  arch/Kconfig                          |   1 +
>  arch/x86/Kconfig                      |   1 +
>  arch/x86/boot/Makefile                |   1 +
>  arch/x86/boot/compressed/Makefile     |   1 +
>  arch/x86/crypto/Makefile              |   4 +
>  arch/x86/entry/vdso/Makefile          |   1 +
>  arch/x86/kernel/vmlinux.lds.S         |   2 +
>  arch/x86/platform/efi/Makefile        |   1 +
>  arch/x86/purgatory/Makefile           |   1 +
>  arch/x86/realmode/rm/Makefile         |   1 +
>  arch/x86/um/vdso/Makefile             |   1 +
>  drivers/firmware/efi/libstub/Makefile |   1 +
>  include/asm-generic/vmlinux.lds.h     |  34 +++
>  kernel/Makefile                       |   1 +
>  kernel/pgo/Kconfig                    |  35 +++
>  kernel/pgo/Makefile                   |   5 +
>  kernel/pgo/fs.c                       | 389 ++++++++++++++++++++++++++
>  kernel/pgo/instrument.c               | 189 +++++++++++++
>  kernel/pgo/pgo.h                      | 203 ++++++++++++++
>  scripts/Makefile.lib                  |  10 +
>  24 files changed, 1022 insertions(+)
>  create mode 100644 Documentation/dev-tools/pgo.rst
>  create mode 100644 kernel/pgo/Kconfig
>  create mode 100644 kernel/pgo/Makefile
>  create mode 100644 kernel/pgo/fs.c
>  create mode 100644 kernel/pgo/instrument.c
>  create mode 100644 kernel/pgo/pgo.h
> 
> diff --git a/Documentation/dev-tools/index.rst b/Documentation/dev-tools/index.rst
> index 1b1cf4f5c9d9..6a30cd98e6f9 100644
> --- a/Documentation/dev-tools/index.rst
> +++ b/Documentation/dev-tools/index.rst
> @@ -27,6 +27,7 @@ whole; patches welcome!
>     kgdb
>     kselftest
>     kunit/index
> +   pgo
>  
>  
>  .. only::  subproject and html
> diff --git a/Documentation/dev-tools/pgo.rst b/Documentation/dev-tools/pgo.rst
> new file mode 100644
> index 000000000000..b7f11d8405b7
> --- /dev/null
> +++ b/Documentation/dev-tools/pgo.rst
> @@ -0,0 +1,127 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +===============================
> +Using PGO with the Linux kernel
> +===============================
> +
> +Clang's profiling kernel support (PGO_) enables profiling of the Linux kernel
> +when building with Clang. The profiling data is exported via the ``pgo``
> +debugfs directory.
> +
> +.. _PGO: https://clang.llvm.org/docs/UsersManual.html#profile-guided-optimization
> +
> +
> +Preparation
> +===========
> +
> +Configure the kernel with:
> +
> +.. code-block:: make
> +
> +   CONFIG_DEBUG_FS=y
> +   CONFIG_PGO_CLANG=y
> +
> +Note that kernels compiled with profiling flags will be significantly larger
> +and run slower.
> +
> +Profiling data will only become accessible once debugfs has been mounted:
> +
> +.. code-block:: sh
> +
> +   mount -t debugfs none /sys/kernel/debug
> +
> +
> +Customization
> +=============
> +
> +You can enable or disable profiling for individual file and directories by
> +adding a line similar to the following to the respective kernel Makefile:
> +
> +- For a single file (e.g. main.o)
> +
> +  .. code-block:: make
> +
> +     PGO_PROFILE_main.o := y
> +
> +- For all files in one directory
> +
> +  .. code-block:: make
> +
> +     PGO_PROFILE := y
> +
> +To exclude files from being profiled use
> +
> +  .. code-block:: make
> +
> +     PGO_PROFILE_main.o := n
> +
> +and
> +
> +  .. code-block:: make
> +
> +     PGO_PROFILE := n
> +
> +Only files which are linked to the main kernel image or are compiled as kernel
> +modules are supported by this mechanism.
> +
> +
> +Files
> +=====
> +
> +The PGO kernel support creates the following files in debugfs:
> +
> +``/sys/kernel/debug/pgo``
> +	Parent directory for all PGO-related files.
> +
> +``/sys/kernel/debug/pgo/reset``
> +	Global reset file: resets all coverage data to zero when written to.
> +
> +``/sys/kernel/debug/profraw``
> +	The raw PGO data that must be processed with ``llvm_profdata``.
> +
> +
> +Workflow
> +========
> +
> +The PGO kernel can be run on the host or test machines. The data though should
> +be analyzed with Clang's tools from the same Clang version as the kernel was
> +compiled. Clang's tolerant of version skew, but it's easier to use the same
> +Clang version.
> +
> +The profiling data is useful for optimizing the kernel, analyzing coverage,
> +etc. Clang offers tools to perform these tasks.
> +
> +Here is an example workflow for profiling an instrumented kernel with PGO and
> +using the result to optimize the kernel:
> +
> +1) Install the kernel on the TEST machine.
> +
> +2) Reset the data counters right before running the load tests
> +
> +   .. code-block:: sh
> +
> +      $ echo 1 > /sys/kernel/debug/pgo/reset
> +
> +3) Run the load tests.
> +
> +4) Collect the raw profile data
> +
> +   .. code-block:: sh
> +
> +      $ cp -a /sys/kernel/debug/pgo/profraw /tmp/vmlinux.profraw
> +
> +5) (Optional) Download the raw profile data to the HOST machine.
> +
> +6) Process the raw profile data
> +
> +   .. code-block:: sh
> +
> +      $ llvm-profdata merge --output=vmlinux.profdata vmlinux.profraw
> +
> +   Note that multiple raw profile data files can be merged during this step.
> +
> +7) Rebuild the kernel using the profile data (PGO disabled)
> +
> +   .. code-block:: sh
> +
> +      $ make LLVM=1 KCFLAGS=-fprofile-use=vmlinux.profdata ...
> diff --git a/MAINTAINERS b/MAINTAINERS
> index c80ad735b384..742058188af2 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -14054,6 +14054,15 @@ S:	Maintained
>  F:	include/linux/personality.h
>  F:	include/uapi/linux/personality.h
>  
> +PGO BASED KERNEL PROFILING
> +M:	Sami Tolvanen <samitolvanen@google.com>
> +M:	Bill Wendling <wcw@google.com>
> +R:	Nathan Chancellor <natechancellor@gmail.com>
> +R:	Nick Desaulniers <ndesaulniers@google.com>
> +S:	Supported
> +F:	Documentation/dev-tools/pgo.rst
> +F:	kernel/pgo
> +
>  PHOENIX RC FLIGHT CONTROLLER ADAPTER
>  M:	Marcus Folkesson <marcus.folkesson@gmail.com>
>  L:	linux-input@vger.kernel.org
> diff --git a/Makefile b/Makefile
> index cc77fd45ca64..6450faceb137 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -660,6 +660,9 @@ endif # KBUILD_EXTMOD
>  # Defaults to vmlinux, but the arch makefile usually adds further targets
>  all: vmlinux
>  
> +CFLAGS_PGO_CLANG := -fprofile-generate
> +export CFLAGS_PGO_CLANG
> +
>  CFLAGS_GCOV	:= -fprofile-arcs -ftest-coverage \
>  	$(call cc-option,-fno-tree-loop-im) \
>  	$(call cc-disable-warning,maybe-uninitialized,)
> diff --git a/arch/Kconfig b/arch/Kconfig
> index ecfd3520b676..afd082133e0a 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -1191,6 +1191,7 @@ config ARCH_HAS_ELFCORE_COMPAT
>  	bool
>  
>  source "kernel/gcov/Kconfig"
> +source "kernel/pgo/Kconfig"
>  
>  source "scripts/gcc-plugins/Kconfig"
>  
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 2792879d398e..62be93b199ff 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -99,6 +99,7 @@ config X86
>  	select ARCH_SUPPORTS_KMAP_LOCAL_FORCE_MAP	if NR_CPUS <= 4096
>  	select ARCH_SUPPORTS_LTO_CLANG		if X86_64
>  	select ARCH_SUPPORTS_LTO_CLANG_THIN	if X86_64
> +	select ARCH_SUPPORTS_PGO_CLANG		if X86_64
>  	select ARCH_USE_BUILTIN_BSWAP
>  	select ARCH_USE_QUEUED_RWLOCKS
>  	select ARCH_USE_QUEUED_SPINLOCKS
> diff --git a/arch/x86/boot/Makefile b/arch/x86/boot/Makefile
> index fe605205b4ce..383853e32f67 100644
> --- a/arch/x86/boot/Makefile
> +++ b/arch/x86/boot/Makefile
> @@ -71,6 +71,7 @@ KBUILD_AFLAGS	:= $(KBUILD_CFLAGS) -D__ASSEMBLY__
>  KBUILD_CFLAGS	+= $(call cc-option,-fmacro-prefix-map=$(srctree)/=)
>  KBUILD_CFLAGS	+= -fno-asynchronous-unwind-tables
>  GCOV_PROFILE := n
> +PGO_PROFILE := n
>  UBSAN_SANITIZE := n
>  
>  $(obj)/bzImage: asflags-y  := $(SVGA_MODE)
> diff --git a/arch/x86/boot/compressed/Makefile b/arch/x86/boot/compressed/Makefile
> index e0bc3988c3fa..ed12ab65f606 100644
> --- a/arch/x86/boot/compressed/Makefile
> +++ b/arch/x86/boot/compressed/Makefile
> @@ -54,6 +54,7 @@ CFLAGS_sev-es.o += -I$(objtree)/arch/x86/lib/
>  
>  KBUILD_AFLAGS  := $(KBUILD_CFLAGS) -D__ASSEMBLY__
>  GCOV_PROFILE := n
> +PGO_PROFILE := n
>  UBSAN_SANITIZE :=n
>  
>  KBUILD_LDFLAGS := -m elf_$(UTS_MACHINE)
> diff --git a/arch/x86/crypto/Makefile b/arch/x86/crypto/Makefile
> index b28e36b7c96b..4b2e9620c412 100644
> --- a/arch/x86/crypto/Makefile
> +++ b/arch/x86/crypto/Makefile
> @@ -4,6 +4,10 @@
>  
>  OBJECT_FILES_NON_STANDARD := y
>  
> +# Disable PGO for curve25519-x86_64. With PGO enabled, clang runs out of
> +# registers for some of the functions.
> +PGO_PROFILE_curve25519-x86_64.o := n
> +
>  obj-$(CONFIG_CRYPTO_TWOFISH_586) += twofish-i586.o
>  twofish-i586-y := twofish-i586-asm_32.o twofish_glue.o
>  obj-$(CONFIG_CRYPTO_TWOFISH_X86_64) += twofish-x86_64.o
> diff --git a/arch/x86/entry/vdso/Makefile b/arch/x86/entry/vdso/Makefile
> index 05c4abc2fdfd..f7421e44725a 100644
> --- a/arch/x86/entry/vdso/Makefile
> +++ b/arch/x86/entry/vdso/Makefile
> @@ -180,6 +180,7 @@ quiet_cmd_vdso = VDSO    $@
>  VDSO_LDFLAGS = -shared --hash-style=both --build-id=sha1 \
>  	$(call ld-option, --eh-frame-hdr) -Bsymbolic
>  GCOV_PROFILE := n
> +PGO_PROFILE := n
>  
>  quiet_cmd_vdso_and_check = VDSO    $@
>        cmd_vdso_and_check = $(cmd_vdso); $(cmd_vdso_check)
> diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
> index efd9e9ea17f2..f6cab2316c46 100644
> --- a/arch/x86/kernel/vmlinux.lds.S
> +++ b/arch/x86/kernel/vmlinux.lds.S
> @@ -184,6 +184,8 @@ SECTIONS
>  
>  	BUG_TABLE
>  
> +	PGO_CLANG_DATA
> +
>  	ORC_UNWIND_TABLE
>  
>  	. = ALIGN(PAGE_SIZE);
> diff --git a/arch/x86/platform/efi/Makefile b/arch/x86/platform/efi/Makefile
> index 84b09c230cbd..5f22b31446ad 100644
> --- a/arch/x86/platform/efi/Makefile
> +++ b/arch/x86/platform/efi/Makefile
> @@ -2,6 +2,7 @@
>  OBJECT_FILES_NON_STANDARD_efi_thunk_$(BITS).o := y
>  KASAN_SANITIZE := n
>  GCOV_PROFILE := n
> +PGO_PROFILE := n
>  
>  obj-$(CONFIG_EFI) 		+= quirks.o efi.o efi_$(BITS).o efi_stub_$(BITS).o
>  obj-$(CONFIG_EFI_MIXED)		+= efi_thunk_$(BITS).o
> diff --git a/arch/x86/purgatory/Makefile b/arch/x86/purgatory/Makefile
> index 95ea17a9d20c..36f20e99da0b 100644
> --- a/arch/x86/purgatory/Makefile
> +++ b/arch/x86/purgatory/Makefile
> @@ -23,6 +23,7 @@ targets += purgatory.ro purgatory.chk
>  
>  # Sanitizer, etc. runtimes are unavailable and cannot be linked here.
>  GCOV_PROFILE	:= n
> +PGO_PROFILE	:= n
>  KASAN_SANITIZE	:= n
>  UBSAN_SANITIZE	:= n
>  KCSAN_SANITIZE	:= n
> diff --git a/arch/x86/realmode/rm/Makefile b/arch/x86/realmode/rm/Makefile
> index 83f1b6a56449..21797192f958 100644
> --- a/arch/x86/realmode/rm/Makefile
> +++ b/arch/x86/realmode/rm/Makefile
> @@ -76,4 +76,5 @@ KBUILD_CFLAGS	:= $(REALMODE_CFLAGS) -D_SETUP -D_WAKEUP \
>  KBUILD_AFLAGS	:= $(KBUILD_CFLAGS) -D__ASSEMBLY__
>  KBUILD_CFLAGS	+= -fno-asynchronous-unwind-tables
>  GCOV_PROFILE := n
> +PGO_PROFILE := n
>  UBSAN_SANITIZE := n
> diff --git a/arch/x86/um/vdso/Makefile b/arch/x86/um/vdso/Makefile
> index 5943387e3f35..54f5768f5853 100644
> --- a/arch/x86/um/vdso/Makefile
> +++ b/arch/x86/um/vdso/Makefile
> @@ -64,6 +64,7 @@ quiet_cmd_vdso = VDSO    $@
>  
>  VDSO_LDFLAGS = -fPIC -shared -Wl,--hash-style=sysv
>  GCOV_PROFILE := n
> +PGO_PROFILE := n
>  
>  #
>  # Install the unstripped copy of vdso*.so listed in $(vdso-install-y).
> diff --git a/drivers/firmware/efi/libstub/Makefile b/drivers/firmware/efi/libstub/Makefile
> index c23466e05e60..724fb389bb9d 100644
> --- a/drivers/firmware/efi/libstub/Makefile
> +++ b/drivers/firmware/efi/libstub/Makefile
> @@ -42,6 +42,7 @@ KBUILD_CFLAGS := $(filter-out $(CC_FLAGS_SCS), $(KBUILD_CFLAGS))
>  KBUILD_CFLAGS := $(filter-out $(CC_FLAGS_LTO), $(KBUILD_CFLAGS))
>  
>  GCOV_PROFILE			:= n
> +PGO_PROFILE			:= n
>  # Sanitizer runtimes are unavailable and cannot be linked here.
>  KASAN_SANITIZE			:= n
>  KCSAN_SANITIZE			:= n
> diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h
> index 0331d5d49551..b371857097e8 100644
> --- a/include/asm-generic/vmlinux.lds.h
> +++ b/include/asm-generic/vmlinux.lds.h
> @@ -329,6 +329,39 @@
>  #define DTPM_TABLE()
>  #endif
>  
> +#ifdef CONFIG_PGO_CLANG
> +#define PGO_CLANG_DATA							\
> +	__llvm_prf_data : AT(ADDR(__llvm_prf_data) - LOAD_OFFSET) {	\
> +		__llvm_prf_start = .;					\
> +		__llvm_prf_data_start = .;				\
> +		*(__llvm_prf_data)					\
> +		__llvm_prf_data_end = .;				\
> +	}								\
> +	__llvm_prf_cnts : AT(ADDR(__llvm_prf_cnts) - LOAD_OFFSET) {	\
> +		__llvm_prf_cnts_start = .;				\
> +		*(__llvm_prf_cnts)					\
> +		__llvm_prf_cnts_end = .;				\
> +	}								\
> +	__llvm_prf_names : AT(ADDR(__llvm_prf_names) - LOAD_OFFSET) {	\
> +		__llvm_prf_names_start = .;				\
> +		*(__llvm_prf_names)					\
> +		__llvm_prf_names_end = .;				\
> +	}								\
> +	__llvm_prf_vals : AT(ADDR(__llvm_prf_vals) - LOAD_OFFSET) {	\
> +		__llvm_prf_vals_start = .;				\
> +		*(__llvm_prf_vals)					\
> +		__llvm_prf_vals_end = .;				\
> +	}								\
> +	__llvm_prf_vnds : AT(ADDR(__llvm_prf_vnds) - LOAD_OFFSET) {	\
> +		__llvm_prf_vnds_start = .;				\
> +		*(__llvm_prf_vnds)					\
> +		__llvm_prf_vnds_end = .;				\
> +		__llvm_prf_end = .;					\
> +	}
> +#else
> +#define PGO_CLANG_DATA
> +#endif
> +
>  #define KERNEL_DTB()							\
>  	STRUCT_ALIGN();							\
>  	__dtb_start = .;						\
> @@ -1106,6 +1139,7 @@
>  		CONSTRUCTORS						\
>  	}								\
>  	BUG_TABLE							\
> +	PGO_CLANG_DATA
>  
>  #define INIT_TEXT_SECTION(inittext_align)				\
>  	. = ALIGN(inittext_align);					\
> diff --git a/kernel/Makefile b/kernel/Makefile
> index 320f1f3941b7..a2a23ef2b12f 100644
> --- a/kernel/Makefile
> +++ b/kernel/Makefile
> @@ -111,6 +111,7 @@ obj-$(CONFIG_BPF) += bpf/
>  obj-$(CONFIG_KCSAN) += kcsan/
>  obj-$(CONFIG_SHADOW_CALL_STACK) += scs.o
>  obj-$(CONFIG_HAVE_STATIC_CALL_INLINE) += static_call.o
> +obj-$(CONFIG_PGO_CLANG) += pgo/
>  
>  obj-$(CONFIG_PERF_EVENTS) += events/
>  
> diff --git a/kernel/pgo/Kconfig b/kernel/pgo/Kconfig
> new file mode 100644
> index 000000000000..76a640b6cf6e
> --- /dev/null
> +++ b/kernel/pgo/Kconfig
> @@ -0,0 +1,35 @@
> +# SPDX-License-Identifier: GPL-2.0-only
> +menu "Profile Guided Optimization (PGO) (EXPERIMENTAL)"
> +
> +config ARCH_SUPPORTS_PGO_CLANG
> +	bool
> +
> +config PGO_CLANG
> +	bool "Enable clang's PGO-based kernel profiling"
> +	depends on DEBUG_FS
> +	depends on ARCH_SUPPORTS_PGO_CLANG
> +	depends on CC_IS_CLANG && CLANG_VERSION >= 120000
> +	help
> +	  This option enables clang's PGO (Profile Guided Optimization) based
> +	  code profiling to better optimize the kernel.
> +
> +	  If unsure, say N.
> +
> +	  Run a representative workload for your application on a kernel
> +	  compiled with this option and download the raw profile file from
> +	  /sys/kernel/debug/pgo/profraw. This file needs to be processed with
> +	  llvm-profdata. It may be merged with other collected raw profiles.
> +
> +	  Copy the resulting profile file into vmlinux.profdata, and enable
> +	  KCFLAGS=-fprofile-use=vmlinux.profdata to produce an optimized
> +	  kernel.
> +
> +	  Note that a kernel compiled with profiling flags will be
> +	  significantly larger and run slower. Also be sure to exclude files
> +	  from profiling which are not linked to the kernel image to prevent
> +	  linker errors.
> +
> +	  Note that the debugfs filesystem has to be mounted to access
> +	  profiling data.
> +
> +endmenu
> diff --git a/kernel/pgo/Makefile b/kernel/pgo/Makefile
> new file mode 100644
> index 000000000000..41e27cefd9a4
> --- /dev/null
> +++ b/kernel/pgo/Makefile
> @@ -0,0 +1,5 @@
> +# SPDX-License-Identifier: GPL-2.0
> +GCOV_PROFILE	:= n
> +PGO_PROFILE	:= n
> +
> +obj-y	+= fs.o instrument.o
> diff --git a/kernel/pgo/fs.c b/kernel/pgo/fs.c
> new file mode 100644
> index 000000000000..1678df3b7d64
> --- /dev/null
> +++ b/kernel/pgo/fs.c
> @@ -0,0 +1,389 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright (C) 2019 Google, Inc.
> + *
> + * Author:
> + *	Sami Tolvanen <samitolvanen@google.com>
> + *
> + * This software is licensed under the terms of the GNU General Public
> + * License version 2, as published by the Free Software Foundation, and
> + * may be copied, distributed, and modified under those terms.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + */
> +
> +#define pr_fmt(fmt)	"pgo: " fmt
> +
> +#include <linux/kernel.h>
> +#include <linux/debugfs.h>
> +#include <linux/fs.h>
> +#include <linux/module.h>
> +#include <linux/slab.h>
> +#include <linux/vmalloc.h>
> +#include "pgo.h"
> +
> +static struct dentry *directory;
> +
> +struct prf_private_data {
> +	void *buffer;
> +	unsigned long size;
> +};
> +
> +/*
> + * Raw profile data format:
> + *
> + *	- llvm_prf_header
> + *	- __llvm_prf_data
> + *	- __llvm_prf_cnts
> + *	- __llvm_prf_names
> + *	- zero padding to 8 bytes
> + *	- for each llvm_prf_data in __llvm_prf_data:
> + *		- llvm_prf_value_data
> + *			- llvm_prf_value_record + site count array
> + *				- llvm_prf_value_node_data
> + *				...
> + *			...
> + *		...
> + */
> +
> +static void prf_fill_header(void **buffer)
> +{
> +	struct llvm_prf_header *header = *(struct llvm_prf_header **)buffer;
> +
> +#ifdef CONFIG_64BIT
> +	header->magic = LLVM_INSTR_PROF_RAW_MAGIC_64;
> +#else
> +	header->magic = LLVM_INSTR_PROF_RAW_MAGIC_32;
> +#endif
> +	header->version = LLVM_VARIANT_MASK_IR_PROF | LLVM_INSTR_PROF_RAW_VERSION;
> +	header->data_size = prf_data_count();
> +	header->padding_bytes_before_counters = 0;
> +	header->counters_size = prf_cnts_count();
> +	header->padding_bytes_after_counters = 0;
> +	header->names_size = prf_names_count();
> +	header->counters_delta = (u64)__llvm_prf_cnts_start;
> +	header->names_delta = (u64)__llvm_prf_names_start;
> +	header->value_kind_last = LLVM_INSTR_PROF_IPVK_LAST;
> +
> +	*buffer += sizeof(*header);
> +}
> +
> +/*
> + * Copy the source into the buffer, incrementing the pointer into buffer in the
> + * process.
> + */
> +static void prf_copy_to_buffer(void **buffer, void *src, unsigned long size)
> +{
> +	memcpy(*buffer, src, size);
> +	*buffer += size;
> +}
> +
> +static u32 __prf_get_value_size(struct llvm_prf_data *p, u32 *value_kinds)
> +{
> +	struct llvm_prf_value_node **nodes =
> +		(struct llvm_prf_value_node **)p->values;
> +	u32 kinds = 0;
> +	u32 size = 0;
> +	unsigned int kind;
> +	unsigned int n;
> +	unsigned int s = 0;
> +
> +	for (kind = 0; kind < ARRAY_SIZE(p->num_value_sites); kind++) {
> +		unsigned int sites = p->num_value_sites[kind];
> +
> +		if (!sites)
> +			continue;
> +
> +		/* Record + site count array */
> +		size += prf_get_value_record_size(sites);
> +		kinds++;
> +
> +		if (!nodes)
> +			continue;
> +
> +		for (n = 0; n < sites; n++) {
> +			u32 count = 0;
> +			struct llvm_prf_value_node *site = nodes[s + n];
> +
> +			while (site && ++count <= U8_MAX)
> +				site = site->next;
> +
> +			size += count *
> +				sizeof(struct llvm_prf_value_node_data);
> +		}
> +
> +		s += sites;
> +	}
> +
> +	if (size)
> +		size += sizeof(struct llvm_prf_value_data);
> +
> +	if (value_kinds)
> +		*value_kinds = kinds;
> +
> +	return size;
> +}
> +
> +static u32 prf_get_value_size(void)
> +{
> +	u32 size = 0;
> +	struct llvm_prf_data *p;
> +
> +	for (p = __llvm_prf_data_start; p < __llvm_prf_data_end; p++)
> +		size += __prf_get_value_size(p, NULL);
> +
> +	return size;
> +}
> +
> +/* Serialize the profiling's value. */
> +static void prf_serialize_value(struct llvm_prf_data *p, void **buffer)
> +{
> +	struct llvm_prf_value_data header;
> +	struct llvm_prf_value_node **nodes =
> +		(struct llvm_prf_value_node **)p->values;
> +	unsigned int kind;
> +	unsigned int n;
> +	unsigned int s = 0;
> +
> +	header.total_size = __prf_get_value_size(p, &header.num_value_kinds);
> +
> +	if (!header.num_value_kinds)
> +		/* Nothing to write. */
> +		return;
> +
> +	prf_copy_to_buffer(buffer, &header, sizeof(header));
> +
> +	for (kind = 0; kind < ARRAY_SIZE(p->num_value_sites); kind++) {
> +		struct llvm_prf_value_record *record;
> +		u8 *counts;
> +		unsigned int sites = p->num_value_sites[kind];
> +
> +		if (!sites)
> +			continue;
> +
> +		/* Profiling value record. */
> +		record = *(struct llvm_prf_value_record **)buffer;
> +		*buffer += prf_get_value_record_header_size();
> +
> +		record->kind = kind;
> +		record->num_value_sites = sites;
> +
> +		/* Site count array. */
> +		counts = *(u8 **)buffer;
> +		*buffer += prf_get_value_record_site_count_size(sites);
> +
> +		/*
> +		 * If we don't have nodes, we can skip updating the site count
> +		 * array, because the buffer is zero filled.
> +		 */
> +		if (!nodes)
> +			continue;
> +
> +		for (n = 0; n < sites; n++) {
> +			u32 count = 0;
> +			struct llvm_prf_value_node *site = nodes[s + n];
> +
> +			while (site && ++count <= U8_MAX) {
> +				prf_copy_to_buffer(buffer, site,
> +						   sizeof(struct llvm_prf_value_node_data));
> +				site = site->next;
> +			}
> +
> +			counts[n] = (u8)count;
> +		}
> +
> +		s += sites;
> +	}
> +}
> +
> +static void prf_serialize_values(void **buffer)
> +{
> +	struct llvm_prf_data *p;
> +
> +	for (p = __llvm_prf_data_start; p < __llvm_prf_data_end; p++)
> +		prf_serialize_value(p, buffer);
> +}
> +
> +static inline unsigned long prf_get_padding(unsigned long size)
> +{
> +	return 7 & (sizeof(u64) - size % sizeof(u64));
> +}
> +
> +static unsigned long prf_buffer_size(void)
> +{
> +	return sizeof(struct llvm_prf_header) +
> +			prf_data_size()	+
> +			prf_cnts_size() +
> +			prf_names_size() +
> +			prf_get_padding(prf_names_size()) +
> +			prf_get_value_size();
> +}
> +
> +/*
> + * Serialize the profiling data into a format LLVM's tools can understand.
> + * Note: caller *must* hold pgo_lock.
> + */
> +static int prf_serialize(struct prf_private_data *p)
> +{
> +	int err = 0;
> +	void *buffer;
> +
> +	p->size = prf_buffer_size();
> +	p->buffer = vzalloc(p->size);
> +
> +	if (!p->buffer) {
> +		err = -ENOMEM;
> +		goto out;
> +	}
> +
> +	buffer = p->buffer;
> +
> +	prf_fill_header(&buffer);
> +	prf_copy_to_buffer(&buffer, __llvm_prf_data_start,  prf_data_size());
> +	prf_copy_to_buffer(&buffer, __llvm_prf_cnts_start,  prf_cnts_size());
> +	prf_copy_to_buffer(&buffer, __llvm_prf_names_start, prf_names_size());
> +	buffer += prf_get_padding(prf_names_size());
> +
> +	prf_serialize_values(&buffer);
> +
> +out:
> +	return err;
> +}
> +
> +/* open() implementation for PGO. Creates a copy of the profiling data set. */
> +static int prf_open(struct inode *inode, struct file *file)
> +{
> +	struct prf_private_data *data;
> +	unsigned long flags;
> +	int err;
> +
> +	data = kzalloc(sizeof(*data), GFP_KERNEL);
> +	if (!data) {
> +		err = -ENOMEM;
> +		goto out;
> +	}
> +
> +	flags = prf_lock();
> +
> +	err = prf_serialize(data);
> +	if (unlikely(err)) {
> +		kfree(data);
> +		goto out_unlock;
> +	}
> +
> +	file->private_data = data;
> +
> +out_unlock:
> +	prf_unlock(flags);
> +out:
> +	return err;
> +}
> +
> +/* read() implementation for PGO. */
> +static ssize_t prf_read(struct file *file, char __user *buf, size_t count,
> +			loff_t *ppos)
> +{
> +	struct prf_private_data *data = file->private_data;
> +
> +	BUG_ON(!data);

I've changed this to:

	if (WARN_ON_ONCE(!data))
		return -ENOMEM;

> +
> +	return simple_read_from_buffer(buf, count, ppos, data->buffer,
> +				       data->size);
> +}
> +
> +/* release() implementation for PGO. Release resources allocated by open(). */
> +static int prf_release(struct inode *inode, struct file *file)
> +{
> +	struct prf_private_data *data = file->private_data;
> +
> +	if (data) {
> +		vfree(data->buffer);
> +		kfree(data);
> +	}
> +
> +	return 0;
> +}
> +
> +static const struct file_operations prf_fops = {
> +	.owner		= THIS_MODULE,
> +	.open		= prf_open,
> +	.read		= prf_read,
> +	.llseek		= default_llseek,
> +	.release	= prf_release
> +};
> +
> +/* write() implementation for resetting PGO's profile data. */
> +static ssize_t reset_write(struct file *file, const char __user *addr,
> +			   size_t len, loff_t *pos)
> +{
> +	struct llvm_prf_data *data;
> +
> +	memset(__llvm_prf_cnts_start, 0, prf_cnts_size());
> +
> +	for (data = __llvm_prf_data_start; data < __llvm_prf_data_end; data++) {
> +		struct llvm_prf_value_node **vnodes;
> +		u64 current_vsite_count;
> +		u32 i;
> +
> +		if (!data->values)
> +			continue;
> +
> +		current_vsite_count = 0;
> +		vnodes = (struct llvm_prf_value_node **)data->values;
> +
> +		for (i = LLVM_INSTR_PROF_IPVK_FIRST; i <= LLVM_INSTR_PROF_IPVK_LAST; i++)
> +			current_vsite_count += data->num_value_sites[i];
> +
> +		for (i = 0; i < current_vsite_count; i++) {
> +			struct llvm_prf_value_node *current_vnode = vnodes[i];
> +
> +			while (current_vnode) {
> +				current_vnode->count = 0;
> +				current_vnode = current_vnode->next;
> +			}
> +		}
> +	}
> +
> +	return len;
> +}
> +
> +static const struct file_operations prf_reset_fops = {
> +	.owner		= THIS_MODULE,
> +	.write		= reset_write,
> +	.llseek		= noop_llseek,
> +};
> +
> +/* Create debugfs entries. */
> +static int __init pgo_init(void)
> +{
> +	directory = debugfs_create_dir("pgo", NULL);
> +	if (!directory)
> +		goto err_remove;
> +
> +	if (!debugfs_create_file("profraw", 0600, directory, NULL,
> +				 &prf_fops))
> +		goto err_remove;
> +
> +	if (!debugfs_create_file("reset", 0200, directory, NULL,
> +				 &prf_reset_fops))
> +		goto err_remove;
> +
> +	return 0;
> +
> +err_remove:
> +	pr_err("initialization failed\n");
> +	return -EIO;
> +}
> +
> +/* Remove debugfs entries. */
> +static void __exit pgo_exit(void)
> +{
> +	debugfs_remove_recursive(directory);
> +}
> +
> +module_init(pgo_init);
> +module_exit(pgo_exit);
> diff --git a/kernel/pgo/instrument.c b/kernel/pgo/instrument.c
> new file mode 100644
> index 000000000000..464b3bc77431
> --- /dev/null
> +++ b/kernel/pgo/instrument.c
> @@ -0,0 +1,189 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright (C) 2019 Google, Inc.
> + *
> + * Author:
> + *	Sami Tolvanen <samitolvanen@google.com>
> + *
> + * This software is licensed under the terms of the GNU General Public
> + * License version 2, as published by the Free Software Foundation, and
> + * may be copied, distributed, and modified under those terms.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + */
> +
> +#define pr_fmt(fmt)	"pgo: " fmt
> +
> +#include <linux/bitops.h>
> +#include <linux/kernel.h>
> +#include <linux/export.h>
> +#include <linux/spinlock.h>
> +#include <linux/types.h>
> +#include "pgo.h"
> +
> +/*
> + * This lock guards both profile count updating and serialization of the
> + * profiling data. Keeping both of these activities separate via locking
> + * ensures that we don't try to serialize data that's only partially updated.
> + */
> +static DEFINE_SPINLOCK(pgo_lock);
> +static int current_node;
> +
> +unsigned long prf_lock(void)
> +{
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&pgo_lock, flags);
> +
> +	return flags;
> +}
> +
> +void prf_unlock(unsigned long flags)
> +{
> +	spin_unlock_irqrestore(&pgo_lock, flags);
> +}
> +
> +/*
> + * Return a newly allocated profiling value node which contains the tracked
> + * value by the value profiler.
> + * Note: caller *must* hold pgo_lock.
> + */
> +static struct llvm_prf_value_node *allocate_node(struct llvm_prf_data *p,
> +						 u32 index, u64 value)
> +{
> +	if (&__llvm_prf_vnds_start[current_node + 1] >= __llvm_prf_vnds_end)
> +		return NULL; /* Out of nodes */
> +
> +	current_node++;
> +
> +	/* Make sure the node is entirely within the section */
> +	if (&__llvm_prf_vnds_start[current_node] >= __llvm_prf_vnds_end ||
> +	    &__llvm_prf_vnds_start[current_node + 1] > __llvm_prf_vnds_end)
> +		return NULL;
> +
> +	return &__llvm_prf_vnds_start[current_node];
> +}
> +
> +/*
> + * Counts the number of times a target value is seen.
> + *
> + * Records the target value for the index if not seen before. Otherwise,
> + * increments the counter associated w/ the target value.
> + */
> +void __llvm_profile_instrument_target(u64 target_value, void *data, u32 index);

For each of these declarations, I've moved them to the pgo.h file so
both W=1 and checkpatch.pl stay happy.

> +void __llvm_profile_instrument_target(u64 target_value, void *data, u32 index)
> +{
> +	struct llvm_prf_data *p = (struct llvm_prf_data *)data;
> +	struct llvm_prf_value_node **counters;
> +	struct llvm_prf_value_node *curr;
> +	struct llvm_prf_value_node *min = NULL;
> +	struct llvm_prf_value_node *prev = NULL;
> +	u64 min_count = U64_MAX;
> +	u8 values = 0;
> +	unsigned long flags;
> +
> +	if (!p || !p->values)
> +		return;
> +
> +	counters = (struct llvm_prf_value_node **)p->values;
> +	curr = counters[index];
> +
> +	while (curr) {
> +		if (target_value == curr->value) {
> +			curr->count++;
> +			return;
> +		}
> +
> +		if (curr->count < min_count) {
> +			min_count = curr->count;
> +			min = curr;
> +		}
> +
> +		prev = curr;
> +		curr = curr->next;
> +		values++;
> +	}
> +
> +	if (values >= LLVM_INSTR_PROF_MAX_NUM_VAL_PER_SITE) {
> +		if (!min->count || !(--min->count)) {
> +			curr = min;
> +			curr->value = target_value;
> +			curr->count++;
> +		}
> +		return;
> +	}
> +
> +	/* Lock when updating the value node structure. */
> +	flags = prf_lock();
> +
> +	curr = allocate_node(p, index, target_value);
> +	if (!curr)
> +		goto out;
> +
> +	curr->value = target_value;
> +	curr->count++;
> +
> +	if (!counters[index])
> +		counters[index] = curr;
> +	else if (prev && !prev->next)
> +		prev->next = curr;
> +
> +out:
> +	prf_unlock(flags);
> +}
> +EXPORT_SYMBOL(__llvm_profile_instrument_target);
> +
> +/* Counts the number of times a range of targets values are seen. */
> +void __llvm_profile_instrument_range(u64 target_value, void *data,
> +				     u32 index, s64 precise_start,
> +				     s64 precise_last, s64 large_value);
> +void __llvm_profile_instrument_range(u64 target_value, void *data,
> +				     u32 index, s64 precise_start,
> +				     s64 precise_last, s64 large_value)
> +{
> +	if (large_value != S64_MIN && (s64)target_value >= large_value)
> +		target_value = large_value;
> +	else if ((s64)target_value < precise_start ||
> +		 (s64)target_value > precise_last)
> +		target_value = precise_last + 1;
> +
> +	__llvm_profile_instrument_target(target_value, data, index);
> +}
> +EXPORT_SYMBOL(__llvm_profile_instrument_range);
> +
> +static u64 inst_prof_get_range_rep_value(u64 value)
> +{
> +	if (value <= 8)
> +		/* The first ranges are individually tracked, use it as is. */
> +		return value;
> +	else if (value >= 513)
> +		/* The last range is mapped to its lowest value. */
> +		return 513;
> +	else if (hweight64(value) == 1)
> +		/* If it's a power of two, use it as is. */
> +		return value;
> +
> +	/* Otherwise, take to the previous power of two + 1. */
> +	return ((u64)1 << (64 - __builtin_clzll(value) - 1)) + 1;
> +}
> +
> +/*
> + * The target values are partitioned into multiple ranges. The range spec is
> + * defined in compiler-rt/include/profile/InstrProfData.inc.
> + */
> +void __llvm_profile_instrument_memop(u64 target_value, void *data,
> +				     u32 counter_index);
> +void __llvm_profile_instrument_memop(u64 target_value, void *data,
> +				     u32 counter_index)
> +{
> +	u64 rep_value;
> +
> +	/* Map the target value to the representative value of its range. */
> +	rep_value = inst_prof_get_range_rep_value(target_value);
> +	__llvm_profile_instrument_target(rep_value, data, counter_index);
> +}
> +EXPORT_SYMBOL(__llvm_profile_instrument_memop);
> diff --git a/kernel/pgo/pgo.h b/kernel/pgo/pgo.h
> new file mode 100644
> index 000000000000..ddc8d3002fe5
> --- /dev/null
> +++ b/kernel/pgo/pgo.h
> @@ -0,0 +1,203 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * Copyright (C) 2019 Google, Inc.
> + *
> + * Author:
> + *	Sami Tolvanen <samitolvanen@google.com>
> + *
> + * This software is licensed under the terms of the GNU General Public
> + * License version 2, as published by the Free Software Foundation, and
> + * may be copied, distributed, and modified under those terms.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + */
> +
> +#ifndef _PGO_H
> +#define _PGO_H
> +
> +/*
> + * Note: These internal LLVM definitions must match the compiler version.
> + * See llvm/include/llvm/ProfileData/InstrProfData.inc in LLVM's source code.
> + */
> +
> +#define LLVM_INSTR_PROF_RAW_MAGIC_64	\
> +		((u64)255 << 56 |	\
> +		 (u64)'l' << 48 |	\
> +		 (u64)'p' << 40 |	\
> +		 (u64)'r' << 32 |	\
> +		 (u64)'o' << 24 |	\
> +		 (u64)'f' << 16 |	\
> +		 (u64)'r' << 8  |	\
> +		 (u64)129)
> +#define LLVM_INSTR_PROF_RAW_MAGIC_32	\
> +		((u64)255 << 56 |	\
> +		 (u64)'l' << 48 |	\
> +		 (u64)'p' << 40 |	\
> +		 (u64)'r' << 32 |	\
> +		 (u64)'o' << 24 |	\
> +		 (u64)'f' << 16 |	\
> +		 (u64)'R' << 8  |	\
> +		 (u64)129)
> +
> +#define LLVM_INSTR_PROF_RAW_VERSION		5
> +#define LLVM_INSTR_PROF_DATA_ALIGNMENT		8
> +#define LLVM_INSTR_PROF_IPVK_FIRST		0
> +#define LLVM_INSTR_PROF_IPVK_LAST		1
> +#define LLVM_INSTR_PROF_MAX_NUM_VAL_PER_SITE	255
> +
> +#define LLVM_VARIANT_MASK_IR_PROF	(0x1ULL << 56)
> +#define LLVM_VARIANT_MASK_CSIR_PROF	(0x1ULL << 57)
> +
> +/**
> + * struct llvm_prf_header - represents the raw profile header data structure.
> + * @magic: the magic token for the file format.
> + * @version: the version of the file format.
> + * @data_size: the number of entries in the profile data section.
> + * @padding_bytes_before_counters: the number of padding bytes before the
> + *   counters.
> + * @counters_size: the size in bytes of the LLVM profile section containing the
> + *   counters.
> + * @padding_bytes_after_counters: the number of padding bytes after the
> + *   counters.
> + * @names_size: the size in bytes of the LLVM profile section containing the
> + *   counters' names.
> + * @counters_delta: the beginning of the LLMV profile counters section.
> + * @names_delta: the beginning of the LLMV profile names section.
> + * @value_kind_last: the last profile value kind.
> + */
> +struct llvm_prf_header {
> +	u64 magic;
> +	u64 version;
> +	u64 data_size;
> +	u64 padding_bytes_before_counters;
> +	u64 counters_size;
> +	u64 padding_bytes_after_counters;
> +	u64 names_size;
> +	u64 counters_delta;
> +	u64 names_delta;
> +	u64 value_kind_last;
> +};
> +
> +/**
> + * struct llvm_prf_data - represents the per-function control structure.
> + * @name_ref: the reference to the function's name.
> + * @func_hash: the hash value of the function.
> + * @counter_ptr: a pointer to the profile counter.
> + * @function_ptr: a pointer to the function.
> + * @values: the profiling values associated with this function.
> + * @num_counters: the number of counters in the function.
> + * @num_value_sites: the number of value profile sites.
> + */
> +struct llvm_prf_data {
> +	const u64 name_ref;
> +	const u64 func_hash;
> +	const void *counter_ptr;
> +	const void *function_ptr;
> +	void *values;
> +	const u32 num_counters;
> +	const u16 num_value_sites[LLVM_INSTR_PROF_IPVK_LAST + 1];
> +} __aligned(LLVM_INSTR_PROF_DATA_ALIGNMENT);
> +
> +/**
> + * structure llvm_prf_value_node_data - represents the data part of the struct
> + *   llvm_prf_value_node data structure.
> + * @value: the value counters.
> + * @count: the counters' count.
> + */
> +struct llvm_prf_value_node_data {
> +	u64 value;
> +	u64 count;
> +};
> +
> +/**
> + * struct llvm_prf_value_node - represents an internal data structure used by
> + *   the value profiler.
> + * @value: the value counters.
> + * @count: the counters' count.
> + * @next: the next value node.
> + */
> +struct llvm_prf_value_node {
> +	u64 value;
> +	u64 count;
> +	struct llvm_prf_value_node *next;
> +};
> +
> +/**
> + * struct llvm_prf_value_data - represents the value profiling data in indexed
> + *   format.
> + * @total_size: the total size in bytes including this field.
> + * @num_value_kinds: the number of value profile kinds that has value profile
> + *   data.
> + */
> +struct llvm_prf_value_data {
> +	u32 total_size;
> +	u32 num_value_kinds;
> +};
> +
> +/**
> + * struct llvm_prf_value_record - represents the on-disk layout of the value
> + *   profile data of a particular kind for one function.
> + * @kind: the kind of the value profile record.
> + * @num_value_sites: the number of value profile sites.
> + * @site_count_array: the first element of the array that stores the number
> + *   of profiled values for each value site.
> + */
> +struct llvm_prf_value_record {
> +	u32 kind;
> +	u32 num_value_sites;
> +	u8 site_count_array[];
> +};
> +
> +#define prf_get_value_record_header_size()		\
> +	offsetof(struct llvm_prf_value_record, site_count_array)
> +#define prf_get_value_record_site_count_size(sites)	\
> +	roundup((sites), 8)
> +#define prf_get_value_record_size(sites)		\
> +	(prf_get_value_record_header_size() +		\
> +	 prf_get_value_record_site_count_size((sites)))
> +
> +/* Data sections */
> +extern struct llvm_prf_data __llvm_prf_data_start[];
> +extern struct llvm_prf_data __llvm_prf_data_end[];
> +
> +extern u64 __llvm_prf_cnts_start[];
> +extern u64 __llvm_prf_cnts_end[];
> +
> +extern char __llvm_prf_names_start[];
> +extern char __llvm_prf_names_end[];
> +
> +extern struct llvm_prf_value_node __llvm_prf_vnds_start[];
> +extern struct llvm_prf_value_node __llvm_prf_vnds_end[];
> +
> +/* Locking for vnodes */
> +extern unsigned long prf_lock(void);
> +extern void prf_unlock(unsigned long flags);
> +
> +#define __DEFINE_PRF_SIZE(s) \
> +	static inline unsigned long prf_ ## s ## _size(void)		\
> +	{								\
> +		unsigned long start =					\
> +			(unsigned long)__llvm_prf_ ## s ## _start;	\
> +		unsigned long end =					\
> +			(unsigned long)__llvm_prf_ ## s ## _end;	\
> +		return roundup(end - start,				\
> +				sizeof(__llvm_prf_ ## s ## _start[0]));	\
> +	}								\
> +	static inline unsigned long prf_ ## s ## _count(void)		\
> +	{								\
> +		return prf_ ## s ## _size() /				\
> +			sizeof(__llvm_prf_ ## s ## _start[0]);		\
> +	}
> +
> +__DEFINE_PRF_SIZE(data);
> +__DEFINE_PRF_SIZE(cnts);
> +__DEFINE_PRF_SIZE(names);
> +__DEFINE_PRF_SIZE(vnds);
> +
> +#undef __DEFINE_PRF_SIZE
> +
> +#endif /* _PGO_H */
> diff --git a/scripts/Makefile.lib b/scripts/Makefile.lib
> index 8cd67b1b6d15..d411e92dd0d6 100644
> --- a/scripts/Makefile.lib
> +++ b/scripts/Makefile.lib
> @@ -139,6 +139,16 @@ _c_flags += $(if $(patsubst n%,, \
>  		$(CFLAGS_GCOV))
>  endif
>  
> +#
> +# Enable clang's PGO profiling flags for a file or directory depending on
> +# variables PGO_PROFILE_obj.o and PGO_PROFILE.
> +#
> +ifeq ($(CONFIG_PGO_CLANG),y)
> +_c_flags += $(if $(patsubst n%,, \
> +		$(PGO_PROFILE_$(basetarget).o)$(PGO_PROFILE)y), \
> +		$(CFLAGS_PGO_CLANG))
> +endif
> +
>  #
>  # Enable address sanitizer flags for kernel except some files or directories
>  # we don't want to check (depends on variables KASAN_SANITIZE_obj.o, KASAN_SANITIZE)
> -- 
> 2.31.0.208.g409f899ff0-goog
> 

I've added this to patch to my -next tree now:

https://git.kernel.org/pub/scm/linux/kernel/git/kees/linux.git/commit/?h=for-next/clang/pgo&id=e1af496cbe9b4517428601a4e44fee3602dd3c15

Thanks!

-Kees
Bill Wendling May 22, 2021, 11:51 p.m. UTC | #6
On Wed, May 19, 2021 at 2:37 PM Kees Cook <keescook@chromium.org> wrote:
>
> I've added this to patch to my -next tree now:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/kees/linux.git/commit/?h=for-next/clang/pgo&id=e1af496cbe9b4517428601a4e44fee3602dd3c15
>
> Thanks!
> Kees Cook

Thank you!

-bw
Nathan Chancellor May 31, 2021, 9:12 p.m. UTC | #7
On Wed, May 19, 2021 at 02:37:26PM -0700, Kees Cook wrote:
> I've added this to patch to my -next tree now:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/kees/linux.git/commit/?h=for-next/clang/pgo&id=e1af496cbe9b4517428601a4e44fee3602dd3c15
> 

Would this be appropriate to send? Someone sent some patches based on
this work so it would be nice to solidify how they will get to Linus
if/when the time comes :)

https://lore.kernel.org/r/20210528200133.459022-1-jarmo.tiitto@gmail.com/
https://lore.kernel.org/r/20210528200432.459120-1-jarmo.tiitto@gmail.com/
https://lore.kernel.org/r/20210528200821.459214-1-jarmo.tiitto@gmail.com/
https://lore.kernel.org/r/20210528201006.459292-1-jarmo.tiitto@gmail.com/
https://lore.kernel.org/r/20210528201107.459362-1-jarmo.tiitto@gmail.com/
https://lore.kernel.org/r/20210528201213.459483-1-jarmo.tiitto@gmail.com/

Cheers,
Nathan

======================================

diff --git a/MAINTAINERS b/MAINTAINERS
index c45613c30803..0d03f6ccdb70 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -14378,9 +14378,13 @@ F:	include/uapi/linux/personality.h
 PGO BASED KERNEL PROFILING
 M:	Sami Tolvanen <samitolvanen@google.com>
 M:	Bill Wendling <wcw@google.com>
+M:	Kees Cook <keescook@chromium.org>
 R:	Nathan Chancellor <nathan@kernel.org>
 R:	Nick Desaulniers <ndesaulniers@google.com>
+L:	clang-built-linux@googlegroups.com
 S:	Supported
+B:	https://github.com/ClangBuiltLinux/linux/issues
+T:	git git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux.git for-next/clang/pgo
 F:	Documentation/dev-tools/pgo.rst
 F:	kernel/pgo/
Nick Desaulniers June 1, 2021, 5:31 p.m. UTC | #8
On Mon, May 31, 2021 at 2:12 PM Nathan Chancellor <nathan@kernel.org> wrote:
> Would this be appropriate to send?
>
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index c45613c30803..0d03f6ccdb70 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -14378,9 +14378,13 @@ F:     include/uapi/linux/personality.h
>  PGO BASED KERNEL PROFILING
>  M:     Sami Tolvanen <samitolvanen@google.com>
>  M:     Bill Wendling <wcw@google.com>
> +M:     Kees Cook <keescook@chromium.org>
>  R:     Nathan Chancellor <nathan@kernel.org>
>  R:     Nick Desaulniers <ndesaulniers@google.com>
> +L:     clang-built-linux@googlegroups.com
>  S:     Supported
> +B:     https://github.com/ClangBuiltLinux/linux/issues
> +T:     git git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux.git for-next/clang/pgo
>  F:     Documentation/dev-tools/pgo.rst
>  F:     kernel/pgo/
>

I think so.
Acked-by: Nick Desaulniers <ndesaulniers@google.com>
Peter Zijlstra June 12, 2021, 4:59 p.m. UTC | #9
On Wed, Apr 07, 2021 at 02:17:04PM -0700, Bill Wendling wrote:
> From: Sami Tolvanen <samitolvanen@google.com>
> 
> Enable the use of clang's Profile-Guided Optimization[1]. To generate a
> profile, the kernel is instrumented with PGO counters, a representative
> workload is run, and the raw profile data is collected from
> /sys/kernel/debug/pgo/profraw.
> 
> The raw profile data must be processed by clang's "llvm-profdata" tool
> before it can be used during recompilation:
> 
>   $ cp /sys/kernel/debug/pgo/profraw vmlinux.profraw
>   $ llvm-profdata merge --output=vmlinux.profdata vmlinux.profraw
> 
> Multiple raw profiles may be merged during this step.
> 
> The data can now be used by the compiler:
> 
>   $ make LLVM=1 KCFLAGS=-fprofile-use=vmlinux.profdata ...
> 
> This initial submission is restricted to x86, as that's the platform we
> know works. This restriction can be lifted once other platforms have
> been verified to work with PGO.

*sigh*, and not a single x86 person on Cc, how nice :-/

> Note that this method of profiling the kernel is clang-native, unlike
> the clang support in kernel/gcov.
> 
> [1] https://clang.llvm.org/docs/UsersManual.html#profile-guided-optimization

Also, and I don't see this answered *anywhere*, why are you not using
perf for this? Your link even mentions Sampling Profilers (and I happen
to know there's been significant effort to make perf output work as
input for the PGO passes of the various compilers).

> Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
> Co-developed-by: Bill Wendling <morbo@google.com>
> Signed-off-by: Bill Wendling <morbo@google.com>
> Tested-by: Nick Desaulniers <ndesaulniers@google.com>
> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
> Reviewed-by: Fangrui Song <maskray@google.com>
> ---
>  Documentation/dev-tools/index.rst     |   1 +
>  Documentation/dev-tools/pgo.rst       | 127 +++++++++
>  MAINTAINERS                           |   9 +
>  Makefile                              |   3 +
>  arch/Kconfig                          |   1 +
>  arch/x86/Kconfig                      |   1 +
>  arch/x86/boot/Makefile                |   1 +
>  arch/x86/boot/compressed/Makefile     |   1 +
>  arch/x86/crypto/Makefile              |   4 +
>  arch/x86/entry/vdso/Makefile          |   1 +
>  arch/x86/kernel/vmlinux.lds.S         |   2 +
>  arch/x86/platform/efi/Makefile        |   1 +
>  arch/x86/purgatory/Makefile           |   1 +
>  arch/x86/realmode/rm/Makefile         |   1 +
>  arch/x86/um/vdso/Makefile             |   1 +
>  drivers/firmware/efi/libstub/Makefile |   1 +
>  include/asm-generic/vmlinux.lds.h     |  34 +++
>  kernel/Makefile                       |   1 +
>  kernel/pgo/Kconfig                    |  35 +++
>  kernel/pgo/Makefile                   |   5 +
>  kernel/pgo/fs.c                       | 389 ++++++++++++++++++++++++++
>  kernel/pgo/instrument.c               | 189 +++++++++++++
>  kernel/pgo/pgo.h                      | 203 ++++++++++++++
>  scripts/Makefile.lib                  |  10 +
>  24 files changed, 1022 insertions(+)
>  create mode 100644 Documentation/dev-tools/pgo.rst
>  create mode 100644 kernel/pgo/Kconfig
>  create mode 100644 kernel/pgo/Makefile
>  create mode 100644 kernel/pgo/fs.c
>  create mode 100644 kernel/pgo/instrument.c
>  create mode 100644 kernel/pgo/pgo.h

> --- a/Makefile
> +++ b/Makefile
> @@ -660,6 +660,9 @@ endif # KBUILD_EXTMOD
>  # Defaults to vmlinux, but the arch makefile usually adds further targets
>  all: vmlinux
>  
> +CFLAGS_PGO_CLANG := -fprofile-generate
> +export CFLAGS_PGO_CLANG
> +
>  CFLAGS_GCOV	:= -fprofile-arcs -ftest-coverage \
>  	$(call cc-option,-fno-tree-loop-im) \
>  	$(call cc-disable-warning,maybe-uninitialized,)

And which of the many flags in noinstr disables this?

Basically I would like to NAK this whole thing until someone can
adequately explain the interaction with noinstr and why we need those
many lines of kernel code and can't simply use perf for this.
Bill Wendling June 12, 2021, 5:25 p.m. UTC | #10
On Sat, Jun 12, 2021 at 9:59 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Wed, Apr 07, 2021 at 02:17:04PM -0700, Bill Wendling wrote:
> > From: Sami Tolvanen <samitolvanen@google.com>
> >
> > Enable the use of clang's Profile-Guided Optimization[1]. To generate a
> > profile, the kernel is instrumented with PGO counters, a representative
> > workload is run, and the raw profile data is collected from
> > /sys/kernel/debug/pgo/profraw.
> >
> > The raw profile data must be processed by clang's "llvm-profdata" tool
> > before it can be used during recompilation:
> >
> >   $ cp /sys/kernel/debug/pgo/profraw vmlinux.profraw
> >   $ llvm-profdata merge --output=vmlinux.profdata vmlinux.profraw
> >
> > Multiple raw profiles may be merged during this step.
> >
> > The data can now be used by the compiler:
> >
> >   $ make LLVM=1 KCFLAGS=-fprofile-use=vmlinux.profdata ...
> >
> > This initial submission is restricted to x86, as that's the platform we
> > know works. This restriction can be lifted once other platforms have
> > been verified to work with PGO.
>
> *sigh*, and not a single x86 person on Cc, how nice :-/
>
This tool is generic and, despite the fact that it's first enabled for
x86, it contains no x86-specific code. The reason we're restricting it
to x86 is because it's the platform we tested on.

> > Note that this method of profiling the kernel is clang-native, unlike
> > the clang support in kernel/gcov.
> >
> > [1] https://clang.llvm.org/docs/UsersManual.html#profile-guided-optimization
>
> Also, and I don't see this answered *anywhere*, why are you not using
> perf for this? Your link even mentions Sampling Profilers (and I happen
> to know there's been significant effort to make perf output work as
> input for the PGO passes of the various compilers).
>
Instruction-based (non-sampling) profiling gives us a better
context-sensitive profile, making PGO more impactful. It's also useful
for coverage whereas sampling profiles cannot.

> > Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
> > Co-developed-by: Bill Wendling <morbo@google.com>
> > Signed-off-by: Bill Wendling <morbo@google.com>
> > Tested-by: Nick Desaulniers <ndesaulniers@google.com>
> > Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
> > Reviewed-by: Fangrui Song <maskray@google.com>
> > ---
> >  Documentation/dev-tools/index.rst     |   1 +
> >  Documentation/dev-tools/pgo.rst       | 127 +++++++++
> >  MAINTAINERS                           |   9 +
> >  Makefile                              |   3 +
> >  arch/Kconfig                          |   1 +
> >  arch/x86/Kconfig                      |   1 +
> >  arch/x86/boot/Makefile                |   1 +
> >  arch/x86/boot/compressed/Makefile     |   1 +
> >  arch/x86/crypto/Makefile              |   4 +
> >  arch/x86/entry/vdso/Makefile          |   1 +
> >  arch/x86/kernel/vmlinux.lds.S         |   2 +
> >  arch/x86/platform/efi/Makefile        |   1 +
> >  arch/x86/purgatory/Makefile           |   1 +
> >  arch/x86/realmode/rm/Makefile         |   1 +
> >  arch/x86/um/vdso/Makefile             |   1 +
> >  drivers/firmware/efi/libstub/Makefile |   1 +
> >  include/asm-generic/vmlinux.lds.h     |  34 +++
> >  kernel/Makefile                       |   1 +
> >  kernel/pgo/Kconfig                    |  35 +++
> >  kernel/pgo/Makefile                   |   5 +
> >  kernel/pgo/fs.c                       | 389 ++++++++++++++++++++++++++
> >  kernel/pgo/instrument.c               | 189 +++++++++++++
> >  kernel/pgo/pgo.h                      | 203 ++++++++++++++
> >  scripts/Makefile.lib                  |  10 +
> >  24 files changed, 1022 insertions(+)
> >  create mode 100644 Documentation/dev-tools/pgo.rst
> >  create mode 100644 kernel/pgo/Kconfig
> >  create mode 100644 kernel/pgo/Makefile
> >  create mode 100644 kernel/pgo/fs.c
> >  create mode 100644 kernel/pgo/instrument.c
> >  create mode 100644 kernel/pgo/pgo.h
>
> > --- a/Makefile
> > +++ b/Makefile
> > @@ -660,6 +660,9 @@ endif # KBUILD_EXTMOD
> >  # Defaults to vmlinux, but the arch makefile usually adds further targets
> >  all: vmlinux
> >
> > +CFLAGS_PGO_CLANG := -fprofile-generate
> > +export CFLAGS_PGO_CLANG
> > +
> >  CFLAGS_GCOV  := -fprofile-arcs -ftest-coverage \
> >       $(call cc-option,-fno-tree-loop-im) \
> >       $(call cc-disable-warning,maybe-uninitialized,)
>
> And which of the many flags in noinstr disables this?
>
These flags aren't used with PGO. So there's no need to disable them.

> Basically I would like to NAK this whole thing until someone can
> adequately explain the interaction with noinstr and why we need those
> many lines of kernel code and can't simply use perf for this.

-bw
Peter Zijlstra June 12, 2021, 6:15 p.m. UTC | #11
On Sat, Jun 12, 2021 at 10:25:57AM -0700, Bill Wendling wrote:
> On Sat, Jun 12, 2021 at 9:59 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Wed, Apr 07, 2021 at 02:17:04PM -0700, Bill Wendling wrote:
> > > From: Sami Tolvanen <samitolvanen@google.com>
> > >
> > > Enable the use of clang's Profile-Guided Optimization[1]. To generate a
> > > profile, the kernel is instrumented with PGO counters, a representative
> > > workload is run, and the raw profile data is collected from
> > > /sys/kernel/debug/pgo/profraw.
> > >
> > > The raw profile data must be processed by clang's "llvm-profdata" tool
> > > before it can be used during recompilation:
> > >
> > >   $ cp /sys/kernel/debug/pgo/profraw vmlinux.profraw
> > >   $ llvm-profdata merge --output=vmlinux.profdata vmlinux.profraw
> > >
> > > Multiple raw profiles may be merged during this step.
> > >
> > > The data can now be used by the compiler:
> > >
> > >   $ make LLVM=1 KCFLAGS=-fprofile-use=vmlinux.profdata ...
> > >
> > > This initial submission is restricted to x86, as that's the platform we
> > > know works. This restriction can be lifted once other platforms have
> > > been verified to work with PGO.
> >
> > *sigh*, and not a single x86 person on Cc, how nice :-/
> >
> This tool is generic and, despite the fact that it's first enabled for
> x86, it contains no x86-specific code. The reason we're restricting it
> to x86 is because it's the platform we tested on.

You're modifying a lot of x86 files, you don't think it's good to let us
know?  Worse, afaict this -fprofile-generate changes code generation,
and we definitely want to know about that.

> > >  arch/x86/Kconfig                      |   1 +
> > >  arch/x86/boot/Makefile                |   1 +
> > >  arch/x86/boot/compressed/Makefile     |   1 +
> > >  arch/x86/crypto/Makefile              |   4 +
> > >  arch/x86/entry/vdso/Makefile          |   1 +
> > >  arch/x86/kernel/vmlinux.lds.S         |   2 +
> > >  arch/x86/platform/efi/Makefile        |   1 +
> > >  arch/x86/purgatory/Makefile           |   1 +
> > >  arch/x86/realmode/rm/Makefile         |   1 +
> > >  arch/x86/um/vdso/Makefile             |   1 +


> > > +CFLAGS_PGO_CLANG := -fprofile-generate
> > > +export CFLAGS_PGO_CLANG

> > And which of the many flags in noinstr disables this?
> >
> These flags aren't used with PGO. So there's no need to disable them.

Supposedly -fprofile-generate adds instrumentation to the generated
code. noinstr *MUST* disable that. If not, this is a complete
non-starter for x86.

> > Also, and I don't see this answered *anywhere*, why are you not using
> > perf for this? Your link even mentions Sampling Profilers (and I happen
> > to know there's been significant effort to make perf output work as
> > input for the PGO passes of the various compilers).
> >
> Instruction-based (non-sampling) profiling gives us a better
> context-sensitive profile, making PGO more impactful. It's also useful
> for coverage whereas sampling profiles cannot.

We've got KCOV and GCOV support already. Coverage is also not an
argument mentioned anywhere else. Coverage can go pound sand, we really
don't need a third means of getting that.

Do you have actual numbers that back up the sampling vs instrumented
argument? Having the instrumentation will affect performance which can
scew the profile just the same.

Also, sampling tends to capture the hot spots very well.
Bill Wendling June 12, 2021, 7:10 p.m. UTC | #12
")On Sat, Jun 12, 2021 at 11:15 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Sat, Jun 12, 2021 at 10:25:57AM -0700, Bill Wendling wrote:
> > On Sat, Jun 12, 2021 at 9:59 AM Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > > On Wed, Apr 07, 2021 at 02:17:04PM -0700, Bill Wendling wrote:
> > > > From: Sami Tolvanen <samitolvanen@google.com>
> > > >
> > > > Enable the use of clang's Profile-Guided Optimization[1]. To generate a
> > > > profile, the kernel is instrumented with PGO counters, a representative
> > > > workload is run, and the raw profile data is collected from
> > > > /sys/kernel/debug/pgo/profraw.
> > > >
> > > > The raw profile data must be processed by clang's "llvm-profdata" tool
> > > > before it can be used during recompilation:
> > > >
> > > >   $ cp /sys/kernel/debug/pgo/profraw vmlinux.profraw
> > > >   $ llvm-profdata merge --output=vmlinux.profdata vmlinux.profraw
> > > >
> > > > Multiple raw profiles may be merged during this step.
> > > >
> > > > The data can now be used by the compiler:
> > > >
> > > >   $ make LLVM=1 KCFLAGS=-fprofile-use=vmlinux.profdata ...
> > > >
> > > > This initial submission is restricted to x86, as that's the platform we
> > > > know works. This restriction can be lifted once other platforms have
> > > > been verified to work with PGO.
> > >
> > > *sigh*, and not a single x86 person on Cc, how nice :-/
> > >
> > This tool is generic and, despite the fact that it's first enabled for
> > x86, it contains no x86-specific code. The reason we're restricting it
> > to x86 is because it's the platform we tested on.
>
> You're modifying a lot of x86 files, you don't think it's good to let us
> know?  Worse, afaict this -fprofile-generate changes code generation,
> and we definitely want to know about that.
>
I got the list of people to add from the scripts/get_maintainer.pl.
The files you list below are mostly changes in Makefile, so it added
the kbuild maintainers and list. There's a small change to the linker
script to add the clang PGO data section, which is defined in
"include/asm-generic/vmlinux.lds.h". Using the "kernel/gcov" initial
implementation as a guildlline
(2521f2c228ad750701ba4702484e31d876dbc386), there's one intel people
CC'ed, but he didn't sign off on it. These patches were available for
review for months now, and posted to all of the lists and CC'ed to the
people from scripts/get_maintainers.pl. Perhaps that program should be
improved?

> > > >  arch/x86/Kconfig                      |   1 +
> > > >  arch/x86/boot/Makefile                |   1 +
> > > >  arch/x86/boot/compressed/Makefile     |   1 +
> > > >  arch/x86/crypto/Makefile              |   4 +
> > > >  arch/x86/entry/vdso/Makefile          |   1 +
> > > >  arch/x86/kernel/vmlinux.lds.S         |   2 +
> > > >  arch/x86/platform/efi/Makefile        |   1 +
> > > >  arch/x86/purgatory/Makefile           |   1 +
> > > >  arch/x86/realmode/rm/Makefile         |   1 +
> > > >  arch/x86/um/vdso/Makefile             |   1 +
>
>
> > > > +CFLAGS_PGO_CLANG := -fprofile-generate
> > > > +export CFLAGS_PGO_CLANG
>
> > > And which of the many flags in noinstr disables this?
> > >
> > These flags aren't used with PGO. So there's no need to disable them.
>
> Supposedly -fprofile-generate adds instrumentation to the generated
> code. noinstr *MUST* disable that. If not, this is a complete
> non-starter for x86.

"noinstr" has "notrace", which is defined as
"__attribute__((__no_instrument_function__))", which is honored by
both gcc and clang.

> > > Also, and I don't see this answered *anywhere*, why are you not using
> > > perf for this? Your link even mentions Sampling Profilers (and I happen
> > > to know there's been significant effort to make perf output work as
> > > input for the PGO passes of the various compilers).
> > >
> > Instruction-based (non-sampling) profiling gives us a better
> > context-sensitive profile, making PGO more impactful. It's also useful
> > for coverage whereas sampling profiles cannot.
>
> We've got KCOV and GCOV support already. Coverage is also not an
> argument mentioned anywhere else. Coverage can go pound sand, we really
> don't need a third means of getting that.
>
Those aren't useful for clang-based implementations. And I like to
look forward to potential improvements.

> Do you have actual numbers that back up the sampling vs instrumented
> argument? Having the instrumentation will affect performance which can
> scew the profile just the same.
>
Instrumentation counts the number of times a branch is taken. Sampling
is at a gross level, where if the sampling time is fine enough, you
can get an idea of where the hot spots are, but it won't give you the
fine-grained information that clang finds useful. Essentially, while
sampling can "capture the hot spots very well", relying solely on
sampling is basically leaving optimization on the floor.

Our optimizations experts here have determined, through data of
course, that instrumentation is the best option for PGO.

> Also, sampling tends to capture the hot spots very well.


-bw
Bill Wendling June 12, 2021, 7:28 p.m. UTC | #13
On Sat, Jun 12, 2021 at 12:10 PM Bill Wendling <morbo@google.com> wrote:
> ")On Sat, Jun 12, 2021 at 11:15 AM Peter Zijlstra <peterz@infradead.org> wrote:
> > On Sat, Jun 12, 2021 at 10:25:57AM -0700, Bill Wendling wrote:
> > > On Sat, Jun 12, 2021 at 9:59 AM Peter Zijlstra <peterz@infradead.org> wrote:
> > > >
> > > > On Wed, Apr 07, 2021 at 02:17:04PM -0700, Bill Wendling wrote:
> > > > > From: Sami Tolvanen <samitolvanen@google.com>
> > > > >
> > > > > Enable the use of clang's Profile-Guided Optimization[1]. To generate a
> > > > > profile, the kernel is instrumented with PGO counters, a representative
> > > > > workload is run, and the raw profile data is collected from
> > > > > /sys/kernel/debug/pgo/profraw.
> > > > >
> > > > > The raw profile data must be processed by clang's "llvm-profdata" tool
> > > > > before it can be used during recompilation:
> > > > >
> > > > >   $ cp /sys/kernel/debug/pgo/profraw vmlinux.profraw
> > > > >   $ llvm-profdata merge --output=vmlinux.profdata vmlinux.profraw
> > > > >
> > > > > Multiple raw profiles may be merged during this step.
> > > > >
> > > > > The data can now be used by the compiler:
> > > > >
> > > > >   $ make LLVM=1 KCFLAGS=-fprofile-use=vmlinux.profdata ...
> > > > >
> > > > > This initial submission is restricted to x86, as that's the platform we
> > > > > know works. This restriction can be lifted once other platforms have
> > > > > been verified to work with PGO.
> > > >
> > > > *sigh*, and not a single x86 person on Cc, how nice :-/
> > > >
> > > This tool is generic and, despite the fact that it's first enabled for
> > > x86, it contains no x86-specific code. The reason we're restricting it
> > > to x86 is because it's the platform we tested on.
> >
> > You're modifying a lot of x86 files, you don't think it's good to let us
> > know?  Worse, afaict this -fprofile-generate changes code generation,
> > and we definitely want to know about that.
> >
> I got the list of people to add from the scripts/get_maintainer.pl.
> The files you list below are mostly changes in Makefile, so it added
> the kbuild maintainers and list. There's a small change to the linker
> script to add the clang PGO data section, which is defined in
> "include/asm-generic/vmlinux.lds.h". Using the "kernel/gcov" initial
> implementation as a guildlline
> (2521f2c228ad750701ba4702484e31d876dbc386), there's one intel people
> CC'ed, but he didn't sign off on it. These patches were available for
> review for months now, and posted to all of the lists and CC'ed to the
> people from scripts/get_maintainers.pl. Perhaps that program should be
> improved?
>
Correction: I see now that it lists X86 maintainers. That was somehow
missed in my initial submission. Sorry about that. Please add any
reviewers you think are necessary.

-bw
Fangrui Song June 12, 2021, 8:20 p.m. UTC | #14
On 2021-06-12, Peter Zijlstra wrote:
>On Sat, Jun 12, 2021 at 10:25:57AM -0700, Bill Wendling wrote:
>> On Sat, Jun 12, 2021 at 9:59 AM Peter Zijlstra <peterz@infradead.org> wrote:
>> > Also, and I don't see this answered *anywhere*, why are you not using
>> > perf for this? Your link even mentions Sampling Profilers (and I happen
>> > to know there's been significant effort to make perf output work as
>> > input for the PGO passes of the various compilers).
>> >
>> Instruction-based (non-sampling) profiling gives us a better
>> context-sensitive profile, making PGO more impactful. It's also useful
>> for coverage whereas sampling profiles cannot.
>
>We've got KCOV and GCOV support already. Coverage is also not an
>argument mentioned anywhere else. Coverage can go pound sand, we really
>don't need a third means of getting that.
>
>Do you have actual numbers that back up the sampling vs instrumented
>argument? Having the instrumentation will affect performance which can
>scew the profile just the same.
>
>Also, sampling tends to capture the hot spots very well.

[I don't do kernel development. My experience is user-space toolchain.]

For applications, I think instrumentation based PGO can be 1%~4% faster
than sample-based PGO (e.g. AutoFDO) on x86.

Sample-based PGO has CPU requirement (e.g. Performance Monitoring Unit).
(my gut feeling is that there may be larger gap between instrumentation
based PGO and sample-based PGO for aarch64/ppc64, even though they can
use sample-based PGO.)
Instrumentation based PGO can be ported to more architectures.

In addition, having an infrastructure for instrumentation based PGO
makes it easy to deploy newer techniques like context-sensitive PGO
(just changed compile options; it doesn't need new source level
annotation).
Peter Zijlstra June 12, 2021, 8:25 p.m. UTC | #15
On Sat, Jun 12, 2021 at 12:10:03PM -0700, Bill Wendling wrote:
> > You're modifying a lot of x86 files, you don't think it's good to let us
> > know?  Worse, afaict this -fprofile-generate changes code generation,
> > and we definitely want to know about that.
> >
> I got the list of people to add from the scripts/get_maintainer.pl.

$ ./scripts/get_maintainer.pl -f arch/x86/Makefile
Thomas Gleixner <tglx@linutronix.de> (maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT))
Ingo Molnar <mingo@redhat.com> (maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT))
Borislav Petkov <bp@alien8.de> (maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT))
x86@kernel.org (maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT))

> there's one intel people CC'ed, but he didn't sign off on it.

Intel does not employ the main x86 maintainers, even it if did, mailing
a random Google person won't get the mail to you either, would it?

> These patches were available for review for months now,

Which doesn't help if you don't Cc the right people, does it. *nobody*
has time to read LKML.

> and posted to all of the lists and CC'ed to the people from
> scripts/get_maintainers.pl. Perhaps that program should be improved?

I suspect operator error, see above.

> > Supposedly -fprofile-generate adds instrumentation to the generated
> > code. noinstr *MUST* disable that. If not, this is a complete
> > non-starter for x86.
> 
> "noinstr" has "notrace", which is defined as
> "__attribute__((__no_instrument_function__))", which is honored by
> both gcc and clang.

Yes it is, but is that sufficient in this case? It very much isn't for
KASAN, UBSAN, and a whole host of other instrumentation crud. They all
needed their own 'bugger-off' attributes.

> > We've got KCOV and GCOV support already. Coverage is also not an
> > argument mentioned anywhere else. Coverage can go pound sand, we really
> > don't need a third means of getting that.
> >
> Those aren't useful for clang-based implementations. And I like to
> look forward to potential improvements.

I look forward to less things doing the same over and over. The obvious
solution if of course to make clang use what we have, not the other way
around.

> > Do you have actual numbers that back up the sampling vs instrumented
> > argument? Having the instrumentation will affect performance which can
> > scew the profile just the same.
> >
> Instrumentation counts the number of times a branch is taken. Sampling
> is at a gross level, where if the sampling time is fine enough, you
> can get an idea of where the hot spots are, but it won't give you the
> fine-grained information that clang finds useful. Essentially, while
> sampling can "capture the hot spots very well", relying solely on
> sampling is basically leaving optimization on the floor.
> 
> Our optimizations experts here have determined, through data of
> course, that instrumentation is the best option for PGO.

It would be very good to post some of that data and explicit examples.
Hear-say don't carry much weight.
Peter Zijlstra June 12, 2021, 8:31 p.m. UTC | #16
On Sat, Jun 12, 2021 at 01:20:15PM -0700, Fangrui Song wrote:

> For applications, I think instrumentation based PGO can be 1%~4% faster
> than sample-based PGO (e.g. AutoFDO) on x86.

Why? What specifically is missed by sample-based? I thought that LBR
augmented samples were very useful for exactly this.

> Sample-based PGO has CPU requirement (e.g. Performance Monitoring Unit).
> (my gut feeling is that there may be larger gap between instrumentation
> based PGO and sample-based PGO for aarch64/ppc64, even though they can
> use sample-based PGO.)
> Instrumentation based PGO can be ported to more architectures.

Every architecture that cares about performance had better have a
hardware PMU. Both argh64 and ppc64 have one.

> In addition, having an infrastructure for instrumentation based PGO
> makes it easy to deploy newer techniques like context-sensitive PGO
> (just changed compile options; it doesn't need new source level
> annotation).

What's this context sensitive stuff you speak of? The link provided
earlier is devoid of useful information.
Bill Wendling June 12, 2021, 8:56 p.m. UTC | #17
On Sat, Jun 12, 2021 at 1:25 PM Peter Zijlstra <peterz@infradead.org> wrote:
> On Sat, Jun 12, 2021 at 12:10:03PM -0700, Bill Wendling wrote:
> Yes it is, but is that sufficient in this case? It very much isn't for
> KASAN, UBSAN, and a whole host of other instrumentation crud. They all
> needed their own 'bugger-off' attributes.
>
> > > We've got KCOV and GCOV support already. Coverage is also not an
> > > argument mentioned anywhere else. Coverage can go pound sand, we really
> > > don't need a third means of getting that.
> > >
> > Those aren't useful for clang-based implementations. And I like to
> > look forward to potential improvements.
>
> I look forward to less things doing the same over and over. The obvious
> solution if of course to make clang use what we have, not the other way
> around.
>
That is not the obvious "solution".

> > > Do you have actual numbers that back up the sampling vs instrumented
> > > argument? Having the instrumentation will affect performance which can
> > > scew the profile just the same.
> > >
> > Instrumentation counts the number of times a branch is taken. Sampling
> > is at a gross level, where if the sampling time is fine enough, you
> > can get an idea of where the hot spots are, but it won't give you the
> > fine-grained information that clang finds useful. Essentially, while
> > sampling can "capture the hot spots very well", relying solely on
> > sampling is basically leaving optimization on the floor.
> >
> > Our optimizations experts here have determined, through data of
> > course, that instrumentation is the best option for PGO.
>
> It would be very good to post some of that data and explicit examples.
> Hear-say don't carry much weight.

Should I add measurements from waving a dead chicken over my keyboard?
I heard somewhere that that works as well. Or how about a feature that
hasn't been integrated yet, like using the perf tool apparently? I'm
sure that will be worth my time. You can't just come up with a
potential, unimplemented alternative (gcov is still a thing and not
using "perf") and expect people to dance to your tune.

I could give you numbers, but they would mean nothing to you, and I
suspect that you would reject them out of hand because it may not
benefit *everything*. The nature of FDO/PGO is that it's targeted to
specific tasks.

For example, Fangrui gave you numbers, and you rejected them out of
hand. I've explained to you why instrumentation is better than
sampling (at least for clang). Fangrui gave you numbers. Let's move on
to something else.

Now, for the "nointr" issue. I'll see if we need an additional change for that.

-bw
Bill Wendling June 12, 2021, 10:47 p.m. UTC | #18
On Sat, Jun 12, 2021 at 1:56 PM Bill Wendling <morbo@google.com> wrote:
> On Sat, Jun 12, 2021 at 1:25 PM Peter Zijlstra <peterz@infradead.org> wrote:
> > On Sat, Jun 12, 2021 at 12:10:03PM -0700, Bill Wendling wrote:
> > Yes it is, but is that sufficient in this case? It very much isn't for
> > KASAN, UBSAN, and a whole host of other instrumentation crud. They all
> > needed their own 'bugger-off' attributes.
> >
> Now, for the "nointr" issue. I'll see if we need an additional change for that.
>
The GCOV implementation disables profiling in those directories where
instrumentation would fail. We do the same. Both clang and gcc seem to
treat the no_instrument_function attribute similarly.

-bw
Bill Wendling June 13, 2021, 6:07 p.m. UTC | #19
On Sat, Jun 12, 2021 at 3:47 PM Bill Wendling <morbo@google.com> wrote:
>
> On Sat, Jun 12, 2021 at 1:56 PM Bill Wendling <morbo@google.com> wrote:
> > On Sat, Jun 12, 2021 at 1:25 PM Peter Zijlstra <peterz@infradead.org> wrote:
> > > On Sat, Jun 12, 2021 at 12:10:03PM -0700, Bill Wendling wrote:
> > > Yes it is, but is that sufficient in this case? It very much isn't for
> > > KASAN, UBSAN, and a whole host of other instrumentation crud. They all
> > > needed their own 'bugger-off' attributes.
> > >
> > Now, for the "nointr" issue. I'll see if we need an additional change for that.
> >
> The GCOV implementation disables profiling in those directories where
> instrumentation would fail. We do the same. Both clang and gcc seem to
> treat the no_instrument_function attribute similarly.
>
An example:

$ cat n.c
int g(int);

int __attribute__((__no_instrument_function__))
__attribute__((no_instrument_function))
no_instr(int a) {
  int sum = 0;
  for (int i = 0; i < a; i++)
    sum += g(i);
  return sum;
}

int instr(int a) {
  int sum = 0;
  for (int i = 0; i < a; i++)
    sum += g(i);
  return sum;
}

$ gcc -S -o - n.c -fprofile-arcs -ftest-coverage -O2
        .globl  no_instr
        .type   no_instr, @function
no_instr:
.LFB0:
 ...
        addq    $1, __gcov0.no_instr(%rip)
        pushq   %rbp
 ...
.L3:
 ...
        addq    $1, 8+__gcov0.no_instr(%rip)
 ...
        addq    $1, 16+__gcov0.no_instr(%rip)
 ...
        addq    $1, 16+__gcov0.no_instr(%rip)
 ...
        ret
        .globl  instr
        .type   instr, @function
instr:
.LFB1:
 ...
        addq    $1, __gcov0.instr(%rip)
 ...
        addq    $1, 8+__gcov0.instr(%rip)
 ...
        addq    $1, 16+__gcov0.instr(%rip)
 ...
        addq    $1, 16+__gcov0.instr(%rip)
 ...
        ret

$ clang -S -o - n.c -fprofile-generate -O2
        .globl  no_instr                        # -- Begin function no_instr
        .p2align        4, 0x90
        .type   no_instr,@function
no_instr:                               # @no_instr
 ...
        addq    $1, .L__profc_no_instr+8(%rip)
 ...
        movq    .L__profc_no_instr(%rip), %rax
 ...
        movq    %rax, .L__profc_no_instr(%rip)
 ...
        retq
        .globl  instr                           # -- Begin function instr
        .p2align        4, 0x90
        .type   instr,@function
instr:                                  # @instr
 ...
        addq    $1, .L__profc_instr+8(%rip)
 ...
        movq    .L__profc_instr(%rip), %rax
 ...
        movq    %rax, .L__profc_instr(%rip)
 ...
        retq
.Lfunc_end1:
Peter Zijlstra June 14, 2021, 7:51 a.m. UTC | #20
On Sat, Jun 12, 2021 at 01:56:41PM -0700, Bill Wendling wrote:
> For example, Fangrui gave you numbers, and you rejected them out of
> hand. I've explained to you why instrumentation is better than
> sampling (at least for clang). Fangrui gave you numbers. Let's move on
> to something else.

I did not dismiss them; I asked for clarification. I would like to
understand what exactly is missed by sampling based PGO data that makes
such a difference.
Peter Zijlstra June 14, 2021, 9:01 a.m. UTC | #21
On Sat, Jun 12, 2021 at 01:56:41PM -0700, Bill Wendling wrote:
> On Sat, Jun 12, 2021 at 1:25 PM Peter Zijlstra <peterz@infradead.org> wrote:
> > On Sat, Jun 12, 2021 at 12:10:03PM -0700, Bill Wendling wrote:
> > Yes it is, but is that sufficient in this case? It very much isn't for
> > KASAN, UBSAN, and a whole host of other instrumentation crud. They all
> > needed their own 'bugger-off' attributes.
> >
> > > > We've got KCOV and GCOV support already. Coverage is also not an
> > > > argument mentioned anywhere else. Coverage can go pound sand, we really
> > > > don't need a third means of getting that.
> > > >
> > > Those aren't useful for clang-based implementations. And I like to
> > > look forward to potential improvements.
> >
> > I look forward to less things doing the same over and over. The obvious
> > solution if of course to make clang use what we have, not the other way
> > around.
> >
> That is not the obvious "solution".

Because having GCOV, KCOV and PGO all do essentially the same thing
differently, makes heaps of sense?

I understand that the compilers actually generates radically different
instrumentation for the various cases, but essentially they're all
collecting (function/branch) arcs.

I'm thinking it might be about time to build _one_ infrastructure for
that and define a kernel arc format and call it a day.

Note that if your compiler does arcs with functions (like gcc, unlike
clang) we can also trivially augment the arcs with PMU counter data. I
once did that for userspace.
Bill Wendling June 14, 2021, 9:39 a.m. UTC | #22
On Mon, Jun 14, 2021 at 2:01 AM Peter Zijlstra <peterz@infradead.org> wrote:
> On Sat, Jun 12, 2021 at 01:56:41PM -0700, Bill Wendling wrote:
> > On Sat, Jun 12, 2021 at 1:25 PM Peter Zijlstra <peterz@infradead.org> wrote:
> > > On Sat, Jun 12, 2021 at 12:10:03PM -0700, Bill Wendling wrote:
> > > Yes it is, but is that sufficient in this case? It very much isn't for
> > > KASAN, UBSAN, and a whole host of other instrumentation crud. They all
> > > needed their own 'bugger-off' attributes.
> > >
> > > > > We've got KCOV and GCOV support already. Coverage is also not an
> > > > > argument mentioned anywhere else. Coverage can go pound sand, we really
> > > > > don't need a third means of getting that.
> > > > >
> > > > Those aren't useful for clang-based implementations. And I like to
> > > > look forward to potential improvements.
> > >
> > > I look forward to less things doing the same over and over. The obvious
> > > solution if of course to make clang use what we have, not the other way
> > > around.
> > >
> > That is not the obvious "solution".
>
> Because having GCOV, KCOV and PGO all do essentially the same thing
> differently, makes heaps of sense?
>
It does when you're dealing with one toolchain without access to another.

> I understand that the compilers actually generates radically different
> instrumentation for the various cases, but essentially they're all
> collecting (function/branch) arcs.
>
That's true, but there's no one format for profiling data that's
usable between all compilers. I'm not even sure there's a good way to
translate between, say, gcov and llvm's format. To make matters more
complicated, each compiler's format is tightly coupled to a specific
version of that compiler. And depending on *how* the data is collected
(e.g. sampling or instrumentation), it may not give us the full
benefit of FDO/PGO.

> I'm thinking it might be about time to build _one_ infrastructure for
> that and define a kernel arc format and call it a day.
>
That may be nice, but it's a rather large request.

> Note that if your compiler does arcs with functions (like gcc, unlike
> clang) we can also trivially augment the arcs with PMU counter data. I
> once did that for userspace.
Peter Zijlstra June 14, 2021, 9:43 a.m. UTC | #23
On Sun, Jun 13, 2021 at 11:07:26AM -0700, Bill Wendling wrote:

> > > Now, for the "nointr" issue. I'll see if we need an additional change for that.
> > >
> > The GCOV implementation disables profiling in those directories where
> > instrumentation would fail. We do the same. Both clang and gcc seem to
> > treat the no_instrument_function attribute similarly.

Both seem to emit instrumentation, so they're both, simliarly, *broken*.

noinstr *MUST* disable all compiler generated instrumentation. Also see:

  https://lkml.kernel.org/r/20210527194448.3470080-1-elver@google.com

I'll go mark GCOV support as BROKEN for x86.
Peter Zijlstra June 14, 2021, 10:18 a.m. UTC | #24
On Mon, Jun 14, 2021 at 11:43:12AM +0200, Peter Zijlstra wrote:
> On Sun, Jun 13, 2021 at 11:07:26AM -0700, Bill Wendling wrote:
> 
> > > > Now, for the "nointr" issue. I'll see if we need an additional change for that.
> > > >
> > > The GCOV implementation disables profiling in those directories where
> > > instrumentation would fail. We do the same. Both clang and gcc seem to
> > > treat the no_instrument_function attribute similarly.
> 
> Both seem to emit instrumentation, so they're both, simliarly, *broken*.
> 
> noinstr *MUST* disable all compiler generated instrumentation. Also see:
> 
>   https://lkml.kernel.org/r/20210527194448.3470080-1-elver@google.com
> 
> I'll go mark GCOV support as BROKEN for x86.

https://lkml.kernel.org/r/YMcssV/n5IBGv4f0@hirez.programming.kicks-ass.net
Peter Zijlstra June 14, 2021, 10:44 a.m. UTC | #25
On Mon, Jun 14, 2021 at 02:39:41AM -0700, Bill Wendling wrote:
> On Mon, Jun 14, 2021 at 2:01 AM Peter Zijlstra <peterz@infradead.org> wrote:

> > Because having GCOV, KCOV and PGO all do essentially the same thing
> > differently, makes heaps of sense?
> >
> It does when you're dealing with one toolchain without access to another.

Here's a sekrit, don't tell anyone, but you can get a free copy of GCC
right here:

  https://gcc.gnu.org/

We also have this linux-toolchains list (Cc'ed now) that contains folks
from both sides.

> > I understand that the compilers actually generates radically different
> > instrumentation for the various cases, but essentially they're all
> > collecting (function/branch) arcs.
> >
> That's true, but there's no one format for profiling data that's
> usable between all compilers. I'm not even sure there's a good way to
> translate between, say, gcov and llvm's format. To make matters more
> complicated, each compiler's format is tightly coupled to a specific
> version of that compiler. And depending on *how* the data is collected
> (e.g. sampling or instrumentation), it may not give us the full
> benefit of FDO/PGO.

I'm thinking that something simple like:

struct arc {
	u64	from;
	u64	to;
	u64	nr;
	u64	cntrs[0];
};

goes a very long way. Stick a header on that says how large cntrs[] is,
and some other data (like load offset and whatnot) and you should be
good.

Combine that with the executable image (say /proc/kcore) to recover
what's @from (call, jmp or conditional branch) and I'm thinking one
ought to be able to construct lots of useful data.

I've also been led to believe that the KCOV data format is not in fact
dependent on which toolchain is used.

> > I'm thinking it might be about time to build _one_ infrastructure for
> > that and define a kernel arc format and call it a day.
> >
> That may be nice, but it's a rather large request.

Given GCOV just died, perhaps you can look at what KCOV does and see if
that can be extended to do as you want. KCOV is actively used and
we actually tripped over all the fun little noinstr bugs at the time.
Bill Wendling June 14, 2021, 11:41 a.m. UTC | #26
On Mon, Jun 14, 2021 at 3:45 AM Peter Zijlstra <peterz@infradead.org> wrote:
> On Mon, Jun 14, 2021 at 02:39:41AM -0700, Bill Wendling wrote:
> > On Mon, Jun 14, 2021 at 2:01 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> > > Because having GCOV, KCOV and PGO all do essentially the same thing
> > > differently, makes heaps of sense?
> > >
> > It does when you're dealing with one toolchain without access to another.
>
> Here's a sekrit, don't tell anyone, but you can get a free copy of GCC
> right here:
>
>   https://gcc.gnu.org/
>
> We also have this linux-toolchains list (Cc'ed now) that contains folks
> from both sides.
>
Your sarcasm is not useful.

> > > I understand that the compilers actually generates radically different
> > > instrumentation for the various cases, but essentially they're all
> > > collecting (function/branch) arcs.
> > >
> > That's true, but there's no one format for profiling data that's
> > usable between all compilers. I'm not even sure there's a good way to
> > translate between, say, gcov and llvm's format. To make matters more
> > complicated, each compiler's format is tightly coupled to a specific
> > version of that compiler. And depending on *how* the data is collected
> > (e.g. sampling or instrumentation), it may not give us the full
> > benefit of FDO/PGO.
>
> I'm thinking that something simple like:
>
> struct arc {
>         u64     from;
>         u64     to;
>         u64     nr;
>         u64     cntrs[0];
> };
>
> goes a very long way. Stick a header on that says how large cntrs[] is,
> and some other data (like load offset and whatnot) and you should be
> good.
>
> Combine that with the executable image (say /proc/kcore) to recover
> what's @from (call, jmp or conditional branch) and I'm thinking one
> ought to be able to construct lots of useful data.
>
> I've also been led to believe that the KCOV data format is not in fact
> dependent on which toolchain is used.
>
> > > I'm thinking it might be about time to build _one_ infrastructure for
> > > that and define a kernel arc format and call it a day.
> > >
> > That may be nice, but it's a rather large request.
>
> Given GCOV just died, perhaps you can look at what KCOV does and see if
> that can be extended to do as you want. KCOV is actively used and
> we actually tripped over all the fun little noinstr bugs at the time.
Bill Wendling June 14, 2021, 11:43 a.m. UTC | #27
On Mon, Jun 14, 2021 at 3:45 AM Peter Zijlstra <peterz@infradead.org> wrote:
> On Mon, Jun 14, 2021 at 02:39:41AM -0700, Bill Wendling wrote:
> > On Mon, Jun 14, 2021 at 2:01 AM Peter Zijlstra <peterz@infradead.org> wrote:
> > > I understand that the compilers actually generates radically different
> > > instrumentation for the various cases, but essentially they're all
> > > collecting (function/branch) arcs.
> > >
> > That's true, but there's no one format for profiling data that's
> > usable between all compilers. I'm not even sure there's a good way to
> > translate between, say, gcov and llvm's format. To make matters more
> > complicated, each compiler's format is tightly coupled to a specific
> > version of that compiler. And depending on *how* the data is collected
> > (e.g. sampling or instrumentation), it may not give us the full
> > benefit of FDO/PGO.
>
> I'm thinking that something simple like:
>
> struct arc {
>         u64     from;
>         u64     to;
>         u64     nr;
>         u64     cntrs[0];
> };
>
> goes a very long way. Stick a header on that says how large cntrs[] is,
> and some other data (like load offset and whatnot) and you should be
> good.
>
> Combine that with the executable image (say /proc/kcore) to recover
> what's @from (call, jmp or conditional branch) and I'm thinking one
> ought to be able to construct lots of useful data.
>
> I've also been led to believe that the KCOV data format is not in fact
> dependent on which toolchain is used.
>
Awesome! I await your RFC on both the gcc and clang mailing lists.

-bw

> > > I'm thinking it might be about time to build _one_ infrastructure for
> > > that and define a kernel arc format and call it a day.
> > >
> > That may be nice, but it's a rather large request.
>
> Given GCOV just died, perhaps you can look at what KCOV does and see if
> that can be extended to do as you want. KCOV is actively used and
> we actually tripped over all the fun little noinstr bugs at the time.
Marco Elver June 14, 2021, 2:16 p.m. UTC | #28
On Mon, 14 Jun 2021 at 12:45, Peter Zijlstra <peterz@infradead.org> wrote:
[...]
> I've also been led to believe that the KCOV data format is not in fact
> dependent on which toolchain is used.

Correct, we use KCOV with both gcc and clang. Both gcc and clang emit
the same instrumentation for -fsanitize-coverage. Thus, the user-space
portion and interface is indeed identical:
https://www.kernel.org/doc/html/latest/dev-tools/kcov.html

> > > I'm thinking it might be about time to build _one_ infrastructure for
> > > that and define a kernel arc format and call it a day.
> > >
> > That may be nice, but it's a rather large request.
>
> Given GCOV just died, perhaps you can look at what KCOV does and see if
> that can be extended to do as you want. KCOV is actively used and
> we actually tripped over all the fun little noinstr bugs at the time.

There might be a subtle mismatch between coverage instrumentation for
testing/fuzzing and for profiling. (Disclaimer: I'm not too familiar
with Clang-PGO's requirements.) For example, while for testing/fuzzing
we may only require information if a code-path has been visited, for
profiling the "hotness" might be of interest. Therefore, the
user-space exported data format can make several trade-offs in
complexity.

In theory, I imagine there's a limit to how generic one could make
profiling information, because one compiler's optimizations are not
another compiler's optimizations. On the other hand, it may be doable
to collect unified profiling information for common stuff, but I guess
there's little motivation for figuring out the common ground given the
producer and consumer of the PGO data is the same compiler by design
(unlike coverage info for testing/fuzzing).

Therefore, if KCOV's exposed information does not match PGO's
requirements today, I'm not sure what realistically can be done
without turning KCOV into a monster. Because KCOV is optimized for
testing/fuzzing coverage, and I'm not sure how complex we can or want
to make it to cater to a new use-case.

My intuition is that the simpler design is to have 2 subsystems for
instrumentation-based coverage collection: one for testing/fuzzing,
and the other for profiling.

Alas, there's the problem of GCOV, which should be replaceable by KCOV
for most use cases. But it would be good to hear from a GCOV user if
there are some.

But as we learned GCOV is broken on x86 now, I see these options:

1. Remove GCOV, make KCOV the de-facto test-coverage collection
subsystem. Introduce PGO-instrumentation subsystem for profile
collection only, and make it _very_ clear that KCOV != PGO data as
hinted above. A pre-requisite is that compiler-support for PGO
instrumentation adds selective instrumentation support, likely just
making attribute no_instrument_function do the right thing.

2. Like (1) but also keep GCOV, given proper support for attribute
no_instrument_function would probably fix it (?).

3. Keep GCOV (and KCOV of course). Somehow extract PGO profiles from KCOV.

4. Somehow extract PGO profiles from GCOV, or modify kernel/gcov to do so.

Thanks.
Kees Cook June 14, 2021, 3:26 p.m. UTC | #29
On Mon, Jun 14, 2021 at 04:16:16PM +0200, 'Marco Elver' via Clang Built Linux wrote:
> On Mon, 14 Jun 2021 at 12:45, Peter Zijlstra <peterz@infradead.org> wrote:
> [...]
> > I've also been led to believe that the KCOV data format is not in fact
> > dependent on which toolchain is used.
> 
> Correct, we use KCOV with both gcc and clang. Both gcc and clang emit
> the same instrumentation for -fsanitize-coverage. Thus, the user-space
> portion and interface is indeed identical:
> https://www.kernel.org/doc/html/latest/dev-tools/kcov.html
> 
> > > > I'm thinking it might be about time to build _one_ infrastructure for
> > > > that and define a kernel arc format and call it a day.
> > > >
> > > That may be nice, but it's a rather large request.
> >
> > Given GCOV just died, perhaps you can look at what KCOV does and see if
> > that can be extended to do as you want. KCOV is actively used and
> > we actually tripped over all the fun little noinstr bugs at the time.
> 
> There might be a subtle mismatch between coverage instrumentation for
> testing/fuzzing and for profiling. (Disclaimer: I'm not too familiar
> with Clang-PGO's requirements.) For example, while for testing/fuzzing
> we may only require information if a code-path has been visited, for
> profiling the "hotness" might be of interest. Therefore, the
> user-space exported data format can make several trade-offs in
> complexity.

This has been my primary take-away: given that Clang's PGO is different
enough from the other things and provides more specific/actionable
results, I think it's justified to exist on its own separate from the
other parts.

> In theory, I imagine there's a limit to how generic one could make
> profiling information, because one compiler's optimizations are not
> another compiler's optimizations. On the other hand, it may be doable
> to collect unified profiling information for common stuff, but I guess
> there's little motivation for figuring out the common ground given the
> producer and consumer of the PGO data is the same compiler by design
> (unlike coverage info for testing/fuzzing).
> 
> Therefore, if KCOV's exposed information does not match PGO's
> requirements today, I'm not sure what realistically can be done
> without turning KCOV into a monster. Because KCOV is optimized for
> testing/fuzzing coverage, and I'm not sure how complex we can or want
> to make it to cater to a new use-case.
> 
> My intuition is that the simpler design is to have 2 subsystems for
> instrumentation-based coverage collection: one for testing/fuzzing,
> and the other for profiling.
> 
> Alas, there's the problem of GCOV, which should be replaceable by KCOV
> for most use cases. But it would be good to hear from a GCOV user if
> there are some.
> 
> But as we learned GCOV is broken on x86 now, I see these options:
> 
> 1. Remove GCOV, make KCOV the de-facto test-coverage collection
> subsystem. Introduce PGO-instrumentation subsystem for profile
> collection only, and make it _very_ clear that KCOV != PGO data as
> hinted above. A pre-requisite is that compiler-support for PGO
> instrumentation adds selective instrumentation support, likely just
> making attribute no_instrument_function do the right thing.

Right. I can't speak to GCOV, but KCOV certainly isn't PGO.

> 2. Like (1) but also keep GCOV, given proper support for attribute
> no_instrument_function would probably fix it (?).
> 
> 3. Keep GCOV (and KCOV of course). Somehow extract PGO profiles from KCOV.
> 
> 4. Somehow extract PGO profiles from GCOV, or modify kernel/gcov to do so.

If there *is* a way to "combine" these, I don't think it makes sense
to do it now. PGO has users (and is expanding[1]), and trying to
optimize the design before even landing the first version seems like a
needless obstruction, and to likely not address currently undiscovered
requirements.

So, AFAICT, the original blocking issue ("PGO does not respect noinstr")
is not actually an issue (noinstr contains notrace, which IS respected
by PGO[2]), I think this is fine to move forward.

-Kees

[1] https://lore.kernel.org/lkml/20210612032425.11425-1-jarmo.tiitto@gmail.com/
[2] https://lore.kernel.org/lkml/CAGG=3QVHkkJ236mCJ8Jt_6JtgYtWHV9b4aVXnoj6ypc7GOnc0A@mail.gmail.com/
Peter Zijlstra June 14, 2021, 3:35 p.m. UTC | #30
On Mon, Jun 14, 2021 at 08:26:01AM -0700, Kees Cook wrote:
> So, AFAICT, the original blocking issue ("PGO does not respect noinstr")
> is not actually an issue (noinstr contains notrace, which IS respected
> by PGO[2]), I think this is fine to move forward.

It is *NOT*: https://godbolt.org/z/9c7xdvGd9

Look at how both compilers generate instrumentation in the no_instr()
function.
Peter Zijlstra June 14, 2021, 3:46 p.m. UTC | #31
On Mon, Jun 14, 2021 at 08:26:01AM -0700, Kees Cook wrote:
> > 2. Like (1) but also keep GCOV, given proper support for attribute
> > no_instrument_function would probably fix it (?).
> > 
> > 3. Keep GCOV (and KCOV of course). Somehow extract PGO profiles from KCOV.
> > 
> > 4. Somehow extract PGO profiles from GCOV, or modify kernel/gcov to do so.
> 
> If there *is* a way to "combine" these, I don't think it makes sense
> to do it now. PGO has users (and is expanding[1]), and trying to
> optimize the design before even landing the first version seems like a
> needless obstruction, and to likely not address currently undiscovered
> requirements.

Even if that were so (and I'm not yet convinced), the current proposal
is wedded to llvm-pgo, there is no way gcc-pgo could reuse any of this
code afaict, which then means they have to create yet another variant.

Sorting this *before* the first version is exactly the right time.

Since when are we merging code when the requirements are not clear?

Just to clarify:

Nacked-by: Peter Zijlstra (Intel) <peterz@infradead.org>

For all this PGO crud.
Nick Desaulniers June 14, 2021, 4:03 p.m. UTC | #32
On Mon, Jun 14, 2021 at 8:46 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Mon, Jun 14, 2021 at 08:26:01AM -0700, Kees Cook wrote:
> > > 2. Like (1) but also keep GCOV, given proper support for attribute
> > > no_instrument_function would probably fix it (?).
> > >
> > > 3. Keep GCOV (and KCOV of course). Somehow extract PGO profiles from KCOV.
> > >
> > > 4. Somehow extract PGO profiles from GCOV, or modify kernel/gcov to do so.
> >
> > If there *is* a way to "combine" these, I don't think it makes sense
> > to do it now. PGO has users (and is expanding[1]), and trying to
> > optimize the design before even landing the first version seems like a
> > needless obstruction, and to likely not address currently undiscovered
> > requirements.
>
> Even if that were so (and I'm not yet convinced), the current proposal
> is wedded to llvm-pgo, there is no way gcc-pgo could reuse any of this
> code afaict, which then means they have to create yet another variant.

Similar to GCOV, the runtime support for exporting such data is
heavily compiler (and compiler version) specific, as is the data
format for compilers to consume.  We were able to reuse most of the
runtime code between GCC and Clang support in GCOV; I don't see why we
couldn't do a similar factoring of the runtime code being added to the
kernel here, should anyone care to pursue implementing PGO with GCC.
Having an implementation is a great starting point for folks looking
to extend support or to understand how to support PGO in such a bare
metal environment (one that doesn't dynamically link against
traditional compiler runtimes).
Kees Cook June 14, 2021, 4:22 p.m. UTC | #33
On Mon, Jun 14, 2021 at 05:35:45PM +0200, Peter Zijlstra wrote:
> On Mon, Jun 14, 2021 at 08:26:01AM -0700, Kees Cook wrote:
> > So, AFAICT, the original blocking issue ("PGO does not respect noinstr")
> > is not actually an issue (noinstr contains notrace, which IS respected
> > by PGO[2]), I think this is fine to move forward.
> 
> It is *NOT*: https://godbolt.org/z/9c7xdvGd9
> 
> Look at how both compilers generate instrumentation in the no_instr()
> function.

Well that's disappointing. I'll put this on hold until Clang can grow an
appropriate attribute (or similar work-around). Thanks for catching
that.
Nick Desaulniers June 14, 2021, 6:07 p.m. UTC | #34
On Mon, Jun 14, 2021 at 9:23 AM Kees Cook <keescook@chromium.org> wrote:
>
> On Mon, Jun 14, 2021 at 05:35:45PM +0200, Peter Zijlstra wrote:
> > On Mon, Jun 14, 2021 at 08:26:01AM -0700, Kees Cook wrote:
> > > So, AFAICT, the original blocking issue ("PGO does not respect noinstr")
> > > is not actually an issue (noinstr contains notrace, which IS respected
> > > by PGO[2]), I think this is fine to move forward.
> >
> > It is *NOT*: https://godbolt.org/z/9c7xdvGd9
> >
> > Look at how both compilers generate instrumentation in the no_instr()
> > function.
>
> Well that's disappointing. I'll put this on hold until Clang can grow an
> appropriate attribute (or similar work-around). Thanks for catching
> that.

Cross referencing since these two threads are related.
https://lore.kernel.org/lkml/CAKwvOdmPTi93n2L0_yQkrzLdmpxzrOR7zggSzonyaw2PGshApw@mail.gmail.com/
Nick Desaulniers June 14, 2021, 8:49 p.m. UTC | #35
On Mon, Jun 14, 2021 at 11:07 AM Nick Desaulniers
<ndesaulniers@google.com> wrote:
>
> On Mon, Jun 14, 2021 at 9:23 AM Kees Cook <keescook@chromium.org> wrote:
> >
> > On Mon, Jun 14, 2021 at 05:35:45PM +0200, Peter Zijlstra wrote:
> > > On Mon, Jun 14, 2021 at 08:26:01AM -0700, Kees Cook wrote:
> > > > So, AFAICT, the original blocking issue ("PGO does not respect noinstr")
> > > > is not actually an issue (noinstr contains notrace, which IS respected
> > > > by PGO[2]), I think this is fine to move forward.
> > >
> > > It is *NOT*: https://godbolt.org/z/9c7xdvGd9
> > >
> > > Look at how both compilers generate instrumentation in the no_instr()
> > > function.
> >
> > Well that's disappointing. I'll put this on hold until Clang can grow an
> > appropriate attribute (or similar work-around). Thanks for catching
> > that.
>
> Cross referencing since these two threads are related.
> https://lore.kernel.org/lkml/CAKwvOdmPTi93n2L0_yQkrzLdmpxzrOR7zggSzonyaw2PGshApw@mail.gmail.com/

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80223 looked appropriate
to me, so I commented on it.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80223#c6

Patches for:
PGO: https://reviews.llvm.org/D104253
GCOV: https://reviews.llvm.org/D104257
--
Thanks,
~Nick Desaulniers
diff mbox series

Patch

diff --git a/Documentation/dev-tools/index.rst b/Documentation/dev-tools/index.rst
index 1b1cf4f5c9d9..6a30cd98e6f9 100644
--- a/Documentation/dev-tools/index.rst
+++ b/Documentation/dev-tools/index.rst
@@ -27,6 +27,7 @@  whole; patches welcome!
    kgdb
    kselftest
    kunit/index
+   pgo
 
 
 .. only::  subproject and html
diff --git a/Documentation/dev-tools/pgo.rst b/Documentation/dev-tools/pgo.rst
new file mode 100644
index 000000000000..b7f11d8405b7
--- /dev/null
+++ b/Documentation/dev-tools/pgo.rst
@@ -0,0 +1,127 @@ 
+.. SPDX-License-Identifier: GPL-2.0
+
+===============================
+Using PGO with the Linux kernel
+===============================
+
+Clang's profiling kernel support (PGO_) enables profiling of the Linux kernel
+when building with Clang. The profiling data is exported via the ``pgo``
+debugfs directory.
+
+.. _PGO: https://clang.llvm.org/docs/UsersManual.html#profile-guided-optimization
+
+
+Preparation
+===========
+
+Configure the kernel with:
+
+.. code-block:: make
+
+   CONFIG_DEBUG_FS=y
+   CONFIG_PGO_CLANG=y
+
+Note that kernels compiled with profiling flags will be significantly larger
+and run slower.
+
+Profiling data will only become accessible once debugfs has been mounted:
+
+.. code-block:: sh
+
+   mount -t debugfs none /sys/kernel/debug
+
+
+Customization
+=============
+
+You can enable or disable profiling for individual file and directories by
+adding a line similar to the following to the respective kernel Makefile:
+
+- For a single file (e.g. main.o)
+
+  .. code-block:: make
+
+     PGO_PROFILE_main.o := y
+
+- For all files in one directory
+
+  .. code-block:: make
+
+     PGO_PROFILE := y
+
+To exclude files from being profiled use
+
+  .. code-block:: make
+
+     PGO_PROFILE_main.o := n
+
+and
+
+  .. code-block:: make
+
+     PGO_PROFILE := n
+
+Only files which are linked to the main kernel image or are compiled as kernel
+modules are supported by this mechanism.
+
+
+Files
+=====
+
+The PGO kernel support creates the following files in debugfs:
+
+``/sys/kernel/debug/pgo``
+	Parent directory for all PGO-related files.
+
+``/sys/kernel/debug/pgo/reset``
+	Global reset file: resets all coverage data to zero when written to.
+
+``/sys/kernel/debug/profraw``
+	The raw PGO data that must be processed with ``llvm_profdata``.
+
+
+Workflow
+========
+
+The PGO kernel can be run on the host or test machines. The data though should
+be analyzed with Clang's tools from the same Clang version as the kernel was
+compiled. Clang's tolerant of version skew, but it's easier to use the same
+Clang version.
+
+The profiling data is useful for optimizing the kernel, analyzing coverage,
+etc. Clang offers tools to perform these tasks.
+
+Here is an example workflow for profiling an instrumented kernel with PGO and
+using the result to optimize the kernel:
+
+1) Install the kernel on the TEST machine.
+
+2) Reset the data counters right before running the load tests
+
+   .. code-block:: sh
+
+      $ echo 1 > /sys/kernel/debug/pgo/reset
+
+3) Run the load tests.
+
+4) Collect the raw profile data
+
+   .. code-block:: sh
+
+      $ cp -a /sys/kernel/debug/pgo/profraw /tmp/vmlinux.profraw
+
+5) (Optional) Download the raw profile data to the HOST machine.
+
+6) Process the raw profile data
+
+   .. code-block:: sh
+
+      $ llvm-profdata merge --output=vmlinux.profdata vmlinux.profraw
+
+   Note that multiple raw profile data files can be merged during this step.
+
+7) Rebuild the kernel using the profile data (PGO disabled)
+
+   .. code-block:: sh
+
+      $ make LLVM=1 KCFLAGS=-fprofile-use=vmlinux.profdata ...
diff --git a/MAINTAINERS b/MAINTAINERS
index c80ad735b384..742058188af2 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -14054,6 +14054,15 @@  S:	Maintained
 F:	include/linux/personality.h
 F:	include/uapi/linux/personality.h
 
+PGO BASED KERNEL PROFILING
+M:	Sami Tolvanen <samitolvanen@google.com>
+M:	Bill Wendling <wcw@google.com>
+R:	Nathan Chancellor <natechancellor@gmail.com>
+R:	Nick Desaulniers <ndesaulniers@google.com>
+S:	Supported
+F:	Documentation/dev-tools/pgo.rst
+F:	kernel/pgo
+
 PHOENIX RC FLIGHT CONTROLLER ADAPTER
 M:	Marcus Folkesson <marcus.folkesson@gmail.com>
 L:	linux-input@vger.kernel.org
diff --git a/Makefile b/Makefile
index cc77fd45ca64..6450faceb137 100644
--- a/Makefile
+++ b/Makefile
@@ -660,6 +660,9 @@  endif # KBUILD_EXTMOD
 # Defaults to vmlinux, but the arch makefile usually adds further targets
 all: vmlinux
 
+CFLAGS_PGO_CLANG := -fprofile-generate
+export CFLAGS_PGO_CLANG
+
 CFLAGS_GCOV	:= -fprofile-arcs -ftest-coverage \
 	$(call cc-option,-fno-tree-loop-im) \
 	$(call cc-disable-warning,maybe-uninitialized,)
diff --git a/arch/Kconfig b/arch/Kconfig
index ecfd3520b676..afd082133e0a 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -1191,6 +1191,7 @@  config ARCH_HAS_ELFCORE_COMPAT
 	bool
 
 source "kernel/gcov/Kconfig"
+source "kernel/pgo/Kconfig"
 
 source "scripts/gcc-plugins/Kconfig"
 
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 2792879d398e..62be93b199ff 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -99,6 +99,7 @@  config X86
 	select ARCH_SUPPORTS_KMAP_LOCAL_FORCE_MAP	if NR_CPUS <= 4096
 	select ARCH_SUPPORTS_LTO_CLANG		if X86_64
 	select ARCH_SUPPORTS_LTO_CLANG_THIN	if X86_64
+	select ARCH_SUPPORTS_PGO_CLANG		if X86_64
 	select ARCH_USE_BUILTIN_BSWAP
 	select ARCH_USE_QUEUED_RWLOCKS
 	select ARCH_USE_QUEUED_SPINLOCKS
diff --git a/arch/x86/boot/Makefile b/arch/x86/boot/Makefile
index fe605205b4ce..383853e32f67 100644
--- a/arch/x86/boot/Makefile
+++ b/arch/x86/boot/Makefile
@@ -71,6 +71,7 @@  KBUILD_AFLAGS	:= $(KBUILD_CFLAGS) -D__ASSEMBLY__
 KBUILD_CFLAGS	+= $(call cc-option,-fmacro-prefix-map=$(srctree)/=)
 KBUILD_CFLAGS	+= -fno-asynchronous-unwind-tables
 GCOV_PROFILE := n
+PGO_PROFILE := n
 UBSAN_SANITIZE := n
 
 $(obj)/bzImage: asflags-y  := $(SVGA_MODE)
diff --git a/arch/x86/boot/compressed/Makefile b/arch/x86/boot/compressed/Makefile
index e0bc3988c3fa..ed12ab65f606 100644
--- a/arch/x86/boot/compressed/Makefile
+++ b/arch/x86/boot/compressed/Makefile
@@ -54,6 +54,7 @@  CFLAGS_sev-es.o += -I$(objtree)/arch/x86/lib/
 
 KBUILD_AFLAGS  := $(KBUILD_CFLAGS) -D__ASSEMBLY__
 GCOV_PROFILE := n
+PGO_PROFILE := n
 UBSAN_SANITIZE :=n
 
 KBUILD_LDFLAGS := -m elf_$(UTS_MACHINE)
diff --git a/arch/x86/crypto/Makefile b/arch/x86/crypto/Makefile
index b28e36b7c96b..4b2e9620c412 100644
--- a/arch/x86/crypto/Makefile
+++ b/arch/x86/crypto/Makefile
@@ -4,6 +4,10 @@ 
 
 OBJECT_FILES_NON_STANDARD := y
 
+# Disable PGO for curve25519-x86_64. With PGO enabled, clang runs out of
+# registers for some of the functions.
+PGO_PROFILE_curve25519-x86_64.o := n
+
 obj-$(CONFIG_CRYPTO_TWOFISH_586) += twofish-i586.o
 twofish-i586-y := twofish-i586-asm_32.o twofish_glue.o
 obj-$(CONFIG_CRYPTO_TWOFISH_X86_64) += twofish-x86_64.o
diff --git a/arch/x86/entry/vdso/Makefile b/arch/x86/entry/vdso/Makefile
index 05c4abc2fdfd..f7421e44725a 100644
--- a/arch/x86/entry/vdso/Makefile
+++ b/arch/x86/entry/vdso/Makefile
@@ -180,6 +180,7 @@  quiet_cmd_vdso = VDSO    $@
 VDSO_LDFLAGS = -shared --hash-style=both --build-id=sha1 \
 	$(call ld-option, --eh-frame-hdr) -Bsymbolic
 GCOV_PROFILE := n
+PGO_PROFILE := n
 
 quiet_cmd_vdso_and_check = VDSO    $@
       cmd_vdso_and_check = $(cmd_vdso); $(cmd_vdso_check)
diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
index efd9e9ea17f2..f6cab2316c46 100644
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -184,6 +184,8 @@  SECTIONS
 
 	BUG_TABLE
 
+	PGO_CLANG_DATA
+
 	ORC_UNWIND_TABLE
 
 	. = ALIGN(PAGE_SIZE);
diff --git a/arch/x86/platform/efi/Makefile b/arch/x86/platform/efi/Makefile
index 84b09c230cbd..5f22b31446ad 100644
--- a/arch/x86/platform/efi/Makefile
+++ b/arch/x86/platform/efi/Makefile
@@ -2,6 +2,7 @@ 
 OBJECT_FILES_NON_STANDARD_efi_thunk_$(BITS).o := y
 KASAN_SANITIZE := n
 GCOV_PROFILE := n
+PGO_PROFILE := n
 
 obj-$(CONFIG_EFI) 		+= quirks.o efi.o efi_$(BITS).o efi_stub_$(BITS).o
 obj-$(CONFIG_EFI_MIXED)		+= efi_thunk_$(BITS).o
diff --git a/arch/x86/purgatory/Makefile b/arch/x86/purgatory/Makefile
index 95ea17a9d20c..36f20e99da0b 100644
--- a/arch/x86/purgatory/Makefile
+++ b/arch/x86/purgatory/Makefile
@@ -23,6 +23,7 @@  targets += purgatory.ro purgatory.chk
 
 # Sanitizer, etc. runtimes are unavailable and cannot be linked here.
 GCOV_PROFILE	:= n
+PGO_PROFILE	:= n
 KASAN_SANITIZE	:= n
 UBSAN_SANITIZE	:= n
 KCSAN_SANITIZE	:= n
diff --git a/arch/x86/realmode/rm/Makefile b/arch/x86/realmode/rm/Makefile
index 83f1b6a56449..21797192f958 100644
--- a/arch/x86/realmode/rm/Makefile
+++ b/arch/x86/realmode/rm/Makefile
@@ -76,4 +76,5 @@  KBUILD_CFLAGS	:= $(REALMODE_CFLAGS) -D_SETUP -D_WAKEUP \
 KBUILD_AFLAGS	:= $(KBUILD_CFLAGS) -D__ASSEMBLY__
 KBUILD_CFLAGS	+= -fno-asynchronous-unwind-tables
 GCOV_PROFILE := n
+PGO_PROFILE := n
 UBSAN_SANITIZE := n
diff --git a/arch/x86/um/vdso/Makefile b/arch/x86/um/vdso/Makefile
index 5943387e3f35..54f5768f5853 100644
--- a/arch/x86/um/vdso/Makefile
+++ b/arch/x86/um/vdso/Makefile
@@ -64,6 +64,7 @@  quiet_cmd_vdso = VDSO    $@
 
 VDSO_LDFLAGS = -fPIC -shared -Wl,--hash-style=sysv
 GCOV_PROFILE := n
+PGO_PROFILE := n
 
 #
 # Install the unstripped copy of vdso*.so listed in $(vdso-install-y).
diff --git a/drivers/firmware/efi/libstub/Makefile b/drivers/firmware/efi/libstub/Makefile
index c23466e05e60..724fb389bb9d 100644
--- a/drivers/firmware/efi/libstub/Makefile
+++ b/drivers/firmware/efi/libstub/Makefile
@@ -42,6 +42,7 @@  KBUILD_CFLAGS := $(filter-out $(CC_FLAGS_SCS), $(KBUILD_CFLAGS))
 KBUILD_CFLAGS := $(filter-out $(CC_FLAGS_LTO), $(KBUILD_CFLAGS))
 
 GCOV_PROFILE			:= n
+PGO_PROFILE			:= n
 # Sanitizer runtimes are unavailable and cannot be linked here.
 KASAN_SANITIZE			:= n
 KCSAN_SANITIZE			:= n
diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h
index 0331d5d49551..b371857097e8 100644
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -329,6 +329,39 @@ 
 #define DTPM_TABLE()
 #endif
 
+#ifdef CONFIG_PGO_CLANG
+#define PGO_CLANG_DATA							\
+	__llvm_prf_data : AT(ADDR(__llvm_prf_data) - LOAD_OFFSET) {	\
+		__llvm_prf_start = .;					\
+		__llvm_prf_data_start = .;				\
+		*(__llvm_prf_data)					\
+		__llvm_prf_data_end = .;				\
+	}								\
+	__llvm_prf_cnts : AT(ADDR(__llvm_prf_cnts) - LOAD_OFFSET) {	\
+		__llvm_prf_cnts_start = .;				\
+		*(__llvm_prf_cnts)					\
+		__llvm_prf_cnts_end = .;				\
+	}								\
+	__llvm_prf_names : AT(ADDR(__llvm_prf_names) - LOAD_OFFSET) {	\
+		__llvm_prf_names_start = .;				\
+		*(__llvm_prf_names)					\
+		__llvm_prf_names_end = .;				\
+	}								\
+	__llvm_prf_vals : AT(ADDR(__llvm_prf_vals) - LOAD_OFFSET) {	\
+		__llvm_prf_vals_start = .;				\
+		*(__llvm_prf_vals)					\
+		__llvm_prf_vals_end = .;				\
+	}								\
+	__llvm_prf_vnds : AT(ADDR(__llvm_prf_vnds) - LOAD_OFFSET) {	\
+		__llvm_prf_vnds_start = .;				\
+		*(__llvm_prf_vnds)					\
+		__llvm_prf_vnds_end = .;				\
+		__llvm_prf_end = .;					\
+	}
+#else
+#define PGO_CLANG_DATA
+#endif
+
 #define KERNEL_DTB()							\
 	STRUCT_ALIGN();							\
 	__dtb_start = .;						\
@@ -1106,6 +1139,7 @@ 
 		CONSTRUCTORS						\
 	}								\
 	BUG_TABLE							\
+	PGO_CLANG_DATA
 
 #define INIT_TEXT_SECTION(inittext_align)				\
 	. = ALIGN(inittext_align);					\
diff --git a/kernel/Makefile b/kernel/Makefile
index 320f1f3941b7..a2a23ef2b12f 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -111,6 +111,7 @@  obj-$(CONFIG_BPF) += bpf/
 obj-$(CONFIG_KCSAN) += kcsan/
 obj-$(CONFIG_SHADOW_CALL_STACK) += scs.o
 obj-$(CONFIG_HAVE_STATIC_CALL_INLINE) += static_call.o
+obj-$(CONFIG_PGO_CLANG) += pgo/
 
 obj-$(CONFIG_PERF_EVENTS) += events/
 
diff --git a/kernel/pgo/Kconfig b/kernel/pgo/Kconfig
new file mode 100644
index 000000000000..76a640b6cf6e
--- /dev/null
+++ b/kernel/pgo/Kconfig
@@ -0,0 +1,35 @@ 
+# SPDX-License-Identifier: GPL-2.0-only
+menu "Profile Guided Optimization (PGO) (EXPERIMENTAL)"
+
+config ARCH_SUPPORTS_PGO_CLANG
+	bool
+
+config PGO_CLANG
+	bool "Enable clang's PGO-based kernel profiling"
+	depends on DEBUG_FS
+	depends on ARCH_SUPPORTS_PGO_CLANG
+	depends on CC_IS_CLANG && CLANG_VERSION >= 120000
+	help
+	  This option enables clang's PGO (Profile Guided Optimization) based
+	  code profiling to better optimize the kernel.
+
+	  If unsure, say N.
+
+	  Run a representative workload for your application on a kernel
+	  compiled with this option and download the raw profile file from
+	  /sys/kernel/debug/pgo/profraw. This file needs to be processed with
+	  llvm-profdata. It may be merged with other collected raw profiles.
+
+	  Copy the resulting profile file into vmlinux.profdata, and enable
+	  KCFLAGS=-fprofile-use=vmlinux.profdata to produce an optimized
+	  kernel.
+
+	  Note that a kernel compiled with profiling flags will be
+	  significantly larger and run slower. Also be sure to exclude files
+	  from profiling which are not linked to the kernel image to prevent
+	  linker errors.
+
+	  Note that the debugfs filesystem has to be mounted to access
+	  profiling data.
+
+endmenu
diff --git a/kernel/pgo/Makefile b/kernel/pgo/Makefile
new file mode 100644
index 000000000000..41e27cefd9a4
--- /dev/null
+++ b/kernel/pgo/Makefile
@@ -0,0 +1,5 @@ 
+# SPDX-License-Identifier: GPL-2.0
+GCOV_PROFILE	:= n
+PGO_PROFILE	:= n
+
+obj-y	+= fs.o instrument.o
diff --git a/kernel/pgo/fs.c b/kernel/pgo/fs.c
new file mode 100644
index 000000000000..1678df3b7d64
--- /dev/null
+++ b/kernel/pgo/fs.c
@@ -0,0 +1,389 @@ 
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2019 Google, Inc.
+ *
+ * Author:
+ *	Sami Tolvanen <samitolvanen@google.com>
+ *
+ * This software is licensed under the terms of the GNU General Public
+ * License version 2, as published by the Free Software Foundation, and
+ * may be copied, distributed, and modified under those terms.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ */
+
+#define pr_fmt(fmt)	"pgo: " fmt
+
+#include <linux/kernel.h>
+#include <linux/debugfs.h>
+#include <linux/fs.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/vmalloc.h>
+#include "pgo.h"
+
+static struct dentry *directory;
+
+struct prf_private_data {
+	void *buffer;
+	unsigned long size;
+};
+
+/*
+ * Raw profile data format:
+ *
+ *	- llvm_prf_header
+ *	- __llvm_prf_data
+ *	- __llvm_prf_cnts
+ *	- __llvm_prf_names
+ *	- zero padding to 8 bytes
+ *	- for each llvm_prf_data in __llvm_prf_data:
+ *		- llvm_prf_value_data
+ *			- llvm_prf_value_record + site count array
+ *				- llvm_prf_value_node_data
+ *				...
+ *			...
+ *		...
+ */
+
+static void prf_fill_header(void **buffer)
+{
+	struct llvm_prf_header *header = *(struct llvm_prf_header **)buffer;
+
+#ifdef CONFIG_64BIT
+	header->magic = LLVM_INSTR_PROF_RAW_MAGIC_64;
+#else
+	header->magic = LLVM_INSTR_PROF_RAW_MAGIC_32;
+#endif
+	header->version = LLVM_VARIANT_MASK_IR_PROF | LLVM_INSTR_PROF_RAW_VERSION;
+	header->data_size = prf_data_count();
+	header->padding_bytes_before_counters = 0;
+	header->counters_size = prf_cnts_count();
+	header->padding_bytes_after_counters = 0;
+	header->names_size = prf_names_count();
+	header->counters_delta = (u64)__llvm_prf_cnts_start;
+	header->names_delta = (u64)__llvm_prf_names_start;
+	header->value_kind_last = LLVM_INSTR_PROF_IPVK_LAST;
+
+	*buffer += sizeof(*header);
+}
+
+/*
+ * Copy the source into the buffer, incrementing the pointer into buffer in the
+ * process.
+ */
+static void prf_copy_to_buffer(void **buffer, void *src, unsigned long size)
+{
+	memcpy(*buffer, src, size);
+	*buffer += size;
+}
+
+static u32 __prf_get_value_size(struct llvm_prf_data *p, u32 *value_kinds)
+{
+	struct llvm_prf_value_node **nodes =
+		(struct llvm_prf_value_node **)p->values;
+	u32 kinds = 0;
+	u32 size = 0;
+	unsigned int kind;
+	unsigned int n;
+	unsigned int s = 0;
+
+	for (kind = 0; kind < ARRAY_SIZE(p->num_value_sites); kind++) {
+		unsigned int sites = p->num_value_sites[kind];
+
+		if (!sites)
+			continue;
+
+		/* Record + site count array */
+		size += prf_get_value_record_size(sites);
+		kinds++;
+
+		if (!nodes)
+			continue;
+
+		for (n = 0; n < sites; n++) {
+			u32 count = 0;
+			struct llvm_prf_value_node *site = nodes[s + n];
+
+			while (site && ++count <= U8_MAX)
+				site = site->next;
+
+			size += count *
+				sizeof(struct llvm_prf_value_node_data);
+		}
+
+		s += sites;
+	}
+
+	if (size)
+		size += sizeof(struct llvm_prf_value_data);
+
+	if (value_kinds)
+		*value_kinds = kinds;
+
+	return size;
+}
+
+static u32 prf_get_value_size(void)
+{
+	u32 size = 0;
+	struct llvm_prf_data *p;
+
+	for (p = __llvm_prf_data_start; p < __llvm_prf_data_end; p++)
+		size += __prf_get_value_size(p, NULL);
+
+	return size;
+}
+
+/* Serialize the profiling's value. */
+static void prf_serialize_value(struct llvm_prf_data *p, void **buffer)
+{
+	struct llvm_prf_value_data header;
+	struct llvm_prf_value_node **nodes =
+		(struct llvm_prf_value_node **)p->values;
+	unsigned int kind;
+	unsigned int n;
+	unsigned int s = 0;
+
+	header.total_size = __prf_get_value_size(p, &header.num_value_kinds);
+
+	if (!header.num_value_kinds)
+		/* Nothing to write. */
+		return;
+
+	prf_copy_to_buffer(buffer, &header, sizeof(header));
+
+	for (kind = 0; kind < ARRAY_SIZE(p->num_value_sites); kind++) {
+		struct llvm_prf_value_record *record;
+		u8 *counts;
+		unsigned int sites = p->num_value_sites[kind];
+
+		if (!sites)
+			continue;
+
+		/* Profiling value record. */
+		record = *(struct llvm_prf_value_record **)buffer;
+		*buffer += prf_get_value_record_header_size();
+
+		record->kind = kind;
+		record->num_value_sites = sites;
+
+		/* Site count array. */
+		counts = *(u8 **)buffer;
+		*buffer += prf_get_value_record_site_count_size(sites);
+
+		/*
+		 * If we don't have nodes, we can skip updating the site count
+		 * array, because the buffer is zero filled.
+		 */
+		if (!nodes)
+			continue;
+
+		for (n = 0; n < sites; n++) {
+			u32 count = 0;
+			struct llvm_prf_value_node *site = nodes[s + n];
+
+			while (site && ++count <= U8_MAX) {
+				prf_copy_to_buffer(buffer, site,
+						   sizeof(struct llvm_prf_value_node_data));
+				site = site->next;
+			}
+
+			counts[n] = (u8)count;
+		}
+
+		s += sites;
+	}
+}
+
+static void prf_serialize_values(void **buffer)
+{
+	struct llvm_prf_data *p;
+
+	for (p = __llvm_prf_data_start; p < __llvm_prf_data_end; p++)
+		prf_serialize_value(p, buffer);
+}
+
+static inline unsigned long prf_get_padding(unsigned long size)
+{
+	return 7 & (sizeof(u64) - size % sizeof(u64));
+}
+
+static unsigned long prf_buffer_size(void)
+{
+	return sizeof(struct llvm_prf_header) +
+			prf_data_size()	+
+			prf_cnts_size() +
+			prf_names_size() +
+			prf_get_padding(prf_names_size()) +
+			prf_get_value_size();
+}
+
+/*
+ * Serialize the profiling data into a format LLVM's tools can understand.
+ * Note: caller *must* hold pgo_lock.
+ */
+static int prf_serialize(struct prf_private_data *p)
+{
+	int err = 0;
+	void *buffer;
+
+	p->size = prf_buffer_size();
+	p->buffer = vzalloc(p->size);
+
+	if (!p->buffer) {
+		err = -ENOMEM;
+		goto out;
+	}
+
+	buffer = p->buffer;
+
+	prf_fill_header(&buffer);
+	prf_copy_to_buffer(&buffer, __llvm_prf_data_start,  prf_data_size());
+	prf_copy_to_buffer(&buffer, __llvm_prf_cnts_start,  prf_cnts_size());
+	prf_copy_to_buffer(&buffer, __llvm_prf_names_start, prf_names_size());
+	buffer += prf_get_padding(prf_names_size());
+
+	prf_serialize_values(&buffer);
+
+out:
+	return err;
+}
+
+/* open() implementation for PGO. Creates a copy of the profiling data set. */
+static int prf_open(struct inode *inode, struct file *file)
+{
+	struct prf_private_data *data;
+	unsigned long flags;
+	int err;
+
+	data = kzalloc(sizeof(*data), GFP_KERNEL);
+	if (!data) {
+		err = -ENOMEM;
+		goto out;
+	}
+
+	flags = prf_lock();
+
+	err = prf_serialize(data);
+	if (unlikely(err)) {
+		kfree(data);
+		goto out_unlock;
+	}
+
+	file->private_data = data;
+
+out_unlock:
+	prf_unlock(flags);
+out:
+	return err;
+}
+
+/* read() implementation for PGO. */
+static ssize_t prf_read(struct file *file, char __user *buf, size_t count,
+			loff_t *ppos)
+{
+	struct prf_private_data *data = file->private_data;
+
+	BUG_ON(!data);
+
+	return simple_read_from_buffer(buf, count, ppos, data->buffer,
+				       data->size);
+}
+
+/* release() implementation for PGO. Release resources allocated by open(). */
+static int prf_release(struct inode *inode, struct file *file)
+{
+	struct prf_private_data *data = file->private_data;
+
+	if (data) {
+		vfree(data->buffer);
+		kfree(data);
+	}
+
+	return 0;
+}
+
+static const struct file_operations prf_fops = {
+	.owner		= THIS_MODULE,
+	.open		= prf_open,
+	.read		= prf_read,
+	.llseek		= default_llseek,
+	.release	= prf_release
+};
+
+/* write() implementation for resetting PGO's profile data. */
+static ssize_t reset_write(struct file *file, const char __user *addr,
+			   size_t len, loff_t *pos)
+{
+	struct llvm_prf_data *data;
+
+	memset(__llvm_prf_cnts_start, 0, prf_cnts_size());
+
+	for (data = __llvm_prf_data_start; data < __llvm_prf_data_end; data++) {
+		struct llvm_prf_value_node **vnodes;
+		u64 current_vsite_count;
+		u32 i;
+
+		if (!data->values)
+			continue;
+
+		current_vsite_count = 0;
+		vnodes = (struct llvm_prf_value_node **)data->values;
+
+		for (i = LLVM_INSTR_PROF_IPVK_FIRST; i <= LLVM_INSTR_PROF_IPVK_LAST; i++)
+			current_vsite_count += data->num_value_sites[i];
+
+		for (i = 0; i < current_vsite_count; i++) {
+			struct llvm_prf_value_node *current_vnode = vnodes[i];
+
+			while (current_vnode) {
+				current_vnode->count = 0;
+				current_vnode = current_vnode->next;
+			}
+		}
+	}
+
+	return len;
+}
+
+static const struct file_operations prf_reset_fops = {
+	.owner		= THIS_MODULE,
+	.write		= reset_write,
+	.llseek		= noop_llseek,
+};
+
+/* Create debugfs entries. */
+static int __init pgo_init(void)
+{
+	directory = debugfs_create_dir("pgo", NULL);
+	if (!directory)
+		goto err_remove;
+
+	if (!debugfs_create_file("profraw", 0600, directory, NULL,
+				 &prf_fops))
+		goto err_remove;
+
+	if (!debugfs_create_file("reset", 0200, directory, NULL,
+				 &prf_reset_fops))
+		goto err_remove;
+
+	return 0;
+
+err_remove:
+	pr_err("initialization failed\n");
+	return -EIO;
+}
+
+/* Remove debugfs entries. */
+static void __exit pgo_exit(void)
+{
+	debugfs_remove_recursive(directory);
+}
+
+module_init(pgo_init);
+module_exit(pgo_exit);
diff --git a/kernel/pgo/instrument.c b/kernel/pgo/instrument.c
new file mode 100644
index 000000000000..464b3bc77431
--- /dev/null
+++ b/kernel/pgo/instrument.c
@@ -0,0 +1,189 @@ 
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2019 Google, Inc.
+ *
+ * Author:
+ *	Sami Tolvanen <samitolvanen@google.com>
+ *
+ * This software is licensed under the terms of the GNU General Public
+ * License version 2, as published by the Free Software Foundation, and
+ * may be copied, distributed, and modified under those terms.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ */
+
+#define pr_fmt(fmt)	"pgo: " fmt
+
+#include <linux/bitops.h>
+#include <linux/kernel.h>
+#include <linux/export.h>
+#include <linux/spinlock.h>
+#include <linux/types.h>
+#include "pgo.h"
+
+/*
+ * This lock guards both profile count updating and serialization of the
+ * profiling data. Keeping both of these activities separate via locking
+ * ensures that we don't try to serialize data that's only partially updated.
+ */
+static DEFINE_SPINLOCK(pgo_lock);
+static int current_node;
+
+unsigned long prf_lock(void)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&pgo_lock, flags);
+
+	return flags;
+}
+
+void prf_unlock(unsigned long flags)
+{
+	spin_unlock_irqrestore(&pgo_lock, flags);
+}
+
+/*
+ * Return a newly allocated profiling value node which contains the tracked
+ * value by the value profiler.
+ * Note: caller *must* hold pgo_lock.
+ */
+static struct llvm_prf_value_node *allocate_node(struct llvm_prf_data *p,
+						 u32 index, u64 value)
+{
+	if (&__llvm_prf_vnds_start[current_node + 1] >= __llvm_prf_vnds_end)
+		return NULL; /* Out of nodes */
+
+	current_node++;
+
+	/* Make sure the node is entirely within the section */
+	if (&__llvm_prf_vnds_start[current_node] >= __llvm_prf_vnds_end ||
+	    &__llvm_prf_vnds_start[current_node + 1] > __llvm_prf_vnds_end)
+		return NULL;
+
+	return &__llvm_prf_vnds_start[current_node];
+}
+
+/*
+ * Counts the number of times a target value is seen.
+ *
+ * Records the target value for the index if not seen before. Otherwise,
+ * increments the counter associated w/ the target value.
+ */
+void __llvm_profile_instrument_target(u64 target_value, void *data, u32 index);
+void __llvm_profile_instrument_target(u64 target_value, void *data, u32 index)
+{
+	struct llvm_prf_data *p = (struct llvm_prf_data *)data;
+	struct llvm_prf_value_node **counters;
+	struct llvm_prf_value_node *curr;
+	struct llvm_prf_value_node *min = NULL;
+	struct llvm_prf_value_node *prev = NULL;
+	u64 min_count = U64_MAX;
+	u8 values = 0;
+	unsigned long flags;
+
+	if (!p || !p->values)
+		return;
+
+	counters = (struct llvm_prf_value_node **)p->values;
+	curr = counters[index];
+
+	while (curr) {
+		if (target_value == curr->value) {
+			curr->count++;
+			return;
+		}
+
+		if (curr->count < min_count) {
+			min_count = curr->count;
+			min = curr;
+		}
+
+		prev = curr;
+		curr = curr->next;
+		values++;
+	}
+
+	if (values >= LLVM_INSTR_PROF_MAX_NUM_VAL_PER_SITE) {
+		if (!min->count || !(--min->count)) {
+			curr = min;
+			curr->value = target_value;
+			curr->count++;
+		}
+		return;
+	}
+
+	/* Lock when updating the value node structure. */
+	flags = prf_lock();
+
+	curr = allocate_node(p, index, target_value);
+	if (!curr)
+		goto out;
+
+	curr->value = target_value;
+	curr->count++;
+
+	if (!counters[index])
+		counters[index] = curr;
+	else if (prev && !prev->next)
+		prev->next = curr;
+
+out:
+	prf_unlock(flags);
+}
+EXPORT_SYMBOL(__llvm_profile_instrument_target);
+
+/* Counts the number of times a range of targets values are seen. */
+void __llvm_profile_instrument_range(u64 target_value, void *data,
+				     u32 index, s64 precise_start,
+				     s64 precise_last, s64 large_value);
+void __llvm_profile_instrument_range(u64 target_value, void *data,
+				     u32 index, s64 precise_start,
+				     s64 precise_last, s64 large_value)
+{
+	if (large_value != S64_MIN && (s64)target_value >= large_value)
+		target_value = large_value;
+	else if ((s64)target_value < precise_start ||
+		 (s64)target_value > precise_last)
+		target_value = precise_last + 1;
+
+	__llvm_profile_instrument_target(target_value, data, index);
+}
+EXPORT_SYMBOL(__llvm_profile_instrument_range);
+
+static u64 inst_prof_get_range_rep_value(u64 value)
+{
+	if (value <= 8)
+		/* The first ranges are individually tracked, use it as is. */
+		return value;
+	else if (value >= 513)
+		/* The last range is mapped to its lowest value. */
+		return 513;
+	else if (hweight64(value) == 1)
+		/* If it's a power of two, use it as is. */
+		return value;
+
+	/* Otherwise, take to the previous power of two + 1. */
+	return ((u64)1 << (64 - __builtin_clzll(value) - 1)) + 1;
+}
+
+/*
+ * The target values are partitioned into multiple ranges. The range spec is
+ * defined in compiler-rt/include/profile/InstrProfData.inc.
+ */
+void __llvm_profile_instrument_memop(u64 target_value, void *data,
+				     u32 counter_index);
+void __llvm_profile_instrument_memop(u64 target_value, void *data,
+				     u32 counter_index)
+{
+	u64 rep_value;
+
+	/* Map the target value to the representative value of its range. */
+	rep_value = inst_prof_get_range_rep_value(target_value);
+	__llvm_profile_instrument_target(rep_value, data, counter_index);
+}
+EXPORT_SYMBOL(__llvm_profile_instrument_memop);
diff --git a/kernel/pgo/pgo.h b/kernel/pgo/pgo.h
new file mode 100644
index 000000000000..ddc8d3002fe5
--- /dev/null
+++ b/kernel/pgo/pgo.h
@@ -0,0 +1,203 @@ 
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (C) 2019 Google, Inc.
+ *
+ * Author:
+ *	Sami Tolvanen <samitolvanen@google.com>
+ *
+ * This software is licensed under the terms of the GNU General Public
+ * License version 2, as published by the Free Software Foundation, and
+ * may be copied, distributed, and modified under those terms.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ */
+
+#ifndef _PGO_H
+#define _PGO_H
+
+/*
+ * Note: These internal LLVM definitions must match the compiler version.
+ * See llvm/include/llvm/ProfileData/InstrProfData.inc in LLVM's source code.
+ */
+
+#define LLVM_INSTR_PROF_RAW_MAGIC_64	\
+		((u64)255 << 56 |	\
+		 (u64)'l' << 48 |	\
+		 (u64)'p' << 40 |	\
+		 (u64)'r' << 32 |	\
+		 (u64)'o' << 24 |	\
+		 (u64)'f' << 16 |	\
+		 (u64)'r' << 8  |	\
+		 (u64)129)
+#define LLVM_INSTR_PROF_RAW_MAGIC_32	\
+		((u64)255 << 56 |	\
+		 (u64)'l' << 48 |	\
+		 (u64)'p' << 40 |	\
+		 (u64)'r' << 32 |	\
+		 (u64)'o' << 24 |	\
+		 (u64)'f' << 16 |	\
+		 (u64)'R' << 8  |	\
+		 (u64)129)
+
+#define LLVM_INSTR_PROF_RAW_VERSION		5
+#define LLVM_INSTR_PROF_DATA_ALIGNMENT		8
+#define LLVM_INSTR_PROF_IPVK_FIRST		0
+#define LLVM_INSTR_PROF_IPVK_LAST		1
+#define LLVM_INSTR_PROF_MAX_NUM_VAL_PER_SITE	255
+
+#define LLVM_VARIANT_MASK_IR_PROF	(0x1ULL << 56)
+#define LLVM_VARIANT_MASK_CSIR_PROF	(0x1ULL << 57)
+
+/**
+ * struct llvm_prf_header - represents the raw profile header data structure.
+ * @magic: the magic token for the file format.
+ * @version: the version of the file format.
+ * @data_size: the number of entries in the profile data section.
+ * @padding_bytes_before_counters: the number of padding bytes before the
+ *   counters.
+ * @counters_size: the size in bytes of the LLVM profile section containing the
+ *   counters.
+ * @padding_bytes_after_counters: the number of padding bytes after the
+ *   counters.
+ * @names_size: the size in bytes of the LLVM profile section containing the
+ *   counters' names.
+ * @counters_delta: the beginning of the LLMV profile counters section.
+ * @names_delta: the beginning of the LLMV profile names section.
+ * @value_kind_last: the last profile value kind.
+ */
+struct llvm_prf_header {
+	u64 magic;
+	u64 version;
+	u64 data_size;
+	u64 padding_bytes_before_counters;
+	u64 counters_size;
+	u64 padding_bytes_after_counters;
+	u64 names_size;
+	u64 counters_delta;
+	u64 names_delta;
+	u64 value_kind_last;
+};
+
+/**
+ * struct llvm_prf_data - represents the per-function control structure.
+ * @name_ref: the reference to the function's name.
+ * @func_hash: the hash value of the function.
+ * @counter_ptr: a pointer to the profile counter.
+ * @function_ptr: a pointer to the function.
+ * @values: the profiling values associated with this function.
+ * @num_counters: the number of counters in the function.
+ * @num_value_sites: the number of value profile sites.
+ */
+struct llvm_prf_data {
+	const u64 name_ref;
+	const u64 func_hash;
+	const void *counter_ptr;
+	const void *function_ptr;
+	void *values;
+	const u32 num_counters;
+	const u16 num_value_sites[LLVM_INSTR_PROF_IPVK_LAST + 1];
+} __aligned(LLVM_INSTR_PROF_DATA_ALIGNMENT);
+
+/**
+ * structure llvm_prf_value_node_data - represents the data part of the struct
+ *   llvm_prf_value_node data structure.
+ * @value: the value counters.
+ * @count: the counters' count.
+ */
+struct llvm_prf_value_node_data {
+	u64 value;
+	u64 count;
+};
+
+/**
+ * struct llvm_prf_value_node - represents an internal data structure used by
+ *   the value profiler.
+ * @value: the value counters.
+ * @count: the counters' count.
+ * @next: the next value node.
+ */
+struct llvm_prf_value_node {
+	u64 value;
+	u64 count;
+	struct llvm_prf_value_node *next;
+};
+
+/**
+ * struct llvm_prf_value_data - represents the value profiling data in indexed
+ *   format.
+ * @total_size: the total size in bytes including this field.
+ * @num_value_kinds: the number of value profile kinds that has value profile
+ *   data.
+ */
+struct llvm_prf_value_data {
+	u32 total_size;
+	u32 num_value_kinds;
+};
+
+/**
+ * struct llvm_prf_value_record - represents the on-disk layout of the value
+ *   profile data of a particular kind for one function.
+ * @kind: the kind of the value profile record.
+ * @num_value_sites: the number of value profile sites.
+ * @site_count_array: the first element of the array that stores the number
+ *   of profiled values for each value site.
+ */
+struct llvm_prf_value_record {
+	u32 kind;
+	u32 num_value_sites;
+	u8 site_count_array[];
+};
+
+#define prf_get_value_record_header_size()		\
+	offsetof(struct llvm_prf_value_record, site_count_array)
+#define prf_get_value_record_site_count_size(sites)	\
+	roundup((sites), 8)
+#define prf_get_value_record_size(sites)		\
+	(prf_get_value_record_header_size() +		\
+	 prf_get_value_record_site_count_size((sites)))
+
+/* Data sections */
+extern struct llvm_prf_data __llvm_prf_data_start[];
+extern struct llvm_prf_data __llvm_prf_data_end[];
+
+extern u64 __llvm_prf_cnts_start[];
+extern u64 __llvm_prf_cnts_end[];
+
+extern char __llvm_prf_names_start[];
+extern char __llvm_prf_names_end[];
+
+extern struct llvm_prf_value_node __llvm_prf_vnds_start[];
+extern struct llvm_prf_value_node __llvm_prf_vnds_end[];
+
+/* Locking for vnodes */
+extern unsigned long prf_lock(void);
+extern void prf_unlock(unsigned long flags);
+
+#define __DEFINE_PRF_SIZE(s) \
+	static inline unsigned long prf_ ## s ## _size(void)		\
+	{								\
+		unsigned long start =					\
+			(unsigned long)__llvm_prf_ ## s ## _start;	\
+		unsigned long end =					\
+			(unsigned long)__llvm_prf_ ## s ## _end;	\
+		return roundup(end - start,				\
+				sizeof(__llvm_prf_ ## s ## _start[0]));	\
+	}								\
+	static inline unsigned long prf_ ## s ## _count(void)		\
+	{								\
+		return prf_ ## s ## _size() /				\
+			sizeof(__llvm_prf_ ## s ## _start[0]);		\
+	}
+
+__DEFINE_PRF_SIZE(data);
+__DEFINE_PRF_SIZE(cnts);
+__DEFINE_PRF_SIZE(names);
+__DEFINE_PRF_SIZE(vnds);
+
+#undef __DEFINE_PRF_SIZE
+
+#endif /* _PGO_H */
diff --git a/scripts/Makefile.lib b/scripts/Makefile.lib
index 8cd67b1b6d15..d411e92dd0d6 100644
--- a/scripts/Makefile.lib
+++ b/scripts/Makefile.lib
@@ -139,6 +139,16 @@  _c_flags += $(if $(patsubst n%,, \
 		$(CFLAGS_GCOV))
 endif
 
+#
+# Enable clang's PGO profiling flags for a file or directory depending on
+# variables PGO_PROFILE_obj.o and PGO_PROFILE.
+#
+ifeq ($(CONFIG_PGO_CLANG),y)
+_c_flags += $(if $(patsubst n%,, \
+		$(PGO_PROFILE_$(basetarget).o)$(PGO_PROFILE)y), \
+		$(CFLAGS_PGO_CLANG))
+endif
+
 #
 # Enable address sanitizer flags for kernel except some files or directories
 # we don't want to check (depends on variables KASAN_SANITIZE_obj.o, KASAN_SANITIZE)