Message ID | 20241115051107.3374417-1-yabinc@google.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | arm64: Allow CONFIG_AUTOFDO_CLANG to be selected | expand |
Hi Yabin, Thanks for working to enable this on ARM64 and test the performance on Android! Please see my comments below. Thanks, -Rong On Thu, Nov 14, 2024 at 9:11 PM Yabin Cui <yabinc@google.com> wrote: > > Select ARCH_SUPPORTS_AUTOFDO_CLANG to allow AUTOFDO_CLANG to be > selected. > > On ARM64, ETM traces can be recorded and converted to AutoFDO profiles. > Experiments on Android show 4% improvement in cold app startup time > and 13% improvement in binder benchmarks. > > Signed-off-by: Yabin Cui <yabinc@google.com> > --- > Documentation/dev-tools/autofdo.rst | 18 +++++++++++++++++- > arch/arm64/Kconfig | 1 + > 2 files changed, 18 insertions(+), 1 deletion(-) > > diff --git a/Documentation/dev-tools/autofdo.rst b/Documentation/dev-tools/autofdo.rst > index 1f0a451e9ccd..f0952e3e8490 100644 > --- a/Documentation/dev-tools/autofdo.rst > +++ b/Documentation/dev-tools/autofdo.rst > @@ -55,7 +55,7 @@ process consists of the following steps: > workload to gather execution frequency data. This data is > collected using hardware sampling, via perf. AutoFDO is most > effective on platforms supporting advanced PMU features like > - LBR on Intel machines. > + LBR on Intel machines, ETM traces on ARM machines. > > #. AutoFDO profile generation: Perf output file is converted to > the AutoFDO profile via offline tools. > @@ -141,6 +141,22 @@ Here is an example workflow for AutoFDO kernel: > > $ perf record --pfm-events RETIRED_TAKEN_BRANCH_INSTRUCTIONS:k -a -N -b -c <count> -o <perf_file> -- <loadtest> > > + - For ARM platforms: The instructions for SPE might be different. Can we change to "- For ARM platforms with ETM trace:" > + > + Follow the instructions in the `Linaro OpenCSD document > + https://github.com/Linaro/OpenCSD/blob/master/decoder/tests/auto-fdo/autofdo.md`_ > + to record ETM traces for AutoFDO:: > + > + $ perf record -e cs_etm/@tmc_etr0/k -a -o <etm_perf_file> -- <loadtest> > + $ perf inject -i <etm_perf_file> -o <perf_file> --itrace=i500009il > + > + For ARM platforms running Android, follow the instructions in the > + `Android simpleperf document > + <https://android.googlesource.com/platform/system/extras/+/main/simpleperf/doc/collect_etm_data_for_autofdo.md>`_ > + to record ETM traces for AutoFDO:: The instructions in "Step 3: Convert ETM data to AutoFDO profile" currently use create_llvm_prof to generate a "binary" profile format. This is incompatible with the default FSAFDO format used for the kernel, which requires an "extbinary" format. To correct this, please update the instructions to include the flag "-format extbinary" in the create_llvm_prof command. Using a non-FSAFDO profile with FSAFDO can negatively impact performance. Therefore, I recommend rerunning the test with the updated flag to potentially achieve better results. > + > + $ simpleperf record -e cs-etm:k -a -o <perf_file> -- <loadtest> > + > 4) (Optional) Download the raw perf file to the host machine. > > 5) To generate an AutoFDO profile, two offline tools are available: > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig > index fd9df6dcc593..c3814df5e391 100644 > --- a/arch/arm64/Kconfig > +++ b/arch/arm64/Kconfig > @@ -103,6 +103,7 @@ config ARM64 > select ARCH_SUPPORTS_PER_VMA_LOCK > select ARCH_SUPPORTS_HUGE_PFNMAP if TRANSPARENT_HUGEPAGE > select ARCH_SUPPORTS_RT > + select ARCH_SUPPORTS_AUTOFDO_CLANG > select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH > select ARCH_WANT_COMPAT_IPC_PARSE_VERSION if COMPAT > select ARCH_WANT_DEFAULT_BPF_JIT > -- > 2.47.0.338.g60cca15819-goog >
diff --git a/Documentation/dev-tools/autofdo.rst b/Documentation/dev-tools/autofdo.rst index 1f0a451e9ccd..f0952e3e8490 100644 --- a/Documentation/dev-tools/autofdo.rst +++ b/Documentation/dev-tools/autofdo.rst @@ -55,7 +55,7 @@ process consists of the following steps: workload to gather execution frequency data. This data is collected using hardware sampling, via perf. AutoFDO is most effective on platforms supporting advanced PMU features like - LBR on Intel machines. + LBR on Intel machines, ETM traces on ARM machines. #. AutoFDO profile generation: Perf output file is converted to the AutoFDO profile via offline tools. @@ -141,6 +141,22 @@ Here is an example workflow for AutoFDO kernel: $ perf record --pfm-events RETIRED_TAKEN_BRANCH_INSTRUCTIONS:k -a -N -b -c <count> -o <perf_file> -- <loadtest> + - For ARM platforms: + + Follow the instructions in the `Linaro OpenCSD document + https://github.com/Linaro/OpenCSD/blob/master/decoder/tests/auto-fdo/autofdo.md`_ + to record ETM traces for AutoFDO:: + + $ perf record -e cs_etm/@tmc_etr0/k -a -o <etm_perf_file> -- <loadtest> + $ perf inject -i <etm_perf_file> -o <perf_file> --itrace=i500009il + + For ARM platforms running Android, follow the instructions in the + `Android simpleperf document + <https://android.googlesource.com/platform/system/extras/+/main/simpleperf/doc/collect_etm_data_for_autofdo.md>`_ + to record ETM traces for AutoFDO:: + + $ simpleperf record -e cs-etm:k -a -o <perf_file> -- <loadtest> + 4) (Optional) Download the raw perf file to the host machine. 5) To generate an AutoFDO profile, two offline tools are available: diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index fd9df6dcc593..c3814df5e391 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -103,6 +103,7 @@ config ARM64 select ARCH_SUPPORTS_PER_VMA_LOCK select ARCH_SUPPORTS_HUGE_PFNMAP if TRANSPARENT_HUGEPAGE select ARCH_SUPPORTS_RT + select ARCH_SUPPORTS_AUTOFDO_CLANG select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH select ARCH_WANT_COMPAT_IPC_PARSE_VERSION if COMPAT select ARCH_WANT_DEFAULT_BPF_JIT
Select ARCH_SUPPORTS_AUTOFDO_CLANG to allow AUTOFDO_CLANG to be selected. On ARM64, ETM traces can be recorded and converted to AutoFDO profiles. Experiments on Android show 4% improvement in cold app startup time and 13% improvement in binder benchmarks. Signed-off-by: Yabin Cui <yabinc@google.com> --- Documentation/dev-tools/autofdo.rst | 18 +++++++++++++++++- arch/arm64/Kconfig | 1 + 2 files changed, 18 insertions(+), 1 deletion(-)