mbox series

[v6,00/10] Enable haltpoll on arm64

Message ID 20240726201332.626395-1-ankur.a.arora@oracle.com (mailing list archive)
Headers show
Series Enable haltpoll on arm64 | expand

Message

Ankur Arora July 26, 2024, 8:13 p.m. UTC
This patchset enables the cpuidle-haltpoll driver and its namesake
governor on arm64. This is specifically interesting for KVM guests by
reducing IPC latencies.

Comparing idle switching latencies on an arm64 KVM guest with 
perf bench sched pipe:

                                     usecs/op       %stdev   

  no haltpoll (baseline)               13.48       +-  5.19%
  with haltpoll                         6.84       +- 22.07%


No change in performance for a similar test on x86:

                                     usecs/op        %stdev   

  haltpoll w/ cpu_relax() (baseline)     4.75      +-  1.76%
  haltpoll w/ smp_cond_load_relaxed()    4.78      +-  2.31%

Both sets of tests were on otherwise idle systems with guest VCPUs
pinned to specific PCPUs. One reason for the higher stdev on arm64
is that trapping of the WFE instruction by the host KVM is contingent
on the number of tasks on the runqueue.


The patch series is organized in three parts: 

 - patch 1, reorganizes the poll_idle() loop, switching to
   smp_cond_load_relaxed() in the polling loop.
   Relatedly patches 2, 3 mangle the config option ARCH_HAS_CPU_RELAX,
   renaming it to ARCH_HAS_OPTIMIZED_POLL.

 - patches 4-6 reorganize the haltpoll selection and init logic
   to allow architecture code to select it. 

 - and finally, patches 7-10 add the bits for arm64 support.


What is still missing: this series largely completes the haltpoll side
of functionality for arm64. There are, however, a few related areas
that still need to be threshed out:

 - WFET support: WFE on arm64 does not guarantee that poll_idle()
   would terminate in halt_poll_ns. Using WFET would address this.
 - KVM_NO_POLL support on arm64
 - KVM TWED support on arm64: allow the host to limit time spent in
   WFE.


Changelog:

v6:

 - reordered the patches to keep poll_idle() and ARCH_HAS_OPTIMIZED_POLL
   changes together (comment from Christoph Lameter)
 - threshes out the commit messages a bit more (comments from Christoph
   Lameter, Sudeep Holla)
 - also rework selection of cpuidle-haltpoll. Now selected based
   on the architectural selection of ARCH_CPUIDLE_HALTPOLL.
 - moved back to arch_haltpoll_want() (comment from Joao Martins)
   Also, arch_haltpoll_want() now takes the force parameter and is
   now responsible for the complete selection (or not) of haltpoll.
 - fixes the build breakage on i386
 - fixes the cpuidle-haltpoll module breakage on arm64 (comment from
   Tomohiro Misono, Haris Okanovic)


v5:
 - rework the poll_idle() loop around smp_cond_load_relaxed() (review
   comment from Tomohiro Misono.)
 - also rework selection of cpuidle-haltpoll. Now selected based
   on the architectural selection of ARCH_CPUIDLE_HALTPOLL.
 - arch_haltpoll_supported() (renamed from arch_haltpoll_want()) on
   arm64 now depends on the event-stream being enabled.
 - limit POLL_IDLE_RELAX_COUNT on arm64 (review comment from Haris Okanovic)
 - ARCH_HAS_CPU_RELAX is now renamed to ARCH_HAS_OPTIMIZED_POLL.

v4 changes from v3:
 - change 7/8 per Rafael input: drop the parens and use ret for the final check
 - add 8/8 which renames the guard for building poll_state

v3 changes from v2:
 - fix 1/7 per Petr Mladek - remove ARCH_HAS_CPU_RELAX from arch/x86/Kconfig
 - add Ack-by from Rafael Wysocki on 2/7

v2 changes from v1:
 - added patch 7 where we change cpu_relax with smp_cond_load_relaxed per PeterZ
   (this improves by 50% at least the CPU cycles consumed in the tests above:
   10,716,881,137 now vs 14,503,014,257 before)
 - removed the ifdef from patch 1 per RafaelW

Please review.

Ankur Arora (5):
  cpuidle: rename ARCH_HAS_CPU_RELAX to ARCH_HAS_OPTIMIZED_POLL
  cpuidle-haltpoll: condition on ARCH_CPUIDLE_HALTPOLL
  arm64: idle: export arch_cpu_idle
  arm64: support cpuidle-haltpoll
  cpuidle/poll_state: limit POLL_IDLE_RELAX_COUNT on arm64

Joao Martins (4):
  Kconfig: move ARCH_HAS_OPTIMIZED_POLL to arch/Kconfig
  cpuidle-haltpoll: define arch_haltpoll_want()
  governors/haltpoll: drop kvm_para_available() check
  arm64: define TIF_POLLING_NRFLAG

Mihai Carabas (1):
  cpuidle/poll_state: poll via smp_cond_load_relaxed()

 arch/Kconfig                              |  3 +++
 arch/arm64/Kconfig                        | 10 ++++++++++
 arch/arm64/include/asm/cpuidle_haltpoll.h |  9 +++++++++
 arch/arm64/include/asm/thread_info.h      |  2 ++
 arch/arm64/kernel/cpuidle.c               | 23 +++++++++++++++++++++++
 arch/arm64/kernel/idle.c                  |  1 +
 arch/x86/Kconfig                          |  5 ++---
 arch/x86/include/asm/cpuidle_haltpoll.h   |  1 +
 arch/x86/kernel/kvm.c                     | 13 +++++++++++++
 drivers/acpi/processor_idle.c             |  4 ++--
 drivers/cpuidle/Kconfig                   |  5 ++---
 drivers/cpuidle/Makefile                  |  2 +-
 drivers/cpuidle/cpuidle-haltpoll.c        | 12 +-----------
 drivers/cpuidle/governors/haltpoll.c      |  6 +-----
 drivers/cpuidle/poll_state.c              | 21 ++++++++++++++++-----
 drivers/idle/Kconfig                      |  1 +
 include/linux/cpuidle.h                   |  2 +-
 include/linux/cpuidle_haltpoll.h          |  5 +++++
 18 files changed, 94 insertions(+), 31 deletions(-)
 create mode 100644 arch/arm64/include/asm/cpuidle_haltpoll.h

Comments

Tomohiro Misono (Fujitsu) Aug. 9, 2024, 6:02 a.m. UTC | #1
> Subject: [PATCH v6 00/10] Enable haltpoll on arm64
> 
> This patchset enables the cpuidle-haltpoll driver and its namesake
> governor on arm64. This is specifically interesting for KVM guests by
> reducing IPC latencies.
> 
> Comparing idle switching latencies on an arm64 KVM guest with
> perf bench sched pipe:
> 
>                                      usecs/op       %stdev
> 
>   no haltpoll (baseline)               13.48       +-  5.19%
>   with haltpoll                         6.84       +- 22.07%

I got similar results with VM on Grace machine (applied to 6.10).

[default]
# cat /sys/devices/system/cpu/cpuidle/current_driver
none
# perf bench sched pipe
# Running 'sched/pipe' benchmark:
# Executed 1000000 pipe operations between two processes

     Total time: 23.832 [sec]

      23.832644 usecs/op
          41959 ops/sec

[With "cpuidle-haltpoll.force=1" commandline]
# cat /sys/devices/system/cpu/cpuidle/current_driver
haltpoll
# perf bench sched pipe
# Running 'sched/pipe' benchmark:
# Executed 1000000 pipe operations between two processes

     Total time: 6.340 [sec]

       6.340116 usecs/op
         157725 ops/sec

Tested-by: Misono Tomohiro <misono.tomohiro@fujitsu.com>
Regards,
Tomohiro


> 
> 
> No change in performance for a similar test on x86:
> 
>                                      usecs/op        %stdev
> 
>   haltpoll w/ cpu_relax() (baseline)     4.75      +-  1.76%
>   haltpoll w/ smp_cond_load_relaxed()    4.78      +-  2.31%
> 
> Both sets of tests were on otherwise idle systems with guest VCPUs
> pinned to specific PCPUs. One reason for the higher stdev on arm64
> is that trapping of the WFE instruction by the host KVM is contingent
> on the number of tasks on the runqueue.
> 
> 
> The patch series is organized in three parts:
> 
>  - patch 1, reorganizes the poll_idle() loop, switching to
>    smp_cond_load_relaxed() in the polling loop.
>    Relatedly patches 2, 3 mangle the config option ARCH_HAS_CPU_RELAX,
>    renaming it to ARCH_HAS_OPTIMIZED_POLL.
> 
>  - patches 4-6 reorganize the haltpoll selection and init logic
>    to allow architecture code to select it.
> 
>  - and finally, patches 7-10 add the bits for arm64 support.
> 
> 
> What is still missing: this series largely completes the haltpoll side
> of functionality for arm64. There are, however, a few related areas
> that still need to be threshed out:
> 
>  - WFET support: WFE on arm64 does not guarantee that poll_idle()
>    would terminate in halt_poll_ns. Using WFET would address this.
>  - KVM_NO_POLL support on arm64
>  - KVM TWED support on arm64: allow the host to limit time spent in
>    WFE.
> 
> 
> Changelog:
> 
> v6:
> 
>  - reordered the patches to keep poll_idle() and ARCH_HAS_OPTIMIZED_POLL
>    changes together (comment from Christoph Lameter)
>  - threshes out the commit messages a bit more (comments from Christoph
>    Lameter, Sudeep Holla)
>  - also rework selection of cpuidle-haltpoll. Now selected based
>    on the architectural selection of ARCH_CPUIDLE_HALTPOLL.
>  - moved back to arch_haltpoll_want() (comment from Joao Martins)
>    Also, arch_haltpoll_want() now takes the force parameter and is
>    now responsible for the complete selection (or not) of haltpoll.
>  - fixes the build breakage on i386
>  - fixes the cpuidle-haltpoll module breakage on arm64 (comment from
>    Tomohiro Misono, Haris Okanovic)
> 
> 
> v5:
>  - rework the poll_idle() loop around smp_cond_load_relaxed() (review
>    comment from Tomohiro Misono.)
>  - also rework selection of cpuidle-haltpoll. Now selected based
>    on the architectural selection of ARCH_CPUIDLE_HALTPOLL.
>  - arch_haltpoll_supported() (renamed from arch_haltpoll_want()) on
>    arm64 now depends on the event-stream being enabled.
>  - limit POLL_IDLE_RELAX_COUNT on arm64 (review comment from Haris Okanovic)
>  - ARCH_HAS_CPU_RELAX is now renamed to ARCH_HAS_OPTIMIZED_POLL.
> 
> v4 changes from v3:
>  - change 7/8 per Rafael input: drop the parens and use ret for the final check
>  - add 8/8 which renames the guard for building poll_state
> 
> v3 changes from v2:
>  - fix 1/7 per Petr Mladek - remove ARCH_HAS_CPU_RELAX from arch/x86/Kconfig
>  - add Ack-by from Rafael Wysocki on 2/7
> 
> v2 changes from v1:
>  - added patch 7 where we change cpu_relax with smp_cond_load_relaxed per PeterZ
>    (this improves by 50% at least the CPU cycles consumed in the tests above:
>    10,716,881,137 now vs 14,503,014,257 before)
>  - removed the ifdef from patch 1 per RafaelW
> 
> Please review.
> 
> Ankur Arora (5):
>   cpuidle: rename ARCH_HAS_CPU_RELAX to ARCH_HAS_OPTIMIZED_POLL
>   cpuidle-haltpoll: condition on ARCH_CPUIDLE_HALTPOLL
>   arm64: idle: export arch_cpu_idle
>   arm64: support cpuidle-haltpoll
>   cpuidle/poll_state: limit POLL_IDLE_RELAX_COUNT on arm64
> 
> Joao Martins (4):
>   Kconfig: move ARCH_HAS_OPTIMIZED_POLL to arch/Kconfig
>   cpuidle-haltpoll: define arch_haltpoll_want()
>   governors/haltpoll: drop kvm_para_available() check
>   arm64: define TIF_POLLING_NRFLAG
> 
> Mihai Carabas (1):
>   cpuidle/poll_state: poll via smp_cond_load_relaxed()
> 
>  arch/Kconfig                              |  3 +++
>  arch/arm64/Kconfig                        | 10 ++++++++++
>  arch/arm64/include/asm/cpuidle_haltpoll.h |  9 +++++++++
>  arch/arm64/include/asm/thread_info.h      |  2 ++
>  arch/arm64/kernel/cpuidle.c               | 23 +++++++++++++++++++++++
>  arch/arm64/kernel/idle.c                  |  1 +
>  arch/x86/Kconfig                          |  5 ++---
>  arch/x86/include/asm/cpuidle_haltpoll.h   |  1 +
>  arch/x86/kernel/kvm.c                     | 13 +++++++++++++
>  drivers/acpi/processor_idle.c             |  4 ++--
>  drivers/cpuidle/Kconfig                   |  5 ++---
>  drivers/cpuidle/Makefile                  |  2 +-
>  drivers/cpuidle/cpuidle-haltpoll.c        | 12 +-----------
>  drivers/cpuidle/governors/haltpoll.c      |  6 +-----
>  drivers/cpuidle/poll_state.c              | 21 ++++++++++++++++-----
>  drivers/idle/Kconfig                      |  1 +
>  include/linux/cpuidle.h                   |  2 +-
>  include/linux/cpuidle_haltpoll.h          |  5 +++++
>  18 files changed, 94 insertions(+), 31 deletions(-)
>  create mode 100644 arch/arm64/include/asm/cpuidle_haltpoll.h
> 
> --
> 2.43.5
Ankur Arora Aug. 12, 2024, 10:43 p.m. UTC | #2
Tomohiro Misono (Fujitsu) <misono.tomohiro@fujitsu.com> writes:

>> Subject: [PATCH v6 00/10] Enable haltpoll on arm64
>>
>> This patchset enables the cpuidle-haltpoll driver and its namesake
>> governor on arm64. This is specifically interesting for KVM guests by
>> reducing IPC latencies.
>>
>> Comparing idle switching latencies on an arm64 KVM guest with
>> perf bench sched pipe:
>>
>>                                      usecs/op       %stdev
>>
>>   no haltpoll (baseline)               13.48       +-  5.19%
>>   with haltpoll                         6.84       +- 22.07%
>
> I got similar results with VM on Grace machine (applied to 6.10).

Great. Thanks for testing.

> [default]
> # cat /sys/devices/system/cpu/cpuidle/current_driver
> none
> # perf bench sched pipe
> # Running 'sched/pipe' benchmark:
> # Executed 1000000 pipe operations between two processes
>
>      Total time: 23.832 [sec]
>
>       23.832644 usecs/op
>           41959 ops/sec
>
> [With "cpuidle-haltpoll.force=1" commandline]
> # cat /sys/devices/system/cpu/cpuidle/current_driver
> haltpoll
> # perf bench sched pipe
> # Running 'sched/pipe' benchmark:
> # Executed 1000000 pipe operations between two processes
>
>      Total time: 6.340 [sec]
>
>        6.340116 usecs/op
>          157725 ops/sec
>
> Tested-by: Misono Tomohiro <misono.tomohiro@fujitsu.com>

Thanks!

--
ankur