mbox series

[v8,00/11] Enable haltpoll on arm64

Message ID 20240925232425.2763385-1-ankur.a.arora@oracle.com (mailing list archive)
Headers show
Series Enable haltpoll on arm64 | expand

Message

Ankur Arora Sept. 25, 2024, 11:24 p.m. UTC
This patchset enables the cpuidle-haltpoll driver and its namesake
governor on arm64. This is specifically interesting for KVM guests by
reducing IPC latencies.

Comparing idle switching latencies on an arm64 KVM guest with 
perf bench sched pipe:

                                     usecs/op       %stdev   

  no haltpoll (baseline)               13.48       +-  5.19%
  with haltpoll                         6.84       +- 22.07%


No change in performance for a similar test on x86:

                                     usecs/op        %stdev   

  haltpoll w/ cpu_relax() (baseline)     4.75      +-  1.76%
  haltpoll w/ smp_cond_load_relaxed()    4.78      +-  2.31%

Both sets of tests were on otherwise idle systems with guest VCPUs
pinned to specific PCPUs. One reason for the higher stdev on arm64
is that trapping of the WFE instruction by the host KVM is contingent
on the number of tasks on the runqueue.

Tomohiro Misono and Haris Okanovic also report similar latency
improvements on Grace and Graviton systems (for v7) [1] [2].

The patch series is organized in three parts: 

 - patch 1, reorganizes the poll_idle() loop, switching to
   smp_cond_load_relaxed() in the polling loop.
   Relatedly patches 2, 3 mangle the config option ARCH_HAS_CPU_RELAX,
   renaming it to ARCH_HAS_OPTIMIZED_POLL.

 - patches 4-6 reorganize the haltpoll selection and init logic
   to allow architecture code to select it. 

 - and finally, patches 7-11 add the bits for arm64 support.

What is still missing: this series largely completes the haltpoll side
of functionality for arm64. There are, however, a few related areas
that still need to be threshed out:

 - WFET support: WFE on arm64 does not guarantee that poll_idle()
   would terminate in halt_poll_ns. Using WFET would address this.
 - KVM_NO_POLL support on arm64
 - KVM TWED support on arm64: allow the host to limit time spent in
   WFE.


Changelog:

v8: No logic changes. Largely respin of v7, with changes
noted below:

 - move selection of ARCH_HAS_OPTIMIZED_POLL on arm64 to its
   own patch.
   (patch-9 "arm64: select ARCH_HAS_OPTIMIZED_POLL")
   
 - address comments simplifying arm64 support (Will Deacon)
   (patch-11 "arm64: support cpuidle-haltpoll")

v7: No significant logic changes. Mostly a respin of v6.

 - minor cleanup in poll_idle() (Christoph Lameter)
 - fixes conflicts due to code movement in arch/arm64/kernel/cpuidle.c
   (Tomohiro Misono)

v6:

 - reordered the patches to keep poll_idle() and ARCH_HAS_OPTIMIZED_POLL
   changes together (comment from Christoph Lameter)
 - threshes out the commit messages a bit more (comments from Christoph
   Lameter, Sudeep Holla)
 - also rework selection of cpuidle-haltpoll. Now selected based
   on the architectural selection of ARCH_CPUIDLE_HALTPOLL.
 - moved back to arch_haltpoll_want() (comment from Joao Martins)
   Also, arch_haltpoll_want() now takes the force parameter and is
   now responsible for the complete selection (or not) of haltpoll.
 - fixes the build breakage on i386
 - fixes the cpuidle-haltpoll module breakage on arm64 (comment from
   Tomohiro Misono, Haris Okanovic)


v5:
 - rework the poll_idle() loop around smp_cond_load_relaxed() (review
   comment from Tomohiro Misono.)
 - also rework selection of cpuidle-haltpoll. Now selected based
   on the architectural selection of ARCH_CPUIDLE_HALTPOLL.
 - arch_haltpoll_supported() (renamed from arch_haltpoll_want()) on
   arm64 now depends on the event-stream being enabled.
 - limit POLL_IDLE_RELAX_COUNT on arm64 (review comment from Haris Okanovic)
 - ARCH_HAS_CPU_RELAX is now renamed to ARCH_HAS_OPTIMIZED_POLL.

v4 changes from v3:
 - change 7/8 per Rafael input: drop the parens and use ret for the final check
 - add 8/8 which renames the guard for building poll_state

v3 changes from v2:
 - fix 1/7 per Petr Mladek - remove ARCH_HAS_CPU_RELAX from arch/x86/Kconfig
 - add Ack-by from Rafael Wysocki on 2/7

v2 changes from v1:
 - added patch 7 where we change cpu_relax with smp_cond_load_relaxed per PeterZ
   (this improves by 50% at least the CPU cycles consumed in the tests above:
   10,716,881,137 now vs 14,503,014,257 before)
 - removed the ifdef from patch 1 per RafaelW

Please review.

[1] https://lore.kernel.org/lkml/TY3PR01MB111481E9B0AF263ACC8EA5D4AE5BA2@TY3PR01MB11148.jpnprd01.prod.outlook.com/
[2] https://lore.kernel.org/lkml/104d0ec31cb45477e27273e089402d4205ee4042.camel@amazon.com/

Ankur Arora (6):
  cpuidle: rename ARCH_HAS_CPU_RELAX to ARCH_HAS_OPTIMIZED_POLL
  cpuidle-haltpoll: condition on ARCH_CPUIDLE_HALTPOLL
  arm64: idle: export arch_cpu_idle
  arm64: select ARCH_HAS_OPTIMIZED_POLL
  cpuidle/poll_state: limit POLL_IDLE_RELAX_COUNT on arm64
  arm64: support cpuidle-haltpoll

Joao Martins (4):
  Kconfig: move ARCH_HAS_OPTIMIZED_POLL to arch/Kconfig
  cpuidle-haltpoll: define arch_haltpoll_want()
  governors/haltpoll: drop kvm_para_available() check
  arm64: define TIF_POLLING_NRFLAG

Mihai Carabas (1):
  cpuidle/poll_state: poll via smp_cond_load_relaxed()

 arch/Kconfig                              |  3 +++
 arch/arm64/Kconfig                        |  7 +++++++
 arch/arm64/include/asm/cpuidle_haltpoll.h | 24 +++++++++++++++++++++++
 arch/arm64/include/asm/thread_info.h      |  2 ++
 arch/arm64/kernel/idle.c                  |  1 +
 arch/x86/Kconfig                          |  5 ++---
 arch/x86/include/asm/cpuidle_haltpoll.h   |  1 +
 arch/x86/kernel/kvm.c                     | 13 ++++++++++++
 drivers/acpi/processor_idle.c             |  4 ++--
 drivers/cpuidle/Kconfig                   |  5 ++---
 drivers/cpuidle/Makefile                  |  2 +-
 drivers/cpuidle/cpuidle-haltpoll.c        | 12 +-----------
 drivers/cpuidle/governors/haltpoll.c      |  6 +-----
 drivers/cpuidle/poll_state.c              | 22 +++++++++++++++------
 drivers/idle/Kconfig                      |  1 +
 include/linux/cpuidle.h                   |  2 +-
 include/linux/cpuidle_haltpoll.h          |  5 +++++
 17 files changed, 83 insertions(+), 32 deletions(-)
 create mode 100644 arch/arm64/include/asm/cpuidle_haltpoll.h

Comments

Lifeng Zheng Oct. 9, 2024, 2:37 a.m. UTC | #1
On 2024/9/26 7:24, Ankur Arora wrote:
> This patchset enables the cpuidle-haltpoll driver and its namesake
> governor on arm64. This is specifically interesting for KVM guests by
> reducing IPC latencies.
> 
> Comparing idle switching latencies on an arm64 KVM guest with 
> perf bench sched pipe:
> 
>                                      usecs/op       %stdev   
> 
>   no haltpoll (baseline)               13.48       +-  5.19%
>   with haltpoll                         6.84       +- 22.07%
> 
> 
> No change in performance for a similar test on x86:
> 
>                                      usecs/op        %stdev   
> 
>   haltpoll w/ cpu_relax() (baseline)     4.75      +-  1.76%
>   haltpoll w/ smp_cond_load_relaxed()    4.78      +-  2.31%
> 
> Both sets of tests were on otherwise idle systems with guest VCPUs
> pinned to specific PCPUs. One reason for the higher stdev on arm64
> is that trapping of the WFE instruction by the host KVM is contingent
> on the number of tasks on the runqueue.
> 
> Tomohiro Misono and Haris Okanovic also report similar latency
> improvements on Grace and Graviton systems (for v7) [1] [2].
> 
> The patch series is organized in three parts: 
> 
>  - patch 1, reorganizes the poll_idle() loop, switching to
>    smp_cond_load_relaxed() in the polling loop.
>    Relatedly patches 2, 3 mangle the config option ARCH_HAS_CPU_RELAX,
>    renaming it to ARCH_HAS_OPTIMIZED_POLL.
> 
>  - patches 4-6 reorganize the haltpoll selection and init logic
>    to allow architecture code to select it. 
> 
>  - and finally, patches 7-11 add the bits for arm64 support.
> 
> What is still missing: this series largely completes the haltpoll side
> of functionality for arm64. There are, however, a few related areas
> that still need to be threshed out:
> 
>  - WFET support: WFE on arm64 does not guarantee that poll_idle()
>    would terminate in halt_poll_ns. Using WFET would address this.
>  - KVM_NO_POLL support on arm64
>  - KVM TWED support on arm64: allow the host to limit time spent in
>    WFE.
> 
> 
> Changelog:
> 
> v8: No logic changes. Largely respin of v7, with changes
> noted below:
> 
>  - move selection of ARCH_HAS_OPTIMIZED_POLL on arm64 to its
>    own patch.
>    (patch-9 "arm64: select ARCH_HAS_OPTIMIZED_POLL")
>    
>  - address comments simplifying arm64 support (Will Deacon)
>    (patch-11 "arm64: support cpuidle-haltpoll")
> 
> v7: No significant logic changes. Mostly a respin of v6.
> 
>  - minor cleanup in poll_idle() (Christoph Lameter)
>  - fixes conflicts due to code movement in arch/arm64/kernel/cpuidle.c
>    (Tomohiro Misono)
> 
> v6:
> 
>  - reordered the patches to keep poll_idle() and ARCH_HAS_OPTIMIZED_POLL
>    changes together (comment from Christoph Lameter)
>  - threshes out the commit messages a bit more (comments from Christoph
>    Lameter, Sudeep Holla)
>  - also rework selection of cpuidle-haltpoll. Now selected based
>    on the architectural selection of ARCH_CPUIDLE_HALTPOLL.
>  - moved back to arch_haltpoll_want() (comment from Joao Martins)
>    Also, arch_haltpoll_want() now takes the force parameter and is
>    now responsible for the complete selection (or not) of haltpoll.
>  - fixes the build breakage on i386
>  - fixes the cpuidle-haltpoll module breakage on arm64 (comment from
>    Tomohiro Misono, Haris Okanovic)
> 
> 
> v5:
>  - rework the poll_idle() loop around smp_cond_load_relaxed() (review
>    comment from Tomohiro Misono.)
>  - also rework selection of cpuidle-haltpoll. Now selected based
>    on the architectural selection of ARCH_CPUIDLE_HALTPOLL.
>  - arch_haltpoll_supported() (renamed from arch_haltpoll_want()) on
>    arm64 now depends on the event-stream being enabled.
>  - limit POLL_IDLE_RELAX_COUNT on arm64 (review comment from Haris Okanovic)
>  - ARCH_HAS_CPU_RELAX is now renamed to ARCH_HAS_OPTIMIZED_POLL.
> 
> v4 changes from v3:
>  - change 7/8 per Rafael input: drop the parens and use ret for the final check
>  - add 8/8 which renames the guard for building poll_state
> 
> v3 changes from v2:
>  - fix 1/7 per Petr Mladek - remove ARCH_HAS_CPU_RELAX from arch/x86/Kconfig
>  - add Ack-by from Rafael Wysocki on 2/7
> 
> v2 changes from v1:
>  - added patch 7 where we change cpu_relax with smp_cond_load_relaxed per PeterZ
>    (this improves by 50% at least the CPU cycles consumed in the tests above:
>    10,716,881,137 now vs 14,503,014,257 before)
>  - removed the ifdef from patch 1 per RafaelW
> 
> Please review.
> 
> [1] https://lore.kernel.org/lkml/TY3PR01MB111481E9B0AF263ACC8EA5D4AE5BA2@TY3PR01MB11148.jpnprd01.prod.outlook.com/
> [2] https://lore.kernel.org/lkml/104d0ec31cb45477e27273e089402d4205ee4042.camel@amazon.com/
> 
> Ankur Arora (6):
>   cpuidle: rename ARCH_HAS_CPU_RELAX to ARCH_HAS_OPTIMIZED_POLL
>   cpuidle-haltpoll: condition on ARCH_CPUIDLE_HALTPOLL
>   arm64: idle: export arch_cpu_idle
>   arm64: select ARCH_HAS_OPTIMIZED_POLL
>   cpuidle/poll_state: limit POLL_IDLE_RELAX_COUNT on arm64
>   arm64: support cpuidle-haltpoll
> 
> Joao Martins (4):
>   Kconfig: move ARCH_HAS_OPTIMIZED_POLL to arch/Kconfig
>   cpuidle-haltpoll: define arch_haltpoll_want()
>   governors/haltpoll: drop kvm_para_available() check
>   arm64: define TIF_POLLING_NRFLAG
> 
> Mihai Carabas (1):
>   cpuidle/poll_state: poll via smp_cond_load_relaxed()
> 
>  arch/Kconfig                              |  3 +++
>  arch/arm64/Kconfig                        |  7 +++++++
>  arch/arm64/include/asm/cpuidle_haltpoll.h | 24 +++++++++++++++++++++++
>  arch/arm64/include/asm/thread_info.h      |  2 ++
>  arch/arm64/kernel/idle.c                  |  1 +
>  arch/x86/Kconfig                          |  5 ++---
>  arch/x86/include/asm/cpuidle_haltpoll.h   |  1 +
>  arch/x86/kernel/kvm.c                     | 13 ++++++++++++
>  drivers/acpi/processor_idle.c             |  4 ++--
>  drivers/cpuidle/Kconfig                   |  5 ++---
>  drivers/cpuidle/Makefile                  |  2 +-
>  drivers/cpuidle/cpuidle-haltpoll.c        | 12 +-----------
>  drivers/cpuidle/governors/haltpoll.c      |  6 +-----
>  drivers/cpuidle/poll_state.c              | 22 +++++++++++++++------
>  drivers/idle/Kconfig                      |  1 +
>  include/linux/cpuidle.h                   |  2 +-
>  include/linux/cpuidle_haltpoll.h          |  5 +++++
>  17 files changed, 83 insertions(+), 32 deletions(-)
>  create mode 100644 arch/arm64/include/asm/cpuidle_haltpoll.h
> 

Hi Ankur,

Thanks for the patches!

We have tested these patches on our machine, with an adaptation of ACPI LPI
states rather than c-states.

Include polling state, there would be three states to get in. Comparing idle
switching latencies of different state with perf bench sched pipe:

                                     usecs/op       %stdev   

  state0(polling state)                7.36       +-  0.35%
  state1                               8.78       +-  0.46%
  state2                              77.32       +-  5.50%
  
It turns out that it works on our machine.

Tested-by: Lifeng Zheng <zhenglifeng1@huawei.com>

The adaptation of ACPI LPI states is shown below as a patch. Feel free to
include this patch as part of your series, or I can also send it out after
your series being merged.

From: Lifeng Zheng <zhenglifeng1@huawei.com>

ACPI: processor_idle: Support polling state for LPI

Initialize an optional polling state besides LPI states.

Wrap up a new enter method to correctly reflect the actual entered state
when the polling state is enabled.

Signed-off-by: Lifeng Zheng <zhenglifeng1@huawei.com>
Reviewed-by: Jie Zhan <zhanjie9@hisilicon.com>
---
 drivers/acpi/processor_idle.c | 39 ++++++++++++++++++++++++++++++-----
 1 file changed, 34 insertions(+), 5 deletions(-)

diff --git a/drivers/acpi/processor_idle.c b/drivers/acpi/processor_idle.c
index 44096406d65d..d154b5d77328 100644
--- a/drivers/acpi/processor_idle.c
+++ b/drivers/acpi/processor_idle.c
@@ -1194,20 +1194,46 @@ static int acpi_idle_lpi_enter(struct cpuidle_device *dev,
 	return -EINVAL;
 }
 
+/* To correctly reflect the entered state if the poll state is enabled. */
+static int acpi_idle_lpi_enter_with_poll_state(struct cpuidle_device *dev,
+			       struct cpuidle_driver *drv, int index)
+{
+	int entered_state;
+
+	if (unlikely(index < 1))
+		return -EINVAL;
+
+	entered_state = acpi_idle_lpi_enter(dev, drv, index - 1);
+	if (entered_state < 0)
+		return entered_state;
+
+	return entered_state + 1;
+}
+
 static int acpi_processor_setup_lpi_states(struct acpi_processor *pr)
 {
-	int i;
+	int i, count;
 	struct acpi_lpi_state *lpi;
 	struct cpuidle_state *state;
 	struct cpuidle_driver *drv = &acpi_idle_driver;
+	typeof(state->enter) enter_method;
 
 	if (!pr->flags.has_lpi)
 		return -EOPNOTSUPP;
 
+	if (IS_ENABLED(CONFIG_ARCH_HAS_OPTIMIZED_POLL)) {
+		cpuidle_poll_state_init(drv);
+		count = 1;
+		enter_method = acpi_idle_lpi_enter_with_poll_state;
+	} else {
+		count = 0;
+		enter_method = acpi_idle_lpi_enter;
+	}
+
 	for (i = 0; i < pr->power.count && i < CPUIDLE_STATE_MAX; i++) {
 		lpi = &pr->power.lpi_states[i];
 
-		state = &drv->states[i];
+		state = &drv->states[count];
 		snprintf(state->name, CPUIDLE_NAME_LEN, "LPI-%d", i);
 		strscpy(state->desc, lpi->desc, CPUIDLE_DESC_LEN);
 		state->exit_latency = lpi->wake_latency;
@@ -1215,11 +1241,14 @@ static int acpi_processor_setup_lpi_states(struct acpi_processor *pr)
 		state->flags |= arch_get_idle_state_flags(lpi->arch_flags);
 		if (i != 0 && lpi->entry_method == ACPI_CSTATE_FFH)
 			state->flags |= CPUIDLE_FLAG_RCU_IDLE;
-		state->enter = acpi_idle_lpi_enter;
-		drv->safe_state_index = i;
+		state->enter = enter_method;
+		drv->safe_state_index = count;
+		count++;
+		if (count == CPUIDLE_STATE_MAX)
+			break;
 	}
 
-	drv->state_count = i;
+	drv->state_count = count;
 
 	return 0;
 }
Christoph Lameter (Ampere) Oct. 14, 2024, 10:54 p.m. UTC | #2
Various versions of this patchset have been extensively tested at Ampere
with numerous benchmarks by our performance testing groups.

Please merge this. We will continue to contribute to this patchset.
Ankur Arora Oct. 15, 2024, 1:53 a.m. UTC | #3
zhenglifeng (A) <zhenglifeng1@huawei.com> writes:

> On 2024/9/26 7:24, Ankur Arora wrote:
>> This patchset enables the cpuidle-haltpoll driver and its namesake
>> governor on arm64. This is specifically interesting for KVM guests by
>> reducing IPC latencies.
>>
>> Comparing idle switching latencies on an arm64 KVM guest with
>> perf bench sched pipe:
>>
>>                                      usecs/op       %stdev
>>
>>   no haltpoll (baseline)               13.48       +-  5.19%
>>   with haltpoll                         6.84       +- 22.07%
>>
>>
>> No change in performance for a similar test on x86:
>>
>>                                      usecs/op        %stdev
>>
>>   haltpoll w/ cpu_relax() (baseline)     4.75      +-  1.76%
>>   haltpoll w/ smp_cond_load_relaxed()    4.78      +-  2.31%
>>
>> Both sets of tests were on otherwise idle systems with guest VCPUs
>> pinned to specific PCPUs. One reason for the higher stdev on arm64
>> is that trapping of the WFE instruction by the host KVM is contingent
>> on the number of tasks on the runqueue.
>>
>> Tomohiro Misono and Haris Okanovic also report similar latency
>> improvements on Grace and Graviton systems (for v7) [1] [2].
>>
>> The patch series is organized in three parts:
>>
>>  - patch 1, reorganizes the poll_idle() loop, switching to
>>    smp_cond_load_relaxed() in the polling loop.
>>    Relatedly patches 2, 3 mangle the config option ARCH_HAS_CPU_RELAX,
>>    renaming it to ARCH_HAS_OPTIMIZED_POLL.
>>
>>  - patches 4-6 reorganize the haltpoll selection and init logic
>>    to allow architecture code to select it.
>>
>>  - and finally, patches 7-11 add the bits for arm64 support.
>>
>> What is still missing: this series largely completes the haltpoll side
>> of functionality for arm64. There are, however, a few related areas
>> that still need to be threshed out:
>>
>>  - WFET support: WFE on arm64 does not guarantee that poll_idle()
>>    would terminate in halt_poll_ns. Using WFET would address this.
>>  - KVM_NO_POLL support on arm64
>>  - KVM TWED support on arm64: allow the host to limit time spent in
>>    WFE.
>>
>>
>> Changelog:
>>
>> v8: No logic changes. Largely respin of v7, with changes
>> noted below:
>>
>>  - move selection of ARCH_HAS_OPTIMIZED_POLL on arm64 to its
>>    own patch.
>>    (patch-9 "arm64: select ARCH_HAS_OPTIMIZED_POLL")
>>
>>  - address comments simplifying arm64 support (Will Deacon)
>>    (patch-11 "arm64: support cpuidle-haltpoll")
>>
>> v7: No significant logic changes. Mostly a respin of v6.
>>
>>  - minor cleanup in poll_idle() (Christoph Lameter)
>>  - fixes conflicts due to code movement in arch/arm64/kernel/cpuidle.c
>>    (Tomohiro Misono)
>>
>> v6:
>>
>>  - reordered the patches to keep poll_idle() and ARCH_HAS_OPTIMIZED_POLL
>>    changes together (comment from Christoph Lameter)
>>  - threshes out the commit messages a bit more (comments from Christoph
>>    Lameter, Sudeep Holla)
>>  - also rework selection of cpuidle-haltpoll. Now selected based
>>    on the architectural selection of ARCH_CPUIDLE_HALTPOLL.
>>  - moved back to arch_haltpoll_want() (comment from Joao Martins)
>>    Also, arch_haltpoll_want() now takes the force parameter and is
>>    now responsible for the complete selection (or not) of haltpoll.
>>  - fixes the build breakage on i386
>>  - fixes the cpuidle-haltpoll module breakage on arm64 (comment from
>>    Tomohiro Misono, Haris Okanovic)
>>
>>
>> v5:
>>  - rework the poll_idle() loop around smp_cond_load_relaxed() (review
>>    comment from Tomohiro Misono.)
>>  - also rework selection of cpuidle-haltpoll. Now selected based
>>    on the architectural selection of ARCH_CPUIDLE_HALTPOLL.
>>  - arch_haltpoll_supported() (renamed from arch_haltpoll_want()) on
>>    arm64 now depends on the event-stream being enabled.
>>  - limit POLL_IDLE_RELAX_COUNT on arm64 (review comment from Haris Okanovic)
>>  - ARCH_HAS_CPU_RELAX is now renamed to ARCH_HAS_OPTIMIZED_POLL.
>>
>> v4 changes from v3:
>>  - change 7/8 per Rafael input: drop the parens and use ret for the final check
>>  - add 8/8 which renames the guard for building poll_state
>>
>> v3 changes from v2:
>>  - fix 1/7 per Petr Mladek - remove ARCH_HAS_CPU_RELAX from arch/x86/Kconfig
>>  - add Ack-by from Rafael Wysocki on 2/7
>>
>> v2 changes from v1:
>>  - added patch 7 where we change cpu_relax with smp_cond_load_relaxed per PeterZ
>>    (this improves by 50% at least the CPU cycles consumed in the tests above:
>>    10,716,881,137 now vs 14,503,014,257 before)
>>  - removed the ifdef from patch 1 per RafaelW
>>
>> Please review.
>>
>> [1] https://lore.kernel.org/lkml/TY3PR01MB111481E9B0AF263ACC8EA5D4AE5BA2@TY3PR01MB11148.jpnprd01.prod.outlook.com/
>> [2] https://lore.kernel.org/lkml/104d0ec31cb45477e27273e089402d4205ee4042.camel@amazon.com/
>>
>> Ankur Arora (6):
>>   cpuidle: rename ARCH_HAS_CPU_RELAX to ARCH_HAS_OPTIMIZED_POLL
>>   cpuidle-haltpoll: condition on ARCH_CPUIDLE_HALTPOLL
>>   arm64: idle: export arch_cpu_idle
>>   arm64: select ARCH_HAS_OPTIMIZED_POLL
>>   cpuidle/poll_state: limit POLL_IDLE_RELAX_COUNT on arm64
>>   arm64: support cpuidle-haltpoll
>>
>> Joao Martins (4):
>>   Kconfig: move ARCH_HAS_OPTIMIZED_POLL to arch/Kconfig
>>   cpuidle-haltpoll: define arch_haltpoll_want()
>>   governors/haltpoll: drop kvm_para_available() check
>>   arm64: define TIF_POLLING_NRFLAG
>>
>> Mihai Carabas (1):
>>   cpuidle/poll_state: poll via smp_cond_load_relaxed()
>>
>>  arch/Kconfig                              |  3 +++
>>  arch/arm64/Kconfig                        |  7 +++++++
>>  arch/arm64/include/asm/cpuidle_haltpoll.h | 24 +++++++++++++++++++++++
>>  arch/arm64/include/asm/thread_info.h      |  2 ++
>>  arch/arm64/kernel/idle.c                  |  1 +
>>  arch/x86/Kconfig                          |  5 ++---
>>  arch/x86/include/asm/cpuidle_haltpoll.h   |  1 +
>>  arch/x86/kernel/kvm.c                     | 13 ++++++++++++
>>  drivers/acpi/processor_idle.c             |  4 ++--
>>  drivers/cpuidle/Kconfig                   |  5 ++---
>>  drivers/cpuidle/Makefile                  |  2 +-
>>  drivers/cpuidle/cpuidle-haltpoll.c        | 12 +-----------
>>  drivers/cpuidle/governors/haltpoll.c      |  6 +-----
>>  drivers/cpuidle/poll_state.c              | 22 +++++++++++++++------
>>  drivers/idle/Kconfig                      |  1 +
>>  include/linux/cpuidle.h                   |  2 +-
>>  include/linux/cpuidle_haltpoll.h          |  5 +++++
>>  17 files changed, 83 insertions(+), 32 deletions(-)
>>  create mode 100644 arch/arm64/include/asm/cpuidle_haltpoll.h
>>
>
> Hi Ankur,
>
> Thanks for the patches!
>
> We have tested these patches on our machine, with an adaptation of ACPI LPI
> states rather than c-states.
>
> Include polling state, there would be three states to get in. Comparing idle
> switching latencies of different state with perf bench sched pipe:
>
>                                      usecs/op       %stdev
>
>   state0(polling state)                7.36       +-  0.35%
>   state1                               8.78       +-  0.46%
>   state2                              77.32       +-  5.50%
>
> It turns out that it works on our machine.
>
> Tested-by: Lifeng Zheng <zhenglifeng1@huawei.com>

Great. Thanks Lifeng.

> The adaptation of ACPI LPI states is shown below as a patch. Feel free to
> include this patch as part of your series, or I can also send it out after
> your series being merged.

Ah, so polling for the regular ACPI driver. From a quick look the
patch looks good but this series is mostly focused on haltpoll so I
think this patch can go in after.

Please Cc me when you send it.

Thanks
Ankur

> From: Lifeng Zheng <zhenglifeng1@huawei.com>
>
> ACPI: processor_idle: Support polling state for LPI
>
> Initialize an optional polling state besides LPI states.
>
> Wrap up a new enter method to correctly reflect the actual entered state
> when the polling state is enabled.
>
> Signed-off-by: Lifeng Zheng <zhenglifeng1@huawei.com>
> Reviewed-by: Jie Zhan <zhanjie9@hisilicon.com>
> ---
>  drivers/acpi/processor_idle.c | 39 ++++++++++++++++++++++++++++++-----
>  1 file changed, 34 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/acpi/processor_idle.c b/drivers/acpi/processor_idle.c
> index 44096406d65d..d154b5d77328 100644
> --- a/drivers/acpi/processor_idle.c
> +++ b/drivers/acpi/processor_idle.c
> @@ -1194,20 +1194,46 @@ static int acpi_idle_lpi_enter(struct cpuidle_device *dev,
>  	return -EINVAL;
>  }
>
> +/* To correctly reflect the entered state if the poll state is enabled. */
> +static int acpi_idle_lpi_enter_with_poll_state(struct cpuidle_device *dev,
> +			       struct cpuidle_driver *drv, int index)
> +{
> +	int entered_state;
> +
> +	if (unlikely(index < 1))
> +		return -EINVAL;
> +
> +	entered_state = acpi_idle_lpi_enter(dev, drv, index - 1);
> +	if (entered_state < 0)
> +		return entered_state;
> +
> +	return entered_state + 1;
> +}
> +
>  static int acpi_processor_setup_lpi_states(struct acpi_processor *pr)
>  {
> -	int i;
> +	int i, count;
>  	struct acpi_lpi_state *lpi;
>  	struct cpuidle_state *state;
>  	struct cpuidle_driver *drv = &acpi_idle_driver;
> +	typeof(state->enter) enter_method;
>
>  	if (!pr->flags.has_lpi)
>  		return -EOPNOTSUPP;
>
> +	if (IS_ENABLED(CONFIG_ARCH_HAS_OPTIMIZED_POLL)) {
> +		cpuidle_poll_state_init(drv);
> +		count = 1;
> +		enter_method = acpi_idle_lpi_enter_with_poll_state;
> +	} else {
> +		count = 0;
> +		enter_method = acpi_idle_lpi_enter;
> +	}
> +
>  	for (i = 0; i < pr->power.count && i < CPUIDLE_STATE_MAX; i++) {
>  		lpi = &pr->power.lpi_states[i];
>
> -		state = &drv->states[i];
> +		state = &drv->states[count];
>  		snprintf(state->name, CPUIDLE_NAME_LEN, "LPI-%d", i);
>  		strscpy(state->desc, lpi->desc, CPUIDLE_DESC_LEN);
>  		state->exit_latency = lpi->wake_latency;
> @@ -1215,11 +1241,14 @@ static int acpi_processor_setup_lpi_states(struct acpi_processor *pr)
>  		state->flags |= arch_get_idle_state_flags(lpi->arch_flags);
>  		if (i != 0 && lpi->entry_method == ACPI_CSTATE_FFH)
>  			state->flags |= CPUIDLE_FLAG_RCU_IDLE;
> -		state->enter = acpi_idle_lpi_enter;
> -		drv->safe_state_index = i;
> +		state->enter = enter_method;
> +		drv->safe_state_index = count;
> +		count++;
> +		if (count == CPUIDLE_STATE_MAX)
> +			break;
>  	}
>
> -	drv->state_count = i;
> +	drv->state_count = count;
>
>  	return 0;
>  }


--
ankur
Marc Zyngier Oct. 15, 2024, 12:36 p.m. UTC | #4
On Thu, 26 Sep 2024 00:24:14 +0100,
Ankur Arora <ankur.a.arora@oracle.com> wrote:
> 
> This patchset enables the cpuidle-haltpoll driver and its namesake
> governor on arm64. This is specifically interesting for KVM guests by
> reducing IPC latencies.
> 
> Comparing idle switching latencies on an arm64 KVM guest with 
> perf bench sched pipe:
> 
>                                      usecs/op       %stdev   
> 
>   no haltpoll (baseline)               13.48       +-  5.19%
>   with haltpoll                         6.84       +- 22.07%
> 
> 
> No change in performance for a similar test on x86:
> 
>                                      usecs/op        %stdev   
> 
>   haltpoll w/ cpu_relax() (baseline)     4.75      +-  1.76%
>   haltpoll w/ smp_cond_load_relaxed()    4.78      +-  2.31%
> 
> Both sets of tests were on otherwise idle systems with guest VCPUs
> pinned to specific PCPUs. One reason for the higher stdev on arm64
> is that trapping of the WFE instruction by the host KVM is contingent
> on the number of tasks on the runqueue.

Sorry to state the obvious, but if that's the variable trapping of
WFI/WFE is the cause of your trouble, why don't you simply turn it off
(see 0b5afe05377d for the details)? Given that you pin your vcpus to
physical CPUs, there is no need for any trapping.

	M.
Ankur Arora Oct. 16, 2024, 9:55 p.m. UTC | #5
Marc Zyngier <maz@kernel.org> writes:

> On Thu, 26 Sep 2024 00:24:14 +0100,
> Ankur Arora <ankur.a.arora@oracle.com> wrote:
>>
>> This patchset enables the cpuidle-haltpoll driver and its namesake
>> governor on arm64. This is specifically interesting for KVM guests by
>> reducing IPC latencies.
>>
>> Comparing idle switching latencies on an arm64 KVM guest with
>> perf bench sched pipe:
>>
>>                                      usecs/op       %stdev
>>
>>   no haltpoll (baseline)               13.48       +-  5.19%
>>   with haltpoll                         6.84       +- 22.07%
>>
>>
>> No change in performance for a similar test on x86:
>>
>>                                      usecs/op        %stdev
>>
>>   haltpoll w/ cpu_relax() (baseline)     4.75      +-  1.76%
>>   haltpoll w/ smp_cond_load_relaxed()    4.78      +-  2.31%
>>
>> Both sets of tests were on otherwise idle systems with guest VCPUs
>> pinned to specific PCPUs. One reason for the higher stdev on arm64
>> is that trapping of the WFE instruction by the host KVM is contingent
>> on the number of tasks on the runqueue.
>
> Sorry to state the obvious, but if that's the variable trapping of
> WFI/WFE is the cause of your trouble, why don't you simply turn it off
> (see 0b5afe05377d for the details)? Given that you pin your vcpus to
> physical CPUs, there is no need for any trapping.

Good point. Thanks. That should help reduce the guessing games around
the variance in these tests.

--
ankur
Marc Zyngier Oct. 17, 2024, 8:19 a.m. UTC | #6
On Wed, 16 Oct 2024 22:55:09 +0100,
Ankur Arora <ankur.a.arora@oracle.com> wrote:
> 
> 
> Marc Zyngier <maz@kernel.org> writes:
> 
> > On Thu, 26 Sep 2024 00:24:14 +0100,
> > Ankur Arora <ankur.a.arora@oracle.com> wrote:
> >>
> >> This patchset enables the cpuidle-haltpoll driver and its namesake
> >> governor on arm64. This is specifically interesting for KVM guests by
> >> reducing IPC latencies.
> >>
> >> Comparing idle switching latencies on an arm64 KVM guest with
> >> perf bench sched pipe:
> >>
> >>                                      usecs/op       %stdev
> >>
> >>   no haltpoll (baseline)               13.48       +-  5.19%
> >>   with haltpoll                         6.84       +- 22.07%
> >>
> >>
> >> No change in performance for a similar test on x86:
> >>
> >>                                      usecs/op        %stdev
> >>
> >>   haltpoll w/ cpu_relax() (baseline)     4.75      +-  1.76%
> >>   haltpoll w/ smp_cond_load_relaxed()    4.78      +-  2.31%
> >>
> >> Both sets of tests were on otherwise idle systems with guest VCPUs
> >> pinned to specific PCPUs. One reason for the higher stdev on arm64
> >> is that trapping of the WFE instruction by the host KVM is contingent
> >> on the number of tasks on the runqueue.
> >
> > Sorry to state the obvious, but if that's the variable trapping of
> > WFI/WFE is the cause of your trouble, why don't you simply turn it off
> > (see 0b5afe05377d for the details)? Given that you pin your vcpus to
> > physical CPUs, there is no need for any trapping.
> 
> Good point. Thanks. That should help reduce the guessing games around
> the variance in these tests.

I'd be interested to find out whether there is still some benefit in
this series once you disable the WFx trapping heuristics.

Thanks,

	M.
Ankur Arora Oct. 17, 2024, 6:35 p.m. UTC | #7
Marc Zyngier <maz@kernel.org> writes:

> On Wed, 16 Oct 2024 22:55:09 +0100,
> Ankur Arora <ankur.a.arora@oracle.com> wrote:
>>
>>
>> Marc Zyngier <maz@kernel.org> writes:
>>
>> > On Thu, 26 Sep 2024 00:24:14 +0100,
>> > Ankur Arora <ankur.a.arora@oracle.com> wrote:
>> >>
>> >> This patchset enables the cpuidle-haltpoll driver and its namesake
>> >> governor on arm64. This is specifically interesting for KVM guests by
>> >> reducing IPC latencies.
>> >>
>> >> Comparing idle switching latencies on an arm64 KVM guest with
>> >> perf bench sched pipe:
>> >>
>> >>                                      usecs/op       %stdev
>> >>
>> >>   no haltpoll (baseline)               13.48       +-  5.19%
>> >>   with haltpoll                         6.84       +- 22.07%
>> >>
>> >>
>> >> No change in performance for a similar test on x86:
>> >>
>> >>                                      usecs/op        %stdev
>> >>
>> >>   haltpoll w/ cpu_relax() (baseline)     4.75      +-  1.76%
>> >>   haltpoll w/ smp_cond_load_relaxed()    4.78      +-  2.31%
>> >>
>> >> Both sets of tests were on otherwise idle systems with guest VCPUs
>> >> pinned to specific PCPUs. One reason for the higher stdev on arm64
>> >> is that trapping of the WFE instruction by the host KVM is contingent
>> >> on the number of tasks on the runqueue.
>> >
>> > Sorry to state the obvious, but if that's the variable trapping of
>> > WFI/WFE is the cause of your trouble, why don't you simply turn it off
>> > (see 0b5afe05377d for the details)? Given that you pin your vcpus to
>> > physical CPUs, there is no need for any trapping.
>>
>> Good point. Thanks. That should help reduce the guessing games around
>> the variance in these tests.
>
> I'd be interested to find out whether there is still some benefit in
> this series once you disable the WFx trapping heuristics.

The benefit of polling in idle is more than just avoiding the cost of
trapping and re-entering. The other benefit is that remote wakeups
can now be done just by setting need-resched, instead of sending an
IPI, and incurring the cost of handling the interrupt on the receiver
side.

But let me get you some numbers with that.

--
ankur
Ankur Arora Oct. 22, 2024, 10:01 p.m. UTC | #8
Ankur Arora <ankur.a.arora@oracle.com> writes:

> Marc Zyngier <maz@kernel.org> writes:
>
>> On Wed, 16 Oct 2024 22:55:09 +0100,
>> Ankur Arora <ankur.a.arora@oracle.com> wrote:
>>>
>>>
>>> Marc Zyngier <maz@kernel.org> writes:
>>>
>>> > On Thu, 26 Sep 2024 00:24:14 +0100,
>>> > Ankur Arora <ankur.a.arora@oracle.com> wrote:
>>> >>
>>> >> This patchset enables the cpuidle-haltpoll driver and its namesake
>>> >> governor on arm64. This is specifically interesting for KVM guests by
>>> >> reducing IPC latencies.
>>> >>
>>> >> Comparing idle switching latencies on an arm64 KVM guest with
>>> >> perf bench sched pipe:
>>> >>
>>> >>                                      usecs/op       %stdev
>>> >>
>>> >>   no haltpoll (baseline)               13.48       +-  5.19%
>>> >>   with haltpoll                         6.84       +- 22.07%
>>> >>
>>> >>
>>> >> No change in performance for a similar test on x86:
>>> >>
>>> >>                                      usecs/op        %stdev
>>> >>
>>> >>   haltpoll w/ cpu_relax() (baseline)     4.75      +-  1.76%
>>> >>   haltpoll w/ smp_cond_load_relaxed()    4.78      +-  2.31%
>>> >>
>>> >> Both sets of tests were on otherwise idle systems with guest VCPUs
>>> >> pinned to specific PCPUs. One reason for the higher stdev on arm64
>>> >> is that trapping of the WFE instruction by the host KVM is contingent
>>> >> on the number of tasks on the runqueue.
>>> >
>>> > Sorry to state the obvious, but if that's the variable trapping of
>>> > WFI/WFE is the cause of your trouble, why don't you simply turn it off
>>> > (see 0b5afe05377d for the details)? Given that you pin your vcpus to
>>> > physical CPUs, there is no need for any trapping.
>>>
>>> Good point. Thanks. That should help reduce the guessing games around
>>> the variance in these tests.
>>
>> I'd be interested to find out whether there is still some benefit in
>> this series once you disable the WFx trapping heuristics.
>
> The benefit of polling in idle is more than just avoiding the cost of
> trapping and re-entering. The other benefit is that remote wakeups
> can now be done just by setting need-resched, instead of sending an
> IPI, and incurring the cost of handling the interrupt on the receiver
> side.
>
> But let me get you some numbers with that.

So, I ran the sched-pipe test with processes on VCPUs 4 and 5 with
kvm-arm.wfi_trap_policy=notrap.

# perf stat -r 5 --cpu 4,5 -e task-clock,cycles,instructions,sched:sched_wake_idle_without_ipi \
  perf bench sched pipe -l 1000000 -c 4

# No haltpoll (and, no TIF_POLLING_NRFLAG):

 Performance counter stats for 'CPU(s) 4,5' (5 runs):

         25,229.57 msec task-clock                       #    2.000 CPUs utilized               ( +-  7.75% )
    45,821,250,284      cycles                           #    1.816 GHz                         ( +- 10.07% )
    26,557,496,665      instructions                     #    0.58  insn per cycle              ( +-  0.21% )
                 0      sched:sched_wake_idle_without_ipi #    0.000 /sec

            12.615 +- 0.977 seconds time elapsed  ( +-  7.75% )


# Haltpoll:

 Performance counter stats for 'CPU(s) 4,5' (5 runs):

         15,131.58 msec task-clock                       #    2.000 CPUs utilized               ( +- 10.00% )
    34,158,188,839      cycles                           #    2.257 GHz                         ( +-  6.91% )
    20,824,950,916      instructions                     #    0.61  insn per cycle              ( +-  0.09% )
         1,983,822      sched:sched_wake_idle_without_ipi #  131.105 K/sec                       ( +-  0.78% )

             7.566 +- 0.756 seconds time elapsed  ( +- 10.00% )

We get a decent boost just because we are executing ~20% fewer
instructions. Not sure how the cpu frequency scaling works in a
VM but we also run at a higher frequency.

--
ankur
Okanovic, Haris Nov. 5, 2024, 6:30 p.m. UTC | #9
Hi Ankur, Catalin,

How about the following series based on a refactor of arm64's delay()?
Does it address your earlier concerns?

delay() already implements wfet() and falls back to wfe() w/ evstream
or a cpu_relax loop. I refactored it to poll an address, and wrapped in
a new platform-agnostic smp_vcond_load_relaxed() macro. More details in
the following commit log.

Regards,
Haris Okanovic
AWS Graviton Software
Ankur Arora Nov. 5, 2024, 6:49 p.m. UTC | #10
Haris Okanovic <harisokn@amazon.com> writes:

> Hi Ankur, Catalin,
>
> How about the following series based on a refactor of arm64's delay()?
> Does it address your earlier concerns?
>
> delay() already implements wfet() and falls back to wfe() w/ evstream
> or a cpu_relax loop. I refactored it to poll an address, and wrapped in
> a new platform-agnostic smp_vcond_load_relaxed() macro. More details in
> the following commit log.

Haven't looked at your series too closely but it looks quite a bit
different from the version I was working on.

Let me send out my version as well in the next few days.

--
ankur