diff mbox series

[v3,1/2] x86/mwait-idle: enable interrupts before C1 on Xeons

Message ID 379483c7-fe7d-16ee-454f-8f8dd001dc48@suse.com (mailing list archive)
State New, archived
Headers show
Series x86/mwait-idle: updates from Linux (and more) | expand

Commit Message

Jan Beulich Jan. 27, 2022, 3:13 p.m. UTC
From: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>

Enable local interrupts before requesting C1 on the last two generations
of Intel Xeon platforms: Sky Lake, Cascade Lake, Cooper Lake, Ice Lake.
This decreases average C1 interrupt latency by about 5-10%, as measured
with the 'wult' tool.

The '->enter()' function of the driver enters C-states with local
interrupts disabled by executing the 'monitor' and 'mwait' pair of
instructions. If an interrupt happens, the CPU exits the C-state and
continues executing instructions after 'mwait'. It does not jump to
the interrupt handler, because local interrupts are disabled. The
cpuidle subsystem enables interrupts a bit later, after doing some
housekeeping.

With this patch, we enable local interrupts before requesting C1. In
this case, if the CPU wakes up because of an interrupt, it will jump
to the interrupt handler right away. The cpuidle housekeeping will be
done after the pending interrupt(s) are handled.

Enabling interrupts before entering a C-state has measurable impact
for faster C-states, like C1. Deeper, but slower C-states like C6 do
not really benefit from this sort of change, because their latency is
a lot higher comparing to the delay added by cpuidle housekeeping.

This change was also tested with cyclictest and dbench. In case of Ice
Lake, the average cyclictest latency decreased by 5.1%, and the average
'dbench' throughput increased by about 0.8%. Both tests were run for 4
hours with only C1 enabled (all other idle states, including 'POLL',
were disabled). CPU frequency was pinned to HFM, and uncore frequency
was pinned to the maximum value. The other platforms had similar
single-digit percentage improvements.

It is worth noting that this patch affects 'cpuidle' statistics a tiny
bit.  Before this patch, C1 residency did not include the interrupt
handling time, but with this patch, it will include it. This is similar
to what happens in case of the 'POLL' state, which also runs with
interrupts enabled.

Suggested-by: Len Brown <len.brown@intel.com>
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
[Linux commit: c227233ad64c77e57db738ab0e46439db71822a3]

We don't have a pointer into cpuidle_state_table[] readily available.
To compensate, propagate the flag into struct acpi_processor_cx.

Unlike Linux we want to
- disable IRQs again after MWAITing, as subsequently invoked functions
  assume so,
- avoid enabling IRQs if cstate_restore_tsc() is not a no-op, to avoid
  interfering with, in particular, the time rendezvous.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
RFC: I'm not entirely certain that we want to take this, i.e. whether
     we're as much worried about interrupt latency.
RFC: I was going back and forth between putting the local_irq_enable()
     ahead of or after cpu_is_haltable().
---
v3: Propagate flag to struct acpi_processor_cx. Don't set flag when TSC
    may stop whild in a C-state.
v2: New.

Comments

Roger Pau Monné Feb. 1, 2022, 10:46 a.m. UTC | #1
On Thu, Jan 27, 2022 at 04:13:21PM +0100, Jan Beulich wrote:
> From: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
> 
> Enable local interrupts before requesting C1 on the last two generations
> of Intel Xeon platforms: Sky Lake, Cascade Lake, Cooper Lake, Ice Lake.
> This decreases average C1 interrupt latency by about 5-10%, as measured
> with the 'wult' tool.
> 
> The '->enter()' function of the driver enters C-states with local
> interrupts disabled by executing the 'monitor' and 'mwait' pair of
> instructions. If an interrupt happens, the CPU exits the C-state and
> continues executing instructions after 'mwait'. It does not jump to
> the interrupt handler, because local interrupts are disabled. The
> cpuidle subsystem enables interrupts a bit later, after doing some
> housekeeping.
> 
> With this patch, we enable local interrupts before requesting C1. In
> this case, if the CPU wakes up because of an interrupt, it will jump
> to the interrupt handler right away. The cpuidle housekeeping will be
> done after the pending interrupt(s) are handled.
> 
> Enabling interrupts before entering a C-state has measurable impact
> for faster C-states, like C1. Deeper, but slower C-states like C6 do
> not really benefit from this sort of change, because their latency is
> a lot higher comparing to the delay added by cpuidle housekeeping.
> 
> This change was also tested with cyclictest and dbench. In case of Ice
> Lake, the average cyclictest latency decreased by 5.1%, and the average
> 'dbench' throughput increased by about 0.8%. Both tests were run for 4
> hours with only C1 enabled (all other idle states, including 'POLL',
> were disabled). CPU frequency was pinned to HFM, and uncore frequency
> was pinned to the maximum value. The other platforms had similar
> single-digit percentage improvements.
> 
> It is worth noting that this patch affects 'cpuidle' statistics a tiny
> bit.  Before this patch, C1 residency did not include the interrupt
> handling time, but with this patch, it will include it. This is similar
> to what happens in case of the 'POLL' state, which also runs with
> interrupts enabled.
> 
> Suggested-by: Len Brown <len.brown@intel.com>
> Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
> [Linux commit: c227233ad64c77e57db738ab0e46439db71822a3]
> 
> We don't have a pointer into cpuidle_state_table[] readily available.
> To compensate, propagate the flag into struct acpi_processor_cx.
> 
> Unlike Linux we want to
> - disable IRQs again after MWAITing, as subsequently invoked functions
>   assume so,
> - avoid enabling IRQs if cstate_restore_tsc() is not a no-op, to avoid
>   interfering with, in particular, the time rendezvous.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Roger Pau Monné <roger.pau@citrix.com>

> ---
> RFC: I'm not entirely certain that we want to take this, i.e. whether
>      we're as much worried about interrupt latency.

I would assume taking this would make it easier for you to pick
further patches.

> RFC: I was going back and forth between putting the local_irq_enable()
>      ahead of or after cpu_is_haltable().
> ---
> v3: Propagate flag to struct acpi_processor_cx. Don't set flag when TSC
>     may stop whild in a C-state.
> v2: New.
> 
> --- a/xen/arch/x86/cpu/mwait-idle.c
> +++ b/xen/arch/x86/cpu/mwait-idle.c
> @@ -108,6 +108,11 @@ static const struct cpuidle_state {
>  
>  #define CPUIDLE_FLAG_DISABLED		0x1
>  /*
> + * Enable interrupts before entering the C-state. On some platforms and for
> + * some C-states, this may measurably decrease interrupt latency.
> + */
> +#define CPUIDLE_FLAG_IRQ_ENABLE		0x8000
> +/*
>   * Set this flag for states where the HW flushes the TLB for us
>   * and so we don't need cross-calls to keep it consistent.
>   * If this flag is set, SW flushes the TLB, so even if the
> @@ -539,7 +544,7 @@ static struct cpuidle_state __read_mostl
>  static struct cpuidle_state __read_mostly skx_cstates[] = {
>  	{
>  		.name = "C1",
> -		.flags = MWAIT2flg(0x00),
> +		.flags = MWAIT2flg(0x00) | CPUIDLE_FLAG_IRQ_ENABLE,
>  		.exit_latency = 2,
>  		.target_residency = 2,
>  	},
> @@ -561,7 +566,7 @@ static struct cpuidle_state __read_mostl
>  static const struct cpuidle_state icx_cstates[] = {
>         {
>                 .name = "C1",
> -               .flags = MWAIT2flg(0x00),
> +               .flags = MWAIT2flg(0x00) | CPUIDLE_FLAG_IRQ_ENABLE,
>                 .exit_latency = 1,
>                 .target_residency = 1,
>         },
> @@ -842,9 +847,15 @@ static void mwait_idle(void)
>  
>  	update_last_cx_stat(power, cx, before);
>  
> -	if (cpu_is_haltable(cpu))
> +	if (cpu_is_haltable(cpu)) {
> +		if (cx->irq_enable_early)
> +			local_irq_enable();
> +
>  		mwait_idle_with_hints(cx->address, MWAIT_ECX_INTERRUPT_BREAK);
>  
> +		local_irq_disable();
> +	}
> +
>  	after = alternative_call(cpuidle_get_tick);
>  
>  	cstate_restore_tsc();
> @@ -1335,6 +1346,11 @@ static int mwait_idle_cpu_init(struct no
>  		cx->latency = cpuidle_state_table[cstate].exit_latency;
>  		cx->target_residency =
>  			cpuidle_state_table[cstate].target_residency;
> +		if ((cpuidle_state_table[cstate].flags &
> +		     CPUIDLE_FLAG_IRQ_ENABLE) &&
> +		    /* cstate_restore_tsc() needs to be a no-op */
> +		    boot_cpu_has(X86_FEATURE_NONSTOP_TSC))
> +			cx->irq_enable_early = true;
>  
>  		dev->count++;
>  	}
> --- a/xen/include/xen/cpuidle.h
> +++ b/xen/include/xen/cpuidle.h
> @@ -42,6 +42,7 @@ struct acpi_processor_cx
>      u8 idx;
>      u8 type;         /* ACPI_STATE_Cn */
>      u8 entry_method; /* ACPI_CSTATE_EM_xxx */
> +    bool irq_enable_early;

Should you use a bit field here and limit the field to :1 in
expectation of maybe adding more flags at a later point?

Thanks, Roger.
Jan Beulich Feb. 1, 2022, 11:32 a.m. UTC | #2
On 01.02.2022 11:46, Roger Pau Monné wrote:
> On Thu, Jan 27, 2022 at 04:13:21PM +0100, Jan Beulich wrote:
>> From: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
>>
>> Enable local interrupts before requesting C1 on the last two generations
>> of Intel Xeon platforms: Sky Lake, Cascade Lake, Cooper Lake, Ice Lake.
>> This decreases average C1 interrupt latency by about 5-10%, as measured
>> with the 'wult' tool.
>>
>> The '->enter()' function of the driver enters C-states with local
>> interrupts disabled by executing the 'monitor' and 'mwait' pair of
>> instructions. If an interrupt happens, the CPU exits the C-state and
>> continues executing instructions after 'mwait'. It does not jump to
>> the interrupt handler, because local interrupts are disabled. The
>> cpuidle subsystem enables interrupts a bit later, after doing some
>> housekeeping.
>>
>> With this patch, we enable local interrupts before requesting C1. In
>> this case, if the CPU wakes up because of an interrupt, it will jump
>> to the interrupt handler right away. The cpuidle housekeeping will be
>> done after the pending interrupt(s) are handled.
>>
>> Enabling interrupts before entering a C-state has measurable impact
>> for faster C-states, like C1. Deeper, but slower C-states like C6 do
>> not really benefit from this sort of change, because their latency is
>> a lot higher comparing to the delay added by cpuidle housekeeping.
>>
>> This change was also tested with cyclictest and dbench. In case of Ice
>> Lake, the average cyclictest latency decreased by 5.1%, and the average
>> 'dbench' throughput increased by about 0.8%. Both tests were run for 4
>> hours with only C1 enabled (all other idle states, including 'POLL',
>> were disabled). CPU frequency was pinned to HFM, and uncore frequency
>> was pinned to the maximum value. The other platforms had similar
>> single-digit percentage improvements.
>>
>> It is worth noting that this patch affects 'cpuidle' statistics a tiny
>> bit.  Before this patch, C1 residency did not include the interrupt
>> handling time, but with this patch, it will include it. This is similar
>> to what happens in case of the 'POLL' state, which also runs with
>> interrupts enabled.
>>
>> Suggested-by: Len Brown <len.brown@intel.com>
>> Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
>> [Linux commit: c227233ad64c77e57db738ab0e46439db71822a3]
>>
>> We don't have a pointer into cpuidle_state_table[] readily available.
>> To compensate, propagate the flag into struct acpi_processor_cx.
>>
>> Unlike Linux we want to
>> - disable IRQs again after MWAITing, as subsequently invoked functions
>>   assume so,
>> - avoid enabling IRQs if cstate_restore_tsc() is not a no-op, to avoid
>>   interfering with, in particular, the time rendezvous.
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> 
> Acked-by: Roger Pau Monné <roger.pau@citrix.com>

Thanks.

>> ---
>> RFC: I'm not entirely certain that we want to take this, i.e. whether
>>      we're as much worried about interrupt latency.
> 
> I would assume taking this would make it easier for you to pick
> further patches.

At least a little, yes.

>> --- a/xen/include/xen/cpuidle.h
>> +++ b/xen/include/xen/cpuidle.h
>> @@ -42,6 +42,7 @@ struct acpi_processor_cx
>>      u8 idx;
>>      u8 type;         /* ACPI_STATE_Cn */
>>      u8 entry_method; /* ACPI_CSTATE_EM_xxx */
>> +    bool irq_enable_early;
> 
> Should you use a bit field here and limit the field to :1 in
> expectation of maybe adding more flags at a later point?

Well, I had considered doing so but then thought we can easily
switch at such future time, leaving it consistently witout bit
fields for now. The more that there's still a 32-bit padding
field left to put stuff in.

Jan
diff mbox series

Patch

--- a/xen/arch/x86/cpu/mwait-idle.c
+++ b/xen/arch/x86/cpu/mwait-idle.c
@@ -108,6 +108,11 @@  static const struct cpuidle_state {
 
 #define CPUIDLE_FLAG_DISABLED		0x1
 /*
+ * Enable interrupts before entering the C-state. On some platforms and for
+ * some C-states, this may measurably decrease interrupt latency.
+ */
+#define CPUIDLE_FLAG_IRQ_ENABLE		0x8000
+/*
  * Set this flag for states where the HW flushes the TLB for us
  * and so we don't need cross-calls to keep it consistent.
  * If this flag is set, SW flushes the TLB, so even if the
@@ -539,7 +544,7 @@  static struct cpuidle_state __read_mostl
 static struct cpuidle_state __read_mostly skx_cstates[] = {
 	{
 		.name = "C1",
-		.flags = MWAIT2flg(0x00),
+		.flags = MWAIT2flg(0x00) | CPUIDLE_FLAG_IRQ_ENABLE,
 		.exit_latency = 2,
 		.target_residency = 2,
 	},
@@ -561,7 +566,7 @@  static struct cpuidle_state __read_mostl
 static const struct cpuidle_state icx_cstates[] = {
        {
                .name = "C1",
-               .flags = MWAIT2flg(0x00),
+               .flags = MWAIT2flg(0x00) | CPUIDLE_FLAG_IRQ_ENABLE,
                .exit_latency = 1,
                .target_residency = 1,
        },
@@ -842,9 +847,15 @@  static void mwait_idle(void)
 
 	update_last_cx_stat(power, cx, before);
 
-	if (cpu_is_haltable(cpu))
+	if (cpu_is_haltable(cpu)) {
+		if (cx->irq_enable_early)
+			local_irq_enable();
+
 		mwait_idle_with_hints(cx->address, MWAIT_ECX_INTERRUPT_BREAK);
 
+		local_irq_disable();
+	}
+
 	after = alternative_call(cpuidle_get_tick);
 
 	cstate_restore_tsc();
@@ -1335,6 +1346,11 @@  static int mwait_idle_cpu_init(struct no
 		cx->latency = cpuidle_state_table[cstate].exit_latency;
 		cx->target_residency =
 			cpuidle_state_table[cstate].target_residency;
+		if ((cpuidle_state_table[cstate].flags &
+		     CPUIDLE_FLAG_IRQ_ENABLE) &&
+		    /* cstate_restore_tsc() needs to be a no-op */
+		    boot_cpu_has(X86_FEATURE_NONSTOP_TSC))
+			cx->irq_enable_early = true;
 
 		dev->count++;
 	}
--- a/xen/include/xen/cpuidle.h
+++ b/xen/include/xen/cpuidle.h
@@ -42,6 +42,7 @@  struct acpi_processor_cx
     u8 idx;
     u8 type;         /* ACPI_STATE_Cn */
     u8 entry_method; /* ACPI_CSTATE_EM_xxx */
+    bool irq_enable_early;
     u32 address;
     u32 latency;
     u32 target_residency;