diff mbox

x86 idle: repair large-server 50-watt idle-power regression

Message ID baff264285f6e585df757d58b17788feabc68918.1387403066.git.len.brown@intel.com (mailing list archive)
State Not Applicable, archived
Headers show

Commit Message

Len Brown Dec. 18, 2013, 9:44 p.m. UTC
From: Len Brown <len.brown@intel.com>

Linux 3.10 changed the timing of how thread_info->flags is touched:

	x86: Use generic idle loop
	(7d1a941731fabf27e5fb6edbebb79fe856edb4e5)

This caused Intel NHM-EX and WSM-EX servers to experience a large number
of immediate MONITOR/MWAIT break wakeups, which caused cpuidle to demote
from deep C-states to shallow C-states, which caused these platforms
to experience a significant increase in idle power.

Note that this issue was already present before the commit above,
however, it wasn't seen often enough to be noticed in power measurements.

Here we extend an errata workaround from the Core2 EX "Dunnington"
to extend to NHM-EX and WSM-EX, to prevent these immediate
returns from MWAIT, reducing idle power on these platforms.

While only acpi_idle ran on Dunnington, intel_idle
may also run on these two newer systems.
As of today, there are no other models that are known
to need this tweak.

ref: https://lkml.org/lkml/2013/12/7/22
Signed-off-by: Len Brown <len.brown@intel.com>
Cc: <stable@vger.kernel.org> # 3.12.x, 3.11.x, 3.10.x
---
 arch/x86/kernel/cpu/intel.c | 3 ++-
 drivers/idle/intel_idle.c   | 3 +++
 2 files changed, 5 insertions(+), 1 deletion(-)

Comments

Ingo Molnar Dec. 19, 2013, 12:22 p.m. UTC | #1
* Len Brown <lenb@kernel.org> wrote:

> From: Len Brown <len.brown@intel.com>
> 
> Linux 3.10 changed the timing of how thread_info->flags is touched:
> 
> 	x86: Use generic idle loop
> 	(7d1a941731fabf27e5fb6edbebb79fe856edb4e5)
> 
> This caused Intel NHM-EX and WSM-EX servers to experience a large number
> of immediate MONITOR/MWAIT break wakeups, which caused cpuidle to demote
> from deep C-states to shallow C-states, which caused these platforms
> to experience a significant increase in idle power.
> 
> Note that this issue was already present before the commit above,
> however, it wasn't seen often enough to be noticed in power measurements.
> 
> Here we extend an errata workaround from the Core2 EX "Dunnington"
> to extend to NHM-EX and WSM-EX, to prevent these immediate
> returns from MWAIT, reducing idle power on these platforms.
> 
> While only acpi_idle ran on Dunnington, intel_idle
> may also run on these two newer systems.
> As of today, there are no other models that are known
> to need this tweak.
> 
> ref: https://lkml.org/lkml/2013/12/7/22
> Signed-off-by: Len Brown <len.brown@intel.com>
> Cc: <stable@vger.kernel.org> # 3.12.x, 3.11.x, 3.10.x
> ---
>  arch/x86/kernel/cpu/intel.c | 3 ++-
>  drivers/idle/intel_idle.c   | 3 +++
>  2 files changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
> index dc1ec0d..ea04b34 100644
> --- a/arch/x86/kernel/cpu/intel.c
> +++ b/arch/x86/kernel/cpu/intel.c
> @@ -387,7 +387,8 @@ static void init_intel(struct cpuinfo_x86 *c)
>  			set_cpu_cap(c, X86_FEATURE_PEBS);
>  	}
>  
> -	if (c->x86 == 6 && c->x86_model == 29 && cpu_has_clflush)
> +	if (c->x86 == 6 && cpu_has_clflush &&
> +	    (c->x86_model == 29 || c->x86_model == 46 || c->x86_model == 47))
>  		set_cpu_cap(c, X86_FEATURE_CLFLUSH_MONITOR);
>  
>  #ifdef CONFIG_X86_64
> diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
> index 92d1206..f80b700 100644
> --- a/drivers/idle/intel_idle.c
> +++ b/drivers/idle/intel_idle.c
> @@ -377,6 +377,9 @@ static int intel_idle(struct cpuidle_device *dev,
>  
>  	if (!current_set_polling_and_test()) {
>  
> +		if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR))
> +			clflush((void *)&current_thread_info()->flags);
> +
>  		__monitor((void *)&current_thread_info()->flags, 0, 0);

I don't think either of these casts to '(void *)' is needed, both the 
clflush() and __monitor() will take pointers.

Looks good to me otherwise - except that maybe the best way to 
represent this quirk would be for the CLFLUSH+MONITOR sequence to be a 
single 'instruction' which is patched in dynamically during bootup, 
using our usual alternatives framework.

On non-affected CPUs a NOP would remain in place of the CLFLUSH, 
eliminating the branch above.

So the whole thing could be thought of as a slightly more complex 
'monitor' instruction - not exposing the quirk details to actual usage 
sites.

Thanks,

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
H. Peter Anvin Dec. 19, 2013, 2:40 p.m. UTC | #2
... or just use static_cpu_has() maybe?

Ingo Molnar <mingo@kernel.org> wrote:
>
>* Len Brown <lenb@kernel.org> wrote:
>
>> From: Len Brown <len.brown@intel.com>
>> 
>> Linux 3.10 changed the timing of how thread_info->flags is touched:
>> 
>> 	x86: Use generic idle loop
>> 	(7d1a941731fabf27e5fb6edbebb79fe856edb4e5)
>> 
>> This caused Intel NHM-EX and WSM-EX servers to experience a large
>number
>> of immediate MONITOR/MWAIT break wakeups, which caused cpuidle to
>demote
>> from deep C-states to shallow C-states, which caused these platforms
>> to experience a significant increase in idle power.
>> 
>> Note that this issue was already present before the commit above,
>> however, it wasn't seen often enough to be noticed in power
>measurements.
>> 
>> Here we extend an errata workaround from the Core2 EX "Dunnington"
>> to extend to NHM-EX and WSM-EX, to prevent these immediate
>> returns from MWAIT, reducing idle power on these platforms.
>> 
>> While only acpi_idle ran on Dunnington, intel_idle
>> may also run on these two newer systems.
>> As of today, there are no other models that are known
>> to need this tweak.
>> 
>> ref: https://lkml.org/lkml/2013/12/7/22
>> Signed-off-by: Len Brown <len.brown@intel.com>
>> Cc: <stable@vger.kernel.org> # 3.12.x, 3.11.x, 3.10.x
>> ---
>>  arch/x86/kernel/cpu/intel.c | 3 ++-
>>  drivers/idle/intel_idle.c   | 3 +++
>>  2 files changed, 5 insertions(+), 1 deletion(-)
>> 
>> diff --git a/arch/x86/kernel/cpu/intel.c
>b/arch/x86/kernel/cpu/intel.c
>> index dc1ec0d..ea04b34 100644
>> --- a/arch/x86/kernel/cpu/intel.c
>> +++ b/arch/x86/kernel/cpu/intel.c
>> @@ -387,7 +387,8 @@ static void init_intel(struct cpuinfo_x86 *c)
>>  			set_cpu_cap(c, X86_FEATURE_PEBS);
>>  	}
>>  
>> -	if (c->x86 == 6 && c->x86_model == 29 && cpu_has_clflush)
>> +	if (c->x86 == 6 && cpu_has_clflush &&
>> +	    (c->x86_model == 29 || c->x86_model == 46 || c->x86_model ==
>47))
>>  		set_cpu_cap(c, X86_FEATURE_CLFLUSH_MONITOR);
>>  
>>  #ifdef CONFIG_X86_64
>> diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
>> index 92d1206..f80b700 100644
>> --- a/drivers/idle/intel_idle.c
>> +++ b/drivers/idle/intel_idle.c
>> @@ -377,6 +377,9 @@ static int intel_idle(struct cpuidle_device *dev,
>>  
>>  	if (!current_set_polling_and_test()) {
>>  
>> +		if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR))
>> +			clflush((void *)&current_thread_info()->flags);
>> +
>>  		__monitor((void *)&current_thread_info()->flags, 0, 0);
>
>I don't think either of these casts to '(void *)' is needed, both the 
>clflush() and __monitor() will take pointers.
>
>Looks good to me otherwise - except that maybe the best way to 
>represent this quirk would be for the CLFLUSH+MONITOR sequence to be a 
>single 'instruction' which is patched in dynamically during bootup, 
>using our usual alternatives framework.
>
>On non-affected CPUs a NOP would remain in place of the CLFLUSH, 
>eliminating the branch above.
>
>So the whole thing could be thought of as a slightly more complex 
>'monitor' instruction - not exposing the quirk details to actual usage 
>sites.
>
>Thanks,
>
>	Ingo
Borislav Petkov Dec. 19, 2013, 3:45 p.m. UTC | #3
On Thu, Dec 19, 2013 at 06:40:41AM -0800, H. Peter Anvin wrote:
> ... or just use static_cpu_has() maybe?

Right, if we can get the compiler to generate the shortest CLFLUSH of 3
bytes with register indirect addressing for the operand, then stomping
over those 3 bytes with the alternatives would be the fastest solution.

If, of course, three NOPs which get discarded in the front end are
cheaper than the near JMP from static_cpu_has.
H. Peter Anvin Dec. 19, 2013, 3:55 p.m. UTC | #4
On 12/19/2013 04:22 AM, Ingo Molnar wrote:
>> diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
>> index 92d1206..f80b700 100644
>> --- a/drivers/idle/intel_idle.c
>> +++ b/drivers/idle/intel_idle.c
>> @@ -377,6 +377,9 @@ static int intel_idle(struct cpuidle_device *dev,
>>  
>>  	if (!current_set_polling_and_test()) {
>>  
>> +		if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR))
>> +			clflush((void *)&current_thread_info()->flags);
>> +
>>  		__monitor((void *)&current_thread_info()->flags, 0, 0);
> 
> I don't think either of these casts to '(void *)' is needed, both the 
> clflush() and __monitor() will take pointers.

__monitor() currently doesn't, which is idiotic.

	-hpa

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ingo Molnar Dec. 19, 2013, 4:02 p.m. UTC | #5
* H. Peter Anvin <hpa@zytor.com> wrote:

> On 12/19/2013 04:22 AM, Ingo Molnar wrote:
> >> diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
> >> index 92d1206..f80b700 100644
> >> --- a/drivers/idle/intel_idle.c
> >> +++ b/drivers/idle/intel_idle.c
> >> @@ -377,6 +377,9 @@ static int intel_idle(struct cpuidle_device *dev,
> >>  
> >>  	if (!current_set_polling_and_test()) {
> >>  
> >> +		if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR))
> >> +			clflush((void *)&current_thread_info()->flags);
> >> +
> >>  		__monitor((void *)&current_thread_info()->flags, 0, 0);
> > 
> > I don't think either of these casts to '(void *)' is needed, both the 
> > clflush() and __monitor() will take pointers.
> 
> __monitor() currently doesn't, which is idiotic.

Hm, __monitor() seems to take a void *:

 arch/x86/include/asm/processor.h:static inline void __monitor(const void *eax, unsigned long ecx,

So writing:

		if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR))
			clflush(&current_thread_info()->flags);

  		__monitor(&current_thread_info()->flags, 0, 0);

ought to work just fine.

Thanks,

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
H. Peter Anvin Dec. 19, 2013, 4:09 p.m. UTC | #6
On 12/19/2013 08:02 AM, Ingo Molnar wrote:
>>
>> __monitor() currently doesn't, which is idiotic.
> 
> Hm, __monitor() seems to take a void *:
> 
>  arch/x86/include/asm/processor.h:static inline void __monitor(const void *eax, unsigned long ecx,
> 
> So writing:
> 
> 		if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR))
> 			clflush(&current_thread_info()->flags);
> 
>   		__monitor(&current_thread_info()->flags, 0, 0);
> 

Yes, brain failure on my part.

	-hpa


--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
index dc1ec0d..ea04b34 100644
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -387,7 +387,8 @@  static void init_intel(struct cpuinfo_x86 *c)
 			set_cpu_cap(c, X86_FEATURE_PEBS);
 	}
 
-	if (c->x86 == 6 && c->x86_model == 29 && cpu_has_clflush)
+	if (c->x86 == 6 && cpu_has_clflush &&
+	    (c->x86_model == 29 || c->x86_model == 46 || c->x86_model == 47))
 		set_cpu_cap(c, X86_FEATURE_CLFLUSH_MONITOR);
 
 #ifdef CONFIG_X86_64
diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
index 92d1206..f80b700 100644
--- a/drivers/idle/intel_idle.c
+++ b/drivers/idle/intel_idle.c
@@ -377,6 +377,9 @@  static int intel_idle(struct cpuidle_device *dev,
 
 	if (!current_set_polling_and_test()) {
 
+		if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR))
+			clflush((void *)&current_thread_info()->flags);
+
 		__monitor((void *)&current_thread_info()->flags, 0, 0);
 		smp_mb();
 		if (!need_resched())