[1/2] PM / arch: x86: Rework the MSR_IA32_ENERGY_PERF_BIAS handling
diff mbox series

Message ID 1605148.8jT99SsvVP@aspire.rjw.lan
State Accepted, archived
Delegated to: Rafael Wysocki
Headers show
Series
  • PM / arch: x86: MSR_IA32_ENERGY_PERF_BIAS handling fixes and sysfs i/f
Related show

Commit Message

Rafael J. Wysocki March 21, 2019, 10:18 p.m. UTC
From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

The current handling of MSR_IA32_ENERGY_PERF_BIAS in the kernel is
problematic, because it may cause changes made by user space to that
MSR (with the help of the x86_energy_perf_policy tool, for example)
to be lost every time a CPU goes offline and then back online as well
as during system-wide power management transitions into sleep states
and back into the working state.

The first problem is that if the current EPB value for a CPU going
online is 0 ('performance'), the kernel will change it to 6 ('normal')
regardless of whether or not this is the first bring-up of that CPU.
That also happens during system-wide resume from sleep states
(including, but not limited to, hibernation).  However, the EPB may
have been adjusted by user space this way and the kernel should not
blindly override that setting.

The second problem is that if the platform firmware resets the EPB
values for any CPUs during system-wide resume from a sleep state,
the kernel will not restore their previous EPB values that may
have been set by user space before the preceding system-wide
suspend transition.  Again, that behavior may at least be confusing
from the user space perspective.

In order to address these issues, rework the handling of
MSR_IA32_ENERGY_PERF_BIAS so that the EPB value is saved on CPU
offline and restored on CPU online as well as (for the boot CPU)
during the syscore stages of system-wide suspend and resume
transitions, respectively.

However, retain the policy by which the EPB is set to 6 ('normal')
on the first bring-up of each CPU if its initial value is 0, based
on the observation that 0 may mean 'not initialized' just as well as
'performance' in that case.

While at it, move the MSR_IA32_ENERGY_PERF_BIAS handling code into
a separate file and document it in Documentation/admin-guide.

Fixes: abe48b108247 (x86, intel, power: Initialize MSR_IA32_ENERGY_PERF_BIAS)
Fixes: b51ef52df71c (x86/cpu: Restore MSR_IA32_ENERGY_PERF_BIAS after resume)
Reported-by: Thomas Renninger <trenn@suse.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---

This basically is a resend of https://patchwork.kernel.org/patch/10862699/,
except that EPB_MASK and EPB_SAVED are ULL values now and the year in the
copyright notice for the new file is 2019.

---
 Documentation/admin-guide/pm/intel_epb.rst     |    6 +
 Documentation/admin-guide/pm/working-state.rst |    1 
 arch/x86/kernel/cpu/Makefile                   |    2 
 arch/x86/kernel/cpu/common.c                   |   17 ---
 arch/x86/kernel/cpu/cpu.h                      |    1 
 arch/x86/kernel/cpu/intel.c                    |   34 ------
 arch/x86/kernel/cpu/intel_epb.c                |  131 +++++++++++++++++++++++++
 include/linux/cpuhotplug.h                     |    1 
 8 files changed, 140 insertions(+), 53 deletions(-)

Comments

Hannes Reinecke March 22, 2019, 9:03 a.m. UTC | #1
On 3/21/19 11:18 PM, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> 
> The current handling of MSR_IA32_ENERGY_PERF_BIAS in the kernel is
> problematic, because it may cause changes made by user space to that
> MSR (with the help of the x86_energy_perf_policy tool, for example)
> to be lost every time a CPU goes offline and then back online as well
> as during system-wide power management transitions into sleep states
> and back into the working state.
> 
> The first problem is that if the current EPB value for a CPU going
> online is 0 ('performance'), the kernel will change it to 6 ('normal')
> regardless of whether or not this is the first bring-up of that CPU.
> That also happens during system-wide resume from sleep states
> (including, but not limited to, hibernation).  However, the EPB may
> have been adjusted by user space this way and the kernel should not
> blindly override that setting.
> 
> The second problem is that if the platform firmware resets the EPB
> values for any CPUs during system-wide resume from a sleep state,
> the kernel will not restore their previous EPB values that may
> have been set by user space before the preceding system-wide
> suspend transition.  Again, that behavior may at least be confusing
> from the user space perspective.
> 
> In order to address these issues, rework the handling of
> MSR_IA32_ENERGY_PERF_BIAS so that the EPB value is saved on CPU
> offline and restored on CPU online as well as (for the boot CPU)
> during the syscore stages of system-wide suspend and resume
> transitions, respectively.
> 
> However, retain the policy by which the EPB is set to 6 ('normal')
> on the first bring-up of each CPU if its initial value is 0, based
> on the observation that 0 may mean 'not initialized' just as well as
> 'performance' in that case.
> 
> While at it, move the MSR_IA32_ENERGY_PERF_BIAS handling code into
> a separate file and document it in Documentation/admin-guide.
> 
> Fixes: abe48b108247 (x86, intel, power: Initialize MSR_IA32_ENERGY_PERF_BIAS)
> Fixes: b51ef52df71c (x86/cpu: Restore MSR_IA32_ENERGY_PERF_BIAS after resume)
> Reported-by: Thomas Renninger <trenn@suse.de>
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> ---
> 
Reviewed-by: Hannes Reinecke <hare@suse.com>

Cheers,

Hannes
Borislav Petkov March 22, 2019, 2:28 p.m. UTC | #2
On Thu, Mar 21, 2019 at 11:18:01PM +0100, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> 
> The current handling of MSR_IA32_ENERGY_PERF_BIAS in the kernel is
> problematic, because it may cause changes made by user space to that
> MSR (with the help of the x86_energy_perf_policy tool, for example)

One more reason to control MSR accesses from userspace. I'm working on
a series to even completely forbid accesses to some MSRs over /dev/msr
so I think accessing MSR_IA32_ENERGY_PERF_BIAS solely over the new
interface in patch 2 would be much better.

So, you're carrying those and you'd like to have an ACK from me?

Btw, a couple of nitpicks below.

> Index: linux-pm/arch/x86/kernel/cpu/intel_epb.c
> ===================================================================
> --- /dev/null
> +++ linux-pm/arch/x86/kernel/cpu/intel_epb.c
> @@ -0,0 +1,131 @@
> +// SPDX-License-Identifier: GPL-2.0

...

> +static DEFINE_PER_CPU(u8, saved_epb);
> +
> +#define EPB_MASK	0x0fULL
> +#define EPB_SAVED	0x10ULL
> +
> +static int intel_epb_save(void)

I'd drop that "intel_epb_" prefix from those static functions, but your
call...

> +{
> +	u64 epb;
> +
> +	rdmsrl(MSR_IA32_ENERGY_PERF_BIAS, epb);
> +	/*
> +	 * Ensure that saved_epb will always be nonzero after this write even if
> +	 * the EPB value read from the MSR is 0.
> +	 */
> +	this_cpu_write(saved_epb, (epb & EPB_MASK) | EPB_SAVED);
> +
> +	return 0;
> +}

...

> Index: linux-pm/Documentation/admin-guide/pm/intel_epb.rst
> ===================================================================
> --- /dev/null
> +++ linux-pm/Documentation/admin-guide/pm/intel_epb.rst

WARNING: Missing or malformed SPDX-License-Identifier tag in line 1
#345: FILE: Documentation/admin-guide/pm/intel_epb.rst:1:
+======================================

> @@ -0,0 +1,6 @@
> +======================================
> +Intel Performance and Energy Bias Hint
> +======================================
> +
> +.. kernel-doc:: arch/x86/kernel/cpu/intel_epb.c
> +   :doc: overview
> Index: linux-pm/Documentation/admin-guide/pm/working-state.rst
> ===================================================================
> --- linux-pm.orig/Documentation/admin-guide/pm/working-state.rst
> +++ linux-pm/Documentation/admin-guide/pm/working-state.rst
> @@ -8,3 +8,4 @@ Working-State Power Management
>     cpuidle
>     cpufreq
>     intel_pstate
> +   intel_epb
>
Thomas Gleixner March 22, 2019, 2:31 p.m. UTC | #3
On Fri, 22 Mar 2019, Borislav Petkov wrote:
> On Thu, Mar 21, 2019 at 11:18:01PM +0100, Rafael J. Wysocki wrote:
> > Index: linux-pm/Documentation/admin-guide/pm/intel_epb.rst
> > ===================================================================
> > --- /dev/null
> > +++ linux-pm/Documentation/admin-guide/pm/intel_epb.rst
> 
> WARNING: Missing or malformed SPDX-License-Identifier tag in line 1
> #345: FILE: Documentation/admin-guide/pm/intel_epb.rst:1:
> +======================================

We have no proper decision/recommendation about documentation
licensing. That's being worked on.

Thanks,

	tglx
Borislav Petkov March 22, 2019, 2:35 p.m. UTC | #4
On Fri, Mar 22, 2019 at 03:31:54PM +0100, Thomas Gleixner wrote:
> We have no proper decision/recommendation about documentation
> licensing. That's being worked on.

Ok, I'm ignoring this checkpatch warning from now on.

Thx.
Thomas Gleixner March 22, 2019, 4:12 p.m. UTC | #5
On Fri, 22 Mar 2019, Borislav Petkov wrote:

> On Fri, Mar 22, 2019 at 03:31:54PM +0100, Thomas Gleixner wrote:
> > We have no proper decision/recommendation about documentation
> > licensing. That's being worked on.
> 
> Ok, I'm ignoring this checkpatch warning from now on.

Only for Documentation/* files please.
Thomas Renninger March 22, 2019, 4:27 p.m. UTC | #6
Thanks Rafael for your quick look at and all the time you spend for this!

A /sys userspace knob will certainly not be enough for us.
You'll need a tool installed fixing this.
powertop on laptops or tuned on servers or a well hidden bootup quirk or 
whatsoever.

The patch I sent with this part:

+       if (acpi_gbl_FADT.preferred_profile == PM_PERFORMANCE_SERVER ||
+           acpi_gbl_FADT.preferred_profile == PM_ENTERPRISE_SERVER)
+               return;

and not touching the EBP value then should at least match most of
our users and OEMs who want a "performance" setting out of the box and
set this on purpose.

Even nicer would be compile option to not touch this at all.

On Thursday, March 21, 2019 11:18:01 PM CET Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
 
...

> + *
> + * Second, on many systems the initial EPB value coming from the platform
> + * firmware is 0 ('performance') and at least on some of them that is
> because + * the platform firmware does not initialize EPB

Why does the CPU not initialize this value to 6?
That would allow OEMs/BIOS to also suggest an init value for the system.
We should try to get microcode people or whoever is in charge to initialize
this value "properly" if Intel thinks 6 is the correct init value.

> at all with the
> assumption that + * the OS will do that anyway.  That sometimes is
> problematic, as it may cause + * the system battery to drain too fast, for
> example, so it is better to adjust + * it on CPU bring-up and if the
> initial EPB value for a given CPU is 0, the + * kernel changes it to 6
> ('normal').

I have an idea to let the kernel more decide about such policies.
It's a nice example that it makes sense to let the kernel set default values.
But not unconditionally, according to what the system is intended to do.

I wanted to do this for quite some time.., I hopefully find the time and
motivation now.

Thanks Rafael.
Sorry for the somewhat rude sounding previous mail, that was not on purpose.
You helped me quite a lot in the past and you obviously still do.

         Thomas
Borislav Petkov March 22, 2019, 4:43 p.m. UTC | #7
On Fri, Mar 22, 2019 at 05:27:43PM +0100, Thomas Renninger wrote:
> Thanks Rafael for your quick look at and all the time you spend for this!
> 
> A /sys userspace knob will certainly not be enough for us.
> You'll need a tool installed fixing this.

We won't pick up the second patch converting to the sysfs interface of
course - only the first one. The second patch is beginning the slow
phasing out of the direct /dev/msr access.
Joe Perches March 22, 2019, 4:52 p.m. UTC | #8
On Fri, 2019-03-22 at 17:12 +0100, Thomas Gleixner wrote:
> On Fri, 22 Mar 2019, Borislav Petkov wrote:
> 
> > On Fri, Mar 22, 2019 at 03:31:54PM +0100, Thomas Gleixner wrote:
> > > We have no proper decision/recommendation about documentation
> > > licensing. That's being worked on.
> > 
> > Ok, I'm ignoring this checkpatch warning from now on.
> 
> Only for Documentation/* files please.

Perhaps
---
 scripts/checkpatch.pl | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
index d0001fd1112d..00c1457fe79b 100755
--- a/scripts/checkpatch.pl
+++ b/scripts/checkpatch.pl
@@ -3058,7 +3058,8 @@ sub process {
 					     "Improper SPDX comment style for '$realfile', please use '$comment' instead\n" . $herecurr);
 				}
 
-				if ($comment !~ /^$/ &&
+				if ($realfile !~ m@^Documentation/@ &&
+				    $comment !~ /^$/ &&
 				    $rawline !~ m@^\+\Q$comment\E SPDX-License-Identifier: @) {
 					WARN("SPDX_LICENSE_TAG",
 					     "Missing or malformed SPDX-License-Identifier tag in line $checklicenseline\n" . $herecurr);
Rafael J. Wysocki March 25, 2019, 10:06 a.m. UTC | #9
On Fri, Mar 22, 2019 at 3:28 PM Borislav Petkov <bp@alien8.de> wrote:
>
> On Thu, Mar 21, 2019 at 11:18:01PM +0100, Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >
> > The current handling of MSR_IA32_ENERGY_PERF_BIAS in the kernel is
> > problematic, because it may cause changes made by user space to that
> > MSR (with the help of the x86_energy_perf_policy tool, for example)
>
> One more reason to control MSR accesses from userspace. I'm working on
> a series to even completely forbid accesses to some MSRs over /dev/msr
> so I think accessing MSR_IA32_ENERGY_PERF_BIAS solely over the new
> interface in patch 2 would be much better.
>
> So, you're carrying those and you'd like to have an ACK from me?
>
> Btw, a couple of nitpicks below.
>
> > Index: linux-pm/arch/x86/kernel/cpu/intel_epb.c
> > ===================================================================
> > --- /dev/null
> > +++ linux-pm/arch/x86/kernel/cpu/intel_epb.c
> > @@ -0,0 +1,131 @@
> > +// SPDX-License-Identifier: GPL-2.0
>
> ...
>
> > +static DEFINE_PER_CPU(u8, saved_epb);
> > +
> > +#define EPB_MASK     0x0fULL
> > +#define EPB_SAVED    0x10ULL
> > +
> > +static int intel_epb_save(void)
>
> I'd drop that "intel_epb_" prefix from those static functions, but your
> call...

They help indexing tools (elixir.bootlin.com and similar) a bit, so
I'd rather retain them.

> > +{
> > +     u64 epb;
> > +
> > +     rdmsrl(MSR_IA32_ENERGY_PERF_BIAS, epb);
> > +     /*
> > +      * Ensure that saved_epb will always be nonzero after this write even if
> > +      * the EPB value read from the MSR is 0.
> > +      */
> > +     this_cpu_write(saved_epb, (epb & EPB_MASK) | EPB_SAVED);
> > +
> > +     return 0;
> > +}
>
> ...
>
> > Index: linux-pm/Documentation/admin-guide/pm/intel_epb.rst
> > ===================================================================
> > --- /dev/null
> > +++ linux-pm/Documentation/admin-guide/pm/intel_epb.rst
>
> WARNING: Missing or malformed SPDX-License-Identifier tag in line 1
> #345: FILE: Documentation/admin-guide/pm/intel_epb.rst:1:

Well, this is documentation, so ...
Borislav Petkov March 25, 2019, 11:31 a.m. UTC | #10
On Thu, Mar 21, 2019 at 11:18:01PM +0100, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> 
> The current handling of MSR_IA32_ENERGY_PERF_BIAS in the kernel is
> problematic, because it may cause changes made by user space to that
> MSR (with the help of the x86_energy_perf_policy tool, for example)
> to be lost every time a CPU goes offline and then back online as well
> as during system-wide power management transitions into sleep states
> and back into the working state.
> 
> The first problem is that if the current EPB value for a CPU going
> online is 0 ('performance'), the kernel will change it to 6 ('normal')
> regardless of whether or not this is the first bring-up of that CPU.
> That also happens during system-wide resume from sleep states
> (including, but not limited to, hibernation).  However, the EPB may
> have been adjusted by user space this way and the kernel should not
> blindly override that setting.
> 
> The second problem is that if the platform firmware resets the EPB
> values for any CPUs during system-wide resume from a sleep state,
> the kernel will not restore their previous EPB values that may
> have been set by user space before the preceding system-wide
> suspend transition.  Again, that behavior may at least be confusing
> from the user space perspective.
> 
> In order to address these issues, rework the handling of
> MSR_IA32_ENERGY_PERF_BIAS so that the EPB value is saved on CPU
> offline and restored on CPU online as well as (for the boot CPU)
> during the syscore stages of system-wide suspend and resume
> transitions, respectively.
> 
> However, retain the policy by which the EPB is set to 6 ('normal')
> on the first bring-up of each CPU if its initial value is 0, based
> on the observation that 0 may mean 'not initialized' just as well as
> 'performance' in that case.
> 
> While at it, move the MSR_IA32_ENERGY_PERF_BIAS handling code into
> a separate file and document it in Documentation/admin-guide.
> 
> Fixes: abe48b108247 (x86, intel, power: Initialize MSR_IA32_ENERGY_PERF_BIAS)
> Fixes: b51ef52df71c (x86/cpu: Restore MSR_IA32_ENERGY_PERF_BIAS after resume)
> Reported-by: Thomas Renninger <trenn@suse.de>
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Acked-by: Borislav Petkov <bp@suse.de>

Patch
diff mbox series

Index: linux-pm/include/linux/cpuhotplug.h
===================================================================
--- linux-pm.orig/include/linux/cpuhotplug.h
+++ linux-pm/include/linux/cpuhotplug.h
@@ -147,6 +147,7 @@  enum cpuhp_state {
 	CPUHP_AP_X86_VDSO_VMA_ONLINE,
 	CPUHP_AP_IRQ_AFFINITY_ONLINE,
 	CPUHP_AP_ARM_MVEBU_SYNC_CLOCKS,
+	CPUHP_AP_X86_INTEL_EPB_ONLINE,
 	CPUHP_AP_PERF_ONLINE,
 	CPUHP_AP_PERF_X86_ONLINE,
 	CPUHP_AP_PERF_X86_UNCORE_ONLINE,
Index: linux-pm/arch/x86/kernel/cpu/intel.c
===================================================================
--- linux-pm.orig/arch/x86/kernel/cpu/intel.c
+++ linux-pm/arch/x86/kernel/cpu/intel.c
@@ -596,36 +596,6 @@  detect_keyid_bits:
 	c->x86_phys_bits -= keyid_bits;
 }
 
-static void init_intel_energy_perf(struct cpuinfo_x86 *c)
-{
-	u64 epb;
-
-	/*
-	 * Initialize MSR_IA32_ENERGY_PERF_BIAS if not already initialized.
-	 * (x86_energy_perf_policy(8) is available to change it at run-time.)
-	 */
-	if (!cpu_has(c, X86_FEATURE_EPB))
-		return;
-
-	rdmsrl(MSR_IA32_ENERGY_PERF_BIAS, epb);
-	if ((epb & 0xF) != ENERGY_PERF_BIAS_PERFORMANCE)
-		return;
-
-	pr_warn_once("ENERGY_PERF_BIAS: Set to 'normal', was 'performance'\n");
-	pr_warn_once("ENERGY_PERF_BIAS: View and update with x86_energy_perf_policy(8)\n");
-	epb = (epb & ~0xF) | ENERGY_PERF_BIAS_NORMAL;
-	wrmsrl(MSR_IA32_ENERGY_PERF_BIAS, epb);
-}
-
-static void intel_bsp_resume(struct cpuinfo_x86 *c)
-{
-	/*
-	 * MSR_IA32_ENERGY_PERF_BIAS is lost across suspend/resume,
-	 * so reinitialize it properly like during bootup:
-	 */
-	init_intel_energy_perf(c);
-}
-
 static void init_cpuid_fault(struct cpuinfo_x86 *c)
 {
 	u64 msr;
@@ -763,8 +733,6 @@  static void init_intel(struct cpuinfo_x8
 	if (cpu_has(c, X86_FEATURE_TME))
 		detect_tme(c);
 
-	init_intel_energy_perf(c);
-
 	init_intel_misc_features(c);
 }
 
@@ -1023,9 +991,7 @@  static const struct cpu_dev intel_cpu_de
 	.c_detect_tlb	= intel_detect_tlb,
 	.c_early_init   = early_init_intel,
 	.c_init		= init_intel,
-	.c_bsp_resume	= intel_bsp_resume,
 	.c_x86_vendor	= X86_VENDOR_INTEL,
 };
 
 cpu_dev_register(intel_cpu_dev);
-
Index: linux-pm/arch/x86/kernel/cpu/common.c
===================================================================
--- linux-pm.orig/arch/x86/kernel/cpu/common.c
+++ linux-pm/arch/x86/kernel/cpu/common.c
@@ -1864,23 +1864,6 @@  void cpu_init(void)
 }
 #endif
 
-static void bsp_resume(void)
-{
-	if (this_cpu->c_bsp_resume)
-		this_cpu->c_bsp_resume(&boot_cpu_data);
-}
-
-static struct syscore_ops cpu_syscore_ops = {
-	.resume		= bsp_resume,
-};
-
-static int __init init_cpu_syscore(void)
-{
-	register_syscore_ops(&cpu_syscore_ops);
-	return 0;
-}
-core_initcall(init_cpu_syscore);
-
 /*
  * The microcode loader calls this upon late microcode load to recheck features,
  * only when microcode has been updated. Caller holds microcode_mutex and CPU
Index: linux-pm/arch/x86/kernel/cpu/cpu.h
===================================================================
--- linux-pm.orig/arch/x86/kernel/cpu/cpu.h
+++ linux-pm/arch/x86/kernel/cpu/cpu.h
@@ -14,7 +14,6 @@  struct cpu_dev {
 	void		(*c_init)(struct cpuinfo_x86 *);
 	void		(*c_identify)(struct cpuinfo_x86 *);
 	void		(*c_detect_tlb)(struct cpuinfo_x86 *);
-	void		(*c_bsp_resume)(struct cpuinfo_x86 *);
 	int		c_x86_vendor;
 #ifdef CONFIG_X86_32
 	/* Optional vendor specific routine to obtain the cache size. */
Index: linux-pm/arch/x86/kernel/cpu/Makefile
===================================================================
--- linux-pm.orig/arch/x86/kernel/cpu/Makefile
+++ linux-pm/arch/x86/kernel/cpu/Makefile
@@ -28,7 +28,7 @@  obj-y			+= cpuid-deps.o
 obj-$(CONFIG_PROC_FS)	+= proc.o
 obj-$(CONFIG_X86_FEATURE_NAMES) += capflags.o powerflags.o
 
-obj-$(CONFIG_CPU_SUP_INTEL)		+= intel.o intel_pconfig.o
+obj-$(CONFIG_CPU_SUP_INTEL)		+= intel.o intel_pconfig.o intel_epb.o
 obj-$(CONFIG_CPU_SUP_AMD)		+= amd.o
 obj-$(CONFIG_CPU_SUP_HYGON)		+= hygon.o
 obj-$(CONFIG_CPU_SUP_CYRIX_32)		+= cyrix.o
Index: linux-pm/arch/x86/kernel/cpu/intel_epb.c
===================================================================
--- /dev/null
+++ linux-pm/arch/x86/kernel/cpu/intel_epb.c
@@ -0,0 +1,131 @@ 
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Intel Performance and Energy Bias Hint support.
+ *
+ * Copyright (C) 2019 Intel Corporation
+ *
+ * Author:
+ *	Rafael J. Wysocki <rafael.j.wysocki@intel.com>
+ */
+
+#include <linux/cpuhotplug.h>
+#include <linux/kernel.h>
+#include <linux/syscore_ops.h>
+
+#include <asm/cpufeature.h>
+#include <asm/msr.h>
+
+/**
+ * DOC: overview
+ *
+ * The Performance and Energy Bias Hint (EPB) allows software to specify its
+ * preference with respect to the power-performance tradeoffs present in the
+ * processor.  Generally, the EPB is expected to be set by user space through
+ * the generic MSR interface (with the help of the x86_energy_perf_policy tool),
+ * but there are two reasons for the kernel to touch it.
+ *
+ * First, there are systems where the platform firmware resets the EPB during
+ * system-wide transitions from sleep states back into the working state
+ * effectively causing the previous EPB updates by user space to be lost.
+ * Thus the kernel needs to save the current EPB values for all CPUs during
+ * system-wide transitions to sleep states and restore them on the way back to
+ * the working state.  That can be achieved by saving EPB for secondary CPUs
+ * when they are taken offline during transitions into system sleep states and
+ * for the boot CPU in a syscore suspend operation, so that it can be restored
+ * for the boot CPU in a syscore resume operation and for the other CPUs when
+ * they are brought back online.  However, CPUs that are already offline when
+ * a system-wide PM transition is started are not taken offline again, but their
+ * EPB values may still be reset by the platform firmware during the transition,
+ * so in fact it is necessary to save the EPB of any CPU taken offline and to
+ * restore it when the given CPU goes back online at all times.
+ *
+ * Second, on many systems the initial EPB value coming from the platform
+ * firmware is 0 ('performance') and at least on some of them that is because
+ * the platform firmware does not initialize EPB at all with the assumption that
+ * the OS will do that anyway.  That sometimes is problematic, as it may cause
+ * the system battery to drain too fast, for example, so it is better to adjust
+ * it on CPU bring-up and if the initial EPB value for a given CPU is 0, the
+ * kernel changes it to 6 ('normal').
+ */
+
+static DEFINE_PER_CPU(u8, saved_epb);
+
+#define EPB_MASK	0x0fULL
+#define EPB_SAVED	0x10ULL
+
+static int intel_epb_save(void)
+{
+	u64 epb;
+
+	rdmsrl(MSR_IA32_ENERGY_PERF_BIAS, epb);
+	/*
+	 * Ensure that saved_epb will always be nonzero after this write even if
+	 * the EPB value read from the MSR is 0.
+	 */
+	this_cpu_write(saved_epb, (epb & EPB_MASK) | EPB_SAVED);
+
+	return 0;
+}
+
+static void intel_epb_restore(void)
+{
+	u64 val = this_cpu_read(saved_epb);
+	u64 epb;
+
+	rdmsrl(MSR_IA32_ENERGY_PERF_BIAS, epb);
+	if (val) {
+		val &= EPB_MASK;
+	} else {
+		/*
+		 * Because intel_epb_save() has not run for the current CPU yet,
+		 * it is going online for the first time, so if its EPB value is
+		 * 0 ('performance') at this point, assume that it has not been
+		 * initialized by the platform firmware and set it to 6
+		 * ('normal').
+		 */
+		val = epb & EPB_MASK;
+		if (val == ENERGY_PERF_BIAS_PERFORMANCE) {
+			val = ENERGY_PERF_BIAS_NORMAL;
+			pr_warn_once("ENERGY_PERF_BIAS: Set to 'normal', was 'performance'\n");
+		}
+	}
+	wrmsrl(MSR_IA32_ENERGY_PERF_BIAS, (epb & ~EPB_MASK) | val);
+}
+
+static struct syscore_ops intel_epb_syscore_ops = {
+	.suspend = intel_epb_save,
+	.resume = intel_epb_restore,
+};
+
+static int intel_epb_online(unsigned int cpu)
+{
+	intel_epb_restore();
+	return 0;
+}
+
+static int intel_epb_offline(unsigned int cpu)
+{
+	return intel_epb_save();
+}
+
+static __init int intel_epb_init(void)
+{
+	int ret;
+
+	if (!boot_cpu_has(X86_FEATURE_EPB))
+		return -ENODEV;
+
+	ret = cpuhp_setup_state(CPUHP_AP_X86_INTEL_EPB_ONLINE,
+				"x86/intel/epb:online", intel_epb_online,
+				intel_epb_offline);
+	if (ret < 0)
+		goto err_out_online;
+
+	register_syscore_ops(&intel_epb_syscore_ops);
+	return 0;
+
+err_out_online:
+	cpuhp_remove_state(CPUHP_AP_X86_INTEL_EPB_ONLINE);
+	return ret;
+}
+subsys_initcall(intel_epb_init);
Index: linux-pm/Documentation/admin-guide/pm/intel_epb.rst
===================================================================
--- /dev/null
+++ linux-pm/Documentation/admin-guide/pm/intel_epb.rst
@@ -0,0 +1,6 @@ 
+======================================
+Intel Performance and Energy Bias Hint
+======================================
+
+.. kernel-doc:: arch/x86/kernel/cpu/intel_epb.c
+   :doc: overview
Index: linux-pm/Documentation/admin-guide/pm/working-state.rst
===================================================================
--- linux-pm.orig/Documentation/admin-guide/pm/working-state.rst
+++ linux-pm/Documentation/admin-guide/pm/working-state.rst
@@ -8,3 +8,4 @@  Working-State Power Management
    cpuidle
    cpufreq
    intel_pstate
+   intel_epb