diff mbox series

[v2,3/3] intel_idle: add C0.2 state for Sapphire Rapids Xeon

Message ID 20230310122110.895093-4-dedekind1@gmail.com (mailing list archive)
State Changes Requested, archived
Headers show
Series Sapphire Rapids C0.x idle states support | expand

Commit Message

Artem Bityutskiy March 10, 2023, 12:21 p.m. UTC
From: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>

Add Sapphire Rapids Xeon C0.2 state support. This state has a lower exit
latency comparing to C1, and saves energy comparing to POLL (in range of
5-20%).

This patch also improves performance (e.g., as measured by 'hackbench'),
because idle CPU power savings in C0.2 increase busy CPU power budget and
therefore, improve turbo boost of the busy CPU.

Suggested-by: Len Brown <len.brown@intel.com>
Suggested-by: Arjan Van De Ven <arjan.van.de.ven@intel.com>
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
---
 drivers/idle/intel_idle.c | 58 ++++++++++++++++++++++++++++++++++++++-
 1 file changed, 57 insertions(+), 1 deletion(-)

Comments

Peter Zijlstra March 20, 2023, 2:50 p.m. UTC | #1
On Fri, Mar 10, 2023 at 02:21:10PM +0200, Artem Bityutskiy wrote:
> +/**
> + * umwait_limit_init - initialize time limit value for 'umwait'.
> + *
> + * C0.1 and C0.2 (later C0.x) idle states are requested via the 'umwait'
> + * instruction. The 'umwait' instruction requires the "deadline" - the TSC
> + * counter value to break out of C0.x (unless it broke out because of an
> + * interrupt or some other event).
> + *
> + * The deadline is specified as an absolute TSC value, and it is calculated as
> + * current TSC value + 'umwait_limit'. This function initializes the
> + * 'umwait_limit' variable to count of cycles per tick. The motivation is:
> + *   * the tick is not disabled for shallow states like C0.x so, so idle will
> + *     not last longer than a tick anyway
> + *   * limit idle time to give cpuidle a chance to re-evaluate its C-state
> + *     selection decision and possibly select a deeper C-state.
> + */
> +static void __init umwait_limit_init(void)
> +{
> +	umwait_limit = (u64)TICK_NSEC * tsc_khz;
> +	do_div(umwait_limit, MICRO);
> +}

Would it not make sense to put this limit in the MSR instead? By
randomly increasing the MSR limit you also change userspace behaviour vs
NOHZ_FULL.

That was part of the reason why Andy insisted on having the MSR low.
Andy Lutomirski March 20, 2023, 6:27 p.m. UTC | #2
On Mon, Mar 20, 2023, at 7:50 AM, Peter Zijlstra wrote:
> On Fri, Mar 10, 2023 at 02:21:10PM +0200, Artem Bityutskiy wrote:
>> +/**
>> + * umwait_limit_init - initialize time limit value for 'umwait'.
>> + *
>> + * C0.1 and C0.2 (later C0.x) idle states are requested via the 'umwait'
>> + * instruction. The 'umwait' instruction requires the "deadline" - the TSC
>> + * counter value to break out of C0.x (unless it broke out because of an
>> + * interrupt or some other event).
>> + *
>> + * The deadline is specified as an absolute TSC value, and it is calculated as
>> + * current TSC value + 'umwait_limit'. This function initializes the
>> + * 'umwait_limit' variable to count of cycles per tick. The motivation is:
>> + *   * the tick is not disabled for shallow states like C0.x so, so idle will
>> + *     not last longer than a tick anyway
>> + *   * limit idle time to give cpuidle a chance to re-evaluate its C-state
>> + *     selection decision and possibly select a deeper C-state.
>> + */
>> +static void __init umwait_limit_init(void)
>> +{
>> +	umwait_limit = (u64)TICK_NSEC * tsc_khz;
>> +	do_div(umwait_limit, MICRO);
>> +}
>
> Would it not make sense to put this limit in the MSR instead? By
> randomly increasing the MSR limit you also change userspace behaviour vs
> NOHZ_FULL.
>
> That was part of the reason why Andy insisted on having the MSR low.

This is all busted.

UMWAIT has a U for *user mode*.  We have the MSR set to a small value because USER waits are a big can of worms, and long user waits don't actually behave in any particularly intelligent manner unless that core is dedicated to just one user task, and no virt is involved, and the user code involved is extremely careful.

But now UMWAIT got extended in a way to make it useful for the kernel, but it's controlled by the same MSR.  And this is busted.  What we want is for CPL0 UMWAIT to ignore the MSR or use a different MSR (for virt, sigh, except that this whole mechanism is presuambly still useless on virt).  Or for a different instruction to be used from the kernel, maybe spelled MWAIT.

Can we please get some hardware folks to stop randomly adding features and start thinking about the fact that real users involve a kernel, in virt and bare metal, and user code, generally running in a preemptive kernel, sometimes under virt, and to FIGURE OUT WHAT THESE FEATURES SHOULD DO IN THESE CONTEXTS!

In the mean time, I assume that this stuff is baked into a CPU coming soon to real users, and there is no correct way to program it.  But we could set a small limit and eat a small power penalty if C0.2 transitions are really that fast.

Also, this series needs to be tested on virt.  Because UMWAIT, if it works at all on virt, is going to have all manner of odd concequences due to the fact that the hypervisor hasn't the faintest clue what's going on because there's no feedback.  For all that UiPI is nasty and half-baked, at least it tries to notify the next privilege level up as to what's going on.  Explicit wakeups virtualize much better than cacheline monitors.
Andy Lutomirski March 20, 2023, 8:29 p.m. UTC | #3
On Mon, Mar 20, 2023, at 11:27 AM, Andy Lutomirski wrote:
> On Mon, Mar 20, 2023, at 7:50 AM, Peter Zijlstra wrote:
>> On Fri, Mar 10, 2023 at 02:21:10PM +0200, Artem Bityutskiy wrote:
>>> +/**
>>> + * umwait_limit_init - initialize time limit value for 'umwait'.
>>> + *
>>> + * C0.1 and C0.2 (later C0.x) idle states are requested via the 'umwait'
>>> + * instruction. The 'umwait' instruction requires the "deadline" - the TSC
>>> + * counter value to break out of C0.x (unless it broke out because of an
>>> + * interrupt or some other event).
>>> + *
>>> + * The deadline is specified as an absolute TSC value, and it is calculated as
>>> + * current TSC value + 'umwait_limit'. This function initializes the
>>> + * 'umwait_limit' variable to count of cycles per tick. The motivation is:
>>> + *   * the tick is not disabled for shallow states like C0.x so, so idle will
>>> + *     not last longer than a tick anyway
>>> + *   * limit idle time to give cpuidle a chance to re-evaluate its C-state
>>> + *     selection decision and possibly select a deeper C-state.
>>> + */
>>> +static void __init umwait_limit_init(void)
>>> +{
>>> +	umwait_limit = (u64)TICK_NSEC * tsc_khz;
>>> +	do_div(umwait_limit, MICRO);
>>> +}
>>
>> Would it not make sense to put this limit in the MSR instead? By
>> randomly increasing the MSR limit you also change userspace behaviour vs
>> NOHZ_FULL.
>>
>> That was part of the reason why Andy insisted on having the MSR low.
>
> This is all busted.
>
> UMWAIT has a U for *user mode*.  We have the MSR set to a small value 
> because USER waits are a big can of worms, and long user waits don't 
> actually behave in any particularly intelligent manner unless that core 
> is dedicated to just one user task, and no virt is involved, and the 
> user code involved is extremely careful.
>
> But now UMWAIT got extended in a way to make it useful for the kernel, 
> but it's controlled by the same MSR.  And this is busted.  What we want 
> is for CPL0 UMWAIT to ignore the MSR or use a different MSR (for virt, 
> sigh, except that this whole mechanism is presuambly still useless on 
> virt).  Or for a different instruction to be used from the kernel, 
> maybe spelled MWAIT.
>
> Can we please get some hardware folks to stop randomly adding features 
> and start thinking about the fact that real users involve a kernel, in 
> virt and bare metal, and user code, generally running in a preemptive 
> kernel, sometimes under virt, and to FIGURE OUT WHAT THESE FEATURES 
> SHOULD DO IN THESE CONTEXTS!
>
> In the mean time, I assume that this stuff is baked into a CPU coming 
> soon to real users, and there is no correct way to program it.  But we 
> could set a small limit and eat a small power penalty if C0.2 
> transitions are really that fast.
>
> Also, this series needs to be tested on virt.  Because UMWAIT, if it 
> works at all on virt, is going to have all manner of odd concequences 
> due to the fact that the hypervisor hasn't the faintest clue what's 
> going on because there's no feedback.  For all that UiPI is nasty and 
> half-baked, at least it tries to notify the next privilege level up as 
> to what's going on.  Explicit wakeups virtualize much better than 
> cacheline monitors.

At the very least, we need to know whether increasing the UMWAIT limit has a real benefit.  Because UMWAIT is really just a fancy busy wait, and it will still work with the low limit.  What happens if we just drop that part of this patch?
Andy Lutomirski March 20, 2023, 8:32 p.m. UTC | #4
On Mon, Mar 20, 2023, at 11:27 AM, Andy Lutomirski wrote:

> Also, this series needs to be tested on virt.  Because UMWAIT, if it 
> works at all on virt, is going to have all manner of odd concequences 
> due to the fact that the hypervisor hasn't the faintest clue what's 
> going on because there's no feedback.  For all that UiPI is nasty and 
> half-baked, at least it tries to notify the next privilege level up as 
> to what's going on.  Explicit wakeups virtualize much better than 
> cacheline monitors.

Sorry to keep replying to myself.  -ETOOLITTLESLEEP.

This needs more than testing on virt.  It needs explicit documentation and handling of virt so we don't end up using UMWAIT on virt and doing something utterly daft like busy-waiting instead of properly going to sleep and not noticing because few people are actually testing on virt on a CPU that has this ability right now.

(Also, there's a surprising ability to thoroughly break idle without anyone reporting it for an impressively long time.  The system still serves cute cat photos, so it doesn't end up on the dashboard!)
Peter Zijlstra March 22, 2023, 10:18 a.m. UTC | #5
On Mon, Mar 20, 2023 at 11:27:54AM -0700, Andy Lutomirski wrote:

> This is all busted.

Well, yes.

> UMWAIT has a U for *user mode*.  We have the MSR set to a small value
> because USER waits are a big can of worms, and long user waits don't
> actually behave in any particularly intelligent manner unless that
> core is dedicated to just one user task, and no virt is involved, and
> the user code involved is extremely careful.

Idem for virt. [U]MWAIT only really works for virt when the CPU is
dedicated to the one (vcpu) task.

> But now UMWAIT got extended in a way to make it useful for the kernel,
> but it's controlled by the same MSR.  And this is busted.  What we
> want is for CPL0 UMWAIT to ignore the MSR or use a different MSR (for
> virt, sigh, except that this whole mechanism is presuambly still
> useless on virt).  Or for a different instruction to be used from the
> kernel, maybe spelled MWAIT.

Yes, CPL0 usage should not be subject to the same limit. I'm not sure if
there's a good argument to have a different limit on virt vs random
other userspace.

> Also, this series needs to be tested on virt.  Because UMWAIT, if it
> works at all on virt, is going to have all manner of odd concequences
> due to the fact that the hypervisor hasn't the faintest clue what's
> going on because there's no feedback.  For all that UiPI is nasty and
> half-baked, at least it tries to notify the next privilege level up as
> to what's going on.  Explicit wakeups virtualize much better than
> cacheline monitors.

Virt is supposedly a big trumpet case for UMWAIT because VMMs
(rightfully) restrict MWAIT. At the same time, you *REALLY* do not want
your vcpu task doing long UMWAITs when there's other vcpu threads
waiting to do real work, so the MSR value *SHOULD* be really low.

Increasing this value randomly is bad.. increasing it beyond 1 tick is
abysmal.

Unless dedicated vcpu:cpu relations, in which case kvm already should be
exposing MWAIT in any case.
Artem Bityutskiy March 29, 2023, 7:32 a.m. UTC | #6
There is a lot of feedback. Let me summarize a bit.

1. C0.x time limit is controlled by MSR 0xE1 - IA32_UMWAIT_CONTROL[31:2]. This
limit applies to both CPL0 and CPL3. Your feedback is that this MSR should be
ignored in CPL0, or there should be a different MSR for CPL3.

Interesting point. I am discussing this with HW folks internally now and trying
to figure it out.


2. Peter Zijlstra said earlier: why C0.x states are not available via 'mwait'.

Also good point. Similarly, I am now discussing this with HW engineers and
trying to figure it out.


3. What happens if you do not increase the global limit in
IA32_UMWAIT_CONTROL[31:2]? May be just drop that patch.

I am taking this approach now. Measuring and benchmarking the system now.


4. Test this in virtual environment.

Will do.


5. Then there were several references to virualization, and this is the part of
your feedback I did not understand. I admit I do not know much about
virtualization.

Below are few questions. I apologize in advance because if they are naive,
please, bear with me.


Question #1.

Userspace applications can do many strange things. For example, just busy-loop
on a variable waiting for it to change.

Not social behavior. May be good idea for special apps, having dedicated CPU, as
you pointed. E.g., DPDK or other latency-sensitive apps. Bad idea for a general
app.

However, we can't control apps, in general. If they want to busy-loop they will.
If someone buys a VM, they may decide that they payed for it and do whatever
they want.

Now take this sort of anti-social app. Replace the busy-loop with umwait or
tpause. You get the same result, but it saves energy. So it is an optimization,
good thing.

What am I missing?

May be you implied that umwait should be designed in a way that hypervisor could
take over the core when the app or guest kernel uses it. Then the hyperwisor
could do something else meanwhile.

But think it would increase C0.x latency observed by umwait users. That would
make umwait not useful for those few apps that do have a good reason for using
umwait (like DPDK).


Question #2.

Why can't we just set this global A32_UMWAIT_CONTROL[31:2] limit to 0, which
means "forever", "no limit"?

Any user of tpause or umwait must have a loop checking for the "end of wait"
condition. Inside this loop, there is a umwait or tpause (as optimization for
busy-looping).

Both tpause and umwait break out on interrupts. Scheduler will preempt the user
task when its time-slice is over or there is something more important to run.
Then when the task starts again, it will continue waiting in its loop, doing
tpause of umwait inside the loop.

What is the problem I am missing?

Question #3:

You wrote :

> Also, this series needs to be tested on virt.  Because UMWAIT, if it works at
> all on virt, is going to have all manner of odd concequences due to the fact
> that the hypervisor hasn't the faintest clue what's going on because there's
> no feedback.

What feed-back would hypervisor need? And what would it do with it?

Is your expectation that huperviser will do something else on the CPU, like run
another VM, while original VM is umwait'ing? If so, the umwait latency will be
way longer than sub-1us...

Thanks in advance!
Andy Lutomirski April 10, 2023, 5:24 p.m. UTC | #7
On Wed, Mar 29, 2023, at 12:32 AM, Artem Bityutskiy wrote:
> There is a lot of feedback. Let me summarize a bit.
>
> 1. C0.x time limit is controlled by MSR 0xE1 - IA32_UMWAIT_CONTROL[31:2]. This
> limit applies to both CPL0 and CPL3. Your feedback is that this MSR should be
> ignored in CPL0, or there should be a different MSR for CPL3.
>
> Interesting point. I am discussing this with HW folks internally now and trying
> to figure it out.
>
>
> 2. Peter Zijlstra said earlier: why C0.x states are not available via 'mwait'.
>
> Also good point. Similarly, I am now discussing this with HW engineers and
> trying to figure it out.
>
>
> 3. What happens if you do not increase the global limit in
> IA32_UMWAIT_CONTROL[31:2]? May be just drop that patch.
>
> I am taking this approach now. Measuring and benchmarking the system now.
>
>
> 4. Test this in virtual environment.
>
> Will do.
>
>
> 5. Then there were several references to virualization, and this is the part of
> your feedback I did not understand. I admit I do not know much about
> virtualization.
>
> Below are few questions. I apologize in advance because if they are naive,
> please, bear with me.
>
>
> Question #1.
>
> Userspace applications can do many strange things. For example, just busy-loop
> on a variable waiting for it to change.
>
> Not social behavior. May be good idea for special apps, having dedicated CPU, as
> you pointed. E.g., DPDK or other latency-sensitive apps. Bad idea for a general
> app.
>
> However, we can't control apps, in general. If they want to busy-loop they will.
> If someone buys a VM, they may decide that they payed for it and do whatever
> they want.
>
> Now take this sort of anti-social app. Replace the busy-loop with umwait or
> tpause. You get the same result, but it saves energy. So it is an optimization,
> good thing.
>
> What am I missing?

You're not missing anything extremely critical, but there are some less critical issues.

Dredging up an old email:

https://lore.kernel.org/all/CALCETrVJsCAWYSnUE+Ju_VmZfZBUBwUq-uFjV9=Vy+wddtJVCw@mail.gmail.com/

First, UMWAIT doesn't have a wakeup mechanism that can notify a supervisor (or hypervisor) of wakeups.  So if a task is UMWAITing and gets interrupted, the scheduler cannot tell when it should schedule it back in.  UIPI can do this (poorly), but UMWAIT doesn't even try.

So, with that caveat, UMWAIT is, as you are noting, a busy-loop that happens to be somewhat more power efficient than a plain loop or a bunch of REP NOPs.  And this is just fine for the top of the stack of supervisor things -- a kernel on bare metal, a hypervisor that is not itself nested, etc.  It's *also* fine for a well-designed assisted polling loop, paravirt style -- if a thread knows that the thread (vCPU, etc) that will wake it is running, then polling briefly may be a very good idea.  But it should be brief, and it should coordinate with a real wait/wake mechanism.

On top of all this, UMWAIT has the somewhat unfortunate effect that it sets CF according to whether the MSR deadline elapsed.  So if I do:

UMONITOR
UMWAIT (deadline  = far future)

and the MSR is set to a large value, I can do this many, many times (possibly infinite) without ever seeing CF=1.  But if the MSR is set to a low value, I'll get CF=1 fairly often.

So setting the MSR to a low value prevents anyone from deluding themselves that UMWAIT is anything other than a busy-wait that can be interrupted based on a TSC timeout or an interrupt or whatever on a very regular basis.

--Andy
diff mbox series

Patch

diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
index 938c17f25d94..0d0e45de610e 100644
--- a/drivers/idle/intel_idle.c
+++ b/drivers/idle/intel_idle.c
@@ -51,11 +51,13 @@ 
 #include <linux/notifier.h>
 #include <linux/cpu.h>
 #include <linux/moduleparam.h>
+#include <linux/units.h>
 #include <asm/cpu_device_id.h>
 #include <asm/intel-family.h>
 #include <asm/nospec-branch.h>
 #include <asm/mwait.h>
 #include <asm/msr.h>
+#include <asm/tsc.h>
 #include <asm/fpu/api.h>
 
 #define INTEL_IDLE_VERSION "0.5.1"
@@ -73,6 +75,8 @@  static struct cpuidle_device __percpu *intel_idle_cpuidle_devices;
 
 static unsigned long auto_demotion_disable_flags;
 
+static u64 umwait_limit;
+
 static enum {
 	C1E_PROMOTION_PRESERVE,
 	C1E_PROMOTION_ENABLE,
@@ -225,6 +229,27 @@  static __cpuidle int intel_idle_s2idle(struct cpuidle_device *dev,
 	return 0;
 }
 
+/**
+ * intel_idle_umwait_irq - Request C0.x using the 'umwait' instruction.
+ * @dev: cpuidle device of the target CPU.
+ * @drv: cpuidle driver (assumed to point to intel_idle_driver).
+ * @index: Target idle state index.
+ *
+ * Request C0.1 or C0.2 using 'umwait' instruction with interrupts enabled.
+ */
+static __cpuidle int intel_idle_umwait_irq(struct cpuidle_device *dev,
+					   struct cpuidle_driver *drv,
+					   int index)
+{
+	u32 state = flg2MWAIT(drv->states[index].flags);
+
+	raw_local_irq_enable();
+	umwait_idle(rdtsc() + umwait_limit, state);
+	raw_local_irq_disable();
+
+	return index;
+}
+
 /*
  * States are indexed by the cstate number,
  * which is also the index into the MWAIT hint array.
@@ -968,6 +993,13 @@  static struct cpuidle_state adl_n_cstates[] __initdata = {
 };
 
 static struct cpuidle_state spr_cstates[] __initdata = {
+	{
+		.name = "C0.2",
+		.desc = "UMWAIT C0.2",
+		.flags = MWAIT2flg(TPAUSE_C02_STATE) | CPUIDLE_FLAG_IRQ_ENABLE,
+		.exit_latency_ns = 100,
+		.target_residency_ns = 100,
+		.enter = &intel_idle_umwait_irq, },
 	{
 		.name = "C1",
 		.desc = "MWAIT 0x00",
@@ -1894,7 +1926,8 @@  static void __init intel_idle_init_cstates_icpu(struct cpuidle_driver *drv)
 		/* Structure copy. */
 		drv->states[drv->state_count] = cpuidle_state_table[cstate];
 
-		if ((cpuidle_state_table[cstate].flags & CPUIDLE_FLAG_IRQ_ENABLE) || force_irq_on) {
+		if (cpuidle_state_table[cstate].enter == intel_idle &&
+		    ((cpuidle_state_table[cstate].flags & CPUIDLE_FLAG_IRQ_ENABLE) || force_irq_on)) {
 			printk("intel_idle: forced intel_idle_irq for state %d\n", cstate);
 			drv->states[drv->state_count].enter = intel_idle_irq;
 		}
@@ -1926,6 +1959,28 @@  static void __init intel_idle_init_cstates_icpu(struct cpuidle_driver *drv)
 	}
 }
 
+/**
+ * umwait_limit_init - initialize time limit value for 'umwait'.
+ *
+ * C0.1 and C0.2 (later C0.x) idle states are requested via the 'umwait'
+ * instruction. The 'umwait' instruction requires the "deadline" - the TSC
+ * counter value to break out of C0.x (unless it broke out because of an
+ * interrupt or some other event).
+ *
+ * The deadline is specified as an absolute TSC value, and it is calculated as
+ * current TSC value + 'umwait_limit'. This function initializes the
+ * 'umwait_limit' variable to count of cycles per tick. The motivation is:
+ *   * the tick is not disabled for shallow states like C0.x so, so idle will
+ *     not last longer than a tick anyway
+ *   * limit idle time to give cpuidle a chance to re-evaluate its C-state
+ *     selection decision and possibly select a deeper C-state.
+ */
+static void __init umwait_limit_init(void)
+{
+	umwait_limit = (u64)TICK_NSEC * tsc_khz;
+	do_div(umwait_limit, MICRO);
+}
+
 /**
  * intel_idle_cpuidle_driver_init - Create the list of available idle states.
  * @drv: cpuidle driver structure to initialize.
@@ -1933,6 +1988,7 @@  static void __init intel_idle_init_cstates_icpu(struct cpuidle_driver *drv)
 static void __init intel_idle_cpuidle_driver_init(struct cpuidle_driver *drv)
 {
 	cpuidle_poll_state_init(drv);
+	umwait_limit_init();
 
 	if (disabled_states_mask & BIT(0))
 		drv->states[0].flags |= CPUIDLE_FLAG_OFF;