diff mbox

[RFC/RFT,4/7] cpuidle: menu: Split idle duration prediction from state selection

Message ID 2332986.m9oRvTSu8E@aspire.rjw.lan (mailing list archive)
State RFC, archived
Headers show

Commit Message

Rafael J. Wysocki March 4, 2018, 10:26 p.m. UTC
From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

In order to address the issue with short idle duration predictions
by the idle governor after the tick has been stopped, prepare the
menu governor code for reordering with respect to the timekeeping
code that stops the tick.

Use the observation that menu_select() can be split into two
functions, one predicting the idle duration and one selecting the
idle state, and rework it accordingly.

Introduce menu_predict() to predict the idle duration and (for now)
call it from menu_select().  The next set of changes will use
menu_predict() as a new governor callback.

This change is not expected to alter the functionality of the code.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---
 drivers/cpuidle/governors/menu.c |   68 +++++++++++++++++++++------------------
 1 file changed, 38 insertions(+), 30 deletions(-)

Comments

Peter Zijlstra March 5, 2018, 11:38 a.m. UTC | #1
On Sun, Mar 04, 2018 at 11:26:24PM +0100, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> 
> In order to address the issue with short idle duration predictions
> by the idle governor after the tick has been stopped, prepare the
> menu governor code for reordering with respect to the timekeeping
> code that stops the tick.
> 
> Use the observation that menu_select() can be split into two
> functions, one predicting the idle duration and one selecting the
> idle state, and rework it accordingly.

I actually think this is the wrong way around.

We really should be predicting state not duration. Yes the duration
thing is an intermediate value, but I don't think it makes any sense
what so ever to preserve that in the predictor. The end result is the
idle state, we should aim for that.

As per:

  https://lkml.org/lkml/2017/7/18/615

there are definite advantages to _not_ preserving duration information
beyond the state boundaries.
Rafael J. Wysocki March 5, 2018, 11:47 a.m. UTC | #2
On Mon, Mar 5, 2018 at 12:38 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Sun, Mar 04, 2018 at 11:26:24PM +0100, Rafael J. Wysocki wrote:
>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>
>> In order to address the issue with short idle duration predictions
>> by the idle governor after the tick has been stopped, prepare the
>> menu governor code for reordering with respect to the timekeeping
>> code that stops the tick.
>>
>> Use the observation that menu_select() can be split into two
>> functions, one predicting the idle duration and one selecting the
>> idle state, and rework it accordingly.
>
> I actually think this is the wrong way around.

In the meantime I concluded that this patch and the next one are
overdesign really.  I have a replacement for the two already. :-)

The only thing I need is the expected idle duration that can be
returned from ->select too.

> We really should be predicting state not duration. Yes the duration
> thing is an intermediate value, but I don't think it makes any sense
> what so ever to preserve that in the predictor. The end result is the
> idle state, we should aim for that.
>
> As per:
>
>   https://lkml.org/lkml/2017/7/18/615
>
> there are definite advantages to _not_ preserving duration information
> beyond the state boundaries.

Well, OK

The reason why I need the predicted idle duration is because the
target residency of the selected state may be below the tick period
duration and if this is the deepest state available, we still want to
stop the tick if the predicted idle duration is long.

IOW, the target residency of the selected state doesn't tell you how
much time you should expect to be idle in general.
Peter Zijlstra March 5, 2018, 12:50 p.m. UTC | #3
On Mon, Mar 05, 2018 at 12:47:23PM +0100, Rafael J. Wysocki wrote:
> On Mon, Mar 5, 2018 at 12:38 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> > We really should be predicting state not duration. Yes the duration
> > thing is an intermediate value, but I don't think it makes any sense
> > what so ever to preserve that in the predictor. The end result is the
> > idle state, we should aim for that.
> >
> > As per:
> >
> >   https://lkml.org/lkml/2017/7/18/615
> >
> > there are definite advantages to _not_ preserving duration information
> > beyond the state boundaries.
> 
> Well, OK
> 
> The reason why I need the predicted idle duration is because the
> target residency of the selected state may be below the tick period
> duration and if this is the deepest state available, we still want to
> stop the tick if the predicted idle duration is long.

Right, so in that case we'd split the deepest state and mark the
resulting smaller state as not disabling the tick and the resulting
larger state as disabling the tick.

So suppose your deepest state is < TICK_USEC, then we introduce a copy
of that state, modify the boundary to be TICK_USEC and set the 'disable
tick for this state' thing to true.

> IOW, the target residency of the selected state doesn't tell you how
> much time you should expect to be idle in general.

Right, but I think that measure isn't of primary relevance. What we want
to know is: 'should I stop the tick' and 'what C state do I go to'.

In order to answer those questions we need durations as input, but I
don't think we should preserve durations throughout. The scheme from the
above link reduces to N states in order to deal with arbitrary
distributions, only the actual states -- ie boundaries where our answers
changes -- are relevant, anything inside those boundaries would lead to
the exact same answer anyway.
Rafael J. Wysocki March 5, 2018, 1:05 p.m. UTC | #4
On Mon, Mar 5, 2018 at 1:50 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Mon, Mar 05, 2018 at 12:47:23PM +0100, Rafael J. Wysocki wrote:
>> On Mon, Mar 5, 2018 at 12:38 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>> > We really should be predicting state not duration. Yes the duration
>> > thing is an intermediate value, but I don't think it makes any sense
>> > what so ever to preserve that in the predictor. The end result is the
>> > idle state, we should aim for that.
>> >
>> > As per:
>> >
>> >   https://lkml.org/lkml/2017/7/18/615
>> >
>> > there are definite advantages to _not_ preserving duration information
>> > beyond the state boundaries.
>>
>> Well, OK
>>
>> The reason why I need the predicted idle duration is because the
>> target residency of the selected state may be below the tick period
>> duration and if this is the deepest state available, we still want to
>> stop the tick if the predicted idle duration is long.
>
> Right, so in that case we'd split the deepest state and mark the
> resulting smaller state as not disabling the tick and the resulting
> larger state as disabling the tick.
>
> So suppose your deepest state is < TICK_USEC, then we introduce a copy
> of that state, modify the boundary to be TICK_USEC and set the 'disable
> tick for this state' thing to true.
>
>> IOW, the target residency of the selected state doesn't tell you how
>> much time you should expect to be idle in general.
>
> Right, but I think that measure isn't of primary relevance. What we want
> to know is: 'should I stop the tick' and 'what C state do I go to'.
>
> In order to answer those questions we need durations as input, but I
> don't think we should preserve durations throughout. The scheme from the
> above link reduces to N states in order to deal with arbitrary
> distributions, only the actual states -- ie boundaries where our answers
> changes -- are relevant, anything inside those boundaries would lead to
> the exact same answer anyway.

I generally agree here, but I'm not convinced about flagging the
states, splitting them and so on.

Maybe just return a "nohz" indicator from cpuidle_select() in addition
to the state index and make the decision in the governor?
Peter Zijlstra March 5, 2018, 1:53 p.m. UTC | #5
On Mon, Mar 05, 2018 at 02:05:10PM +0100, Rafael J. Wysocki wrote:
> On Mon, Mar 5, 2018 at 1:50 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Mon, Mar 05, 2018 at 12:47:23PM +0100, Rafael J. Wysocki wrote:

> >> IOW, the target residency of the selected state doesn't tell you how
> >> much time you should expect to be idle in general.
> >
> > Right, but I think that measure isn't of primary relevance. What we want
> > to know is: 'should I stop the tick' and 'what C state do I go to'.
> >
> > In order to answer those questions we need durations as input, but I
> > don't think we should preserve durations throughout. The scheme from the
> > above link reduces to N states in order to deal with arbitrary
> > distributions, only the actual states -- ie boundaries where our answers
> > changes -- are relevant, anything inside those boundaries would lead to
> > the exact same answer anyway.
> 
> I generally agree here, but I'm not convinced about flagging the
> states, splitting them and so on.

I think linking them like that makes sense, but I can see room for
discussion...

> Maybe just return a "nohz" indicator from cpuidle_select() in addition
> to the state index and make the decision in the governor?

Much better option than returning a duration :-)
Aubrey Li March 6, 2018, 2:15 a.m. UTC | #6
On 2018/3/5 21:53, Peter Zijlstra wrote:
> On Mon, Mar 05, 2018 at 02:05:10PM +0100, Rafael J. Wysocki wrote:
>> On Mon, Mar 5, 2018 at 1:50 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>>> On Mon, Mar 05, 2018 at 12:47:23PM +0100, Rafael J. Wysocki wrote:
> 
>>>> IOW, the target residency of the selected state doesn't tell you how
>>>> much time you should expect to be idle in general.
>>>
>>> Right, but I think that measure isn't of primary relevance. What we want
>>> to know is: 'should I stop the tick' and 'what C state do I go to'.

I understood the benefit of mapping duration to state number, is duration <->
state number mapping a generic solution to all arches? 

Back to the user's concern is, "I'm running a latency sensitive application, and
I want idle switching ASAP". So I think the user may not care about what C state
to go into, that is, even if a deeper state has chance to go, the user striving
for a higher workload score may still not want it?

>>>
>>> In order to answer those questions we need durations as input, but I
>>> don't think we should preserve durations throughout. The scheme from the
>>> above link reduces to N states in order to deal with arbitrary
>>> distributions, only the actual states -- ie boundaries where our answers
>>> changes -- are relevant, anything inside those boundaries would lead to
>>> the exact same answer anyway.
>>
>> I generally agree here, but I'm not convinced about flagging the
>> states, splitting them and so on.
> 
> I think linking them like that makes sense, but I can see room for
> discussion...
> 
>> Maybe just return a "nohz" indicator from cpuidle_select() in addition
>> to the state index and make the decision in the governor?
> 
> Much better option than returning a duration :-)
>
So what does "nohz = disable and state index = deepest" mean? This combination
does not make sense for performance only purpose?

Thanks,
-Aubrey
Peter Zijlstra March 6, 2018, 8:45 a.m. UTC | #7
On Tue, Mar 06, 2018 at 10:15:10AM +0800, Li, Aubrey wrote:
> On 2018/3/5 21:53, Peter Zijlstra wrote:
> > On Mon, Mar 05, 2018 at 02:05:10PM +0100, Rafael J. Wysocki wrote:
> >> On Mon, Mar 5, 2018 at 1:50 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> >>> On Mon, Mar 05, 2018 at 12:47:23PM +0100, Rafael J. Wysocki wrote:
> > 
> >>>> IOW, the target residency of the selected state doesn't tell you how
> >>>> much time you should expect to be idle in general.
> >>>
> >>> Right, but I think that measure isn't of primary relevance. What we want
> >>> to know is: 'should I stop the tick' and 'what C state do I go to'.
> 
> I understood the benefit of mapping duration to state number, is duration <->
> state number mapping a generic solution to all arches?

Yes, all platforms have a limited set of possible idle states.

> Back to the user's concern is, "I'm running a latency sensitive application, and
> I want idle switching ASAP". So I think the user may not care about what C state
> to go into, that is, even if a deeper state has chance to go, the user striving
> for a higher workload score may still not want it?

The user caring about performance very much cares about the actual idle
state too, exit latency for deeper states is horrific and will screw
them up just as much as the whole nohz timer reprogramming does.

We can basically view the whole nohz thing as an additional entry/exit
latency for the idle state, which is why I don't think its weird to
couple them.

> >> Maybe just return a "nohz" indicator from cpuidle_select() in addition
> >> to the state index and make the decision in the governor?
> > 
> > Much better option than returning a duration :-)
> >
> So what does "nohz = disable and state index = deepest" mean? This combination
> does not make sense for performance only purpose?

I tend to agree with you that the state space allowed by a separate
variable is larger than required, but it's significantly smaller than
preserving 'time' so I can live with it.
Aubrey Li March 6, 2018, 2:07 p.m. UTC | #8
On 2018/3/6 16:45, Peter Zijlstra wrote:
> On Tue, Mar 06, 2018 at 10:15:10AM +0800, Li, Aubrey wrote:
>> On 2018/3/5 21:53, Peter Zijlstra wrote:
>>> On Mon, Mar 05, 2018 at 02:05:10PM +0100, Rafael J. Wysocki wrote:
>>>> On Mon, Mar 5, 2018 at 1:50 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>>>>> On Mon, Mar 05, 2018 at 12:47:23PM +0100, Rafael J. Wysocki wrote:
>>>
>>>>>> IOW, the target residency of the selected state doesn't tell you how
>>>>>> much time you should expect to be idle in general.
>>>>>
>>>>> Right, but I think that measure isn't of primary relevance. What we want
>>>>> to know is: 'should I stop the tick' and 'what C state do I go to'.
>>
>> I understood the benefit of mapping duration to state number, is duration <->
>> state number mapping a generic solution to all arches?
> 
> Yes, all platforms have a limited set of possible idle states.
> 
>> Back to the user's concern is, "I'm running a latency sensitive application, and
>> I want idle switching ASAP". So I think the user may not care about what C state
>> to go into, that is, even if a deeper state has chance to go, the user striving
>> for a higher workload score may still not want it?
> 
> The user caring about performance very much cares about the actual idle
> state too, exit latency for deeper states is horrific and will screw
> them up just as much as the whole nohz timer reprogramming does.
> 
> We can basically view the whole nohz thing as an additional entry/exit
> latency for the idle state, which is why I don't think its weird to
> couple them.

okay, I see your point now. Thanks!

> 
>>>> Maybe just return a "nohz" indicator from cpuidle_select() in addition
>>>> to the state index and make the decision in the governor?
>>>
>>> Much better option than returning a duration :-)
>>>
>> So what does "nohz = disable and state index = deepest" mean? This combination
>> does not make sense for performance only purpose?
> 
> I tend to agree with you that the state space allowed by a separate
> variable is larger than required, but it's significantly smaller than
> preserving 'time' so I can live with it.
>
diff mbox

Patch

Index: linux-pm/drivers/cpuidle/governors/menu.c
===================================================================
--- linux-pm.orig/drivers/cpuidle/governors/menu.c
+++ linux-pm/drivers/cpuidle/governors/menu.c
@@ -130,6 +130,7 @@  struct menu_device {
 	unsigned int	correction_factor[BUCKETS];
 	unsigned int	intervals[INTERVALS];
 	int		interval_ptr;
+	int		latency_req;
 };
 
 
@@ -275,36 +276,30 @@  again:
 	goto again;
 }
 
-/**
- * menu_select - selects the next idle state to enter
- * @drv: cpuidle driver containing state data
- * @dev: the CPU
- */
-static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev)
+static void menu_predict(struct cpuidle_driver *drv, struct cpuidle_device *dev)
 {
 	struct menu_device *data = this_cpu_ptr(&menu_devices);
 	struct device *device = get_cpu_device(dev->cpu);
-	int latency_req = pm_qos_request(PM_QOS_CPU_DMA_LATENCY);
-	int i;
-	int first_idx;
-	int idx;
+	int resume_latency = dev_pm_qos_raw_read_value(device);
 	unsigned int interactivity_req;
 	unsigned int expected_interval;
 	unsigned long nr_iowaiters, cpu_load;
-	int resume_latency = dev_pm_qos_raw_read_value(device);
 
 	if (data->needs_update) {
 		menu_update(drv, dev);
 		data->needs_update = 0;
 	}
 
-	if (resume_latency < latency_req &&
+	data->latency_req = pm_qos_request(PM_QOS_CPU_DMA_LATENCY);
+	if (resume_latency < data->latency_req &&
 	    resume_latency != PM_QOS_RESUME_LATENCY_NO_CONSTRAINT)
-		latency_req = resume_latency;
+		data->latency_req = resume_latency;
 
 	/* Special case when user has set very strict latency requirement */
-	if (unlikely(latency_req == 0))
-		return 0;
+	if (unlikely(data->latency_req == 0)) {
+		data->predicted_us = 0;
+		return;
+	}
 
 	/* determine the expected residency time, round up */
 	data->next_timer_us = ktime_to_us(tick_nohz_get_sleep_length());
@@ -324,6 +319,32 @@  static int menu_select(struct cpuidle_dr
 	expected_interval = get_typical_interval(data);
 	expected_interval = min(expected_interval, data->next_timer_us);
 
+	/*
+	 * Use the lowest expected idle interval to pick the idle state.
+	 */
+	data->predicted_us = min(data->predicted_us, expected_interval);
+
+	/*
+	 * Use the performance multiplier and the user-configurable
+	 * latency_req to determine the maximum exit latency.
+	 */
+	interactivity_req = data->predicted_us / performance_multiplier(nr_iowaiters, cpu_load);
+	if (data->latency_req > interactivity_req)
+		data->latency_req = interactivity_req;
+}
+
+/**
+ * menu_select - selects the next idle state to enter
+ * @drv: cpuidle driver containing state data
+ * @dev: the CPU
+ */
+static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev)
+{
+	struct menu_device *data = this_cpu_ptr(&menu_devices);
+	int first_idx, idx, i;
+
+	menu_predict(drv, dev);
+
 	first_idx = 0;
 	if (drv->states[0].flags & CPUIDLE_FLAG_POLLING) {
 		struct cpuidle_state *s = &drv->states[1];
@@ -336,25 +357,12 @@  static int menu_select(struct cpuidle_dr
 		 */
 		polling_threshold = max_t(unsigned int, 20, s->target_residency);
 		if (data->next_timer_us > polling_threshold &&
-		    latency_req > s->exit_latency && !s->disabled &&
+		    data->latency_req > s->exit_latency && !s->disabled &&
 		    !dev->states_usage[1].disable)
 			first_idx = 1;
 	}
 
 	/*
-	 * Use the lowest expected idle interval to pick the idle state.
-	 */
-	data->predicted_us = min(data->predicted_us, expected_interval);
-
-	/*
-	 * Use the performance multiplier and the user-configurable
-	 * latency_req to determine the maximum exit latency.
-	 */
-	interactivity_req = data->predicted_us / performance_multiplier(nr_iowaiters, cpu_load);
-	if (latency_req > interactivity_req)
-		latency_req = interactivity_req;
-
-	/*
 	 * Find the idle state with the lowest power while satisfying
 	 * our constraints.
 	 */
@@ -369,7 +377,7 @@  static int menu_select(struct cpuidle_dr
 			idx = i; /* first enabled state */
 		if (s->target_residency > data->predicted_us)
 			break;
-		if (s->exit_latency > latency_req)
+		if (s->exit_latency > data->latency_req)
 			break;
 
 		idx = i;