diff mbox

cpuidle: use high confidence factors only when considering polling

Message ID 20160316121400.680a6a46@annuminas.surriel.com (mailing list archive)
State Accepted, archived
Delegated to: Rafael Wysocki
Headers show

Commit Message

Rik van Riel March 16, 2016, 4:14 p.m. UTC
The menu governor uses five different factors to pick the
idle state:
- the user configured latency_req
- the time until the next timer (next_timer_us)
- the typical sleep interval, as measured recently
- an estimate of sleep time by dividing next_timer_us by an observed factor
- a load corrected version of the above, divided again by load

Only the first three items are known with enough confidence that
we can use them to consider polling, instead of an actual CPU
idle state, because the cost of being wrong about polling can be
excessive power use.

The latter two are used in the menu governor's main selection
loop, and can result in choosing a shallower idle state when
the system is expected to be busy again soon.

This pushes a busy system in the "performance" direction of
the performance<>power tradeoff, when choosing between idle
states, but stays more strictly on the "power" state when
deciding between polling and C1.

Signed-off-by: Rik van Riel <riel@redhat.com>
---
 drivers/cpuidle/governors/menu.c | 42 +++++++++++++++++++++++-----------------
 1 file changed, 24 insertions(+), 18 deletions(-)


--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Rafael J. Wysocki March 18, 2016, 12:45 a.m. UTC | #1
On Wednesday, March 16, 2016 12:14:00 PM Rik van Riel wrote:
> The menu governor uses five different factors to pick the
> idle state:
> - the user configured latency_req
> - the time until the next timer (next_timer_us)
> - the typical sleep interval, as measured recently
> - an estimate of sleep time by dividing next_timer_us by an observed factor
> - a load corrected version of the above, divided again by load
> 
> Only the first three items are known with enough confidence that
> we can use them to consider polling, instead of an actual CPU
> idle state, because the cost of being wrong about polling can be
> excessive power use.
> 
> The latter two are used in the menu governor's main selection
> loop, and can result in choosing a shallower idle state when
> the system is expected to be busy again soon.
> 
> This pushes a busy system in the "performance" direction of
> the performance<>power tradeoff, when choosing between idle
> states, but stays more strictly on the "power" state when
> deciding between polling and C1.
> 
> Signed-off-by: Rik van Riel <riel@redhat.com>

Applied, thanks!

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Doug Smythies March 18, 2016, 6:32 a.m. UTC | #2
Sorry for the delay in my reply / test. The patch e-mail went
to my junk folder for some reason.

On 2106.03.17 17:46 Rafael J. Wysocki wrote:
> On Wednesday, March 16, 2016 12:14:00 PM Rik van Riel wrote:
>> The menu governor uses five different factors to pick the
>> idle state:
>> - the user configured latency_req
>> - the time until the next timer (next_timer_us)
>> - the typical sleep interval, as measured recently
>> - an estimate of sleep time by dividing next_timer_us by an observed factor
>> - a load corrected version of the above, divided again by load
>> 
>> Only the first three items are known with enough confidence that
>> we can use them to consider polling, instead of an actual CPU
>> idle state, because the cost of being wrong about polling can be
>> excessive power use.
>> 
>> The latter two are used in the menu governor's main selection
>> loop, and can result in choosing a shallower idle state when
>> the system is expected to be busy again soon.
>> 
>> This pushes a busy system in the "performance" direction of
>> the performance<>power tradeoff, when choosing between idle
>> states, but stays more strictly on the "power" state when
>> deciding between polling and C1.
>> 
>> Signed-off-by: Rik van Riel <riel@redhat.com>

For my part of it, this patch seems to be not O.K.
(reference rvr5 = this patch)

Aggregate idle for the 2000 second test. All in minutes.
(old tests re-stated)

State	k45rc7-rjw10	k45rc7-rjw10-reverted	k45rc7-rjw10-rvr5
0.00	18.07			0.92				18.67
1.00	12.35			19.51				12.82
2.00	3.96			4.28				3.97
3.00	1.55			1.53				1.58
4.00	138.96		141.99			143.80
			
total	174.90		168.24			180.84

Energy:
>> Kernel 4.5-rc7-rjw10: 61983 Joules
>> Kernel 4.5-rc7-rjw10-reverted: 48409 Joules (test 2 was 55040 Joules)
Kernel 4.5-rc7-rjw10-rvr5: 62243 Joules

I did acquire trace data with this test, but haven't post processed it yet.

... Doug


--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki March 18, 2016, 1:11 p.m. UTC | #3
On Fri, Mar 18, 2016 at 7:32 AM, Doug Smythies <dsmythies@telus.net> wrote:
> Sorry for the delay in my reply / test. The patch e-mail went
> to my junk folder for some reason.
>
> On 2106.03.17 17:46 Rafael J. Wysocki wrote:
>> On Wednesday, March 16, 2016 12:14:00 PM Rik van Riel wrote:
>>> The menu governor uses five different factors to pick the
>>> idle state:
>>> - the user configured latency_req
>>> - the time until the next timer (next_timer_us)
>>> - the typical sleep interval, as measured recently
>>> - an estimate of sleep time by dividing next_timer_us by an observed factor
>>> - a load corrected version of the above, divided again by load
>>>
>>> Only the first three items are known with enough confidence that
>>> we can use them to consider polling, instead of an actual CPU
>>> idle state, because the cost of being wrong about polling can be
>>> excessive power use.
>>>
>>> The latter two are used in the menu governor's main selection
>>> loop, and can result in choosing a shallower idle state when
>>> the system is expected to be busy again soon.
>>>
>>> This pushes a busy system in the "performance" direction of
>>> the performance<>power tradeoff, when choosing between idle
>>> states, but stays more strictly on the "power" state when
>>> deciding between polling and C1.
>>>
>>> Signed-off-by: Rik van Riel <riel@redhat.com>
>
> For my part of it, this patch seems to be not O.K.
> (reference rvr5 = this patch)
>
> Aggregate idle for the 2000 second test. All in minutes.
> (old tests re-stated)
>
> State   k45rc7-rjw10    k45rc7-rjw10-reverted   k45rc7-rjw10-rvr5
> 0.00    18.07                   0.92                            18.67
> 1.00    12.35                   19.51                           12.82
> 2.00    3.96                    4.28                            3.97
> 3.00    1.55                    1.53                            1.58
> 4.00    138.96          141.99                  143.80
>
> total   174.90          168.24                  180.84
>
> Energy:
>>> Kernel 4.5-rc7-rjw10: 61983 Joules
>>> Kernel 4.5-rc7-rjw10-reverted: 48409 Joules (test 2 was 55040 Joules)
> Kernel 4.5-rc7-rjw10-rvr5: 62243 Joules
>
> I did acquire trace data with this test, but haven't post processed it yet.

I'm wondering what happens if you replace the expected_interval in the
"expected_interval >
drv->states[CPUIDLE_DRIVER_STATE_START].target_residency" test with
data->next_timer_us (with the Rik's patch applied, of course).  Can
you please try doing that?
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Doug Smythies March 18, 2016, 6:32 p.m. UTC | #4
On 2016.03.18 06:12 Rafael J. Wysocki wrote:
> On Fri, Mar 18, 2016 at 7:32 AM, Doug Smythies <dsmythies@telus.net> wrote:
>
>> For my part of it, this patch seems to be not O.K.
>> (reference rvr5 = this patch)
>>
>> Aggregate idle for the 2000 second test. All in minutes.
>> (old tests re-stated)
>>
>> State	k45rc7-rjw10	k45rc7-rjw10-reverted	k45rc7-rjw10-rvr5
>> 0.00	18.07			0.92				18.67
>> 1.00	12.35			19.51				12.82
>> 2.00	3.96			4.28				3.97
>> 3.00	1.55			1.53				1.58
>> 4.00	138.96		141.99			143.80
>>
>> total	174.90		168.24			180.84
>>
>> Energy:
>>>> Kernel 4.5-rc7-rjw10: 61983 Joules
>>>> Kernel 4.5-rc7-rjw10-reverted: 48409 Joules (test 2 was 55040 Joules)
>> Kernel 4.5-rc7-rjw10-rvr5: 62243 Joules
>>
>> I did acquire trace data with this test, but haven't post processed it yet.
>
> I'm wondering what happens if you replace the expected_interval in the
> "expected_interval >
> drv->states[CPUIDLE_DRIVER_STATE_START].target_residency" test with
> data->next_timer_us (with the Rik's patch applied, of course).  Can
> you please try doing that?

O.K. my reference: rvr6 is the above modification to rvr5
It works as well as "reverted"/

State	k45rc7-rjw10-rvr6 (mins)
0.00	0.87
1.00	24.20
2.00	4.05
3.00	1.72
4.00	147.50

total	178.34

Energy:
Kernel 4.5-rc7-rjw10-rvr6: 55864 Joules

Trace data (very crude summary):
Kernel 4.5-rc7-rjw10-rvr5: ~3049 long durations at high CPU load (idle state 0)
Kernel 4.5-rc7-rjw10-rvr5: ~183 long durations at high, but less, CPU load (not all idle state 0)

... Doug


--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c
index 0742b3296673..fba867d917f7 100644
--- a/drivers/cpuidle/governors/menu.c
+++ b/drivers/cpuidle/governors/menu.c
@@ -196,7 +196,7 @@  static void menu_update(struct cpuidle_driver *drv, struct cpuidle_device *dev);
  * of points is below a threshold. If it is... then use the
  * average of these 8 points as the estimated value.
  */
-static void get_typical_interval(struct menu_device *data)
+static unsigned int get_typical_interval(struct menu_device *data)
 {
 	int i, divisor;
 	unsigned int max, thresh;
@@ -254,9 +254,7 @@  static void get_typical_interval(struct menu_device *data)
 		stddev = int_sqrt(stddev);
 		if (((avg > stddev * 6) && (divisor * 4 >= INTERVALS * 3))
 							|| stddev <= 20) {
-			if (data->next_timer_us > avg)
-				data->predicted_us = avg;
-			return;
+			return avg;
 		}
 	}
 
@@ -270,7 +268,7 @@  static void get_typical_interval(struct menu_device *data)
 	 * with sporadic activity with a bunch of short pauses.
 	 */
 	if ((divisor * 4) <= INTERVALS * 3)
-		return;
+		return UINT_MAX;
 
 	thresh = max - 1;
 	goto again;
@@ -287,6 +285,7 @@  static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev)
 	int latency_req = pm_qos_request(PM_QOS_CPU_DMA_LATENCY);
 	int i;
 	unsigned int interactivity_req;
+	unsigned int expected_interval;
 	unsigned long nr_iowaiters, cpu_load;
 
 	if (data->needs_update) {
@@ -313,32 +312,39 @@  static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev)
 					 data->correction_factor[data->bucket],
 					 RESOLUTION * DECAY);
 
-	get_typical_interval(data);
-
-	/*
-	 * Performance multiplier defines a minimum predicted idle
-	 * duration / latency ratio. Adjust the latency limit if
-	 * necessary.
-	 */
-	interactivity_req = data->predicted_us / performance_multiplier(nr_iowaiters, cpu_load);
-	if (latency_req > interactivity_req)
-		latency_req = interactivity_req;
+	expected_interval = get_typical_interval(data);
+	expected_interval = min(expected_interval, data->next_timer_us);
 
 	if (CPUIDLE_DRIVER_STATE_START > 0) {
 		data->last_state_idx = CPUIDLE_DRIVER_STATE_START - 1;
 		/*
 		 * We want to default to C1 (hlt), not to busy polling
-		 * unless the timer is happening really really soon.
+		 * unless the timer is happening really really soon, or
+		 * C1's exit latency exceeds the user configured limit.
 		 */
-		if (interactivity_req > 20 &&
+		if (expected_interval > drv->states[CPUIDLE_DRIVER_STATE_START].target_residency &&
+		    latency_req > drv->states[CPUIDLE_DRIVER_STATE_START].exit_latency &&
 		    !drv->states[CPUIDLE_DRIVER_STATE_START].disabled &&
-			dev->states_usage[CPUIDLE_DRIVER_STATE_START].disable == 0)
+		    !dev->states_usage[CPUIDLE_DRIVER_STATE_START].disable)
 			data->last_state_idx = CPUIDLE_DRIVER_STATE_START;
 	} else {
 		data->last_state_idx = CPUIDLE_DRIVER_STATE_START;
 	}
 
 	/*
+	 * Use the lowest expected idle interval to pick the idle state.
+	 */
+	data->predicted_us = min(data->predicted_us, expected_interval);
+
+	/*
+	 * Use the performance multiplier and the user-configurable
+	 * latency_req to determine the maximum exit latency.
+	 */
+	interactivity_req = data->predicted_us / performance_multiplier(nr_iowaiters, cpu_load);
+	if (latency_req > interactivity_req)
+		latency_req = interactivity_req;
+
+	/*
 	 * Find the idle state with the lowest power while satisfying
 	 * our constraints.
 	 */