diff mbox

[RFC] sched,idle: teach select_idle_sibling about idle states

Message ID 20141002131548.6cd377d5@cuia.bos.redhat.com (mailing list archive)
State Not Applicable, archived
Headers show

Commit Message

Rik van Riel Oct. 2, 2014, 5:15 p.m. UTC
On Tue, 30 Sep 2014 19:15:00 -0400 (EDT)
Nicolas Pitre <nicolas.pitre@linaro.org> wrote:
> On Tue, 30 Sep 2014, Rik van Riel wrote:

> > The main thing it does not cover is already running tasks that
> > get woken up again, since select_idle_sibling() covers everything
> > except for newly forked and newly executed tasks.
> 
> True. Now that you bring this up, I remember that Peter mentioned it as 
> well.
> 
> > I am looking at adding similar logic to select_idle_sibling()
> 
> OK thanks.

This patch is ugly. I have not bothered cleaning it up, because it
causes a regression with hackbench. Apparently for hackbench (and
potentially other sync wakeups), locality is more important than
idleness.

We may need to add a third clause before the search, something
along the lines of, to ensure target gets selected if neither
target or i are idle and the wakeup is synchronous...

    if (sync_wakeup && cpu_of(target)->nr_running == 1)
	return target;

I still need to run tests with other workloads, too.

Another consideration is that search costs with this patch
are potentially much increased. I suspect we may want to simply
propagate the load on each sched_group up the tree hierarchically,
with delta accounting and propagating the info upwards only when
the delta is significant, like done in __update_tg_runnable_avg.

---8<---

Subject: sched,idle: teach select_idle_sibling about idle states

Change select_idle_sibling to take cpu idle exit latency into
account.  First preference is to select the cpu with the lowest
exit latency from a completely idle sched_group inside the CPU;
if that is not available, we pick the CPU with the lowest exit
latency in any sched_group.

This increases the total search time of select_idle_sibling,
we may want to look into propagating load info up the sched_group
tree in some way. That information would also be useful to prevent
the wake_affine logic from causing a load imbalance between
sched_groups.

It is not clear when locality (from staying on the old CPU) beats
a lower idle exit latency. Having information on whether the CPU
drops content from the CPU caches in certain idle states would
help with that, but with multiple CPUs bound together in the same
physical CPU core, the hardware often does not do what we tell it,
anyway...

Signed-off-by: Rik van Riel <riel@redhat.com>
---
 kernel/sched/fair.c | 47 +++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 41 insertions(+), 6 deletions(-)


--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Mike Galbraith Oct. 3, 2014, 6:04 a.m. UTC | #1
On Thu, 2014-10-02 at 13:15 -0400, Rik van Riel wrote:

> This patch is ugly. I have not bothered cleaning it up, because it
> causes a regression with hackbench. Apparently for hackbench (and
> potentially other sync wakeups), locality is more important than
> idleness.
> 
> We may need to add a third clause before the search, something
> along the lines of, to ensure target gets selected if neither
> target or i are idle and the wakeup is synchronous...
> 
>     if (sync_wakeup && cpu_of(target)->nr_running == 1)
> 	return target;

I recommend you forget that trusting sync hint ever sprang to mind, it
is often a big fat lie.

-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mike Galbraith Oct. 3, 2014, 6:23 a.m. UTC | #2
On Thu, 2014-10-02 at 13:15 -0400, Rik van Riel wrote:

> Subject: sched,idle: teach select_idle_sibling about idle states
> 
> Change select_idle_sibling to take cpu idle exit latency into
> account.  First preference is to select the cpu with the lowest
> exit latency from a completely idle sched_group inside the CPU;
> if that is not available, we pick the CPU with the lowest exit
> latency in any sched_group.
> 
> This increases the total search time of select_idle_sibling,
> we may want to look into propagating load info up the sched_group
> tree in some way. That information would also be useful to prevent
> the wake_affine logic from causing a load imbalance between
> sched_groups.

A generic boo hiss aimed in the general direction of all of this let's
go look at every possibility on every wakeup stuff.  Less is more.

-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra Oct. 3, 2014, 7:50 a.m. UTC | #3
On Fri, Oct 03, 2014 at 08:23:04AM +0200, Mike Galbraith wrote:
> On Thu, 2014-10-02 at 13:15 -0400, Rik van Riel wrote:
> 
> > Subject: sched,idle: teach select_idle_sibling about idle states
> > 
> > Change select_idle_sibling to take cpu idle exit latency into
> > account.  First preference is to select the cpu with the lowest
> > exit latency from a completely idle sched_group inside the CPU;
> > if that is not available, we pick the CPU with the lowest exit
> > latency in any sched_group.
> > 
> > This increases the total search time of select_idle_sibling,
> > we may want to look into propagating load info up the sched_group
> > tree in some way. That information would also be useful to prevent
> > the wake_affine logic from causing a load imbalance between
> > sched_groups.
> 
> A generic boo hiss aimed in the general direction of all of this let's
> go look at every possibility on every wakeup stuff.  Less is more.

I hear you, can you see actual slowdown with the patch? While the worst
case doesn't change, it does make the average case equal to the worst
case iteration -- where we previously would average out at inspecting
half the CPUs before finding an idle one, we'd now always inspect all of
them in order to compare all idle ones on their properties.

Also, with the latest generation of Haswell Xeons having 18 cores (36
threads) this is one massively painful loop for sure.

I'm just not sure what to do about it.. I suppose we can artificially
split it into smaller groups, but I bet that'll hurt some, but if we can
show it gains more we might still be able to do it. The only real
problem is actual numbers/workloads (isn't it always) :/

One thing I suppose we could try is keeping a 'busy' flag at the
llc domain which is set when all CPUs are busy (we'll clear it from
new_idle) that way we can avoid the entire iteration if we know its
pointless.

Hmm...
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mike Galbraith Oct. 3, 2014, 1:05 p.m. UTC | #4
On Fri, 2014-10-03 at 09:50 +0200, Peter Zijlstra wrote: 
> On Fri, Oct 03, 2014 at 08:23:04AM +0200, Mike Galbraith wrote:

> > A generic boo hiss aimed in the general direction of all of this let's
> > go look at every possibility on every wakeup stuff.  Less is more.
> 
> I hear you, can you see actual slowdown with the patch? While the worst
> case doesn't change, it does make the average case equal to the worst
> case iteration -- where we previously would average out at inspecting
> half the CPUs before finding an idle one, we'd now always inspect all of
> them in order to compare all idle ones on their properties.
> 
> Also, with the latest generation of Haswell Xeons having 18 cores (36
> threads) this is one massively painful loop for sure.

Yeah, the things are getting too damn big.  I didn't try the patch and
measure anything, my gut instantly said "nope, not worth it".
  
> I'm just not sure what to do about it.. I suppose we can artificially
> split it into smaller groups, but I bet that'll hurt some, but if we can
> show it gains more we might still be able to do it. The only real
> problem is actual numbers/workloads (isn't it always) :/
> 
> One thing I suppose we could try is keeping a 'busy' flag at the
> llc domain which is set when all CPUs are busy (we'll clear it from
> new_idle) that way we can avoid the entire iteration if we know its
> pointless.

On one of those huge packages, heck, even on a 8 core that could save a
substantial number of busy box cycles.

-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rik van Riel Oct. 3, 2014, 2:28 p.m. UTC | #5
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 10/03/2014 03:50 AM, Peter Zijlstra wrote:
> On Fri, Oct 03, 2014 at 08:23:04AM +0200, Mike Galbraith wrote:
>> On Thu, 2014-10-02 at 13:15 -0400, Rik van Riel wrote:
>> 
>>> Subject: sched,idle: teach select_idle_sibling about idle
>>> states
>>> 
>>> Change select_idle_sibling to take cpu idle exit latency into 
>>> account.  First preference is to select the cpu with the
>>> lowest exit latency from a completely idle sched_group inside
>>> the CPU; if that is not available, we pick the CPU with the
>>> lowest exit latency in any sched_group.
>>> 
>>> This increases the total search time of select_idle_sibling, we
>>> may want to look into propagating load info up the sched_group 
>>> tree in some way. That information would also be useful to
>>> prevent the wake_affine logic from causing a load imbalance
>>> between sched_groups.
>> 
>> A generic boo hiss aimed in the general direction of all of this
>> let's go look at every possibility on every wakeup stuff.  Less
>> is more.
> 
> I hear you, can you see actual slowdown with the patch? While the
> worst case doesn't change, it does make the average case equal to
> the worst case iteration -- where we previously would average out
> at inspecting half the CPUs before finding an idle one, we'd now
> always inspect all of them in order to compare all idle ones on
> their properties.
> 
> Also, with the latest generation of Haswell Xeons having 18 cores
> (36 threads) this is one massively painful loop for sure.

We have 3 different goals when selecting a runqueue for a task:
1) locality: get the task running close to where it has stuff cached
2) work preserving: get the task running ASAP, and preferably on a
   fully idle core
3) idle state latency: place the task on a CPU that can start running
   it ASAP

We may also consider the interplay of the above 3 to have an impact on
4) power use: pack tasks on some CPUs so other CPUs can go into deeper
   idle states

The current implementation is a "compromise" between (1) and (2),
with a strong preference for (2), falling back to (1) if no fully
idle core is found.

My ugly hack isn't any better, trading off (1) in order to be better
at (2) and (3). Whether it even affects (4) remains to be seen.

I know my patch is probably unacceptable, but I do think it is important
that we talk about the problem, and hopefully agree on exactly what the
problem is that we want to solve.

One big question in my mind is, when is locality more important, and
when is work preserving more important?  Do we have an answer to that
question?

The current code has the potential to be quite painful on systems with
a large number of cores per chip, so we will have to change things
anyway...

- -- 
All rights reversed
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQEcBAEBAgAGBQJULrKaAAoJEM553pKExN6DVk4H/0d3vVXEezyIUgONluPwKwJC
6QFlaYkglMvfPM85aVLzj4JSQwGmgttXOZBcKvPxk76TbPEgee3lHsstqb0hmWKA
gJdNsR3q/56uUZz4nKTFZqHTXQ6JeXWhppCtd6dibfugo4gI6duvfNsugtOdggm7
1xfUamU6wNAa8VYl3XlHaAaXG4xApVgiNuAC/zRog4ckhfB/Rl2X+4A5Ki7F3eBa
6Gz1DvABd9UYXWvzmHZvB0B+cwSMUpApj5PlPIeo+ZceMCfw7vN20gdZdg/2trsn
weAQsc6ENGaadd5xPj3vsE5QS9oXUw14QM/RH74xy5A7iNyd5JToDRz67aKONiA=
=ZlKb
-----END PGP SIGNATURE-----
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra Oct. 3, 2014, 2:46 p.m. UTC | #6
On Fri, Oct 03, 2014 at 10:28:42AM -0400, Rik van Riel wrote:
> We have 3 different goals when selecting a runqueue for a task:
> 1) locality: get the task running close to where it has stuff cached
> 2) work preserving: get the task running ASAP, and preferably on a
>    fully idle core
> 3) idle state latency: place the task on a CPU that can start running
>    it ASAP

3 can also be considered part of power aware, seeing how it will try and
let CPUs reach their deep idle potential.

> We may also consider the interplay of the above 3 to have an impact on
> 4) power use: pack tasks on some CPUs so other CPUs can go into deeper
>    idle states
> 
> The current implementation is a "compromise" between (1) and (2),
> with a strong preference for (2), falling back to (1) if no fully
> idle core is found.
> 
> My ugly hack isn't any better, trading off (1) in order to be better
> at (2) and (3). Whether it even affects (4) remains to be seen.
> 
> I know my patch is probably unacceptable, but I do think it is important
> that we talk about the problem, and hopefully agree on exactly what the
> problem is that we want to solve.

Yeah, we've been through this several times, it basically boils down to
the amount of fail vs win on 'various' workloads. The endless problem is
of course that the fail vs win ratio is entirely workload dependent and
as ever there is no comprehensive set.

The last time this came up was when Mike tried to do his cache buddy
idea, which basically reduced things to only looking at 2 cpus. That
make some things fly and some things tank.

> One big question in my mind is, when is locality more important, and
> when is work preserving more important?  Do we have an answer to that
> question?

Typically 2) is important when there's lots of short running tasks
around, any queueing typically destroys throughput in that case.

> The current code has the potential to be quite painful on systems with
> a large number of cores per chip, so we will have to change things
> anyway...

What I said.. so far we've failed at coming up with anything sane
though, so far we've found that 2 cpus is too small a slice to look at
and we're fairly sure 18/36 is too large :-)
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rik van Riel Oct. 3, 2014, 3:37 p.m. UTC | #7
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 10/03/2014 10:46 AM, Peter Zijlstra wrote:
> On Fri, Oct 03, 2014 at 10:28:42AM -0400, Rik van Riel wrote:

>> The current code has the potential to be quite painful on systems
>> with a large number of cores per chip, so we will have to change
>> things anyway...
> 
> What I said.. so far we've failed at coming up with anything sane 
> though, so far we've found that 2 cpus is too small a slice to look
> at and we're fairly sure 18/36 is too large :-)

Some more brainstorming points...

1) We should probably (lazily/batched?) propagate load information
   up the sched_group tree.  This will be useful for wake_affine,
   load_balancing, find_idlest_cpu, and select_idle_sibling

2) With both find_idlest_cpu and select_idle_sibling walking down
   the tree from the LLC level, they could probably share code

3) Counting both blocked and runnable load may give better long
   term stability of loads, resulting in a reduction in work
   preserving behaviour, but an improvement in locality - this
   could be worthwhile, but it is hard to say in advance

4) We can be pretty sure that CPU makers are not going to stop
   at a mere 18 cores. We need to subdivide things below the LLC
   level, turning select_idle_sibling and find_idlest_cpu into
   a tree walk.

   This means whatever selection criteria are used by these need
   to be propagated up the sched_group tree. This, in turn, means
   we probably need to restrict ourselves to things that do not get
   changed/updated too often.

Am I overlooking anything?

- -- 
All rights reversed
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQEcBAEBAgAGBQJULsK7AAoJEM553pKExN6DtBEIAIJWwDPXfrIN6D4yH+/sY7Xg
cDRVDS978OW8GMx3/IqOD90PIvx/l/pttIHHkAcMfDv2Lv8QhiGJEX+OMQg9ETPq
bA31A5t3V3Wlnfc/0xeIMrebc2P3Wfe5s2DApiYPQbDzh47BimDJyeC/9XSqKyvk
CuOZR02t4/20axGwZhl8hk7vGTJhlJWPuh5RUHWjRi2shoHJM90nfZh144GDO3S7
EfiNlC9ZT9z9MYUL6FvCGA7yF+fwzIPE4ppU/KeoDVHsav2OKadV+MjsTQ/IHti2
p0Heu80jEmWW3/zv9zeMpa8jv6Xg8kNsaW709ZSBAzphen5g9sch170A0SdZCiU=
=gUXr
-----END PGP SIGNATURE-----
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Nicolas Pitre Oct. 3, 2014, 6:52 p.m. UTC | #8
On Fri, 3 Oct 2014, Rik van Riel wrote:

> We have 3 different goals when selecting a runqueue for a task:
> 1) locality: get the task running close to where it has stuff cached
> 2) work preserving: get the task running ASAP, and preferably on a
>    fully idle core
> 3) idle state latency: place the task on a CPU that can start running
>    it ASAP
> 
> We may also consider the interplay of the above 3 to have an impact on
> 4) power use: pack tasks on some CPUs so other CPUs can go into deeper
>    idle states

In my mind the actual choice is between (1) and (2).  Once you decided 
on (2) then obviously you should imply (3) all the time. And by having 
(3) then (4) should be a natural side effect by not selecting idle CPUs 
randomly.

By selecting (1) you already have (4).

The deficient part right now is (3) as a consequence of (2).  Fixing 
(3) should not have to affect (1).

> The current implementation is a "compromise" between (1) and (2),
> with a strong preference for (2), falling back to (1) if no fully
> idle core is found.
> 
> My ugly hack isn't any better, trading off (1) in order to be better
> at (2) and (3). Whether it even affects (4) remains to be seen.

(4) is greatly influenced by (3) on mobile platforms, especially those 
with a cluster topology.  This might not be as significant on server 
type systems, although performance should benefit as well from the 
smaller wake-up latency.

On a mobile system losing 10% performance to save 20% on power usage 
might be an excellent compromise.  Maybe not so on a server system where 
performance is everything.


Nicolas
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra Oct. 9, 2014, 4:04 p.m. UTC | #9
On Fri, Oct 03, 2014 at 11:37:31AM -0400, Rik van Riel wrote:
> Some more brainstorming points...
> 
> 1) We should probably (lazily/batched?) propagate load information
>    up the sched_group tree.  This will be useful for wake_affine,
>    load_balancing, find_idlest_cpu, and select_idle_sibling
> 
> 2) With both find_idlest_cpu and select_idle_sibling walking down
>    the tree from the LLC level, they could probably share code
> 
> 3) Counting both blocked and runnable load may give better long
>    term stability of loads, resulting in a reduction in work
>    preserving behaviour, but an improvement in locality - this
>    could be worthwhile, but it is hard to say in advance
> 
> 4) We can be pretty sure that CPU makers are not going to stop
>    at a mere 18 cores. We need to subdivide things below the LLC
>    level, turning select_idle_sibling and find_idlest_cpu into
>    a tree walk.
> 
>    This means whatever selection criteria are used by these need
>    to be propagated up the sched_group tree. This, in turn, means
>    we probably need to restrict ourselves to things that do not get
>    changed/updated too often.
> 
> Am I overlooking anything?

Well, we can certainly try something like that; but your last point
seems like a contradition; seeing how _the_ important point for
select_idle_sibling() is the actual idle state, and that per definition
is something that can change/update often.

But yes, the only viable option is some artificial breakup of the
topology and we can indeed try and bridge the gap with some caching.
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 10a5a28..12540cd 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4465,41 +4465,76 @@  static int select_idle_sibling(struct task_struct *p, int target)
 {
 	struct sched_domain *sd;
 	struct sched_group *sg;
+	unsigned int min_exit_latency_thread = UINT_MAX;
+	unsigned int min_exit_latency_core = UINT_MAX;
+	int shallowest_idle_thread = -1;
+	int shallowest_idle_core = -1;
 	int i = task_cpu(p);
 
+	/* target always has some code running and is not in an idle state */
 	if (idle_cpu(target))
 		return target;
 
 	/*
 	 * If the prevous cpu is cache affine and idle, don't be stupid.
+	 * XXX: does i's exit latency exceed sysctl_sched_migration_cost?
 	 */
 	if (i != target && cpus_share_cache(i, target) && idle_cpu(i))
 		return i;
 
 	/*
 	 * Otherwise, iterate the domains and find an elegible idle cpu.
+	 * First preference is finding a totally idle core with a thread
+	 * in a shallow idle state; second preference is whatever idle
+	 * thread has the shallowest idle state anywhere.
 	 */
 	sd = rcu_dereference(per_cpu(sd_llc, target));
 	for_each_lower_domain(sd) {
 		sg = sd->groups;
 		do {
+			unsigned int min_sg_exit_latency = UINT_MAX;
+			int shallowest_sg_idle_thread = -1;
+			bool all_idle = true;
+
 			if (!cpumask_intersects(sched_group_cpus(sg),
 						tsk_cpus_allowed(p)))
 				goto next;
 
 			for_each_cpu(i, sched_group_cpus(sg)) {
-				if (i == target || !idle_cpu(i))
-					goto next;
+				struct rq *rq;
+				struct cpuidle_state *idle;
+
+				if (i == target || !idle_cpu(i)) {
+					all_idle = false;
+					continue;
+				}
+
+				rq = cpu_rq(i);
+				idle = idle_get_state(rq);
+
+				if (idle && idle->exit_latency < min_sg_exit_latency) {
+					min_sg_exit_latency = idle->exit_latency;
+					shallowest_sg_idle_thread = i;
+				}
+			}
+
+			if (all_idle && min_sg_exit_latency < min_exit_latency_core) {
+				shallowest_idle_core = shallowest_sg_idle_thread;
+				min_exit_latency_core = min_sg_exit_latency;
+			} else if (min_sg_exit_latency < min_exit_latency_thread) {
+				shallowest_idle_thread = shallowest_sg_idle_thread;
+				min_exit_latency_thread = min_sg_exit_latency;
 			}
 
-			target = cpumask_first_and(sched_group_cpus(sg),
-					tsk_cpus_allowed(p));
-			goto done;
 next:
 			sg = sg->next;
 		} while (sg != sd->groups);
 	}
-done:
+	if (shallowest_idle_core >= 0)
+		target = shallowest_idle_core;
+	else if (shallowest_idle_thread >= 0)
+		target = shallowest_idle_thread;
+
 	return target;
 }