Message ID | 20141002131548.6cd377d5@cuia.bos.redhat.com (mailing list archive) |
---|---|
State | Not Applicable, archived |
Headers | show |
On Thu, 2014-10-02 at 13:15 -0400, Rik van Riel wrote: > This patch is ugly. I have not bothered cleaning it up, because it > causes a regression with hackbench. Apparently for hackbench (and > potentially other sync wakeups), locality is more important than > idleness. > > We may need to add a third clause before the search, something > along the lines of, to ensure target gets selected if neither > target or i are idle and the wakeup is synchronous... > > if (sync_wakeup && cpu_of(target)->nr_running == 1) > return target; I recommend you forget that trusting sync hint ever sprang to mind, it is often a big fat lie. -Mike -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 2014-10-02 at 13:15 -0400, Rik van Riel wrote: > Subject: sched,idle: teach select_idle_sibling about idle states > > Change select_idle_sibling to take cpu idle exit latency into > account. First preference is to select the cpu with the lowest > exit latency from a completely idle sched_group inside the CPU; > if that is not available, we pick the CPU with the lowest exit > latency in any sched_group. > > This increases the total search time of select_idle_sibling, > we may want to look into propagating load info up the sched_group > tree in some way. That information would also be useful to prevent > the wake_affine logic from causing a load imbalance between > sched_groups. A generic boo hiss aimed in the general direction of all of this let's go look at every possibility on every wakeup stuff. Less is more. -Mike -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Oct 03, 2014 at 08:23:04AM +0200, Mike Galbraith wrote: > On Thu, 2014-10-02 at 13:15 -0400, Rik van Riel wrote: > > > Subject: sched,idle: teach select_idle_sibling about idle states > > > > Change select_idle_sibling to take cpu idle exit latency into > > account. First preference is to select the cpu with the lowest > > exit latency from a completely idle sched_group inside the CPU; > > if that is not available, we pick the CPU with the lowest exit > > latency in any sched_group. > > > > This increases the total search time of select_idle_sibling, > > we may want to look into propagating load info up the sched_group > > tree in some way. That information would also be useful to prevent > > the wake_affine logic from causing a load imbalance between > > sched_groups. > > A generic boo hiss aimed in the general direction of all of this let's > go look at every possibility on every wakeup stuff. Less is more. I hear you, can you see actual slowdown with the patch? While the worst case doesn't change, it does make the average case equal to the worst case iteration -- where we previously would average out at inspecting half the CPUs before finding an idle one, we'd now always inspect all of them in order to compare all idle ones on their properties. Also, with the latest generation of Haswell Xeons having 18 cores (36 threads) this is one massively painful loop for sure. I'm just not sure what to do about it.. I suppose we can artificially split it into smaller groups, but I bet that'll hurt some, but if we can show it gains more we might still be able to do it. The only real problem is actual numbers/workloads (isn't it always) :/ One thing I suppose we could try is keeping a 'busy' flag at the llc domain which is set when all CPUs are busy (we'll clear it from new_idle) that way we can avoid the entire iteration if we know its pointless. Hmm... -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 2014-10-03 at 09:50 +0200, Peter Zijlstra wrote: > On Fri, Oct 03, 2014 at 08:23:04AM +0200, Mike Galbraith wrote: > > A generic boo hiss aimed in the general direction of all of this let's > > go look at every possibility on every wakeup stuff. Less is more. > > I hear you, can you see actual slowdown with the patch? While the worst > case doesn't change, it does make the average case equal to the worst > case iteration -- where we previously would average out at inspecting > half the CPUs before finding an idle one, we'd now always inspect all of > them in order to compare all idle ones on their properties. > > Also, with the latest generation of Haswell Xeons having 18 cores (36 > threads) this is one massively painful loop for sure. Yeah, the things are getting too damn big. I didn't try the patch and measure anything, my gut instantly said "nope, not worth it". > I'm just not sure what to do about it.. I suppose we can artificially > split it into smaller groups, but I bet that'll hurt some, but if we can > show it gains more we might still be able to do it. The only real > problem is actual numbers/workloads (isn't it always) :/ > > One thing I suppose we could try is keeping a 'busy' flag at the > llc domain which is set when all CPUs are busy (we'll clear it from > new_idle) that way we can avoid the entire iteration if we know its > pointless. On one of those huge packages, heck, even on a 8 core that could save a substantial number of busy box cycles. -Mike -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 10/03/2014 03:50 AM, Peter Zijlstra wrote: > On Fri, Oct 03, 2014 at 08:23:04AM +0200, Mike Galbraith wrote: >> On Thu, 2014-10-02 at 13:15 -0400, Rik van Riel wrote: >> >>> Subject: sched,idle: teach select_idle_sibling about idle >>> states >>> >>> Change select_idle_sibling to take cpu idle exit latency into >>> account. First preference is to select the cpu with the >>> lowest exit latency from a completely idle sched_group inside >>> the CPU; if that is not available, we pick the CPU with the >>> lowest exit latency in any sched_group. >>> >>> This increases the total search time of select_idle_sibling, we >>> may want to look into propagating load info up the sched_group >>> tree in some way. That information would also be useful to >>> prevent the wake_affine logic from causing a load imbalance >>> between sched_groups. >> >> A generic boo hiss aimed in the general direction of all of this >> let's go look at every possibility on every wakeup stuff. Less >> is more. > > I hear you, can you see actual slowdown with the patch? While the > worst case doesn't change, it does make the average case equal to > the worst case iteration -- where we previously would average out > at inspecting half the CPUs before finding an idle one, we'd now > always inspect all of them in order to compare all idle ones on > their properties. > > Also, with the latest generation of Haswell Xeons having 18 cores > (36 threads) this is one massively painful loop for sure. We have 3 different goals when selecting a runqueue for a task: 1) locality: get the task running close to where it has stuff cached 2) work preserving: get the task running ASAP, and preferably on a fully idle core 3) idle state latency: place the task on a CPU that can start running it ASAP We may also consider the interplay of the above 3 to have an impact on 4) power use: pack tasks on some CPUs so other CPUs can go into deeper idle states The current implementation is a "compromise" between (1) and (2), with a strong preference for (2), falling back to (1) if no fully idle core is found. My ugly hack isn't any better, trading off (1) in order to be better at (2) and (3). Whether it even affects (4) remains to be seen. I know my patch is probably unacceptable, but I do think it is important that we talk about the problem, and hopefully agree on exactly what the problem is that we want to solve. One big question in my mind is, when is locality more important, and when is work preserving more important? Do we have an answer to that question? The current code has the potential to be quite painful on systems with a large number of cores per chip, so we will have to change things anyway... - -- All rights reversed -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQEcBAEBAgAGBQJULrKaAAoJEM553pKExN6DVk4H/0d3vVXEezyIUgONluPwKwJC 6QFlaYkglMvfPM85aVLzj4JSQwGmgttXOZBcKvPxk76TbPEgee3lHsstqb0hmWKA gJdNsR3q/56uUZz4nKTFZqHTXQ6JeXWhppCtd6dibfugo4gI6duvfNsugtOdggm7 1xfUamU6wNAa8VYl3XlHaAaXG4xApVgiNuAC/zRog4ckhfB/Rl2X+4A5Ki7F3eBa 6Gz1DvABd9UYXWvzmHZvB0B+cwSMUpApj5PlPIeo+ZceMCfw7vN20gdZdg/2trsn weAQsc6ENGaadd5xPj3vsE5QS9oXUw14QM/RH74xy5A7iNyd5JToDRz67aKONiA= =ZlKb -----END PGP SIGNATURE----- -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Oct 03, 2014 at 10:28:42AM -0400, Rik van Riel wrote: > We have 3 different goals when selecting a runqueue for a task: > 1) locality: get the task running close to where it has stuff cached > 2) work preserving: get the task running ASAP, and preferably on a > fully idle core > 3) idle state latency: place the task on a CPU that can start running > it ASAP 3 can also be considered part of power aware, seeing how it will try and let CPUs reach their deep idle potential. > We may also consider the interplay of the above 3 to have an impact on > 4) power use: pack tasks on some CPUs so other CPUs can go into deeper > idle states > > The current implementation is a "compromise" between (1) and (2), > with a strong preference for (2), falling back to (1) if no fully > idle core is found. > > My ugly hack isn't any better, trading off (1) in order to be better > at (2) and (3). Whether it even affects (4) remains to be seen. > > I know my patch is probably unacceptable, but I do think it is important > that we talk about the problem, and hopefully agree on exactly what the > problem is that we want to solve. Yeah, we've been through this several times, it basically boils down to the amount of fail vs win on 'various' workloads. The endless problem is of course that the fail vs win ratio is entirely workload dependent and as ever there is no comprehensive set. The last time this came up was when Mike tried to do his cache buddy idea, which basically reduced things to only looking at 2 cpus. That make some things fly and some things tank. > One big question in my mind is, when is locality more important, and > when is work preserving more important? Do we have an answer to that > question? Typically 2) is important when there's lots of short running tasks around, any queueing typically destroys throughput in that case. > The current code has the potential to be quite painful on systems with > a large number of cores per chip, so we will have to change things > anyway... What I said.. so far we've failed at coming up with anything sane though, so far we've found that 2 cpus is too small a slice to look at and we're fairly sure 18/36 is too large :-) -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 10/03/2014 10:46 AM, Peter Zijlstra wrote: > On Fri, Oct 03, 2014 at 10:28:42AM -0400, Rik van Riel wrote: >> The current code has the potential to be quite painful on systems >> with a large number of cores per chip, so we will have to change >> things anyway... > > What I said.. so far we've failed at coming up with anything sane > though, so far we've found that 2 cpus is too small a slice to look > at and we're fairly sure 18/36 is too large :-) Some more brainstorming points... 1) We should probably (lazily/batched?) propagate load information up the sched_group tree. This will be useful for wake_affine, load_balancing, find_idlest_cpu, and select_idle_sibling 2) With both find_idlest_cpu and select_idle_sibling walking down the tree from the LLC level, they could probably share code 3) Counting both blocked and runnable load may give better long term stability of loads, resulting in a reduction in work preserving behaviour, but an improvement in locality - this could be worthwhile, but it is hard to say in advance 4) We can be pretty sure that CPU makers are not going to stop at a mere 18 cores. We need to subdivide things below the LLC level, turning select_idle_sibling and find_idlest_cpu into a tree walk. This means whatever selection criteria are used by these need to be propagated up the sched_group tree. This, in turn, means we probably need to restrict ourselves to things that do not get changed/updated too often. Am I overlooking anything? - -- All rights reversed -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQEcBAEBAgAGBQJULsK7AAoJEM553pKExN6DtBEIAIJWwDPXfrIN6D4yH+/sY7Xg cDRVDS978OW8GMx3/IqOD90PIvx/l/pttIHHkAcMfDv2Lv8QhiGJEX+OMQg9ETPq bA31A5t3V3Wlnfc/0xeIMrebc2P3Wfe5s2DApiYPQbDzh47BimDJyeC/9XSqKyvk CuOZR02t4/20axGwZhl8hk7vGTJhlJWPuh5RUHWjRi2shoHJM90nfZh144GDO3S7 EfiNlC9ZT9z9MYUL6FvCGA7yF+fwzIPE4ppU/KeoDVHsav2OKadV+MjsTQ/IHti2 p0Heu80jEmWW3/zv9zeMpa8jv6Xg8kNsaW709ZSBAzphen5g9sch170A0SdZCiU= =gUXr -----END PGP SIGNATURE----- -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 3 Oct 2014, Rik van Riel wrote: > We have 3 different goals when selecting a runqueue for a task: > 1) locality: get the task running close to where it has stuff cached > 2) work preserving: get the task running ASAP, and preferably on a > fully idle core > 3) idle state latency: place the task on a CPU that can start running > it ASAP > > We may also consider the interplay of the above 3 to have an impact on > 4) power use: pack tasks on some CPUs so other CPUs can go into deeper > idle states In my mind the actual choice is between (1) and (2). Once you decided on (2) then obviously you should imply (3) all the time. And by having (3) then (4) should be a natural side effect by not selecting idle CPUs randomly. By selecting (1) you already have (4). The deficient part right now is (3) as a consequence of (2). Fixing (3) should not have to affect (1). > The current implementation is a "compromise" between (1) and (2), > with a strong preference for (2), falling back to (1) if no fully > idle core is found. > > My ugly hack isn't any better, trading off (1) in order to be better > at (2) and (3). Whether it even affects (4) remains to be seen. (4) is greatly influenced by (3) on mobile platforms, especially those with a cluster topology. This might not be as significant on server type systems, although performance should benefit as well from the smaller wake-up latency. On a mobile system losing 10% performance to save 20% on power usage might be an excellent compromise. Maybe not so on a server system where performance is everything. Nicolas -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Oct 03, 2014 at 11:37:31AM -0400, Rik van Riel wrote: > Some more brainstorming points... > > 1) We should probably (lazily/batched?) propagate load information > up the sched_group tree. This will be useful for wake_affine, > load_balancing, find_idlest_cpu, and select_idle_sibling > > 2) With both find_idlest_cpu and select_idle_sibling walking down > the tree from the LLC level, they could probably share code > > 3) Counting both blocked and runnable load may give better long > term stability of loads, resulting in a reduction in work > preserving behaviour, but an improvement in locality - this > could be worthwhile, but it is hard to say in advance > > 4) We can be pretty sure that CPU makers are not going to stop > at a mere 18 cores. We need to subdivide things below the LLC > level, turning select_idle_sibling and find_idlest_cpu into > a tree walk. > > This means whatever selection criteria are used by these need > to be propagated up the sched_group tree. This, in turn, means > we probably need to restrict ourselves to things that do not get > changed/updated too often. > > Am I overlooking anything? Well, we can certainly try something like that; but your last point seems like a contradition; seeing how _the_ important point for select_idle_sibling() is the actual idle state, and that per definition is something that can change/update often. But yes, the only viable option is some artificial breakup of the topology and we can indeed try and bridge the gap with some caching. -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 10a5a28..12540cd 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4465,41 +4465,76 @@ static int select_idle_sibling(struct task_struct *p, int target) { struct sched_domain *sd; struct sched_group *sg; + unsigned int min_exit_latency_thread = UINT_MAX; + unsigned int min_exit_latency_core = UINT_MAX; + int shallowest_idle_thread = -1; + int shallowest_idle_core = -1; int i = task_cpu(p); + /* target always has some code running and is not in an idle state */ if (idle_cpu(target)) return target; /* * If the prevous cpu is cache affine and idle, don't be stupid. + * XXX: does i's exit latency exceed sysctl_sched_migration_cost? */ if (i != target && cpus_share_cache(i, target) && idle_cpu(i)) return i; /* * Otherwise, iterate the domains and find an elegible idle cpu. + * First preference is finding a totally idle core with a thread + * in a shallow idle state; second preference is whatever idle + * thread has the shallowest idle state anywhere. */ sd = rcu_dereference(per_cpu(sd_llc, target)); for_each_lower_domain(sd) { sg = sd->groups; do { + unsigned int min_sg_exit_latency = UINT_MAX; + int shallowest_sg_idle_thread = -1; + bool all_idle = true; + if (!cpumask_intersects(sched_group_cpus(sg), tsk_cpus_allowed(p))) goto next; for_each_cpu(i, sched_group_cpus(sg)) { - if (i == target || !idle_cpu(i)) - goto next; + struct rq *rq; + struct cpuidle_state *idle; + + if (i == target || !idle_cpu(i)) { + all_idle = false; + continue; + } + + rq = cpu_rq(i); + idle = idle_get_state(rq); + + if (idle && idle->exit_latency < min_sg_exit_latency) { + min_sg_exit_latency = idle->exit_latency; + shallowest_sg_idle_thread = i; + } + } + + if (all_idle && min_sg_exit_latency < min_exit_latency_core) { + shallowest_idle_core = shallowest_sg_idle_thread; + min_exit_latency_core = min_sg_exit_latency; + } else if (min_sg_exit_latency < min_exit_latency_thread) { + shallowest_idle_thread = shallowest_sg_idle_thread; + min_exit_latency_thread = min_sg_exit_latency; } - target = cpumask_first_and(sched_group_cpus(sg), - tsk_cpus_allowed(p)); - goto done; next: sg = sg->next; } while (sg != sd->groups); } -done: + if (shallowest_idle_core >= 0) + target = shallowest_idle_core; + else if (shallowest_idle_thread >= 0) + target = shallowest_idle_thread; + return target; }