diff mbox

[RFC,v3,5/6] sched: pack the idle load balance

Message ID 1363955155-18382-6-git-send-email-vincent.guittot@linaro.org (mailing list archive)
State New, archived
Headers show

Commit Message

Vincent Guittot March 22, 2013, 12:25 p.m. UTC
Look for an idle CPU close to the pack buddy CPU whenever possible.
The goal is to prevent the wake up of a CPU which doesn't share the power
domain of the pack buddy CPU.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/fair.c |   18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

Comments

Peter Zijlstra March 26, 2013, 12:52 p.m. UTC | #1
On Fri, 2013-03-22 at 13:25 +0100, Vincent Guittot wrote:
> Look for an idle CPU close to the pack buddy CPU whenever possible.
> The goal is to prevent the wake up of a CPU which doesn't share the
> power
> domain of the pack buddy CPU.
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> Reviewed-by: Morten Rasmussen <morten.rasmussen@arm.com>
> ---
>  kernel/sched/fair.c |   18 ++++++++++++++++++
>  1 file changed, 18 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index b636199..52a7736 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5455,7 +5455,25 @@ static struct {
>  
>  static inline int find_new_ilb(int call_cpu)
>  {
> +       struct sched_domain *sd;
>         int ilb = cpumask_first(nohz.idle_cpus_mask);
> +       int buddy = per_cpu(sd_pack_buddy, call_cpu);
> +
> +       /*
> +        * If we have a pack buddy CPU, we try to run load balance on
> a CPU
> +        * that is close to the buddy.
> +        */
> +       if (buddy != -1)
> +               for_each_domain(buddy, sd) {
> +                       if (sd->flags & SD_SHARE_CPUPOWER)
> +                               continue;
> +
> +                       ilb = cpumask_first_and(sched_domain_span(sd),
> +                                       nohz.idle_cpus_mask);
> +
> +                       if (ilb < nr_cpu_ids)
> +                               break;
> +               }

/me hands you a fresh bucket of curlies, no reason to skimp on them.

But ha! here's your NO_HZ link.. but does the above DTRT and ensure
that the ILB is a little core when possible?
Vincent Guittot March 26, 2013, 2:03 p.m. UTC | #2
On 26 March 2013 13:52, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, 2013-03-22 at 13:25 +0100, Vincent Guittot wrote:
>> Look for an idle CPU close to the pack buddy CPU whenever possible.
>> The goal is to prevent the wake up of a CPU which doesn't share the
>> power
>> domain of the pack buddy CPU.
>>
>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
>> Reviewed-by: Morten Rasmussen <morten.rasmussen@arm.com>
>> ---
>>  kernel/sched/fair.c |   18 ++++++++++++++++++
>>  1 file changed, 18 insertions(+)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index b636199..52a7736 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -5455,7 +5455,25 @@ static struct {
>>
>>  static inline int find_new_ilb(int call_cpu)
>>  {
>> +       struct sched_domain *sd;
>>         int ilb = cpumask_first(nohz.idle_cpus_mask);
>> +       int buddy = per_cpu(sd_pack_buddy, call_cpu);
>> +
>> +       /*
>> +        * If we have a pack buddy CPU, we try to run load balance on
>> a CPU
>> +        * that is close to the buddy.
>> +        */
>> +       if (buddy != -1)
>> +               for_each_domain(buddy, sd) {
>> +                       if (sd->flags & SD_SHARE_CPUPOWER)
>> +                               continue;
>> +
>> +                       ilb = cpumask_first_and(sched_domain_span(sd),
>> +                                       nohz.idle_cpus_mask);
>> +
>> +                       if (ilb < nr_cpu_ids)
>> +                               break;
>> +               }
>
> /me hands you a fresh bucket of curlies, no reason to skimp on them.

ok

>
> But ha! here's your NO_HZ link.. but does the above DTRT and ensure
> that the ILB is a little core when possible?

The loop looks for an idle CPU as close as possible to the buddy CPU
and the buddy CPU is the 1st CPU has been chosen. So if your buddy is
a little and there is an idle little, the ILB will be this idle
little.

>
>
Peter Zijlstra March 26, 2013, 2:42 p.m. UTC | #3
On Tue, 2013-03-26 at 15:03 +0100, Vincent Guittot wrote:
> > But ha! here's your NO_HZ link.. but does the above DTRT and ensure
> > that the ILB is a little core when possible?
> 
> The loop looks for an idle CPU as close as possible to the buddy CPU
> and the buddy CPU is the 1st CPU has been chosen. So if your buddy is
> a little and there is an idle little, the ILB will be this idle
> little.

Earlier you wrote:

>       | Cluster 0   | Cluster 1   |
>       | CPU0 | CPU1 | CPU2 | CPU3 |
> -----------------------------------
> buddy | CPU0 | CPU0 | CPU0 | CPU2 |

So extrapolating that to a 4+4 big-little you'd get something like:

      |   little  A9  ||   big A15     |
      | 0 | 1 | 2 | 3 || 4 | 5 | 6 | 7 |
------+---+---+---+---++---+---+---+---+
buddy | 0 | 0 | 0 | 0 || 0 | 4 | 4 | 4 |

Right?

So supposing the current ILB is 6, we'll only check 4, not 0-3, even
though there might be a perfectly idle cpu in there.

Also, your scheme fails to pack when cpus 0,4 are filled, even when
there's idle cores around.

If we'd use the ILB as packing cpu, we would simply select a next pack
target once the old one fills up.
Vincent Guittot March 26, 2013, 3:55 p.m. UTC | #4
On 26 March 2013 15:42, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, 2013-03-26 at 15:03 +0100, Vincent Guittot wrote:
>> > But ha! here's your NO_HZ link.. but does the above DTRT and ensure
>> > that the ILB is a little core when possible?
>>
>> The loop looks for an idle CPU as close as possible to the buddy CPU
>> and the buddy CPU is the 1st CPU has been chosen. So if your buddy is
>> a little and there is an idle little, the ILB will be this idle
>> little.
>
> Earlier you wrote:
>
>>       | Cluster 0   | Cluster 1   |
>>       | CPU0 | CPU1 | CPU2 | CPU3 |
>> -----------------------------------
>> buddy | CPU0 | CPU0 | CPU0 | CPU2 |
>
> So extrapolating that to a 4+4 big-little you'd get something like:
>
>       |   little  A9  ||   big A15     |
>       | 0 | 1 | 2 | 3 || 4 | 5 | 6 | 7 |
> ------+---+---+---+---++---+---+---+---+
> buddy | 0 | 0 | 0 | 0 || 0 | 4 | 4 | 4 |
>
> Right?

yes

>
> So supposing the current ILB is 6, we'll only check 4, not 0-3, even
> though there might be a perfectly idle cpu in there.

We will check 4,5,7 at MC level in order to pack in the group of A15
(because they are not sharing the same power domain). If none of them
are idle, we will look at CPU level and will check CPUs 0-3.

>
> Also, your scheme fails to pack when cpus 0,4 are filled, even when
> there's idle cores around.

The primary target is to pack the tasks only when we are in a not busy
system so you will have a power improvement without performance
decrease. is_light_task function returns false and  is_buddy_busy
function true before the buddy is fully loaded and the scheduler will
fall back into the default behavior which spreads tasks and races to
idle.

We can extend the buddy CPU and the packing mechanism to fill one CPU
before filling another buddy but it's not always the best choice for
performance and/or power and thus it will imply to have a knob to
select this full packing mode.

>
> If we'd use the ILB as packing cpu, we would simply select a next pack
> target once the old one fills up.

Yes, we will be able to pack the long running tasks and the wake up
will take care of the short tasks

>
alex.shi March 27, 2013, 4:56 a.m. UTC | #5
On 03/26/2013 11:55 PM, Vincent Guittot wrote:
>> > So extrapolating that to a 4+4 big-little you'd get something like:
>> >
>> >       |   little  A9  ||   big A15     |
>> >       | 0 | 1 | 2 | 3 || 4 | 5 | 6 | 7 |
>> > ------+---+---+---+---++---+---+---+---+
>> > buddy | 0 | 0 | 0 | 0 || 0 | 4 | 4 | 4 |
>> >
>> > Right?
> yes
> 
>> >
>> > So supposing the current ILB is 6, we'll only check 4, not 0-3, even
>> > though there might be a perfectly idle cpu in there.
> We will check 4,5,7 at MC level in order to pack in the group of A15
> (because they are not sharing the same power domain). If none of them
> are idle, we will look at CPU level and will check CPUs 0-3.

So you increase a fixed step here.
> 
>> >
>> > Also, your scheme fails to pack when cpus 0,4 are filled, even when
>> > there's idle cores around.
> The primary target is to pack the tasks only when we are in a not busy
> system so you will have a power improvement without performance
> decrease. is_light_task function returns false and  is_buddy_busy
> function true before the buddy is fully loaded and the scheduler will
> fall back into the default behavior which spreads tasks and races to
> idle.
> 
> We can extend the buddy CPU and the packing mechanism to fill one CPU
> before filling another buddy but it's not always the best choice for
> performance and/or power and thus it will imply to have a knob to
> select this full packing mode.

Just one buddy to pack tasks for whole level cpus definitely has
scalability problem. That is not good for powersaving in most of scenarios.
Vincent Guittot March 27, 2013, 8:05 a.m. UTC | #6
On 27 March 2013 05:56, Alex Shi <alex.shi@intel.com> wrote:
> On 03/26/2013 11:55 PM, Vincent Guittot wrote:
>>> > So extrapolating that to a 4+4 big-little you'd get something like:
>>> >
>>> >       |   little  A9  ||   big A15     |
>>> >       | 0 | 1 | 2 | 3 || 4 | 5 | 6 | 7 |
>>> > ------+---+---+---+---++---+---+---+---+
>>> > buddy | 0 | 0 | 0 | 0 || 0 | 4 | 4 | 4 |
>>> >
>>> > Right?
>> yes
>>
>>> >
>>> > So supposing the current ILB is 6, we'll only check 4, not 0-3, even
>>> > though there might be a perfectly idle cpu in there.
>> We will check 4,5,7 at MC level in order to pack in the group of A15
>> (because they are not sharing the same power domain). If none of them
>> are idle, we will look at CPU level and will check CPUs 0-3.
>
> So you increase a fixed step here.

I have modified the find_new_ilb function to look for the best idle
CPU instead of just picking the first CPU of idle_cpus_mask.

>>
>>> >
>>> > Also, your scheme fails to pack when cpus 0,4 are filled, even when
>>> > there's idle cores around.
>> The primary target is to pack the tasks only when we are in a not busy
>> system so you will have a power improvement without performance
>> decrease. is_light_task function returns false and  is_buddy_busy
>> function true before the buddy is fully loaded and the scheduler will
>> fall back into the default behavior which spreads tasks and races to
>> idle.
>>
>> We can extend the buddy CPU and the packing mechanism to fill one CPU
>> before filling another buddy but it's not always the best choice for
>> performance and/or power and thus it will imply to have a knob to
>> select this full packing mode.
>
> Just one buddy to pack tasks for whole level cpus definitely has
> scalability problem. That is not good for powersaving in most of scenarios.
>

This patch doesn't want to pack all kind of tasks in all scenario but
only the small tasks that run less that 10ms and when the CPU is not
already too busy with other tasks so you don't have to cope with long
wake up latency and performance regression and only one CPU will be
powered up for these background activities. Nevertheless, I can extend
the packing small tasks to pack all tasks in any scenario in as few
CPUs as possible. This will imply to choose a new buddy CPU when the
previous one is full during the ILB selection as an example and to add
a knob to select this mode which will modify the performance of the
system. But the primary target is not to have a knob and not to reduce
performance in most of scenario.

Regards,
Vincent

>
> --
> Thanks Alex
alex.shi March 27, 2013, 8:47 a.m. UTC | #7
>>>>> So supposing the current ILB is 6, we'll only check 4, not 0-3, even
>>>>> though there might be a perfectly idle cpu in there.
>>> We will check 4,5,7 at MC level in order to pack in the group of A15
>>> (because they are not sharing the same power domain). If none of them
>>> are idle, we will look at CPU level and will check CPUs 0-3.
>>
>> So you increase a fixed step here.
> 
> I have modified the find_new_ilb function to look for the best idle
> CPU instead of just picking the first CPU of idle_cpus_mask.

That's better.
But using a fixed buddy is still not flexible, and involve more checking
in this time critical balancing.
Consider the most of SMP system, cpu is equal, so any of other cpu can
play the role of buddy in your design. That means no buddy cpu is
better, like my version packing.

> 
>>>
>>>>>
>>>>> Also, your scheme fails to pack when cpus 0,4 are filled, even when
>>>>> there's idle cores around.
>>> The primary target is to pack the tasks only when we are in a not busy
>>> system so you will have a power improvement without performance
>>> decrease. is_light_task function returns false and  is_buddy_busy
>>> function true before the buddy is fully loaded and the scheduler will
>>> fall back into the default behavior which spreads tasks and races to
>>> idle.
>>>
>>> We can extend the buddy CPU and the packing mechanism to fill one CPU
>>> before filling another buddy but it's not always the best choice for
>>> performance and/or power and thus it will imply to have a knob to
>>> select this full packing mode.
>>
>> Just one buddy to pack tasks for whole level cpus definitely has
>> scalability problem. That is not good for powersaving in most of scenarios.
>>
> 
> This patch doesn't want to pack all kind of tasks in all scenario but
> only the small tasks that run less that 10ms and when the CPU is not
> already too busy with other tasks so you don't have to cope with long
> wake up latency and performance regression and only one CPU will be
> powered up for these background activities. Nevertheless, I can extend
> the packing small tasks to pack all tasks in any scenario in as few
> CPUs as possible. This will imply to choose a new buddy CPU when the
> previous one is full during the ILB selection as an example and to add
> a knob to select this mode which will modify the performance of the
> system. But the primary target is not to have a knob and not to reduce
> performance in most of scenario.

Arguing the performance/power balance does no much sense without
detailed scenario. We just want to seek a flexible compromise way.
But fixed buddy cpu is not flexible. and it may lose many possible
powersaving fit scenarios on x86 system. Like if 2 SMT cpu can handle
all tasks, we don't need to wake another core. or if 2 cores in one
socket can handle tasks, we also don't need to wakeup another socket.
> 
> Regards,
> Vincent
> 
>>
>> --
>> Thanks Alex
Peter Zijlstra March 27, 2013, 8:49 a.m. UTC | #8
On Wed, 2013-03-27 at 12:56 +0800, Alex Shi wrote:

> Just one buddy to pack tasks for whole level cpus definitely has
> scalability problem.

Right, but note we already have this scalability problem in the form of
the ILB. Some people were working on sorting that but then someone
changed jobs and I think it fell in some deep and dark crack.

Venki, Suresh?
Vincent Guittot March 27, 2013, 10:30 a.m. UTC | #9
On 27 March 2013 09:47, Alex Shi <alex.shi@intel.com> wrote:
>
>>>>>> So supposing the current ILB is 6, we'll only check 4, not 0-3, even
>>>>>> though there might be a perfectly idle cpu in there.
>>>> We will check 4,5,7 at MC level in order to pack in the group of A15
>>>> (because they are not sharing the same power domain). If none of them
>>>> are idle, we will look at CPU level and will check CPUs 0-3.
>>>
>>> So you increase a fixed step here.
>>
>> I have modified the find_new_ilb function to look for the best idle
>> CPU instead of just picking the first CPU of idle_cpus_mask.
>
> That's better.
> But using a fixed buddy is still not flexible, and involve more checking
> in this time critical balancing.
> Consider the most of SMP system, cpu is equal, so any of other cpu can
> play the role of buddy in your design. That means no buddy cpu is
> better, like my version packing.
>
>>
>>>>
>>>>>>
>>>>>> Also, your scheme fails to pack when cpus 0,4 are filled, even when
>>>>>> there's idle cores around.
>>>> The primary target is to pack the tasks only when we are in a not busy
>>>> system so you will have a power improvement without performance
>>>> decrease. is_light_task function returns false and  is_buddy_busy
>>>> function true before the buddy is fully loaded and the scheduler will
>>>> fall back into the default behavior which spreads tasks and races to
>>>> idle.
>>>>
>>>> We can extend the buddy CPU and the packing mechanism to fill one CPU
>>>> before filling another buddy but it's not always the best choice for
>>>> performance and/or power and thus it will imply to have a knob to
>>>> select this full packing mode.
>>>
>>> Just one buddy to pack tasks for whole level cpus definitely has
>>> scalability problem. That is not good for powersaving in most of scenarios.
>>>
>>
>> This patch doesn't want to pack all kind of tasks in all scenario but
>> only the small tasks that run less that 10ms and when the CPU is not
>> already too busy with other tasks so you don't have to cope with long
>> wake up latency and performance regression and only one CPU will be
>> powered up for these background activities. Nevertheless, I can extend
>> the packing small tasks to pack all tasks in any scenario in as few
>> CPUs as possible. This will imply to choose a new buddy CPU when the
>> previous one is full during the ILB selection as an example and to add
>> a knob to select this mode which will modify the performance of the
>> system. But the primary target is not to have a knob and not to reduce
>> performance in most of scenario.
>
> Arguing the performance/power balance does no much sense without
> detailed scenario. We just want to seek a flexible compromise way.
> But fixed buddy cpu is not flexible. and it may lose many possible
> powersaving fit scenarios on x86 system. Like if 2 SMT cpu can handle
> all tasks, we don't need to wake another core. or if 2 cores in one
> socket can handle tasks, we also don't need to wakeup another socket.

Using 2 SMT for all tasks implies to accept latency and to share
resources like cache and memory bandwidth so it means that you also
accept some potential performance decrease which implies that someone
must select this mode with a knob.
The primary goal of the patchset is not to select between powersaving
and performance but to stay in performance mode. We pack the small
tasks in one CPU so the performance will not decrease but the low load
scenario will consume less power. Then, I can add another step which
will be more power saving aggressive with a potential cost of
performance and i this case the buddy CPU will be updated dynamically
according to the system load

>>
>> Regards,
>> Vincent
>>
>>>
>>> --
>>> Thanks Alex
>
>
> --
> Thanks Alex
alex.shi March 27, 2013, 1:32 p.m. UTC | #10
On 03/27/2013 06:30 PM, Vincent Guittot wrote:
>> Arguing the performance/power balance does no much sense without
>> > detailed scenario. We just want to seek a flexible compromise way.
>> > But fixed buddy cpu is not flexible. and it may lose many possible
>> > powersaving fit scenarios on x86 system. Like if 2 SMT cpu can handle
>> > all tasks, we don't need to wake another core. or if 2 cores in one
>> > socket can handle tasks, we also don't need to wakeup another socket.
> Using 2 SMT for all tasks implies to accept latency and to share
> resources like cache and memory bandwidth so it means that you also
> accept some potential performance decrease which implies that someone
> must select this mode with a knob.
> The primary goal of the patchset is not to select between powersaving
> and performance but to stay in performance mode. We pack the small
> tasks in one CPU so the performance will not decrease but the low load
> scenario will consume less power. Then, I can add another step which
> will be more power saving aggressive with a potential cost of
> performance and i this case the buddy CPU will be updated dynamically
> according to the system load
> 

Predication of small task behavior is often wrong. so for performance
purpose, packing task is a bad idea.
Vincent Guittot April 5, 2013, 11:08 a.m. UTC | #11
Peter,

After some toughts about your comments,I can update the buddy cpu
during ILB or periofdic LB to a new idle core and extend the packing
mechanism  Does this additional mechanism sound better for you ?

Vincent

On 26 March 2013 15:42, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, 2013-03-26 at 15:03 +0100, Vincent Guittot wrote:
>> > But ha! here's your NO_HZ link.. but does the above DTRT and ensure
>> > that the ILB is a little core when possible?
>>
>> The loop looks for an idle CPU as close as possible to the buddy CPU
>> and the buddy CPU is the 1st CPU has been chosen. So if your buddy is
>> a little and there is an idle little, the ILB will be this idle
>> little.
>
> Earlier you wrote:
>
>>       | Cluster 0   | Cluster 1   |
>>       | CPU0 | CPU1 | CPU2 | CPU3 |
>> -----------------------------------
>> buddy | CPU0 | CPU0 | CPU0 | CPU2 |
>
> So extrapolating that to a 4+4 big-little you'd get something like:
>
>       |   little  A9  ||   big A15     |
>       | 0 | 1 | 2 | 3 || 4 | 5 | 6 | 7 |
> ------+---+---+---+---++---+---+---+---+
> buddy | 0 | 0 | 0 | 0 || 0 | 4 | 4 | 4 |
>
> Right?
>
> So supposing the current ILB is 6, we'll only check 4, not 0-3, even
> though there might be a perfectly idle cpu in there.
>
> Also, your scheme fails to pack when cpus 0,4 are filled, even when
> there's idle cores around.
>
> If we'd use the ILB as packing cpu, we would simply select a next pack
> target once the old one fills up.
>
preeti April 22, 2013, 5:45 a.m. UTC | #12
Hi Vincent,

On 04/05/2013 04:38 PM, Vincent Guittot wrote:
> Peter,
> 
> After some toughts about your comments,I can update the buddy cpu
> during ILB or periofdic LB to a new idle core and extend the packing
> mechanism  Does this additional mechanism sound better for you ?

If the primary goal of this patchset is to pack small tasks in fewer
power domains then why not see if the power aware scheduler patchset by
Alex does the same for you? The reason being:

a.The power aware scheduler also checks if a task is small enough to be
packed on a cpu which has just enough capacity to take on that
task(leader cpu). This cpu belongs to a scheduler group which is nearly
full(group_leader),so we end up packing tasks.

b.The overhead of assigning a buddy cpu gets eliminated because the best
cpu for packing is decided during wake up.

c.This is a scalable solution because if the leader cpu is busy,then any
other idle cpu from that group_leader is chosen.Eventually you end up
packing anyway.

The reason that I am suggesting this is that we could unify the power
awareness of the scheduler under one umbrella.And i believe that the
current power aware scheduler patchset is flexible enough to do this and
that we must cash in on it.

Thanks

Regards
Preeti U Murthy
> 
> Vincent
> 
> On 26 March 2013 15:42, Peter Zijlstra <peterz@infradead.org> wrote:
>> On Tue, 2013-03-26 at 15:03 +0100, Vincent Guittot wrote:
>>>> But ha! here's your NO_HZ link.. but does the above DTRT and ensure
>>>> that the ILB is a little core when possible?
>>>
>>> The loop looks for an idle CPU as close as possible to the buddy CPU
>>> and the buddy CPU is the 1st CPU has been chosen. So if your buddy is
>>> a little and there is an idle little, the ILB will be this idle
>>> little.
>>
>> Earlier you wrote:
>>
>>>       | Cluster 0   | Cluster 1   |
>>>       | CPU0 | CPU1 | CPU2 | CPU3 |
>>> -----------------------------------
>>> buddy | CPU0 | CPU0 | CPU0 | CPU2 |
>>
>> So extrapolating that to a 4+4 big-little you'd get something like:
>>
>>       |   little  A9  ||   big A15     |
>>       | 0 | 1 | 2 | 3 || 4 | 5 | 6 | 7 |
>> ------+---+---+---+---++---+---+---+---+
>> buddy | 0 | 0 | 0 | 0 || 0 | 4 | 4 | 4 |
>>
>> Right?
>>
>> So supposing the current ILB is 6, we'll only check 4, not 0-3, even
>> though there might be a perfectly idle cpu in there.
>>
>> Also, your scheme fails to pack when cpus 0,4 are filled, even when
>> there's idle cores around.
>>
>> If we'd use the ILB as packing cpu, we would simply select a next pack
>> target once the old one fills up.
>>
>
alex.shi April 23, 2013, 2:23 a.m. UTC | #13
Thanks you, Preeti and Vincent to talk the power aware scheduler for
details! believe this open discussion is helpful to conduct a a more
comprehensive solution. :)

> Hi Preeti,
> 
> I have had a look at Alex patches but i have some concerns with his patches
> -There no notion of power domain which is quite important when we speak
> about power saving IMHO. Packing tasks has got an interest if the idle
> CPUs can reach a useful low power state independently from busy CPUs.
> Architectures have different low power state capabilities which must be
> taken into account. In addition, you can have system which have CPUs
> with better power efficiency and this kind of system are not taken into
> account.

I agree with you on this point. and like what's you done to add new flag
in sched domain. It also make scheduler easy pick up new idea in balancing.
BTW, Currently, the my balance is trying pack task per SMT, maybe
packing task per cpu horse power is more compatible for other archs?

> -There are some computation of statistics on a potentially large number
> of cpus and groups at each task wake up. This overhead concerns me and
> such amount of computation should only be done when we have more time
> like the periodic load balance.

Usually, some computation is far slighter then the task migration. If
the computation is helpful to reduce future possible migration, it will
save much. On current code, I observed the fork balancing can distribute
task well in powersaving policy. That means the computation is worth.

> -There are some heuristics that will be hard to tune:
>  *powersaving balance period set as 8*max_interval
>  *power saving can do some performance load balance if there was no
> performance load balance in the last 32 balances with no more than 4
> perf balance in the last 64 balance

Do you have other tunning opinions on the numbers? I am glad to hear any
input.
>  *sched_burst_threshold

I find it is useful on 3.8 kernel when aim7 cause a very imbalance
wakeup. but now aim7 is calm down after lock-stealing RWSEM used in
kernel, maybe need to re-evaluate this on future new version.
> 
> I'm going to send a proposal for a more aggressive and scalable mode of
> my patches which will take care of my concerns. Let see how this new
> patchset can fit with Alex's ones
preeti April 23, 2013, 4:36 a.m. UTC | #14
Hi Vincent,

Thank you very much for bringing about the differences between your
goals and the working of the power aware scheduler patchset.This was
essential for us to understand the various requirements from a power
aware scheduler.After you post out the patchset we could try and
evaluate the following points again.

Thanks

Regards
Preeti U Murthy

On 04/23/2013 01:27 AM, Vincent Guittot wrote:
> On Monday, 22 April 2013, Preeti U Murthy <preeti@linux.vnet.ibm.com> wrote:
>> Hi Vincent,
>>
>> On 04/05/2013 04:38 PM, Vincent Guittot wrote:
>>> Peter,
>>>
>>> After some toughts about your comments,I can update the buddy cpu
>>> during ILB or periofdic LB to a new idle core and extend the packing
>>> mechanism  Does this additional mechanism sound better for you ?
>>
> 
> Hi Preeti,
> 
> I have had a look at Alex patches but i have some concerns with his patches
> -There no notion of power domain which is quite important when we speak
> about power saving IMHO. Packing tasks has got an interest if the idle CPUs
> can reach a useful low power state independently from busy CPUs.
> Architectures have different low power state capabilities which must be
> taken into account. In addition, you can have system which have CPUs with
> better power efficiency and this kind of system are not taken into account.
> -There are some computation of statistics on a potentially large number of
> cpus and groups at each task wake up. This overhead concerns me and such
> amount of computation should only be done when we have more time like the
> periodic load balance.
> -There are some heuristics that will be hard to tune:
>  *powersaving balance period set as 8*max_interval
>  *power saving can do some performance load balance if there was no
> performance load balance in the last 32 balances with no more than 4 perf
> balance in the last 64 balance
>  *sched_burst_threshold
> 
> I'm going to send a proposal for a more aggressive and scalable mode of my
> patches which will take care of my concerns. Let see how this new patchset
> can fit with Alex's ones
> 
> Regards,
> Vincent
> 
>> If the primary goal of this patchset is to pack small tasks in fewer
>> power domains then why not see if the power aware scheduler patchset by
>> Alex does the same for you? The reason being:
>>
>> a.The power aware scheduler also checks if a task is small enough to be
>> packed on a cpu which has just enough capacity to take on that
>> task(leader cpu). This cpu belongs to a scheduler group which is nearly
>> full(group_leader),so we end up packing tasks.
>>
>> b.The overhead of assigning a buddy cpu gets eliminated because the best
>> cpu for packing is decided during wake up.
>>
>> c.This is a scalable solution because if the leader cpu is busy,then any
>> other idle cpu from that group_leader is chosen.Eventually you end up
>> packing anyway.
>>
>> The reason that I am suggesting this is that we could unify the power
>> awareness of the scheduler under one umbrella.And i believe that the
>> current power aware scheduler patchset is flexible enough to do this and
>> that we must cash in on it.
>>
>> Thanks
>>
>> Regards
>> Preeti U Murthy
>>>
>>> Vincent
>>>
>>> On 26 March 2013 15:42, Peter Zijlstra <peterz@infradead.org> wrote:
>>>> On Tue, 2013-03-26 at 15:03 +0100, Vincent Guittot wrote:
>>>>>> But ha! here's your NO_HZ link.. but does the above DTRT and ensure
>>>>>> that the ILB is a little core when possible?
>>>>>
>>>>> The loop looks for an idle CPU as close as possible to the buddy CPU
>>>>> and the buddy CPU is the 1st CPU has been chosen. So if your buddy is
>>>>> a little and there is an idle little, the ILB will be this idle
>>>>> little.
>>>>
>>>> Earlier you wrote:
>>>>
>>>>>       | Cluster 0   | Cluster 1   |
>>>>>       | CPU0 | CPU1 | CPU2 | CPU3 |
>>>>> -----------------------------------
>>>>> buddy | CPU0 | CPU0 | CPU0 | CPU2 |
>>>>
>>>> So extrapolating that to a 4+4 big-little you'd get something like:
>>>>
>>>>       |   little  A9  ||   big A15     |
>>>>       | 0 | 1 | 2 | 3 || 4 | 5 | 6 | 7 |
>>>> ------+---+---+---+---++---+---+---+---+
>>>> buddy | 0 | 0 | 0 | 0 || 0 | 4 | 4 | 4 |
>>>>
>>>> Right?
>>>>
>>>> So supposing the current ILB is 6, we'll only check 4, not 0-3, even
>>>> though there might be a perfectly idle cpu in there.
>>>>
>>>> Also, your scheme fails to pack when cpus 0,4 are filled, even when
>>>> there's idle cores around.
>>>>
>>>> If we'd use the ILB as packing cpu, we would simply select a next pack
>>>> target once the old one fills up.
>>>>
>>>
>>
>>
>
preeti April 23, 2013, 4:57 a.m. UTC | #15
Hi Alex,

I have one point below.

On 04/23/2013 07:53 AM, Alex Shi wrote:
> Thanks you, Preeti and Vincent to talk the power aware scheduler for
> details! believe this open discussion is helpful to conduct a a more
> comprehensive solution. :)
> 
>> Hi Preeti,
>>
>> I have had a look at Alex patches but i have some concerns with his patches
>> -There no notion of power domain which is quite important when we speak
>> about power saving IMHO. Packing tasks has got an interest if the idle
>> CPUs can reach a useful low power state independently from busy CPUs.
>> Architectures have different low power state capabilities which must be
>> taken into account. In addition, you can have system which have CPUs
>> with better power efficiency and this kind of system are not taken into
>> account.
> 
> I agree with you on this point. and like what's you done to add new flag
> in sched domain. It also make scheduler easy pick up new idea in balancing.
> BTW, Currently, the my balance is trying pack task per SMT, maybe
> packing task per cpu horse power is more compatible for other archs?

Correct me if I am wrong,but the scheduler today does not compare the
task load to the destination cpu power before moving the task to the
destination cpu.This could be during:

1. Load balancing: In move_tasks(), only the imbalance is verified
against the task load before moving tasks and does not necessarily check
if the destination cpu has enough cpu power to handle these tasks.

2. select_task_rq_fair(): For a forked task, the idlest cpu in the group
leader is found during power save balance( I am focussing only on the
power save policy),and is returned as the destination cpu for the forked
task.But I feel we need to check if the idle cpu has the cpu power to
handle the task load.

Why I am bringing about this point is due to a use case which we might
need to handle in the power aware scheduler going ahead.That being the
big.LITTLE cpus. We would ideally want the short running tasks on the
LITTLE cpus and the long running tasks on the big cpus.

While the power aware scheduler strives to pack tasks,it should not end
up packing long running tasks on LITTLE cpus. Not having big cpus to
handle short running tasks is the next step of course but atleast not
throttle the long running tasks by scheduling them on LITTLE cpus.

Thanks

Regards
Preeti U Murthy
Arjan van de Ven April 23, 2013, 3:30 p.m. UTC | #16
On 4/22/2013 7:23 PM, Alex Shi wrote:
> Thanks you, Preeti and Vincent to talk the power aware scheduler for
> details! believe this open discussion is helpful to conduct a a more
> comprehensive solution. :)
>
>> Hi Preeti,
>>
>> I have had a look at Alex patches but i have some concerns with his patches
>> -There no notion of power domain which is quite important when we speak
>> about power saving IMHO. Packing tasks has got an interest if the idle
>> CPUs can reach a useful low power state independently from busy CPUs.
>> Architectures have different low power state capabilities which must be
>> taken into account. In addition, you can have system which have CPUs
>> with better power efficiency and this kind of system are not taken into
>> account.
>
> I agree with you on this point. and like what's you done to add new flag
> in sched domain.

For x86 we should not be setting such flag then; we don't have a way for some cpu packages to
go to an extra deep power state if they're completely idle.
(this afaik is true for both Intel and AMD)
Peter Zijlstra April 26, 2013, 10:54 a.m. UTC | #17
On Tue, Apr 23, 2013 at 08:30:58AM -0700, Arjan van de Ven wrote:
> For x86 we should not be setting such flag then; we don't have a way for some cpu packages to
> go to an extra deep power state if they're completely idle.
> (this afaik is true for both Intel and AMD)

You say 'some'; this implies we do for others, right?

We can dynamically set the required flags in the arch setup when it makes sense.
diff mbox

Patch

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b636199..52a7736 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5455,7 +5455,25 @@  static struct {
 
 static inline int find_new_ilb(int call_cpu)
 {
+	struct sched_domain *sd;
 	int ilb = cpumask_first(nohz.idle_cpus_mask);
+	int buddy = per_cpu(sd_pack_buddy, call_cpu);
+
+	/*
+	 * If we have a pack buddy CPU, we try to run load balance on a CPU
+	 * that is close to the buddy.
+	 */
+	if (buddy != -1)
+		for_each_domain(buddy, sd) {
+			if (sd->flags & SD_SHARE_CPUPOWER)
+				continue;
+
+			ilb = cpumask_first_and(sched_domain_span(sd),
+					nohz.idle_cpus_mask);
+
+			if (ilb < nr_cpu_ids)
+				break;
+		}
 
 	if (ilb < nr_cpu_ids && idle_cpu(ilb))
 		return ilb;