diff mbox

[RFC,v3,3/6] sched: pack small tasks

Message ID 1363955155-18382-4-git-send-email-vincent.guittot@linaro.org (mailing list archive)
State New, archived
Headers show

Commit Message

Vincent Guittot March 22, 2013, 12:25 p.m. UTC
During the creation of sched_domain, we define a pack buddy CPU for each CPU
when one is available. We want to pack at all levels where a group of CPU can
be power gated independently from others.
On a system that can't power gate a group of CPUs independently, the flag is
set at all sched_domain level and the buddy is set to -1. This is the default
behavior.
On a dual clusters / dual cores system which can power gate each core and
cluster independently, the buddy configuration will be :

      | Cluster 0   | Cluster 1   |
      | CPU0 | CPU1 | CPU2 | CPU3 |
-----------------------------------
buddy | CPU0 | CPU0 | CPU0 | CPU2 |

Small tasks tend to slip out of the periodic load balance so the best place
to choose to migrate them is during their wake up. The decision is in O(1) as
we only check again one buddy CPU

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/core.c  |    1 +
 kernel/sched/fair.c  |  115 ++++++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h |    5 +++
 3 files changed, 121 insertions(+)

Comments

Peter Zijlstra March 26, 2013, 12:26 p.m. UTC | #1
On Fri, 2013-03-22 at 13:25 +0100, Vincent Guittot wrote:
> +static bool is_buddy_busy(int cpu)
> +{
> +       struct rq *rq = cpu_rq(cpu);
> +
> +       /*
> +        * A busy buddy is a CPU with a high load or a small load with
> a lot of
> +        * running tasks.
> +        */
> +       return (rq->avg.runnable_avg_sum >
> +                       (rq->avg.runnable_avg_period / (rq->nr_running
> + 2)));
> +}

Why does the comment talk about load but we don't see it in the
equation. Also, why does nr_running matter at all? I thought we'd
simply bother with utilization, if fully utilized we're done etc..
Peter Zijlstra March 26, 2013, 12:37 p.m. UTC | #2
On Fri, 2013-03-22 at 13:25 +0100, Vincent Guittot wrote:
> +static bool is_light_task(struct task_struct *p)
> +{
> +       /* A light task runs less than 20% in average */
> +       return ((p->se.avg.runnable_avg_sum  * 5) <
> +                       (p->se.avg.runnable_avg_period));
> +}

OK, so we have a 'problem' here, we initialize runnable_avg_* to 0, but
we want to 'assume' a fresh task is fully 'loaded'. IIRC Alex ran into
this as well.

PJT, do you have any sane solution for this, I forgot what the result
of the last discussion was -- was there any?
Peter Zijlstra March 26, 2013, 12:46 p.m. UTC | #3
On Fri, 2013-03-22 at 13:25 +0100, Vincent Guittot wrote:
> During the creation of sched_domain, we define a pack buddy CPU for
> each CPU
> when one is available. We want to pack at all levels where a group of
> CPU can
> be power gated independently from others.
> On a system that can't power gate a group of CPUs independently, the
> flag is
> set at all sched_domain level and the buddy is set to -1. This is the
> default
> behavior.
> On a dual clusters / dual cores system which can power gate each core
> and
> cluster independently, the buddy configuration will be :
> 
>       | Cluster 0   | Cluster 1   |
>       | CPU0 | CPU1 | CPU2 | CPU3 |
> -----------------------------------
> buddy | CPU0 | CPU0 | CPU0 | CPU2 |

I suppose this is adequate for the 'small' systems you currently have;
but given that Samsung is already bragging with its 'octo'-core Exynos
5 (4+4 big-little thing) does this solution scale?

Isn't this basically related to picking the NO_HZ cpu; if the system
isn't fully symmetric with its power gates you want the NO_HZ cpu to be
the 'special' cpu. If it is symmetric we really don't care which core
is left 'running' and we can even select a new pack cpu from the idle
cores once the old one is fully utilized.

Re-using (or integrating) with NO_HZ has the dual advantage that you'll
make NO_HZ do the right thing for big-little (you typically want a
little core to be the one staying 'awake' and once someone makes NO_HZ
scale this all gets to scale along with it.
Vincent Guittot March 26, 2013, 1 p.m. UTC | #4
On 26 March 2013 13:37, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, 2013-03-22 at 13:25 +0100, Vincent Guittot wrote:
>> +static bool is_light_task(struct task_struct *p)
>> +{
>> +       /* A light task runs less than 20% in average */
>> +       return ((p->se.avg.runnable_avg_sum  * 5) <
>> +                       (p->se.avg.runnable_avg_period));
>> +}
>
> OK, so we have a 'problem' here, we initialize runnable_avg_* to 0, but
> we want to 'assume' a fresh task is fully 'loaded'. IIRC Alex ran into
> this as well.
Hi Peter,

The packing small tasks is only applied at wake up and not during fork
or exec so the runnable_avg_* should have been initialized. As you
mentionned, we assume that a fresh task is fully loaded and let the
default scheduler behavior to select a target CPU

Vincent

>
> PJT, do you have any sane solution for this, I forgot what the result
> of the last discussion was -- was there any?
>
Vincent Guittot March 26, 2013, 1:53 p.m. UTC | #5
On 26 March 2013 13:46, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, 2013-03-22 at 13:25 +0100, Vincent Guittot wrote:
>> During the creation of sched_domain, we define a pack buddy CPU for
>> each CPU
>> when one is available. We want to pack at all levels where a group of
>> CPU can
>> be power gated independently from others.
>> On a system that can't power gate a group of CPUs independently, the
>> flag is
>> set at all sched_domain level and the buddy is set to -1. This is the
>> default
>> behavior.
>> On a dual clusters / dual cores system which can power gate each core
>> and
>> cluster independently, the buddy configuration will be :
>>
>>       | Cluster 0   | Cluster 1   |
>>       | CPU0 | CPU1 | CPU2 | CPU3 |
>> -----------------------------------
>> buddy | CPU0 | CPU0 | CPU0 | CPU2 |
>
> I suppose this is adequate for the 'small' systems you currently have;
> but given that Samsung is already bragging with its 'octo'-core Exynos
> 5 (4+4 big-little thing) does this solution scale?

The packing is only done at MC and CPU level to minimize the number of
transition.

>
> Isn't this basically related to picking the NO_HZ cpu; if the system
> isn't fully symmetric with its power gates you want the NO_HZ cpu to be
> the 'special' cpu. If it is symmetric we really don't care which core
> is left 'running' and we can even select a new pack cpu from the idle
> cores once the old one is fully utilized.

I agree that on a symmetric system, we don't really care about which
core is selected but we want to use the same one whenever possible to
prevent a ping pong between several cores or groups of cores, which is
power consuming. By forcing a NOHZ cpu, your background activity will
smoothly pack on this CPU and will not be spread on your system.
When a CPU is fully loaded, we don't fall in a low CPU load use case
and the periodic load balance can handle the situation to select a new
target CPU which is close to the buddy CPU

>
> Re-using (or integrating) with NO_HZ has the dual advantage that you'll
> make NO_HZ do the right thing for big-little (you typically want a
> little core to be the one staying 'awake' and once someone makes NO_HZ
> scale this all gets to scale along with it.
>

I think that you have answered to this question in your comment of
patch 5, isn't it?

Vincent
Arjan van de Ven March 26, 2013, 3:29 p.m. UTC | #6
> Isn't this basically related to picking the NO_HZ cpu; if the system
> isn't fully symmetric with its power gates you want the NO_HZ cpu to be
> the 'special' cpu. If it is symmetric we really don't care which core
> is left 'running' and we can even select a new pack cpu from the idle
> cores once the old one is fully utilized.

you don't really care much sure, but there's some advantages for sorting "all the way left",
e.g. to linux cpu 0.
Some tasks only run there, and interrupts tend to be favored to that cpu as well on x86.
preeti March 27, 2013, 4:33 a.m. UTC | #7
Hi Peter,

On 03/26/2013 06:07 PM, Peter Zijlstra wrote:
> On Fri, 2013-03-22 at 13:25 +0100, Vincent Guittot wrote:
>> +static bool is_light_task(struct task_struct *p)
>> +{
>> +       /* A light task runs less than 20% in average */
>> +       return ((p->se.avg.runnable_avg_sum  * 5) <
>> +                       (p->se.avg.runnable_avg_period));
>> +}
> 
> OK, so we have a 'problem' here, we initialize runnable_avg_* to 0, but
> we want to 'assume' a fresh task is fully 'loaded'. IIRC Alex ran into
> this as well.
> 
> PJT, do you have any sane solution for this, I forgot what the result
> of the last discussion was -- was there any?

The conclusion after last discussion between PJT and Alex was that the
load contribution of a fresh task be set to "full" during "__sched_fork()".

task->se.avg.load_avg_contrib = task->se.load.weight during
__sched_fork() is reflected in the latest power aware scheduler patchset
by Alex.

Thanks

Regards
Preeti U Murthy
>
alex.shi March 27, 2013, 4:48 a.m. UTC | #8
On 03/27/2013 12:33 PM, Preeti U Murthy wrote:
> Hi Peter,
> 
> On 03/26/2013 06:07 PM, Peter Zijlstra wrote:
>> On Fri, 2013-03-22 at 13:25 +0100, Vincent Guittot wrote:
>>> +static bool is_light_task(struct task_struct *p)
>>> +{
>>> +       /* A light task runs less than 20% in average */
>>> +       return ((p->se.avg.runnable_avg_sum  * 5) <
>>> +                       (p->se.avg.runnable_avg_period));
>>> +}
>>
>> OK, so we have a 'problem' here, we initialize runnable_avg_* to 0, but
>> we want to 'assume' a fresh task is fully 'loaded'. IIRC Alex ran into
>> this as well.
>>
>> PJT, do you have any sane solution for this, I forgot what the result
>> of the last discussion was -- was there any?
> 
> The conclusion after last discussion between PJT and Alex was that the
> load contribution of a fresh task be set to "full" during "__sched_fork()".
> 
> task->se.avg.load_avg_contrib = task->se.load.weight during
> __sched_fork() is reflected in the latest power aware scheduler patchset
> by Alex.

Yes, the new forked runnable load was set as full utilisation in V5
power aware scheduling. PJT, Mike and I both agree on this. PJT just
discussion how to give the full load to new forked task. and we get
agreement in my coming V6 power aware scheduling patchset.

> 
> Thanks
> 
> Regards
> Preeti U Murthy
>>
>
Peter Zijlstra March 27, 2013, 8:46 a.m. UTC | #9
On Tue, 2013-03-26 at 08:29 -0700, Arjan van de Ven wrote:
> > Isn't this basically related to picking the NO_HZ cpu; if the system
> > isn't fully symmetric with its power gates you want the NO_HZ cpu to be
> > the 'special' cpu. If it is symmetric we really don't care which core
> > is left 'running' and we can even select a new pack cpu from the idle
> > cores once the old one is fully utilized.
> 
> you don't really care much sure, but there's some advantages for sorting "all the way left",
> e.g. to linux cpu 0.
> Some tasks only run there, and interrupts tend to be favored to that cpu as well on x86.

Right, and I suspect all the big-little nonsense will have the little
cores on low numbers as well (is this architected or can a creative
licensee screw us over?)

So find_new_ilb() already does cpumask_first(), so it has a strong
leftmost preference. We just need to make sure it indeed does the right
thing and doesn't have some unintended side effect.
Peter Zijlstra March 27, 2013, 8:51 a.m. UTC | #10
On Wed, 2013-03-27 at 12:48 +0800, Alex Shi wrote:
> 
> Yes, the new forked runnable load was set as full utilisation in V5
> power aware scheduling. PJT, Mike and I both agree on this. PJT just
> discussion how to give the full load to new forked task. and we get
> agreement in my coming V6 power aware scheduling patchset.

Great!, means I can mark the v5 thread read! :-)
Vincent Guittot March 27, 2013, 8:54 a.m. UTC | #11
On 27 March 2013 09:46, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, 2013-03-26 at 08:29 -0700, Arjan van de Ven wrote:
>> > Isn't this basically related to picking the NO_HZ cpu; if the system
>> > isn't fully symmetric with its power gates you want the NO_HZ cpu to be
>> > the 'special' cpu. If it is symmetric we really don't care which core
>> > is left 'running' and we can even select a new pack cpu from the idle
>> > cores once the old one is fully utilized.
>>
>> you don't really care much sure, but there's some advantages for sorting "all the way left",
>> e.g. to linux cpu 0.
>> Some tasks only run there, and interrupts tend to be favored to that cpu as well on x86.
>
> Right, and I suspect all the big-little nonsense will have the little
> cores on low numbers as well (is this architected or can a creative
> licensee screw us over?)

It's not mandatory to have little cores on low numbers even if it's advised

>
> So find_new_ilb() already does cpumask_first(), so it has a strong
> leftmost preference. We just need to make sure it indeed does the right
> thing and doesn't have some unintended side effect.
>
Peter Zijlstra March 27, 2013, 9 a.m. UTC | #12
On Wed, 2013-03-27 at 09:54 +0100, Vincent Guittot wrote:
> It's not mandatory to have little cores on low numbers even if it's
> advised

ARGH!
preeti March 27, 2013, 10:21 a.m. UTC | #13
Hi,

On 03/26/2013 05:56 PM, Peter Zijlstra wrote:
> On Fri, 2013-03-22 at 13:25 +0100, Vincent Guittot wrote:
>> +static bool is_buddy_busy(int cpu)
>> +{
>> +       struct rq *rq = cpu_rq(cpu);
>> +
>> +       /*
>> +        * A busy buddy is a CPU with a high load or a small load with
>> a lot of
>> +        * running tasks.
>> +        */
>> +       return (rq->avg.runnable_avg_sum >
>> +                       (rq->avg.runnable_avg_period / (rq->nr_running
>> + 2)));
>> +}
> 
> Why does the comment talk about load but we don't see it in the
> equation. Also, why does nr_running matter at all? I thought we'd
> simply bother with utilization, if fully utilized we're done etc..
> 

Peter, lets say the run-queue has 50% utilization and is running 2
tasks. And we wish to find out if it is busy. We would compare this
metric with the cpu power, which lets say is 100.

rq->util * 100 < cpu_of(rq)->power.

In the above scenario would we declare the cpu _not_busy? Or would we do
the following:

(rq->util * 100) * #nr_running <  cpu_of(rq)->power and conclude that it
is just enough _busy_ to not take on more processes?


@Vincent: Yes the comment above needs to be fixed. A busy buddy is a CPU
with *high rq utilization*, as far as the equation goes.

Regards
Preeti U Murthy
Vincent Guittot March 27, 2013, 11 a.m. UTC | #14
On 27 March 2013 11:21, Preeti U Murthy <preeti@linux.vnet.ibm.com> wrote:
> Hi,
>
> On 03/26/2013 05:56 PM, Peter Zijlstra wrote:
>> On Fri, 2013-03-22 at 13:25 +0100, Vincent Guittot wrote:
>>> +static bool is_buddy_busy(int cpu)
>>> +{
>>> +       struct rq *rq = cpu_rq(cpu);
>>> +
>>> +       /*
>>> +        * A busy buddy is a CPU with a high load or a small load with
>>> a lot of
>>> +        * running tasks.
>>> +        */
>>> +       return (rq->avg.runnable_avg_sum >
>>> +                       (rq->avg.runnable_avg_period / (rq->nr_running
>>> + 2)));
>>> +}
>>
>> Why does the comment talk about load but we don't see it in the
>> equation. Also, why does nr_running matter at all? I thought we'd
>> simply bother with utilization, if fully utilized we're done etc..
>>
>
> Peter, lets say the run-queue has 50% utilization and is running 2
> tasks. And we wish to find out if it is busy. We would compare this
> metric with the cpu power, which lets say is 100.
>
> rq->util * 100 < cpu_of(rq)->power.

I don't use cpu_of(rq)->power in the definition of the business

>
> In the above scenario would we declare the cpu _not_busy? Or would we do
> the following:

In the above scenario, the CPU is busy

By load, I mean : 100 * avg.runnable_avg_sum / avg.runnable_avg_period
In addition, i take into account the number of tasks already in the
runqueue in order to define the business of a CPU. A CPU with a load
of 50% without any tasks in the runqeue in not busy at this time and
we can migrate tasks on it but if the CPU already has 2 tasks in its
runqueue, it means that newly wake up task will have to share the CPU
with other tasks so we consider that the CPU is already busy and we
will fall back to default behavior. The equation considers that a CPU
is not busy if
100 * avg.runnable_avg_sum / avg.runnable_avg_period < 100 / (nr_running + 2)

>
> (rq->util * 100) * #nr_running <  cpu_of(rq)->power and conclude that it
> is just enough _busy_ to not take on more processes?
>
>
> @Vincent: Yes the comment above needs to be fixed. A busy buddy is a CPU
> with *high rq utilization*, as far as the equation goes.

I can update the comment. Is the comment below more clear ?

/*
 * A busy buddy is a CPU with a high average running time or a small
average running time but a lot of
 * running tasks in its runqueue which are already sharing this CPU time.
 */

Vincent

>
> Regards
> Preeti U Murthy
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
Catalin Marinas March 27, 2013, 11:18 a.m. UTC | #15
On Wed, Mar 27, 2013 at 09:00:20AM +0000, Peter Zijlstra wrote:
> On Wed, 2013-03-27 at 09:54 +0100, Vincent Guittot wrote:
> > It's not mandatory to have little cores on low numbers even if it's
> > advised
> 
> ARGH!

I haven't followed this thread closely, so just a random comment from
me. An argument from some is that they want to boot Linux on the big CPU
to be quicker. The most convenient is to have the big CPU at 0.
Peter Zijlstra March 27, 2013, 2:13 p.m. UTC | #16
On Wed, 2013-03-27 at 11:18 +0000, Catalin Marinas wrote:
> On Wed, Mar 27, 2013 at 09:00:20AM +0000, Peter Zijlstra wrote:
> > On Wed, 2013-03-27 at 09:54 +0100, Vincent Guittot wrote:
> > > It's not mandatory to have little cores on low numbers even if it's
> > > advised
> > 
> > ARGH!
> 
> I haven't followed this thread closely, so just a random comment from
> me. An argument from some is that they want to boot Linux on the big CPU
> to be quicker. The most convenient is to have the big CPU at 0.

I suppose that's almost sensible ;-) I just despair at the amount of
variation that's allowed.

I'm guessing that swapping cpus in the bootloader or someplace really
early is equally hard in that we (Linux) assume we boot on cpu 0 or
something like that?
Nicolas Pitre March 27, 2013, 3:37 p.m. UTC | #17
On Wed, 27 Mar 2013, Vincent Guittot wrote:

> On 27 March 2013 09:46, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Tue, 2013-03-26 at 08:29 -0700, Arjan van de Ven wrote:
> >> > Isn't this basically related to picking the NO_HZ cpu; if the system
> >> > isn't fully symmetric with its power gates you want the NO_HZ cpu to be
> >> > the 'special' cpu. If it is symmetric we really don't care which core
> >> > is left 'running' and we can even select a new pack cpu from the idle
> >> > cores once the old one is fully utilized.
> >>
> >> you don't really care much sure, but there's some advantages for sorting "all the way left",
> >> e.g. to linux cpu 0.
> >> Some tasks only run there, and interrupts tend to be favored to that cpu as well on x86.
> >
> > Right, and I suspect all the big-little nonsense will have the little
> > cores on low numbers as well (is this architected or can a creative
> > licensee screw us over?)
> 
> It's not mandatory to have little cores on low numbers even if it's advised

We can trivially move things around in the logical CPU mapping if that 
simplifies things.  However  the boot CPU might not be CPU0 in that 
case which might be less trivial.


Nicolas
Catalin Marinas March 27, 2013, 4:36 p.m. UTC | #18
On Wed, Mar 27, 2013 at 02:13:54PM +0000, Peter Zijlstra wrote:
> On Wed, 2013-03-27 at 11:18 +0000, Catalin Marinas wrote:
> > On Wed, Mar 27, 2013 at 09:00:20AM +0000, Peter Zijlstra wrote:
> > > On Wed, 2013-03-27 at 09:54 +0100, Vincent Guittot wrote:
> > > > It's not mandatory to have little cores on low numbers even if it's
> > > > advised
> > > 
> > > ARGH!
> > 
> > I haven't followed this thread closely, so just a random comment from
> > me. An argument from some is that they want to boot Linux on the big CPU
> > to be quicker. The most convenient is to have the big CPU at 0.
> 
> I suppose that's almost sensible ;-) I just despair at the amount of
> variation that's allowed.
> 
> I'm guessing that swapping cpus in the bootloader or someplace really
> early is equally hard in that we (Linux) assume we boot on cpu 0 or
> something like that?

I think it's worth trying (by changing the CPU topology in the DT). At a
quick look, I don't see anything hard-coded in the kernel boot sequence.
It uses smp_processor_id() which translates to
current_thread_info()->cpu on ARM. I'm not sure how early we need this
but it's probably after DT parsing, so we could set 'cpu' to a non-zero
value for the booting processor. There are a few tweaks in the arch/arm
code code with cpu_logical_map setup (which maps between
smp_processor_id and the actual hardware CPU id and assumes 0 is the
booting CPU).

So if the above works, the scheduler guys can mandate that little CPUs
are always first and for ARM it would be a matter of getting the right
CPU topology in the DT (independent of what hw vendors think of CPU
topology) and booting Linux on CPU 4 etc.
Nicolas Pitre March 27, 2013, 5:18 p.m. UTC | #19
On Wed, 27 Mar 2013, Catalin Marinas wrote:

> So if the above works, the scheduler guys can mandate that little CPUs
> are always first and for ARM it would be a matter of getting the right
> CPU topology in the DT (independent of what hw vendors think of CPU
> topology) and booting Linux on CPU 4 etc.

Just a note about that: if the scheduler mandates little CPUs first, 
that should _not_ have any implications on the DT content.  DT is not 
about encoding Linux specific implementation details.  It is simple 
enough to tweak the CPU logical map at run time when enumeratiing CPUs.


Nicolas
Vincent Guittot March 27, 2013, 5:20 p.m. UTC | #20
On 27 March 2013 17:36, Catalin Marinas <catalin.marinas@arm.com> wrote:
> On Wed, Mar 27, 2013 at 02:13:54PM +0000, Peter Zijlstra wrote:
>> On Wed, 2013-03-27 at 11:18 +0000, Catalin Marinas wrote:
>> > On Wed, Mar 27, 2013 at 09:00:20AM +0000, Peter Zijlstra wrote:
>> > > On Wed, 2013-03-27 at 09:54 +0100, Vincent Guittot wrote:
>> > > > It's not mandatory to have little cores on low numbers even if it's
>> > > > advised
>> > >
>> > > ARGH!
>> >
>> > I haven't followed this thread closely, so just a random comment from
>> > me. An argument from some is that they want to boot Linux on the big CPU
>> > to be quicker. The most convenient is to have the big CPU at 0.
>>
>> I suppose that's almost sensible ;-) I just despair at the amount of
>> variation that's allowed.
>>
>> I'm guessing that swapping cpus in the bootloader or someplace really
>> early is equally hard in that we (Linux) assume we boot on cpu 0 or
>> something like that?
>
> I think it's worth trying (by changing the CPU topology in the DT). At a
> quick look, I don't see anything hard-coded in the kernel boot sequence.
> It uses smp_processor_id() which translates to
> current_thread_info()->cpu on ARM. I'm not sure how early we need this
> but it's probably after DT parsing, so we could set 'cpu' to a non-zero
> value for the booting processor. There are a few tweaks in the arch/arm
> code code with cpu_logical_map setup (which maps between
> smp_processor_id and the actual hardware CPU id and assumes 0 is the
> booting CPU).

The are few other places in the code that makes the assumption that
the 1st online CPU is the booting CPU like the disable_nonboot_cpus

>
> So if the above works, the scheduler guys can mandate that little CPUs
> are always first and for ARM it would be a matter of getting the right
> CPU topology in the DT (independent of what hw vendors think of CPU
> topology) and booting Linux on CPU 4 etc.
>
> --
> Catalin
Catalin Marinas March 27, 2013, 5:37 p.m. UTC | #21
On Wed, Mar 27, 2013 at 05:18:53PM +0000, Nicolas Pitre wrote:
> On Wed, 27 Mar 2013, Catalin Marinas wrote:
> 
> > So if the above works, the scheduler guys can mandate that little CPUs
> > are always first and for ARM it would be a matter of getting the right
> > CPU topology in the DT (independent of what hw vendors think of CPU
> > topology) and booting Linux on CPU 4 etc.
> 
> Just a note about that: if the scheduler mandates little CPUs first, 
> that should _not_ have any implications on the DT content.  DT is not 
> about encoding Linux specific implementation details.  It is simple 
> enough to tweak the CPU logical map at run time when enumeratiing CPUs.

You are right, though a simpler way (hack) to tweak the cpu_logical_map
is to change the DT ;).

But the problem is that the kernel doesn't know which CPU is big and
which is little, unless you specify this in some way via the DT. It can
be the cpu nodes order or some other means.
Catalin Marinas March 27, 2013, 6:01 p.m. UTC | #22
On Wed, Mar 27, 2013 at 05:20:28PM +0000, Vincent Guittot wrote:
> On 27 March 2013 17:36, Catalin Marinas <catalin.marinas@arm.com> wrote:
> > On Wed, Mar 27, 2013 at 02:13:54PM +0000, Peter Zijlstra wrote:
> >> On Wed, 2013-03-27 at 11:18 +0000, Catalin Marinas wrote:
> >> > On Wed, Mar 27, 2013 at 09:00:20AM +0000, Peter Zijlstra wrote:
> >> > > On Wed, 2013-03-27 at 09:54 +0100, Vincent Guittot wrote:
> >> > > > It's not mandatory to have little cores on low numbers even if it's
> >> > > > advised
> >> > >
> >> > > ARGH!
> >> >
> >> > I haven't followed this thread closely, so just a random comment from
> >> > me. An argument from some is that they want to boot Linux on the big CPU
> >> > to be quicker. The most convenient is to have the big CPU at 0.
> >>
> >> I suppose that's almost sensible ;-) I just despair at the amount of
> >> variation that's allowed.
> >>
> >> I'm guessing that swapping cpus in the bootloader or someplace really
> >> early is equally hard in that we (Linux) assume we boot on cpu 0 or
> >> something like that?
> >
> > I think it's worth trying (by changing the CPU topology in the DT). At a
> > quick look, I don't see anything hard-coded in the kernel boot sequence.
> > It uses smp_processor_id() which translates to
> > current_thread_info()->cpu on ARM. I'm not sure how early we need this
> > but it's probably after DT parsing, so we could set 'cpu' to a non-zero
> > value for the booting processor. There are a few tweaks in the arch/arm
> > code code with cpu_logical_map setup (which maps between
> > smp_processor_id and the actual hardware CPU id and assumes 0 is the
> > booting CPU).
> 
> The are few other places in the code that makes the assumption that
> the 1st online CPU is the booting CPU like the disable_nonboot_cpus

Another thing - boot_cpu_init() is using smp_processor_id() before we
get the setup_arch() call where we parse the DT.
Peter Zijlstra April 26, 2013, 10:18 a.m. UTC | #23
On Wed, Mar 27, 2013 at 03:51:51PM +0530, Preeti U Murthy wrote:
> Hi,
> 
> On 03/26/2013 05:56 PM, Peter Zijlstra wrote:
> > On Fri, 2013-03-22 at 13:25 +0100, Vincent Guittot wrote:
> >> +static bool is_buddy_busy(int cpu)
> >> +{
> >> +       struct rq *rq = cpu_rq(cpu);
> >> +
> >> +       /*
> >> +        * A busy buddy is a CPU with a high load or a small load with
> >> a lot of
> >> +        * running tasks.
> >> +        */
> >> +       return (rq->avg.runnable_avg_sum >
> >> +                       (rq->avg.runnable_avg_period / (rq->nr_running
> >> + 2)));
> >> +}
> > 
> > Why does the comment talk about load but we don't see it in the
> > equation. Also, why does nr_running matter at all? I thought we'd
> > simply bother with utilization, if fully utilized we're done etc..
> > 
> 
> Peter, lets say the run-queue has 50% utilization and is running 2
> tasks. And we wish to find out if it is busy. We would compare this
> metric with the cpu power, which lets say is 100.
> 
> rq->util * 100 < cpu_of(rq)->power.
> 
> In the above scenario would we declare the cpu _not_busy? Or would we do
> the following:
> 
> (rq->util * 100) * #nr_running <  cpu_of(rq)->power and conclude that it
> is just enough _busy_ to not take on more processes?

That is just confused... ->power doesn't have anything to do with a per-cpu
measure. ->power is a inter-cpu measure of relative compute capacity.

Mixing in nr_running confuses things even more; it doesn't matter how many
tasks it takes to push utilization up to 100%; once its there the cpu simply
cannot run more.

So colour me properly confused..
Peter Zijlstra April 26, 2013, 10:30 a.m. UTC | #24
On Wed, Mar 27, 2013 at 12:00:40PM +0100, Vincent Guittot wrote:
> On 27 March 2013 11:21, Preeti U Murthy <preeti@linux.vnet.ibm.com> wrote:
> > Hi,
> >
> > On 03/26/2013 05:56 PM, Peter Zijlstra wrote:
> >> On Fri, 2013-03-22 at 13:25 +0100, Vincent Guittot wrote:
> >>> +static bool is_buddy_busy(int cpu)
> >>> +{
> >>> +       struct rq *rq = cpu_rq(cpu);
> >>> +
> >>> +       /*
> >>> +        * A busy buddy is a CPU with a high load or a small load with
> >>> a lot of
> >>> +        * running tasks.
> >>> +        */
> >>> +       return (rq->avg.runnable_avg_sum >
> >>> +                       (rq->avg.runnable_avg_period / (rq->nr_running
> >>> + 2)));
> >>> +}
> >>
> >> Why does the comment talk about load but we don't see it in the
> >> equation. Also, why does nr_running matter at all? I thought we'd
> >> simply bother with utilization, if fully utilized we're done etc..
> 
> By load, I mean : 100 * avg.runnable_avg_sum / avg.runnable_avg_period
> In addition, i take into account the number of tasks already in the
> runqueue in order to define the business of a CPU. A CPU with a load
> of 50% without any tasks in the runqeue in not busy at this time and
> we can migrate tasks on it but if the CPU already has 2 tasks in its
> runqueue, it means that newly wake up task will have to share the CPU
> with other tasks so we consider that the CPU is already busy and we
> will fall back to default behavior. The equation considers that a CPU
> is not busy if
> 100 * avg.runnable_avg_sum / avg.runnable_avg_period < 100 / (nr_running + 2)

I'm still somewhat confused by all this. So raising nr_running will lower the
required utilization to be considered busy. Suppose we have 50 tasks running,
all at 1% utilization (bulk wakeup) we'd consider the cpu busy, even though its
picking its nose for half the time.


I'm assuming it's mean to limit process latency or so? Why are you concerned
with that? This seems like an arbitrary power vs performance tweak without
solid effidence its needed or even wanted.
preeti April 26, 2013, 10:32 a.m. UTC | #25
Hi Peter,

On 04/26/2013 03:48 PM, Peter Zijlstra wrote:
> On Wed, Mar 27, 2013 at 03:51:51PM +0530, Preeti U Murthy wrote:
>> Hi,
>>
>> On 03/26/2013 05:56 PM, Peter Zijlstra wrote:
>>> On Fri, 2013-03-22 at 13:25 +0100, Vincent Guittot wrote:
>>>> +static bool is_buddy_busy(int cpu)
>>>> +{
>>>> +       struct rq *rq = cpu_rq(cpu);
>>>> +
>>>> +       /*
>>>> +        * A busy buddy is a CPU with a high load or a small load with
>>>> a lot of
>>>> +        * running tasks.
>>>> +        */
>>>> +       return (rq->avg.runnable_avg_sum >
>>>> +                       (rq->avg.runnable_avg_period / (rq->nr_running
>>>> + 2)));
>>>> +}
>>>
>>> Why does the comment talk about load but we don't see it in the
>>> equation. Also, why does nr_running matter at all? I thought we'd
>>> simply bother with utilization, if fully utilized we're done etc..
>>>
>>
>> Peter, lets say the run-queue has 50% utilization and is running 2
>> tasks. And we wish to find out if it is busy. We would compare this
>> metric with the cpu power, which lets say is 100.
>>
>> rq->util * 100 < cpu_of(rq)->power.
>>
>> In the above scenario would we declare the cpu _not_busy? Or would we do
>> the following:
>>
>> (rq->util * 100) * #nr_running <  cpu_of(rq)->power and conclude that it
>> is just enough _busy_ to not take on more processes?
> 
> That is just confused... ->power doesn't have anything to do with a per-cpu
> measure. ->power is a inter-cpu measure of relative compute capacity.

Ok.

> 
> Mixing in nr_running confuses things even more; it doesn't matter how many
> tasks it takes to push utilization up to 100%; once its there the cpu simply
> cannot run more.

True, this is from the perspective of the CPU. But will not the tasks on
this CPU get throttled if, you find the utilization of this CPU < 100%
and decide to put more tasks on it?

Regards
Preeti U Murthy
Vincent Guittot April 26, 2013, 11:34 a.m. UTC | #26
On 26 April 2013 12:30, Peter Zijlstra <peterz@infradead.org> wrote:
> On Wed, Mar 27, 2013 at 12:00:40PM +0100, Vincent Guittot wrote:
>> On 27 March 2013 11:21, Preeti U Murthy <preeti@linux.vnet.ibm.com> wrote:
>> > Hi,
>> >
>> > On 03/26/2013 05:56 PM, Peter Zijlstra wrote:
>> >> On Fri, 2013-03-22 at 13:25 +0100, Vincent Guittot wrote:
>> >>> +static bool is_buddy_busy(int cpu)
>> >>> +{
>> >>> +       struct rq *rq = cpu_rq(cpu);
>> >>> +
>> >>> +       /*
>> >>> +        * A busy buddy is a CPU with a high load or a small load with
>> >>> a lot of
>> >>> +        * running tasks.
>> >>> +        */
>> >>> +       return (rq->avg.runnable_avg_sum >
>> >>> +                       (rq->avg.runnable_avg_period / (rq->nr_running
>> >>> + 2)));
>> >>> +}
>> >>
>> >> Why does the comment talk about load but we don't see it in the
>> >> equation. Also, why does nr_running matter at all? I thought we'd
>> >> simply bother with utilization, if fully utilized we're done etc..
>>
>> By load, I mean : 100 * avg.runnable_avg_sum / avg.runnable_avg_period
>> In addition, i take into account the number of tasks already in the
>> runqueue in order to define the business of a CPU. A CPU with a load
>> of 50% without any tasks in the runqeue in not busy at this time and
>> we can migrate tasks on it but if the CPU already has 2 tasks in its
>> runqueue, it means that newly wake up task will have to share the CPU
>> with other tasks so we consider that the CPU is already busy and we
>> will fall back to default behavior. The equation considers that a CPU
>> is not busy if
>> 100 * avg.runnable_avg_sum / avg.runnable_avg_period < 100 / (nr_running + 2)
>
> I'm still somewhat confused by all this. So raising nr_running will lower the
> required utilization to be considered busy. Suppose we have 50 tasks running,
> all at 1% utilization (bulk wakeup) we'd consider the cpu busy, even though its
> picking its nose for half the time.
>
>
> I'm assuming it's mean to limit process latency or so? Why are you concerned
> with that? This seems like an arbitrary power vs performance tweak without
> solid effidence its needed or even wanted.

Yes the goal was to limit the wake up latency because this version was
only trying to modify the scheduler behavior when the system was not
busy in order to pack the small tasks like background activities but
without decreasing the performance so we were concerned by wakeup
latency.

The new version proposes a more aggressive mode that packs all tasks
until CPUs becomes full.

Vincent
diff mbox

Patch

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b827e0c..21c35ce 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5662,6 +5662,7 @@  cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu)
 	rcu_assign_pointer(rq->sd, sd);
 	destroy_sched_domains(tmp, cpu);
 
+	update_packing_domain(cpu);
 	update_top_cache_domain(cpu);
 }
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9c2f726..021c7b7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -160,6 +160,76 @@  void sched_init_granularity(void)
 	update_sysctl();
 }
 
+
+#ifdef CONFIG_SMP
+/*
+ * Save the id of the optimal CPU that should be used to pack small tasks
+ * The value -1 is used when no buddy has been found
+ */
+DEFINE_PER_CPU(int, sd_pack_buddy);
+
+/*
+ * Look for the best buddy CPU that can be used to pack small tasks
+ * We make the assumption that it doesn't wort to pack on CPU that share the
+ * same powerline. We look for the 1st sched_domain without the
+ * SD_SHARE_POWERDOMAIN flag. Then we look for the sched_group with the lowest
+ * power per core based on the assumption that their power efficiency is
+ * better
+ */
+void update_packing_domain(int cpu)
+{
+	struct sched_domain *sd;
+	int id = -1;
+
+	sd = highest_flag_domain(cpu, SD_SHARE_POWERDOMAIN);
+	if (!sd)
+		sd = rcu_dereference_check_sched_domain(cpu_rq(cpu)->sd);
+	else
+		sd = sd->parent;
+
+	while (sd && (sd->flags & SD_LOAD_BALANCE)
+		&& !(sd->flags & SD_SHARE_POWERDOMAIN)) {
+		struct sched_group *sg = sd->groups;
+		struct sched_group *pack = sg;
+		struct sched_group *tmp;
+
+		/*
+		 * The sched_domain of a CPU points on the local sched_group
+		 * and the 1st CPU of this local group is a good candidate
+		 */
+		id = cpumask_first(sched_group_cpus(pack));
+
+		/* loop the sched groups to find the best one */
+		for (tmp = sg->next; tmp != sg; tmp = tmp->next) {
+			if (tmp->sgp->power * pack->group_weight >
+					pack->sgp->power * tmp->group_weight)
+				continue;
+
+			if ((tmp->sgp->power * pack->group_weight ==
+					pack->sgp->power * tmp->group_weight)
+			 && (cpumask_first(sched_group_cpus(tmp)) >= id))
+				continue;
+
+			/* we have found a better group */
+			pack = tmp;
+
+			/* Take the 1st CPU of the new group */
+			id = cpumask_first(sched_group_cpus(pack));
+		}
+
+		/* Look for another CPU than itself */
+		if (id != cpu)
+			break;
+
+		sd = sd->parent;
+	}
+
+	pr_debug("CPU%d packing on CPU%d\n", cpu, id);
+	per_cpu(sd_pack_buddy, cpu) = id;
+}
+
+#endif /* CONFIG_SMP */
+
 #if BITS_PER_LONG == 32
 # define WMULT_CONST	(~0UL)
 #else
@@ -3291,6 +3361,47 @@  done:
 	return target;
 }
 
+static bool is_buddy_busy(int cpu)
+{
+	struct rq *rq = cpu_rq(cpu);
+
+	/*
+	 * A busy buddy is a CPU with a high load or a small load with a lot of
+	 * running tasks.
+	 */
+	return (rq->avg.runnable_avg_sum >
+			(rq->avg.runnable_avg_period / (rq->nr_running + 2)));
+}
+
+static bool is_light_task(struct task_struct *p)
+{
+	/* A light task runs less than 20% in average */
+	return ((p->se.avg.runnable_avg_sum  * 5) <
+			(p->se.avg.runnable_avg_period));
+}
+
+static int check_pack_buddy(int cpu, struct task_struct *p)
+{
+	int buddy = per_cpu(sd_pack_buddy, cpu);
+
+	/* No pack buddy for this CPU */
+	if (buddy == -1)
+		return false;
+
+	/* buddy is not an allowed CPU */
+	if (!cpumask_test_cpu(buddy, tsk_cpus_allowed(p)))
+		return false;
+
+	/*
+	 * If the task is a small one and the buddy is not overloaded,
+	 * we use buddy cpu
+	 */
+	if (!is_light_task(p) || is_buddy_busy(buddy))
+		return false;
+
+	return true;
+}
+
 /*
  * sched_balance_self: balance the current task (running on cpu) in domains
  * that have the 'flag' flag set. In practice, this is SD_BALANCE_FORK and
@@ -3319,6 +3430,10 @@  select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
 		if (cpumask_test_cpu(cpu, tsk_cpus_allowed(p)))
 			want_affine = 1;
 		new_cpu = prev_cpu;
+
+		/* We pack only at wake up and not new task */
+		if (check_pack_buddy(new_cpu, p))
+			return per_cpu(sd_pack_buddy, new_cpu);
 	}
 
 	rcu_read_lock();
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 7f36024f..96b164d 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -872,6 +872,7 @@  extern const struct sched_class idle_sched_class;
 
 extern void trigger_load_balance(struct rq *rq, int cpu);
 extern void idle_balance(int this_cpu, struct rq *this_rq);
+extern void update_packing_domain(int cpu);
 
 #else	/* CONFIG_SMP */
 
@@ -879,6 +880,10 @@  static inline void idle_balance(int cpu, struct rq *rq)
 {
 }
 
+static inline void update_packing_domain(int cpu)
+{
+}
+
 #endif
 
 extern void sysrq_sched_debug_show(void);