Message ID | 1363955155-18382-4-git-send-email-vincent.guittot@linaro.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Fri, 2013-03-22 at 13:25 +0100, Vincent Guittot wrote: > +static bool is_buddy_busy(int cpu) > +{ > + struct rq *rq = cpu_rq(cpu); > + > + /* > + * A busy buddy is a CPU with a high load or a small load with > a lot of > + * running tasks. > + */ > + return (rq->avg.runnable_avg_sum > > + (rq->avg.runnable_avg_period / (rq->nr_running > + 2))); > +} Why does the comment talk about load but we don't see it in the equation. Also, why does nr_running matter at all? I thought we'd simply bother with utilization, if fully utilized we're done etc..
On Fri, 2013-03-22 at 13:25 +0100, Vincent Guittot wrote: > +static bool is_light_task(struct task_struct *p) > +{ > + /* A light task runs less than 20% in average */ > + return ((p->se.avg.runnable_avg_sum * 5) < > + (p->se.avg.runnable_avg_period)); > +} OK, so we have a 'problem' here, we initialize runnable_avg_* to 0, but we want to 'assume' a fresh task is fully 'loaded'. IIRC Alex ran into this as well. PJT, do you have any sane solution for this, I forgot what the result of the last discussion was -- was there any?
On Fri, 2013-03-22 at 13:25 +0100, Vincent Guittot wrote: > During the creation of sched_domain, we define a pack buddy CPU for > each CPU > when one is available. We want to pack at all levels where a group of > CPU can > be power gated independently from others. > On a system that can't power gate a group of CPUs independently, the > flag is > set at all sched_domain level and the buddy is set to -1. This is the > default > behavior. > On a dual clusters / dual cores system which can power gate each core > and > cluster independently, the buddy configuration will be : > > | Cluster 0 | Cluster 1 | > | CPU0 | CPU1 | CPU2 | CPU3 | > ----------------------------------- > buddy | CPU0 | CPU0 | CPU0 | CPU2 | I suppose this is adequate for the 'small' systems you currently have; but given that Samsung is already bragging with its 'octo'-core Exynos 5 (4+4 big-little thing) does this solution scale? Isn't this basically related to picking the NO_HZ cpu; if the system isn't fully symmetric with its power gates you want the NO_HZ cpu to be the 'special' cpu. If it is symmetric we really don't care which core is left 'running' and we can even select a new pack cpu from the idle cores once the old one is fully utilized. Re-using (or integrating) with NO_HZ has the dual advantage that you'll make NO_HZ do the right thing for big-little (you typically want a little core to be the one staying 'awake' and once someone makes NO_HZ scale this all gets to scale along with it.
On 26 March 2013 13:37, Peter Zijlstra <peterz@infradead.org> wrote: > On Fri, 2013-03-22 at 13:25 +0100, Vincent Guittot wrote: >> +static bool is_light_task(struct task_struct *p) >> +{ >> + /* A light task runs less than 20% in average */ >> + return ((p->se.avg.runnable_avg_sum * 5) < >> + (p->se.avg.runnable_avg_period)); >> +} > > OK, so we have a 'problem' here, we initialize runnable_avg_* to 0, but > we want to 'assume' a fresh task is fully 'loaded'. IIRC Alex ran into > this as well. Hi Peter, The packing small tasks is only applied at wake up and not during fork or exec so the runnable_avg_* should have been initialized. As you mentionned, we assume that a fresh task is fully loaded and let the default scheduler behavior to select a target CPU Vincent > > PJT, do you have any sane solution for this, I forgot what the result > of the last discussion was -- was there any? >
On 26 March 2013 13:46, Peter Zijlstra <peterz@infradead.org> wrote: > On Fri, 2013-03-22 at 13:25 +0100, Vincent Guittot wrote: >> During the creation of sched_domain, we define a pack buddy CPU for >> each CPU >> when one is available. We want to pack at all levels where a group of >> CPU can >> be power gated independently from others. >> On a system that can't power gate a group of CPUs independently, the >> flag is >> set at all sched_domain level and the buddy is set to -1. This is the >> default >> behavior. >> On a dual clusters / dual cores system which can power gate each core >> and >> cluster independently, the buddy configuration will be : >> >> | Cluster 0 | Cluster 1 | >> | CPU0 | CPU1 | CPU2 | CPU3 | >> ----------------------------------- >> buddy | CPU0 | CPU0 | CPU0 | CPU2 | > > I suppose this is adequate for the 'small' systems you currently have; > but given that Samsung is already bragging with its 'octo'-core Exynos > 5 (4+4 big-little thing) does this solution scale? The packing is only done at MC and CPU level to minimize the number of transition. > > Isn't this basically related to picking the NO_HZ cpu; if the system > isn't fully symmetric with its power gates you want the NO_HZ cpu to be > the 'special' cpu. If it is symmetric we really don't care which core > is left 'running' and we can even select a new pack cpu from the idle > cores once the old one is fully utilized. I agree that on a symmetric system, we don't really care about which core is selected but we want to use the same one whenever possible to prevent a ping pong between several cores or groups of cores, which is power consuming. By forcing a NOHZ cpu, your background activity will smoothly pack on this CPU and will not be spread on your system. When a CPU is fully loaded, we don't fall in a low CPU load use case and the periodic load balance can handle the situation to select a new target CPU which is close to the buddy CPU > > Re-using (or integrating) with NO_HZ has the dual advantage that you'll > make NO_HZ do the right thing for big-little (you typically want a > little core to be the one staying 'awake' and once someone makes NO_HZ > scale this all gets to scale along with it. > I think that you have answered to this question in your comment of patch 5, isn't it? Vincent
> Isn't this basically related to picking the NO_HZ cpu; if the system > isn't fully symmetric with its power gates you want the NO_HZ cpu to be > the 'special' cpu. If it is symmetric we really don't care which core > is left 'running' and we can even select a new pack cpu from the idle > cores once the old one is fully utilized. you don't really care much sure, but there's some advantages for sorting "all the way left", e.g. to linux cpu 0. Some tasks only run there, and interrupts tend to be favored to that cpu as well on x86.
Hi Peter, On 03/26/2013 06:07 PM, Peter Zijlstra wrote: > On Fri, 2013-03-22 at 13:25 +0100, Vincent Guittot wrote: >> +static bool is_light_task(struct task_struct *p) >> +{ >> + /* A light task runs less than 20% in average */ >> + return ((p->se.avg.runnable_avg_sum * 5) < >> + (p->se.avg.runnable_avg_period)); >> +} > > OK, so we have a 'problem' here, we initialize runnable_avg_* to 0, but > we want to 'assume' a fresh task is fully 'loaded'. IIRC Alex ran into > this as well. > > PJT, do you have any sane solution for this, I forgot what the result > of the last discussion was -- was there any? The conclusion after last discussion between PJT and Alex was that the load contribution of a fresh task be set to "full" during "__sched_fork()". task->se.avg.load_avg_contrib = task->se.load.weight during __sched_fork() is reflected in the latest power aware scheduler patchset by Alex. Thanks Regards Preeti U Murthy >
On 03/27/2013 12:33 PM, Preeti U Murthy wrote: > Hi Peter, > > On 03/26/2013 06:07 PM, Peter Zijlstra wrote: >> On Fri, 2013-03-22 at 13:25 +0100, Vincent Guittot wrote: >>> +static bool is_light_task(struct task_struct *p) >>> +{ >>> + /* A light task runs less than 20% in average */ >>> + return ((p->se.avg.runnable_avg_sum * 5) < >>> + (p->se.avg.runnable_avg_period)); >>> +} >> >> OK, so we have a 'problem' here, we initialize runnable_avg_* to 0, but >> we want to 'assume' a fresh task is fully 'loaded'. IIRC Alex ran into >> this as well. >> >> PJT, do you have any sane solution for this, I forgot what the result >> of the last discussion was -- was there any? > > The conclusion after last discussion between PJT and Alex was that the > load contribution of a fresh task be set to "full" during "__sched_fork()". > > task->se.avg.load_avg_contrib = task->se.load.weight during > __sched_fork() is reflected in the latest power aware scheduler patchset > by Alex. Yes, the new forked runnable load was set as full utilisation in V5 power aware scheduling. PJT, Mike and I both agree on this. PJT just discussion how to give the full load to new forked task. and we get agreement in my coming V6 power aware scheduling patchset. > > Thanks > > Regards > Preeti U Murthy >> >
On Tue, 2013-03-26 at 08:29 -0700, Arjan van de Ven wrote: > > Isn't this basically related to picking the NO_HZ cpu; if the system > > isn't fully symmetric with its power gates you want the NO_HZ cpu to be > > the 'special' cpu. If it is symmetric we really don't care which core > > is left 'running' and we can even select a new pack cpu from the idle > > cores once the old one is fully utilized. > > you don't really care much sure, but there's some advantages for sorting "all the way left", > e.g. to linux cpu 0. > Some tasks only run there, and interrupts tend to be favored to that cpu as well on x86. Right, and I suspect all the big-little nonsense will have the little cores on low numbers as well (is this architected or can a creative licensee screw us over?) So find_new_ilb() already does cpumask_first(), so it has a strong leftmost preference. We just need to make sure it indeed does the right thing and doesn't have some unintended side effect.
On Wed, 2013-03-27 at 12:48 +0800, Alex Shi wrote: > > Yes, the new forked runnable load was set as full utilisation in V5 > power aware scheduling. PJT, Mike and I both agree on this. PJT just > discussion how to give the full load to new forked task. and we get > agreement in my coming V6 power aware scheduling patchset. Great!, means I can mark the v5 thread read! :-)
On 27 March 2013 09:46, Peter Zijlstra <peterz@infradead.org> wrote: > On Tue, 2013-03-26 at 08:29 -0700, Arjan van de Ven wrote: >> > Isn't this basically related to picking the NO_HZ cpu; if the system >> > isn't fully symmetric with its power gates you want the NO_HZ cpu to be >> > the 'special' cpu. If it is symmetric we really don't care which core >> > is left 'running' and we can even select a new pack cpu from the idle >> > cores once the old one is fully utilized. >> >> you don't really care much sure, but there's some advantages for sorting "all the way left", >> e.g. to linux cpu 0. >> Some tasks only run there, and interrupts tend to be favored to that cpu as well on x86. > > Right, and I suspect all the big-little nonsense will have the little > cores on low numbers as well (is this architected or can a creative > licensee screw us over?) It's not mandatory to have little cores on low numbers even if it's advised > > So find_new_ilb() already does cpumask_first(), so it has a strong > leftmost preference. We just need to make sure it indeed does the right > thing and doesn't have some unintended side effect. >
On Wed, 2013-03-27 at 09:54 +0100, Vincent Guittot wrote: > It's not mandatory to have little cores on low numbers even if it's > advised ARGH!
Hi, On 03/26/2013 05:56 PM, Peter Zijlstra wrote: > On Fri, 2013-03-22 at 13:25 +0100, Vincent Guittot wrote: >> +static bool is_buddy_busy(int cpu) >> +{ >> + struct rq *rq = cpu_rq(cpu); >> + >> + /* >> + * A busy buddy is a CPU with a high load or a small load with >> a lot of >> + * running tasks. >> + */ >> + return (rq->avg.runnable_avg_sum > >> + (rq->avg.runnable_avg_period / (rq->nr_running >> + 2))); >> +} > > Why does the comment talk about load but we don't see it in the > equation. Also, why does nr_running matter at all? I thought we'd > simply bother with utilization, if fully utilized we're done etc.. > Peter, lets say the run-queue has 50% utilization and is running 2 tasks. And we wish to find out if it is busy. We would compare this metric with the cpu power, which lets say is 100. rq->util * 100 < cpu_of(rq)->power. In the above scenario would we declare the cpu _not_busy? Or would we do the following: (rq->util * 100) * #nr_running < cpu_of(rq)->power and conclude that it is just enough _busy_ to not take on more processes? @Vincent: Yes the comment above needs to be fixed. A busy buddy is a CPU with *high rq utilization*, as far as the equation goes. Regards Preeti U Murthy
On 27 March 2013 11:21, Preeti U Murthy <preeti@linux.vnet.ibm.com> wrote: > Hi, > > On 03/26/2013 05:56 PM, Peter Zijlstra wrote: >> On Fri, 2013-03-22 at 13:25 +0100, Vincent Guittot wrote: >>> +static bool is_buddy_busy(int cpu) >>> +{ >>> + struct rq *rq = cpu_rq(cpu); >>> + >>> + /* >>> + * A busy buddy is a CPU with a high load or a small load with >>> a lot of >>> + * running tasks. >>> + */ >>> + return (rq->avg.runnable_avg_sum > >>> + (rq->avg.runnable_avg_period / (rq->nr_running >>> + 2))); >>> +} >> >> Why does the comment talk about load but we don't see it in the >> equation. Also, why does nr_running matter at all? I thought we'd >> simply bother with utilization, if fully utilized we're done etc.. >> > > Peter, lets say the run-queue has 50% utilization and is running 2 > tasks. And we wish to find out if it is busy. We would compare this > metric with the cpu power, which lets say is 100. > > rq->util * 100 < cpu_of(rq)->power. I don't use cpu_of(rq)->power in the definition of the business > > In the above scenario would we declare the cpu _not_busy? Or would we do > the following: In the above scenario, the CPU is busy By load, I mean : 100 * avg.runnable_avg_sum / avg.runnable_avg_period In addition, i take into account the number of tasks already in the runqueue in order to define the business of a CPU. A CPU with a load of 50% without any tasks in the runqeue in not busy at this time and we can migrate tasks on it but if the CPU already has 2 tasks in its runqueue, it means that newly wake up task will have to share the CPU with other tasks so we consider that the CPU is already busy and we will fall back to default behavior. The equation considers that a CPU is not busy if 100 * avg.runnable_avg_sum / avg.runnable_avg_period < 100 / (nr_running + 2) > > (rq->util * 100) * #nr_running < cpu_of(rq)->power and conclude that it > is just enough _busy_ to not take on more processes? > > > @Vincent: Yes the comment above needs to be fixed. A busy buddy is a CPU > with *high rq utilization*, as far as the equation goes. I can update the comment. Is the comment below more clear ? /* * A busy buddy is a CPU with a high average running time or a small average running time but a lot of * running tasks in its runqueue which are already sharing this CPU time. */ Vincent > > Regards > Preeti U Murthy > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/
On Wed, Mar 27, 2013 at 09:00:20AM +0000, Peter Zijlstra wrote: > On Wed, 2013-03-27 at 09:54 +0100, Vincent Guittot wrote: > > It's not mandatory to have little cores on low numbers even if it's > > advised > > ARGH! I haven't followed this thread closely, so just a random comment from me. An argument from some is that they want to boot Linux on the big CPU to be quicker. The most convenient is to have the big CPU at 0.
On Wed, 2013-03-27 at 11:18 +0000, Catalin Marinas wrote: > On Wed, Mar 27, 2013 at 09:00:20AM +0000, Peter Zijlstra wrote: > > On Wed, 2013-03-27 at 09:54 +0100, Vincent Guittot wrote: > > > It's not mandatory to have little cores on low numbers even if it's > > > advised > > > > ARGH! > > I haven't followed this thread closely, so just a random comment from > me. An argument from some is that they want to boot Linux on the big CPU > to be quicker. The most convenient is to have the big CPU at 0. I suppose that's almost sensible ;-) I just despair at the amount of variation that's allowed. I'm guessing that swapping cpus in the bootloader or someplace really early is equally hard in that we (Linux) assume we boot on cpu 0 or something like that?
On Wed, 27 Mar 2013, Vincent Guittot wrote: > On 27 March 2013 09:46, Peter Zijlstra <peterz@infradead.org> wrote: > > On Tue, 2013-03-26 at 08:29 -0700, Arjan van de Ven wrote: > >> > Isn't this basically related to picking the NO_HZ cpu; if the system > >> > isn't fully symmetric with its power gates you want the NO_HZ cpu to be > >> > the 'special' cpu. If it is symmetric we really don't care which core > >> > is left 'running' and we can even select a new pack cpu from the idle > >> > cores once the old one is fully utilized. > >> > >> you don't really care much sure, but there's some advantages for sorting "all the way left", > >> e.g. to linux cpu 0. > >> Some tasks only run there, and interrupts tend to be favored to that cpu as well on x86. > > > > Right, and I suspect all the big-little nonsense will have the little > > cores on low numbers as well (is this architected or can a creative > > licensee screw us over?) > > It's not mandatory to have little cores on low numbers even if it's advised We can trivially move things around in the logical CPU mapping if that simplifies things. However the boot CPU might not be CPU0 in that case which might be less trivial. Nicolas
On Wed, Mar 27, 2013 at 02:13:54PM +0000, Peter Zijlstra wrote: > On Wed, 2013-03-27 at 11:18 +0000, Catalin Marinas wrote: > > On Wed, Mar 27, 2013 at 09:00:20AM +0000, Peter Zijlstra wrote: > > > On Wed, 2013-03-27 at 09:54 +0100, Vincent Guittot wrote: > > > > It's not mandatory to have little cores on low numbers even if it's > > > > advised > > > > > > ARGH! > > > > I haven't followed this thread closely, so just a random comment from > > me. An argument from some is that they want to boot Linux on the big CPU > > to be quicker. The most convenient is to have the big CPU at 0. > > I suppose that's almost sensible ;-) I just despair at the amount of > variation that's allowed. > > I'm guessing that swapping cpus in the bootloader or someplace really > early is equally hard in that we (Linux) assume we boot on cpu 0 or > something like that? I think it's worth trying (by changing the CPU topology in the DT). At a quick look, I don't see anything hard-coded in the kernel boot sequence. It uses smp_processor_id() which translates to current_thread_info()->cpu on ARM. I'm not sure how early we need this but it's probably after DT parsing, so we could set 'cpu' to a non-zero value for the booting processor. There are a few tweaks in the arch/arm code code with cpu_logical_map setup (which maps between smp_processor_id and the actual hardware CPU id and assumes 0 is the booting CPU). So if the above works, the scheduler guys can mandate that little CPUs are always first and for ARM it would be a matter of getting the right CPU topology in the DT (independent of what hw vendors think of CPU topology) and booting Linux on CPU 4 etc.
On Wed, 27 Mar 2013, Catalin Marinas wrote: > So if the above works, the scheduler guys can mandate that little CPUs > are always first and for ARM it would be a matter of getting the right > CPU topology in the DT (independent of what hw vendors think of CPU > topology) and booting Linux on CPU 4 etc. Just a note about that: if the scheduler mandates little CPUs first, that should _not_ have any implications on the DT content. DT is not about encoding Linux specific implementation details. It is simple enough to tweak the CPU logical map at run time when enumeratiing CPUs. Nicolas
On 27 March 2013 17:36, Catalin Marinas <catalin.marinas@arm.com> wrote: > On Wed, Mar 27, 2013 at 02:13:54PM +0000, Peter Zijlstra wrote: >> On Wed, 2013-03-27 at 11:18 +0000, Catalin Marinas wrote: >> > On Wed, Mar 27, 2013 at 09:00:20AM +0000, Peter Zijlstra wrote: >> > > On Wed, 2013-03-27 at 09:54 +0100, Vincent Guittot wrote: >> > > > It's not mandatory to have little cores on low numbers even if it's >> > > > advised >> > > >> > > ARGH! >> > >> > I haven't followed this thread closely, so just a random comment from >> > me. An argument from some is that they want to boot Linux on the big CPU >> > to be quicker. The most convenient is to have the big CPU at 0. >> >> I suppose that's almost sensible ;-) I just despair at the amount of >> variation that's allowed. >> >> I'm guessing that swapping cpus in the bootloader or someplace really >> early is equally hard in that we (Linux) assume we boot on cpu 0 or >> something like that? > > I think it's worth trying (by changing the CPU topology in the DT). At a > quick look, I don't see anything hard-coded in the kernel boot sequence. > It uses smp_processor_id() which translates to > current_thread_info()->cpu on ARM. I'm not sure how early we need this > but it's probably after DT parsing, so we could set 'cpu' to a non-zero > value for the booting processor. There are a few tweaks in the arch/arm > code code with cpu_logical_map setup (which maps between > smp_processor_id and the actual hardware CPU id and assumes 0 is the > booting CPU). The are few other places in the code that makes the assumption that the 1st online CPU is the booting CPU like the disable_nonboot_cpus > > So if the above works, the scheduler guys can mandate that little CPUs > are always first and for ARM it would be a matter of getting the right > CPU topology in the DT (independent of what hw vendors think of CPU > topology) and booting Linux on CPU 4 etc. > > -- > Catalin
On Wed, Mar 27, 2013 at 05:18:53PM +0000, Nicolas Pitre wrote: > On Wed, 27 Mar 2013, Catalin Marinas wrote: > > > So if the above works, the scheduler guys can mandate that little CPUs > > are always first and for ARM it would be a matter of getting the right > > CPU topology in the DT (independent of what hw vendors think of CPU > > topology) and booting Linux on CPU 4 etc. > > Just a note about that: if the scheduler mandates little CPUs first, > that should _not_ have any implications on the DT content. DT is not > about encoding Linux specific implementation details. It is simple > enough to tweak the CPU logical map at run time when enumeratiing CPUs. You are right, though a simpler way (hack) to tweak the cpu_logical_map is to change the DT ;). But the problem is that the kernel doesn't know which CPU is big and which is little, unless you specify this in some way via the DT. It can be the cpu nodes order or some other means.
On Wed, Mar 27, 2013 at 05:20:28PM +0000, Vincent Guittot wrote: > On 27 March 2013 17:36, Catalin Marinas <catalin.marinas@arm.com> wrote: > > On Wed, Mar 27, 2013 at 02:13:54PM +0000, Peter Zijlstra wrote: > >> On Wed, 2013-03-27 at 11:18 +0000, Catalin Marinas wrote: > >> > On Wed, Mar 27, 2013 at 09:00:20AM +0000, Peter Zijlstra wrote: > >> > > On Wed, 2013-03-27 at 09:54 +0100, Vincent Guittot wrote: > >> > > > It's not mandatory to have little cores on low numbers even if it's > >> > > > advised > >> > > > >> > > ARGH! > >> > > >> > I haven't followed this thread closely, so just a random comment from > >> > me. An argument from some is that they want to boot Linux on the big CPU > >> > to be quicker. The most convenient is to have the big CPU at 0. > >> > >> I suppose that's almost sensible ;-) I just despair at the amount of > >> variation that's allowed. > >> > >> I'm guessing that swapping cpus in the bootloader or someplace really > >> early is equally hard in that we (Linux) assume we boot on cpu 0 or > >> something like that? > > > > I think it's worth trying (by changing the CPU topology in the DT). At a > > quick look, I don't see anything hard-coded in the kernel boot sequence. > > It uses smp_processor_id() which translates to > > current_thread_info()->cpu on ARM. I'm not sure how early we need this > > but it's probably after DT parsing, so we could set 'cpu' to a non-zero > > value for the booting processor. There are a few tweaks in the arch/arm > > code code with cpu_logical_map setup (which maps between > > smp_processor_id and the actual hardware CPU id and assumes 0 is the > > booting CPU). > > The are few other places in the code that makes the assumption that > the 1st online CPU is the booting CPU like the disable_nonboot_cpus Another thing - boot_cpu_init() is using smp_processor_id() before we get the setup_arch() call where we parse the DT.
On Wed, Mar 27, 2013 at 03:51:51PM +0530, Preeti U Murthy wrote: > Hi, > > On 03/26/2013 05:56 PM, Peter Zijlstra wrote: > > On Fri, 2013-03-22 at 13:25 +0100, Vincent Guittot wrote: > >> +static bool is_buddy_busy(int cpu) > >> +{ > >> + struct rq *rq = cpu_rq(cpu); > >> + > >> + /* > >> + * A busy buddy is a CPU with a high load or a small load with > >> a lot of > >> + * running tasks. > >> + */ > >> + return (rq->avg.runnable_avg_sum > > >> + (rq->avg.runnable_avg_period / (rq->nr_running > >> + 2))); > >> +} > > > > Why does the comment talk about load but we don't see it in the > > equation. Also, why does nr_running matter at all? I thought we'd > > simply bother with utilization, if fully utilized we're done etc.. > > > > Peter, lets say the run-queue has 50% utilization and is running 2 > tasks. And we wish to find out if it is busy. We would compare this > metric with the cpu power, which lets say is 100. > > rq->util * 100 < cpu_of(rq)->power. > > In the above scenario would we declare the cpu _not_busy? Or would we do > the following: > > (rq->util * 100) * #nr_running < cpu_of(rq)->power and conclude that it > is just enough _busy_ to not take on more processes? That is just confused... ->power doesn't have anything to do with a per-cpu measure. ->power is a inter-cpu measure of relative compute capacity. Mixing in nr_running confuses things even more; it doesn't matter how many tasks it takes to push utilization up to 100%; once its there the cpu simply cannot run more. So colour me properly confused..
On Wed, Mar 27, 2013 at 12:00:40PM +0100, Vincent Guittot wrote: > On 27 March 2013 11:21, Preeti U Murthy <preeti@linux.vnet.ibm.com> wrote: > > Hi, > > > > On 03/26/2013 05:56 PM, Peter Zijlstra wrote: > >> On Fri, 2013-03-22 at 13:25 +0100, Vincent Guittot wrote: > >>> +static bool is_buddy_busy(int cpu) > >>> +{ > >>> + struct rq *rq = cpu_rq(cpu); > >>> + > >>> + /* > >>> + * A busy buddy is a CPU with a high load or a small load with > >>> a lot of > >>> + * running tasks. > >>> + */ > >>> + return (rq->avg.runnable_avg_sum > > >>> + (rq->avg.runnable_avg_period / (rq->nr_running > >>> + 2))); > >>> +} > >> > >> Why does the comment talk about load but we don't see it in the > >> equation. Also, why does nr_running matter at all? I thought we'd > >> simply bother with utilization, if fully utilized we're done etc.. > > By load, I mean : 100 * avg.runnable_avg_sum / avg.runnable_avg_period > In addition, i take into account the number of tasks already in the > runqueue in order to define the business of a CPU. A CPU with a load > of 50% without any tasks in the runqeue in not busy at this time and > we can migrate tasks on it but if the CPU already has 2 tasks in its > runqueue, it means that newly wake up task will have to share the CPU > with other tasks so we consider that the CPU is already busy and we > will fall back to default behavior. The equation considers that a CPU > is not busy if > 100 * avg.runnable_avg_sum / avg.runnable_avg_period < 100 / (nr_running + 2) I'm still somewhat confused by all this. So raising nr_running will lower the required utilization to be considered busy. Suppose we have 50 tasks running, all at 1% utilization (bulk wakeup) we'd consider the cpu busy, even though its picking its nose for half the time. I'm assuming it's mean to limit process latency or so? Why are you concerned with that? This seems like an arbitrary power vs performance tweak without solid effidence its needed or even wanted.
Hi Peter, On 04/26/2013 03:48 PM, Peter Zijlstra wrote: > On Wed, Mar 27, 2013 at 03:51:51PM +0530, Preeti U Murthy wrote: >> Hi, >> >> On 03/26/2013 05:56 PM, Peter Zijlstra wrote: >>> On Fri, 2013-03-22 at 13:25 +0100, Vincent Guittot wrote: >>>> +static bool is_buddy_busy(int cpu) >>>> +{ >>>> + struct rq *rq = cpu_rq(cpu); >>>> + >>>> + /* >>>> + * A busy buddy is a CPU with a high load or a small load with >>>> a lot of >>>> + * running tasks. >>>> + */ >>>> + return (rq->avg.runnable_avg_sum > >>>> + (rq->avg.runnable_avg_period / (rq->nr_running >>>> + 2))); >>>> +} >>> >>> Why does the comment talk about load but we don't see it in the >>> equation. Also, why does nr_running matter at all? I thought we'd >>> simply bother with utilization, if fully utilized we're done etc.. >>> >> >> Peter, lets say the run-queue has 50% utilization and is running 2 >> tasks. And we wish to find out if it is busy. We would compare this >> metric with the cpu power, which lets say is 100. >> >> rq->util * 100 < cpu_of(rq)->power. >> >> In the above scenario would we declare the cpu _not_busy? Or would we do >> the following: >> >> (rq->util * 100) * #nr_running < cpu_of(rq)->power and conclude that it >> is just enough _busy_ to not take on more processes? > > That is just confused... ->power doesn't have anything to do with a per-cpu > measure. ->power is a inter-cpu measure of relative compute capacity. Ok. > > Mixing in nr_running confuses things even more; it doesn't matter how many > tasks it takes to push utilization up to 100%; once its there the cpu simply > cannot run more. True, this is from the perspective of the CPU. But will not the tasks on this CPU get throttled if, you find the utilization of this CPU < 100% and decide to put more tasks on it? Regards Preeti U Murthy
On 26 April 2013 12:30, Peter Zijlstra <peterz@infradead.org> wrote: > On Wed, Mar 27, 2013 at 12:00:40PM +0100, Vincent Guittot wrote: >> On 27 March 2013 11:21, Preeti U Murthy <preeti@linux.vnet.ibm.com> wrote: >> > Hi, >> > >> > On 03/26/2013 05:56 PM, Peter Zijlstra wrote: >> >> On Fri, 2013-03-22 at 13:25 +0100, Vincent Guittot wrote: >> >>> +static bool is_buddy_busy(int cpu) >> >>> +{ >> >>> + struct rq *rq = cpu_rq(cpu); >> >>> + >> >>> + /* >> >>> + * A busy buddy is a CPU with a high load or a small load with >> >>> a lot of >> >>> + * running tasks. >> >>> + */ >> >>> + return (rq->avg.runnable_avg_sum > >> >>> + (rq->avg.runnable_avg_period / (rq->nr_running >> >>> + 2))); >> >>> +} >> >> >> >> Why does the comment talk about load but we don't see it in the >> >> equation. Also, why does nr_running matter at all? I thought we'd >> >> simply bother with utilization, if fully utilized we're done etc.. >> >> By load, I mean : 100 * avg.runnable_avg_sum / avg.runnable_avg_period >> In addition, i take into account the number of tasks already in the >> runqueue in order to define the business of a CPU. A CPU with a load >> of 50% without any tasks in the runqeue in not busy at this time and >> we can migrate tasks on it but if the CPU already has 2 tasks in its >> runqueue, it means that newly wake up task will have to share the CPU >> with other tasks so we consider that the CPU is already busy and we >> will fall back to default behavior. The equation considers that a CPU >> is not busy if >> 100 * avg.runnable_avg_sum / avg.runnable_avg_period < 100 / (nr_running + 2) > > I'm still somewhat confused by all this. So raising nr_running will lower the > required utilization to be considered busy. Suppose we have 50 tasks running, > all at 1% utilization (bulk wakeup) we'd consider the cpu busy, even though its > picking its nose for half the time. > > > I'm assuming it's mean to limit process latency or so? Why are you concerned > with that? This seems like an arbitrary power vs performance tweak without > solid effidence its needed or even wanted. Yes the goal was to limit the wake up latency because this version was only trying to modify the scheduler behavior when the system was not busy in order to pack the small tasks like background activities but without decreasing the performance so we were concerned by wakeup latency. The new version proposes a more aggressive mode that packs all tasks until CPUs becomes full. Vincent
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index b827e0c..21c35ce 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5662,6 +5662,7 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu) rcu_assign_pointer(rq->sd, sd); destroy_sched_domains(tmp, cpu); + update_packing_domain(cpu); update_top_cache_domain(cpu); } diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 9c2f726..021c7b7 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -160,6 +160,76 @@ void sched_init_granularity(void) update_sysctl(); } + +#ifdef CONFIG_SMP +/* + * Save the id of the optimal CPU that should be used to pack small tasks + * The value -1 is used when no buddy has been found + */ +DEFINE_PER_CPU(int, sd_pack_buddy); + +/* + * Look for the best buddy CPU that can be used to pack small tasks + * We make the assumption that it doesn't wort to pack on CPU that share the + * same powerline. We look for the 1st sched_domain without the + * SD_SHARE_POWERDOMAIN flag. Then we look for the sched_group with the lowest + * power per core based on the assumption that their power efficiency is + * better + */ +void update_packing_domain(int cpu) +{ + struct sched_domain *sd; + int id = -1; + + sd = highest_flag_domain(cpu, SD_SHARE_POWERDOMAIN); + if (!sd) + sd = rcu_dereference_check_sched_domain(cpu_rq(cpu)->sd); + else + sd = sd->parent; + + while (sd && (sd->flags & SD_LOAD_BALANCE) + && !(sd->flags & SD_SHARE_POWERDOMAIN)) { + struct sched_group *sg = sd->groups; + struct sched_group *pack = sg; + struct sched_group *tmp; + + /* + * The sched_domain of a CPU points on the local sched_group + * and the 1st CPU of this local group is a good candidate + */ + id = cpumask_first(sched_group_cpus(pack)); + + /* loop the sched groups to find the best one */ + for (tmp = sg->next; tmp != sg; tmp = tmp->next) { + if (tmp->sgp->power * pack->group_weight > + pack->sgp->power * tmp->group_weight) + continue; + + if ((tmp->sgp->power * pack->group_weight == + pack->sgp->power * tmp->group_weight) + && (cpumask_first(sched_group_cpus(tmp)) >= id)) + continue; + + /* we have found a better group */ + pack = tmp; + + /* Take the 1st CPU of the new group */ + id = cpumask_first(sched_group_cpus(pack)); + } + + /* Look for another CPU than itself */ + if (id != cpu) + break; + + sd = sd->parent; + } + + pr_debug("CPU%d packing on CPU%d\n", cpu, id); + per_cpu(sd_pack_buddy, cpu) = id; +} + +#endif /* CONFIG_SMP */ + #if BITS_PER_LONG == 32 # define WMULT_CONST (~0UL) #else @@ -3291,6 +3361,47 @@ done: return target; } +static bool is_buddy_busy(int cpu) +{ + struct rq *rq = cpu_rq(cpu); + + /* + * A busy buddy is a CPU with a high load or a small load with a lot of + * running tasks. + */ + return (rq->avg.runnable_avg_sum > + (rq->avg.runnable_avg_period / (rq->nr_running + 2))); +} + +static bool is_light_task(struct task_struct *p) +{ + /* A light task runs less than 20% in average */ + return ((p->se.avg.runnable_avg_sum * 5) < + (p->se.avg.runnable_avg_period)); +} + +static int check_pack_buddy(int cpu, struct task_struct *p) +{ + int buddy = per_cpu(sd_pack_buddy, cpu); + + /* No pack buddy for this CPU */ + if (buddy == -1) + return false; + + /* buddy is not an allowed CPU */ + if (!cpumask_test_cpu(buddy, tsk_cpus_allowed(p))) + return false; + + /* + * If the task is a small one and the buddy is not overloaded, + * we use buddy cpu + */ + if (!is_light_task(p) || is_buddy_busy(buddy)) + return false; + + return true; +} + /* * sched_balance_self: balance the current task (running on cpu) in domains * that have the 'flag' flag set. In practice, this is SD_BALANCE_FORK and @@ -3319,6 +3430,10 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags) if (cpumask_test_cpu(cpu, tsk_cpus_allowed(p))) want_affine = 1; new_cpu = prev_cpu; + + /* We pack only at wake up and not new task */ + if (check_pack_buddy(new_cpu, p)) + return per_cpu(sd_pack_buddy, new_cpu); } rcu_read_lock(); diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 7f36024f..96b164d 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -872,6 +872,7 @@ extern const struct sched_class idle_sched_class; extern void trigger_load_balance(struct rq *rq, int cpu); extern void idle_balance(int this_cpu, struct rq *this_rq); +extern void update_packing_domain(int cpu); #else /* CONFIG_SMP */ @@ -879,6 +880,10 @@ static inline void idle_balance(int cpu, struct rq *rq) { } +static inline void update_packing_domain(int cpu) +{ +} + #endif extern void sysrq_sched_debug_show(void);