Message ID | 1308007897-17013-7-git-send-email-glommer@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Mon, 13 Jun 2011, Glauber Costa wrote: > This is a first proposal for using steal time information > to influence the scheduler. There are a lot of optimizations > and fine grained adjustments to be done, but it is working reasonably > so far for me (mostly) > > With this patch (and some host pinnings to demonstrate the situation), > two vcpus with very different steal time (Say 80 % vs 1 %) will not get > an even distribution of processes. This is a situation that can naturally > arise, specially in overcommited scenarios. Previosly, the guest scheduler > would wrongly think that all cpus have the same ability to run processes, > lowering the overall throughput. > > Signed-off-by: Glauber Costa <glommer@redhat.com> > CC: Rik van Riel <riel@redhat.com> > CC: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> > CC: Peter Zijlstra <peterz@infradead.org> > CC: Avi Kivity <avi@redhat.com> > CC: Anthony Liguori <aliguori@us.ibm.com> > CC: Eric B Munson <emunson@mgebm.net> Tested-by: Eric B Munson <emunson@mgebm.net>
On 06/14/2011 07:31 AM, Glauber Costa wrote: > This is a first proposal for using steal time information > to influence the scheduler. There are a lot of optimizations > and fine grained adjustments to be done, but it is working reasonably > so far for me (mostly) > > With this patch (and some host pinnings to demonstrate the situation), > two vcpus with very different steal time (Say 80 % vs 1 %) will not get > an even distribution of processes. This is a situation that can naturally > arise, specially in overcommited scenarios. Previosly, the guest scheduler > would wrongly think that all cpus have the same ability to run processes, > lowering the overall throughput. One typo in the commit log: s/Previosly/Previously
On Mon, 2011-06-13 at 19:31 -0400, Glauber Costa wrote: > @@ -1981,12 +1987,29 @@ static void update_rq_clock_task(struct rq > *rq, s64 delta) > > rq->prev_irq_time += irq_delta; > delta -= irq_delta; > +#endif > +#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING > + if (static_branch((¶virt_steal_rq_enabled))) { Why is that a different variable from the touch_steal_time() one? > + > + steal = paravirt_steal_clock(cpu_of(rq)); > + steal -= rq->prev_steal_time_acc; > + > + rq->prev_steal_time_acc += steal; You have this addition in the wrong place, when you clip: > + if (steal > delta) > + steal = delta; you just lost your steal delta, so the addition to prev_steal_time_acc needs to be after the clip. > + delta -= steal; > + } > +#endif > + > rq->clock_task += delta; > > - if (irq_delta && sched_feat(NONIRQ_POWER)) > - sched_rt_avg_update(rq, irq_delta); > + if ((irq_delta + steal) && sched_feat(NONTASK_POWER)) > + sched_rt_avg_update(rq, irq_delta + steal); > } -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 06/14/2011 07:42 AM, Peter Zijlstra wrote: > On Mon, 2011-06-13 at 19:31 -0400, Glauber Costa wrote: >> @@ -1981,12 +1987,29 @@ static void update_rq_clock_task(struct rq >> *rq, s64 delta) >> >> rq->prev_irq_time += irq_delta; >> delta -= irq_delta; >> +#endif >> +#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING >> + if (static_branch((¶virt_steal_rq_enabled))) { > > Why is that a different variable from the touch_steal_time() one? because they track different things, touch_steal_time() and update_rq_clock() are called from different places at different situations. If we advance prev_steal_time in touch_steal_time(), and later on call update_rq_clock_task(), we won't discount the time already flushed from the rq_clock. Conversely, if we call update_rq_clock_task(), and only then arrive at touch_steal_time, we won't account steal time properly. update_rq_clock_task() is called whenever update_rq_clock() is called. touch_steal_time is called every tick. If there is a causal relation between them that would allow us to track it in a single location, I fail to realize. >> + >> + steal = paravirt_steal_clock(cpu_of(rq)); >> + steal -= rq->prev_steal_time_acc; >> + >> + rq->prev_steal_time_acc += steal; > > You have this addition in the wrong place, when you clip: I begin by disagreeing >> + if (steal> delta) >> + steal = delta; > > you just lost your steal delta, so the addition to prev_steal_time_acc > needs to be after the clip. Unlike irq time, steal time can be extremely huge. Just think of a virtual machine that got interrupted for a very long time. We'd have steal >> delta, leading to steal == delta for a big number of iterations. That would affect cpu power for an extended period of time, not reflecting present situation, just the past. So I like to think of delta as a hard cap for steal time. Obviously, I am open to debate. > >> + delta -= steal; >> + } >> +#endif >> + >> rq->clock_task += delta; >> >> - if (irq_delta&& sched_feat(NONIRQ_POWER)) >> - sched_rt_avg_update(rq, irq_delta); >> + if ((irq_delta + steal)&& sched_feat(NONTASK_POWER)) >> + sched_rt_avg_update(rq, irq_delta + steal); >> } -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 2011-06-14 at 22:26 -0300, Glauber Costa wrote: > On 06/14/2011 07:42 AM, Peter Zijlstra wrote: > > On Mon, 2011-06-13 at 19:31 -0400, Glauber Costa wrote: > >> @@ -1981,12 +1987,29 @@ static void update_rq_clock_task(struct rq > >> *rq, s64 delta) > >> > >> rq->prev_irq_time += irq_delta; > >> delta -= irq_delta; > >> +#endif > >> +#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING > >> + if (static_branch((¶virt_steal_rq_enabled))) { > > > > Why is that a different variable from the touch_steal_time() one? > > because they track different things, touch_steal_time() and > update_rq_clock() are > called from different places at different situations. > > If we advance prev_steal_time in touch_steal_time(), and later on call > update_rq_clock_task(), we won't discount the time already flushed from > the rq_clock. Conversely, if we call update_rq_clock_task(), and only > then arrive at touch_steal_time, we won't account steal time properly. But that's about prev_steal_time vs prev_steal_time_acc, I agree those should be different. > update_rq_clock_task() is called whenever update_rq_clock() is called. > touch_steal_time is called every tick. If there is a causal relation > between them that would allow us to track it in a single location, I > fail to realize. Both are steal time muck, I was wondering why we'd want to do one and not the other when we have a high res stealtime clock. > >> + > >> + steal = paravirt_steal_clock(cpu_of(rq)); > >> + steal -= rq->prev_steal_time_acc; > >> + > >> + rq->prev_steal_time_acc += steal; > > > > You have this addition in the wrong place, when you clip: > > I begin by disagreeing > >> + if (steal> delta) > >> + steal = delta; > > > > you just lost your steal delta, so the addition to prev_steal_time_acc > > needs to be after the clip. > Unlike irq time, steal time can be extremely huge. Just think of a > virtual machine that got interrupted for a very long time. We'd have > steal >> delta, leading to steal == delta for a big number of iterations. > That would affect cpu power for an extended period of time, not > reflecting present situation, just the past. So I like to think of delta > as a hard cap for steal time. > > Obviously, I am open to debate. I'm failing to see how this would happen, if the virtual machine wasn't scheduled for a long long while, delta would be huge too. But suppose it does happen, wouldn't it be likely that the virtual machine would receive similar bad service in the near future? Making the total accounting relevant. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index da34972..b26f312 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -512,6 +512,18 @@ menuconfig PARAVIRT_GUEST if PARAVIRT_GUEST +config PARAVIRT_TIME_ACCOUNTING + bool "Paravirtual steal time accounting" + select PARAVIRT + default n + ---help--- + Select this option to enable fine granularity task steal time + accounting. Time spent executing other tasks in parallel with + the current vCPU is discounted from the vCPU power. To account for + that, there can be a small performance impact. + + If in doubt, say N here. + source "arch/x86/xen/Kconfig" config KVM_CLOCK diff --git a/kernel/sched.c b/kernel/sched.c index 154cb14..d6b2749 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -530,6 +530,9 @@ struct rq { u64 prev_irq_time; #endif u64 prev_steal_time; +#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING + u64 prev_steal_time_acc; +#endif /* calc_load related fields */ unsigned long calc_load_update; @@ -1955,10 +1958,13 @@ void account_system_vtime(struct task_struct *curr) } EXPORT_SYMBOL_GPL(account_system_vtime); +#endif /* CONFIG_IRQ_TIME_ACCOUNTING */ + static void update_rq_clock_task(struct rq *rq, s64 delta) { - s64 irq_delta; + s64 irq_delta = 0, steal = 0; +#ifdef CONFIG_IRQ_TIME_ACCOUNTING irq_delta = irq_time_read(cpu_of(rq)) - rq->prev_irq_time; /* @@ -1981,12 +1987,29 @@ static void update_rq_clock_task(struct rq *rq, s64 delta) rq->prev_irq_time += irq_delta; delta -= irq_delta; +#endif +#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING + if (static_branch((¶virt_steal_rq_enabled))) { + + steal = paravirt_steal_clock(cpu_of(rq)); + steal -= rq->prev_steal_time_acc; + + rq->prev_steal_time_acc += steal; + + if (steal > delta) + steal = delta; + + delta -= steal; + } +#endif + rq->clock_task += delta; - if (irq_delta && sched_feat(NONIRQ_POWER)) - sched_rt_avg_update(rq, irq_delta); + if ((irq_delta + steal) && sched_feat(NONTASK_POWER)) + sched_rt_avg_update(rq, irq_delta + steal); } +#ifdef CONFIG_IRQ_TIME_ACCOUNTING static int irqtime_account_hi_update(void) { struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat; @@ -2021,12 +2044,7 @@ static int irqtime_account_si_update(void) #define sched_clock_irqtime (0) -static void update_rq_clock_task(struct rq *rq, s64 delta) -{ - rq->clock_task += delta; -} - -#endif /* CONFIG_IRQ_TIME_ACCOUNTING */ +#endif #include "sched_idletask.c" #include "sched_fair.c" diff --git a/kernel/sched_features.h b/kernel/sched_features.h index be40f73..ca3b025 100644 --- a/kernel/sched_features.h +++ b/kernel/sched_features.h @@ -61,9 +61,9 @@ SCHED_FEAT(LB_BIAS, 1) SCHED_FEAT(OWNER_SPIN, 1) /* - * Decrement CPU power based on irq activity + * Decrement CPU power based on time not spent running tasks */ -SCHED_FEAT(NONIRQ_POWER, 1) +SCHED_FEAT(NONTASK_POWER, 1) /* * Queue remote wakeups on the target CPU and process them
This is a first proposal for using steal time information to influence the scheduler. There are a lot of optimizations and fine grained adjustments to be done, but it is working reasonably so far for me (mostly) With this patch (and some host pinnings to demonstrate the situation), two vcpus with very different steal time (Say 80 % vs 1 %) will not get an even distribution of processes. This is a situation that can naturally arise, specially in overcommited scenarios. Previosly, the guest scheduler would wrongly think that all cpus have the same ability to run processes, lowering the overall throughput. Signed-off-by: Glauber Costa <glommer@redhat.com> CC: Rik van Riel <riel@redhat.com> CC: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> CC: Peter Zijlstra <peterz@infradead.org> CC: Avi Kivity <avi@redhat.com> CC: Anthony Liguori <aliguori@us.ibm.com> CC: Eric B Munson <emunson@mgebm.net> --- arch/x86/Kconfig | 12 ++++++++++++ kernel/sched.c | 36 +++++++++++++++++++++++++++--------- kernel/sched_features.h | 4 ++-- 3 files changed, 41 insertions(+), 11 deletions(-)