diff mbox

[6/7] KVM-GST: adjust scheduler cpu power

Message ID 1308007897-17013-7-git-send-email-glommer@redhat.com (mailing list archive)
State New, archived
Headers show

Commit Message

Glauber Costa June 13, 2011, 11:31 p.m. UTC
This is a first proposal for using steal time information
to influence the scheduler. There are a lot of optimizations
and fine grained adjustments to be done, but it is working reasonably
so far for me (mostly)

With this patch (and some host pinnings to demonstrate the situation),
two vcpus with very different steal time (Say 80 % vs 1 %) will not get
an even distribution of processes. This is a situation that can naturally
arise, specially in overcommited scenarios. Previosly, the guest scheduler
would wrongly think that all cpus have the same ability to run processes,
lowering the overall throughput.

Signed-off-by: Glauber Costa <glommer@redhat.com>
CC: Rik van Riel <riel@redhat.com>
CC: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Avi Kivity <avi@redhat.com>
CC: Anthony Liguori <aliguori@us.ibm.com>
CC: Eric B Munson <emunson@mgebm.net>
---
 arch/x86/Kconfig        |   12 ++++++++++++
 kernel/sched.c          |   36 +++++++++++++++++++++++++++---------
 kernel/sched_features.h |    4 ++--
 3 files changed, 41 insertions(+), 11 deletions(-)

Comments

Eric B Munson June 14, 2011, 1:21 a.m. UTC | #1
On Mon, 13 Jun 2011, Glauber Costa wrote:

> This is a first proposal for using steal time information
> to influence the scheduler. There are a lot of optimizations
> and fine grained adjustments to be done, but it is working reasonably
> so far for me (mostly)
> 
> With this patch (and some host pinnings to demonstrate the situation),
> two vcpus with very different steal time (Say 80 % vs 1 %) will not get
> an even distribution of processes. This is a situation that can naturally
> arise, specially in overcommited scenarios. Previosly, the guest scheduler
> would wrongly think that all cpus have the same ability to run processes,
> lowering the overall throughput.
> 
> Signed-off-by: Glauber Costa <glommer@redhat.com>
> CC: Rik van Riel <riel@redhat.com>
> CC: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
> CC: Peter Zijlstra <peterz@infradead.org>
> CC: Avi Kivity <avi@redhat.com>
> CC: Anthony Liguori <aliguori@us.ibm.com>
> CC: Eric B Munson <emunson@mgebm.net>

Tested-by: Eric B Munson <emunson@mgebm.net>
Asias He June 14, 2011, 2:47 a.m. UTC | #2
On 06/14/2011 07:31 AM, Glauber Costa wrote:
> This is a first proposal for using steal time information
> to influence the scheduler. There are a lot of optimizations
> and fine grained adjustments to be done, but it is working reasonably
> so far for me (mostly)
> 
> With this patch (and some host pinnings to demonstrate the situation),
> two vcpus with very different steal time (Say 80 % vs 1 %) will not get
> an even distribution of processes. This is a situation that can naturally
> arise, specially in overcommited scenarios. Previosly, the guest scheduler
> would wrongly think that all cpus have the same ability to run processes,
> lowering the overall throughput.

One typo in the commit log:

s/Previosly/Previously
Peter Zijlstra June 14, 2011, 10:42 a.m. UTC | #3
On Mon, 2011-06-13 at 19:31 -0400, Glauber Costa wrote:
> @@ -1981,12 +1987,29 @@ static void update_rq_clock_task(struct rq
> *rq, s64 delta)
>  
>         rq->prev_irq_time += irq_delta;
>         delta -= irq_delta;
> +#endif
> +#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
> +       if (static_branch((&paravirt_steal_rq_enabled))) {

Why is that a different variable from the touch_steal_time() one?

> +
> +               steal = paravirt_steal_clock(cpu_of(rq));
> +               steal -= rq->prev_steal_time_acc;
> +
> +               rq->prev_steal_time_acc += steal;

You have this addition in the wrong place, when you clip:

> +               if (steal > delta)
> +                       steal = delta;

you just lost your steal delta, so the addition to prev_steal_time_acc
needs to be after the clip.

> +               delta -= steal;
> +       }
> +#endif
> +
>         rq->clock_task += delta;
>  
> -       if (irq_delta && sched_feat(NONIRQ_POWER))
> -               sched_rt_avg_update(rq, irq_delta);
> +       if ((irq_delta + steal) && sched_feat(NONTASK_POWER))
> +               sched_rt_avg_update(rq, irq_delta + steal);
>  } 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Glauber Costa June 15, 2011, 1:26 a.m. UTC | #4
On 06/14/2011 07:42 AM, Peter Zijlstra wrote:
> On Mon, 2011-06-13 at 19:31 -0400, Glauber Costa wrote:
>> @@ -1981,12 +1987,29 @@ static void update_rq_clock_task(struct rq
>> *rq, s64 delta)
>>
>>          rq->prev_irq_time += irq_delta;
>>          delta -= irq_delta;
>> +#endif
>> +#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
>> +       if (static_branch((&paravirt_steal_rq_enabled))) {
>
> Why is that a different variable from the touch_steal_time() one?

because they track different things, touch_steal_time() and 
update_rq_clock() are
called from different places at different situations.

If we advance prev_steal_time in touch_steal_time(), and later on call 
update_rq_clock_task(), we won't discount the time already flushed from 
the rq_clock. Conversely, if we call update_rq_clock_task(), and only 
then arrive at touch_steal_time, we won't account steal time properly.

update_rq_clock_task() is called whenever update_rq_clock() is called.
touch_steal_time is called every tick. If there is a causal relation 
between them that would allow us to track it in a single location, I 
fail to realize.

>> +
>> +               steal = paravirt_steal_clock(cpu_of(rq));
>> +               steal -= rq->prev_steal_time_acc;
>> +
>> +               rq->prev_steal_time_acc += steal;
>
> You have this addition in the wrong place, when you clip:

I begin by disagreeing
>> +               if (steal>  delta)
>> +                       steal = delta;
>
> you just lost your steal delta, so the addition to prev_steal_time_acc
> needs to be after the clip.
Unlike irq time, steal time can be extremely huge. Just think of a 
virtual machine that got interrupted for a very long time. We'd have 
steal >> delta, leading to steal == delta for a big number of iterations.
That would affect cpu power for an extended period of time, not
reflecting present situation, just the past. So I like to think of delta
as a hard cap for steal time.

Obviously, I am open to debate.
>
>> +               delta -= steal;
>> +       }
>> +#endif
>> +
>>          rq->clock_task += delta;
>>
>> -       if (irq_delta&&  sched_feat(NONIRQ_POWER))
>> -               sched_rt_avg_update(rq, irq_delta);
>> +       if ((irq_delta + steal)&&  sched_feat(NONTASK_POWER))
>> +               sched_rt_avg_update(rq, irq_delta + steal);
>>   }

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra June 20, 2011, 2:41 p.m. UTC | #5
On Tue, 2011-06-14 at 22:26 -0300, Glauber Costa wrote:
> On 06/14/2011 07:42 AM, Peter Zijlstra wrote:
> > On Mon, 2011-06-13 at 19:31 -0400, Glauber Costa wrote:
> >> @@ -1981,12 +1987,29 @@ static void update_rq_clock_task(struct rq
> >> *rq, s64 delta)
> >>
> >>          rq->prev_irq_time += irq_delta;
> >>          delta -= irq_delta;
> >> +#endif
> >> +#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
> >> +       if (static_branch((&paravirt_steal_rq_enabled))) {
> >
> > Why is that a different variable from the touch_steal_time() one?
> 
> because they track different things, touch_steal_time() and 
> update_rq_clock() are
> called from different places at different situations.
> 
> If we advance prev_steal_time in touch_steal_time(), and later on call 
> update_rq_clock_task(), we won't discount the time already flushed from 
> the rq_clock. Conversely, if we call update_rq_clock_task(), and only 
> then arrive at touch_steal_time, we won't account steal time properly.

But that's about prev_steal_time vs prev_steal_time_acc, I agree those
should be different.

> update_rq_clock_task() is called whenever update_rq_clock() is called.
> touch_steal_time is called every tick. If there is a causal relation 
> between them that would allow us to track it in a single location, I 
> fail to realize.

Both are steal time muck, I was wondering why we'd want to do one and
not the other when we have a high res stealtime clock.

> >> +
> >> +               steal = paravirt_steal_clock(cpu_of(rq));
> >> +               steal -= rq->prev_steal_time_acc;
> >> +
> >> +               rq->prev_steal_time_acc += steal;
> >
> > You have this addition in the wrong place, when you clip:
> 
> I begin by disagreeing
> >> +               if (steal>  delta)
> >> +                       steal = delta;
> >
> > you just lost your steal delta, so the addition to prev_steal_time_acc
> > needs to be after the clip.

> Unlike irq time, steal time can be extremely huge. Just think of a 
> virtual machine that got interrupted for a very long time. We'd have 
> steal >> delta, leading to steal == delta for a big number of iterations.
> That would affect cpu power for an extended period of time, not
> reflecting present situation, just the past. So I like to think of delta
> as a hard cap for steal time.
> 
> Obviously, I am open to debate.

I'm failing to see how this would happen, if the virtual machine wasn't
scheduled for a long long while, delta would be huge too. But suppose it
does happen, wouldn't it be likely that the virtual machine would
receive similar bad service in the near future? Making the total
accounting relevant.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index da34972..b26f312 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -512,6 +512,18 @@  menuconfig PARAVIRT_GUEST
 
 if PARAVIRT_GUEST
 
+config PARAVIRT_TIME_ACCOUNTING
+	bool "Paravirtual steal time accounting"
+	select PARAVIRT
+	default n
+	---help---
+	  Select this option to enable fine granularity task steal time 
+	  accounting. Time spent executing other tasks in parallel with
+	  the current vCPU is discounted from the vCPU power. To account for
+	  that, there can be a small performance impact.
+
+	  If in doubt, say N here.
+
 source "arch/x86/xen/Kconfig"
 
 config KVM_CLOCK
diff --git a/kernel/sched.c b/kernel/sched.c
index 154cb14..d6b2749 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -530,6 +530,9 @@  struct rq {
 	u64 prev_irq_time;
 #endif
 	u64 prev_steal_time;
+#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
+	u64 prev_steal_time_acc;
+#endif
 
 	/* calc_load related fields */
 	unsigned long calc_load_update;
@@ -1955,10 +1958,13 @@  void account_system_vtime(struct task_struct *curr)
 }
 EXPORT_SYMBOL_GPL(account_system_vtime);
 
+#endif /* CONFIG_IRQ_TIME_ACCOUNTING */
+
 static void update_rq_clock_task(struct rq *rq, s64 delta)
 {
-	s64 irq_delta;
+	s64 irq_delta = 0, steal = 0;
 
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
 	irq_delta = irq_time_read(cpu_of(rq)) - rq->prev_irq_time;
 
 	/*
@@ -1981,12 +1987,29 @@  static void update_rq_clock_task(struct rq *rq, s64 delta)
 
 	rq->prev_irq_time += irq_delta;
 	delta -= irq_delta;
+#endif
+#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
+	if (static_branch((&paravirt_steal_rq_enabled))) {
+
+		steal = paravirt_steal_clock(cpu_of(rq));
+		steal -= rq->prev_steal_time_acc;
+
+		rq->prev_steal_time_acc += steal;
+
+		if (steal > delta)
+			steal = delta;
+
+		delta -= steal;
+	}
+#endif
+
 	rq->clock_task += delta;
 
-	if (irq_delta && sched_feat(NONIRQ_POWER))
-		sched_rt_avg_update(rq, irq_delta);
+	if ((irq_delta + steal) && sched_feat(NONTASK_POWER))
+		sched_rt_avg_update(rq, irq_delta + steal);
 }
 
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
 static int irqtime_account_hi_update(void)
 {
 	struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
@@ -2021,12 +2044,7 @@  static int irqtime_account_si_update(void)
 
 #define sched_clock_irqtime	(0)
 
-static void update_rq_clock_task(struct rq *rq, s64 delta)
-{
-	rq->clock_task += delta;
-}
-
-#endif /* CONFIG_IRQ_TIME_ACCOUNTING */
+#endif
 
 #include "sched_idletask.c"
 #include "sched_fair.c"
diff --git a/kernel/sched_features.h b/kernel/sched_features.h
index be40f73..ca3b025 100644
--- a/kernel/sched_features.h
+++ b/kernel/sched_features.h
@@ -61,9 +61,9 @@  SCHED_FEAT(LB_BIAS, 1)
 SCHED_FEAT(OWNER_SPIN, 1)
 
 /*
- * Decrement CPU power based on irq activity
+ * Decrement CPU power based on time not spent running tasks
  */
-SCHED_FEAT(NONIRQ_POWER, 1)
+SCHED_FEAT(NONTASK_POWER, 1)
 
 /*
  * Queue remote wakeups on the target CPU and process them