Message ID | 1308262856-5779-6-git-send-email-glommer@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Thu, 16 Jun 2011, Glauber Costa wrote: > This patch accounts steal time time in kernel/sched. > I kept it from last proposal, because I still see advantages > in it: Doing it here will give us easier access from scheduler > variables such as the cpu rq. The next patch shows an example of > usage for it. > > Since functions like account_idle_time() can be called from > multiple places, not only account_process_tick(), steal time > grabbing is repeated in each account function separatedely. > > Signed-off-by: Glauber Costa <glommer@redhat.com> > CC: Rik van Riel <riel@redhat.com> > CC: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> > CC: Peter Zijlstra <peterz@infradead.org> > CC: Avi Kivity <avi@redhat.com> > CC: Anthony Liguori <aliguori@us.ibm.com> > CC: Eric B Munson <emunson@mgebm.net> Tested-by: Eric B Munson <emunson@mgebm.net>
On 06/17/2011 01:20 AM, Glauber Costa wrote: > This patch accounts steal time time in kernel/sched. > I kept it from last proposal, because I still see advantages > in it: Doing it here will give us easier access from scheduler > variables such as the cpu rq. The next patch shows an example of > usage for it. > > Since functions like account_idle_time() can be called from > multiple places, not only account_process_tick(), steal time > grabbing is repeated in each account function separatedely. > > /* > + * We have to at flush steal time information every time something else > + * is accounted. Since the accounting functions are all visible to the rest > + * of the kernel, it gets tricky to do them in one place. This helper function > + * helps us. > + * > + * When the system is idle, the concept of steal time does not apply. We just > + * tell the underlying hypervisor that we grabbed the data, but skip steal time > + * accounting > + */ > +static inline bool touch_steal_time(int is_idle) > +{ > + u64 steal, st = 0; > + > + if (static_branch(¶virt_steal_enabled)) { > + > + steal = paravirt_steal_clock(smp_processor_id()); > + > + steal -= this_rq()->prev_steal_time; > + if (is_idle) { > + this_rq()->prev_steal_time += steal; > + return false; > + } > + > + while (steal>= TICK_NSEC) { > + /* > + * Inline assembly required to prevent the compiler > + * optimising this loop into a divmod call. > + * See __iter_div_u64_rem() for another example of this. > + */ Why not use said function? > + asm("" : "+rm" (steal)); > + > + steal -= TICK_NSEC; > + this_rq()->prev_steal_time += TICK_NSEC; > + st++; Suppose a live migration or SIGSTOP causes lots of steal time. How long will we spend here? > + } > + > + account_steal_time(st); > + return !!st; !! !needed, you're returning a bool. > + } > + return false; > +} > + I'll need Peter's (or another sched maintainer's) review to apply this.
On 06/19/2011 07:04 AM, Avi Kivity wrote: > On 06/17/2011 01:20 AM, Glauber Costa wrote: >> This patch accounts steal time time in kernel/sched. >> I kept it from last proposal, because I still see advantages >> in it: Doing it here will give us easier access from scheduler >> variables such as the cpu rq. The next patch shows an example of >> usage for it. >> >> Since functions like account_idle_time() can be called from >> multiple places, not only account_process_tick(), steal time >> grabbing is repeated in each account function separatedely. >> >> /* >> + * We have to at flush steal time information every time something else >> + * is accounted. Since the accounting functions are all visible to >> the rest >> + * of the kernel, it gets tricky to do them in one place. This helper >> function >> + * helps us. >> + * >> + * When the system is idle, the concept of steal time does not apply. >> We just >> + * tell the underlying hypervisor that we grabbed the data, but skip >> steal time >> + * accounting >> + */ >> +static inline bool touch_steal_time(int is_idle) >> +{ >> + u64 steal, st = 0; >> + >> + if (static_branch(¶virt_steal_enabled)) { >> + >> + steal = paravirt_steal_clock(smp_processor_id()); >> + >> + steal -= this_rq()->prev_steal_time; >> + if (is_idle) { >> + this_rq()->prev_steal_time += steal; >> + return false; >> + } >> + >> + while (steal>= TICK_NSEC) { >> + /* >> + * Inline assembly required to prevent the compiler >> + * optimising this loop into a divmod call. >> + * See __iter_div_u64_rem() for another example of this. >> + */ > > Why not use said function? because here we want to do work during each loop. The said function would have to be adapted for that, possibly using a macro, to run arbitrary code during each loop iteration, in a way that I don't think it is worthy given the current number of callers (2 counting this new one) >> + asm("" : "+rm" (steal)); >> + >> + steal -= TICK_NSEC; >> + this_rq()->prev_steal_time += TICK_NSEC; >> + st++; > > Suppose a live migration or SIGSTOP causes lots of steal time. How long > will we spend here? Silly me. I actually used this same argument with Peter to cap it with "delta" in the next patch in this series. So I think you are 100 % right. Here, however, we do want to account all that time, I believe. How about we do a slow division if we're > 10 sec (unlikely), and account everything as steal time in this scenario ? >> + } >> + >> + account_steal_time(st); >> + return !!st; > > !! !needed, you're returning a bool. ah, sure thing. >> + } >> + return false; >> +} >> + > > I'll need Peter's (or another sched maintainer's) review to apply this. > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 06/20/2011 05:38 AM, Glauber Costa wrote: > On 06/19/2011 07:04 AM, Avi Kivity wrote: >> On 06/17/2011 01:20 AM, Glauber Costa wrote: >>> This patch accounts steal time time in kernel/sched. >>> I kept it from last proposal, because I still see advantages >>> in it: Doing it here will give us easier access from scheduler >>> variables such as the cpu rq. The next patch shows an example of >>> usage for it. >>> >>> Since functions like account_idle_time() can be called from >>> multiple places, not only account_process_tick(), steal time >>> grabbing is repeated in each account function separatedely. >>> >>> /* >>> + * We have to at flush steal time information every time something >>> else >>> + * is accounted. Since the accounting functions are all visible to >>> the rest >>> + * of the kernel, it gets tricky to do them in one place. This helper >>> function >>> + * helps us. >>> + * >>> + * When the system is idle, the concept of steal time does not apply. >>> We just >>> + * tell the underlying hypervisor that we grabbed the data, but skip >>> steal time >>> + * accounting >>> + */ >>> +static inline bool touch_steal_time(int is_idle) >>> +{ >>> + u64 steal, st = 0; >>> + >>> + if (static_branch(¶virt_steal_enabled)) { >>> + >>> + steal = paravirt_steal_clock(smp_processor_id()); >>> + >>> + steal -= this_rq()->prev_steal_time; >>> + if (is_idle) { >>> + this_rq()->prev_steal_time += steal; >>> + return false; >>> + } >>> + >>> + while (steal>= TICK_NSEC) { >>> + /* >>> + * Inline assembly required to prevent the compiler >>> + * optimising this loop into a divmod call. >>> + * See __iter_div_u64_rem() for another example of this. >>> + */ >> >> Why not use said function? > > because here we want to do work during each loop. The said function > would have to be adapted for that, possibly using a macro, to run > arbitrary code during each loop iteration, in a way that I don't think > it is worthy given the current number of callers (2 counting this new > one) You mean adding to prev_steal_time? That can be done outside the loop. > >>> + asm("" : "+rm" (steal)); >>> + >>> + steal -= TICK_NSEC; >>> + this_rq()->prev_steal_time += TICK_NSEC; >>> + st++; >> >> Suppose a live migration or SIGSTOP causes lots of steal time. How long >> will we spend here? > Silly me. I actually used this same argument with Peter to cap it with > "delta" in the next patch in this series. So I think you are 100 % > right. Here, however, we do want to account all that time, I believe. > > How about we do a slow division if we're > 10 sec (unlikely), and > account everything as steal time in this scenario ? Okay. Division would be faster for a lot less than 10s though.
diff --git a/kernel/sched.c b/kernel/sched.c index 3f2e502..fa983c6 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -75,6 +75,7 @@ #include <asm/tlb.h> #include <asm/irq_regs.h> #include <asm/mutex.h> +#include <asm/paravirt.h> #include "sched_cpupri.h" #include "workqueue_sched.h" @@ -528,6 +529,7 @@ struct rq { #ifdef CONFIG_IRQ_TIME_ACCOUNTING u64 prev_irq_time; #endif + u64 prev_steal_time; /* calc_load related fields */ unsigned long calc_load_update; @@ -3705,6 +3707,49 @@ unsigned long long thread_group_sched_runtime(struct task_struct *p) } /* + * We have to at flush steal time information every time something else + * is accounted. Since the accounting functions are all visible to the rest + * of the kernel, it gets tricky to do them in one place. This helper function + * helps us. + * + * When the system is idle, the concept of steal time does not apply. We just + * tell the underlying hypervisor that we grabbed the data, but skip steal time + * accounting + */ +static inline bool touch_steal_time(int is_idle) +{ + u64 steal, st = 0; + + if (static_branch(¶virt_steal_enabled)) { + + steal = paravirt_steal_clock(smp_processor_id()); + + steal -= this_rq()->prev_steal_time; + if (is_idle) { + this_rq()->prev_steal_time += steal; + return false; + } + + while (steal >= TICK_NSEC) { + /* + * Inline assembly required to prevent the compiler + * optimising this loop into a divmod call. + * See __iter_div_u64_rem() for another example of this. + */ + asm("" : "+rm" (steal)); + + steal -= TICK_NSEC; + this_rq()->prev_steal_time += TICK_NSEC; + st++; + } + + account_steal_time(st); + return !!st; + } + return false; +} + +/* * Account user cpu time to a process. * @p: the process that the cpu time gets accounted to * @cputime: the cpu time spent in user space since the last update @@ -3716,6 +3761,9 @@ void account_user_time(struct task_struct *p, cputime_t cputime, struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat; cputime64_t tmp; + if (touch_steal_time(0)) + return; + /* Add user time to process. */ p->utime = cputime_add(p->utime, cputime); p->utimescaled = cputime_add(p->utimescaled, cputime_scaled); @@ -3802,6 +3850,9 @@ void account_system_time(struct task_struct *p, int hardirq_offset, struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat; cputime64_t *target_cputime64; + if (touch_steal_time(0)) + return; + if ((p->flags & PF_VCPU) && (irq_count() - hardirq_offset == 0)) { account_guest_time(p, cputime, cputime_scaled); return; @@ -3839,6 +3890,8 @@ void account_idle_time(cputime_t cputime) cputime64_t cputime64 = cputime_to_cputime64(cputime); struct rq *rq = this_rq(); + touch_steal_time(1); + if (atomic_read(&rq->nr_iowait) > 0) cpustat->iowait = cputime64_add(cpustat->iowait, cputime64); else
This patch accounts steal time time in kernel/sched. I kept it from last proposal, because I still see advantages in it: Doing it here will give us easier access from scheduler variables such as the cpu rq. The next patch shows an example of usage for it. Since functions like account_idle_time() can be called from multiple places, not only account_process_tick(), steal time grabbing is repeated in each account function separatedely. Signed-off-by: Glauber Costa <glommer@redhat.com> CC: Rik van Riel <riel@redhat.com> CC: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> CC: Peter Zijlstra <peterz@infradead.org> CC: Avi Kivity <avi@redhat.com> CC: Anthony Liguori <aliguori@us.ibm.com> CC: Eric B Munson <emunson@mgebm.net> --- kernel/sched.c | 53 +++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 files changed, 53 insertions(+), 0 deletions(-)