diff mbox

sched/cputime: add steal clock warps handling during cpu hotplug

Message ID 1464868639-8924-1-git-send-email-wanpeng.li@hotmail.com (mailing list archive)
State New, archived
Headers show

Commit Message

Wanpeng Li June 2, 2016, 11:57 a.m. UTC
From: Wanpeng Li <wanpeng.li@hotmail.com>

I observed that sometimes st is 100% instantaneous, then idle is 100% 
even if there is a cpu hog on the guest cpu after the cpu hotplug comes 
back(N.B. both guest and host are latest 4.7-rc1, this can not always 
be readily reproduced). I add trace to capture it as below:

cpuhp/1-12    [001] d.h1   167.461657: account_process_tick: steal = 1291385514, prev_steal_time = 0         
cpuhp/1-12    [001] d.h1   167.461659: account_process_tick: steal_jiffies = 1291          
<idle>-0     [001] d.h1   167.462663: account_process_tick: steal = 18732255, prev_steal_time = 1291000000          
<idle>-0     [001] d.h1   167.462664: account_process_tick: steal_jiffies = 18446744072437

The steal clock warps and then steal_jiffies overflow, this patch align 
prev_steal_time to the new steal clock timestamp, in order to avoid 
overflow and st stuff can continue to work.

Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim <rkrcmar@redhat.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
---
 kernel/sched/cputime.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

Comments

Peter Zijlstra June 2, 2016, noon UTC | #1
On Thu, Jun 02, 2016 at 07:57:19PM +0800, Wanpeng Li wrote:
> From: Wanpeng Li <wanpeng.li@hotmail.com>
> 
> I observed that sometimes st is 100% instantaneous, then idle is 100% 
> even if there is a cpu hog on the guest cpu after the cpu hotplug comes 
> back(N.B. both guest and host are latest 4.7-rc1, this can not always 
> be readily reproduced). I add trace to capture it as below:
> 
> cpuhp/1-12    [001] d.h1   167.461657: account_process_tick: steal = 1291385514, prev_steal_time = 0         
> cpuhp/1-12    [001] d.h1   167.461659: account_process_tick: steal_jiffies = 1291          
> <idle>-0     [001] d.h1   167.462663: account_process_tick: steal = 18732255, prev_steal_time = 1291000000          
> <idle>-0     [001] d.h1   167.462664: account_process_tick: steal_jiffies = 18446744072437
> 
> The steal clock warps and then steal_jiffies overflow, this patch align 
> prev_steal_time to the new steal clock timestamp, in order to avoid 
> overflow and st stuff can continue to work.

I would rather suggest fixing the steal clock thing to not jump like
that; is that at all possible?
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rik van Riel June 2, 2016, 1:59 p.m. UTC | #2
On Thu, 2016-06-02 at 14:00 +0200, Peter Zijlstra wrote:
> On Thu, Jun 02, 2016 at 07:57:19PM +0800, Wanpeng Li wrote:
> > 
> > From: Wanpeng Li <wanpeng.li@hotmail.com>
> > 
> > I observed that sometimes st is 100% instantaneous, then idle is
> > 100% 
> > even if there is a cpu hog on the guest cpu after the cpu hotplug
> > comes 
> > back(N.B. both guest and host are latest 4.7-rc1, this can not
> > always 
> > be readily reproduced). I add trace to capture it as below:
> > 
> > cpuhp/1-12    [001] d.h1   167.461657: account_process_tick: steal
> > = 1291385514, prev_steal_time = 0         
> > cpuhp/1-12    [001] d.h1   167.461659: account_process_tick:
> > steal_jiffies = 1291          
> > <idle>-0     [001] d.h1   167.462663: account_process_tick: steal =
> > 18732255, prev_steal_time = 1291000000          
> > <idle>-0     [001] d.h1   167.462664: account_process_tick:
> > steal_jiffies = 18446744072437
> > 
> > The steal clock warps and then steal_jiffies overflow, this patch
> > align 
> > prev_steal_time to the new steal clock timestamp, in order to
> > avoid 
> > overflow and st stuff can continue to work.
> I would rather suggest fixing the steal clock thing to not jump like
> that; is that at all possible?

Not always possible, I suspect.

If a guest is saved to disk and later restored (eg. after
a host reboot), or live migrated to another host, I would
expect to get totally disjoint steal time statistics from
the "new run" of the guest (which is the same run of the
guest OS).

In fact, this code may also need to deal with the case
where steal time suddenly increases by a ludicrous amount,
and ignore those events, too.

A safe threshold might be to only apply steal times that
are positive and smaller than one second (as long as nohz_full
has the one second timer tick left), ignoring intervals that
are negative or longer than a second, and using those to sync
up the guest with the host.
Wanpeng Li June 3, 2016, 5:34 a.m. UTC | #3
2016-06-02 21:59 GMT+08:00 Rik van Riel <riel@redhat.com>:
> On Thu, 2016-06-02 at 14:00 +0200, Peter Zijlstra wrote:
>> On Thu, Jun 02, 2016 at 07:57:19PM +0800, Wanpeng Li wrote:
>> >
>> > From: Wanpeng Li <wanpeng.li@hotmail.com>
>> >
>> > I observed that sometimes st is 100% instantaneous, then idle is
>> > 100%
>> > even if there is a cpu hog on the guest cpu after the cpu hotplug
>> > comes
>> > back(N.B. both guest and host are latest 4.7-rc1, this can not
>> > always
>> > be readily reproduced). I add trace to capture it as below:
>> >
>> > cpuhp/1-12    [001] d.h1   167.461657: account_process_tick: steal
>> > = 1291385514, prev_steal_time = 0
>> > cpuhp/1-12    [001] d.h1   167.461659: account_process_tick:
>> > steal_jiffies = 1291
>> > <idle>-0     [001] d.h1   167.462663: account_process_tick: steal =
>> > 18732255, prev_steal_time = 1291000000
>> > <idle>-0     [001] d.h1   167.462664: account_process_tick:
>> > steal_jiffies = 18446744072437
>> >
>> > The steal clock warps and then steal_jiffies overflow, this patch
>> > align
>> > prev_steal_time to the new steal clock timestamp, in order to
>> > avoid
>> > overflow and st stuff can continue to work.
>> I would rather suggest fixing the steal clock thing to not jump like
>> that; is that at all possible?
>
> Not always possible, I suspect.
>
> If a guest is saved to disk and later restored (eg. after
> a host reboot), or live migrated to another host, I would
> expect to get totally disjoint steal time statistics from
> the "new run" of the guest (which is the same run of the
> guest OS).
>
> In fact, this code may also need to deal with the case
> where steal time suddenly increases by a ludicrous amount,
> and ignore those events, too.
>
> A safe threshold might be to only apply steal times that
> are positive and smaller than one second (as long as nohz_full
> has the one second timer tick left), ignoring intervals that
> are negative or longer than a second, and using those to sync
> up the guest with the host.

Good point, thanks for your review, Rik. :) Just send out v2 to do it.

Regards,
Wanpeng Li
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 75f98c5..d0eebc3 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -265,7 +265,13 @@  static __always_inline bool steal_account_process_tick(void)
 		unsigned long steal_jiffies;
 
 		steal = paravirt_steal_clock(smp_processor_id());
-		steal -= this_rq()->prev_steal_time;
+		if (likely(steal > this_rq()->prev_steal_time))
+			steal -= this_rq()->prev_steal_time;
+		else {
+			/* steal clock warp */
+			this_rq()->prev_steal_time = steal;
+			return false;
+		}
 
 		/*
 		 * steal is in nsecs but our caller is expecting steal