diff mbox

x86/time: use correct (local) time stamp in constant-TSC calibration fast path

Message ID 575976AE02000078000F3695@prv-mh.provo.novell.com (mailing list archive)
State New, archived
Headers show

Commit Message

Jan Beulich June 9, 2016, 12:01 p.m. UTC
This looks like a copy and paste mistake in commit 1b6a99892d ("x86:
Simpler time handling when TSC is constant across all power saving
states"), responsible for occasional many-microsecond cross-CPU skew of
what NOW() returns.

Also improve the correlation between local TSC and stime stamps
obtained at the end of the two calibration handlers: Compute the stime
one from the TSC one, instead of doing another rdtsc() for that
compuation.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
As to 4.7 inclusion: This of course looks like a pretty blatant mistake
that has been there for many years (and hence many releases). There's
certainly non-zero risk that I'm overlooking something here (despite
Joao apparently having come to the same conclusion), so I can't really
make up my mind on whether to request this patch to be put there right
away, or rather having linger in -unstable for a while.
x86/time: use correct (local) time stamp in constant-TSC calibration fast path

This looks like a copy and paste mistake in commit 1b6a99892d ("x86:
Simpler time handling when TSC is constant across all power saving
states"), responsible for occasional many-microsecond cross-CPU skew of
what NOW() returns.

Also improve the correlation between local TSC and stime stamps
obtained at the end of the two calibration handlers: Compute the stime
one from the TSC one, instead of doing another rdtsc() for that
compuation.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
As to 4.7 inclusion: This of course looks like a pretty blatant mistake
that has been there for many years (and hence many releases). There's
certainly non-zero risk that I'm overlooking something here (despite
Joao apparently having come to the same conclusion), so I can't really
make up my mind on whether to request this patch to be put there right
away, or rather having linger in -unstable for a while.

--- a/xen/arch/x86/time.c
+++ b/xen/arch/x86/time.c
@@ -998,7 +998,7 @@ static void local_time_calibration(void)
         /* Atomically read cpu_calibration struct and write cpu_time struct. */
         local_irq_disable();
         t->local_tsc_stamp    = c->local_tsc_stamp;
-        t->stime_local_stamp  = c->stime_master_stamp;
+        t->stime_local_stamp  = c->stime_local_stamp;
         t->stime_master_stamp = c->stime_master_stamp;
         local_irq_enable();
         update_vcpu_system_time(current);
@@ -1275,7 +1275,7 @@ static void time_calibration_tsc_rendezv
     }
 
     c->local_tsc_stamp = rdtsc();
-    c->stime_local_stamp = get_s_time();
+    c->stime_local_stamp = get_s_time_fixed(c->local_tsc_stamp);
     c->stime_master_stamp = r->master_stime;
 
     raise_softirq(TIME_CALIBRATE_SOFTIRQ);
@@ -1305,7 +1305,7 @@ static void time_calibration_std_rendezv
     }
 
     c->local_tsc_stamp = rdtsc();
-    c->stime_local_stamp = get_s_time();
+    c->stime_local_stamp = get_s_time_fixed(c->local_tsc_stamp);
     c->stime_master_stamp = r->master_stime;
 
     raise_softirq(TIME_CALIBRATE_SOFTIRQ);

Comments

Andrew Cooper June 9, 2016, 12:10 p.m. UTC | #1
On 09/06/16 13:01, Jan Beulich wrote:
> This looks like a copy and paste mistake in commit 1b6a99892d ("x86:
> Simpler time handling when TSC is constant across all power saving
> states"), responsible for occasional many-microsecond cross-CPU skew of
> what NOW() returns.
>
> Also improve the correlation between local TSC and stime stamps
> obtained at the end of the two calibration handlers: Compute the stime
> one from the TSC one, instead of doing another rdtsc() for that
> compuation.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

That does indeed look like a copy/paste mistake.  I would be tempted to
leave this in -unstable for a while.

~Andrew
Joao Martins June 9, 2016, 12:11 p.m. UTC | #2
On 06/09/2016 01:01 PM, Jan Beulich wrote:
> This looks like a copy and paste mistake in commit 1b6a99892d ("x86:
> Simpler time handling when TSC is constant across all power saving
> states"), responsible for occasional many-microsecond cross-CPU skew of
> what NOW() returns.
> 
> Also improve the correlation between local TSC and stime stamps
> obtained at the end of the two calibration handlers: Compute the stime
> one from the TSC one, instead of doing another rdtsc() for that
> compuation.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> As to 4.7 inclusion: This of course looks like a pretty blatant mistake
> that has been there for many years (and hence many releases). There's
> certainly non-zero risk that I'm overlooking something here (despite
> Joao apparently having come to the same conclusion), so I can't really
> make up my mind on whether to request this patch to be put there right
> away, or rather having linger in -unstable for a while.
> 
Initially I thought of this as a fix too, but then wouldn't having
t->stime_local_stamp be c->stime_local_stamp, render no use to the
platform timer reads done on calibration? Unless we would change
update_vcpu_system to use stime_master_stamp instead?
stime_master_stamp field isn't used anywhere other than the dom0 injected
cpu_frequency_change or when at boot seeding the cpu_time struct on
init_percpu_time (and the already mentioned use on local_time_calibration) ?
init_percpu_time also takes a different read of the platform timer per
cpu and could probably be inherited by a read done on the boot processor
and written on remaining CPUs, so that all would start from the same stamp.
IOW - this sounds like time we are turning stime to be totally TSC except
when initially seeding each cpu_time?

> --- a/xen/arch/x86/time.c
> +++ b/xen/arch/x86/time.c
> @@ -998,7 +998,7 @@ static void local_time_calibration(void)
>          /* Atomically read cpu_calibration struct and write cpu_time struct. */
>          local_irq_disable();
>          t->local_tsc_stamp    = c->local_tsc_stamp;
> -        t->stime_local_stamp  = c->stime_master_stamp;
> +        t->stime_local_stamp  = c->stime_local_stamp;
>          t->stime_master_stamp = c->stime_master_stamp;
>          local_irq_enable();
>          update_vcpu_system_time(current);
> @@ -1275,7 +1275,7 @@ static void time_calibration_tsc_rendezv
>      }
>  
>      c->local_tsc_stamp = rdtsc();
> -    c->stime_local_stamp = get_s_time();
> +    c->stime_local_stamp = get_s_time_fixed(c->local_tsc_stamp);
>      c->stime_master_stamp = r->master_stime;
>  
>      raise_softirq(TIME_CALIBRATE_SOFTIRQ);
> @@ -1305,7 +1305,7 @@ static void time_calibration_std_rendezv
>      }
>  
>      c->local_tsc_stamp = rdtsc();
> -    c->stime_local_stamp = get_s_time();
> +    c->stime_local_stamp = get_s_time_fixed(c->local_tsc_stamp);
>      c->stime_master_stamp = r->master_stime;
>  
>      raise_softirq(TIME_CALIBRATE_SOFTIRQ);
> 
> 
>
Wei Liu June 9, 2016, 12:12 p.m. UTC | #3
On Thu, Jun 09, 2016 at 06:01:18AM -0600, Jan Beulich wrote:
> This looks like a copy and paste mistake in commit 1b6a99892d ("x86:
> Simpler time handling when TSC is constant across all power saving
> states"), responsible for occasional many-microsecond cross-CPU skew of
> what NOW() returns.
> 
> Also improve the correlation between local TSC and stime stamps
> obtained at the end of the two calibration handlers: Compute the stime
> one from the TSC one, instead of doing another rdtsc() for that
> compuation.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> As to 4.7 inclusion: This of course looks like a pretty blatant mistake
> that has been there for many years (and hence many releases). There's
> certainly non-zero risk that I'm overlooking something here (despite
> Joao apparently having come to the same conclusion), so I can't really
> make up my mind on whether to request this patch to be put there right
> away, or rather having linger in -unstable for a while.
> 

Leave it in unstable for a while please.

I don't want to delay the release any further.

Wei.
Jan Beulich June 9, 2016, 12:57 p.m. UTC | #4
>>> On 09.06.16 at 14:11, <joao.m.martins@oracle.com> wrote:
> On 06/09/2016 01:01 PM, Jan Beulich wrote:
>> This looks like a copy and paste mistake in commit 1b6a99892d ("x86:
>> Simpler time handling when TSC is constant across all power saving
>> states"), responsible for occasional many-microsecond cross-CPU skew of
>> what NOW() returns.
>> 
>> Also improve the correlation between local TSC and stime stamps
>> obtained at the end of the two calibration handlers: Compute the stime
>> one from the TSC one, instead of doing another rdtsc() for that
>> compuation.
>> 
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>> ---
>> As to 4.7 inclusion: This of course looks like a pretty blatant mistake
>> that has been there for many years (and hence many releases). There's
>> certainly non-zero risk that I'm overlooking something here (despite
>> Joao apparently having come to the same conclusion), so I can't really
>> make up my mind on whether to request this patch to be put there right
>> away, or rather having linger in -unstable for a while.
>> 
> Initially I thought of this as a fix too, but then wouldn't having
> t->stime_local_stamp be c->stime_local_stamp, render no use to the
> platform timer reads done on calibration? Unless we would change
> update_vcpu_system to use stime_master_stamp instead?
> stime_master_stamp field isn't used anywhere other than the dom0 injected
> cpu_frequency_change or when at boot seeding the cpu_time struct on
> init_percpu_time (and the already mentioned use on local_time_calibration) ?
> init_percpu_time also takes a different read of the platform timer per
> cpu and could probably be inherited by a read done on the boot processor
> and written on remaining CPUs, so that all would start from the same stamp.
> IOW - this sounds like time we are turning stime to be totally TSC except
> when initially seeding each cpu_time?

Did you also look at the "slow" path in local_time_calibration()? That
does use the master stamp. So in effect for the fast path the patch
changes the situation from c->stime_local_stamp being effectively
unused to c->stime_master_stamp being so. In the former case, if
that really hadn't been a typo, deleting the write of that field from
time_calibration_std_rendezvous() would have made sense, as
get_s_time() certainly is more overhead than the simply memory
read and write needed for keeping c->stime_master_stamp up to
date (despite not being used).

Jan
Joao Martins June 9, 2016, 3 p.m. UTC | #5
On 06/09/2016 01:57 PM, Jan Beulich wrote:
>>>> On 09.06.16 at 14:11, <joao.m.martins@oracle.com> wrote:
>> On 06/09/2016 01:01 PM, Jan Beulich wrote:
>>> This looks like a copy and paste mistake in commit 1b6a99892d ("x86:
>>> Simpler time handling when TSC is constant across all power saving
>>> states"), responsible for occasional many-microsecond cross-CPU skew of
>>> what NOW() returns.
>>>
>>> Also improve the correlation between local TSC and stime stamps
>>> obtained at the end of the two calibration handlers: Compute the stime
>>> one from the TSC one, instead of doing another rdtsc() for that
>>> compuation.
>>>
>>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>>> ---
>>> As to 4.7 inclusion: This of course looks like a pretty blatant mistake
>>> that has been there for many years (and hence many releases). There's
>>> certainly non-zero risk that I'm overlooking something here (despite
>>> Joao apparently having come to the same conclusion), so I can't really
>>> make up my mind on whether to request this patch to be put there right
>>> away, or rather having linger in -unstable for a while.
>>>
>> Initially I thought of this as a fix too, but then wouldn't having
>> t->stime_local_stamp be c->stime_local_stamp, render no use to the
>> platform timer reads done on calibration? Unless we would change
>> update_vcpu_system to use stime_master_stamp instead?
>> stime_master_stamp field isn't used anywhere other than the dom0 injected
>> cpu_frequency_change or when at boot seeding the cpu_time struct on
>> init_percpu_time (and the already mentioned use on local_time_calibration) ?
>> init_percpu_time also takes a different read of the platform timer per
>> cpu and could probably be inherited by a read done on the boot processor
>> and written on remaining CPUs, so that all would start from the same stamp.
>> IOW - this sounds like time we are turning stime to be totally TSC except
>> when initially seeding each cpu_time?
> 
> Did you also look at the "slow" path in local_time_calibration()? That
> does use the master stamp.
Yeah I am aware of its usage there - my comment was only on the fast path. I think
the latter is the most common case too.

> So in effect for the fast path the patch
> changes the situation from c->stime_local_stamp being effectively
> unused to c->stime_master_stamp being so. In the former case, if
> that really hadn't been a typo, deleting the write of that field from
> time_calibration_std_rendezvous() would have made sense, as
> get_s_time() certainly is more overhead than the simply memory
> read and write needed for keeping c->stime_master_stamp up to
> date (despite not being used).
I agree, but what I meant previously was more of a concern meaning: CPU 0 is doing an
expensive read_platform_time (e.g. HPET supposedly microseconds order, plus a
non-contented lock) to set stime_master_stamp that doesn't get used at all -
effectively not using the clocksource set initially at boot.

What if verify_tsc_reliability clears out X86_FEATURE_TSC_RELIABLE when failing
the warp test? The calibration function is set early on right after interrupts are
enabled and the time warp check later on when all CPUs are up. So on problematic
platforms it's possible that std_rendezvous is used with a constant TSC but still
deemed unreliable. We still keep incrementing deltas at roughly about the same time,
but in effect with this change the stime_local_stamp would be TSC-only based thus
leading to warps with an unreliable TSC? And there's also the CPU hotplug/onlining
case that we once discussed.

Joao
Jan Beulich June 9, 2016, 3:52 p.m. UTC | #6
>>> On 09.06.16 at 17:00, <joao.m.martins@oracle.com> wrote:
> On 06/09/2016 01:57 PM, Jan Beulich wrote:
>>>>> On 09.06.16 at 14:11, <joao.m.martins@oracle.com> wrote:
>> So in effect for the fast path the patch
>> changes the situation from c->stime_local_stamp being effectively
>> unused to c->stime_master_stamp being so. In the former case, if
>> that really hadn't been a typo, deleting the write of that field from
>> time_calibration_std_rendezvous() would have made sense, as
>> get_s_time() certainly is more overhead than the simply memory
>> read and write needed for keeping c->stime_master_stamp up to
>> date (despite not being used).
> I agree, but what I meant previously was more of a concern meaning: CPU 0 is 
> doing an
> expensive read_platform_time (e.g. HPET supposedly microseconds order, plus 
> a
> non-contented lock) to set stime_master_stamp that doesn't get used at all -
> effectively not using the clocksource set initially at boot.

Yeah, there's likely some cleanup potential here, but of course we
need to be pretty careful about not doing something that may be
needed by some code paths but not others. But if you think you
can help the situation without harming maintainability, feel free to
go ahead.

> What if verify_tsc_reliability clears out X86_FEATURE_TSC_RELIABLE when 
> failing
> the warp test? The calibration function is set early on right after 
> interrupts are
> enabled and the time warp check later on when all CPUs are up. So on 
> problematic
> platforms it's possible that std_rendezvous is used with a constant TSC but 
> still
> deemed unreliable. We still keep incrementing deltas at roughly about the 
> same time,
> but in effect with this change the stime_local_stamp would be TSC-only based 
> thus
> leading to warps with an unreliable TSC? And there's also the CPU 
> hotplug/onlining
> case that we once discussed.

I agree that we're likely in trouble in such a case. But for the
moment I'd be glad if we could get the "normal" case work right.

Jan
Joao Martins June 9, 2016, 6:19 p.m. UTC | #7
[changing Dario address to citrix.com as it was bouncing for me ]

On 06/09/2016 04:52 PM, Jan Beulich wrote:
>>>> On 09.06.16 at 17:00, <joao.m.martins@oracle.com> wrote:
>> On 06/09/2016 01:57 PM, Jan Beulich wrote:
>>>>>> On 09.06.16 at 14:11, <joao.m.martins@oracle.com> wrote:
>>> So in effect for the fast path the patch
>>> changes the situation from c->stime_local_stamp being effectively
>>> unused to c->stime_master_stamp being so. In the former case, if
>>> that really hadn't been a typo, deleting the write of that field from
>>> time_calibration_std_rendezvous() would have made sense, as
>>> get_s_time() certainly is more overhead than the simply memory
>>> read and write needed for keeping c->stime_master_stamp up to
>>> date (despite not being used).
>> I agree, but what I meant previously was more of a concern meaning: CPU 0 is 
>> doing an
>> expensive read_platform_time (e.g. HPET supposedly microseconds order, plus 
>> a
>> non-contented lock) to set stime_master_stamp that doesn't get used at all -
>> effectively not using the clocksource set initially at boot.
> 
> Yeah, there's likely some cleanup potential here, but of course we
> need to be pretty careful about not doing something that may be
> needed by some code paths but not others. But if you think you
> can help the situation without harming maintainability, feel free to
> go ahead.
> 
OK, Makes sense. I'll likely do already so of it on my related series.

>> What if verify_tsc_reliability clears out X86_FEATURE_TSC_RELIABLE when 
>> failing
>> the warp test? The calibration function is set early on right after 
>> interrupts are
>> enabled and the time warp check later on when all CPUs are up. So on 
>> problematic
>> platforms it's possible that std_rendezvous is used with a constant TSC but 
>> still
>> deemed unreliable. We still keep incrementing deltas at roughly about the 
>> same time,
>> but in effect with this change the stime_local_stamp would be TSC-only based 
>> thus
>> leading to warps with an unreliable TSC? And there's also the CPU 
>> hotplug/onlining
>> case that we once discussed.
> 
> I agree that we're likely in trouble in such a case. But for the
> moment I'd be glad if we could get the "normal" case work right.
> 
OK. Apologies for the noise, I was just pointing out things that I tried and some
also discussed here in the PVCLOCK_TSC_STABLE_BIT series, although didn't cross me
that Xen own idea of time could be a little broken. IMO adding another clocksource
for TSC would be more correct if we are only using TSC (and having its associated
limitations made aware/explicit to the user) rather then being on the back of another
clocksource in use. But it wouldn't cover the normal case :( unless set manually

NB: Guests on the other hand aren't affected since they take care of keeping the
latest stamp when different vCPUS slightly diverge.

Joao
Jan Beulich June 10, 2016, 6:59 a.m. UTC | #8
>>> On 09.06.16 at 20:19, <joao.m.martins@oracle.com> wrote:
> [changing Dario address to citrix.com as it was bouncing for me ]
> 
> On 06/09/2016 04:52 PM, Jan Beulich wrote:
>>>>> On 09.06.16 at 17:00, <joao.m.martins@oracle.com> wrote:
>>> On 06/09/2016 01:57 PM, Jan Beulich wrote:
>>>>>>> On 09.06.16 at 14:11, <joao.m.martins@oracle.com> wrote:
>>>> So in effect for the fast path the patch
>>>> changes the situation from c->stime_local_stamp being effectively
>>>> unused to c->stime_master_stamp being so. In the former case, if
>>>> that really hadn't been a typo, deleting the write of that field from
>>>> time_calibration_std_rendezvous() would have made sense, as
>>>> get_s_time() certainly is more overhead than the simply memory
>>>> read and write needed for keeping c->stime_master_stamp up to
>>>> date (despite not being used).
>>> I agree, but what I meant previously was more of a concern meaning: CPU 0 is 
> 
>>> doing an
>>> expensive read_platform_time (e.g. HPET supposedly microseconds order, plus 
>>> a
>>> non-contented lock) to set stime_master_stamp that doesn't get used at all -
>>> effectively not using the clocksource set initially at boot.
>> 
>> Yeah, there's likely some cleanup potential here, but of course we
>> need to be pretty careful about not doing something that may be
>> needed by some code paths but not others. But if you think you
>> can help the situation without harming maintainability, feel free to
>> go ahead.
>> 
> OK, Makes sense. I'll likely do already so of it on my related series.

Actually, since local time gets seeded from platform time in
init_percpu_time(), I don't think we can do away with
maintaining platform time.

Jan
Jan Beulich June 10, 2016, 9:29 a.m. UTC | #9
>>> On 10.06.16 at 08:59, <JBeulich@suse.com> wrote:
> Actually, since local time gets seeded from platform time in
> init_percpu_time(), I don't think we can do away with
> maintaining platform time.

And it looks like this seeding is where much of the remaining backwards
deltas are coming from: While on local_time_calibration()'s slow path
we use the platform time as reference (and hence the seeding is fine),
on the fast path we don't, and hence using the platform time in
init_percpu_time() to initialize local stime introduces a discontinuity.

Jan
Joao Martins June 10, 2016, 5:07 p.m. UTC | #10
On 06/10/2016 10:29 AM, Jan Beulich wrote:
>>>> On 10.06.16 at 08:59, <JBeulich@suse.com> wrote:
>> Actually, since local time gets seeded from platform time in
>> init_percpu_time(), I don't think we can do away with
>> maintaining platform time.
>
Yeah, I agree. But the case of my previous message was towards
improvement potential of the rendezvous after this patch (for
the fast path). Platform time overflow could be another example
but probably a bit borderline as certain clocksources have very
short intervals.

> And it looks like this seeding is where much of the remaining backwards
> deltas are coming from: While on local_time_calibration()'s slow path
> we use the platform time as reference (and hence the seeding is fine),
> on the fast path we don't, and hence using the platform time in
> init_percpu_time() to initialize local stime introduces a discontinuity.

I'll do some testing too on your patch that's addressing this on Monday (today
is an holiday in my country).

Joao
diff mbox

Patch

--- a/xen/arch/x86/time.c
+++ b/xen/arch/x86/time.c
@@ -998,7 +998,7 @@  static void local_time_calibration(void)
         /* Atomically read cpu_calibration struct and write cpu_time struct. */
         local_irq_disable();
         t->local_tsc_stamp    = c->local_tsc_stamp;
-        t->stime_local_stamp  = c->stime_master_stamp;
+        t->stime_local_stamp  = c->stime_local_stamp;
         t->stime_master_stamp = c->stime_master_stamp;
         local_irq_enable();
         update_vcpu_system_time(current);
@@ -1275,7 +1275,7 @@  static void time_calibration_tsc_rendezv
     }
 
     c->local_tsc_stamp = rdtsc();
-    c->stime_local_stamp = get_s_time();
+    c->stime_local_stamp = get_s_time_fixed(c->local_tsc_stamp);
     c->stime_master_stamp = r->master_stime;
 
     raise_softirq(TIME_CALIBRATE_SOFTIRQ);
@@ -1305,7 +1305,7 @@  static void time_calibration_std_rendezv
     }
 
     c->local_tsc_stamp = rdtsc();
-    c->stime_local_stamp = get_s_time();
+    c->stime_local_stamp = get_s_time_fixed(c->local_tsc_stamp);
     c->stime_master_stamp = r->master_stime;
 
     raise_softirq(TIME_CALIBRATE_SOFTIRQ);