Message ID | d0f1f249-293c-5a7f-4b6c-1caeb275e7b9@suse.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | x86/time: calibration rendezvous adjustments | expand |
On 29.01.2021 17:20, Jan Beulich wrote: > @@ -1696,6 +1696,21 @@ static void time_calibration_tsc_rendezv > r->master_stime = read_platform_stime(NULL); > r->master_tsc_stamp = rdtsc_ordered(); > } > + else if ( r->master_tsc_stamp < r->max_tsc_stamp ) > + { > + /* > + * We want to avoid moving the TSC backwards for any CPU. > + * Use the largest value observed anywhere on the first > + * iteration and bump up our previously recorded system > + * accordingly. > + */ > + uint64_t delta = r->max_tsc_stamp - r->master_tsc_stamp; > + > + r->master_stime += scale_delta(delta, > + &this_cpu(cpu_time).tsc_scale); > + r->master_tsc_stamp = r->max_tsc_stamp; > + } I went too far here - adjusting ->master_stime like this is a mistake. Especially in extreme cases like Claudemir's this can lead to the read_platform_stime() visible in context above reading a value behind the previously recorded one, leading to NOW() moving backwards (temporarily). Instead of this I think I will want to move the call to read_platform_stime() to the last iteration, such that the gap between the point in time when it was taken and the point in time the TSCs start counting from their new values gets minimized. In fact I intend that to also do away with the unnecessary reading back of the TSC in time_calibration_rendezvous_tail() - we already know the closest TSC value we can get hold of (without calculations), which is the one we wrote a few cycles back. Jan
--- a/xen/arch/x86/time.c +++ b/xen/arch/x86/time.c @@ -1658,7 +1658,7 @@ struct calibration_rendezvous { cpumask_t cpu_calibration_map; atomic_t semaphore; s_time_t master_stime; - u64 master_tsc_stamp; + uint64_t master_tsc_stamp, max_tsc_stamp; }; static void @@ -1696,6 +1696,21 @@ static void time_calibration_tsc_rendezv r->master_stime = read_platform_stime(NULL); r->master_tsc_stamp = rdtsc_ordered(); } + else if ( r->master_tsc_stamp < r->max_tsc_stamp ) + { + /* + * We want to avoid moving the TSC backwards for any CPU. + * Use the largest value observed anywhere on the first + * iteration and bump up our previously recorded system + * accordingly. + */ + uint64_t delta = r->max_tsc_stamp - r->master_tsc_stamp; + + r->master_stime += scale_delta(delta, + &this_cpu(cpu_time).tsc_scale); + r->master_tsc_stamp = r->max_tsc_stamp; + } + atomic_inc(&r->semaphore); if ( i == 0 ) @@ -1711,6 +1726,17 @@ static void time_calibration_tsc_rendezv while ( atomic_read(&r->semaphore) < total_cpus ) cpu_relax(); + if ( _r ) + { + uint64_t tsc = rdtsc_ordered(), cur; + + while ( tsc > (cur = r->max_tsc_stamp) ) + if ( cmpxchg(&r->max_tsc_stamp, cur, tsc) == cur ) + break; + + _r = NULL; + } + if ( i == 0 ) write_tsc(r->master_tsc_stamp);
While doing this for small amounts may be okay, the unconditional use of CPU0's value here has been found to be a problem when the boot time TSC of the BSP was behind that of all APs by more than a second. In particular because of get_s_time_fixed() producing insane output when the calculated delta is negative, we can't allow this to happen. On the first iteration have all other CPUs sort out the highest TSC value any one of them has read. On the second iteration, if that maximum is higher than CPU0's, update its recorded value from that taken in the first iteration, along with the system time. Use the resulting value on the last iteration to write everyone's TSCs. Reported-by: Claudemir Todo Bom <claudemir@todobom.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> --- Since CPU0 reads its TSC last on the first iteration, if TSCs were perfectly sync-ed there shouldn't ever be a need to update. However, even on the TSC-reliable system I first tested this on (using "tsc=skewed" to get this rendezvous function into use in the first place) updates by up to several thousand clocks did happen. I wonder whether this points at some problem with the approach that I'm not (yet) seeing. Considering the sufficiently modern CPU it's using, I suspect the system wouldn't even need to turn off TSC_RELIABLE, if only there wasn't the boot time skew. Hence another approach might be to fix this boot time skew. Of course to recognize whether the TSCs then still aren't in sync we'd need to run tsc_check_reliability() sufficiently long after that adjustment. The above and the desire to have the change tested by the reporter are the reasons for the RFC. As per the comment ahead of it, the original purpose of the function was to deal with TSCs halted in deep C states. While this probably explains why only forward moves were ever expected, I don't see how this could have been reliable in case CPU0 was deep-sleeping for a sufficiently long time. My only guess here is a hidden assumption of CPU0 never being idle for long enough.