Message ID | 65c123fe-c8e7-b9cf-4dea-904bf28170a7@suse.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | x86: further improve timer freq calibration accuracy | expand |
On Mon, Feb 14, 2022 at 10:24:49AM +0100, Jan Beulich wrote: > Calibration logic assumes that the platform timer (HPET or ACPI PM > timer) and the TSC are read at about the same time. This assumption may > not hold when a long latency event (e.g. SMI or NMI) occurs between the > two reads. Reduce the risk of reading uncorrelated values by doing at > least four pairs of reads, using the tuple where the delta between the > enclosing TSC reads was smallest. From the fourth iteration onwards bail > if the new TSC delta isn't better (smaller) than the best earlier one. > > Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> > --- > When running virtualized, scheduling in the host would also constitute > long latency events. I wonder whether, to compensate for that, we'd want > more than 3 "base" iterations, as I would expect scheduling events to > occur more frequently than e.g. SMI (and with a higher probability of > multiple ones occurring in close succession). That's hard to tell, maybe we should make the base iteration count settable from the command line? > --- > v3: Fix 24-bit PM timer wrapping between the two read_pt_and_tsc() > invocations. > v2: Use helper functions to fold duplicate code. > > --- a/xen/arch/x86/time.c > +++ b/xen/arch/x86/time.c > @@ -287,9 +287,47 @@ static char *freq_string(u64 freq) > return s; > } > > -static uint64_t adjust_elapsed(uint64_t elapsed, uint32_t actual, > - uint32_t target) > +static uint32_t __init read_pt_and_tsc(uint64_t *tsc, > + const struct platform_timesource *pts) > { > + uint64_t tsc_prev = *tsc = rdtsc_ordered(), tsc_min = ~0; > + uint32_t best = best; > + unsigned int i; > + > + for ( i = 0; ; ++i ) > + { > + uint32_t pt = pts->read_counter(); > + uint64_t tsc_cur = rdtsc_ordered(); > + uint64_t tsc_delta = tsc_cur - tsc_prev; > + > + if ( tsc_delta < tsc_min ) > + { > + tsc_min = tsc_delta; > + *tsc = tsc_cur; > + best = pt; > + } > + else if ( i > 2 ) > + break; > + > + tsc_prev = tsc_cur; > + } > + > + return best; > +} > + > +static uint64_t __init calibrate_tsc(const struct platform_timesource *pts) > +{ > + uint64_t start, end, elapsed; > + unsigned int count = read_pt_and_tsc(&start, pts); > + unsigned int target = CALIBRATE_VALUE(pts->frequency), actual; > + unsigned int mask = (uint32_t)~0 >> (32 - pts->counter_bits); Just to be on the safe side you might want to add an assert that counter_bits <= 32. Thanks, Roger.
On 11.03.2022 13:03, Roger Pau Monné wrote: > On Mon, Feb 14, 2022 at 10:24:49AM +0100, Jan Beulich wrote: >> Calibration logic assumes that the platform timer (HPET or ACPI PM >> timer) and the TSC are read at about the same time. This assumption may >> not hold when a long latency event (e.g. SMI or NMI) occurs between the >> two reads. Reduce the risk of reading uncorrelated values by doing at >> least four pairs of reads, using the tuple where the delta between the >> enclosing TSC reads was smallest. From the fourth iteration onwards bail >> if the new TSC delta isn't better (smaller) than the best earlier one. >> >> Signed-off-by: Jan Beulich <jbeulich@suse.com> > > Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Thanks. >> --- >> When running virtualized, scheduling in the host would also constitute >> long latency events. I wonder whether, to compensate for that, we'd want >> more than 3 "base" iterations, as I would expect scheduling events to >> occur more frequently than e.g. SMI (and with a higher probability of >> multiple ones occurring in close succession). > > That's hard to tell, maybe we should make the base iteration count > settable from the command line? As a last resort (if people observe problems) - maybe. It's not clear to me though on what basis an admin would choose another value. Jan
--- a/xen/arch/x86/time.c +++ b/xen/arch/x86/time.c @@ -287,9 +287,47 @@ static char *freq_string(u64 freq) return s; } -static uint64_t adjust_elapsed(uint64_t elapsed, uint32_t actual, - uint32_t target) +static uint32_t __init read_pt_and_tsc(uint64_t *tsc, + const struct platform_timesource *pts) { + uint64_t tsc_prev = *tsc = rdtsc_ordered(), tsc_min = ~0; + uint32_t best = best; + unsigned int i; + + for ( i = 0; ; ++i ) + { + uint32_t pt = pts->read_counter(); + uint64_t tsc_cur = rdtsc_ordered(); + uint64_t tsc_delta = tsc_cur - tsc_prev; + + if ( tsc_delta < tsc_min ) + { + tsc_min = tsc_delta; + *tsc = tsc_cur; + best = pt; + } + else if ( i > 2 ) + break; + + tsc_prev = tsc_cur; + } + + return best; +} + +static uint64_t __init calibrate_tsc(const struct platform_timesource *pts) +{ + uint64_t start, end, elapsed; + unsigned int count = read_pt_and_tsc(&start, pts); + unsigned int target = CALIBRATE_VALUE(pts->frequency), actual; + unsigned int mask = (uint32_t)~0 >> (32 - pts->counter_bits); + + while ( ((pts->read_counter() - count) & mask) < target ) + continue; + + actual = (read_pt_and_tsc(&end, pts) - count) & mask; + elapsed = end - start; + if ( likely(actual > target) ) { /* @@ -395,8 +433,7 @@ static u64 read_hpet_count(void) static int64_t __init init_hpet(struct platform_timesource *pts) { - uint64_t hpet_rate, start; - uint32_t count, target, elapsed; + uint64_t hpet_rate; /* * Allow HPET to be setup, but report a frequency of 0 so it's not selected * as a timer source. This is required so it can be used in legacy @@ -467,13 +504,7 @@ static int64_t __init init_hpet(struct p pts->frequency = hpet_rate; - count = hpet_read32(HPET_COUNTER); - start = rdtsc_ordered(); - target = CALIBRATE_VALUE(hpet_rate); - while ( (elapsed = hpet_read32(HPET_COUNTER) - count) < target ) - continue; - - return adjust_elapsed(rdtsc_ordered() - start, elapsed, target); + return calibrate_tsc(pts); } static void resume_hpet(struct platform_timesource *pts) @@ -508,22 +539,12 @@ static u64 read_pmtimer_count(void) static s64 __init init_pmtimer(struct platform_timesource *pts) { - uint64_t start; - uint32_t count, target, mask, elapsed; - if ( !pmtmr_ioport || (pmtmr_width != 24 && pmtmr_width != 32) ) return 0; pts->counter_bits = pmtmr_width; - mask = 0xffffffff >> (32 - pmtmr_width); - - count = inl(pmtmr_ioport); - start = rdtsc_ordered(); - target = CALIBRATE_VALUE(ACPI_PM_FREQUENCY); - while ( (elapsed = (inl(pmtmr_ioport) - count) & mask) < target ) - continue; - return adjust_elapsed(rdtsc_ordered() - start, elapsed, target); + return calibrate_tsc(pts); } static struct platform_timesource __initdata plt_pmtimer =
Calibration logic assumes that the platform timer (HPET or ACPI PM timer) and the TSC are read at about the same time. This assumption may not hold when a long latency event (e.g. SMI or NMI) occurs between the two reads. Reduce the risk of reading uncorrelated values by doing at least four pairs of reads, using the tuple where the delta between the enclosing TSC reads was smallest. From the fourth iteration onwards bail if the new TSC delta isn't better (smaller) than the best earlier one. Signed-off-by: Jan Beulich <jbeulich@suse.com> --- When running virtualized, scheduling in the host would also constitute long latency events. I wonder whether, to compensate for that, we'd want more than 3 "base" iterations, as I would expect scheduling events to occur more frequently than e.g. SMI (and with a higher probability of multiple ones occurring in close succession). --- v3: Fix 24-bit PM timer wrapping between the two read_pt_and_tsc() invocations. v2: Use helper functions to fold duplicate code.