Message ID | 26b71f94-d1c7-d906-5b2a-4e7994d6f7c0@suse.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | x86/time: calibration rendezvous adjustments | expand |
On Mon, Feb 01, 2021 at 01:43:04PM +0100, Jan Beulich wrote: > The (stime,tsc) tuple is the basis for extrapolation by get_s_time(). > Therefore the two better get taken as close to one another as possible. > This means two things: First, reading platform time is too early when > done on the first iteration. The closest we can get is on the last > iteration, immediately before telling other CPUs to write their TSCs > (and then also writing CPU0's). While at the first glance it may seem > not overly relevant when exactly platform time is read (when assuming > that only stime is ever relevant anywhere, and hence the association > with the precise TSC values is of lower interest), both CPU frequency > changes and the effects of SMT make it unpredictable (between individual > rendezvous instances) how long the loop iterations will take. This will > in turn lead to higher an error than neccesary in how close to linear > stime movement we can get. > > Second, re-reading the TSC for local recording is increasing the overall > error as well, when we already know a more precise value - the one just > written. > > Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com> I've been thinking this all seems doomed when Xen runs in a virtualized environment, and should likely be disabled. There's no point on trying to sync the TSC over multiple vCPUs as the scheduling delay between them will likely skew any calculations. Thanks, Roger.
On 05.02.2021 17:15, Roger Pau Monné wrote: > On Mon, Feb 01, 2021 at 01:43:04PM +0100, Jan Beulich wrote: >> The (stime,tsc) tuple is the basis for extrapolation by get_s_time(). >> Therefore the two better get taken as close to one another as possible. >> This means two things: First, reading platform time is too early when >> done on the first iteration. The closest we can get is on the last >> iteration, immediately before telling other CPUs to write their TSCs >> (and then also writing CPU0's). While at the first glance it may seem >> not overly relevant when exactly platform time is read (when assuming >> that only stime is ever relevant anywhere, and hence the association >> with the precise TSC values is of lower interest), both CPU frequency >> changes and the effects of SMT make it unpredictable (between individual >> rendezvous instances) how long the loop iterations will take. This will >> in turn lead to higher an error than neccesary in how close to linear >> stime movement we can get. >> >> Second, re-reading the TSC for local recording is increasing the overall >> error as well, when we already know a more precise value - the one just >> written. >> >> Signed-off-by: Jan Beulich <jbeulich@suse.com> > > Acked-by: Roger Pau Monné <roger.pau@citrix.com> Thanks. > I've been thinking this all seems doomed when Xen runs in a virtualized > environment, and should likely be disabled. There's no point on trying > to sync the TSC over multiple vCPUs as the scheduling delay between > them will likely skew any calculations. We may want to consider to force the equivalent of "clocksource=tsc" in that case. Otoh a well behaved hypervisor underneath shouldn't lead to us finding a need to clear TSC_RELIABLE, at which point this logic wouldn't get engaged in the first place. Jan
On Mon, Feb 08, 2021 at 11:56:01AM +0100, Jan Beulich wrote: > On 05.02.2021 17:15, Roger Pau Monné wrote: > > On Mon, Feb 01, 2021 at 01:43:04PM +0100, Jan Beulich wrote: > >> The (stime,tsc) tuple is the basis for extrapolation by get_s_time(). > >> Therefore the two better get taken as close to one another as possible. > >> This means two things: First, reading platform time is too early when > >> done on the first iteration. The closest we can get is on the last > >> iteration, immediately before telling other CPUs to write their TSCs > >> (and then also writing CPU0's). While at the first glance it may seem > >> not overly relevant when exactly platform time is read (when assuming > >> that only stime is ever relevant anywhere, and hence the association > >> with the precise TSC values is of lower interest), both CPU frequency > >> changes and the effects of SMT make it unpredictable (between individual > >> rendezvous instances) how long the loop iterations will take. This will > >> in turn lead to higher an error than neccesary in how close to linear > >> stime movement we can get. > >> > >> Second, re-reading the TSC for local recording is increasing the overall > >> error as well, when we already know a more precise value - the one just > >> written. > >> > >> Signed-off-by: Jan Beulich <jbeulich@suse.com> > > > > Acked-by: Roger Pau Monné <roger.pau@citrix.com> > > Thanks. > > > I've been thinking this all seems doomed when Xen runs in a virtualized > > environment, and should likely be disabled. There's no point on trying > > to sync the TSC over multiple vCPUs as the scheduling delay between > > them will likely skew any calculations. > > We may want to consider to force the equivalent of > "clocksource=tsc" in that case. Otoh a well behaved hypervisor > underneath shouldn't lead to us finding a need to clear > TSC_RELIABLE, at which point this logic wouldn't get engaged > in the first place. I got the impression that on a loaded system guests with a non-trivial amount of vCPUs might be in trouble to be able to schedule them all close enough for the rendezvous to not report a big skew, and thus disable TSC_RELIABLE? Thanks, Roger.
On 08.02.2021 12:05, Roger Pau Monné wrote: > On Mon, Feb 08, 2021 at 11:56:01AM +0100, Jan Beulich wrote: >> On 05.02.2021 17:15, Roger Pau Monné wrote: >>> I've been thinking this all seems doomed when Xen runs in a virtualized >>> environment, and should likely be disabled. There's no point on trying >>> to sync the TSC over multiple vCPUs as the scheduling delay between >>> them will likely skew any calculations. >> >> We may want to consider to force the equivalent of >> "clocksource=tsc" in that case. Otoh a well behaved hypervisor >> underneath shouldn't lead to us finding a need to clear >> TSC_RELIABLE, at which point this logic wouldn't get engaged >> in the first place. > > I got the impression that on a loaded system guests with a non-trivial > amount of vCPUs might be in trouble to be able to schedule them all > close enough for the rendezvous to not report a big skew, and thus > disable TSC_RELIABLE? No, check_tsc_warp() / tsc_check_reliability() don't have a problem there. Every CPU reads the shared "most advanced" stamp before reading its local one. So it doesn't matter how large the gaps are here. In fact the possible bad effect is the other way around here - if the scheduling effects are too heavy, we may mistakenly consider TSCs reliable when they aren't. A problem of the kind you describe exists in the actual rendezvous function. And actually any problem of this kind can, on a smaller scale, already be be observed with SMT, because the individual hyperthreads of a core can't possibly all run at the same time. As occurs to me only now, I think we can improve accuracy some (in particular on big systems) by making sure struct calibration_rendezvous's master_tsc_stamp is not sharing its cache line with semaphore and master_stime. The latter get written by (at least) the BSP, while master_tsc_stamp is stable after the 2nd loop iteration. Hence on the 3rd and 4th iterations we could even prefetch it to reduce the delay on the last one. Jan
On Mon, Feb 08, 2021 at 12:50:09PM +0100, Jan Beulich wrote: > On 08.02.2021 12:05, Roger Pau Monné wrote: > > On Mon, Feb 08, 2021 at 11:56:01AM +0100, Jan Beulich wrote: > >> On 05.02.2021 17:15, Roger Pau Monné wrote: > >>> I've been thinking this all seems doomed when Xen runs in a virtualized > >>> environment, and should likely be disabled. There's no point on trying > >>> to sync the TSC over multiple vCPUs as the scheduling delay between > >>> them will likely skew any calculations. > >> > >> We may want to consider to force the equivalent of > >> "clocksource=tsc" in that case. Otoh a well behaved hypervisor > >> underneath shouldn't lead to us finding a need to clear > >> TSC_RELIABLE, at which point this logic wouldn't get engaged > >> in the first place. > > > > I got the impression that on a loaded system guests with a non-trivial > > amount of vCPUs might be in trouble to be able to schedule them all > > close enough for the rendezvous to not report a big skew, and thus > > disable TSC_RELIABLE? > > No, check_tsc_warp() / tsc_check_reliability() don't have a > problem there. Every CPU reads the shared "most advanced" > stamp before reading its local one. So it doesn't matter how > large the gaps are here. In fact the possible bad effect is > the other way around here - if the scheduling effects are > too heavy, we may mistakenly consider TSCs reliable when > they aren't. > > A problem of the kind you describe exists in the actual > rendezvous function. And actually any problem of this kind > can, on a smaller scale, already be be observed with SMT, > because the individual hyperthreads of a core can't > possibly all run at the same time. Indeed I got confused between tsc_check_reliability and the actual rendezvous function, so it's likely the adjustments done by the rendezvous are pointless when running virtualized IMO, due to the inability to likely schedule all the vCPUs at one to execute the rendezvous. > As occurs to me only now, I think we can improve accuracy > some (in particular on big systems) by making sure > struct calibration_rendezvous's master_tsc_stamp is not > sharing its cache line with semaphore and master_stime. The > latter get written by (at least) the BSP, while > master_tsc_stamp is stable after the 2nd loop iteration. > Hence on the 3rd and 4th iterations we could even prefetch > it to reduce the delay on the last one. Seems like a possibility indeed. Thanks, Roger.
--- a/xen/arch/x86/time.c +++ b/xen/arch/x86/time.c @@ -1662,11 +1662,12 @@ struct calibration_rendezvous { }; static void -time_calibration_rendezvous_tail(const struct calibration_rendezvous *r) +time_calibration_rendezvous_tail(const struct calibration_rendezvous *r, + uint64_t tsc) { struct cpu_time_stamp *c = &this_cpu(cpu_calibration); - c->local_tsc = rdtsc_ordered(); + c->local_tsc = tsc; c->local_stime = get_s_time_fixed(c->local_tsc); c->master_stime = r->master_stime; @@ -1691,11 +1692,11 @@ static void time_calibration_tsc_rendezv while ( atomic_read(&r->semaphore) != (total_cpus - 1) ) cpu_relax(); - if ( r->master_stime == 0 ) - { - r->master_stime = read_platform_stime(NULL); + if ( r->master_tsc_stamp == 0 ) r->master_tsc_stamp = rdtsc_ordered(); - } + else if ( i == 0 ) + r->master_stime = read_platform_stime(NULL); + atomic_inc(&r->semaphore); if ( i == 0 ) @@ -1720,7 +1721,7 @@ static void time_calibration_tsc_rendezv } } - time_calibration_rendezvous_tail(r); + time_calibration_rendezvous_tail(r, r->master_tsc_stamp); } /* Ordinary rendezvous function which does not modify TSC values. */ @@ -1745,7 +1746,7 @@ static void time_calibration_std_rendezv smp_rmb(); /* receive signal /then/ read r->master_stime */ } - time_calibration_rendezvous_tail(r); + time_calibration_rendezvous_tail(r, rdtsc_ordered()); } /*
The (stime,tsc) tuple is the basis for extrapolation by get_s_time(). Therefore the two better get taken as close to one another as possible. This means two things: First, reading platform time is too early when done on the first iteration. The closest we can get is on the last iteration, immediately before telling other CPUs to write their TSCs (and then also writing CPU0's). While at the first glance it may seem not overly relevant when exactly platform time is read (when assuming that only stime is ever relevant anywhere, and hence the association with the precise TSC values is of lower interest), both CPU frequency changes and the effects of SMT make it unpredictable (between individual rendezvous instances) how long the loop iterations will take. This will in turn lead to higher an error than neccesary in how close to linear stime movement we can get. Second, re-reading the TSC for local recording is increasing the overall error as well, when we already know a more precise value - the one just written. Signed-off-by: Jan Beulich <jbeulich@suse.com> --- v2: New.