[1/3] lib/vdso: Delay mask application in do_hres()
diff mbox series

Message ID 20190625161804.38713-1-vincenzo.frascino@arm.com
State New
Headers show
Series
  • [1/3] lib/vdso: Delay mask application in do_hres()
Related show

Commit Message

Vincenzo Frascino June 25, 2019, 4:18 p.m. UTC
do_hres() in the vDSO generic library masks the hw counter value
immediately after reading it.

Postpone the mask application after checking if the syscall fallback is
enabled, in order to be able to detect a possible fallback for the
architectures that have masks smaller than ULLONG_MAX.

Signed-off-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
---
 lib/vdso/gettimeofday.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Comments

Thomas Gleixner June 25, 2019, 5:02 p.m. UTC | #1
On Tue, 25 Jun 2019, Vincenzo Frascino wrote:

CC+ Andy

> do_hres() in the vDSO generic library masks the hw counter value
> immediately after reading it.
> 
> Postpone the mask application after checking if the syscall fallback is
> enabled, in order to be able to detect a possible fallback for the
> architectures that have masks smaller than ULLONG_MAX.

Right. This only worked on x86 because the mask is there ULLONG_MAX for all
VDSO capable clocksources, i.e. that ever worked just by chance.

As we talked about that already yesterday, I tested this on a couple of
machines and as expected the outcome is uarch dependent. Minimal deviations
to both sides and some machines do not show any change at all. I doubt it's
possible to come up with a solution which makes all uarchs go faster
magically.

Though, thinking about it, we could remove the mask operation completely on
X86. /me runs tests

Thanks,

	tglx


> Signed-off-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
> ---
>  lib/vdso/gettimeofday.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/lib/vdso/gettimeofday.c b/lib/vdso/gettimeofday.c
> index ef28cc5d7bff..ee1221ba1d32 100644
> --- a/lib/vdso/gettimeofday.c
> +++ b/lib/vdso/gettimeofday.c
> @@ -35,12 +35,12 @@ static int do_hres(const struct vdso_data *vd, clockid_t clk,
>  
>  	do {
>  		seq = vdso_read_begin(vd);
> -		cycles = __arch_get_hw_counter(vd->clock_mode) &
> -			vd->mask;
> +		cycles = __arch_get_hw_counter(vd->clock_mode);
>  		ns = vdso_ts->nsec;
>  		last = vd->cycle_last;
>  		if (unlikely((s64)cycles < 0))
>  			return clock_gettime_fallback(clk, ts);
> +		cycles &= vd->mask;
>  		if (cycles > last)
>  			ns += (cycles - last) * vd->mult;
>  		ns >>= vd->shift;
> -- 
> 2.22.0
> 
>
Thomas Gleixner June 25, 2019, 6:27 p.m. UTC | #2
On Tue, 25 Jun 2019, Thomas Gleixner wrote:

> On Tue, 25 Jun 2019, Vincenzo Frascino wrote:
> 
> CC+ Andy
> 
> > do_hres() in the vDSO generic library masks the hw counter value
> > immediately after reading it.
> > 
> > Postpone the mask application after checking if the syscall fallback is
> > enabled, in order to be able to detect a possible fallback for the
> > architectures that have masks smaller than ULLONG_MAX.
> 
> Right. This only worked on x86 because the mask is there ULLONG_MAX for all
> VDSO capable clocksources, i.e. that ever worked just by chance.
> 
> As we talked about that already yesterday, I tested this on a couple of
> machines and as expected the outcome is uarch dependent. Minimal deviations
> to both sides and some machines do not show any change at all. I doubt it's
> possible to come up with a solution which makes all uarchs go faster
> magically.
> 
> Though, thinking about it, we could remove the mask operation completely on
> X86. /me runs tests

Unsurprisingly the results vary. Two uarchs do not care, but they did not
care about moving the mask either. The other two gain performance and the
last one falls back to the state before moving the mask. So in general it
looks like a worthwhile optimization.

Thanks,

	tglx
Andy Lutomirski June 25, 2019, 8:15 p.m. UTC | #3
On Tue, Jun 25, 2019 at 11:27 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> On Tue, 25 Jun 2019, Thomas Gleixner wrote:
>
> > On Tue, 25 Jun 2019, Vincenzo Frascino wrote:
> >
> > CC+ Andy
> >
> > > do_hres() in the vDSO generic library masks the hw counter value
> > > immediately after reading it.
> > >
> > > Postpone the mask application after checking if the syscall fallback is
> > > enabled, in order to be able to detect a possible fallback for the
> > > architectures that have masks smaller than ULLONG_MAX.
> >
> > Right. This only worked on x86 because the mask is there ULLONG_MAX for all
> > VDSO capable clocksources, i.e. that ever worked just by chance.
> >
> > As we talked about that already yesterday, I tested this on a couple of
> > machines and as expected the outcome is uarch dependent. Minimal deviations
> > to both sides and some machines do not show any change at all. I doubt it's
> > possible to come up with a solution which makes all uarchs go faster
> > magically.
> >
> > Though, thinking about it, we could remove the mask operation completely on
> > X86. /me runs tests
>
> Unsurprisingly the results vary. Two uarchs do not care, but they did not
> care about moving the mask either. The other two gain performance and the
> last one falls back to the state before moving the mask. So in general it
> looks like a worthwhile optimization.
>

At one point, I contemplated a different approach: have the "get the
counter" routine return 0 and then do if (unlikely(cycles <= last))
goto fallback.  This will remove one branch from the hot path.  I got
dubious results when I tried benchmarking it, probably because the
branch in question was always correctly predicted.
Thomas Gleixner June 25, 2019, 10:24 p.m. UTC | #4
On Tue, 25 Jun 2019, Andy Lutomirski wrote:
> On Tue, Jun 25, 2019 at 11:27 AM Thomas Gleixner <tglx@linutronix.de> wrote:
> >
> > On Tue, 25 Jun 2019, Thomas Gleixner wrote:
> >
> > > On Tue, 25 Jun 2019, Vincenzo Frascino wrote:
> > >
> > > CC+ Andy
> > >
> > > > do_hres() in the vDSO generic library masks the hw counter value
> > > > immediately after reading it.
> > > >
> > > > Postpone the mask application after checking if the syscall fallback is
> > > > enabled, in order to be able to detect a possible fallback for the
> > > > architectures that have masks smaller than ULLONG_MAX.
> > >
> > > Right. This only worked on x86 because the mask is there ULLONG_MAX for all
> > > VDSO capable clocksources, i.e. that ever worked just by chance.
> > >
> > > As we talked about that already yesterday, I tested this on a couple of
> > > machines and as expected the outcome is uarch dependent. Minimal deviations
> > > to both sides and some machines do not show any change at all. I doubt it's
> > > possible to come up with a solution which makes all uarchs go faster
> > > magically.
> > >
> > > Though, thinking about it, we could remove the mask operation completely on
> > > X86. /me runs tests
> >
> > Unsurprisingly the results vary. Two uarchs do not care, but they did not
> > care about moving the mask either. The other two gain performance and the
> > last one falls back to the state before moving the mask. So in general it
> > looks like a worthwhile optimization.
> >
> 
> At one point, I contemplated a different approach: have the "get the
> counter" routine return 0 and then do if (unlikely(cycles <= last))
> goto fallback.  This will remove one branch from the hot path.  I got
> dubious results when I tried benchmarking it, probably because the
> branch in question was always correctly predicted.

Just tried and it's the same thing. One drops, one does not care and one
gains. Did not test the other two as they are asleep already. There is no
universal cure for this I fear. I even tried a uarch optimized build a few
days ago which came out worse than the generic one...

The issue in that code path is the fencing of the TSC read. That seems to
screw up every uarch in a different way.

If you have no objections I'll queue this change (moving the mask) along
with the other two ARM64 ones to unbreak the fallback path for these errata
inflicted machines.

Thanks,

	tglx
Thomas Gleixner June 26, 2019, 6:38 a.m. UTC | #5
On Tue, 25 Jun 2019, Thomas Gleixner wrote:
> On Tue, 25 Jun 2019, Vincenzo Frascino wrote:
> > do_hres() in the vDSO generic library masks the hw counter value
> > immediately after reading it.
> > 
> > Postpone the mask application after checking if the syscall fallback is
> > enabled, in order to be able to detect a possible fallback for the
> > architectures that have masks smaller than ULLONG_MAX.
> 
> Right. This only worked on x86 because the mask is there ULLONG_MAX for all
> VDSO capable clocksources, i.e. that ever worked just by chance.

But it's actually worse than that:

> > +		cycles &= vd->mask;
> >  		if (cycles > last)
> >  			ns += (cycles - last) * vd->mult;
> >  		ns >>= vd->shift;

This is broken for any clocksource which can legitimately wrap around. The
core timekeeping does the right thing:

     		 (cycles - last) & mask

That makes sure that a wraparound is correctly handled. With the above the
wrap around would be ignored due to

     	    if (cycles > last)

Stupid me. I should have added big fat comments to the x86 vdso why this
all works correctly and only correctly for the x86 crud. That was part of
squeezing the last cycles out of the vdso.

Sorry for not noticing earlier. Working on a fix.

Thanks,

	tglx
Vincenzo Frascino June 26, 2019, 9:25 a.m. UTC | #6
Hi Thomas,

On 26/06/2019 07:38, Thomas Gleixner wrote:
> On Tue, 25 Jun 2019, Thomas Gleixner wrote:
>> On Tue, 25 Jun 2019, Vincenzo Frascino wrote:
>>> do_hres() in the vDSO generic library masks the hw counter value
>>> immediately after reading it.
>>>
>>> Postpone the mask application after checking if the syscall fallback is
>>> enabled, in order to be able to detect a possible fallback for the
>>> architectures that have masks smaller than ULLONG_MAX.
>>
>> Right. This only worked on x86 because the mask is there ULLONG_MAX for all
>> VDSO capable clocksources, i.e. that ever worked just by chance.
> 
> But it's actually worse than that:
> 
>>> +		cycles &= vd->mask;
>>>  		if (cycles > last)
>>>  			ns += (cycles - last) * vd->mult;
>>>  		ns >>= vd->shift;
> 
> This is broken for any clocksource which can legitimately wrap around. The
> core timekeeping does the right thing:
> 
>      		 (cycles - last) & mask
> 
> That makes sure that a wraparound is correctly handled. With the above the
> wrap around would be ignored due to
> 
>      	    if (cycles > last)
> 

You are right. Thanks for spotting it.


...
Vincenzo Frascino June 26, 2019, 11:08 a.m. UTC | #7
Hi Thomas,

On 26/06/2019 11:02, Thomas Gleixner wrote:
> The x86 vdso implementation on which the generic vdso library is based on
> has subtle (unfortunately undocumented) twists:
> 
>  1) The code assumes that the clocksource mask is U64_MAX which means that
>     no bits are masked. Which is true for any valid x86 VDSO clocksource.
>     Stupidly it still did the mask operation for no reason and at the wrong
>     place right after reading the clocksource.
> 
>  2) It contains a sanity check to catch the case where slightly
>     unsynchronized TSC values can be overserved which would cause the delta
>     calculation to make a huge jump. It therefore checks whether the
>     current TSC value is larger than the value on which the current
>     conversion is based on. If it's not larger the base value is used to
>     prevent time jumps.
> 
> #1 Is not only stupid for the X86 case because it does the masking for no
> reason it is also completely wrong for clocksources with a smaller mask
> which can legitimately wrap around during a conversion period. The core
> timekeeping code does it correct by applying the mask after the delta
> calculation:
> 
> 	(now - base) & mask
> 
> #2 is equally broken for clocksources which have smaller masks and can wrap
> around during a conversion period because there the now > base check is
> just wrong and causes stale time stamps and time going backwards issues.
> 
> Unbreak it by:
> 
>   1) Removing the mask operation from the clocksource read which makes the
>      fallback detection work for all clocksources
> 
>   2) Replacing the conditional delta calculation with a overrideable inline
>      function.
> 
> #2 could reuse clocksource_delta() from the timekeeping code but that
> results in a significant performance hit for the x86 VSDO. The timekeeping
> core code must have the non optimized version as it has to operate
> correctly with clocksources which have smaller masks as well to handle the
> case where TSC is discarded as timekeeper clocksource and replaced by HPET
> or pmtimer. For the VDSO there is no replacement clocksource. If TSC is
> unusable the syscall is enforced which does the right thing.
> 
> To accomodate to the needs of various architectures provide an overrideable
> inline function which defaults to the regular delta calculation with
> masking:
> 
> 	(now - base) & mask
> 
> Override it for x86 with the non-masking and checking version.
> 
> This unbreaks the ARM64 syscall fallback operation, allows to use
> clocksources with arbitrary width and preserves the performance
> optimization for x86.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

A part a typo that leads to compilation errors on non-x86 platforms the rest
looks fine by me.

I tested it on arm64 and behaves correctly.

With this:

Reviewed-by: Vincenzo Frascino <vincenzo.frascino@arm.com>

> ---
>  arch/x86/include/asm/vdso/gettimeofday.h |   27 +++++++++++++++++++++++++++
>  lib/vdso/gettimeofday.c                  |   19 +++++++++++++++----
>  2 files changed, 42 insertions(+), 4 deletions(-)
> 
> --- a/arch/x86/include/asm/vdso/gettimeofday.h
> +++ b/arch/x86/include/asm/vdso/gettimeofday.h
> @@ -229,6 +229,33 @@ static __always_inline const struct vdso
>  	return __vdso_data;
>  }
>  
> +/*
> + * x86 specific delta calculation.
> + *
> + * The regular implementation assumes that clocksource reads are globally
> + * monotonic. The TSC can be slightly off across sockets which can cause
> + * the regular delta calculation (@cycles - @last) to return a huge time
> + * jump.
> + *
> + * Therefore it needs to be verified that @cycles are greater than
> + * @last. If not then use @last, which is the base time of the current
> + * conversion period.
> + *
> + * This variant also removes the masking of the subtraction because the
> + * clocksource mask of all VDSO capable clocksources on x86 is U64_MAX
> + * which would result in a pointless operation. The compiler cannot
> + * optimize it away as the mask comes from the vdso data and is not compile
> + * time constant.
> + */
> +static __always_inline
> +u64 vdso_calc_delta(u64 cycles, u64 last, u64 mask, u32 mult)
> +{
> +	if (cycles > last)
> +		return (cycles - last) * mult;
> +	return 0;
> +}
> +#define vdso_calc_delta vdso_calc_delta
> +
>  #endif /* !__ASSEMBLY__ */
>  
>  #endif /* __ASM_VDSO_GETTIMEOFDAY_H */
> --- a/lib/vdso/gettimeofday.c
> +++ b/lib/vdso/gettimeofday.c
> @@ -26,6 +26,18 @@
>  #include <asm/vdso/gettimeofday.h>
>  #endif /* ENABLE_COMPAT_VDSO */
>  
> +#ifndef vdso_calc_delta
> +/*
> + * Default implementation which works for all sane clocksources. That
> + * obviously excludes x86/TSC.
> + */
> +static __always_inline
> +u64 vdso_calc_delta(u64 cycles, u64 last, u64 mask, u32 mult)
> +{
> +	return ((cyles - last) & mask) * mult;

Typo here:

s/cyles/cycles/

> +}
> +#endif
> +
>  static int do_hres(const struct vdso_data *vd, clockid_t clk,
>  		   struct __kernel_timespec *ts)
>  {
> @@ -35,14 +47,13 @@ static int do_hres(const struct vdso_dat
>  
>  	do {
>  		seq = vdso_read_begin(vd);
> -		cycles = __arch_get_hw_counter(vd->clock_mode) &
> -			vd->mask;
> +		cycles = __arch_get_hw_counter(vd->clock_mode);
>  		ns = vdso_ts->nsec;
>  		last = vd->cycle_last;
>  		if (unlikely((s64)cycles < 0))
>  			return clock_gettime_fallback(clk, ts);
> -		if (cycles > last)
> -			ns += (cycles - last) * vd->mult;
> +
> +		ns += vdso_calc_delta(cycles, last, vd->mask, vd->mult);
>  		ns >>= vd->shift;
>  		sec = vdso_ts->sec;
>  	} while (unlikely(vdso_read_retry(vd, seq)));
>

Patch
diff mbox series

diff --git a/lib/vdso/gettimeofday.c b/lib/vdso/gettimeofday.c
index ef28cc5d7bff..ee1221ba1d32 100644
--- a/lib/vdso/gettimeofday.c
+++ b/lib/vdso/gettimeofday.c
@@ -35,12 +35,12 @@  static int do_hres(const struct vdso_data *vd, clockid_t clk,
 
 	do {
 		seq = vdso_read_begin(vd);
-		cycles = __arch_get_hw_counter(vd->clock_mode) &
-			vd->mask;
+		cycles = __arch_get_hw_counter(vd->clock_mode);
 		ns = vdso_ts->nsec;
 		last = vd->cycle_last;
 		if (unlikely((s64)cycles < 0))
 			return clock_gettime_fallback(clk, ts);
+		cycles &= vd->mask;
 		if (cycles > last)
 			ns += (cycles - last) * vd->mult;
 		ns >>= vd->shift;