Message ID | 20190625161804.38713-1-vincenzo.frascino@arm.com (mailing list archive) |
---|---|
State | Not Applicable |
Headers | show |
Series | [1/3] lib/vdso: Delay mask application in do_hres() | expand |
On Tue, 25 Jun 2019, Vincenzo Frascino wrote: CC+ Andy > do_hres() in the vDSO generic library masks the hw counter value > immediately after reading it. > > Postpone the mask application after checking if the syscall fallback is > enabled, in order to be able to detect a possible fallback for the > architectures that have masks smaller than ULLONG_MAX. Right. This only worked on x86 because the mask is there ULLONG_MAX for all VDSO capable clocksources, i.e. that ever worked just by chance. As we talked about that already yesterday, I tested this on a couple of machines and as expected the outcome is uarch dependent. Minimal deviations to both sides and some machines do not show any change at all. I doubt it's possible to come up with a solution which makes all uarchs go faster magically. Though, thinking about it, we could remove the mask operation completely on X86. /me runs tests Thanks, tglx > Signed-off-by: Vincenzo Frascino <vincenzo.frascino@arm.com> > --- > lib/vdso/gettimeofday.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/lib/vdso/gettimeofday.c b/lib/vdso/gettimeofday.c > index ef28cc5d7bff..ee1221ba1d32 100644 > --- a/lib/vdso/gettimeofday.c > +++ b/lib/vdso/gettimeofday.c > @@ -35,12 +35,12 @@ static int do_hres(const struct vdso_data *vd, clockid_t clk, > > do { > seq = vdso_read_begin(vd); > - cycles = __arch_get_hw_counter(vd->clock_mode) & > - vd->mask; > + cycles = __arch_get_hw_counter(vd->clock_mode); > ns = vdso_ts->nsec; > last = vd->cycle_last; > if (unlikely((s64)cycles < 0)) > return clock_gettime_fallback(clk, ts); > + cycles &= vd->mask; > if (cycles > last) > ns += (cycles - last) * vd->mult; > ns >>= vd->shift; > -- > 2.22.0 > >
On Tue, 25 Jun 2019, Thomas Gleixner wrote: > On Tue, 25 Jun 2019, Vincenzo Frascino wrote: > > CC+ Andy > > > do_hres() in the vDSO generic library masks the hw counter value > > immediately after reading it. > > > > Postpone the mask application after checking if the syscall fallback is > > enabled, in order to be able to detect a possible fallback for the > > architectures that have masks smaller than ULLONG_MAX. > > Right. This only worked on x86 because the mask is there ULLONG_MAX for all > VDSO capable clocksources, i.e. that ever worked just by chance. > > As we talked about that already yesterday, I tested this on a couple of > machines and as expected the outcome is uarch dependent. Minimal deviations > to both sides and some machines do not show any change at all. I doubt it's > possible to come up with a solution which makes all uarchs go faster > magically. > > Though, thinking about it, we could remove the mask operation completely on > X86. /me runs tests Unsurprisingly the results vary. Two uarchs do not care, but they did not care about moving the mask either. The other two gain performance and the last one falls back to the state before moving the mask. So in general it looks like a worthwhile optimization. Thanks, tglx
On Tue, Jun 25, 2019 at 11:27 AM Thomas Gleixner <tglx@linutronix.de> wrote: > > On Tue, 25 Jun 2019, Thomas Gleixner wrote: > > > On Tue, 25 Jun 2019, Vincenzo Frascino wrote: > > > > CC+ Andy > > > > > do_hres() in the vDSO generic library masks the hw counter value > > > immediately after reading it. > > > > > > Postpone the mask application after checking if the syscall fallback is > > > enabled, in order to be able to detect a possible fallback for the > > > architectures that have masks smaller than ULLONG_MAX. > > > > Right. This only worked on x86 because the mask is there ULLONG_MAX for all > > VDSO capable clocksources, i.e. that ever worked just by chance. > > > > As we talked about that already yesterday, I tested this on a couple of > > machines and as expected the outcome is uarch dependent. Minimal deviations > > to both sides and some machines do not show any change at all. I doubt it's > > possible to come up with a solution which makes all uarchs go faster > > magically. > > > > Though, thinking about it, we could remove the mask operation completely on > > X86. /me runs tests > > Unsurprisingly the results vary. Two uarchs do not care, but they did not > care about moving the mask either. The other two gain performance and the > last one falls back to the state before moving the mask. So in general it > looks like a worthwhile optimization. > At one point, I contemplated a different approach: have the "get the counter" routine return 0 and then do if (unlikely(cycles <= last)) goto fallback. This will remove one branch from the hot path. I got dubious results when I tried benchmarking it, probably because the branch in question was always correctly predicted.
On Tue, 25 Jun 2019, Andy Lutomirski wrote: > On Tue, Jun 25, 2019 at 11:27 AM Thomas Gleixner <tglx@linutronix.de> wrote: > > > > On Tue, 25 Jun 2019, Thomas Gleixner wrote: > > > > > On Tue, 25 Jun 2019, Vincenzo Frascino wrote: > > > > > > CC+ Andy > > > > > > > do_hres() in the vDSO generic library masks the hw counter value > > > > immediately after reading it. > > > > > > > > Postpone the mask application after checking if the syscall fallback is > > > > enabled, in order to be able to detect a possible fallback for the > > > > architectures that have masks smaller than ULLONG_MAX. > > > > > > Right. This only worked on x86 because the mask is there ULLONG_MAX for all > > > VDSO capable clocksources, i.e. that ever worked just by chance. > > > > > > As we talked about that already yesterday, I tested this on a couple of > > > machines and as expected the outcome is uarch dependent. Minimal deviations > > > to both sides and some machines do not show any change at all. I doubt it's > > > possible to come up with a solution which makes all uarchs go faster > > > magically. > > > > > > Though, thinking about it, we could remove the mask operation completely on > > > X86. /me runs tests > > > > Unsurprisingly the results vary. Two uarchs do not care, but they did not > > care about moving the mask either. The other two gain performance and the > > last one falls back to the state before moving the mask. So in general it > > looks like a worthwhile optimization. > > > > At one point, I contemplated a different approach: have the "get the > counter" routine return 0 and then do if (unlikely(cycles <= last)) > goto fallback. This will remove one branch from the hot path. I got > dubious results when I tried benchmarking it, probably because the > branch in question was always correctly predicted. Just tried and it's the same thing. One drops, one does not care and one gains. Did not test the other two as they are asleep already. There is no universal cure for this I fear. I even tried a uarch optimized build a few days ago which came out worse than the generic one... The issue in that code path is the fencing of the TSC read. That seems to screw up every uarch in a different way. If you have no objections I'll queue this change (moving the mask) along with the other two ARM64 ones to unbreak the fallback path for these errata inflicted machines. Thanks, tglx
On Tue, 25 Jun 2019, Thomas Gleixner wrote: > On Tue, 25 Jun 2019, Vincenzo Frascino wrote: > > do_hres() in the vDSO generic library masks the hw counter value > > immediately after reading it. > > > > Postpone the mask application after checking if the syscall fallback is > > enabled, in order to be able to detect a possible fallback for the > > architectures that have masks smaller than ULLONG_MAX. > > Right. This only worked on x86 because the mask is there ULLONG_MAX for all > VDSO capable clocksources, i.e. that ever worked just by chance. But it's actually worse than that: > > + cycles &= vd->mask; > > if (cycles > last) > > ns += (cycles - last) * vd->mult; > > ns >>= vd->shift; This is broken for any clocksource which can legitimately wrap around. The core timekeeping does the right thing: (cycles - last) & mask That makes sure that a wraparound is correctly handled. With the above the wrap around would be ignored due to if (cycles > last) Stupid me. I should have added big fat comments to the x86 vdso why this all works correctly and only correctly for the x86 crud. That was part of squeezing the last cycles out of the vdso. Sorry for not noticing earlier. Working on a fix. Thanks, tglx
Hi Thomas, On 26/06/2019 07:38, Thomas Gleixner wrote: > On Tue, 25 Jun 2019, Thomas Gleixner wrote: >> On Tue, 25 Jun 2019, Vincenzo Frascino wrote: >>> do_hres() in the vDSO generic library masks the hw counter value >>> immediately after reading it. >>> >>> Postpone the mask application after checking if the syscall fallback is >>> enabled, in order to be able to detect a possible fallback for the >>> architectures that have masks smaller than ULLONG_MAX. >> >> Right. This only worked on x86 because the mask is there ULLONG_MAX for all >> VDSO capable clocksources, i.e. that ever worked just by chance. > > But it's actually worse than that: > >>> + cycles &= vd->mask; >>> if (cycles > last) >>> ns += (cycles - last) * vd->mult; >>> ns >>= vd->shift; > > This is broken for any clocksource which can legitimately wrap around. The > core timekeeping does the right thing: > > (cycles - last) & mask > > That makes sure that a wraparound is correctly handled. With the above the > wrap around would be ignored due to > > if (cycles > last) > You are right. Thanks for spotting it. ...
diff --git a/lib/vdso/gettimeofday.c b/lib/vdso/gettimeofday.c index ef28cc5d7bff..ee1221ba1d32 100644 --- a/lib/vdso/gettimeofday.c +++ b/lib/vdso/gettimeofday.c @@ -35,12 +35,12 @@ static int do_hres(const struct vdso_data *vd, clockid_t clk, do { seq = vdso_read_begin(vd); - cycles = __arch_get_hw_counter(vd->clock_mode) & - vd->mask; + cycles = __arch_get_hw_counter(vd->clock_mode); ns = vdso_ts->nsec; last = vd->cycle_last; if (unlikely((s64)cycles < 0)) return clock_gettime_fallback(clk, ts); + cycles &= vd->mask; if (cycles > last) ns += (cycles - last) * vd->mult; ns >>= vd->shift;
do_hres() in the vDSO generic library masks the hw counter value immediately after reading it. Postpone the mask application after checking if the syscall fallback is enabled, in order to be able to detect a possible fallback for the architectures that have masks smaller than ULLONG_MAX. Signed-off-by: Vincenzo Frascino <vincenzo.frascino@arm.com> --- lib/vdso/gettimeofday.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)