Message ID | 20240912-mgtime-v2-1-54db84afb7a7@kernel.org (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [v2] timekeeping: move multigrain timestamp floor handling into timekeeper | expand |
On Thu, Sep 12, 2024 at 11:02 AM Jeff Layton <jlayton@kernel.org> wrote: > > The kernel test robot reported a performance hit in some will-it-scale > tests due to the multigrain timestamp patches. My own testing showed > about a 7% drop in performance on the pipe1_threads test, and the data > showed that coarse_ctime() was slowing down current_time(). So, you provided some useful detail about why coarse_ctime() was slow in your reply earlier, but it would be good to preserve that in the commit message here. > Move the multigrain timestamp floor tracking word into timekeeper.c. Add > two new public interfaces: The first fills a timespec64 with the later > of the coarse-grained clock and the floor time, and the second gets a > fine-grained time and tries to swap it into the floor and fills a > timespec64 with the result. > > The first function returns an opaque cookie that is suitable for passing > to the second, which will use it as the "old" value in the cmpxchg. The cookie usage isn't totally clear to me right off. It feels a bit more subtle then I'd expect. > diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c > index 5391e4167d60..bb039c9d525e 100644 > --- a/kernel/time/timekeeping.c > +++ b/kernel/time/timekeeping.c > @@ -114,6 +114,13 @@ static struct tk_fast tk_fast_raw ____cacheline_aligned = { > .base[1] = FAST_TK_INIT, > }; > > +/* > + * This represents the latest fine-grained time that we have handed out as a > + * timestamp on the system. Tracked as a monotonic ktime_t, and converted to the > + * realtime clock on an as-needed basis. > + */ > +static __cacheline_aligned_in_smp atomic64_t mg_floor; > + So I do really like this general approach of having an internal floor value combined with special coarse/fine grained assessors that work with the floor, so we're not impacting the normal hotpath logic (basically I was writing up a suggestion to this effect to the thread with Arnd when I realized you had follow up patch in my inbox). > static inline void tk_normalize_xtime(struct timekeeper *tk) > { > while (tk->tkr_mono.xtime_nsec >= ((u64)NSEC_PER_SEC << tk->tkr_mono.shift)) { > @@ -2394,6 +2401,76 @@ void ktime_get_coarse_real_ts64(struct timespec64 *ts) > } > EXPORT_SYMBOL(ktime_get_coarse_real_ts64); > > +/** > + * ktime_get_coarse_real_ts64_mg - get later of coarse grained time or floor > + * @ts: timespec64 to be filled > + * > + * Adjust floor to realtime and compare it to the coarse time. Fill > + * @ts with the latest one. Returns opaque cookie suitable to pass > + * to ktime_get_real_ts64_mg. > + */ > +u64 ktime_get_coarse_real_ts64_mg(struct timespec64 *ts) > +{ > + struct timekeeper *tk = &tk_core.timekeeper; > + u64 floor = atomic64_read(&mg_floor); > + ktime_t f_real, offset, coarse; > + unsigned int seq; > + > + WARN_ON(timekeeping_suspended); > + > + do { > + seq = read_seqcount_begin(&tk_core.seq); > + *ts = tk_xtime(tk); > + offset = *offsets[TK_OFFS_REAL]; > + } while (read_seqcount_retry(&tk_core.seq, seq)); > + > + coarse = timespec64_to_ktime(*ts); > + f_real = ktime_add(floor, offset); > + if (ktime_after(f_real, coarse)) > + *ts = ktime_to_timespec64(f_real); > + return floor; > +} > +EXPORT_SYMBOL_GPL(ktime_get_coarse_real_ts64_mg); Generally this looks ok to me. > +/** > + * ktime_get_real_ts64_mg - attempt to update floor value and return result > + * @ts: pointer to the timespec to be set > + * @cookie: opaque cookie from earlier call to ktime_get_coarse_real_ts64_mg() > + * > + * Get a current monotonic fine-grained time value and attempt to swap > + * it into the floor using @cookie as the "old" value. @ts will be > + * filled with the resulting floor value, regardless of the outcome of > + * the swap. > + */ Again this cookie argument usage and the behavior of this function isn't very clear to me. > +void ktime_get_real_ts64_mg(struct timespec64 *ts, u64 cookie) > +{ > + struct timekeeper *tk = &tk_core.timekeeper; > + ktime_t offset, mono, old = (ktime_t)cookie; > + unsigned int seq; > + u64 nsecs; > + > + WARN_ON(timekeeping_suspended); > + > + do { > + seq = read_seqcount_begin(&tk_core.seq); > + > + ts->tv_sec = tk->xtime_sec; > + mono = tk->tkr_mono.base; > + nsecs = timekeeping_get_ns(&tk->tkr_mono); > + offset = *offsets[TK_OFFS_REAL]; > + } while (read_seqcount_retry(&tk_core.seq, seq)); > + > + mono = ktime_add_ns(mono, nsecs); > + if (atomic64_try_cmpxchg(&mg_floor, &old, mono)) { > + ts->tv_nsec = 0; > + timespec64_add_ns(ts, nsecs); > + } else { > + *ts = ktime_to_timespec64(ktime_add(old, offset)); > + } > + > +} > +EXPORT_SYMBOL(ktime_get_real_ts64_mg); So initially I was expecting this to look something like (sorry for the whitespace damage here): { do { seq = read_seqcount_begin(&tk_core.seq); ts->tv_sec = tk->xtime_sec; mono = tk->tkr_mono.base; nsecs = timekeeping_get_ns(&tk->tkr_mono); offset = *offsets[TK_OFFS_REAL]; } while (read_seqcount_retry(&tk_core.seq, seq)); mono = ktime_add_ns(mono, nsecs); do { old = atomic64_read(&mg_floor); if (floor >= mono) break; } while(!atomic64_try_cmpxchg(&mg_floor, old, mono); ts->tv_nsec = 0; timespec64_add_ns(ts, nsecs); } Where you read the tk data, atomically update the floor (assuming it's not in the future) and then return the finegrained value, not needing to manage a cookie value. But instead, it seems like if something has happened since the cookie value was saved (another cpu getting a fine grained timestamp), your ktime_get_real_ts64_mg() will fall back to returning the same coarse grained time saved to the cookie, as if no time had past? It seems like that could cause problems: cpu1 cpu2 -------------------------------------------------------------------------- t2a = ktime_get_coarse_real_ts64_mg t1a = ktime_get_coarse_real_ts64_mg() t1b = ktime_get_real_ts64_mg(t1a) t2b = ktime_get_real_ts64_mg(t2a) Where t2b will seem to be before t1b, even though it happened afterwards. thanks -john
On Thu, Sep 12, 2024 at 1:11 PM John Stultz <jstultz@google.com> wrote: > So initially I was expecting this to look something like (sorry for > the whitespace damage here): > { > do { > seq = read_seqcount_begin(&tk_core.seq); > ts->tv_sec = tk->xtime_sec; > mono = tk->tkr_mono.base; > nsecs = timekeeping_get_ns(&tk->tkr_mono); > offset = *offsets[TK_OFFS_REAL]; > } while (read_seqcount_retry(&tk_core.seq, seq)); > > mono = ktime_add_ns(mono, nsecs); > do { > old = atomic64_read(&mg_floor); > if (floor >= mono) > break; Apologies, that should be if (old >= mono) break; thanks -john
On Thu, 2024-09-12 at 13:11 -0700, John Stultz wrote: > On Thu, Sep 12, 2024 at 11:02 AM Jeff Layton <jlayton@kernel.org> wrote: > > > > The kernel test robot reported a performance hit in some will-it-scale > > tests due to the multigrain timestamp patches. My own testing showed > > about a 7% drop in performance on the pipe1_threads test, and the data > > showed that coarse_ctime() was slowing down current_time(). > > So, you provided some useful detail about why coarse_ctime() was slow > in your reply earlier, but it would be good to preserve that in the > commit message here. > > > > Move the multigrain timestamp floor tracking word into timekeeper.c. Add > > two new public interfaces: The first fills a timespec64 with the later > > of the coarse-grained clock and the floor time, and the second gets a > > fine-grained time and tries to swap it into the floor and fills a > > timespec64 with the result. > > > > The first function returns an opaque cookie that is suitable for passing > > to the second, which will use it as the "old" value in the cmpxchg. > > The cookie usage isn't totally clear to me right off. It feels a bit > more subtle then I'd expect. > > > > diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c > > index 5391e4167d60..bb039c9d525e 100644 > > --- a/kernel/time/timekeeping.c > > +++ b/kernel/time/timekeeping.c > > @@ -114,6 +114,13 @@ static struct tk_fast tk_fast_raw ____cacheline_aligned = { > > .base[1] = FAST_TK_INIT, > > }; > > > > +/* > > + * This represents the latest fine-grained time that we have handed out as a > > + * timestamp on the system. Tracked as a monotonic ktime_t, and converted to the > > + * realtime clock on an as-needed basis. > > + */ > > +static __cacheline_aligned_in_smp atomic64_t mg_floor; > > + > > So I do really like this general approach of having an internal floor > value combined with special coarse/fine grained assessors that work > with the floor, so we're not impacting the normal hotpath logic > (basically I was writing up a suggestion to this effect to the thread > with Arnd when I realized you had follow up patch in my inbox). > > > > static inline void tk_normalize_xtime(struct timekeeper *tk) > > { > > while (tk->tkr_mono.xtime_nsec >= ((u64)NSEC_PER_SEC << tk->tkr_mono.shift)) { > > @@ -2394,6 +2401,76 @@ void ktime_get_coarse_real_ts64(struct timespec64 *ts) > > } > > EXPORT_SYMBOL(ktime_get_coarse_real_ts64); > > > > +/** > > + * ktime_get_coarse_real_ts64_mg - get later of coarse grained time or floor > > + * @ts: timespec64 to be filled > > + * > > + * Adjust floor to realtime and compare it to the coarse time. Fill > > + * @ts with the latest one. Returns opaque cookie suitable to pass > > + * to ktime_get_real_ts64_mg. > > + */ > > +u64 ktime_get_coarse_real_ts64_mg(struct timespec64 *ts) > > +{ > > + struct timekeeper *tk = &tk_core.timekeeper; > > + u64 floor = atomic64_read(&mg_floor); > > + ktime_t f_real, offset, coarse; > > + unsigned int seq; > > + > > + WARN_ON(timekeeping_suspended); > > + > > + do { > > + seq = read_seqcount_begin(&tk_core.seq); > > + *ts = tk_xtime(tk); > > + offset = *offsets[TK_OFFS_REAL]; > > + } while (read_seqcount_retry(&tk_core.seq, seq)); > > + > > + coarse = timespec64_to_ktime(*ts); > > + f_real = ktime_add(floor, offset); > > + if (ktime_after(f_real, coarse)) > > + *ts = ktime_to_timespec64(f_real); > > + return floor; > > +} > > +EXPORT_SYMBOL_GPL(ktime_get_coarse_real_ts64_mg); > > Generally this looks ok to me. > > > > +/** > > + * ktime_get_real_ts64_mg - attempt to update floor value and return result > > + * @ts: pointer to the timespec to be set > > + * @cookie: opaque cookie from earlier call to ktime_get_coarse_real_ts64_mg() > > + * > > + * Get a current monotonic fine-grained time value and attempt to swap > > + * it into the floor using @cookie as the "old" value. @ts will be > > + * filled with the resulting floor value, regardless of the outcome of > > + * the swap. > > + */ > > Again this cookie argument usage and the behavior of this function > isn't very clear to me. > > > +void ktime_get_real_ts64_mg(struct timespec64 *ts, u64 cookie) > > +{ > > + struct timekeeper *tk = &tk_core.timekeeper; > > + ktime_t offset, mono, old = (ktime_t)cookie; > > + unsigned int seq; > > + u64 nsecs; > > + > > + WARN_ON(timekeeping_suspended); > > + > > + do { > > + seq = read_seqcount_begin(&tk_core.seq); > > + > > + ts->tv_sec = tk->xtime_sec; > > + mono = tk->tkr_mono.base; > > + nsecs = timekeeping_get_ns(&tk->tkr_mono); > > + offset = *offsets[TK_OFFS_REAL]; > > + } while (read_seqcount_retry(&tk_core.seq, seq)); > > + > > + mono = ktime_add_ns(mono, nsecs); > > + if (atomic64_try_cmpxchg(&mg_floor, &old, mono)) { > > + ts->tv_nsec = 0; > > + timespec64_add_ns(ts, nsecs); > > + } else { > > + *ts = ktime_to_timespec64(ktime_add(old, offset)); > > + } > > + > > +} > > +EXPORT_SYMBOL(ktime_get_real_ts64_mg); > > > So initially I was expecting this to look something like (sorry for > the whitespace damage here): > { > do { > seq = read_seqcount_begin(&tk_core.seq); > ts->tv_sec = tk->xtime_sec; > mono = tk->tkr_mono.base; > nsecs = timekeeping_get_ns(&tk->tkr_mono); > offset = *offsets[TK_OFFS_REAL]; > } while (read_seqcount_retry(&tk_core.seq, seq)); > > mono = ktime_add_ns(mono, nsecs); > do { > old = atomic64_read(&mg_floor); > if (floor >= mono) > break; > } while(!atomic64_try_cmpxchg(&mg_floor, old, mono); > ts->tv_nsec = 0; > timespec64_add_ns(ts, nsecs); > } > > Where you read the tk data, atomically update the floor (assuming it's > not in the future) and then return the finegrained value, not needing > to manage a cookie value. > > But instead, it seems like if something has happened since the cookie > value was saved (another cpu getting a fine grained timestamp), your > ktime_get_real_ts64_mg() will fall back to returning the same coarse > grained time saved to the cookie, as if no time had past? > > It seems like that could cause problems: > > cpu1 cpu2 > -------------------------------------------------------------------------- > t2a = ktime_get_coarse_real_ts64_mg > t1a = ktime_get_coarse_real_ts64_mg() > t1b = ktime_get_real_ts64_mg(t1a) > > t2b = ktime_get_real_ts64_mg(t2a) > > Where t2b will seem to be before t1b, even though it happened afterwards. > Ahh no, the subtle thing about atomic64_try_cmpxchg is that it overwrites "old" with the value that was currently there in the event that the cmp fails. So, the try_cmpxchg there will either swap the new value into place, or if it was updated in the meantime, "old" will now refer to the value that's currently in the floor word. Either is fine in this case, so we don't need to retry anything.
On Thu, Sep 12, 2024 at 1:18 PM Jeff Layton <jlayton@kernel.org> wrote: > On Thu, 2024-09-12 at 13:11 -0700, John Stultz wrote: > > On Thu, Sep 12, 2024 at 11:02 AM Jeff Layton <jlayton@kernel.org> wrote: > > But instead, it seems like if something has happened since the cookie > > value was saved (another cpu getting a fine grained timestamp), your > > ktime_get_real_ts64_mg() will fall back to returning the same coarse > > grained time saved to the cookie, as if no time had past? > > > > It seems like that could cause problems: > > > > cpu1 cpu2 > > -------------------------------------------------------------------------- > > t2a = ktime_get_coarse_real_ts64_mg > > t1a = ktime_get_coarse_real_ts64_mg() > > t1b = ktime_get_real_ts64_mg(t1a) > > > > t2b = ktime_get_real_ts64_mg(t2a) > > > > Where t2b will seem to be before t1b, even though it happened afterwards. > > > > Ahh no, the subtle thing about atomic64_try_cmpxchg is that it > overwrites "old" with the value that was currently there in the event > that the cmp fails. Ah, ok. Thank you for the explanation there! > So, the try_cmpxchg there will either swap the new value into place, or > if it was updated in the meantime, "old" will now refer to the value > that's currently in the floor word. Either is fine in this case, so we > don't need to retry anything. Though if cpu2 then made another call to ktime_get_coarse_real_ts64_mg(), the value returned there will be the same as t1b? and would be before t2b? thanks -john
On Thu, Sep 12, 2024 at 1:33 PM John Stultz <jstultz@google.com> wrote: > > On Thu, Sep 12, 2024 at 1:18 PM Jeff Layton <jlayton@kernel.org> wrote: > > On Thu, 2024-09-12 at 13:11 -0700, John Stultz wrote: > > > On Thu, Sep 12, 2024 at 11:02 AM Jeff Layton <jlayton@kernel.org> wrote: > > > But instead, it seems like if something has happened since the cookie > > > value was saved (another cpu getting a fine grained timestamp), your > > > ktime_get_real_ts64_mg() will fall back to returning the same coarse > > > grained time saved to the cookie, as if no time had past? > > > > > > It seems like that could cause problems: > > > > > > cpu1 cpu2 > > > -------------------------------------------------------------------------- > > > t2a = ktime_get_coarse_real_ts64_mg > > > t1a = ktime_get_coarse_real_ts64_mg() > > > t1b = ktime_get_real_ts64_mg(t1a) > > > > > > t2b = ktime_get_real_ts64_mg(t2a) > > > > > > Where t2b will seem to be before t1b, even though it happened afterwards. > > > > > > > Ahh no, the subtle thing about atomic64_try_cmpxchg is that it > > overwrites "old" with the value that was currently there in the event > > that the cmp fails. > > Ah, ok. Thank you for the explanation there! > > > So, the try_cmpxchg there will either swap the new value into place, or > > if it was updated in the meantime, "old" will now refer to the value > > that's currently in the floor word. Either is fine in this case, so we > > don't need to retry anything. > > > Though if cpu2 then made another call to > ktime_get_coarse_real_ts64_mg(), the value returned there will be the > same as t1b? and would be before t2b? Oh, no. Apologies again, as I see t2b would be the same as t1b as well. Ok. -john
On Thu 12-09-24 14:02:52, Jeff Layton wrote: > The kernel test robot reported a performance hit in some will-it-scale > tests due to the multigrain timestamp patches. My own testing showed > about a 7% drop in performance on the pipe1_threads test, and the data > showed that coarse_ctime() was slowing down current_time(). > > Move the multigrain timestamp floor tracking word into timekeeper.c. Add > two new public interfaces: The first fills a timespec64 with the later > of the coarse-grained clock and the floor time, and the second gets a > fine-grained time and tries to swap it into the floor and fills a > timespec64 with the result. > > The first function returns an opaque cookie that is suitable for passing > to the second, which will use it as the "old" value in the cmpxchg. > > With this patch on top of the multigrain series, the will-it-scale > pipe1_threads microbenchmark shows these averages on my test rig: > > v6.11-rc7: 103561295 (baseline) > v6.11-rc7 + mgtime + this: 101357203 (~2% performance drop) > > Reported-by: kernel test robot <oliver.sang@intel.com> > Closes: https://lore.kernel.org/oe-lkp/202409091303.31b2b713-oliver.sang@intel.com > Suggested-by: Arnd Bergmann <arnd@kernel.org> > Signed-off-by: Jeff Layton <jlayton@kernel.org> One question regarding the cookie handling as well :) > diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c > index 5391e4167d60..bb039c9d525e 100644 > --- a/kernel/time/timekeeping.c > +++ b/kernel/time/timekeeping.c > @@ -114,6 +114,13 @@ static struct tk_fast tk_fast_raw ____cacheline_aligned = { > .base[1] = FAST_TK_INIT, > }; > > +/* > + * This represents the latest fine-grained time that we have handed out as a > + * timestamp on the system. Tracked as a monotonic ktime_t, and converted to the > + * realtime clock on an as-needed basis. > + */ > +static __cacheline_aligned_in_smp atomic64_t mg_floor; > + > static inline void tk_normalize_xtime(struct timekeeper *tk) > { > while (tk->tkr_mono.xtime_nsec >= ((u64)NSEC_PER_SEC << tk->tkr_mono.shift)) { > @@ -2394,6 +2401,76 @@ void ktime_get_coarse_real_ts64(struct timespec64 *ts) > } > EXPORT_SYMBOL(ktime_get_coarse_real_ts64); > > +/** > + * ktime_get_coarse_real_ts64_mg - get later of coarse grained time or floor > + * @ts: timespec64 to be filled > + * > + * Adjust floor to realtime and compare it to the coarse time. Fill > + * @ts with the latest one. Returns opaque cookie suitable to pass > + * to ktime_get_real_ts64_mg. > + */ > +u64 ktime_get_coarse_real_ts64_mg(struct timespec64 *ts) > +{ > + struct timekeeper *tk = &tk_core.timekeeper; > + u64 floor = atomic64_read(&mg_floor); > + ktime_t f_real, offset, coarse; > + unsigned int seq; > + > + WARN_ON(timekeeping_suspended); > + > + do { > + seq = read_seqcount_begin(&tk_core.seq); > + *ts = tk_xtime(tk); > + offset = *offsets[TK_OFFS_REAL]; > + } while (read_seqcount_retry(&tk_core.seq, seq)); > + > + coarse = timespec64_to_ktime(*ts); > + f_real = ktime_add(floor, offset); > + if (ktime_after(f_real, coarse)) > + *ts = ktime_to_timespec64(f_real); > + return floor; > +} > +EXPORT_SYMBOL_GPL(ktime_get_coarse_real_ts64_mg); > + > +/** > + * ktime_get_real_ts64_mg - attempt to update floor value and return result > + * @ts: pointer to the timespec to be set > + * @cookie: opaque cookie from earlier call to ktime_get_coarse_real_ts64_mg() > + * > + * Get a current monotonic fine-grained time value and attempt to swap > + * it into the floor using @cookie as the "old" value. @ts will be > + * filled with the resulting floor value, regardless of the outcome of > + * the swap. > + */ > +void ktime_get_real_ts64_mg(struct timespec64 *ts, u64 cookie) > +{ > + struct timekeeper *tk = &tk_core.timekeeper; > + ktime_t offset, mono, old = (ktime_t)cookie; > + unsigned int seq; > + u64 nsecs; So what would be the difference if we did instead: old = atomic64_read(&mg_floor); and not bother with the cookie? AFAIU this could result in somewhat more updates to mg_floor (the contention on the mg_floor cacheline would be the same but there would be more invalidates of the cacheline). OTOH these updates can happen only if max(current_coarse_time, mg_floor) == inode->i_ctime which is presumably rare? What is your concern that I'm missing? Honza > + > + WARN_ON(timekeeping_suspended); > + > + do { > + seq = read_seqcount_begin(&tk_core.seq); > + > + ts->tv_sec = tk->xtime_sec; > + mono = tk->tkr_mono.base; > + nsecs = timekeeping_get_ns(&tk->tkr_mono); > + offset = *offsets[TK_OFFS_REAL]; > + } while (read_seqcount_retry(&tk_core.seq, seq)); > + > + mono = ktime_add_ns(mono, nsecs); > + if (atomic64_try_cmpxchg(&mg_floor, &old, mono)) { > + ts->tv_nsec = 0; > + timespec64_add_ns(ts, nsecs); > + } else { > + *ts = ktime_to_timespec64(ktime_add(old, offset)); > + } > + > +} > +EXPORT_SYMBOL(ktime_get_real_ts64_mg); > +
On Fri, 2024-09-13 at 13:26 +0200, Jan Kara wrote: > On Thu 12-09-24 14:02:52, Jeff Layton wrote: > > The kernel test robot reported a performance hit in some will-it-scale > > tests due to the multigrain timestamp patches. My own testing showed > > about a 7% drop in performance on the pipe1_threads test, and the data > > showed that coarse_ctime() was slowing down current_time(). > > > > Move the multigrain timestamp floor tracking word into timekeeper.c. Add > > two new public interfaces: The first fills a timespec64 with the later > > of the coarse-grained clock and the floor time, and the second gets a > > fine-grained time and tries to swap it into the floor and fills a > > timespec64 with the result. > > > > The first function returns an opaque cookie that is suitable for passing > > to the second, which will use it as the "old" value in the cmpxchg. > > > > With this patch on top of the multigrain series, the will-it-scale > > pipe1_threads microbenchmark shows these averages on my test rig: > > > > v6.11-rc7: 103561295 (baseline) > > v6.11-rc7 + mgtime + this: 101357203 (~2% performance drop) > > > > Reported-by: kernel test robot <oliver.sang@intel.com> > > Closes: https://lore.kernel.org/oe-lkp/202409091303.31b2b713-oliver.sang@intel.com > > Suggested-by: Arnd Bergmann <arnd@kernel.org> > > Signed-off-by: Jeff Layton <jlayton@kernel.org> > > One question regarding the cookie handling as well :) > > > diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c > > index 5391e4167d60..bb039c9d525e 100644 > > --- a/kernel/time/timekeeping.c > > +++ b/kernel/time/timekeeping.c > > @@ -114,6 +114,13 @@ static struct tk_fast tk_fast_raw ____cacheline_aligned = { > > .base[1] = FAST_TK_INIT, > > }; > > > > +/* > > + * This represents the latest fine-grained time that we have handed out as a > > + * timestamp on the system. Tracked as a monotonic ktime_t, and converted to the > > + * realtime clock on an as-needed basis. > > + */ > > +static __cacheline_aligned_in_smp atomic64_t mg_floor; > > + > > static inline void tk_normalize_xtime(struct timekeeper *tk) > > { > > while (tk->tkr_mono.xtime_nsec >= ((u64)NSEC_PER_SEC << tk->tkr_mono.shift)) { > > @@ -2394,6 +2401,76 @@ void ktime_get_coarse_real_ts64(struct timespec64 *ts) > > } > > EXPORT_SYMBOL(ktime_get_coarse_real_ts64); > > > > +/** > > + * ktime_get_coarse_real_ts64_mg - get later of coarse grained time or floor > > + * @ts: timespec64 to be filled > > + * > > + * Adjust floor to realtime and compare it to the coarse time. Fill > > + * @ts with the latest one. Returns opaque cookie suitable to pass > > + * to ktime_get_real_ts64_mg. > > + */ > > +u64 ktime_get_coarse_real_ts64_mg(struct timespec64 *ts) > > +{ > > + struct timekeeper *tk = &tk_core.timekeeper; > > + u64 floor = atomic64_read(&mg_floor); > > + ktime_t f_real, offset, coarse; > > + unsigned int seq; > > + > > + WARN_ON(timekeeping_suspended); > > + > > + do { > > + seq = read_seqcount_begin(&tk_core.seq); > > + *ts = tk_xtime(tk); > > + offset = *offsets[TK_OFFS_REAL]; > > + } while (read_seqcount_retry(&tk_core.seq, seq)); > > + > > + coarse = timespec64_to_ktime(*ts); > > + f_real = ktime_add(floor, offset); > > + if (ktime_after(f_real, coarse)) > > + *ts = ktime_to_timespec64(f_real); > > + return floor; > > +} > > +EXPORT_SYMBOL_GPL(ktime_get_coarse_real_ts64_mg); > > + > > +/** > > + * ktime_get_real_ts64_mg - attempt to update floor value and return result > > + * @ts: pointer to the timespec to be set > > + * @cookie: opaque cookie from earlier call to ktime_get_coarse_real_ts64_mg() > > + * > > + * Get a current monotonic fine-grained time value and attempt to swap > > + * it into the floor using @cookie as the "old" value. @ts will be > > + * filled with the resulting floor value, regardless of the outcome of > > + * the swap. > > + */ > > +void ktime_get_real_ts64_mg(struct timespec64 *ts, u64 cookie) > > +{ > > + struct timekeeper *tk = &tk_core.timekeeper; > > + ktime_t offset, mono, old = (ktime_t)cookie; > > + unsigned int seq; > > + u64 nsecs; > > So what would be the difference if we did instead: > > old = atomic64_read(&mg_floor); > > and not bother with the cookie? AFAIU this could result in somewhat more > updates to mg_floor (the contention on the mg_floor cacheline would be the > same but there would be more invalidates of the cacheline). OTOH these > updates can happen only if max(current_coarse_time, mg_floor) == > inode->i_ctime which is presumably rare? What is your concern that I'm > missing? > My main concern is the "somewhat more updates to mg_floor". mg_floor is a global variable, so one of my main goals is to minimize the updates to it. There is no correctness issue in doing what you're saying above (AFAICT anyway), but the window of time between when we fetch the current floor and try to do the swap will be smaller, and we'll end up doing more swaps as a result. Do you have any objection to adding the cookie to this API? > > > + > > + WARN_ON(timekeeping_suspended); > > + > > + do { > > + seq = read_seqcount_begin(&tk_core.seq); > > + > > + ts->tv_sec = tk->xtime_sec; > > + mono = tk->tkr_mono.base; > > + nsecs = timekeeping_get_ns(&tk->tkr_mono); > > + offset = *offsets[TK_OFFS_REAL]; > > + } while (read_seqcount_retry(&tk_core.seq, seq)); > > + > > + mono = ktime_add_ns(mono, nsecs); > > + if (atomic64_try_cmpxchg(&mg_floor, &old, mono)) { > > + ts->tv_nsec = 0; > > + timespec64_add_ns(ts, nsecs); > > + } else { > > + *ts = ktime_to_timespec64(ktime_add(old, offset)); > > + } > > + > > +} > > +EXPORT_SYMBOL(ktime_get_real_ts64_mg); > > +
On Fri, Sep 13, 2024 at 5:01 AM Jeff Layton <jlayton@kernel.org> wrote: > On Fri, 2024-09-13 at 13:26 +0200, Jan Kara wrote: > > So what would be the difference if we did instead: > > > > old = atomic64_read(&mg_floor); > > > > and not bother with the cookie? AFAIU this could result in somewhat more > > updates to mg_floor (the contention on the mg_floor cacheline would be the > > same but there would be more invalidates of the cacheline). OTOH these > > updates can happen only if max(current_coarse_time, mg_floor) == > > inode->i_ctime which is presumably rare? What is your concern that I'm > > missing? > > > > My main concern is the "somewhat more updates to mg_floor". mg_floor is > a global variable, so one of my main goals is to minimize the updates > to it. There is no correctness issue in doing what you're saying above > (AFAICT anyway), but the window of time between when we fetch the > current floor and try to do the swap will be smaller, and we'll end up > doing more swaps as a result. Would it be worth quantifying that cost? > Do you have any objection to adding the cookie to this API? My main concern is it is just a bit subtle. I found it hard to grok (though I can be pretty dim sometimes, so maybe that doesn't count for much :) It seems if it were misused, the fine-grained accessor could constantly return coarse grained results when called repeatedly with a very stale cookie. Further, the point about avoiding "too many" mg_floor writes is a little fuzzy. It feels almost like folks would need to use the cookie update as a tuning knob to balance the granularity of their timestamps against the cost of the global mg_floor writes. So this probably needs some clear comments to make it more obvious. thanks -john
On Fri, 2024-09-13 at 11:43 -0700, John Stultz wrote: > On Fri, Sep 13, 2024 at 5:01 AM Jeff Layton <jlayton@kernel.org> wrote: > > On Fri, 2024-09-13 at 13:26 +0200, Jan Kara wrote: > > > So what would be the difference if we did instead: > > > > > > old = atomic64_read(&mg_floor); > > > > > > and not bother with the cookie? AFAIU this could result in somewhat more > > > updates to mg_floor (the contention on the mg_floor cacheline would be the > > > same but there would be more invalidates of the cacheline). OTOH these > > > updates can happen only if max(current_coarse_time, mg_floor) == > > > inode->i_ctime which is presumably rare? What is your concern that I'm > > > missing? > > > > > > > My main concern is the "somewhat more updates to mg_floor". mg_floor is > > a global variable, so one of my main goals is to minimize the updates > > to it. There is no correctness issue in doing what you're saying above > > (AFAICT anyway), but the window of time between when we fetch the > > current floor and try to do the swap will be smaller, and we'll end up > > doing more swaps as a result. > > Would it be worth quantifying that cost? > There's a patch in the larger set that adds some percpu counters to count events. One of them was successful floor value swaps. I dropped that particular counter from the v7 set, but we could resurrect it. > > Do you have any objection to adding the cookie to this API? > > My main concern is it is just a bit subtle. I found it hard to grok > (though I can be pretty dim sometimes, so maybe that doesn't count for > much :) > It seems if it were misused, the fine-grained accessor could > constantly return coarse grained results when called repeatedly with a > very stale cookie. > > Further, the point about avoiding "too many" mg_floor writes is a > little fuzzy. It feels almost like folks would need to use the cookie > update as a tuning knob to balance the granularity of their timestamps > against the cost of the global mg_floor writes. So this probably needs > some clear comments to make it more obvious. > Fair points. I don't have any hard numbers around it. I'm mainly just trying to do what I can to keep the floor swaps to an absolute minimum. This is a global value after all so we really are better off avoiding cache invalidations. That said, passing the cookie like this would only open the window a small amount. I can certainly drop that part of the interface. In the big scheme of things I doubt it'll make much difference in performance, and if it does we can always bring it back. If that sounds OK, I'll send a v8 (after some testing). I have some comment updates I'd like to add as well.
On Fri 13-09-24 08:01:28, Jeff Layton wrote: > On Fri, 2024-09-13 at 13:26 +0200, Jan Kara wrote: > > On Thu 12-09-24 14:02:52, Jeff Layton wrote: > > > +/** > > > + * ktime_get_real_ts64_mg - attempt to update floor value and return result > > > + * @ts: pointer to the timespec to be set > > > + * @cookie: opaque cookie from earlier call to ktime_get_coarse_real_ts64_mg() > > > + * > > > + * Get a current monotonic fine-grained time value and attempt to swap > > > + * it into the floor using @cookie as the "old" value. @ts will be > > > + * filled with the resulting floor value, regardless of the outcome of > > > + * the swap. > > > + */ > > > +void ktime_get_real_ts64_mg(struct timespec64 *ts, u64 cookie) > > > +{ > > > + struct timekeeper *tk = &tk_core.timekeeper; > > > + ktime_t offset, mono, old = (ktime_t)cookie; > > > + unsigned int seq; > > > + u64 nsecs; > > > > So what would be the difference if we did instead: > > > > old = atomic64_read(&mg_floor); > > > > and not bother with the cookie? AFAIU this could result in somewhat more > > updates to mg_floor (the contention on the mg_floor cacheline would be the > > same but there would be more invalidates of the cacheline). OTOH these > > updates can happen only if max(current_coarse_time, mg_floor) == > > inode->i_ctime which is presumably rare? What is your concern that I'm > > missing? > > > > My main concern is the "somewhat more updates to mg_floor". mg_floor is > a global variable, so one of my main goals is to minimize the updates > to it. There is no correctness issue in doing what you're saying above > (AFAICT anyway), but the window of time between when we fetch the > current floor and try to do the swap will be smaller, and we'll end up > doing more swaps as a result. > > Do you have any objection to adding the cookie to this API? No objection as such but as John said, I had also some trouble understanding what the cookie value is about and what are the constraints in using it. So if we can live without cookie, it would be a simplification of the API. If the cooking indeed brings noticeable performance benefit, we just need to document that the cookie is about performance and how to use it to get good performance. Honza
diff --git a/fs/inode.c b/fs/inode.c index f0fbfd470d8e..3c7e16935bac 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -65,13 +65,6 @@ static unsigned int i_hash_shift __ro_after_init; static struct hlist_head *inode_hashtable __ro_after_init; static __cacheline_aligned_in_smp DEFINE_SPINLOCK(inode_hash_lock); -/* - * This represents the latest fine-grained time that we have handed out as a - * timestamp on the system. Tracked as a monotonic value, and converted to the - * realtime clock on an as-needed basis. - */ -static __cacheline_aligned_in_smp atomic64_t ctime_floor; - /* * Empty aops. Can be used for the cases where the user does not * define any of the address_space operations. @@ -2255,25 +2248,6 @@ int file_remove_privs(struct file *file) } EXPORT_SYMBOL(file_remove_privs); -/** - * coarse_ctime - return the current coarse-grained time - * @floor: current (monotonic) ctime_floor value - * - * Get the coarse-grained time, and then determine whether to - * return it or the current floor value. Returns the later of the - * floor and coarse grained timestamps, converted to realtime - * clock value. - */ -static ktime_t coarse_ctime(ktime_t floor) -{ - ktime_t coarse = ktime_get_coarse(); - - /* If coarse time is already newer, return that */ - if (!ktime_after(floor, coarse)) - return ktime_get_coarse_real(); - return ktime_mono_to_real(floor); -} - /** * current_time - Return FS time (possibly fine-grained) * @inode: inode. @@ -2284,11 +2258,11 @@ static ktime_t coarse_ctime(ktime_t floor) */ struct timespec64 current_time(struct inode *inode) { - ktime_t floor = atomic64_read(&ctime_floor); - ktime_t now = coarse_ctime(floor); - struct timespec64 now_ts = ktime_to_timespec64(now); + struct timespec64 now; u32 cns; + ktime_get_coarse_real_ts64_mg(&now); + if (!is_mgtime(inode)) goto out; @@ -2299,11 +2273,11 @@ struct timespec64 current_time(struct inode *inode) * If there is no apparent change, then * get a fine-grained timestamp. */ - if (now_ts.tv_nsec == (cns & ~I_CTIME_QUERIED)) - ktime_get_real_ts64(&now_ts); + if (now.tv_nsec == (cns & ~I_CTIME_QUERIED)) + ktime_get_real_ts64(&now); } out: - return timestamp_truncate(now_ts, inode); + return timestamp_truncate(now, inode); } EXPORT_SYMBOL(current_time); @@ -2745,7 +2719,7 @@ EXPORT_SYMBOL(timestamp_truncate); * * Set the inode's ctime to the current value for the inode. Returns the * current value that was assigned. If this is not a multigrain inode, then we - * just set it to whatever the coarse_ctime is. + * set it to the later of the coarse time and floor value. * * If it is multigrain, then we first see if the coarse-grained timestamp is * distinct from what we have. If so, then we'll just use that. If we have to @@ -2756,16 +2730,16 @@ EXPORT_SYMBOL(timestamp_truncate); */ struct timespec64 inode_set_ctime_current(struct inode *inode) { - ktime_t now, floor = atomic64_read(&ctime_floor); - struct timespec64 now_ts; + struct timespec64 now; u32 cns, cur; + u64 cookie; - now = coarse_ctime(floor); + cookie = ktime_get_coarse_real_ts64_mg(&now); /* Just return that if this is not a multigrain fs */ if (!is_mgtime(inode)) { - now_ts = timestamp_truncate(ktime_to_timespec64(now), inode); - inode_set_ctime_to_ts(inode, now_ts); + now = timestamp_truncate(now, inode); + inode_set_ctime_to_ts(inode, now); goto out; } @@ -2776,44 +2750,27 @@ struct timespec64 inode_set_ctime_current(struct inode *inode) */ cns = smp_load_acquire(&inode->i_ctime_nsec); if (cns & I_CTIME_QUERIED) { - ktime_t ctime = ktime_set(inode->i_ctime_sec, cns & ~I_CTIME_QUERIED); - - if (!ktime_after(now, ctime)) { - ktime_t old, fine; + struct timespec64 ctime = { .tv_sec = inode->i_ctime_sec, + .tv_nsec = cns & ~I_CTIME_QUERIED }; - /* Get a fine-grained time */ - fine = ktime_get(); - mgtime_counter_inc(mg_fine_stamps); - - /* - * If the cmpxchg works, we take the new floor value. If - * not, then that means that someone else changed it after we - * fetched it but before we got here. That value is just - * as good, so keep it. - */ - old = floor; - if (atomic64_try_cmpxchg(&ctime_floor, &old, fine)) - mgtime_counter_inc(mg_floor_swaps); - else - fine = old; - now = ktime_mono_to_real(fine); - } + if (timespec64_compare(&now, &ctime) <= 0) + ktime_get_real_ts64_mg(&now, cookie); } mgtime_counter_inc(mg_ctime_updates); - now_ts = timestamp_truncate(ktime_to_timespec64(now), inode); - cur = cns; + now = timestamp_truncate(now, inode); /* No need to cmpxchg if it's exactly the same */ - if (cns == now_ts.tv_nsec && inode->i_ctime_sec == now_ts.tv_sec) { - trace_ctime_xchg_skip(inode, &now_ts); + if (cns == now.tv_nsec && inode->i_ctime_sec == now.tv_sec) { + trace_ctime_xchg_skip(inode, &now); goto out; } + cur = cns; retry: /* Try to swap the nsec value into place. */ - if (try_cmpxchg(&inode->i_ctime_nsec, &cur, now_ts.tv_nsec)) { + if (try_cmpxchg(&inode->i_ctime_nsec, &cur, now.tv_nsec)) { /* If swap occurred, then we're (mostly) done */ - inode->i_ctime_sec = now_ts.tv_sec; - trace_ctime_ns_xchg(inode, cns, now_ts.tv_nsec, cur); + inode->i_ctime_sec = now.tv_sec; + trace_ctime_ns_xchg(inode, cns, now.tv_nsec, cur); mgtime_counter_inc(mg_ctime_swaps); } else { /* @@ -2827,11 +2784,11 @@ struct timespec64 inode_set_ctime_current(struct inode *inode) goto retry; } /* Otherwise, keep the existing ctime */ - now_ts.tv_sec = inode->i_ctime_sec; - now_ts.tv_nsec = cur & ~I_CTIME_QUERIED; + now.tv_sec = inode->i_ctime_sec; + now.tv_nsec = cur & ~I_CTIME_QUERIED; } out: - return now_ts; + return now; } EXPORT_SYMBOL(inode_set_ctime_current); @@ -2854,8 +2811,7 @@ EXPORT_SYMBOL(inode_set_ctime_current); */ struct timespec64 inode_set_ctime_deleg(struct inode *inode, struct timespec64 update) { - ktime_t now, floor = atomic64_read(&ctime_floor); - struct timespec64 now_ts, cur_ts; + struct timespec64 now, cur_ts; u32 cur, old; /* pairs with try_cmpxchg below */ @@ -2867,12 +2823,11 @@ struct timespec64 inode_set_ctime_deleg(struct inode *inode, struct timespec64 u if (timespec64_compare(&update, &cur_ts) <= 0) return cur_ts; - now = coarse_ctime(floor); - now_ts = ktime_to_timespec64(now); + ktime_get_coarse_real_ts64_mg(&now); /* Clamp the update to "now" if it's in the future */ - if (timespec64_compare(&update, &now_ts) > 0) - update = now_ts; + if (timespec64_compare(&update, &now) > 0) + update = now; update = timestamp_truncate(update, inode); diff --git a/include/linux/timekeeping.h b/include/linux/timekeeping.h index fc12a9ba2c88..cf2293158c65 100644 --- a/include/linux/timekeeping.h +++ b/include/linux/timekeeping.h @@ -45,6 +45,10 @@ extern void ktime_get_real_ts64(struct timespec64 *tv); extern void ktime_get_coarse_ts64(struct timespec64 *ts); extern void ktime_get_coarse_real_ts64(struct timespec64 *ts); +/* Multigrain timestamp interfaces */ +extern u64 ktime_get_coarse_real_ts64_mg(struct timespec64 *ts); +extern void ktime_get_real_ts64_mg(struct timespec64 *ts, u64 cookie); + void getboottime64(struct timespec64 *ts); /* diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c index 5391e4167d60..bb039c9d525e 100644 --- a/kernel/time/timekeeping.c +++ b/kernel/time/timekeeping.c @@ -114,6 +114,13 @@ static struct tk_fast tk_fast_raw ____cacheline_aligned = { .base[1] = FAST_TK_INIT, }; +/* + * This represents the latest fine-grained time that we have handed out as a + * timestamp on the system. Tracked as a monotonic ktime_t, and converted to the + * realtime clock on an as-needed basis. + */ +static __cacheline_aligned_in_smp atomic64_t mg_floor; + static inline void tk_normalize_xtime(struct timekeeper *tk) { while (tk->tkr_mono.xtime_nsec >= ((u64)NSEC_PER_SEC << tk->tkr_mono.shift)) { @@ -2394,6 +2401,76 @@ void ktime_get_coarse_real_ts64(struct timespec64 *ts) } EXPORT_SYMBOL(ktime_get_coarse_real_ts64); +/** + * ktime_get_coarse_real_ts64_mg - get later of coarse grained time or floor + * @ts: timespec64 to be filled + * + * Adjust floor to realtime and compare it to the coarse time. Fill + * @ts with the latest one. Returns opaque cookie suitable to pass + * to ktime_get_real_ts64_mg. + */ +u64 ktime_get_coarse_real_ts64_mg(struct timespec64 *ts) +{ + struct timekeeper *tk = &tk_core.timekeeper; + u64 floor = atomic64_read(&mg_floor); + ktime_t f_real, offset, coarse; + unsigned int seq; + + WARN_ON(timekeeping_suspended); + + do { + seq = read_seqcount_begin(&tk_core.seq); + *ts = tk_xtime(tk); + offset = *offsets[TK_OFFS_REAL]; + } while (read_seqcount_retry(&tk_core.seq, seq)); + + coarse = timespec64_to_ktime(*ts); + f_real = ktime_add(floor, offset); + if (ktime_after(f_real, coarse)) + *ts = ktime_to_timespec64(f_real); + return floor; +} +EXPORT_SYMBOL_GPL(ktime_get_coarse_real_ts64_mg); + +/** + * ktime_get_real_ts64_mg - attempt to update floor value and return result + * @ts: pointer to the timespec to be set + * @cookie: opaque cookie from earlier call to ktime_get_coarse_real_ts64_mg() + * + * Get a current monotonic fine-grained time value and attempt to swap + * it into the floor using @cookie as the "old" value. @ts will be + * filled with the resulting floor value, regardless of the outcome of + * the swap. + */ +void ktime_get_real_ts64_mg(struct timespec64 *ts, u64 cookie) +{ + struct timekeeper *tk = &tk_core.timekeeper; + ktime_t offset, mono, old = (ktime_t)cookie; + unsigned int seq; + u64 nsecs; + + WARN_ON(timekeeping_suspended); + + do { + seq = read_seqcount_begin(&tk_core.seq); + + ts->tv_sec = tk->xtime_sec; + mono = tk->tkr_mono.base; + nsecs = timekeeping_get_ns(&tk->tkr_mono); + offset = *offsets[TK_OFFS_REAL]; + } while (read_seqcount_retry(&tk_core.seq, seq)); + + mono = ktime_add_ns(mono, nsecs); + if (atomic64_try_cmpxchg(&mg_floor, &old, mono)) { + ts->tv_nsec = 0; + timespec64_add_ns(ts, nsecs); + } else { + *ts = ktime_to_timespec64(ktime_add(old, offset)); + } + +} +EXPORT_SYMBOL(ktime_get_real_ts64_mg); + void ktime_get_coarse_ts64(struct timespec64 *ts) { struct timekeeper *tk = &tk_core.timekeeper;
The kernel test robot reported a performance hit in some will-it-scale tests due to the multigrain timestamp patches. My own testing showed about a 7% drop in performance on the pipe1_threads test, and the data showed that coarse_ctime() was slowing down current_time(). Move the multigrain timestamp floor tracking word into timekeeper.c. Add two new public interfaces: The first fills a timespec64 with the later of the coarse-grained clock and the floor time, and the second gets a fine-grained time and tries to swap it into the floor and fills a timespec64 with the result. The first function returns an opaque cookie that is suitable for passing to the second, which will use it as the "old" value in the cmpxchg. With this patch on top of the multigrain series, the will-it-scale pipe1_threads microbenchmark shows these averages on my test rig: v6.11-rc7: 103561295 (baseline) v6.11-rc7 + mgtime + this: 101357203 (~2% performance drop) Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202409091303.31b2b713-oliver.sang@intel.com Suggested-by: Arnd Bergmann <arnd@kernel.org> Signed-off-by: Jeff Layton <jlayton@kernel.org> --- This version moves more of the floor handling into the timekeeper code. Some of the earlier concern was about mixing different time value types in the same interface. This sort of still does that, but it does it using an opaque cookie value instead, which I'm hoping will make the interfaces a little cleaner. Using an opaque cookie also means that we can change the underlying implementation at will, without breaking the callers. If this approach looks OK, it's probably best for me to just respin the entire series and have Christian drop the old and pick up the new. That should reduce some of the unnecessary churn in fs/inode.c. --- Changes in v2: - move floor handling completely into timekeeper code - add new interfaces for multigrain timestamp handling - Link to v1: https://lore.kernel.org/r/20240911-mgtime-v1-1-e4aedf1d0d15@kernel.org --- fs/inode.c | 105 +++++++++++++------------------------------- include/linux/timekeeping.h | 4 ++ kernel/time/timekeeping.c | 77 ++++++++++++++++++++++++++++++++ 3 files changed, 111 insertions(+), 75 deletions(-) --- base-commit: 18348a38102a4efca57186740afb33f08e5f4609 change-id: 20240912-mgtime-c1280600a9a3 Best regards,