vfs: replace current_kernel_time64 with ktime equivalent

Message ID	20180620150138.49380-1-arnd@arndb.de (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-fsdevel-owner@kernel.org> From: Arnd Bergmann <arnd@arndb.de> To: Andrew Morton <akpm@linux-foundation.org>, Alexander Viro <viro@zeniv.linux.org.uk> Cc: y2038@lists.linaro.org, Andi Kleen <andi.kleen@intel.com>, Arnd Bergmann <arnd@arndb.de>, "Darrick J. Wong" <darrick.wong@oracle.com>, Jeff Layton <jlayton@redhat.com>, Jan Kara <jack@suse.cz>, Brian Foster <bfoster@redhat.com>, Deepa Dinamani <deepa.kernel@gmail.com>, Miklos Szeredi <miklos@szeredi.hu>, Jens Axboe <axboe@kernel.dk>, Pavel Tatashin <pasha.tatashin@oracle.com>, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [PATCH] vfs: replace current_kernel_time64 with ktime equivalent Date: Wed, 20 Jun 2018 17:01:24 +0200 Message-Id: <20180620150138.49380-1-arnd@arndb.de> Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk

Arnd Bergmann June 20, 2018, 3:01 p.m. UTC

current_time is one of the few callers of current_kernel_time64(), which
is a wrapper around ktime_get_coarse_real_ts64(). This calls the latter
directly for consistency with the rest of the kernel that is moving to
the ktime_get_ family of time accessors.

An open questions is whether we may want to actually call the more
accurate ktime_get_real_ts64() for file systems that save high-resolution
timestamps in their on-disk format. This would add a small but measurable
overhead to each update of the inode stamps but lead to inode timestamps
to actually have a usable resolution better than one jiffy (1 to 10
milliseconds normally).

I traced the original addition of the current_kernel_time() call to set
the nanosecond fields back to linux-2.5.48, where Andi Kleen added a
patch with subject "nanosecond stat timefields". This adds the original
call to current_kernel_time and the truncation to the resolution of the
file system, but makes no mention of the intended accuracy.  At the time,
we had a do_gettimeofday() interface that on some architectures could
return a microsecond-resolution timestamp, but there was no interface
for getting an accurate timestamp in nanosecond resolution, neither inside
the kernel nor from user space. This makes me suspect that the use of
coarse timestamps was never really a conscious decision but instead
a result of whatever API was available 16 years ago.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
 fs/inode.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

Andi Kleen June 20, 2018, 3:40 p.m. UTC | #1

Arnd Bergmann <arnd@arndb.de> writes:
>
> I traced the original addition of the current_kernel_time() call to set
> the nanosecond fields back to linux-2.5.48, where Andi Kleen added a
> patch with subject "nanosecond stat timefields". This adds the original
> call to current_kernel_time and the truncation to the resolution of the
> file system, but makes no mention of the intended accuracy.  At the time,
> we had a do_gettimeofday() interface that on some architectures could
> return a microsecond-resolution timestamp, but there was no interface
> for getting an accurate timestamp in nanosecond resolution, neither inside
> the kernel nor from user space. This makes me suspect that the use of
> coarse timestamps was never really a conscious decision but instead
> a result of whatever API was available 16 years ago.

Kind of. VFS/system calls are expensive enough that you need multiple us
in and out so us resolution was considered good enough.

Also if you do this change you really need to do some benchmarks,
especially on setups without lazy atime. This might potentially
cause a lot more inode flushes.

-Andi

Arnd Bergmann June 20, 2018, 4:14 p.m. UTC | #2

On Wed, Jun 20, 2018 at 5:40 PM, Andi Kleen <ak@linux.intel.com> wrote:
> Arnd Bergmann <arnd@arndb.de> writes:
>>
>> I traced the original addition of the current_kernel_time() call to set
>> the nanosecond fields back to linux-2.5.48, where Andi Kleen added a
>> patch with subject "nanosecond stat timefields". This adds the original
>> call to current_kernel_time and the truncation to the resolution of the
>> file system, but makes no mention of the intended accuracy.  At the time,
>> we had a do_gettimeofday() interface that on some architectures could
>> return a microsecond-resolution timestamp, but there was no interface
>> for getting an accurate timestamp in nanosecond resolution, neither inside
>> the kernel nor from user space. This makes me suspect that the use of
>> coarse timestamps was never really a conscious decision but instead
>> a result of whatever API was available 16 years ago.
>
> Kind of. VFS/system calls are expensive enough that you need multiple us
> in and out so us resolution was considered good enough.

To clarify: current_kernel_time() uses at most millisecond resolution rather
than microsecond, as tkr_mono.xtime_nsec only gets updated during the
timer tick.

Has that time scale changed over the past 16 years as CPUs got faster
(and system call entry times slower down again with recent changes)?

I tried a simple test on the shell, in tmpfs here and saw:

$ for i in `seq -w 100000` ; do > $i ; done
$ stat * | less | grep Modify | uniq -c | head
    601 Modify: 2018-06-20 18:04:48.794314629 +0200
    920 Modify: 2018-06-20 18:04:48.798314691 +0200
    936 Modify: 2018-06-20 18:04:48.802314753 +0200
    937 Modify: 2018-06-20 18:04:48.806314816 +0200
    901 Modify: 2018-06-20 18:04:48.810314878 +0200
    929 Modify: 2018-06-20 18:04:48.814314940 +0200
    931 Modify: 2018-06-20 18:04:48.818315002 +0200
    894 Modify: 2018-06-20 18:04:48.822315064 +0200
    952 Modify: 2018-06-20 18:04:48.826315128 +0200
    898 Modify: 2018-06-20 18:04:48.830315190 +0200

which indicates that the result of ktime_get_coarse_real_ts64()
gets updated every four milliseconds here (matching the
CONFIG_HZ_250 setting in my running kernel), and that
we can create around 900 files during that time that each
get the same timestamp (strace shows 10 system calls for
each new file). Trying the same on btrfs, I get around 260
files per jiffy.

> Also if you do this change you really need to do some benchmarks,
> especially on setups without lazy atime. This might potentially
> cause a lot more inode flushes.

Good point. On the other hand, there may be some reasons to
do it even if there is a noticeable overhead, in cases where we
actually want hires timestamps, so perhaps this could be
a mount option.

     Arnd

Andi Kleen June 20, 2018, 4:19 p.m. UTC | #3

Arnd Bergmann <arnd@arndb.de> writes:
>
> To clarify: current_kernel_time() uses at most millisecond resolution rather
> than microsecond, as tkr_mono.xtime_nsec only gets updated during the
> timer tick.

Ah you're right. I remember now: the motivation was to make sure there
is basically no overhead. In some setups the full gtod can be rather
slow, particularly if it falls back to some crappy timer.

I think it would be ok if it falls back to jiffies if TSC or a similar
fast timer doesn't work. But the function you're using likely
doesn't do that?

> Has that time scale changed over the past 16 years as CPUs got faster
> (and system call entry times slower down again with recent changes)?

Maybe a bit, but not substantially.

-Andi

Arnd Bergmann June 20, 2018, 7:35 p.m. UTC | #4

On Wed, Jun 20, 2018 at 6:19 PM, Andi Kleen <ak@linux.intel.com> wrote:
> Arnd Bergmann <arnd@arndb.de> writes:
>>
>> To clarify: current_kernel_time() uses at most millisecond resolution rather
>> than microsecond, as tkr_mono.xtime_nsec only gets updated during the
>> timer tick.
>
> Ah you're right. I remember now: the motivation was to make sure there
> is basically no overhead. In some setups the full gtod can be rather
> slow, particularly if it falls back to some crappy timer.
>
> I think it would be ok if it falls back to jiffies if TSC or a similar
> fast timer doesn't work. But the function you're using likely
> doesn't do that?

My patch as posted just uses ktime_get_coarse_real_ts64(), which
doesn't ever access the hires clocksource, the change is just cosmetic
so far.

The timekeeping and clocksource core code (maintainers added to Cc)
doesn't yet export an API that we can use to determine whether the
clocksource is "fast" or not, but I would expect that we can decide
to add that if needed.

This is also something that definitely changed over the years since
your patch was originally added. Back then, the x86 TSC probably
wasn't reliable enough to depend on it but now I would guess that
very few x86 machines in production use care. On embedded systems,
we used to have all kinds of clocksource drivers with varying
characteristics, but nowadays the embedded market is dominated
by ARMv7VE (Cortex-A7/A15/A17) or ARMv8, which are required
to have a fast clocksource (drivers/clocksource/arm_arch_timer.c),
and a lot of the others have it too (risc-v, modern mips, all ppc32,
most ARM Cortex-A9, ...).
The traditional non-x86 architectures (s390, powerpc, sparc) that
are still being used have of course had low-latency clocksource
access for a much longer time.

This means, we're probably fine with a compile-time option that
distros can choose to enable depending on what classes of hardware
they are targetting, like

struct timespec64 current_time(struct inode *inode)
{
        struct timespec64 now;
        u64 gran = inode->i_sb->s_time_gran;

        if (IS_ENABLED(CONFIG_HIRES_INODE_TIMES) &&
            gran <= NSEC_PER_JIFFY)
                  ktime_get_real_ts64(&now);
        else
                  ktime_get_coarse_real_ts64(&now);

        return timespec64_trunc(now, gran);
}

With that implementation, we could still let file systems choose
to get coarse timestamps by tuning the granularity in the
superblock s_time_gran, which would result in nice round
tv_nsec values that represent the actual accuracy.

Obviously this still needs performance testing on various bits
of real hardware, but I can imagine that the overhead is rather
small on hardware from the past five years.

     Arnd

Dave Chinner June 21, 2018, 8:23 p.m. UTC | #5

On Wed, Jun 20, 2018 at 05:01:24PM +0200, Arnd Bergmann wrote:
> current_time is one of the few callers of current_kernel_time64(), which
> is a wrapper around ktime_get_coarse_real_ts64(). This calls the latter
> directly for consistency with the rest of the kernel that is moving to
> the ktime_get_ family of time accessors.
> 
> An open questions is whether we may want to actually call the more
> accurate ktime_get_real_ts64() for file systems that save high-resolution
> timestamps in their on-disk format. This would add a small but measurable
> overhead to each update of the inode stamps but lead to inode timestamps
> to actually have a usable resolution better than one jiffy (1 to 10
> milliseconds normally).
> 
> I traced the original addition of the current_kernel_time() call to set
> the nanosecond fields back to linux-2.5.48, where Andi Kleen added a
> patch with subject "nanosecond stat timefields". This adds the original
> call to current_kernel_time and the truncation to the resolution of the
> file system, but makes no mention of the intended accuracy.  At the time,
> we had a do_gettimeofday() interface that on some architectures could
> return a microsecond-resolution timestamp, but there was no interface
> for getting an accurate timestamp in nanosecond resolution, neither inside
> the kernel nor from user space. This makes me suspect that the use of
> coarse timestamps was never really a conscious decision but instead
> a result of whatever API was available 16 years ago.
> 
> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
> ---
>  fs/inode.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/inode.c b/fs/inode.c
> index 2c300e981796..e27bd9334939 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -2133,7 +2133,9 @@ EXPORT_SYMBOL(timespec64_trunc);
>   */
>  struct timespec64 current_time(struct inode *inode)
>  {
> -	struct timespec64 now = current_kernel_time64();
> +	struct timespec64 now;
> +
> +	ktime_get_coarse_real_ts64(&now);

Can I just say as a filesystem dev who has no idea at all about
kernel timer implementations: this is an awful API change.  There
are hundreds of callers of current_time(), so I'm not going to be
the only person looking at this function who has no clue about WTF
"ktime_get_coarse_real" actually means or does. Further, this
function is not documented, and jumps straight into internal time
implementation stuff, so I'm lost straight away if somebody asks me
"what does that function do"?. i.e. I have *no clue* what this
function returns or why this code uses it.

i.e. the function goes from an obvious self documenting name that
has been blessed as the current kernel timestamp to something that
only people who work on the time subsystem understand and know when
to use. It might make sense to you, but it sucks for everyone
else....

Keep the wrapper, please. Change it to ktime_get_current(), if you
really must change the function namespace...

Cheers,

Dave.

Arnd Bergmann June 22, 2018, 1:24 p.m. UTC | #6

On Thu, Jun 21, 2018 at 10:23 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Wed, Jun 20, 2018 at 05:01:24PM +0200, Arnd Bergmann wrote:

>> diff --git a/fs/inode.c b/fs/inode.c
>> index 2c300e981796..e27bd9334939 100644
>> --- a/fs/inode.c
>> +++ b/fs/inode.c
>> @@ -2133,7 +2133,9 @@ EXPORT_SYMBOL(timespec64_trunc);
>>   */
>>  struct timespec64 current_time(struct inode *inode)
>>  {
>> -     struct timespec64 now = current_kernel_time64();
>> +     struct timespec64 now;
>> +
>> +     ktime_get_coarse_real_ts64(&now);
>
> Can I just say as a filesystem dev who has no idea at all about
> kernel timer implementations: this is an awful API change.  There
> are hundreds of callers of current_time(), so I'm not going to be
> the only person looking at this function who has no clue about WTF
> "ktime_get_coarse_real" actually means or does. Further, this
> function is not documented, and jumps straight into internal time
> implementation stuff, so I'm lost straight away if somebody asks me
> "what does that function do"?. i.e. I have *no clue* what this
> function returns or why this code uses it.

You definitely have a point about the documentation. I meant to
fix that as part of the recent rework of the timekeeping.h header
but haven't finished it, partly because the header is still being
changed as we get rid of the older interfaces.

> i.e. the function goes from an obvious self documenting name that
> has been blessed as the current kernel timestamp to something that
> only people who work on the time subsystem understand and know when
> to use. It might make sense to you, but it sucks for everyone
> else....
>
> Keep the wrapper, please. Change it to ktime_get_current(), if you
> really must change the function namespace...

The thing about current_kernel_time/current_kernel_time64/
ktime_get_coarse_real_ts64 is that it hasn't been a good choice
as a default time accessor for a long time, pretty much everything
outside of current_time() has a better option:

- For measuring time intervals, you want a monotonic time source,
  not a 'real' one, to prevent time from going backwards during leap
  seconds or settimeofday() adjustments.
- For getting a timestamp as fast as possible, using a timespec64
  based interface adds extra overhead compared to 'jiffies' or
  ktime_t (64-bit nanoseconds), especially when passing the
  struct as the return value.
- As mentioned before, almost all machines these days have a
  fast clocksource device, so getting a 'coarse' timestamp is
  barely faster than an accurate one.
- Having an interface with (at best) millisecond precsision but
  nanosecond resolution can be confusing, as it suggests a much
  more precision than we can give (the 'coarse' in the name is
  meant to clarify that).

While changing over all device drivers from the old-style interfaces
with 32-bit time_t (current_kernel_time, get_seconds,
do_getimeofday and getnstimeofday, getrawmonotonic,
get_monotonic_coarse, get_monotonic_boottime), we tried to
also address those problems, using ktime_get() as the preferred
interface, or an interface with a longer name where we had
specific reasons.

Outside of file system timestamps, there are only two other
files left that ask for a timestamp that is both 'coarse' and
'realtime', and both only do that when user space specifically
asks for that combination.

       Arnd

Arnd Bergmann June 25, 2018, 1:42 p.m. UTC | #7

On Wed, Jun 20, 2018 at 9:35 PM, Arnd Bergmann <arnd@arndb.de> wrote:
> On Wed, Jun 20, 2018 at 6:19 PM, Andi Kleen <ak@linux.intel.com> wrote:
>> Arnd Bergmann <arnd@arndb.de> writes:
>>>
>>> To clarify: current_kernel_time() uses at most millisecond resolution rather
>>> than microsecond, as tkr_mono.xtime_nsec only gets updated during the
>>> timer tick.
>>
>> Ah you're right. I remember now: the motivation was to make sure there
>> is basically no overhead. In some setups the full gtod can be rather
>> slow, particularly if it falls back to some crappy timer.
>
> This means, we're probably fine with a compile-time option that
> distros can choose to enable depending on what classes of hardware
> they are targetting, like
>
> struct timespec64 current_time(struct inode *inode)
> {
>         struct timespec64 now;
>         u64 gran = inode->i_sb->s_time_gran;
>
>         if (IS_ENABLED(CONFIG_HIRES_INODE_TIMES) &&
>             gran <= NSEC_PER_JIFFY)
>                   ktime_get_real_ts64(&now);
>         else
>                   ktime_get_coarse_real_ts64(&now);
>
>         return timespec64_trunc(now, gran);
> }
>
> With that implementation, we could still let file systems choose
> to get coarse timestamps by tuning the granularity in the
> superblock s_time_gran, which would result in nice round
> tv_nsec values that represent the actual accuracy.

I've done some simple tests and found that on a variety of
x86, arm32 and arm64 CPUs, it takes between 70 and 100
CPU cycles to read the TSC and add it to the coarse
clock, e.g. on a 3.1GHz Ryzen, using the little test program
below:

vdso hires:   37.18ns
vdso coarse:    6.44ns
sysc hires: 161.62ns
sysc coarse: 133.87ns

On the same machine, it takes around 400ns (1240 cycles)
to write one byte into a tmpfs file with pwrite(). Adding 5% to
10% overhead for accurate timestamps would definitely be
noticed, so I guess we wouldn't enable that unconditionally,
but could do it as an opt-in mount option if someone had a
use case.

       Arnd

---
/* measure times for high-resolution clocksource access from userspace */
#include <stdio.h>
#include <time.h>
#include <unistd.h>
#include <stdbool.h>
#include <sys/syscall.h>

static int do_clock_gettime(clockid_t clkid, struct timespec *tp, bool vdso)
{
        if (vdso)
                return clock_gettime(clkid, tp);

        return syscall(__NR_clock_gettime, clkid, tp);
}

static int loop1sec(int clkid, bool vdso)
{
        int i;
        struct timespec t, start;

        do_clock_gettime(clkid, &start, vdso);
        i = 0;
        do {
                do_clock_gettime(clkid, &t, vdso);
                i++;
        } while (t.tv_sec == start.tv_sec || t.tv_nsec < start.tv_nsec);

        return i;
}

int main(void)
{
        printf("vdso hires:     %7.2fns\n", 1000000000.0 /
loop1sec(CLOCK_REALTIME, true));
        printf("vdso coarse:    %7.2fns\n", 1000000000.0 /
loop1sec(CLOCK_REALTIME_COARSE, true));
        printf("sysc hires:     %7.2fns\n", 1000000000.0 /
loop1sec(CLOCK_REALTIME, false));
        printf("sysc coarse:    %7.2fns\n", 1000000000.0 /
loop1sec(CLOCK_REALTIME_COARSE, false));

        return 0;
}

Dave Chinner June 26, 2018, 12:24 a.m. UTC | #8

On Fri, Jun 22, 2018 at 03:24:48PM +0200, Arnd Bergmann wrote:
> On Thu, Jun 21, 2018 at 10:23 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Wed, Jun 20, 2018 at 05:01:24PM +0200, Arnd Bergmann wrote:
> 
> >> diff --git a/fs/inode.c b/fs/inode.c
> >> index 2c300e981796..e27bd9334939 100644
> >> --- a/fs/inode.c
> >> +++ b/fs/inode.c
> >> @@ -2133,7 +2133,9 @@ EXPORT_SYMBOL(timespec64_trunc);
> >>   */
> >>  struct timespec64 current_time(struct inode *inode)
> >>  {
> >> -     struct timespec64 now = current_kernel_time64();
> >> +     struct timespec64 now;
> >> +
> >> +     ktime_get_coarse_real_ts64(&now);
> >
> > Can I just say as a filesystem dev who has no idea at all about
> > kernel timer implementations: this is an awful API change.  There
> > are hundreds of callers of current_time(), so I'm not going to be
> > the only person looking at this function who has no clue about WTF
> > "ktime_get_coarse_real" actually means or does. Further, this
> > function is not documented, and jumps straight into internal time
> > implementation stuff, so I'm lost straight away if somebody asks me
> > "what does that function do"?. i.e. I have *no clue* what this
> > function returns or why this code uses it.
> 
> You definitely have a point about the documentation. I meant to
> fix that as part of the recent rework of the timekeeping.h header
> but haven't finished it, partly because the header is still being
> changed as we get rid of the older interfaces.

The interface documentation should be introduced with the new
interfaces, not left for later as that leaves people like me with no
fucking clue about what the changes actually mean or why they are
being done.  Perhaps you'd have done better to explain this API as
an internal implementation of clock_gettime(CLOCK_REALTIME_COARSE),
which is clearly documented in the man page as:

"Use when you need very fast, but not fine-grained timestamps."

Put that comment on ktime_get_coarse_real_ts64(), and almost all the
questions about "WTF does this whacky function do?" go away....

Cheers,

Dave.

Arnd Bergmann June 26, 2018, 4:08 p.m. UTC | #9

On Tue, Jun 26, 2018 at 2:24 AM, Dave Chinner <david@fromorbit.com> wrote:
> On Fri, Jun 22, 2018 at 03:24:48PM +0200, Arnd Bergmann wrote:
>> On Thu, Jun 21, 2018 at 10:23 PM, Dave Chinner <david@fromorbit.com> wrote:
>> > On Wed, Jun 20, 2018 at 05:01:24PM +0200, Arnd Bergmann wrote:

>> You definitely have a point about the documentation. I meant to
>> fix that as part of the recent rework of the timekeeping.h header
>> but haven't finished it, partly because the header is still being
>> changed as we get rid of the older interfaces.
>
> The interface documentation should be introduced with the new
> interfaces, not left for later as that leaves people like me with no
> fucking clue about what the changes actually mean or why they are
> being done.  Perhaps you'd have done better to explain this API as
> an internal implementation of clock_gettime(CLOCK_REALTIME_COARSE),
> which is clearly documented in the man page as:
>
> "Use when you need very fast, but not fine-grained timestamps."
>
> Put that comment on ktime_get_coarse_real_ts64(), and almost all the
> questions about "WTF does this whacky function do?" go away....

I've tried to come up with a coherent documentation now and sent a
patch for that. For this version, I ended up not documenting each of the
ktime_get() functions separately but instead added a new file in the
core-api documentation and a reference to that from the header. This
seemed easier to understand than duplicating the same text for over
30 very similar interfaces.

       Arnd

vfs: replace current_kernel_time64 with ktime equivalent

Commit Message

Comments

Patch