Message ID | 20230921-umgekehrt-buden-a8718451ef7c@brauner (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [GIT,PULL,v2] timestamp fixes | expand |
The pull request you sent on Thu, 21 Sep 2023 13:20:46 +0200:
> git@gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs tags/v6.6-rc3.vfs.ctime.revert
has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/b5cbe7c00aa0f7a81ec40c007f81a3e9c84581e3
Thank you!
On Thu, 2023-09-21 at 12:46 -0700, Linus Torvalds wrote: > On Thu, 21 Sept 2023 at 12:28, Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > And that's ok when we're talking about times that are kernel running > > times and we haev a couple of centuries to say "ok, we'll need to make > > it be a bigger type", > > Note that the "couple of centuries" here is mostly the machine uptime, > not necessarily "we'll need to change the time in the year 2292". > Right. On-disk formats are really a different matter anyway, so that value is only relevant within a single running instance. > Although we do also have "ktime_get_real()" which is encoding the > whole "nanoseconds since 1970". That *will* break in 2292. > Still pretty much SEP, unless we all end up as cyborgs after this. > Anyway, regardless, I am *not* suggesting that ktime_t would be useful > for filesystems, because of this issue. > > I *do* suspect that we might consider a "tenth of a microsecond", though. > > Resolution-wise, it's pretty much in the "system call time" order of > magnitude, and if we have Linux filesystems around in the year-31k, > I'll happily consider it to be a SEP thing at that point ("somebody > else's problem"). > FWIW, I'm reworking the multigrain ctime patches for internal consumers. As part of that, when we present multigrain timestamps to userland via statx, we'll truncate them at a granularity of (NSEC_PER_SEC / HZ). So, we could easily do that today since we're already going to be truncating off more than that for external uses. Having a single word to deal with would sure be simpler too, particularly since we're using atomic operations here. I'll have to think about it. The first step is to get all of the timestamp handling wrappers in place anyway. Cheers,
On Thu, Sep 21, 2023 at 12:28:13PM -0700, Linus Torvalds wrote: > On Thu, 21 Sept 2023 at 11:51, Jeff Layton <jlayton@kernel.org> wrote: > > > > We have many, many inodes though, and 12 bytes per adds up! > > That was my thinking, but honestly, who knows what other alignment > issues might eat up some - or all - of the theoreteical 12 bytes. > > It might be, for example, that the inode is already some aligned size, > and that the allocation alignment means that the size wouldn't > *really* shrink at all. > > So I just want to make clear that I think the 12 bytes isn't > necessarily there. Maybe you'd get it, maybe it would be hidden by > other things. I think all filesystem developers appreciate when struct inode shrinks, it's usually embedded with additional data and the size grows. I'm on a mission to squeeze btrfs_inode under 1024 so it fits better to the slab pages and currently it's about 1100 bytes. 1024 is within reach but it gets harder to find potential space savings.
On Thu, Sep 21, 2023 at 11:24:43AM -0700, Linus Torvalds wrote: > On Thu, 21 Sept 2023 at 04:21, Christian Brauner <brauner@kernel.org> wrote: > > > > git@gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs tags/v6.6-rc3.vfs.ctime.revert > > So for some reason pr-tracker-bot doesn't seem to have reacted to this > pull request, but it's in my tree now. I think vger was somehow backed up again. We've been hit kinda hard by this recently. I've asked Konstantin whether he'd move us to the new kernel.org managed mailing infra (which will happen eventually anyway) so hopefully we'll have less of these delays in the very near future...
> We have many, many inodes though, and 12 bytes per adds up! > > I'm on board with the idea, but...that's likely to be as big a patch > series as the ctime overhaul was. In fact, it'll touch a lot of the same Hm, I really think that isn't an issue. Let the series be big. Half of the meaningful fs patches explode anyway as soon as you change any i_op/f_op anyway. If it meaningfully cleans up something then it's worth it. Hm, we really need automated testing on vfs.git soon so we can run xfstests for all major filesystems automatically instead of me having to kick this off manually every time.
> I'll have to think about it. The first step is to get all of the
I think it'd be best if we start off with converting the other times in
struct inode to accessor and leave the questions whether timestamps-
until-2292 are enough to solve for later. I don't think the torn
timestamp reads is all that of a pressing issue and mixing both things
up might just stall what is otherwise already a worthy cleanup.
On Sat, Sep 23, 2023 at 3:43 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Thu, 21 Sept 2023 at 11:51, Jeff Layton <jlayton@kernel.org> wrote: > > > > We have many, many inodes though, and 12 bytes per adds up! > > That was my thinking, but honestly, who knows what other alignment > issues might eat up some - or all - of the theoreteical 12 bytes. > > It might be, for example, that the inode is already some aligned size, > and that the allocation alignment means that the size wouldn't > *really* shrink at all. > > So I just want to make clear that I think the 12 bytes isn't > necessarily there. Maybe you'd get it, maybe it would be hidden by > other things. > > My biggest impetus was really that whole abuse of a type that I > already disliked for other reasons. > > > I'm on board with the idea, but...that's likely to be as big a patch > > series as the ctime overhaul was. In fact, it'll touch a lot of the same > > code. I can take a stab at that in the near future though. > > Yea, it's likely to be fairly big and invasive. That was one of the > reasons for my suggested "inode_time()" macro hack: using the macro > argument concatenation is really a hack to "gather" the pieces based > on name, and while it's odd and not a very typical kernel model, I > think doing it that way might allow the conversion to be slightly less > painful. > > You'd obviously have to have the same kind of thing for assignment. > > Without that kind of name-based hack, you'd have to create all these > random helper functions that just do the same thing over and over for > the different times, which seems really annoying. > > > Since we're on the subject...another thing that bothers me with all of > > the timestamp handling is that we don't currently try to mitigate "torn > > reads" across the two different words. It seems like you could fetch a > > tv_sec value and then get a tv_nsec value that represents an entirely > > different timestamp if there are stores between them. > > Hmm. I think that's an issue that we have always had in theory, and > have ignored because it's simply not problematic in practice, and > fixing it is *hugely* painful. > > I suspect we'd have to use some kind of sequence lock for it (to make > reads be cheap), and while it's _possible_ that having the separate > accessor functions for reading/writing those times might help things > out, I suspect the reading/writing happens for the different times (ie > atime/mtime/ctime) together often enough that you might want to have > the locking done at an outer level, and _not_ do it at the accessor > level. > > So I suspect this is a completely separate issue (ie even an accessor > doesn't make the "hugely painful" go away). And probably not worth > worrying about *unless* somebody decides that they really really care > about the race. > > That said, one thing that *could* help is if people decide that the > right format for inode times is to just have one 64-bit word that has > "sufficient resolution". That's what we did for "kernel time", ie > "ktime_t" is a 64-bit nanosecond count, and by being just a single > value, it avoids not just the horrible padding with 'struct > timespec64', it is also dense _and_ can be accessed as one atomic > value. Just pointing out that xfs has already changed it's on-disk timestamp format to this model (a.k.a bigtime), but the in-core inode still uses the timespec64 of course. The vfs can inprise from this model. > > Sadly, that "sufficient resolution" couldn't be nanoseconds, because > 64-bit nanoseconds isn't enough of a spread. It's fine for the kernel > time, because 2**63 nanoseconds is 292 years, so it moved the "year > 2038" problem to "year 2262". Note that xfs_bigtime_to_unix(XFS_BIGTIME_TIME_MAX) is in year 2486, not year 2262, because there was no need to use the 64bit to go backwards to year 1678. > > And that's ok when we're talking about times that are kernel running > times and we haev a couple of centuries to say "ok, we'll need to make > it be a bigger type", but when you save the values to disk, things are > different. I suspect filesystem people are *not* willing to deal with > a "year 2262" issue. > Apparently, they are willing to handle the "year 2486" issue ;) > But if we were to say that "a tenth of microsecond resolution is > sufficient for inode timestamps", then suddenly 64 bits is *enormous*. > So we could do a > > // tenth of a microseconds since Jan 1, 1970 > typedef s64 fstime_t; > > and have a nice dense timestamp format with reasonable - but not > nanosecond - accuracy. Now that 292 year range has become 29,247 > years, and filesystem people *might* find the "year-31k" problem > acceptable. > > I happen to think that "100ns timestamp resolution on files is > sufficient" is a very reasonable statement, but I suspect that we'll > still find lots of people who say "that's completely unacceptable" > both to that resolution, and to the 31k-year problem. > I am guessing that you are aware of the Windows/SMB FILETIME standard which is 64bit in 100ns units (since 1601). So the 31k-year "problem" is very widespread already. But the resolution change is counter to the purpose of multigrain timestamps - if two syscalls updated the same or two different inodes within a 100ns tick, apparently, there are some workloads that care to know about it and fs needs to store this information persistently. Thanks, Amir.
On Fri, 22 Sept 2023 at 23:36, Amir Goldstein <amir73il@gmail.com> wrote: > > Apparently, they are willing to handle the "year 2486" issue ;) Well, we could certainly do the same at the VFS layer. But I suspect 10ns resolution is entirely overkill, since on a lot of platforms you don't even have timers with that resolution. I feel like 100ns is a much more reasonable resolution, and is quite close to a single system call (think "one thousand cycles at 10GHz"). > But the resolution change is counter to the purpose of multigrain > timestamps - if two syscalls updated the same or two different inodes > within a 100ns tick, apparently, there are some workloads that > care to know about it and fs needs to store this information persistently. Those workloads are broken garbage, and we should *not* use that kind of sh*t to decide on VFS internals. Honestly, if the main reason for the multigrain resolution is something like that, I think we should forget about MG *entirely*. Somebody needs to be told to get their act together. We have *never* guaranteed nanosecond resolution on timestamps, and I think we should put our foot down and say that we never will. Partly because we have platforms where that kind of timer resolution just does not exist. Partly because it's stupid to expect that kind of resolution anyway. And partly because any load that assumes that kind of resolution is already broken. End result: we should ABSOLUTELY NOT have as a target to support some insane resolution. 100ns resolution for file access times is - and I'll happily go down in history for saying this - enough for anybody. If you need finer resolution than that, you'd better do it yourself in user space. And no, this is not a "but some day we'll have terahertz CPU's and 100ns is an eternity". Moore's law is dead, we're not going to see terahertz CPUs, and people who say "but quantum" have bought into a technological fairytale. 100ns is plenty, and has the advantage of having a very safe range. That said, we don't have to do powers-of-ten. In fact, in many ways, it would probably be a good idea to think of the fractional seconds in powers of two. That tends to make it cheaper to do conversions, without having to do a full 64-bit divide (a constant divide turns into a fancy multiply, but it's still painful on 32-bit architectures). So, for example, we could easily make the format be a fixed-point format with "sign bit, 38 bit seconds, 25 bit fractional seconds", which gives us about 30ns resolution, and a range of almost 9000 years. Which is nice, in how it covers all of written history and all four-digit years (we'd keep the 1970 base). And 30ns resolution really *is* pretty much the limit of a single system call. I could *wish* we had system calls that fast, or CPU's that fast. Not the case right now, and sadly doesn't seem to be the case in the forseeable future - if ever - either. It would be a really good problem to have. And the nice thing about that would be that conversion to timespec64 would be fairly straightforward: struct timespec64 to_timespec(fstime_t fstime) { struct timespec64 res; unsigned int frac; frac = fstime & 0x1ffffffu; res.tv_sec = fstime >> 25; res.tv_nsec = frac * 1000000000ull >> 25; return res; } fstime_t to_fstime(struct timespec64 a) { fstime_t sec = (fstime_t) a.tv_sec << 25; unsigned frac = a.tv_nsec; frac = ((unsigned long long) a.tv_nsec << 25) / 1000000000ull; return sec | frac; } and both of those generate good code (that large divide by a constant in to_fstime() is not great, but the compiler can turn it into a multiply). The above could be improved upon (nicer rounding and overflow handling, and a few modifications to generate even nicer code), but it's not horrendous as-is. On x86-64, to_timespec becomes a very reasonable movq %rdi, %rax andl $33554431, %edi imulq $1000000000, %rdi, %rdx sarq $25, %rax shrq $25, %rdx and to some degree that's the critical function (that code would show up in 'stat()'). Of course, I might have screwed up the above conversion functions, they are untested garbage, but they look close enough to being in the right ballpark. Anyway, we really need to push back at any crazies who say "I want nanosecond resolution, because I'm special and my mother said so". Linus
On Sat, Sep 23, 2023 at 10:48:51AM -0700, Linus Torvalds wrote: > > I feel like 100ns is a much more reasonable resolution, and is quite > close to a single system call (think "one thousand cycles at 10GHz"). FWIW, UUID's (which originally came from Apollo Domain/OS in the 1980's, before getting adopted by OSF/DCE, and then by Linux and Microsoft) use a 100ns granularity. And the smart folks at Apollo figured this out some 4 decades ago, and *no* they didn't use units of a single nanosecond. :-) 100ns granularity is also what what ext4 uses for our on-disk format --- 2**30 just enough to cover 100ns granularity (with only 7% of wasted number space), and those two bits are enough for us to encode timestamps into 2446 using a 64-bit timestamp (and what we do past 2446 is pretty much something I'm happy to let someone else deal with, as I expect I'll be long dead by then.) (And if someone does happen to event some kind of life-extension technology, I'm happy to fix it up... later. :-) > That said, we don't have to do powers-of-ten. In fact, in many ways, > it would probably be a good idea to think of the fractional seconds in > powers of two. That tends to make it cheaper to do conversions, > without having to do a full 64-bit divide (a constant divide turns > into a fancy multiply, but it's still painful on 32-bit > architectures). It depends on what conversion we need to do. If we're converting to userspace's timespec64 data structure, which is denominated in nanosecods, it's actually much easier to use decimal 100ns units: #define EXT4_EPOCH_BITS 2 #define EXT4_EPOCH_MASK ((1 << EXT4_EPOCH_BITS) - 1) #define EXT4_NSEC_MASK (~0UL << EXT4_EPOCH_BITS) static inline __le32 ext4_encode_extra_time(struct timespec64 *time) { u32 extra =((time->tv_sec - (s32)time->tv_sec) >> 32) & EXT4_EPOCH_MASK; return cpu_to_le32(extra | (time->tv_nsec << EXT4_EPOCH_BITS)); } static inline void ext4_decode_extra_time(struct timespec64 *time, __le32 extra) { if (unlikely(extra & cpu_to_le32(EXT4_EPOCH_MASK))) time->tv_sec += (u64)(le32_to_cpu(extra) & EXT4_EPOCH_MASK) << 32; time->tv_nsec = (le32_to_cpu(extra) & EXT4_NSEC_MASK) >> EXT4_EPOCH_BITS; } > Of course, I might have screwed up the above conversion functions, > they are untested garbage, but they look close enough to being in the > right ballpark. We actually have kunit tests for ext4_encode_extra_time() and ext4_decode_extra_time(), mainly because people *have* screwed it up when making architecture-specific optimizations or when making global sweeps of VFS code. :-) - Ted
On Sat, 23 Sept 2023 at 12:30, Theodore Ts'o <tytso@mit.edu> wrote: > > It depends on what conversion we need to do. If we're converting to > userspace's timespec64 data structure, which is denominated in > nanosecods, it's actually much easier to use decimal 100ns units: Actually, your data format seems to be a mix of "decimal nanoseconds" and then a power-of-two seconds (ie bit shift). Except it looks like ext4 actually does full nanosecond resolution (30 bits for nanoseconds, 34 bits for seconds). Thus the "only a couple of hundred years of range". And yes, that's probably close to optimal. It makes it harder to do *math* on those dates (because you have seconds and 100ns as separate fields), but for file timestamps that's likely not a real issue. It was for 'ktime_t()', where with timers etc the whole "subtract and add times" happens *all* the time, but for file timestamps you basically have "set time" together with possibly comparing them (and you can do comparisons without even splitting the fields if you lay things out reasonably - which you ext4 doesn't seem to have done). So yeah, I think that would be a fine 'fstime_t' for the kernel. Except we'd do it without the EXT4_EPOCH_MASK conditionals, and I think it would be better to have a bigger range for seconds. If you give the seconds field three extra bits, you're comfortable in the "thousands of years", and you still have 27 bits that can encode a decimal "100ns". It means that when you convert a fstime_t to timespec64 at stat() time, you'd have to do a 32-bit "multiply by 100" to get the actual nanosecond field, but that's cheap everywhere (and obviously the shift-and-masks to get the separate fields out of the 64-bit value). Linus
On Sat, Sep 23, 2023 at 8:49 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Fri, 22 Sept 2023 at 23:36, Amir Goldstein <amir73il@gmail.com> wrote: > > > > Apparently, they are willing to handle the "year 2486" issue ;) > > Well, we could certainly do the same at the VFS layer. > > But I suspect 10ns resolution is entirely overkill, since on a lot of > platforms you don't even have timers with that resolution. > > I feel like 100ns is a much more reasonable resolution, and is quite > close to a single system call (think "one thousand cycles at 10GHz"). > You are right. 100ns is enough resolution for fs timestamps, but: 1. We cannot truncate existing timestamps on existing files that are stored in 1ns format even if the 100ns remainder is garbage, applications depend on tv_nsec not changing, so if vfs truncates to 100ns, filesystems will need to carry the remainder anyway. 2. The whole idea behind multigrain time is that the 100ns remainder space is available in on-disk timestamps formats and is not going away, so better not let it go to waste on garbage and use is for "change cookie". > > But the resolution change is counter to the purpose of multigrain > > timestamps - if two syscalls updated the same or two different inodes > > within a 100ns tick, apparently, there are some workloads that > > care to know about it and fs needs to store this information persistently. > > Those workloads are broken garbage, and we should *not* use that kind > of sh*t to decide on VFS internals. > Sorry, I phrased it completely wrong. The workloads don't expect 1ns resolution. The workloads just compare timestamps of objects and expect some sane not-before ordering rules. If user visible timestamps are truncated to 100ns all is good. > End result: we should ABSOLUTELY NOT have as a target to support some > insane resolution. This is not the target. I think the target is explained well in this new v8 revision [1] of the patch set which does exactly what you promote - it hides the insane resolution in fs timestamps from users. Thanks, Amir. [1] https://lore.kernel.org/linux-fsdevel/CAOQ4uxjGofKT2LcM_YmoMYsH42ETpB5kxQkh+21nCbYc=w1nEg@mail.gmail.com/T/#mbc87e67faf71cc58c6424335333935bf7740d49e
On Sat, Sep 23, 2023 at 01:03:56PM -0700, Linus Torvalds wrote: > > Except it looks like ext4 actually does full nanosecond resolution (30 > bits for nanoseconds, 34 bits for seconds). Thus the "only a couple of > hundred years of range". Hmm, yeah, sorry, I misremembered. We did talk about possibly going with 100ns, but if I recall correctly, I think there was a desire that an arbitrary timespec64 should be encodable into an on-disk timestamp, and then back again, and hundreds of years of range was considered Good Enough (tm). > Except we'd do it without the EXT4_EPOCH_MASK conditionals, and I > think it would be better to have a bigger range for seconds. If you > give the seconds field three extra bits, you're comfortable in the > "thousands of years", and you still have 27 bits that can encode a > decimal "100ns". I might be screweing my math, but I believe 24 bits should be enough to code 10,000,000 units of 100ns (it's enough for 16,777,216), which should be sufficient. What am I missing? As far as how many seconds are needed, that's an area where people of good will can disagree. Given that I don't really believe a machine is going to be up for decades before we will need to reboot and update the kernel to address zero days, and LTS kernels are going to be supported for two years going forward, if what we're talking about is the in-memory time type, my personal opinion is that hundreds of years is plenty, since it's not hard to change the encoding later. Cheers, - Ted
On Sat, 23 Sept 2023 at 15:08, Theodore Ts'o <tytso@mit.edu> wrote: > > I might be screweing my math, but I believe 24 bits should be enough > to code 10,000,000 units of 100ns (it's enough for 16,777,216), which > should be sufficient. What am I missing? You're missing me just being stupid, having a brain-fart, and doing the math for 10ns despite telling everybody that 100ns should be the reasonable thing. Duh. Linus
On Thu, Sep 21, 2023 at 10:02 PM Jeff Layton <jlayton@kernel.org> wrote: > > On Thu, 2023-09-21 at 11:24 -0700, Linus Torvalds wrote: > > On Thu, 21 Sept 2023 at 04:21, Christian Brauner <brauner@kernel.org> wrote: > > > > > > git@gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs tags/v6.6-rc3.vfs.ctime.revert > > > > So for some reason pr-tracker-bot doesn't seem to have reacted to this > > pull request, but it's in my tree now. > > > > I *do* have one reaction to all of this: now that you have made > > "i_ctime" be something that cannot be accessed directly (and renamed > > it to "__i_ctime"), would you mind horribly going all the way, and do > > the same for i_atime and i_mtime too? > > > > The reason I ask is that I *really* despise "struct timespec64" as a type. > > > > I despise it inherently, but I despise it even more when you really > > use it as another type entirely, and are hiding bits in there. > > > > I despise it because "tv_sec" obviously needs to be 64-bit, but then > > "tv_nsec" is this horrible abomination. It's defined as "long", which > > is all kinds of crazy. It's bogus and historical. > > > > And it's wasteful. > > > > Now, in the case of i_ctime, you took advantage of that waste by using > > one (of the potentially 2..34!) unused bits for that > > "fine-granularity" flag. > > > > But even when you do that, there's up to 33 other bits just lying > > around, wasting space in a very central data structure. > > > > So it would actually be much better to explode the 'struct timespec64' > > into explicit 64-bit seconds field, and an explicit 32-bit field with > > two bits reserved. And not even next to each other, because they don't > > pack well in general. > > > > So instead of > > > > struct timespec64 i_atime; > > struct timespec64 i_mtime; > > struct timespec64 __i_ctime; > > > > where that last one needs accessors to access, just make them *all* > > have helper accessors, and make it be > > > > u64 i_atime_sec; > > u64 i_mtime_sec; > > u64 i_ctime_sec; > > u32 i_atime_nsec; > > u32 i_mtime_nsec; > > u32 i_ctime_nsec; > > > > and now that 'struct inode' should shrink by 12 bytes. > > > > I like it. > > > Then do this: > > > > #define inode_time(x) \ > > (struct timespec64) { x##_sec, x##_nsec } > > > > and you can now create a timespec64 by just doing > > > > inode_time(inode->i_atime) > > > > or something like that (to help create those accessor functions). > > > > Now, I agree that 12 bytes in the disaster that is 'struct inode' is > > likely a drop in the ocean. We have tons of big things in there (ie > > several list_heads, a whole 'struct address_space' etc etc), so it's > > only twelve bytes out of hundreds. > > > > But dammit, that 'timespec64' really is ugly, and now that you're > > hiding bits in it and it's no longer *really* a 'timespec64', I feel > > like it would be better to do it right, and not mis-use a type that > > has other semantics, and has other problems. > > > > > We have many, many inodes though, and 12 bytes per adds up! > > I'm on board with the idea, but...that's likely to be as big a patch > series as the ctime overhaul was. In fact, it'll touch a lot of the same > code. I can take a stab at that in the near future though. > > Since we're on the subject...another thing that bothers me with all of > the timestamp handling is that we don't currently try to mitigate "torn > reads" across the two different words. It seems like you could fetch a > tv_sec value and then get a tv_nsec value that represents an entirely > different timestamp if there are stores between them. > > Should we be concerned about this? I suppose we could do something with > a seqlock, though I'd really want to avoid locking on the write side. As luck would have it, if my calculations are correct, on x86-64 and with CONFIG_FS_POSIX_ACL=y, CONFIG_SECURITY=y (as they are on distro kernels), __i_ctime is exactly on split cache lines and maybe even worse (?), __i_ctime.tv_nsec and the QUERIED bit are on the same cache line with i_lock :-/ If we reorder the inode timestamps with i_size, we improve the situation for this specific and very common configuration. Maybe. Thanks, Amir.
> As luck would have it, if my calculations are correct, on x86-64 and with I hope you're aware that you can use pahole to get struct layouts and that you don't have to manually calculate this... > CONFIG_FS_POSIX_ACL=y, CONFIG_SECURITY=y (as they are on > distro kernels), __i_ctime is exactly on split cache lines and maybe even (Make sure that your kernel doesn't use randomize_layout...) tv_nsec has always been on a split cacheline. We've optimized struct file a little bit in the same way before as there were actual regressions in some workloads I can put it into one of the vfs.git perf branches that are tested by LKP to see if there's any actual performance changes. 5.15: /* --- cacheline 1 boundary (64 bytes) --- */ long unsigned int i_ino; /* 64 8 */ union { const unsigned int i_nlink; /* 72 4 */ unsigned int __i_nlink; /* 72 4 */ }; /* 72 4 */ /* typedef dev_t -> __kernel_dev_t -> u32 -> __u32 */ unsigned int i_rdev; /* 76 4 */ /* typedef loff_t -> __kernel_loff_t */ long long int i_size; /* 80 8 */ struct timespec64 { /* typedef time64_t -> __s64 */ long long int tv_sec; /* 88 8 */ long int tv_nsec; /* 96 8 */ }i_atime; /* 88 16 */ struct timespec64 { /* typedef time64_t -> __s64 */ long long int tv_sec; /* 104 8 */ long int tv_nsec; /* 112 8 */ }i_mtime; /* 104 16 */ struct timespec64 { /* typedef time64_t -> __s64 */ long long int tv_sec; /* 120 8 */ /* --- cacheline 2 boundary (128 bytes) --- */ long int tv_nsec; /* 128 8 */ }i_ctime; /* 120 16 */ /* typedef spinlock_t */ struct spinlock { 6.6: /* --- cacheline 1 boundary (64 bytes) --- */ long unsigned int i_ino; /* 64 8 */ union { const unsigned int i_nlink; /* 72 4 */ unsigned int __i_nlink; /* 72 4 */ }; /* 72 4 */ /* typedef dev_t -> __kernel_dev_t -> u32 -> __u32 */ unsigned int i_rdev; /* 76 4 */ /* typedef loff_t -> __kernel_loff_t */ long long int i_size; /* 80 8 */ struct timespec64 { /* typedef time64_t -> __s64 */ long long int tv_sec; /* 88 8 */ long int tv_nsec; /* 96 8 */ }i_atime; /* 88 16 */ struct timespec64 { /* typedef time64_t -> __s64 */ long long int tv_sec; /* 104 8 */ long int tv_nsec; /* 112 8 */ }i_mtime; /* 104 16 */ struct timespec64 { /* typedef time64_t -> __s64 */ long long int tv_sec; /* 120 8 */ /* --- cacheline 2 boundary (128 bytes) --- */ long int tv_nsec; /* 128 8 */ }__i_ctime; /* 120 16 */
> > Those workloads are broken garbage, and we should *not* use that kind > > of sh*t to decide on VFS internals. > > > > Sorry, I phrased it completely wrong. Thanks for clearing this up. I had just formulated my own reply but I'll happily delete it. :) > The workloads don't expect 1ns resolution. Yes, they don't. In the revert explanation I just used that number to illustrate the general ordering problem. The workload that surfaced the issue is just plain weird of course but it points to a more general ordering problem. > The workloads just compare timestamps of objects and expect some sane > not-before ordering rules. Yes. > If user visible timestamps are truncated to 100ns all is good. Yes.
On Sat, 2023-09-23 at 10:48 -0700, Linus Torvalds wrote: > On Fri, 22 Sept 2023 at 23:36, Amir Goldstein <amir73il@gmail.com> wrote: > > > > Apparently, they are willing to handle the "year 2486" issue ;) > > Well, we could certainly do the same at the VFS layer. > > But I suspect 10ns resolution is entirely overkill, since on a lot of > platforms you don't even have timers with that resolution. > > I feel like 100ns is a much more reasonable resolution, and is quite > close to a single system call (think "one thousand cycles at 10GHz"). > > > But the resolution change is counter to the purpose of multigrain > > timestamps - if two syscalls updated the same or two different inodes > > within a 100ns tick, apparently, there are some workloads that > > care to know about it and fs needs to store this information persistently. > > Those workloads are broken garbage, and we should *not* use that kind > of sh*t to decide on VFS internals. > > Honestly, if the main reason for the multigrain resolution is > something like that, I think we should forget about MG *entirely*. > Somebody needs to be told to get their act together. > As I noted in the other thread, the primary reason for this was to fix XFS's change cookie without having to rev the on-disk format. If we could also present fine-grained timestamps to userland and nfsd, then that would also fix a lot of cache-coherency problems with NFSv3, and may also help some workloads which depend on comparing timestamps between files. That'd be a wonderful bonus, but I'm not going to lose too much sleep if we can't make that work. > We have *never* guaranteed nanosecond resolution on timestamps, and I > think we should put our foot down and say that we never will. > > Partly because we have platforms where that kind of timer resolution > just does not exist. > > Partly because it's stupid to expect that kind of resolution anyway. > > And partly because any load that assumes that kind of resolution is > already broken. > > End result: we should ABSOLUTELY NOT have as a target to support some > insane resolution. > > 100ns resolution for file access times is - and I'll happily go down > in history for saying this - enough for anybody. > > If you need finer resolution than that, you'd better do it yourself in > user space. > > And no, this is not a "but some day we'll have terahertz CPU's and > 100ns is an eternity". Moore's law is dead, we're not going to see > terahertz CPUs, and people who say "but quantum" have bought into a > technological fairytale. > > 100ns is plenty, and has the advantage of having a very safe range. > The catch here is that we have at least some testcases that do things like set specific values in the mtime and atime, and then test that the same value is retrievable. Are we OK with breaking those? If we can always say that the stored resolution is X and that even values that are explicitly set get truncated then the v8 set I sent on Friday may be ok. Of course, that set truncates the values at jiffies granularity (~4ms on my box). That's well above 100ns, so it's possible that's too coarse for us to handwave this problem away. > That said, we don't have to do powers-of-ten. In fact, in many ways, > it would probably be a good idea to think of the fractional seconds in > powers of two. That tends to make it cheaper to do conversions, > without having to do a full 64-bit divide (a constant divide turns > into a fancy multiply, but it's still painful on 32-bit > architectures). > > So, for example, we could easily make the format be a fixed-point > format with "sign bit, 38 bit seconds, 25 bit fractional seconds", > which gives us about 30ns resolution, and a range of almost 9000 > years. Which is nice, in how it covers all of written history and all > four-digit years (we'd keep the 1970 base). > > And 30ns resolution really *is* pretty much the limit of a single > system call. I could *wish* we had system calls that fast, or CPU's > that fast. Not the case right now, and sadly doesn't seem to be the > case in the forseeable future - if ever - either. It would be a really > good problem to have. > > And the nice thing about that would be that conversion to timespec64 > would be fairly straightforward: > > struct timespec64 to_timespec(fstime_t fstime) > { > struct timespec64 res; > unsigned int frac; > > frac = fstime & 0x1ffffffu; > res.tv_sec = fstime >> 25; > res.tv_nsec = frac * 1000000000ull >> 25; > return res; > } > > fstime_t to_fstime(struct timespec64 a) > { > fstime_t sec = (fstime_t) a.tv_sec << 25; > unsigned frac = a.tv_nsec; > > frac = ((unsigned long long) a.tv_nsec << 25) / 1000000000ull; > return sec | frac; > } > > and both of those generate good code (that large divide by a constant > in to_fstime() is not great, but the compiler can turn it into a > multiply). > > The above could be improved upon (nicer rounding and overflow > handling, and a few modifications to generate even nicer code), but > it's not horrendous as-is. On x86-64, to_timespec becomes a very > reasonable > > movq %rdi, %rax > andl $33554431, %edi > imulq $1000000000, %rdi, %rdx > sarq $25, %rax > shrq $25, %rdx > > and to some degree that's the critical function (that code would show > up in 'stat()'). > > Of course, I might have screwed up the above conversion functions, > they are untested garbage, but they look close enough to being in the > right ballpark. > > Anyway, we really need to push back at any crazies who say "I want > nanosecond resolution, because I'm special and my mother said so". > Yeah if we we're going to establish a floor granularity for timestamps above 1ns, then making it a power-of-two factor would probably be a good thing. These calculations are done a _lot_ so we really do want them to be efficient.
On Mon, 25 Sept 2023 at 04:23, Jeff Layton <jlayton@kernel.org> wrote: > > The catch here is that we have at least some testcases that do things > like set specific values in the mtime and atime, and then test that the > same value is retrievable. Yeah, I'm sure that happens. But as you say, we already have per-filesystem granularity issues that means that there is some non-ns granularity that those tests have to deal with. Unless they currently just work on one or two filesystems, of course. > Of course, that set truncates the values at jiffies granularity (~4ms on > my box). That's well above 100ns, so it's possible that's too coarse for > us to handwave this problem away. Note that depending or enforcing some kind of jiffies granularity is *particularly* bad, because it's basically a random value. It will depend on architecture and on configuration. On some architectures it's a fixed value (some have it at 100, which is, I think, the original x86 case), on others it's "configurable", but not really (ie alpha is "configurable" in our Kconfig, but the _hardware_ typically has a fixed clock tick at 1024 Hz, but then there are platforms that are different, and then there's Qemu that likes a lower frequency to avoid overhead etc etc). And then we have the "modern default", which is to ask the user at config time if they want a 100 / 250 / 300 / 1000 HZ value, and the actual hw clock tick may be much more dynamic than that. Anyway, what I'm saying is just that we should *not* limit granularity to anything that has to do with jiffies. Yes, that's still a real thing in that it's a "cheap read of the current time", but it should never be seen as any kind of vfs granularity. And yes, HZ will be in the "roughly 100-1000" range, so if we're talking granularities that are microseconds or finer, then at most you'll have rounding issues - and since any HZ rounding issues should only be for the cases where we set the time to "now" - rounding shouldn't be an issue anyway, since it's not a time that is specified by user space. End result: try to avoid anything to do with HZ in filesystem code, unless it's literally about jiffies (which should typically be mostly about any timeouts a filesystem may have, not about timestamps themselves). Linus