Message ID | 20230928110554.34758-2-jlayton@kernel.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | fs: new accessor methods for atime and mtime | expand |
On Thu, Sep 28, 2023, at 07:05, Jeff Layton wrote: > This shaves 8 bytes off struct inode, according to pahole. > > Signed-off-by: Jeff Layton <jlayton@kernel.org> FWIW, this is similar to the approach that Deepa suggested back in 2016: https://lore.kernel.org/lkml/1452144972-15802-3-git-send-email-deepa.kernel@gmail.com/ It was NaKed at the time because of the added complexity, though it would have been much easier to do it then, as we had to touch all the timespec references anyway. The approach still seems ok to me, but I'm not sure it's worth doing it now if we didn't do it then. Arnd
On Thu, 2023-09-28 at 11:48 -0400, Arnd Bergmann wrote: > On Thu, Sep 28, 2023, at 07:05, Jeff Layton wrote: > > This shaves 8 bytes off struct inode, according to pahole. > > > > Signed-off-by: Jeff Layton <jlayton@kernel.org> > > FWIW, this is similar to the approach that Deepa suggested > back in 2016: > > https://lore.kernel.org/lkml/1452144972-15802-3-git-send-email-deepa.kernel@gmail.com/ > > It was NaKed at the time because of the added complexity, > though it would have been much easier to do it then, > as we had to touch all the timespec references anyway. > > The approach still seems ok to me, but I'm not sure it's worth > doing it now if we didn't do it then. > I remember seeing those patches go by. I don't remember that change being NaK'ed, but I wasn't paying close attention at the time Looking at it objectively now, I think it's worth it to recover 8 bytes per inode and open a 4 byte hole that Amir can use to grow the i_fsnotify_mask. We might even able to shave off another 12 bytes eventually if we can move to a single 64-bit word per timestamp. It is a lot of churn though.
On Thu, 2023-09-28 at 07:05 -0400, Jeff Layton wrote: > This shaves 8 bytes off struct inode, according to pahole. > > Signed-off-by: Jeff Layton <jlayton@kernel.org> > --- > include/linux/fs.h | 32 +++++++++++++++++++++++--------- > 1 file changed, 23 insertions(+), 9 deletions(-) > > diff --git a/include/linux/fs.h b/include/linux/fs.h > index 831657011036..de902ff2938b 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -671,9 +671,12 @@ struct inode { > }; > dev_t i_rdev; > loff_t i_size; > - struct timespec64 __i_atime; /* use inode_*_atime accessors */ > - struct timespec64 __i_mtime; /* use inode_*_mtime accessors */ > - struct timespec64 __i_ctime; /* use inode_*_ctime accessors */ > + time64_t i_atime_sec; > + time64_t i_mtime_sec; > + time64_t i_ctime_sec; > + u32 i_atime_nsec; > + u32 i_mtime_nsec; > + u32 i_ctime_nsec; > spinlock_t i_lock; /* i_blocks, i_bytes, maybe i_size */ > unsigned short i_bytes; > u8 i_blkbits; > @@ -1519,7 +1522,9 @@ struct timespec64 inode_set_ctime_current(struct inode *inode); > */ > static inline struct timespec64 inode_get_ctime(const struct inode *inode) > { > - return inode->__i_ctime; > + struct timespec64 ts = { .tv_sec = inode->i_ctime_sec, > + .tv_nsec = inode->i_ctime_nsec }; > + return ts; > } > > > > /** > @@ -1532,7 +1537,8 @@ static inline struct timespec64 inode_get_ctime(const struct inode *inode) > static inline struct timespec64 inode_set_ctime_to_ts(struct inode *inode, > struct timespec64 ts) > { > - inode->__i_ctime = ts; > + inode->i_ctime_sec = ts.tv_sec; > + inode->i_ctime_nsec = ts.tv_sec; Bug above and in the other inode_set_?time_to_ts() functions. This isn't setting the nsec field correctly. > return ts; > } > > > > @@ -1555,13 +1561,17 @@ static inline struct timespec64 inode_set_ctime(struct inode *inode, > > static inline struct timespec64 inode_get_atime(const struct inode *inode) > { > - return inode->__i_atime; > + struct timespec64 ts = { .tv_sec = inode->i_atime_sec, > + .tv_nsec = inode->i_atime_nsec }; > + > + return ts; > } > > static inline struct timespec64 inode_set_atime_to_ts(struct inode *inode, > struct timespec64 ts) > { > - inode->__i_atime = ts; > + inode->i_atime_sec = ts.tv_sec; > + inode->i_atime_nsec = ts.tv_sec; > return ts; > } > > @@ -1575,13 +1585,17 @@ static inline struct timespec64 inode_set_atime(struct inode *inode, > > static inline struct timespec64 inode_get_mtime(const struct inode *inode) > { > - return inode->__i_mtime; > + struct timespec64 ts = { .tv_sec = inode->i_mtime_sec, > + .tv_nsec = inode->i_mtime_nsec }; > + > + return ts; > } > > static inline struct timespec64 inode_set_mtime_to_ts(struct inode *inode, > struct timespec64 ts) > { > - inode->__i_mtime = ts; > + inode->i_atime_sec = ts.tv_sec; > + inode->i_atime_nsec = ts.tv_sec; Doh! s/atime/mtime/ in the above lines. > return ts; > } > Both bugs are fixed in my tree.
On Thu, Sep 28, 2023 at 01:06:03PM -0400, Jeff Layton wrote: > On Thu, 2023-09-28 at 11:48 -0400, Arnd Bergmann wrote: > > On Thu, Sep 28, 2023, at 07:05, Jeff Layton wrote: > > > This shaves 8 bytes off struct inode, according to pahole. > > > > > > Signed-off-by: Jeff Layton <jlayton@kernel.org> > > > > FWIW, this is similar to the approach that Deepa suggested > > back in 2016: > > > > https://lore.kernel.org/lkml/1452144972-15802-3-git-send-email-deepa.kernel@gmail.com/ > > > > It was NaKed at the time because of the added complexity, > > though it would have been much easier to do it then, > > as we had to touch all the timespec references anyway. > > > > The approach still seems ok to me, but I'm not sure it's worth > > doing it now if we didn't do it then. > > > > I remember seeing those patches go by. I don't remember that change > being NaK'ed, but I wasn't paying close attention at the time > > Looking at it objectively now, I think it's worth it to recover 8 bytes > per inode and open a 4 byte hole that Amir can use to grow the > i_fsnotify_mask. We might even able to shave off another 12 bytes > eventually if we can move to a single 64-bit word per timestamp. I don't think you can, since btrfs timestamps utilize s64 seconds counting in both directions from the Unix epoch. They also support ns resolution: struct btrfs_timespec { __le64 sec; __le32 nsec; } __attribute__ ((__packed__)); --D > It is a lot of churn though. > -- > Jeff Layton <jlayton@kernel.org>
On Thu, 2023-09-28 at 10:19 -0700, Darrick J. Wong wrote: > On Thu, Sep 28, 2023 at 01:06:03PM -0400, Jeff Layton wrote: > > On Thu, 2023-09-28 at 11:48 -0400, Arnd Bergmann wrote: > > > On Thu, Sep 28, 2023, at 07:05, Jeff Layton wrote: > > > > This shaves 8 bytes off struct inode, according to pahole. > > > > > > > > Signed-off-by: Jeff Layton <jlayton@kernel.org> > > > > > > FWIW, this is similar to the approach that Deepa suggested > > > back in 2016: > > > > > > https://lore.kernel.org/lkml/1452144972-15802-3-git-send-email-deepa.kernel@gmail.com/ > > > > > > It was NaKed at the time because of the added complexity, > > > though it would have been much easier to do it then, > > > as we had to touch all the timespec references anyway. > > > > > > The approach still seems ok to me, but I'm not sure it's worth > > > doing it now if we didn't do it then. > > > > > > > I remember seeing those patches go by. I don't remember that change > > being NaK'ed, but I wasn't paying close attention at the time > > > > Looking at it objectively now, I think it's worth it to recover 8 bytes > > per inode and open a 4 byte hole that Amir can use to grow the > > i_fsnotify_mask. We might even able to shave off another 12 bytes > > eventually if we can move to a single 64-bit word per timestamp. > > I don't think you can, since btrfs timestamps utilize s64 seconds > counting in both directions from the Unix epoch. They also support ns > resolution: > > struct btrfs_timespec { > __le64 sec; > __le32 nsec; > } __attribute__ ((__packed__)); > Correct. We'd lose some fidelity in currently stored timestamps, but as Linus and Ted pointed out, anything below ~100ns granularity is effectively just noise, as that's the floor overhead for calling into the kernel. It's hard to argue that any application needs that sort of timestamp resolution, at least with contemporary hardware. Doing that would mean that tests that store specific values in the atime/mtime and expect to be able to fetch exactly that value back would break though, so we'd have to be OK with that if we want to try it. The good news is that it's relatively easy to experiment with new ways to store timestamps with these wrappers in place.
On Thu, Sep 28, 2023, at 13:40, Jeff Layton wrote: > On Thu, 2023-09-28 at 10:19 -0700, Darrick J. Wong wrote: >> >> > I remember seeing those patches go by. I don't remember that change >> > being NaK'ed, but I wasn't paying close attention at the time >> > >> > Looking at it objectively now, I think it's worth it to recover 8 bytes >> > per inode and open a 4 byte hole that Amir can use to grow the >> > i_fsnotify_mask. We might even able to shave off another 12 bytes >> > eventually if we can move to a single 64-bit word per timestamp. >> >> I don't think you can, since btrfs timestamps utilize s64 seconds >> counting in both directions from the Unix epoch. They also support ns >> resolution: >> >> struct btrfs_timespec { >> __le64 sec; >> __le32 nsec; >> } __attribute__ ((__packed__)); >> > > Correct. We'd lose some fidelity in currently stored timestamps, but as > Linus and Ted pointed out, anything below ~100ns granularity is > effectively just noise, as that's the floor overhead for calling into > the kernel. It's hard to argue that any application needs that sort of > timestamp resolution, at least with contemporary hardware. There are probably applications that have come up with creative ways to use the timestamp fields of file systems that 94 bits of data, with both the MSB of the seconds and the LSB of the nanoseconds carrying information that they expect to be preserved. Dropping any information in the nanoseconds other than the top two bits would trivially change the 'ls -t' output when two files have the same timestamp in one kernel but slightly different timestamps in another one. For large values of 'tv_sec', there are fewer obvious things that break, but if current kernels are able to retrieve arbitrary times that were stored with utimensat(), then we should probably make sure future kernels can see the same. Arnd
On Thu, Sep 28, 2023 at 01:40:55PM -0400, Jeff Layton wrote: > > Correct. We'd lose some fidelity in currently stored timestamps, but as > Linus and Ted pointed out, anything below ~100ns granularity is > effectively just noise, as that's the floor overhead for calling into > the kernel. It's hard to argue that any application needs that sort of > timestamp resolution, at least with contemporary hardware. > > Doing that would mean that tests that store specific values in the > atime/mtime and expect to be able to fetch exactly that value back would > break though, so we'd have to be OK with that if we want to try it. The > good news is that it's relatively easy to experiment with new ways to > store timestamps with these wrappers in place. The reason why we store 1ns granularity in ext4's on-disk format (and accept that we only support times only a couple of centuries into the future, as opposed shooting for an on-disk format good for several millennia :-), was in case there was userspace that might try to store a very fine-grained timestamp and want to be able to get it back bit-for-bit identical. For example, what if someone was trying to implement some kind of steganographic scheme where they going store a secret message (or more likely, a 256-bit AES key) in the nanosecond fields of the file's {c,m,a,cr}time timestamps, "hiding in plain sight". Not that I think that we have to support something like that, since the field is for *timestamps* not cryptographic bits, so if we break someone who is doing that, do we care? I don't think anyone will complain about breaking the userspace API --- especially since if, say, the CIA was using this for their spies' drop boxes, they probably wouldn't want to admit it. :-) - Ted
On Thu, 28 Sept 2023 at 14:28, Theodore Ts'o <tytso@mit.edu> wrote: > > I don't think anyone will complain about breaking the userspace API > --- especially since if, say, the CIA was using this for their spies' > drop boxes, they probably wouldn't want to admit it. :-) Well, you will find that real apps do kind of of care. Just to take a very real example, "git" will very much notice time granularity issues and care - because git will cache the 'stat' times in the index. So if you get a different stat time (because the vfs layer has changed some granularity), git will then have to check the files carefully again and update the index. You can simulate this "re-check all files" with something like this: $ time git diff real 0m0.040s user 0m0.035s sys 0m0.264s $ rm .git/index && git read-tree HEAD $ time git diff real 0m9.595s user 0m7.287s sys 0m2.810s so the difference between just doing a "look, index information matches current 'stat' information" and "oops, index does not have the stat data" is "40 milliseconds" vs "10 seconds". That's a big difference, and you'd see that each time the granularity changes. But then once the index file has been updated, it's back to the good case. So yes, real programs to cache stat information, and it matters for performance. But I don't think any actual reasonable program will have *correctness* issues, though - because there are certainly filesystems out there that don't do nanosecond resolution (and other operations like copying trees around will obviously also change times). Anybody doing steganography in the timestamps is already not going to have a great time, really. Linus
On Thu, Sep 28, 2023 at 8:19 PM Darrick J. Wong <djwong@kernel.org> wrote: > > On Thu, Sep 28, 2023 at 01:06:03PM -0400, Jeff Layton wrote: > > On Thu, 2023-09-28 at 11:48 -0400, Arnd Bergmann wrote: > > > On Thu, Sep 28, 2023, at 07:05, Jeff Layton wrote: > > > > This shaves 8 bytes off struct inode, according to pahole. > > > > > > > > Signed-off-by: Jeff Layton <jlayton@kernel.org> > > > > > > FWIW, this is similar to the approach that Deepa suggested > > > back in 2016: > > > > > > https://lore.kernel.org/lkml/1452144972-15802-3-git-send-email-deepa.kernel@gmail.com/ > > > > > > It was NaKed at the time because of the added complexity, > > > though it would have been much easier to do it then, > > > as we had to touch all the timespec references anyway. > > > > > > The approach still seems ok to me, but I'm not sure it's worth > > > doing it now if we didn't do it then. > > > > > > > I remember seeing those patches go by. I don't remember that change > > being NaK'ed, but I wasn't paying close attention at the time > > > > Looking at it objectively now, I think it's worth it to recover 8 bytes > > per inode and open a 4 byte hole that Amir can use to grow the > > i_fsnotify_mask. We might even able to shave off another 12 bytes > > eventually if we can move to a single 64-bit word per timestamp. > > I don't think you can, since btrfs timestamps utilize s64 seconds > counting in both directions from the Unix epoch. They also support ns > resolution: > > struct btrfs_timespec { > __le64 sec; > __le32 nsec; > } __attribute__ ((__packed__)); > > --D > Sure we can. That's what btrfs_inode is for. vfs inode also does not store i_otime (birth time) and there is even a precedent of vfs/btrfs variable size mismatch: /* full 64 bit generation number, struct vfs_inode doesn't have a big * enough field for this. */ u64 generation; If we decide that vfs should use "bigtime", btrfs pre-historic timestamps are not a show stopper. Thanks, Amir.
On Fri, Sep 29, 2023 at 3:19 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: ... > So yes, real programs to cache stat information, and it matters for performance. > > But I don't think any actual reasonable program will have > *correctness* issues, though - I beg to disagree. > because there are certainly filesystems > out there that don't do nanosecond resolution (and other operations > like copying trees around will obviously also change times). > > Anybody doing steganography in the timestamps is already not going to > have a great time, really. > Your thesis implies that all applications are portable across different filesystems and all applications are expected to cope with copying trees around. There are applications that work on specific filesystems and those applications are very much within sanity if they expect that past observed values of nsec will not to change if the file was not changed. But even if we agree that will "only" hurt performance, your example of performance hit (10s of git diff) is nowhere close to the performance hit of invalidating the mtime cache of billions of files at once (i.e. after kernel upgrade), which means that rsync-like programs need to re-read all the data from remote locations. I am not saying that filesystems cannot decide to *stop storing nsec granularity* from this day forth, but like btrfs pre-historic timestamps, those fs have an obligation to preserve existing metadata, unless users opted to throw it away. OTOH, it is perfectly fine if the vfs wants to stop providing sub 100ns services to filesystems. It's just going to be the fs problem and the preserved pre-historic/fine-grained time on existing files would only need to be provided in getattr(). It does not need to be in __i_mtime. Thanks, Amir.
Jeff Layton <jlayton@kernel.org> wrote: > Correct. We'd lose some fidelity in currently stored timestamps, but as > Linus and Ted pointed out, anything below ~100ns granularity is > effectively just noise, as that's the floor overhead for calling into > the kernel. It's hard to argue that any application needs that sort of > timestamp resolution, at least with contemporary hardware. Albeit with the danger of making Steve French very happy;-), would it make sense to switch internally to Microsoft-style 64-bit timestamps with their 100ns granularity? David
> It is a lot of churn though.
I think that i_{a,c,m}time shouldn't be accessed directly by
filesystems same as no filesystem should really access i_{g,u}id which
we also provide i_{g,u}id_{read,write}() accessors for. The mode is
another example where really most often should use helpers because of all
the set*id stripping that we need to do (and the bugs that we had
because of this...).
The interdependency between ctime and mtime is enough to hide this in
accessors. The other big advantage is simply grepability. So really I
would like to see this change even without the type switch.
In other words, there's no need to lump the two changes together. Do the
conversion part and we can argue about the switch to discrete integers
separately.
The other adavantage is that we have a cycle to see any possible
regression from the conversion.
Thoughts anyone?
On Fri, 2023-09-29 at 11:44 +0200, Christian Brauner wrote: > > It is a lot of churn though. > > I think that i_{a,c,m}time shouldn't be accessed directly by > filesystems same as no filesystem should really access i_{g,u}id which > we also provide i_{g,u}id_{read,write}() accessors for. The mode is > another example where really most often should use helpers because of all > the set*id stripping that we need to do (and the bugs that we had > because of this...). > > The interdependency between ctime and mtime is enough to hide this in > accessors. The other big advantage is simply grepability. So really I > would like to see this change even without the type switch. > > In other words, there's no need to lump the two changes together. Do the > conversion part and we can argue about the switch to discrete integers > separately. > > The other adavantage is that we have a cycle to see any possible > regression from the conversion. > > Thoughts anyone? That works for me, and sort of what I was planning anyway. I mostly just did the change to timestamp storage to see what it would look like afterward. FWIW, I'm planning to do a v2 patchbomb early next week, with the changes that Chuck suggested (specific helpers for fetching the _sec and _nsec fields). For now, I'll drop the change from timespec64 to discrete fields. We can do that in a separate follow-on set.
On Thu, 28 Sept 2023 at 20:50, Amir Goldstein <amir73il@gmail.com> wrote: > > OTOH, it is perfectly fine if the vfs wants to stop providing sub 100ns > services to filesystems. It's just going to be the fs problem and the > preserved pre-historic/fine-grained time on existing files would only > need to be provided in getattr(). It does not need to be in __i_mtime. Hmm. That sounds technically sane, but for one thing: if the aim is to try to do (a) atomic timestamp access (b) shrink the inode then having the filesystem maintain its own timestamp for fine-grained data will break both of those goals. Yes, we'd make 'struct inode' smaller if we pack the times into one 64-bit entity, but if btrfs responds by adding mtime fields to "struct btrfs_inode", we lost the size advantage and only made things worse. And if ->getattr() then reads those fields without locking (and we definitely don't want locking in that path), then we lost the atomicity thing too. So no. A "but the filesystem can maintain finer granularity" model is not acceptable, I think. If we do require nanoseconds for compatibility, what we could possibly do is say "we guarantee nanosecond values for *legacy* dates", and say that future dates use 100ns resolution. We'd define "legacy dates" to be the traditional 32-bit signed time_t. So with a 64-bit fstime_t, we'd have the "legacy format": - top 32 bits are seconds, bottom 32 bits are ns which gives us that ns format. Then, because only 30 bits are needed for nanosecond resolution, we use the top two bits of that ns field as flags. '00' means that legacy format, and '01' would mean "we're not doing nanosecond resolution, we're doing 64ns resolution, and the low 6 bits of the ns field are actually bits 32-37 of the seconds field". That still gives us some extensibility (unless the multi-grain code still wants to use the other top bit), and it gives us 40 bits of seconds, which is quite a lot. And all the conversion functions will be simple bit field manipulations, so there are no expensive ops here. Anyway, I agree with the "let's introduce the accessor functions first, we can do the 'pack into one word' decisions later". Linus
On Fri, Sep 29, 2023 at 3:06 AM David Howells via samba-technical <samba-technical@lists.samba.org> wrote: > > > Jeff Layton <jlayton@kernel.org> wrote: > > > Correct. We'd lose some fidelity in currently stored timestamps, but as > > Linus and Ted pointed out, anything below ~100ns granularity is > > effectively just noise, as that's the floor overhead for calling into > > the kernel. It's hard to argue that any application needs that sort of > > timestamp resolution, at least with contemporary hardware. > > Albeit with the danger of making Steve French very happy;-), would it make > sense to switch internally to Microsoft-style 64-bit timestamps with their > 100ns granularity? 100ns granularity does seem to make sense and IIRC was used by various DCE standards in the 90s and 2000s (not just used for SMB2/SMB3 protocol and various Windows filesystems)
diff --git a/include/linux/fs.h b/include/linux/fs.h index 831657011036..de902ff2938b 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -671,9 +671,12 @@ struct inode { }; dev_t i_rdev; loff_t i_size; - struct timespec64 __i_atime; /* use inode_*_atime accessors */ - struct timespec64 __i_mtime; /* use inode_*_mtime accessors */ - struct timespec64 __i_ctime; /* use inode_*_ctime accessors */ + time64_t i_atime_sec; + time64_t i_mtime_sec; + time64_t i_ctime_sec; + u32 i_atime_nsec; + u32 i_mtime_nsec; + u32 i_ctime_nsec; spinlock_t i_lock; /* i_blocks, i_bytes, maybe i_size */ unsigned short i_bytes; u8 i_blkbits; @@ -1519,7 +1522,9 @@ struct timespec64 inode_set_ctime_current(struct inode *inode); */ static inline struct timespec64 inode_get_ctime(const struct inode *inode) { - return inode->__i_ctime; + struct timespec64 ts = { .tv_sec = inode->i_ctime_sec, + .tv_nsec = inode->i_ctime_nsec }; + return ts; } /** @@ -1532,7 +1537,8 @@ static inline struct timespec64 inode_get_ctime(const struct inode *inode) static inline struct timespec64 inode_set_ctime_to_ts(struct inode *inode, struct timespec64 ts) { - inode->__i_ctime = ts; + inode->i_ctime_sec = ts.tv_sec; + inode->i_ctime_nsec = ts.tv_sec; return ts; } @@ -1555,13 +1561,17 @@ static inline struct timespec64 inode_set_ctime(struct inode *inode, static inline struct timespec64 inode_get_atime(const struct inode *inode) { - return inode->__i_atime; + struct timespec64 ts = { .tv_sec = inode->i_atime_sec, + .tv_nsec = inode->i_atime_nsec }; + + return ts; } static inline struct timespec64 inode_set_atime_to_ts(struct inode *inode, struct timespec64 ts) { - inode->__i_atime = ts; + inode->i_atime_sec = ts.tv_sec; + inode->i_atime_nsec = ts.tv_sec; return ts; } @@ -1575,13 +1585,17 @@ static inline struct timespec64 inode_set_atime(struct inode *inode, static inline struct timespec64 inode_get_mtime(const struct inode *inode) { - return inode->__i_mtime; + struct timespec64 ts = { .tv_sec = inode->i_mtime_sec, + .tv_nsec = inode->i_mtime_nsec }; + + return ts; } static inline struct timespec64 inode_set_mtime_to_ts(struct inode *inode, struct timespec64 ts) { - inode->__i_mtime = ts; + inode->i_atime_sec = ts.tv_sec; + inode->i_atime_nsec = ts.tv_sec; return ts; }
This shaves 8 bytes off struct inode, according to pahole. Signed-off-by: Jeff Layton <jlayton@kernel.org> --- include/linux/fs.h | 32 +++++++++++++++++++++++--------- 1 file changed, 23 insertions(+), 9 deletions(-)