Message ID | 20240526-ctime-ktime-v1-1-016ca6c1e19a@kernel.org (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [RFC] fs: turn inode->i_ctime into a ktime_t | expand |
On Sun, May 26, 2024 at 08:20:16AM -0400, Jeff Layton wrote: > > Switch the __i_ctime fields to a single ktime_t. Move the i_generation > down above i_fsnotify_mask and then move the i_version into the > resulting 8 byte hole. This shrinks struct inode by 8 bytes total, and > should improve the cache footprint as the i_version and __i_ctime are > usually updated together. So first of all, this patch is a bit confusing because the patch doens't change __i_ctime, but rather i_ctime_sec and i_ctime_nsec, and Linus's tree doesn't have those fields. That's because the base commit in the patch, a6f48ee9b741, isn't in Linus's tree, and apparently this patch is dependent on "fs: switch timespec64 fields in inode to discrete integers"[1]. [1] https://lore.kernel.org/all/20240517-amtime-v1-1-7b804ca4be8f@kernel.org/ > The one downside I can see to switching to a ktime_t is that if someone > has a filesystem with files on it that has ctimes outside the ktime_t > range (before ~1678 AD or after ~2262 AD), we won't be able to display > them properly in stat() without some special treatment. I'm operating > under the assumption that this is not a practical problem. There are two additional possible problems with this. The first is that if there is a maliciously fuzzed file system with timestamp outside of ctimes outside of the ktime_t range, this will almost certainly trigger UBSAN warnings, which will result in Syzkaller security bugs reported to file system developers. This can be fixed by explicitly and clamping ranges whenever converting to ktime_t in include/linux/fs.h, but that leads to another problem. The second problem is if the file system converts their on-disk inode to the in-memory struct inode, and then converts it from the in-memory to the on-disk inode format, and the timestamp is outside of the ktime_t range, this could result the on-disk inode having its ctime field getting corrupted. Now, *most* of the time, whenver the inode needs to be written back to disk, the ctime field will be changed anyway. However, if there are changes that don't result userspace-visible changes, but involves internal file system changes (for example, in case of an online repair or defrag, or a COW snap), where the file system doesn't set ctime --- and in it's possible that this might result in the ctime field messed. We could argue that ctime fields outside of the ktime_t range should never, ever happen (except for malicious fuzz testing by systems like syzkaller), and so it could be argued that we don't care(tm), but it's worth at least a mention in the comments and commit description. Of course, a file system which *did* care could work around the problem by having their own copy of ctime in the file-specific inode, but this would come at the cost of space saving benefits of this commit. I'd suspect what I'd do is to add a manual check for an out of range ctime on-disk, log a warning, and then clamp the ctime to the maximum ktime_t value, which is what would be returned by stat(2), and then write that clamped value back to disk if the ctime field doesn't get set to the current time before it needs to be written back to disk. Cheers, - Ted
On Sun, 2024-05-26 at 11:31 -0400, Theodore Ts'o wrote: > On Sun, May 26, 2024 at 08:20:16AM -0400, Jeff Layton wrote: > > > > Switch the __i_ctime fields to a single ktime_t. Move the i_generation > > down above i_fsnotify_mask and then move the i_version into the > > resulting 8 byte hole. This shrinks struct inode by 8 bytes total, and > > should improve the cache footprint as the i_version and __i_ctime are > > usually updated together. > > So first of all, this patch is a bit confusing because the patch > doens't change __i_ctime, but rather i_ctime_sec and i_ctime_nsec, and > Linus's tree doesn't have those fields. That's because the base > commit in the patch, a6f48ee9b741, isn't in Linus's tree, and > apparently this patch is dependent on "fs: switch timespec64 fields in > inode to discrete integers"[1]. > > [1] https://lore.kernel.org/all/20240517-amtime-v1-1-7b804ca4be8f@kernel.org/ > My bad. I should have mentioned that in the changelog, and I'll clean up the title. > > The one downside I can see to switching to a ktime_t is that if someone > > has a filesystem with files on it that has ctimes outside the ktime_t > > range (before ~1678 AD or after ~2262 AD), we won't be able to display > > them properly in stat() without some special treatment. I'm operating > > under the assumption that this is not a practical problem. > > There are two additional possible problems with this. The first is > that if there is a maliciously fuzzed file system with timestamp > outside of ctimes outside of the ktime_t range, this will almost > certainly trigger UBSAN warnings, which will result in Syzkaller > security bugs reported to file system developers. This can be fixed > by explicitly and clamping ranges whenever converting to ktime_t in > include/linux/fs.h, but that leads to another problem. > There should be no undefined behavior since this is using ktime_set. > The second problem is if the file system converts their on-disk inode > to the in-memory struct inode, and then converts it from the in-memory > to the on-disk inode format, and the timestamp is outside of the > ktime_t range, this could result the on-disk inode having its ctime > field getting corrupted. Now, *most* of the time, whenver the inode > needs to be written back to disk, the ctime field will be changed > anyway. However, if there are changes that don't result > userspace-visible changes, but involves internal file system changes > (for example, in case of an online repair or defrag, or a COW snap), > where the file system doesn't set ctime --- and in it's possible that > this might result in the ctime field messed. > > > We could argue that ctime fields outside of the ktime_t range should > never, ever happen (except for malicious fuzz testing by systems like > syzkaller), and so it could be argued that we don't care(tm), but it's > worth at least a mention in the comments and commit description. Of > course, a file system which *did* care could work around the problem > by having their own copy of ctime in the file-specific inode, but this > would come at the cost of space saving benefits of this commit. > My assumption was that we'd not overwrite an on-disk inode unless we were going to stamp the ctime as well. That's not 100% true (as you point out). I guess we have to decide whether we care about preserving the results with this sort of intentional fuzz testing. > > I'd suspect what I'd do is to add a manual check for an out of range > ctime on-disk, log a warning, and then clamp the ctime to the maximum > ktime_t value, which is what would be returned by stat(2), and then > write that clamped value back to disk if the ctime field doesn't get > set to the current time before it needs to be written back to disk. > ktime_set does this: if (unlikely(secs >= KTIME_SEC_MAX)) return KTIME_MAX; So, this should already clamp the value to KTIME_MAX, at least as long as the seconds value is normalized when called. We certainly could have this do a pr_warn or pr_notice too if the value comes back as KTIME_MAX. Thanks for taking a look, Ted!
diff --git a/include/linux/fs.h b/include/linux/fs.h index 639885621608..6b9ed7dff6d5 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -662,11 +662,10 @@ struct inode { loff_t i_size; time64_t i_atime_sec; time64_t i_mtime_sec; - time64_t i_ctime_sec; u32 i_atime_nsec; u32 i_mtime_nsec; - u32 i_ctime_nsec; - u32 i_generation; + ktime_t __i_ctime; + atomic64_t i_version; spinlock_t i_lock; /* i_blocks, i_bytes, maybe i_size */ unsigned short i_bytes; u8 i_blkbits; @@ -701,7 +700,6 @@ struct inode { struct hlist_head i_dentry; struct rcu_head i_rcu; }; - atomic64_t i_version; atomic64_t i_sequence; /* see futex */ atomic_t i_count; atomic_t i_dio_count; @@ -724,6 +722,8 @@ struct inode { }; + u32 i_generation; + #ifdef CONFIG_FSNOTIFY __u32 i_fsnotify_mask; /* all events this inode cares about */ /* 32-bit hole reserved for expanding i_fsnotify_mask */ @@ -1608,29 +1608,25 @@ static inline struct timespec64 inode_set_mtime(struct inode *inode, return inode_set_mtime_to_ts(inode, ts); } -static inline time64_t inode_get_ctime_sec(const struct inode *inode) +static inline struct timespec64 inode_get_ctime(const struct inode *inode) { - return inode->i_ctime_sec; + return ktime_to_timespec64(inode->__i_ctime); } -static inline long inode_get_ctime_nsec(const struct inode *inode) +static inline time64_t inode_get_ctime_sec(const struct inode *inode) { - return inode->i_ctime_nsec; + return inode_get_ctime(inode).tv_sec; } -static inline struct timespec64 inode_get_ctime(const struct inode *inode) +static inline long inode_get_ctime_nsec(const struct inode *inode) { - struct timespec64 ts = { .tv_sec = inode_get_ctime_sec(inode), - .tv_nsec = inode_get_ctime_nsec(inode) }; - - return ts; + return inode_get_ctime(inode).tv_nsec; } static inline struct timespec64 inode_set_ctime_to_ts(struct inode *inode, struct timespec64 ts) { - inode->i_ctime_sec = ts.tv_sec; - inode->i_ctime_nsec = ts.tv_nsec; + inode->__i_ctime = ktime_set(ts.tv_sec, ts.tv_nsec); return ts; }
The ctime is not settable to arbitrary values. It always comes from the system clock, so we'll never stamp an inode with a value that can't be represented there. If we disregard people setting their system clock past the year 2262, there is no reason we can't replace the ctime fields with a ktime_t. Switch the __i_ctime fields to a single ktime_t. Move the i_generation down above i_fsnotify_mask and then move the i_version into the resulting 8 byte hole. This shrinks struct inode by 8 bytes total, and should improve the cache footprint as the i_version and __i_ctime are usually updated together. The one downside I can see to switching to a ktime_t is that if someone has a filesystem with files on it that has ctimes outside the ktime_t range (before ~1678 AD or after ~2262 AD), we won't be able to display them properly in stat() without some special treatment. I'm operating under the assumption that this is not a practical problem. Signed-off-by: Jeff Layton <jlayton@kernel.org> --- I've been looking at this as part of trying to resurrect the multigrain timestamp work, as Linus mentioned it in passing in an earlier discussion, but then Willy threw down the gauntlet. Thoughts? --- include/linux/fs.h | 26 +++++++++++--------------- 1 file changed, 11 insertions(+), 15 deletions(-) --- base-commit: a6f48ee9b741a6da6a939aa5c58d879327f452e1 change-id: 20240526-ctime-ktime-d4152de81153 Best regards,