diff mbox series

[RFC] fs: turn inode->i_ctime into a ktime_t

Message ID 20240526-ctime-ktime-v1-1-016ca6c1e19a@kernel.org (mailing list archive)
State New
Headers show
Series [RFC] fs: turn inode->i_ctime into a ktime_t | expand

Commit Message

Jeff Layton May 26, 2024, 12:20 p.m. UTC
The ctime is not settable to arbitrary values. It always comes from the
system clock, so we'll never stamp an inode with a value that can't be
represented there. If we disregard people setting their system clock
past the year 2262, there is no reason we can't replace the ctime fields
with a ktime_t.

Switch the __i_ctime fields to a single ktime_t. Move the i_generation
down above i_fsnotify_mask and then move the i_version into the
resulting 8 byte hole. This shrinks struct inode by 8 bytes total, and
should improve the cache footprint as the i_version and __i_ctime are
usually updated together.

The one downside I can see to switching to a ktime_t is that if someone
has a filesystem with files on it that has ctimes outside the ktime_t
range (before ~1678 AD or after ~2262 AD), we won't be able to display
them properly in stat() without some special treatment. I'm operating
under the assumption that this is not a practical problem.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
I've been looking at this as part of trying to resurrect the multigrain
timestamp work, as Linus mentioned it in passing in an earlier
discussion, but then Willy threw down the gauntlet.

Thoughts?
---
 include/linux/fs.h | 26 +++++++++++---------------
 1 file changed, 11 insertions(+), 15 deletions(-)


---
base-commit: a6f48ee9b741a6da6a939aa5c58d879327f452e1
change-id: 20240526-ctime-ktime-d4152de81153

Best regards,

Comments

Theodore Ts'o May 26, 2024, 3:31 p.m. UTC | #1
On Sun, May 26, 2024 at 08:20:16AM -0400, Jeff Layton wrote:
> 
> Switch the __i_ctime fields to a single ktime_t. Move the i_generation
> down above i_fsnotify_mask and then move the i_version into the
> resulting 8 byte hole. This shrinks struct inode by 8 bytes total, and
> should improve the cache footprint as the i_version and __i_ctime are
> usually updated together.

So first of all, this patch is a bit confusing because the patch
doens't change __i_ctime, but rather i_ctime_sec and i_ctime_nsec, and
Linus's tree doesn't have those fields.  That's because the base
commit in the patch, a6f48ee9b741, isn't in Linus's tree, and
apparently this patch is dependent on "fs: switch timespec64 fields in
inode to discrete integers"[1].

[1] https://lore.kernel.org/all/20240517-amtime-v1-1-7b804ca4be8f@kernel.org/

> The one downside I can see to switching to a ktime_t is that if someone
> has a filesystem with files on it that has ctimes outside the ktime_t
> range (before ~1678 AD or after ~2262 AD), we won't be able to display
> them properly in stat() without some special treatment. I'm operating
> under the assumption that this is not a practical problem.

There are two additional possible problems with this.  The first is
that if there is a maliciously fuzzed file system with timestamp
outside of ctimes outside of the ktime_t range, this will almost
certainly trigger UBSAN warnings, which will result in Syzkaller
security bugs reported to file system developers.  This can be fixed
by explicitly and clamping ranges whenever converting to ktime_t in
include/linux/fs.h, but that leads to another problem.

The second problem is if the file system converts their on-disk inode
to the in-memory struct inode, and then converts it from the in-memory
to the on-disk inode format, and the timestamp is outside of the
ktime_t range, this could result the on-disk inode having its ctime
field getting corrupted.  Now, *most* of the time, whenver the inode
needs to be written back to disk, the ctime field will be changed
anyway.  However, if there are changes that don't result
userspace-visible changes, but involves internal file system changes
(for example, in case of an online repair or defrag, or a COW snap),
where the file system doesn't set ctime --- and in it's possible that
this might result in the ctime field messed.

We could argue that ctime fields outside of the ktime_t range should
never, ever happen (except for malicious fuzz testing by systems like
syzkaller), and so it could be argued that we don't care(tm), but it's
worth at least a mention in the comments and commit description.  Of
course, a file system which *did* care could work around the problem
by having their own copy of ctime in the file-specific inode, but this
would come at the cost of space saving benefits of this commit.

I'd suspect what I'd do is to add a manual check for an out of range
ctime on-disk, log a warning, and then clamp the ctime to the maximum
ktime_t value, which is what would be returned by stat(2), and then
write that clamped value back to disk if the ctime field doesn't get
set to the current time before it needs to be written back to disk.

Cheers,

					- Ted
Jeff Layton May 26, 2024, 3:55 p.m. UTC | #2
On Sun, 2024-05-26 at 11:31 -0400, Theodore Ts'o wrote:
> On Sun, May 26, 2024 at 08:20:16AM -0400, Jeff Layton wrote:
> > 
> > Switch the __i_ctime fields to a single ktime_t. Move the i_generation
> > down above i_fsnotify_mask and then move the i_version into the
> > resulting 8 byte hole. This shrinks struct inode by 8 bytes total, and
> > should improve the cache footprint as the i_version and __i_ctime are
> > usually updated together.
> 
> So first of all, this patch is a bit confusing because the patch
> doens't change __i_ctime, but rather i_ctime_sec and i_ctime_nsec, and
> Linus's tree doesn't have those fields.  That's because the base
> commit in the patch, a6f48ee9b741, isn't in Linus's tree, and
> apparently this patch is dependent on "fs: switch timespec64 fields in
> inode to discrete integers"[1].
> 
> [1] https://lore.kernel.org/all/20240517-amtime-v1-1-7b804ca4be8f@kernel.org/
> 

My bad. I should have mentioned that in the changelog, and I'll clean
up the title.

> > The one downside I can see to switching to a ktime_t is that if someone
> > has a filesystem with files on it that has ctimes outside the ktime_t
> > range (before ~1678 AD or after ~2262 AD), we won't be able to display
> > them properly in stat() without some special treatment. I'm operating
> > under the assumption that this is not a practical problem.
> 
> There are two additional possible problems with this.  The first is
> that if there is a maliciously fuzzed file system with timestamp
> outside of ctimes outside of the ktime_t range, this will almost
> certainly trigger UBSAN warnings, which will result in Syzkaller
> security bugs reported to file system developers.  This can be fixed
> by explicitly and clamping ranges whenever converting to ktime_t in
> include/linux/fs.h, but that leads to another problem.
>

There should be no undefined behavior since this is using ktime_set.

> The second problem is if the file system converts their on-disk inode
> to the in-memory struct inode, and then converts it from the in-memory
> to the on-disk inode format, and the timestamp is outside of the
> ktime_t range, this could result the on-disk inode having its ctime
> field getting corrupted.  Now, *most* of the time, whenver the inode
> needs to be written back to disk, the ctime field will be changed
> anyway.  However, if there are changes that don't result
> userspace-visible changes, but involves internal file system changes
> (for example, in case of an online repair or defrag, or a COW snap),
> where the file system doesn't set ctime --- and in it's possible that
> this might result in the ctime field messed.
> 
>
> We could argue that ctime fields outside of the ktime_t range should
> never, ever happen (except for malicious fuzz testing by systems like
> syzkaller), and so it could be argued that we don't care(tm), but it's
> worth at least a mention in the comments and commit description.  Of
> course, a file system which *did* care could work around the problem
> by having their own copy of ctime in the file-specific inode, but this
> would come at the cost of space saving benefits of this commit.
>


My assumption was that we'd not overwrite an on-disk inode unless we
were going to stamp the ctime as well. That's not 100% true (as you
point out). I guess we have to decide whether we care about preserving
the results with this sort of intentional fuzz testing.

>
> I'd suspect what I'd do is to add a manual check for an out of range
> ctime on-disk, log a warning, and then clamp the ctime to the maximum
> ktime_t value, which is what would be returned by stat(2), and then
> write that clamped value back to disk if the ctime field doesn't get
> set to the current time before it needs to be written back to disk.
> 

ktime_set does this:

        if (unlikely(secs >= KTIME_SEC_MAX))
                return KTIME_MAX;

So, this should already clamp the value to KTIME_MAX, at least as long
as the seconds value is normalized when called. We certainly could have
this do a pr_warn or pr_notice too if the value comes back as
KTIME_MAX.

Thanks for taking a look, Ted!
diff mbox series

Patch

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 639885621608..6b9ed7dff6d5 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -662,11 +662,10 @@  struct inode {
 	loff_t			i_size;
 	time64_t		i_atime_sec;
 	time64_t		i_mtime_sec;
-	time64_t		i_ctime_sec;
 	u32			i_atime_nsec;
 	u32			i_mtime_nsec;
-	u32			i_ctime_nsec;
-	u32			i_generation;
+	ktime_t			__i_ctime;
+	atomic64_t		i_version;
 	spinlock_t		i_lock;	/* i_blocks, i_bytes, maybe i_size */
 	unsigned short          i_bytes;
 	u8			i_blkbits;
@@ -701,7 +700,6 @@  struct inode {
 		struct hlist_head	i_dentry;
 		struct rcu_head		i_rcu;
 	};
-	atomic64_t		i_version;
 	atomic64_t		i_sequence; /* see futex */
 	atomic_t		i_count;
 	atomic_t		i_dio_count;
@@ -724,6 +722,8 @@  struct inode {
 	};
 
 
+	u32			i_generation;
+
 #ifdef CONFIG_FSNOTIFY
 	__u32			i_fsnotify_mask; /* all events this inode cares about */
 	/* 32-bit hole reserved for expanding i_fsnotify_mask */
@@ -1608,29 +1608,25 @@  static inline struct timespec64 inode_set_mtime(struct inode *inode,
 	return inode_set_mtime_to_ts(inode, ts);
 }
 
-static inline time64_t inode_get_ctime_sec(const struct inode *inode)
+static inline struct timespec64 inode_get_ctime(const struct inode *inode)
 {
-	return inode->i_ctime_sec;
+	return ktime_to_timespec64(inode->__i_ctime);
 }
 
-static inline long inode_get_ctime_nsec(const struct inode *inode)
+static inline time64_t inode_get_ctime_sec(const struct inode *inode)
 {
-	return inode->i_ctime_nsec;
+	return inode_get_ctime(inode).tv_sec;
 }
 
-static inline struct timespec64 inode_get_ctime(const struct inode *inode)
+static inline long inode_get_ctime_nsec(const struct inode *inode)
 {
-	struct timespec64 ts = { .tv_sec  = inode_get_ctime_sec(inode),
-				 .tv_nsec = inode_get_ctime_nsec(inode) };
-
-	return ts;
+	return inode_get_ctime(inode).tv_nsec;
 }
 
 static inline struct timespec64 inode_set_ctime_to_ts(struct inode *inode,
 						      struct timespec64 ts)
 {
-	inode->i_ctime_sec = ts.tv_sec;
-	inode->i_ctime_nsec = ts.tv_nsec;
+	inode->__i_ctime = ktime_set(ts.tv_sec, ts.tv_nsec);
 	return ts;
 }