diff mbox series

[v4,2/9] fs: add infrastructure for multigrain inode i_m/ctime

Message ID 20230518114742.128950-3-jlayton@kernel.org (mailing list archive)
State New, archived
Headers show
Series fs: implement multigrain timestamps | expand

Commit Message

Jeff Layton May 18, 2023, 11:47 a.m. UTC
The VFS always uses coarse-grained timestamp updates for filling out the
ctime and mtime after a change. This has the benefit of allowing
filesystems to optimize away a lot metadata updates, down to around 1
per jiffy, even when a file is under heavy writes.

Unfortunately, this has always been an issue when we're exporting via
NFSv3, which relies on timestamps to validate caches. Even with NFSv4, a
lot of exported filesystems don't properly support a change attribute
and are subject to the same problems with timestamp granularity. Other
applications have similar issues (e.g backup applications).

Switching to always using fine-grained timestamps would improve the
situation, but that becomes rather expensive, as the underlying
filesystem will have to log a lot more metadata updates.

What we need is a way to only use fine-grained timestamps when they are
being actively queried.

The kernel always stores normalized ctime values, so only the first 30
bits of the tv_nsec field are ever used. Whenever the mtime changes, the
ctime must also change.

Use the 31st bit of the ctime tv_nsec field to indicate that something
has queried the inode for the i_mtime or i_ctime. When this flag is set,
on the next timestamp update, the kernel can fetch a fine-grained
timestamp instead of the usual coarse-grained one.

This patch adds the infrastructure this scheme. Filesytems can opt
into it by setting the FS_MULTIGRAIN_TS flag in the fstype.

Later patches will convert individual filesystems over to use it.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/inode.c         | 48 ++++++++++++++++++++++++++++-----
 fs/stat.c          | 41 ++++++++++++++++++++++++++--
 include/linux/fs.h | 66 +++++++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 145 insertions(+), 10 deletions(-)

Comments

Jan Kara May 23, 2023, 10:02 a.m. UTC | #1
On Thu 18-05-23 07:47:35, Jeff Layton wrote:
> The VFS always uses coarse-grained timestamp updates for filling out the
> ctime and mtime after a change. This has the benefit of allowing
> filesystems to optimize away a lot metadata updates, down to around 1
> per jiffy, even when a file is under heavy writes.
> 
> Unfortunately, this has always been an issue when we're exporting via
> NFSv3, which relies on timestamps to validate caches. Even with NFSv4, a
> lot of exported filesystems don't properly support a change attribute
> and are subject to the same problems with timestamp granularity. Other
> applications have similar issues (e.g backup applications).
> 
> Switching to always using fine-grained timestamps would improve the
> situation, but that becomes rather expensive, as the underlying
> filesystem will have to log a lot more metadata updates.
> 
> What we need is a way to only use fine-grained timestamps when they are
> being actively queried.
> 
> The kernel always stores normalized ctime values, so only the first 30
> bits of the tv_nsec field are ever used. Whenever the mtime changes, the
> ctime must also change.
> 
> Use the 31st bit of the ctime tv_nsec field to indicate that something
> has queried the inode for the i_mtime or i_ctime. When this flag is set,
> on the next timestamp update, the kernel can fetch a fine-grained
> timestamp instead of the usual coarse-grained one.
> 
> This patch adds the infrastructure this scheme. Filesytems can opt
> into it by setting the FS_MULTIGRAIN_TS flag in the fstype.
> 
> Later patches will convert individual filesystems over to use it.
> 
> Signed-off-by: Jeff Layton <jlayton@kernel.org>

So there are two things I dislike about this series because I think they
are fragile:

1) If we have a filesystem supporting multigrain ts and someone
accidentally directly uses the value of inode->i_ctime, he can get bogus
value (with QUERIED flag). This mistake is very easy to do. So I think we
should rename i_ctime to something like __i_ctime and always use accessor
function for it.

2) As I already commented in a previous version of the series, the scheme
with just one flag for both ctime and mtime and flag getting cleared in
current_time() relies on the fact that filesystems always do an equivalent
of:

	inode->i_mtime = inode->i_ctime = current_time();

Otherwise we can do coarse grained update where we should have done a fine
grained one. Filesystems often update timestamps like this but not
universally. Grepping shows some instances where only inode->i_mtime is set
from current_time() e.g. in autofs or bfs. Again a mistake that is rather
easy to make and results in subtle issues. I think this would be also
nicely solved by renaming i_ctime to __i_ctime and using a function to set
ctime. Mtime could then be updated with inode->i_mtime = ctime_peek().

I understand this is quite some churn but a very mechanical one that could
be just done with Coccinelle and a few manual fixups. So IMHO it is worth
the more robust result.

Some more nits below.

> +/**
> + * current_mg_time - Return FS time (possibly fine-grained)
> + * @inode: inode.
> + *
> + * Return the current time truncated to the time granularity supported by
> + * the fs, as suitable for a ctime/mtime change. If the ctime is flagged
> + * as having been QUERIED, get a fine-grained timestamp.
> + */

The comment should also mention that QUERIED flag is cleared from the ctime.

> +static struct timespec64 current_mg_time(struct inode *inode)
> +{
> +	struct timespec64 now;
> +	atomic_long_t *pnsec = (atomic_long_t *)&inode->i_ctime.tv_nsec;
> +	long nsec = atomic_long_fetch_andnot(I_CTIME_QUERIED, pnsec);
> +
> +	if (nsec & I_CTIME_QUERIED) {
> +		ktime_get_real_ts64(&now);
> +	} else {
> +		struct timespec64 ctime;
> +
> +		ktime_get_coarse_real_ts64(&now);
> +
> +		/*
> +		 * If we've recently fetched a fine-grained timestamp
> +		 * then the coarse-grained one may still be earlier than the
> +		 * existing one. Just keep the existing ctime if so.
> +		 */
> +		ctime = ctime_peek(inode);
> +		if (timespec64_compare(&ctime, &now) > 0)
> +			now = ctime;
> +	}
> +
> +	return now;
> +}
> +

...

> +/**
> + * ctime_nsec_peek - peek at (but don't query) the ctime tv_nsec field
> + * @inode: inode to fetch the ctime from
> + *
> + * Grab the current ctime tv_nsec field from the inode, mask off the
> + * I_CTIME_QUERIED flag and return it. This is mostly intended for use by
> + * internal consumers of the ctime that aren't concerned with ensuring a
> + * fine-grained update on the next change (e.g. when preparing to store
> + * the value in the backing store for later retrieval).
> + *
> + * This is safe to call regardless of whether the underlying filesystem
> + * is using multigrain timestamps.
> + */
> +static inline long ctime_nsec_peek(const struct inode *inode)
> +{
> +	return inode->i_ctime.tv_nsec &~ I_CTIME_QUERIED;

This is somewhat unusual spacing. I'd use:

	inode->i_ctime.tv_nsec & ~I_CTIME_QUERIED

> +}
> +
> +/**
> + * ctime_peek - peek at (but don't query) the ctime
> + * @inode: inode to fetch the ctime from
> + *
> + * Grab the current ctime from the inode, sans I_CTIME_QUERIED flag. For
> + * use by internal consumers that don't require a fine-grained update on
> + * the next change.
> + *
> + * This is safe to call regardless of whether the underlying filesystem
> + * is using multigrain timestamps.
> + */
> +static inline struct timespec64 ctime_peek(const struct inode *inode)
> +{
> +	struct timespec64 ctime;
> +
> +	ctime.tv_sec = inode->i_ctime.tv_sec;
> +	ctime.tv_nsec = ctime_nsec_peek(inode);
> +
> +	return ctime;
> +}

Given this is in a header that gets included in a lot of places, maybe we
should call it like inode_ctime_peek() or inode_ctime_get() to reduce
chances of a name clash?

								Honza
Jan Kara May 23, 2023, 10:17 a.m. UTC | #2
On Tue 23-05-23 12:02:40, Jan Kara wrote:
> On Thu 18-05-23 07:47:35, Jeff Layton wrote:
> > The VFS always uses coarse-grained timestamp updates for filling out the
> > ctime and mtime after a change. This has the benefit of allowing
> > filesystems to optimize away a lot metadata updates, down to around 1
> > per jiffy, even when a file is under heavy writes.
> > 
> > Unfortunately, this has always been an issue when we're exporting via
> > NFSv3, which relies on timestamps to validate caches. Even with NFSv4, a
> > lot of exported filesystems don't properly support a change attribute
> > and are subject to the same problems with timestamp granularity. Other
> > applications have similar issues (e.g backup applications).
> > 
> > Switching to always using fine-grained timestamps would improve the
> > situation, but that becomes rather expensive, as the underlying
> > filesystem will have to log a lot more metadata updates.
> > 
> > What we need is a way to only use fine-grained timestamps when they are
> > being actively queried.
> > 
> > The kernel always stores normalized ctime values, so only the first 30
> > bits of the tv_nsec field are ever used. Whenever the mtime changes, the
> > ctime must also change.
> > 
> > Use the 31st bit of the ctime tv_nsec field to indicate that something
> > has queried the inode for the i_mtime or i_ctime. When this flag is set,
> > on the next timestamp update, the kernel can fetch a fine-grained
> > timestamp instead of the usual coarse-grained one.
> > 
> > This patch adds the infrastructure this scheme. Filesytems can opt
> > into it by setting the FS_MULTIGRAIN_TS flag in the fstype.
> > 
> > Later patches will convert individual filesystems over to use it.
> > 
> > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> 
> So there are two things I dislike about this series because I think they
> are fragile:
> 
> 1) If we have a filesystem supporting multigrain ts and someone
> accidentally directly uses the value of inode->i_ctime, he can get bogus
> value (with QUERIED flag). This mistake is very easy to do. So I think we
> should rename i_ctime to something like __i_ctime and always use accessor
> function for it.
> 
> 2) As I already commented in a previous version of the series, the scheme
> with just one flag for both ctime and mtime and flag getting cleared in
> current_time() relies on the fact that filesystems always do an equivalent
> of:
> 
> 	inode->i_mtime = inode->i_ctime = current_time();
> 
> Otherwise we can do coarse grained update where we should have done a fine
> grained one. Filesystems often update timestamps like this but not
> universally. Grepping shows some instances where only inode->i_mtime is set
> from current_time() e.g. in autofs or bfs. Again a mistake that is rather
> easy to make and results in subtle issues. I think this would be also
> nicely solved by renaming i_ctime to __i_ctime and using a function to set
> ctime. Mtime could then be updated with inode->i_mtime = ctime_peek().
> 
> I understand this is quite some churn but a very mechanical one that could
> be just done with Coccinelle and a few manual fixups. So IMHO it is worth
> the more robust result.

Also as I'm thinking about it your current scheme is slightly racy. Suppose
the filesystem does:

CPU1					CPU2

					statx()
inode->i_ctime = current_time()
  current_mg_time()
    nsec = atomic_long_fetch_andnot(QUERIED, &inode->i_ctime.tv_nsec)
					  nsec = atomic_long_fetch_or(QUERIED, &inode->i_ctime.tv_nsec)
    if (nsec & QUERIED) - not set
      ktime_get_coarse_real_ts64(&now)
    return timestamp_truncate(now, inode);
- QUERIED flag in the inode->i_ctime gets overwritten by the assignment
  => we need not update ctime due to granularity although it was queried

One more reason to use explicit function to update inode->i_ctime ;)

								Honza
Christian Brauner May 23, 2023, 10:38 a.m. UTC | #3
On Tue, May 23, 2023 at 12:02:40PM +0200, Jan Kara wrote:
> On Thu 18-05-23 07:47:35, Jeff Layton wrote:
> > The VFS always uses coarse-grained timestamp updates for filling out the
> > ctime and mtime after a change. This has the benefit of allowing
> > filesystems to optimize away a lot metadata updates, down to around 1
> > per jiffy, even when a file is under heavy writes.
> > 
> > Unfortunately, this has always been an issue when we're exporting via
> > NFSv3, which relies on timestamps to validate caches. Even with NFSv4, a
> > lot of exported filesystems don't properly support a change attribute
> > and are subject to the same problems with timestamp granularity. Other
> > applications have similar issues (e.g backup applications).
> > 
> > Switching to always using fine-grained timestamps would improve the
> > situation, but that becomes rather expensive, as the underlying
> > filesystem will have to log a lot more metadata updates.
> > 
> > What we need is a way to only use fine-grained timestamps when they are
> > being actively queried.
> > 
> > The kernel always stores normalized ctime values, so only the first 30
> > bits of the tv_nsec field are ever used. Whenever the mtime changes, the
> > ctime must also change.
> > 
> > Use the 31st bit of the ctime tv_nsec field to indicate that something
> > has queried the inode for the i_mtime or i_ctime. When this flag is set,
> > on the next timestamp update, the kernel can fetch a fine-grained
> > timestamp instead of the usual coarse-grained one.
> > 
> > This patch adds the infrastructure this scheme. Filesytems can opt
> > into it by setting the FS_MULTIGRAIN_TS flag in the fstype.
> > 
> > Later patches will convert individual filesystems over to use it.
> > 
> > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> 
> So there are two things I dislike about this series because I think they
> are fragile:
> 
> 1) If we have a filesystem supporting multigrain ts and someone
> accidentally directly uses the value of inode->i_ctime, he can get bogus
> value (with QUERIED flag). This mistake is very easy to do. So I think we
> should rename i_ctime to something like __i_ctime and always use accessor
> function for it.
> 
> 2) As I already commented in a previous version of the series, the scheme
> with just one flag for both ctime and mtime and flag getting cleared in
> current_time() relies on the fact that filesystems always do an equivalent
> of:
> 
> 	inode->i_mtime = inode->i_ctime = current_time();
> 
> Otherwise we can do coarse grained update where we should have done a fine
> grained one. Filesystems often update timestamps like this but not
> universally. Grepping shows some instances where only inode->i_mtime is set
> from current_time() e.g. in autofs or bfs. Again a mistake that is rather
> easy to make and results in subtle issues. I think this would be also
> nicely solved by renaming i_ctime to __i_ctime and using a function to set
> ctime. Mtime could then be updated with inode->i_mtime = ctime_peek().
> 
> I understand this is quite some churn but a very mechanical one that could
> be just done with Coccinelle and a few manual fixups. So IMHO it is worth
> the more robust result.

Yeah, these are all good points.

> 
> Some more nits below.
> 
> > +/**
> > + * current_mg_time - Return FS time (possibly fine-grained)
> > + * @inode: inode.
> > + *
> > + * Return the current time truncated to the time granularity supported by
> > + * the fs, as suitable for a ctime/mtime change. If the ctime is flagged
> > + * as having been QUERIED, get a fine-grained timestamp.
> > + */
> 
> The comment should also mention that QUERIED flag is cleared from the ctime.
> 
> > +static struct timespec64 current_mg_time(struct inode *inode)
> > +{
> > +	struct timespec64 now;
> > +	atomic_long_t *pnsec = (atomic_long_t *)&inode->i_ctime.tv_nsec;
> > +	long nsec = atomic_long_fetch_andnot(I_CTIME_QUERIED, pnsec);
> > +
> > +	if (nsec & I_CTIME_QUERIED) {
> > +		ktime_get_real_ts64(&now);
> > +	} else {
> > +		struct timespec64 ctime;
> > +
> > +		ktime_get_coarse_real_ts64(&now);
> > +
> > +		/*
> > +		 * If we've recently fetched a fine-grained timestamp
> > +		 * then the coarse-grained one may still be earlier than the
> > +		 * existing one. Just keep the existing ctime if so.
> > +		 */
> > +		ctime = ctime_peek(inode);
> > +		if (timespec64_compare(&ctime, &now) > 0)
> > +			now = ctime;
> > +	}
> > +
> > +	return now;
> > +}
> > +
> 
> ...
> 
> > +/**
> > + * ctime_nsec_peek - peek at (but don't query) the ctime tv_nsec field
> > + * @inode: inode to fetch the ctime from
> > + *
> > + * Grab the current ctime tv_nsec field from the inode, mask off the
> > + * I_CTIME_QUERIED flag and return it. This is mostly intended for use by
> > + * internal consumers of the ctime that aren't concerned with ensuring a
> > + * fine-grained update on the next change (e.g. when preparing to store
> > + * the value in the backing store for later retrieval).
> > + *
> > + * This is safe to call regardless of whether the underlying filesystem
> > + * is using multigrain timestamps.
> > + */
> > +static inline long ctime_nsec_peek(const struct inode *inode)
> > +{
> > +	return inode->i_ctime.tv_nsec &~ I_CTIME_QUERIED;
> 
> This is somewhat unusual spacing. I'd use:
> 
> 	inode->i_ctime.tv_nsec & ~I_CTIME_QUERIED
> 
> > +}
> > +
> > +/**
> > + * ctime_peek - peek at (but don't query) the ctime
> > + * @inode: inode to fetch the ctime from
> > + *
> > + * Grab the current ctime from the inode, sans I_CTIME_QUERIED flag. For
> > + * use by internal consumers that don't require a fine-grained update on
> > + * the next change.
> > + *
> > + * This is safe to call regardless of whether the underlying filesystem
> > + * is using multigrain timestamps.
> > + */
> > +static inline struct timespec64 ctime_peek(const struct inode *inode)
> > +{
> > +	struct timespec64 ctime;
> > +
> > +	ctime.tv_sec = inode->i_ctime.tv_sec;
> > +	ctime.tv_nsec = ctime_nsec_peek(inode);
> > +
> > +	return ctime;
> > +}
> 
> Given this is in a header that gets included in a lot of places, maybe we
> should call it like inode_ctime_peek() or inode_ctime_get() to reduce
> chances of a name clash?

I think I mentioned this in an earlier comment. Independent of this
series, it would be kinda nice if we could start moving stuff out of
fs.h so we end up with a finer grained split of fs.h.
Jeff Layton May 23, 2023, 10:40 a.m. UTC | #4
On Tue, 2023-05-23 at 12:02 +0200, Jan Kara wrote:
> On Thu 18-05-23 07:47:35, Jeff Layton wrote:
> > The VFS always uses coarse-grained timestamp updates for filling out the
> > ctime and mtime after a change. This has the benefit of allowing
> > filesystems to optimize away a lot metadata updates, down to around 1
> > per jiffy, even when a file is under heavy writes.
> > 
> > Unfortunately, this has always been an issue when we're exporting via
> > NFSv3, which relies on timestamps to validate caches. Even with NFSv4, a
> > lot of exported filesystems don't properly support a change attribute
> > and are subject to the same problems with timestamp granularity. Other
> > applications have similar issues (e.g backup applications).
> > 
> > Switching to always using fine-grained timestamps would improve the
> > situation, but that becomes rather expensive, as the underlying
> > filesystem will have to log a lot more metadata updates.
> > 
> > What we need is a way to only use fine-grained timestamps when they are
> > being actively queried.
> > 
> > The kernel always stores normalized ctime values, so only the first 30
> > bits of the tv_nsec field are ever used. Whenever the mtime changes, the
> > ctime must also change.
> > 
> > Use the 31st bit of the ctime tv_nsec field to indicate that something
> > has queried the inode for the i_mtime or i_ctime. When this flag is set,
> > on the next timestamp update, the kernel can fetch a fine-grained
> > timestamp instead of the usual coarse-grained one.
> > 
> > This patch adds the infrastructure this scheme. Filesytems can opt
> > into it by setting the FS_MULTIGRAIN_TS flag in the fstype.
> > 
> > Later patches will convert individual filesystems over to use it.
> > 
> > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> 
> So there are two things I dislike about this series because I think they
> are fragile:
> 
> 1) If we have a filesystem supporting multigrain ts and someone
> accidentally directly uses the value of inode->i_ctime, he can get bogus
> value (with QUERIED flag). This mistake is very easy to do. So I think we
> should rename i_ctime to something like __i_ctime and always use accessor
> function for it.
> 

We could do this, but it'll be quite invasive. We'd have to change any
place that touches i_ctime (and there are a lot of them), even on
filesystems that are not being converted.

> 2) As I already commented in a previous version of the series, the scheme
> with just one flag for both ctime and mtime and flag getting cleared in
> current_time() relies on the fact that filesystems always do an equivalent
> of:
> 
> 	inode->i_mtime = inode->i_ctime = current_time();
> 
> Otherwise we can do coarse grained update where we should have done a fine
> grained one. Filesystems often update timestamps like this but not
> universally. Grepping shows some instances where only inode->i_mtime is set
> from current_time() e.g. in autofs or bfs. Again a mistake that is rather
> easy to make and results in subtle issues. I think this would be also
> nicely solved by renaming i_ctime to __i_ctime and using a function to set
> ctime. Mtime could then be updated with inode->i_mtime = ctime_peek().
>
> I understand this is quite some churn but a very mechanical one that could
> be just done with Coccinelle and a few manual fixups. So IMHO it is worth
> the more robust result.

AFAICT, under POSIX, you must _always_ set the ctime when you set the
mtime, but the reverse is not true. That's why keeping the flag in the
ctime makes sense. If we're updating the mtime, then we necessarily must
update the ctime.

> Some more nits below.
> 
> > +/**
> > + * current_mg_time - Return FS time (possibly fine-grained)
> > + * @inode: inode.
> > + *
> > + * Return the current time truncated to the time granularity supported by
> > + * the fs, as suitable for a ctime/mtime change. If the ctime is flagged
> > + * as having been QUERIED, get a fine-grained timestamp.
> > + */
> 
> The comment should also mention that QUERIED flag is cleared from the ctime.
> 

Fair point. I can fix that up.

> > +static struct timespec64 current_mg_time(struct inode *inode)
> > +{
> > +	struct timespec64 now;
> > +	atomic_long_t *pnsec = (atomic_long_t *)&inode->i_ctime.tv_nsec;
> > +	long nsec = atomic_long_fetch_andnot(I_CTIME_QUERIED, pnsec);
> > +
> > +	if (nsec & I_CTIME_QUERIED) {
> > +		ktime_get_real_ts64(&now);
> > +	} else {
> > +		struct timespec64 ctime;
> > +
> > +		ktime_get_coarse_real_ts64(&now);
> > +
> > +		/*
> > +		 * If we've recently fetched a fine-grained timestamp
> > +		 * then the coarse-grained one may still be earlier than the
> > +		 * existing one. Just keep the existing ctime if so.
> > +		 */
> > +		ctime = ctime_peek(inode);
> > +		if (timespec64_compare(&ctime, &now) > 0)
> > +			now = ctime;
> > +	}
> > +
> > +	return now;
> > +}
> > +
> 
> ...
> 
> > +/**
> > + * ctime_nsec_peek - peek at (but don't query) the ctime tv_nsec field
> > + * @inode: inode to fetch the ctime from
> > + *
> > + * Grab the current ctime tv_nsec field from the inode, mask off the
> > + * I_CTIME_QUERIED flag and return it. This is mostly intended for use by
> > + * internal consumers of the ctime that aren't concerned with ensuring a
> > + * fine-grained update on the next change (e.g. when preparing to store
> > + * the value in the backing store for later retrieval).
> > + *
> > + * This is safe to call regardless of whether the underlying filesystem
> > + * is using multigrain timestamps.
> > + */
> > +static inline long ctime_nsec_peek(const struct inode *inode)
> > +{
> > +	return inode->i_ctime.tv_nsec &~ I_CTIME_QUERIED;
> 
> This is somewhat unusual spacing. I'd use:
> 
> 	inode->i_ctime.tv_nsec & ~I_CTIME_QUERIED
> 

Yeah, I don't know what happened there. I'll fix that up.

> > +}
> > +
> > +/**
> > + * ctime_peek - peek at (but don't query) the ctime
> > + * @inode: inode to fetch the ctime from
> > + *
> > + * Grab the current ctime from the inode, sans I_CTIME_QUERIED flag. For
> > + * use by internal consumers that don't require a fine-grained update on
> > + * the next change.
> > + *
> > + * This is safe to call regardless of whether the underlying filesystem
> > + * is using multigrain timestamps.
> > + */
> > +static inline struct timespec64 ctime_peek(const struct inode *inode)
> > +{
> > +	struct timespec64 ctime;
> > +
> > +	ctime.tv_sec = inode->i_ctime.tv_sec;
> > +	ctime.tv_nsec = ctime_nsec_peek(inode);
> > +
> > +	return ctime;
> > +}
> 
> Given this is in a header that gets included in a lot of places, maybe we
> should call it like inode_ctime_peek() or inode_ctime_get() to reduce
> chances of a name clash?

I'd be fine with that, but "ctime" sort of implies inode->i_ctime to me.
We don't really use that nomenclature elsewhere.
Jeff Layton May 23, 2023, 10:56 a.m. UTC | #5
On Tue, 2023-05-23 at 12:17 +0200, Jan Kara wrote:
> On Tue 23-05-23 12:02:40, Jan Kara wrote:
> > On Thu 18-05-23 07:47:35, Jeff Layton wrote:
> > > The VFS always uses coarse-grained timestamp updates for filling out the
> > > ctime and mtime after a change. This has the benefit of allowing
> > > filesystems to optimize away a lot metadata updates, down to around 1
> > > per jiffy, even when a file is under heavy writes.
> > > 
> > > Unfortunately, this has always been an issue when we're exporting via
> > > NFSv3, which relies on timestamps to validate caches. Even with NFSv4, a
> > > lot of exported filesystems don't properly support a change attribute
> > > and are subject to the same problems with timestamp granularity. Other
> > > applications have similar issues (e.g backup applications).
> > > 
> > > Switching to always using fine-grained timestamps would improve the
> > > situation, but that becomes rather expensive, as the underlying
> > > filesystem will have to log a lot more metadata updates.
> > > 
> > > What we need is a way to only use fine-grained timestamps when they are
> > > being actively queried.
> > > 
> > > The kernel always stores normalized ctime values, so only the first 30
> > > bits of the tv_nsec field are ever used. Whenever the mtime changes, the
> > > ctime must also change.
> > > 
> > > Use the 31st bit of the ctime tv_nsec field to indicate that something
> > > has queried the inode for the i_mtime or i_ctime. When this flag is set,
> > > on the next timestamp update, the kernel can fetch a fine-grained
> > > timestamp instead of the usual coarse-grained one.
> > > 
> > > This patch adds the infrastructure this scheme. Filesytems can opt
> > > into it by setting the FS_MULTIGRAIN_TS flag in the fstype.
> > > 
> > > Later patches will convert individual filesystems over to use it.
> > > 
> > > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > 
> > So there are two things I dislike about this series because I think they
> > are fragile:
> > 
> > 1) If we have a filesystem supporting multigrain ts and someone
> > accidentally directly uses the value of inode->i_ctime, he can get bogus
> > value (with QUERIED flag). This mistake is very easy to do. So I think we
> > should rename i_ctime to something like __i_ctime and always use accessor
> > function for it.
> > 
> > 2) As I already commented in a previous version of the series, the scheme
> > with just one flag for both ctime and mtime and flag getting cleared in
> > current_time() relies on the fact that filesystems always do an equivalent
> > of:
> > 
> > 	inode->i_mtime = inode->i_ctime = current_time();
> > 
> > Otherwise we can do coarse grained update where we should have done a fine
> > grained one. Filesystems often update timestamps like this but not
> > universally. Grepping shows some instances where only inode->i_mtime is set
> > from current_time() e.g. in autofs or bfs. Again a mistake that is rather
> > easy to make and results in subtle issues. I think this would be also
> > nicely solved by renaming i_ctime to __i_ctime and using a function to set
> > ctime. Mtime could then be updated with inode->i_mtime = ctime_peek().
> > 
> > I understand this is quite some churn but a very mechanical one that could
> > be just done with Coccinelle and a few manual fixups. So IMHO it is worth
> > the more robust result.
> 
> Also as I'm thinking about it your current scheme is slightly racy. Suppose
> the filesystem does:
> 
> CPU1					CPU2
> 
> 					statx()
> inode->i_ctime = current_time()
>   current_mg_time()
>     nsec = atomic_long_fetch_andnot(QUERIED, &inode->i_ctime.tv_nsec)
> 					  nsec = atomic_long_fetch_or(QUERIED, &inode->i_ctime.tv_nsec)
>     if (nsec & QUERIED) - not set
>       ktime_get_coarse_real_ts64(&now)
>     return timestamp_truncate(now, inode);
> - QUERIED flag in the inode->i_ctime gets overwritten by the assignment
>   => we need not update ctime due to granularity although it was queried
> 
> One more reason to use explicit function to update inode->i_ctime ;)

When we store the new time in the i_ctime field, the flag gets cleared
because at that point we're storing a new (unseen) time.

However, you're correct: if the i_ctime in your above example starts at
the same value that is currently being returned by
ktime_get_coarse_real_ts64, then we'll lose the flag set in statx.

I think the right fix there would be to not update the ctime at all if
it's a coarse grained time, and the value wouldn't have an apparent
change to an observer. That would leave the flag intact.

That does mean we'd need to move to a function that does clock fetch and
assigns it to i_ctime in one go (like you suggest). Something like:

    inode_update_ctime(inode);

How we do that with atomic operations over two values (the tv_sec and
tv_nsec) is a bit tricky. I'll have to think about it.

Christian, given Jan's concerns do you want to drop this series for now
and let me respin it?

Thanks,
Christian Brauner May 23, 2023, 11:01 a.m. UTC | #6
On Tue, May 23, 2023 at 06:56:11AM -0400, Jeff Layton wrote:
> On Tue, 2023-05-23 at 12:17 +0200, Jan Kara wrote:
> > On Tue 23-05-23 12:02:40, Jan Kara wrote:
> > > On Thu 18-05-23 07:47:35, Jeff Layton wrote:
> > > > The VFS always uses coarse-grained timestamp updates for filling out the
> > > > ctime and mtime after a change. This has the benefit of allowing
> > > > filesystems to optimize away a lot metadata updates, down to around 1
> > > > per jiffy, even when a file is under heavy writes.
> > > > 
> > > > Unfortunately, this has always been an issue when we're exporting via
> > > > NFSv3, which relies on timestamps to validate caches. Even with NFSv4, a
> > > > lot of exported filesystems don't properly support a change attribute
> > > > and are subject to the same problems with timestamp granularity. Other
> > > > applications have similar issues (e.g backup applications).
> > > > 
> > > > Switching to always using fine-grained timestamps would improve the
> > > > situation, but that becomes rather expensive, as the underlying
> > > > filesystem will have to log a lot more metadata updates.
> > > > 
> > > > What we need is a way to only use fine-grained timestamps when they are
> > > > being actively queried.
> > > > 
> > > > The kernel always stores normalized ctime values, so only the first 30
> > > > bits of the tv_nsec field are ever used. Whenever the mtime changes, the
> > > > ctime must also change.
> > > > 
> > > > Use the 31st bit of the ctime tv_nsec field to indicate that something
> > > > has queried the inode for the i_mtime or i_ctime. When this flag is set,
> > > > on the next timestamp update, the kernel can fetch a fine-grained
> > > > timestamp instead of the usual coarse-grained one.
> > > > 
> > > > This patch adds the infrastructure this scheme. Filesytems can opt
> > > > into it by setting the FS_MULTIGRAIN_TS flag in the fstype.
> > > > 
> > > > Later patches will convert individual filesystems over to use it.
> > > > 
> > > > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > > 
> > > So there are two things I dislike about this series because I think they
> > > are fragile:
> > > 
> > > 1) If we have a filesystem supporting multigrain ts and someone
> > > accidentally directly uses the value of inode->i_ctime, he can get bogus
> > > value (with QUERIED flag). This mistake is very easy to do. So I think we
> > > should rename i_ctime to something like __i_ctime and always use accessor
> > > function for it.
> > > 
> > > 2) As I already commented in a previous version of the series, the scheme
> > > with just one flag for both ctime and mtime and flag getting cleared in
> > > current_time() relies on the fact that filesystems always do an equivalent
> > > of:
> > > 
> > > 	inode->i_mtime = inode->i_ctime = current_time();
> > > 
> > > Otherwise we can do coarse grained update where we should have done a fine
> > > grained one. Filesystems often update timestamps like this but not
> > > universally. Grepping shows some instances where only inode->i_mtime is set
> > > from current_time() e.g. in autofs or bfs. Again a mistake that is rather
> > > easy to make and results in subtle issues. I think this would be also
> > > nicely solved by renaming i_ctime to __i_ctime and using a function to set
> > > ctime. Mtime could then be updated with inode->i_mtime = ctime_peek().
> > > 
> > > I understand this is quite some churn but a very mechanical one that could
> > > be just done with Coccinelle and a few manual fixups. So IMHO it is worth
> > > the more robust result.
> > 
> > Also as I'm thinking about it your current scheme is slightly racy. Suppose
> > the filesystem does:
> > 
> > CPU1					CPU2
> > 
> > 					statx()
> > inode->i_ctime = current_time()
> >   current_mg_time()
> >     nsec = atomic_long_fetch_andnot(QUERIED, &inode->i_ctime.tv_nsec)
> > 					  nsec = atomic_long_fetch_or(QUERIED, &inode->i_ctime.tv_nsec)
> >     if (nsec & QUERIED) - not set
> >       ktime_get_coarse_real_ts64(&now)
> >     return timestamp_truncate(now, inode);
> > - QUERIED flag in the inode->i_ctime gets overwritten by the assignment
> >   => we need not update ctime due to granularity although it was queried
> > 
> > One more reason to use explicit function to update inode->i_ctime ;)
> 
> When we store the new time in the i_ctime field, the flag gets cleared
> because at that point we're storing a new (unseen) time.
> 
> However, you're correct: if the i_ctime in your above example starts at
> the same value that is currently being returned by
> ktime_get_coarse_real_ts64, then we'll lose the flag set in statx.
> 
> I think the right fix there would be to not update the ctime at all if
> it's a coarse grained time, and the value wouldn't have an apparent
> change to an observer. That would leave the flag intact.
> 
> That does mean we'd need to move to a function that does clock fetch and
> assigns it to i_ctime in one go (like you suggest). Something like:
> 
>     inode_update_ctime(inode);
> 
> How we do that with atomic operations over two values (the tv_sec and
> tv_nsec) is a bit tricky. I'll have to think about it.
> 
> Christian, given Jan's concerns do you want to drop this series for now
> and let me respin it?

I deliberately put it into a vfs.unstable.* branch. I would leave it
there until you send a new one then drop it. If we get lucky the bots
that run on -next will have time to report potential perf issues while
it's not currently causing conflicts.
Jeff Layton May 23, 2023, 11:15 a.m. UTC | #7
On Tue, 2023-05-23 at 13:01 +0200, Christian Brauner wrote:
> On Tue, May 23, 2023 at 06:56:11AM -0400, Jeff Layton wrote:
> > On Tue, 2023-05-23 at 12:17 +0200, Jan Kara wrote:
> > > On Tue 23-05-23 12:02:40, Jan Kara wrote:
> > > > On Thu 18-05-23 07:47:35, Jeff Layton wrote:
> > > > > The VFS always uses coarse-grained timestamp updates for filling out the
> > > > > ctime and mtime after a change. This has the benefit of allowing
> > > > > filesystems to optimize away a lot metadata updates, down to around 1
> > > > > per jiffy, even when a file is under heavy writes.
> > > > > 
> > > > > Unfortunately, this has always been an issue when we're exporting via
> > > > > NFSv3, which relies on timestamps to validate caches. Even with NFSv4, a
> > > > > lot of exported filesystems don't properly support a change attribute
> > > > > and are subject to the same problems with timestamp granularity. Other
> > > > > applications have similar issues (e.g backup applications).
> > > > > 
> > > > > Switching to always using fine-grained timestamps would improve the
> > > > > situation, but that becomes rather expensive, as the underlying
> > > > > filesystem will have to log a lot more metadata updates.
> > > > > 
> > > > > What we need is a way to only use fine-grained timestamps when they are
> > > > > being actively queried.
> > > > > 
> > > > > The kernel always stores normalized ctime values, so only the first 30
> > > > > bits of the tv_nsec field are ever used. Whenever the mtime changes, the
> > > > > ctime must also change.
> > > > > 
> > > > > Use the 31st bit of the ctime tv_nsec field to indicate that something
> > > > > has queried the inode for the i_mtime or i_ctime. When this flag is set,
> > > > > on the next timestamp update, the kernel can fetch a fine-grained
> > > > > timestamp instead of the usual coarse-grained one.
> > > > > 
> > > > > This patch adds the infrastructure this scheme. Filesytems can opt
> > > > > into it by setting the FS_MULTIGRAIN_TS flag in the fstype.
> > > > > 
> > > > > Later patches will convert individual filesystems over to use it.
> > > > > 
> > > > > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > > > 
> > > > So there are two things I dislike about this series because I think they
> > > > are fragile:
> > > > 
> > > > 1) If we have a filesystem supporting multigrain ts and someone
> > > > accidentally directly uses the value of inode->i_ctime, he can get bogus
> > > > value (with QUERIED flag). This mistake is very easy to do. So I think we
> > > > should rename i_ctime to something like __i_ctime and always use accessor
> > > > function for it.
> > > > 
> > > > 2) As I already commented in a previous version of the series, the scheme
> > > > with just one flag for both ctime and mtime and flag getting cleared in
> > > > current_time() relies on the fact that filesystems always do an equivalent
> > > > of:
> > > > 
> > > > 	inode->i_mtime = inode->i_ctime = current_time();
> > > > 
> > > > Otherwise we can do coarse grained update where we should have done a fine
> > > > grained one. Filesystems often update timestamps like this but not
> > > > universally. Grepping shows some instances where only inode->i_mtime is set
> > > > from current_time() e.g. in autofs or bfs. Again a mistake that is rather
> > > > easy to make and results in subtle issues. I think this would be also
> > > > nicely solved by renaming i_ctime to __i_ctime and using a function to set
> > > > ctime. Mtime could then be updated with inode->i_mtime = ctime_peek().
> > > > 
> > > > I understand this is quite some churn but a very mechanical one that could
> > > > be just done with Coccinelle and a few manual fixups. So IMHO it is worth
> > > > the more robust result.
> > > 
> > > Also as I'm thinking about it your current scheme is slightly racy. Suppose
> > > the filesystem does:
> > > 
> > > CPU1					CPU2
> > > 
> > > 					statx()
> > > inode->i_ctime = current_time()
> > >   current_mg_time()
> > >     nsec = atomic_long_fetch_andnot(QUERIED, &inode->i_ctime.tv_nsec)
> > > 					  nsec = atomic_long_fetch_or(QUERIED, &inode->i_ctime.tv_nsec)
> > >     if (nsec & QUERIED) - not set
> > >       ktime_get_coarse_real_ts64(&now)
> > >     return timestamp_truncate(now, inode);
> > > - QUERIED flag in the inode->i_ctime gets overwritten by the assignment
> > >   => we need not update ctime due to granularity although it was queried
> > > 
> > > One more reason to use explicit function to update inode->i_ctime ;)
> > 
> > When we store the new time in the i_ctime field, the flag gets cleared
> > because at that point we're storing a new (unseen) time.
> > 
> > However, you're correct: if the i_ctime in your above example starts at
> > the same value that is currently being returned by
> > ktime_get_coarse_real_ts64, then we'll lose the flag set in statx.
> > 
> > I think the right fix there would be to not update the ctime at all if
> > it's a coarse grained time, and the value wouldn't have an apparent
> > change to an observer. That would leave the flag intact.
> > 
> > That does mean we'd need to move to a function that does clock fetch and
> > assigns it to i_ctime in one go (like you suggest). Something like:
> > 
> >     inode_update_ctime(inode);
> > 
> > How we do that with atomic operations over two values (the tv_sec and
> > tv_nsec) is a bit tricky. I'll have to think about it.
> > 
> > Christian, given Jan's concerns do you want to drop this series for now
> > and let me respin it?
> 
> I deliberately put it into a vfs.unstable.* branch. I would leave it
> there until you send a new one then drop it. If we get lucky the bots
> that run on -next will have time to report potential perf issues while
> it's not currently causing conflicts.

Sounds good to me. Thanks!
Jan Kara May 23, 2023, 12:46 p.m. UTC | #8
On Tue 23-05-23 06:40:08, Jeff Layton wrote:
> On Tue, 2023-05-23 at 12:02 +0200, Jan Kara wrote:
> > On Thu 18-05-23 07:47:35, Jeff Layton wrote:
> > > The VFS always uses coarse-grained timestamp updates for filling out the
> > > ctime and mtime after a change. This has the benefit of allowing
> > > filesystems to optimize away a lot metadata updates, down to around 1
> > > per jiffy, even when a file is under heavy writes.
> > > 
> > > Unfortunately, this has always been an issue when we're exporting via
> > > NFSv3, which relies on timestamps to validate caches. Even with NFSv4, a
> > > lot of exported filesystems don't properly support a change attribute
> > > and are subject to the same problems with timestamp granularity. Other
> > > applications have similar issues (e.g backup applications).
> > > 
> > > Switching to always using fine-grained timestamps would improve the
> > > situation, but that becomes rather expensive, as the underlying
> > > filesystem will have to log a lot more metadata updates.
> > > 
> > > What we need is a way to only use fine-grained timestamps when they are
> > > being actively queried.
> > > 
> > > The kernel always stores normalized ctime values, so only the first 30
> > > bits of the tv_nsec field are ever used. Whenever the mtime changes, the
> > > ctime must also change.
> > > 
> > > Use the 31st bit of the ctime tv_nsec field to indicate that something
> > > has queried the inode for the i_mtime or i_ctime. When this flag is set,
> > > on the next timestamp update, the kernel can fetch a fine-grained
> > > timestamp instead of the usual coarse-grained one.
> > > 
> > > This patch adds the infrastructure this scheme. Filesytems can opt
> > > into it by setting the FS_MULTIGRAIN_TS flag in the fstype.
> > > 
> > > Later patches will convert individual filesystems over to use it.
> > > 
> > > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > 
> > So there are two things I dislike about this series because I think they
> > are fragile:
> > 
> > 1) If we have a filesystem supporting multigrain ts and someone
> > accidentally directly uses the value of inode->i_ctime, he can get bogus
> > value (with QUERIED flag). This mistake is very easy to do. So I think we
> > should rename i_ctime to something like __i_ctime and always use accessor
> > function for it.
> > 
> 
> We could do this, but it'll be quite invasive. We'd have to change any
> place that touches i_ctime (and there are a lot of them), even on
> filesystems that are not being converted.

Yes, that's why I suggested Coccinelle to deal with this.

> > 2) As I already commented in a previous version of the series, the scheme
> > with just one flag for both ctime and mtime and flag getting cleared in
> > current_time() relies on the fact that filesystems always do an equivalent
> > of:
> > 
> > 	inode->i_mtime = inode->i_ctime = current_time();
> > 
> > Otherwise we can do coarse grained update where we should have done a fine
> > grained one. Filesystems often update timestamps like this but not
> > universally. Grepping shows some instances where only inode->i_mtime is set
> > from current_time() e.g. in autofs or bfs. Again a mistake that is rather
> > easy to make and results in subtle issues. I think this would be also
> > nicely solved by renaming i_ctime to __i_ctime and using a function to set
> > ctime. Mtime could then be updated with inode->i_mtime = ctime_peek().
> >
> > I understand this is quite some churn but a very mechanical one that could
> > be just done with Coccinelle and a few manual fixups. So IMHO it is worth
> > the more robust result.
> 
> AFAICT, under POSIX, you must _always_ set the ctime when you set the
> mtime, but the reverse is not true. That's why keeping the flag in the
> ctime makes sense. If we're updating the mtime, then we necessarily must
> update the ctime.

Yes, but technically speaking:

	inode->i_mtime = current_time();
	inode->i_ctime = current_time();

follows these requirements but is broken with your changes in unobvious
way. And if mtime update is hidden in some function or condition, it is not
even that suboptimal coding pattern.

> > > +}
> > > +
> > > +/**
> > > + * ctime_peek - peek at (but don't query) the ctime
> > > + * @inode: inode to fetch the ctime from
> > > + *
> > > + * Grab the current ctime from the inode, sans I_CTIME_QUERIED flag. For
> > > + * use by internal consumers that don't require a fine-grained update on
> > > + * the next change.
> > > + *
> > > + * This is safe to call regardless of whether the underlying filesystem
> > > + * is using multigrain timestamps.
> > > + */
> > > +static inline struct timespec64 ctime_peek(const struct inode *inode)
> > > +{
> > > +	struct timespec64 ctime;
> > > +
> > > +	ctime.tv_sec = inode->i_ctime.tv_sec;
> > > +	ctime.tv_nsec = ctime_nsec_peek(inode);
> > > +
> > > +	return ctime;
> > > +}
> > 
> > Given this is in a header that gets included in a lot of places, maybe we
> > should call it like inode_ctime_peek() or inode_ctime_get() to reduce
> > chances of a name clash?
> 
> I'd be fine with that, but "ctime" sort of implies inode->i_ctime to me.
> We don't really use that nomenclature elsewhere.

Yes, I don't insist here but we do have 'ctime' e.g. in IPC code
(sem_ctime, shm_ctime, msg_ctime), we have ctime in md layer, ctime(3) is
also a glibc function. So it is not *completely* impossible the clash
happens but we can deal with it when it happens.

								Honza
Jeff Layton June 13, 2023, 1:09 p.m. UTC | #9
On Tue, 2023-05-23 at 14:46 +0200, Jan Kara wrote:
> On Tue 23-05-23 06:40:08, Jeff Layton wrote:
> > On Tue, 2023-05-23 at 12:02 +0200, Jan Kara wrote:
> > > 
> > > So there are two things I dislike about this series because I think they
> > > are fragile:
> > > 
> > > 1) If we have a filesystem supporting multigrain ts and someone
> > > accidentally directly uses the value of inode->i_ctime, he can get bogus
> > > value (with QUERIED flag). This mistake is very easy to do. So I think we
> > > should rename i_ctime to something like __i_ctime and always use accessor
> > > function for it.
> > > 
> > 
> > We could do this, but it'll be quite invasive. We'd have to change any
> > place that touches i_ctime (and there are a lot of them), even on
> > filesystems that are not being converted.
> 
> Yes, that's why I suggested Coccinelle to deal with this.


I've done the work to convert all of the accesses of i_ctime into
accessor functions in the kernel. The current state of it is here:

   
https://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux.git/commit/?h=ctime

As expected, it touches a lot of code, all over the place. So far I have
most of the conversion in one giant patch, and I need to split it up
(probably per-subsystem).

What's the best way to feed this change into mainline? Should I try to
get subsystem maintainers to pick these up, or are we better off feeding
this in via a separate branch?

For reference, the diffstat for the big conversion patch is below:

 arch/powerpc/platforms/cell/spufs/inode.c |  2 +-
 arch/s390/hypfs/inode.c                   |  4 +-
 drivers/android/binderfs.c                |  8 ++--
 drivers/infiniband/hw/qib/qib_fs.c        |  4 +-
 drivers/misc/ibmasm/ibmasmfs.c            |  2 +-
 drivers/misc/ibmvmc.c                     |  2 +-
 drivers/usb/core/devio.c                  | 16 ++++----
 drivers/usb/gadget/function/f_fs.c        |  6 +--
 drivers/usb/gadget/legacy/inode.c         |  3 +-
 fs/9p/vfs_inode.c                         |  6 ++-
 fs/9p/vfs_inode_dotl.c                    | 11 +++---
 fs/adfs/inode.c                           |  4 +-
 fs/affs/amigaffs.c                        |  6 +--
 fs/affs/inode.c                           | 17 ++++----
 fs/afs/dynroot.c                          |  2 +-
 fs/afs/inode.c                            |  6 +--
 fs/attr.c                                 |  2 +-
 fs/autofs/inode.c                         |  2 +-
 fs/autofs/root.c                          |  6 +--
 fs/bad_inode.c                            |  3 +-
 fs/befs/linuxvfs.c                        |  2 +-
 fs/bfs/dir.c                              | 16 ++++----
 fs/bfs/inode.c                            |  6 +--
 fs/binfmt_misc.c                          |  3 +-
 fs/btrfs/delayed-inode.c                  | 10 +++--
 fs/btrfs/file.c                           | 24 ++++-------
 fs/btrfs/inode.c                          | 66 ++++++++++++------------
-------
 fs/btrfs/ioctl.c                          |  2 +-
 fs/btrfs/reflink.c                        |  7 ++--
 fs/btrfs/transaction.c                    |  3 +-
 fs/btrfs/tree-log.c                       |  4 +-
 fs/btrfs/xattr.c                          |  4 +-
 fs/ceph/acl.c                             |  2 +-
 fs/ceph/caps.c                            |  2 +-
 fs/ceph/inode.c                           | 17 ++++----
 fs/ceph/snap.c                            |  2 +-
 fs/ceph/xattr.c                           |  2 +-
 fs/coda/coda_linux.c                      |  2 +-
 fs/coda/dir.c                             |  2 +-
 fs/coda/file.c                            |  2 +-
 fs/coda/inode.c                           |  2 +-
 fs/configfs/inode.c                       |  6 +--
 fs/cramfs/inode.c                         |  2 +-
 fs/debugfs/inode.c                        |  2 +-
 fs/devpts/inode.c                         |  6 +--
 fs/ecryptfs/inode.c                       |  2 +-
 fs/efivarfs/file.c                        |  2 +-
 fs/efivarfs/inode.c                       |  2 +-
 fs/efs/inode.c                            |  5 ++-
 fs/erofs/inode.c                          | 16 ++++----
 fs/exfat/file.c                           |  4 +-
 fs/exfat/inode.c                          |  6 +--
 fs/exfat/namei.c                          | 29 +++++++-------
 fs/exfat/super.c                          |  4 +-
 fs/ext2/acl.c                             |  2 +-
 fs/ext2/dir.c                             |  6 +--
 fs/ext2/ialloc.c                          |  2 +-
 fs/ext2/inode.c                           | 11 +++---
 fs/ext2/ioctl.c                           |  4 +-
 fs/ext2/namei.c                           |  8 ++--
 fs/ext2/super.c                           |  2 +-
 fs/ext2/xattr.c                           |  2 +-
 fs/ext4/acl.c                             |  2 +-
 fs/ext4/ext4.h                            | 20 ++++++++++
 fs/ext4/extents.c                         | 12 +++---
 fs/ext4/ialloc.c                          |  2 +-
 fs/ext4/inline.c                          |  4 +-
 fs/ext4/inode.c                           | 16 ++++----
 fs/ext4/ioctl.c                           |  9 +++--
 fs/ext4/namei.c                           | 26 ++++++------
 fs/ext4/super.c                           |  2 +-
 fs/ext4/xattr.c                           |  6 +--
 fs/f2fs/dir.c                             |  8 ++--
 fs/f2fs/f2fs.h                            |  5 ++-
 fs/f2fs/file.c                            | 16 ++++----
 fs/f2fs/inline.c                          |  2 +-
 fs/f2fs/inode.c                           | 10 ++---
 fs/f2fs/namei.c                           | 12 +++---
 fs/f2fs/recovery.c                        |  4 +-
 fs/f2fs/super.c                           |  2 +-
 fs/f2fs/xattr.c                           |  2 +-
 fs/fat/inode.c                            |  8 ++--
 fs/fat/misc.c                             |  7 +++-
 fs/freevxfs/vxfs_inode.c                  |  4 +-
 fs/fuse/control.c                         |  2 +-
 fs/fuse/dir.c                             |  8 ++--
 fs/fuse/inode.c                           | 18 +++++----
 fs/gfs2/acl.c                             |  2 +-
 fs/gfs2/bmap.c                            | 11 +++---
 fs/gfs2/dir.c                             | 15 +++----
 fs/gfs2/file.c                            |  2 +-
 fs/gfs2/glops.c                           |  4 +-
 fs/gfs2/inode.c                           |  8 ++--
 fs/gfs2/super.c                           |  4 +-
 fs/gfs2/xattr.c                           |  8 ++--
 fs/hfs/catalog.c                          |  8 ++--
 fs/hfs/dir.c                              |  2 +-
 fs/hfs/inode.c                            | 13 +++---
 fs/hfs/sysdep.c                           |  2 +-
 fs/hfsplus/catalog.c                      |  8 ++--
 fs/hfsplus/dir.c                          |  6 +--
 fs/hfsplus/inode.c                        | 14 +++----
 fs/hostfs/hostfs_kern.c                   |  5 ++-
 fs/hpfs/dir.c                             |  8 ++--
 fs/hpfs/inode.c                           |  6 +--
 fs/hpfs/namei.c                           | 26 ++++++------
 fs/hpfs/super.c                           |  5 ++-
 fs/hugetlbfs/inode.c                      | 12 +++---
 fs/inode.c                                | 12 ++++--
 fs/isofs/inode.c                          |  4 +-
 fs/isofs/rock.c                           | 16 ++++----
 fs/jffs2/dir.c                            | 19 ++++-----
 fs/jffs2/file.c                           |  3 +-
 fs/jffs2/fs.c                             | 10 ++---
 fs/jffs2/os-linux.h                       |  2 +-
 fs/jfs/acl.c                              |  2 +-
 fs/jfs/inode.c                            |  2 +-
 fs/jfs/ioctl.c                            |  2 +-
 fs/jfs/jfs_imap.c                         |  8 ++--
 fs/jfs/jfs_inode.c                        |  4 +-
 fs/jfs/namei.c                            | 25 ++++++------
 fs/jfs/super.c                            |  2 +-
 fs/jfs/xattr.c                            |  2 +-
 fs/kernfs/inode.c                         |  4 +-
 fs/libfs.c                                | 32 ++++++++-------
 fs/minix/bitmap.c                         |  2 +-
 fs/minix/dir.c                            |  6 +--
 fs/minix/inode.c                          | 11 +++---
 fs/minix/itree_common.c                   |  4 +-
 fs/minix/namei.c                          |  6 +--
 fs/nfs/callback_proc.c                    |  2 +-
 fs/nfs/fscache.h                          |  4 +-
 fs/nfs/inode.c                            | 21 +++++-----
 fs/nfsd/nfsctl.c                          |  2 +-
 fs/nfsd/nfsfh.c                           |  4 +-
 fs/nfsd/vfs.c                             |  2 +-
 fs/nilfs2/dir.c                           |  6 +--
 fs/nilfs2/inode.c                         | 12 +++---
 fs/nilfs2/ioctl.c                         |  2 +-
 fs/nilfs2/namei.c                         |  8 ++--
 fs/nsfs.c                                 |  2 +-
 fs/ntfs/inode.c                           | 15 +++----
 fs/ntfs/mft.c                             |  3 +-
 fs/ntfs3/file.c                           |  6 +--
 fs/ntfs3/frecord.c                        |  4 +-
 fs/ntfs3/inode.c                          | 14 ++++---
 fs/ntfs3/namei.c                          | 10 ++---
 fs/ntfs3/xattr.c                          |  4 +-
 fs/ocfs2/acl.c                            |  6 +--
 fs/ocfs2/alloc.c                          |  6 +--
 fs/ocfs2/aops.c                           |  2 +-
 fs/ocfs2/dir.c                            |  8 ++--
 fs/ocfs2/dlmfs/dlmfs.c                    |  4 +-
 fs/ocfs2/dlmglue.c                        | 10 +++--
 fs/ocfs2/file.c                           | 16 ++++----
 fs/ocfs2/inode.c                          | 14 ++++---
 fs/ocfs2/move_extents.c                   |  6 +--
 fs/ocfs2/namei.c                          | 22 ++++++-----
 fs/ocfs2/refcounttree.c                   | 14 +++----
 fs/ocfs2/xattr.c                          |  6 +--
 fs/omfs/dir.c                             |  4 +-
 fs/omfs/inode.c                           | 10 ++---
 fs/openpromfs/inode.c                     |  4 +-
 fs/orangefs/namei.c                       |  2 +-
 fs/orangefs/orangefs-utils.c              |  6 +--
 fs/overlayfs/file.c                       |  7 +++-
 fs/overlayfs/util.c                       |  2 +-
 fs/pipe.c                                 |  2 +-
 fs/posix_acl.c                            |  2 +-
 fs/proc/base.c                            |  2 +-
 fs/proc/inode.c                           |  2 +-
 fs/proc/proc_sysctl.c                     |  2 +-
 fs/proc/self.c                            |  2 +-
 fs/proc/thread_self.c                     |  2 +-
 fs/pstore/inode.c                         |  4 +-
 fs/qnx4/inode.c                           |  4 +-
 fs/qnx6/inode.c                           |  4 +-
 fs/ramfs/inode.c                          |  6 +--
 fs/reiserfs/inode.c                       | 14 +++----
 fs/reiserfs/ioctl.c                       |  4 +-
 fs/reiserfs/namei.c                       | 21 +++++-----
 fs/reiserfs/stree.c                       |  4 +-
 fs/reiserfs/super.c                       |  2 +-
 fs/reiserfs/xattr.c                       |  5 ++-
 fs/reiserfs/xattr_acl.c                   |  2 +-
 fs/romfs/super.c                          |  4 +-
 fs/smb/client/file.c                      |  4 +-
 fs/smb/client/fscache.h                   |  5 ++-
 fs/smb/client/inode.c                     | 15 ++++---
 fs/smb/client/smb2ops.c                   |  2 +-
 fs/smb/server/smb2pdu.c                   |  8 ++--
 fs/squashfs/inode.c                       |  2 +-
 fs/stack.c                                |  2 +-
 fs/stat.c                                 |  2 +-
 fs/sysv/dir.c                             |  6 +--
 fs/sysv/ialloc.c                          |  2 +-
 fs/sysv/inode.c                           |  6 +--
 fs/sysv/itree.c                           |  4 +-
 fs/sysv/namei.c                           |  6 +--
 fs/tracefs/inode.c                        |  2 +-
 fs/ubifs/debug.c                          |  4 +-
 fs/ubifs/dir.c                            | 39 +++++++++---------
 fs/ubifs/file.c                           | 16 ++++----
 fs/ubifs/ioctl.c                          |  2 +-
 fs/ubifs/journal.c                        |  4 +-
 fs/ubifs/super.c                          |  4 +-
 fs/ubifs/xattr.c                          |  6 +--
 fs/udf/ialloc.c                           |  2 +-
 fs/udf/inode.c                            | 17 ++++----
 fs/udf/namei.c                            | 24 +++++------
 fs/ufs/dir.c                              |  6 +--
 fs/ufs/ialloc.c                           |  2 +-
 fs/ufs/inode.c                            | 23 ++++++-----
 fs/ufs/namei.c                            |  8 ++--
 fs/vboxsf/utils.c                         |  4 +-
 fs/xfs/libxfs/xfs_inode_buf.c             |  4 +-
 fs/xfs/libxfs/xfs_trans_inode.c           |  2 +-
 fs/xfs/xfs_acl.c                          |  2 +-
 fs/xfs/xfs_bmap_util.c                    |  6 ++-
 fs/xfs/xfs_inode.c                        |  2 +-
 fs/xfs/xfs_inode_item.c                   |  2 +-
 fs/xfs/xfs_iops.c                         |  4 +-
 fs/xfs/xfs_itable.c                       |  4 +-
 fs/zonefs/super.c                         |  8 ++--
 include/linux/fs.h                        |  1 +
 include/linux/fs_stack.h                  |  2 +-
 ipc/mqueue.c                              | 20 +++++-----
 kernel/bpf/inode.c                        |  4 +-
 mm/shmem.c                                | 28 +++++++------
 net/sunrpc/rpc_pipe.c                     |  2 +-
 security/apparmor/apparmorfs.c            |  6 +--
 security/apparmor/policy_unpack.c         |  4 +-
 security/inode.c                          |  2 +-
 security/selinux/selinuxfs.c              |  2 +-
 234 files changed, 851 insertions(+), 808 deletions(-)
Jeff Layton June 13, 2023, 7:47 p.m. UTC | #10
On Tue, 2023-05-23 at 12:17 +0200, Jan Kara wrote:
> On Tue 23-05-23 12:02:40, Jan Kara wrote:
> > On Thu 18-05-23 07:47:35, Jeff Layton wrote:
> > > The VFS always uses coarse-grained timestamp updates for filling out the
> > > ctime and mtime after a change. This has the benefit of allowing
> > > filesystems to optimize away a lot metadata updates, down to around 1
> > > per jiffy, even when a file is under heavy writes.
> > > 
> > > Unfortunately, this has always been an issue when we're exporting via
> > > NFSv3, which relies on timestamps to validate caches. Even with NFSv4, a
> > > lot of exported filesystems don't properly support a change attribute
> > > and are subject to the same problems with timestamp granularity. Other
> > > applications have similar issues (e.g backup applications).
> > > 
> > > Switching to always using fine-grained timestamps would improve the
> > > situation, but that becomes rather expensive, as the underlying
> > > filesystem will have to log a lot more metadata updates.
> > > 
> > > What we need is a way to only use fine-grained timestamps when they are
> > > being actively queried.
> > > 
> > > The kernel always stores normalized ctime values, so only the first 30
> > > bits of the tv_nsec field are ever used. Whenever the mtime changes, the
> > > ctime must also change.
> > > 
> > > Use the 31st bit of the ctime tv_nsec field to indicate that something
> > > has queried the inode for the i_mtime or i_ctime. When this flag is set,
> > > on the next timestamp update, the kernel can fetch a fine-grained
> > > timestamp instead of the usual coarse-grained one.
> > > 
> > > This patch adds the infrastructure this scheme. Filesytems can opt
> > > into it by setting the FS_MULTIGRAIN_TS flag in the fstype.
> > > 
> > > Later patches will convert individual filesystems over to use it.
> > > 
> > > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > 
> > So there are two things I dislike about this series because I think they
> > are fragile:
> > 
> > 1) If we have a filesystem supporting multigrain ts and someone
> > accidentally directly uses the value of inode->i_ctime, he can get bogus
> > value (with QUERIED flag). This mistake is very easy to do. So I think we
> > should rename i_ctime to something like __i_ctime and always use accessor
> > function for it.
> > 
> > 2) As I already commented in a previous version of the series, the scheme
> > with just one flag for both ctime and mtime and flag getting cleared in
> > current_time() relies on the fact that filesystems always do an equivalent
> > of:
> > 
> > 	inode->i_mtime = inode->i_ctime = current_time();
> > 
> > Otherwise we can do coarse grained update where we should have done a fine
> > grained one. Filesystems often update timestamps like this but not
> > universally. Grepping shows some instances where only inode->i_mtime is set
> > from current_time() e.g. in autofs or bfs. Again a mistake that is rather
> > easy to make and results in subtle issues. I think this would be also
> > nicely solved by renaming i_ctime to __i_ctime and using a function to set
> > ctime. Mtime could then be updated with inode->i_mtime = ctime_peek().
> > 
> > I understand this is quite some churn but a very mechanical one that could
> > be just done with Coccinelle and a few manual fixups. So IMHO it is worth
> > the more robust result.
> 
> Also as I'm thinking about it your current scheme is slightly racy. Suppose
> the filesystem does:
> 
> CPU1					CPU2
> 
> 					statx()
> inode->i_ctime = current_time()
>   current_mg_time()
>     nsec = atomic_long_fetch_andnot(QUERIED, &inode->i_ctime.tv_nsec)
> 					  nsec = atomic_long_fetch_or(QUERIED, &inode->i_ctime.tv_nsec)
>     if (nsec & QUERIED) - not set
>       ktime_get_coarse_real_ts64(&now)
>     return timestamp_truncate(now, inode);
> - QUERIED flag in the inode->i_ctime gets overwritten by the assignment
>   => we need not update ctime due to granularity although it was queried
> 
> One more reason to use explicit function to update inode->i_ctime ;)

Thinking about this some more, I think we can fix the race you pointed
out by just not clearing the queried flag when we fetch the
i_ctime.tv_nsec field when we're updating.

So, instead of atomic_long_fetch_andnot, we'd just want to use an
atomic_long_fetch there, and just let the eventual assignment of
inode->__i_ctime.tv_nsec be what clears the flag.

Any task that goes to update the time during the interim window will
fetch a fine-grained time, but that's what we want anyway.

Since you bring up races though, there are a couple of other things we
should be aware of. Note that both problems exist today too:

1) it's possible for two tasks to race in such a way that the ctime goes
backward. There's no synchronization between tasks doing the updating,
so an older time can overwrite a newer one. I think you'd need a pretty
tight race to observe this though.

2) it's possible to fetch a "torn" timestamp out of the inode.
timespec64 is two words, and we don't order its loads or stores. We
could consider adding a seqcount_t and some barriers and fixing it that
way. I'm not sure it's worth it though, given that we haven't worried
about this in the past.

For now, I propose that we ignore both problems, unless and until we can
prove that they are more than theoretical.
Christian Brauner June 14, 2023, 6:29 a.m. UTC | #11
On Tue, Jun 13, 2023 at 09:09:29AM -0400, Jeff Layton wrote:
> On Tue, 2023-05-23 at 14:46 +0200, Jan Kara wrote:
> > On Tue 23-05-23 06:40:08, Jeff Layton wrote:
> > > On Tue, 2023-05-23 at 12:02 +0200, Jan Kara wrote:
> > > > 
> > > > So there are two things I dislike about this series because I think they
> > > > are fragile:
> > > > 
> > > > 1) If we have a filesystem supporting multigrain ts and someone
> > > > accidentally directly uses the value of inode->i_ctime, he can get bogus
> > > > value (with QUERIED flag). This mistake is very easy to do. So I think we
> > > > should rename i_ctime to something like __i_ctime and always use accessor
> > > > function for it.
> > > > 
> > > 
> > > We could do this, but it'll be quite invasive. We'd have to change any
> > > place that touches i_ctime (and there are a lot of them), even on
> > > filesystems that are not being converted.
> > 
> > Yes, that's why I suggested Coccinelle to deal with this.
> 
> 
> I've done the work to convert all of the accesses of i_ctime into
> accessor functions in the kernel. The current state of it is here:
> 
>    
> https://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux.git/commit/?h=ctime
> 
> As expected, it touches a lot of code, all over the place. So far I have
> most of the conversion in one giant patch, and I need to split it up
> (probably per-subsystem).

Yeah, you have time since it'll be v6.6 material.

> 
> What's the best way to feed this change into mainline? Should I try to
> get subsystem maintainers to pick these up, or are we better off feeding
> this in via a separate branch?

I would prefer if we send them all through the vfs tree since trickle
down conversions are otherwise very painful and potentially very slow.
diff mbox series

Patch

diff --git a/fs/inode.c b/fs/inode.c
index 577799b7855f..24769e08fbaa 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -2029,6 +2029,7 @@  EXPORT_SYMBOL(file_remove_privs);
 static int inode_needs_update_time(struct inode *inode, struct timespec64 *now)
 {
 	int sync_it = 0;
+	struct timespec64 ctime;
 
 	/* First try to exhaust all avenues to not sync */
 	if (IS_NOCMTIME(inode))
@@ -2037,7 +2038,8 @@  static int inode_needs_update_time(struct inode *inode, struct timespec64 *now)
 	if (!timespec64_equal(&inode->i_mtime, now))
 		sync_it = S_MTIME;
 
-	if (!timespec64_equal(&inode->i_ctime, now))
+	ctime = ctime_peek(inode);
+	if (!timespec64_equal(&ctime, now))
 		sync_it |= S_CTIME;
 
 	if (IS_I_VERSION(inode) && inode_iversion_need_inc(inode))
@@ -2431,6 +2433,40 @@  struct timespec64 timestamp_truncate(struct timespec64 t, struct inode *inode)
 }
 EXPORT_SYMBOL(timestamp_truncate);
 
+/**
+ * current_mg_time - Return FS time (possibly fine-grained)
+ * @inode: inode.
+ *
+ * Return the current time truncated to the time granularity supported by
+ * the fs, as suitable for a ctime/mtime change. If the ctime is flagged
+ * as having been QUERIED, get a fine-grained timestamp.
+ */
+static struct timespec64 current_mg_time(struct inode *inode)
+{
+	struct timespec64 now;
+	atomic_long_t *pnsec = (atomic_long_t *)&inode->i_ctime.tv_nsec;
+	long nsec = atomic_long_fetch_andnot(I_CTIME_QUERIED, pnsec);
+
+	if (nsec & I_CTIME_QUERIED) {
+		ktime_get_real_ts64(&now);
+	} else {
+		struct timespec64 ctime;
+
+		ktime_get_coarse_real_ts64(&now);
+
+		/*
+		 * If we've recently fetched a fine-grained timestamp
+		 * then the coarse-grained one may still be earlier than the
+		 * existing one. Just keep the existing ctime if so.
+		 */
+		ctime = ctime_peek(inode);
+		if (timespec64_compare(&ctime, &now) > 0)
+			now = ctime;
+	}
+
+	return now;
+}
+
 /**
  * current_time - Return FS time
  * @inode: inode.
@@ -2445,12 +2481,10 @@  struct timespec64 current_time(struct inode *inode)
 {
 	struct timespec64 now;
 
-	ktime_get_coarse_real_ts64(&now);
-
-	if (unlikely(!inode->i_sb)) {
-		WARN(1, "current_time() called with uninitialized super_block in the inode");
-		return now;
-	}
+	if (is_multigrain_ts(inode))
+		now = current_mg_time(inode);
+	else
+		ktime_get_coarse_real_ts64(&now);
 
 	return timestamp_truncate(now, inode);
 }
diff --git a/fs/stat.c b/fs/stat.c
index 9b513a142a56..74d8283cc5c6 100644
--- a/fs/stat.c
+++ b/fs/stat.c
@@ -26,6 +26,38 @@ 
 #include "internal.h"
 #include "mount.h"
 
+/**
+ * fill_multigrain_cmtime - Fill in the mtime and ctime and flag ctime as QUERIED
+ * @request_mask: STATX_* values requested
+ * @inode: inode from which to grab the c/mtime
+ * @stat: where to store the resulting values
+ *
+ * Given @inode, grab the ctime and mtime out if it and store the result
+ * in @stat. When fetching the value, flag it as queried so the next write
+ * will use a fine-grained timestamp.
+ */
+void fill_multigrain_cmtime(u32 request_mask, struct inode *inode,
+			    struct kstat *stat)
+{
+	atomic_long_t *pnsec = (atomic_long_t *)&inode->i_ctime.tv_nsec;
+
+	/* If neither time was requested, then don't report them */
+	if (!(request_mask & (STATX_CTIME|STATX_MTIME))) {
+		stat->result_mask &= ~(STATX_CTIME|STATX_MTIME);
+		return;
+	}
+
+	stat->mtime = inode->i_mtime;
+	stat->ctime.tv_sec = inode->i_ctime.tv_sec;
+	/*
+	 * Atomically set the QUERIED flag and fetch the new value with
+	 * the flag masked off.
+	 */
+	stat->ctime.tv_nsec = atomic_long_fetch_or(I_CTIME_QUERIED, pnsec) &
+					~I_CTIME_QUERIED;
+}
+EXPORT_SYMBOL(fill_multigrain_cmtime);
+
 /**
  * generic_fillattr - Fill in the basic attributes from the inode struct
  * @idmap:	idmap of the mount the inode was found from
@@ -58,11 +90,16 @@  void generic_fillattr(struct mnt_idmap *idmap, u32 request_mask,
 	stat->rdev = inode->i_rdev;
 	stat->size = i_size_read(inode);
 	stat->atime = inode->i_atime;
-	stat->mtime = inode->i_mtime;
-	stat->ctime = inode->i_ctime;
 	stat->blksize = i_blocksize(inode);
 	stat->blocks = inode->i_blocks;
 
+	if (is_multigrain_ts(inode)) {
+		fill_multigrain_cmtime(request_mask, inode, stat);
+	} else {
+		stat->mtime = inode->i_mtime;
+		stat->ctime = inode->i_ctime;
+	}
+
 	if ((request_mask & STATX_CHANGE_COOKIE) && IS_I_VERSION(inode)) {
 		stat->result_mask |= STATX_CHANGE_COOKIE;
 		stat->change_cookie = inode_query_iversion(inode);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index d5896f90093a..1f670cf1edbd 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1474,7 +1474,7 @@  static inline bool fsuidgid_has_mapping(struct super_block *sb,
 	       kgid_has_mapping(fs_userns, kgid);
 }
 
-extern struct timespec64 current_time(struct inode *inode);
+struct timespec64 current_time(struct inode *inode);
 
 /*
  * Snapshotting support.
@@ -2212,6 +2212,7 @@  struct file_system_type {
 #define FS_USERNS_MOUNT		8	/* Can be mounted by userns root */
 #define FS_DISALLOW_NOTIFY_PERM	16	/* Disable fanotify permission events */
 #define FS_ALLOW_IDMAP         32      /* FS has been updated to handle vfs idmappings. */
+#define FS_MULTIGRAIN_TS	64	/* Filesystem uses multigrain timestamps */
 #define FS_RENAME_DOES_D_MOVE	32768	/* FS will handle d_move() during rename() internally. */
 	int (*init_fs_context)(struct fs_context *);
 	const struct fs_parameter_spec *parameters;
@@ -2235,6 +2236,67 @@  struct file_system_type {
 
 #define MODULE_ALIAS_FS(NAME) MODULE_ALIAS("fs-" NAME)
 
+/*
+ * Multigrain timestamps
+ *
+ * Conditionally use fine-grained ctime and mtime timestamps when there
+ * are users actively observing them via getattr. The primary use-case
+ * for this is NFS clients that use the ctime to distinguish between
+ * different states of the file, and that are often fooled by multiple
+ * operations that occur in the same coarse-grained timer tick.
+ */
+static inline bool is_multigrain_ts(const struct inode *inode)
+{
+	return inode->i_sb->s_type->fs_flags & FS_MULTIGRAIN_TS;
+}
+
+/*
+ * The kernel always keeps normalized struct timespec64 values in the ctime,
+ * which means that only the first 30 bits of the value are used. Use the
+ * 31st bit of the ctime's tv_nsec field as a flag to indicate that the value
+ * has been queried since it was last updated.
+ */
+#define I_CTIME_QUERIED		(1L<<30)
+
+/**
+ * ctime_nsec_peek - peek at (but don't query) the ctime tv_nsec field
+ * @inode: inode to fetch the ctime from
+ *
+ * Grab the current ctime tv_nsec field from the inode, mask off the
+ * I_CTIME_QUERIED flag and return it. This is mostly intended for use by
+ * internal consumers of the ctime that aren't concerned with ensuring a
+ * fine-grained update on the next change (e.g. when preparing to store
+ * the value in the backing store for later retrieval).
+ *
+ * This is safe to call regardless of whether the underlying filesystem
+ * is using multigrain timestamps.
+ */
+static inline long ctime_nsec_peek(const struct inode *inode)
+{
+	return inode->i_ctime.tv_nsec &~ I_CTIME_QUERIED;
+}
+
+/**
+ * ctime_peek - peek at (but don't query) the ctime
+ * @inode: inode to fetch the ctime from
+ *
+ * Grab the current ctime from the inode, sans I_CTIME_QUERIED flag. For
+ * use by internal consumers that don't require a fine-grained update on
+ * the next change.
+ *
+ * This is safe to call regardless of whether the underlying filesystem
+ * is using multigrain timestamps.
+ */
+static inline struct timespec64 ctime_peek(const struct inode *inode)
+{
+	struct timespec64 ctime;
+
+	ctime.tv_sec = inode->i_ctime.tv_sec;
+	ctime.tv_nsec = ctime_nsec_peek(inode);
+
+	return ctime;
+}
+
 extern struct dentry *mount_bdev(struct file_system_type *fs_type,
 	int flags, const char *dev_name, void *data,
 	int (*fill_super)(struct super_block *, void *, int));
@@ -2857,6 +2919,8 @@  extern void page_put_link(void *);
 extern int page_symlink(struct inode *inode, const char *symname, int len);
 extern const struct inode_operations page_symlink_inode_operations;
 extern void kfree_link(void *);
+void fill_multigrain_cmtime(u32 request_mask, struct inode *inode,
+			    struct kstat *stat);
 void generic_fillattr(struct mnt_idmap *, u32, struct inode *, struct kstat *);
 void generic_fill_statx_attr(struct inode *inode, struct kstat *stat);
 extern int vfs_getattr_nosec(const struct path *, struct kstat *, u32, unsigned int);