diff mbox series

[v8,RESEND,2/8] fs: clarify when the i_version counter must be updated

Message ID 20230124193025.185781-3-jlayton@kernel.org (mailing list archive)
State New, archived
Headers show
Series fs: clean up internal i_version handling | expand

Commit Message

Jeff Layton Jan. 24, 2023, 7:30 p.m. UTC
The i_version field in the kernel has had different semantics over
the decades, but NFSv4 has certain expectations. Update the comments
in iversion.h to describe when the i_version must change.

Cc: Colin Walters <walters@verbum.org>
Cc: NeilBrown <neilb@suse.de>
Cc: Trond Myklebust <trondmy@hammerspace.com>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 include/linux/iversion.h | 21 +++++++++++++++++++--
 1 file changed, 19 insertions(+), 2 deletions(-)

Comments

Jan Kara Jan. 25, 2023, 4:06 p.m. UTC | #1
On Tue 24-01-23 14:30:19, Jeff Layton wrote:
> The i_version field in the kernel has had different semantics over
> the decades, but NFSv4 has certain expectations. Update the comments
> in iversion.h to describe when the i_version must change.
> 
> Cc: Colin Walters <walters@verbum.org>
> Cc: NeilBrown <neilb@suse.de>
> Cc: Trond Myklebust <trondmy@hammerspace.com>
> Cc: Dave Chinner <david@fromorbit.com>
> Signed-off-by: Jeff Layton <jlayton@kernel.org>

Looks good to me. But one note below:

> diff --git a/include/linux/iversion.h b/include/linux/iversion.h
> index 6755d8b4f20b..fced8115a5f4 100644
> --- a/include/linux/iversion.h
> +++ b/include/linux/iversion.h
> @@ -9,8 +9,25 @@
>   * ---------------------------
>   * The change attribute (i_version) is mandated by NFSv4 and is mostly for
>   * knfsd, but is also used for other purposes (e.g. IMA). The i_version must
> - * appear different to observers if there was a change to the inode's data or
> - * metadata since it was last queried.
> + * appear larger to observers if there was an explicit change to the inode's
> + * data or metadata since it was last queried.
> + *
> + * An explicit change is one that would ordinarily result in a change to the
> + * inode status change time (aka ctime). i_version must appear to change, even
> + * if the ctime does not (since the whole point is to avoid missing updates due
> + * to timestamp granularity). If POSIX or other relevant spec mandates that the
> + * ctime must change due to an operation, then the i_version counter must be
> + * incremented as well.
> + *
> + * Making the i_version update completely atomic with the operation itself would
> + * be prohibitively expensive. Traditionally the kernel has updated the times on
> + * directories after an operation that changes its contents. For regular files,
> + * the ctime is usually updated before the data is copied into the cache for a
> + * write. This means that there is a window of time when an observer can
> + * associate a new timestamp with old file contents. Since the purpose of the
> + * i_version is to allow for better cache coherency, the i_version must always
> + * be updated after the results of the operation are visible. Updating it before
> + * and after a change is also permitted.

This sounds good but it is not the case for any of the current filesystems, is
it? Perhaps the documentation should mention this so that people are not
confused?

								Honza
Christian Brauner Jan. 26, 2023, 9:01 a.m. UTC | #2
On Tue, Jan 24, 2023 at 02:30:19PM -0500, Jeff Layton wrote:
> The i_version field in the kernel has had different semantics over
> the decades, but NFSv4 has certain expectations. Update the comments
> in iversion.h to describe when the i_version must change.
> 
> Cc: Colin Walters <walters@verbum.org>
> Cc: NeilBrown <neilb@suse.de>
> Cc: Trond Myklebust <trondmy@hammerspace.com>
> Cc: Dave Chinner <david@fromorbit.com>
> Signed-off-by: Jeff Layton <jlayton@kernel.org>
> ---

Looks good to me,
Reviewed-by: Christian Brauner <brauner@kernel.org>
Jeff Layton Jan. 26, 2023, 10:54 a.m. UTC | #3
On Wed, 2023-01-25 at 17:06 +0100, Jan Kara wrote:
> On Tue 24-01-23 14:30:19, Jeff Layton wrote:
> > The i_version field in the kernel has had different semantics over
> > the decades, but NFSv4 has certain expectations. Update the comments
> > in iversion.h to describe when the i_version must change.
> > 
> > Cc: Colin Walters <walters@verbum.org>
> > Cc: NeilBrown <neilb@suse.de>
> > Cc: Trond Myklebust <trondmy@hammerspace.com>
> > Cc: Dave Chinner <david@fromorbit.com>
> > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> 
> Looks good to me. But one note below:
> 
> > diff --git a/include/linux/iversion.h b/include/linux/iversion.h
> > index 6755d8b4f20b..fced8115a5f4 100644
> > --- a/include/linux/iversion.h
> > +++ b/include/linux/iversion.h
> > @@ -9,8 +9,25 @@
> >   * ---------------------------
> >   * The change attribute (i_version) is mandated by NFSv4 and is mostly for
> >   * knfsd, but is also used for other purposes (e.g. IMA). The i_version must
> > - * appear different to observers if there was a change to the inode's data or
> > - * metadata since it was last queried.
> > + * appear larger to observers if there was an explicit change to the inode's
> > + * data or metadata since it was last queried.
> > + *
> > + * An explicit change is one that would ordinarily result in a change to the
> > + * inode status change time (aka ctime). i_version must appear to change, even
> > + * if the ctime does not (since the whole point is to avoid missing updates due
> > + * to timestamp granularity). If POSIX or other relevant spec mandates that the
> > + * ctime must change due to an operation, then the i_version counter must be
> > + * incremented as well.
> > + *
> > + * Making the i_version update completely atomic with the operation itself would
> > + * be prohibitively expensive. Traditionally the kernel has updated the times on
> > + * directories after an operation that changes its contents. For regular files,
> > + * the ctime is usually updated before the data is copied into the cache for a
> > + * write. This means that there is a window of time when an observer can
> > + * associate a new timestamp with old file contents. Since the purpose of the
> > + * i_version is to allow for better cache coherency, the i_version must always
> > + * be updated after the results of the operation are visible. Updating it before
> > + * and after a change is also permitted.
> 
> This sounds good but it is not the case for any of the current filesystems, is
> it? Perhaps the documentation should mention this so that people are not
> confused?
> 
> 								Honza

Correct. Currently, all filesystems change the times and version before
a write instead of after. I'm hoping that situation will change soon
though, as I've been working on a patchset to fix this for tmpfs, ext4
and btrfs.

If you still want to see something for this though, what would you
suggest for verbiage?

Thanks,
Jan Kara Jan. 26, 2023, 11:36 a.m. UTC | #4
On Thu 26-01-23 05:54:16, Jeff Layton wrote:
> On Wed, 2023-01-25 at 17:06 +0100, Jan Kara wrote:
> > On Tue 24-01-23 14:30:19, Jeff Layton wrote:
> > > The i_version field in the kernel has had different semantics over
> > > the decades, but NFSv4 has certain expectations. Update the comments
> > > in iversion.h to describe when the i_version must change.
> > > 
> > > Cc: Colin Walters <walters@verbum.org>
> > > Cc: NeilBrown <neilb@suse.de>
> > > Cc: Trond Myklebust <trondmy@hammerspace.com>
> > > Cc: Dave Chinner <david@fromorbit.com>
> > > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > 
> > Looks good to me. But one note below:
> > 
> > > diff --git a/include/linux/iversion.h b/include/linux/iversion.h
> > > index 6755d8b4f20b..fced8115a5f4 100644
> > > --- a/include/linux/iversion.h
> > > +++ b/include/linux/iversion.h
> > > @@ -9,8 +9,25 @@
> > >   * ---------------------------
> > >   * The change attribute (i_version) is mandated by NFSv4 and is mostly for
> > >   * knfsd, but is also used for other purposes (e.g. IMA). The i_version must
> > > - * appear different to observers if there was a change to the inode's data or
> > > - * metadata since it was last queried.
> > > + * appear larger to observers if there was an explicit change to the inode's
> > > + * data or metadata since it was last queried.
> > > + *
> > > + * An explicit change is one that would ordinarily result in a change to the
> > > + * inode status change time (aka ctime). i_version must appear to change, even
> > > + * if the ctime does not (since the whole point is to avoid missing updates due
> > > + * to timestamp granularity). If POSIX or other relevant spec mandates that the
> > > + * ctime must change due to an operation, then the i_version counter must be
> > > + * incremented as well.
> > > + *
> > > + * Making the i_version update completely atomic with the operation itself would
> > > + * be prohibitively expensive. Traditionally the kernel has updated the times on
> > > + * directories after an operation that changes its contents. For regular files,
> > > + * the ctime is usually updated before the data is copied into the cache for a
> > > + * write. This means that there is a window of time when an observer can
> > > + * associate a new timestamp with old file contents. Since the purpose of the
> > > + * i_version is to allow for better cache coherency, the i_version must always
> > > + * be updated after the results of the operation are visible. Updating it before
> > > + * and after a change is also permitted.
> > 
> > This sounds good but it is not the case for any of the current filesystems, is
> > it? Perhaps the documentation should mention this so that people are not
> > confused?
> 
> Correct. Currently, all filesystems change the times and version before
> a write instead of after. I'm hoping that situation will change soon
> though, as I've been working on a patchset to fix this for tmpfs, ext4
> and btrfs.

That is good but we'll see how long it takes to get merged. AFAIR it is not
a complete nobrainer ;)

> If you still want to see something for this though, what would you
> suggest for verbiage?

Sure:

... the i_version must a be updated after the results of the operation are
visible (note that none of the filesystems currently do this, it is a work
in progress to fix this).

And once your patches are merged, you can also delete this note :).

								Honza
Jeff Layton Jan. 26, 2023, 12:02 p.m. UTC | #5
On Thu, 2023-01-26 at 12:36 +0100, Jan Kara wrote:
> On Thu 26-01-23 05:54:16, Jeff Layton wrote:
> > On Wed, 2023-01-25 at 17:06 +0100, Jan Kara wrote:
> > > On Tue 24-01-23 14:30:19, Jeff Layton wrote:
> > > > The i_version field in the kernel has had different semantics over
> > > > the decades, but NFSv4 has certain expectations. Update the comments
> > > > in iversion.h to describe when the i_version must change.
> > > > 
> > > > Cc: Colin Walters <walters@verbum.org>
> > > > Cc: NeilBrown <neilb@suse.de>
> > > > Cc: Trond Myklebust <trondmy@hammerspace.com>
> > > > Cc: Dave Chinner <david@fromorbit.com>
> > > > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > > 
> > > Looks good to me. But one note below:
> > > 
> > > > diff --git a/include/linux/iversion.h b/include/linux/iversion.h
> > > > index 6755d8b4f20b..fced8115a5f4 100644
> > > > --- a/include/linux/iversion.h
> > > > +++ b/include/linux/iversion.h
> > > > @@ -9,8 +9,25 @@
> > > >   * ---------------------------
> > > >   * The change attribute (i_version) is mandated by NFSv4 and is mostly for
> > > >   * knfsd, but is also used for other purposes (e.g. IMA). The i_version must
> > > > - * appear different to observers if there was a change to the inode's data or
> > > > - * metadata since it was last queried.
> > > > + * appear larger to observers if there was an explicit change to the inode's
> > > > + * data or metadata since it was last queried.
> > > > + *
> > > > + * An explicit change is one that would ordinarily result in a change to the
> > > > + * inode status change time (aka ctime). i_version must appear to change, even
> > > > + * if the ctime does not (since the whole point is to avoid missing updates due
> > > > + * to timestamp granularity). If POSIX or other relevant spec mandates that the
> > > > + * ctime must change due to an operation, then the i_version counter must be
> > > > + * incremented as well.
> > > > + *
> > > > + * Making the i_version update completely atomic with the operation itself would
> > > > + * be prohibitively expensive. Traditionally the kernel has updated the times on
> > > > + * directories after an operation that changes its contents. For regular files,
> > > > + * the ctime is usually updated before the data is copied into the cache for a
> > > > + * write. This means that there is a window of time when an observer can
> > > > + * associate a new timestamp with old file contents. Since the purpose of the
> > > > + * i_version is to allow for better cache coherency, the i_version must always
> > > > + * be updated after the results of the operation are visible. Updating it before
> > > > + * and after a change is also permitted.
> > > 
> > > This sounds good but it is not the case for any of the current filesystems, is
> > > it? Perhaps the documentation should mention this so that people are not
> > > confused?
> > 
> > Correct. Currently, all filesystems change the times and version before
> > a write instead of after. I'm hoping that situation will change soon
> > though, as I've been working on a patchset to fix this for tmpfs, ext4
> > and btrfs.
> 
> That is good but we'll see how long it takes to get merged. AFAIR it is not
> a complete nobrainer ;)
> 
> > If you still want to see something for this though, what would you
> > suggest for verbiage?
> 
> Sure:
> 
> ... the i_version must a be updated after the results of the operation are
> visible (note that none of the filesystems currently do this, it is a work
> in progress to fix this).
> 
> And once your patches are merged, you can also delete this note :).
> 
> 								Honza

Sounds good, I folded something similar to that into the patch and
pushed it into the branch I'm feeding into linux-next. I won't bother
re-posting for just that though:

diff --git a/include/linux/iversion.h b/include/linux/iversion.h
index fced8115a5f4..f174ff1b59ee 100644
--- a/include/linux/iversion.h
+++ b/include/linux/iversion.h
@@ -27,7 +27,8 @@
  * associate a new timestamp with old file contents. Since the purpose of the
  * i_version is to allow for better cache coherency, the i_version must always
  * be updated after the results of the operation are visible. Updating it before
- * and after a change is also permitted.
+ * and after a change is also permitted. (Note that no filesystems currently do
+ * this. Fixing that is a work-in-progress).
  *
  * Observers see the i_version as a 64-bit number that never decreases. If it
  * remains the same since it was last checked, then nothing has changed in the

Thanks!
diff mbox series

Patch

diff --git a/include/linux/iversion.h b/include/linux/iversion.h
index 6755d8b4f20b..fced8115a5f4 100644
--- a/include/linux/iversion.h
+++ b/include/linux/iversion.h
@@ -9,8 +9,25 @@ 
  * ---------------------------
  * The change attribute (i_version) is mandated by NFSv4 and is mostly for
  * knfsd, but is also used for other purposes (e.g. IMA). The i_version must
- * appear different to observers if there was a change to the inode's data or
- * metadata since it was last queried.
+ * appear larger to observers if there was an explicit change to the inode's
+ * data or metadata since it was last queried.
+ *
+ * An explicit change is one that would ordinarily result in a change to the
+ * inode status change time (aka ctime). i_version must appear to change, even
+ * if the ctime does not (since the whole point is to avoid missing updates due
+ * to timestamp granularity). If POSIX or other relevant spec mandates that the
+ * ctime must change due to an operation, then the i_version counter must be
+ * incremented as well.
+ *
+ * Making the i_version update completely atomic with the operation itself would
+ * be prohibitively expensive. Traditionally the kernel has updated the times on
+ * directories after an operation that changes its contents. For regular files,
+ * the ctime is usually updated before the data is copied into the cache for a
+ * write. This means that there is a window of time when an observer can
+ * associate a new timestamp with old file contents. Since the purpose of the
+ * i_version is to allow for better cache coherency, the i_version must always
+ * be updated after the results of the operation are visible. Updating it before
+ * and after a change is also permitted.
  *
  * Observers see the i_version as a 64-bit number that never decreases. If it
  * remains the same since it was last checked, then nothing has changed in the