diff mbox series

iversion: update comments with info about atime updates

Message ID 20220822133309.86005-1-jlayton@kernel.org (mailing list archive)
State New, archived
Headers show
Series iversion: update comments with info about atime updates | expand

Commit Message

Jeff Layton Aug. 22, 2022, 1:33 p.m. UTC
Add an explicit paragraph codifying that atime updates due to reads
should not be counted against the i_version counter. None of the
existing subsystems that use the i_version want those counted, and
there is an easy workaround for those that do.

Cc: NeilBrown <neilb@suse.de>
Cc: Trond Myklebust <trondmy@hammerspace.com>
Cc: Dave Chinner <david@fromorbit.com>
Link: https://lore.kernel.org/linux-xfs/166086932784.5425.17134712694961326033@noble.neil.brown.name/#t
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 include/linux/iversion.h | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

Comments

Mimi Zohar Aug. 22, 2022, 3:40 p.m. UTC | #1
On Mon, 2022-08-22 at 09:33 -0400, Jeff Layton wrote:
> Add an explicit paragraph codifying that atime updates due to reads
> should not be counted against the i_version counter. None of the
> existing subsystems that use the i_version want those counted, and
> there is an easy workaround for those that do.
> 
> Cc: NeilBrown <neilb@suse.de>
> Cc: Trond Myklebust <trondmy@hammerspace.com>
> Cc: Dave Chinner <david@fromorbit.com>
> Link: https://lore.kernel.org/linux-xfs/166086932784.5425.17134712694961326033@noble.neil.brown.name/#t
> Signed-off-by: Jeff Layton <jlayton@kernel.org>
> ---
>  include/linux/iversion.h | 10 ++++++++--
>  1 file changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/iversion.h b/include/linux/iversion.h
> index 3bfebde5a1a6..da6cc1cc520a 100644
> --- a/include/linux/iversion.h
> +++ b/include/linux/iversion.h
> @@ -9,8 +9,8 @@
>   * ---------------------------
>   * The change attribute (i_version) is mandated by NFSv4 and is mostly for
>   * knfsd, but is also used for other purposes (e.g. IMA). The i_version must
> - * appear different to observers if there was a change to the inode's data or
> - * metadata since it was last queried.
> + * appear different to observers if there was an explicit change to the inode's
> + * data or metadata since it was last queried.
>   *
>   * Observers see the i_version as a 64-bit number that never decreases. If it
>   * remains the same since it was last checked, then nothing has changed in the
> @@ -18,6 +18,12 @@
>   * anything about the nature or magnitude of the changes from the value, only
>   * that the inode has changed in some fashion.
>   *
> + * Note that atime updates due to reads or similar activity do _not_ represent
> + * an explicit change to the inode. If the only change is to the atime and it

Thanks, Jeff.  The ext4 patch increments i_version on file metadata
changes.  Could the wording here be more explicit to reflect changes
based on either inode data or metadata changes?

thanks,

Mimi

> + * wasn't set via utimes() or a similar mechanism, then i_version should not be
> + * incremented. If an observer cares about atime updates, it should plan to
> + * fetch and store them in conjunction with the i_version.
> + *
>   * Not all filesystems properly implement the i_version counter. Subsystems that
>   * want to use i_version field on an inode should first check whether the
>   * filesystem sets the SB_I_VERSION flag (usually via the IS_I_VERSION macro).
Jeff Layton Aug. 22, 2022, 4:22 p.m. UTC | #2
On Mon, 2022-08-22 at 11:40 -0400, Mimi Zohar wrote:
> On Mon, 2022-08-22 at 09:33 -0400, Jeff Layton wrote:
> > Add an explicit paragraph codifying that atime updates due to reads
> > should not be counted against the i_version counter. None of the
> > existing subsystems that use the i_version want those counted, and
> > there is an easy workaround for those that do.
> > 
> > Cc: NeilBrown <neilb@suse.de>
> > Cc: Trond Myklebust <trondmy@hammerspace.com>
> > Cc: Dave Chinner <david@fromorbit.com>
> > Link: https://lore.kernel.org/linux-xfs/166086932784.5425.17134712694961326033@noble.neil.brown.name/#t
> > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > ---
> >  include/linux/iversion.h | 10 ++++++++--
> >  1 file changed, 8 insertions(+), 2 deletions(-)
> > 
> > diff --git a/include/linux/iversion.h b/include/linux/iversion.h
> > index 3bfebde5a1a6..da6cc1cc520a 100644
> > --- a/include/linux/iversion.h
> > +++ b/include/linux/iversion.h
> > @@ -9,8 +9,8 @@
> >   * ---------------------------
> >   * The change attribute (i_version) is mandated by NFSv4 and is mostly for
> >   * knfsd, but is also used for other purposes (e.g. IMA). The i_version must
> > - * appear different to observers if there was a change to the inode's data or
> > - * metadata since it was last queried.
> > + * appear different to observers if there was an explicit change to the inode's
> > + * data or metadata since it was last queried.
> >   *
> >   * Observers see the i_version as a 64-bit number that never decreases. If it
> >   * remains the same since it was last checked, then nothing has changed in the
> > @@ -18,6 +18,12 @@
> >   * anything about the nature or magnitude of the changes from the value, only
> >   * that the inode has changed in some fashion.
> >   *
> > + * Note that atime updates due to reads or similar activity do _not_ represent
> > + * an explicit change to the inode. If the only change is to the atime and it
> 
> Thanks, Jeff.  The ext4 patch increments i_version on file metadata
> changes.  Could the wording here be more explicit to reflect changes
> based on either inode data or metadata changes?
> 
> 

Thanks Mimi,

Care to suggest some wording?

The main issue we have is that ext4 and xfs both increment i_version on
atime updates due to reads. I have patches in flight to fix those, but
going forward, we want to ensure that i_version gets incremented on all
changes _except_ for atime updates.

The best wording we have at the moment is what Trond suggested, which is
to classify the changes to the inode as "explicit" (someone or something
made a deliberate change to the inode) and "implicit" (the change to the
inode was due to activity such as reads that don't actually change
anything).

Is there a better way to describe this?

> > + * wasn't set via utimes() or a similar mechanism, then i_version should not be
> > + * incremented. If an observer cares about atime updates, it should plan to
> > + * fetch and store them in conjunction with the i_version.
> > + *
> >   * Not all filesystems properly implement the i_version counter. Subsystems that
> >   * want to use i_version field on an inode should first check whether the
> >   * filesystem sets the SB_I_VERSION flag (usually via the IS_I_VERSION macro).
> 
>
Mimi Zohar Aug. 22, 2022, 5:39 p.m. UTC | #3
On Mon, 2022-08-22 at 12:22 -0400, Jeff Layton wrote:
> On Mon, 2022-08-22 at 11:40 -0400, Mimi Zohar wrote:
> > On Mon, 2022-08-22 at 09:33 -0400, Jeff Layton wrote:
> > > Add an explicit paragraph codifying that atime updates due to reads
> > > should not be counted against the i_version counter. None of the
> > > existing subsystems that use the i_version want those counted, and
> > > there is an easy workaround for those that do.
> > > 
> > > Cc: NeilBrown <neilb@suse.de>
> > > Cc: Trond Myklebust <trondmy@hammerspace.com>
> > > Cc: Dave Chinner <david@fromorbit.com>
> > > Link: https://lore.kernel.org/linux-xfs/166086932784.5425.17134712694961326033@noble.neil.brown.name/#t
> > > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > > ---
> > >  include/linux/iversion.h | 10 ++++++++--
> > >  1 file changed, 8 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/include/linux/iversion.h b/include/linux/iversion.h
> > > index 3bfebde5a1a6..da6cc1cc520a 100644
> > > --- a/include/linux/iversion.h
> > > +++ b/include/linux/iversion.h
> > > @@ -9,8 +9,8 @@
> > >   * ---------------------------
> > >   * The change attribute (i_version) is mandated by NFSv4 and is mostly for
> > >   * knfsd, but is also used for other purposes (e.g. IMA). The i_version must
> > > - * appear different to observers if there was a change to the inode's data or
> > > - * metadata since it was last queried.
> > > + * appear different to observers if there was an explicit change to the inode's
> > > + * data or metadata since it was last queried.
> > >   *
> > >   * Observers see the i_version as a 64-bit number that never decreases. If it
> > >   * remains the same since it was last checked, then nothing has changed in the
> > > @@ -18,6 +18,12 @@
> > >   * anything about the nature or magnitude of the changes from the value, only
> > >   * that the inode has changed in some fashion.
> > >   *
> > > + * Note that atime updates due to reads or similar activity do _not_ represent
> > > + * an explicit change to the inode. If the only change is to the atime and it
> > 
> > Thanks, Jeff.  The ext4 patch increments i_version on file metadata
> > changes.  Could the wording here be more explicit to reflect changes
> > based on either inode data or metadata changes?b
> > 
> > 
> 
> Thanks Mimi,
> 
> Care to suggest some wording?
> 
> The main issue we have is that ext4 and xfs both increment i_version on
> atime updates due to reads. I have patches in flight to fix those, but
> going forward, we want to ensure that i_version gets incremented on all
> changes _except_ for atime updates.
> 
> The best wording we have at the moment is what Trond suggested, which is
> to classify the changes to the inode as "explicit" (someone or something
> made a deliberate change to the inode) and "implicit" (the change to the
> inode was due to activity such as reads that don't actually change
> anything).
> 
> Is there a better way to describe this?

"explicit change to the inode" probably implies both the inode file
data and metadata, but let's call it out by saying "an explicit change
to either the inode data or metadata".

> 
> > > + * wasn't set via utimes() or a similar mechanism, then i_version should not be
> > > + * incremented. If an observer cares about atime updates, it should plan to
> > > + * fetch and store them in conjunction with the i_version.
> > > + *
> > >   * Not all filesystems properly implement the i_version counter. Subsystems that
> > >   * want to use i_version field on an inode should first check whether the
> > >   * filesystem sets the SB_I_VERSION flag (usually via the IS_I_VERSION macro).
> > 
> > 
>
Jeff Layton Aug. 22, 2022, 6:22 p.m. UTC | #4
On Mon, 2022-08-22 at 13:39 -0400, Mimi Zohar wrote:
> On Mon, 2022-08-22 at 12:22 -0400, Jeff Layton wrote:
> > On Mon, 2022-08-22 at 11:40 -0400, Mimi Zohar wrote:
> > > On Mon, 2022-08-22 at 09:33 -0400, Jeff Layton wrote:
> > > > Add an explicit paragraph codifying that atime updates due to reads
> > > > should not be counted against the i_version counter. None of the
> > > > existing subsystems that use the i_version want those counted, and
> > > > there is an easy workaround for those that do.
> > > > 
> > > > Cc: NeilBrown <neilb@suse.de>
> > > > Cc: Trond Myklebust <trondmy@hammerspace.com>
> > > > Cc: Dave Chinner <david@fromorbit.com>
> > > > Link: https://lore.kernel.org/linux-xfs/166086932784.5425.17134712694961326033@noble.neil.brown.name/#t
> > > > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > > > ---
> > > >  include/linux/iversion.h | 10 ++++++++--
> > > >  1 file changed, 8 insertions(+), 2 deletions(-)
> > > > 
> > > > diff --git a/include/linux/iversion.h b/include/linux/iversion.h
> > > > index 3bfebde5a1a6..da6cc1cc520a 100644
> > > > --- a/include/linux/iversion.h
> > > > +++ b/include/linux/iversion.h
> > > > @@ -9,8 +9,8 @@
> > > >   * ---------------------------
> > > >   * The change attribute (i_version) is mandated by NFSv4 and is mostly for
> > > >   * knfsd, but is also used for other purposes (e.g. IMA). The i_version must
> > > > - * appear different to observers if there was a change to the inode's data or
> > > > - * metadata since it was last queried.
> > > > + * appear different to observers if there was an explicit change to the inode's
> > > > + * data or metadata since it was last queried.
> > > >   *
> > > >   * Observers see the i_version as a 64-bit number that never decreases. If it
> > > >   * remains the same since it was last checked, then nothing has changed in the
> > > > @@ -18,6 +18,12 @@
> > > >   * anything about the nature or magnitude of the changes from the value, only
> > > >   * that the inode has changed in some fashion.
> > > >   *
> > > > + * Note that atime updates due to reads or similar activity do _not_ represent
> > > > + * an explicit change to the inode. If the only change is to the atime and it
> > > 
> > > Thanks, Jeff.  The ext4 patch increments i_version on file metadata
> > > changes.  Could the wording here be more explicit to reflect changes
> > > based on either inode data or metadata changes?b
> > > 
> > > 
> > 
> > Thanks Mimi,
> > 
> > Care to suggest some wording?
> > 
> > The main issue we have is that ext4 and xfs both increment i_version on
> > atime updates due to reads. I have patches in flight to fix those, but
> > going forward, we want to ensure that i_version gets incremented on all
> > changes _except_ for atime updates.
> > 
> > The best wording we have at the moment is what Trond suggested, which is
> > to classify the changes to the inode as "explicit" (someone or something
> > made a deliberate change to the inode) and "implicit" (the change to the
> > inode was due to activity such as reads that don't actually change
> > anything).
> > 
> > Is there a better way to describe this?
> 
> "explicit change to the inode" probably implies both the inode file
> data and metadata, but let's call it out by saying "an explicit change
> to either the inode data or metadata".
> 
> > 
> > > > + * wasn't set via utimes() or a similar mechanism, then i_version should not be
> > > > + * incremented. If an observer cares about atime updates, it should plan to
> > > > + * fetch and store them in conjunction with the i_version.
> > > > + *
> > > >   * Not all filesystems properly implement the i_version counter. Subsystems that
> > > >   * want to use i_version field on an inode should first check whether the
> > > >   * filesystem sets the SB_I_VERSION flag (usually via the IS_I_VERSION macro).
> > > 
> > > 
> > 
> 
> 

Thanks Mimi,

Here's what I have now. I'll plan to send a v2 patch once others have
had a chance to comment as well.

-- Jeff

diff --git a/include/linux/iversion.h b/include/linux/iversion.h
index 3bfebde5a1a6..524abd372100 100644
--- a/include/linux/iversion.h
+++ b/include/linux/iversion.h
@@ -9,8 +9,8 @@
  * ---------------------------
  * The change attribute (i_version) is mandated by NFSv4 and is mostly for
  * knfsd, but is also used for other purposes (e.g. IMA). The i_version must
- * appear different to observers if there was a change to the inode's data or
- * metadata since it was last queried.
+ * appear different to observers if there was an explicit change to the inode's
+ * data or metadata since it was last queried.
  *
  * Observers see the i_version as a 64-bit number that never decreases. If it
  * remains the same since it was last checked, then nothing has changed in the
@@ -18,6 +18,13 @@
  * anything about the nature or magnitude of the changes from the value, only
  * that the inode has changed in some fashion.
  *
+ * Note that atime updates due to reads or similar activity do not represent
+ * an explicit change to the inode data or metadata. If the only change is to
+ * the atime and it wasn't set via utimes() or a similar mechanism, then
+ * i_version should not be incremented. If an observer cares about atime
+ * updates, it should plan to fetch and store them in conjunction with the
+ * i_version.
+ *
  * Not all filesystems properly implement the i_version counter. Subsystems that
  * want to use i_version field on an inode should first check whether the
  * filesystem sets the SB_I_VERSION flag (usually via the IS_I_VERSION macro).
NeilBrown Aug. 22, 2022, 10:42 p.m. UTC | #5
On Mon, 22 Aug 2022, Jeff Layton wrote:
> Add an explicit paragraph codifying that atime updates due to reads
> should not be counted against the i_version counter. None of the
> existing subsystems that use the i_version want those counted, and
> there is an easy workaround for those that do.
> 
> Cc: NeilBrown <neilb@suse.de>
> Cc: Trond Myklebust <trondmy@hammerspace.com>
> Cc: Dave Chinner <david@fromorbit.com>
> Link: https://lore.kernel.org/linux-xfs/166086932784.5425.17134712694961326033@noble.neil.brown.name/#t
> Signed-off-by: Jeff Layton <jlayton@kernel.org>
> ---
>  include/linux/iversion.h | 10 ++++++++--
>  1 file changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/iversion.h b/include/linux/iversion.h
> index 3bfebde5a1a6..da6cc1cc520a 100644
> --- a/include/linux/iversion.h
> +++ b/include/linux/iversion.h
> @@ -9,8 +9,8 @@
>   * ---------------------------
>   * The change attribute (i_version) is mandated by NFSv4 and is mostly for
>   * knfsd, but is also used for other purposes (e.g. IMA). The i_version must
> - * appear different to observers if there was a change to the inode's data or
> - * metadata since it was last queried.
> + * appear different to observers if there was an explicit change to the inode's
> + * data or metadata since it was last queried.

Should rename change the i_version?
It does not explicitly change data or metadata, though it seems to
implicitly change the ctime.

>   *
>   * Observers see the i_version as a 64-bit number that never decreases. If it
>   * remains the same since it was last checked, then nothing has changed in the
> @@ -18,6 +18,12 @@
>   * anything about the nature or magnitude of the changes from the value, only
>   * that the inode has changed in some fashion.
>   *
> + * Note that atime updates due to reads or similar activity do _not_ represent
> + * an explicit change to the inode. If the only change is to the atime and it
> + * wasn't set via utimes() or a similar mechanism, then i_version should not be
> + * incremented. If an observer cares about atime updates, it should plan to
> + * fetch and store them in conjunction with the i_version.
> + *

If an implicit atime update happened to make the atime go backwards
(possible, but not common), the updating i_version should be permitted,
and possibly should be preferred.

NeilBrown


>   * Not all filesystems properly implement the i_version counter. Subsystems that
>   * want to use i_version field on an inode should first check whether the
>   * filesystem sets the SB_I_VERSION flag (usually via the IS_I_VERSION macro).
> -- 
> 2.37.2
> 
>
Trond Myklebust Aug. 22, 2022, 11:26 p.m. UTC | #6
On Tue, 2022-08-23 at 08:42 +1000, NeilBrown wrote:
> On Mon, 22 Aug 2022, Jeff Layton wrote:
> > Add an explicit paragraph codifying that atime updates due to reads
> > should not be counted against the i_version counter. None of the
> > existing subsystems that use the i_version want those counted, and
> > there is an easy workaround for those that do.
> > 
> > Cc: NeilBrown <neilb@suse.de>
> > Cc: Trond Myklebust <trondmy@hammerspace.com>
> > Cc: Dave Chinner <david@fromorbit.com>
> > Link:
> > https://lore.kernel.org/linux-xfs/166086932784.5425.17134712694961326033@noble.neil.brown.name/#t
> > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > ---
> >  include/linux/iversion.h | 10 ++++++++--
> >  1 file changed, 8 insertions(+), 2 deletions(-)
> > 
> > diff --git a/include/linux/iversion.h b/include/linux/iversion.h
> > index 3bfebde5a1a6..da6cc1cc520a 100644
> > --- a/include/linux/iversion.h
> > +++ b/include/linux/iversion.h
> > @@ -9,8 +9,8 @@
> >   * ---------------------------
> >   * The change attribute (i_version) is mandated by NFSv4 and is
> > mostly for
> >   * knfsd, but is also used for other purposes (e.g. IMA). The
> > i_version must
> > - * appear different to observers if there was a change to the
> > inode's data or
> > - * metadata since it was last queried.
> > + * appear different to observers if there was an explicit change
> > to the inode's
> > + * data or metadata since it was last queried.
> 
> Should rename change the i_version?
> It does not explicitly change data or metadata, though it seems to
> implicitly change the ctime.

Actually, POSIX only requires that the mtime and ctime change on the
source and target directory. There is no requirement that the ctime
change on the file itself, although such a change is permitted by the
spec in order to allow for existing filesystem implementations.

https://pubs.opengroup.org/onlinepubs/9699919799/functions/rename.html

I'd prefer not changing the i_version on the file on rename, but could
live with an implementation that copies the ctime behaviour.

> 
> >   *
> >   * Observers see the i_version as a 64-bit number that never
> > decreases. If it
> >   * remains the same since it was last checked, then nothing has
> > changed in the
> > @@ -18,6 +18,12 @@
> >   * anything about the nature or magnitude of the changes from the
> > value, only
> >   * that the inode has changed in some fashion.
> >   *
> > + * Note that atime updates due to reads or similar activity do
> > _not_ represent
> > + * an explicit change to the inode. If the only change is to the
> > atime and it
> > + * wasn't set via utimes() or a similar mechanism, then i_version
> > should not be
> > + * incremented. If an observer cares about atime updates, it
> > should plan to
> > + * fetch and store them in conjunction with the i_version.
> > + *
> 
> If an implicit atime update happened to make the atime go backwards
> (possible, but not common), the updating i_version should be
> permitted,
> and possibly should be preferred.
> 

Maybe.

> NeilBrown
> 
> 
> >   * Not all filesystems properly implement the i_version counter.
> > Subsystems that
> >   * want to use i_version field on an inode should first check
> > whether the
> >   * filesystem sets the SB_I_VERSION flag (usually via the
> > IS_I_VERSION macro).
> > -- 
> > 2.37.2
> > 
> >
Dave Chinner Aug. 22, 2022, 11:32 p.m. UTC | #7
On Mon, Aug 22, 2022 at 02:22:20PM -0400, Jeff Layton wrote:
> diff --git a/include/linux/iversion.h b/include/linux/iversion.h
> index 3bfebde5a1a6..524abd372100 100644
> --- a/include/linux/iversion.h
> +++ b/include/linux/iversion.h
> @@ -9,8 +9,8 @@
>   * ---------------------------
>   * The change attribute (i_version) is mandated by NFSv4 and is mostly for
>   * knfsd, but is also used for other purposes (e.g. IMA). The i_version must
> - * appear different to observers if there was a change to the inode's data or
> - * metadata since it was last queried.
> + * appear different to observers if there was an explicit change to the inode's
> + * data or metadata since it was last queried.
>   *
>   * Observers see the i_version as a 64-bit number that never decreases. If it
>   * remains the same since it was last checked, then nothing has changed in the
> @@ -18,6 +18,13 @@
>   * anything about the nature or magnitude of the changes from the value, only
>   * that the inode has changed in some fashion.
>   *
> + * Note that atime updates due to reads or similar activity do not represent

What does "or similar activity" mean?

This whole atime vs iversion issue is arising because "or any
metadata change" was interpretted literally by filesystems to mean
"any metadata change" without caveats or exclusions. Now you're both
changing that definition and making things *worse* by adding
explicit wiggle-room for future changes in behaviour to persistent
change counter behaviour.

iversion is going to be exposed to userspace, so we *can't change
the definition in future* because behaviour is bound by "changes may
break userspace apps" constraints. IOWs, we cannot justify changes
in behaviour with "but there are only in-kernel users" like is being
done for this change.

It's only a matter of time before someone is going to complain about
the fact that filesystems bump the change counter for internal
metadata modifications as they make changes to the persistent state
of data and metadata. These existing behaviours will almost
certainly causes visible NFS quirks due to unexpected
iversion bumps.

In case you didn't realise, XFS can bump iversion 500+ times for a
single 1MB write() on a 4kB block size filesytem, and only one of
them is initial write() system call that copies the data into the
page cache. The other 500+ are all the extent allocation and
manipulation transactions that we might run when persisting the data
to disk tens of seconds later. This is how iversion on XFS has
behaved for the past decade.

Right now, both ext4 and XFS conform to the exact definition that is
in this file. Trond has outlines that NFS wants iversion to behave
exactly like c/mtime changes, but that is fundamentally different to
how both ext4 and XFS have implemented the persistent change
counters.

IOWs, if we are going to start randomly excluding specific metadata
from the iversion API, then we need a full definition of exactly
what operations are supposed to trigger an iversion bump.
API-design-by-application-deficiency-whack-a-mole is not appropriate
for persistent structures, nor is it appropriate for information
that is going to be exposed to userspace via the kernel ABI.

Therefore, before we change behaviour in the filesystems or expose
it to userspace, we need to define *exactly* what changes are
covered by iversion. Once we have that definition, then we can
modify the filesytems appropriately to follow the required
definition and only change the persistent iversion counter when the
definition says we should change it.

While Trond's description is a useful definition from the
application perspective - and I'm not opposed to making it behave
like that (i.e. iversion only bumped with c/mtime changes) - it
requires a fundamentally different implementation in filesystems.

We must no longer capture *every metadata change* as we are
currently doing as per the current specification, and instead only
capture *application metadata changes* only.  i.e.  we are going to
need an on-disk format change of some kind because the two
behaviours are not compatible.

Further, if iversion is going to be extended to userspace, we most
definitely need to decouple it from our internal "change on every
metadata change" behaviour. We change how we persist metadata to
disk over time and if we don't abstract that away from the
persistent change counter we expose to userspace then this will lead
to random user visible changes in iversion behaviour in future.

Any way I look at it, we're at the point where filesystems now need
a real, fixed definition of what changes require an iversion bump
and which don't. The other option is that move the iversion bumping
completely up into the VFS where it happens consistently for every
filesystem whether they persist it or not. We also need a set of
test code for fstests that check that iversion performs as expected
and to the specification that the VFS defines for statx.

Either way we chose, one of these options are the only way that we
will end up with a consistent implementation of a change counter
across all the filesystems. And, quite frankly, that's exactly is
needed if we are going to present this information to userspace
forever more.

-Dave.
Jeff Layton Aug. 23, 2022, 11:21 a.m. UTC | #8
On Tue, 2022-08-23 at 09:32 +1000, Dave Chinner wrote:
> On Mon, Aug 22, 2022 at 02:22:20PM -0400, Jeff Layton wrote:
> > diff --git a/include/linux/iversion.h b/include/linux/iversion.h
> > index 3bfebde5a1a6..524abd372100 100644
> > --- a/include/linux/iversion.h
> > +++ b/include/linux/iversion.h
> > @@ -9,8 +9,8 @@
> >   * ---------------------------
> >   * The change attribute (i_version) is mandated by NFSv4 and is mostly for
> >   * knfsd, but is also used for other purposes (e.g. IMA). The i_version must
> > - * appear different to observers if there was a change to the inode's data or
> > - * metadata since it was last queried.
> > + * appear different to observers if there was an explicit change to the inode's
> > + * data or metadata since it was last queried.
> >   *
> >   * Observers see the i_version as a 64-bit number that never decreases. If it
> >   * remains the same since it was last checked, then nothing has changed in the
> > @@ -18,6 +18,13 @@
> >   * anything about the nature or magnitude of the changes from the value, only
> >   * that the inode has changed in some fashion.
> >   *
> > + * Note that atime updates due to reads or similar activity do not represent
> 
> What does "or similar activity" mean?
> 

Some examples:

- readdir() in a directory
- readlink() on symlink
- mmap reads

...basically, things that access data without materially changing it.

> This whole atime vs iversion issue is arising because "or any
> metadata change" was interpretted literally by filesystems to mean
> "any metadata change" without caveats or exclusions. Now you're both
> changing that definition and making things *worse* by adding
> explicit wiggle-room for future changes in behaviour to persistent
> change counter behaviour.
> 
> iversion is going to be exposed to userspace, so we *can't change
> the definition in future* because behaviour is bound by "changes may
> break userspace apps" constraints. IOWs, we cannot justify changes
> in behaviour with "but there are only in-kernel users" like is being
> done for this change.
> 
> It's only a matter of time before someone is going to complain about
> the fact that filesystems bump the change counter for internal
> metadata modifications as they make changes to the persistent state
> of data and metadata. These existing behaviours will almost
> certainly causes visible NFS quirks due to unexpected
> iversion bumps.
> 
> In case you didn't realise, XFS can bump iversion 500+ times for a
> single 1MB write() on a 4kB block size filesytem, and only one of
> them is initial write() system call that copies the data into the
> page cache. The other 500+ are all the extent allocation and
> manipulation transactions that we might run when persisting the data
> to disk tens of seconds later. This is how iversion on XFS has
> behaved for the past decade.
> 

Bumping the change count multiple times internally for a single change
is not a problem. From the comments in iversion.h:

 * Observers see the i_version as a 64-bit number that never decreases. If it
 * remains the same since it was last checked, then nothing has changed in the
 * inode. If it's different then something has changed. Observers cannot infer
 * anything about the nature or magnitude of the changes from the value, only
 * that the inode has changed in some fashion.

Bumping it once or multiple times still conforms to how we have this
defined.

> Right now, both ext4 and XFS conform to the exact definition that is
> in this file. Trond has outlines that NFS wants iversion to behave
> exactly like c/mtime changes, but that is fundamentally different to
> how both ext4 and XFS have implemented the persistent change
> counters.
> 
> IOWs, if we are going to start randomly excluding specific metadata
> from the iversion API, then we need a full definition of exactly
> what operations are supposed to trigger an iversion bump.
> API-design-by-application-deficiency-whack-a-mole is not appropriate
> for persistent structures, nor is it appropriate for information
> that is going to be exposed to userspace via the kernel ABI.
> 
> Therefore, before we change behaviour in the filesystems or expose
> it to userspace, we need to define *exactly* what changes are
> covered by iversion. Once we have that definition, then we can
> modify the filesytems appropriately to follow the required
> definition and only change the persistent iversion counter when the
> definition says we should change it.
> 
> While Trond's description is a useful definition from the
> application perspective - and I'm not opposed to making it behave
> like that (i.e. iversion only bumped with c/mtime changes) - it
> requires a fundamentally different implementation in filesystems.
> 
> We must no longer capture *every metadata change* as we are
> currently doing as per the current specification, and instead only
> capture *application metadata changes* only.  i.e.  we are going to
> need an on-disk format change of some kind because the two
> behaviours are not compatible.
> 
> Further, if iversion is going to be extended to userspace, we most
> definitely need to decouple it from our internal "change on every
> metadata change" behaviour. We change how we persist metadata to
> disk over time and if we don't abstract that away from the
> persistent change counter we expose to userspace then this will lead
> to random user visible changes in iversion behaviour in future.
> 
> Any way I look at it, we're at the point where filesystems now need
> a real, fixed definition of what changes require an iversion bump
> and which don't. The other option is that move the iversion bumping
> completely up into the VFS where it happens consistently for every
> filesystem whether they persist it or not. We also need a set of
> test code for fstests that check that iversion performs as expected
> and to the specification that the VFS defines for statx.
> 

I have an initial test for this that we can also build on on later. It
does require the patch to add support for STATX_INO_VERSION however.

> Either way we chose, one of these options are the only way that we
> will end up with a consistent implementation of a change counter
> across all the filesystems. And, quite frankly, that's exactly is
> needed if we are going to present this information to userspace
> forever more.
> 

I agree that we need a real definition of what changes should be
represented in this value. My intent was to add that to the statx
manpage once we had gotten a little further along, but it won't hurt to
hash that out first.

I do not intend to exhaustively list all possible activities that should
and shouldn't update the i_version. It's as difficult to put together a
comprehensive list of what activities should and shouldn't change the
i_version as it is to list what activities should and shouldn't cause
the mtime/ctime/atime to change. The list is also going to constantly
change as our interfaces change.

What may be best is to just define this value in terms of how timestamps
get updated, since POSIX already specifies that. Section 4.9 in the
POSIX spec discusses file time updates:

    https://pubs.opengroup.org/onlinepubs/9699919799/

It says:

"Each function or utility in POSIX.1-2017 that reads or writes data
(even if the data does not change) or performs an operation to change
file status (even if the file status does not change) indicates which of
the appropriate timestamps shall be marked for update."

So, we can refer to that and simply say:

"If the function updates the mtime or ctime on the inode, then the
i_version should be incremented. If only the atime is being updated,
then the i_version should not be incremented. The exception to this rule
is explicit atime updates via utimes() or similar mechanism, which
should result in the i_version being incremented."
NeilBrown Aug. 23, 2022, 11:38 a.m. UTC | #9
On Tue, 23 Aug 2022, Jeff Layton wrote:
> So, we can refer to that and simply say:
> 
> "If the function updates the mtime or ctime on the inode, then the
> i_version should be incremented. If only the atime is being updated,
> then the i_version should not be incremented. The exception to this rule
> is explicit atime updates via utimes() or similar mechanism, which
> should result in the i_version being incremented."

Is that exception needed?  utimes() updates ctime.

https://man7.org/linux/man-pages/man2/utimes.2.html

doesn't say that, but

https://pubs.opengroup.org/onlinepubs/007904875/functions/utimes.html

does, as does the code.

NeilBrown
Jeff Layton Aug. 23, 2022, 11:51 a.m. UTC | #10
On Tue, 2022-08-23 at 21:38 +1000, NeilBrown wrote:
> On Tue, 23 Aug 2022, Jeff Layton wrote:
> > So, we can refer to that and simply say:
> > 
> > "If the function updates the mtime or ctime on the inode, then the
> > i_version should be incremented. If only the atime is being updated,
> > then the i_version should not be incremented. The exception to this rule
> > is explicit atime updates via utimes() or similar mechanism, which
> > should result in the i_version being incremented."
> 
> Is that exception needed?  utimes() updates ctime.
> 
> https://man7.org/linux/man-pages/man2/utimes.2.html
> 
> doesn't say that, but
> 
> https://pubs.opengroup.org/onlinepubs/007904875/functions/utimes.html
> 
> does, as does the code.
> 

Oh, good point! I think we can leave that out. Even better!
NeilBrown Aug. 23, 2022, 10:24 p.m. UTC | #11
On Tue, 23 Aug 2022, Jeff Layton wrote:
> On Tue, 2022-08-23 at 21:38 +1000, NeilBrown wrote:
> > On Tue, 23 Aug 2022, Jeff Layton wrote:
> > > So, we can refer to that and simply say:
> > > 
> > > "If the function updates the mtime or ctime on the inode, then the
> > > i_version should be incremented. If only the atime is being updated,
> > > then the i_version should not be incremented. The exception to this rule
> > > is explicit atime updates via utimes() or similar mechanism, which
> > > should result in the i_version being incremented."
> > 
> > Is that exception needed?  utimes() updates ctime.
> > 
> > https://man7.org/linux/man-pages/man2/utimes.2.html
> > 
> > doesn't say that, but
> > 
> > https://pubs.opengroup.org/onlinepubs/007904875/functions/utimes.html
> > 
> > does, as does the code.
> > 
> 
> Oh, good point! I think we can leave that out. Even better!

Further, implicit mtime updates (file_update_time()) also update ctime.
So all you need is
   If the function updates the ctime, then i_version should be
   incremented.

and I have to ask - why not just use the ctime?  Why have another number
that is parallel?

Timestamps are updated at HZ (ktime_get_course) which is at most every
millisecond.
xfs stores nanosecond resolution, so about 20 bits are currently wasted.
We could put a counter like i_version in there that only increments
after it is viewed, then we can get all the precision we need but with
exactly ctime semantics.

The 64 change-id could comprise
 35 bits of seconds (nearly a millenium)
 16 bits of sub-seconds (just in case a higher precision time was wanted
                         one day)
 13 bits of counter. - 8192 changes per tick

The value exposed in i_ctime would hide the counter and just show the
timestamp portion of what the filesystem stores.  This would ensure we
never get changes on different files that happen in one order leaving
timestamps with the reversed order (the timestamps could be the same,
but that is expected).

This scheme could be made to handle a sustained update rate of 1
increment every 8 nanoseconds (if the counter were allowed to overflow
into unused bits of the sub-second field).  This is one ever 24 CPU
cycles.  Incrementing a counter and making it visible to all CPUs can
probably be done in 24 cycles.  Accessing it and setting the "seen" flag
as well might just fit with faster memory.  Getting any other useful
work done while maintaining that rate on a single file seems unlikely.

NeilBrown
Dave Chinner Aug. 23, 2022, 11:25 p.m. UTC | #12
On Tue, Aug 23, 2022 at 07:21:36AM -0400, Jeff Layton wrote:
> On Tue, 2022-08-23 at 09:32 +1000, Dave Chinner wrote:
> > On Mon, Aug 22, 2022 at 02:22:20PM -0400, Jeff Layton wrote:
> > > diff --git a/include/linux/iversion.h b/include/linux/iversion.h
> > > index 3bfebde5a1a6..524abd372100 100644
> > > --- a/include/linux/iversion.h
> > > +++ b/include/linux/iversion.h
> > > @@ -9,8 +9,8 @@
> > >   * ---------------------------
> > >   * The change attribute (i_version) is mandated by NFSv4 and is mostly for
> > >   * knfsd, but is also used for other purposes (e.g. IMA). The i_version must
> > > - * appear different to observers if there was a change to the inode's data or
> > > - * metadata since it was last queried.
> > > + * appear different to observers if there was an explicit change to the inode's
> > > + * data or metadata since it was last queried.
> > >   *
> > >   * Observers see the i_version as a 64-bit number that never decreases. If it
> > >   * remains the same since it was last checked, then nothing has changed in the
> > > @@ -18,6 +18,13 @@
> > >   * anything about the nature or magnitude of the changes from the value, only
> > >   * that the inode has changed in some fashion.
> > >   *
> > > + * Note that atime updates due to reads or similar activity do not represent
> > 
> > What does "or similar activity" mean?
> > 
> 
> Some examples:
> 
> - readdir() in a directory
> - readlink() on symlink
> - mmap reads
> 
> ...basically, things that access data without materially changing it.

What happens if we have buffered dirty data in the page cache, and a
DIO read is done at that location?

This doesn't materially change data, but it forces writeback of the
cached data, and that means XFS will bump iversion because of the
data writeback changing inode metadata.

i can think of several scenarios where a pure data access operation
that does not materially change user data but will cause an iversion
change because those access operations imply some level of data
persistence.

> > In case you didn't realise, XFS can bump iversion 500+ times for a
> > single 1MB write() on a 4kB block size filesytem, and only one of
> > them is initial write() system call that copies the data into the
> > page cache. The other 500+ are all the extent allocation and
> > manipulation transactions that we might run when persisting the data
> > to disk tens of seconds later. This is how iversion on XFS has
> > behaved for the past decade.
> > 
> 
> Bumping the change count multiple times internally for a single change
> is not a problem. From the comments in iversion.h:
> 
>  * Observers see the i_version as a 64-bit number that never decreases. If it
>  * remains the same since it was last checked, then nothing has changed in the
>  * inode. If it's different then something has changed. Observers cannot infer
>  * anything about the nature or magnitude of the changes from the value, only
>  * that the inode has changed in some fashion.
> 
> Bumping it once or multiple times still conforms to how we have this
> defined.

Sure, it conforms to this piece of the specification. But the
temporal aspect of the filesystem bumping iversion due to background
operations most definitely conflicts with the new definition of
iversion only changing when operations that change c/mtime are
performed.

i.e.  if we're to take the "iversion changes only when c/mtime is
changed" definition at face value, then the filesystem is not free
to modify iversion when it modifies metadata during background
writeback. It's not free to bump iversion during fsync(). i.e. it's
not free to bump iversion on any operation that implies data
writeback is necessary.

That makes the existing XFS di_changecount implementation
incompatible with the newly redefined iversion semantics being
pushed and wanting to be exposed to userspace. If the filesystem
implementation can't meet the specification of the change attribute
being exposed to userspace then we *must not expose that information
to userspace*.

This restriction does not prevent us from using our existing
iversion implementation for NFS and IMA because the worst that
happens is users occasionally have to refetch information from the
server as has already been occurring for the past decade or so.
Indeed, there's an argument to be made that the periodic IMA
revalidation that relatime + iversion causes for the data at rest in
the page cache is actually a good security practice and not a
behaviour that we should be trying to optimise away.

All I want from this process is a *solid definition* of what
iversion is supposed to represent and what will be exposed to
userspace and the ability for the filesystem to decide itself
whether to expose it's persistent change counter to userspace. Us
filesystem developers can take it from there to provide a change
attribute that conforms to the contract we form with userspace by
exposing this information to statx().

> > Either way we chose, one of these options are the only way that we
> > will end up with a consistent implementation of a change counter
> > across all the filesystems. And, quite frankly, that's exactly is
> > needed if we are going to present this information to userspace
> > forever more.
> > 
> 
> I agree that we need a real definition of what changes should be
> represented in this value. My intent was to add that to the statx
> manpage once we had gotten a little further along, but it won't hurt to
> hash that out first.

How have so many experienced engineers forgotten basic engineering
processes they were taught as an undergrad? i.e. that requirements
and specification come first, then the implementation is derived
from the specification?

And why do they keep "forgetting" this over and over again?

> I do not intend to exhaustively list all possible activities that should
> and shouldn't update the i_version. It's as difficult to put together a
> comprehensive list of what activities should and shouldn't change the
> i_version as it is to list what activities should and shouldn't cause
> the mtime/ctime/atime to change. The list is also going to constantly
> change as our interfaces change.

If this change attribute is not going to specified in a way that
userspace cannot rely on it's behaviour not changing in incompatible
ways, then it should not be exposed to userspace at all. Both
userspace and the filesystems need an unambiguous definition so that
userspace applications can rely on the behaviour that the kernel
and filesystems guarantee will be provided.

> What may be best is to just define this value in terms of how timestamps
> get updated, since POSIX already specifies that. Section 4.9 in the
> POSIX spec discusses file time updates:
> 
>     https://pubs.opengroup.org/onlinepubs/9699919799/
> 
> It says:
> 
> "Each function or utility in POSIX.1-2017 that reads or writes data
> (even if the data does not change) or performs an operation to change
> file status (even if the file status does not change) indicates which of
> the appropriate timestamps shall be marked for update."
> 
> So, we can refer to that and simply say:
> 
> "If the function updates the mtime or ctime on the inode, then the
> i_version should be incremented. If only the atime is being updated,
> then the i_version should not be incremented. The exception to this rule
> is explicit atime updates via utimes() or similar mechanism, which
> should result in the i_version being incremented."

I'd almost be fine with that definition for iversion being exposed
to userspace, but it doesn't say anything about metadata changes
that don't change c/mtime. i.e. this needs to define iversion as
"_Only_ operations that modify user data and/or c/mtime on the inode
should increment the change attribute", not leave it ambiguous as to
whether other operations can bump the change attribute or not.

Of course, this new iversion deinition is most definitely
incompatible with the existing specification of the XFS persistent
change attribute.....

-Dave.
Dave Chinner Aug. 23, 2022, 11:28 p.m. UTC | #13
On Wed, Aug 24, 2022 at 08:24:47AM +1000, NeilBrown wrote:
> On Tue, 23 Aug 2022, Jeff Layton wrote:
> > On Tue, 2022-08-23 at 21:38 +1000, NeilBrown wrote:
> > > On Tue, 23 Aug 2022, Jeff Layton wrote:
> > > > So, we can refer to that and simply say:
> > > > 
> > > > "If the function updates the mtime or ctime on the inode, then the
> > > > i_version should be incremented. If only the atime is being updated,
> > > > then the i_version should not be incremented. The exception to this rule
> > > > is explicit atime updates via utimes() or similar mechanism, which
> > > > should result in the i_version being incremented."
> > > 
> > > Is that exception needed?  utimes() updates ctime.
> > > 
> > > https://man7.org/linux/man-pages/man2/utimes.2.html
> > > 
> > > doesn't say that, but
> > > 
> > > https://pubs.opengroup.org/onlinepubs/007904875/functions/utimes.html
> > > 
> > > does, as does the code.
> > > 
> > 
> > Oh, good point! I think we can leave that out. Even better!
> 
> Further, implicit mtime updates (file_update_time()) also update ctime.
> So all you need is
>    If the function updates the ctime, then i_version should be
>    incremented.
> 
> and I have to ask - why not just use the ctime?  Why have another number
> that is parallel?
> 
> Timestamps are updated at HZ (ktime_get_course) which is at most every
> millisecond.

Kernel time, and therefore timestamps, can go backwards.

-Dave.
NeilBrown Aug. 23, 2022, 11:42 p.m. UTC | #14
On Wed, 24 Aug 2022, Dave Chinner wrote:
> On Wed, Aug 24, 2022 at 08:24:47AM +1000, NeilBrown wrote:
> > On Tue, 23 Aug 2022, Jeff Layton wrote:
> > > On Tue, 2022-08-23 at 21:38 +1000, NeilBrown wrote:
> > > > On Tue, 23 Aug 2022, Jeff Layton wrote:
> > > > > So, we can refer to that and simply say:
> > > > > 
> > > > > "If the function updates the mtime or ctime on the inode, then the
> > > > > i_version should be incremented. If only the atime is being updated,
> > > > > then the i_version should not be incremented. The exception to this rule
> > > > > is explicit atime updates via utimes() or similar mechanism, which
> > > > > should result in the i_version being incremented."
> > > > 
> > > > Is that exception needed?  utimes() updates ctime.
> > > > 
> > > > https://man7.org/linux/man-pages/man2/utimes.2.html
> > > > 
> > > > doesn't say that, but
> > > > 
> > > > https://pubs.opengroup.org/onlinepubs/007904875/functions/utimes.html
> > > > 
> > > > does, as does the code.
> > > > 
> > > 
> > > Oh, good point! I think we can leave that out. Even better!
> > 
> > Further, implicit mtime updates (file_update_time()) also update ctime.
> > So all you need is
> >    If the function updates the ctime, then i_version should be
> >    incremented.
> > 
> > and I have to ask - why not just use the ctime?  Why have another number
> > that is parallel?
> > 
> > Timestamps are updated at HZ (ktime_get_course) which is at most every
> > millisecond.
> 
> Kernel time, and therefore timestamps, can go backwards.

Yes, and when that happens you get to keep both halves...

For NFSv4 I really don't think that matters.  If it happened every day,
that might be a problem.  Even if it happens as a consequence of normal
operations it might be a problem.  But it can only happen if something
goes wrong.

Mostly, NFSv4 only needs to changeid to change.  If the kernel time goes
backwards it is possible that a changeid will repeat, though unlikely.
It is even possible that a client will see the first and second
instances of that repeat, and assume there is no change in between - but
that is astronomically unlikely.  "touch"ing the file or remounting will
fix that.

When a write delegation is in force (which Linux doesn't currently offer
and no-one seems to care about, but maybe one day), the client is
allowed to update the changeid, and when the delegation is returned, the
server is supposed to ensure the new changeid is at least the last one
assigned by the client.  This is the only reason that it is defined as
being monotonic (rather than just "non-repeating") - so the client and
server can change it in the same way.

So while kernel time going backwards is theoretically less than ideal,
it is not practically a problem.

NeilBrown
Trond Myklebust Aug. 24, 2022, 12:11 a.m. UTC | #15
On Wed, 2022-08-24 at 09:42 +1000, NeilBrown wrote:
> On Wed, 24 Aug 2022, Dave Chinner wrote:
> > On Wed, Aug 24, 2022 at 08:24:47AM +1000, NeilBrown wrote:
> > > On Tue, 23 Aug 2022, Jeff Layton wrote:
> > > > On Tue, 2022-08-23 at 21:38 +1000, NeilBrown wrote:
> > > > > On Tue, 23 Aug 2022, Jeff Layton wrote:
> > > > > > So, we can refer to that and simply say:
> > > > > > 
> > > > > > "If the function updates the mtime or ctime on the inode,
> > > > > > then the
> > > > > > i_version should be incremented. If only the atime is being
> > > > > > updated,
> > > > > > then the i_version should not be incremented. The exception
> > > > > > to this rule
> > > > > > is explicit atime updates via utimes() or similar
> > > > > > mechanism, which
> > > > > > should result in the i_version being incremented."
> > > > > 
> > > > > Is that exception needed?  utimes() updates ctime.
> > > > > 
> > > > > https://man7.org/linux/man-pages/man2/utimes.2.html
> > > > > 
> > > > > doesn't say that, but
> > > > > 
> > > > > https://pubs.opengroup.org/onlinepubs/007904875/functions/utimes.html
> > > > > 
> > > > > does, as does the code.
> > > > > 
> > > > 
> > > > Oh, good point! I think we can leave that out. Even better!
> > > 
> > > Further, implicit mtime updates (file_update_time()) also update
> > > ctime.
> > > So all you need is
> > >    If the function updates the ctime, then i_version should be
> > >    incremented.
> > > 
> > > and I have to ask - why not just use the ctime?  Why have another
> > > number
> > > that is parallel?
> > > 
> > > Timestamps are updated at HZ (ktime_get_course) which is at most
> > > every
> > > millisecond.
> > 
> > Kernel time, and therefore timestamps, can go backwards.
> 
> Yes, and when that happens you get to keep both halves...
> 
> For NFSv4 I really don't think that matters.  If it happened every
> day,
> that might be a problem.  Even if it happens as a consequence of
> normal
> operations it might be a problem.  But it can only happen if
> something
> goes wrong.
> 
> Mostly, NFSv4 only needs to changeid to change.  If the kernel time
> goes
> backwards it is possible that a changeid will repeat, though
> unlikely.
> It is even possible that a client will see the first and second
> instances of that repeat, and assume there is no change in between -
> but
> that is astronomically unlikely.  "touch"ing the file or remounting
> will
> fix that.
> 
> When a write delegation is in force (which Linux doesn't currently
> offer
> and no-one seems to care about, but maybe one day), the client is
> allowed to update the changeid, and when the delegation is returned,
> the
> server is supposed to ensure the new changeid is at least the last
> one
> assigned by the client.  This is the only reason that it is defined
> as
> being monotonic (rather than just "non-repeating") - so the client
> and
> server can change it in the same way.
> 

Sort of... Monotonicity of the change id is not a requirement, even for
that case.

The exact NFSv4 requirement is that if the server uses a callback to
ask the client for the file attribute information, then the client is
supposed to bump the server-supplied change id value by 1 unit if it is
caching writes. The intention is simply to ensure that there is a
notification mechanism to allow the server to know that writes are
being cached (Note: I've no idea why we didn't just add a separate flag
for that in the callback reply).

> So while kernel time going backwards is theoretically less than
> ideal,
> it is not practically a problem.
> 

I agree with that. As long as this results in few collisions, so that
value uniqueness is guaranteed then applications (NFS or other) will
have a way to determine if the filesystem object has changed since they
last looked at it.

IOW: I'm not saying this is the perfect way to implement an
application-visible change attribute, but it would be a massive
improvement over ctime, and might be a practical way to go about it.
Jeff Layton Aug. 24, 2022, 12:45 p.m. UTC | #16
On Wed, 2022-08-24 at 09:25 +1000, Dave Chinner wrote:
> On Tue, Aug 23, 2022 at 07:21:36AM -0400, Jeff Layton wrote:
> > On Tue, 2022-08-23 at 09:32 +1000, Dave Chinner wrote:
> > > On Mon, Aug 22, 2022 at 02:22:20PM -0400, Jeff Layton wrote:
> > > > diff --git a/include/linux/iversion.h b/include/linux/iversion.h
> > > > index 3bfebde5a1a6..524abd372100 100644
> > > > --- a/include/linux/iversion.h
> > > > +++ b/include/linux/iversion.h
> > > > @@ -9,8 +9,8 @@
> > > >   * ---------------------------
> > > >   * The change attribute (i_version) is mandated by NFSv4 and is mostly for
> > > >   * knfsd, but is also used for other purposes (e.g. IMA). The i_version must
> > > > - * appear different to observers if there was a change to the inode's data or
> > > > - * metadata since it was last queried.
> > > > + * appear different to observers if there was an explicit change to the inode's
> > > > + * data or metadata since it was last queried.
> > > >   *
> > > >   * Observers see the i_version as a 64-bit number that never decreases. If it
> > > >   * remains the same since it was last checked, then nothing has changed in the
> > > > @@ -18,6 +18,13 @@
> > > >   * anything about the nature or magnitude of the changes from the value, only
> > > >   * that the inode has changed in some fashion.
> > > >   *
> > > > + * Note that atime updates due to reads or similar activity do not represent
> > > 
> > > What does "or similar activity" mean?
> > > 
> > 
> > Some examples:
> > 
> > - readdir() in a directory
> > - readlink() on symlink
> > - mmap reads
> > 
> > ...basically, things that access data without materially changing it.
> 
> What happens if we have buffered dirty data in the page cache, and a
> DIO read is done at that location?
> 
> This doesn't materially change data, but it forces writeback of the
> cached data, and that means XFS will bump iversion because of the
> data writeback changing inode metadata.
> 

Ideally, the i_version should not change in this case.

> i can think of several scenarios where a pure data access operation
> that does not materially change user data but will cause an iversion
> change because those access operations imply some level of data
> persistence.
> 
> > > In case you didn't realise, XFS can bump iversion 500+ times for a
> > > single 1MB write() on a 4kB block size filesytem, and only one of
> > > them is initial write() system call that copies the data into the
> > > page cache. The other 500+ are all the extent allocation and
> > > manipulation transactions that we might run when persisting the data
> > > to disk tens of seconds later. This is how iversion on XFS has
> > > behaved for the past decade.
> > > 
> > 
> > Bumping the change count multiple times internally for a single change
> > is not a problem. From the comments in iversion.h:
> > 
> >  * Observers see the i_version as a 64-bit number that never decreases. If it
> >  * remains the same since it was last checked, then nothing has changed in the
> >  * inode. If it's different then something has changed. Observers cannot infer
> >  * anything about the nature or magnitude of the changes from the value, only
> >  * that the inode has changed in some fashion.
> > 
> > Bumping it once or multiple times still conforms to how we have this
> > defined.
> 
> Sure, it conforms to this piece of the specification. But the
> temporal aspect of the filesystem bumping iversion due to background
> operations most definitely conflicts with the new definition of
> iversion only changing when operations that change c/mtime are
> performed.
> 
> i.e.  if we're to take the "iversion changes only when c/mtime is
> changed" definition at face value, then the filesystem is not free
> to modify iversion when it modifies metadata during background
> writeback. It's not free to bump iversion during fsync(). i.e. it's
> not free to bump iversion on any operation that implies data
> writeback is necessary.
> 
> That makes the existing XFS di_changecount implementation
> incompatible with the newly redefined iversion semantics being
> pushed and wanting to be exposed to userspace. If the filesystem
> implementation can't meet the specification of the change attribute
> being exposed to userspace then we *must not expose that information
> to userspace*.
> 
> This restriction does not prevent us from using our existing
> iversion implementation for NFS and IMA because the worst that
> happens is users occasionally have to refetch information from the
> server as has already been occurring for the past decade or so.
> Indeed, there's an argument to be made that the periodic IMA
> revalidation that relatime + iversion causes for the data at rest in
> the page cache is actually a good security practice and not a
> behaviour that we should be trying to optimise away.
> 
> All I want from this process is a *solid definition* of what
> iversion is supposed to represent and what will be exposed to
> userspace and the ability for the filesystem to decide itself
> whether to expose it's persistent change counter to userspace. Us
> filesystem developers can take it from there to provide a change
> attribute that conforms to the contract we form with userspace by
> exposing this information to statx().
> 
> > > Either way we chose, one of these options are the only way that we
> > > will end up with a consistent implementation of a change counter
> > > across all the filesystems. And, quite frankly, that's exactly is
> > > needed if we are going to present this information to userspace
> > > forever more.
> > > 
> > 
> > I agree that we need a real definition of what changes should be
> > represented in this value. My intent was to add that to the statx
> > manpage once we had gotten a little further along, but it won't hurt to
> > hash that out first.
> 
> How have so many experienced engineers forgotten basic engineering
> processes they were taught as an undergrad? i.e. that requirements
> and specification come first, then the implementation is derived
> from the specification?
> 
> And why do they keep "forgetting" this over and over again?
> 

The sanctimonious comments are really unnecessary.

YOU are the person who asked me to write testcases for this. The only
reasonable way to do that is to expose this attribute to userland.

I would certainly have approached all of this differently had I been
implementing the i_version counter from scratch. The time to write a
specification for i_version was when it was created (>20 years ago).
That predates my involvement in Linux kernel development. I'm doing what
I can to remedy it now. Be patient, please.

> > I do not intend to exhaustively list all possible activities that should
> > and shouldn't update the i_version. It's as difficult to put together a
> > comprehensive list of what activities should and shouldn't change the
> > i_version as it is to list what activities should and shouldn't cause
> > the mtime/ctime/atime to change. The list is also going to constantly
> > change as our interfaces change.
> 
> If this change attribute is not going to specified in a way that
> userspace cannot rely on it's behaviour not changing in incompatible
> ways, then it should not be exposed to userspace at all. Both
> userspace and the filesystems need an unambiguous definition so that
> userspace applications can rely on the behaviour that the kernel
> and filesystems guarantee will be provided.
> 
> > What may be best is to just define this value in terms of how timestamps
> > get updated, since POSIX already specifies that. Section 4.9 in the
> > POSIX spec discusses file time updates:
> > 
> >     https://pubs.opengroup.org/onlinepubs/9699919799/
> > 
> > It says:
> > 
> > "Each function or utility in POSIX.1-2017 that reads or writes data
> > (even if the data does not change) or performs an operation to change
> > file status (even if the file status does not change) indicates which of
> > the appropriate timestamps shall be marked for update."
> > 
> > So, we can refer to that and simply say:
> > 
> > "If the function updates the mtime or ctime on the inode, then the
> > i_version should be incremented. If only the atime is being updated,
> > then the i_version should not be incremented. The exception to this rule
> > is explicit atime updates via utimes() or similar mechanism, which
> > should result in the i_version being incremented."
> 
> I'd almost be fine with that definition for iversion being exposed
> to userspace, but it doesn't say anything about metadata changes
> that don't change c/mtime. i.e. this needs to define iversion as
> "_Only_ operations that modify user data and/or c/mtime on the inode
> should increment the change attribute", not leave it ambiguous as to
> whether other operations can bump the change attribute or not.
> 

Good, this is probably how we'll end up defining it. Doing anything else
is going to be too difficult, I think.

> Of course, this new iversion deinition is most definitely
> incompatible with the existing specification of the XFS persistent
> change attribute.....
> 

We could also allow for a conformant implementation to have i_version
bumps even when something "invisible" happens, and just mention that the
consumer of it (i.e. application or subsystem) must allow for that
possibility.

With that, a change in i_version would mean that something _might_ have
changed instead of something having definitely changed. Ideally, an
implementation wouldn't do that, since doing so would likely have
measurable performance performance impact.

If we did that, it would mean that the xfs implementation currently
conforms to the proposed spec. Then it would just be a matter of trying
to optimize away the i_version bumps that occur during read-like
activity.

I think it's probably best to define this as loosely as possible so that
we can make it easier for a broad range of filesystems to implement it.
All of the existing consumers can (and do) work with i_version today,
it's just that the extra i_version bumps are terrible for performance
(particularly with NFS).

If we get a spurious bump just occasionally, then it's not too awful
(particularly when the file is being written to anyway, and we'll need
to dump the cache for it). The biggest pain point is i_version being
updated on all atime updates though.
Jeff Layton Aug. 24, 2022, 12:53 p.m. UTC | #17
On Wed, 2022-08-24 at 08:24 +1000, NeilBrown wrote:
> On Tue, 23 Aug 2022, Jeff Layton wrote:
> > On Tue, 2022-08-23 at 21:38 +1000, NeilBrown wrote:
> > > On Tue, 23 Aug 2022, Jeff Layton wrote:
> > > > So, we can refer to that and simply say:
> > > > 
> > > > "If the function updates the mtime or ctime on the inode, then the
> > > > i_version should be incremented. If only the atime is being updated,
> > > > then the i_version should not be incremented. The exception to this rule
> > > > is explicit atime updates via utimes() or similar mechanism, which
> > > > should result in the i_version being incremented."
> > > 
> > > Is that exception needed? utimes() updates ctime.
> > > 
> > > https://man7.org/linux/man-pages/man2/utimes.2.html
> > > 
> > > doesn't say that, but
> > > 
> > > https://pubs.opengroup.org/onlinepubs/007904875/functions/utimes.html
> > > 
> > > does, as does the code.
> > > 
> > 
> > Oh, good point! I think we can leave that out. Even better!
> 
> Further, implicit mtime updates (file_update_time()) also update ctime.
> So all you need is
>  If the function updates the ctime, then i_version should be
>  incremented.
> 
> and I have to ask - why not just use the ctime? Why have another number
> that is parallel?
> 
> Timestamps are updated at HZ (ktime_get_course) which is at most every
> millisecond.
> xfs stores nanosecond resolution, so about 20 bits are currently wasted.
> We could put a counter like i_version in there that only increments
> after it is viewed, then we can get all the precision we need but with
> exactly ctime semantics.
> 
> The 64 change-id could comprise
>  35 bits of seconds (nearly a millenium)
>  16 bits of sub-seconds (just in case a higher precision time was wanted
>  one day)
>  13 bits of counter. - 8192 changes per tick

We'd need a "seen" flag too, so maybe only 4096 changes per tick...

> 
> The value exposed in i_ctime would hide the counter and just show the
> timestamp portion of what the filesystem stores. This would ensure we
> never get changes on different files that happen in one order leaving
> timestamps with the reversed order (the timestamps could be the same,
> but that is expected).
> 
> This scheme could be made to handle a sustained update rate of 1
> increment every 8 nanoseconds (if the counter were allowed to overflow
> into unused bits of the sub-second field). This is one ever 24 CPU
> cycles. Incrementing a counter and making it visible to all CPUs can
> probably be done in 24 cycles. Accessing it and setting the "seen" flag
> as well might just fit with faster memory. Getting any other useful
> work done while maintaining that rate on a single file seems unlikely.

This is an interesting idea.

So, for NFSv4 you'd just mask off the counter bits (and "seen" flag) to
get the ctime, and for the change attribute we'd just mask off the
"seen" flag and put it all in there.

 * Implementing that for all filesystems would be a huge project though.
   If we were implementing the i_version counter from scratch, I'd
   probably do something along these lines. Given that we already have
   an existing i_version counter, would there be any real benefit to
   pursuing this avenue instead?
NeilBrown Aug. 25, 2022, 12:17 a.m. UTC | #18
On Wed, 24 Aug 2022, Jeff Layton wrote:
> On Wed, 2022-08-24 at 08:24 +1000, NeilBrown wrote:
> > On Tue, 23 Aug 2022, Jeff Layton wrote:
> > > On Tue, 2022-08-23 at 21:38 +1000, NeilBrown wrote:
> > > > On Tue, 23 Aug 2022, Jeff Layton wrote:
> > > > > So, we can refer to that and simply say:
> > > > > 
> > > > > "If the function updates the mtime or ctime on the inode, then the
> > > > > i_version should be incremented. If only the atime is being updated,
> > > > > then the i_version should not be incremented. The exception to this rule
> > > > > is explicit atime updates via utimes() or similar mechanism, which
> > > > > should result in the i_version being incremented."
> > > > 
> > > > Is that exception needed? utimes() updates ctime.
> > > > 
> > > > https://man7.org/linux/man-pages/man2/utimes.2.html
> > > > 
> > > > doesn't say that, but
> > > > 
> > > > https://pubs.opengroup.org/onlinepubs/007904875/functions/utimes.html
> > > > 
> > > > does, as does the code.
> > > > 
> > > 
> > > Oh, good point! I think we can leave that out. Even better!
> > 
> > Further, implicit mtime updates (file_update_time()) also update ctime.
> > So all you need is
> >  If the function updates the ctime, then i_version should be
> >  incremented.
> > 
> > and I have to ask - why not just use the ctime? Why have another number
> > that is parallel?
> > 
> > Timestamps are updated at HZ (ktime_get_course) which is at most every
> > millisecond.
> > xfs stores nanosecond resolution, so about 20 bits are currently wasted.
> > We could put a counter like i_version in there that only increments
> > after it is viewed, then we can get all the precision we need but with
> > exactly ctime semantics.
> > 
> > The 64 change-id could comprise
> >  35 bits of seconds (nearly a millenium)
> >  16 bits of sub-seconds (just in case a higher precision time was wanted
> >  one day)
> >  13 bits of counter. - 8192 changes per tick
> 
> We'd need a "seen" flag too, so maybe only 4096 changes per tick...

The "seen" flag does not need to be visible to NFSv4.
Nor does it need to be appear on storage.

Though it may still be easier to include it with the counter bits.

> 
> > 
> > The value exposed in i_ctime would hide the counter and just show the
> > timestamp portion of what the filesystem stores. This would ensure we
> > never get changes on different files that happen in one order leaving
> > timestamps with the reversed order (the timestamps could be the same,
> > but that is expected).
> > 
> > This scheme could be made to handle a sustained update rate of 1
> > increment every 8 nanoseconds (if the counter were allowed to overflow
> > into unused bits of the sub-second field). This is one ever 24 CPU
> > cycles. Incrementing a counter and making it visible to all CPUs can
> > probably be done in 24 cycles. Accessing it and setting the "seen" flag
> > as well might just fit with faster memory. Getting any other useful
> > work done while maintaining that rate on a single file seems unlikely.
> 
> This is an interesting idea.
> 
> So, for NFSv4 you'd just mask off the counter bits (and "seen" flag) to
> get the ctime, and for the change attribute we'd just mask off the
> "seen" flag and put it all in there.

Obviously it isn't just NFSv4 that needs the ctime, it is also the
vfs...

I imagine that the counter would be separate in the in-memory inode.  It
would be split out when read from storage, and merge in when written to
storage.

> 
>  * Implementing that for all filesystems would be a huge project though.
>    If we were implementing the i_version counter from scratch, I'd
>    probably do something along these lines. Given that we already have
>    an existing i_version counter, would there be any real benefit to
>    pursuing this avenue instead?

i_version is currently only supported by btrfs, ext4, and xfs.  Plus
cephfs which has its own internal ideas.
So "all filesystems" isn't needed.  Let's just start with xfs.

All we need is for xfs store in ->i_version a value that meets the
semantics that we specify for ->i_version.
So we need to change xfs to use somewhere else to store its internal
counter that is used for forensics, and then arrange that ->i_version
stores the ctime combined with a counter that resets whenever the ctime
changes.
I think most of this would be done in xfs_vn_update_time(), but probably
some changes would be needed in iversion.h to provide useful support.

If ext4's current use of i_version provides the semantics that we need,
there would be no need to change it.  Ditto for btrfs.

NeilBrown
diff mbox series

Patch

diff --git a/include/linux/iversion.h b/include/linux/iversion.h
index 3bfebde5a1a6..da6cc1cc520a 100644
--- a/include/linux/iversion.h
+++ b/include/linux/iversion.h
@@ -9,8 +9,8 @@ 
  * ---------------------------
  * The change attribute (i_version) is mandated by NFSv4 and is mostly for
  * knfsd, but is also used for other purposes (e.g. IMA). The i_version must
- * appear different to observers if there was a change to the inode's data or
- * metadata since it was last queried.
+ * appear different to observers if there was an explicit change to the inode's
+ * data or metadata since it was last queried.
  *
  * Observers see the i_version as a 64-bit number that never decreases. If it
  * remains the same since it was last checked, then nothing has changed in the
@@ -18,6 +18,12 @@ 
  * anything about the nature or magnitude of the changes from the value, only
  * that the inode has changed in some fashion.
  *
+ * Note that atime updates due to reads or similar activity do _not_ represent
+ * an explicit change to the inode. If the only change is to the atime and it
+ * wasn't set via utimes() or a similar mechanism, then i_version should not be
+ * incremented. If an observer cares about atime updates, it should plan to
+ * fetch and store them in conjunction with the i_version.
+ *
  * Not all filesystems properly implement the i_version counter. Subsystems that
  * want to use i_version field on an inode should first check whether the
  * filesystem sets the SB_I_VERSION flag (usually via the IS_I_VERSION macro).