xfs: fix i_version handling in xfs

Message ID	20220816131736.42615-1-jlayton@kernel.org (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-fsdevel-owner@kernel.org> From: Jeff Layton <jlayton@kernel.org> To: djwong@kernel.org Cc: linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org, "Darrick J . Wong" <darrick.wong@oracle.com>, Dave Chinner <david@fromorbit.com> Subject: [PATCH] xfs: fix i_version handling in xfs Date: Tue, 16 Aug 2022 09:17:36 -0400 Message-Id: <20220816131736.42615-1-jlayton@kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	xfs: fix i_version handling in xfs \| expand xfs: fix i_version handling in xfs

Jeffrey Layton Aug. 16, 2022, 1:17 p.m. UTC

The i_version in xfs_trans_log_inode is bumped for any inode update,
including atime-only updates due to reads. We don't want to record those
in the i_version, as they don't represent "real" changes. Remove that
callsite.

In xfs_vn_update_time, if S_VERSION is flagged, then attempt to bump the
i_version and turn on XFS_ILOG_CORE if it happens. In
xfs_trans_ichgtime, update the i_version if the mtime or ctime are being
updated.

Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/xfs/libxfs/xfs_trans_inode.c | 17 +++--------------
 fs/xfs/xfs_iops.c               |  4 ++++
 2 files changed, 7 insertions(+), 14 deletions(-)

Darrick J. Wong Aug. 16, 2022, 3:43 p.m. UTC | #1

On Tue, Aug 16, 2022 at 09:17:36AM -0400, Jeff Layton wrote:
> The i_version in xfs_trans_log_inode is bumped for any inode update,
> including atime-only updates due to reads. We don't want to record those
> in the i_version, as they don't represent "real" changes. Remove that
> callsite.
> 
> In xfs_vn_update_time, if S_VERSION is flagged, then attempt to bump the
> i_version and turn on XFS_ILOG_CORE if it happens. In
> xfs_trans_ichgtime, update the i_version if the mtime or ctime are being
> updated.

What about operations that don't touch the mtime but change the file
metadata anyway?  There are a few of those, like the blockgc garbage
collector, deduperange, and the defrag tool.

Zooming out a bit -- what does i_version signal, concretely?  I thought
it was used by nfs (and maybe ceph?) to signal to clients that the file
on the server has moved on, and the client needs to invalidate its
caches.  I thought afs had a similar generation counter, though it's
only used to cache file data, not metadata?  Does an i_version change
cause all of them to invalidate caches, or is there more behavior I
don't know about?

Does that mean that we should bump i_version for any file data or
attribute that could be queried or observed by userspace?  In which case
I suppose this change is still correct, even if it relaxes i_version
updates from "any change to the inode whatsoever" to "any change that
would bump mtime".  Unless FIEMAP is part of "attributes observed by
userspace".

(The other downside I can see is that now we have to remember to bump
timestamps for every new file operation we add, unlike the current code
which is centrally located in xfs_trans_log_inode.)

--D

> Cc: Darrick J. Wong <darrick.wong@oracle.com>
> Cc: Dave Chinner <david@fromorbit.com>
> Signed-off-by: Jeff Layton <jlayton@kernel.org>
> ---
>  fs/xfs/libxfs/xfs_trans_inode.c | 17 +++--------------
>  fs/xfs/xfs_iops.c               |  4 ++++
>  2 files changed, 7 insertions(+), 14 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_trans_inode.c b/fs/xfs/libxfs/xfs_trans_inode.c
> index 8b5547073379..78bf7f491462 100644
> --- a/fs/xfs/libxfs/xfs_trans_inode.c
> +++ b/fs/xfs/libxfs/xfs_trans_inode.c
> @@ -71,6 +71,8 @@ xfs_trans_ichgtime(
>  		inode->i_ctime = tv;
>  	if (flags & XFS_ICHGTIME_CREATE)
>  		ip->i_crtime = tv;
> +	if (flags & (XFS_ICHGTIME_MOD|XFS_ICHGTIME_CHG))
> +		inode_inc_iversion(inode);
>  }
>  
>  /*
> @@ -116,20 +118,7 @@ xfs_trans_log_inode(
>  		spin_unlock(&inode->i_lock);
>  	}
>  
> -	/*
> -	 * First time we log the inode in a transaction, bump the inode change
> -	 * counter if it is configured for this to occur. While we have the
> -	 * inode locked exclusively for metadata modification, we can usually
> -	 * avoid setting XFS_ILOG_CORE if no one has queried the value since
> -	 * the last time it was incremented. If we have XFS_ILOG_CORE already
> -	 * set however, then go ahead and bump the i_version counter
> -	 * unconditionally.
> -	 */
> -	if (!test_and_set_bit(XFS_LI_DIRTY, &iip->ili_item.li_flags)) {
> -		if (IS_I_VERSION(inode) &&
> -		    inode_maybe_inc_iversion(inode, flags & XFS_ILOG_CORE))
> -			iversion_flags = XFS_ILOG_CORE;
> -	}
> +	set_bit(XFS_LI_DIRTY, &iip->ili_item.li_flags);
>  
>  	/*
>  	 * If we're updating the inode core or the timestamps and it's possible
> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> index 45518b8c613c..162e044c7f56 100644
> --- a/fs/xfs/xfs_iops.c
> +++ b/fs/xfs/xfs_iops.c
> @@ -718,6 +718,7 @@ xfs_setattr_nonsize(
>  	}
>  
>  	setattr_copy(mnt_userns, inode, iattr);
> +	inode_inc_iversion(inode);
>  	xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
>  
>  	XFS_STATS_INC(mp, xs_ig_attrchg);
> @@ -943,6 +944,7 @@ xfs_setattr_size(
>  
>  	ASSERT(!(iattr->ia_valid & (ATTR_UID | ATTR_GID)));
>  	setattr_copy(mnt_userns, inode, iattr);
> +	inode_inc_iversion(inode);
>  	xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
>  
>  	XFS_STATS_INC(mp, xs_ig_attrchg);
> @@ -1047,6 +1049,8 @@ xfs_vn_update_time(
>  		inode->i_mtime = *now;
>  	if (flags & S_ATIME)
>  		inode->i_atime = *now;
> +	if ((flags & S_VERSION) && inode_maybe_inc_iversion(inode, false))
> +		log_flags |= XFS_ILOG_CORE;
>  
>  	xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);
>  	xfs_trans_log_inode(tp, ip, log_flags);
> -- 
> 2.37.2
>

Jeffrey Layton Aug. 16, 2022, 3:58 p.m. UTC | #2

On Tue, 2022-08-16 at 08:43 -0700, Darrick J. Wong wrote:
> On Tue, Aug 16, 2022 at 09:17:36AM -0400, Jeff Layton wrote:
> > The i_version in xfs_trans_log_inode is bumped for any inode update,
> > including atime-only updates due to reads. We don't want to record those
> > in the i_version, as they don't represent "real" changes. Remove that
> > callsite.
> > 
> > In xfs_vn_update_time, if S_VERSION is flagged, then attempt to bump the
> > i_version and turn on XFS_ILOG_CORE if it happens. In
> > xfs_trans_ichgtime, update the i_version if the mtime or ctime are being
> > updated.
> 
> What about operations that don't touch the mtime but change the file
> metadata anyway?  There are a few of those, like the blockgc garbage
> collector, deduperange, and the defrag tool.
> 

Do those change the c/mtime at all?

It's possible we're missing some places that should change the i_version
as well. We may need some more call sites.

> Zooming out a bit -- what does i_version signal, concretely?  I thought
> it was used by nfs (and maybe ceph?) to signal to clients that the file
> on the server has moved on, and the client needs to invalidate its
> caches.  I thought afs had a similar generation counter, though it's
> only used to cache file data, not metadata?  Does an i_version change
> cause all of them to invalidate caches, or is there more behavior I
> don't know about?
> 

For NFS, it indicates a change to the change attribute indicates that
there has been a change to the data or metadata for the file. atime
changes due to reads are specifically exempted from this, but we do bump
the i_version if someone (e.g.) changes the atime via utimes(). 

The NFS client will generally invalidate its caches for the inode when
it notices a change attribute change.

FWIW, AFS may not meet this standard since it doesn't generally
increment the counter on metadata changes. It may turn out that we don't
want to expose this to the AFS client due to that (or maybe come up with
some way to indicate this difference).

> Does that mean that we should bump i_version for any file data or
> attribute that could be queried or observed by userspace?  In which case
> I suppose this change is still correct, even if it relaxes i_version
> updates from "any change to the inode whatsoever" to "any change that
> would bump mtime".  Unless FIEMAP is part of "attributes observed by
> userspace".
> 
> (The other downside I can see is that now we have to remember to bump
> timestamps for every new file operation we add, unlike the current code
> which is centrally located in xfs_trans_log_inode.)
> 

The main reason for the change attribute in NFS was that NFSv3 is
plagued with cache-coherency problems due to coarse-grained timestamp
granularity. It was conceived as a way to indicate that the inode had
changed without relying on timestamps.

In practice, we want to bump the i_version counter whenever the ctime or
mtime would be changed.

> --D
> 
> > Cc: Darrick J. Wong <darrick.wong@oracle.com>
> > Cc: Dave Chinner <david@fromorbit.com>
> > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > ---
> >  fs/xfs/libxfs/xfs_trans_inode.c | 17 +++--------------
> >  fs/xfs/xfs_iops.c               |  4 ++++
> >  2 files changed, 7 insertions(+), 14 deletions(-)
> > 
> > diff --git a/fs/xfs/libxfs/xfs_trans_inode.c b/fs/xfs/libxfs/xfs_trans_inode.c
> > index 8b5547073379..78bf7f491462 100644
> > --- a/fs/xfs/libxfs/xfs_trans_inode.c
> > +++ b/fs/xfs/libxfs/xfs_trans_inode.c
> > @@ -71,6 +71,8 @@ xfs_trans_ichgtime(
> >  		inode->i_ctime = tv;
> >  	if (flags & XFS_ICHGTIME_CREATE)
> >  		ip->i_crtime = tv;
> > +	if (flags & (XFS_ICHGTIME_MOD|XFS_ICHGTIME_CHG))
> > +		inode_inc_iversion(inode);
> >  }
> >  
> >  /*
> > @@ -116,20 +118,7 @@ xfs_trans_log_inode(
> >  		spin_unlock(&inode->i_lock);
> >  	}
> >  
> > -	/*
> > -	 * First time we log the inode in a transaction, bump the inode change
> > -	 * counter if it is configured for this to occur. While we have the
> > -	 * inode locked exclusively for metadata modification, we can usually
> > -	 * avoid setting XFS_ILOG_CORE if no one has queried the value since
> > -	 * the last time it was incremented. If we have XFS_ILOG_CORE already
> > -	 * set however, then go ahead and bump the i_version counter
> > -	 * unconditionally.
> > -	 */
> > -	if (!test_and_set_bit(XFS_LI_DIRTY, &iip->ili_item.li_flags)) {
> > -		if (IS_I_VERSION(inode) &&
> > -		    inode_maybe_inc_iversion(inode, flags & XFS_ILOG_CORE))
> > -			iversion_flags = XFS_ILOG_CORE;
> > -	}
> > +	set_bit(XFS_LI_DIRTY, &iip->ili_item.li_flags);
> >  
> >  	/*
> >  	 * If we're updating the inode core or the timestamps and it's possible
> > diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> > index 45518b8c613c..162e044c7f56 100644
> > --- a/fs/xfs/xfs_iops.c
> > +++ b/fs/xfs/xfs_iops.c
> > @@ -718,6 +718,7 @@ xfs_setattr_nonsize(
> >  	}
> >  
> >  	setattr_copy(mnt_userns, inode, iattr);
> > +	inode_inc_iversion(inode);
> >  	xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
> >  
> >  	XFS_STATS_INC(mp, xs_ig_attrchg);
> > @@ -943,6 +944,7 @@ xfs_setattr_size(
> >  
> >  	ASSERT(!(iattr->ia_valid & (ATTR_UID | ATTR_GID)));
> >  	setattr_copy(mnt_userns, inode, iattr);
> > +	inode_inc_iversion(inode);
> >  	xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
> >  
> >  	XFS_STATS_INC(mp, xs_ig_attrchg);
> > @@ -1047,6 +1049,8 @@ xfs_vn_update_time(
> >  		inode->i_mtime = *now;
> >  	if (flags & S_ATIME)
> >  		inode->i_atime = *now;
> > +	if ((flags & S_VERSION) && inode_maybe_inc_iversion(inode, false))
> > +		log_flags |= XFS_ILOG_CORE;
> >  
> >  	xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);
> >  	xfs_trans_log_inode(tp, ip, log_flags);
> > -- 
> > 2.37.2
> >

David Wysochanski Aug. 16, 2022, 5:14 p.m. UTC | #3

On Tue, Aug 16, 2022 at 9:19 AM Jeff Layton <jlayton@kernel.org> wrote:
>
> The i_version in xfs_trans_log_inode is bumped for any inode update,
> including atime-only updates due to reads. We don't want to record those
> in the i_version, as they don't represent "real" changes. Remove that
> callsite.
>
> In xfs_vn_update_time, if S_VERSION is flagged, then attempt to bump the
> i_version and turn on XFS_ILOG_CORE if it happens. In
> xfs_trans_ichgtime, update the i_version if the mtime or ctime are being
> updated.
>
> Cc: Darrick J. Wong <darrick.wong@oracle.com>
> Cc: Dave Chinner <david@fromorbit.com>
> Signed-off-by: Jeff Layton <jlayton@kernel.org>
> ---
>  fs/xfs/libxfs/xfs_trans_inode.c | 17 +++--------------
>  fs/xfs/xfs_iops.c               |  4 ++++
>  2 files changed, 7 insertions(+), 14 deletions(-)
>
> diff --git a/fs/xfs/libxfs/xfs_trans_inode.c b/fs/xfs/libxfs/xfs_trans_inode.c
> index 8b5547073379..78bf7f491462 100644
> --- a/fs/xfs/libxfs/xfs_trans_inode.c
> +++ b/fs/xfs/libxfs/xfs_trans_inode.c
> @@ -71,6 +71,8 @@ xfs_trans_ichgtime(
>                 inode->i_ctime = tv;
>         if (flags & XFS_ICHGTIME_CREATE)
>                 ip->i_crtime = tv;
> +       if (flags & (XFS_ICHGTIME_MOD|XFS_ICHGTIME_CHG))
> +               inode_inc_iversion(inode);
>  }
>
>  /*
> @@ -116,20 +118,7 @@ xfs_trans_log_inode(
>                 spin_unlock(&inode->i_lock);
>         }
>
> -       /*
> -        * First time we log the inode in a transaction, bump the inode change
> -        * counter if it is configured for this to occur. While we have the
> -        * inode locked exclusively for metadata modification, we can usually
> -        * avoid setting XFS_ILOG_CORE if no one has queried the value since
> -        * the last time it was incremented. If we have XFS_ILOG_CORE already
> -        * set however, then go ahead and bump the i_version counter
> -        * unconditionally.
> -        */
> -       if (!test_and_set_bit(XFS_LI_DIRTY, &iip->ili_item.li_flags)) {
> -               if (IS_I_VERSION(inode) &&
> -                   inode_maybe_inc_iversion(inode, flags & XFS_ILOG_CORE))
> -                       iversion_flags = XFS_ILOG_CORE;
> -       }
> +       set_bit(XFS_LI_DIRTY, &iip->ili_item.li_flags);
>
>         /*
>          * If we're updating the inode core or the timestamps and it's possible
> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> index 45518b8c613c..162e044c7f56 100644
> --- a/fs/xfs/xfs_iops.c
> +++ b/fs/xfs/xfs_iops.c
> @@ -718,6 +718,7 @@ xfs_setattr_nonsize(
>         }
>
>         setattr_copy(mnt_userns, inode, iattr);
> +       inode_inc_iversion(inode);
>         xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
>
>         XFS_STATS_INC(mp, xs_ig_attrchg);
> @@ -943,6 +944,7 @@ xfs_setattr_size(
>
>         ASSERT(!(iattr->ia_valid & (ATTR_UID | ATTR_GID)));
>         setattr_copy(mnt_userns, inode, iattr);
> +       inode_inc_iversion(inode);
>         xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
>
>         XFS_STATS_INC(mp, xs_ig_attrchg);
> @@ -1047,6 +1049,8 @@ xfs_vn_update_time(
>                 inode->i_mtime = *now;
>         if (flags & S_ATIME)
>                 inode->i_atime = *now;
> +       if ((flags & S_VERSION) && inode_maybe_inc_iversion(inode, false))
> +               log_flags |= XFS_ILOG_CORE;
>
>         xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);
>         xfs_trans_log_inode(tp, ip, log_flags);
> --
> 2.37.2
>

I have a test (details below) that shows an open issue with NFSv4.x +
fscache where an xfs exported filesystem would trigger unnecessary
over the wire READs after a umount/mount cycle of the NFS mount.  I
previously tracked this down to atime updates, but never followed
through on any patch.  Now that Jeff worked it out and this patch is
under review, I built 5.19 vanilla, retested, then built 5.19 + this
patch and verified the problem is fixed.
You can add:
Tested-by: Dave Wysochanski <dwysocha@redhat.com>



# ./t0_bz1913591.sh 4.1 xfs relatime
Setting NFS vers=4.1 filesystem to xfs and mount options relatime,rw
 0. On NFS server, setup export with xfs filesystem on loop device
/dev/loop0 /export/dir1 xfs
rw,seclabel,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota 0 0
 1. On NFS client, install and enable cachefilesd
 2. On NFS client, mount -o vers=4.1,fsc 127.0.0.1:/export/dir1 /mnt
 3. On NFS client, dd if=/dev/zero of=/mnt/file1.bin bs=4096 count=1
 4. On NFS client, echo 3 > /proc/sys/vm/drop_caches
 5. On NFS client, dd if=/mnt/file1.bin of=/dev/null (read into fscache)
 6. On NFS client, umount /mnt
 7. On NFS client, mount -o vers=4.1,fsc 127.0.0.1:/export/dir1 /mnt
 8. On NFS client, repeat steps 4-5 (read from fscache)
 9. On NFS client, check for READ ops (1st number) > 0 in /proc/self/mountstats
Found 4200 NFS READ and READ_PLUS ops in /proc/self/mountstats > 0
                READ: 1 1 0 220 4200 0 0 1 0
           READ_PLUS: 0 0 0 0 0 0 0 0 0
FAILED TEST ./t0_bz1913591.sh on kernel 5.19.0 with NFS vers=4.1
exported filesystem xfs options relatime,rw


# ./t0_bz1913591.sh 4.1 xfs relatime
Setting NFS vers=4.1 filesystem to xfs and mount options relatime,rw
 0. On NFS server, setup export with xfs filesystem on loop device
/dev/loop0 /export/dir1 xfs
rw,seclabel,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota 0 0
 1. On NFS client, install and enable cachefilesd
 2. On NFS client, mount -o vers=4.1,fsc 127.0.0.1:/export/dir1 /mnt
 3. On NFS client, dd if=/dev/zero of=/mnt/file1.bin bs=4096 count=1
 4. On NFS client, echo 3 > /proc/sys/vm/drop_caches
 5. On NFS client, dd if=/mnt/file1.bin of=/dev/null (read into fscache)
 6. On NFS client, umount /mnt
 7. On NFS client, mount -o vers=4.1,fsc 127.0.0.1:/export/dir1 /mnt
 8. On NFS client, repeat steps 4-5 (read from fscache)
 9. On NFS client, check for READ ops (1st number) > 0 in /proc/self/mountstats
10. On NFS client, check /proc/fs/fscache/stats fscache reads incrementing
PASSED TEST ./t0_bz1913591.sh on kernel 5.19.0i_version+ with NFS
vers=4.1 exported filesystem xfs options relatime,rw

Dave Chinner Aug. 16, 2022, 10:42 p.m. UTC | #4

On Tue, Aug 16, 2022 at 11:58:06AM -0400, Jeff Layton wrote:
> On Tue, 2022-08-16 at 08:43 -0700, Darrick J. Wong wrote:
> > On Tue, Aug 16, 2022 at 09:17:36AM -0400, Jeff Layton wrote:
> > > The i_version in xfs_trans_log_inode is bumped for any inode update,
> > > including atime-only updates due to reads. We don't want to record those
> > > in the i_version, as they don't represent "real" changes. Remove that
> > > callsite.
> > > 
> > > In xfs_vn_update_time, if S_VERSION is flagged, then attempt to bump the
> > > i_version and turn on XFS_ILOG_CORE if it happens. In
> > > xfs_trans_ichgtime, update the i_version if the mtime or ctime are being
> > > updated.
> > 
> > What about operations that don't touch the mtime but change the file
> > metadata anyway?  There are a few of those, like the blockgc garbage
> > collector, deduperange, and the defrag tool.
> > 
> 
> Do those change the c/mtime at all?
> 
> It's possible we're missing some places that should change the i_version
> as well. We may need some more call sites.
> 
> > Zooming out a bit -- what does i_version signal, concretely?  I thought
> > it was used by nfs (and maybe ceph?) to signal to clients that the file
> > on the server has moved on, and the client needs to invalidate its
> > caches.  I thought afs had a similar generation counter, though it's
> > only used to cache file data, not metadata?  Does an i_version change
> > cause all of them to invalidate caches, or is there more behavior I
> > don't know about?
> > 
> 
> For NFS, it indicates a change to the change attribute indicates that
> there has been a change to the data or metadata for the file. atime
> changes due to reads are specifically exempted from this, but we do bump
> the i_version if someone (e.g.) changes the atime via utimes(). 

We have relatime behaviour to optimise away unnecessary atime
updates on reads.  Trying to explicitly exclude i_version from atime
updates in one filesystem just because NFS doesn't need that
information seems ....  misguided.  The -on disk- i_version
field behaviour is defined by the filesystem implementation, not the
NFS requirements.

> The NFS client will generally invalidate its caches for the inode when
> it notices a change attribute change.
>
> FWIW, AFS may not meet this standard since it doesn't generally
> increment the counter on metadata changes. It may turn out that we don't
> want to expose this to the AFS client due to that (or maybe come up with
> some way to indicate this difference).

In XFS, we've defined the on-disk i_version field to mean
"increments with any persistent inode data or metadata change",
regardless of what the high level applications that use i_version
might actually require.

That some network filesystem might only need a subset of the
metadata to be covered by i_version is largely irrelevant - if we
don't cover every persistent inode metadata change with i_version,
then applications that *need* stuff like atime change notification
can't be supported.

> > Does that mean that we should bump i_version for any file data or
> > attribute that could be queried or observed by userspace?  In which case
> > I suppose this change is still correct, even if it relaxes i_version
> > updates from "any change to the inode whatsoever" to "any change that
> > would bump mtime".  Unless FIEMAP is part of "attributes observed by
> > userspace".
> > 
> > (The other downside I can see is that now we have to remember to bump
> > timestamps for every new file operation we add, unlike the current code
> > which is centrally located in xfs_trans_log_inode.)
> > 
> 
> The main reason for the change attribute in NFS was that NFSv3 is
> plagued with cache-coherency problems due to coarse-grained timestamp
> granularity. It was conceived as a way to indicate that the inode had
> changed without relying on timestamps.

Yes, and the most important design consideration for a filesystem is
that it -must be persistent-. The constraints on i_version are much
stricter than timestamps, and they are directly related to how the
filesystem persists metadata changes, not how metadata is changed or
accessed in memory.

> In practice, we want to bump the i_version counter whenever the ctime or
> mtime would be changed.

What about O_NOCMTIME modifications? What about lazytime
filesystems? These explicilty avoid or delay persisten c/mtime
updates, and that means bumping i_version only based on c/mtime
updates cannot be relied on. i_version is supposed to track user
visible data and metadata changes, *not timestamp updates*.

> > > Cc: Darrick J. Wong <darrick.wong@oracle.com>
> > > Cc: Dave Chinner <david@fromorbit.com>
> > > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > > ---
> > >  fs/xfs/libxfs/xfs_trans_inode.c | 17 +++--------------
> > >  fs/xfs/xfs_iops.c               |  4 ++++
> > >  2 files changed, 7 insertions(+), 14 deletions(-)
> > > 
> > > diff --git a/fs/xfs/libxfs/xfs_trans_inode.c b/fs/xfs/libxfs/xfs_trans_inode.c
> > > index 8b5547073379..78bf7f491462 100644
> > > --- a/fs/xfs/libxfs/xfs_trans_inode.c
> > > +++ b/fs/xfs/libxfs/xfs_trans_inode.c
> > > @@ -71,6 +71,8 @@ xfs_trans_ichgtime(
> > >  		inode->i_ctime = tv;
> > >  	if (flags & XFS_ICHGTIME_CREATE)
> > >  		ip->i_crtime = tv;
> > > +	if (flags & (XFS_ICHGTIME_MOD|XFS_ICHGTIME_CHG))
> > > +		inode_inc_iversion(inode);
> > >  }

That looks wrong - this is not the only path through XFS that
modifies timestamps, and I have to ask why this needs to be an
explicit i_version bump given that nobody may have looked at
i_version since the last time it was updated?.

What about xfs_fs_dirty_inode() when we actually persist lazytime
in-memory timestamp updates? We didn't bump i_version when setting
I_DIRTY_TIME, and this patch now removes the mechanism that is used
to bump iversion if it is needed when we persist those lazytime
updates.....

> > >  /*
> > > @@ -116,20 +118,7 @@ xfs_trans_log_inode(
> > >  		spin_unlock(&inode->i_lock);
> > >  	}
> > >  
> > > -	/*
> > > -	 * First time we log the inode in a transaction, bump the inode change
> > > -	 * counter if it is configured for this to occur. While we have the
> > > -	 * inode locked exclusively for metadata modification, we can usually
> > > -	 * avoid setting XFS_ILOG_CORE if no one has queried the value since
> > > -	 * the last time it was incremented. If we have XFS_ILOG_CORE already
> > > -	 * set however, then go ahead and bump the i_version counter
> > > -	 * unconditionally.
> > > -	 */
> > > -	if (!test_and_set_bit(XFS_LI_DIRTY, &iip->ili_item.li_flags)) {
> > > -		if (IS_I_VERSION(inode) &&
> > > -		    inode_maybe_inc_iversion(inode, flags & XFS_ILOG_CORE))
> > > -			iversion_flags = XFS_ILOG_CORE;
> > > -	}
> > > +	set_bit(XFS_LI_DIRTY, &iip->ili_item.li_flags);

.... and this removes the sweep that captures in-memory timestamp
and i_version peeks between any persistent inode metadata
modifications that have been made, regardless of whether i_version
has already been bumped for them or not.

IOws, this seems to rely on every future inode modification in XFS
calling xfs_trans_ichgtime() to bump i_version to sweep previous VFS
in-memory timestamp updates that this inode modification captures
and persists to disk.

This seems fragile and error prone - it's relying on the
developers always getting timestamp and iversion updates correct,
rather the code always guaranteeing that it captures timestamp and
iversion updates without any extra effort.

Hence, I don't think that trying to modify how filesystems persist
and maintain i_version coherency because NFS "doesn't need i_version
to cover atime updates" is the wrong approach. On-disk i_version
coherency has to work for more than just one NFS implementation
(especially now i_version will be exported to userspace!). 
Persistent atime updates are already optimised away by relatime, and
so I think that any further atime filtering is largely a NFS
application layer problem and not something that should be solved by
changing the on-disk definition of back end filesystem structure
persistence.

Cheers,

Dave.

Dave Chinner Aug. 16, 2022, 11:37 p.m. UTC | #5

On Tue, Aug 16, 2022 at 01:14:55PM -0400, David Wysochanski wrote:
> On Tue, Aug 16, 2022 at 9:19 AM Jeff Layton <jlayton@kernel.org> wrote:
> >
> > The i_version in xfs_trans_log_inode is bumped for any inode update,
> > including atime-only updates due to reads. We don't want to record those
> > in the i_version, as they don't represent "real" changes. Remove that
> > callsite.
> >
> > In xfs_vn_update_time, if S_VERSION is flagged, then attempt to bump the
> > i_version and turn on XFS_ILOG_CORE if it happens. In
> > xfs_trans_ichgtime, update the i_version if the mtime or ctime are being
> > updated.
> >
> > Cc: Darrick J. Wong <darrick.wong@oracle.com>
> > Cc: Dave Chinner <david@fromorbit.com>
> > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > ---
> >  fs/xfs/libxfs/xfs_trans_inode.c | 17 +++--------------
> >  fs/xfs/xfs_iops.c               |  4 ++++
> >  2 files changed, 7 insertions(+), 14 deletions(-)
> >
> > diff --git a/fs/xfs/libxfs/xfs_trans_inode.c b/fs/xfs/libxfs/xfs_trans_inode.c
> > index 8b5547073379..78bf7f491462 100644
> > --- a/fs/xfs/libxfs/xfs_trans_inode.c
> > +++ b/fs/xfs/libxfs/xfs_trans_inode.c
> > @@ -71,6 +71,8 @@ xfs_trans_ichgtime(
> >                 inode->i_ctime = tv;
> >         if (flags & XFS_ICHGTIME_CREATE)
> >                 ip->i_crtime = tv;
> > +       if (flags & (XFS_ICHGTIME_MOD|XFS_ICHGTIME_CHG))
> > +               inode_inc_iversion(inode);
> >  }
> >
> >  /*
> > @@ -116,20 +118,7 @@ xfs_trans_log_inode(
> >                 spin_unlock(&inode->i_lock);
> >         }
> >
> > -       /*
> > -        * First time we log the inode in a transaction, bump the inode change
> > -        * counter if it is configured for this to occur. While we have the
> > -        * inode locked exclusively for metadata modification, we can usually
> > -        * avoid setting XFS_ILOG_CORE if no one has queried the value since
> > -        * the last time it was incremented. If we have XFS_ILOG_CORE already
> > -        * set however, then go ahead and bump the i_version counter
> > -        * unconditionally.
> > -        */
> > -       if (!test_and_set_bit(XFS_LI_DIRTY, &iip->ili_item.li_flags)) {
> > -               if (IS_I_VERSION(inode) &&
> > -                   inode_maybe_inc_iversion(inode, flags & XFS_ILOG_CORE))
> > -                       iversion_flags = XFS_ILOG_CORE;
> > -       }
> > +       set_bit(XFS_LI_DIRTY, &iip->ili_item.li_flags);
> >
> >         /*
> >          * If we're updating the inode core or the timestamps and it's possible
> > diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> > index 45518b8c613c..162e044c7f56 100644
> > --- a/fs/xfs/xfs_iops.c
> > +++ b/fs/xfs/xfs_iops.c
> > @@ -718,6 +718,7 @@ xfs_setattr_nonsize(
> >         }
> >
> >         setattr_copy(mnt_userns, inode, iattr);
> > +       inode_inc_iversion(inode);
> >         xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
> >
> >         XFS_STATS_INC(mp, xs_ig_attrchg);
> > @@ -943,6 +944,7 @@ xfs_setattr_size(
> >
> >         ASSERT(!(iattr->ia_valid & (ATTR_UID | ATTR_GID)));
> >         setattr_copy(mnt_userns, inode, iattr);
> > +       inode_inc_iversion(inode);
> >         xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
> >
> >         XFS_STATS_INC(mp, xs_ig_attrchg);
> > @@ -1047,6 +1049,8 @@ xfs_vn_update_time(
> >                 inode->i_mtime = *now;
> >         if (flags & S_ATIME)
> >                 inode->i_atime = *now;
> > +       if ((flags & S_VERSION) && inode_maybe_inc_iversion(inode, false))
> > +               log_flags |= XFS_ILOG_CORE;
> >
> >         xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);
> >         xfs_trans_log_inode(tp, ip, log_flags);
> > --
> > 2.37.2
> >
> 
> I have a test (details below) that shows an open issue with NFSv4.x +
> fscache where an xfs exported filesystem would trigger unnecessary
> over the wire READs after a umount/mount cycle of the NFS mount.  I
> previously tracked this down to atime updates, but never followed
> through on any patch.  Now that Jeff worked it out and this patch is
> under review, I built 5.19 vanilla, retested, then built 5.19 + this
> patch and verified the problem is fixed.

And so the question that needs to be answered is "why isn't relatime
working for this workload to avoid unnecessary atime updates"?

Which then makes me ask "what's changing atime on the server between
client side reads"?

Which then makes me wonder "what's actually changing iversion on the
server?" because I don't think atime is the issue here.

I suspect that Jeff's patch is affecting this test case by removing
iversion updates when the data is written back on the server. i.e.
delayed allocation and unwritten extent conversion will no longer
bump iversion when they log the inode metadata changes associated
with extent allocation to store the data being written.  There may
be other places where Jeff's patch removes implicit iversion
updates, too, so it may not be writeback that is the issue here.

How that impacts on the observed behaviour is dependent on things I
don't know, like what cachefiles is doing in the background,
especially across NFS client unmount/mount cycles. However, this all
makes me think the "atime is updated" behaviour is an observed
symptom of something else changing iversion and/or cmtime between
reads from the server...

Cheers,

Dave.

Dave Chinner Aug. 16, 2022, 11:57 p.m. UTC | #6

On Wed, Aug 17, 2022 at 08:42:57AM +1000, Dave Chinner wrote:
> On Tue, Aug 16, 2022 at 11:58:06AM -0400, Jeff Layton wrote:
> > On Tue, 2022-08-16 at 08:43 -0700, Darrick J. Wong wrote:
> > > On Tue, Aug 16, 2022 at 09:17:36AM -0400, Jeff Layton wrote:
> > > > @@ -116,20 +118,7 @@ xfs_trans_log_inode(
> > > >  		spin_unlock(&inode->i_lock);
> > > >  	}
> > > >  
> > > > -	/*
> > > > -	 * First time we log the inode in a transaction, bump the inode change
> > > > -	 * counter if it is configured for this to occur. While we have the
> > > > -	 * inode locked exclusively for metadata modification, we can usually
> > > > -	 * avoid setting XFS_ILOG_CORE if no one has queried the value since
> > > > -	 * the last time it was incremented. If we have XFS_ILOG_CORE already
> > > > -	 * set however, then go ahead and bump the i_version counter
> > > > -	 * unconditionally.
> > > > -	 */
> > > > -	if (!test_and_set_bit(XFS_LI_DIRTY, &iip->ili_item.li_flags)) {
> > > > -		if (IS_I_VERSION(inode) &&
> > > > -		    inode_maybe_inc_iversion(inode, flags & XFS_ILOG_CORE))
> > > > -			iversion_flags = XFS_ILOG_CORE;
> > > > -	}
> > > > +	set_bit(XFS_LI_DIRTY, &iip->ili_item.li_flags);
> 
> .... and this removes the sweep that captures in-memory timestamp
> and i_version peeks between any persistent inode metadata
> modifications that have been made, regardless of whether i_version
> has already been bumped for them or not.

Which, BTW, breaks the iversion update for xfs_fs_commit_blocks()
which the pNFS server calls to inform the filesystem that the pNFS
client has finished writing data to a mapped region.

This function runs unwritten extent conversion (making the data
externally visible) and takes timestamps from the pNFS server. It
then persists all these changes, meaning that there will be
externally visible data, metadata and timestamp updates persisted to
disk by the pNFS server without an iversion update occurring.

This iversion stuff is .... complex. It's also really easy to get
wrong, and that's even before we start trying to optimise away stuff
like timestamp updates....

Cheers,

Dave.

Jeffrey Layton Aug. 17, 2022, 12:02 p.m. UTC | #7

On Wed, 2022-08-17 at 08:42 +1000, Dave Chinner wrote:
> On Tue, Aug 16, 2022 at 11:58:06AM -0400, Jeff Layton wrote:
> > On Tue, 2022-08-16 at 08:43 -0700, Darrick J. Wong wrote:
> > > On Tue, Aug 16, 2022 at 09:17:36AM -0400, Jeff Layton wrote:
> > > > The i_version in xfs_trans_log_inode is bumped for any inode update,
> > > > including atime-only updates due to reads. We don't want to record those
> > > > in the i_version, as they don't represent "real" changes. Remove that
> > > > callsite.
> > > > 
> > > > In xfs_vn_update_time, if S_VERSION is flagged, then attempt to bump the
> > > > i_version and turn on XFS_ILOG_CORE if it happens. In
> > > > xfs_trans_ichgtime, update the i_version if the mtime or ctime are being
> > > > updated.
> > > 
> > > What about operations that don't touch the mtime but change the file
> > > metadata anyway?  There are a few of those, like the blockgc garbage
> > > collector, deduperange, and the defrag tool.
> > > 
> > 
> > Do those change the c/mtime at all?
> > 
> > It's possible we're missing some places that should change the i_version
> > as well. We may need some more call sites.
> > 
> > > Zooming out a bit -- what does i_version signal, concretely?  I thought
> > > it was used by nfs (and maybe ceph?) to signal to clients that the file
> > > on the server has moved on, and the client needs to invalidate its
> > > caches.  I thought afs had a similar generation counter, though it's
> > > only used to cache file data, not metadata?  Does an i_version change
> > > cause all of them to invalidate caches, or is there more behavior I
> > > don't know about?
> > > 
> > 
> > For NFS, it indicates a change to the change attribute indicates that
> > there has been a change to the data or metadata for the file. atime
> > changes due to reads are specifically exempted from this, but we do bump
> > the i_version if someone (e.g.) changes the atime via utimes(). 
> 
> We have relatime behaviour to optimise away unnecessary atime
> updates on reads.  Trying to explicitly exclude i_version from atime
> updates in one filesystem just because NFS doesn't need that
> information seems ....  misguided.  The -on disk- i_version
> field behaviour is defined by the filesystem implementation, not the
> NFS requirements.
> 

-o relatime does not fix this.

> > The NFS client will generally invalidate its caches for the inode when
> > it notices a change attribute change.
> > 
> > FWIW, AFS may not meet this standard since it doesn't generally
> > increment the counter on metadata changes. It may turn out that we don't
> > want to expose this to the AFS client due to that (or maybe come up with
> > some way to indicate this difference).
> 
> In XFS, we've defined the on-disk i_version field to mean
> "increments with any persistent inode data or metadata change",
> regardless of what the high level applications that use i_version
> might actually require.
> 
> That some network filesystem might only need a subset of the
> metadata to be covered by i_version is largely irrelevant - if we
> don't cover every persistent inode metadata change with i_version,
> then applications that *need* stuff like atime change notification
> can't be supported.
> 
> > > Does that mean that we should bump i_version for any file data or
> > > attribute that could be queried or observed by userspace?  In which case
> > > I suppose this change is still correct, even if it relaxes i_version
> > > updates from "any change to the inode whatsoever" to "any change that
> > > would bump mtime".  Unless FIEMAP is part of "attributes observed by
> > > userspace".
> > > 
> > > (The other downside I can see is that now we have to remember to bump
> > > timestamps for every new file operation we add, unlike the current code
> > > which is centrally located in xfs_trans_log_inode.)
> > > 
> > 
> > The main reason for the change attribute in NFS was that NFSv3 is
> > plagued with cache-coherency problems due to coarse-grained timestamp
> > granularity. It was conceived as a way to indicate that the inode had
> > changed without relying on timestamps.
> 
> Yes, and the most important design consideration for a filesystem is
> that it -must be persistent-. The constraints on i_version are much
> stricter than timestamps, and they are directly related to how the
> filesystem persists metadata changes, not how metadata is changed or
> accessed in memory.
> 
> > In practice, we want to bump the i_version counter whenever the ctime or
> > mtime would be changed.
> 
> What about O_NOCMTIME modifications? What about lazytime
> filesystems? These explicilty avoid or delay persisten c/mtime
> updates, and that means bumping i_version only based on c/mtime
> updates cannot be relied on. i_version is supposed to track user
> visible data and metadata changes, *not timestamp updates*.
> 

I was speaking more about the sorts of activity that should result in
the i_version being changed, not about tracking timestamp updates. IOW,
if some activity would cause the mtime or ctime to change, then we want
to also bump the i_version.

Specifically, for NOCMTIME, I think we'd still want the i_version to
change since that option is about timestamps and not i_version.

> > > > Cc: Darrick J. Wong <darrick.wong@oracle.com>
> > > > Cc: Dave Chinner <david@fromorbit.com>
> > > > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > > > ---
> > > >  fs/xfs/libxfs/xfs_trans_inode.c | 17 +++--------------
> > > >  fs/xfs/xfs_iops.c               |  4 ++++
> > > >  2 files changed, 7 insertions(+), 14 deletions(-)
> > > > 
> > > > diff --git a/fs/xfs/libxfs/xfs_trans_inode.c b/fs/xfs/libxfs/xfs_trans_inode.c
> > > > index 8b5547073379..78bf7f491462 100644
> > > > --- a/fs/xfs/libxfs/xfs_trans_inode.c
> > > > +++ b/fs/xfs/libxfs/xfs_trans_inode.c
> > > > @@ -71,6 +71,8 @@ xfs_trans_ichgtime(
> > > >  		inode->i_ctime = tv;
> > > >  	if (flags & XFS_ICHGTIME_CREATE)
> > > >  		ip->i_crtime = tv;
> > > > +	if (flags & (XFS_ICHGTIME_MOD|XFS_ICHGTIME_CHG))
> > > > +		inode_inc_iversion(inode);
> > > >  }
> 
> That looks wrong - this is not the only path through XFS that
> modifies timestamps, and I have to ask why this needs to be an
> explicit i_version bump given that nobody may have looked at
> i_version since the last time it was updated?.
> 
> What about xfs_fs_dirty_inode() when we actually persist lazytime
> in-memory timestamp updates? We didn't bump i_version when setting
> I_DIRTY_TIME, and this patch now removes the mechanism that is used
> to bump iversion if it is needed when we persist those lazytime
> updates.....
> 
> > > >  /*
> > > > @@ -116,20 +118,7 @@ xfs_trans_log_inode(
> > > >  		spin_unlock(&inode->i_lock);
> > > >  	}
> > > >  
> > > > -	/*
> > > > -	 * First time we log the inode in a transaction, bump the inode change
> > > > -	 * counter if it is configured for this to occur. While we have the
> > > > -	 * inode locked exclusively for metadata modification, we can usually
> > > > -	 * avoid setting XFS_ILOG_CORE if no one has queried the value since
> > > > -	 * the last time it was incremented. If we have XFS_ILOG_CORE already
> > > > -	 * set however, then go ahead and bump the i_version counter
> > > > -	 * unconditionally.
> > > > -	 */
> > > > -	if (!test_and_set_bit(XFS_LI_DIRTY, &iip->ili_item.li_flags)) {
> > > > -		if (IS_I_VERSION(inode) &&
> > > > -		    inode_maybe_inc_iversion(inode, flags & XFS_ILOG_CORE))
> > > > -			iversion_flags = XFS_ILOG_CORE;
> > > > -	}
> > > > +	set_bit(XFS_LI_DIRTY, &iip->ili_item.li_flags);
> 
> .... and this removes the sweep that captures in-memory timestamp
> and i_version peeks between any persistent inode metadata
> modifications that have been made, regardless of whether i_version
> has already been bumped for them or not.
> 
> IOws, this seems to rely on every future inode modification in XFS
> calling xfs_trans_ichgtime() to bump i_version to sweep previous VFS
> in-memory timestamp updates that this inode modification captures
> and persists to disk.
> 
> This seems fragile and error prone - it's relying on the
> developers always getting timestamp and iversion updates correct,
> rather the code always guaranteeing that it captures timestamp and
> iversion updates without any extra effort.
> 
> Hence, I don't think that trying to modify how filesystems persist
> and maintain i_version coherency because NFS "doesn't need i_version
> to cover atime updates" is the wrong approach. On-disk i_version
> coherency has to work for more than just one NFS implementation
> (especially now i_version will be exported to userspace!). 
> Persistent atime updates are already optimised away by relatime, and
> so I think that any further atime filtering is largely a NFS
> application layer problem and not something that should be solved by
> changing the on-disk definition of back end filesystem structure
> persistence.
> 

Fair enough. xfs is not really in my wheelhouse so take this as a patch
that helps illustrate the problem, rather than a serious submission.

There are two consumers of the i_version counter today: the kernel NFS
server and IMA. Having the i_version reflect atime updates due to reads
harms both of those use-cases with unneeded cache invalidations on NFS
and extra measurements on IMA. It would also be problematic for userland
NFS servers such as ganesha if we end up exposing this to userland.

atime updates are really a special case when it comes to metadata (and I
maintain that they are one of the worst ideas in POSIX). The way you're
choosing to define i_version doesn't really work properly for any
current or projected use case. I'd like to see that changed.

If the concern is fragility of the code going forward, then maybe we can
go with a different approach. Would it be possible to just have
xfs_trans_log_inode skip bumping the i_version when the log transaction
is _only_ for an atime update due to reads? Maybe we could add a new
XFS_ILOG_ATIME_UPDATE flag or something and set it the appropriate
codepaths?

Jeffrey Layton Aug. 17, 2022, 12:10 p.m. UTC | #8

On Wed, 2022-08-17 at 09:37 +1000, Dave Chinner wrote:
> On Tue, Aug 16, 2022 at 01:14:55PM -0400, David Wysochanski wrote:
> > On Tue, Aug 16, 2022 at 9:19 AM Jeff Layton <jlayton@kernel.org> wrote:
> > > 
> > > The i_version in xfs_trans_log_inode is bumped for any inode update,
> > > including atime-only updates due to reads. We don't want to record those
> > > in the i_version, as they don't represent "real" changes. Remove that
> > > callsite.
> > > 
> > > In xfs_vn_update_time, if S_VERSION is flagged, then attempt to bump the
> > > i_version and turn on XFS_ILOG_CORE if it happens. In
> > > xfs_trans_ichgtime, update the i_version if the mtime or ctime are being
> > > updated.
> > > 
> > > Cc: Darrick J. Wong <darrick.wong@oracle.com>
> > > Cc: Dave Chinner <david@fromorbit.com>
> > > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > > ---
> > >  fs/xfs/libxfs/xfs_trans_inode.c | 17 +++--------------
> > >  fs/xfs/xfs_iops.c               |  4 ++++
> > >  2 files changed, 7 insertions(+), 14 deletions(-)
> > > 
> > > diff --git a/fs/xfs/libxfs/xfs_trans_inode.c b/fs/xfs/libxfs/xfs_trans_inode.c
> > > index 8b5547073379..78bf7f491462 100644
> > > --- a/fs/xfs/libxfs/xfs_trans_inode.c
> > > +++ b/fs/xfs/libxfs/xfs_trans_inode.c
> > > @@ -71,6 +71,8 @@ xfs_trans_ichgtime(
> > >                 inode->i_ctime = tv;
> > >         if (flags & XFS_ICHGTIME_CREATE)
> > >                 ip->i_crtime = tv;
> > > +       if (flags & (XFS_ICHGTIME_MOD|XFS_ICHGTIME_CHG))
> > > +               inode_inc_iversion(inode);
> > >  }
> > > 
> > >  /*
> > > @@ -116,20 +118,7 @@ xfs_trans_log_inode(
> > >                 spin_unlock(&inode->i_lock);
> > >         }
> > > 
> > > -       /*
> > > -        * First time we log the inode in a transaction, bump the inode change
> > > -        * counter if it is configured for this to occur. While we have the
> > > -        * inode locked exclusively for metadata modification, we can usually
> > > -        * avoid setting XFS_ILOG_CORE if no one has queried the value since
> > > -        * the last time it was incremented. If we have XFS_ILOG_CORE already
> > > -        * set however, then go ahead and bump the i_version counter
> > > -        * unconditionally.
> > > -        */
> > > -       if (!test_and_set_bit(XFS_LI_DIRTY, &iip->ili_item.li_flags)) {
> > > -               if (IS_I_VERSION(inode) &&
> > > -                   inode_maybe_inc_iversion(inode, flags & XFS_ILOG_CORE))
> > > -                       iversion_flags = XFS_ILOG_CORE;
> > > -       }
> > > +       set_bit(XFS_LI_DIRTY, &iip->ili_item.li_flags);
> > > 
> > >         /*
> > >          * If we're updating the inode core or the timestamps and it's possible
> > > diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> > > index 45518b8c613c..162e044c7f56 100644
> > > --- a/fs/xfs/xfs_iops.c
> > > +++ b/fs/xfs/xfs_iops.c
> > > @@ -718,6 +718,7 @@ xfs_setattr_nonsize(
> > >         }
> > > 
> > >         setattr_copy(mnt_userns, inode, iattr);
> > > +       inode_inc_iversion(inode);
> > >         xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
> > > 
> > >         XFS_STATS_INC(mp, xs_ig_attrchg);
> > > @@ -943,6 +944,7 @@ xfs_setattr_size(
> > > 
> > >         ASSERT(!(iattr->ia_valid & (ATTR_UID | ATTR_GID)));
> > >         setattr_copy(mnt_userns, inode, iattr);
> > > +       inode_inc_iversion(inode);
> > >         xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
> > > 
> > >         XFS_STATS_INC(mp, xs_ig_attrchg);
> > > @@ -1047,6 +1049,8 @@ xfs_vn_update_time(
> > >                 inode->i_mtime = *now;
> > >         if (flags & S_ATIME)
> > >                 inode->i_atime = *now;
> > > +       if ((flags & S_VERSION) && inode_maybe_inc_iversion(inode, false))
> > > +               log_flags |= XFS_ILOG_CORE;
> > > 
> > >         xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);
> > >         xfs_trans_log_inode(tp, ip, log_flags);
> > > --
> > > 2.37.2
> > > 
> > 
> > I have a test (details below) that shows an open issue with NFSv4.x +
> > fscache where an xfs exported filesystem would trigger unnecessary
> > over the wire READs after a umount/mount cycle of the NFS mount.  I
> > previously tracked this down to atime updates, but never followed
> > through on any patch.  Now that Jeff worked it out and this patch is
> > under review, I built 5.19 vanilla, retested, then built 5.19 + this
> > patch and verified the problem is fixed.
> 
> And so the question that needs to be answered is "why isn't relatime
> working for this workload to avoid unnecessary atime updates"?
> 
> Which then makes me ask "what's changing atime on the server between
> client side reads"?
> 
> Which then makes me wonder "what's actually changing iversion on the
> server?" because I don't think atime is the issue here.
> 
> I suspect that Jeff's patch is affecting this test case by removing
> iversion updates when the data is written back on the server. i.e.
> delayed allocation and unwritten extent conversion will no longer
> bump iversion when they log the inode metadata changes associated
> with extent allocation to store the data being written.  There may
> be other places where Jeff's patch removes implicit iversion
> updates, too, so it may not be writeback that is the issue here.
> 
> How that impacts on the observed behaviour is dependent on things I
> don't know, like what cachefiles is doing in the background,
> especially across NFS client unmount/mount cycles. However, this all
> makes me think the "atime is updated" behaviour is an observed
> symptom of something else changing iversion and/or cmtime between
> reads from the server...
> 

You may be right here.

What I see with both noatime and relatime is that the first read after a
write to a file results in the i_version being incremented, but then it
doesn't change on subsequent reads. Write to the file again, and the
i_version will get incremented again on the next read and then not again
until there is a write.

Dave Chinner Aug. 17, 2022, 9:57 p.m. UTC | #9

On Wed, Aug 17, 2022 at 08:10:22AM -0400, Jeff Layton wrote:
> On Wed, 2022-08-17 at 09:37 +1000, Dave Chinner wrote:
> > On Tue, Aug 16, 2022 at 01:14:55PM -0400, David Wysochanski wrote:
> > > On Tue, Aug 16, 2022 at 9:19 AM Jeff Layton <jlayton@kernel.org> wrote:
> > > I have a test (details below) that shows an open issue with NFSv4.x +
> > > fscache where an xfs exported filesystem would trigger unnecessary
> > > over the wire READs after a umount/mount cycle of the NFS mount.  I
> > > previously tracked this down to atime updates, but never followed
> > > through on any patch.  Now that Jeff worked it out and this patch is
> > > under review, I built 5.19 vanilla, retested, then built 5.19 + this
> > > patch and verified the problem is fixed.
> > 
> > And so the question that needs to be answered is "why isn't relatime
> > working for this workload to avoid unnecessary atime updates"?
> > 
> > Which then makes me ask "what's changing atime on the server between
> > client side reads"?
> > 
> > Which then makes me wonder "what's actually changing iversion on the
> > server?" because I don't think atime is the issue here.
> > 
> > I suspect that Jeff's patch is affecting this test case by removing
> > iversion updates when the data is written back on the server. i.e.
> > delayed allocation and unwritten extent conversion will no longer
> > bump iversion when they log the inode metadata changes associated
> > with extent allocation to store the data being written.  There may
> > be other places where Jeff's patch removes implicit iversion
> > updates, too, so it may not be writeback that is the issue here.
> > 
> > How that impacts on the observed behaviour is dependent on things I
> > don't know, like what cachefiles is doing in the background,
> > especially across NFS client unmount/mount cycles. However, this all
> > makes me think the "atime is updated" behaviour is an observed
> > symptom of something else changing iversion and/or cmtime between
> > reads from the server...
> > 
> 
> You may be right here.
> 
> What I see with both noatime and relatime is that the first read after a
> write to a file results in the i_version being incremented, but then it
> doesn't change on subsequent reads. Write to the file again, and the
> i_version will get incremented again on the next read and then not again
> until there is a write.

Yup, that is exactly the behaviour relatime encodes at the VFS. THe
data write updates mtime (or is it ctime? doesn't matter, though)
and the very next atime access check done by the VFS sees that the
mtime is more recent than the atime, so it asks the filesystem to do
an atime update.

Cheers,

Dave.

NeilBrown Aug. 18, 2022, 12:34 a.m. UTC | #10

On Wed, 17 Aug 2022, Dave Chinner wrote:
> 
> In XFS, we've defined the on-disk i_version field to mean
> "increments with any persistent inode data or metadata change",
> regardless of what the high level applications that use i_version
> might actually require.
> 
> That some network filesystem might only need a subset of the
> metadata to be covered by i_version is largely irrelevant - if we
> don't cover every persistent inode metadata change with i_version,
> then applications that *need* stuff like atime change notification
> can't be supported.

So what you are saying is that the i_version provided by XFS does not
match the changeid semantics required by NFSv4.  Fair enough.  I guess
we shouldn't use the one to implement the other then.

Maybe we should just go back to using ctime.  ctime is *exactly* what
NFSv4 wants, as long as its granularity is sufficient to catch every
single change.  Presumably XFS doesn't try to ensure this.  How hard
would it be to get any ctime update to add at least one nanosecond?
This would be enabled by a mount option, or possibly be a direct request
from nfsd.

<rant>NFSv4 changeid is really one of the more horrible parts of the
design</rant>

NeilBrown

Dave Chinner Aug. 18, 2022, 1:07 a.m. UTC | #11

On Wed, Aug 17, 2022 at 08:02:23AM -0400, Jeff Layton wrote:
> On Wed, 2022-08-17 at 08:42 +1000, Dave Chinner wrote:
> > On Tue, Aug 16, 2022 at 11:58:06AM -0400, Jeff Layton wrote:
> > > On Tue, 2022-08-16 at 08:43 -0700, Darrick J. Wong wrote:
> > > > On Tue, Aug 16, 2022 at 09:17:36AM -0400, Jeff Layton wrote:
> > > > > The i_version in xfs_trans_log_inode is bumped for any inode update,
> > > > > including atime-only updates due to reads. We don't want to record those
> > > > > in the i_version, as they don't represent "real" changes. Remove that
> > > > > callsite.
> > > > > 
> > > > > In xfs_vn_update_time, if S_VERSION is flagged, then attempt to bump the
> > > > > i_version and turn on XFS_ILOG_CORE if it happens. In
> > > > > xfs_trans_ichgtime, update the i_version if the mtime or ctime are being
> > > > > updated.
> > > > 
> > > > What about operations that don't touch the mtime but change the file
> > > > metadata anyway?  There are a few of those, like the blockgc garbage
> > > > collector, deduperange, and the defrag tool.
> > > > 
> > > 
> > > Do those change the c/mtime at all?
> > > 
> > > It's possible we're missing some places that should change the i_version
> > > as well. We may need some more call sites.
> > > 
> > > > Zooming out a bit -- what does i_version signal, concretely?  I thought
> > > > it was used by nfs (and maybe ceph?) to signal to clients that the file
> > > > on the server has moved on, and the client needs to invalidate its
> > > > caches.  I thought afs had a similar generation counter, though it's
> > > > only used to cache file data, not metadata?  Does an i_version change
> > > > cause all of them to invalidate caches, or is there more behavior I
> > > > don't know about?
> > > > 
> > > 
> > > For NFS, it indicates a change to the change attribute indicates that
> > > there has been a change to the data or metadata for the file. atime
> > > changes due to reads are specifically exempted from this, but we do bump
> > > the i_version if someone (e.g.) changes the atime via utimes(). 
> > 
> > We have relatime behaviour to optimise away unnecessary atime
> > updates on reads.  Trying to explicitly exclude i_version from atime
> > updates in one filesystem just because NFS doesn't need that
> > information seems ....  misguided.  The -on disk- i_version
> > field behaviour is defined by the filesystem implementation, not the
> > NFS requirements.
> 
> -o relatime does not fix this.

So why not fix -o relatime to handle this? That way the fix works
for all filesystems and doesn't require hacking around what the VFS
has told us to do in every filesystem.

i.e. the VFS told us to update atime, we updated atime and that is a
persistent metadata change to the inode. Hence a filesystem with a
persistent change counter has to bump the change counter because
we've been asked by the VFS to make a persistent metadata change.

If you want atime updates to not make persistent changes to on disk
metadata, then change the relatime implementation so that it doesn't
ask the filesystem to update the on-disk atime.

Essentially, what I'm hearing is that NFS wants atime updates to
behave like lazytime, not like relatime. With lazytime, atime always
gets updated in memory, but it is not written back to the filesystem
until a timeout or some other modification is made to the inode or
file data. THe filesystem doesn't bump iversion until the timestamp
gets written back in the lazytime case.

IOWs, we already have a mechanism in the kernel for making atime
updates behave exactly as NFS wants: -o lazytime.

> > > The NFS client will generally invalidate its caches for the inode when
> > > it notices a change attribute change.
> > > 
> > > FWIW, AFS may not meet this standard since it doesn't generally
> > > increment the counter on metadata changes. It may turn out that we don't
> > > want to expose this to the AFS client due to that (or maybe come up with
> > > some way to indicate this difference).
> > 
> > In XFS, we've defined the on-disk i_version field to mean
> > "increments with any persistent inode data or metadata change",
> > regardless of what the high level applications that use i_version
> > might actually require.
> > 
> > That some network filesystem might only need a subset of the
> > metadata to be covered by i_version is largely irrelevant - if we
> > don't cover every persistent inode metadata change with i_version,
> > then applications that *need* stuff like atime change notification
> > can't be supported.
> > 
> > > > Does that mean that we should bump i_version for any file data or
> > > > attribute that could be queried or observed by userspace?  In which case
> > > > I suppose this change is still correct, even if it relaxes i_version
> > > > updates from "any change to the inode whatsoever" to "any change that
> > > > would bump mtime".  Unless FIEMAP is part of "attributes observed by
> > > > userspace".
> > > > 
> > > > (The other downside I can see is that now we have to remember to bump
> > > > timestamps for every new file operation we add, unlike the current code
> > > > which is centrally located in xfs_trans_log_inode.)
> > > > 
> > > 
> > > The main reason for the change attribute in NFS was that NFSv3 is
> > > plagued with cache-coherency problems due to coarse-grained timestamp
> > > granularity. It was conceived as a way to indicate that the inode had
> > > changed without relying on timestamps.
> > 
> > Yes, and the most important design consideration for a filesystem is
> > that it -must be persistent-. The constraints on i_version are much
> > stricter than timestamps, and they are directly related to how the
> > filesystem persists metadata changes, not how metadata is changed or
> > accessed in memory.
> > 
> > > In practice, we want to bump the i_version counter whenever the ctime or
> > > mtime would be changed.
> > 
> > What about O_NOCMTIME modifications? What about lazytime
> > filesystems? These explicilty avoid or delay persisten c/mtime
> > updates, and that means bumping i_version only based on c/mtime
> > updates cannot be relied on. i_version is supposed to track user
> > visible data and metadata changes, *not timestamp updates*.
> 
> I was speaking more about the sorts of activity that should result in
> the i_version being changed, not about tracking timestamp updates. IOW,
> if some activity would cause the mtime or ctime to change, then we want
> to also bump the i_version.
> 
> Specifically, for NOCMTIME, I think we'd still want the i_version to
> change since that option is about timestamps and not i_version.

Exactly my point: this is what XFS currently does. It is also what
your proposed changes break by tying i_version updates to c/mtime
updates.

> > Hence, I don't think that trying to modify how filesystems persist
> > and maintain i_version coherency because NFS "doesn't need i_version
> > to cover atime updates" is the wrong approach. On-disk i_version
> > coherency has to work for more than just one NFS implementation
> > (especially now i_version will be exported to userspace!). 
> > Persistent atime updates are already optimised away by relatime, and
> > so I think that any further atime filtering is largely a NFS
> > application layer problem and not something that should be solved by
> > changing the on-disk definition of back end filesystem structure
> > persistence.
> > 
> 
> Fair enough. xfs is not really in my wheelhouse so take this as a patch
> that helps illustrate the problem, rather than a serious submission.
> 
> There are two consumers of the i_version counter today: the kernel NFS
> server and IMA. Having the i_version reflect atime updates due to reads
> harms both of those use-cases with unneeded cache invalidations on NFS
> and extra measurements on IMA. It would also be problematic for userland
> NFS servers such as ganesha if we end up exposing this to userland.

So you're wanting define an exact behaviour for atime vs I_VERSION
where atime doesn't bump iversion.

However, the definition we baked into the XFS on-disk format is that
the on-disk iversion filed changes for every persistent change made
to the inode. That includes atime updates. We did this for two
reasons - the first is so that we could support any application that
needed change detection, and the second was that knowing how many
changes have occurred to an inode is extremely useful for forensic
purposes (especially given the ability to use O_NOCMTIME to modify
file data).

> atime updates are really a special case when it comes to metadata (and I
> maintain that they are one of the worst ideas in POSIX). The way you're
> choosing to define i_version doesn't really work properly for any
> current or projected use case. I'd like to see that changed.

We chose to do that a decade ago, knowing that it is the
responsibility of the VFS to avoid unnecessary atime updates, not
the responsibility of the filesystem. That was the whole point of
introducing the relatime functionality: fix the problem at the VFS,
not have to work around generic atime behaviour in every filesystem.

> If the concern is fragility of the code going forward, then maybe we can
> go with a different approach. Would it be possible to just have
> xfs_trans_log_inode skip bumping the i_version when the log transaction
> is _only_ for an atime update due to reads? Maybe we could add a new
> XFS_ILOG_ATIME_UPDATE flag or something and set it the appropriate
> codepaths?

No, I don't think this is something we should be hacking around in
individual filesystems. If the VFS tells us we should be updating
atime, we should be updating it and bumping persistent change
counters because *that's what the VFS asked us to do*.

IOWs, if NFS wants atime to be considered "in memory only" as per
the lazytime behaviour, then that behaviour needs to be
supported/changed at the VFS, not at the individual fileystem level.

You could add a subset of SB_LAZYTIME functionality just for atime
updates, handle that entirely within the existing lazytime
infrastructure. THis means the VFS supports non-persistent,
best-effort, non-persistent atime updates directly. The VFS will not
ack filesystems to persist atime updates the moment they are made,
it will tell the filesystem via ->dirty_inode() that it needs to
persist the in-memory timestamps.

This gives all the filesystems the same atime behaviour. If the VFS
is going to expose the persistent change counter to userspace, we
need to have consistent, VFS enforced rules on what is coverred by
the change counter. If atime is not covered by the ipersistent
change counter, then the VFS must not ask the filesystems to persist
atime changes every time atime changes.

Then all filesystems will avoid on-disk atime updates as much as
possible, whilst still gathering them correctly when other
modifications are made.

Until we've got a definition of what this persistent change counter
actually describes and guarantees for userspace, changing filesystem
implementations is premature. And if it is decided that atime is not
going to be considered a "persistent change" by itself, then that
behaviour needs to be encoded at the VFS, not in individual
filesystems....

Cheers,

Dave.

Trond Myklebust Aug. 18, 2022, 1:11 a.m. UTC | #12

On Wed, 2022-08-17 at 08:42 +1000, Dave Chinner wrote:
> 
> In XFS, we've defined the on-disk i_version field to mean
> "increments with any persistent inode data or metadata change",
> regardless of what the high level applications that use i_version
> might actually require.
> 
> That some network filesystem might only need a subset of the
> metadata to be covered by i_version is largely irrelevant - if we
> don't cover every persistent inode metadata change with i_version,
> then applications that *need* stuff like atime change notification
> can't be supported.

OK, I'll bite...

What real world application are we talking about here, and why can't it
just read both the atime + i_version if it cares?

The value of the change attribute lies in the fact that it gives you
ctime semantics without the time resolution limitation.
i.e. if the change attribute has changed, then you know that someone
has explicitly modified either the file data or the file metadata (with
the emphasis being on the word "explicitly").
Implicit changes such as the mtime change due to a write are reflected
only because they are necessarily also accompanied by an explicit
change to the data contents of the file.
Implicit changes, such as the atime changes due to a read are not
reflected in the change attribute because there is no explicit change
being made by an application.

Any implicit change in the metadata can be derived by just reading the
attribute in question: there is only 1 (atime) and it is supposed to be
monotonically increasing, hence is dead simple to detect.

Dave Chinner Aug. 18, 2022, 1:32 a.m. UTC | #13

On Thu, Aug 18, 2022 at 10:34:40AM +1000, NeilBrown wrote:
> On Wed, 17 Aug 2022, Dave Chinner wrote:
> > 
> > In XFS, we've defined the on-disk i_version field to mean
> > "increments with any persistent inode data or metadata change",
> > regardless of what the high level applications that use i_version
> > might actually require.
> > 
> > That some network filesystem might only need a subset of the
> > metadata to be covered by i_version is largely irrelevant - if we
> > don't cover every persistent inode metadata change with i_version,
> > then applications that *need* stuff like atime change notification
> > can't be supported.
> 
> So what you are saying is that the i_version provided by XFS does not
> match the changeid semantics required by NFSv4.  Fair enough.  I guess
> we shouldn't use the one to implement the other then.

True, but there's more nuance to it than that. We can already
provide the NFSv4 requirements without filesystem modification -
just mount the filesytsem with "-o lazytime".

There's *always* a difference between what the filesystem implements
and what an application require. The filesystem implements what the
VFS asks it to do, not what the application needs. The VFS provides
the functionality they applications require, not the filesystems.

atime update filtering has long been VFS functionality - we do it
there so behaviour is common across all filesystems. Therefore, if
the NFS application wants atime to always behave as if lazytime is
enabled, that filtering should be done at the VFS atime filtering
layer.

Why should we duplicate VFS atime update filtering in every
filesystem for the default relatime behaviour when the VFS lazytime
filter already provides this filtering function to everyone?

> Maybe we should just go back to using ctime.  ctime is *exactly* what
> NFSv4 wants, as long as its granularity is sufficient to catch every
> single change.  Presumably XFS doesn't try to ensure this.  How hard
> would it be to get any ctime update to add at least one nanosecond?
> This would be enabled by a mount option, or possibly be a direct request
> from nfsd.

We can't rely on ctime to be changed during a modification because
O_NOCMTIME exists to enable "user invisible" modifications to be
made. On XFS these still bump iversion, so while they are invisible
to the user, they are still tracked by the filesystem and anything
that wants to know if the inode data/metadata changed.

Cheers,

Dave.

NeilBrown Aug. 18, 2022, 1:52 a.m. UTC | #14

On Thu, 18 Aug 2022, Dave Chinner wrote:
> 
> > Maybe we should just go back to using ctime.  ctime is *exactly* what
> > NFSv4 wants, as long as its granularity is sufficient to catch every
> > single change.  Presumably XFS doesn't try to ensure this.  How hard
> > would it be to get any ctime update to add at least one nanosecond?
> > This would be enabled by a mount option, or possibly be a direct request
> > from nfsd.
> 
> We can't rely on ctime to be changed during a modification because
> O_NOCMTIME exists to enable "user invisible" modifications to be
> made. On XFS these still bump iversion, so while they are invisible
> to the user, they are still tracked by the filesystem and anything
> that wants to know if the inode data/metadata changed.
> 

O_NOCMTIME isn't mentioned in the man page, so it doesn't exist :-(

If they are "user invisible", should they then also be "NFS invisible"?
I think so.
As I understand it, the purpose of O_NOCMTIME is to allow optimisations
- do a lot of writes, then update the mtime, thus reducing latency.  I
think it is perfectly reasonable for all of that to be invisible to NFS.

NeilBrown

Trond Myklebust Aug. 18, 2022, 2:22 a.m. UTC | #15

On Thu, 2022-08-18 at 11:52 +1000, NeilBrown wrote:
> On Thu, 18 Aug 2022, Dave Chinner wrote:
> > 
> > > Maybe we should just go back to using ctime.  ctime is *exactly*
> > > what
> > > NFSv4 wants, as long as its granularity is sufficient to catch
> > > every
> > > single change.  Presumably XFS doesn't try to ensure this.  How
> > > hard
> > > would it be to get any ctime update to add at least one
> > > nanosecond?
> > > This would be enabled by a mount option, or possibly be a direct
> > > request
> > > from nfsd.
> > 
> > We can't rely on ctime to be changed during a modification because
> > O_NOCMTIME exists to enable "user invisible" modifications to be
> > made. On XFS these still bump iversion, so while they are invisible
> > to the user, they are still tracked by the filesystem and anything
> > that wants to know if the inode data/metadata changed.
> > 
> 
> O_NOCMTIME isn't mentioned in the man page, so it doesn't exist :-(
> 
> If they are "user invisible", should they then also be "NFS
> invisible"?
> I think so.
> As I understand it, the purpose of O_NOCMTIME is to allow
> optimisations
> - do a lot of writes, then update the mtime, thus reducing latency. 
> I
> think it is perfectly reasonable for all of that to be invisible to
> NFS.

The point is that you can always detect an implicit metadata change by
just reading the attribute that changed. The case of an explicit change
is different, because the application might be changing the value back
to a previous setting. The entire value of ctime is that it allows you
to know not to trust any caches when this might be the case because it
records the fact that there was an explicit data or metadata change
somewhere along the line.

By discarding that information about explicit vs implicit changes, XFS
is making the i_version less useful to applications that need to cache
data and/or metadata. So the real question is: for which real world
applications is this behaviour adding value that could not be derived
through other means?

Dave Chinner Aug. 18, 2022, 3 a.m. UTC | #16

On Thu, Aug 18, 2022 at 11:52:12AM +1000, NeilBrown wrote:
> On Thu, 18 Aug 2022, Dave Chinner wrote:
> > 
> > > Maybe we should just go back to using ctime.  ctime is *exactly* what
> > > NFSv4 wants, as long as its granularity is sufficient to catch every
> > > single change.  Presumably XFS doesn't try to ensure this.  How hard
> > > would it be to get any ctime update to add at least one nanosecond?
> > > This would be enabled by a mount option, or possibly be a direct request
> > > from nfsd.
> > 
> > We can't rely on ctime to be changed during a modification because
> > O_NOCMTIME exists to enable "user invisible" modifications to be
> > made. On XFS these still bump iversion, so while they are invisible
> > to the user, they are still tracked by the filesystem and anything
> > that wants to know if the inode data/metadata changed.
> > 
> 
> O_NOCMTIME isn't mentioned in the man page, so it doesn't exist :-(
> 
> If they are "user invisible", should they then also be "NFS invisible"?
> I think so.

Maybe, but now you're making big assumptions about what is being
done by those operations. Userspace can write whatever it likes,
nothing says that O_NOCMTIME can't change user visible data or
metadata.

> As I understand it, the purpose of O_NOCMTIME is to allow optimisations
> - do a lot of writes, then update the mtime, thus reducing latency.  I
> think it is perfectly reasonable for all of that to be invisible to NFS.

O_NOCMTIME is used by things like HSMs, file defragmenters,
deduplication tools, backup programs, etc to be able to read/write
data and manipulate file layout without modifying user visible
timestamps. i.e. users shouldn't notice that the online defragmenter
defragmented their file. Backup programs shouldn't notice the
defragmenter defragmented the file. 

But having uses of it that don't change user visible data does not
mean it can't be used for changing user visible data. Hence we made
the defensive assumption that O_NOCMTIME was a mechanism that could
be used to hide data changes from forensic analysis. With that in
mind, it was important that the change counter captured changes made
even when O_NOCMTIME was specified to leave behind a breadcrumb to
indicate unexpected changes may had been made to the file.

Yeah, we had lots of different requirements for the XFS on-disk
change counter when we were considering adding it. NFSv4 was one of
the least demanding and least defined requirements; it's taken a
*decade* for this atime issue to be noticed, so I really don't think
there's anything wrong with how XFs has implemented persistent
change counters.

What it tells me is that the VFS needs more appropriate atime
filtering for NFSv4's change attribute requirements....

Cheers,

Dave.

Dave Chinner Aug. 18, 2022, 3:37 a.m. UTC | #17

On Thu, Aug 18, 2022 at 01:11:09AM +0000, Trond Myklebust wrote:
> On Wed, 2022-08-17 at 08:42 +1000, Dave Chinner wrote:
> > 
> > In XFS, we've defined the on-disk i_version field to mean
> > "increments with any persistent inode data or metadata change",
> > regardless of what the high level applications that use i_version
> > might actually require.
> > 
> > That some network filesystem might only need a subset of the
> > metadata to be covered by i_version is largely irrelevant - if we
> > don't cover every persistent inode metadata change with i_version,
> > then applications that *need* stuff like atime change notification
> > can't be supported.
> 
> OK, I'll bite...
> 
> What real world application are we talking about here, and why can't it
> just read both the atime + i_version if it cares?

The whole point of i_version is that the aplication can skip the
storage and comparison of individual metadata fields to determine if
anythign changed. If you're going to store multiple fields and
compare them all in addition to the change attribute, then what is
the change attribute actually gaining you?

> The value of the change attribute lies in the fact that it gives you
> ctime semantics without the time resolution limitation.
> i.e. if the change attribute has changed, then you know that someone
> has explicitly modified either the file data or the file metadata (with
> the emphasis being on the word "explicitly").
> Implicit changes such as the mtime change due to a write are reflected
> only because they are necessarily also accompanied by an explicit
> change to the data contents of the file.
> Implicit changes, such as the atime changes due to a read are not
> reflected in the change attribute because there is no explicit change
> being made by an application.

That's the *NFSv4 requirements*, not what people were asking XFS to
support in a persistent change attribute 10-15 years ago. What NFS
required was just one of the inputs at the time, and what we
implemented has kept NFSv4 happy for the past decade. I've mentioned
other requirements elsewhere in the thread.

The problem we're talking about here is essentially a relatime
filtering issue; it's triggering an filesystem update because the
first access after a modification triggers an on-disk atime update
rahter than just storing it in memory.

This is not a filesystem issue - the VFS controls when the on-disk
updates occur, and that what NFSv4 appears to need changed.
If NFS doesn't want the filesystem to bump change counters for
on-disk atime updates, then it should be asking the VFS to keep the
atime updates in memory. e.g. use lazytime semantics.

This way, every filesystem will have the same behaviour, regardless
of how they track/store persistent change count metadata.

Cheers,

Dave.

Trond Myklebust Aug. 18, 2022, 4:15 a.m. UTC | #18

On Thu, 2022-08-18 at 13:37 +1000, Dave Chinner wrote:
> On Thu, Aug 18, 2022 at 01:11:09AM +0000, Trond Myklebust wrote:
> > On Wed, 2022-08-17 at 08:42 +1000, Dave Chinner wrote:
> > > 
> > > In XFS, we've defined the on-disk i_version field to mean
> > > "increments with any persistent inode data or metadata change",
> > > regardless of what the high level applications that use i_version
> > > might actually require.
> > > 
> > > That some network filesystem might only need a subset of the
> > > metadata to be covered by i_version is largely irrelevant - if we
> > > don't cover every persistent inode metadata change with
> > > i_version,
> > > then applications that *need* stuff like atime change
> > > notification
> > > can't be supported.
> > 
> > OK, I'll bite...
> > 
> > What real world application are we talking about here, and why
> > can't it
> > just read both the atime + i_version if it cares?
> 
> The whole point of i_version is that the aplication can skip the
> storage and comparison of individual metadata fields to determine if
> anythign changed. If you're going to store multiple fields and
> compare them all in addition to the change attribute, then what is
> the change attribute actually gaining you?

Information that is not contained in the fields themselves. Such as
metadata about the fact that they were explicitly changed by an
application.

> 
> > The value of the change attribute lies in the fact that it gives
> > you
> > ctime semantics without the time resolution limitation.
> > i.e. if the change attribute has changed, then you know that
> > someone
> > has explicitly modified either the file data or the file metadata
> > (with
> > the emphasis being on the word "explicitly").
> > Implicit changes such as the mtime change due to a write are
> > reflected
> > only because they are necessarily also accompanied by an explicit
> > change to the data contents of the file.
> > Implicit changes, such as the atime changes due to a read are not
> > reflected in the change attribute because there is no explicit
> > change
> > being made by an application.
> 
> That's the *NFSv4 requirements*, not what people were asking XFS to
> support in a persistent change attribute 10-15 years ago. What NFS
> required was just one of the inputs at the time, and what we
> implemented has kept NFSv4 happy for the past decade. I've mentioned
> other requirements elsewhere in the thread

NFS can work with a change attribute that tells it to invalidate its
cache on every read. The only side effect will be that the performance
graph will act as if you were filtering it through a cow's digestive
system...

> The problem we're talking about here is essentially a relatime
> filtering issue; it's triggering an filesystem update because the
> first access after a modification triggers an on-disk atime update
> rahter than just storing it in memory.

No. It's not... You appear to be discarding valuable information about
why the attributes changed. I've been asking you for reasons why, and
you're avoiding the question.

> This is not a filesystem issue - the VFS controls when the on-disk
> updates occur, and that what NFSv4 appears to need changed.
> If NFS doesn't want the filesystem to bump change counters for
> on-disk atime updates, then it should be asking the VFS to keep the
> atime updates in memory. e.g. use lazytime semantics.
> 
> This way, every filesystem will have the same behaviour, regardless
> of how they track/store persistent change count metadata.

Right now, the i_version updates are not exported via any common API,
so any piss poor performance side-effects of the implementation are
pretty much limited to the kernel users (NFS and... ???)

Who do you expect to use this attribute if it were to be exported via
statx() as Jeff is proposing, and why is the XFS behaviour appropriate?
It already differs from the behaviour of both btrfs and NFS, so the
argument that this will magically consolidate behaviour can be ignored.

Jeffrey Layton Aug. 18, 2022, 11 a.m. UTC | #19

On Thu, 2022-08-18 at 10:34 +1000, NeilBrown wrote:
> On Wed, 17 Aug 2022, Dave Chinner wrote:
> > 
> > In XFS, we've defined the on-disk i_version field to mean
> > "increments with any persistent inode data or metadata change",
> > regardless of what the high level applications that use i_version
> > might actually require.
> > 
> > That some network filesystem might only need a subset of the
> > metadata to be covered by i_version is largely irrelevant - if we
> > don't cover every persistent inode metadata change with i_version,
> > then applications that *need* stuff like atime change notification
> > can't be supported.
> 
> So what you are saying is that the i_version provided by XFS does not
> match the changeid semantics required by NFSv4.  Fair enough.  I guess
> we shouldn't use the one to implement the other then.
> 
> Maybe we should just go back to using ctime.  ctime is *exactly* what
> NFSv4 wants, as long as its granularity is sufficient to catch every
> single change.  Presumably XFS doesn't try to ensure this.  How hard
> would it be to get any ctime update to add at least one nanosecond?
> This would be enabled by a mount option, or possibly be a direct request
> from nfsd.
> 

I think that would be an unfortunate outcome, but if we can't stop xfs
from bumping the i_version on atime updates, then we may have no choice
but to do so. I suppose we could add a fetch_iversion for xfs that takes
it back to using the ctime.

> <rant>NFSv4 changeid is really one of the more horrible parts of the
> design</rant>
> 

Hah! I was telling Tom Talpey yesterday that I thought that the change
counter was one of the best ideas in NFSv4 and that we should be trying
to get all filesystems to implement it correctly.

The part that does suck about the design is that the original specs
weren't specific enough about its behavior. I think that's been somewhat
remedied in more recent RFCs though.

Jeffrey Layton Aug. 18, 2022, 11:03 a.m. UTC | #20

On Thu, 2022-08-18 at 04:15 +0000, Trond Myklebust wrote:
> On Thu, 2022-08-18 at 13:37 +1000, Dave Chinner wrote:
> > On Thu, Aug 18, 2022 at 01:11:09AM +0000, Trond Myklebust wrote:
> > > On Wed, 2022-08-17 at 08:42 +1000, Dave Chinner wrote:
> > > > 
> > > > In XFS, we've defined the on-disk i_version field to mean
> > > > "increments with any persistent inode data or metadata change",
> > > > regardless of what the high level applications that use i_version
> > > > might actually require.
> > > > 
> > > > That some network filesystem might only need a subset of the
> > > > metadata to be covered by i_version is largely irrelevant - if we
> > > > don't cover every persistent inode metadata change with
> > > > i_version,
> > > > then applications that *need* stuff like atime change
> > > > notification
> > > > can't be supported.
> > > 
> > > OK, I'll bite...
> > > 
> > > What real world application are we talking about here, and why
> > > can't it
> > > just read both the atime + i_version if it cares?
> > 
> > The whole point of i_version is that the aplication can skip the
> > storage and comparison of individual metadata fields to determine if
> > anythign changed. If you're going to store multiple fields and
> > compare them all in addition to the change attribute, then what is
> > the change attribute actually gaining you?
> 
> Information that is not contained in the fields themselves. Such as
> metadata about the fact that they were explicitly changed by an
> application.
> 
> > 
> > > The value of the change attribute lies in the fact that it gives
> > > you
> > > ctime semantics without the time resolution limitation.
> > > i.e. if the change attribute has changed, then you know that
> > > someone
> > > has explicitly modified either the file data or the file metadata
> > > (with
> > > the emphasis being on the word "explicitly").
> > > Implicit changes such as the mtime change due to a write are
> > > reflected
> > > only because they are necessarily also accompanied by an explicit
> > > change to the data contents of the file.
> > > Implicit changes, such as the atime changes due to a read are not
> > > reflected in the change attribute because there is no explicit
> > > change
> > > being made by an application.
> > 
> > That's the *NFSv4 requirements*, not what people were asking XFS to
> > support in a persistent change attribute 10-15 years ago. What NFS
> > required was just one of the inputs at the time, and what we
> > implemented has kept NFSv4 happy for the past decade. I've mentioned
> > other requirements elsewhere in the thread
> 
> NFS can work with a change attribute that tells it to invalidate its
> cache on every read. The only side effect will be that the performance
> graph will act as if you were filtering it through a cow's digestive
> system...
> 
> > The problem we're talking about here is essentially a relatime
> > filtering issue; it's triggering an filesystem update because the
> > first access after a modification triggers an on-disk atime update
> > rahter than just storing it in memory.
> 
> No. It's not... You appear to be discarding valuable information about
> why the attributes changed. I've been asking you for reasons why, and
> you're avoiding the question.
> 
> > This is not a filesystem issue - the VFS controls when the on-disk
> > updates occur, and that what NFSv4 appears to need changed.
> > If NFS doesn't want the filesystem to bump change counters for
> > on-disk atime updates, then it should be asking the VFS to keep the
> > atime updates in memory. e.g. use lazytime semantics.
> > 
> > This way, every filesystem will have the same behaviour, regardless
> > of how they track/store persistent change count metadata.
> 
> Right now, the i_version updates are not exported via any common API,
> so any piss poor performance side-effects of the implementation are
> pretty much limited to the kernel users (NFS and... ???)
> 
> Who do you expect to use this attribute if it were to be exported via
> statx() as Jeff is proposing, and why is the XFS behaviour appropriate?
> It already differs from the behaviour of both btrfs and NFS, so the
> argument that this will magically consolidate behaviour can be ignored.
> 

Thanks Trond,

That's been exactly the point I've been trying to make. The _only_
consumers of i_version at this time are the kernel's NFS server and IMA.
Both of them will still work with the i_version being updated due to
atime updates, but their performance suffers.

The change I'm proposing should bring xfs in line with other providers
of i_version as well. btrfs already behaves correctly, and I have a
proposed patch for ext4 which should fix it. The ext4 devs seem amenable
to it so far.

Dave, you keep talking about the xfs i_version counter as if there are
other applications already relying on its behavior, but I don't see how
that can be. There is no way for userland applications to fetch the
counter currently.

Jeffrey Layton Aug. 18, 2022, 11:12 a.m. UTC | #21

On Thu, 2022-08-18 at 11:07 +1000, Dave Chinner wrote:
> On Wed, Aug 17, 2022 at 08:02:23AM -0400, Jeff Layton wrote:
> > On Wed, 2022-08-17 at 08:42 +1000, Dave Chinner wrote:
> > > On Tue, Aug 16, 2022 at 11:58:06AM -0400, Jeff Layton wrote:
> > > > On Tue, 2022-08-16 at 08:43 -0700, Darrick J. Wong wrote:
> > > > > On Tue, Aug 16, 2022 at 09:17:36AM -0400, Jeff Layton wrote:
> > > > > > The i_version in xfs_trans_log_inode is bumped for any inode update,
> > > > > > including atime-only updates due to reads. We don't want to record those
> > > > > > in the i_version, as they don't represent "real" changes. Remove that
> > > > > > callsite.
> > > > > > 
> > > > > > In xfs_vn_update_time, if S_VERSION is flagged, then attempt to bump the
> > > > > > i_version and turn on XFS_ILOG_CORE if it happens. In
> > > > > > xfs_trans_ichgtime, update the i_version if the mtime or ctime are being
> > > > > > updated.
> > > > > 
> > > > > What about operations that don't touch the mtime but change the file
> > > > > metadata anyway?  There are a few of those, like the blockgc garbage
> > > > > collector, deduperange, and the defrag tool.
> > > > > 
> > > > 
> > > > Do those change the c/mtime at all?
> > > > 
> > > > It's possible we're missing some places that should change the i_version
> > > > as well. We may need some more call sites.
> > > > 
> > > > > Zooming out a bit -- what does i_version signal, concretely?  I thought
> > > > > it was used by nfs (and maybe ceph?) to signal to clients that the file
> > > > > on the server has moved on, and the client needs to invalidate its
> > > > > caches.  I thought afs had a similar generation counter, though it's
> > > > > only used to cache file data, not metadata?  Does an i_version change
> > > > > cause all of them to invalidate caches, or is there more behavior I
> > > > > don't know about?
> > > > > 
> > > > 
> > > > For NFS, it indicates a change to the change attribute indicates that
> > > > there has been a change to the data or metadata for the file. atime
> > > > changes due to reads are specifically exempted from this, but we do bump
> > > > the i_version if someone (e.g.) changes the atime via utimes(). 
> > > 
> > > We have relatime behaviour to optimise away unnecessary atime
> > > updates on reads.  Trying to explicitly exclude i_version from atime
> > > updates in one filesystem just because NFS doesn't need that
> > > information seems ....  misguided.  The -on disk- i_version
> > > field behaviour is defined by the filesystem implementation, not the
> > > NFS requirements.
> > 
> > -o relatime does not fix this.
> 
> So why not fix -o relatime to handle this? That way the fix works
> for all filesystems and doesn't require hacking around what the VFS
> has told us to do in every filesystem.
> 
> i.e. the VFS told us to update atime, we updated atime and that is a
> persistent metadata change to the inode. Hence a filesystem with a
> persistent change counter has to bump the change counter because
> we've been asked by the VFS to make a persistent metadata change.
> 
> If you want atime updates to not make persistent changes to on disk
> metadata, then change the relatime implementation so that it doesn't
> ask the filesystem to update the on-disk atime.
> 
> Essentially, what I'm hearing is that NFS wants atime updates to
> behave like lazytime, not like relatime. With lazytime, atime always
> gets updated in memory, but it is not written back to the filesystem
> until a timeout or some other modification is made to the inode or
> file data. THe filesystem doesn't bump iversion until the timestamp
> gets written back in the lazytime case.
> 
> IOWs, we already have a mechanism in the kernel for making atime
> updates behave exactly as NFS wants: -o lazytime.
> 

No, that won't help. Both lazytime and relatime don't help anything
since they don't address the fundamental problem, which is that the
i_version changes due to atime updates. They only affect when the atime
gets updated (or when it goes to disk).



> > > > The NFS client will generally invalidate its caches for the inode when
> > > > it notices a change attribute change.
> > > > 
> > > > FWIW, AFS may not meet this standard since it doesn't generally
> > > > increment the counter on metadata changes. It may turn out that we don't
> > > > want to expose this to the AFS client due to that (or maybe come up with
> > > > some way to indicate this difference).
> > > 
> > > In XFS, we've defined the on-disk i_version field to mean
> > > "increments with any persistent inode data or metadata change",
> > > regardless of what the high level applications that use i_version
> > > might actually require.
> > > 
> > > That some network filesystem might only need a subset of the
> > > metadata to be covered by i_version is largely irrelevant - if we
> > > don't cover every persistent inode metadata change with i_version,
> > > then applications that *need* stuff like atime change notification
> > > can't be supported.
> > > 
> > > > > Does that mean that we should bump i_version for any file data or
> > > > > attribute that could be queried or observed by userspace?  In which case
> > > > > I suppose this change is still correct, even if it relaxes i_version
> > > > > updates from "any change to the inode whatsoever" to "any change that
> > > > > would bump mtime".  Unless FIEMAP is part of "attributes observed by
> > > > > userspace".
> > > > > 
> > > > > (The other downside I can see is that now we have to remember to bump
> > > > > timestamps for every new file operation we add, unlike the current code
> > > > > which is centrally located in xfs_trans_log_inode.)
> > > > > 
> > > > 
> > > > The main reason for the change attribute in NFS was that NFSv3 is
> > > > plagued with cache-coherency problems due to coarse-grained timestamp
> > > > granularity. It was conceived as a way to indicate that the inode had
> > > > changed without relying on timestamps.
> > > 
> > > Yes, and the most important design consideration for a filesystem is
> > > that it -must be persistent-. The constraints on i_version are much
> > > stricter than timestamps, and they are directly related to how the
> > > filesystem persists metadata changes, not how metadata is changed or
> > > accessed in memory.
> > > 
> > > > In practice, we want to bump the i_version counter whenever the ctime or
> > > > mtime would be changed.
> > > 
> > > What about O_NOCMTIME modifications? What about lazytime
> > > filesystems? These explicilty avoid or delay persisten c/mtime
> > > updates, and that means bumping i_version only based on c/mtime
> > > updates cannot be relied on. i_version is supposed to track user
> > > visible data and metadata changes, *not timestamp updates*.
> > 
> > I was speaking more about the sorts of activity that should result in
> > the i_version being changed, not about tracking timestamp updates. IOW,
> > if some activity would cause the mtime or ctime to change, then we want
> > to also bump the i_version.
> > 
> > Specifically, for NOCMTIME, I think we'd still want the i_version to
> > change since that option is about timestamps and not i_version.
> 
> Exactly my point: this is what XFS currently does. It is also what
> your proposed changes break by tying i_version updates to c/mtime
> updates.
> 
> > > Hence, I don't think that trying to modify how filesystems persist
> > > and maintain i_version coherency because NFS "doesn't need i_version
> > > to cover atime updates" is the wrong approach. On-disk i_version
> > > coherency has to work for more than just one NFS implementation
> > > (especially now i_version will be exported to userspace!). 
> > > Persistent atime updates are already optimised away by relatime, and
> > > so I think that any further atime filtering is largely a NFS
> > > application layer problem and not something that should be solved by
> > > changing the on-disk definition of back end filesystem structure
> > > persistence.
> > > 
> > 
> > Fair enough. xfs is not really in my wheelhouse so take this as a patch
> > that helps illustrate the problem, rather than a serious submission.
> > 
> > There are two consumers of the i_version counter today: the kernel NFS
> > server and IMA. Having the i_version reflect atime updates due to reads
> > harms both of those use-cases with unneeded cache invalidations on NFS
> > and extra measurements on IMA. It would also be problematic for userland
> > NFS servers such as ganesha if we end up exposing this to userland.
> 
> So you're wanting define an exact behaviour for atime vs I_VERSION
> where atime doesn't bump iversion.
> 
> However, the definition we baked into the XFS on-disk format is that
> the on-disk iversion filed changes for every persistent change made
> to the inode. That includes atime updates. We did this for two
> reasons - the first is so that we could support any application that
> needed change detection, and the second was that knowing how many
> changes have occurred to an inode is extremely useful for forensic
> purposes (especially given the ability to use O_NOCMTIME to modify
> file data).
> 
> > atime updates are really a special case when it comes to metadata (and I
> > maintain that they are one of the worst ideas in POSIX). The way you're
> > choosing to define i_version doesn't really work properly for any
> > current or projected use case. I'd like to see that changed.
> 
> We chose to do that a decade ago, knowing that it is the
> responsibility of the VFS to avoid unnecessary atime updates, not
> the responsibility of the filesystem. That was the whole point of
> introducing the relatime functionality: fix the problem at the VFS,
> not have to work around generic atime behaviour in every filesystem.
> 
> > If the concern is fragility of the code going forward, then maybe we can
> > go with a different approach. Would it be possible to just have
> > xfs_trans_log_inode skip bumping the i_version when the log transaction
> > is _only_ for an atime update due to reads? Maybe we could add a new
> > XFS_ILOG_ATIME_UPDATE flag or something and set it the appropriate
> > codepaths?
> 
> No, I don't think this is something we should be hacking around in
> individual filesystems. If the VFS tells us we should be updating
> atime, we should be updating it and bumping persistent change
> counters because *that's what the VFS asked us to do*.
> 
> IOWs, if NFS wants atime to be considered "in memory only" as per
> the lazytime behaviour, then that behaviour needs to be
> supported/changed at the VFS, not at the individual fileystem level.
> 
> You could add a subset of SB_LAZYTIME functionality just for atime
> updates, handle that entirely within the existing lazytime
> infrastructure. THis means the VFS supports non-persistent,
> best-effort, non-persistent atime updates directly. The VFS will not
> ack filesystems to persist atime updates the moment they are made,
> it will tell the filesystem via ->dirty_inode() that it needs to
> persist the in-memory timestamps.
> 
> This gives all the filesystems the same atime behaviour. If the VFS
> is going to expose the persistent change counter to userspace, we
> need to have consistent, VFS enforced rules on what is coverred by
> the change counter. If atime is not covered by the ipersistent
> change counter, then the VFS must not ask the filesystems to persist
> atime changes every time atime changes.
> 
> Then all filesystems will avoid on-disk atime updates as much as
> possible, whilst still gathering them correctly when other
> modifications are made.
> 
> Until we've got a definition of what this persistent change counter
> actually describes and guarantees for userspace, changing filesystem
> implementations is premature. And if it is decided that atime is not
> going to be considered a "persistent change" by itself, then that
> behaviour needs to be encoded at the VFS, not in individual
> filesystems....
> 
> Cheers,
> 
> Dave.

NeilBrown Aug. 18, 2022, 11:43 p.m. UTC | #22

On Thu, 18 Aug 2022, Jeff Layton wrote:
> On Thu, 2022-08-18 at 10:34 +1000, NeilBrown wrote:
> > On Wed, 17 Aug 2022, Dave Chinner wrote:
> > > 
> > > In XFS, we've defined the on-disk i_version field to mean
> > > "increments with any persistent inode data or metadata change",
> > > regardless of what the high level applications that use i_version
> > > might actually require.
> > > 
> > > That some network filesystem might only need a subset of the
> > > metadata to be covered by i_version is largely irrelevant - if we
> > > don't cover every persistent inode metadata change with i_version,
> > > then applications that *need* stuff like atime change notification
> > > can't be supported.
> > 
> > So what you are saying is that the i_version provided by XFS does not
> > match the changeid semantics required by NFSv4.  Fair enough.  I guess
> > we shouldn't use the one to implement the other then.
> > 
> > Maybe we should just go back to using ctime.  ctime is *exactly* what
> > NFSv4 wants, as long as its granularity is sufficient to catch every
> > single change.  Presumably XFS doesn't try to ensure this.  How hard
> > would it be to get any ctime update to add at least one nanosecond?
> > This would be enabled by a mount option, or possibly be a direct request
> > from nfsd.
> > 
> 
> I think that would be an unfortunate outcome, but if we can't stop xfs
> from bumping the i_version on atime updates, then we may have no choice
> but to do so. I suppose we could add a fetch_iversion for xfs that takes
> it back to using the ctime.

"unfortunate" for who I wonder..

I think Trond's argument about not needing implicit updates to be
reflected i_version is sound - as the effective i_version can be
constructed from stored i_version plus all attributes, if you ever want
an effective i_version that covers all implicit changes to attributes.
However I doubt Dave will be convinced.

> 
> > <rant>NFSv4 changeid is really one of the more horrible parts of the
> > design</rant>
> > 
> 
> Hah! I was telling Tom Talpey yesterday that I thought that the change
> counter was one of the best ideas in NFSv4 and that we should be trying
> to get all filesystems to implement it correctly.
> 
> The part that does suck about the design is that the original specs
> weren't specific enough about its behavior. I think that's been somewhat
> remedied in more recent RFCs though.

That's enough bate for me to expand my rant...
The problem with changeid is that it imposes on the filesystem.  You
CANNOT provide a compliant NFSv4 services on a filesystem which has 1
second resolution of time stamps and no internal i_version.  When you
are designing a protocol - particularly one where interoperability is a
focus - making it impossible to export some common (at the time)
filesystems correctly is crazy.

And not at all necessary.  There is a much better way.

When reporting the ctime of a file, the server could add a Bool which is
true if that time is "now" to within the resolution of the timestamp.
If you see a "now" ctime, then you cannot use timestamps to validate any
cached data.  i.e.  a "now" ctime is different to every other ctime,
even another "now" ctime with the same timestamp.
If the filesystem has high resolution timestamps, it will never say
"now" and you get the same semantics as changeid.  If timestamps are low
resolution, then you cannot cache a file while it is changing (unless
you get a read delegation), but you can accurately cache a file as long
as it doesn't change - which is all that changeid gives you anyway.

NeilBrown

> -- 
> Jeff Layton <jlayton@kernel.org>
>

NeilBrown Aug. 19, 2022, 12:35 a.m. UTC | #23

On Thu, 18 Aug 2022, Dave Chinner wrote:
> On Thu, Aug 18, 2022 at 11:52:12AM +1000, NeilBrown wrote:
> > On Thu, 18 Aug 2022, Dave Chinner wrote:
> > > 
> > > > Maybe we should just go back to using ctime.  ctime is *exactly* what
> > > > NFSv4 wants, as long as its granularity is sufficient to catch every
> > > > single change.  Presumably XFS doesn't try to ensure this.  How hard
> > > > would it be to get any ctime update to add at least one nanosecond?
> > > > This would be enabled by a mount option, or possibly be a direct request
> > > > from nfsd.
> > > 
> > > We can't rely on ctime to be changed during a modification because
> > > O_NOCMTIME exists to enable "user invisible" modifications to be
> > > made. On XFS these still bump iversion, so while they are invisible
> > > to the user, they are still tracked by the filesystem and anything
> > > that wants to know if the inode data/metadata changed.
> > > 
> > 
> > O_NOCMTIME isn't mentioned in the man page, so it doesn't exist :-(
> > 
> > If they are "user invisible", should they then also be "NFS invisible"?
> > I think so.
> 
> Maybe, but now you're making big assumptions about what is being
> done by those operations. Userspace can write whatever it likes,
> nothing says that O_NOCMTIME can't change user visible data or
> metadata.

Nope.  The only assumption I'm making is that if the ctime/mtime don't
change, then it is safe to trust any cached content.  I think that is
broadly assumed in the Posix world.  Anyone who uses O_NOCMTIME must
understand the risks (not currently documented ....) and it must be
assumed they will handled them properly.  We cannot allow the addition
of O_NOCMTIME to make us think "ctime and mtime don't mean what they
used to, we cannot trust them any more".

> But having uses of it that don't change user visible data does not
> mean it can't be used for changing user visible data. Hence we made
> the defensive assumption that O_NOCMTIME was a mechanism that could
> be used to hide data changes from forensic analysis. With that in
> mind, it was important that the change counter captured changes made
> even when O_NOCMTIME was specified to leave behind a breadcrumb to
> indicate unexpected changes may had been made to the file.

Having a breadcrumb seems reasonable.  Calling that breadcrumb "i_version"
might be questionable - though specifications seem to be vague so this
decision is probably defensible.

> 
> Yeah, we had lots of different requirements for the XFS on-disk
> change counter when we were considering adding it. NFSv4 was one of
> the least demanding and least defined requirements; it's taken a
> *decade* for this atime issue to be noticed, so I really don't think
> there's anything wrong with how XFs has implemented persistent
> change counters.
> 
> What it tells me is that the VFS needs more appropriate atime
> filtering for NFSv4's change attribute requirements....

I don't agree with that last point.  I think "atime == mtime" and 
"atime > mtime" are distinctly different states which should recorded.

I think Trond's' observation about implicit updates is on-point.
There is no need to include implicit atime updates in i_version.  If
anyone cares about those they can combine i_version and atime into a
single value.  If that value changes, then something changed, possibly
an implicit atime update.  

Excluding implicit atime updates makes i_version strictly more useful.
It doesn't lose any value and does gain some.

NeilBrown

Dave Chinner Aug. 23, 2022, 12:05 a.m. UTC | #24

On Thu, Aug 18, 2022 at 07:03:42AM -0400, Jeff Layton wrote:
> Dave, you keep talking about the xfs i_version counter as if there are
> other applications already relying on its behavior, but I don't see how
> that can be. There is no way for userland applications to fetch the
> counter currently.

You miss the point entirely: the behaviour is defined by the on-disk
format (the di_changecount filed), not the applications that are
using the kernel internal iversion API.

Just because there are no in-kernel users of the di_changecount
field in the XFS inode, it does not mean that it doesn't get used by
applications. Forensic analysis tools that walk filesystem images.

Did you not notice that xfs_trans_log_inode() forces an iversion
update if the inode core is marked for update:

	inode_maybe_inc_iversion(inode, flags & XFS_ILOG_CORE))
					^^^^^^^^^^^^^^^^^^^^^^
					this?

So every modification to the inode core (which almost all inode
modifications will do, except for pure timestamp updates) will bump
the iversion regardless of whether it was queried or not.

I use this information all the time for forensic analysis of broken
filesystem images. There are forensic tools that use expose this
information from filesystem images (e.g. xfs_db) so that we can use
it for forensic analysis.

See the problem? On-disk format di_changecount != NFS change
attribute. We can implement the NFS change attribute with the
existing di_changecount field, but if you want to constrain the
definition of how the NFS change attribute is calculated, we can't
necessarily implement that in di_changecount without changing the
on-disk format definition.

And, yes, this has implications for iversion being exposed to
userspace via statx(), as I've mentioned in reply to the v2 patch
you've posted. iversion is persistent information - you can't just
redefine it's behaviour without some fairly serious knock-on
effects for the subsystems that provide the persistent storage...

Cheers,

Dave.

Trond Myklebust Aug. 23, 2022, 1:33 a.m. UTC | #25

On Tue, 2022-08-23 at 10:05 +1000, Dave Chinner wrote:
> On Thu, Aug 18, 2022 at 07:03:42AM -0400, Jeff Layton wrote:
> > Dave, you keep talking about the xfs i_version counter as if there
> > are
> > other applications already relying on its behavior, but I don't see
> > how
> > that can be. There is no way for userland applications to fetch the
> > counter currently.
> 
> You miss the point entirely: the behaviour is defined by the on-disk
> format (the di_changecount filed), not the applications that are
> using the kernel internal iversion API.
> 
> Just because there are no in-kernel users of the di_changecount
> field in the XFS inode, it does not mean that it doesn't get used by
> applications. Forensic analysis tools that walk filesystem images.
> 
> Did you not notice that xfs_trans_log_inode() forces an iversion
> update if the inode core is marked for update:
> 
>         inode_maybe_inc_iversion(inode, flags & XFS_ILOG_CORE))
>                                         ^^^^^^^^^^^^^^^^^^^^^^
>                                         this?
> 
> So every modification to the inode core (which almost all inode
> modifications will do, except for pure timestamp updates) will bump
> the iversion regardless of whether it was queried or not.
> 
> I use this information all the time for forensic analysis of broken
> filesystem images. There are forensic tools that use expose this
> information from filesystem images (e.g. xfs_db) so that we can use
> it for forensic analysis.
> 
> See the problem? On-disk format di_changecount != NFS change
> attribute. We can implement the NFS change attribute with the
> existing di_changecount field, but if you want to constrain the
> definition of how the NFS change attribute is calculated, we can't
> necessarily implement that in di_changecount without changing the
> on-disk format definition.
> 
> And, yes, this has implications for iversion being exposed to
> userspace via statx(), as I've mentioned in reply to the v2 patch
> you've posted. iversion is persistent information - you can't just
> redefine it's behaviour without some fairly serious knock-on
> effects for the subsystems that provide the persistent storage...
> 

That still doesn't explain why a regular application would want to use
that definition of a version counter.
Your use case of a forensic tool is not a generic use case. It is very
limited to something which needs access to detailed knowledge of the
internals of your transactional filesystem. This isn't even something
that needs to be exposed to the VFS layer through the i_version if the
users are all going to be internal to XFS.

What is being requested was initially driven (yes, over 20 years ago
now, as you've pointed out) by the use case of caching applications, of
which NFS is only one. There are other applications that can benefit
from it. Pretty much anything that is currently using the ctime and/or
mtime, wants a standard that overcomes the time resolution issues that
increasingly affect storage as performance improves. That means
applications such as build utilities, backup/restore utilities, search
utilities, etc... Even 'git' could make use of it.

However none of the above applications need the 'all metadata
transactions' definition that you appear to need for your forensics use
case. All will suffer a performance loss with such a definition. Worst
of all, the application behaviour would vary wildly depending on the
combination of mount options being used (noatime vs nodiratime vs...).

So the point isn't about 'redefining the behaviour of i_version'. It is
about delivering a statx() API that meets the requirements of a number
of applications. If XFS is going to opt in to that API, then it needs
to be able to deliver an attribute that meets the same requirements.
Otherwise, it could choose to opt out.

xfs: fix i_version handling in xfs

Commit Message

Comments

Patch