Message ID | 20220816131736.42615-1-jlayton@kernel.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | xfs: fix i_version handling in xfs | expand |
On Tue, Aug 16, 2022 at 09:17:36AM -0400, Jeff Layton wrote: > The i_version in xfs_trans_log_inode is bumped for any inode update, > including atime-only updates due to reads. We don't want to record those > in the i_version, as they don't represent "real" changes. Remove that > callsite. > > In xfs_vn_update_time, if S_VERSION is flagged, then attempt to bump the > i_version and turn on XFS_ILOG_CORE if it happens. In > xfs_trans_ichgtime, update the i_version if the mtime or ctime are being > updated. What about operations that don't touch the mtime but change the file metadata anyway? There are a few of those, like the blockgc garbage collector, deduperange, and the defrag tool. Zooming out a bit -- what does i_version signal, concretely? I thought it was used by nfs (and maybe ceph?) to signal to clients that the file on the server has moved on, and the client needs to invalidate its caches. I thought afs had a similar generation counter, though it's only used to cache file data, not metadata? Does an i_version change cause all of them to invalidate caches, or is there more behavior I don't know about? Does that mean that we should bump i_version for any file data or attribute that could be queried or observed by userspace? In which case I suppose this change is still correct, even if it relaxes i_version updates from "any change to the inode whatsoever" to "any change that would bump mtime". Unless FIEMAP is part of "attributes observed by userspace". (The other downside I can see is that now we have to remember to bump timestamps for every new file operation we add, unlike the current code which is centrally located in xfs_trans_log_inode.) --D > Cc: Darrick J. Wong <darrick.wong@oracle.com> > Cc: Dave Chinner <david@fromorbit.com> > Signed-off-by: Jeff Layton <jlayton@kernel.org> > --- > fs/xfs/libxfs/xfs_trans_inode.c | 17 +++-------------- > fs/xfs/xfs_iops.c | 4 ++++ > 2 files changed, 7 insertions(+), 14 deletions(-) > > diff --git a/fs/xfs/libxfs/xfs_trans_inode.c b/fs/xfs/libxfs/xfs_trans_inode.c > index 8b5547073379..78bf7f491462 100644 > --- a/fs/xfs/libxfs/xfs_trans_inode.c > +++ b/fs/xfs/libxfs/xfs_trans_inode.c > @@ -71,6 +71,8 @@ xfs_trans_ichgtime( > inode->i_ctime = tv; > if (flags & XFS_ICHGTIME_CREATE) > ip->i_crtime = tv; > + if (flags & (XFS_ICHGTIME_MOD|XFS_ICHGTIME_CHG)) > + inode_inc_iversion(inode); > } > > /* > @@ -116,20 +118,7 @@ xfs_trans_log_inode( > spin_unlock(&inode->i_lock); > } > > - /* > - * First time we log the inode in a transaction, bump the inode change > - * counter if it is configured for this to occur. While we have the > - * inode locked exclusively for metadata modification, we can usually > - * avoid setting XFS_ILOG_CORE if no one has queried the value since > - * the last time it was incremented. If we have XFS_ILOG_CORE already > - * set however, then go ahead and bump the i_version counter > - * unconditionally. > - */ > - if (!test_and_set_bit(XFS_LI_DIRTY, &iip->ili_item.li_flags)) { > - if (IS_I_VERSION(inode) && > - inode_maybe_inc_iversion(inode, flags & XFS_ILOG_CORE)) > - iversion_flags = XFS_ILOG_CORE; > - } > + set_bit(XFS_LI_DIRTY, &iip->ili_item.li_flags); > > /* > * If we're updating the inode core or the timestamps and it's possible > diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c > index 45518b8c613c..162e044c7f56 100644 > --- a/fs/xfs/xfs_iops.c > +++ b/fs/xfs/xfs_iops.c > @@ -718,6 +718,7 @@ xfs_setattr_nonsize( > } > > setattr_copy(mnt_userns, inode, iattr); > + inode_inc_iversion(inode); > xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE); > > XFS_STATS_INC(mp, xs_ig_attrchg); > @@ -943,6 +944,7 @@ xfs_setattr_size( > > ASSERT(!(iattr->ia_valid & (ATTR_UID | ATTR_GID))); > setattr_copy(mnt_userns, inode, iattr); > + inode_inc_iversion(inode); > xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE); > > XFS_STATS_INC(mp, xs_ig_attrchg); > @@ -1047,6 +1049,8 @@ xfs_vn_update_time( > inode->i_mtime = *now; > if (flags & S_ATIME) > inode->i_atime = *now; > + if ((flags & S_VERSION) && inode_maybe_inc_iversion(inode, false)) > + log_flags |= XFS_ILOG_CORE; > > xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL); > xfs_trans_log_inode(tp, ip, log_flags); > -- > 2.37.2 >
On Tue, 2022-08-16 at 08:43 -0700, Darrick J. Wong wrote: > On Tue, Aug 16, 2022 at 09:17:36AM -0400, Jeff Layton wrote: > > The i_version in xfs_trans_log_inode is bumped for any inode update, > > including atime-only updates due to reads. We don't want to record those > > in the i_version, as they don't represent "real" changes. Remove that > > callsite. > > > > In xfs_vn_update_time, if S_VERSION is flagged, then attempt to bump the > > i_version and turn on XFS_ILOG_CORE if it happens. In > > xfs_trans_ichgtime, update the i_version if the mtime or ctime are being > > updated. > > What about operations that don't touch the mtime but change the file > metadata anyway? There are a few of those, like the blockgc garbage > collector, deduperange, and the defrag tool. > Do those change the c/mtime at all? It's possible we're missing some places that should change the i_version as well. We may need some more call sites. > Zooming out a bit -- what does i_version signal, concretely? I thought > it was used by nfs (and maybe ceph?) to signal to clients that the file > on the server has moved on, and the client needs to invalidate its > caches. I thought afs had a similar generation counter, though it's > only used to cache file data, not metadata? Does an i_version change > cause all of them to invalidate caches, or is there more behavior I > don't know about? > For NFS, it indicates a change to the change attribute indicates that there has been a change to the data or metadata for the file. atime changes due to reads are specifically exempted from this, but we do bump the i_version if someone (e.g.) changes the atime via utimes(). The NFS client will generally invalidate its caches for the inode when it notices a change attribute change. FWIW, AFS may not meet this standard since it doesn't generally increment the counter on metadata changes. It may turn out that we don't want to expose this to the AFS client due to that (or maybe come up with some way to indicate this difference). > Does that mean that we should bump i_version for any file data or > attribute that could be queried or observed by userspace? In which case > I suppose this change is still correct, even if it relaxes i_version > updates from "any change to the inode whatsoever" to "any change that > would bump mtime". Unless FIEMAP is part of "attributes observed by > userspace". > > (The other downside I can see is that now we have to remember to bump > timestamps for every new file operation we add, unlike the current code > which is centrally located in xfs_trans_log_inode.) > The main reason for the change attribute in NFS was that NFSv3 is plagued with cache-coherency problems due to coarse-grained timestamp granularity. It was conceived as a way to indicate that the inode had changed without relying on timestamps. In practice, we want to bump the i_version counter whenever the ctime or mtime would be changed. > --D > > > Cc: Darrick J. Wong <darrick.wong@oracle.com> > > Cc: Dave Chinner <david@fromorbit.com> > > Signed-off-by: Jeff Layton <jlayton@kernel.org> > > --- > > fs/xfs/libxfs/xfs_trans_inode.c | 17 +++-------------- > > fs/xfs/xfs_iops.c | 4 ++++ > > 2 files changed, 7 insertions(+), 14 deletions(-) > > > > diff --git a/fs/xfs/libxfs/xfs_trans_inode.c b/fs/xfs/libxfs/xfs_trans_inode.c > > index 8b5547073379..78bf7f491462 100644 > > --- a/fs/xfs/libxfs/xfs_trans_inode.c > > +++ b/fs/xfs/libxfs/xfs_trans_inode.c > > @@ -71,6 +71,8 @@ xfs_trans_ichgtime( > > inode->i_ctime = tv; > > if (flags & XFS_ICHGTIME_CREATE) > > ip->i_crtime = tv; > > + if (flags & (XFS_ICHGTIME_MOD|XFS_ICHGTIME_CHG)) > > + inode_inc_iversion(inode); > > } > > > > /* > > @@ -116,20 +118,7 @@ xfs_trans_log_inode( > > spin_unlock(&inode->i_lock); > > } > > > > - /* > > - * First time we log the inode in a transaction, bump the inode change > > - * counter if it is configured for this to occur. While we have the > > - * inode locked exclusively for metadata modification, we can usually > > - * avoid setting XFS_ILOG_CORE if no one has queried the value since > > - * the last time it was incremented. If we have XFS_ILOG_CORE already > > - * set however, then go ahead and bump the i_version counter > > - * unconditionally. > > - */ > > - if (!test_and_set_bit(XFS_LI_DIRTY, &iip->ili_item.li_flags)) { > > - if (IS_I_VERSION(inode) && > > - inode_maybe_inc_iversion(inode, flags & XFS_ILOG_CORE)) > > - iversion_flags = XFS_ILOG_CORE; > > - } > > + set_bit(XFS_LI_DIRTY, &iip->ili_item.li_flags); > > > > /* > > * If we're updating the inode core or the timestamps and it's possible > > diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c > > index 45518b8c613c..162e044c7f56 100644 > > --- a/fs/xfs/xfs_iops.c > > +++ b/fs/xfs/xfs_iops.c > > @@ -718,6 +718,7 @@ xfs_setattr_nonsize( > > } > > > > setattr_copy(mnt_userns, inode, iattr); > > + inode_inc_iversion(inode); > > xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE); > > > > XFS_STATS_INC(mp, xs_ig_attrchg); > > @@ -943,6 +944,7 @@ xfs_setattr_size( > > > > ASSERT(!(iattr->ia_valid & (ATTR_UID | ATTR_GID))); > > setattr_copy(mnt_userns, inode, iattr); > > + inode_inc_iversion(inode); > > xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE); > > > > XFS_STATS_INC(mp, xs_ig_attrchg); > > @@ -1047,6 +1049,8 @@ xfs_vn_update_time( > > inode->i_mtime = *now; > > if (flags & S_ATIME) > > inode->i_atime = *now; > > + if ((flags & S_VERSION) && inode_maybe_inc_iversion(inode, false)) > > + log_flags |= XFS_ILOG_CORE; > > > > xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL); > > xfs_trans_log_inode(tp, ip, log_flags); > > -- > > 2.37.2 > >
On Tue, Aug 16, 2022 at 9:19 AM Jeff Layton <jlayton@kernel.org> wrote: > > The i_version in xfs_trans_log_inode is bumped for any inode update, > including atime-only updates due to reads. We don't want to record those > in the i_version, as they don't represent "real" changes. Remove that > callsite. > > In xfs_vn_update_time, if S_VERSION is flagged, then attempt to bump the > i_version and turn on XFS_ILOG_CORE if it happens. In > xfs_trans_ichgtime, update the i_version if the mtime or ctime are being > updated. > > Cc: Darrick J. Wong <darrick.wong@oracle.com> > Cc: Dave Chinner <david@fromorbit.com> > Signed-off-by: Jeff Layton <jlayton@kernel.org> > --- > fs/xfs/libxfs/xfs_trans_inode.c | 17 +++-------------- > fs/xfs/xfs_iops.c | 4 ++++ > 2 files changed, 7 insertions(+), 14 deletions(-) > > diff --git a/fs/xfs/libxfs/xfs_trans_inode.c b/fs/xfs/libxfs/xfs_trans_inode.c > index 8b5547073379..78bf7f491462 100644 > --- a/fs/xfs/libxfs/xfs_trans_inode.c > +++ b/fs/xfs/libxfs/xfs_trans_inode.c > @@ -71,6 +71,8 @@ xfs_trans_ichgtime( > inode->i_ctime = tv; > if (flags & XFS_ICHGTIME_CREATE) > ip->i_crtime = tv; > + if (flags & (XFS_ICHGTIME_MOD|XFS_ICHGTIME_CHG)) > + inode_inc_iversion(inode); > } > > /* > @@ -116,20 +118,7 @@ xfs_trans_log_inode( > spin_unlock(&inode->i_lock); > } > > - /* > - * First time we log the inode in a transaction, bump the inode change > - * counter if it is configured for this to occur. While we have the > - * inode locked exclusively for metadata modification, we can usually > - * avoid setting XFS_ILOG_CORE if no one has queried the value since > - * the last time it was incremented. If we have XFS_ILOG_CORE already > - * set however, then go ahead and bump the i_version counter > - * unconditionally. > - */ > - if (!test_and_set_bit(XFS_LI_DIRTY, &iip->ili_item.li_flags)) { > - if (IS_I_VERSION(inode) && > - inode_maybe_inc_iversion(inode, flags & XFS_ILOG_CORE)) > - iversion_flags = XFS_ILOG_CORE; > - } > + set_bit(XFS_LI_DIRTY, &iip->ili_item.li_flags); > > /* > * If we're updating the inode core or the timestamps and it's possible > diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c > index 45518b8c613c..162e044c7f56 100644 > --- a/fs/xfs/xfs_iops.c > +++ b/fs/xfs/xfs_iops.c > @@ -718,6 +718,7 @@ xfs_setattr_nonsize( > } > > setattr_copy(mnt_userns, inode, iattr); > + inode_inc_iversion(inode); > xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE); > > XFS_STATS_INC(mp, xs_ig_attrchg); > @@ -943,6 +944,7 @@ xfs_setattr_size( > > ASSERT(!(iattr->ia_valid & (ATTR_UID | ATTR_GID))); > setattr_copy(mnt_userns, inode, iattr); > + inode_inc_iversion(inode); > xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE); > > XFS_STATS_INC(mp, xs_ig_attrchg); > @@ -1047,6 +1049,8 @@ xfs_vn_update_time( > inode->i_mtime = *now; > if (flags & S_ATIME) > inode->i_atime = *now; > + if ((flags & S_VERSION) && inode_maybe_inc_iversion(inode, false)) > + log_flags |= XFS_ILOG_CORE; > > xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL); > xfs_trans_log_inode(tp, ip, log_flags); > -- > 2.37.2 > I have a test (details below) that shows an open issue with NFSv4.x + fscache where an xfs exported filesystem would trigger unnecessary over the wire READs after a umount/mount cycle of the NFS mount. I previously tracked this down to atime updates, but never followed through on any patch. Now that Jeff worked it out and this patch is under review, I built 5.19 vanilla, retested, then built 5.19 + this patch and verified the problem is fixed. You can add: Tested-by: Dave Wysochanski <dwysocha@redhat.com> # ./t0_bz1913591.sh 4.1 xfs relatime Setting NFS vers=4.1 filesystem to xfs and mount options relatime,rw 0. On NFS server, setup export with xfs filesystem on loop device /dev/loop0 /export/dir1 xfs rw,seclabel,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota 0 0 1. On NFS client, install and enable cachefilesd 2. On NFS client, mount -o vers=4.1,fsc 127.0.0.1:/export/dir1 /mnt 3. On NFS client, dd if=/dev/zero of=/mnt/file1.bin bs=4096 count=1 4. On NFS client, echo 3 > /proc/sys/vm/drop_caches 5. On NFS client, dd if=/mnt/file1.bin of=/dev/null (read into fscache) 6. On NFS client, umount /mnt 7. On NFS client, mount -o vers=4.1,fsc 127.0.0.1:/export/dir1 /mnt 8. On NFS client, repeat steps 4-5 (read from fscache) 9. On NFS client, check for READ ops (1st number) > 0 in /proc/self/mountstats Found 4200 NFS READ and READ_PLUS ops in /proc/self/mountstats > 0 READ: 1 1 0 220 4200 0 0 1 0 READ_PLUS: 0 0 0 0 0 0 0 0 0 FAILED TEST ./t0_bz1913591.sh on kernel 5.19.0 with NFS vers=4.1 exported filesystem xfs options relatime,rw # ./t0_bz1913591.sh 4.1 xfs relatime Setting NFS vers=4.1 filesystem to xfs and mount options relatime,rw 0. On NFS server, setup export with xfs filesystem on loop device /dev/loop0 /export/dir1 xfs rw,seclabel,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota 0 0 1. On NFS client, install and enable cachefilesd 2. On NFS client, mount -o vers=4.1,fsc 127.0.0.1:/export/dir1 /mnt 3. On NFS client, dd if=/dev/zero of=/mnt/file1.bin bs=4096 count=1 4. On NFS client, echo 3 > /proc/sys/vm/drop_caches 5. On NFS client, dd if=/mnt/file1.bin of=/dev/null (read into fscache) 6. On NFS client, umount /mnt 7. On NFS client, mount -o vers=4.1,fsc 127.0.0.1:/export/dir1 /mnt 8. On NFS client, repeat steps 4-5 (read from fscache) 9. On NFS client, check for READ ops (1st number) > 0 in /proc/self/mountstats 10. On NFS client, check /proc/fs/fscache/stats fscache reads incrementing PASSED TEST ./t0_bz1913591.sh on kernel 5.19.0i_version+ with NFS vers=4.1 exported filesystem xfs options relatime,rw
On Tue, Aug 16, 2022 at 11:58:06AM -0400, Jeff Layton wrote: > On Tue, 2022-08-16 at 08:43 -0700, Darrick J. Wong wrote: > > On Tue, Aug 16, 2022 at 09:17:36AM -0400, Jeff Layton wrote: > > > The i_version in xfs_trans_log_inode is bumped for any inode update, > > > including atime-only updates due to reads. We don't want to record those > > > in the i_version, as they don't represent "real" changes. Remove that > > > callsite. > > > > > > In xfs_vn_update_time, if S_VERSION is flagged, then attempt to bump the > > > i_version and turn on XFS_ILOG_CORE if it happens. In > > > xfs_trans_ichgtime, update the i_version if the mtime or ctime are being > > > updated. > > > > What about operations that don't touch the mtime but change the file > > metadata anyway? There are a few of those, like the blockgc garbage > > collector, deduperange, and the defrag tool. > > > > Do those change the c/mtime at all? > > It's possible we're missing some places that should change the i_version > as well. We may need some more call sites. > > > Zooming out a bit -- what does i_version signal, concretely? I thought > > it was used by nfs (and maybe ceph?) to signal to clients that the file > > on the server has moved on, and the client needs to invalidate its > > caches. I thought afs had a similar generation counter, though it's > > only used to cache file data, not metadata? Does an i_version change > > cause all of them to invalidate caches, or is there more behavior I > > don't know about? > > > > For NFS, it indicates a change to the change attribute indicates that > there has been a change to the data or metadata for the file. atime > changes due to reads are specifically exempted from this, but we do bump > the i_version if someone (e.g.) changes the atime via utimes(). We have relatime behaviour to optimise away unnecessary atime updates on reads. Trying to explicitly exclude i_version from atime updates in one filesystem just because NFS doesn't need that information seems .... misguided. The -on disk- i_version field behaviour is defined by the filesystem implementation, not the NFS requirements. > The NFS client will generally invalidate its caches for the inode when > it notices a change attribute change. > > FWIW, AFS may not meet this standard since it doesn't generally > increment the counter on metadata changes. It may turn out that we don't > want to expose this to the AFS client due to that (or maybe come up with > some way to indicate this difference). In XFS, we've defined the on-disk i_version field to mean "increments with any persistent inode data or metadata change", regardless of what the high level applications that use i_version might actually require. That some network filesystem might only need a subset of the metadata to be covered by i_version is largely irrelevant - if we don't cover every persistent inode metadata change with i_version, then applications that *need* stuff like atime change notification can't be supported. > > Does that mean that we should bump i_version for any file data or > > attribute that could be queried or observed by userspace? In which case > > I suppose this change is still correct, even if it relaxes i_version > > updates from "any change to the inode whatsoever" to "any change that > > would bump mtime". Unless FIEMAP is part of "attributes observed by > > userspace". > > > > (The other downside I can see is that now we have to remember to bump > > timestamps for every new file operation we add, unlike the current code > > which is centrally located in xfs_trans_log_inode.) > > > > The main reason for the change attribute in NFS was that NFSv3 is > plagued with cache-coherency problems due to coarse-grained timestamp > granularity. It was conceived as a way to indicate that the inode had > changed without relying on timestamps. Yes, and the most important design consideration for a filesystem is that it -must be persistent-. The constraints on i_version are much stricter than timestamps, and they are directly related to how the filesystem persists metadata changes, not how metadata is changed or accessed in memory. > In practice, we want to bump the i_version counter whenever the ctime or > mtime would be changed. What about O_NOCMTIME modifications? What about lazytime filesystems? These explicilty avoid or delay persisten c/mtime updates, and that means bumping i_version only based on c/mtime updates cannot be relied on. i_version is supposed to track user visible data and metadata changes, *not timestamp updates*. > > > Cc: Darrick J. Wong <darrick.wong@oracle.com> > > > Cc: Dave Chinner <david@fromorbit.com> > > > Signed-off-by: Jeff Layton <jlayton@kernel.org> > > > --- > > > fs/xfs/libxfs/xfs_trans_inode.c | 17 +++-------------- > > > fs/xfs/xfs_iops.c | 4 ++++ > > > 2 files changed, 7 insertions(+), 14 deletions(-) > > > > > > diff --git a/fs/xfs/libxfs/xfs_trans_inode.c b/fs/xfs/libxfs/xfs_trans_inode.c > > > index 8b5547073379..78bf7f491462 100644 > > > --- a/fs/xfs/libxfs/xfs_trans_inode.c > > > +++ b/fs/xfs/libxfs/xfs_trans_inode.c > > > @@ -71,6 +71,8 @@ xfs_trans_ichgtime( > > > inode->i_ctime = tv; > > > if (flags & XFS_ICHGTIME_CREATE) > > > ip->i_crtime = tv; > > > + if (flags & (XFS_ICHGTIME_MOD|XFS_ICHGTIME_CHG)) > > > + inode_inc_iversion(inode); > > > } That looks wrong - this is not the only path through XFS that modifies timestamps, and I have to ask why this needs to be an explicit i_version bump given that nobody may have looked at i_version since the last time it was updated?. What about xfs_fs_dirty_inode() when we actually persist lazytime in-memory timestamp updates? We didn't bump i_version when setting I_DIRTY_TIME, and this patch now removes the mechanism that is used to bump iversion if it is needed when we persist those lazytime updates..... > > > /* > > > @@ -116,20 +118,7 @@ xfs_trans_log_inode( > > > spin_unlock(&inode->i_lock); > > > } > > > > > > - /* > > > - * First time we log the inode in a transaction, bump the inode change > > > - * counter if it is configured for this to occur. While we have the > > > - * inode locked exclusively for metadata modification, we can usually > > > - * avoid setting XFS_ILOG_CORE if no one has queried the value since > > > - * the last time it was incremented. If we have XFS_ILOG_CORE already > > > - * set however, then go ahead and bump the i_version counter > > > - * unconditionally. > > > - */ > > > - if (!test_and_set_bit(XFS_LI_DIRTY, &iip->ili_item.li_flags)) { > > > - if (IS_I_VERSION(inode) && > > > - inode_maybe_inc_iversion(inode, flags & XFS_ILOG_CORE)) > > > - iversion_flags = XFS_ILOG_CORE; > > > - } > > > + set_bit(XFS_LI_DIRTY, &iip->ili_item.li_flags); .... and this removes the sweep that captures in-memory timestamp and i_version peeks between any persistent inode metadata modifications that have been made, regardless of whether i_version has already been bumped for them or not. IOws, this seems to rely on every future inode modification in XFS calling xfs_trans_ichgtime() to bump i_version to sweep previous VFS in-memory timestamp updates that this inode modification captures and persists to disk. This seems fragile and error prone - it's relying on the developers always getting timestamp and iversion updates correct, rather the code always guaranteeing that it captures timestamp and iversion updates without any extra effort. Hence, I don't think that trying to modify how filesystems persist and maintain i_version coherency because NFS "doesn't need i_version to cover atime updates" is the wrong approach. On-disk i_version coherency has to work for more than just one NFS implementation (especially now i_version will be exported to userspace!). Persistent atime updates are already optimised away by relatime, and so I think that any further atime filtering is largely a NFS application layer problem and not something that should be solved by changing the on-disk definition of back end filesystem structure persistence. Cheers, Dave.
On Tue, Aug 16, 2022 at 01:14:55PM -0400, David Wysochanski wrote: > On Tue, Aug 16, 2022 at 9:19 AM Jeff Layton <jlayton@kernel.org> wrote: > > > > The i_version in xfs_trans_log_inode is bumped for any inode update, > > including atime-only updates due to reads. We don't want to record those > > in the i_version, as they don't represent "real" changes. Remove that > > callsite. > > > > In xfs_vn_update_time, if S_VERSION is flagged, then attempt to bump the > > i_version and turn on XFS_ILOG_CORE if it happens. In > > xfs_trans_ichgtime, update the i_version if the mtime or ctime are being > > updated. > > > > Cc: Darrick J. Wong <darrick.wong@oracle.com> > > Cc: Dave Chinner <david@fromorbit.com> > > Signed-off-by: Jeff Layton <jlayton@kernel.org> > > --- > > fs/xfs/libxfs/xfs_trans_inode.c | 17 +++-------------- > > fs/xfs/xfs_iops.c | 4 ++++ > > 2 files changed, 7 insertions(+), 14 deletions(-) > > > > diff --git a/fs/xfs/libxfs/xfs_trans_inode.c b/fs/xfs/libxfs/xfs_trans_inode.c > > index 8b5547073379..78bf7f491462 100644 > > --- a/fs/xfs/libxfs/xfs_trans_inode.c > > +++ b/fs/xfs/libxfs/xfs_trans_inode.c > > @@ -71,6 +71,8 @@ xfs_trans_ichgtime( > > inode->i_ctime = tv; > > if (flags & XFS_ICHGTIME_CREATE) > > ip->i_crtime = tv; > > + if (flags & (XFS_ICHGTIME_MOD|XFS_ICHGTIME_CHG)) > > + inode_inc_iversion(inode); > > } > > > > /* > > @@ -116,20 +118,7 @@ xfs_trans_log_inode( > > spin_unlock(&inode->i_lock); > > } > > > > - /* > > - * First time we log the inode in a transaction, bump the inode change > > - * counter if it is configured for this to occur. While we have the > > - * inode locked exclusively for metadata modification, we can usually > > - * avoid setting XFS_ILOG_CORE if no one has queried the value since > > - * the last time it was incremented. If we have XFS_ILOG_CORE already > > - * set however, then go ahead and bump the i_version counter > > - * unconditionally. > > - */ > > - if (!test_and_set_bit(XFS_LI_DIRTY, &iip->ili_item.li_flags)) { > > - if (IS_I_VERSION(inode) && > > - inode_maybe_inc_iversion(inode, flags & XFS_ILOG_CORE)) > > - iversion_flags = XFS_ILOG_CORE; > > - } > > + set_bit(XFS_LI_DIRTY, &iip->ili_item.li_flags); > > > > /* > > * If we're updating the inode core or the timestamps and it's possible > > diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c > > index 45518b8c613c..162e044c7f56 100644 > > --- a/fs/xfs/xfs_iops.c > > +++ b/fs/xfs/xfs_iops.c > > @@ -718,6 +718,7 @@ xfs_setattr_nonsize( > > } > > > > setattr_copy(mnt_userns, inode, iattr); > > + inode_inc_iversion(inode); > > xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE); > > > > XFS_STATS_INC(mp, xs_ig_attrchg); > > @@ -943,6 +944,7 @@ xfs_setattr_size( > > > > ASSERT(!(iattr->ia_valid & (ATTR_UID | ATTR_GID))); > > setattr_copy(mnt_userns, inode, iattr); > > + inode_inc_iversion(inode); > > xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE); > > > > XFS_STATS_INC(mp, xs_ig_attrchg); > > @@ -1047,6 +1049,8 @@ xfs_vn_update_time( > > inode->i_mtime = *now; > > if (flags & S_ATIME) > > inode->i_atime = *now; > > + if ((flags & S_VERSION) && inode_maybe_inc_iversion(inode, false)) > > + log_flags |= XFS_ILOG_CORE; > > > > xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL); > > xfs_trans_log_inode(tp, ip, log_flags); > > -- > > 2.37.2 > > > > I have a test (details below) that shows an open issue with NFSv4.x + > fscache where an xfs exported filesystem would trigger unnecessary > over the wire READs after a umount/mount cycle of the NFS mount. I > previously tracked this down to atime updates, but never followed > through on any patch. Now that Jeff worked it out and this patch is > under review, I built 5.19 vanilla, retested, then built 5.19 + this > patch and verified the problem is fixed. And so the question that needs to be answered is "why isn't relatime working for this workload to avoid unnecessary atime updates"? Which then makes me ask "what's changing atime on the server between client side reads"? Which then makes me wonder "what's actually changing iversion on the server?" because I don't think atime is the issue here. I suspect that Jeff's patch is affecting this test case by removing iversion updates when the data is written back on the server. i.e. delayed allocation and unwritten extent conversion will no longer bump iversion when they log the inode metadata changes associated with extent allocation to store the data being written. There may be other places where Jeff's patch removes implicit iversion updates, too, so it may not be writeback that is the issue here. How that impacts on the observed behaviour is dependent on things I don't know, like what cachefiles is doing in the background, especially across NFS client unmount/mount cycles. However, this all makes me think the "atime is updated" behaviour is an observed symptom of something else changing iversion and/or cmtime between reads from the server... Cheers, Dave.
On Wed, Aug 17, 2022 at 08:42:57AM +1000, Dave Chinner wrote: > On Tue, Aug 16, 2022 at 11:58:06AM -0400, Jeff Layton wrote: > > On Tue, 2022-08-16 at 08:43 -0700, Darrick J. Wong wrote: > > > On Tue, Aug 16, 2022 at 09:17:36AM -0400, Jeff Layton wrote: > > > > @@ -116,20 +118,7 @@ xfs_trans_log_inode( > > > > spin_unlock(&inode->i_lock); > > > > } > > > > > > > > - /* > > > > - * First time we log the inode in a transaction, bump the inode change > > > > - * counter if it is configured for this to occur. While we have the > > > > - * inode locked exclusively for metadata modification, we can usually > > > > - * avoid setting XFS_ILOG_CORE if no one has queried the value since > > > > - * the last time it was incremented. If we have XFS_ILOG_CORE already > > > > - * set however, then go ahead and bump the i_version counter > > > > - * unconditionally. > > > > - */ > > > > - if (!test_and_set_bit(XFS_LI_DIRTY, &iip->ili_item.li_flags)) { > > > > - if (IS_I_VERSION(inode) && > > > > - inode_maybe_inc_iversion(inode, flags & XFS_ILOG_CORE)) > > > > - iversion_flags = XFS_ILOG_CORE; > > > > - } > > > > + set_bit(XFS_LI_DIRTY, &iip->ili_item.li_flags); > > .... and this removes the sweep that captures in-memory timestamp > and i_version peeks between any persistent inode metadata > modifications that have been made, regardless of whether i_version > has already been bumped for them or not. Which, BTW, breaks the iversion update for xfs_fs_commit_blocks() which the pNFS server calls to inform the filesystem that the pNFS client has finished writing data to a mapped region. This function runs unwritten extent conversion (making the data externally visible) and takes timestamps from the pNFS server. It then persists all these changes, meaning that there will be externally visible data, metadata and timestamp updates persisted to disk by the pNFS server without an iversion update occurring. This iversion stuff is .... complex. It's also really easy to get wrong, and that's even before we start trying to optimise away stuff like timestamp updates.... Cheers, Dave.
On Wed, 2022-08-17 at 08:42 +1000, Dave Chinner wrote: > On Tue, Aug 16, 2022 at 11:58:06AM -0400, Jeff Layton wrote: > > On Tue, 2022-08-16 at 08:43 -0700, Darrick J. Wong wrote: > > > On Tue, Aug 16, 2022 at 09:17:36AM -0400, Jeff Layton wrote: > > > > The i_version in xfs_trans_log_inode is bumped for any inode update, > > > > including atime-only updates due to reads. We don't want to record those > > > > in the i_version, as they don't represent "real" changes. Remove that > > > > callsite. > > > > > > > > In xfs_vn_update_time, if S_VERSION is flagged, then attempt to bump the > > > > i_version and turn on XFS_ILOG_CORE if it happens. In > > > > xfs_trans_ichgtime, update the i_version if the mtime or ctime are being > > > > updated. > > > > > > What about operations that don't touch the mtime but change the file > > > metadata anyway? There are a few of those, like the blockgc garbage > > > collector, deduperange, and the defrag tool. > > > > > > > Do those change the c/mtime at all? > > > > It's possible we're missing some places that should change the i_version > > as well. We may need some more call sites. > > > > > Zooming out a bit -- what does i_version signal, concretely? I thought > > > it was used by nfs (and maybe ceph?) to signal to clients that the file > > > on the server has moved on, and the client needs to invalidate its > > > caches. I thought afs had a similar generation counter, though it's > > > only used to cache file data, not metadata? Does an i_version change > > > cause all of them to invalidate caches, or is there more behavior I > > > don't know about? > > > > > > > For NFS, it indicates a change to the change attribute indicates that > > there has been a change to the data or metadata for the file. atime > > changes due to reads are specifically exempted from this, but we do bump > > the i_version if someone (e.g.) changes the atime via utimes(). > > We have relatime behaviour to optimise away unnecessary atime > updates on reads. Trying to explicitly exclude i_version from atime > updates in one filesystem just because NFS doesn't need that > information seems .... misguided. The -on disk- i_version > field behaviour is defined by the filesystem implementation, not the > NFS requirements. > -o relatime does not fix this. > > The NFS client will generally invalidate its caches for the inode when > > it notices a change attribute change. > > > > FWIW, AFS may not meet this standard since it doesn't generally > > increment the counter on metadata changes. It may turn out that we don't > > want to expose this to the AFS client due to that (or maybe come up with > > some way to indicate this difference). > > In XFS, we've defined the on-disk i_version field to mean > "increments with any persistent inode data or metadata change", > regardless of what the high level applications that use i_version > might actually require. > > That some network filesystem might only need a subset of the > metadata to be covered by i_version is largely irrelevant - if we > don't cover every persistent inode metadata change with i_version, > then applications that *need* stuff like atime change notification > can't be supported. > > > > Does that mean that we should bump i_version for any file data or > > > attribute that could be queried or observed by userspace? In which case > > > I suppose this change is still correct, even if it relaxes i_version > > > updates from "any change to the inode whatsoever" to "any change that > > > would bump mtime". Unless FIEMAP is part of "attributes observed by > > > userspace". > > > > > > (The other downside I can see is that now we have to remember to bump > > > timestamps for every new file operation we add, unlike the current code > > > which is centrally located in xfs_trans_log_inode.) > > > > > > > The main reason for the change attribute in NFS was that NFSv3 is > > plagued with cache-coherency problems due to coarse-grained timestamp > > granularity. It was conceived as a way to indicate that the inode had > > changed without relying on timestamps. > > Yes, and the most important design consideration for a filesystem is > that it -must be persistent-. The constraints on i_version are much > stricter than timestamps, and they are directly related to how the > filesystem persists metadata changes, not how metadata is changed or > accessed in memory. > > > In practice, we want to bump the i_version counter whenever the ctime or > > mtime would be changed. > > What about O_NOCMTIME modifications? What about lazytime > filesystems? These explicilty avoid or delay persisten c/mtime > updates, and that means bumping i_version only based on c/mtime > updates cannot be relied on. i_version is supposed to track user > visible data and metadata changes, *not timestamp updates*. > I was speaking more about the sorts of activity that should result in the i_version being changed, not about tracking timestamp updates. IOW, if some activity would cause the mtime or ctime to change, then we want to also bump the i_version. Specifically, for NOCMTIME, I think we'd still want the i_version to change since that option is about timestamps and not i_version. > > > > Cc: Darrick J. Wong <darrick.wong@oracle.com> > > > > Cc: Dave Chinner <david@fromorbit.com> > > > > Signed-off-by: Jeff Layton <jlayton@kernel.org> > > > > --- > > > > fs/xfs/libxfs/xfs_trans_inode.c | 17 +++-------------- > > > > fs/xfs/xfs_iops.c | 4 ++++ > > > > 2 files changed, 7 insertions(+), 14 deletions(-) > > > > > > > > diff --git a/fs/xfs/libxfs/xfs_trans_inode.c b/fs/xfs/libxfs/xfs_trans_inode.c > > > > index 8b5547073379..78bf7f491462 100644 > > > > --- a/fs/xfs/libxfs/xfs_trans_inode.c > > > > +++ b/fs/xfs/libxfs/xfs_trans_inode.c > > > > @@ -71,6 +71,8 @@ xfs_trans_ichgtime( > > > > inode->i_ctime = tv; > > > > if (flags & XFS_ICHGTIME_CREATE) > > > > ip->i_crtime = tv; > > > > + if (flags & (XFS_ICHGTIME_MOD|XFS_ICHGTIME_CHG)) > > > > + inode_inc_iversion(inode); > > > > } > > That looks wrong - this is not the only path through XFS that > modifies timestamps, and I have to ask why this needs to be an > explicit i_version bump given that nobody may have looked at > i_version since the last time it was updated?. > > What about xfs_fs_dirty_inode() when we actually persist lazytime > in-memory timestamp updates? We didn't bump i_version when setting > I_DIRTY_TIME, and this patch now removes the mechanism that is used > to bump iversion if it is needed when we persist those lazytime > updates..... > > > > > /* > > > > @@ -116,20 +118,7 @@ xfs_trans_log_inode( > > > > spin_unlock(&inode->i_lock); > > > > } > > > > > > > > - /* > > > > - * First time we log the inode in a transaction, bump the inode change > > > > - * counter if it is configured for this to occur. While we have the > > > > - * inode locked exclusively for metadata modification, we can usually > > > > - * avoid setting XFS_ILOG_CORE if no one has queried the value since > > > > - * the last time it was incremented. If we have XFS_ILOG_CORE already > > > > - * set however, then go ahead and bump the i_version counter > > > > - * unconditionally. > > > > - */ > > > > - if (!test_and_set_bit(XFS_LI_DIRTY, &iip->ili_item.li_flags)) { > > > > - if (IS_I_VERSION(inode) && > > > > - inode_maybe_inc_iversion(inode, flags & XFS_ILOG_CORE)) > > > > - iversion_flags = XFS_ILOG_CORE; > > > > - } > > > > + set_bit(XFS_LI_DIRTY, &iip->ili_item.li_flags); > > .... and this removes the sweep that captures in-memory timestamp > and i_version peeks between any persistent inode metadata > modifications that have been made, regardless of whether i_version > has already been bumped for them or not. > > IOws, this seems to rely on every future inode modification in XFS > calling xfs_trans_ichgtime() to bump i_version to sweep previous VFS > in-memory timestamp updates that this inode modification captures > and persists to disk. > > This seems fragile and error prone - it's relying on the > developers always getting timestamp and iversion updates correct, > rather the code always guaranteeing that it captures timestamp and > iversion updates without any extra effort. > > Hence, I don't think that trying to modify how filesystems persist > and maintain i_version coherency because NFS "doesn't need i_version > to cover atime updates" is the wrong approach. On-disk i_version > coherency has to work for more than just one NFS implementation > (especially now i_version will be exported to userspace!). > Persistent atime updates are already optimised away by relatime, and > so I think that any further atime filtering is largely a NFS > application layer problem and not something that should be solved by > changing the on-disk definition of back end filesystem structure > persistence. > Fair enough. xfs is not really in my wheelhouse so take this as a patch that helps illustrate the problem, rather than a serious submission. There are two consumers of the i_version counter today: the kernel NFS server and IMA. Having the i_version reflect atime updates due to reads harms both of those use-cases with unneeded cache invalidations on NFS and extra measurements on IMA. It would also be problematic for userland NFS servers such as ganesha if we end up exposing this to userland. atime updates are really a special case when it comes to metadata (and I maintain that they are one of the worst ideas in POSIX). The way you're choosing to define i_version doesn't really work properly for any current or projected use case. I'd like to see that changed. If the concern is fragility of the code going forward, then maybe we can go with a different approach. Would it be possible to just have xfs_trans_log_inode skip bumping the i_version when the log transaction is _only_ for an atime update due to reads? Maybe we could add a new XFS_ILOG_ATIME_UPDATE flag or something and set it the appropriate codepaths?
On Wed, 2022-08-17 at 09:37 +1000, Dave Chinner wrote: > On Tue, Aug 16, 2022 at 01:14:55PM -0400, David Wysochanski wrote: > > On Tue, Aug 16, 2022 at 9:19 AM Jeff Layton <jlayton@kernel.org> wrote: > > > > > > The i_version in xfs_trans_log_inode is bumped for any inode update, > > > including atime-only updates due to reads. We don't want to record those > > > in the i_version, as they don't represent "real" changes. Remove that > > > callsite. > > > > > > In xfs_vn_update_time, if S_VERSION is flagged, then attempt to bump the > > > i_version and turn on XFS_ILOG_CORE if it happens. In > > > xfs_trans_ichgtime, update the i_version if the mtime or ctime are being > > > updated. > > > > > > Cc: Darrick J. Wong <darrick.wong@oracle.com> > > > Cc: Dave Chinner <david@fromorbit.com> > > > Signed-off-by: Jeff Layton <jlayton@kernel.org> > > > --- > > > fs/xfs/libxfs/xfs_trans_inode.c | 17 +++-------------- > > > fs/xfs/xfs_iops.c | 4 ++++ > > > 2 files changed, 7 insertions(+), 14 deletions(-) > > > > > > diff --git a/fs/xfs/libxfs/xfs_trans_inode.c b/fs/xfs/libxfs/xfs_trans_inode.c > > > index 8b5547073379..78bf7f491462 100644 > > > --- a/fs/xfs/libxfs/xfs_trans_inode.c > > > +++ b/fs/xfs/libxfs/xfs_trans_inode.c > > > @@ -71,6 +71,8 @@ xfs_trans_ichgtime( > > > inode->i_ctime = tv; > > > if (flags & XFS_ICHGTIME_CREATE) > > > ip->i_crtime = tv; > > > + if (flags & (XFS_ICHGTIME_MOD|XFS_ICHGTIME_CHG)) > > > + inode_inc_iversion(inode); > > > } > > > > > > /* > > > @@ -116,20 +118,7 @@ xfs_trans_log_inode( > > > spin_unlock(&inode->i_lock); > > > } > > > > > > - /* > > > - * First time we log the inode in a transaction, bump the inode change > > > - * counter if it is configured for this to occur. While we have the > > > - * inode locked exclusively for metadata modification, we can usually > > > - * avoid setting XFS_ILOG_CORE if no one has queried the value since > > > - * the last time it was incremented. If we have XFS_ILOG_CORE already > > > - * set however, then go ahead and bump the i_version counter > > > - * unconditionally. > > > - */ > > > - if (!test_and_set_bit(XFS_LI_DIRTY, &iip->ili_item.li_flags)) { > > > - if (IS_I_VERSION(inode) && > > > - inode_maybe_inc_iversion(inode, flags & XFS_ILOG_CORE)) > > > - iversion_flags = XFS_ILOG_CORE; > > > - } > > > + set_bit(XFS_LI_DIRTY, &iip->ili_item.li_flags); > > > > > > /* > > > * If we're updating the inode core or the timestamps and it's possible > > > diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c > > > index 45518b8c613c..162e044c7f56 100644 > > > --- a/fs/xfs/xfs_iops.c > > > +++ b/fs/xfs/xfs_iops.c > > > @@ -718,6 +718,7 @@ xfs_setattr_nonsize( > > > } > > > > > > setattr_copy(mnt_userns, inode, iattr); > > > + inode_inc_iversion(inode); > > > xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE); > > > > > > XFS_STATS_INC(mp, xs_ig_attrchg); > > > @@ -943,6 +944,7 @@ xfs_setattr_size( > > > > > > ASSERT(!(iattr->ia_valid & (ATTR_UID | ATTR_GID))); > > > setattr_copy(mnt_userns, inode, iattr); > > > + inode_inc_iversion(inode); > > > xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE); > > > > > > XFS_STATS_INC(mp, xs_ig_attrchg); > > > @@ -1047,6 +1049,8 @@ xfs_vn_update_time( > > > inode->i_mtime = *now; > > > if (flags & S_ATIME) > > > inode->i_atime = *now; > > > + if ((flags & S_VERSION) && inode_maybe_inc_iversion(inode, false)) > > > + log_flags |= XFS_ILOG_CORE; > > > > > > xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL); > > > xfs_trans_log_inode(tp, ip, log_flags); > > > -- > > > 2.37.2 > > > > > > > I have a test (details below) that shows an open issue with NFSv4.x + > > fscache where an xfs exported filesystem would trigger unnecessary > > over the wire READs after a umount/mount cycle of the NFS mount. I > > previously tracked this down to atime updates, but never followed > > through on any patch. Now that Jeff worked it out and this patch is > > under review, I built 5.19 vanilla, retested, then built 5.19 + this > > patch and verified the problem is fixed. > > And so the question that needs to be answered is "why isn't relatime > working for this workload to avoid unnecessary atime updates"? > > Which then makes me ask "what's changing atime on the server between > client side reads"? > > Which then makes me wonder "what's actually changing iversion on the > server?" because I don't think atime is the issue here. > > I suspect that Jeff's patch is affecting this test case by removing > iversion updates when the data is written back on the server. i.e. > delayed allocation and unwritten extent conversion will no longer > bump iversion when they log the inode metadata changes associated > with extent allocation to store the data being written. There may > be other places where Jeff's patch removes implicit iversion > updates, too, so it may not be writeback that is the issue here. > > How that impacts on the observed behaviour is dependent on things I > don't know, like what cachefiles is doing in the background, > especially across NFS client unmount/mount cycles. However, this all > makes me think the "atime is updated" behaviour is an observed > symptom of something else changing iversion and/or cmtime between > reads from the server... > You may be right here. What I see with both noatime and relatime is that the first read after a write to a file results in the i_version being incremented, but then it doesn't change on subsequent reads. Write to the file again, and the i_version will get incremented again on the next read and then not again until there is a write.
On Wed, Aug 17, 2022 at 08:10:22AM -0400, Jeff Layton wrote: > On Wed, 2022-08-17 at 09:37 +1000, Dave Chinner wrote: > > On Tue, Aug 16, 2022 at 01:14:55PM -0400, David Wysochanski wrote: > > > On Tue, Aug 16, 2022 at 9:19 AM Jeff Layton <jlayton@kernel.org> wrote: > > > I have a test (details below) that shows an open issue with NFSv4.x + > > > fscache where an xfs exported filesystem would trigger unnecessary > > > over the wire READs after a umount/mount cycle of the NFS mount. I > > > previously tracked this down to atime updates, but never followed > > > through on any patch. Now that Jeff worked it out and this patch is > > > under review, I built 5.19 vanilla, retested, then built 5.19 + this > > > patch and verified the problem is fixed. > > > > And so the question that needs to be answered is "why isn't relatime > > working for this workload to avoid unnecessary atime updates"? > > > > Which then makes me ask "what's changing atime on the server between > > client side reads"? > > > > Which then makes me wonder "what's actually changing iversion on the > > server?" because I don't think atime is the issue here. > > > > I suspect that Jeff's patch is affecting this test case by removing > > iversion updates when the data is written back on the server. i.e. > > delayed allocation and unwritten extent conversion will no longer > > bump iversion when they log the inode metadata changes associated > > with extent allocation to store the data being written. There may > > be other places where Jeff's patch removes implicit iversion > > updates, too, so it may not be writeback that is the issue here. > > > > How that impacts on the observed behaviour is dependent on things I > > don't know, like what cachefiles is doing in the background, > > especially across NFS client unmount/mount cycles. However, this all > > makes me think the "atime is updated" behaviour is an observed > > symptom of something else changing iversion and/or cmtime between > > reads from the server... > > > > You may be right here. > > What I see with both noatime and relatime is that the first read after a > write to a file results in the i_version being incremented, but then it > doesn't change on subsequent reads. Write to the file again, and the > i_version will get incremented again on the next read and then not again > until there is a write. Yup, that is exactly the behaviour relatime encodes at the VFS. THe data write updates mtime (or is it ctime? doesn't matter, though) and the very next atime access check done by the VFS sees that the mtime is more recent than the atime, so it asks the filesystem to do an atime update. Cheers, Dave.
On Wed, 17 Aug 2022, Dave Chinner wrote: > > In XFS, we've defined the on-disk i_version field to mean > "increments with any persistent inode data or metadata change", > regardless of what the high level applications that use i_version > might actually require. > > That some network filesystem might only need a subset of the > metadata to be covered by i_version is largely irrelevant - if we > don't cover every persistent inode metadata change with i_version, > then applications that *need* stuff like atime change notification > can't be supported. So what you are saying is that the i_version provided by XFS does not match the changeid semantics required by NFSv4. Fair enough. I guess we shouldn't use the one to implement the other then. Maybe we should just go back to using ctime. ctime is *exactly* what NFSv4 wants, as long as its granularity is sufficient to catch every single change. Presumably XFS doesn't try to ensure this. How hard would it be to get any ctime update to add at least one nanosecond? This would be enabled by a mount option, or possibly be a direct request from nfsd. <rant>NFSv4 changeid is really one of the more horrible parts of the design</rant> NeilBrown
On Wed, Aug 17, 2022 at 08:02:23AM -0400, Jeff Layton wrote: > On Wed, 2022-08-17 at 08:42 +1000, Dave Chinner wrote: > > On Tue, Aug 16, 2022 at 11:58:06AM -0400, Jeff Layton wrote: > > > On Tue, 2022-08-16 at 08:43 -0700, Darrick J. Wong wrote: > > > > On Tue, Aug 16, 2022 at 09:17:36AM -0400, Jeff Layton wrote: > > > > > The i_version in xfs_trans_log_inode is bumped for any inode update, > > > > > including atime-only updates due to reads. We don't want to record those > > > > > in the i_version, as they don't represent "real" changes. Remove that > > > > > callsite. > > > > > > > > > > In xfs_vn_update_time, if S_VERSION is flagged, then attempt to bump the > > > > > i_version and turn on XFS_ILOG_CORE if it happens. In > > > > > xfs_trans_ichgtime, update the i_version if the mtime or ctime are being > > > > > updated. > > > > > > > > What about operations that don't touch the mtime but change the file > > > > metadata anyway? There are a few of those, like the blockgc garbage > > > > collector, deduperange, and the defrag tool. > > > > > > > > > > Do those change the c/mtime at all? > > > > > > It's possible we're missing some places that should change the i_version > > > as well. We may need some more call sites. > > > > > > > Zooming out a bit -- what does i_version signal, concretely? I thought > > > > it was used by nfs (and maybe ceph?) to signal to clients that the file > > > > on the server has moved on, and the client needs to invalidate its > > > > caches. I thought afs had a similar generation counter, though it's > > > > only used to cache file data, not metadata? Does an i_version change > > > > cause all of them to invalidate caches, or is there more behavior I > > > > don't know about? > > > > > > > > > > For NFS, it indicates a change to the change attribute indicates that > > > there has been a change to the data or metadata for the file. atime > > > changes due to reads are specifically exempted from this, but we do bump > > > the i_version if someone (e.g.) changes the atime via utimes(). > > > > We have relatime behaviour to optimise away unnecessary atime > > updates on reads. Trying to explicitly exclude i_version from atime > > updates in one filesystem just because NFS doesn't need that > > information seems .... misguided. The -on disk- i_version > > field behaviour is defined by the filesystem implementation, not the > > NFS requirements. > > -o relatime does not fix this. So why not fix -o relatime to handle this? That way the fix works for all filesystems and doesn't require hacking around what the VFS has told us to do in every filesystem. i.e. the VFS told us to update atime, we updated atime and that is a persistent metadata change to the inode. Hence a filesystem with a persistent change counter has to bump the change counter because we've been asked by the VFS to make a persistent metadata change. If you want atime updates to not make persistent changes to on disk metadata, then change the relatime implementation so that it doesn't ask the filesystem to update the on-disk atime. Essentially, what I'm hearing is that NFS wants atime updates to behave like lazytime, not like relatime. With lazytime, atime always gets updated in memory, but it is not written back to the filesystem until a timeout or some other modification is made to the inode or file data. THe filesystem doesn't bump iversion until the timestamp gets written back in the lazytime case. IOWs, we already have a mechanism in the kernel for making atime updates behave exactly as NFS wants: -o lazytime. > > > The NFS client will generally invalidate its caches for the inode when > > > it notices a change attribute change. > > > > > > FWIW, AFS may not meet this standard since it doesn't generally > > > increment the counter on metadata changes. It may turn out that we don't > > > want to expose this to the AFS client due to that (or maybe come up with > > > some way to indicate this difference). > > > > In XFS, we've defined the on-disk i_version field to mean > > "increments with any persistent inode data or metadata change", > > regardless of what the high level applications that use i_version > > might actually require. > > > > That some network filesystem might only need a subset of the > > metadata to be covered by i_version is largely irrelevant - if we > > don't cover every persistent inode metadata change with i_version, > > then applications that *need* stuff like atime change notification > > can't be supported. > > > > > > Does that mean that we should bump i_version for any file data or > > > > attribute that could be queried or observed by userspace? In which case > > > > I suppose this change is still correct, even if it relaxes i_version > > > > updates from "any change to the inode whatsoever" to "any change that > > > > would bump mtime". Unless FIEMAP is part of "attributes observed by > > > > userspace". > > > > > > > > (The other downside I can see is that now we have to remember to bump > > > > timestamps for every new file operation we add, unlike the current code > > > > which is centrally located in xfs_trans_log_inode.) > > > > > > > > > > The main reason for the change attribute in NFS was that NFSv3 is > > > plagued with cache-coherency problems due to coarse-grained timestamp > > > granularity. It was conceived as a way to indicate that the inode had > > > changed without relying on timestamps. > > > > Yes, and the most important design consideration for a filesystem is > > that it -must be persistent-. The constraints on i_version are much > > stricter than timestamps, and they are directly related to how the > > filesystem persists metadata changes, not how metadata is changed or > > accessed in memory. > > > > > In practice, we want to bump the i_version counter whenever the ctime or > > > mtime would be changed. > > > > What about O_NOCMTIME modifications? What about lazytime > > filesystems? These explicilty avoid or delay persisten c/mtime > > updates, and that means bumping i_version only based on c/mtime > > updates cannot be relied on. i_version is supposed to track user > > visible data and metadata changes, *not timestamp updates*. > > I was speaking more about the sorts of activity that should result in > the i_version being changed, not about tracking timestamp updates. IOW, > if some activity would cause the mtime or ctime to change, then we want > to also bump the i_version. > > Specifically, for NOCMTIME, I think we'd still want the i_version to > change since that option is about timestamps and not i_version. Exactly my point: this is what XFS currently does. It is also what your proposed changes break by tying i_version updates to c/mtime updates. > > Hence, I don't think that trying to modify how filesystems persist > > and maintain i_version coherency because NFS "doesn't need i_version > > to cover atime updates" is the wrong approach. On-disk i_version > > coherency has to work for more than just one NFS implementation > > (especially now i_version will be exported to userspace!). > > Persistent atime updates are already optimised away by relatime, and > > so I think that any further atime filtering is largely a NFS > > application layer problem and not something that should be solved by > > changing the on-disk definition of back end filesystem structure > > persistence. > > > > Fair enough. xfs is not really in my wheelhouse so take this as a patch > that helps illustrate the problem, rather than a serious submission. > > There are two consumers of the i_version counter today: the kernel NFS > server and IMA. Having the i_version reflect atime updates due to reads > harms both of those use-cases with unneeded cache invalidations on NFS > and extra measurements on IMA. It would also be problematic for userland > NFS servers such as ganesha if we end up exposing this to userland. So you're wanting define an exact behaviour for atime vs I_VERSION where atime doesn't bump iversion. However, the definition we baked into the XFS on-disk format is that the on-disk iversion filed changes for every persistent change made to the inode. That includes atime updates. We did this for two reasons - the first is so that we could support any application that needed change detection, and the second was that knowing how many changes have occurred to an inode is extremely useful for forensic purposes (especially given the ability to use O_NOCMTIME to modify file data). > atime updates are really a special case when it comes to metadata (and I > maintain that they are one of the worst ideas in POSIX). The way you're > choosing to define i_version doesn't really work properly for any > current or projected use case. I'd like to see that changed. We chose to do that a decade ago, knowing that it is the responsibility of the VFS to avoid unnecessary atime updates, not the responsibility of the filesystem. That was the whole point of introducing the relatime functionality: fix the problem at the VFS, not have to work around generic atime behaviour in every filesystem. > If the concern is fragility of the code going forward, then maybe we can > go with a different approach. Would it be possible to just have > xfs_trans_log_inode skip bumping the i_version when the log transaction > is _only_ for an atime update due to reads? Maybe we could add a new > XFS_ILOG_ATIME_UPDATE flag or something and set it the appropriate > codepaths? No, I don't think this is something we should be hacking around in individual filesystems. If the VFS tells us we should be updating atime, we should be updating it and bumping persistent change counters because *that's what the VFS asked us to do*. IOWs, if NFS wants atime to be considered "in memory only" as per the lazytime behaviour, then that behaviour needs to be supported/changed at the VFS, not at the individual fileystem level. You could add a subset of SB_LAZYTIME functionality just for atime updates, handle that entirely within the existing lazytime infrastructure. THis means the VFS supports non-persistent, best-effort, non-persistent atime updates directly. The VFS will not ack filesystems to persist atime updates the moment they are made, it will tell the filesystem via ->dirty_inode() that it needs to persist the in-memory timestamps. This gives all the filesystems the same atime behaviour. If the VFS is going to expose the persistent change counter to userspace, we need to have consistent, VFS enforced rules on what is coverred by the change counter. If atime is not covered by the ipersistent change counter, then the VFS must not ask the filesystems to persist atime changes every time atime changes. Then all filesystems will avoid on-disk atime updates as much as possible, whilst still gathering them correctly when other modifications are made. Until we've got a definition of what this persistent change counter actually describes and guarantees for userspace, changing filesystem implementations is premature. And if it is decided that atime is not going to be considered a "persistent change" by itself, then that behaviour needs to be encoded at the VFS, not in individual filesystems.... Cheers, Dave.
On Wed, 2022-08-17 at 08:42 +1000, Dave Chinner wrote: > > In XFS, we've defined the on-disk i_version field to mean > "increments with any persistent inode data or metadata change", > regardless of what the high level applications that use i_version > might actually require. > > That some network filesystem might only need a subset of the > metadata to be covered by i_version is largely irrelevant - if we > don't cover every persistent inode metadata change with i_version, > then applications that *need* stuff like atime change notification > can't be supported. OK, I'll bite... What real world application are we talking about here, and why can't it just read both the atime + i_version if it cares? The value of the change attribute lies in the fact that it gives you ctime semantics without the time resolution limitation. i.e. if the change attribute has changed, then you know that someone has explicitly modified either the file data or the file metadata (with the emphasis being on the word "explicitly"). Implicit changes such as the mtime change due to a write are reflected only because they are necessarily also accompanied by an explicit change to the data contents of the file. Implicit changes, such as the atime changes due to a read are not reflected in the change attribute because there is no explicit change being made by an application. Any implicit change in the metadata can be derived by just reading the attribute in question: there is only 1 (atime) and it is supposed to be monotonically increasing, hence is dead simple to detect.
On Thu, Aug 18, 2022 at 10:34:40AM +1000, NeilBrown wrote: > On Wed, 17 Aug 2022, Dave Chinner wrote: > > > > In XFS, we've defined the on-disk i_version field to mean > > "increments with any persistent inode data or metadata change", > > regardless of what the high level applications that use i_version > > might actually require. > > > > That some network filesystem might only need a subset of the > > metadata to be covered by i_version is largely irrelevant - if we > > don't cover every persistent inode metadata change with i_version, > > then applications that *need* stuff like atime change notification > > can't be supported. > > So what you are saying is that the i_version provided by XFS does not > match the changeid semantics required by NFSv4. Fair enough. I guess > we shouldn't use the one to implement the other then. True, but there's more nuance to it than that. We can already provide the NFSv4 requirements without filesystem modification - just mount the filesytsem with "-o lazytime". There's *always* a difference between what the filesystem implements and what an application require. The filesystem implements what the VFS asks it to do, not what the application needs. The VFS provides the functionality they applications require, not the filesystems. atime update filtering has long been VFS functionality - we do it there so behaviour is common across all filesystems. Therefore, if the NFS application wants atime to always behave as if lazytime is enabled, that filtering should be done at the VFS atime filtering layer. Why should we duplicate VFS atime update filtering in every filesystem for the default relatime behaviour when the VFS lazytime filter already provides this filtering function to everyone? > Maybe we should just go back to using ctime. ctime is *exactly* what > NFSv4 wants, as long as its granularity is sufficient to catch every > single change. Presumably XFS doesn't try to ensure this. How hard > would it be to get any ctime update to add at least one nanosecond? > This would be enabled by a mount option, or possibly be a direct request > from nfsd. We can't rely on ctime to be changed during a modification because O_NOCMTIME exists to enable "user invisible" modifications to be made. On XFS these still bump iversion, so while they are invisible to the user, they are still tracked by the filesystem and anything that wants to know if the inode data/metadata changed. Cheers, Dave.
On Thu, 18 Aug 2022, Dave Chinner wrote: > > > Maybe we should just go back to using ctime. ctime is *exactly* what > > NFSv4 wants, as long as its granularity is sufficient to catch every > > single change. Presumably XFS doesn't try to ensure this. How hard > > would it be to get any ctime update to add at least one nanosecond? > > This would be enabled by a mount option, or possibly be a direct request > > from nfsd. > > We can't rely on ctime to be changed during a modification because > O_NOCMTIME exists to enable "user invisible" modifications to be > made. On XFS these still bump iversion, so while they are invisible > to the user, they are still tracked by the filesystem and anything > that wants to know if the inode data/metadata changed. > O_NOCMTIME isn't mentioned in the man page, so it doesn't exist :-( If they are "user invisible", should they then also be "NFS invisible"? I think so. As I understand it, the purpose of O_NOCMTIME is to allow optimisations - do a lot of writes, then update the mtime, thus reducing latency. I think it is perfectly reasonable for all of that to be invisible to NFS. NeilBrown
On Thu, 2022-08-18 at 11:52 +1000, NeilBrown wrote: > On Thu, 18 Aug 2022, Dave Chinner wrote: > > > > > Maybe we should just go back to using ctime. ctime is *exactly* > > > what > > > NFSv4 wants, as long as its granularity is sufficient to catch > > > every > > > single change. Presumably XFS doesn't try to ensure this. How > > > hard > > > would it be to get any ctime update to add at least one > > > nanosecond? > > > This would be enabled by a mount option, or possibly be a direct > > > request > > > from nfsd. > > > > We can't rely on ctime to be changed during a modification because > > O_NOCMTIME exists to enable "user invisible" modifications to be > > made. On XFS these still bump iversion, so while they are invisible > > to the user, they are still tracked by the filesystem and anything > > that wants to know if the inode data/metadata changed. > > > > O_NOCMTIME isn't mentioned in the man page, so it doesn't exist :-( > > If they are "user invisible", should they then also be "NFS > invisible"? > I think so. > As I understand it, the purpose of O_NOCMTIME is to allow > optimisations > - do a lot of writes, then update the mtime, thus reducing latency. > I > think it is perfectly reasonable for all of that to be invisible to > NFS. The point is that you can always detect an implicit metadata change by just reading the attribute that changed. The case of an explicit change is different, because the application might be changing the value back to a previous setting. The entire value of ctime is that it allows you to know not to trust any caches when this might be the case because it records the fact that there was an explicit data or metadata change somewhere along the line. By discarding that information about explicit vs implicit changes, XFS is making the i_version less useful to applications that need to cache data and/or metadata. So the real question is: for which real world applications is this behaviour adding value that could not be derived through other means?
On Thu, Aug 18, 2022 at 11:52:12AM +1000, NeilBrown wrote: > On Thu, 18 Aug 2022, Dave Chinner wrote: > > > > > Maybe we should just go back to using ctime. ctime is *exactly* what > > > NFSv4 wants, as long as its granularity is sufficient to catch every > > > single change. Presumably XFS doesn't try to ensure this. How hard > > > would it be to get any ctime update to add at least one nanosecond? > > > This would be enabled by a mount option, or possibly be a direct request > > > from nfsd. > > > > We can't rely on ctime to be changed during a modification because > > O_NOCMTIME exists to enable "user invisible" modifications to be > > made. On XFS these still bump iversion, so while they are invisible > > to the user, they are still tracked by the filesystem and anything > > that wants to know if the inode data/metadata changed. > > > > O_NOCMTIME isn't mentioned in the man page, so it doesn't exist :-( > > If they are "user invisible", should they then also be "NFS invisible"? > I think so. Maybe, but now you're making big assumptions about what is being done by those operations. Userspace can write whatever it likes, nothing says that O_NOCMTIME can't change user visible data or metadata. > As I understand it, the purpose of O_NOCMTIME is to allow optimisations > - do a lot of writes, then update the mtime, thus reducing latency. I > think it is perfectly reasonable for all of that to be invisible to NFS. O_NOCMTIME is used by things like HSMs, file defragmenters, deduplication tools, backup programs, etc to be able to read/write data and manipulate file layout without modifying user visible timestamps. i.e. users shouldn't notice that the online defragmenter defragmented their file. Backup programs shouldn't notice the defragmenter defragmented the file. But having uses of it that don't change user visible data does not mean it can't be used for changing user visible data. Hence we made the defensive assumption that O_NOCMTIME was a mechanism that could be used to hide data changes from forensic analysis. With that in mind, it was important that the change counter captured changes made even when O_NOCMTIME was specified to leave behind a breadcrumb to indicate unexpected changes may had been made to the file. Yeah, we had lots of different requirements for the XFS on-disk change counter when we were considering adding it. NFSv4 was one of the least demanding and least defined requirements; it's taken a *decade* for this atime issue to be noticed, so I really don't think there's anything wrong with how XFs has implemented persistent change counters. What it tells me is that the VFS needs more appropriate atime filtering for NFSv4's change attribute requirements.... Cheers, Dave.
On Thu, Aug 18, 2022 at 01:11:09AM +0000, Trond Myklebust wrote: > On Wed, 2022-08-17 at 08:42 +1000, Dave Chinner wrote: > > > > In XFS, we've defined the on-disk i_version field to mean > > "increments with any persistent inode data or metadata change", > > regardless of what the high level applications that use i_version > > might actually require. > > > > That some network filesystem might only need a subset of the > > metadata to be covered by i_version is largely irrelevant - if we > > don't cover every persistent inode metadata change with i_version, > > then applications that *need* stuff like atime change notification > > can't be supported. > > OK, I'll bite... > > What real world application are we talking about here, and why can't it > just read both the atime + i_version if it cares? The whole point of i_version is that the aplication can skip the storage and comparison of individual metadata fields to determine if anythign changed. If you're going to store multiple fields and compare them all in addition to the change attribute, then what is the change attribute actually gaining you? > The value of the change attribute lies in the fact that it gives you > ctime semantics without the time resolution limitation. > i.e. if the change attribute has changed, then you know that someone > has explicitly modified either the file data or the file metadata (with > the emphasis being on the word "explicitly"). > Implicit changes such as the mtime change due to a write are reflected > only because they are necessarily also accompanied by an explicit > change to the data contents of the file. > Implicit changes, such as the atime changes due to a read are not > reflected in the change attribute because there is no explicit change > being made by an application. That's the *NFSv4 requirements*, not what people were asking XFS to support in a persistent change attribute 10-15 years ago. What NFS required was just one of the inputs at the time, and what we implemented has kept NFSv4 happy for the past decade. I've mentioned other requirements elsewhere in the thread. The problem we're talking about here is essentially a relatime filtering issue; it's triggering an filesystem update because the first access after a modification triggers an on-disk atime update rahter than just storing it in memory. This is not a filesystem issue - the VFS controls when the on-disk updates occur, and that what NFSv4 appears to need changed. If NFS doesn't want the filesystem to bump change counters for on-disk atime updates, then it should be asking the VFS to keep the atime updates in memory. e.g. use lazytime semantics. This way, every filesystem will have the same behaviour, regardless of how they track/store persistent change count metadata. Cheers, Dave.
On Thu, 2022-08-18 at 13:37 +1000, Dave Chinner wrote: > On Thu, Aug 18, 2022 at 01:11:09AM +0000, Trond Myklebust wrote: > > On Wed, 2022-08-17 at 08:42 +1000, Dave Chinner wrote: > > > > > > In XFS, we've defined the on-disk i_version field to mean > > > "increments with any persistent inode data or metadata change", > > > regardless of what the high level applications that use i_version > > > might actually require. > > > > > > That some network filesystem might only need a subset of the > > > metadata to be covered by i_version is largely irrelevant - if we > > > don't cover every persistent inode metadata change with > > > i_version, > > > then applications that *need* stuff like atime change > > > notification > > > can't be supported. > > > > OK, I'll bite... > > > > What real world application are we talking about here, and why > > can't it > > just read both the atime + i_version if it cares? > > The whole point of i_version is that the aplication can skip the > storage and comparison of individual metadata fields to determine if > anythign changed. If you're going to store multiple fields and > compare them all in addition to the change attribute, then what is > the change attribute actually gaining you? Information that is not contained in the fields themselves. Such as metadata about the fact that they were explicitly changed by an application. > > > The value of the change attribute lies in the fact that it gives > > you > > ctime semantics without the time resolution limitation. > > i.e. if the change attribute has changed, then you know that > > someone > > has explicitly modified either the file data or the file metadata > > (with > > the emphasis being on the word "explicitly"). > > Implicit changes such as the mtime change due to a write are > > reflected > > only because they are necessarily also accompanied by an explicit > > change to the data contents of the file. > > Implicit changes, such as the atime changes due to a read are not > > reflected in the change attribute because there is no explicit > > change > > being made by an application. > > That's the *NFSv4 requirements*, not what people were asking XFS to > support in a persistent change attribute 10-15 years ago. What NFS > required was just one of the inputs at the time, and what we > implemented has kept NFSv4 happy for the past decade. I've mentioned > other requirements elsewhere in the thread NFS can work with a change attribute that tells it to invalidate its cache on every read. The only side effect will be that the performance graph will act as if you were filtering it through a cow's digestive system... > The problem we're talking about here is essentially a relatime > filtering issue; it's triggering an filesystem update because the > first access after a modification triggers an on-disk atime update > rahter than just storing it in memory. No. It's not... You appear to be discarding valuable information about why the attributes changed. I've been asking you for reasons why, and you're avoiding the question. > This is not a filesystem issue - the VFS controls when the on-disk > updates occur, and that what NFSv4 appears to need changed. > If NFS doesn't want the filesystem to bump change counters for > on-disk atime updates, then it should be asking the VFS to keep the > atime updates in memory. e.g. use lazytime semantics. > > This way, every filesystem will have the same behaviour, regardless > of how they track/store persistent change count metadata. Right now, the i_version updates are not exported via any common API, so any piss poor performance side-effects of the implementation are pretty much limited to the kernel users (NFS and... ???) Who do you expect to use this attribute if it were to be exported via statx() as Jeff is proposing, and why is the XFS behaviour appropriate? It already differs from the behaviour of both btrfs and NFS, so the argument that this will magically consolidate behaviour can be ignored.
On Thu, 2022-08-18 at 10:34 +1000, NeilBrown wrote: > On Wed, 17 Aug 2022, Dave Chinner wrote: > > > > In XFS, we've defined the on-disk i_version field to mean > > "increments with any persistent inode data or metadata change", > > regardless of what the high level applications that use i_version > > might actually require. > > > > That some network filesystem might only need a subset of the > > metadata to be covered by i_version is largely irrelevant - if we > > don't cover every persistent inode metadata change with i_version, > > then applications that *need* stuff like atime change notification > > can't be supported. > > So what you are saying is that the i_version provided by XFS does not > match the changeid semantics required by NFSv4. Fair enough. I guess > we shouldn't use the one to implement the other then. > > Maybe we should just go back to using ctime. ctime is *exactly* what > NFSv4 wants, as long as its granularity is sufficient to catch every > single change. Presumably XFS doesn't try to ensure this. How hard > would it be to get any ctime update to add at least one nanosecond? > This would be enabled by a mount option, or possibly be a direct request > from nfsd. > I think that would be an unfortunate outcome, but if we can't stop xfs from bumping the i_version on atime updates, then we may have no choice but to do so. I suppose we could add a fetch_iversion for xfs that takes it back to using the ctime. > <rant>NFSv4 changeid is really one of the more horrible parts of the > design</rant> > Hah! I was telling Tom Talpey yesterday that I thought that the change counter was one of the best ideas in NFSv4 and that we should be trying to get all filesystems to implement it correctly. The part that does suck about the design is that the original specs weren't specific enough about its behavior. I think that's been somewhat remedied in more recent RFCs though.
On Thu, 2022-08-18 at 04:15 +0000, Trond Myklebust wrote: > On Thu, 2022-08-18 at 13:37 +1000, Dave Chinner wrote: > > On Thu, Aug 18, 2022 at 01:11:09AM +0000, Trond Myklebust wrote: > > > On Wed, 2022-08-17 at 08:42 +1000, Dave Chinner wrote: > > > > > > > > In XFS, we've defined the on-disk i_version field to mean > > > > "increments with any persistent inode data or metadata change", > > > > regardless of what the high level applications that use i_version > > > > might actually require. > > > > > > > > That some network filesystem might only need a subset of the > > > > metadata to be covered by i_version is largely irrelevant - if we > > > > don't cover every persistent inode metadata change with > > > > i_version, > > > > then applications that *need* stuff like atime change > > > > notification > > > > can't be supported. > > > > > > OK, I'll bite... > > > > > > What real world application are we talking about here, and why > > > can't it > > > just read both the atime + i_version if it cares? > > > > The whole point of i_version is that the aplication can skip the > > storage and comparison of individual metadata fields to determine if > > anythign changed. If you're going to store multiple fields and > > compare them all in addition to the change attribute, then what is > > the change attribute actually gaining you? > > Information that is not contained in the fields themselves. Such as > metadata about the fact that they were explicitly changed by an > application. > > > > > > The value of the change attribute lies in the fact that it gives > > > you > > > ctime semantics without the time resolution limitation. > > > i.e. if the change attribute has changed, then you know that > > > someone > > > has explicitly modified either the file data or the file metadata > > > (with > > > the emphasis being on the word "explicitly"). > > > Implicit changes such as the mtime change due to a write are > > > reflected > > > only because they are necessarily also accompanied by an explicit > > > change to the data contents of the file. > > > Implicit changes, such as the atime changes due to a read are not > > > reflected in the change attribute because there is no explicit > > > change > > > being made by an application. > > > > That's the *NFSv4 requirements*, not what people were asking XFS to > > support in a persistent change attribute 10-15 years ago. What NFS > > required was just one of the inputs at the time, and what we > > implemented has kept NFSv4 happy for the past decade. I've mentioned > > other requirements elsewhere in the thread > > NFS can work with a change attribute that tells it to invalidate its > cache on every read. The only side effect will be that the performance > graph will act as if you were filtering it through a cow's digestive > system... > > > The problem we're talking about here is essentially a relatime > > filtering issue; it's triggering an filesystem update because the > > first access after a modification triggers an on-disk atime update > > rahter than just storing it in memory. > > No. It's not... You appear to be discarding valuable information about > why the attributes changed. I've been asking you for reasons why, and > you're avoiding the question. > > > This is not a filesystem issue - the VFS controls when the on-disk > > updates occur, and that what NFSv4 appears to need changed. > > If NFS doesn't want the filesystem to bump change counters for > > on-disk atime updates, then it should be asking the VFS to keep the > > atime updates in memory. e.g. use lazytime semantics. > > > > This way, every filesystem will have the same behaviour, regardless > > of how they track/store persistent change count metadata. > > Right now, the i_version updates are not exported via any common API, > so any piss poor performance side-effects of the implementation are > pretty much limited to the kernel users (NFS and... ???) > > Who do you expect to use this attribute if it were to be exported via > statx() as Jeff is proposing, and why is the XFS behaviour appropriate? > It already differs from the behaviour of both btrfs and NFS, so the > argument that this will magically consolidate behaviour can be ignored. > Thanks Trond, That's been exactly the point I've been trying to make. The _only_ consumers of i_version at this time are the kernel's NFS server and IMA. Both of them will still work with the i_version being updated due to atime updates, but their performance suffers. The change I'm proposing should bring xfs in line with other providers of i_version as well. btrfs already behaves correctly, and I have a proposed patch for ext4 which should fix it. The ext4 devs seem amenable to it so far. Dave, you keep talking about the xfs i_version counter as if there are other applications already relying on its behavior, but I don't see how that can be. There is no way for userland applications to fetch the counter currently.
On Thu, 2022-08-18 at 11:07 +1000, Dave Chinner wrote: > On Wed, Aug 17, 2022 at 08:02:23AM -0400, Jeff Layton wrote: > > On Wed, 2022-08-17 at 08:42 +1000, Dave Chinner wrote: > > > On Tue, Aug 16, 2022 at 11:58:06AM -0400, Jeff Layton wrote: > > > > On Tue, 2022-08-16 at 08:43 -0700, Darrick J. Wong wrote: > > > > > On Tue, Aug 16, 2022 at 09:17:36AM -0400, Jeff Layton wrote: > > > > > > The i_version in xfs_trans_log_inode is bumped for any inode update, > > > > > > including atime-only updates due to reads. We don't want to record those > > > > > > in the i_version, as they don't represent "real" changes. Remove that > > > > > > callsite. > > > > > > > > > > > > In xfs_vn_update_time, if S_VERSION is flagged, then attempt to bump the > > > > > > i_version and turn on XFS_ILOG_CORE if it happens. In > > > > > > xfs_trans_ichgtime, update the i_version if the mtime or ctime are being > > > > > > updated. > > > > > > > > > > What about operations that don't touch the mtime but change the file > > > > > metadata anyway? There are a few of those, like the blockgc garbage > > > > > collector, deduperange, and the defrag tool. > > > > > > > > > > > > > Do those change the c/mtime at all? > > > > > > > > It's possible we're missing some places that should change the i_version > > > > as well. We may need some more call sites. > > > > > > > > > Zooming out a bit -- what does i_version signal, concretely? I thought > > > > > it was used by nfs (and maybe ceph?) to signal to clients that the file > > > > > on the server has moved on, and the client needs to invalidate its > > > > > caches. I thought afs had a similar generation counter, though it's > > > > > only used to cache file data, not metadata? Does an i_version change > > > > > cause all of them to invalidate caches, or is there more behavior I > > > > > don't know about? > > > > > > > > > > > > > For NFS, it indicates a change to the change attribute indicates that > > > > there has been a change to the data or metadata for the file. atime > > > > changes due to reads are specifically exempted from this, but we do bump > > > > the i_version if someone (e.g.) changes the atime via utimes(). > > > > > > We have relatime behaviour to optimise away unnecessary atime > > > updates on reads. Trying to explicitly exclude i_version from atime > > > updates in one filesystem just because NFS doesn't need that > > > information seems .... misguided. The -on disk- i_version > > > field behaviour is defined by the filesystem implementation, not the > > > NFS requirements. > > > > -o relatime does not fix this. > > So why not fix -o relatime to handle this? That way the fix works > for all filesystems and doesn't require hacking around what the VFS > has told us to do in every filesystem. > > i.e. the VFS told us to update atime, we updated atime and that is a > persistent metadata change to the inode. Hence a filesystem with a > persistent change counter has to bump the change counter because > we've been asked by the VFS to make a persistent metadata change. > > If you want atime updates to not make persistent changes to on disk > metadata, then change the relatime implementation so that it doesn't > ask the filesystem to update the on-disk atime. > > Essentially, what I'm hearing is that NFS wants atime updates to > behave like lazytime, not like relatime. With lazytime, atime always > gets updated in memory, but it is not written back to the filesystem > until a timeout or some other modification is made to the inode or > file data. THe filesystem doesn't bump iversion until the timestamp > gets written back in the lazytime case. > > IOWs, we already have a mechanism in the kernel for making atime > updates behave exactly as NFS wants: -o lazytime. > No, that won't help. Both lazytime and relatime don't help anything since they don't address the fundamental problem, which is that the i_version changes due to atime updates. They only affect when the atime gets updated (or when it goes to disk). > > > > The NFS client will generally invalidate its caches for the inode when > > > > it notices a change attribute change. > > > > > > > > FWIW, AFS may not meet this standard since it doesn't generally > > > > increment the counter on metadata changes. It may turn out that we don't > > > > want to expose this to the AFS client due to that (or maybe come up with > > > > some way to indicate this difference). > > > > > > In XFS, we've defined the on-disk i_version field to mean > > > "increments with any persistent inode data or metadata change", > > > regardless of what the high level applications that use i_version > > > might actually require. > > > > > > That some network filesystem might only need a subset of the > > > metadata to be covered by i_version is largely irrelevant - if we > > > don't cover every persistent inode metadata change with i_version, > > > then applications that *need* stuff like atime change notification > > > can't be supported. > > > > > > > > Does that mean that we should bump i_version for any file data or > > > > > attribute that could be queried or observed by userspace? In which case > > > > > I suppose this change is still correct, even if it relaxes i_version > > > > > updates from "any change to the inode whatsoever" to "any change that > > > > > would bump mtime". Unless FIEMAP is part of "attributes observed by > > > > > userspace". > > > > > > > > > > (The other downside I can see is that now we have to remember to bump > > > > > timestamps for every new file operation we add, unlike the current code > > > > > which is centrally located in xfs_trans_log_inode.) > > > > > > > > > > > > > The main reason for the change attribute in NFS was that NFSv3 is > > > > plagued with cache-coherency problems due to coarse-grained timestamp > > > > granularity. It was conceived as a way to indicate that the inode had > > > > changed without relying on timestamps. > > > > > > Yes, and the most important design consideration for a filesystem is > > > that it -must be persistent-. The constraints on i_version are much > > > stricter than timestamps, and they are directly related to how the > > > filesystem persists metadata changes, not how metadata is changed or > > > accessed in memory. > > > > > > > In practice, we want to bump the i_version counter whenever the ctime or > > > > mtime would be changed. > > > > > > What about O_NOCMTIME modifications? What about lazytime > > > filesystems? These explicilty avoid or delay persisten c/mtime > > > updates, and that means bumping i_version only based on c/mtime > > > updates cannot be relied on. i_version is supposed to track user > > > visible data and metadata changes, *not timestamp updates*. > > > > I was speaking more about the sorts of activity that should result in > > the i_version being changed, not about tracking timestamp updates. IOW, > > if some activity would cause the mtime or ctime to change, then we want > > to also bump the i_version. > > > > Specifically, for NOCMTIME, I think we'd still want the i_version to > > change since that option is about timestamps and not i_version. > > Exactly my point: this is what XFS currently does. It is also what > your proposed changes break by tying i_version updates to c/mtime > updates. > > > > Hence, I don't think that trying to modify how filesystems persist > > > and maintain i_version coherency because NFS "doesn't need i_version > > > to cover atime updates" is the wrong approach. On-disk i_version > > > coherency has to work for more than just one NFS implementation > > > (especially now i_version will be exported to userspace!). > > > Persistent atime updates are already optimised away by relatime, and > > > so I think that any further atime filtering is largely a NFS > > > application layer problem and not something that should be solved by > > > changing the on-disk definition of back end filesystem structure > > > persistence. > > > > > > > Fair enough. xfs is not really in my wheelhouse so take this as a patch > > that helps illustrate the problem, rather than a serious submission. > > > > There are two consumers of the i_version counter today: the kernel NFS > > server and IMA. Having the i_version reflect atime updates due to reads > > harms both of those use-cases with unneeded cache invalidations on NFS > > and extra measurements on IMA. It would also be problematic for userland > > NFS servers such as ganesha if we end up exposing this to userland. > > So you're wanting define an exact behaviour for atime vs I_VERSION > where atime doesn't bump iversion. > > However, the definition we baked into the XFS on-disk format is that > the on-disk iversion filed changes for every persistent change made > to the inode. That includes atime updates. We did this for two > reasons - the first is so that we could support any application that > needed change detection, and the second was that knowing how many > changes have occurred to an inode is extremely useful for forensic > purposes (especially given the ability to use O_NOCMTIME to modify > file data). > > > atime updates are really a special case when it comes to metadata (and I > > maintain that they are one of the worst ideas in POSIX). The way you're > > choosing to define i_version doesn't really work properly for any > > current or projected use case. I'd like to see that changed. > > We chose to do that a decade ago, knowing that it is the > responsibility of the VFS to avoid unnecessary atime updates, not > the responsibility of the filesystem. That was the whole point of > introducing the relatime functionality: fix the problem at the VFS, > not have to work around generic atime behaviour in every filesystem. > > > If the concern is fragility of the code going forward, then maybe we can > > go with a different approach. Would it be possible to just have > > xfs_trans_log_inode skip bumping the i_version when the log transaction > > is _only_ for an atime update due to reads? Maybe we could add a new > > XFS_ILOG_ATIME_UPDATE flag or something and set it the appropriate > > codepaths? > > No, I don't think this is something we should be hacking around in > individual filesystems. If the VFS tells us we should be updating > atime, we should be updating it and bumping persistent change > counters because *that's what the VFS asked us to do*. > > IOWs, if NFS wants atime to be considered "in memory only" as per > the lazytime behaviour, then that behaviour needs to be > supported/changed at the VFS, not at the individual fileystem level. > > You could add a subset of SB_LAZYTIME functionality just for atime > updates, handle that entirely within the existing lazytime > infrastructure. THis means the VFS supports non-persistent, > best-effort, non-persistent atime updates directly. The VFS will not > ack filesystems to persist atime updates the moment they are made, > it will tell the filesystem via ->dirty_inode() that it needs to > persist the in-memory timestamps. > > This gives all the filesystems the same atime behaviour. If the VFS > is going to expose the persistent change counter to userspace, we > need to have consistent, VFS enforced rules on what is coverred by > the change counter. If atime is not covered by the ipersistent > change counter, then the VFS must not ask the filesystems to persist > atime changes every time atime changes. > > Then all filesystems will avoid on-disk atime updates as much as > possible, whilst still gathering them correctly when other > modifications are made. > > Until we've got a definition of what this persistent change counter > actually describes and guarantees for userspace, changing filesystem > implementations is premature. And if it is decided that atime is not > going to be considered a "persistent change" by itself, then that > behaviour needs to be encoded at the VFS, not in individual > filesystems.... > > Cheers, > > Dave.
On Thu, 18 Aug 2022, Jeff Layton wrote: > On Thu, 2022-08-18 at 10:34 +1000, NeilBrown wrote: > > On Wed, 17 Aug 2022, Dave Chinner wrote: > > > > > > In XFS, we've defined the on-disk i_version field to mean > > > "increments with any persistent inode data or metadata change", > > > regardless of what the high level applications that use i_version > > > might actually require. > > > > > > That some network filesystem might only need a subset of the > > > metadata to be covered by i_version is largely irrelevant - if we > > > don't cover every persistent inode metadata change with i_version, > > > then applications that *need* stuff like atime change notification > > > can't be supported. > > > > So what you are saying is that the i_version provided by XFS does not > > match the changeid semantics required by NFSv4. Fair enough. I guess > > we shouldn't use the one to implement the other then. > > > > Maybe we should just go back to using ctime. ctime is *exactly* what > > NFSv4 wants, as long as its granularity is sufficient to catch every > > single change. Presumably XFS doesn't try to ensure this. How hard > > would it be to get any ctime update to add at least one nanosecond? > > This would be enabled by a mount option, or possibly be a direct request > > from nfsd. > > > > I think that would be an unfortunate outcome, but if we can't stop xfs > from bumping the i_version on atime updates, then we may have no choice > but to do so. I suppose we could add a fetch_iversion for xfs that takes > it back to using the ctime. "unfortunate" for who I wonder.. I think Trond's argument about not needing implicit updates to be reflected i_version is sound - as the effective i_version can be constructed from stored i_version plus all attributes, if you ever want an effective i_version that covers all implicit changes to attributes. However I doubt Dave will be convinced. > > > <rant>NFSv4 changeid is really one of the more horrible parts of the > > design</rant> > > > > Hah! I was telling Tom Talpey yesterday that I thought that the change > counter was one of the best ideas in NFSv4 and that we should be trying > to get all filesystems to implement it correctly. > > The part that does suck about the design is that the original specs > weren't specific enough about its behavior. I think that's been somewhat > remedied in more recent RFCs though. That's enough bate for me to expand my rant... The problem with changeid is that it imposes on the filesystem. You CANNOT provide a compliant NFSv4 services on a filesystem which has 1 second resolution of time stamps and no internal i_version. When you are designing a protocol - particularly one where interoperability is a focus - making it impossible to export some common (at the time) filesystems correctly is crazy. And not at all necessary. There is a much better way. When reporting the ctime of a file, the server could add a Bool which is true if that time is "now" to within the resolution of the timestamp. If you see a "now" ctime, then you cannot use timestamps to validate any cached data. i.e. a "now" ctime is different to every other ctime, even another "now" ctime with the same timestamp. If the filesystem has high resolution timestamps, it will never say "now" and you get the same semantics as changeid. If timestamps are low resolution, then you cannot cache a file while it is changing (unless you get a read delegation), but you can accurately cache a file as long as it doesn't change - which is all that changeid gives you anyway. NeilBrown > -- > Jeff Layton <jlayton@kernel.org> >
On Thu, 18 Aug 2022, Dave Chinner wrote: > On Thu, Aug 18, 2022 at 11:52:12AM +1000, NeilBrown wrote: > > On Thu, 18 Aug 2022, Dave Chinner wrote: > > > > > > > Maybe we should just go back to using ctime. ctime is *exactly* what > > > > NFSv4 wants, as long as its granularity is sufficient to catch every > > > > single change. Presumably XFS doesn't try to ensure this. How hard > > > > would it be to get any ctime update to add at least one nanosecond? > > > > This would be enabled by a mount option, or possibly be a direct request > > > > from nfsd. > > > > > > We can't rely on ctime to be changed during a modification because > > > O_NOCMTIME exists to enable "user invisible" modifications to be > > > made. On XFS these still bump iversion, so while they are invisible > > > to the user, they are still tracked by the filesystem and anything > > > that wants to know if the inode data/metadata changed. > > > > > > > O_NOCMTIME isn't mentioned in the man page, so it doesn't exist :-( > > > > If they are "user invisible", should they then also be "NFS invisible"? > > I think so. > > Maybe, but now you're making big assumptions about what is being > done by those operations. Userspace can write whatever it likes, > nothing says that O_NOCMTIME can't change user visible data or > metadata. Nope. The only assumption I'm making is that if the ctime/mtime don't change, then it is safe to trust any cached content. I think that is broadly assumed in the Posix world. Anyone who uses O_NOCMTIME must understand the risks (not currently documented ....) and it must be assumed they will handled them properly. We cannot allow the addition of O_NOCMTIME to make us think "ctime and mtime don't mean what they used to, we cannot trust them any more". > But having uses of it that don't change user visible data does not > mean it can't be used for changing user visible data. Hence we made > the defensive assumption that O_NOCMTIME was a mechanism that could > be used to hide data changes from forensic analysis. With that in > mind, it was important that the change counter captured changes made > even when O_NOCMTIME was specified to leave behind a breadcrumb to > indicate unexpected changes may had been made to the file. Having a breadcrumb seems reasonable. Calling that breadcrumb "i_version" might be questionable - though specifications seem to be vague so this decision is probably defensible. > > Yeah, we had lots of different requirements for the XFS on-disk > change counter when we were considering adding it. NFSv4 was one of > the least demanding and least defined requirements; it's taken a > *decade* for this atime issue to be noticed, so I really don't think > there's anything wrong with how XFs has implemented persistent > change counters. > > What it tells me is that the VFS needs more appropriate atime > filtering for NFSv4's change attribute requirements.... I don't agree with that last point. I think "atime == mtime" and "atime > mtime" are distinctly different states which should recorded. I think Trond's' observation about implicit updates is on-point. There is no need to include implicit atime updates in i_version. If anyone cares about those they can combine i_version and atime into a single value. If that value changes, then something changed, possibly an implicit atime update. Excluding implicit atime updates makes i_version strictly more useful. It doesn't lose any value and does gain some. NeilBrown
On Thu, Aug 18, 2022 at 07:03:42AM -0400, Jeff Layton wrote: > Dave, you keep talking about the xfs i_version counter as if there are > other applications already relying on its behavior, but I don't see how > that can be. There is no way for userland applications to fetch the > counter currently. You miss the point entirely: the behaviour is defined by the on-disk format (the di_changecount filed), not the applications that are using the kernel internal iversion API. Just because there are no in-kernel users of the di_changecount field in the XFS inode, it does not mean that it doesn't get used by applications. Forensic analysis tools that walk filesystem images. Did you not notice that xfs_trans_log_inode() forces an iversion update if the inode core is marked for update: inode_maybe_inc_iversion(inode, flags & XFS_ILOG_CORE)) ^^^^^^^^^^^^^^^^^^^^^^ this? So every modification to the inode core (which almost all inode modifications will do, except for pure timestamp updates) will bump the iversion regardless of whether it was queried or not. I use this information all the time for forensic analysis of broken filesystem images. There are forensic tools that use expose this information from filesystem images (e.g. xfs_db) so that we can use it for forensic analysis. See the problem? On-disk format di_changecount != NFS change attribute. We can implement the NFS change attribute with the existing di_changecount field, but if you want to constrain the definition of how the NFS change attribute is calculated, we can't necessarily implement that in di_changecount without changing the on-disk format definition. And, yes, this has implications for iversion being exposed to userspace via statx(), as I've mentioned in reply to the v2 patch you've posted. iversion is persistent information - you can't just redefine it's behaviour without some fairly serious knock-on effects for the subsystems that provide the persistent storage... Cheers, Dave.
On Tue, 2022-08-23 at 10:05 +1000, Dave Chinner wrote: > On Thu, Aug 18, 2022 at 07:03:42AM -0400, Jeff Layton wrote: > > Dave, you keep talking about the xfs i_version counter as if there > > are > > other applications already relying on its behavior, but I don't see > > how > > that can be. There is no way for userland applications to fetch the > > counter currently. > > You miss the point entirely: the behaviour is defined by the on-disk > format (the di_changecount filed), not the applications that are > using the kernel internal iversion API. > > Just because there are no in-kernel users of the di_changecount > field in the XFS inode, it does not mean that it doesn't get used by > applications. Forensic analysis tools that walk filesystem images. > > Did you not notice that xfs_trans_log_inode() forces an iversion > update if the inode core is marked for update: > > inode_maybe_inc_iversion(inode, flags & XFS_ILOG_CORE)) > ^^^^^^^^^^^^^^^^^^^^^^ > this? > > So every modification to the inode core (which almost all inode > modifications will do, except for pure timestamp updates) will bump > the iversion regardless of whether it was queried or not. > > I use this information all the time for forensic analysis of broken > filesystem images. There are forensic tools that use expose this > information from filesystem images (e.g. xfs_db) so that we can use > it for forensic analysis. > > See the problem? On-disk format di_changecount != NFS change > attribute. We can implement the NFS change attribute with the > existing di_changecount field, but if you want to constrain the > definition of how the NFS change attribute is calculated, we can't > necessarily implement that in di_changecount without changing the > on-disk format definition. > > And, yes, this has implications for iversion being exposed to > userspace via statx(), as I've mentioned in reply to the v2 patch > you've posted. iversion is persistent information - you can't just > redefine it's behaviour without some fairly serious knock-on > effects for the subsystems that provide the persistent storage... > That still doesn't explain why a regular application would want to use that definition of a version counter. Your use case of a forensic tool is not a generic use case. It is very limited to something which needs access to detailed knowledge of the internals of your transactional filesystem. This isn't even something that needs to be exposed to the VFS layer through the i_version if the users are all going to be internal to XFS. What is being requested was initially driven (yes, over 20 years ago now, as you've pointed out) by the use case of caching applications, of which NFS is only one. There are other applications that can benefit from it. Pretty much anything that is currently using the ctime and/or mtime, wants a standard that overcomes the time resolution issues that increasingly affect storage as performance improves. That means applications such as build utilities, backup/restore utilities, search utilities, etc... Even 'git' could make use of it. However none of the above applications need the 'all metadata transactions' definition that you appear to need for your forensics use case. All will suffer a performance loss with such a definition. Worst of all, the application behaviour would vary wildly depending on the combination of mount options being used (noatime vs nodiratime vs...). So the point isn't about 'redefining the behaviour of i_version'. It is about delivering a statx() API that meets the requirements of a number of applications. If XFS is going to opt in to that API, then it needs to be able to deliver an attribute that meets the same requirements. Otherwise, it could choose to opt out.
diff --git a/fs/xfs/libxfs/xfs_trans_inode.c b/fs/xfs/libxfs/xfs_trans_inode.c index 8b5547073379..78bf7f491462 100644 --- a/fs/xfs/libxfs/xfs_trans_inode.c +++ b/fs/xfs/libxfs/xfs_trans_inode.c @@ -71,6 +71,8 @@ xfs_trans_ichgtime( inode->i_ctime = tv; if (flags & XFS_ICHGTIME_CREATE) ip->i_crtime = tv; + if (flags & (XFS_ICHGTIME_MOD|XFS_ICHGTIME_CHG)) + inode_inc_iversion(inode); } /* @@ -116,20 +118,7 @@ xfs_trans_log_inode( spin_unlock(&inode->i_lock); } - /* - * First time we log the inode in a transaction, bump the inode change - * counter if it is configured for this to occur. While we have the - * inode locked exclusively for metadata modification, we can usually - * avoid setting XFS_ILOG_CORE if no one has queried the value since - * the last time it was incremented. If we have XFS_ILOG_CORE already - * set however, then go ahead and bump the i_version counter - * unconditionally. - */ - if (!test_and_set_bit(XFS_LI_DIRTY, &iip->ili_item.li_flags)) { - if (IS_I_VERSION(inode) && - inode_maybe_inc_iversion(inode, flags & XFS_ILOG_CORE)) - iversion_flags = XFS_ILOG_CORE; - } + set_bit(XFS_LI_DIRTY, &iip->ili_item.li_flags); /* * If we're updating the inode core or the timestamps and it's possible diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c index 45518b8c613c..162e044c7f56 100644 --- a/fs/xfs/xfs_iops.c +++ b/fs/xfs/xfs_iops.c @@ -718,6 +718,7 @@ xfs_setattr_nonsize( } setattr_copy(mnt_userns, inode, iattr); + inode_inc_iversion(inode); xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE); XFS_STATS_INC(mp, xs_ig_attrchg); @@ -943,6 +944,7 @@ xfs_setattr_size( ASSERT(!(iattr->ia_valid & (ATTR_UID | ATTR_GID))); setattr_copy(mnt_userns, inode, iattr); + inode_inc_iversion(inode); xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE); XFS_STATS_INC(mp, xs_ig_attrchg); @@ -1047,6 +1049,8 @@ xfs_vn_update_time( inode->i_mtime = *now; if (flags & S_ATIME) inode->i_atime = *now; + if ((flags & S_VERSION) && inode_maybe_inc_iversion(inode, false)) + log_flags |= XFS_ILOG_CORE; xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL); xfs_trans_log_inode(tp, ip, log_flags);
The i_version in xfs_trans_log_inode is bumped for any inode update, including atime-only updates due to reads. We don't want to record those in the i_version, as they don't represent "real" changes. Remove that callsite. In xfs_vn_update_time, if S_VERSION is flagged, then attempt to bump the i_version and turn on XFS_ILOG_CORE if it happens. In xfs_trans_ichgtime, update the i_version if the mtime or ctime are being updated. Cc: Darrick J. Wong <darrick.wong@oracle.com> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Jeff Layton <jlayton@kernel.org> --- fs/xfs/libxfs/xfs_trans_inode.c | 17 +++-------------- fs/xfs/xfs_iops.c | 4 ++++ 2 files changed, 7 insertions(+), 14 deletions(-)