diff mbox series

[2/3] nfsd: move change attribute generation to filesystem

Message ID 1611084297-27352-3-git-send-email-bfields@redhat.com (mailing list archive)
State New
Headers show
Series NFS change attribute patches | expand

Commit Message

J. Bruce Fields Jan. 19, 2021, 7:24 p.m. UTC
From: "J. Bruce Fields" <bfields@redhat.com>

After this, only filesystems lacking change attribute support will leave
the fetch_iversion export op NULL.

This seems cleaner to me, and will allow some minor optimizations in the
nfsd code.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
---
 fs/btrfs/export.c        |  2 ++
 fs/ext4/super.c          |  9 +++++++++
 fs/nfsd/nfsfh.h          | 25 +++----------------------
 fs/xfs/xfs_export.c      | 10 ++++++++++
 include/linux/iversion.h | 26 ++++++++++++++++++++++++++
 5 files changed, 50 insertions(+), 22 deletions(-)

Comments

Christoph Hellwig Jan. 20, 2021, 8:46 a.m. UTC | #1
On Tue, Jan 19, 2021 at 02:24:56PM -0500, J. Bruce Fields wrote:
> From: "J. Bruce Fields" <bfields@redhat.com>
> 
> After this, only filesystems lacking change attribute support will leave
> the fetch_iversion export op NULL.
> 
> This seems cleaner to me, and will allow some minor optimizations in the
> nfsd code.

Another indirect call just to fetch the change attribute (which happens
a lot IIRC) does not seem very optimal to me.  And the fact that we need
three duplicate implementations also is not very nice.
J. Bruce Fields Jan. 21, 2021, 8:27 p.m. UTC | #2
On Wed, Jan 20, 2021 at 08:46:38AM +0000, Christoph Hellwig wrote:
> On Tue, Jan 19, 2021 at 02:24:56PM -0500, J. Bruce Fields wrote:
> > From: "J. Bruce Fields" <bfields@redhat.com>
> > 
> > After this, only filesystems lacking change attribute support will leave
> > the fetch_iversion export op NULL.
> > 
> > This seems cleaner to me, and will allow some minor optimizations in the
> > nfsd code.
> 
> Another indirect call just to fetch the change attribute (which happens
> a lot IIRC) does not seem very optimal to me.

In the next patch we're removing an fh_getattr (vfs_getattr) in the case
we call the new op, so that's typically a net decrease in indirect
calls.

Though maybe we could use a flag here and do without either.

> And the fact that we need three duplicate implementations also is not
> very nice.

Ext4 and xfs are identical, btrfs is a little different since it doesn't
consult I_VERSION.  (And then there's nfs, which uses the origin
server's i_version if it can.)

I also have a vague idea that some filesystem-specific improvements
might be possible.  (E.g., if a filesystem had some kind of boot count
in the superblock, maybe that would be a better way to prevent the
change attribute from going backwards on reboot than the thing
generic_fetch_iversion is currently doing with ctime.  But I have no
concrete plan there, maybe I'm dreaming.)

--b.
Christoph Hellwig Jan. 22, 2021, 8:20 a.m. UTC | #3
On Thu, Jan 21, 2021 at 03:27:56PM -0500, J. Bruce Fields wrote:
> > Another indirect call just to fetch the change attribute (which happens
> > a lot IIRC) does not seem very optimal to me.
> 
> In the next patch we're removing an fh_getattr (vfs_getattr) in the case
> we call the new op, so that's typically a net decrease in indirect
> calls.
> 
> Though maybe we could use a flag here and do without either.
> 
> > And the fact that we need three duplicate implementations also is not
> > very nice.
> 
> Ext4 and xfs are identical, btrfs is a little different since it doesn't
> consult I_VERSION.  (And then there's nfs, which uses the origin
> server's i_version if it can.)

I'd much prefer to just keep consulting the I_VERSION flag and only
add the new op as an override for the NFS export.

> 
> I also have a vague idea that some filesystem-specific improvements
> might be possible.  (E.g., if a filesystem had some kind of boot count
> in the superblock, maybe that would be a better way to prevent the
> change attribute from going backwards on reboot than the thing
> generic_fetch_iversion is currently doing with ctime.  But I have no
> concrete plan there, maybe I'm dreaming.)

Even without the ctime i_version never goes backward, what is the
problem here?
J. Bruce Fields Jan. 22, 2021, 2:47 p.m. UTC | #4
On Fri, Jan 22, 2021 at 08:20:59AM +0000, Christoph Hellwig wrote:
> On Thu, Jan 21, 2021 at 03:27:56PM -0500, J. Bruce Fields wrote:
> > I also have a vague idea that some filesystem-specific improvements
> > might be possible.  (E.g., if a filesystem had some kind of boot count
> > in the superblock, maybe that would be a better way to prevent the
> > change attribute from going backwards on reboot than the thing
> > generic_fetch_iversion is currently doing with ctime.  But I have no
> > concrete plan there, maybe I'm dreaming.)
> 
> Even without the ctime i_version never goes backward, what is the
> problem here?

Suppose a modification bumps the change attribute, a client reads
the new value of the change attribute before it's committed to disk,
then the server crashes.  After the server comes back up, the client
requests the change attribute again and sees an older value.

That's actually not too bad.  What I'd mainly like to avoid is
incrementing the change attribute further and risking reuse of an old
value for a different new state of the file.

Which is why generic_fetch_iversion is adding the ctime in the higher
bits.

So we depend on good time for correctness.  Trond has had some concerns
about that.

Something like a boot counter might be more reliable.

Another interesting case is after restore from backup.

--b.
Christoph Hellwig Jan. 22, 2021, 5:32 p.m. UTC | #5
On Fri, Jan 22, 2021 at 09:47:53AM -0500, J. Bruce Fields wrote:
> > > I also have a vague idea that some filesystem-specific improvements
> > > might be possible.  (E.g., if a filesystem had some kind of boot count
> > > in the superblock, maybe that would be a better way to prevent the
> > > change attribute from going backwards on reboot than the thing
> > > generic_fetch_iversion is currently doing with ctime.  But I have no
> > > concrete plan there, maybe I'm dreaming.)
> > 
> > Even without the ctime i_version never goes backward, what is the
> > problem here?
> 
> Suppose a modification bumps the change attribute, a client reads
> the new value of the change attribute before it's committed to disk,
> then the server crashes.  After the server comes back up, the client
> requests the change attribute again and sees an older value.

So all metadata operations kicked off by nfsd are synchronous due
to ->commit_metadata/sync_inode_metadata, so this could only happen
for operations not kicked off by nfsd.  More importanly ctime will
also be lost as i_version and the ctime are commited together.

> 
> That's actually not too bad.  What I'd mainly like to avoid is
> incrementing the change attribute further and risking reuse of an old
> value for a different new state of the file.

Ok but that is an issue if we need to deal with changes that did not
come in through NFSD.
J. Bruce Fields Jan. 22, 2021, 6:46 p.m. UTC | #6
On Fri, Jan 22, 2021 at 05:32:48PM +0000, Christoph Hellwig wrote:
> On Fri, Jan 22, 2021 at 09:47:53AM -0500, J. Bruce Fields wrote:
> > > > I also have a vague idea that some filesystem-specific improvements
> > > > might be possible.  (E.g., if a filesystem had some kind of boot count
> > > > in the superblock, maybe that would be a better way to prevent the
> > > > change attribute from going backwards on reboot than the thing
> > > > generic_fetch_iversion is currently doing with ctime.  But I have no
> > > > concrete plan there, maybe I'm dreaming.)
> > > 
> > > Even without the ctime i_version never goes backward, what is the
> > > problem here?
> > 
> > Suppose a modification bumps the change attribute, a client reads
> > the new value of the change attribute before it's committed to disk,
> > then the server crashes.  After the server comes back up, the client
> > requests the change attribute again and sees an older value.
> 
> So all metadata operations kicked off by nfsd are synchronous due
> to ->commit_metadata/sync_inode_metadata, so this could only happen
> for operations not kicked off by nfsd.

Plain old nfs write also bumps the change attribute.

> More importanly ctime will
> also be lost as i_version and the ctime are commited together.

That's not the reason for using ctime.

The ctime is so that change attributes due to new changes that happen
after the boot will not collide with change attributes used before.

So:

	0. file has change attribute i.
	1. write bumps change attribute to i+1.
	2. client fetches new data and change attribute i+1.
	3. server crashes and comes back up.
	4. a new write with different data bumps change attribute to i+1.
	5. client fetches change attribute again, incorrectly concludes
	   its data cache is still correct.

Including the ctime ensures the change attributes at step 2 and 4 are no
longer the same.

> > That's actually not too bad.  What I'd mainly like to avoid is
> > incrementing the change attribute further and risking reuse of an old
> > value for a different new state of the file.
> 
> Ok but that is an issue if we need to deal with changes that did not
> come in through NFSD.

Those matter too, of course.

--b.
Bruce Fields Jan. 29, 2021, 7:25 p.m. UTC | #7
On Fri, Jan 22, 2021 at 08:20:59AM +0000, Christoph Hellwig wrote:
> On Thu, Jan 21, 2021 at 03:27:56PM -0500, J. Bruce Fields wrote:
> > > Another indirect call just to fetch the change attribute (which happens
> > > a lot IIRC) does not seem very optimal to me.
> > 
> > In the next patch we're removing an fh_getattr (vfs_getattr) in the case
> > we call the new op, so that's typically a net decrease in indirect
> > calls.
> > 
> > Though maybe we could use a flag here and do without either.
> > 
> > > And the fact that we need three duplicate implementations also is not
> > > very nice.
> > 
> > Ext4 and xfs are identical, btrfs is a little different since it doesn't
> > consult I_VERSION.  (And then there's nfs, which uses the origin
> > server's i_version if it can.)
> 
> I'd much prefer to just keep consulting the I_VERSION flag and only
> add the new op as an override for the NFS export.

OK, sorry for the delay, I'm not wedded to that second patch; I can just
replace the test for the new export op by a test for either that or
SB_I_VERSION, and it ends up being the same.  Will follow up in a
moment.....

--b.
diff mbox series

Patch

diff --git a/fs/btrfs/export.c b/fs/btrfs/export.c
index 1d4c2397d0d6..9f9fe24b29cc 100644
--- a/fs/btrfs/export.c
+++ b/fs/btrfs/export.c
@@ -7,6 +7,7 @@ 
 #include "btrfs_inode.h"
 #include "print-tree.h"
 #include "export.h"
+#include <linux/iversion.h>
 
 #define BTRFS_FID_SIZE_NON_CONNECTABLE (offsetof(struct btrfs_fid, \
 						 parent_objectid) / 4)
@@ -278,4 +279,5 @@  const struct export_operations btrfs_export_ops = {
 	.fh_to_parent	= btrfs_fh_to_parent,
 	.get_parent	= btrfs_get_parent,
 	.get_name	= btrfs_get_name,
+	.fetch_iversion	= generic_fetch_iversion,
 };
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 21121787c874..4a37b7fc55b6 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1616,11 +1616,20 @@  static const struct super_operations ext4_sops = {
 	.bdev_try_to_free_page = bdev_try_to_free_page,
 };
 
+static u64 ext4_fetch_iversion(struct inode *inode)
+{
+	if (IS_I_VERSION(inode))
+		return generic_fetch_iversion(inode);
+	else
+		return time_to_chattr(&inode->i_ctime);
+}
+
 static const struct export_operations ext4_export_ops = {
 	.fh_to_dentry = ext4_fh_to_dentry,
 	.fh_to_parent = ext4_fh_to_parent,
 	.get_parent = ext4_get_parent,
 	.commit_metadata = ext4_nfs_commit_metadata,
+	.fetch_iversion = ext4_fetch_iversion,
 };
 
 enum {
diff --git a/fs/nfsd/nfsfh.h b/fs/nfsd/nfsfh.h
index f58933519f38..257ba5646239 100644
--- a/fs/nfsd/nfsfh.h
+++ b/fs/nfsd/nfsfh.h
@@ -52,8 +52,8 @@  typedef struct svc_fh {
 	struct timespec64	fh_pre_mtime;	/* mtime before oper */
 	struct timespec64	fh_pre_ctime;	/* ctime before oper */
 	/*
-	 * pre-op nfsv4 change attr: note must check IS_I_VERSION(inode)
-	 *  to find out if it is valid.
+	 * pre-op nfsv4 change attr: note must check for fetch_iversion
+	 * op to find out if it is valid.
 	 */
 	u64			fh_pre_change;
 
@@ -251,31 +251,12 @@  fh_clear_wcc(struct svc_fh *fhp)
 	fhp->fh_pre_saved = false;
 }
 
-/*
- * We could use i_version alone as the change attribute.  However,
- * i_version can go backwards after a reboot.  On its own that doesn't
- * necessarily cause a problem, but if i_version goes backwards and then
- * is incremented again it could reuse a value that was previously used
- * before boot, and a client who queried the two values might
- * incorrectly assume nothing changed.
- *
- * By using both ctime and the i_version counter we guarantee that as
- * long as time doesn't go backwards we never reuse an old value.
- */
 static inline u64 nfsd4_change_attribute(struct kstat *stat,
 					 struct inode *inode)
 {
 	if (inode->i_sb->s_export_op->fetch_iversion)
 		return inode->i_sb->s_export_op->fetch_iversion(inode);
-	else if (IS_I_VERSION(inode)) {
-		u64 chattr;
-
-		chattr =  stat->ctime.tv_sec;
-		chattr <<= 30;
-		chattr += stat->ctime.tv_nsec;
-		chattr += inode_query_iversion(inode);
-		return chattr;
-	} else
+	else
 		return time_to_chattr(&stat->ctime);
 }
 
diff --git a/fs/xfs/xfs_export.c b/fs/xfs/xfs_export.c
index 465fd9e048d4..4a08b65c32aa 100644
--- a/fs/xfs/xfs_export.c
+++ b/fs/xfs/xfs_export.c
@@ -16,6 +16,7 @@ 
 #include "xfs_inode_item.h"
 #include "xfs_icache.h"
 #include "xfs_pnfs.h"
+#include <linux/iversion.h>
 
 /*
  * Note that we only accept fileids which are long enough rather than allow
@@ -223,6 +224,14 @@  xfs_fs_nfs_commit_metadata(
 	return xfs_log_force_inode(XFS_I(inode));
 }
 
+static u64 xfs_fetch_iversion(struct inode *inode)
+{
+	if (IS_I_VERSION(inode))
+		return generic_fetch_iversion(inode);
+	else
+		return time_to_chattr(&inode->i_ctime);
+}
+
 const struct export_operations xfs_export_operations = {
 	.encode_fh		= xfs_fs_encode_fh,
 	.fh_to_dentry		= xfs_fs_fh_to_dentry,
@@ -234,4 +243,5 @@  const struct export_operations xfs_export_operations = {
 	.map_blocks		= xfs_fs_map_blocks,
 	.commit_blocks		= xfs_fs_commit_blocks,
 #endif
+	.fetch_iversion		= xfs_fetch_iversion,
 };
diff --git a/include/linux/iversion.h b/include/linux/iversion.h
index 3bfebde5a1a6..ded74523c8a6 100644
--- a/include/linux/iversion.h
+++ b/include/linux/iversion.h
@@ -328,6 +328,32 @@  inode_query_iversion(struct inode *inode)
 	return cur >> I_VERSION_QUERIED_SHIFT;
 }
 
+/*
+ * We could use i_version alone as the NFSv4 change attribute.  However,
+ * i_version can go backwards after a reboot.  On its own that doesn't
+ * necessarily cause a problem, but if i_version goes backwards and then
+ * is incremented again it could reuse a value that was previously used
+ * before boot, and a client who queried the two values might
+ * incorrectly assume nothing changed.
+ *
+ * By using both ctime and the i_version counter we guarantee that as
+ * long as time doesn't go backwards we never reuse an old value.
+ *
+ * A filesystem that has an on-disk boot counter or similar might prefer
+ * to use that to avoid the risk of the change attribute going backwards
+ * if system time is set backwards.
+ */
+static inline u64 generic_fetch_iversion(struct inode *inode)
+{
+	u64 chattr;
+
+	chattr =  inode->i_ctime.tv_sec;
+	chattr <<= 30;
+	chattr += inode->i_ctime.tv_nsec;
+	chattr += inode_query_iversion(inode);
+	return chattr;
+}
+
 /*
  * For filesystems without any sort of change attribute, the best we can
  * do is fake one up from the ctime: