[RFC,v1,30/30] fs: convert i_version counter over to an atomic64_t
diff mbox

Message ID 1482339827-7882-31-git-send-email-jlayton@redhat.com
State New
Headers show

Commit Message

Jeff Layton Dec. 21, 2016, 5:03 p.m. UTC
The spinlock is only used to serialize callers that want to increment
the counter. We can achieve the same thing with an atomic64_t and
get the i_lock out of this codepath.

Drop the I_VERS_BUMP flag, and instead, borrow the most significant bit
in the counter to use as the flag. With this change, we can stop taking
the i_lock in this codepath, and can use atomics instead to manage the
thing.

On the query side, if the flag is already set, then we just return the
counter value. Otherwise, we set the flag in our in-memory copy and use
cmpxchg to swap it into place if it hasn't changed. If it has, then we
use the value from the cmpxchg as the new "old" value and try again.

When we go to bump the thing, we fetch the value and check the flag bit.
If it's clear then we don't need to do anything if the update isn't
being forced.

If we do need to update, then we clear the flag in our in-memory copy
and bump the counter (handling any overflow into the flag bit by
resetting the counter to zero). We then do a cmpxchg to swap the updated
value into place if it hasn't changed. If it has changed, then we use
the value we got back from cmpxchg to try again.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
---
 include/linux/fs.h | 82 ++++++++++++++++++++++++++++++++----------------------
 1 file changed, 48 insertions(+), 34 deletions(-)

Comments

Amir Goldstein Dec. 22, 2016, 8:38 a.m. UTC | #1
On Wed, Dec 21, 2016 at 7:03 PM, Jeff Layton <jlayton@redhat.com> wrote:
> The spinlock is only used to serialize callers that want to increment
> the counter. We can achieve the same thing with an atomic64_t and
> get the i_lock out of this codepath.
>

Cool work! See some nits and suggestions below.

> +/*
> + * We borrow the top bit in the i_version to use as a flag to tell us whether
> + * it has been queried since we last bumped it. If it has, then we must bump
> + * it and set the flag. Note that this means that we have to handle wrapping
> + * manually.
> + */
> +#define INODE_I_VERSION_QUERIED                (1ULL<<63)
> +
>  /**
>   * inode_set_iversion - set i_version to a particular value
>   * @inode: inode to set
> @@ -1976,7 +1980,7 @@ static inline void inode_dec_link_count(struct inode *inode)
>  static inline void
>  inode_set_iversion(struct inode *inode, const u64 new)
>  {
> -       inode->i_version = new;
> +       atomic64_set(&inode->i_version, new);
>  }
>

Maybe needs an overflow sanity check !(new & INODE_I_VERSION_QUERIED)??
See API change suggestion below.


>  /**
> @@ -2010,16 +2011,26 @@ inode_set_iversion_read(struct inode *inode, const u64 new)
>  static inline bool
>  inode_inc_iversion(struct inode *inode, bool force)
>  {
> -       bool ret = false;
> +       u64 cur, old, new;
> +
> +       cur = (u64)atomic64_read(&inode->i_version);
> +       for (;;) {
> +               /* If flag is clear then we needn't do anything */
> +               if (!force && !(cur & INODE_I_VERSION_QUERIED))
> +                       return false;
> +
> +               new = (cur & ~INODE_I_VERSION_QUERIED) + 1;
> +
> +               /* Did we overflow into flag bit? Reset to 0 if so. */
> +               if (unlikely(new == INODE_I_VERSION_QUERIED))
> +                       new = 0;
>

Did you consider changing f_version type and the signature of the new
i_version API to set/get s64 instead of u64?

It makes a bit more sense from API users perspective to know that
the valid range for version is >=0.

file->f_version is not the only struct member used to store&compare
i_version. nfs and xfs have other struct members for that, but even
if all those members are not changed to type s64, the explicit cast
to (s64) and back to (u64) will serve as a good documentation in
the code about the valid range of version in the new API.

>  /**
> @@ -2080,7 +2099,7 @@ inode_get_iversion(struct inode *inode)
>  static inline s64
>  inode_cmp_iversion(const struct inode *inode, const u64 old)
>  {
> -       return (s64)inode->i_version - (s64)old;
> +       return (s64)(atomic64_read(&inode->i_version) << 1) - (s64)(old << 1);
>  }
>

IMO, it is better for the API to determine that 'old' is valid a value
returned from
inode_get_iversion* and therefore should not have the MSB set.
Unless the reason you chose to shift those 2 values is because it is cheaper
then masking INODE_I_VERSION_QUERIED??


Cheers,
Amir.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jeff Layton Dec. 22, 2016, 1:27 p.m. UTC | #2
On Thu, 2016-12-22 at 10:38 +0200, Amir Goldstein wrote:
> On Wed, Dec 21, 2016 at 7:03 PM, Jeff Layton <jlayton@redhat.com> wrote:
> > 
> > The spinlock is only used to serialize callers that want to increment
> > the counter. We can achieve the same thing with an atomic64_t and
> > get the i_lock out of this codepath.
> > 
> 
> Cool work! See some nits and suggestions below.
> 
> > 
> > +/*
> > + * We borrow the top bit in the i_version to use as a flag to tell us whether
> > + * it has been queried since we last bumped it. If it has, then we must bump
> > + * it and set the flag. Note that this means that we have to handle wrapping
> > + * manually.
> > + */
> > +#define INODE_I_VERSION_QUERIED                (1ULL<<63)
> > +
> >  /**
> >   * inode_set_iversion - set i_version to a particular value
> >   * @inode: inode to set
> > @@ -1976,7 +1980,7 @@ static inline void inode_dec_link_count(struct inode *inode)
> >  static inline void
> >  inode_set_iversion(struct inode *inode, const u64 new)
> >  {
> > -       inode->i_version = new;
> > +       atomic64_set(&inode->i_version, new);
> >  }
> > 
> 
> Maybe needs an overflow sanity check !(new & INODE_I_VERSION_QUERIED)??
> See API change suggestion below.
> 
> 

Possibly. Note that in some cases (when the i_version can be stored on
disk across a remount), we need to ensure that we set this flag when the
inode is read in from disk. It's always possible that we'll get a query
for it, and then crash so we always set the flag just in case.

> > 
> >  /**
> > @@ -2010,16 +2011,26 @@ inode_set_iversion_read(struct inode *inode, const u64 new)
> >  static inline bool
> >  inode_inc_iversion(struct inode *inode, bool force)
> >  {
> > -       bool ret = false;
> > +       u64 cur, old, new;
> > +
> > +       cur = (u64)atomic64_read(&inode->i_version);
> > +       for (;;) {
> > +               /* If flag is clear then we needn't do anything */
> > +               if (!force && !(cur & INODE_I_VERSION_QUERIED))
> > +                       return false;
> > +
> > +               new = (cur & ~INODE_I_VERSION_QUERIED) + 1;
> > +
> > +               /* Did we overflow into flag bit? Reset to 0 if so. */
> > +               if (unlikely(new == INODE_I_VERSION_QUERIED))
> > +                       new = 0;
> > 
> 
> Did you consider changing f_version type and the signature of the new
> i_version API to set/get s64 instead of u64?
> 
> It makes a bit more sense from API users perspective to know that
> the valid range for version is >=0.
> 
> file->f_version is not the only struct member used to store&compare
> i_version. nfs and xfs have other struct members for that, but even
> if all those members are not changed to type s64, the explicit cast
> to (s64) and back to (u64) will serve as a good documentation in
> the code about the valid range of version in the new API.
> 

This API is definitely not set in stone. That said, we have to consider
that there are really three classes of filesystems here:

1) ones that treat i_version as an opaque value: Mostly AFS and NFS,
as they get this value from the server. These both can also use the
entire u64 field, so we need to ensure that we don't monkey with the
flag bit on them.

2) filesystems that just use it internally: These don't set MS_I_VERSION
and mostly use it to detect directory changes that occur during readdir.
i_version is initialized to some value (0 or 1) when the struct inode is
allocated and bump it on directory changes.

3) filesystems where the kernel manages it completely: these set
MS_I_VERSION and the kernel handles bumping it on writes. Currently,
this is btrfs, ext4 and xfs. These are persistent across remounts as
well.

So, we have to ensure that this API encompasses all 3 of these use
cases.

> >  /**
> > @@ -2080,7 +2099,7 @@ inode_get_iversion(struct inode *inode)
> >  static inline s64
> >  inode_cmp_iversion(const struct inode *inode, const u64 old)
> >  {
> > -       return (s64)inode->i_version - (s64)old;
> > +       return (s64)(atomic64_read(&inode->i_version) << 1) - (s64)(old << 1);
> >  }
> > 
> 
> IMO, it is better for the API to determine that 'old' is valid a value
> returned from
> inode_get_iversion* and therefore should not have the MSB set.
> Unless the reason you chose to shift those 2 values is because it is cheaper
> then masking INODE_I_VERSION_QUERIED??
> 
> 

No, we need to do that in order to handle wraparound correctly. We want
this check to work something like the time_before/after macros in the
kernel that handle jiffies wraparound.

So, the sign returned here matters, as positive values indicate that the
current one is "newer" than the old one. That's the main reason for the
shift here.

Note that that that should be documented here too, I'll plan to add that
for the next revision.

Thanks for the comments so far!
NeilBrown March 4, 2017, midnight UTC | #3
On Wed, Dec 21 2016, Jeff Layton wrote:

>  
> +/*
> + * We borrow the top bit in the i_version to use as a flag to tell us whether
> + * it has been queried since we last bumped it. If it has, then we must bump
> + * it and set the flag. Note that this means that we have to handle wrapping
> + * manually.
> + */
> +#define INODE_I_VERSION_QUERIED		(1ULL<<63)
> +

I would prefer that the least significant bit were used, rather than the
most significant.

Partly, this is because the "queried" state is less significant than
that "number has changed" state.
But most, this would mean we wouldn't need inode_cmp_iversion() at all.
We could just use "<" or ">=" or whatever.
The number returned by inode_get_iversion() would always be even (or
maybe odd) and wrapping (after the end of time) would "just work".

Thanks,
NeilBrown

Patch
diff mbox

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 917557faa8e8..401e38d76171 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -621,7 +621,7 @@  struct inode {
 		struct hlist_head	i_dentry;
 		struct rcu_head		i_rcu;
 	};
-	u64			i_version;
+	atomic64_t		i_version;
 	atomic_t		i_count;
 	atomic_t		i_dio_count;
 	atomic_t		i_writecount;
@@ -1909,9 +1909,6 @@  static inline bool HAS_UNMAPPED_ID(struct inode *inode)
  *			wb stat updates to grab mapping->tree_lock.  See
  *			inode_switch_wb_work_fn() for details.
  *
- * I_VERS_BUMP		inode->i_version counter must be bumped on the next
- * 			change. See the inode_*_iversion functions.
- *
  * Q: What is the difference between I_WILL_FREE and I_FREEING?
  */
 #define I_DIRTY_SYNC		(1 << 0)
@@ -1932,7 +1929,6 @@  static inline bool HAS_UNMAPPED_ID(struct inode *inode)
 #define __I_DIRTY_TIME_EXPIRED	12
 #define I_DIRTY_TIME_EXPIRED	(1 << __I_DIRTY_TIME_EXPIRED)
 #define I_WB_SWITCH		(1 << 13)
-#define I_VERS_BUMP		(1 << 14)
 
 #define I_DIRTY (I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_PAGES)
 #define I_DIRTY_ALL (I_DIRTY | I_DIRTY_TIME)
@@ -1965,6 +1961,14 @@  static inline void inode_dec_link_count(struct inode *inode)
 	mark_inode_dirty(inode);
 }
 
+/*
+ * We borrow the top bit in the i_version to use as a flag to tell us whether
+ * it has been queried since we last bumped it. If it has, then we must bump
+ * it and set the flag. Note that this means that we have to handle wrapping
+ * manually.
+ */
+#define INODE_I_VERSION_QUERIED		(1ULL<<63)
+
 /**
  * inode_set_iversion - set i_version to a particular value
  * @inode: inode to set
@@ -1976,7 +1980,7 @@  static inline void inode_dec_link_count(struct inode *inode)
 static inline void
 inode_set_iversion(struct inode *inode, const u64 new)
 {
-	inode->i_version = new;
+	atomic64_set(&inode->i_version, new);
 }
 
 /**
@@ -1992,10 +1996,7 @@  inode_set_iversion(struct inode *inode, const u64 new)
 static inline void
 inode_set_iversion_read(struct inode *inode, const u64 new)
 {
-	spin_lock(&inode->i_lock);
-	inode_set_iversion(inode, new);
-	inode->i_state |= I_VERS_BUMP;
-	spin_unlock(&inode->i_lock);
+	inode_set_iversion(inode, new | INODE_I_VERSION_QUERIED);
 }
 
 /**
@@ -2010,16 +2011,26 @@  inode_set_iversion_read(struct inode *inode, const u64 new)
 static inline bool
 inode_inc_iversion(struct inode *inode, bool force)
 {
-	bool ret = false;
+	u64 cur, old, new;
+
+	cur = (u64)atomic64_read(&inode->i_version);
+	for (;;) {
+		/* If flag is clear then we needn't do anything */
+		if (!force && !(cur & INODE_I_VERSION_QUERIED))
+			return false;
+
+		new = (cur & ~INODE_I_VERSION_QUERIED) + 1;
+
+		/* Did we overflow into flag bit? Reset to 0 if so. */
+		if (unlikely(new == INODE_I_VERSION_QUERIED))
+			new = 0;
 
-	spin_lock(&inode->i_lock);
-	if (force || (inode->i_state & I_VERS_BUMP)) {
-		inode->i_version++;
-		inode->i_state &= ~I_VERS_BUMP;
-		ret = true;
+		old = atomic64_cmpxchg(&inode->i_version, cur, new);
+		if (likely(old == cur))
+			break;
+		cur = old;
 	}
-	spin_unlock(&inode->i_lock);
-	return ret;
+	return true;
 }
 
 /**
@@ -2027,8 +2038,9 @@  inode_inc_iversion(struct inode *inode, bool force)
  * @inode: inode to be updated
  *
  * Increment the i_version field in the inode. This version is usable
- * when there is some other sort of lock in play that would prevent
- * concurrent increments (typically inode->i_rwsem for write).
+ * when there is some other sort of lock in play (e.g. i_rwsem for write)
+ * that would prevent concurrent incrementors, and is typically used on
+ * directories or other non-regular files.
  */
 static inline void
 inode_inc_iversion_locked(struct inode *inode)
@@ -2047,7 +2059,7 @@  inode_inc_iversion_locked(struct inode *inode)
 static inline u64
 inode_get_iversion_raw(const struct inode *inode)
 {
-	return inode->i_version;
+	return atomic64_read(&inode->i_version) & ~INODE_I_VERSION_QUERIED;
 }
 
 /**
@@ -2060,13 +2072,20 @@  inode_get_iversion_raw(const struct inode *inode)
 static inline u64
 inode_get_iversion(struct inode *inode)
 {
-	u64 ret;
+	u64 cur, old, new;
 
-	spin_lock(&inode->i_lock);
-	inode->i_state |= I_VERS_BUMP;
-	ret = inode->i_version;
-	spin_unlock(&inode->i_lock);
-	return ret;
+	cur = atomic64_read(&inode->i_version);
+	for (;;) {
+		if (cur & INODE_I_VERSION_QUERIED)
+			return (cur & ~INODE_I_VERSION_QUERIED);
+
+		new = (cur | INODE_I_VERSION_QUERIED);
+		old = atomic64_cmpxchg(&inode->i_version, cur, new);
+		if (old == cur)
+			break;
+		cur = old;
+	}
+	return cur;
 }
 
 /**
@@ -2080,7 +2099,7 @@  inode_get_iversion(struct inode *inode)
 static inline s64
 inode_cmp_iversion(const struct inode *inode, const u64 old)
 {
-	return (s64)inode->i_version - (s64)old;
+	return (s64)(atomic64_read(&inode->i_version) << 1) - (s64)(old << 1);
 }
 
 /**
@@ -2093,12 +2112,7 @@  inode_cmp_iversion(const struct inode *inode, const u64 old)
 static inline bool
 inode_iversion_need_inc(struct inode *inode)
 {
-	bool ret;
-
-	spin_lock(&inode->i_lock);
-	ret = inode->i_state & I_VERS_BUMP;
-	spin_unlock(&inode->i_lock);
-	return ret;
+	return atomic64_read(&inode->i_version) & INODE_I_VERSION_QUERIED;
 }
 
 enum file_time_flags {