[RFCv2,2/7] fs: use RCU for free_super() vs. __sb_start_write()
diff mbox

Message ID 20150625001605.50C9BE41@viggo.jf.intel.com
State New
Headers show

Commit Message

Dave Hansen June 25, 2015, 12:16 a.m. UTC
From: Dave Hansen <dave.hansen@linux.intel.com>

Currently, __sb_start_write() and freeze_super() can race with
each other.  __sb_start_write() uses a smp_mb() to ensure that
freeze_super() can see its write to sb->s_writers.counter and
that it can see freeze_super()'s update to sb->s_writers.frozen.
This all seems to work fine.

But, this smp_mb() makes __sb_start_write() the single hottest
function in the kernel if I sit in a loop and do tiny write()s to
tmpfs over and over.  This is on a very small 2-core system, so
it will only get worse on larger systems.

This _seems_ like an ideal case for RCU.  __sb_start_write() is
the RCU read-side and is in a very fast, performance-sensitive
path.  freeze_super() is the RCU writer and is in an extremely
rare non-performance-sensitive path.

Instead of doing and smp_wmb() in __sb_start_write(), we do
rcu_read_lock().  This ensures that a CPU doing freeze_super()
can not proceed past its synchronize_rcu() until the grace
period has ended and the 's_writers.frozen = SB_FREEZE_WRITE'
is visible to __sb_start_write().

One question here: Does the work that __sb_start_write() does in
a previous grace period becomes visible to freeze_super() after
its call to synchronize_rcu()?  It _seems_ like it should, but it
seems backwards to me since __sb_start_write() is the "reader" in
this case.

This patch increases the number of writes/second that I can do
by 5.6%.

Does anybody see any holes with this?

Cc: Jan Kara <jack@suse.cz>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/fs/super.c |   63 ++++++++++++++++++++++++++++++++++++-----------------------
 1 file changed, 39 insertions(+), 24 deletions(-)

Comments

Jan Kara June 26, 2015, 12:59 p.m. UTC | #1
On Wed 24-06-15 17:16:05, Dave Hansen wrote:
> 
> From: Dave Hansen <dave.hansen@linux.intel.com>
> 
> Currently, __sb_start_write() and freeze_super() can race with
> each other.  __sb_start_write() uses a smp_mb() to ensure that
> freeze_super() can see its write to sb->s_writers.counter and
> that it can see freeze_super()'s update to sb->s_writers.frozen.
> This all seems to work fine.
> 
> But, this smp_mb() makes __sb_start_write() the single hottest
> function in the kernel if I sit in a loop and do tiny write()s to
> tmpfs over and over.  This is on a very small 2-core system, so
> it will only get worse on larger systems.
> 
> This _seems_ like an ideal case for RCU.  __sb_start_write() is
> the RCU read-side and is in a very fast, performance-sensitive
> path.  freeze_super() is the RCU writer and is in an extremely
> rare non-performance-sensitive path.
> 
> Instead of doing and smp_wmb() in __sb_start_write(), we do
> rcu_read_lock().  This ensures that a CPU doing freeze_super()
> can not proceed past its synchronize_rcu() until the grace
> period has ended and the 's_writers.frozen = SB_FREEZE_WRITE'
> is visible to __sb_start_write().
> 
> One question here: Does the work that __sb_start_write() does in
> a previous grace period becomes visible to freeze_super() after
> its call to synchronize_rcu()?  It _seems_ like it should, but it
> seems backwards to me since __sb_start_write() is the "reader" in
> this case.

I believe yes. Because all accesses (be it reads or writes) must finish
before the current RCU period finishes. And synchronize_rcu() must make
sure that any code (loads / stores) after it execute only after the RCU
period has finished...

The patch looks good to me. You can add:
Reviewed-by: Jan Kara <jack@suse.com>

								Honza

> 
> This patch increases the number of writes/second that I can do
> by 5.6%.
> 
> Does anybody see any holes with this?
> 
> Cc: Jan Kara <jack@suse.cz>
> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
> Cc: linux-fsdevel@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> Cc: Tim Chen <tim.c.chen@linux.intel.com>
> Cc: Andi Kleen <ak@linux.intel.com>
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> ---
> 
>  b/fs/super.c |   63 ++++++++++++++++++++++++++++++++++++-----------------------
>  1 file changed, 39 insertions(+), 24 deletions(-)
> 
> diff -puN fs/super.c~rcu-__sb_start_write fs/super.c
> --- a/fs/super.c~rcu-__sb_start_write	2015-06-24 17:14:34.939125713 -0700
> +++ b/fs/super.c	2015-06-24 17:14:34.942125847 -0700
> @@ -1190,27 +1190,21 @@ static void acquire_freeze_lock(struct s
>   */
>  int __sb_start_write(struct super_block *sb, int level, bool wait)
>  {
> -retry:
> -	if (unlikely(sb->s_writers.frozen >= level)) {
> +	rcu_read_lock();
> +	while (unlikely(sb->s_writers.frozen >= level)) {
> +		rcu_read_unlock();
>  		if (!wait)
>  			return 0;
>  		wait_event(sb->s_writers.wait_unfrozen,
>  			   sb->s_writers.frozen < level);
> +		rcu_read_lock();
>  	}
>  
>  #ifdef CONFIG_LOCKDEP
>  	acquire_freeze_lock(sb, level, !wait, _RET_IP_);
>  #endif
>  	percpu_counter_inc(&sb->s_writers.counter[level-1]);
> -	/*
> -	 * Make sure counter is updated before we check for frozen.
> -	 * freeze_super() first sets frozen and then checks the counter.
> -	 */
> -	smp_mb();
> -	if (unlikely(sb->s_writers.frozen >= level)) {
> -		__sb_end_write(sb, level);
> -		goto retry;
> -	}
> +	rcu_read_unlock();
>  	return 1;
>  }
>  EXPORT_SYMBOL(__sb_start_write);
> @@ -1254,6 +1248,29 @@ static void sb_wait_write(struct super_b
>  	} while (writers);
>  }
>  
> +static void __thaw_super(struct super_block *sb)
> +{
> +	sb->s_writers.frozen = SB_UNFROZEN;
> +	/*
> +	 * RCU protects us against races where we are taking
> +	 * s_writers.frozen in to a less permissive state.  When
> +	 * that happens, __sb_start_write() might not yet have
> +	 * seen our write and might still increment
> +	 * s_writers.counter.
> +	 *
> +	 * Here, however, we are transitioning to a _more_
> +	 * permissive state.  The filesystem is frozen and no
> +	 * writes to s_writers.counter are being permitted.
> +	 *
> +	 * A smp_wmb() is sufficient here because we just need
> +	 * to ensure that new calls __sb_start_write() are
> +	 * allowed, not that _concurrent_ calls have finished.
> +	 */
> +	smp_wmb();
> +	wake_up(&sb->s_writers.wait_unfrozen);
> +	deactivate_locked_super(sb);
> +}
> +
>  /**
>   * freeze_super - lock the filesystem and force it into a consistent state
>   * @sb: the super to lock
> @@ -1312,7 +1329,13 @@ int freeze_super(struct super_block *sb)
>  
>  	/* From now on, no new normal writers can start */
>  	sb->s_writers.frozen = SB_FREEZE_WRITE;
> -	smp_wmb();
> +	/*
> +	 * After we synchronize_rcu(), we have ensured that everyone
> +	 * who reads sb->s_writers.frozen under rcu_read_lock() can
> +	 * now see our update.  This pretty much means that
> +	 * __sb_start_write() will not allow any new writers.
> +	 */
> +	synchronize_rcu();
>  
>  	/* Release s_umount to preserve sb_start_write -> s_umount ordering */
>  	up_write(&sb->s_umount);
> @@ -1322,7 +1345,7 @@ int freeze_super(struct super_block *sb)
>  	/* Now we go and block page faults... */
>  	down_write(&sb->s_umount);
>  	sb->s_writers.frozen = SB_FREEZE_PAGEFAULT;
> -	smp_wmb();
> +	synchronize_rcu();
>  
>  	sb_wait_write(sb, SB_FREEZE_PAGEFAULT);
>  
> @@ -1331,7 +1354,7 @@ int freeze_super(struct super_block *sb)
>  
>  	/* Now wait for internal filesystem counter */
>  	sb->s_writers.frozen = SB_FREEZE_FS;
> -	smp_wmb();
> +	synchronize_rcu();
>  	sb_wait_write(sb, SB_FREEZE_FS);
>  
>  	if (sb->s_op->freeze_fs) {
> @@ -1339,11 +1362,7 @@ int freeze_super(struct super_block *sb)
>  		if (ret) {
>  			printk(KERN_ERR
>  				"VFS:Filesystem freeze failed\n");
> -			sb->s_writers.frozen = SB_UNFROZEN;
> -			smp_wmb();
> -			wake_up(&sb->s_writers.wait_unfrozen);
> -			deactivate_locked_super(sb);
> -			return ret;
> +			__thaw_super(sb);
>  		}
>  	}
>  	/*
> @@ -1386,11 +1405,7 @@ int thaw_super(struct super_block *sb)
>  	}
>  
>  out:
> -	sb->s_writers.frozen = SB_UNFROZEN;
> -	smp_wmb();
> -	wake_up(&sb->s_writers.wait_unfrozen);
> -	deactivate_locked_super(sb);
> -
> +	__thaw_super(sb);
>  	return 0;
>  }
>  EXPORT_SYMBOL(thaw_super);
> _

Patch
diff mbox

diff -puN fs/super.c~rcu-__sb_start_write fs/super.c
--- a/fs/super.c~rcu-__sb_start_write	2015-06-24 17:14:34.939125713 -0700
+++ b/fs/super.c	2015-06-24 17:14:34.942125847 -0700
@@ -1190,27 +1190,21 @@  static void acquire_freeze_lock(struct s
  */
 int __sb_start_write(struct super_block *sb, int level, bool wait)
 {
-retry:
-	if (unlikely(sb->s_writers.frozen >= level)) {
+	rcu_read_lock();
+	while (unlikely(sb->s_writers.frozen >= level)) {
+		rcu_read_unlock();
 		if (!wait)
 			return 0;
 		wait_event(sb->s_writers.wait_unfrozen,
 			   sb->s_writers.frozen < level);
+		rcu_read_lock();
 	}
 
 #ifdef CONFIG_LOCKDEP
 	acquire_freeze_lock(sb, level, !wait, _RET_IP_);
 #endif
 	percpu_counter_inc(&sb->s_writers.counter[level-1]);
-	/*
-	 * Make sure counter is updated before we check for frozen.
-	 * freeze_super() first sets frozen and then checks the counter.
-	 */
-	smp_mb();
-	if (unlikely(sb->s_writers.frozen >= level)) {
-		__sb_end_write(sb, level);
-		goto retry;
-	}
+	rcu_read_unlock();
 	return 1;
 }
 EXPORT_SYMBOL(__sb_start_write);
@@ -1254,6 +1248,29 @@  static void sb_wait_write(struct super_b
 	} while (writers);
 }
 
+static void __thaw_super(struct super_block *sb)
+{
+	sb->s_writers.frozen = SB_UNFROZEN;
+	/*
+	 * RCU protects us against races where we are taking
+	 * s_writers.frozen in to a less permissive state.  When
+	 * that happens, __sb_start_write() might not yet have
+	 * seen our write and might still increment
+	 * s_writers.counter.
+	 *
+	 * Here, however, we are transitioning to a _more_
+	 * permissive state.  The filesystem is frozen and no
+	 * writes to s_writers.counter are being permitted.
+	 *
+	 * A smp_wmb() is sufficient here because we just need
+	 * to ensure that new calls __sb_start_write() are
+	 * allowed, not that _concurrent_ calls have finished.
+	 */
+	smp_wmb();
+	wake_up(&sb->s_writers.wait_unfrozen);
+	deactivate_locked_super(sb);
+}
+
 /**
  * freeze_super - lock the filesystem and force it into a consistent state
  * @sb: the super to lock
@@ -1312,7 +1329,13 @@  int freeze_super(struct super_block *sb)
 
 	/* From now on, no new normal writers can start */
 	sb->s_writers.frozen = SB_FREEZE_WRITE;
-	smp_wmb();
+	/*
+	 * After we synchronize_rcu(), we have ensured that everyone
+	 * who reads sb->s_writers.frozen under rcu_read_lock() can
+	 * now see our update.  This pretty much means that
+	 * __sb_start_write() will not allow any new writers.
+	 */
+	synchronize_rcu();
 
 	/* Release s_umount to preserve sb_start_write -> s_umount ordering */
 	up_write(&sb->s_umount);
@@ -1322,7 +1345,7 @@  int freeze_super(struct super_block *sb)
 	/* Now we go and block page faults... */
 	down_write(&sb->s_umount);
 	sb->s_writers.frozen = SB_FREEZE_PAGEFAULT;
-	smp_wmb();
+	synchronize_rcu();
 
 	sb_wait_write(sb, SB_FREEZE_PAGEFAULT);
 
@@ -1331,7 +1354,7 @@  int freeze_super(struct super_block *sb)
 
 	/* Now wait for internal filesystem counter */
 	sb->s_writers.frozen = SB_FREEZE_FS;
-	smp_wmb();
+	synchronize_rcu();
 	sb_wait_write(sb, SB_FREEZE_FS);
 
 	if (sb->s_op->freeze_fs) {
@@ -1339,11 +1362,7 @@  int freeze_super(struct super_block *sb)
 		if (ret) {
 			printk(KERN_ERR
 				"VFS:Filesystem freeze failed\n");
-			sb->s_writers.frozen = SB_UNFROZEN;
-			smp_wmb();
-			wake_up(&sb->s_writers.wait_unfrozen);
-			deactivate_locked_super(sb);
-			return ret;
+			__thaw_super(sb);
 		}
 	}
 	/*
@@ -1386,11 +1405,7 @@  int thaw_super(struct super_block *sb)
 	}
 
 out:
-	sb->s_writers.frozen = SB_UNFROZEN;
-	smp_wmb();
-	wake_up(&sb->s_writers.wait_unfrozen);
-	deactivate_locked_super(sb);
-
+	__thaw_super(sb);
 	return 0;
 }
 EXPORT_SYMBOL(thaw_super);