diff mbox series

[6/6] xfs: lockless buffer lookup

Message ID 20220627060841.244226-7-david@fromorbit.com (mailing list archive)
State Superseded
Headers show
Series xfs: lockless buffer lookups | expand

Commit Message

Dave Chinner June 27, 2022, 6:08 a.m. UTC
From: Dave Chinner <dchinner@redhat.com>

Now that we have a standalone fast path for buffer lookup, we can
easily convert it to use rcu lookups. When we continually hammer the
buffer cache with trylock lookups, we end up with a huge amount of
lock contention on the per-ag buffer hash locks:

-   92.71%     0.05%  [kernel]                  [k] xfs_inodegc_worker
   - 92.67% xfs_inodegc_worker
      - 92.13% xfs_inode_unlink
         - 91.52% xfs_inactive_ifree
            - 85.63% xfs_read_agi
               - 85.61% xfs_trans_read_buf_map
                  - 85.59% xfs_buf_read_map
                     - xfs_buf_get_map
                        - 85.55% xfs_buf_find
                           - 72.87% _raw_spin_lock
                              - do_raw_spin_lock
                                   71.86% __pv_queued_spin_lock_slowpath
                           - 8.74% xfs_buf_rele
                              - 7.88% _raw_spin_lock
                                 - 7.88% do_raw_spin_lock
                                      7.63% __pv_queued_spin_lock_slowpath
                           - 1.70% xfs_buf_trylock
                              - 1.68% down_trylock
                                 - 1.41% _raw_spin_lock_irqsave
                                    - 1.39% do_raw_spin_lock
                                         __pv_queued_spin_lock_slowpath
                           - 0.76% _raw_spin_unlock
                                0.75% do_raw_spin_unlock

This is basically hammering the pag->pag_buf_lock from lots of CPUs
doing trylocks at the same time. Most of the buffer trylock
operations ultimately fail after we've done the lookup, so we're
really hammering the buf hash lock whilst making no progress.

We can also see significant spinlock traffic on the same lock just
under normal operation when lots of tasks are accessing metadata
from the same AG, so let's avoid all this by converting the lookup
fast path to leverages the rhashtable's ability to do rcu protected
lookups.

We avoid races with the buffer release path by using
atomic_inc_not_zero() on the buffer hold count. Any buffer that is
in the LRU will have a non-zero count, thereby allowing the lockless
fast path to be taken in most cache hit situations. If the buffer
hold count is zero, then it is likely going through the release path
so in that case we fall back to the existing lookup miss slow path.

The slow path will then do an atomic lookup and insert under the
buffer hash lock and hence serialise correctly against buffer
release freeing the buffer.

The use of rcu protected lookups means that buffer handles now need
to be freed by RCU callbacks (same as inodes). We still free the
buffer pages before the RCU callback - we won't be trying to access
them at all on a buffer that has zero references - but we need the
buffer handle itself to be present for the entire rcu protected read
side to detect a zero hold count correctly.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_buf.c | 22 +++++++++++++++-------
 fs/xfs/xfs_buf.h |  1 +
 2 files changed, 16 insertions(+), 7 deletions(-)

Comments

Christoph Hellwig June 29, 2022, 7:41 a.m. UTC | #1
Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong June 29, 2022, 10:04 p.m. UTC | #2
On Mon, Jun 27, 2022 at 04:08:41PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Now that we have a standalone fast path for buffer lookup, we can
> easily convert it to use rcu lookups. When we continually hammer the
> buffer cache with trylock lookups, we end up with a huge amount of
> lock contention on the per-ag buffer hash locks:
> 
> -   92.71%     0.05%  [kernel]                  [k] xfs_inodegc_worker
>    - 92.67% xfs_inodegc_worker
>       - 92.13% xfs_inode_unlink
>          - 91.52% xfs_inactive_ifree
>             - 85.63% xfs_read_agi
>                - 85.61% xfs_trans_read_buf_map
>                   - 85.59% xfs_buf_read_map
>                      - xfs_buf_get_map
>                         - 85.55% xfs_buf_find
>                            - 72.87% _raw_spin_lock
>                               - do_raw_spin_lock
>                                    71.86% __pv_queued_spin_lock_slowpath
>                            - 8.74% xfs_buf_rele
>                               - 7.88% _raw_spin_lock
>                                  - 7.88% do_raw_spin_lock
>                                       7.63% __pv_queued_spin_lock_slowpath
>                            - 1.70% xfs_buf_trylock
>                               - 1.68% down_trylock
>                                  - 1.41% _raw_spin_lock_irqsave
>                                     - 1.39% do_raw_spin_lock
>                                          __pv_queued_spin_lock_slowpath
>                            - 0.76% _raw_spin_unlock
>                                 0.75% do_raw_spin_unlock
> 
> This is basically hammering the pag->pag_buf_lock from lots of CPUs
> doing trylocks at the same time. Most of the buffer trylock
> operations ultimately fail after we've done the lookup, so we're
> really hammering the buf hash lock whilst making no progress.
> 
> We can also see significant spinlock traffic on the same lock just
> under normal operation when lots of tasks are accessing metadata
> from the same AG, so let's avoid all this by converting the lookup
> fast path to leverages the rhashtable's ability to do rcu protected
> lookups.
> 
> We avoid races with the buffer release path by using
> atomic_inc_not_zero() on the buffer hold count. Any buffer that is
> in the LRU will have a non-zero count, thereby allowing the lockless
> fast path to be taken in most cache hit situations. If the buffer
> hold count is zero, then it is likely going through the release path
> so in that case we fall back to the existing lookup miss slow path.
> 
> The slow path will then do an atomic lookup and insert under the
> buffer hash lock and hence serialise correctly against buffer
> release freeing the buffer.
> 
> The use of rcu protected lookups means that buffer handles now need
> to be freed by RCU callbacks (same as inodes). We still free the
> buffer pages before the RCU callback - we won't be trying to access
> them at all on a buffer that has zero references - but we need the
> buffer handle itself to be present for the entire rcu protected read
> side to detect a zero hold count correctly.

Hmm, so what still uses pag_buf_lock?  Are we still using it to
serialize xfs_buf_rele calls?

--D

> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_buf.c | 22 +++++++++++++++-------
>  fs/xfs/xfs_buf.h |  1 +
>  2 files changed, 16 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index 3461ef3ebc1c..14a2d2d6a4e0 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -294,6 +294,16 @@ xfs_buf_free_pages(
>  	bp->b_flags &= ~_XBF_PAGES;
>  }
>  
> +static void
> +xfs_buf_free_callback(
> +	struct callback_head	*cb)
> +{
> +	struct xfs_buf		*bp = container_of(cb, struct xfs_buf, b_rcu);
> +
> +	xfs_buf_free_maps(bp);
> +	kmem_cache_free(xfs_buf_cache, bp);
> +}
> +
>  static void
>  xfs_buf_free(
>  	struct xfs_buf		*bp)
> @@ -307,8 +317,7 @@ xfs_buf_free(
>  	else if (bp->b_flags & _XBF_KMEM)
>  		kmem_free(bp->b_addr);
>  
> -	xfs_buf_free_maps(bp);
> -	kmem_cache_free(xfs_buf_cache, bp);
> +	call_rcu(&bp->b_rcu, xfs_buf_free_callback);
>  }
>  
>  static int
> @@ -567,14 +576,13 @@ xfs_buf_find_fast(
>  	struct xfs_buf          *bp;
>  	int			error;
>  
> -	spin_lock(&pag->pag_buf_lock);
> +	rcu_read_lock();
>  	bp = rhashtable_lookup(&pag->pag_buf_hash, map, xfs_buf_hash_params);
> -	if (!bp) {
> -		spin_unlock(&pag->pag_buf_lock);
> +	if (!bp || !atomic_inc_not_zero(&bp->b_hold)) {
> +		rcu_read_unlock();
>  		return -ENOENT;
>  	}
> -	atomic_inc(&bp->b_hold);
> -	spin_unlock(&pag->pag_buf_lock);
> +	rcu_read_unlock();
>  
>  	error = xfs_buf_find_lock(bp, flags);
>  	if (error) {
> diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
> index 58e9034d51bd..02b3c1635ec3 100644
> --- a/fs/xfs/xfs_buf.h
> +++ b/fs/xfs/xfs_buf.h
> @@ -196,6 +196,7 @@ struct xfs_buf {
>  	int			b_last_error;
>  
>  	const struct xfs_buf_ops	*b_ops;
> +	struct rcu_head		b_rcu;
>  };
>  
>  /* Finding and Reading Buffers */
> -- 
> 2.36.1
>
Dave Chinner July 7, 2022, 12:36 p.m. UTC | #3
On Wed, Jun 29, 2022 at 03:04:41PM -0700, Darrick J. Wong wrote:
> On Mon, Jun 27, 2022 at 04:08:41PM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Now that we have a standalone fast path for buffer lookup, we can
> > easily convert it to use rcu lookups. When we continually hammer the
> > buffer cache with trylock lookups, we end up with a huge amount of
> > lock contention on the per-ag buffer hash locks:
> > 
> > -   92.71%     0.05%  [kernel]                  [k] xfs_inodegc_worker
> >    - 92.67% xfs_inodegc_worker
> >       - 92.13% xfs_inode_unlink
> >          - 91.52% xfs_inactive_ifree
> >             - 85.63% xfs_read_agi
> >                - 85.61% xfs_trans_read_buf_map
> >                   - 85.59% xfs_buf_read_map
> >                      - xfs_buf_get_map
> >                         - 85.55% xfs_buf_find
> >                            - 72.87% _raw_spin_lock
> >                               - do_raw_spin_lock
> >                                    71.86% __pv_queued_spin_lock_slowpath
> >                            - 8.74% xfs_buf_rele
> >                               - 7.88% _raw_spin_lock
> >                                  - 7.88% do_raw_spin_lock
> >                                       7.63% __pv_queued_spin_lock_slowpath
> >                            - 1.70% xfs_buf_trylock
> >                               - 1.68% down_trylock
> >                                  - 1.41% _raw_spin_lock_irqsave
> >                                     - 1.39% do_raw_spin_lock
> >                                          __pv_queued_spin_lock_slowpath
> >                            - 0.76% _raw_spin_unlock
> >                                 0.75% do_raw_spin_unlock
> > 
> > This is basically hammering the pag->pag_buf_lock from lots of CPUs
> > doing trylocks at the same time. Most of the buffer trylock
> > operations ultimately fail after we've done the lookup, so we're
> > really hammering the buf hash lock whilst making no progress.
> > 
> > We can also see significant spinlock traffic on the same lock just
> > under normal operation when lots of tasks are accessing metadata
> > from the same AG, so let's avoid all this by converting the lookup
> > fast path to leverages the rhashtable's ability to do rcu protected
> > lookups.
> > 
> > We avoid races with the buffer release path by using
> > atomic_inc_not_zero() on the buffer hold count. Any buffer that is
> > in the LRU will have a non-zero count, thereby allowing the lockless
> > fast path to be taken in most cache hit situations. If the buffer
> > hold count is zero, then it is likely going through the release path
> > so in that case we fall back to the existing lookup miss slow path.
> > 
> > The slow path will then do an atomic lookup and insert under the
> > buffer hash lock and hence serialise correctly against buffer
> > release freeing the buffer.
> > 
> > The use of rcu protected lookups means that buffer handles now need
> > to be freed by RCU callbacks (same as inodes). We still free the
> > buffer pages before the RCU callback - we won't be trying to access
> > them at all on a buffer that has zero references - but we need the
> > buffer handle itself to be present for the entire rcu protected read
> > side to detect a zero hold count correctly.
> 
> Hmm, so what still uses pag_buf_lock?  Are we still using it to
> serialize xfs_buf_rele calls?

slow path lookup/insert and xfs_buf_rele calls.

The only thing we are allowing lockless lookups on are buffers with
at least one reference. With the LRU holding a reference, that means
it the buffer is still actively cached or referenced by something
else so won't disappear from under us. If the ref count is zero,
then it means the buffer is being removed from the cache, so we
need to go the slow way and take the pag_buf_lock to serialise the
lookup against the release of the unreferenced buffer we found in
the cache.

Cheers,

Dave.
Darrick J. Wong July 7, 2022, 5:55 p.m. UTC | #4
On Thu, Jul 07, 2022 at 10:36:33PM +1000, Dave Chinner wrote:
> On Wed, Jun 29, 2022 at 03:04:41PM -0700, Darrick J. Wong wrote:
> > On Mon, Jun 27, 2022 at 04:08:41PM +1000, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > Now that we have a standalone fast path for buffer lookup, we can
> > > easily convert it to use rcu lookups. When we continually hammer the
> > > buffer cache with trylock lookups, we end up with a huge amount of
> > > lock contention on the per-ag buffer hash locks:
> > > 
> > > -   92.71%     0.05%  [kernel]                  [k] xfs_inodegc_worker
> > >    - 92.67% xfs_inodegc_worker
> > >       - 92.13% xfs_inode_unlink
> > >          - 91.52% xfs_inactive_ifree
> > >             - 85.63% xfs_read_agi
> > >                - 85.61% xfs_trans_read_buf_map
> > >                   - 85.59% xfs_buf_read_map
> > >                      - xfs_buf_get_map
> > >                         - 85.55% xfs_buf_find
> > >                            - 72.87% _raw_spin_lock
> > >                               - do_raw_spin_lock
> > >                                    71.86% __pv_queued_spin_lock_slowpath
> > >                            - 8.74% xfs_buf_rele
> > >                               - 7.88% _raw_spin_lock
> > >                                  - 7.88% do_raw_spin_lock
> > >                                       7.63% __pv_queued_spin_lock_slowpath
> > >                            - 1.70% xfs_buf_trylock
> > >                               - 1.68% down_trylock
> > >                                  - 1.41% _raw_spin_lock_irqsave
> > >                                     - 1.39% do_raw_spin_lock
> > >                                          __pv_queued_spin_lock_slowpath
> > >                            - 0.76% _raw_spin_unlock
> > >                                 0.75% do_raw_spin_unlock
> > > 
> > > This is basically hammering the pag->pag_buf_lock from lots of CPUs
> > > doing trylocks at the same time. Most of the buffer trylock
> > > operations ultimately fail after we've done the lookup, so we're
> > > really hammering the buf hash lock whilst making no progress.
> > > 
> > > We can also see significant spinlock traffic on the same lock just
> > > under normal operation when lots of tasks are accessing metadata
> > > from the same AG, so let's avoid all this by converting the lookup
> > > fast path to leverages the rhashtable's ability to do rcu protected
> > > lookups.
> > > 
> > > We avoid races with the buffer release path by using
> > > atomic_inc_not_zero() on the buffer hold count. Any buffer that is
> > > in the LRU will have a non-zero count, thereby allowing the lockless
> > > fast path to be taken in most cache hit situations. If the buffer
> > > hold count is zero, then it is likely going through the release path
> > > so in that case we fall back to the existing lookup miss slow path.
> > > 
> > > The slow path will then do an atomic lookup and insert under the
> > > buffer hash lock and hence serialise correctly against buffer
> > > release freeing the buffer.
> > > 
> > > The use of rcu protected lookups means that buffer handles now need
> > > to be freed by RCU callbacks (same as inodes). We still free the
> > > buffer pages before the RCU callback - we won't be trying to access
> > > them at all on a buffer that has zero references - but we need the
> > > buffer handle itself to be present for the entire rcu protected read
> > > side to detect a zero hold count correctly.
> > 
> > Hmm, so what still uses pag_buf_lock?  Are we still using it to
> > serialize xfs_buf_rele calls?
> 
> slow path lookup/insert and xfs_buf_rele calls.
> 
> The only thing we are allowing lockless lookups on are buffers with
> at least one reference. With the LRU holding a reference, that means
> it the buffer is still actively cached or referenced by something
> else so won't disappear from under us. If the ref count is zero,
> then it means the buffer is being removed from the cache, so we
> need to go the slow way and take the pag_buf_lock to serialise the
> lookup against the release of the unreferenced buffer we found in
> the cache.

Ah, right, got it.
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D


> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
Christoph Hellwig July 11, 2022, 5:16 a.m. UTC | #5
On Thu, Jul 07, 2022 at 10:36:33PM +1000, Dave Chinner wrote:
> > 
> > Hmm, so what still uses pag_buf_lock?  Are we still using it to
> > serialize xfs_buf_rele calls?
> 
> slow path lookup/insert and xfs_buf_rele calls.

Note that we might be able to eventually remove the locking in rele
if we switch the lookup slow path to use atomic_inc_unless_zero as
well.  But it would need a very careful audit.
diff mbox series

Patch

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 3461ef3ebc1c..14a2d2d6a4e0 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -294,6 +294,16 @@  xfs_buf_free_pages(
 	bp->b_flags &= ~_XBF_PAGES;
 }
 
+static void
+xfs_buf_free_callback(
+	struct callback_head	*cb)
+{
+	struct xfs_buf		*bp = container_of(cb, struct xfs_buf, b_rcu);
+
+	xfs_buf_free_maps(bp);
+	kmem_cache_free(xfs_buf_cache, bp);
+}
+
 static void
 xfs_buf_free(
 	struct xfs_buf		*bp)
@@ -307,8 +317,7 @@  xfs_buf_free(
 	else if (bp->b_flags & _XBF_KMEM)
 		kmem_free(bp->b_addr);
 
-	xfs_buf_free_maps(bp);
-	kmem_cache_free(xfs_buf_cache, bp);
+	call_rcu(&bp->b_rcu, xfs_buf_free_callback);
 }
 
 static int
@@ -567,14 +576,13 @@  xfs_buf_find_fast(
 	struct xfs_buf          *bp;
 	int			error;
 
-	spin_lock(&pag->pag_buf_lock);
+	rcu_read_lock();
 	bp = rhashtable_lookup(&pag->pag_buf_hash, map, xfs_buf_hash_params);
-	if (!bp) {
-		spin_unlock(&pag->pag_buf_lock);
+	if (!bp || !atomic_inc_not_zero(&bp->b_hold)) {
+		rcu_read_unlock();
 		return -ENOENT;
 	}
-	atomic_inc(&bp->b_hold);
-	spin_unlock(&pag->pag_buf_lock);
+	rcu_read_unlock();
 
 	error = xfs_buf_find_lock(bp, flags);
 	if (error) {
diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
index 58e9034d51bd..02b3c1635ec3 100644
--- a/fs/xfs/xfs_buf.h
+++ b/fs/xfs/xfs_buf.h
@@ -196,6 +196,7 @@  struct xfs_buf {
 	int			b_last_error;
 
 	const struct xfs_buf_ops	*b_ops;
+	struct rcu_head		b_rcu;
 };
 
 /* Finding and Reading Buffers */