diff mbox

[2/2] xfs: switch buffer cache entries to RCU freeing

Message ID 1476821653-2595-3-git-send-email-dev@lynxeye.de
State New, archived
Headers show

Commit Message

Lucas Stach Oct. 18, 2016, 8:14 p.m. UTC
The buffer cache hash as the indexing data structure into the buffer
cache is already protected by RCU. By freeing the entries itself by
RCU we can get rid of the buffer cache lock altogether.

Signed-off-by: Lucas Stach <dev@lynxeye.de>
---
 fs/xfs/xfs_buf.c   | 41 +++++++++++++++++++++++++++--------------
 fs/xfs/xfs_buf.h   |  2 ++
 fs/xfs/xfs_mount.h |  1 -
 3 files changed, 29 insertions(+), 15 deletions(-)

Comments

Lucas Stach Oct. 22, 2016, 6:52 p.m. UTC | #1
Am Mittwoch, den 19.10.2016, 09:43 +1100 schrieb Dave Chinner:
> On Tue, Oct 18, 2016 at 10:14:13PM +0200, Lucas Stach wrote:
> > 
> > The buffer cache hash as the indexing data structure into the
> > buffer
> > cache is already protected by RCU. By freeing the entries itself by
> > RCU we can get rid of the buffer cache lock altogether.
> > 
> > Signed-off-by: Lucas Stach <dev@lynxeye.de>
> > ---
> >  fs/xfs/xfs_buf.c   | 41 +++++++++++++++++++++++++++--------------
> >  fs/xfs/xfs_buf.h   |  2 ++
> >  fs/xfs/xfs_mount.h |  1 -
> >  3 files changed, 29 insertions(+), 15 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> > index 50c5b01..3ee0a3d1 100644
> > --- a/fs/xfs/xfs_buf.c
> > +++ b/fs/xfs/xfs_buf.c
> > @@ -326,6 +326,16 @@ xfs_buf_free(
> >  	kmem_zone_free(xfs_buf_zone, bp);
> >  }
> >  
> > +STATIC void
> > +__xfs_buf_rcu_free(
> > +	struct rcu_head	*head)
> > +{
> > +	xfs_buf_t *bp = container_of(head, struct xfs_buf,
> > rcu_head);
> > +
> > +	ASSERT(atomic_read(&bp->b_hold) == 0);
> > +	xfs_buf_free(bp);
> > +}
> > +
> >  /*
> >   * Allocates all the pages for buffer in question and builds it's
> > page list.
> >   */
> > @@ -491,6 +501,14 @@ _xfs_buf_cmp(
> >  	BUILD_BUG_ON(offsetof(struct xfs_buf_cmp_arg, blkno) !=
> > 0);
> >  
> >  	if (bp->b_bn == cmp_arg->blkno) {
> > +		/*
> > +		 * Skip matches with a hold count of zero, as they
> > are about to
> > +		 * be freed by RCU. Continue searching as another
> > valid entry
> > +		 * might have already been inserted into the hash.
> > +		 */
> > +		if (unlikely(atomic_read(&bp->b_hold) == 0))
> > +			return 1;
> 
> This is an invalid assumption. Buffers transition through a hold
> count of zero when they get moved to the LRU list, which then adds
> a reference for the LRU list so the hold count goes back above zero.
> The pag_buf_lock serialised this 1->0->1 hold count change so it
> wasn't ever seen by lookups or concurrent xfs_buf_rele() calls.
> 
> Removing the pag_buf_lock around freeing of buffers and lookup
> exposes this transition to lookup and so b_hold cannot be used like
> this to determine if a buffer is being freed or not.
> 
> Like I said, there is some really subtle stuff in the buffer
> lifecycle management....
> 
> Hmmmm - because this is now all RCU-ised, we are going to need some
> form of memory barrier to ensure that whatever we use to detect
> buffers in the RCU grace period is race free. I suspect the easiest
> thing to do is to use the bp->b_lock spinlock to set a noticable
> "buffer being freed" state in xfs_buf_rele() that we can check in
> lookup under the spinlock. That way the lock provides the memory
> barrier we need and it also provides serialisation against
> transient hold count values in xfs_buf_rele() similar to what the
> pag_buf_lock used to provide....
> 
> (FYI: the internal object spinlock pattern is what we use for the
> RCU lookup vs freeing serialisation in the XFS inode cache....)
> 

Thanks. Using the internal spinlock seems like a good idea.

> > 
> > @@ -580,14 +597,16 @@ _xfs_buf_find(
> >  	pag = xfs_perag_get(btp->bt_mount,
> >  			    xfs_daddr_to_agno(btp->bt_mount,
> > cmp_arg.blkno));
> >  
> > +	rcu_read_lock();
> >  	/* lookup buf in pag hash */
> > -	spin_lock(&pag->pag_buf_lock);
> >  	bp = rhashtable_lookup_fast(&pag->pag_buf_hash, &cmp_arg,
> >  				    xfs_buf_hash_params);
> > -	if (bp) {
> > -		atomic_inc(&bp->b_hold);
> > +
> > +	/* if the hold count is zero the buffer is about to be
> > freed by RCU */
> > +	if (bp && atomic_inc_not_zero(&bp->b_hold))
> >  		goto found;
> > -	}
> > +
> > +	rcu_read_unlock();
> 
> This has the same problem with transient 0 hold counts in
> xfs_buf_rele(). I suspect that we need to take the hold reference in
> the xfs_buf_rhash_compare() function where we need to take the
> bp->b_lock() to check for the buffer being in the rcu grace
> period. If it's not being freed, we should take a reference right
> then so that it doesn't get freed between the lookup check and the
> buffer being returned...
> 
I don't think this will work with rhashtables. The internal workings of
those depend on being able to call the compare function multiple times
on a single lookup, so you can't do anything in the lookup that changes
the state of the object.

I think the best we can do is to skip buffers that are already in
"going to be freed" state in the lookup (without taking the bp-
>b_lock).
Then take the lock to increase the hold count and recheck if the object
transitioned into the "about to be freed" state between our lookup and
the time we got the lock. If the object is marked from removal at this
point we need to retry the lookup, I think.

Regards,
Lucas
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner Oct. 24, 2016, 2:37 a.m. UTC | #2
On Sat, Oct 22, 2016 at 08:52:44PM +0200, Lucas Stach wrote:
> Am Mittwoch, den 19.10.2016, 09:43 +1100 schrieb Dave Chinner:
> > On Tue, Oct 18, 2016 at 10:14:13PM +0200, Lucas Stach wrote:
> > >  			    xfs_daddr_to_agno(btp->bt_mount,
> > > cmp_arg.blkno));
> > >  
> > > +	rcu_read_lock();
> > >  	/* lookup buf in pag hash */
> > > -	spin_lock(&pag->pag_buf_lock);
> > >  	bp = rhashtable_lookup_fast(&pag->pag_buf_hash, &cmp_arg,
> > >  				    xfs_buf_hash_params);
> > > -	if (bp) {
> > > -		atomic_inc(&bp->b_hold);
> > > +
> > > +	/* if the hold count is zero the buffer is about to be
> > > freed by RCU */
> > > +	if (bp && atomic_inc_not_zero(&bp->b_hold))
> > >  		goto found;
> > > -	}
> > > +
> > > +	rcu_read_unlock();
> > 
> > This has the same problem with transient 0 hold counts in
> > xfs_buf_rele(). I suspect that we need to take the hold reference in
> > the xfs_buf_rhash_compare() function where we need to take the
> > bp->b_lock() to check for the buffer being in the rcu grace
> > period. If it's not being freed, we should take a reference right
> > then so that it doesn't get freed between the lookup check and the
> > buffer being returned...
> > 
> I don't think this will work with rhashtables. The internal workings of
> those depend on being able to call the compare function multiple times
> on a single lookup, so you can't do anything in the lookup that changes
> the state of the object.

That's not documented anywhere iin the code that I can find. :(
It would be nice to have important API details like that documented
somewhere obvious....

> I think the best we can do is to skip buffers that are already in
> "going to be freed" state in the lookup (without taking the bp-
> >b_lock).

Let's get our terminology straight here to avoid confusion. You're
talking about checking in the compare function here, right? As
opposed to checking the object returned by the lookup function?

> Then take the lock to increase the hold count and recheck if the object
> transitioned into the "about to be freed" state between our lookup and
> the time we got the lock.

.... between the check in the compare function and the time we got
the lock in the object returned?

If so, There is no need to increase the hold count before we check
the freeing state once we have the lock - freeing will serialise
against the lock we take on the buffer object the lookup returns
before it can drop the last reference and free it. Hence as long as
we increment the hold count before we drop the lock we are good...

> If the object is marked from removal at this
> point we need to retry the lookup, I think.

Yup, seems reasonable. Probably best to hook up the xb_miss_locked
stat in this case, too...

Cheers,

Dave.
diff mbox

Patch

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 50c5b01..3ee0a3d1 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -326,6 +326,16 @@  xfs_buf_free(
 	kmem_zone_free(xfs_buf_zone, bp);
 }
 
+STATIC void
+__xfs_buf_rcu_free(
+	struct rcu_head	*head)
+{
+	xfs_buf_t *bp = container_of(head, struct xfs_buf, rcu_head);
+
+	ASSERT(atomic_read(&bp->b_hold) == 0);
+	xfs_buf_free(bp);
+}
+
 /*
  * Allocates all the pages for buffer in question and builds it's page list.
  */
@@ -491,6 +501,14 @@  _xfs_buf_cmp(
 	BUILD_BUG_ON(offsetof(struct xfs_buf_cmp_arg, blkno) != 0);
 
 	if (bp->b_bn == cmp_arg->blkno) {
+		/*
+		 * Skip matches with a hold count of zero, as they are about to
+		 * be freed by RCU. Continue searching as another valid entry
+		 * might have already been inserted into the hash.
+		 */
+		if (unlikely(atomic_read(&bp->b_hold) == 0))
+			return 1;
+
 		if (unlikely(bp->b_length != cmp_arg->numblks)) {
 			/*
 			 * found a block number match. If the range doesn't
@@ -522,7 +540,6 @@  static const struct rhashtable_params xfs_buf_hash_params = {
 int xfs_buf_hash_init(
 	struct xfs_perag *pag)
 {
-	spin_lock_init(&pag->pag_buf_lock);
 	return rhashtable_init(&pag->pag_buf_hash, &xfs_buf_hash_params);
 }
 
@@ -580,14 +597,16 @@  _xfs_buf_find(
 	pag = xfs_perag_get(btp->bt_mount,
 			    xfs_daddr_to_agno(btp->bt_mount, cmp_arg.blkno));
 
+	rcu_read_lock();
 	/* lookup buf in pag hash */
-	spin_lock(&pag->pag_buf_lock);
 	bp = rhashtable_lookup_fast(&pag->pag_buf_hash, &cmp_arg,
 				    xfs_buf_hash_params);
-	if (bp) {
-		atomic_inc(&bp->b_hold);
+
+	/* if the hold count is zero the buffer is about to be freed by RCU */
+	if (bp && atomic_inc_not_zero(&bp->b_hold))
 		goto found;
-	}
+
+	rcu_read_unlock();
 
 	/* No match found */
 	if (new_bp) {
@@ -596,16 +615,14 @@  _xfs_buf_find(
 		rhashtable_insert_fast(&pag->pag_buf_hash,
 				       &new_bp->b_rhash_head,
 				       xfs_buf_hash_params);
-		spin_unlock(&pag->pag_buf_lock);
 	} else {
 		XFS_STATS_INC(btp->bt_mount, xb_miss_locked);
-		spin_unlock(&pag->pag_buf_lock);
 		xfs_perag_put(pag);
 	}
 	return new_bp;
 
 found:
-	spin_unlock(&pag->pag_buf_lock);
+	rcu_read_unlock();
 	xfs_perag_put(pag);
 
 	if (!xfs_buf_trylock(bp)) {
@@ -956,7 +973,6 @@  xfs_buf_rele(
 	xfs_buf_t		*bp)
 {
 	struct xfs_perag	*pag = bp->b_pag;
-	bool			release;
 	bool			freebuf = false;
 
 	trace_xfs_buf_rele(bp, _RET_IP_);
@@ -975,9 +991,8 @@  xfs_buf_rele(
 
 	ASSERT(atomic_read(&bp->b_hold) > 0);
 
-	release = atomic_dec_and_lock(&bp->b_hold, &pag->pag_buf_lock);
 	spin_lock(&bp->b_lock);
-	if (!release) {
+	if (!atomic_dec_and_test(&bp->b_hold)) {
 		/*
 		 * Drop the in-flight state if the buffer is already on the LRU
 		 * and it holds the only reference. This is racy because we
@@ -1001,7 +1016,6 @@  xfs_buf_rele(
 			bp->b_state &= ~XFS_BSTATE_DISPOSE;
 			atomic_inc(&bp->b_hold);
 		}
-		spin_unlock(&pag->pag_buf_lock);
 	} else {
 		/*
 		 * most of the time buffers will already be removed from the
@@ -1018,7 +1032,6 @@  xfs_buf_rele(
 		ASSERT(!(bp->b_flags & _XBF_DELWRI_Q));
 		rhashtable_remove_fast(&pag->pag_buf_hash, &bp->b_rhash_head,
 				       xfs_buf_hash_params);
-		spin_unlock(&pag->pag_buf_lock);
 		xfs_perag_put(pag);
 		freebuf = true;
 	}
@@ -1027,7 +1040,7 @@  xfs_buf_rele(
 	spin_unlock(&bp->b_lock);
 
 	if (freebuf)
-		xfs_buf_free(bp);
+		call_rcu(&bp->rcu_head, __xfs_buf_rcu_free);
 }
 
 
diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
index 37943ac..3f99ea7 100644
--- a/fs/xfs/xfs_buf.h
+++ b/fs/xfs/xfs_buf.h
@@ -210,6 +210,8 @@  typedef struct xfs_buf {
 
 	const struct xfs_buf_ops	*b_ops;
 
+	struct rcu_head		rcu_head;
+
 #ifdef XFS_BUF_LOCK_TRACKING
 	int			b_last_holder;
 #endif
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 84f7852..1116909 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -393,7 +393,6 @@  typedef struct xfs_perag {
 	unsigned long	pag_ici_reclaim_cursor;	/* reclaim restart point */
 
 	/* buffer cache index */
-	spinlock_t	pag_buf_lock;	/* lock for pag_buf_hash */
 	struct rhashtable pag_buf_hash;
 
 	/* for rcu-safe freeing */