diff mbox series

[3/5] xfs: remove a superflous hash lookup when inserting new buffers

Message ID 20220403120119.235457-4-hch@lst.de (mailing list archive)
State Superseded
Headers show
Series [1/5] xfs: add a flags argument to xfs_buf_get | expand

Commit Message

Christoph Hellwig April 3, 2022, 12:01 p.m. UTC
xfs_buf_get_map has a bit of a strange structure where the xfs_buf_find
helper is called twice before we actually insert a new buffer on a cache
miss.  Given that the rhashtable has an interface to insert a new entry
and return the found one on a conflict we can easily get rid of the
double lookup by using that.  To get a straight code path this requires
folding the xfs_buf_find helper into xfs_buf_get_map, but the counter
side we can now easily move the logic for the slow path cache insert
into a new helper.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_buf.c | 207 +++++++++++++++++++++++------------------------
 1 file changed, 99 insertions(+), 108 deletions(-)

Comments

Dave Chinner April 3, 2022, 11:04 p.m. UTC | #1
On Sun, Apr 03, 2022 at 02:01:17PM +0200, Christoph Hellwig wrote:
> xfs_buf_get_map has a bit of a strange structure where the xfs_buf_find
> helper is called twice before we actually insert a new buffer on a cache
> miss.  Given that the rhashtable has an interface to insert a new entry
> and return the found one on a conflict we can easily get rid of the
> double lookup by using that.

We can do that without completely rewriting this code.

        spin_lock(&pag->pag_buf_lock);
	if (!new_bp) {
		bp = rhashtable_lookup_fast(&pag->pag_buf_hash, &cmap,
					    xfs_buf_hash_params);
		if (!bp) {
			error = -ENOENT;
			goto not_found;
		}
	} else {
		bp = rhashtable_lookup_get_insert_fast(&pag->pag_buf_hash,
				&new_bp->b_rhash_head, xfs_buf_hash_params);
		if (IS_ERR(bp)) {
			error = PTR_ERR(*bpp);
			goto not_found;
		}
		if (!bp) {
			/*
			 * Inserted new buffer, it keeps the perag reference until
			 * it is freed.
			 */
			new_bp->b_pag = pag;
			spin_unlock(&pag->pag_buf_lock);
			*found_bp = new_bp;
			return 0;
		}
	}

	atomic_inc(&bp->b_hold);
        spin_lock(&pag->pag_buf_lock);
	xfs_perag_put(pag);

	<lock buffer>

	*found_bp = bp;
	return 0;

not_found:
	spin_lock(&pag->pag_buf_lock); 
	xfs_perag_put(pag);
	return error;
}

And now we have the existing code with the the optimised rhashtable
insertion. I'd much prefer this be separated out like this so that
this rhashtable usage change is a separate bisectable commit to all
the other changes.

I'm also not sold on moving where we allocate the buffer allocation
as done in this patch because:

> +static int
> +xfs_buf_insert(
> +	struct xfs_buftarg	*btp,
> +	struct xfs_buf_map	*map,
> +	int			nmaps,
> +	xfs_buf_flags_t		flags,
> +	struct xfs_perag	*pag,
> +	struct xfs_buf		**bpp)
> +{
> +	int			error;
> +	struct xfs_buf		*bp;
> +
> +	error = _xfs_buf_alloc(btp, map, nmaps, flags, &bp);
> +	if (error)
> +		return error;
> +
> +	/*
> +	 * For buffers that fit entirely within a single page, first attempt to
> +	 * allocate the memory from the heap to minimise memory usage. If we
> +	 * can't get heap memory for these small buffers, we fall back to using
> +	 * the page allocator.
> +	 */
> +	if (BBTOB(bp->b_length) >= PAGE_SIZE ||
> +	    xfs_buf_alloc_kmem(bp, flags) < 0) {
> +		error = xfs_buf_alloc_pages(bp, flags);
> +		if (error)
> +			goto out_free_buf;
> +	}
> +
> +	/* the buffer keeps the perag reference until it is freed */
> +	bp->b_pag = pag;
> +
> +	spin_lock(&pag->pag_buf_lock);
> +	*bpp = rhashtable_lookup_get_insert_fast(&pag->pag_buf_hash,
> +			&bp->b_rhash_head, xfs_buf_hash_params);
> +	if (IS_ERR(*bpp)) {
> +		error = PTR_ERR(*bpp);
> +		goto out_unlock;
> +	}
> +	if (*bpp) {
> +		/* found an existing buffer */
> +		atomic_inc(&(*bpp)->b_hold);
> +		error = -EEXIST;
> +		goto out_unlock;
> +	}
> +	spin_unlock(&pag->pag_buf_lock);
> +	*bpp = bp;
> +	return 0;

The return cases of this function end up being a bit of a mess. We can return:

 - error = 0 and a locked buffer in *bpp
 - error = -EEXIST and an unlocked buffer in *bpp
 - error != 0 and a modified *bpp pointer
 - error != 0 and an unmodified *bpp pointer

So we end up with a bunch of logic here to separate out the return
cases into different error values, then have to add logic to the
caller to handle the different return cases.

And if we look at the new caller:

> +	if (unlikely(!bp)) {
> +		if (flags & XBF_NOALLOC) {
> +			XFS_STATS_INC(mp, xb_miss_locked);
> +			xfs_perag_put(pag);
> +			return -ENOENT;
> +		}
> +
> +		error = xfs_buf_insert(btp, map, nmaps, flags, pag, &bp);
> +		if (!error)
> +			goto finish;
> +		if (error != -EEXIST) {
> +			xfs_perag_put(pag);
> +			return error;
> +		}
> +	}
>  	xfs_perag_put(pag);

	<lock the buffer>

It took me quite some time to realise this wasn't buggy - it looks
for all the world like the "!error" case here fails to lock the
buffer and leaks the perag reference. It isn't at all clear that the
new buffer is inserted locked into the hash and that it steals the
callers perag reference, whilst the -EEXIST case does neither yet
still returns a buffer.

Hence while I like the idea of changing up how we allocate the
buffer on lookup failure, I'm not sold on this interface yet. Hence
I think splitting the rhashtable insert optimisation from the
allocation rework might help clean this up more.

Cheers,

Dave.
Christoph Hellwig April 5, 2022, 3 p.m. UTC | #2
On Mon, Apr 04, 2022 at 09:04:52AM +1000, Dave Chinner wrote:
> On Sun, Apr 03, 2022 at 02:01:17PM +0200, Christoph Hellwig wrote:
> > xfs_buf_get_map has a bit of a strange structure where the xfs_buf_find
> > helper is called twice before we actually insert a new buffer on a cache
> > miss.  Given that the rhashtable has an interface to insert a new entry
> > and return the found one on a conflict we can easily get rid of the
> > double lookup by using that.
> 
> We can do that without completely rewriting this code.

We could.  And I had something similar earlier.  But I actually thing
the structure of the code after this patch makes much more sense.  All
the logic for the fast path buffer lookup is clearly layed out in one
function, which then just calls a helper to perform the lookup.
The new scheme also is slightly less code overall.  Even more so once
the lockless lookup comes into play which requires different locking
and refcount increments.

> The return cases of this function end up being a bit of a mess. We can return:
> 
>  - error = 0 and a locked buffer in *bpp
>  - error = -EEXIST and an unlocked buffer in *bpp
>  - error != 0 and a modified *bpp pointer
>  - error != 0 and an unmodified *bpp pointer

The last two are the same  - the *bpp pointer simply is not valid on a
"real" error return.  So the return really is a tristate, similar
to many other places in xfs.
Dave Chinner April 5, 2022, 10:01 p.m. UTC | #3
On Tue, Apr 05, 2022 at 05:00:27PM +0200, Christoph Hellwig wrote:
> On Mon, Apr 04, 2022 at 09:04:52AM +1000, Dave Chinner wrote:
> > On Sun, Apr 03, 2022 at 02:01:17PM +0200, Christoph Hellwig wrote:
> > > xfs_buf_get_map has a bit of a strange structure where the xfs_buf_find
> > > helper is called twice before we actually insert a new buffer on a cache
> > > miss.  Given that the rhashtable has an interface to insert a new entry
> > > and return the found one on a conflict we can easily get rid of the
> > > double lookup by using that.
> > 
> > We can do that without completely rewriting this code.
> 
> We could.  And I had something similar earlier.  But I actually thing
> the structure of the code after this patch makes much more sense.  All
> the logic for the fast path buffer lookup is clearly layed out in one
> function, which then just calls a helper to perform the lookup.
> The new scheme also is slightly less code overall.  Even more so once
> the lockless lookup comes into play which requires different locking
> and refcount increments.

Agreed, but you're making two distinct, significant modifications in
the one patchset. One is changing the way we use a generic library
functionality, the other is changing the entire structure of the
lookup path.

IOWs, I was not saying the end result was bad, I was (clumsily)
trying to suggest that you should split these two modifications into
separate patches because they are largely separate changes.

Once I thought about it that way, and
looking that them that way made me want to structure the code quite
differently.

> > The return cases of this function end up being a bit of a mess. We can return:
> > 
> >  - error = 0 and a locked buffer in *bpp
> >  - error = -EEXIST and an unlocked buffer in *bpp
> >  - error != 0 and a modified *bpp pointer
> >  - error != 0 and an unmodified *bpp pointer
> 
> The last two are the same  - the *bpp pointer simply is not valid on a
> "real" error return.  So the return really is a tristate, similar
> to many other places in xfs.

I think you missed the point I was making. I'm not complaining about
whether it's a tristate return or not, it's the fact that it can
return buffers in different states and the caller has to handle that
inconsistency itself whilst still maintaining an efficient fast
path.

That's what makes the code difficult to follow - xfs_buf_insert() is
the slow path, so all the complexity and twisted logic should be
inside that function rather than directly impacting the fast path
code.

e.g. Most of the complexity goes away if we factor out the buffer
trylock/locking code into a helper (like we have in the iomap code)
and then have xfs_buf_insert() call it when it finds an existing
buffer. Then the -EEXIST return value can go away, and
xfs_buf_insert can return a locked buffer exactly the same as if it
inserted a new buffer. Have the newly allocated buffer take a new
perag reference, too, instead of stealing the caller's reference,
and then all the differences between insert and -EEXIST cases go
away.

Then you can move all the conditional lookup failure cases into
xfs_buf_insert(), too, and we end up with high level fast path code
that is clear and concise:

	/* cache hits generally outnumber misses by at least 10:1 */
	bp = xfs_buf_lookup_fast(pag, &cmap);
	if (likely(bp))
		error = xfs_buf_lookup_lock(bp, flags);
	else
		error = xfs_buf_lookup_insert(pag, &cmap, flags, &bp);

	xfs_perag_put(pag);
	if (!error)
		*bpp = bp;
	return error;

Cheers,

Dave.
Christoph Hellwig April 6, 2022, 4:26 p.m. UTC | #4
On Wed, Apr 06, 2022 at 08:01:21AM +1000, Dave Chinner wrote:
> Agreed, but you're making two distinct, significant modifications in
> the one patchset. One is changing the way we use a generic library
> functionality, the other is changing the entire structure of the
> lookup path.
> 
> IOWs, I was not saying the end result was bad, I was (clumsily)
> trying to suggest that you should split these two modifications into
> separate patches because they are largely separate changes.
> 
> Once I thought about it that way, and
> looking that them that way made me want to structure the code quite
> differently.

Ok, I'll see if I can split things up a bit better.

> 
> e.g. Most of the complexity goes away if we factor out the buffer
> trylock/locking code into a helper (like we have in the iomap code)
> and then have xfs_buf_insert() call it when it finds an existing
> buffer. Then the -EEXIST return value can go away, and
> xfs_buf_insert can return a locked buffer exactly the same as if it
> inserted a new buffer. Have the newly allocated buffer take a new
> perag reference, too, instead of stealing the caller's reference,
> and then all the differences between insert and -EEXIST cases go
> away.

I actually had that earlier as well and really like the flow of
the single function.  So it certainly is doable.
diff mbox series

Patch

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 0951b7cbd79675..ef645e15935369 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -504,12 +504,66 @@  xfs_buf_hash_destroy(
 	rhashtable_destroy(&pag->pag_buf_hash);
 }
 
+static int
+xfs_buf_insert(
+	struct xfs_buftarg	*btp,
+	struct xfs_buf_map	*map,
+	int			nmaps,
+	xfs_buf_flags_t		flags,
+	struct xfs_perag	*pag,
+	struct xfs_buf		**bpp)
+{
+	int			error;
+	struct xfs_buf		*bp;
+
+	error = _xfs_buf_alloc(btp, map, nmaps, flags, &bp);
+	if (error)
+		return error;
+
+	/*
+	 * For buffers that fit entirely within a single page, first attempt to
+	 * allocate the memory from the heap to minimise memory usage. If we
+	 * can't get heap memory for these small buffers, we fall back to using
+	 * the page allocator.
+	 */
+	if (BBTOB(bp->b_length) >= PAGE_SIZE ||
+	    xfs_buf_alloc_kmem(bp, flags) < 0) {
+		error = xfs_buf_alloc_pages(bp, flags);
+		if (error)
+			goto out_free_buf;
+	}
+
+	/* the buffer keeps the perag reference until it is freed */
+	bp->b_pag = pag;
+
+	spin_lock(&pag->pag_buf_lock);
+	*bpp = rhashtable_lookup_get_insert_fast(&pag->pag_buf_hash,
+			&bp->b_rhash_head, xfs_buf_hash_params);
+	if (IS_ERR(*bpp)) {
+		error = PTR_ERR(*bpp);
+		goto out_unlock;
+	}
+	if (*bpp) {
+		/* found an existing buffer */
+		atomic_inc(&(*bpp)->b_hold);
+		error = -EEXIST;
+		goto out_unlock;
+	}
+	spin_unlock(&pag->pag_buf_lock);
+	*bpp = bp;
+	return 0;
+
+out_unlock:
+	spin_unlock(&pag->pag_buf_lock);
+out_free_buf:
+	xfs_buf_free(bp);
+	return error;
+}
+
 /*
- * Look up a buffer in the buffer cache and return it referenced and locked
- * in @found_bp.
- *
- * If @new_bp is supplied and we have a lookup miss, insert @new_bp into the
- * cache.
+ * Assemble a buffer covering the specified range. The code is optimised for
+ * cache hits, as metadata intensive workloads will see 3 orders of magnitude
+ * more hits than misses.
  *
  * If XBF_TRYLOCK is set in @flags, only try to lock the buffer and return
  * -EAGAIN if we fail to lock it.
@@ -517,27 +571,26 @@  xfs_buf_hash_destroy(
  * Return values are:
  *	-EFSCORRUPTED if have been supplied with an invalid address
  *	-EAGAIN on trylock failure
- *	-ENOENT if we fail to find a match and @new_bp was NULL
- *	0, with @found_bp:
- *		- @new_bp if we inserted it into the cache
- *		- the buffer we found and locked.
+ *	-ENOENT if we fail to find a match and alloc was %false
+ *	0 if a buffer was found or newly created
  */
-static int
-xfs_buf_find(
+int
+xfs_buf_get_map(
 	struct xfs_buftarg	*btp,
 	struct xfs_buf_map	*map,
 	int			nmaps,
 	xfs_buf_flags_t		flags,
-	struct xfs_buf		*new_bp,
-	struct xfs_buf		**found_bp)
+	struct xfs_buf		**bpp)
 {
+	struct xfs_mount	*mp = btp->bt_mount;
 	struct xfs_perag	*pag;
 	struct xfs_buf		*bp;
 	struct xfs_buf_map	cmap = { .bm_bn = map[0].bm_bn };
+	int			error = 0;
 	xfs_daddr_t		eofs;
 	int			i;
 
-	*found_bp = NULL;
+	*bpp = NULL;
 
 	for (i = 0; i < nmaps; i++)
 		cmap.bm_len += map[i].bm_len;
@@ -550,54 +603,47 @@  xfs_buf_find(
 	 * Corrupted block numbers can get through to here, unfortunately, so we
 	 * have to check that the buffer falls within the filesystem bounds.
 	 */
-	eofs = XFS_FSB_TO_BB(btp->bt_mount, btp->bt_mount->m_sb.sb_dblocks);
-	if (cmap.bm_bn < 0 || cmap.bm_bn >= eofs) {
-		xfs_alert(btp->bt_mount,
-			  "%s: daddr 0x%llx out of range, EOFS 0x%llx",
+	eofs = XFS_FSB_TO_BB(mp, mp->m_sb.sb_dblocks);
+	if (WARN_ON_ONCE(cmap.bm_bn < 0) || WARN_ON_ONCE(cmap.bm_bn >= eofs)) {
+		xfs_alert(mp, "%s: daddr 0x%llx out of range, EOFS 0x%llx",
 			  __func__, cmap.bm_bn, eofs);
-		WARN_ON(1);
 		return -EFSCORRUPTED;
 	}
 
-	pag = xfs_perag_get(btp->bt_mount,
-			    xfs_daddr_to_agno(btp->bt_mount, cmap.bm_bn));
+	pag = xfs_perag_get(mp, xfs_daddr_to_agno(mp, cmap.bm_bn));
 
 	spin_lock(&pag->pag_buf_lock);
 	bp = rhashtable_lookup_fast(&pag->pag_buf_hash, &cmap,
 				    xfs_buf_hash_params);
-	if (bp) {
+	if (bp)
 		atomic_inc(&bp->b_hold);
-		goto found;
-	}
-
-	/* No match found */
-	if (!new_bp) {
-		XFS_STATS_INC(btp->bt_mount, xb_miss_locked);
-		spin_unlock(&pag->pag_buf_lock);
-		xfs_perag_put(pag);
-		return -ENOENT;
-	}
-
-	/* the buffer keeps the perag reference until it is freed */
-	new_bp->b_pag = pag;
-	rhashtable_insert_fast(&pag->pag_buf_hash, &new_bp->b_rhash_head,
-			       xfs_buf_hash_params);
 	spin_unlock(&pag->pag_buf_lock);
-	*found_bp = new_bp;
-	return 0;
 
-found:
-	spin_unlock(&pag->pag_buf_lock);
+	if (unlikely(!bp)) {
+		if (flags & XBF_NOALLOC) {
+			XFS_STATS_INC(mp, xb_miss_locked);
+			xfs_perag_put(pag);
+			return -ENOENT;
+		}
+
+		error = xfs_buf_insert(btp, map, nmaps, flags, pag, &bp);
+		if (!error)
+			goto finish;
+		if (error != -EEXIST) {
+			xfs_perag_put(pag);
+			return error;
+		}
+	}
 	xfs_perag_put(pag);
 
 	if (!xfs_buf_trylock(bp)) {
 		if (flags & XBF_TRYLOCK) {
 			xfs_buf_rele(bp);
-			XFS_STATS_INC(btp->bt_mount, xb_busy_locked);
+			XFS_STATS_INC(mp, xb_busy_locked);
 			return -EAGAIN;
 		}
 		xfs_buf_lock(bp);
-		XFS_STATS_INC(btp->bt_mount, xb_get_locked_waited);
+		XFS_STATS_INC(mp, xb_get_locked_waited);
 	}
 
 	/*
@@ -612,64 +658,12 @@  xfs_buf_find(
 	}
 
 	trace_xfs_buf_find(bp, flags, _RET_IP_);
-	XFS_STATS_INC(btp->bt_mount, xb_get_locked);
-	*found_bp = bp;
-	return 0;
-}
-
-/*
- * Assembles a buffer covering the specified range. The code is optimised for
- * cache hits, as metadata intensive workloads will see 3 orders of magnitude
- * more hits than misses.
- */
-int
-xfs_buf_get_map(
-	struct xfs_buftarg	*target,
-	struct xfs_buf_map	*map,
-	int			nmaps,
-	xfs_buf_flags_t		flags,
-	struct xfs_buf		**bpp)
-{
-	struct xfs_buf		*bp;
-	struct xfs_buf		*new_bp;
-	int			error;
-
-	*bpp = NULL;
-	error = xfs_buf_find(target, map, nmaps, flags, NULL, &bp);
-	if (!error)
-		goto found;
-	if (error != -ENOENT || (flags & XBF_NOALLOC))
-		return error;
-
-	error = _xfs_buf_alloc(target, map, nmaps, flags, &new_bp);
-	if (error)
-		return error;
-
-	/*
-	 * For buffers that fit entirely within a single page, first attempt to
-	 * allocate the memory from the heap to minimise memory usage. If we
-	 * can't get heap memory for these small buffers, we fall back to using
-	 * the page allocator.
-	 */
-	if (BBTOB(new_bp->b_length) >= PAGE_SIZE ||
-	    xfs_buf_alloc_kmem(new_bp, flags) < 0) {
-		error = xfs_buf_alloc_pages(new_bp, flags);
-		if (error)
-			goto out_free_buf;
-	}
-
-	error = xfs_buf_find(target, map, nmaps, flags, new_bp, &bp);
-	if (error)
-		goto out_free_buf;
-
-	if (bp != new_bp)
-		xfs_buf_free(new_bp);
-
-found:
+	XFS_STATS_INC(mp, xb_get_locked);
+finish:
 	if (!bp->b_addr) {
 		error = _xfs_buf_map_pages(bp, flags);
 		if (unlikely(error)) {
-			xfs_warn_ratelimited(target->bt_mount,
+			xfs_warn_ratelimited(mp,
 				"%s: failed to map %u pages", __func__,
 				bp->b_page_count);
 			xfs_buf_relse(bp);
@@ -684,13 +678,10 @@  xfs_buf_get_map(
 	if (!(flags & XBF_READ))
 		xfs_buf_ioerror(bp, 0);
 
-	XFS_STATS_INC(target->bt_mount, xb_get);
+	XFS_STATS_INC(mp, xb_get);
 	trace_xfs_buf_get(bp, flags, _RET_IP_);
 	*bpp = bp;
 	return 0;
-out_free_buf:
-	xfs_buf_free(new_bp);
-	return error;
 }
 
 int
@@ -962,12 +953,12 @@  xfs_buf_rele(
 	/*
 	 * We grab the b_lock here first to serialise racing xfs_buf_rele()
 	 * calls. The pag_buf_lock being taken on the last reference only
-	 * serialises against racing lookups in xfs_buf_find(). IOWs, the second
-	 * to last reference we drop here is not serialised against the last
-	 * reference until we take bp->b_lock. Hence if we don't grab b_lock
-	 * first, the last "release" reference can win the race to the lock and
-	 * free the buffer before the second-to-last reference is processed,
-	 * leading to a use-after-free scenario.
+	 * serialises against racing lookups in xfs_buf_get_map(). IOWs, the
+	 * second to last reference we drop here is not serialised against the
+	 * last reference until we take bp->b_lock. Hence if we don't grab
+	 * b_lock first, the last "release" reference can win the race to the
+	 * lock and free the buffer before the second-to-last reference is
+	 * processed, leading to a use-after-free scenario.
 	 */
 	spin_lock(&bp->b_lock);
 	release = atomic_dec_and_lock(&bp->b_hold, &pag->pag_buf_lock);