diff mbox

[1/4] xfs: fix bogus minleft manipulations

Message ID 1481644767-9098-2-git-send-email-hch@lst.de (mailing list archive)
State Superseded
Headers show

Commit Message

Christoph Hellwig Dec. 13, 2016, 3:59 p.m. UTC
We can't just set minleft to 0 when we're low on space - that's exactly
what we need minleft for: to protect space in the AG for btree block
allocations when we are low on free space.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/libxfs/xfs_alloc.c      | 24 +++++++-----------------
 fs/xfs/libxfs/xfs_bmap.c       |  3 ---
 fs/xfs/libxfs/xfs_bmap_btree.c | 14 --------------
 3 files changed, 7 insertions(+), 34 deletions(-)

Comments

Brian Foster Dec. 14, 2016, 9:51 p.m. UTC | #1
On Wed, Dec 14, 2016 at 08:36:26PM +0100, Christoph Hellwig wrote:
> On Wed, Dec 14, 2016 at 12:35:07PM -0500, Brian Foster wrote:
> > I'm still trying to grok the minleft/dop_low stuff, but doesn't this
> > increase the risk of bmbt block allocation failure?
> 
> No.  If working correctly it reduces the chance of a failing bmbt block
> allocation in transaction to zero, at the expense of potentially
> failing whole block allocations while we still have a few blocks left.
> But given that the first leads to a fs shutdown and the latter just to
> an ENOSPC it seems worthwhile.
> 

Indeed, I completely agree in principle with the series. It's much
better to leave some free space efficiency on the table for greater
reliability.  I just want to make sure I understand how this change
works and whether it does so correctly...

> > E.g., args.minleft
> > is set to the transaction reservation earlier in the function. AFAIK the
> > res is not guaranteed to be in a particular AG, but in that case we're
> > just setting a start AG and ultimately cycling through the AGs. If that
> > fails, this hunk resets minleft, restarts at fsbno 0 and activates the
> > sequential agno allocation requirement for subsequent allocs. Doesn't
> > removing this logic mean we can successfully reserve blocks in a write
> > transaction that will never ultimately be able to allocate bmbt blocks,
> > even if the data blocks can all be allocated..?
> 
> We only set it to res if no firstblock was set.  The only case where
> that is possible during the first bmbt block allocation in a reflink
> operation - for actual direct I/O writes, fallocate calls or unwritten
> extent conversion we will always have firstblock set from the allocation
> of the actual data extent.  And that allocation now has the proper minleft
> set after this series so that we are guaranteed we have enough space
> in one single AG for the whole allocation.
> 

Ok, I need to stare at that codepath a bit more..

In the meantime, what about the delalloc codepath as well? We reserve
indlen out of the global pool at write time and track the res in the
in-core extents. Writeback comes around later and we try to allocate
real blocks without any need for transaction reservation. Now that we
set 'args.minleft = xfs_bmap_worst_indlen()' in xfs_bmap_btalloc() and
never reset it, what guarantees any particular AG can satisfy that
request based on the global reservation?

I'm basically just a little nervous that we might rely on some of these
minleft hacks to avoid serious failures like in bmbt allocation or
writeback, despite the fact that the broader mechanism may be bogus...

Brian

> In the reflink case where firstblock wasn't set the minleft ensures
> that we fail the first allocation if we don't have enough blocks
> in a single AG.
> retryloop to try the allocation 
> 
> > The other thing to note here is that minleft is initialized to 0 and
> > only set to the transaction res conditionally, so if firstblock is
> > already set we won't use minleft at all.
> 
> Yes, that's intentional - we already did our minleft check when allocating
> the data extent.
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner Dec. 15, 2016, 10:09 p.m. UTC | #2
On Tue, Dec 13, 2016 at 04:59:24PM +0100, Christoph Hellwig wrote:
> We can't just set minleft to 0 when we're low on space - that's exactly
> what we need minleft for: to protect space in the AG for btree block
> allocations when we are low on free space.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
....

The xfs_alloc_vextent() loop changes look fine. The minleft hacks
always troubled me.

> ---
> diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> index 2760bc3..44773c9 100644
> --- a/fs/xfs/libxfs/xfs_bmap.c
> +++ b/fs/xfs/libxfs/xfs_bmap.c
> @@ -3812,7 +3812,6 @@ xfs_bmap_btalloc(
>  		args.fsbno = 0;
>  		args.type = XFS_ALLOCTYPE_FIRST_AG;
>  		args.total = ap->minlen;
> -		args.minleft = 0;
>  		if ((error = xfs_alloc_vextent(&args)))
>  			return error;
>  		ap->dfops->dop_low = true;

But this looks wrong. when combined with the next patch that sets:

	args.minleft = xfs_bmap_worst_indlen(ap->ip, args.maxlen);

we've got a minleft value that might be for multigigabyte
allocation, but now we are only asking for a total allocation
of minlen, and that might be 1 block. IOWs, shouldn't this "last,
final attempt" code actually do something like this:

		args.maxlen = ap->minlen;
		args.total = ap->minlen;
		args.minleft = xfs_bmap_worst_indlen(ap->ip, args.maxlen);

> +++ b/fs/xfs/libxfs/xfs_bmap_btree.c
> @@ -499,20 +499,6 @@ xfs_bmbt_alloc_block(
>  		goto try_another_ag;
>  	}
>  
> -	if (args.fsbno == NULLFSBLOCK && args.minleft) {
> -		/*
> -		 * Could not find an AG with enough free space to satisfy
> -		 * a full btree split.  Try again without minleft and if
> -		 * successful activate the lowspace algorithm.
> -		 */
> -		args.fsbno = 0;
> -		args.type = XFS_ALLOCTYPE_FIRST_AG;
> -		args.minleft = 0;
> -		error = xfs_alloc_vextent(&args);
> -		if (error)
> -			goto error0;
> -		cur->bc_private.b.dfops->dop_low = true;
> -	}
>  	if (args.fsbno == NULLFSBLOCK) {
>  		XFS_BTREE_TRACE_CURSOR(cur, XBT_EXIT);
>  		*stat = 0;

Yes, that's fair enough considering the common in the block that
set args.minleft, and that was a bug fix introduced long after this
fallback was put in place because the fallback could still fail...

Cheers,

Dave.
Christoph Hellwig Dec. 16, 2016, 8:20 a.m. UTC | #3
On Fri, Dec 16, 2016 at 09:09:38AM +1100, Dave Chinner wrote:
> > @@ -3812,7 +3812,6 @@ xfs_bmap_btalloc(
> >  		args.fsbno = 0;
> >  		args.type = XFS_ALLOCTYPE_FIRST_AG;
> >  		args.total = ap->minlen;
> > -		args.minleft = 0;
> >  		if ((error = xfs_alloc_vextent(&args)))
> >  			return error;
> >  		ap->dfops->dop_low = true;
> 
> But this looks wrong. when combined with the next patch that sets:
> 
> 	args.minleft = xfs_bmap_worst_indlen(ap->ip, args.maxlen);
> 
> we've got a minleft value that might be for multigigabyte
> allocation, but now we are only asking for a total allocation
> of minlen, and that might be 1 block. IOWs, shouldn't this "last,
> final attempt" code actually do something like this:
> 
> 		args.maxlen = ap->minlen;
> 		args.total = ap->minlen;
> 		args.minleft = xfs_bmap_worst_indlen(ap->ip, args.maxlen);

Yes.  And I actually had that in a previous iteration, and it got
dropped at some point.
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig Dec. 16, 2016, 8:21 a.m. UTC | #4
On Thu, Dec 15, 2016 at 09:34:33AM -0500, Brian Foster wrote:
> FWIW, I was playing with this a bit more and managed to manufacture a
> filesystem layout that this series doesn't handle too well. Emphasis on
> "manufactured" because this might not be a likely real world scenario,
> but either way the current code handles it fine.
> 
> I've attached a metadump of the offending image. mdestore it, mount and
> attempt something like 'dd if=/dev/zero of=/mnt/file' on the root. The
> buffered write looks like it's in a livelock, waiting indefinitely for a
> writeback cycle that will never complete...

Ok, I'll give it a spin.
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Darrick J. Wong Jan. 4, 2017, 6:32 a.m. UTC | #5
On Fri, Dec 16, 2016 at 09:20:44AM +0100, Christoph Hellwig wrote:
> On Fri, Dec 16, 2016 at 09:09:38AM +1100, Dave Chinner wrote:
> > > @@ -3812,7 +3812,6 @@ xfs_bmap_btalloc(
> > >  		args.fsbno = 0;
> > >  		args.type = XFS_ALLOCTYPE_FIRST_AG;
> > >  		args.total = ap->minlen;
> > > -		args.minleft = 0;
> > >  		if ((error = xfs_alloc_vextent(&args)))
> > >  			return error;
> > >  		ap->dfops->dop_low = true;
> > 
> > But this looks wrong. when combined with the next patch that sets:
> > 
> > 	args.minleft = xfs_bmap_worst_indlen(ap->ip, args.maxlen);
> > 
> > we've got a minleft value that might be for multigigabyte
> > allocation, but now we are only asking for a total allocation
> > of minlen, and that might be 1 block. IOWs, shouldn't this "last,
> > final attempt" code actually do something like this:
> > 
> > 		args.maxlen = ap->minlen;
> > 		args.total = ap->minlen;
> > 		args.minleft = xfs_bmap_worst_indlen(ap->ip, args.maxlen);
> 
> Yes.  And I actually had that in a previous iteration, and it got
> dropped at some point.

I see the diff hunk in question hasn't changed in the v2 patchset...
Did Dave's suggestion not work?

--D
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig Jan. 4, 2017, 8:50 a.m. UTC | #6
> > Yes.  And I actually had that in a previous iteration, and it got
> > dropped at some point.
> 
> I see the diff hunk in question hasn't changed in the v2 patchset...
> Did Dave's suggestion not work?

We don't need it anymore.  The previous iteration set minleft to the
worst case indirect blocks for the requested extent length.  This series
instead sticks to the old minleft value for just a single extent
allocation and thus doesn't need the fallback as the minleft value
can't get any smaller.
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index 5050056..91a6c17 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -2638,12 +2638,10 @@  xfs_alloc_vextent(
 	xfs_agblock_t	agsize;	/* allocation group size */
 	int		error;
 	int		flags;	/* XFS_ALLOC_FLAG_... locking flags */
-	xfs_extlen_t	minleft;/* minimum left value, temp copy */
 	xfs_mount_t	*mp;	/* mount structure pointer */
 	xfs_agnumber_t	sagno;	/* starting allocation group number */
 	xfs_alloctype_t	type;	/* input allocation type */
 	int		bump_rotor = 0;
-	int		no_min = 0;
 	xfs_agnumber_t	rotorstep = xfs_rotorstep; /* inode32 agf stepper */
 
 	mp = args->mp;
@@ -2672,7 +2670,6 @@  xfs_alloc_vextent(
 		trace_xfs_alloc_vextent_badargs(args);
 		return 0;
 	}
-	minleft = args->minleft;
 
 	switch (type) {
 	case XFS_ALLOCTYPE_THIS_AG:
@@ -2683,9 +2680,7 @@  xfs_alloc_vextent(
 		 */
 		args->agno = XFS_FSB_TO_AGNO(mp, args->fsbno);
 		args->pag = xfs_perag_get(mp, args->agno);
-		args->minleft = 0;
 		error = xfs_alloc_fix_freelist(args, 0);
-		args->minleft = minleft;
 		if (error) {
 			trace_xfs_alloc_vextent_nofix(args);
 			goto error0;
@@ -2750,9 +2745,7 @@  xfs_alloc_vextent(
 		 */
 		for (;;) {
 			args->pag = xfs_perag_get(mp, args->agno);
-			if (no_min) args->minleft = 0;
 			error = xfs_alloc_fix_freelist(args, flags);
-			args->minleft = minleft;
 			if (error) {
 				trace_xfs_alloc_vextent_nofix(args);
 				goto error0;
@@ -2792,20 +2785,17 @@  xfs_alloc_vextent(
 			 * or switch to non-trylock mode.
 			 */
 			if (args->agno == sagno) {
-				if (no_min == 1) {
+				if (flags == 0) {
 					args->agbno = NULLAGBLOCK;
 					trace_xfs_alloc_vextent_allfailed(args);
 					break;
 				}
-				if (flags == 0) {
-					no_min = 1;
-				} else {
-					flags = 0;
-					if (type == XFS_ALLOCTYPE_START_BNO) {
-						args->agbno = XFS_FSB_TO_AGBNO(mp,
-							args->fsbno);
-						args->type = XFS_ALLOCTYPE_NEAR_BNO;
-					}
+
+				flags = 0;
+				if (type == XFS_ALLOCTYPE_START_BNO) {
+					args->agbno = XFS_FSB_TO_AGBNO(mp,
+						args->fsbno);
+					args->type = XFS_ALLOCTYPE_NEAR_BNO;
 				}
 			}
 			xfs_perag_put(args->pag);
diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 2760bc3..44773c9 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -3812,7 +3812,6 @@  xfs_bmap_btalloc(
 		args.fsbno = 0;
 		args.type = XFS_ALLOCTYPE_FIRST_AG;
 		args.total = ap->minlen;
-		args.minleft = 0;
 		if ((error = xfs_alloc_vextent(&args)))
 			return error;
 		ap->dfops->dop_low = true;
@@ -4344,8 +4343,6 @@  xfs_bmapi_allocate(
 	if (error)
 		return error;
 
-	if (bma->dfops->dop_low)
-		bma->minleft = 0;
 	if (bma->cur)
 		bma->cur->bc_private.b.firstblock = *bma->firstblock;
 	if (bma->blkno == NULLFSBLOCK)
diff --git a/fs/xfs/libxfs/xfs_bmap_btree.c b/fs/xfs/libxfs/xfs_bmap_btree.c
index d6330c2..9581ee8 100644
--- a/fs/xfs/libxfs/xfs_bmap_btree.c
+++ b/fs/xfs/libxfs/xfs_bmap_btree.c
@@ -499,20 +499,6 @@  xfs_bmbt_alloc_block(
 		goto try_another_ag;
 	}
 
-	if (args.fsbno == NULLFSBLOCK && args.minleft) {
-		/*
-		 * Could not find an AG with enough free space to satisfy
-		 * a full btree split.  Try again without minleft and if
-		 * successful activate the lowspace algorithm.
-		 */
-		args.fsbno = 0;
-		args.type = XFS_ALLOCTYPE_FIRST_AG;
-		args.minleft = 0;
-		error = xfs_alloc_vextent(&args);
-		if (error)
-			goto error0;
-		cur->bc_private.b.dfops->dop_low = true;
-	}
 	if (args.fsbno == NULLFSBLOCK) {
 		XFS_BTREE_TRACE_CURSOR(cur, XBT_EXIT);
 		*stat = 0;