diff mbox series

[REPOST,1/2] xfs: drop minlen before tossing alignment on bmap allocs

Message ID 20190912143223.24194-2-bfoster@redhat.com (mailing list archive)
State New, archived
Headers show
Series xfs: rely on minleft instead of total for bmbt res | expand

Commit Message

Brian Foster Sept. 12, 2019, 2:32 p.m. UTC
The bmap block allocation code issues a sequence of retries to
perform an optimal allocation, gradually loosening constraints as
allocations fail. For example, the first attempt might begin at a
particular bno, with maxlen == minlen and alignment incorporated. As
allocations fail, the parameters fall back to different modes, drop
alignment requirements and reduce the minlen and total block
requirements.

For large extent allocations with an args.total value that exceeds
the allocation length (i.e., non-delalloc), the total value tends to
dominate despite these fallbacks. For example, an aligned extent
allocation request of tens to hundreds of MB that cannot be
satisfied from a particular AG will not succeed after dropping
alignment or minlen because xfs_alloc_space_available() never
selects an AG that can't satisfy args.total. The retry sequence
eventually reduces total and ultimately succeeds if a minlen extent
is available somewhere, but the first several retries are
effectively pointless in this scenario.

Beyond simply being inefficient, another side effect of this
behavior is that we drop alignment requirements too aggressively.
Consider a 1GB fallocate on a 15GB fs with 16 AGs and 128k stripe
unit:

 # xfs_io -c "falloc 0 1g" /mnt/file
 # <xfstests>/src/t_stripealign /mnt/file 32
 /mnt/file: Start block 347176 not multiple of sunit 32

Despite the filesystem being completely empty, the fact that the
allocation request cannot be satisifed from a single AG means the
allocation doesn't succeed until xfs_bmap_btalloc() drops total from
the original value based on maxlen. This occurs after we've dropped
minlen and alignment (unnecessarily).

As a step towards addressing this problem, insert a new retry in the
bmap allocation sequence to drop minlen (from maxlen) before tossing
alignment. This should still result in as large of an extent as
possible as the block allocator prioritizes extent size in all but
exact allocation modes. By itself, this does not change the behavior
of the command above because the preallocation code still specifies
total based on maxlen. Instead, this facilitates preservation of
alignment once extra reservation is separated from the extent length
portion of the total block requirement.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/libxfs/xfs_bmap.c | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

Comments

Dave Chinner Sept. 12, 2019, 10:35 p.m. UTC | #1
On Thu, Sep 12, 2019 at 10:32:22AM -0400, Brian Foster wrote:
> The bmap block allocation code issues a sequence of retries to
> perform an optimal allocation, gradually loosening constraints as
> allocations fail. For example, the first attempt might begin at a
> particular bno, with maxlen == minlen and alignment incorporated. As
> allocations fail, the parameters fall back to different modes, drop
> alignment requirements and reduce the minlen and total block
> requirements.
> 
> For large extent allocations with an args.total value that exceeds
> the allocation length (i.e., non-delalloc), the total value tends to
> dominate despite these fallbacks. For example, an aligned extent
> allocation request of tens to hundreds of MB that cannot be
> satisfied from a particular AG will not succeed after dropping
> alignment or minlen because xfs_alloc_space_available() never
> selects an AG that can't satisfy args.total. The retry sequence
> eventually reduces total and ultimately succeeds if a minlen extent
> is available somewhere, but the first several retries are
> effectively pointless in this scenario.
> 
> Beyond simply being inefficient, another side effect of this
> behavior is that we drop alignment requirements too aggressively.
> Consider a 1GB fallocate on a 15GB fs with 16 AGs and 128k stripe
> unit:
> 
>  # xfs_io -c "falloc 0 1g" /mnt/file
>  # <xfstests>/src/t_stripealign /mnt/file 32
>  /mnt/file: Start block 347176 not multiple of sunit 32

Ok, so what Carlos and I found last night was an issue with the
the agresv code leading to the maximum free extent calculated
by xfs_alloc_longest_free_extent() being longer than the largest
allowable extent allocation (mp->m_ag_max_usable) resulting in the
situation where blen > args->maxlen, and so in the case of initial
allocation here, we never run this:

	/*
	 * Adjust for alignment
	 */
	if (blen > args.alignment && blen <= args.maxlen)
		args.minlen = blen - args.alignment;
	args.minalignslop = 0;

this is how we end up with args.minlen = args.maxlen and the
initial allocation failing.

The issue is the way mp->m_ag_max_usable is calculated versus how
the pag->pag_meta_resv.ar_reserved value is set up for the finobt.
That is, "ask" = max tree size, and "used" = 1 because we have a
root block allocated. that code does:

	mp->m_ag_max_unused -= ask;
...
	pag->pag_meta_resv.ar_reserved = ask - used

That means when we calculate the longest extent in the AG, we do:

	longest = pag->pagf_longest - min_needed - resv(NONE)
		= pag->pagf_longest - min_needed - pag->pag_meta_resv.ar_reserved

whilst mp->m_ag_max_usable is calculated as

	usable = agf_length - AG header blocks - AGFL - resv(ask)

When the AG is empty, this ends up with

	pag->pagf_longest = agf_length - AG header blocks
and
	min_needed = AGFL blocks
and
	resv(ask) = pag->pag_meta_resv.ar_reserved + 1

and so:
	longest = usable + 1

And that's how we get blen = maxlen + 1, and that's why alignment is
being dropped for the initial allocation in this "allocate full AG"
corner case.

> Despite the filesystem being completely empty, the fact that the
> allocation request cannot be satisifed from a single AG means the
> allocation doesn't succeed until xfs_bmap_btalloc() drops total from
> the original value based on maxlen. This occurs after we've dropped
> minlen and alignment (unnecessarily).

Right, we'll continue to fail until minlen is reduced appropriately.
But that's not an issue in the fallback algorithms, that's a problem
with the initial conditions not being set up correctly.

> As a step towards addressing this problem, insert a new retry in the
> bmap allocation sequence to drop minlen (from maxlen) before tossing
> alignment. This should still result in as large of an extent as
> possible as the block allocator prioritizes extent size in all but
> exact allocation modes. By itself, this does not change the behavior
> of the command above because the preallocation code still specifies
> total based on maxlen. Instead, this facilitates preservation of
> alignment once extra reservation is separated from the extent length
> portion of the total block requirement.

AFAICT this is not necessary. The prototypoe patch I wrote last
night while working through this with Carlos is attached below. I
updated with a variant of your patch 2 to demonstrate that it does
actually solve the problem of full AG allocation failing to be
aligned.

Cheers,

Dave.
Dave Chinner Sept. 12, 2019, 10:46 p.m. UTC | #2
On Fri, Sep 13, 2019 at 08:35:19AM +1000, Dave Chinner wrote:
> On Thu, Sep 12, 2019 at 10:32:22AM -0400, Brian Foster wrote:
> > The bmap block allocation code issues a sequence of retries to
> > perform an optimal allocation, gradually loosening constraints as
> > allocations fail. For example, the first attempt might begin at a
> > particular bno, with maxlen == minlen and alignment incorporated. As
> > allocations fail, the parameters fall back to different modes, drop
> > alignment requirements and reduce the minlen and total block
> > requirements.
> > 
> > For large extent allocations with an args.total value that exceeds
> > the allocation length (i.e., non-delalloc), the total value tends to
> > dominate despite these fallbacks. For example, an aligned extent
> > allocation request of tens to hundreds of MB that cannot be
> > satisfied from a particular AG will not succeed after dropping
> > alignment or minlen because xfs_alloc_space_available() never
> > selects an AG that can't satisfy args.total. The retry sequence
> > eventually reduces total and ultimately succeeds if a minlen extent
> > is available somewhere, but the first several retries are
> > effectively pointless in this scenario.
> > 
> > Beyond simply being inefficient, another side effect of this
> > behavior is that we drop alignment requirements too aggressively.
> > Consider a 1GB fallocate on a 15GB fs with 16 AGs and 128k stripe
> > unit:
> > 
> >  # xfs_io -c "falloc 0 1g" /mnt/file
> >  # <xfstests>/src/t_stripealign /mnt/file 32
> >  /mnt/file: Start block 347176 not multiple of sunit 32
> 
> Ok, so what Carlos and I found last night was an issue with the
> the agresv code leading to the maximum free extent calculated
> by xfs_alloc_longest_free_extent() being longer than the largest
> allowable extent allocation (mp->m_ag_max_usable) resulting in the
> situation where blen > args->maxlen, and so in the case of initial
> allocation here, we never run this:

Just to make it clear: carlos did all the hard work of narrowing it
down and isolating the accounting discrepancy in the allocation
code. All I did was put 2 and 2 together - the agresv discrepancy -
wrote a quick patch and did a trace to make sure I didn't get 5...

Cheers,

Dave.
Brian Foster Sept. 13, 2019, 2:58 p.m. UTC | #3
On Fri, Sep 13, 2019 at 08:35:19AM +1000, Dave Chinner wrote:
> On Thu, Sep 12, 2019 at 10:32:22AM -0400, Brian Foster wrote:
> > The bmap block allocation code issues a sequence of retries to
> > perform an optimal allocation, gradually loosening constraints as
> > allocations fail. For example, the first attempt might begin at a
> > particular bno, with maxlen == minlen and alignment incorporated. As
> > allocations fail, the parameters fall back to different modes, drop
> > alignment requirements and reduce the minlen and total block
> > requirements.
> > 
> > For large extent allocations with an args.total value that exceeds
> > the allocation length (i.e., non-delalloc), the total value tends to
> > dominate despite these fallbacks. For example, an aligned extent
> > allocation request of tens to hundreds of MB that cannot be
> > satisfied from a particular AG will not succeed after dropping
> > alignment or minlen because xfs_alloc_space_available() never
> > selects an AG that can't satisfy args.total. The retry sequence
> > eventually reduces total and ultimately succeeds if a minlen extent
> > is available somewhere, but the first several retries are
> > effectively pointless in this scenario.
> > 
> > Beyond simply being inefficient, another side effect of this
> > behavior is that we drop alignment requirements too aggressively.
> > Consider a 1GB fallocate on a 15GB fs with 16 AGs and 128k stripe
> > unit:
> > 
> >  # xfs_io -c "falloc 0 1g" /mnt/file
> >  # <xfstests>/src/t_stripealign /mnt/file 32
> >  /mnt/file: Start block 347176 not multiple of sunit 32
> 
> Ok, so what Carlos and I found last night was an issue with the
> the agresv code leading to the maximum free extent calculated
> by xfs_alloc_longest_free_extent() being longer than the largest
> allowable extent allocation (mp->m_ag_max_usable) resulting in the
> situation where blen > args->maxlen, and so in the case of initial
> allocation here, we never run this:
> 
> 	/*
> 	 * Adjust for alignment
> 	 */
> 	if (blen > args.alignment && blen <= args.maxlen)
> 		args.minlen = blen - args.alignment;
> 	args.minalignslop = 0;
> 

Interesting, I think I missed this logic when I looked at this
originally. It makes sense that we should be allowing this to handle the
maxlen = minlen alignment case.

> this is how we end up with args.minlen = args.maxlen and the
> initial allocation failing.
> 
> The issue is the way mp->m_ag_max_usable is calculated versus how
> the pag->pag_meta_resv.ar_reserved value is set up for the finobt.
> That is, "ask" = max tree size, and "used" = 1 because we have a
> root block allocated. that code does:
> 

Ok, I see that this behavior changes without perag reservations in the
mix (i.e., disable finobt, reflink, etc.). The perag reservation
calculation stuff always confuses me, however..

> 	mp->m_ag_max_unused -= ask;
> ...
> 	pag->pag_meta_resv.ar_reserved = ask - used
> 
> That means when we calculate the longest extent in the AG, we do:
> 
> 	longest = pag->pagf_longest - min_needed - resv(NONE)
> 		= pag->pagf_longest - min_needed - pag->pag_meta_resv.ar_reserved
> 

So here we use ar_reserved, which reflects outstanding and so far unused
reservation for the metadata type.

> whilst mp->m_ag_max_usable is calculated as
> 
> 	usable = agf_length - AG header blocks - AGFL - resv(ask)
> 

And here we apparently use the total reservation requirement, regardless
of how much is used, because we're calculating the maximum amount ever
available from the AG.

> When the AG is empty, this ends up with
> 
> 	pag->pagf_longest = agf_length - AG header blocks
> and
> 	min_needed = AGFL blocks
> and
> 	resv(ask) = pag->pag_meta_resv.ar_reserved + 1
> 
> and so:
> 	longest = usable + 1
> 
> And that's how we get blen = maxlen + 1, and that's why alignment is
> being dropped for the initial allocation in this "allocate full AG"
> corner case.
> 

And maxlen is capped to ->m_ag_max_usable, hence the delta between the
two values. I think this makes sense. Thanks for the breakdown.

> > Despite the filesystem being completely empty, the fact that the
> > allocation request cannot be satisifed from a single AG means the
> > allocation doesn't succeed until xfs_bmap_btalloc() drops total from
> > the original value based on maxlen. This occurs after we've dropped
> > minlen and alignment (unnecessarily).
> 
> Right, we'll continue to fail until minlen is reduced appropriately.
> But that's not an issue in the fallback algorithms, that's a problem
> with the initial conditions not being set up correctly.
> 
> > As a step towards addressing this problem, insert a new retry in the
> > bmap allocation sequence to drop minlen (from maxlen) before tossing
> > alignment. This should still result in as large of an extent as
> > possible as the block allocator prioritizes extent size in all but
> > exact allocation modes. By itself, this does not change the behavior
> > of the command above because the preallocation code still specifies
> > total based on maxlen. Instead, this facilitates preservation of
> > alignment once extra reservation is separated from the extent length
> > portion of the total block requirement.
> 
> AFAICT this is not necessary. The prototypoe patch I wrote last
> night while working through this with Carlos is attached below. I
> updated with a variant of your patch 2 to demonstrate that it does
> actually solve the problem of full AG allocation failing to be
> aligned.
> 

I agree that this addresses the reported issue, but I can reproduce
other corner cases affected by the original patch that aren't affected
by this one. For example, if the allocation request happens to be
slightly less than blen but not enough to allow for alignment, minlen
isn't dropped and we can run through the same allocation retry sequence
that kills off alignment before success.

This does raise an interesting question in that current allocation
behavior allocates one extent out of whatever that largest free extent
happened to be in the AG, regardless of whether it's aligned (because
args.alignment drops to 1 before minlen). With the retry in this patch,
we'd satisfy alignment but end up allocating two extents in the file
after chopping alignment off the first free extent.

I could see a reasonable argument for either. On one hand contiguity may
be preferable to alignment. On the other, the difference between one or
two extents in this case is basically the alignment size. Getting back
an unaligned extent of say 200MB over an aligned extent 128k smaller
might not be worth it, or at the very least unexpected to users who
don't understand XFS geometry when there are multiples of that amount of
free space still available in the broader fs. I don't feel too strongly
about it either way, but figured it's worth noting..

...
> 
> xfs: cap longest free extent to maximum allocatable
> 
> From: Dave Chinner <dchinner@redhat.com>
> 
> Cap longest extent to the largest we can allocate based on limits
> calculated at mount time. Dynamic state (such as finobt blocks)
> can result in the longest free extent exceeding the size we can
> allocate, and that results in failure to align full AG allocations
> when the AG is empty.
> 
...
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---

Are either of you guys planning to post this patch without the
bmapi.total hack?

Brian

>  fs/xfs/libxfs/xfs_alloc.c |  3 ++-
>  fs/xfs/libxfs/xfs_bmap.c  | 14 ++++++++++++++
>  2 files changed, 16 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
> index 533b04aaf6f6..9dead25d2e70 100644
> --- a/fs/xfs/libxfs/xfs_alloc.c
> +++ b/fs/xfs/libxfs/xfs_alloc.c
> @@ -1989,7 +1989,8 @@ xfs_alloc_longest_free_extent(
>  	 * reservations and AGFL rules in place, we can return this extent.
>  	 */
>  	if (pag->pagf_longest > delta)
> -		return pag->pagf_longest - delta;
> +		return min_t(xfs_extlen_t, pag->pag_mount->m_ag_max_usable,
> +				pag->pagf_longest - delta);
>  
>  	/* Otherwise, let the caller try for 1 block if there's space. */
>  	return pag->pagf_flcount > 0 || pag->pagf_longest > 0;
> diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> index 054b4ce30033..b05683f649a6 100644
> --- a/fs/xfs/libxfs/xfs_bmap.c
> +++ b/fs/xfs/libxfs/xfs_bmap.c
> @@ -4286,6 +4286,20 @@ xfs_bmapi_write(
>  #endif
>  	whichfork = xfs_bmapi_whichfork(flags);
>  
> +	/*
> +	 * XXX: Hack!
> +	 *
> +	 * If the total blocks requested is larger than an AG, we can't allocate
> +	 * all the space atomically and within a single AG. This will be a
> +	 * "short" allocation. In this case, just ignore the total block count
> +	 * and rely on minleft calculations to ensure the allocation we do fits
> +	 * inside an AG properly.
> +	 *
> +	 * Based on a patch from Brian.
> +	 */
> +	if (bma.total > mp->m_ag_max_usable)
> +		bma.total = 0;
> +
>  	ASSERT(*nmap >= 1);
>  	ASSERT(*nmap <= XFS_BMAP_MAX_NMAP);
>  	ASSERT(tp != NULL);
Dave Chinner Sept. 14, 2019, 10 p.m. UTC | #4
On Fri, Sep 13, 2019 at 10:58:02AM -0400, Brian Foster wrote:
> On Fri, Sep 13, 2019 at 08:35:19AM +1000, Dave Chinner wrote:
> > On Thu, Sep 12, 2019 at 10:32:22AM -0400, Brian Foster wrote:
> > > The bmap block allocation code issues a sequence of retries to
> > > perform an optimal allocation, gradually loosening constraints as
> > > allocations fail. For example, the first attempt might begin at a
> > > particular bno, with maxlen == minlen and alignment incorporated. As
> > > allocations fail, the parameters fall back to different modes, drop
> > > alignment requirements and reduce the minlen and total block
> > > requirements.
> > > 
> > > For large extent allocations with an args.total value that exceeds
> > > the allocation length (i.e., non-delalloc), the total value tends to
> > > dominate despite these fallbacks. For example, an aligned extent
> > > allocation request of tens to hundreds of MB that cannot be
> > > satisfied from a particular AG will not succeed after dropping
> > > alignment or minlen because xfs_alloc_space_available() never
> > > selects an AG that can't satisfy args.total. The retry sequence
> > > eventually reduces total and ultimately succeeds if a minlen extent
> > > is available somewhere, but the first several retries are
> > > effectively pointless in this scenario.
> > > 
> > > Beyond simply being inefficient, another side effect of this
> > > behavior is that we drop alignment requirements too aggressively.
> > > Consider a 1GB fallocate on a 15GB fs with 16 AGs and 128k stripe
> > > unit:
> > > 
> > >  # xfs_io -c "falloc 0 1g" /mnt/file
> > >  # <xfstests>/src/t_stripealign /mnt/file 32
> > >  /mnt/file: Start block 347176 not multiple of sunit 32
> > 
> > Ok, so what Carlos and I found last night was an issue with the
> > the agresv code leading to the maximum free extent calculated
> > by xfs_alloc_longest_free_extent() being longer than the largest
> > allowable extent allocation (mp->m_ag_max_usable) resulting in the
> > situation where blen > args->maxlen, and so in the case of initial
> > allocation here, we never run this:
> > 
> > 	/*
> > 	 * Adjust for alignment
> > 	 */
> > 	if (blen > args.alignment && blen <= args.maxlen)
> > 		args.minlen = blen - args.alignment;
> > 	args.minalignslop = 0;
> > 
....
> > > As a step towards addressing this problem, insert a new retry in the
> > > bmap allocation sequence to drop minlen (from maxlen) before tossing
> > > alignment. This should still result in as large of an extent as
> > > possible as the block allocator prioritizes extent size in all but
> > > exact allocation modes. By itself, this does not change the behavior
> > > of the command above because the preallocation code still specifies
> > > total based on maxlen. Instead, this facilitates preservation of
> > > alignment once extra reservation is separated from the extent length
> > > portion of the total block requirement.
> > 
> > AFAICT this is not necessary. The prototypoe patch I wrote last
> > night while working through this with Carlos is attached below. I
> > updated with a variant of your patch 2 to demonstrate that it does
> > actually solve the problem of full AG allocation failing to be
> > aligned.
> > 
> 
> I agree that this addresses the reported issue, but I can reproduce
> other corner cases affected by the original patch that aren't affected
> by this one. For example, if the allocation request happens to be
> slightly less than blen but not enough to allow for alignment, minlen
> isn't dropped and we can run through the same allocation retry sequence
> that kills off alignment before success.

But isn't that just another variation of the initial conditions
(minlen/maxlen) not being set up correctly for alignment when the AG
is empty?

i.e. Take the above condition and change it like this:

 	/*
 	 * Adjust for alignment
 	 */
-	if (blen > args.alignment && blen <= args.maxlen)
+	if (blen > args.alignment && blen <= args.maxlen + args.alignment)
 		args.minlen = blen - args.alignment;
 	args.minalignslop = 0;

and now we cover all the cases when blen covers an aligned maxlen
allocation...

Cheers,

Dave.
Brian Foster Sept. 15, 2019, 1:09 p.m. UTC | #5
On Sun, Sep 15, 2019 at 08:00:35AM +1000, Dave Chinner wrote:
> On Fri, Sep 13, 2019 at 10:58:02AM -0400, Brian Foster wrote:
> > On Fri, Sep 13, 2019 at 08:35:19AM +1000, Dave Chinner wrote:
> > > On Thu, Sep 12, 2019 at 10:32:22AM -0400, Brian Foster wrote:
> > > > The bmap block allocation code issues a sequence of retries to
> > > > perform an optimal allocation, gradually loosening constraints as
> > > > allocations fail. For example, the first attempt might begin at a
> > > > particular bno, with maxlen == minlen and alignment incorporated. As
> > > > allocations fail, the parameters fall back to different modes, drop
> > > > alignment requirements and reduce the minlen and total block
> > > > requirements.
> > > > 
> > > > For large extent allocations with an args.total value that exceeds
> > > > the allocation length (i.e., non-delalloc), the total value tends to
> > > > dominate despite these fallbacks. For example, an aligned extent
> > > > allocation request of tens to hundreds of MB that cannot be
> > > > satisfied from a particular AG will not succeed after dropping
> > > > alignment or minlen because xfs_alloc_space_available() never
> > > > selects an AG that can't satisfy args.total. The retry sequence
> > > > eventually reduces total and ultimately succeeds if a minlen extent
> > > > is available somewhere, but the first several retries are
> > > > effectively pointless in this scenario.
> > > > 
> > > > Beyond simply being inefficient, another side effect of this
> > > > behavior is that we drop alignment requirements too aggressively.
> > > > Consider a 1GB fallocate on a 15GB fs with 16 AGs and 128k stripe
> > > > unit:
> > > > 
> > > >  # xfs_io -c "falloc 0 1g" /mnt/file
> > > >  # <xfstests>/src/t_stripealign /mnt/file 32
> > > >  /mnt/file: Start block 347176 not multiple of sunit 32
> > > 
> > > Ok, so what Carlos and I found last night was an issue with the
> > > the agresv code leading to the maximum free extent calculated
> > > by xfs_alloc_longest_free_extent() being longer than the largest
> > > allowable extent allocation (mp->m_ag_max_usable) resulting in the
> > > situation where blen > args->maxlen, and so in the case of initial
> > > allocation here, we never run this:
> > > 
> > > 	/*
> > > 	 * Adjust for alignment
> > > 	 */
> > > 	if (blen > args.alignment && blen <= args.maxlen)
> > > 		args.minlen = blen - args.alignment;
> > > 	args.minalignslop = 0;
> > > 
> ....
> > > > As a step towards addressing this problem, insert a new retry in the
> > > > bmap allocation sequence to drop minlen (from maxlen) before tossing
> > > > alignment. This should still result in as large of an extent as
> > > > possible as the block allocator prioritizes extent size in all but
> > > > exact allocation modes. By itself, this does not change the behavior
> > > > of the command above because the preallocation code still specifies
> > > > total based on maxlen. Instead, this facilitates preservation of
> > > > alignment once extra reservation is separated from the extent length
> > > > portion of the total block requirement.
> > > 
> > > AFAICT this is not necessary. The prototypoe patch I wrote last
> > > night while working through this with Carlos is attached below. I
> > > updated with a variant of your patch 2 to demonstrate that it does
> > > actually solve the problem of full AG allocation failing to be
> > > aligned.
> > > 
> > 
> > I agree that this addresses the reported issue, but I can reproduce
> > other corner cases affected by the original patch that aren't affected
> > by this one. For example, if the allocation request happens to be
> > slightly less than blen but not enough to allow for alignment, minlen
> > isn't dropped and we can run through the same allocation retry sequence
> > that kills off alignment before success.
> 
> But isn't that just another variation of the initial conditions
> (minlen/maxlen) not being set up correctly for alignment when the AG
> is empty?
> 

Perhaps, though I don't think it's exclusive to an empty AG.

> i.e. Take the above condition and change it like this:
> 
>  	/*
>  	 * Adjust for alignment
>  	 */
> -	if (blen > args.alignment && blen <= args.maxlen)
> +	if (blen > args.alignment && blen <= args.maxlen + args.alignment)
>  		args.minlen = blen - args.alignment;
>  	args.minalignslop = 0;
> 
> and now we cover all the cases when blen covers an aligned maxlen
> allocation...
> 

Do we want to consider whether minlen goes to 1? Otherwise that looks
reasonable to me. What I was trying to get at is just that we should
consider whether there are any other corner cases (that we might care
about) where this particular allocation might not behave as expected vs.
just the example used in the original commit log.

If somebody wants to send a finalized patch or two with these fixes
along with the bma.total one (or I can tack it on in reply..?), I'll
think about it further on review as well..

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
Carlos Maiolino Sept. 16, 2019, 8:36 a.m. UTC | #6
> > > > aligned.
> > > > 
> > > 
> > > I agree that this addresses the reported issue, but I can reproduce
> > > other corner cases affected by the original patch that aren't affected
> > > by this one. For example, if the allocation request happens to be
> > > slightly less than blen but not enough to allow for alignment, minlen
> > > isn't dropped and we can run through the same allocation retry sequence
> > > that kills off alignment before success.
> > 
> > But isn't that just another variation of the initial conditions
> > (minlen/maxlen) not being set up correctly for alignment when the AG
> > is empty?
> > 
> 
> Perhaps, though I don't think it's exclusive to an empty AG.
> 
> > i.e. Take the above condition and change it like this:
> > 
> >  	/*
> >  	 * Adjust for alignment
> >  	 */
> > -	if (blen > args.alignment && blen <= args.maxlen)
> > +	if (blen > args.alignment && blen <= args.maxlen + args.alignment)
> >  		args.minlen = blen - args.alignment;
> >  	args.minalignslop = 0;
> > 
> > and now we cover all the cases when blen covers an aligned maxlen
> > allocation...
> > 
> 
> Do we want to consider whether minlen goes to 1? Otherwise that looks
> reasonable to me. What I was trying to get at is just that we should
> consider whether there are any other corner cases (that we might care
> about) where this particular allocation might not behave as expected vs.
> just the example used in the original commit log.
> 
> If somebody wants to send a finalized patch or two with these fixes
> along with the bma.total one (or I can tack it on in reply..?), I'll
> think about it further on review as well..

I didn't realize the conversation was already going on before I replied for the
first time, my apologies for unnecessary emails.

I've been working on some patches about this issue since I had this chat with
Dave, but I do not have anything 'mature' yet, exactly because some of the
issues you mentioned above, like the behavior not being exclusive to an empty
AG, and the fact the generic/223 was still failing after the 'fix' has been
applied (the single large fallocated file worked, but generic/223 no), so I was
kind of chasing my own tails on Friday :D

I'll get back to it today and see what I can do with fresh eyes.

Thanks Dave and Brian.

> 
> Brian
> 
> > Cheers,
> > 
> > Dave.
> > -- 
> > Dave Chinner
> > david@fromorbit.com
Carlos Maiolino Sept. 17, 2019, 12:22 p.m. UTC | #7
Hi Dave.

I've been playing a bit with it, and, based on our talk on IRC:

> +++ b/fs/xfs/libxfs/xfs_bmap.c
> @@ -4286,6 +4286,20 @@ xfs_bmapi_write(
>  #endif
>  	whichfork = xfs_bmapi_whichfork(flags);
>  
> +	/*
> +	 * XXX: Hack!
> +	 *
> +	 * If the total blocks requested is larger than an AG, we can't allocate
> +	 * all the space atomically and within a single AG. This will be a
> +	 * "short" allocation. In this case, just ignore the total block count
> +	 * and rely on minleft calculations to ensure the allocation we do fits
> +	 * inside an AG properly.
> +	 *
> +	 * Based on a patch from Brian.
> +	 */
> +	if (bma.total > mp->m_ag_max_usable)
> +		bma.total = 0;

Instead zeroing bma.total here, can't we crop it to blen in xfs_bmap_btalloc()?

I did some tests here and looks like the result is the same, although I'm not
sure if it's a good approach.

Cheers

> +
>  	ASSERT(*nmap >= 1);
>  	ASSERT(*nmap <= XFS_BMAP_MAX_NMAP);
>  	ASSERT(tp != NULL);
Darrick J. Wong Sept. 18, 2019, 9:41 p.m. UTC | #8
On Thu, Sep 12, 2019 at 10:32:22AM -0400, Brian Foster wrote:
> The bmap block allocation code issues a sequence of retries to
> perform an optimal allocation, gradually loosening constraints as
> allocations fail. For example, the first attempt might begin at a
> particular bno, with maxlen == minlen and alignment incorporated. As
> allocations fail, the parameters fall back to different modes, drop
> alignment requirements and reduce the minlen and total block
> requirements.
> 
> For large extent allocations with an args.total value that exceeds
> the allocation length (i.e., non-delalloc), the total value tends to
> dominate despite these fallbacks. For example, an aligned extent
> allocation request of tens to hundreds of MB that cannot be
> satisfied from a particular AG will not succeed after dropping
> alignment or minlen because xfs_alloc_space_available() never
> selects an AG that can't satisfy args.total. The retry sequence

"..that can satisfy args.total"?

> eventually reduces total and ultimately succeeds if a minlen extent
> is available somewhere, but the first several retries are
> effectively pointless in this scenario.
> 
> Beyond simply being inefficient, another side effect of this
> behavior is that we drop alignment requirements too aggressively.
> Consider a 1GB fallocate on a 15GB fs with 16 AGs and 128k stripe
> unit:
> 
>  # xfs_io -c "falloc 0 1g" /mnt/file
>  # <xfstests>/src/t_stripealign /mnt/file 32
>  /mnt/file: Start block 347176 not multiple of sunit 32
> 
> Despite the filesystem being completely empty, the fact that the
> allocation request cannot be satisifed from a single AG means the
> allocation doesn't succeed until xfs_bmap_btalloc() drops total from
> the original value based on maxlen. This occurs after we've dropped
> minlen and alignment (unnecessarily).
> 
> As a step towards addressing this problem, insert a new retry in the
> bmap allocation sequence to drop minlen (from maxlen) before tossing
> alignment. This should still result in as large of an extent as
> possible as the block allocator prioritizes extent size in all but
> exact allocation modes. By itself, this does not change the behavior
> of the command above because the preallocation code still specifies
> total based on maxlen. Instead, this facilitates preservation of
> alignment once extra reservation is separated from the extent length
> portion of the total block requirement.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
>  fs/xfs/libxfs/xfs_bmap.c | 12 +++++++++---
>  1 file changed, 9 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> index 054b4ce30033..eaa965920a03 100644
> --- a/fs/xfs/libxfs/xfs_bmap.c
> +++ b/fs/xfs/libxfs/xfs_bmap.c
> @@ -3573,6 +3573,14 @@ xfs_bmap_btalloc(
>  		if ((error = xfs_alloc_vextent(&args)))
>  			return error;
>  	}
> +	if (args.fsbno == NULLFSBLOCK && nullfb &&
> +	    args.minlen > ap->minlen) {

Maybe a comment here to point out that we're retrying the allocation
with the minimum acceptable minlen?  I say this mostly because (some of)
the other clauses have a quick description of the constraints that are
being fed to the allocation request, and it makes it easier to keep
track of what's going on.

> +		args.minlen = ap->minlen;
> +		args.fsbno = ap->blkno;
> +		error = xfs_alloc_vextent(&args);
> +		if (error)
> +			return error;
> +	}
>  	if (isaligned && args.fsbno == NULLFSBLOCK) {
>  		/*
>  		 * allocation failed, so turn off alignment and
> @@ -3584,9 +3592,7 @@ xfs_bmap_btalloc(
>  		if ((error = xfs_alloc_vextent(&args)))
>  			return error;
>  	}
> -	if (args.fsbno == NULLFSBLOCK && nullfb &&
> -	    args.minlen > ap->minlen) {
> -		args.minlen = ap->minlen;
> +	if (args.fsbno == NULLFSBLOCK && nullfb) {
>  		args.type = XFS_ALLOCTYPE_START_BNO;

Particularly when we get here and I have to look pretty closely to
figure out what this retry is now attempting to do.  This one is trying
the allocation again, but now with no alignment and the caller's minlen,
right?

--D

>  		args.fsbno = ap->blkno;
>  		if ((error = xfs_alloc_vextent(&args)))
> -- 
> 2.20.1
>
Brian Foster Sept. 19, 2019, 11:49 a.m. UTC | #9
On Wed, Sep 18, 2019 at 02:41:41PM -0700, Darrick J. Wong wrote:
> On Thu, Sep 12, 2019 at 10:32:22AM -0400, Brian Foster wrote:
> > The bmap block allocation code issues a sequence of retries to
> > perform an optimal allocation, gradually loosening constraints as
> > allocations fail. For example, the first attempt might begin at a
> > particular bno, with maxlen == minlen and alignment incorporated. As
> > allocations fail, the parameters fall back to different modes, drop
> > alignment requirements and reduce the minlen and total block
> > requirements.
> > 
> > For large extent allocations with an args.total value that exceeds
> > the allocation length (i.e., non-delalloc), the total value tends to
> > dominate despite these fallbacks. For example, an aligned extent
> > allocation request of tens to hundreds of MB that cannot be
> > satisfied from a particular AG will not succeed after dropping
> > alignment or minlen because xfs_alloc_space_available() never
> > selects an AG that can't satisfy args.total. The retry sequence
> 
> "..that can satisfy args.total"?
> 

Heh. "can't" was intended there, but after reading it back it is poorly
worded as a double negative. :P Basically it's just saying that
args.total is what restricts AG selection in this corner case.

> > eventually reduces total and ultimately succeeds if a minlen extent
> > is available somewhere, but the first several retries are
> > effectively pointless in this scenario.
> > 
> > Beyond simply being inefficient, another side effect of this
> > behavior is that we drop alignment requirements too aggressively.
> > Consider a 1GB fallocate on a 15GB fs with 16 AGs and 128k stripe
> > unit:
> > 
> >  # xfs_io -c "falloc 0 1g" /mnt/file
> >  # <xfstests>/src/t_stripealign /mnt/file 32
> >  /mnt/file: Start block 347176 not multiple of sunit 32
> > 
> > Despite the filesystem being completely empty, the fact that the
> > allocation request cannot be satisifed from a single AG means the
> > allocation doesn't succeed until xfs_bmap_btalloc() drops total from
> > the original value based on maxlen. This occurs after we've dropped
> > minlen and alignment (unnecessarily).
> > 
> > As a step towards addressing this problem, insert a new retry in the
> > bmap allocation sequence to drop minlen (from maxlen) before tossing
> > alignment. This should still result in as large of an extent as
> > possible as the block allocator prioritizes extent size in all but
> > exact allocation modes. By itself, this does not change the behavior
> > of the command above because the preallocation code still specifies
> > total based on maxlen. Instead, this facilitates preservation of
> > alignment once extra reservation is separated from the extent length
> > portion of the total block requirement.
> > 
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > ---
> >  fs/xfs/libxfs/xfs_bmap.c | 12 +++++++++---
> >  1 file changed, 9 insertions(+), 3 deletions(-)
> > 
> > diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> > index 054b4ce30033..eaa965920a03 100644
> > --- a/fs/xfs/libxfs/xfs_bmap.c
> > +++ b/fs/xfs/libxfs/xfs_bmap.c
> > @@ -3573,6 +3573,14 @@ xfs_bmap_btalloc(
> >  		if ((error = xfs_alloc_vextent(&args)))
> >  			return error;
> >  	}
> > +	if (args.fsbno == NULLFSBLOCK && nullfb &&
> > +	    args.minlen > ap->minlen) {
> 
> Maybe a comment here to point out that we're retrying the allocation
> with the minimum acceptable minlen?  I say this mostly because (some of)
> the other clauses have a quick description of the constraints that are
> being fed to the allocation request, and it makes it easier to keep
> track of what's going on.
> 

Yeah..

> > +		args.minlen = ap->minlen;
> > +		args.fsbno = ap->blkno;
> > +		error = xfs_alloc_vextent(&args);
> > +		if (error)
> > +			return error;
> > +	}
> >  	if (isaligned && args.fsbno == NULLFSBLOCK) {
> >  		/*
> >  		 * allocation failed, so turn off alignment and
> > @@ -3584,9 +3592,7 @@ xfs_bmap_btalloc(
> >  		if ((error = xfs_alloc_vextent(&args)))
> >  			return error;
> >  	}
> > -	if (args.fsbno == NULLFSBLOCK && nullfb &&
> > -	    args.minlen > ap->minlen) {
> > -		args.minlen = ap->minlen;
> > +	if (args.fsbno == NULLFSBLOCK && nullfb) {
> >  		args.type = XFS_ALLOCTYPE_START_BNO;
> 
> Particularly when we get here and I have to look pretty closely to
> figure out what this retry is now attempting to do.  This one is trying
> the allocation again, but now with no alignment and the caller's minlen,
> right?
> 

Yeah, but this branch also (potentially) changes the allocation type.
IIRC, this wasn't relevant to the corner case I was trying to address
with this patch. I basically just wanted to get minlen dropped before
tossing alignment and at the same time didn't want to screw around with
the existing logic that was unrelated (hence the separation of the
minlen update into a new retry as opposed to just reordering code).

That said, the other subthread suggests this patch can be replaced with
more localized fixes to bma.minlen alignment handling code. My original
reproducer of this problem was based on some of the extent allocation
rework bits that have been deferred (rework of the size allocation
mode). I wasn't aware of the bma.minlen alignment logic at the time, so
I may have misanalyzed the problem and my current development tree
doesn't reproduce. I'll make the tweaks to this patch locally in the
event that I run into the problem again when I get back to that work and
if there still happens to be corner cases not covered by the minlen
fixup patch that Carlos has posted..

Brian

> --D
> 
> >  		args.fsbno = ap->blkno;
> >  		if ((error = xfs_alloc_vextent(&args)))
> > -- 
> > 2.20.1
> >
diff mbox series

Patch

diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 054b4ce30033..eaa965920a03 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -3573,6 +3573,14 @@  xfs_bmap_btalloc(
 		if ((error = xfs_alloc_vextent(&args)))
 			return error;
 	}
+	if (args.fsbno == NULLFSBLOCK && nullfb &&
+	    args.minlen > ap->minlen) {
+		args.minlen = ap->minlen;
+		args.fsbno = ap->blkno;
+		error = xfs_alloc_vextent(&args);
+		if (error)
+			return error;
+	}
 	if (isaligned && args.fsbno == NULLFSBLOCK) {
 		/*
 		 * allocation failed, so turn off alignment and
@@ -3584,9 +3592,7 @@  xfs_bmap_btalloc(
 		if ((error = xfs_alloc_vextent(&args)))
 			return error;
 	}
-	if (args.fsbno == NULLFSBLOCK && nullfb &&
-	    args.minlen > ap->minlen) {
-		args.minlen = ap->minlen;
+	if (args.fsbno == NULLFSBLOCK && nullfb) {
 		args.type = XFS_ALLOCTYPE_START_BNO;
 		args.fsbno = ap->blkno;
 		if ((error = xfs_alloc_vextent(&args)))