diff mbox series

[v2,04/14] fs: xfs: Make file data allocations observe the 'forcealign' flag

Message ID 20240304130428.13026-5-john.g.garry@oracle.com (mailing list archive)
State New, archived
Headers show
Series block atomic writes for XFS | expand

Commit Message

John Garry March 4, 2024, 1:04 p.m. UTC
From: "Darrick J. Wong" <djwong@kernel.org>

The existing extsize hint code already did the work of expanding file
range mapping requests so that the range is aligned to the hint value.
Now add the code we need to guarantee that the space allocations are
also always aligned.

XXX: still need to check all this with reflink

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Co-developed-by: John Garry <john.g.garry@oracle.com>
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/xfs/libxfs/xfs_bmap.c | 22 +++++++++++++++++-----
 fs/xfs/xfs_iomap.c       |  4 +++-
 2 files changed, 20 insertions(+), 6 deletions(-)

Comments

Dave Chinner March 5, 2024, 12:44 a.m. UTC | #1
On Mon, Mar 04, 2024 at 01:04:18PM +0000, John Garry wrote:
> From: "Darrick J. Wong" <djwong@kernel.org>
> 
> The existing extsize hint code already did the work of expanding file
> range mapping requests so that the range is aligned to the hint value.
> Now add the code we need to guarantee that the space allocations are
> also always aligned.
> 
> XXX: still need to check all this with reflink
> 
> Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> Co-developed-by: John Garry <john.g.garry@oracle.com>
> Signed-off-by: John Garry <john.g.garry@oracle.com>
> ---
>  fs/xfs/libxfs/xfs_bmap.c | 22 +++++++++++++++++-----
>  fs/xfs/xfs_iomap.c       |  4 +++-
>  2 files changed, 20 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> index 60d100134280..8dee60795cf4 100644
> --- a/fs/xfs/libxfs/xfs_bmap.c
> +++ b/fs/xfs/libxfs/xfs_bmap.c
> @@ -3343,6 +3343,19 @@ xfs_bmap_compute_alignments(
>  		align = xfs_get_cowextsz_hint(ap->ip);
>  	else if (ap->datatype & XFS_ALLOC_USERDATA)
>  		align = xfs_get_extsz_hint(ap->ip);
> +
> +	/*
> +	 * xfs_get_cowextsz_hint() returns extsz_hint for when forcealign is
> +	 * set as forcealign and cowextsz_hint are mutually exclusive
> +	 */
> +	if (xfs_inode_forcealign(ap->ip) && align) {
> +		args->alignment = align;
> +		if (stripe_align % align)
> +			stripe_align = align;

This kinda does the right thing, but it strikes me as the wrong
thing to be doing - it seems to conflates two different physical
alignment properties in potentially bad ways, and we should never
get this far into the allocator to discover these issues.

Stripe alignment is alignment to physical disks in a RAID array.
Inode forced alignment is file offset alignment to specificly
defined LBA address ranges.  The stripe unit/width is set at mkfs
time, the inode extent size hint at runtime.

These can only work together in specific situations:

	1. max sized RWF_ATOMIC IO must be the same or smaller size
	   than the stripe unit. Alternatively: stripe unit must be
	   an integer (power of 2?) multiple of max atomic IO size.

	   IOWs, stripe unit vs atomic IO constraints must be
	   resolved in a compatible manner at mkfs time. If they
	   can't be resolved (e.g. non-power of 2 stripe unit) then
	   atomic IO cannot be done on the filesystem at all. Stripe
	   unit constraints always win over atomic IO.

	2. min sized RWF_ATOMIC IO must be an integer divider of
	   stripe unit so that small atomic IOs are always correctly
	   aligned to the stripe unit.

	3. Inode forced alignment constraints set at runtime (i.e.
	   extsize hints) must fit within the above stripe unit vs
	   min/max atomic IO requirements.

	   i.e. extent size hint will typically need to an integer
	   multiple of stripe unit size which will always be
	   compatible with stripe unit and atomic IO size and
	   alignments...

IOWs, these are static geometry constraints and so should be checked
and rejected at the point where alignments are specified (i.e.
mkfs, mount and ioctls). Then the allocator can simply assume that
forced inode alignments are always stripe alignment compatible and
we don't need separate handling of two possibly incompatible
alignments.

Regardless, I think the code as it stands won't work correctly when
a stripe unit is not set.

The correct functioning of this code appears to be dependent on
stripe_align being set >= the extent size hint. If stripe align is
not set, then what does (0 % align) return? If my checks are
correct, this will return 0, and so the above code will end up with
stripe_align = 0.

Now, consider that args->alignment is used to signal to the
allocator what the -start alignment- of the allocation is supposed
to be, whilst args->prod/args->mod are used to tell the allocator
what the tail adjustment is supposed to be.

Stripe alignment wants start alignment, extent size hints want tail
adjustment and force aligned allocation wants both start alignment
and tail adjustment.

However, the allocator overrides these depending on what it is
doing. e.g. exact block EOF allocation turns off start alignment but
has to ensure space is reserved for start alignment if it fails.
This reserves space for stripe_align in args->minalignslop, but if
force alignment has a start alignment requirement larger than stripe
unit alignment, this code fails to reserve the necessary alignment
slop for the force alignment code.

If we aren't doing exact block EOF allocation, then we do stripe
unit alignment. Again, if this fails and the forced alignment is
larger than stripe alignment, this code does not reserve space for
the larger alignment in args->minalignslop, so it will also do the
wrong when we fall back to an args->alignment > 1 allocation.

Failing to correctly account for minalignslop is known to cause
accounting problems when the AG is very near empty, and the usual
result of that is cancelling of a dirty allocation transaction and a
forced shutdown. So this is something we need to get right.

More below....

> +	} else {
> +		args->alignment = 1;
> +	}

Just initialise the allocation args structure with a value of 1 like
we already do?

> +
>  	if (align) {
>  		if (xfs_bmap_extsize_align(mp, &ap->got, &ap->prev, align, 0,
>  					ap->eof, 0, ap->conv, &ap->offset,
> @@ -3438,7 +3451,6 @@ xfs_bmap_exact_minlen_extent_alloc(
>  	args.minlen = args.maxlen = ap->minlen;
>  	args.total = ap->total;
>  
> -	args.alignment = 1;
>  	args.minalignslop = 0;

This likely breaks the debug code. This isn't intended for testing
atomic write capability, it's for exercising low space allocation
algorithms with worst case fragmentation patterns. This is only
called when error injection is turned on to trigger this path, so we
shouldn't need to support forced IO alignment here.

>  
>  	args.minleft = ap->minleft;
> @@ -3484,6 +3496,7 @@ xfs_bmap_btalloc_at_eof(
>  {
>  	struct xfs_mount	*mp = args->mp;
>  	struct xfs_perag	*caller_pag = args->pag;
> +	int			orig_alignment = args->alignment;
>  	int			error;
>  
>  	/*
> @@ -3558,10 +3571,10 @@ xfs_bmap_btalloc_at_eof(
>  
>  	/*
>  	 * Allocation failed, so turn return the allocation args to their
> -	 * original non-aligned state so the caller can proceed on allocation
> -	 * failure as if this function was never called.
> +	 * original state so the caller can proceed on allocation failure as
> +	 * if this function was never called.
>  	 */
> -	args->alignment = 1;
> +	args->alignment = orig_alignment;
>  	return 0;
>  }

As I said above, we can't set an alignment of > 1 here if we haven't
accounted for that alignment in args->minalignslop above. This leads
to unexpected ENOSPC conditions and filesystem shutdowns.

I suspect what we need to do is get rid of the separate stripe_align
variable altogether and always just set args->alignment to what we
need the extent start alignment to be, regardless of whether it is
from stripe alignment or forced alignment.

Then the code in xfs_bmap_btalloc_at_eof() doesn't need to know what
'stripe_align' is - the exact EOF block allocation can simply save
and restore the args->alignment value and use it for minalignslop
calculations for the initial exact block allocation.

Then, if xfs_bmap_btalloc_at_eof() fails and xfs_inode_forcealign()
is true, we can abort allocation immediately, and not bother to fall
back on further aligned/unaligned attempts that will also fail or do
the wrong them.

Similarly, if we aren't doing EOF allocation, having args->alignment
set means it will do the right thing for the first allocation
attempt. Again, if that fails, we can check if
xfs_inode_forcealign() is true and fail the aligned allocation
instead of running the low space algorithm. This now makes it clear
that we're failing the allocation because of the forced alignment
requirement, and now the low space allocation code can explicitly
turn off start alignment as it isn't required...


> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index 18c8f168b153..70fe873951f3 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -181,7 +181,9 @@ xfs_eof_alignment(
>  		 * If mounted with the "-o swalloc" option the alignment is
>  		 * increased from the strip unit size to the stripe width.
>  		 */
> -		if (mp->m_swidth && xfs_has_swalloc(mp))
> +		if (xfs_inode_forcealign(ip))
> +			align = xfs_get_extsz_hint(ip);
> +		else if (mp->m_swidth && xfs_has_swalloc(mp))
>  			align = mp->m_swidth;
>  		else if (mp->m_dalign)
>  			align = mp->m_dalign;

This doesn't work for files with a current size of less than
"align". The next line of code does:

	if (align && XFS_ISIZE(ip) < XFS_FSB_TO_B(mp, align))
		align = 0;

IOWs, the first aligned write allocation will occur with a file size
of zero, and that will result in this function returning 0. i.e.
speculative post EOF delalloc doesn't turn on and align the EOF
until writes up to offset == align have already been completed.

Essentially, this code wants to be:

	if (xfs_inode_forcealign(ip))
		return xfs_get_extsz_hint(ip);

So that any write to the a force aligned inode will always trigger
extent size hint EOF alignment.

-Dave.
John Garry March 5, 2024, 3:22 p.m. UTC | #2
On 05/03/2024 00:44, Dave Chinner wrote:
> On Mon, Mar 04, 2024 at 01:04:18PM +0000, John Garry wrote:
>> From: "Darrick J. Wong" <djwong@kernel.org>
>>
>> The existing extsize hint code already did the work of expanding file
>> range mapping requests so that the range is aligned to the hint value.
>> Now add the code we need to guarantee that the space allocations are
>> also always aligned.
>>
>> XXX: still need to check all this with reflink
>>
>> Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
>> Co-developed-by: John Garry <john.g.garry@oracle.com>
>> Signed-off-by: John Garry <john.g.garry@oracle.com>
>> ---
>>   fs/xfs/libxfs/xfs_bmap.c | 22 +++++++++++++++++-----
>>   fs/xfs/xfs_iomap.c       |  4 +++-
>>   2 files changed, 20 insertions(+), 6 deletions(-)
>>
>> diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
>> index 60d100134280..8dee60795cf4 100644
>> --- a/fs/xfs/libxfs/xfs_bmap.c
>> +++ b/fs/xfs/libxfs/xfs_bmap.c
>> @@ -3343,6 +3343,19 @@ xfs_bmap_compute_alignments(
>>   		align = xfs_get_cowextsz_hint(ap->ip);
>>   	else if (ap->datatype & XFS_ALLOC_USERDATA)
>>   		align = xfs_get_extsz_hint(ap->ip);
>> +
>> +	/*
>> +	 * xfs_get_cowextsz_hint() returns extsz_hint for when forcealign is
>> +	 * set as forcealign and cowextsz_hint are mutually exclusive
>> +	 */
>> +	if (xfs_inode_forcealign(ap->ip) && align) {
>> +		args->alignment = align;
>> +		if (stripe_align % align)
>> +			stripe_align = align;
> 
> This kinda does the right thing, but it strikes me as the wrong
> thing to be doing - it seems to conflates two different physical
> alignment properties in potentially bad ways, and we should never
> get this far into the allocator to discover these issues.
> 
> Stripe alignment is alignment to physical disks in a RAID array.
> Inode forced alignment is file offset alignment to specificly
> defined LBA address ranges.  The stripe unit/width is set at mkfs
> time, the inode extent size hint at runtime.
> 
> These can only work together in specific situations:
> 
> 	1. max sized RWF_ATOMIC IO must be the same or smaller size
> 	   than the stripe unit. Alternatively: stripe unit must be
> 	   an integer (power of 2?) multiple of max atomic IO size.

Sure, it would not make sense to have max sized RWF_ATOMIC IO > stripe 
alignment.

> 
> 	   IOWs, stripe unit vs atomic IO constraints must be
> 	   resolved in a compatible manner at mkfs time. If they
> 	   can't be resolved (e.g. non-power of 2 stripe unit) then
> 	   atomic IO cannot be done on the filesystem at all. Stripe
> 	   unit constraints always win over atomic IO.

We can could do that. Indeed, I thought our xfsprogs was doing that, but 
we have had a few versions now for forcealign ...

> 
> 	2. min sized RWF_ATOMIC IO must be an integer divider of
> 	   stripe unit so that small atomic IOs are always correctly
> 	   aligned to the stripe unit.

That's a given from atomic write rules and point 1.

> 
> 	3. Inode forced alignment constraints set at runtime (i.e.
> 	   extsize hints) must fit within the above stripe unit vs
> 	   min/max atomic IO requirements.
>  > 	   i.e. extent size hint will typically need to an integer
> 	   multiple of stripe unit size which will always be
> 	   compatible with stripe unit and atomic IO size and
> 	   alignments...

I think that any extsize hint for forcealign needs to comply with stripe 
unit, same as point 1.

> 
> IOWs, these are static geometry constraints and so should be checked
> and rejected at the point where alignments are specified (i.e.
> mkfs, mount and ioctls). Then the allocator can simply assume that
> forced inode alignments are always stripe alignment compatible and
> we don't need separate handling of two possibly incompatible
> alignments.

ok, makes sense.

Please note in case missed, I am mandating extsize hint for forcealign 
needs to be a power-of-2. It just makes life easier for all the 
sub-extent zero'ing later on.

Also we need to enforce that the AG count to be compliant with the 
extsize hint for forcealign; but since the extsize hint for forcealign 
needs to be compliant with stripe unit, above, and stripe unit would be 
compliant wth AG count (right?), then this would be a given.

> 
> Regardless, I think the code as it stands won't work correctly when
> a stripe unit is not set.
> 
> The correct functioning of this code appears to be dependent on
> stripe_align being set >= the extent size hint. If stripe align is
> not set, then what does (0 % align) return? If my checks are
> correct, this will return 0,

Correct

> and so the above code will end up with
> stripe_align = 0.
> 
> Now, consider that args->alignment is used to signal to the
> allocator what the -start alignment- of the allocation is supposed
> to be, whilst args->prod/args->mod are used to tell the allocator
> what the tail adjustment is supposed to be.
> 
> Stripe alignment wants start alignment, extent size hints want tail
> adjustment and force aligned allocation wants both start alignment
> and tail adjustment.

Right

> 
> However, the allocator overrides these depending on what it is
> doing. e.g. exact block EOF allocation turns off start alignment but
> has to ensure space is reserved for start alignment if it fails.
> This reserves space for stripe_align in args->minalignslop, but if
> force alignment has a start alignment requirement larger than stripe
> unit alignment, this code fails to reserve the necessary alignment
> slop for the force alignment code.
> 
> If we aren't doing exact block EOF allocation, then we do stripe
> unit alignment. Again, if this fails and the forced alignment is
> larger than stripe alignment, this code does not reserve space for
> the larger alignment in args->minalignslop, so it will also do the
> wrong when we fall back to an args->alignment > 1 allocation.
> 
> Failing to correctly account for minalignslop is known to cause
> accounting problems when the AG is very near empty, and the usual
> result of that is cancelling of a dirty allocation transaction and a
> forced shutdown. So this is something we need to get right.

For sure

> 
> More below....
> 
>> +	} else {
>> +		args->alignment = 1;
>> +	}
> 
> Just initialise the allocation args structure with a value of 1 like
> we already do?

It was being done in this way to have just a single place where the 
value is initialised. It can easily be kept as is.

> 
>> +
>>   	if (align) {
>>   		if (xfs_bmap_extsize_align(mp, &ap->got, &ap->prev, align, 0,
>>   					ap->eof, 0, ap->conv, &ap->offset,
>> @@ -3438,7 +3451,6 @@ xfs_bmap_exact_minlen_extent_alloc(
>>   	args.minlen = args.maxlen = ap->minlen;
>>   	args.total = ap->total;
>>   
>> -	args.alignment = 1;
>>   	args.minalignslop = 0;
> 
> This likely breaks the debug code. This isn't intended for testing
> atomic write capability, it's for exercising low space allocation
> algorithms with worst case fragmentation patterns. This is only
> called when error injection is turned on to trigger this path, so we
> shouldn't need to support forced IO alignment here.

ok, it can be left untouched.

> 
>>   
>>   	args.minleft = ap->minleft;
>> @@ -3484,6 +3496,7 @@ xfs_bmap_btalloc_at_eof(
>>   {
>>   	struct xfs_mount	*mp = args->mp;
>>   	struct xfs_perag	*caller_pag = args->pag;
>> +	int			orig_alignment = args->alignment;
>>   	int			error;
>>   
>>   	/*
>> @@ -3558,10 +3571,10 @@ xfs_bmap_btalloc_at_eof(
>>   
>>   	/*
>>   	 * Allocation failed, so turn return the allocation args to their
>> -	 * original non-aligned state so the caller can proceed on allocation
>> -	 * failure as if this function was never called.
>> +	 * original state so the caller can proceed on allocation failure as
>> +	 * if this function was never called.
>>   	 */
>> -	args->alignment = 1;
>> +	args->alignment = orig_alignment;
>>   	return 0;
>>   }
> 
> As I said above, we can't set an alignment of > 1 here if we haven't
> accounted for that alignment in args->minalignslop above. This leads
> to unexpected ENOSPC conditions and filesystem shutdowns.
> 
> I suspect what we need to do is get rid of the separate stripe_align
> variable altogether and always just set args->alignment to what we
> need the extent start alignment to be, regardless of whether it is
> from stripe alignment or forced alignment.

ok, it sounds a bit simpler at least

> 
> Then the code in xfs_bmap_btalloc_at_eof() doesn't need to know what
> 'stripe_align' is - the exact EOF block allocation can simply save
> and restore the args->alignment value and use it for minalignslop
> calculations for the initial exact block allocation.
> 
> Then, if xfs_bmap_btalloc_at_eof() fails and xfs_inode_forcealign()
> is true, we can abort allocation immediately, and not bother to fall
> back on further aligned/unaligned attempts that will also fail or do
> the wrong them.

ok

> 
> Similarly, if we aren't doing EOF allocation, having args->alignment
> set means it will do the right thing for the first allocation
> attempt. Again, if that fails, we can check if
> xfs_inode_forcealign() is true and fail the aligned allocation
> instead of running the low space algorithm. This now makes it clear
> that we're failing the allocation because of the forced alignment
> requirement, and now the low space allocation code can explicitly
> turn off start alignment as it isn't required...

are you saying that low-space allocator can set args->alignment = 1 to 
be explicit?

> 
> 
>> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
>> index 18c8f168b153..70fe873951f3 100644
>> --- a/fs/xfs/xfs_iomap.c
>> +++ b/fs/xfs/xfs_iomap.c
>> @@ -181,7 +181,9 @@ xfs_eof_alignment(
>>   		 * If mounted with the "-o swalloc" option the alignment is
>>   		 * increased from the strip unit size to the stripe width.
>>   		 */
>> -		if (mp->m_swidth && xfs_has_swalloc(mp))
>> +		if (xfs_inode_forcealign(ip))

I actually thought that I dropped this chunk as it was causing alignment 
issues. I need to check that again.

>> +			align = xfs_get_extsz_hint(ip);
>> +		else if (mp->m_swidth && xfs_has_swalloc(mp))
>>   			align = mp->m_swidth;
>>   		else if (mp->m_dalign)
>>   			align = mp->m_dalign;
> 
> This doesn't work for files with a current size of less than
> "align". The next line of code does:
> 
> 	if (align && XFS_ISIZE(ip) < XFS_FSB_TO_B(mp, align))
> 		align = 0;
> 
> IOWs, the first aligned write allocation will occur with a file size
> of zero, and that will result in this function returning 0. i.e.
> speculative post EOF delalloc doesn't turn on and align the EOF
> until writes up to offset == align have already been completed.

Maybe it was working as I have the change to keep EOF aligned for 
forcealign. I'll check that.

> 
> Essentially, this code wants to be:
> 
> 	if (xfs_inode_forcealign(ip))
> 		return xfs_get_extsz_hint(ip);
> 
> So that any write to the a force aligned inode will always trigger
> extent size hint EOF alignment.
> 

ok, sounds reasonable.

Thanks,
John
Dave Chinner March 5, 2024, 10:18 p.m. UTC | #3
On Tue, Mar 05, 2024 at 03:22:52PM +0000, John Garry wrote:
> On 05/03/2024 00:44, Dave Chinner wrote:
> > On Mon, Mar 04, 2024 at 01:04:18PM +0000, John Garry wrote:
....
> > IOWs, these are static geometry constraints and so should be checked
> > and rejected at the point where alignments are specified (i.e.
> > mkfs, mount and ioctls). Then the allocator can simply assume that
> > forced inode alignments are always stripe alignment compatible and
> > we don't need separate handling of two possibly incompatible
> > alignments.
> 
> ok, makes sense.
> 
> Please note in case missed, I am mandating extsize hint for forcealign needs
> to be a power-of-2. It just makes life easier for all the sub-extent
> zero'ing later on.

That's fine - that will need to be documented in the xfsctl man
page...

> Also we need to enforce that the AG count to be compliant with the extsize
                                      ^^^^^ size?

> hint for forcealign; but since the extsize hint for forcealign needs to be
> compliant with stripe unit, above, and stripe unit would be compliant wth AG
> count (right?), then this would be a given.

We already align AG size to stripe unit when a stripe unit is set,
and ensure that we don't place all the AG headers on the same stripe
unit.

However, if there is no stripe unit we don't align the AG to
anything. So, yes, AG sizing by mkfs will need to ensure that all
AGs are correctly aligned to the underlying storage (integer
multiple of the max atomic write size, right?)...

> > More below....
> > 
> > > +	} else {
> > > +		args->alignment = 1;
> > > +	}
> > 
> > Just initialise the allocation args structure with a value of 1 like
> > we already do?
> 
> It was being done in this way to have just a single place where the value is
> initialised. It can easily be kept as is.

I'd prefer it as is, because then the value is always initialised
correctly and we only override in the special cases....

> > >   	args.minleft = ap->minleft;
> > > @@ -3484,6 +3496,7 @@ xfs_bmap_btalloc_at_eof(
> > >   {
> > >   	struct xfs_mount	*mp = args->mp;
> > >   	struct xfs_perag	*caller_pag = args->pag;
> > > +	int			orig_alignment = args->alignment;
> > >   	int			error;
> > >   	/*
> > > @@ -3558,10 +3571,10 @@ xfs_bmap_btalloc_at_eof(
> > >   	/*
> > >   	 * Allocation failed, so turn return the allocation args to their
> > > -	 * original non-aligned state so the caller can proceed on allocation
> > > -	 * failure as if this function was never called.
> > > +	 * original state so the caller can proceed on allocation failure as
> > > +	 * if this function was never called.
> > >   	 */
> > > -	args->alignment = 1;
> > > +	args->alignment = orig_alignment;
> > >   	return 0;
> > >   }
> > 
> > As I said above, we can't set an alignment of > 1 here if we haven't
> > accounted for that alignment in args->minalignslop above. This leads
> > to unexpected ENOSPC conditions and filesystem shutdowns.
> > 
> > I suspect what we need to do is get rid of the separate stripe_align
> > variable altogether and always just set args->alignment to what we
> > need the extent start alignment to be, regardless of whether it is
> > from stripe alignment or forced alignment.
> 
> ok, it sounds a bit simpler at least
> 
> > 
> > Then the code in xfs_bmap_btalloc_at_eof() doesn't need to know what
> > 'stripe_align' is - the exact EOF block allocation can simply save
> > and restore the args->alignment value and use it for minalignslop
> > calculations for the initial exact block allocation.
> > 
> > Then, if xfs_bmap_btalloc_at_eof() fails and xfs_inode_forcealign()
> > is true, we can abort allocation immediately, and not bother to fall
> > back on further aligned/unaligned attempts that will also fail or do
> > the wrong them.
> 
> ok
> 
> > 
> > Similarly, if we aren't doing EOF allocation, having args->alignment
> > set means it will do the right thing for the first allocation
> > attempt. Again, if that fails, we can check if
> > xfs_inode_forcealign() is true and fail the aligned allocation
> > instead of running the low space algorithm. This now makes it clear
> > that we're failing the allocation because of the forced alignment
> > requirement, and now the low space allocation code can explicitly
> > turn off start alignment as it isn't required...
> 
> are you saying that low-space allocator can set args->alignment = 1 to be
> explicit?

Yes.

-Dave.
John Garry March 6, 2024, 9:41 a.m. UTC | #4
>> Please note in case missed, I am mandating extsize hint for forcealign needs
>> to be a power-of-2. It just makes life easier for all the sub-extent
>> zero'ing later on.
> 
> That's fine - that will need to be documented in the xfsctl man
> page...
> 
>> Also we need to enforce that the AG count to be compliant with the extsize
>                                        ^^^^^ size?

Yes

> 
>> hint for forcealign; but since the extsize hint for forcealign needs to be
>> compliant with stripe unit, above, and stripe unit would be compliant wth AG
>> count (right?), then this would be a given.
> 
> We already align AG size to stripe unit when a stripe unit is set,
> and ensure that we don't place all the AG headers on the same stripe
> unit.
> 
> However, if there is no stripe unit we don't align the AG to
> anything. 


> So, yes, AG sizing by mkfs will need to ensure that all
> AGs are correctly aligned to the underlying storage (integer
> multiple of the max atomic write size, right?)...

right, this is really important

> 
>>> More below....
>>>
>>>> +	} else {
>>>> +		args->alignment = 1;
>>>> +	}
>>>
>>> Just initialise the allocation args structure with a value of 1 like
>>> we already do?
>>
>> It was being done in this way to have just a single place where the value is
>> initialised. It can easily be kept as is.
> 
> I'd prefer it as is, because then the value is always initialised
> correctly and we only override in the special cases....

ok

>>
>> are you saying that low-space allocator can set args->alignment = 1 to be
>> explicit?
> 
> Yes.

ok

Thanks,
John
diff mbox series

Patch

diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 60d100134280..8dee60795cf4 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -3343,6 +3343,19 @@  xfs_bmap_compute_alignments(
 		align = xfs_get_cowextsz_hint(ap->ip);
 	else if (ap->datatype & XFS_ALLOC_USERDATA)
 		align = xfs_get_extsz_hint(ap->ip);
+
+	/*
+	 * xfs_get_cowextsz_hint() returns extsz_hint for when forcealign is
+	 * set as forcealign and cowextsz_hint are mutually exclusive
+	 */
+	if (xfs_inode_forcealign(ap->ip) && align) {
+		args->alignment = align;
+		if (stripe_align % align)
+			stripe_align = align;
+	} else {
+		args->alignment = 1;
+	}
+
 	if (align) {
 		if (xfs_bmap_extsize_align(mp, &ap->got, &ap->prev, align, 0,
 					ap->eof, 0, ap->conv, &ap->offset,
@@ -3438,7 +3451,6 @@  xfs_bmap_exact_minlen_extent_alloc(
 	args.minlen = args.maxlen = ap->minlen;
 	args.total = ap->total;
 
-	args.alignment = 1;
 	args.minalignslop = 0;
 
 	args.minleft = ap->minleft;
@@ -3484,6 +3496,7 @@  xfs_bmap_btalloc_at_eof(
 {
 	struct xfs_mount	*mp = args->mp;
 	struct xfs_perag	*caller_pag = args->pag;
+	int			orig_alignment = args->alignment;
 	int			error;
 
 	/*
@@ -3558,10 +3571,10 @@  xfs_bmap_btalloc_at_eof(
 
 	/*
 	 * Allocation failed, so turn return the allocation args to their
-	 * original non-aligned state so the caller can proceed on allocation
-	 * failure as if this function was never called.
+	 * original state so the caller can proceed on allocation failure as
+	 * if this function was never called.
 	 */
-	args->alignment = 1;
+	args->alignment = orig_alignment;
 	return 0;
 }
 
@@ -3709,7 +3722,6 @@  xfs_bmap_btalloc(
 		.wasdel		= ap->wasdel,
 		.resv		= XFS_AG_RESV_NONE,
 		.datatype	= ap->datatype,
-		.alignment	= 1,
 		.minalignslop	= 0,
 	};
 	xfs_fileoff_t		orig_offset;
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 18c8f168b153..70fe873951f3 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -181,7 +181,9 @@  xfs_eof_alignment(
 		 * If mounted with the "-o swalloc" option the alignment is
 		 * increased from the strip unit size to the stripe width.
 		 */
-		if (mp->m_swidth && xfs_has_swalloc(mp))
+		if (xfs_inode_forcealign(ip))
+			align = xfs_get_extsz_hint(ip);
+		else if (mp->m_swidth && xfs_has_swalloc(mp))
 			align = mp->m_swidth;
 		else if (mp->m_dalign)
 			align = mp->m_dalign;