diff mbox series

[5/9] xfs: force inode garbage collection before fallocate when space is low

Message ID 162310472140.3465262.3509717954267805085.stgit@locust (mailing list archive)
State New, archived
Headers show
Series xfs: deferred inode inactivation | expand

Commit Message

Darrick J. Wong June 7, 2021, 10:25 p.m. UTC
From: Darrick J. Wong <djwong@kernel.org>

Generally speaking, when a user calls fallocate, they're looking to
preallocate space in a file in the largest contiguous chunks possible.
If free space is low, it's possible that the free space will look
unnecessarily fragmented because there are unlinked inodes that are
holding on to space that we could allocate.  When this happens,
fallocate makes suboptimal allocation decisions for the sake of deleted
files, which doesn't make much sense, so scan the filesystem for dead
items to delete to try to avoid this.

Note that there are a handful of fstests that fill a filesystem, delete
just enough files to allow a single large allocation, and check that
fallocate actually gets the allocation.  These tests regress because the
test runs fallocate before the inode gc has a chance to run, so add this
behavior to maintain as much of the old behavior as possible.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_bmap_util.c |   43 +++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_icache.c    |    8 ++++++++
 fs/xfs/xfs_icache.h    |    1 +
 3 files changed, 52 insertions(+)

Comments

Dave Chinner June 8, 2021, 1:26 a.m. UTC | #1
On Mon, Jun 07, 2021 at 03:25:21PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Generally speaking, when a user calls fallocate, they're looking to
> preallocate space in a file in the largest contiguous chunks possible.
> If free space is low, it's possible that the free space will look
> unnecessarily fragmented because there are unlinked inodes that are
> holding on to space that we could allocate.  When this happens,
> fallocate makes suboptimal allocation decisions for the sake of deleted
> files, which doesn't make much sense, so scan the filesystem for dead
> items to delete to try to avoid this.
> 
> Note that there are a handful of fstests that fill a filesystem, delete
> just enough files to allow a single large allocation, and check that
> fallocate actually gets the allocation.  These tests regress because the
> test runs fallocate before the inode gc has a chance to run, so add this
> behavior to maintain as much of the old behavior as possible.

I don't think this is a good justification for the change. Just
because the unit tests exploit an undefined behaviour that no
filesystem actually guarantees to acheive a specific layout, it
doesn't mean we always have to behave that way.

For example, many tests used to use reverse sequential writes to
exploit deficiencies in the allocation algorithms to generate
fragmented files. We fixed that problem and the tests broke because
they couldn't fragment files any more.

Did we reject those changes because the tests broke? No, we didn't
because the tests were exploiting an observed behaviour rather than
a guaranteed behaviour.

So, yeah, "test does X to make Y happen" doesn't mean "X will always
make Y happen". It just means the test needs to be made more robust,
or we have to provide a way for the test to trigger the behaviour it
needs.

Indeed, I think that the way to fix these sorts of issues is to have
the tests issue a syncfs(2) after they've deleted the inodes and have
the filesystem run a inodegc flush as part of the sync mechanism.

Then we don't need to do.....

> +/*
> + * If the target device (or some part of it) is full enough that it won't to be
> + * able to satisfy the entire request, try to free inactive files to free up
> + * space.  While it's perfectly fine to fill a preallocation request with a
> + * bunch of short extents, we prefer to slow down preallocation requests to
> + * combat long term fragmentation in new file data.
> + */
> +static int
> +xfs_alloc_consolidate_freespace(
> +	struct xfs_inode	*ip,
> +	xfs_filblks_t		wanted)
> +{
> +	struct xfs_mount	*mp = ip->i_mount;
> +	struct xfs_perag	*pag;
> +	struct xfs_sb		*sbp = &mp->m_sb;
> +	xfs_agnumber_t		agno;
> +
> +	if (!xfs_has_inodegc_work(mp))
> +		return 0;
> +
> +	if (XFS_IS_REALTIME_INODE(ip)) {
> +		if (sbp->sb_frextents * sbp->sb_rextsize >= wanted)
> +			return 0;
> +		goto free_space;
> +	}
> +
> +	for_each_perag(mp, agno, pag) {
> +		if (pag->pagf_freeblks >= wanted) {
> +			xfs_perag_put(pag);
> +			return 0;
> +		}
> +	}

... really hurty things (e.g. on high AG count fs) on every fallocate()
call, and we have a simple modification to the tests that allow them
to work as they want to on both old and new kernels....

Cheers,

Dave.
Brian Foster June 8, 2021, 11:48 a.m. UTC | #2
On Tue, Jun 08, 2021 at 11:26:05AM +1000, Dave Chinner wrote:
> On Mon, Jun 07, 2021 at 03:25:21PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > Generally speaking, when a user calls fallocate, they're looking to
> > preallocate space in a file in the largest contiguous chunks possible.
> > If free space is low, it's possible that the free space will look
> > unnecessarily fragmented because there are unlinked inodes that are
> > holding on to space that we could allocate.  When this happens,
> > fallocate makes suboptimal allocation decisions for the sake of deleted
> > files, which doesn't make much sense, so scan the filesystem for dead
> > items to delete to try to avoid this.
> > 
> > Note that there are a handful of fstests that fill a filesystem, delete
> > just enough files to allow a single large allocation, and check that
> > fallocate actually gets the allocation.  These tests regress because the
> > test runs fallocate before the inode gc has a chance to run, so add this
> > behavior to maintain as much of the old behavior as possible.
> 
> I don't think this is a good justification for the change. Just
> because the unit tests exploit an undefined behaviour that no
> filesystem actually guarantees to acheive a specific layout, it
> doesn't mean we always have to behave that way.
> 
> For example, many tests used to use reverse sequential writes to
> exploit deficiencies in the allocation algorithms to generate
> fragmented files. We fixed that problem and the tests broke because
> they couldn't fragment files any more.
> 
> Did we reject those changes because the tests broke? No, we didn't
> because the tests were exploiting an observed behaviour rather than
> a guaranteed behaviour.
> 
> So, yeah, "test does X to make Y happen" doesn't mean "X will always
> make Y happen". It just means the test needs to be made more robust,
> or we have to provide a way for the test to trigger the behaviour it
> needs.
> 

Agree on all this..

> Indeed, I think that the way to fix these sorts of issues is to have
> the tests issue a syncfs(2) after they've deleted the inodes and have
> the filesystem run a inodegc flush as part of the sync mechanism.
> 

... but it seems a bit of a leap to equate exploitation of a
historically poorly handled allocation pattern in developer tests with
adding a new requirement (i.e. sync) to achieve optimal behavior of a
fairly common allocation pattern (delete a file, use the space for
something else).

IOW, how to hack around test regressions aside (are the test regressions
actual ENOSPC failures or something else, btw?), what's the impact on
users/workloads that might operate under these conditions? I guess
historically we've always recommended to not consistently operate in
<20% free space conditions, so to some degree there is an expectation
for less than optimal behavior if one decides to constantly bash an fs
into ENOSPC. Then again with large enough files, will/can we put the
filesystem into that state ourselves without any indication to the user?

I kind of wonder if unless/until there's some kind of efficient feedback
between allocation and "pending" free space, whether deferred
inactivation should be an optimization tied to some kind of heuristic
that balances the amount of currently available free space against
pending free space (but I've not combed through the code enough to grok
whether this already does something like that).

Brian

> Then we don't need to do.....
> 
> > +/*
> > + * If the target device (or some part of it) is full enough that it won't to be
> > + * able to satisfy the entire request, try to free inactive files to free up
> > + * space.  While it's perfectly fine to fill a preallocation request with a
> > + * bunch of short extents, we prefer to slow down preallocation requests to
> > + * combat long term fragmentation in new file data.
> > + */
> > +static int
> > +xfs_alloc_consolidate_freespace(
> > +	struct xfs_inode	*ip,
> > +	xfs_filblks_t		wanted)
> > +{
> > +	struct xfs_mount	*mp = ip->i_mount;
> > +	struct xfs_perag	*pag;
> > +	struct xfs_sb		*sbp = &mp->m_sb;
> > +	xfs_agnumber_t		agno;
> > +
> > +	if (!xfs_has_inodegc_work(mp))
> > +		return 0;
> > +
> > +	if (XFS_IS_REALTIME_INODE(ip)) {
> > +		if (sbp->sb_frextents * sbp->sb_rextsize >= wanted)
> > +			return 0;
> > +		goto free_space;
> > +	}
> > +
> > +	for_each_perag(mp, agno, pag) {
> > +		if (pag->pagf_freeblks >= wanted) {
> > +			xfs_perag_put(pag);
> > +			return 0;
> > +		}
> > +	}
> 
> ... really hurty things (e.g. on high AG count fs) on every fallocate()
> call, and we have a simple modification to the tests that allow them
> to work as they want to on both old and new kernels....
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
>
Darrick J. Wong June 8, 2021, 3:32 p.m. UTC | #3
On Tue, Jun 08, 2021 at 07:48:05AM -0400, Brian Foster wrote:
> On Tue, Jun 08, 2021 at 11:26:05AM +1000, Dave Chinner wrote:
> > On Mon, Jun 07, 2021 at 03:25:21PM -0700, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <djwong@kernel.org>
> > > 
> > > Generally speaking, when a user calls fallocate, they're looking to
> > > preallocate space in a file in the largest contiguous chunks possible.
> > > If free space is low, it's possible that the free space will look
> > > unnecessarily fragmented because there are unlinked inodes that are
> > > holding on to space that we could allocate.  When this happens,
> > > fallocate makes suboptimal allocation decisions for the sake of deleted
> > > files, which doesn't make much sense, so scan the filesystem for dead
> > > items to delete to try to avoid this.
> > > 
> > > Note that there are a handful of fstests that fill a filesystem, delete
> > > just enough files to allow a single large allocation, and check that
> > > fallocate actually gets the allocation.  These tests regress because the
> > > test runs fallocate before the inode gc has a chance to run, so add this
> > > behavior to maintain as much of the old behavior as possible.
> > 
> > I don't think this is a good justification for the change. Just
> > because the unit tests exploit an undefined behaviour that no
> > filesystem actually guarantees to acheive a specific layout, it
> > doesn't mean we always have to behave that way.
> > 
> > For example, many tests used to use reverse sequential writes to
> > exploit deficiencies in the allocation algorithms to generate
> > fragmented files. We fixed that problem and the tests broke because
> > they couldn't fragment files any more.
> > 
> > Did we reject those changes because the tests broke? No, we didn't
> > because the tests were exploiting an observed behaviour rather than
> > a guaranteed behaviour.
> > 
> > So, yeah, "test does X to make Y happen" doesn't mean "X will always
> > make Y happen". It just means the test needs to be made more robust,
> > or we have to provide a way for the test to trigger the behaviour it
> > needs.
> > 
> 
> Agree on all this..
> 
> > Indeed, I think that the way to fix these sorts of issues is to have
> > the tests issue a syncfs(2) after they've deleted the inodes and have
> > the filesystem run a inodegc flush as part of the sync mechanism.
> > 
> 
> ... but it seems a bit of a leap to equate exploitation of a
> historically poorly handled allocation pattern in developer tests with
> adding a new requirement (i.e. sync) to achieve optimal behavior of a
> fairly common allocation pattern (delete a file, use the space for
> something else).
> 
> IOW, how to hack around test regressions aside (are the test regressions
> actual ENOSPC failures or something else, btw?), what's the impact on

They're not ENOSPC failures, they're fallocate layout tests that assume
that you can format the fs with a stripe alignment, fragment the free
space so that it isn't possible to obtain stripe-aligned blocks, delete
80% of the file(s) you used to fragment the free space, and fallocate a
stripe-aligned extent from the newly freed space in one go after the
last unlink() returns.  Unfortunately, I don't remember which test it
was that tripped over this.

IOWs, tests that confirm the historic behavior of XFS (and presumably
other filesystems) even though we don't guarantee anything about file
layout and never have.  This is a similar issue to the one Dave
complains about a few patches ago about needing to kick the inodegc
workers so that df reporting behavior stays the same as it has been on
xfs for ages.

We /might/ have figured out a solution to some of that nastiness.

> users/workloads that might operate under these conditions? I guess
> historically we've always recommended to not consistently operate in
> <20% free space conditions, so to some degree there is an expectation
> for less than optimal behavior if one decides to constantly bash an fs
> into ENOSPC. Then again with large enough files, will/can we put the
> filesystem into that state ourselves without any indication to the user?
> 
> I kind of wonder if unless/until there's some kind of efficient feedback
> between allocation and "pending" free space, whether deferred
> inactivation should be an optimization tied to some kind of heuristic
> that balances the amount of currently available free space against
> pending free space (but I've not combed through the code enough to grok
> whether this already does something like that).

Ooh!  You mentioned "efficient feedback", and one sprung immediately to
mind -- if the AG is near full (or above 80% full, or whatever) we
schedule the per-AG inodegc worker immediately instead of delaying it.

--D

> 
> Brian
> 
> > Then we don't need to do.....
> > 
> > > +/*
> > > + * If the target device (or some part of it) is full enough that it won't to be
> > > + * able to satisfy the entire request, try to free inactive files to free up
> > > + * space.  While it's perfectly fine to fill a preallocation request with a
> > > + * bunch of short extents, we prefer to slow down preallocation requests to
> > > + * combat long term fragmentation in new file data.
> > > + */
> > > +static int
> > > +xfs_alloc_consolidate_freespace(
> > > +	struct xfs_inode	*ip,
> > > +	xfs_filblks_t		wanted)
> > > +{
> > > +	struct xfs_mount	*mp = ip->i_mount;
> > > +	struct xfs_perag	*pag;
> > > +	struct xfs_sb		*sbp = &mp->m_sb;
> > > +	xfs_agnumber_t		agno;
> > > +
> > > +	if (!xfs_has_inodegc_work(mp))
> > > +		return 0;
> > > +
> > > +	if (XFS_IS_REALTIME_INODE(ip)) {
> > > +		if (sbp->sb_frextents * sbp->sb_rextsize >= wanted)
> > > +			return 0;
> > > +		goto free_space;
> > > +	}
> > > +
> > > +	for_each_perag(mp, agno, pag) {
> > > +		if (pag->pagf_freeblks >= wanted) {
> > > +			xfs_perag_put(pag);
> > > +			return 0;
> > > +		}
> > > +	}
> > 
> > ... really hurty things (e.g. on high AG count fs) on every fallocate()
> > call, and we have a simple modification to the tests that allow them
> > to work as they want to on both old and new kernels....
> > 
> > Cheers,
> > 
> > Dave.
> > -- 
> > Dave Chinner
> > david@fromorbit.com
> > 
>
Brian Foster June 8, 2021, 4:06 p.m. UTC | #4
On Tue, Jun 08, 2021 at 08:32:04AM -0700, Darrick J. Wong wrote:
> On Tue, Jun 08, 2021 at 07:48:05AM -0400, Brian Foster wrote:
> > On Tue, Jun 08, 2021 at 11:26:05AM +1000, Dave Chinner wrote:
> > > On Mon, Jun 07, 2021 at 03:25:21PM -0700, Darrick J. Wong wrote:
> > > > From: Darrick J. Wong <djwong@kernel.org>
> > > > 
> > > > Generally speaking, when a user calls fallocate, they're looking to
> > > > preallocate space in a file in the largest contiguous chunks possible.
> > > > If free space is low, it's possible that the free space will look
> > > > unnecessarily fragmented because there are unlinked inodes that are
> > > > holding on to space that we could allocate.  When this happens,
> > > > fallocate makes suboptimal allocation decisions for the sake of deleted
> > > > files, which doesn't make much sense, so scan the filesystem for dead
> > > > items to delete to try to avoid this.
> > > > 
> > > > Note that there are a handful of fstests that fill a filesystem, delete
> > > > just enough files to allow a single large allocation, and check that
> > > > fallocate actually gets the allocation.  These tests regress because the
> > > > test runs fallocate before the inode gc has a chance to run, so add this
> > > > behavior to maintain as much of the old behavior as possible.
> > > 
> > > I don't think this is a good justification for the change. Just
> > > because the unit tests exploit an undefined behaviour that no
> > > filesystem actually guarantees to acheive a specific layout, it
> > > doesn't mean we always have to behave that way.
> > > 
> > > For example, many tests used to use reverse sequential writes to
> > > exploit deficiencies in the allocation algorithms to generate
> > > fragmented files. We fixed that problem and the tests broke because
> > > they couldn't fragment files any more.
> > > 
> > > Did we reject those changes because the tests broke? No, we didn't
> > > because the tests were exploiting an observed behaviour rather than
> > > a guaranteed behaviour.
> > > 
> > > So, yeah, "test does X to make Y happen" doesn't mean "X will always
> > > make Y happen". It just means the test needs to be made more robust,
> > > or we have to provide a way for the test to trigger the behaviour it
> > > needs.
> > > 
> > 
> > Agree on all this..
> > 
> > > Indeed, I think that the way to fix these sorts of issues is to have
> > > the tests issue a syncfs(2) after they've deleted the inodes and have
> > > the filesystem run a inodegc flush as part of the sync mechanism.
> > > 
> > 
> > ... but it seems a bit of a leap to equate exploitation of a
> > historically poorly handled allocation pattern in developer tests with
> > adding a new requirement (i.e. sync) to achieve optimal behavior of a
> > fairly common allocation pattern (delete a file, use the space for
> > something else).
> > 
> > IOW, how to hack around test regressions aside (are the test regressions
> > actual ENOSPC failures or something else, btw?), what's the impact on
> 
> They're not ENOSPC failures, they're fallocate layout tests that assume
> that you can format the fs with a stripe alignment, fragment the free
> space so that it isn't possible to obtain stripe-aligned blocks, delete
> 80% of the file(s) you used to fragment the free space, and fallocate a
> stripe-aligned extent from the newly freed space in one go after the
> last unlink() returns.  Unfortunately, I don't remember which test it
> was that tripped over this.
> 
> IOWs, tests that confirm the historic behavior of XFS (and presumably
> other filesystems) even though we don't guarantee anything about file
> layout and never have.  This is a similar issue to the one Dave
> complains about a few patches ago about needing to kick the inodegc
> workers so that df reporting behavior stays the same as it has been on
> xfs for ages.
> 

Ah, Ok. That's less critical than what I was thinking when I read "test
regression" in the earlier comments.. thanks. To Dave's earlier point, I
think it's reasonable to update tests if they rely on non-guaranteed
behavior as such. I just wanted to make sure that the associated impact
on the user wasn't terrible.

> We /might/ have figured out a solution to some of that nastiness.
> 
> > users/workloads that might operate under these conditions? I guess
> > historically we've always recommended to not consistently operate in
> > <20% free space conditions, so to some degree there is an expectation
> > for less than optimal behavior if one decides to constantly bash an fs
> > into ENOSPC. Then again with large enough files, will/can we put the
> > filesystem into that state ourselves without any indication to the user?
> > 
> > I kind of wonder if unless/until there's some kind of efficient feedback
> > between allocation and "pending" free space, whether deferred
> > inactivation should be an optimization tied to some kind of heuristic
> > that balances the amount of currently available free space against
> > pending free space (but I've not combed through the code enough to grok
> > whether this already does something like that).
> 
> Ooh!  You mentioned "efficient feedback", and one sprung immediately to
> mind -- if the AG is near full (or above 80% full, or whatever) we
> schedule the per-AG inodegc worker immediately instead of delaying it.
> 

Indeed, or not deferring at all in that case (not sure if there's much
different there anyways?). Either way, something like that might be a
nice incremental step to get the mechanism in place while (hopefully)
avoiding some of these corner case behaviors that might require more
thought and care to get right. TBH, even with something more
conservative like "defer/delay when <50% full && fs >= some minimum
size" might be a reasonable starting point. We could always bypass that
entirely via DEBUG=1 to get the full effect. Just a handwavy thought
though..

Brian

> --D
> 
> > 
> > Brian
> > 
> > > Then we don't need to do.....
> > > 
> > > > +/*
> > > > + * If the target device (or some part of it) is full enough that it won't to be
> > > > + * able to satisfy the entire request, try to free inactive files to free up
> > > > + * space.  While it's perfectly fine to fill a preallocation request with a
> > > > + * bunch of short extents, we prefer to slow down preallocation requests to
> > > > + * combat long term fragmentation in new file data.
> > > > + */
> > > > +static int
> > > > +xfs_alloc_consolidate_freespace(
> > > > +	struct xfs_inode	*ip,
> > > > +	xfs_filblks_t		wanted)
> > > > +{
> > > > +	struct xfs_mount	*mp = ip->i_mount;
> > > > +	struct xfs_perag	*pag;
> > > > +	struct xfs_sb		*sbp = &mp->m_sb;
> > > > +	xfs_agnumber_t		agno;
> > > > +
> > > > +	if (!xfs_has_inodegc_work(mp))
> > > > +		return 0;
> > > > +
> > > > +	if (XFS_IS_REALTIME_INODE(ip)) {
> > > > +		if (sbp->sb_frextents * sbp->sb_rextsize >= wanted)
> > > > +			return 0;
> > > > +		goto free_space;
> > > > +	}
> > > > +
> > > > +	for_each_perag(mp, agno, pag) {
> > > > +		if (pag->pagf_freeblks >= wanted) {
> > > > +			xfs_perag_put(pag);
> > > > +			return 0;
> > > > +		}
> > > > +	}
> > > 
> > > ... really hurty things (e.g. on high AG count fs) on every fallocate()
> > > call, and we have a simple modification to the tests that allow them
> > > to work as they want to on both old and new kernels....
> > > 
> > > Cheers,
> > > 
> > > Dave.
> > > -- 
> > > Dave Chinner
> > > david@fromorbit.com
> > > 
> > 
>
Dave Chinner June 8, 2021, 9:55 p.m. UTC | #5
On Tue, Jun 08, 2021 at 08:32:04AM -0700, Darrick J. Wong wrote:
> On Tue, Jun 08, 2021 at 07:48:05AM -0400, Brian Foster wrote:
> > users/workloads that might operate under these conditions? I guess
> > historically we've always recommended to not consistently operate in
> > <20% free space conditions, so to some degree there is an expectation
> > for less than optimal behavior if one decides to constantly bash an fs
> > into ENOSPC. Then again with large enough files, will/can we put the
> > filesystem into that state ourselves without any indication to the user?
> > 
> > I kind of wonder if unless/until there's some kind of efficient feedback
> > between allocation and "pending" free space, whether deferred
> > inactivation should be an optimization tied to some kind of heuristic
> > that balances the amount of currently available free space against
> > pending free space (but I've not combed through the code enough to grok
> > whether this already does something like that).
> 
> Ooh!  You mentioned "efficient feedback", and one sprung immediately to
> mind -- if the AG is near full (or above 80% full, or whatever) we
> schedule the per-AG inodegc worker immediately instead of delaying it.

That's what the lowspace thresholds in speculative preallocation are
for...

20% of a 1TB AG is an awful lot of freespace still remaining, and
if someone is asking for a 200GB fallocate(), they are always going
to get some fragmentation on a used, 80% full filesystem regardless
of deferred inode inactivation.

IMO, if you're going to do this, use the same thresholds we already
use to limit preallocation near global ENOSPC and graduate it to be
more severe the closer we get to global ENOSPC...

Cheers,

Dave.
Darrick J. Wong June 9, 2021, 12:25 a.m. UTC | #6
On Wed, Jun 09, 2021 at 07:55:42AM +1000, Dave Chinner wrote:
> On Tue, Jun 08, 2021 at 08:32:04AM -0700, Darrick J. Wong wrote:
> > On Tue, Jun 08, 2021 at 07:48:05AM -0400, Brian Foster wrote:
> > > users/workloads that might operate under these conditions? I guess
> > > historically we've always recommended to not consistently operate in
> > > <20% free space conditions, so to some degree there is an expectation
> > > for less than optimal behavior if one decides to constantly bash an fs
> > > into ENOSPC. Then again with large enough files, will/can we put the
> > > filesystem into that state ourselves without any indication to the user?
> > > 
> > > I kind of wonder if unless/until there's some kind of efficient feedback
> > > between allocation and "pending" free space, whether deferred
> > > inactivation should be an optimization tied to some kind of heuristic
> > > that balances the amount of currently available free space against
> > > pending free space (but I've not combed through the code enough to grok
> > > whether this already does something like that).
> > 
> > Ooh!  You mentioned "efficient feedback", and one sprung immediately to
> > mind -- if the AG is near full (or above 80% full, or whatever) we
> > schedule the per-AG inodegc worker immediately instead of delaying it.
> 
> That's what the lowspace thresholds in speculative preallocation are
> for...
> 
> 20% of a 1TB AG is an awful lot of freespace still remaining, and
> if someone is asking for a 200GB fallocate(), they are always going
> to get some fragmentation on a used, 80% full filesystem regardless
> of deferred inode inactivation.
> 
> IMO, if you're going to do this, use the same thresholds we already
> use to limit preallocation near global ENOSPC and graduate it to be
> more severe the closer we get to global ENOSPC...

Ok.  I'll just crib the same 5/4/3/2/1% thresholds like prealloc, then.

--D

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
diff mbox series

Patch

diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 997eb5c6e9b4..a1be77fe89d6 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -28,6 +28,8 @@ 
 #include "xfs_icache.h"
 #include "xfs_iomap.h"
 #include "xfs_reflink.h"
+#include "xfs_sb.h"
+#include "xfs_ag.h"
 
 /* Kernel only BMAP related definitions and functions */
 
@@ -767,6 +769,43 @@  xfs_free_eofblocks(
 	return error;
 }
 
+/*
+ * If the target device (or some part of it) is full enough that it won't to be
+ * able to satisfy the entire request, try to free inactive files to free up
+ * space.  While it's perfectly fine to fill a preallocation request with a
+ * bunch of short extents, we prefer to slow down preallocation requests to
+ * combat long term fragmentation in new file data.
+ */
+static int
+xfs_alloc_consolidate_freespace(
+	struct xfs_inode	*ip,
+	xfs_filblks_t		wanted)
+{
+	struct xfs_mount	*mp = ip->i_mount;
+	struct xfs_perag	*pag;
+	struct xfs_sb		*sbp = &mp->m_sb;
+	xfs_agnumber_t		agno;
+
+	if (!xfs_has_inodegc_work(mp))
+		return 0;
+
+	if (XFS_IS_REALTIME_INODE(ip)) {
+		if (sbp->sb_frextents * sbp->sb_rextsize >= wanted)
+			return 0;
+		goto free_space;
+	}
+
+	for_each_perag(mp, agno, pag) {
+		if (pag->pagf_freeblks >= wanted) {
+			xfs_perag_put(pag);
+			return 0;
+		}
+	}
+
+free_space:
+	return xfs_inodegc_free_space(mp, NULL);
+}
+
 int
 xfs_alloc_file_space(
 	struct xfs_inode	*ip,
@@ -851,6 +890,10 @@  xfs_alloc_file_space(
 			rblocks = 0;
 		}
 
+		error = xfs_alloc_consolidate_freespace(ip, allocatesize_fsb);
+		if (error)
+			break;
+
 		/*
 		 * Allocate and setup the transaction.
 		 */
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index a7ca6b988e29..8016e90b7b6d 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -1965,6 +1965,14 @@  xfs_inodegc_start(
 	xfs_inodegc_queue(mp);
 }
 
+/* Are there files waiting for inactivation? */
+bool
+xfs_has_inodegc_work(
+	struct xfs_mount	*mp)
+{
+	return radix_tree_tagged(&mp->m_perag_tree, XFS_ICI_INODEGC_TAG);
+}
+
 /* XFS Inode Cache Walking Code */
 
 /*
diff --git a/fs/xfs/xfs_icache.h b/fs/xfs/xfs_icache.h
index d03d46f83316..1f693e7fe6c8 100644
--- a/fs/xfs/xfs_icache.h
+++ b/fs/xfs/xfs_icache.h
@@ -85,6 +85,7 @@  void xfs_inodegc_flush(struct xfs_mount *mp);
 void xfs_inodegc_stop(struct xfs_mount *mp);
 void xfs_inodegc_start(struct xfs_mount *mp);
 int xfs_inodegc_free_space(struct xfs_mount *mp, struct xfs_icwalk *icw);
+bool xfs_has_inodegc_work(struct xfs_mount *mp);
 
 /*
  * Process all pending inode inactivations immediately (sort of) so that a