diff mbox series

[06/11] xfs: flush eof/cowblocks if we can't reserve quota for file blocks

Message ID 161142795294.2171939.2305516748220731694.stgit@magnolia (mailing list archive)
State Superseded
Headers show
Series xfs: try harder to reclaim space when we run out | expand

Commit Message

Darrick J. Wong Jan. 23, 2021, 6:52 p.m. UTC
From: Darrick J. Wong <djwong@kernel.org>

If a fs modification (data write, reflink, xattr set, fallocate, etc.)
is unable to reserve enough quota to handle the modification, try
clearing whatever space the filesystem might have been hanging onto in
the hopes of speeding up the filesystem.  The flushing behavior will
become particularly important when we add deferred inode inactivation
because that will increase the amount of space that isn't actively tied
to user data.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_attr.c |    8 +++++-
 fs/xfs/libxfs/xfs_bmap.c |    8 +++++-
 fs/xfs/xfs_bmap_util.c   |   17 ++++++++++----
 fs/xfs/xfs_iomap.c       |   19 ++++++++++++---
 fs/xfs/xfs_quota.h       |   20 ++++++++++------
 fs/xfs/xfs_reflink.c     |   16 ++++++++++---
 fs/xfs/xfs_trans_dquot.c |   57 ++++++++++++++++++++++++++++++++++++----------
 7 files changed, 108 insertions(+), 37 deletions(-)

Comments

Christoph Hellwig Jan. 24, 2021, 9:39 a.m. UTC | #1
> +	/* We only allow one retry for EDQUOT/ENOSPC. */
> +	if (*retry || (error != -EDQUOT && error != -ENOSPC)) {
> +		*retry = false;
> +		return error;
> +	}

> +	/* Release resources, prepare for scan. */
> +	xfs_trans_cancel(*tpp);
> +	*tpp = NULL;
> +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> +
> +	/* Try to free some quota for this file's dquots. */
> +	*retry = true;
> +	xfs_blockgc_free_quota(ip, 0);
> +	return 0;

I till have grave reservations about this calling conventions.  And if
you just remove the unlock and th call to xfs_blockgc_free_quota here
we don't equire a whole lot of boilerplate code in the callers while
making the code possible to reason about for a mere human.
Brian Foster Jan. 25, 2021, 6:16 p.m. UTC | #2
On Sun, Jan 24, 2021 at 09:39:53AM +0000, Christoph Hellwig wrote:
> > +	/* We only allow one retry for EDQUOT/ENOSPC. */
> > +	if (*retry || (error != -EDQUOT && error != -ENOSPC)) {
> > +		*retry = false;
> > +		return error;
> > +	}
> 
> > +	/* Release resources, prepare for scan. */
> > +	xfs_trans_cancel(*tpp);
> > +	*tpp = NULL;
> > +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > +
> > +	/* Try to free some quota for this file's dquots. */
> > +	*retry = true;
> > +	xfs_blockgc_free_quota(ip, 0);
> > +	return 0;
> 
> I till have grave reservations about this calling conventions.  And if
> you just remove the unlock and th call to xfs_blockgc_free_quota here
> we don't equire a whole lot of boilerplate code in the callers while
> making the code possible to reason about for a mere human.
> 

I agree that the retry pattern is rather odd. I'm curious, is there a
specific reason this scanning task has to execute outside of transaction
context in the first place? Assuming it does because the underlying work
may involve more transactions or whatnot, I'm wondering if this logic
could be buried further down in the transaction allocation path.

For example, if we passed the quota reservation and inode down into a
new variant of xfs_trans_alloc(), it could acquire the ilock and attempt
the quota reservation as a final step (to avoid adding an extra
unconditional ilock cycle). If quota res fails, iunlock and release the
log res internally and perform the scan. From there, perhaps we could
retry the quota reservation immediately without logres or the ilock by
saving references to the dquots, and then only reacquire logres/ilock on
success..? Just thinking out loud so that might require further
thought...

Brian
Darrick J. Wong Jan. 25, 2021, 6:57 p.m. UTC | #3
On Mon, Jan 25, 2021 at 01:16:23PM -0500, Brian Foster wrote:
> On Sun, Jan 24, 2021 at 09:39:53AM +0000, Christoph Hellwig wrote:
> > > +	/* We only allow one retry for EDQUOT/ENOSPC. */
> > > +	if (*retry || (error != -EDQUOT && error != -ENOSPC)) {
> > > +		*retry = false;
> > > +		return error;
> > > +	}
> > 
> > > +	/* Release resources, prepare for scan. */
> > > +	xfs_trans_cancel(*tpp);
> > > +	*tpp = NULL;
> > > +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > > +
> > > +	/* Try to free some quota for this file's dquots. */
> > > +	*retry = true;
> > > +	xfs_blockgc_free_quota(ip, 0);
> > > +	return 0;
> > 
> > I till have grave reservations about this calling conventions.  And if
> > you just remove the unlock and th call to xfs_blockgc_free_quota here
> > we don't equire a whole lot of boilerplate code in the callers while
> > making the code possible to reason about for a mere human.
> > 
> 
> I agree that the retry pattern is rather odd. I'm curious, is there a
> specific reason this scanning task has to execute outside of transaction
> context in the first place?

Dave didn't like the open-coded retry and told me to shrink the call
sites to:

	error = xfs_trans_reserve_quota(...);
	if (error)
		goto out_trans_cancel;
	if (quota_retry)
		goto retry;

So here we are, slowly putting things almost all the way back to where
they were originally.  Now I have a little utility function:

/*
 * Cancel a transaction and try to clear some space so that we can
 * reserve some quota.  The caller must hold the ILOCK; when this
 * function returns, the transaction will be cancelled and the ILOCK
 * will have been released.
 */
int
xfs_trans_cancel_qretry(
	struct xfs_trans	*tp,
	struct xfs_inode	*ip)
{
	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));

	xfs_trans_cancel(tp);
	xfs_iunlock(ip, XFS_ILOCK_EXCL);

	return xfs_blockgc_free_quota(ip, 0);
}

Which I guess reduces the amount of call site boilerplate from 4 lines
to two, only now I've spent half of last week on this.

> Assuming it does because the underlying work
> may involve more transactions or whatnot, I'm wondering if this logic
> could be buried further down in the transaction allocation path.
> 
> For example, if we passed the quota reservation and inode down into a
> new variant of xfs_trans_alloc(), it could acquire the ilock and attempt
> the quota reservation as a final step (to avoid adding an extra
> unconditional ilock cycle). If quota res fails, iunlock and release the
> log res internally and perform the scan. From there, perhaps we could
> retry the quota reservation immediately without logres or the ilock by
> saving references to the dquots, and then only reacquire logres/ilock on
> success..? Just thinking out loud so that might require further
> thought...

Yes, that's certainly possible, and probably a good design goal to have
a xfs_trans_alloc_quota(tres, ip, whichfork, nblks, &tp) that one could
call to reserve a transaction, lock the inode, and reserve the
appropriate amounts of quota to handle mapping nblks into an inode fork.

However, there are complications that don't make this a trivial switch:

1. Reflink and (new) swapext don't actually know how many blocks they
need to reserve until after they've grabbed the two ILOCKs, which means
that the wrapper is of no use here.

2. For the remaining quota reservation callsites, you have to deal with
the bmap code that computes qblocks for reservation against the realtime
device.  This is opening a huge can of worms because:

3. Realtime and quota are not supported, which means that none of that
code ever gets properly QA'd.  It would be totally stupid to rework most
of the quota reservation callsites and still leave that logic bomb.
This gigantic piece of technical debt needs to be paid off, either by
fixing the functionality and getting it under test, or by dropping rt
quota support completely and officially.

My guess is that fixing rt quota is probably going to take 10-15
patches, and doing more small cleanups to convert the callsites will be
another 10 or so.

4. We're already past -rc5, and what started as two cleanup patchsets of
13 is now four patchsets of 27 patches, and I /really/ would just like
to get these patches merged without expanding the scope of work even
further.

--D

> Brian
>
Brian Foster Jan. 26, 2021, 1:26 p.m. UTC | #4
On Mon, Jan 25, 2021 at 10:57:35AM -0800, Darrick J. Wong wrote:
> On Mon, Jan 25, 2021 at 01:16:23PM -0500, Brian Foster wrote:
> > On Sun, Jan 24, 2021 at 09:39:53AM +0000, Christoph Hellwig wrote:
> > > > +	/* We only allow one retry for EDQUOT/ENOSPC. */
> > > > +	if (*retry || (error != -EDQUOT && error != -ENOSPC)) {
> > > > +		*retry = false;
> > > > +		return error;
> > > > +	}
> > > 
> > > > +	/* Release resources, prepare for scan. */
> > > > +	xfs_trans_cancel(*tpp);
> > > > +	*tpp = NULL;
> > > > +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > > > +
> > > > +	/* Try to free some quota for this file's dquots. */
> > > > +	*retry = true;
> > > > +	xfs_blockgc_free_quota(ip, 0);
> > > > +	return 0;
> > > 
> > > I till have grave reservations about this calling conventions.  And if
> > > you just remove the unlock and th call to xfs_blockgc_free_quota here
> > > we don't equire a whole lot of boilerplate code in the callers while
> > > making the code possible to reason about for a mere human.
> > > 
> > 
> > I agree that the retry pattern is rather odd. I'm curious, is there a
> > specific reason this scanning task has to execute outside of transaction
> > context in the first place?
> 
> Dave didn't like the open-coded retry and told me to shrink the call
> sites to:
> 
> 	error = xfs_trans_reserve_quota(...);
> 	if (error)
> 		goto out_trans_cancel;
> 	if (quota_retry)
> 		goto retry;
> 
> So here we are, slowly putting things almost all the way back to where
> they were originally.  Now I have a little utility function:
> 
> /*
>  * Cancel a transaction and try to clear some space so that we can
>  * reserve some quota.  The caller must hold the ILOCK; when this
>  * function returns, the transaction will be cancelled and the ILOCK
>  * will have been released.
>  */
> int
> xfs_trans_cancel_qretry(
> 	struct xfs_trans	*tp,
> 	struct xfs_inode	*ip)
> {
> 	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
> 
> 	xfs_trans_cancel(tp);
> 	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> 
> 	return xfs_blockgc_free_quota(ip, 0);
> }
> 
> Which I guess reduces the amount of call site boilerplate from 4 lines
> to two, only now I've spent half of last week on this.
> 
> > Assuming it does because the underlying work
> > may involve more transactions or whatnot, I'm wondering if this logic
> > could be buried further down in the transaction allocation path.
> > 
> > For example, if we passed the quota reservation and inode down into a
> > new variant of xfs_trans_alloc(), it could acquire the ilock and attempt
> > the quota reservation as a final step (to avoid adding an extra
> > unconditional ilock cycle). If quota res fails, iunlock and release the
> > log res internally and perform the scan. From there, perhaps we could
> > retry the quota reservation immediately without logres or the ilock by
> > saving references to the dquots, and then only reacquire logres/ilock on
> > success..? Just thinking out loud so that might require further
> > thought...
> 
> Yes, that's certainly possible, and probably a good design goal to have
> a xfs_trans_alloc_quota(tres, ip, whichfork, nblks, &tp) that one could
> call to reserve a transaction, lock the inode, and reserve the
> appropriate amounts of quota to handle mapping nblks into an inode fork.
> 
> However, there are complications that don't make this a trivial switch:
> 
> 1. Reflink and (new) swapext don't actually know how many blocks they
> need to reserve until after they've grabbed the two ILOCKs, which means
> that the wrapper is of no use here.
> 

IMO, it's preferable to define a clean/usable interface if we can find
one that covers the majority of use cases and have to open code a
handful of outliers than define a cumbersome interface that must be used
everywhere to accommodate the outliers. Perhaps we'll find cleaner ways
to deal with open coded outliers over time..? Perhaps (at least in the
reflink case) we could attempt a worst case quota reservation with the
helper, knowing that it will have invoked the scan on -EDQUOT, and then
fall back to a more accurate open-coded xfs_trans_reserve_() call (that
will no longer fall into retry loops on failure)..?

> 2. For the remaining quota reservation callsites, you have to deal with
> the bmap code that computes qblocks for reservation against the realtime
> device.  This is opening a huge can of worms because:
> 
> 3. Realtime and quota are not supported, which means that none of that
> code ever gets properly QA'd.  It would be totally stupid to rework most
> of the quota reservation callsites and still leave that logic bomb.
> This gigantic piece of technical debt needs to be paid off, either by
> fixing the functionality and getting it under test, or by dropping rt
> quota support completely and officially.
> 

I'm not following what you're referring to here. Can you point to
examples in the code for reference, please?

Brian

> My guess is that fixing rt quota is probably going to take 10-15
> patches, and doing more small cleanups to convert the callsites will be
> another 10 or so.
> 
> 4. We're already past -rc5, and what started as two cleanup patchsets of
> 13 is now four patchsets of 27 patches, and I /really/ would just like
> to get these patches merged without expanding the scope of work even
> further.
> 
> --D
> 
> > Brian
> > 
>
Darrick J. Wong Jan. 26, 2021, 9:12 p.m. UTC | #5
On Tue, Jan 26, 2021 at 08:26:00AM -0500, Brian Foster wrote:
> On Mon, Jan 25, 2021 at 10:57:35AM -0800, Darrick J. Wong wrote:
> > On Mon, Jan 25, 2021 at 01:16:23PM -0500, Brian Foster wrote:
> > > On Sun, Jan 24, 2021 at 09:39:53AM +0000, Christoph Hellwig wrote:
> > > > > +	/* We only allow one retry for EDQUOT/ENOSPC. */
> > > > > +	if (*retry || (error != -EDQUOT && error != -ENOSPC)) {
> > > > > +		*retry = false;
> > > > > +		return error;
> > > > > +	}
> > > > 
> > > > > +	/* Release resources, prepare for scan. */
> > > > > +	xfs_trans_cancel(*tpp);
> > > > > +	*tpp = NULL;
> > > > > +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > > > > +
> > > > > +	/* Try to free some quota for this file's dquots. */
> > > > > +	*retry = true;
> > > > > +	xfs_blockgc_free_quota(ip, 0);
> > > > > +	return 0;
> > > > 
> > > > I till have grave reservations about this calling conventions.  And if
> > > > you just remove the unlock and th call to xfs_blockgc_free_quota here
> > > > we don't equire a whole lot of boilerplate code in the callers while
> > > > making the code possible to reason about for a mere human.
> > > > 
> > > 
> > > I agree that the retry pattern is rather odd. I'm curious, is there a
> > > specific reason this scanning task has to execute outside of transaction
> > > context in the first place?
> > 
> > Dave didn't like the open-coded retry and told me to shrink the call
> > sites to:
> > 
> > 	error = xfs_trans_reserve_quota(...);
> > 	if (error)
> > 		goto out_trans_cancel;
> > 	if (quota_retry)
> > 		goto retry;
> > 
> > So here we are, slowly putting things almost all the way back to where
> > they were originally.  Now I have a little utility function:
> > 
> > /*
> >  * Cancel a transaction and try to clear some space so that we can
> >  * reserve some quota.  The caller must hold the ILOCK; when this
> >  * function returns, the transaction will be cancelled and the ILOCK
> >  * will have been released.
> >  */
> > int
> > xfs_trans_cancel_qretry(
> > 	struct xfs_trans	*tp,
> > 	struct xfs_inode	*ip)
> > {
> > 	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
> > 
> > 	xfs_trans_cancel(tp);
> > 	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > 
> > 	return xfs_blockgc_free_quota(ip, 0);
> > }
> > 
> > Which I guess reduces the amount of call site boilerplate from 4 lines
> > to two, only now I've spent half of last week on this.
> > 
> > > Assuming it does because the underlying work
> > > may involve more transactions or whatnot, I'm wondering if this logic
> > > could be buried further down in the transaction allocation path.
> > > 
> > > For example, if we passed the quota reservation and inode down into a
> > > new variant of xfs_trans_alloc(), it could acquire the ilock and attempt
> > > the quota reservation as a final step (to avoid adding an extra
> > > unconditional ilock cycle). If quota res fails, iunlock and release the
> > > log res internally and perform the scan. From there, perhaps we could
> > > retry the quota reservation immediately without logres or the ilock by
> > > saving references to the dquots, and then only reacquire logres/ilock on
> > > success..? Just thinking out loud so that might require further
> > > thought...
> > 
> > Yes, that's certainly possible, and probably a good design goal to have
> > a xfs_trans_alloc_quota(tres, ip, whichfork, nblks, &tp) that one could
> > call to reserve a transaction, lock the inode, and reserve the
> > appropriate amounts of quota to handle mapping nblks into an inode fork.
> > 
> > However, there are complications that don't make this a trivial switch:
> > 
> > 1. Reflink and (new) swapext don't actually know how many blocks they
> > need to reserve until after they've grabbed the two ILOCKs, which means
> > that the wrapper is of no use here.
> > 
> 
> IMO, it's preferable to define a clean/usable interface if we can find
> one that covers the majority of use cases and have to open code a
> handful of outliers than define a cumbersome interface that must be used
> everywhere to accommodate the outliers. Perhaps we'll find cleaner ways
> to deal with open coded outliers over time..?

Sure, we might, but let's not delay this cleanup, since these are the
last two pieces that I need to get merged before I can send out deferred
inode inactivation for review.  Deferred inode inactivation adds yet
another button that we can push to reclaim free space when something hits
EDQUOT/ENOSPC.

FWIW I did start down the path of building a better interface last week,
but quickly became mired in (1) how do we allocate rt quota with a new
interface and (2) do we care?  And then I started looking at what rt
allocations do wrt quota and decided that fixing that (or even removing
it) would be an entire patchset.

Hence I'm trying to constrain this patchset to updating the existing
callsites to do the scan+retry, and no more.

> Perhaps (at least in the
> reflink case) we could attempt a worst case quota reservation with the
> helper, knowing that it will have invoked the scan on -EDQUOT, and then
> fall back to a more accurate open-coded xfs_trans_reserve_() call (that
> will no longer fall into retry loops on failure)..?

Making a worst case reservation and backing off creates more ways for
things to fail unnecessarily.

For a remap operation, the worst case is if the source file range has an
allocated mapping and the destination file range is a hole, because we
have to increment quota by the size of that allocated mapping.  If we
run out of quota we'll have to flush the fs and try again.  If we fail
the quota reservation a second time, the whole operation fails.

This is not good behavior for all the other cases -- if both mappings
are holes or both allocated, we just failed an operation that would have
made no net change to the quota allocations.  If the source file range
is a hole and the destination range is allocated, we actually would have
/decreased/ the quota usage, but instead we fail with EDQUOT.

Right now the remap code handles those cases just fine, at a cost of
open coded logic.

> > 2. For the remaining quota reservation callsites, you have to deal with
> > the bmap code that computes qblocks for reservation against the realtime
> > device.  This is opening a huge can of worms because:
> > 
> > 3. Realtime and quota are not supported, which means that none of that
> > code ever gets properly QA'd.  It would be totally stupid to rework most
> > of the quota reservation callsites and still leave that logic bomb.
> > This gigantic piece of technical debt needs to be paid off, either by
> > fixing the functionality and getting it under test, or by dropping rt
> > quota support completely and officially.
> > 
> 
> I'm not following what you're referring to here. Can you point to
> examples in the code for reference, please?

If you format a filesystem with realtime and mount it with -oquota, xfs
will ignore the 'quota' mount option completely.  Some code exists to
do rt quota accounting (xfs_alloc_file_space and xfs_iomap_write_direct
are examples) but since we never allow rt+quota, the code coverage on
that is 0%.

I've also noticed that those functions seem to have this odd behavior
where for rt files, they'll reserve quota for the allocated blocks
themselves but not the bmbt blocks; but for regular files, they reserve
quota for both the allocated blocks and the bmbt blocks.  The quota code
makes various allowances for transactions that try to commit quota count
updates but have zero quota reservation attached to the transaction,
which I /think/ could have been an attempt to work around that quirk.

I also just noticed that xfs_bmapi_reserve_delalloc only works with
non-rt files.  I guess that's fine since rt files don't use the delalloc
mechanism anyway (and I think the  reason they don't is that xfs can't
currently do write-around to handle rextsize>1 filesystems) but that's
another mess to sort.

(FWIW I'm slowly working through all those rt issues as part of maturing
the rt reflink patchset, but that's at the far end of my dev tree...)

--D

> 
> Brian
> 
> > My guess is that fixing rt quota is probably going to take 10-15
> > patches, and doing more small cleanups to convert the callsites will be
> > another 10 or so.
> > 
> > 4. We're already past -rc5, and what started as two cleanup patchsets of
> > 13 is now four patchsets of 27 patches, and I /really/ would just like
> > to get these patches merged without expanding the scope of work even
> > further.
> > 
> > --D
> > 
> > > Brian
> > > 
> > 
>
Brian Foster Jan. 27, 2021, 2:19 p.m. UTC | #6
On Tue, Jan 26, 2021 at 01:12:59PM -0800, Darrick J. Wong wrote:
> On Tue, Jan 26, 2021 at 08:26:00AM -0500, Brian Foster wrote:
> > On Mon, Jan 25, 2021 at 10:57:35AM -0800, Darrick J. Wong wrote:
> > > On Mon, Jan 25, 2021 at 01:16:23PM -0500, Brian Foster wrote:
> > > > On Sun, Jan 24, 2021 at 09:39:53AM +0000, Christoph Hellwig wrote:
> > > > > > +	/* We only allow one retry for EDQUOT/ENOSPC. */
> > > > > > +	if (*retry || (error != -EDQUOT && error != -ENOSPC)) {
> > > > > > +		*retry = false;
> > > > > > +		return error;
> > > > > > +	}
> > > > > 
> > > > > > +	/* Release resources, prepare for scan. */
> > > > > > +	xfs_trans_cancel(*tpp);
> > > > > > +	*tpp = NULL;
> > > > > > +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > > > > > +
> > > > > > +	/* Try to free some quota for this file's dquots. */
> > > > > > +	*retry = true;
> > > > > > +	xfs_blockgc_free_quota(ip, 0);
> > > > > > +	return 0;
> > > > > 
> > > > > I till have grave reservations about this calling conventions.  And if
> > > > > you just remove the unlock and th call to xfs_blockgc_free_quota here
> > > > > we don't equire a whole lot of boilerplate code in the callers while
> > > > > making the code possible to reason about for a mere human.
> > > > > 
> > > > 
> > > > I agree that the retry pattern is rather odd. I'm curious, is there a
> > > > specific reason this scanning task has to execute outside of transaction
> > > > context in the first place?
> > > 
> > > Dave didn't like the open-coded retry and told me to shrink the call
> > > sites to:
> > > 
> > > 	error = xfs_trans_reserve_quota(...);
> > > 	if (error)
> > > 		goto out_trans_cancel;
> > > 	if (quota_retry)
> > > 		goto retry;
> > > 
> > > So here we are, slowly putting things almost all the way back to where
> > > they were originally.  Now I have a little utility function:
> > > 
> > > /*
> > >  * Cancel a transaction and try to clear some space so that we can
> > >  * reserve some quota.  The caller must hold the ILOCK; when this
> > >  * function returns, the transaction will be cancelled and the ILOCK
> > >  * will have been released.
> > >  */
> > > int
> > > xfs_trans_cancel_qretry(
> > > 	struct xfs_trans	*tp,
> > > 	struct xfs_inode	*ip)
> > > {
> > > 	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
> > > 
> > > 	xfs_trans_cancel(tp);
> > > 	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > > 
> > > 	return xfs_blockgc_free_quota(ip, 0);
> > > }
> > > 
> > > Which I guess reduces the amount of call site boilerplate from 4 lines
> > > to two, only now I've spent half of last week on this.
> > > 
> > > > Assuming it does because the underlying work
> > > > may involve more transactions or whatnot, I'm wondering if this logic
> > > > could be buried further down in the transaction allocation path.
> > > > 
> > > > For example, if we passed the quota reservation and inode down into a
> > > > new variant of xfs_trans_alloc(), it could acquire the ilock and attempt
> > > > the quota reservation as a final step (to avoid adding an extra
> > > > unconditional ilock cycle). If quota res fails, iunlock and release the
> > > > log res internally and perform the scan. From there, perhaps we could
> > > > retry the quota reservation immediately without logres or the ilock by
> > > > saving references to the dquots, and then only reacquire logres/ilock on
> > > > success..? Just thinking out loud so that might require further
> > > > thought...
> > > 
> > > Yes, that's certainly possible, and probably a good design goal to have
> > > a xfs_trans_alloc_quota(tres, ip, whichfork, nblks, &tp) that one could
> > > call to reserve a transaction, lock the inode, and reserve the
> > > appropriate amounts of quota to handle mapping nblks into an inode fork.
> > > 
> > > However, there are complications that don't make this a trivial switch:
> > > 
> > > 1. Reflink and (new) swapext don't actually know how many blocks they
> > > need to reserve until after they've grabbed the two ILOCKs, which means
> > > that the wrapper is of no use here.
> > > 
> > 
> > IMO, it's preferable to define a clean/usable interface if we can find
> > one that covers the majority of use cases and have to open code a
> > handful of outliers than define a cumbersome interface that must be used
> > everywhere to accommodate the outliers. Perhaps we'll find cleaner ways
> > to deal with open coded outliers over time..?
> 
> Sure, we might, but let's not delay this cleanup, since these are the
> last two pieces that I need to get merged before I can send out deferred
> inode inactivation for review.  Deferred inode inactivation adds yet
> another button that we can push to reclaim free space when something hits
> EDQUOT/ENOSPC.
> 

Not sure I see the need to rush in a particular interface that multiple
reviewers have expressed reservations about just because there are more
patches coming built on top. That just creates more churn and cleanup
work for later, which means more review/test cycles and more work
indirectly for people who might have to deal with backports, etc. I'm
not dead set against what this patch does if there's no better
alternative, but IMO it's better to get it right than get it fast so we
should at least give fair consideration to some alternatives if ideas
are being presented.

> FWIW I did start down the path of building a better interface last week,
> but quickly became mired in (1) how do we allocate rt quota with a new
> interface and (2) do we care?  And then I started looking at what rt
> allocations do wrt quota and decided that fixing that (or even removing
> it) would be an entire patchset.
> 
> Hence I'm trying to constrain this patchset to updating the existing
> callsites to do the scan+retry, and no more.
> 

Ok, well I think that helps me understand the situation, but I'm still
not really following if/how that conflicts with any of the previous
suggestions (which is why I was asking for example code to consider).

> > Perhaps (at least in the
> > reflink case) we could attempt a worst case quota reservation with the
> > helper, knowing that it will have invoked the scan on -EDQUOT, and then
> > fall back to a more accurate open-coded xfs_trans_reserve_() call (that
> > will no longer fall into retry loops on failure)..?
> 
> Making a worst case reservation and backing off creates more ways for
> things to fail unnecessarily.
> 
> For a remap operation, the worst case is if the source file range has an
> allocated mapping and the destination file range is a hole, because we
> have to increment quota by the size of that allocated mapping.  If we
> run out of quota we'll have to flush the fs and try again.  If we fail
> the quota reservation a second time, the whole operation fails.
> 

Right...

> This is not good behavior for all the other cases -- if both mappings
> are holes or both allocated, we just failed an operation that would have
> made no net change to the quota allocations.  If the source file range
> is a hole and the destination range is allocated, we actually would have
> /decreased/ the quota usage, but instead we fail with EDQUOT.
> 

But that wasn't the suggestion. The suggestion was to do something along
the lines of the following in the reflink case:

	error = xfs_trans_alloc_quota(..., ip, resblks, worstqres, ...);
	if (error == -EDQUOT) {
		worstqres = 0;
		error = xfs_trans_alloc(..., resblks, ...);
		...
	}

	if (!worstqres) {
		worstqres = <calculate actual quota res>
		error = xfs_trans_reserve_quota(...);
		if (error)
			return error;
	}

	...

... where the initial transaction allocation failure would have failed
on the worst case qres, but also would have run the internal reclaim
scan and retried before it returned. Therefore, we could still attempt
the open coded non-worst case reservation and either proceed or return
-EDQUOT with generally similar scan->retry semantics as this patch, just
without the open coded goto loops everywhere we attach quota reservation
to a transaction. This of course assumes that the
xfs_trans_alloc_quota() interface works well enough for the majority of
other cases without need for open coded reservation...

> Right now the remap code handles those cases just fine, at a cost of
> open coded logic.
> 
> > > 2. For the remaining quota reservation callsites, you have to deal with
> > > the bmap code that computes qblocks for reservation against the realtime
> > > device.  This is opening a huge can of worms because:
> > > 
> > > 3. Realtime and quota are not supported, which means that none of that
> > > code ever gets properly QA'd.  It would be totally stupid to rework most
> > > of the quota reservation callsites and still leave that logic bomb.
> > > This gigantic piece of technical debt needs to be paid off, either by
> > > fixing the functionality and getting it under test, or by dropping rt
> > > quota support completely and officially.
> > > 
> > 
> > I'm not following what you're referring to here. Can you point to
> > examples in the code for reference, please?
> 
> If you format a filesystem with realtime and mount it with -oquota, xfs
> will ignore the 'quota' mount option completely.  Some code exists to
> do rt quota accounting (xfs_alloc_file_space and xfs_iomap_write_direct
> are examples) but since we never allow rt+quota, the code coverage on
> that is 0%.
> 

Ok, but how is that any different for what this patch does?

Brian

> I've also noticed that those functions seem to have this odd behavior
> where for rt files, they'll reserve quota for the allocated blocks
> themselves but not the bmbt blocks; but for regular files, they reserve
> quota for both the allocated blocks and the bmbt blocks.  The quota code
> makes various allowances for transactions that try to commit quota count
> updates but have zero quota reservation attached to the transaction,
> which I /think/ could have been an attempt to work around that quirk.
> 
> I also just noticed that xfs_bmapi_reserve_delalloc only works with
> non-rt files.  I guess that's fine since rt files don't use the delalloc
> mechanism anyway (and I think the  reason they don't is that xfs can't
> currently do write-around to handle rextsize>1 filesystems) but that's
> another mess to sort.
> 
> (FWIW I'm slowly working through all those rt issues as part of maturing
> the rt reflink patchset, but that's at the far end of my dev tree...)
> 
> --D
> 
> > 
> > Brian
> > 
> > > My guess is that fixing rt quota is probably going to take 10-15
> > > patches, and doing more small cleanups to convert the callsites will be
> > > another 10 or so.
> > > 
> > > 4. We're already past -rc5, and what started as two cleanup patchsets of
> > > 13 is now four patchsets of 27 patches, and I /really/ would just like
> > > to get these patches merged without expanding the scope of work even
> > > further.
> > > 
> > > --D
> > > 
> > > > Brian
> > > > 
> > > 
> > 
>
Darrick J. Wong Jan. 27, 2021, 5:19 p.m. UTC | #7
On Wed, Jan 27, 2021 at 09:19:10AM -0500, Brian Foster wrote:
> On Tue, Jan 26, 2021 at 01:12:59PM -0800, Darrick J. Wong wrote:
> > On Tue, Jan 26, 2021 at 08:26:00AM -0500, Brian Foster wrote:
> > > On Mon, Jan 25, 2021 at 10:57:35AM -0800, Darrick J. Wong wrote:
> > > > On Mon, Jan 25, 2021 at 01:16:23PM -0500, Brian Foster wrote:
> > > > > On Sun, Jan 24, 2021 at 09:39:53AM +0000, Christoph Hellwig wrote:
> > > > > > > +	/* We only allow one retry for EDQUOT/ENOSPC. */
> > > > > > > +	if (*retry || (error != -EDQUOT && error != -ENOSPC)) {
> > > > > > > +		*retry = false;
> > > > > > > +		return error;
> > > > > > > +	}
> > > > > > 
> > > > > > > +	/* Release resources, prepare for scan. */
> > > > > > > +	xfs_trans_cancel(*tpp);
> > > > > > > +	*tpp = NULL;
> > > > > > > +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > > > > > > +
> > > > > > > +	/* Try to free some quota for this file's dquots. */
> > > > > > > +	*retry = true;
> > > > > > > +	xfs_blockgc_free_quota(ip, 0);
> > > > > > > +	return 0;
> > > > > > 
> > > > > > I till have grave reservations about this calling conventions.  And if
> > > > > > you just remove the unlock and th call to xfs_blockgc_free_quota here
> > > > > > we don't equire a whole lot of boilerplate code in the callers while
> > > > > > making the code possible to reason about for a mere human.
> > > > > > 
> > > > > 
> > > > > I agree that the retry pattern is rather odd. I'm curious, is there a
> > > > > specific reason this scanning task has to execute outside of transaction
> > > > > context in the first place?
> > > > 
> > > > Dave didn't like the open-coded retry and told me to shrink the call
> > > > sites to:
> > > > 
> > > > 	error = xfs_trans_reserve_quota(...);
> > > > 	if (error)
> > > > 		goto out_trans_cancel;
> > > > 	if (quota_retry)
> > > > 		goto retry;
> > > > 
> > > > So here we are, slowly putting things almost all the way back to where
> > > > they were originally.  Now I have a little utility function:
> > > > 
> > > > /*
> > > >  * Cancel a transaction and try to clear some space so that we can
> > > >  * reserve some quota.  The caller must hold the ILOCK; when this
> > > >  * function returns, the transaction will be cancelled and the ILOCK
> > > >  * will have been released.
> > > >  */
> > > > int
> > > > xfs_trans_cancel_qretry(
> > > > 	struct xfs_trans	*tp,
> > > > 	struct xfs_inode	*ip)
> > > > {
> > > > 	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
> > > > 
> > > > 	xfs_trans_cancel(tp);
> > > > 	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > > > 
> > > > 	return xfs_blockgc_free_quota(ip, 0);
> > > > }
> > > > 
> > > > Which I guess reduces the amount of call site boilerplate from 4 lines
> > > > to two, only now I've spent half of last week on this.
> > > > 
> > > > > Assuming it does because the underlying work
> > > > > may involve more transactions or whatnot, I'm wondering if this logic
> > > > > could be buried further down in the transaction allocation path.
> > > > > 
> > > > > For example, if we passed the quota reservation and inode down into a
> > > > > new variant of xfs_trans_alloc(), it could acquire the ilock and attempt
> > > > > the quota reservation as a final step (to avoid adding an extra
> > > > > unconditional ilock cycle). If quota res fails, iunlock and release the
> > > > > log res internally and perform the scan. From there, perhaps we could
> > > > > retry the quota reservation immediately without logres or the ilock by
> > > > > saving references to the dquots, and then only reacquire logres/ilock on
> > > > > success..? Just thinking out loud so that might require further
> > > > > thought...
> > > > 
> > > > Yes, that's certainly possible, and probably a good design goal to have
> > > > a xfs_trans_alloc_quota(tres, ip, whichfork, nblks, &tp) that one could
> > > > call to reserve a transaction, lock the inode, and reserve the
> > > > appropriate amounts of quota to handle mapping nblks into an inode fork.
> > > > 
> > > > However, there are complications that don't make this a trivial switch:
> > > > 
> > > > 1. Reflink and (new) swapext don't actually know how many blocks they
> > > > need to reserve until after they've grabbed the two ILOCKs, which means
> > > > that the wrapper is of no use here.
> > > > 
> > > 
> > > IMO, it's preferable to define a clean/usable interface if we can find
> > > one that covers the majority of use cases and have to open code a
> > > handful of outliers than define a cumbersome interface that must be used
> > > everywhere to accommodate the outliers. Perhaps we'll find cleaner ways
> > > to deal with open coded outliers over time..?
> > 
> > Sure, we might, but let's not delay this cleanup, since these are the
> > last two pieces that I need to get merged before I can send out deferred
> > inode inactivation for review.  Deferred inode inactivation adds yet
> > another button that we can push to reclaim free space when something hits
> > EDQUOT/ENOSPC.
> > 
> 
> Not sure I see the need to rush in a particular interface that multiple
> reviewers have expressed reservations about just because there are more
> patches coming built on top. That just creates more churn and cleanup
> work for later, which means more review/test cycles and more work
> indirectly for people who might have to deal with backports, etc. I'm
> not dead set against what this patch does if there's no better
> alternative, but IMO it's better to get it right than get it fast so we
> should at least give fair consideration to some alternatives if ideas
> are being presented.
> 
> > FWIW I did start down the path of building a better interface last week,
> > but quickly became mired in (1) how do we allocate rt quota with a new
> > interface and (2) do we care?  And then I started looking at what rt
> > allocations do wrt quota and decided that fixing that (or even removing
> > it) would be an entire patchset.
> > 
> > Hence I'm trying to constrain this patchset to updating the existing
> > callsites to do the scan+retry, and no more.
> > 
> 
> Ok, well I think that helps me understand the situation, but I'm still
> not really following if/how that conflicts with any of the previous
> suggestions (which is why I was asking for example code to consider).
> 
> > > Perhaps (at least in the
> > > reflink case) we could attempt a worst case quota reservation with the
> > > helper, knowing that it will have invoked the scan on -EDQUOT, and then
> > > fall back to a more accurate open-coded xfs_trans_reserve_() call (that
> > > will no longer fall into retry loops on failure)..?
> > 
> > Making a worst case reservation and backing off creates more ways for
> > things to fail unnecessarily.
> > 
> > For a remap operation, the worst case is if the source file range has an
> > allocated mapping and the destination file range is a hole, because we
> > have to increment quota by the size of that allocated mapping.  If we
> > run out of quota we'll have to flush the fs and try again.  If we fail
> > the quota reservation a second time, the whole operation fails.
> > 
> 
> Right...
> 
> > This is not good behavior for all the other cases -- if both mappings
> > are holes or both allocated, we just failed an operation that would have
> > made no net change to the quota allocations.  If the source file range
> > is a hole and the destination range is allocated, we actually would have
> > /decreased/ the quota usage, but instead we fail with EDQUOT.
> > 
> 
> But that wasn't the suggestion. The suggestion was to do something along
> the lines of the following in the reflink case:
> 
> 	error = xfs_trans_alloc_quota(..., ip, resblks, worstqres, ...);
> 	if (error == -EDQUOT) {
> 		worstqres = 0;
> 		error = xfs_trans_alloc(..., resblks, ...);
> 		...
> 	}

OH.  I misread that sentence with "fall back to a more accurate reserve
call", and totally thought that your suggestion was to use
xfs_trans_alloc_quota on its own, then later when we know how much quota
we really want, using xfs_trans_reserve_quota to adjust the transaction.

I am totally ok with doing it this way.

> 	if (!worstqres) {
> 		worstqres = <calculate actual quota res>
> 		error = xfs_trans_reserve_quota(...);
> 		if (error)
> 			return error;
> 	}
> 
> 	...
> 
> ... where the initial transaction allocation failure would have failed
> on the worst case qres, but also would have run the internal reclaim
> scan and retried before it returned. Therefore, we could still attempt
> the open coded non-worst case reservation and either proceed or return
> -EDQUOT with generally similar scan->retry semantics as this patch, just
> without the open coded goto loops everywhere we attach quota reservation
> to a transaction. This of course assumes that the
> xfs_trans_alloc_quota() interface works well enough for the majority of
> other cases without need for open coded reservation...

I think it will.

> > Right now the remap code handles those cases just fine, at a cost of
> > open coded logic.
> > 
> > > > 2. For the remaining quota reservation callsites, you have to deal with
> > > > the bmap code that computes qblocks for reservation against the realtime
> > > > device.  This is opening a huge can of worms because:
> > > > 
> > > > 3. Realtime and quota are not supported, which means that none of that
> > > > code ever gets properly QA'd.  It would be totally stupid to rework most
> > > > of the quota reservation callsites and still leave that logic bomb.
> > > > This gigantic piece of technical debt needs to be paid off, either by
> > > > fixing the functionality and getting it under test, or by dropping rt
> > > > quota support completely and officially.
> > > > 
> > > 
> > > I'm not following what you're referring to here. Can you point to
> > > examples in the code for reference, please?
> > 
> > If you format a filesystem with realtime and mount it with -oquota, xfs
> > will ignore the 'quota' mount option completely.  Some code exists to
> > do rt quota accounting (xfs_alloc_file_space and xfs_iomap_write_direct
> > are examples) but since we never allow rt+quota, the code coverage on
> > that is 0%.
> > 
> 
> Ok, but how is that any different for what this patch does?

In the end there isn't any practical difference; I had to get over my
reluctance to fiddle around with code that can't ever be run.  Whatever
the state of rt quota, at least users can't get to it.

With that... between the long delivery delays and replies arriving out
of order and with unpredictable lag time, it might just be time for me
to tidy up my patchsets and send a v5.

--D

> Brian
> 
> > I've also noticed that those functions seem to have this odd behavior
> > where for rt files, they'll reserve quota for the allocated blocks
> > themselves but not the bmbt blocks; but for regular files, they reserve
> > quota for both the allocated blocks and the bmbt blocks.  The quota code
> > makes various allowances for transactions that try to commit quota count
> > updates but have zero quota reservation attached to the transaction,
> > which I /think/ could have been an attempt to work around that quirk.
> > 
> > I also just noticed that xfs_bmapi_reserve_delalloc only works with
> > non-rt files.  I guess that's fine since rt files don't use the delalloc
> > mechanism anyway (and I think the  reason they don't is that xfs can't
> > currently do write-around to handle rextsize>1 filesystems) but that's
> > another mess to sort.
> > 
> > (FWIW I'm slowly working through all those rt issues as part of maturing
> > the rt reflink patchset, but that's at the far end of my dev tree...)
> > 
> > --D
> > 
> > > 
> > > Brian
> > > 
> > > > My guess is that fixing rt quota is probably going to take 10-15
> > > > patches, and doing more small cleanups to convert the callsites will be
> > > > another 10 or so.
> > > > 
> > > > 4. We're already past -rc5, and what started as two cleanup patchsets of
> > > > 13 is now four patchsets of 27 patches, and I /really/ would just like
> > > > to get these patches merged without expanding the scope of work even
> > > > further.
> > > > 
> > > > --D
> > > > 
> > > > > Brian
> > > > > 
> > > > 
> > > 
> > 
>
diff mbox series

Patch

diff --git a/fs/xfs/libxfs/xfs_attr.c b/fs/xfs/libxfs/xfs_attr.c
index be51e7068dcd..af835ea0ca80 100644
--- a/fs/xfs/libxfs/xfs_attr.c
+++ b/fs/xfs/libxfs/xfs_attr.c
@@ -395,6 +395,7 @@  xfs_attr_set(
 	struct xfs_mount	*mp = dp->i_mount;
 	struct xfs_trans_res	tres;
 	bool			rsvd = (args->attr_filter & XFS_ATTR_ROOT);
+	bool			quota_retry = false;
 	int			error, local;
 	int			rmt_blks = 0;
 	unsigned int		total;
@@ -458,6 +459,7 @@  xfs_attr_set(
 	 * Root fork attributes can use reserved data blocks for this
 	 * operation if necessary
 	 */
+retry:
 	error = xfs_trans_alloc(mp, &tres, total, 0,
 			rsvd ? XFS_TRANS_RESERVE : 0, &args->trans);
 	if (error)
@@ -478,10 +480,12 @@  xfs_attr_set(
 
 		if (rsvd)
 			quota_flags |= XFS_QMOPT_FORCE_RES;
-		error = xfs_trans_reserve_quota_nblks(args->trans, dp,
-				args->total, 0, quota_flags);
+		error = xfs_trans_reserve_quota_nblks(&args->trans, dp,
+				args->total, 0, quota_flags, &quota_retry);
 		if (error)
 			goto out_trans_cancel;
+		if (quota_retry)
+			goto retry;
 
 		error = xfs_has_attr(args);
 		if (error == -EEXIST && (args->attr_flags & XATTR_CREATE))
diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 908b7d49da60..0247763dfac3 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -1070,6 +1070,7 @@  xfs_bmap_add_attrfork(
 	int			blks;		/* space reservation */
 	int			version = 1;	/* superblock attr version */
 	int			logflags;	/* logging flags */
+	bool			quota_retry = false;
 	int			error;		/* error return value */
 
 	ASSERT(XFS_IFORK_Q(ip) == 0);
@@ -1079,17 +1080,20 @@  xfs_bmap_add_attrfork(
 
 	blks = XFS_ADDAFORK_SPACE_RES(mp);
 
+retry:
 	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_addafork, blks, 0,
 			rsvd ? XFS_TRANS_RESERVE : 0, &tp);
 	if (error)
 		return error;
 
 	xfs_ilock(ip, XFS_ILOCK_EXCL);
-	error = xfs_trans_reserve_quota_nblks(tp, ip, blks, 0, rsvd ?
+	error = xfs_trans_reserve_quota_nblks(&tp, ip, blks, 0, rsvd ?
 			XFS_QMOPT_RES_REGBLKS | XFS_QMOPT_FORCE_RES :
-			XFS_QMOPT_RES_REGBLKS);
+			XFS_QMOPT_RES_REGBLKS, &quota_retry);
 	if (error)
 		goto trans_cancel;
+	if (quota_retry)
+		goto retry;
 	if (XFS_IFORK_Q(ip))
 		goto trans_cancel;
 
diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 792809debaaa..6eaf92bf8fc6 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -761,6 +761,7 @@  xfs_alloc_file_space(
 	 */
 	while (allocatesize_fsb && !error) {
 		xfs_fileoff_t	s, e;
+		bool		quota_retry = false;
 
 		/*
 		 * Determine space reservations for data/realtime.
@@ -803,6 +804,7 @@  xfs_alloc_file_space(
 		/*
 		 * Allocate and setup the transaction.
 		 */
+retry:
 		error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, resblks,
 				resrtextents, 0, &tp);
 
@@ -817,10 +819,12 @@  xfs_alloc_file_space(
 			break;
 		}
 		xfs_ilock(ip, XFS_ILOCK_EXCL);
-		error = xfs_trans_reserve_quota_nblks(tp, ip, qblocks,
-						      0, quota_flag);
+		error = xfs_trans_reserve_quota_nblks(&tp, ip, qblocks, 0,
+				quota_flag, &quota_retry);
 		if (error)
 			goto error1;
+		if (quota_retry)
+			goto retry;
 
 		error = xfs_iext_count_may_overflow(ip, XFS_DATA_FORK,
 				XFS_IEXT_ADD_NOSPLIT_CNT);
@@ -858,7 +862,6 @@  xfs_alloc_file_space(
 
 error0:	/* unlock inode, unreserve quota blocks, cancel trans */
 	xfs_trans_unreserve_quota_nblks(tp, ip, (long)qblocks, 0, quota_flag);
-
 error1:	/* Just cancel transaction */
 	xfs_trans_cancel(tp);
 	xfs_iunlock(ip, XFS_ILOCK_EXCL);
@@ -875,8 +878,10 @@  xfs_unmap_extent(
 	struct xfs_mount	*mp = ip->i_mount;
 	struct xfs_trans	*tp;
 	uint			resblks = XFS_DIOSTRAT_SPACE_RES(mp, 0);
+	bool			quota_retry = false;
 	int			error;
 
+retry:
 	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, resblks, 0, 0, &tp);
 	if (error) {
 		ASSERT(error == -ENOSPC || XFS_FORCED_SHUTDOWN(mp));
@@ -884,10 +889,12 @@  xfs_unmap_extent(
 	}
 
 	xfs_ilock(ip, XFS_ILOCK_EXCL);
-	error = xfs_trans_reserve_quota_nblks(tp, ip, resblks, 0,
-			XFS_QMOPT_RES_REGBLKS);
+	error = xfs_trans_reserve_quota_nblks(&tp, ip, resblks, 0,
+			XFS_QMOPT_RES_REGBLKS, &quota_retry);
 	if (error)
 		goto out_trans_cancel;
+	if (quota_retry)
+		goto retry;
 
 	xfs_trans_ijoin(tp, ip, 0);
 
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 514e6ae010e0..294d819c30c6 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -27,7 +27,7 @@ 
 #include "xfs_dquot_item.h"
 #include "xfs_dquot.h"
 #include "xfs_reflink.h"
-
+#include "xfs_icache.h"
 
 #define XFS_ALLOC_ALIGN(mp, off) \
 	(((off) >> mp->m_allocsize_log) << mp->m_allocsize_log)
@@ -197,6 +197,7 @@  xfs_iomap_write_direct(
 	int			quota_flag;
 	uint			qblocks, resblks;
 	unsigned int		resrtextents = 0;
+	bool			quota_retry = false;
 	int			error;
 	int			bmapi_flags = XFS_BMAPI_PREALLOC;
 	uint			tflags = 0;
@@ -239,6 +240,7 @@  xfs_iomap_write_direct(
 			resblks = XFS_DIOSTRAT_SPACE_RES(mp, 0) << 1;
 		}
 	}
+retry:
 	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, resblks, resrtextents,
 			tflags, &tp);
 	if (error)
@@ -246,9 +248,12 @@  xfs_iomap_write_direct(
 
 	xfs_ilock(ip, XFS_ILOCK_EXCL);
 
-	error = xfs_trans_reserve_quota_nblks(tp, ip, qblocks, 0, quota_flag);
+	error = xfs_trans_reserve_quota_nblks(&tp, ip, qblocks, 0, quota_flag,
+			&quota_retry);
 	if (error)
 		goto out_trans_cancel;
+	if (quota_retry)
+		goto retry;
 
 	error = xfs_iext_count_may_overflow(ip, XFS_DATA_FORK,
 			XFS_IEXT_ADD_NOSPLIT_CNT);
@@ -544,6 +549,8 @@  xfs_iomap_write_unwritten(
 		return error;
 
 	do {
+		bool	quota_retry = false;
+
 		/*
 		 * Set up a transaction to convert the range of extents
 		 * from unwritten to real. Do allocations in a loop until
@@ -553,6 +560,7 @@  xfs_iomap_write_unwritten(
 		 * here as we might be asked to write out the same inode that we
 		 * complete here and might deadlock on the iolock.
 		 */
+retry:
 		error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, resblks, 0,
 				XFS_TRANS_RESERVE, &tp);
 		if (error)
@@ -561,10 +569,13 @@  xfs_iomap_write_unwritten(
 		xfs_ilock(ip, XFS_ILOCK_EXCL);
 		xfs_trans_ijoin(tp, ip, 0);
 
-		error = xfs_trans_reserve_quota_nblks(tp, ip, resblks, 0,
-				XFS_QMOPT_RES_REGBLKS | XFS_QMOPT_FORCE_RES);
+		error = xfs_trans_reserve_quota_nblks(&tp, ip, resblks, 0,
+				XFS_QMOPT_RES_REGBLKS | XFS_QMOPT_FORCE_RES,
+				&quota_retry);
 		if (error)
 			goto error_on_bmapi_transaction;
+		if (quota_retry)
+			goto retry;
 
 		error = xfs_iext_count_may_overflow(ip, XFS_DATA_FORK,
 				XFS_IEXT_WRITE_UNWRITTEN_CNT);
diff --git a/fs/xfs/xfs_quota.h b/fs/xfs/xfs_quota.h
index 16a2e7adf4da..1c083b5267d9 100644
--- a/fs/xfs/xfs_quota.h
+++ b/fs/xfs/xfs_quota.h
@@ -81,8 +81,9 @@  extern void xfs_trans_mod_dquot_byino(struct xfs_trans *, struct xfs_inode *,
 		uint, int64_t);
 extern void xfs_trans_apply_dquot_deltas(struct xfs_trans *);
 extern void xfs_trans_unreserve_and_mod_dquots(struct xfs_trans *);
-extern int xfs_trans_reserve_quota_nblks(struct xfs_trans *,
-		struct xfs_inode *, int64_t, long, uint);
+int xfs_trans_reserve_quota_nblks(struct xfs_trans **tpp, struct xfs_inode *ip,
+		int64_t nblocks, long ninos, unsigned int flags,
+		bool *retry);
 extern int xfs_trans_reserve_quota_bydquots(struct xfs_trans *,
 		struct xfs_mount *, struct xfs_dquot *,
 		struct xfs_dquot *, struct xfs_dquot *, int64_t, long, uint);
@@ -114,8 +115,11 @@  extern void xfs_qm_unmount_quotas(struct xfs_mount *);
 static inline int
 xfs_quota_reserve_blkres(struct xfs_inode *ip, int64_t nblks, bool isrt)
 {
-	return xfs_trans_reserve_quota_nblks(NULL, ip, nblks, 0,
-			isrt ? XFS_QMOPT_RES_RTBLKS : XFS_QMOPT_RES_REGBLKS);
+	struct xfs_trans	*tp = NULL;
+
+	return xfs_trans_reserve_quota_nblks(&tp, ip, nblks, 0,
+			isrt ? XFS_QMOPT_RES_RTBLKS : XFS_QMOPT_RES_REGBLKS,
+			NULL);
 }
 #else
 static inline int
@@ -133,8 +137,9 @@  xfs_qm_vop_dqalloc(struct xfs_inode *ip, kuid_t kuid, kgid_t kgid,
 #define xfs_trans_mod_dquot_byino(tp, ip, fields, delta)
 #define xfs_trans_apply_dquot_deltas(tp)
 #define xfs_trans_unreserve_and_mod_dquots(tp)
-static inline int xfs_trans_reserve_quota_nblks(struct xfs_trans *tp,
-		struct xfs_inode *ip, int64_t nblks, long ninos, uint flags)
+static inline int xfs_trans_reserve_quota_nblks(struct xfs_trans **tpp,
+		struct xfs_inode *ip, int64_t nblks, long ninos,
+		unsigned int flags, bool *retry)
 {
 	return 0;
 }
@@ -179,7 +184,8 @@  static inline int
 xfs_trans_unreserve_quota_nblks(struct xfs_trans *tp, struct xfs_inode *ip,
 		int64_t nblks, long ninos, unsigned int flags)
 {
-	return xfs_trans_reserve_quota_nblks(tp, ip, -nblks, -ninos, flags);
+	return xfs_trans_reserve_quota_nblks(&tp, ip, -nblks, -ninos, flags,
+			NULL);
 }
 
 static inline int
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 0da1a603b7d8..0afd74f35ab7 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -355,6 +355,7 @@  xfs_reflink_allocate_cow(
 	xfs_filblks_t		count_fsb = imap->br_blockcount;
 	struct xfs_trans	*tp;
 	int			nimaps, error = 0;
+	bool			quota_retry = false;
 	bool			found;
 	xfs_filblks_t		resaligned;
 	xfs_extlen_t		resblks = 0;
@@ -376,6 +377,7 @@  xfs_reflink_allocate_cow(
 	resblks = XFS_DIOSTRAT_SPACE_RES(mp, resaligned);
 
 	xfs_iunlock(ip, *lockmode);
+retry:
 	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, resblks, 0, 0, &tp);
 	*lockmode = XFS_ILOCK_EXCL;
 	xfs_ilock(ip, *lockmode);
@@ -398,10 +400,12 @@  xfs_reflink_allocate_cow(
 		goto convert;
 	}
 
-	error = xfs_trans_reserve_quota_nblks(tp, ip, resblks, 0,
-			XFS_QMOPT_RES_REGBLKS);
+	error = xfs_trans_reserve_quota_nblks(&tp, ip, resblks, 0,
+			XFS_QMOPT_RES_REGBLKS, &quota_retry);
 	if (error)
 		goto out_trans_cancel;
+	if (quota_retry)
+		goto retry;
 
 	xfs_trans_ijoin(tp, ip, 0);
 
@@ -1006,10 +1010,12 @@  xfs_reflink_remap_extent(
 	unsigned int		resblks;
 	bool			smap_real;
 	bool			dmap_written = xfs_bmap_is_written_extent(dmap);
+	bool			quota_retry = false;
 	int			iext_delta = 0;
 	int			nimaps;
 	int			error;
 
+retry:
 	/* Start a rolling transaction to switch the mappings */
 	resblks = XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK);
 	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, resblks, 0, 0, &tp);
@@ -1094,10 +1100,12 @@  xfs_reflink_remap_extent(
 	if (!smap_real && dmap_written)
 		qres += dmap->br_blockcount;
 	if (qres > 0) {
-		error = xfs_trans_reserve_quota_nblks(tp, ip, qres, 0,
-				XFS_QMOPT_RES_REGBLKS);
+		error = xfs_trans_reserve_quota_nblks(&tp, ip, qres, 0,
+				XFS_QMOPT_RES_REGBLKS, &quota_retry);
 		if (error)
 			goto out_cancel;
+		if (quota_retry)
+			goto retry;
 	}
 
 	if (smap_real)
diff --git a/fs/xfs/xfs_trans_dquot.c b/fs/xfs/xfs_trans_dquot.c
index 3315498a6fa1..adc7331ff182 100644
--- a/fs/xfs/xfs_trans_dquot.c
+++ b/fs/xfs/xfs_trans_dquot.c
@@ -16,6 +16,7 @@ 
 #include "xfs_quota.h"
 #include "xfs_qm.h"
 #include "xfs_trace.h"
+#include "xfs_icache.h"
 
 STATIC void	xfs_trans_alloc_dqinfo(xfs_trans_t *);
 
@@ -770,21 +771,38 @@  xfs_trans_reserve_quota_bydquots(
 	return error;
 }
 
-
 /*
- * Lock the dquot and change the reservation if we can.
- * This doesn't change the actual usage, just the reservation.
- * The inode sent in is locked.
+ * Lock the dquot and change the reservation if we can.  This doesn't change
+ * the actual usage, just the reservation.  The caller must hold ILOCK_EXCL on
+ * the inode.  If @retry is not a NULL pointer, the caller must ensure that
+ * *retry is set to false before the first time this function is called.
+ *
+ * If the quota reservation fails because we hit a quota limit (and retry is
+ * not a NULL pointer, and *retry is false), this function will try to invoke
+ * the speculative preallocation gc scanner to reduce quota usage.  In order to
+ * do that, we cancel the transaction, NULL out tpp, drop the ILOCK, and set
+ * *retry to true.
+ *
+ * This function ends one of two ways:
+ *
+ *  1) To signal the caller to try again, *retry is set to true; *tpp is
+ *     cancelled and set to NULL; the inode is unlocked; and the return value
+ *     is zero.
+ *
+ *  2) Otherwise, *tpp is still set; the inode is still locked; and the return
+ *     value is zero or the usual negative error code.
  */
 int
 xfs_trans_reserve_quota_nblks(
-	struct xfs_trans	*tp,
+	struct xfs_trans	**tpp,
 	struct xfs_inode	*ip,
 	int64_t			nblks,
 	long			ninos,
-	uint			flags)
+	unsigned int		flags,
+	bool			*retry)
 {
 	struct xfs_mount	*mp = ip->i_mount;
+	int			error;
 
 	if (!XFS_IS_QUOTA_RUNNING(mp) || !XFS_IS_QUOTA_ON(mp))
 		return 0;
@@ -795,13 +813,26 @@  xfs_trans_reserve_quota_nblks(
 	ASSERT((flags & ~(XFS_QMOPT_FORCE_RES)) == XFS_TRANS_DQ_RES_RTBLKS ||
 	       (flags & ~(XFS_QMOPT_FORCE_RES)) == XFS_TRANS_DQ_RES_BLKS);
 
-	/*
-	 * Reserve nblks against these dquots, with trans as the mediator.
-	 */
-	return xfs_trans_reserve_quota_bydquots(tp, mp,
-						ip->i_udquot, ip->i_gdquot,
-						ip->i_pdquot,
-						nblks, ninos, flags);
+	/* Reserve nblks against these dquots, with trans as the mediator. */
+	error = xfs_trans_reserve_quota_bydquots(*tpp, mp, ip->i_udquot,
+			ip->i_gdquot, ip->i_pdquot, nblks, ninos, flags);
+	if (retry == NULL)
+		return error;
+	/* We only allow one retry for EDQUOT/ENOSPC. */
+	if (*retry || (error != -EDQUOT && error != -ENOSPC)) {
+		*retry = false;
+		return error;
+	}
+
+	/* Release resources, prepare for scan. */
+	xfs_trans_cancel(*tpp);
+	*tpp = NULL;
+	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+
+	/* Try to free some quota for this file's dquots. */
+	*retry = true;
+	xfs_blockgc_free_quota(ip, 0);
+	return 0;
 }
 
 /* Change the quota reservations for an inode creation activity. */