diff mbox series

[20/24] xfs: use AIL pushing for inode reclaim IO

Message ID 20190801021752.4986-21-david@fromorbit.com (mailing list archive)
State New, archived
Headers show
Series mm, xfs: non-blocking inode reclaim | expand

Commit Message

Dave Chinner Aug. 1, 2019, 2:17 a.m. UTC
From: Dave Chinner <dchinner@redhat.com>

Inode reclaim currently issues it's own inode IO when it comes
across dirty inodes. This is used to throttle direct reclaim down to
the rate at which we can reclaim dirty inodes. Failure to throttle
in this manner results in the OOM killer being trivial to trigger
even when there is lots of free memory available.

However, having direct reclaimers issue IO causes an amount of
IO thrashing to occur. We can have up to the number of AGs in the
filesystem concurrently issuing IO, plus the AIL pushing thread as
well. This means we can many competing sources of IO and they all
end up thrashing and competing for the request slots in the block
device.

Similar to dirty page throttling and the BDI flusher thread, we can
use the AIL pushing thread the sole place we issue inode writeback
from and everything else waits for it to make progress. To do this,
reclaim will skip over dirty inodes, but in doing so will record the
lowest LSN of all the dirty inodes it skips. It will then push the
AIL to this LSN and wait for it to complete that work.

In doing so, we block direct reclaim on the IO of at least one IO,
thereby providing some level of throttling for when we encounter
dirty inodes. However we gain the ability to scan and reclaim
clean inodes in a non-blocking fashion. This allows us to
remove all the per-ag reclaim locking that avoids excessive direct
reclaim, as repeated concurrent direct reclaim will hit the same
dirty inodes on block waiting on the same IO to complete.

Hence direct reclaim will be throttled directly by the rate at which
dirty inodes are cleaned by AIL pushing, rather than by delays
caused by competing IO submissions. This allows us to remove all the
locking that limits direct reclaim concurrency and greatly
simplifies the inode reclaim code now that it just skips dirty
inodes.

Note: this patch by itself isn't completely able to throttle direct
reclaim sufficiently to prevent OOM killer madness. We can't do that
until we change the way we index reclaimable inodes in the next
patch and can feed back state to the mm core sanely.  However, we
can't change the way we index reclaimable inodes until we have
IO-less non-blocking reclaim for both direct reclaim and kswapd
reclaim.  Catch-22...

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_icache.c    | 208 +++++++++++++++++------------------------
 fs/xfs/xfs_mount.c     |   4 -
 fs/xfs/xfs_mount.h     |   1 -
 fs/xfs/xfs_trans_ail.c |   4 +-
 4 files changed, 90 insertions(+), 127 deletions(-)

Comments

Brian Foster Aug. 7, 2019, 6:09 p.m. UTC | #1
On Thu, Aug 01, 2019 at 12:17:48PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Inode reclaim currently issues it's own inode IO when it comes
> across dirty inodes. This is used to throttle direct reclaim down to
> the rate at which we can reclaim dirty inodes. Failure to throttle
> in this manner results in the OOM killer being trivial to trigger
> even when there is lots of free memory available.
> 
> However, having direct reclaimers issue IO causes an amount of
> IO thrashing to occur. We can have up to the number of AGs in the
> filesystem concurrently issuing IO, plus the AIL pushing thread as
> well. This means we can many competing sources of IO and they all
> end up thrashing and competing for the request slots in the block
> device.
> 
> Similar to dirty page throttling and the BDI flusher thread, we can
> use the AIL pushing thread the sole place we issue inode writeback
> from and everything else waits for it to make progress. To do this,
> reclaim will skip over dirty inodes, but in doing so will record the
> lowest LSN of all the dirty inodes it skips. It will then push the
> AIL to this LSN and wait for it to complete that work.
> 
> In doing so, we block direct reclaim on the IO of at least one IO,
> thereby providing some level of throttling for when we encounter
> dirty inodes. However we gain the ability to scan and reclaim
> clean inodes in a non-blocking fashion. This allows us to
> remove all the per-ag reclaim locking that avoids excessive direct
> reclaim, as repeated concurrent direct reclaim will hit the same
> dirty inodes on block waiting on the same IO to complete.
> 

The last part of the above sentence sounds borked..

> Hence direct reclaim will be throttled directly by the rate at which
> dirty inodes are cleaned by AIL pushing, rather than by delays
> caused by competing IO submissions. This allows us to remove all the
> locking that limits direct reclaim concurrency and greatly
> simplifies the inode reclaim code now that it just skips dirty
> inodes.
> 
> Note: this patch by itself isn't completely able to throttle direct
> reclaim sufficiently to prevent OOM killer madness. We can't do that
> until we change the way we index reclaimable inodes in the next
> patch and can feed back state to the mm core sanely.  However, we
> can't change the way we index reclaimable inodes until we have
> IO-less non-blocking reclaim for both direct reclaim and kswapd
> reclaim.  Catch-22...
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_icache.c    | 208 +++++++++++++++++------------------------
>  fs/xfs/xfs_mount.c     |   4 -
>  fs/xfs/xfs_mount.h     |   1 -
>  fs/xfs/xfs_trans_ail.c |   4 +-
>  4 files changed, 90 insertions(+), 127 deletions(-)
> 
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index 0bd4420a7e16..4c4c5bc12147 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -22,6 +22,7 @@
>  #include "xfs_dquot_item.h"
>  #include "xfs_dquot.h"
>  #include "xfs_reflink.h"
> +#include "xfs_log.h"
>  
>  #include <linux/iversion.h>
>  
> @@ -967,28 +968,42 @@ xfs_inode_ag_iterator_tag(
>  }
>  
>  /*
> - * Grab the inode for reclaim exclusively.
> - * Return 0 if we grabbed it, non-zero otherwise.
> + * Grab the inode for reclaim.
> + *
> + * Return false if we aren't going to reclaim it, true if it is a reclaim
> + * candidate.
> + *
> + * If the inode is clean or unreclaimable, return NULLCOMMITLSN to tell the
> + * caller it does not require flushing. Otherwise return the log item lsn of the
> + * inode so the caller can determine it's inode flush target.  If we get the
> + * clean/dirty state wrong then it will be sorted in xfs_reclaim_inode() once we
> + * have locks held.
>   */
> -STATIC int
> +STATIC bool
>  xfs_reclaim_inode_grab(
>  	struct xfs_inode	*ip,
> -	int			flags)
> +	int			flags,
> +	xfs_lsn_t		*lsn)
>  {
>  	ASSERT(rcu_read_lock_held());
> +	*lsn = 0;

The comment above says we return NULLCOMMITLSN. Given the rest of the
code, I'm assuming we should just fix up the comment.

>  
>  	/* quick check for stale RCU freed inode */
>  	if (!ip->i_ino)
> -		return 1;
> +		return false;
>  
>  	/*
> -	 * If we are asked for non-blocking operation, do unlocked checks to
> -	 * see if the inode already is being flushed or in reclaim to avoid
> -	 * lock traffic.
> +	 * Do unlocked checks to see if the inode already is being flushed or in
> +	 * reclaim to avoid lock traffic. If the inode is not clean, return the
> +	 * it's position in the AIL for the caller to push to.
>  	 */
> -	if ((flags & SYNC_TRYLOCK) &&
> -	    __xfs_iflags_test(ip, XFS_IFLOCK | XFS_IRECLAIM))
> -		return 1;
> +	if (!xfs_inode_clean(ip)) {
> +		*lsn = ip->i_itemp->ili_item.li_lsn;
> +		return false;
> +	}
> +
> +	if (__xfs_iflags_test(ip, XFS_IFLOCK | XFS_IRECLAIM))
> +		return false;
>  
>  	/*
>  	 * The radix tree lock here protects a thread in xfs_iget from racing
...
> @@ -1050,92 +1065,67 @@ xfs_reclaim_inode_grab(
>   *	clean		=> reclaim
>   *	dirty, async	=> requeue
>   *	dirty, sync	=> flush, wait and reclaim
> + *
> + * Returns true if the inode was reclaimed, false otherwise.
>   */
> -STATIC int
> +STATIC bool
>  xfs_reclaim_inode(
>  	struct xfs_inode	*ip,
>  	struct xfs_perag	*pag,
> -	int			sync_mode)
> +	xfs_lsn_t		*lsn)
>  {
> -	struct xfs_buf		*bp = NULL;
> -	xfs_ino_t		ino = ip->i_ino; /* for radix_tree_delete */
> -	int			error;
> +	xfs_ino_t		ino;
> +
> +	*lsn = 0;
>  
> -restart:
> -	error = 0;
>  	/*
>  	 * Don't try to flush the inode if another inode in this cluster has
>  	 * already flushed it after we did the initial checks in
>  	 * xfs_reclaim_inode_grab().
>  	 */
> -	if (sync_mode & SYNC_TRYLOCK) {
> -		if (!xfs_ilock_nowait(ip, XFS_ILOCK_EXCL))
> -			goto out;
> -		if (!xfs_iflock_nowait(ip))
> -			goto out_unlock;
> -	} else {
> -		xfs_ilock(ip, XFS_ILOCK_EXCL);
> -		if (!xfs_iflock_nowait(ip)) {
> -			if (!(sync_mode & SYNC_WAIT))
> -				goto out_unlock;
> -			xfs_iflock(ip);
> -		}
> -	}
> +	if (!xfs_ilock_nowait(ip, XFS_ILOCK_EXCL))
> +		goto out;
> +	if (!xfs_iflock_nowait(ip))
> +		goto out_unlock;
>  

Do we even need the flush lock here any more if we're never going to
flush from this context? The shutdown case just below notwithstanding
(which I'm also wondering if we should just drop given we abort from
xfs_iflush() on shutdown), the pin count is an atomic and the dirty
state changes under ilock.

Maybe I'm missing something else, but the reason I ask is that the
increased flush lock contention in codepaths that don't actually flush
once it's acquired gives me a bit of concern that we could reduce
effectiveness of the one task that actually does (xfsaild).

> +	/* If we are in shutdown, we don't care about blocking. */
>  	if (XFS_FORCED_SHUTDOWN(ip->i_mount)) {
>  		xfs_iunpin_wait(ip);
>  		/* xfs_iflush_abort() drops the flush lock */
>  		xfs_iflush_abort(ip, false);
>  		goto reclaim;
>  	}
> -	if (xfs_ipincount(ip)) {
> -		if (!(sync_mode & SYNC_WAIT))
> -			goto out_ifunlock;
> -		xfs_iunpin_wait(ip);
> -	}
> -	if (xfs_iflags_test(ip, XFS_ISTALE) || xfs_inode_clean(ip)) {
> -		xfs_ifunlock(ip);
> -		goto reclaim;
> -	}
>  
>  	/*
> -	 * Never flush out dirty data during non-blocking reclaim, as it would
> -	 * just contend with AIL pushing trying to do the same job.
> +	 * If it is pinned, we only want to flush this if there's nothing else
> +	 * to be flushed as it requires a log force. Hence we essentially set
> +	 * the LSN to flush the entire AIL which will end up triggering a log
> +	 * force to unpin this inode, but that will only happen if there are not
> +	 * other inodes in the scan that only need writeback.
>  	 */
> -	if (!(sync_mode & SYNC_WAIT))
> +	if (xfs_ipincount(ip)) {
> +		*lsn = ip->i_itemp->ili_last_lsn;

->ili_last_lsn comes from xfs_cil_ctx->sequence, which I don't think is
actually a physical LSN suitable for AIL pushing. The lsn assigned to
the item once it's physically logged and AIL inserted comes from
ctx->start_lsn, which comes from the iclog header and so is a physical
LSN.

That said, this usage of ili_last_lsn seems to disappear by the end of
the series...

>  		goto out_ifunlock;
> +	}
>  
...
> @@ -1205,39 +1189,28 @@ xfs_reclaim_inode(
>   * corrupted, we still want to try to reclaim all the inodes. If we don't,
>   * then a shut down during filesystem unmount reclaim walk leak all the
>   * unreclaimed inodes.
> + *
> + * Return the number of inodes freed.
>   */
>  STATIC int
>  xfs_reclaim_inodes_ag(
>  	struct xfs_mount	*mp,
>  	int			flags,
> -	int			*nr_to_scan)
> +	int			nr_to_scan)
>  {
>  	struct xfs_perag	*pag;
> -	int			error = 0;
> -	int			last_error = 0;
>  	xfs_agnumber_t		ag;
> -	int			trylock = flags & SYNC_TRYLOCK;
> -	int			skipped;
> +	xfs_lsn_t		lsn, lowest_lsn = NULLCOMMITLSN;
> +	long			freed = 0;
>  
> -restart:
>  	ag = 0;
> -	skipped = 0;
>  	while ((pag = xfs_perag_get_tag(mp, ag, XFS_ICI_RECLAIM_TAG))) {
>  		unsigned long	first_index = 0;
>  		int		done = 0;
>  		int		nr_found = 0;
>  
>  		ag = pag->pag_agno + 1;
> -
> -		if (trylock) {
> -			if (!mutex_trylock(&pag->pag_ici_reclaim_lock)) {
> -				skipped++;
> -				xfs_perag_put(pag);
> -				continue;
> -			}
> -			first_index = pag->pag_ici_reclaim_cursor;
> -		} else
> -			mutex_lock(&pag->pag_ici_reclaim_lock);

I understand that the eliminated blocking drops a dependency on the
perag reclaim exclusion as described by the commit log, but I'm not sure
it's enough to justify removing it entirely. For one, the reclaim cursor
management looks potentially racy. Also, doesn't this exclusion provide
some balance for reclaim across AGs? E.g., if a bunch of reclaim threads
come in at the same time, this allows them to walk across AGs instead of
potentially stumbling over eachother in the batching/grabbing code.

I see again that most of this code seems to ultimately go away, replaced
by an LRU mechanism so we no longer operate on a per-ag basis. I can see
how this becomes irrelevant with that mechanism, but I think it might
make more sense to drop this locking along with the broader mechanism in
the last patch or two of the series rather than doing it here. If
nothing else, that eliminates the need for the reviewer to consider this
transient "old mechanism + new locking" state as opposed to reasoning
about the old mechanism vs. new mechanism and why the old locking simply
no longer applies.

> +		first_index = pag->pag_ici_reclaim_cursor;
>  
>  		do {
>  			struct xfs_inode *batch[XFS_LOOKUP_BATCH];
...
> diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
> index 00d66175f41a..5802139f786b 100644
> --- a/fs/xfs/xfs_trans_ail.c
> +++ b/fs/xfs/xfs_trans_ail.c
> @@ -676,8 +676,10 @@ xfs_ail_push_sync(
>  	spin_lock(&ailp->ail_lock);
>  	while ((lip = xfs_ail_min(ailp)) != NULL) {
>  		prepare_to_wait(&ailp->ail_push, &wait, TASK_UNINTERRUPTIBLE);
> +	trace_printk("lip lsn 0x%llx thres 0x%llx targ 0x%llx",
> +			lip->li_lsn, threshold_lsn, ailp->ail_target);
>  		if (XFS_FORCED_SHUTDOWN(ailp->ail_mount) ||
> -		    XFS_LSN_CMP(threshold_lsn, lip->li_lsn) <= 0)
> +		    XFS_LSN_CMP(threshold_lsn, lip->li_lsn) < 0)
>  			break;

Stale/mislocated changes?

Brian

>  		/* XXX: cmpxchg? */
>  		while (XFS_LSN_CMP(threshold_lsn, ailp->ail_target) > 0)
> -- 
> 2.22.0
>
Dave Chinner Aug. 7, 2019, 11:10 p.m. UTC | #2
On Wed, Aug 07, 2019 at 02:09:15PM -0400, Brian Foster wrote:
> On Thu, Aug 01, 2019 at 12:17:48PM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Inode reclaim currently issues it's own inode IO when it comes
> > across dirty inodes. This is used to throttle direct reclaim down to
> > the rate at which we can reclaim dirty inodes. Failure to throttle
> > in this manner results in the OOM killer being trivial to trigger
> > even when there is lots of free memory available.
> > 
> > However, having direct reclaimers issue IO causes an amount of
> > IO thrashing to occur. We can have up to the number of AGs in the
> > filesystem concurrently issuing IO, plus the AIL pushing thread as
> > well. This means we can many competing sources of IO and they all
> > end up thrashing and competing for the request slots in the block
> > device.
> > 
> > Similar to dirty page throttling and the BDI flusher thread, we can
> > use the AIL pushing thread the sole place we issue inode writeback
> > from and everything else waits for it to make progress. To do this,
> > reclaim will skip over dirty inodes, but in doing so will record the
> > lowest LSN of all the dirty inodes it skips. It will then push the
> > AIL to this LSN and wait for it to complete that work.
> > 
> > In doing so, we block direct reclaim on the IO of at least one IO,
> > thereby providing some level of throttling for when we encounter
> > dirty inodes. However we gain the ability to scan and reclaim
> > clean inodes in a non-blocking fashion. This allows us to
> > remove all the per-ag reclaim locking that avoids excessive direct
> > reclaim, as repeated concurrent direct reclaim will hit the same
> > dirty inodes on block waiting on the same IO to complete.
> > 
> 
> The last part of the above sentence sounds borked..

s/on/and/

:)

> >  /*
> > - * Grab the inode for reclaim exclusively.
> > - * Return 0 if we grabbed it, non-zero otherwise.
> > + * Grab the inode for reclaim.
> > + *
> > + * Return false if we aren't going to reclaim it, true if it is a reclaim
> > + * candidate.
> > + *
> > + * If the inode is clean or unreclaimable, return NULLCOMMITLSN to tell the
> > + * caller it does not require flushing. Otherwise return the log item lsn of the
> > + * inode so the caller can determine it's inode flush target.  If we get the
> > + * clean/dirty state wrong then it will be sorted in xfs_reclaim_inode() once we
> > + * have locks held.
> >   */
> > -STATIC int
> > +STATIC bool
> >  xfs_reclaim_inode_grab(
> >  	struct xfs_inode	*ip,
> > -	int			flags)
> > +	int			flags,
> > +	xfs_lsn_t		*lsn)
> >  {
> >  	ASSERT(rcu_read_lock_held());
> > +	*lsn = 0;
> 
> The comment above says we return NULLCOMMITLSN. Given the rest of the
> code, I'm assuming we should just fix up the comment.

Yup, I think I've already fixed it.

> > -restart:
> > -	error = 0;
> >  	/*
> >  	 * Don't try to flush the inode if another inode in this cluster has
> >  	 * already flushed it after we did the initial checks in
> >  	 * xfs_reclaim_inode_grab().
> >  	 */
> > -	if (sync_mode & SYNC_TRYLOCK) {
> > -		if (!xfs_ilock_nowait(ip, XFS_ILOCK_EXCL))
> > -			goto out;
> > -		if (!xfs_iflock_nowait(ip))
> > -			goto out_unlock;
> > -	} else {
> > -		xfs_ilock(ip, XFS_ILOCK_EXCL);
> > -		if (!xfs_iflock_nowait(ip)) {
> > -			if (!(sync_mode & SYNC_WAIT))
> > -				goto out_unlock;
> > -			xfs_iflock(ip);
> > -		}
> > -	}
> > +	if (!xfs_ilock_nowait(ip, XFS_ILOCK_EXCL))
> > +		goto out;
> > +	if (!xfs_iflock_nowait(ip))
> > +		goto out_unlock;
> >  
> 
> Do we even need the flush lock here any more if we're never going to
> flush from this context?

Ideally, no. But the inode my currently be being flushed, in which
case the incore inode is clean, but we can't reclaim it yet. Hence
we need the flush lock to serialise against IO completion.

> The shutdown case just below notwithstanding
> (which I'm also wondering if we should just drop given we abort from
> xfs_iflush() on shutdown), the pin count is an atomic and the dirty
> state changes under ilock.

The shutdown case has to handle pinned inodes, not just inodes being
flushed.

> Maybe I'm missing something else, but the reason I ask is that the
> increased flush lock contention in codepaths that don't actually flush
> once it's acquired gives me a bit of concern that we could reduce
> effectiveness of the one task that actually does (xfsaild).

The flush lock isn't a contended lock - it's actually a bit that is
protected by the i_flags_lock, so if we are contending on anything
it will be the flags lock. And, well, see the LRU isolate function
conversion of this code, becuase it changes how the flags lock is
used for reclaim but I haven't seen any contention as a result of
that change....

> 
> > -	 * Never flush out dirty data during non-blocking reclaim, as it would
> > -	 * just contend with AIL pushing trying to do the same job.
> > +	 * If it is pinned, we only want to flush this if there's nothing else
> > +	 * to be flushed as it requires a log force. Hence we essentially set
> > +	 * the LSN to flush the entire AIL which will end up triggering a log
> > +	 * force to unpin this inode, but that will only happen if there are not
> > +	 * other inodes in the scan that only need writeback.
> >  	 */
> > -	if (!(sync_mode & SYNC_WAIT))
> > +	if (xfs_ipincount(ip)) {
> > +		*lsn = ip->i_itemp->ili_last_lsn;
> 
> ->ili_last_lsn comes from xfs_cil_ctx->sequence, which I don't think is
> actually a physical LSN suitable for AIL pushing. The lsn assigned to
> the item once it's physically logged and AIL inserted comes from
> ctx->start_lsn, which comes from the iclog header and so is a physical
> LSN.

Yup, I've already noticed and fixed that bug :)

> >  	while ((pag = xfs_perag_get_tag(mp, ag, XFS_ICI_RECLAIM_TAG))) {
> >  		unsigned long	first_index = 0;
> >  		int		done = 0;
> >  		int		nr_found = 0;
> >  
> >  		ag = pag->pag_agno + 1;
> > -
> > -		if (trylock) {
> > -			if (!mutex_trylock(&pag->pag_ici_reclaim_lock)) {
> > -				skipped++;
> > -				xfs_perag_put(pag);
> > -				continue;
> > -			}
> > -			first_index = pag->pag_ici_reclaim_cursor;
> > -		} else
> > -			mutex_lock(&pag->pag_ici_reclaim_lock);
> 
> I understand that the eliminated blocking drops a dependency on the
> perag reclaim exclusion as described by the commit log, but I'm not sure
> it's enough to justify removing it entirely. For one, the reclaim cursor
> management looks potentially racy.

We really don't care if the cursor updates are racy. All that will
result in is some inode ranges being scanned twice in quick
succession. All this does now is prevent reclaim from starting at
the start of the AG every time it runs, so we end up with most
reclaimers iterating over previously unscanned inodes.

> Also, doesn't this exclusion provide
> some balance for reclaim across AGs? E.g., if a bunch of reclaim threads
> come in at the same time, this allows them to walk across AGs instead of
> potentially stumbling over eachother in the batching/grabbing code.

What they do now is leapfrog each other and work through the same
AGs much faster. The overall pattern of reclaim doesn't actually
change much, just the speed at which individual AGs are scanned.

But that was not what the locking was put in place for. THe locking
was put in place to be able to throttle the number of concurrent
reclaimers issuing IO. If the reclaimers leapfrogged like they do
without the locking, then we end up with non-sequential inode
writeback patterns, and writeback performance goes really bad,
really quickly. Hence the locking is there to ensure we get
sequential inode writeback patterns from each AG that is being
reclaimed from. That can be optimised by block layer merging, and so
even though we might have a lot of concurrent reclaimers, we get
lots of large, well-formed IOs from each of them.

IOWs, the locking was all about preventing the IO patterns from
breaking down under memory pressure, not about optimising how
reclaimers interact with each other.

> I see again that most of this code seems to ultimately go away, replaced
> by an LRU mechanism so we no longer operate on a per-ag basis. I can see
> how this becomes irrelevant with that mechanism, but I think it might
> make more sense to drop this locking along with the broader mechanism in
> the last patch or two of the series rather than doing it here.

Fundamentally, this patch is all about shifting the IO and blocking
mechanisms to the AIL. This locking is no longer necessary, and it
actually gets in the way of doing non-blocking reclaim and shifting
the IO to the AIL. i.e. we block where we no longer need to, and
that causes more problems for this change than it solves.

> If
> nothing else, that eliminates the need for the reviewer to consider this
> transient "old mechanism + new locking" state as opposed to reasoning
> about the old mechanism vs. new mechanism and why the old locking simply
> no longer applies.

I think you're putting to much "make every step of the transition
perfect" focus on this. We've got to drop this locking to make
reclaim non-blocking, and we have to make reclaim non-blocking
before we can move to a LRU mechanisms that relies on LRU removal
being completely non-blocking and cannot issue IO. It's a waste of
time trying to break this down further and further into "perfect"
patches - it works well enough and without functional regressions so
it does not create landmines for people doing bisects, and that's
largely all that matters in the middle of a large patchset that is
making large algorithm changes...

> > +		first_index = pag->pag_ici_reclaim_cursor;
> >  
> >  		do {
> >  			struct xfs_inode *batch[XFS_LOOKUP_BATCH];
> ...
> > diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
> > index 00d66175f41a..5802139f786b 100644
> > --- a/fs/xfs/xfs_trans_ail.c
> > +++ b/fs/xfs/xfs_trans_ail.c
> > @@ -676,8 +676,10 @@ xfs_ail_push_sync(
> >  	spin_lock(&ailp->ail_lock);
> >  	while ((lip = xfs_ail_min(ailp)) != NULL) {
> >  		prepare_to_wait(&ailp->ail_push, &wait, TASK_UNINTERRUPTIBLE);
> > +	trace_printk("lip lsn 0x%llx thres 0x%llx targ 0x%llx",
> > +			lip->li_lsn, threshold_lsn, ailp->ail_target);
> >  		if (XFS_FORCED_SHUTDOWN(ailp->ail_mount) ||
> > -		    XFS_LSN_CMP(threshold_lsn, lip->li_lsn) <= 0)
> > +		    XFS_LSN_CMP(threshold_lsn, lip->li_lsn) < 0)
> >  			break;
> 
> Stale/mislocated changes?

I've already cleaned that one up, too.

Cheers,

Dave.
Brian Foster Aug. 8, 2019, 4:20 p.m. UTC | #3
On Thu, Aug 08, 2019 at 09:10:44AM +1000, Dave Chinner wrote:
> On Wed, Aug 07, 2019 at 02:09:15PM -0400, Brian Foster wrote:
> > On Thu, Aug 01, 2019 at 12:17:48PM +1000, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > Inode reclaim currently issues it's own inode IO when it comes
> > > across dirty inodes. This is used to throttle direct reclaim down to
> > > the rate at which we can reclaim dirty inodes. Failure to throttle
> > > in this manner results in the OOM killer being trivial to trigger
> > > even when there is lots of free memory available.
> > > 
> > > However, having direct reclaimers issue IO causes an amount of
> > > IO thrashing to occur. We can have up to the number of AGs in the
> > > filesystem concurrently issuing IO, plus the AIL pushing thread as
> > > well. This means we can many competing sources of IO and they all
> > > end up thrashing and competing for the request slots in the block
> > > device.
> > > 
> > > Similar to dirty page throttling and the BDI flusher thread, we can
> > > use the AIL pushing thread the sole place we issue inode writeback
> > > from and everything else waits for it to make progress. To do this,
> > > reclaim will skip over dirty inodes, but in doing so will record the
> > > lowest LSN of all the dirty inodes it skips. It will then push the
> > > AIL to this LSN and wait for it to complete that work.
> > > 
> > > In doing so, we block direct reclaim on the IO of at least one IO,
> > > thereby providing some level of throttling for when we encounter
> > > dirty inodes. However we gain the ability to scan and reclaim
> > > clean inodes in a non-blocking fashion. This allows us to
> > > remove all the per-ag reclaim locking that avoids excessive direct
> > > reclaim, as repeated concurrent direct reclaim will hit the same
> > > dirty inodes on block waiting on the same IO to complete.
> > > 
...
> > > -restart:
> > > -	error = 0;
> > >  	/*
> > >  	 * Don't try to flush the inode if another inode in this cluster has
> > >  	 * already flushed it after we did the initial checks in
> > >  	 * xfs_reclaim_inode_grab().
> > >  	 */
> > > -	if (sync_mode & SYNC_TRYLOCK) {
> > > -		if (!xfs_ilock_nowait(ip, XFS_ILOCK_EXCL))
> > > -			goto out;
> > > -		if (!xfs_iflock_nowait(ip))
> > > -			goto out_unlock;
> > > -	} else {
> > > -		xfs_ilock(ip, XFS_ILOCK_EXCL);
> > > -		if (!xfs_iflock_nowait(ip)) {
> > > -			if (!(sync_mode & SYNC_WAIT))
> > > -				goto out_unlock;
> > > -			xfs_iflock(ip);
> > > -		}
> > > -	}
> > > +	if (!xfs_ilock_nowait(ip, XFS_ILOCK_EXCL))
> > > +		goto out;
> > > +	if (!xfs_iflock_nowait(ip))
> > > +		goto out_unlock;
> > >  
> > 
> > Do we even need the flush lock here any more if we're never going to
> > flush from this context?
> 
> Ideally, no. But the inode my currently be being flushed, in which
> case the incore inode is clean, but we can't reclaim it yet. Hence
> we need the flush lock to serialise against IO completion.
> 

Ok, so xfs_inode_clean() checks ili_fields and that state is cleared at
flush time (under the flush lock). Hmm, so xfs_inode_clean() basically
determines whether the inode needs flushing or not, not necessarily
whether the in-core inode is clean with respect to the on-disk inode. I
wonder if we could enhance that or provide a new helper variant to cover
the latter as well, perhaps by also looking at ->ili_last_fields (with a
separate lock). Anyways, that's a matter for a separate patch..

> > The shutdown case just below notwithstanding
> > (which I'm also wondering if we should just drop given we abort from
> > xfs_iflush() on shutdown), the pin count is an atomic and the dirty
> > state changes under ilock.
> 
> The shutdown case has to handle pinned inodes, not just inodes being
> flushed.
> 
> > Maybe I'm missing something else, but the reason I ask is that the
> > increased flush lock contention in codepaths that don't actually flush
> > once it's acquired gives me a bit of concern that we could reduce
> > effectiveness of the one task that actually does (xfsaild).
> 
> The flush lock isn't a contended lock - it's actually a bit that is
> protected by the i_flags_lock, so if we are contending on anything
> it will be the flags lock. And, well, see the LRU isolate function
> conversion of this code, becuase it changes how the flags lock is
> used for reclaim but I haven't seen any contention as a result of
> that change....
> 

True, but I guess it's not so much the lock contention I'm concerned
about as opposed to the resulting impact of increased nonblocking
reclaim scanning on xfsaild. As of right now reclaim activity is highly
throttled and if a direct reclaimer does ultimatly acquire the flush
lock, it will perform the flush.

With reduced blocking/throttling and no reclaim flushing, I'm wondering
how possible it is for a bunch of direct reclaimers to come in and
effectively cycle over the same batch of dirty inodes repeatedly and
faster than xfsaild to the point where xfsaild can't make progress
actually flushing some of these inodes. Looking again, the combination
of the unlocked dirty check before grabbing the inode and sync AIL push
after each pass might be enough to prevent this from being a problem.
The former allows a direct reclaimer to skip the lock cycle if the inode
is obviously dirty and if the lock cycle was necessary to detect dirty
state, the associated task will wait for a while before coming around
again.

...
> > >  	while ((pag = xfs_perag_get_tag(mp, ag, XFS_ICI_RECLAIM_TAG))) {
> > >  		unsigned long	first_index = 0;
> > >  		int		done = 0;
> > >  		int		nr_found = 0;
> > >  
> > >  		ag = pag->pag_agno + 1;
> > > -
> > > -		if (trylock) {
> > > -			if (!mutex_trylock(&pag->pag_ici_reclaim_lock)) {
> > > -				skipped++;
> > > -				xfs_perag_put(pag);
> > > -				continue;
> > > -			}
> > > -			first_index = pag->pag_ici_reclaim_cursor;
> > > -		} else
> > > -			mutex_lock(&pag->pag_ici_reclaim_lock);
> > 
> > I understand that the eliminated blocking drops a dependency on the
> > perag reclaim exclusion as described by the commit log, but I'm not sure
> > it's enough to justify removing it entirely. For one, the reclaim cursor
> > management looks potentially racy.
> 
> We really don't care if the cursor updates are racy. All that will
> result in is some inode ranges being scanned twice in quick
> succession. All this does now is prevent reclaim from starting at
> the start of the AG every time it runs, so we end up with most
> reclaimers iterating over previously unscanned inodes.
> 

That might happen, sure, but a racing update may also cause a reclaimer
in progress to jump backwards and rescan a set of inodes it just
finished with (assuming some were dirty and left around). If that
happens once or twice it's probably not a big deal, but if you have a
bunch of reclaimer threads repeatedly stammering over the same ranges
over and over again, ISTM the whole thing could slow down to a crawl and
do quite a bit less real scanning work than request by the shrinker
core.

> > Also, doesn't this exclusion provide
> > some balance for reclaim across AGs? E.g., if a bunch of reclaim threads
> > come in at the same time, this allows them to walk across AGs instead of
> > potentially stumbling over eachother in the batching/grabbing code.
> 
> What they do now is leapfrog each other and work through the same
> AGs much faster. The overall pattern of reclaim doesn't actually
> change much, just the speed at which individual AGs are scanned.
> 
> But that was not what the locking was put in place for. THe locking
> was put in place to be able to throttle the number of concurrent
> reclaimers issuing IO. If the reclaimers leapfrogged like they do
> without the locking, then we end up with non-sequential inode
> writeback patterns, and writeback performance goes really bad,
> really quickly. Hence the locking is there to ensure we get
> sequential inode writeback patterns from each AG that is being
> reclaimed from. That can be optimised by block layer merging, and so
> even though we might have a lot of concurrent reclaimers, we get
> lots of large, well-formed IOs from each of them.
> 
> IOWs, the locking was all about preventing the IO patterns from
> breaking down under memory pressure, not about optimising how
> reclaimers interact with each other.
> 

I don't dispute any of that, but that doesn't necessarily mean that code
added since the locking was added doesn't depend on the serialization
that was already in place at the time to function sanely. I'm not asking
for performance/optimization here.

> > I see again that most of this code seems to ultimately go away, replaced
> > by an LRU mechanism so we no longer operate on a per-ag basis. I can see
> > how this becomes irrelevant with that mechanism, but I think it might
> > make more sense to drop this locking along with the broader mechanism in
> > the last patch or two of the series rather than doing it here.
> 
> Fundamentally, this patch is all about shifting the IO and blocking
> mechanisms to the AIL. This locking is no longer necessary, and it
> actually gets in the way of doing non-blocking reclaim and shifting
> the IO to the AIL. i.e. we block where we no longer need to, and
> that causes more problems for this change than it solves.
> 

I could see the issue of blocking where we no longer need to, but this
patch could also change the contextual blocking behavior without
necessarily removing the lock entirely. Or better yet, drop the trylock
and reduce the scope of the lock to the inode grabbing and cursor update
(and re-sample the cursor on each iteration). The cursor update index
doesn't change after we grab the last inode we're going to try and
reclaim, but we currently hold the lock across the entire batch reclaim.
Doing the lookup and grab under the lock and then releasing it once we
start to reclaim allows multiple threads to drive reclaim of the same AG
as you want, but prevents those tasks from disrupting eachother. We
still would have a blocking lock in this path, but it's purely
structural and we'd only block for what would otherwise be time spent
scanning over already handled inodes (i.e.  wasted CPU time). IOW, if
there's contention on the lock, then by definition some other task has
the same lookup index.

> > If
> > nothing else, that eliminates the need for the reviewer to consider this
> > transient "old mechanism + new locking" state as opposed to reasoning
> > about the old mechanism vs. new mechanism and why the old locking simply
> > no longer applies.
> 
> I think you're putting to much "make every step of the transition
> perfect" focus on this. We've got to drop this locking to make
> reclaim non-blocking, and we have to make reclaim non-blocking
> before we can move to a LRU mechanisms that relies on LRU removal
> being completely non-blocking and cannot issue IO. It's a waste of
> time trying to break this down further and further into "perfect"
> patches - it works well enough and without functional regressions so
> it does not create landmines for people doing bisects, and that's
> largely all that matters in the middle of a large patchset that is
> making large algorithm changes...
> 

I'm not really worried about maintaining perfection throughout every
step of this series. I just think this patch is a bit too fast and
loose.

In logistical terms, this patch and the next few continue to modify the
perag based reclaim mechanism, the second to last patch adds a new
reclaim mechanism to reclaim via the newly added LRU, and the final
patch removes the old/unused perag bits. The locking we're talking about
here is part of that old/unused mechanism that is ripped out in the
final patch, not common code shared between the two, so I don't see a
problem with either leaving it or changing it around as described above.
An untested patch (based on top of this one) for the latter is appended
for reference..

Brian

--- 8< ---

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 4c4c5bc12147..87f3ca86dfd1 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -1210,12 +1210,14 @@ xfs_reclaim_inodes_ag(
 		int		nr_found = 0;
 
 		ag = pag->pag_agno + 1;
-		first_index = pag->pag_ici_reclaim_cursor;
 
 		do {
 			struct xfs_inode *batch[XFS_LOOKUP_BATCH];
 			int	i;
 
+			mutex_lock(&pag->pag_ici_reclaim_lock);
+			first_index = pag->pag_ici_reclaim_cursor;
+
 			rcu_read_lock();
 			nr_found = radix_tree_gang_lookup_tag(
 					&pag->pag_ici_root,
@@ -1225,6 +1227,7 @@ xfs_reclaim_inodes_ag(
 			if (!nr_found) {
 				done = 1;
 				rcu_read_unlock();
+				mutex_unlock(&pag->pag_ici_reclaim_lock);
 				break;
 			}
 
@@ -1266,6 +1269,11 @@ xfs_reclaim_inodes_ag(
 
 			/* unlock now we've grabbed the inodes. */
 			rcu_read_unlock();
+			if (!done)
+				pag->pag_ici_reclaim_cursor = first_index;
+			else
+				pag->pag_ici_reclaim_cursor = 0;
+			mutex_unlock(&pag->pag_ici_reclaim_lock);
 
 			for (i = 0; i < nr_found; i++) {
 				if (!batch[i])
@@ -1281,10 +1289,6 @@ xfs_reclaim_inodes_ag(
 
 		} while (nr_found && !done && nr_to_scan > 0);
 
-		if (!done)
-			pag->pag_ici_reclaim_cursor = first_index;
-		else
-			pag->pag_ici_reclaim_cursor = 0;
 		xfs_perag_put(pag);
 	}
 
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index bcf8f64d1b1f..a1805021c92f 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -148,6 +148,7 @@ xfs_free_perag(
 		ASSERT(atomic_read(&pag->pag_ref) == 0);
 		xfs_iunlink_destroy(pag);
 		xfs_buf_hash_destroy(pag);
+		mutex_destroy(&pag->pag_ici_reclaim_lock);
 		call_rcu(&pag->rcu_head, __xfs_free_perag);
 	}
 }
@@ -199,6 +200,7 @@ xfs_initialize_perag(
 		pag->pag_agno = index;
 		pag->pag_mount = mp;
 		spin_lock_init(&pag->pag_ici_lock);
+		mutex_init(&pag->pag_ici_reclaim_lock);
 		INIT_RADIX_TREE(&pag->pag_ici_root, GFP_ATOMIC);
 		if (xfs_buf_hash_init(pag))
 			goto out_free_pag;
@@ -240,6 +242,7 @@ xfs_initialize_perag(
 out_hash_destroy:
 	xfs_buf_hash_destroy(pag);
 out_free_pag:
+	mutex_destroy(&pag->pag_ici_reclaim_lock);
 	kmem_free(pag);
 out_unwind_new_pags:
 	/* unwind any prior newly initialized pags */
@@ -249,6 +252,7 @@ xfs_initialize_perag(
 			break;
 		xfs_buf_hash_destroy(pag);
 		xfs_iunlink_destroy(pag);
+		mutex_destroy(&pag->pag_ici_reclaim_lock);
 		kmem_free(pag);
 	}
 	return error;
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 3ed6d942240f..a585860eaa94 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -390,6 +390,7 @@ typedef struct xfs_perag {
 	spinlock_t	pag_ici_lock;	/* incore inode cache lock */
 	struct radix_tree_root pag_ici_root;	/* incore inode cache root */
 	int		pag_ici_reclaimable;	/* reclaimable inodes */
+	struct mutex	pag_ici_reclaim_lock;	/* serialisation point */
 	unsigned long	pag_ici_reclaim_cursor;	/* reclaim restart point */
 
 	/* buffer cache index */
diff mbox series

Patch

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 0bd4420a7e16..4c4c5bc12147 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -22,6 +22,7 @@ 
 #include "xfs_dquot_item.h"
 #include "xfs_dquot.h"
 #include "xfs_reflink.h"
+#include "xfs_log.h"
 
 #include <linux/iversion.h>
 
@@ -967,28 +968,42 @@  xfs_inode_ag_iterator_tag(
 }
 
 /*
- * Grab the inode for reclaim exclusively.
- * Return 0 if we grabbed it, non-zero otherwise.
+ * Grab the inode for reclaim.
+ *
+ * Return false if we aren't going to reclaim it, true if it is a reclaim
+ * candidate.
+ *
+ * If the inode is clean or unreclaimable, return NULLCOMMITLSN to tell the
+ * caller it does not require flushing. Otherwise return the log item lsn of the
+ * inode so the caller can determine it's inode flush target.  If we get the
+ * clean/dirty state wrong then it will be sorted in xfs_reclaim_inode() once we
+ * have locks held.
  */
-STATIC int
+STATIC bool
 xfs_reclaim_inode_grab(
 	struct xfs_inode	*ip,
-	int			flags)
+	int			flags,
+	xfs_lsn_t		*lsn)
 {
 	ASSERT(rcu_read_lock_held());
+	*lsn = 0;
 
 	/* quick check for stale RCU freed inode */
 	if (!ip->i_ino)
-		return 1;
+		return false;
 
 	/*
-	 * If we are asked for non-blocking operation, do unlocked checks to
-	 * see if the inode already is being flushed or in reclaim to avoid
-	 * lock traffic.
+	 * Do unlocked checks to see if the inode already is being flushed or in
+	 * reclaim to avoid lock traffic. If the inode is not clean, return the
+	 * it's position in the AIL for the caller to push to.
 	 */
-	if ((flags & SYNC_TRYLOCK) &&
-	    __xfs_iflags_test(ip, XFS_IFLOCK | XFS_IRECLAIM))
-		return 1;
+	if (!xfs_inode_clean(ip)) {
+		*lsn = ip->i_itemp->ili_item.li_lsn;
+		return false;
+	}
+
+	if (__xfs_iflags_test(ip, XFS_IFLOCK | XFS_IRECLAIM))
+		return false;
 
 	/*
 	 * The radix tree lock here protects a thread in xfs_iget from racing
@@ -1005,11 +1020,11 @@  xfs_reclaim_inode_grab(
 	    __xfs_iflags_test(ip, XFS_IRECLAIM)) {
 		/* not a reclaim candidate. */
 		spin_unlock(&ip->i_flags_lock);
-		return 1;
+		return false;
 	}
 	__xfs_iflags_set(ip, XFS_IRECLAIM);
 	spin_unlock(&ip->i_flags_lock);
-	return 0;
+	return true;
 }
 
 /*
@@ -1050,92 +1065,67 @@  xfs_reclaim_inode_grab(
  *	clean		=> reclaim
  *	dirty, async	=> requeue
  *	dirty, sync	=> flush, wait and reclaim
+ *
+ * Returns true if the inode was reclaimed, false otherwise.
  */
-STATIC int
+STATIC bool
 xfs_reclaim_inode(
 	struct xfs_inode	*ip,
 	struct xfs_perag	*pag,
-	int			sync_mode)
+	xfs_lsn_t		*lsn)
 {
-	struct xfs_buf		*bp = NULL;
-	xfs_ino_t		ino = ip->i_ino; /* for radix_tree_delete */
-	int			error;
+	xfs_ino_t		ino;
+
+	*lsn = 0;
 
-restart:
-	error = 0;
 	/*
 	 * Don't try to flush the inode if another inode in this cluster has
 	 * already flushed it after we did the initial checks in
 	 * xfs_reclaim_inode_grab().
 	 */
-	if (sync_mode & SYNC_TRYLOCK) {
-		if (!xfs_ilock_nowait(ip, XFS_ILOCK_EXCL))
-			goto out;
-		if (!xfs_iflock_nowait(ip))
-			goto out_unlock;
-	} else {
-		xfs_ilock(ip, XFS_ILOCK_EXCL);
-		if (!xfs_iflock_nowait(ip)) {
-			if (!(sync_mode & SYNC_WAIT))
-				goto out_unlock;
-			xfs_iflock(ip);
-		}
-	}
+	if (!xfs_ilock_nowait(ip, XFS_ILOCK_EXCL))
+		goto out;
+	if (!xfs_iflock_nowait(ip))
+		goto out_unlock;
 
+	/* If we are in shutdown, we don't care about blocking. */
 	if (XFS_FORCED_SHUTDOWN(ip->i_mount)) {
 		xfs_iunpin_wait(ip);
 		/* xfs_iflush_abort() drops the flush lock */
 		xfs_iflush_abort(ip, false);
 		goto reclaim;
 	}
-	if (xfs_ipincount(ip)) {
-		if (!(sync_mode & SYNC_WAIT))
-			goto out_ifunlock;
-		xfs_iunpin_wait(ip);
-	}
-	if (xfs_iflags_test(ip, XFS_ISTALE) || xfs_inode_clean(ip)) {
-		xfs_ifunlock(ip);
-		goto reclaim;
-	}
 
 	/*
-	 * Never flush out dirty data during non-blocking reclaim, as it would
-	 * just contend with AIL pushing trying to do the same job.
+	 * If it is pinned, we only want to flush this if there's nothing else
+	 * to be flushed as it requires a log force. Hence we essentially set
+	 * the LSN to flush the entire AIL which will end up triggering a log
+	 * force to unpin this inode, but that will only happen if there are not
+	 * other inodes in the scan that only need writeback.
 	 */
-	if (!(sync_mode & SYNC_WAIT))
+	if (xfs_ipincount(ip)) {
+		*lsn = ip->i_itemp->ili_last_lsn;
 		goto out_ifunlock;
+	}
 
 	/*
-	 * Now we have an inode that needs flushing.
-	 *
-	 * Note that xfs_iflush will never block on the inode buffer lock, as
-	 * xfs_ifree_cluster() can lock the inode buffer before it locks the
-	 * ip->i_lock, and we are doing the exact opposite here.  As a result,
-	 * doing a blocking xfs_imap_to_bp() to get the cluster buffer would
-	 * result in an ABBA deadlock with xfs_ifree_cluster().
-	 *
-	 * As xfs_ifree_cluser() must gather all inodes that are active in the
-	 * cache to mark them stale, if we hit this case we don't actually want
-	 * to do IO here - we want the inode marked stale so we can simply
-	 * reclaim it.  Hence if we get an EAGAIN error here,  just unlock the
-	 * inode, back off and try again.  Hopefully the next pass through will
-	 * see the stale flag set on the inode.
+	 * Dirty inode we didn't catch, skip it.
 	 */
-	error = xfs_iflush(ip, &bp);
-	if (error == -EAGAIN) {
-		xfs_iunlock(ip, XFS_ILOCK_EXCL);
-		/* backoff longer than in xfs_ifree_cluster */
-		delay(2);
-		goto restart;
+	if (!xfs_inode_clean(ip) && !xfs_iflags_test(ip, XFS_ISTALE)) {
+		*lsn = ip->i_itemp->ili_item.li_lsn;
+		goto out_ifunlock;
 	}
 
-	if (!error) {
-		error = xfs_bwrite(bp);
-		xfs_buf_relse(bp);
-	}
+	/*
+	 * It's clean, we have it locked, we can now drop the flush lock
+	 * and reclaim it.
+	 */
+	xfs_ifunlock(ip);
 
 reclaim:
 	ASSERT(!xfs_isiflocked(ip));
+	ASSERT(xfs_inode_clean(ip) || xfs_iflags_test(ip, XFS_ISTALE));
+	ASSERT(ip->i_ino != 0);
 
 	/*
 	 * Because we use RCU freeing we need to ensure the inode always appears
@@ -1148,6 +1138,7 @@  xfs_reclaim_inode(
 	 * will see an invalid inode that it can skip.
 	 */
 	spin_lock(&ip->i_flags_lock);
+	ino = ip->i_ino; /* for radix_tree_delete */
 	ip->i_flags = XFS_IRECLAIM;
 	ip->i_ino = 0;
 	spin_unlock(&ip->i_flags_lock);
@@ -1182,7 +1173,7 @@  xfs_reclaim_inode(
 	xfs_iunlock(ip, XFS_ILOCK_EXCL);
 
 	__xfs_inode_free(ip);
-	return error;
+	return true;
 
 out_ifunlock:
 	xfs_ifunlock(ip);
@@ -1190,14 +1181,7 @@  xfs_reclaim_inode(
 	xfs_iunlock(ip, XFS_ILOCK_EXCL);
 out:
 	xfs_iflags_clear(ip, XFS_IRECLAIM);
-	/*
-	 * We could return -EAGAIN here to make reclaim rescan the inode tree in
-	 * a short while. However, this just burns CPU time scanning the tree
-	 * waiting for IO to complete and the reclaim work never goes back to
-	 * the idle state. Instead, return 0 to let the next scheduled
-	 * background reclaim attempt to reclaim the inode again.
-	 */
-	return 0;
+	return false;
 }
 
 /*
@@ -1205,39 +1189,28 @@  xfs_reclaim_inode(
  * corrupted, we still want to try to reclaim all the inodes. If we don't,
  * then a shut down during filesystem unmount reclaim walk leak all the
  * unreclaimed inodes.
+ *
+ * Return the number of inodes freed.
  */
 STATIC int
 xfs_reclaim_inodes_ag(
 	struct xfs_mount	*mp,
 	int			flags,
-	int			*nr_to_scan)
+	int			nr_to_scan)
 {
 	struct xfs_perag	*pag;
-	int			error = 0;
-	int			last_error = 0;
 	xfs_agnumber_t		ag;
-	int			trylock = flags & SYNC_TRYLOCK;
-	int			skipped;
+	xfs_lsn_t		lsn, lowest_lsn = NULLCOMMITLSN;
+	long			freed = 0;
 
-restart:
 	ag = 0;
-	skipped = 0;
 	while ((pag = xfs_perag_get_tag(mp, ag, XFS_ICI_RECLAIM_TAG))) {
 		unsigned long	first_index = 0;
 		int		done = 0;
 		int		nr_found = 0;
 
 		ag = pag->pag_agno + 1;
-
-		if (trylock) {
-			if (!mutex_trylock(&pag->pag_ici_reclaim_lock)) {
-				skipped++;
-				xfs_perag_put(pag);
-				continue;
-			}
-			first_index = pag->pag_ici_reclaim_cursor;
-		} else
-			mutex_lock(&pag->pag_ici_reclaim_lock);
+		first_index = pag->pag_ici_reclaim_cursor;
 
 		do {
 			struct xfs_inode *batch[XFS_LOOKUP_BATCH];
@@ -1262,9 +1235,13 @@  xfs_reclaim_inodes_ag(
 			for (i = 0; i < nr_found; i++) {
 				struct xfs_inode *ip = batch[i];
 
-				if (done || xfs_reclaim_inode_grab(ip, flags))
+				if (done ||
+				    !xfs_reclaim_inode_grab(ip, flags, &lsn))
 					batch[i] = NULL;
 
+				if (lsn && XFS_LSN_CMP(lsn, lowest_lsn) < 0)
+					lowest_lsn = lsn;
+
 				/*
 				 * Update the index for the next lookup. Catch
 				 * overflows into the next AG range which can
@@ -1293,37 +1270,28 @@  xfs_reclaim_inodes_ag(
 			for (i = 0; i < nr_found; i++) {
 				if (!batch[i])
 					continue;
-				error = xfs_reclaim_inode(batch[i], pag, flags);
-				if (error && last_error != -EFSCORRUPTED)
-					last_error = error;
+				if (xfs_reclaim_inode(batch[i], pag, &lsn))
+					freed++;
+				if (lsn && XFS_LSN_CMP(lsn, lowest_lsn) < 0)
+					lowest_lsn = lsn;
 			}
 
-			*nr_to_scan -= XFS_LOOKUP_BATCH;
-
+			nr_to_scan -= XFS_LOOKUP_BATCH;
 			cond_resched();
 
-		} while (nr_found && !done && *nr_to_scan > 0);
+		} while (nr_found && !done && nr_to_scan > 0);
 
-		if (trylock && !done)
+		if (!done)
 			pag->pag_ici_reclaim_cursor = first_index;
 		else
 			pag->pag_ici_reclaim_cursor = 0;
-		mutex_unlock(&pag->pag_ici_reclaim_lock);
 		xfs_perag_put(pag);
 	}
 
-	/*
-	 * if we skipped any AG, and we still have scan count remaining, do
-	 * another pass this time using blocking reclaim semantics (i.e
-	 * waiting on the reclaim locks and ignoring the reclaim cursors). This
-	 * ensure that when we get more reclaimers than AGs we block rather
-	 * than spin trying to execute reclaim.
-	 */
-	if (skipped && (flags & SYNC_WAIT) && *nr_to_scan > 0) {
-		trylock = 0;
-		goto restart;
-	}
-	return last_error;
+	if ((flags & SYNC_WAIT) && lowest_lsn != NULLCOMMITLSN)
+		xfs_ail_push_sync(mp->m_ail, lowest_lsn);
+
+	return freed;
 }
 
 int
@@ -1331,9 +1299,7 @@  xfs_reclaim_inodes(
 	xfs_mount_t	*mp,
 	int		mode)
 {
-	int		nr_to_scan = INT_MAX;
-
-	return xfs_reclaim_inodes_ag(mp, mode, &nr_to_scan);
+	return xfs_reclaim_inodes_ag(mp, mode, INT_MAX);
 }
 
 /*
@@ -1350,7 +1316,7 @@  xfs_reclaim_inodes_nr(
 	struct xfs_mount	*mp,
 	int			nr_to_scan)
 {
-	int			sync_mode = SYNC_TRYLOCK;
+	int			sync_mode = 0;
 
 	/*
 	 * For kswapd, we kick background inode writeback. For direct
@@ -1362,7 +1328,7 @@  xfs_reclaim_inodes_nr(
 	else
 		sync_mode |= SYNC_WAIT;
 
-	return xfs_reclaim_inodes_ag(mp, sync_mode, &nr_to_scan);
+	return xfs_reclaim_inodes_ag(mp, sync_mode, nr_to_scan);
 }
 
 /*
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index a1805021c92f..bcf8f64d1b1f 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -148,7 +148,6 @@  xfs_free_perag(
 		ASSERT(atomic_read(&pag->pag_ref) == 0);
 		xfs_iunlink_destroy(pag);
 		xfs_buf_hash_destroy(pag);
-		mutex_destroy(&pag->pag_ici_reclaim_lock);
 		call_rcu(&pag->rcu_head, __xfs_free_perag);
 	}
 }
@@ -200,7 +199,6 @@  xfs_initialize_perag(
 		pag->pag_agno = index;
 		pag->pag_mount = mp;
 		spin_lock_init(&pag->pag_ici_lock);
-		mutex_init(&pag->pag_ici_reclaim_lock);
 		INIT_RADIX_TREE(&pag->pag_ici_root, GFP_ATOMIC);
 		if (xfs_buf_hash_init(pag))
 			goto out_free_pag;
@@ -242,7 +240,6 @@  xfs_initialize_perag(
 out_hash_destroy:
 	xfs_buf_hash_destroy(pag);
 out_free_pag:
-	mutex_destroy(&pag->pag_ici_reclaim_lock);
 	kmem_free(pag);
 out_unwind_new_pags:
 	/* unwind any prior newly initialized pags */
@@ -252,7 +249,6 @@  xfs_initialize_perag(
 			break;
 		xfs_buf_hash_destroy(pag);
 		xfs_iunlink_destroy(pag);
-		mutex_destroy(&pag->pag_ici_reclaim_lock);
 		kmem_free(pag);
 	}
 	return error;
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index f0cc952ad527..2049e764faed 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -383,7 +383,6 @@  typedef struct xfs_perag {
 	spinlock_t	pag_ici_lock;	/* incore inode cache lock */
 	struct radix_tree_root pag_ici_root;	/* incore inode cache root */
 	int		pag_ici_reclaimable;	/* reclaimable inodes */
-	struct mutex	pag_ici_reclaim_lock;	/* serialisation point */
 	unsigned long	pag_ici_reclaim_cursor;	/* reclaim restart point */
 
 	/* buffer cache index */
diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
index 00d66175f41a..5802139f786b 100644
--- a/fs/xfs/xfs_trans_ail.c
+++ b/fs/xfs/xfs_trans_ail.c
@@ -676,8 +676,10 @@  xfs_ail_push_sync(
 	spin_lock(&ailp->ail_lock);
 	while ((lip = xfs_ail_min(ailp)) != NULL) {
 		prepare_to_wait(&ailp->ail_push, &wait, TASK_UNINTERRUPTIBLE);
+	trace_printk("lip lsn 0x%llx thres 0x%llx targ 0x%llx",
+			lip->li_lsn, threshold_lsn, ailp->ail_target);
 		if (XFS_FORCED_SHUTDOWN(ailp->ail_mount) ||
-		    XFS_LSN_CMP(threshold_lsn, lip->li_lsn) <= 0)
+		    XFS_LSN_CMP(threshold_lsn, lip->li_lsn) < 0)
 			break;
 		/* XXX: cmpxchg? */
 		while (XFS_LSN_CMP(threshold_lsn, ailp->ail_target) > 0)