diff mbox

xfs: add readahead bufs to lru early to prevent post-unmount panic

Message ID 20160712172259.GA22757@bfoster.bfoster (mailing list archive)
State New, archived
Headers show

Commit Message

Brian Foster July 12, 2016, 5:22 p.m. UTC
On Tue, Jul 12, 2016 at 08:03:15AM -0400, Brian Foster wrote:
> On Tue, Jul 12, 2016 at 08:44:51AM +1000, Dave Chinner wrote:
> > On Mon, Jul 11, 2016 at 11:29:22AM -0400, Brian Foster wrote:
> > > On Mon, Jul 11, 2016 at 09:52:52AM -0400, Brian Foster wrote:
> > > ...
> > > > So what is your preference out of the possible approaches here? AFAICS,
> > > > we have the following options:
> > > > 
> > > > 1.) The original "add readahead to LRU early" approach.
> > > > 	Pros: simple one-liner
> > > > 	Cons: bit of a hack, only covers readahead scenario
> > > > 2.) Defer I/O count decrement to buffer release (this patch).
> > > > 	Pros: should cover all cases (reads/writes)
> > > > 	Cons: more complex (requires per-buffer accounting, etc.)
> > > > 3.) Raw (buffer or bio?) I/O count (no defer to buffer release)
> > > > 	Pros: eliminates some complexity from #2
> > > > 	Cons: still more complex than #1, racy in that decrement does
> > > > 	not serialize against LRU addition (requires drain_workqueue(),
> > > > 	which still doesn't cover error conditions)
> > > > 
> > > > As noted above, option #3 also allows for either a buffer based count or
> > > > bio based count, the latter of which might simplify things a bit further
> > > > (TBD). Thoughts?
> > 
> > Pretty good summary :P
> > 
> > > FWIW, the following is a slightly cleaned up version of my initial
> > > approach (option #3 above). Note that the flag is used to help deal with
> > > varying ioend behavior. E.g., xfs_buf_ioend() is called once for some
> > > buffers, multiple times for others with an iodone callback, that
> > > behavior changes in some cases when an error is set, etc. (I'll add
> > > comments before an official post.)
> > 
> > The approach looks good - I think there's a couple of things we can
> > do to clean it up and make it robust. Comments inline.
> > 
> > > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> > > index 4665ff6..45d3ddd 100644
> > > --- a/fs/xfs/xfs_buf.c
> > > +++ b/fs/xfs/xfs_buf.c
> > > @@ -1018,7 +1018,10 @@ xfs_buf_ioend(
> > >  
> > >  	trace_xfs_buf_iodone(bp, _RET_IP_);
> > >  
> > > -	bp->b_flags &= ~(XBF_READ | XBF_WRITE | XBF_READ_AHEAD);
> > > +	if (bp->b_flags & XBF_IN_FLIGHT)
> > > +		percpu_counter_dec(&bp->b_target->bt_io_count);
> > > +
> > > +	bp->b_flags &= ~(XBF_READ | XBF_WRITE | XBF_READ_AHEAD | XBF_IN_FLIGHT);
> > >  
> > >  	/*
> > >  	 * Pull in IO completion errors now. We are guaranteed to be running
> > 
> > I think the XBF_IN_FLIGHT can be moved to the final xfs_buf_rele()
> > processing if:
> > 
> > > @@ -1341,6 +1344,11 @@ xfs_buf_submit(
> > >  	 * xfs_buf_ioend too early.
> > >  	 */
> > >  	atomic_set(&bp->b_io_remaining, 1);
> > > +	if (bp->b_flags & XBF_ASYNC) {
> > > +		percpu_counter_inc(&bp->b_target->bt_io_count);
> > > +		bp->b_flags |= XBF_IN_FLIGHT;
> > > +	}
> > 
> > You change this to:
> > 
> > 	if (!(bp->b_flags & XBF_IN_FLIGHT)) {
> > 		percpu_counter_inc(&bp->b_target->bt_io_count);
> > 		bp->b_flags |= XBF_IN_FLIGHT;
> > 	}
> > 
> 
> Ok, so use the flag to cap the I/O count and defer the decrement to
> release. I think that should work and addresses the raciness issue. I'll
> give it a try.
> 

This appears to be doable, but it reintroduces some ugliness from the
previous approach. For example, we have to start filtering out uncached
buffers again (if we defer the decrement to release, we must handle
never-released buffers one way or another). Also, given the feedback on
the previous patch with regard to filtering out non-new buffers from the
I/O count, I've dropped that and replaced it with updates to
xfs_buf_rele() to decrement when the buffer is returned to the LRU (we
either have to filter out buffers already on the LRU at submit time or
make sure that they are decremented when released back to the LRU).

Code follows...

Brian


> > We shouldn't have to check for XBF_ASYNC in xfs_buf_submit() - it is
> > the path taken for async IO submission, so we should probably
> > ASSERT(bp->b_flags & XBF_ASYNC) in this function to ensure that is
> > the case.
> > 
> 
> Yeah, that's unnecessary. There's already such an assert in
> xfs_buf_submit(), actually.
> 
> > [Thinking aloud - __test_and_set_bit() might make this code a bit
> > cleaner]
> > 
> 
> On a quick try, this complains about b_flags being an unsigned int. I
> think I'll leave the set bit as is and use a helper for the release,
> which also provides a location to explain how the count works.
> 
> > > diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
> > > index 8bfb974..e1f95e0 100644
> > > --- a/fs/xfs/xfs_buf.h
> > > +++ b/fs/xfs/xfs_buf.h
> > > @@ -43,6 +43,7 @@ typedef enum {
> > >  #define XBF_READ	 (1 << 0) /* buffer intended for reading from device */
> > >  #define XBF_WRITE	 (1 << 1) /* buffer intended for writing to device */
> > >  #define XBF_READ_AHEAD	 (1 << 2) /* asynchronous read-ahead */
> > > +#define XBF_IN_FLIGHT	 (1 << 3)
> > 
> > Hmmm - it's an internal flag, so probably should be prefixed with an
> > "_" and moved down to the section with _XBF_KMEM and friends.
> > 
> 
> Indeed, thanks.
> 
> Brian
> 
> > Thoughts?
> > 
> > Cheers,
> > 
> > Dave.
> > -- 
> > Dave Chinner
> > david@fromorbit.com
> > 
> > _______________________________________________
> > xfs mailing list
> > xfs@oss.sgi.com
> > http://oss.sgi.com/mailman/listinfo/xfs
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

Comments

Dave Chinner July 12, 2016, 11:57 p.m. UTC | #1
On Tue, Jul 12, 2016 at 01:22:59PM -0400, Brian Foster wrote:
> On Tue, Jul 12, 2016 at 08:03:15AM -0400, Brian Foster wrote:
> > On Tue, Jul 12, 2016 at 08:44:51AM +1000, Dave Chinner wrote:
> > > On Mon, Jul 11, 2016 at 11:29:22AM -0400, Brian Foster wrote:
> > > > On Mon, Jul 11, 2016 at 09:52:52AM -0400, Brian Foster wrote:
> > > > ...
> > > > > So what is your preference out of the possible approaches here? AFAICS,
> > > > > we have the following options:
> > > > > 
> > > > > 1.) The original "add readahead to LRU early" approach.
> > > > > 	Pros: simple one-liner
> > > > > 	Cons: bit of a hack, only covers readahead scenario
> > > > > 2.) Defer I/O count decrement to buffer release (this patch).
> > > > > 	Pros: should cover all cases (reads/writes)
> > > > > 	Cons: more complex (requires per-buffer accounting, etc.)
> > > > > 3.) Raw (buffer or bio?) I/O count (no defer to buffer release)
> > > > > 	Pros: eliminates some complexity from #2
> > > > > 	Cons: still more complex than #1, racy in that decrement does
> > > > > 	not serialize against LRU addition (requires drain_workqueue(),
> > > > > 	which still doesn't cover error conditions)
> > > > > 
> > > > > As noted above, option #3 also allows for either a buffer based count or
> > > > > bio based count, the latter of which might simplify things a bit further
> > > > > (TBD). Thoughts?
> > > 
> > > Pretty good summary :P
> > > 
> > > > FWIW, the following is a slightly cleaned up version of my initial
> > > > approach (option #3 above). Note that the flag is used to help deal with
> > > > varying ioend behavior. E.g., xfs_buf_ioend() is called once for some
> > > > buffers, multiple times for others with an iodone callback, that
> > > > behavior changes in some cases when an error is set, etc. (I'll add
> > > > comments before an official post.)
> > > 
> > > The approach looks good - I think there's a couple of things we can
> > > do to clean it up and make it robust. Comments inline.
> > > 
> > > > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> > > > index 4665ff6..45d3ddd 100644
> > > > --- a/fs/xfs/xfs_buf.c
> > > > +++ b/fs/xfs/xfs_buf.c
> > > > @@ -1018,7 +1018,10 @@ xfs_buf_ioend(
> > > >  
> > > >  	trace_xfs_buf_iodone(bp, _RET_IP_);
> > > >  
> > > > -	bp->b_flags &= ~(XBF_READ | XBF_WRITE | XBF_READ_AHEAD);
> > > > +	if (bp->b_flags & XBF_IN_FLIGHT)
> > > > +		percpu_counter_dec(&bp->b_target->bt_io_count);
> > > > +
> > > > +	bp->b_flags &= ~(XBF_READ | XBF_WRITE | XBF_READ_AHEAD | XBF_IN_FLIGHT);
> > > >  
> > > >  	/*
> > > >  	 * Pull in IO completion errors now. We are guaranteed to be running
> > > 
> > > I think the XBF_IN_FLIGHT can be moved to the final xfs_buf_rele()
> > > processing if:
> > > 
> > > > @@ -1341,6 +1344,11 @@ xfs_buf_submit(
> > > >  	 * xfs_buf_ioend too early.
> > > >  	 */
> > > >  	atomic_set(&bp->b_io_remaining, 1);
> > > > +	if (bp->b_flags & XBF_ASYNC) {
> > > > +		percpu_counter_inc(&bp->b_target->bt_io_count);
> > > > +		bp->b_flags |= XBF_IN_FLIGHT;
> > > > +	}
> > > 
> > > You change this to:
> > > 
> > > 	if (!(bp->b_flags & XBF_IN_FLIGHT)) {
> > > 		percpu_counter_inc(&bp->b_target->bt_io_count);
> > > 		bp->b_flags |= XBF_IN_FLIGHT;
> > > 	}
> > > 
> > 
> > Ok, so use the flag to cap the I/O count and defer the decrement to
> > release. I think that should work and addresses the raciness issue. I'll
> > give it a try.
> > 
> 
> This appears to be doable, but it reintroduces some ugliness from the
> previous approach.

Ah, so it does. Bugger.

> For example, we have to start filtering out uncached
> buffers again (if we defer the decrement to release, we must handle
> never-released buffers one way or another).

So the problem is limited to the superblock buffer and the iclog
buffers, right? How about making that special case explicit via a
flag set on the buffer? e.g. XBF_NO_IOCOUNT. THat way the exceptions
are clearly spelt out, rather than avoiding all uncached buffers?

> Also, given the feedback on
> the previous patch with regard to filtering out non-new buffers from the
> I/O count, I've dropped that and replaced it with updates to
> xfs_buf_rele() to decrement when the buffer is returned to the LRU (we
> either have to filter out buffers already on the LRU at submit time or
> make sure that they are decremented when released back to the LRU).
> 
> Code follows...
> 
> Brian
> 
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index 4665ff6..b7afbac 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -80,6 +80,25 @@ xfs_buf_vmap_len(
>  }
>  
>  /*
> + * Clear the in-flight state on a buffer about to be released to the LRU or
> + * freed and unaccount from the buftarg. The buftarg I/O count maintains a count
> + * of held buffers that have undergone at least one I/O in the current hold
> + * cycle (e.g., not a total I/O count). This provides protection against unmount
> + * for buffer I/O completion (see xfs_wait_buftarg()) processing.
> + */
> +static inline void
> +xfs_buf_rele_in_flight(
> +	struct xfs_buf	*bp)

Not sure about the name: xfs_buf_ioacct_dec()?

> +{
> +	if (!(bp->b_flags & _XBF_IN_FLIGHT))
> +		return;
> +
> +	ASSERT(bp->b_flags & XBF_ASYNC);
> +	bp->b_flags &= ~_XBF_IN_FLIGHT;
> +	percpu_counter_dec(&bp->b_target->bt_io_count);
> +}
> +
> +/*
>   * When we mark a buffer stale, we remove the buffer from the LRU and clear the
>   * b_lru_ref count so that the buffer is freed immediately when the buffer
>   * reference count falls to zero. If the buffer is already on the LRU, we need
> @@ -866,30 +885,37 @@ xfs_buf_hold(
>  }
>  
>  /*
> - *	Releases a hold on the specified buffer.  If the
> - *	the hold count is 1, calls xfs_buf_free.
> + * Release a hold on the specified buffer. If the hold count is 1, the buffer is
> + * placed on LRU or freed (depending on b_lru_ref).
>   */
>  void
>  xfs_buf_rele(
>  	xfs_buf_t		*bp)
>  {
>  	struct xfs_perag	*pag = bp->b_pag;
> +	bool			release;
> +	bool			freebuf = false;
>  
>  	trace_xfs_buf_rele(bp, _RET_IP_);
>  
>  	if (!pag) {
>  		ASSERT(list_empty(&bp->b_lru));
>  		ASSERT(RB_EMPTY_NODE(&bp->b_rbnode));
> -		if (atomic_dec_and_test(&bp->b_hold))
> +		if (atomic_dec_and_test(&bp->b_hold)) {
> +			xfs_buf_rele_in_flight(bp);
>  			xfs_buf_free(bp);
> +		}
>  		return;
>  	}
>  
>  	ASSERT(!RB_EMPTY_NODE(&bp->b_rbnode));
>  
>  	ASSERT(atomic_read(&bp->b_hold) > 0);
> -	if (atomic_dec_and_lock(&bp->b_hold, &pag->pag_buf_lock)) {
> -		spin_lock(&bp->b_lock);
> +
> +	release = atomic_dec_and_lock(&bp->b_hold, &pag->pag_buf_lock);
> +	spin_lock(&bp->b_lock);
> +	if (release) {
> +		xfs_buf_rele_in_flight(bp);
>  		if (!(bp->b_flags & XBF_STALE) && atomic_read(&bp->b_lru_ref)) {
>  			/*
>  			 * If the buffer is added to the LRU take a new
> @@ -900,7 +926,6 @@ xfs_buf_rele(
>  				bp->b_state &= ~XFS_BSTATE_DISPOSE;
>  				atomic_inc(&bp->b_hold);
>  			}
> -			spin_unlock(&bp->b_lock);
>  			spin_unlock(&pag->pag_buf_lock);
>  		} else {
>  			/*
> @@ -914,15 +939,24 @@ xfs_buf_rele(
>  			} else {
>  				ASSERT(list_empty(&bp->b_lru));
>  			}
> -			spin_unlock(&bp->b_lock);
>  
>  			ASSERT(!(bp->b_flags & _XBF_DELWRI_Q));
>  			rb_erase(&bp->b_rbnode, &pag->pag_buf_tree);
>  			spin_unlock(&pag->pag_buf_lock);
>  			xfs_perag_put(pag);
> -			xfs_buf_free(bp);
> +			freebuf = true;
>  		}
> +	} else if ((atomic_read(&bp->b_hold) == 1) && !list_empty(&bp->b_lru)) {
> +		/*
> +		 * The buffer is already on the LRU and it holds the only
> +		 * reference. Drop the in flight state.
> +		 */
> +		xfs_buf_rele_in_flight(bp);
>  	}

This b_hold check is racy - bp->b_lock is not enough to stabilise
the b_hold count. Because we don't hold the buffer semaphore any
more, another buffer reference holder can successfully run the above
atomic_dec_and_lock(&bp->b_hold, &pag->pag_buf_lock). New references
can be taken in xfs_buf_find() so the count could go up, but I
think that's fine given the eventual case we care about here is
draining references on unmount.

I think this is still ok for draining references, too, because of
the flag check inside xfs_buf_rele_in_flight(). If we race on a
transition a value of 1, then we end running the branch in each
caller. If we race on transition to zero, then the caller that is
releasing the buffer will execute xfs_buf_rele_in_flight() and all
will be well.

Needs comments, and maybe restructing the code to handle the
xfs_buf_rele_in_flight() call up front so it's clear that io
accounting is clearly a separate case from the rest of release
handling. e.g.

	release = atomic_dec_and_lock(&bp->b_hold, &pag->pag_buf_lock);
	spin_lock(&bp->b_lock);
	if (!release) {
		if (!(atomic_read(&bp->b_hold) == 1) && !list_empty(&bp->b_lru))
			xfs_buf_ioacct_dec(bp);
		goto out_unlock;
	}
	xfs_buf_ioacct_dec(bp);

	/* rest of release code, one level of indentation removed */

out_unlock:
	spin_unlock(&bp->b_lock);

	if (freebuf)
		xfs_buf_free(bp);



> @@ -1341,6 +1375,18 @@ xfs_buf_submit(
>  	 * xfs_buf_ioend too early.
>  	 */
>  	atomic_set(&bp->b_io_remaining, 1);
> +
> +	/*
> +	 * Bump the I/O in flight count on the buftarg if we haven't yet done
> +	 * so for this buffer. Skip uncached buffers because many of those
> +	 * (e.g., superblock, log buffers) are never released.
> +	 */
> +	if ((bp->b_bn != XFS_BUF_DADDR_NULL) &&
> +	    !(bp->b_flags & _XBF_IN_FLIGHT)) {
> +		bp->b_flags |= _XBF_IN_FLIGHT;
> +		percpu_counter_inc(&bp->b_target->bt_io_count);
> +	}

xfs_buf_ioacct_inc()
{
	if (bp->b_flags & (XBF_NO_IOACCT | _XBF_IN_FLIGHT))
		return;
	percpu_counter_inc(&bp->b_target->bt_io_count);
	bp->b_flags |= _XBF_IN_FLIGHT;
}

Cheers,

Dave.
Brian Foster July 13, 2016, 11:32 a.m. UTC | #2
On Wed, Jul 13, 2016 at 09:57:52AM +1000, Dave Chinner wrote:
> On Tue, Jul 12, 2016 at 01:22:59PM -0400, Brian Foster wrote:
> > On Tue, Jul 12, 2016 at 08:03:15AM -0400, Brian Foster wrote:
> > > On Tue, Jul 12, 2016 at 08:44:51AM +1000, Dave Chinner wrote:
> > > > On Mon, Jul 11, 2016 at 11:29:22AM -0400, Brian Foster wrote:
> > > > > On Mon, Jul 11, 2016 at 09:52:52AM -0400, Brian Foster wrote:
> > > > > ...
> > > > > > So what is your preference out of the possible approaches here? AFAICS,
> > > > > > we have the following options:
> > > > > > 
> > > > > > 1.) The original "add readahead to LRU early" approach.
> > > > > > 	Pros: simple one-liner
> > > > > > 	Cons: bit of a hack, only covers readahead scenario
> > > > > > 2.) Defer I/O count decrement to buffer release (this patch).
> > > > > > 	Pros: should cover all cases (reads/writes)
> > > > > > 	Cons: more complex (requires per-buffer accounting, etc.)
> > > > > > 3.) Raw (buffer or bio?) I/O count (no defer to buffer release)
> > > > > > 	Pros: eliminates some complexity from #2
> > > > > > 	Cons: still more complex than #1, racy in that decrement does
> > > > > > 	not serialize against LRU addition (requires drain_workqueue(),
> > > > > > 	which still doesn't cover error conditions)
> > > > > > 
> > > > > > As noted above, option #3 also allows for either a buffer based count or
> > > > > > bio based count, the latter of which might simplify things a bit further
> > > > > > (TBD). Thoughts?
> > > > 
> > > > Pretty good summary :P
> > > > 
> > > > > FWIW, the following is a slightly cleaned up version of my initial
> > > > > approach (option #3 above). Note that the flag is used to help deal with
> > > > > varying ioend behavior. E.g., xfs_buf_ioend() is called once for some
> > > > > buffers, multiple times for others with an iodone callback, that
> > > > > behavior changes in some cases when an error is set, etc. (I'll add
> > > > > comments before an official post.)
> > > > 
> > > > The approach looks good - I think there's a couple of things we can
> > > > do to clean it up and make it robust. Comments inline.
> > > > 
> > > > > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> > > > > index 4665ff6..45d3ddd 100644
> > > > > --- a/fs/xfs/xfs_buf.c
> > > > > +++ b/fs/xfs/xfs_buf.c
> > > > > @@ -1018,7 +1018,10 @@ xfs_buf_ioend(
> > > > >  
> > > > >  	trace_xfs_buf_iodone(bp, _RET_IP_);
> > > > >  
> > > > > -	bp->b_flags &= ~(XBF_READ | XBF_WRITE | XBF_READ_AHEAD);
> > > > > +	if (bp->b_flags & XBF_IN_FLIGHT)
> > > > > +		percpu_counter_dec(&bp->b_target->bt_io_count);
> > > > > +
> > > > > +	bp->b_flags &= ~(XBF_READ | XBF_WRITE | XBF_READ_AHEAD | XBF_IN_FLIGHT);
> > > > >  
> > > > >  	/*
> > > > >  	 * Pull in IO completion errors now. We are guaranteed to be running
> > > > 
> > > > I think the XBF_IN_FLIGHT can be moved to the final xfs_buf_rele()
> > > > processing if:
> > > > 
> > > > > @@ -1341,6 +1344,11 @@ xfs_buf_submit(
> > > > >  	 * xfs_buf_ioend too early.
> > > > >  	 */
> > > > >  	atomic_set(&bp->b_io_remaining, 1);
> > > > > +	if (bp->b_flags & XBF_ASYNC) {
> > > > > +		percpu_counter_inc(&bp->b_target->bt_io_count);
> > > > > +		bp->b_flags |= XBF_IN_FLIGHT;
> > > > > +	}
> > > > 
> > > > You change this to:
> > > > 
> > > > 	if (!(bp->b_flags & XBF_IN_FLIGHT)) {
> > > > 		percpu_counter_inc(&bp->b_target->bt_io_count);
> > > > 		bp->b_flags |= XBF_IN_FLIGHT;
> > > > 	}
> > > > 
> > > 
> > > Ok, so use the flag to cap the I/O count and defer the decrement to
> > > release. I think that should work and addresses the raciness issue. I'll
> > > give it a try.
> > > 
> > 
> > This appears to be doable, but it reintroduces some ugliness from the
> > previous approach.
> 
> Ah, so it does. Bugger.
> 
> > For example, we have to start filtering out uncached
> > buffers again (if we defer the decrement to release, we must handle
> > never-released buffers one way or another).
> 
> So the problem is limited to the superblock buffer and the iclog
> buffers, right? How about making that special case explicit via a
> flag set on the buffer? e.g. XBF_NO_IOCOUNT. THat way the exceptions
> are clearly spelt out, rather than avoiding all uncached buffers?
> 

I think so. I considered a similar approach earlier, but I didn't want
to spend time tracking down the associated users until the broader
approach was nailed down. More specifically, I think we could set
b_lru_ref to 0 on those buffers and use that to bypass the accounting.
That makes it clear that these buffers are not destined for the LRU and
alternative synchronization is required (which already exists in the
form of lock cycles).

The rest of the feedback makes sense, so I'll incorporate that and give
the above a try... thanks.

Brian

> > Also, given the feedback on
> > the previous patch with regard to filtering out non-new buffers from the
> > I/O count, I've dropped that and replaced it with updates to
> > xfs_buf_rele() to decrement when the buffer is returned to the LRU (we
> > either have to filter out buffers already on the LRU at submit time or
> > make sure that they are decremented when released back to the LRU).
> > 
> > Code follows...
> > 
> > Brian
> > 
> > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> > index 4665ff6..b7afbac 100644
> > --- a/fs/xfs/xfs_buf.c
> > +++ b/fs/xfs/xfs_buf.c
> > @@ -80,6 +80,25 @@ xfs_buf_vmap_len(
> >  }
> >  
> >  /*
> > + * Clear the in-flight state on a buffer about to be released to the LRU or
> > + * freed and unaccount from the buftarg. The buftarg I/O count maintains a count
> > + * of held buffers that have undergone at least one I/O in the current hold
> > + * cycle (e.g., not a total I/O count). This provides protection against unmount
> > + * for buffer I/O completion (see xfs_wait_buftarg()) processing.
> > + */
> > +static inline void
> > +xfs_buf_rele_in_flight(
> > +	struct xfs_buf	*bp)
> 
> Not sure about the name: xfs_buf_ioacct_dec()?
> 
> > +{
> > +	if (!(bp->b_flags & _XBF_IN_FLIGHT))
> > +		return;
> > +
> > +	ASSERT(bp->b_flags & XBF_ASYNC);
> > +	bp->b_flags &= ~_XBF_IN_FLIGHT;
> > +	percpu_counter_dec(&bp->b_target->bt_io_count);
> > +}
> > +
> > +/*
> >   * When we mark a buffer stale, we remove the buffer from the LRU and clear the
> >   * b_lru_ref count so that the buffer is freed immediately when the buffer
> >   * reference count falls to zero. If the buffer is already on the LRU, we need
> > @@ -866,30 +885,37 @@ xfs_buf_hold(
> >  }
> >  
> >  /*
> > - *	Releases a hold on the specified buffer.  If the
> > - *	the hold count is 1, calls xfs_buf_free.
> > + * Release a hold on the specified buffer. If the hold count is 1, the buffer is
> > + * placed on LRU or freed (depending on b_lru_ref).
> >   */
> >  void
> >  xfs_buf_rele(
> >  	xfs_buf_t		*bp)
> >  {
> >  	struct xfs_perag	*pag = bp->b_pag;
> > +	bool			release;
> > +	bool			freebuf = false;
> >  
> >  	trace_xfs_buf_rele(bp, _RET_IP_);
> >  
> >  	if (!pag) {
> >  		ASSERT(list_empty(&bp->b_lru));
> >  		ASSERT(RB_EMPTY_NODE(&bp->b_rbnode));
> > -		if (atomic_dec_and_test(&bp->b_hold))
> > +		if (atomic_dec_and_test(&bp->b_hold)) {
> > +			xfs_buf_rele_in_flight(bp);
> >  			xfs_buf_free(bp);
> > +		}
> >  		return;
> >  	}
> >  
> >  	ASSERT(!RB_EMPTY_NODE(&bp->b_rbnode));
> >  
> >  	ASSERT(atomic_read(&bp->b_hold) > 0);
> > -	if (atomic_dec_and_lock(&bp->b_hold, &pag->pag_buf_lock)) {
> > -		spin_lock(&bp->b_lock);
> > +
> > +	release = atomic_dec_and_lock(&bp->b_hold, &pag->pag_buf_lock);
> > +	spin_lock(&bp->b_lock);
> > +	if (release) {
> > +		xfs_buf_rele_in_flight(bp);
> >  		if (!(bp->b_flags & XBF_STALE) && atomic_read(&bp->b_lru_ref)) {
> >  			/*
> >  			 * If the buffer is added to the LRU take a new
> > @@ -900,7 +926,6 @@ xfs_buf_rele(
> >  				bp->b_state &= ~XFS_BSTATE_DISPOSE;
> >  				atomic_inc(&bp->b_hold);
> >  			}
> > -			spin_unlock(&bp->b_lock);
> >  			spin_unlock(&pag->pag_buf_lock);
> >  		} else {
> >  			/*
> > @@ -914,15 +939,24 @@ xfs_buf_rele(
> >  			} else {
> >  				ASSERT(list_empty(&bp->b_lru));
> >  			}
> > -			spin_unlock(&bp->b_lock);
> >  
> >  			ASSERT(!(bp->b_flags & _XBF_DELWRI_Q));
> >  			rb_erase(&bp->b_rbnode, &pag->pag_buf_tree);
> >  			spin_unlock(&pag->pag_buf_lock);
> >  			xfs_perag_put(pag);
> > -			xfs_buf_free(bp);
> > +			freebuf = true;
> >  		}
> > +	} else if ((atomic_read(&bp->b_hold) == 1) && !list_empty(&bp->b_lru)) {
> > +		/*
> > +		 * The buffer is already on the LRU and it holds the only
> > +		 * reference. Drop the in flight state.
> > +		 */
> > +		xfs_buf_rele_in_flight(bp);
> >  	}
> 
> This b_hold check is racy - bp->b_lock is not enough to stabilise
> the b_hold count. Because we don't hold the buffer semaphore any
> more, another buffer reference holder can successfully run the above
> atomic_dec_and_lock(&bp->b_hold, &pag->pag_buf_lock). New references
> can be taken in xfs_buf_find() so the count could go up, but I
> think that's fine given the eventual case we care about here is
> draining references on unmount.
> 
> I think this is still ok for draining references, too, because of
> the flag check inside xfs_buf_rele_in_flight(). If we race on a
> transition a value of 1, then we end running the branch in each
> caller. If we race on transition to zero, then the caller that is
> releasing the buffer will execute xfs_buf_rele_in_flight() and all
> will be well.
> 
> Needs comments, and maybe restructing the code to handle the
> xfs_buf_rele_in_flight() call up front so it's clear that io
> accounting is clearly a separate case from the rest of release
> handling. e.g.
> 
> 	release = atomic_dec_and_lock(&bp->b_hold, &pag->pag_buf_lock);
> 	spin_lock(&bp->b_lock);
> 	if (!release) {
> 		if (!(atomic_read(&bp->b_hold) == 1) && !list_empty(&bp->b_lru))
> 			xfs_buf_ioacct_dec(bp);
> 		goto out_unlock;
> 	}
> 	xfs_buf_ioacct_dec(bp);
> 
> 	/* rest of release code, one level of indentation removed */
> 
> out_unlock:
> 	spin_unlock(&bp->b_lock);
> 
> 	if (freebuf)
> 		xfs_buf_free(bp);
> 
> 
> 
> > @@ -1341,6 +1375,18 @@ xfs_buf_submit(
> >  	 * xfs_buf_ioend too early.
> >  	 */
> >  	atomic_set(&bp->b_io_remaining, 1);
> > +
> > +	/*
> > +	 * Bump the I/O in flight count on the buftarg if we haven't yet done
> > +	 * so for this buffer. Skip uncached buffers because many of those
> > +	 * (e.g., superblock, log buffers) are never released.
> > +	 */
> > +	if ((bp->b_bn != XFS_BUF_DADDR_NULL) &&
> > +	    !(bp->b_flags & _XBF_IN_FLIGHT)) {
> > +		bp->b_flags |= _XBF_IN_FLIGHT;
> > +		percpu_counter_inc(&bp->b_target->bt_io_count);
> > +	}
> 
> xfs_buf_ioacct_inc()
> {
> 	if (bp->b_flags & (XBF_NO_IOACCT | _XBF_IN_FLIGHT))
> 		return;
> 	percpu_counter_inc(&bp->b_target->bt_io_count);
> 	bp->b_flags |= _XBF_IN_FLIGHT;
> }
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs
Brian Foster July 13, 2016, 12:49 p.m. UTC | #3
On Wed, Jul 13, 2016 at 07:32:26AM -0400, Brian Foster wrote:
> On Wed, Jul 13, 2016 at 09:57:52AM +1000, Dave Chinner wrote:
> > On Tue, Jul 12, 2016 at 01:22:59PM -0400, Brian Foster wrote:
> > > On Tue, Jul 12, 2016 at 08:03:15AM -0400, Brian Foster wrote:
> > > > On Tue, Jul 12, 2016 at 08:44:51AM +1000, Dave Chinner wrote:
> > > > > On Mon, Jul 11, 2016 at 11:29:22AM -0400, Brian Foster wrote:
> > > > > > On Mon, Jul 11, 2016 at 09:52:52AM -0400, Brian Foster wrote:
> > > > > > ...
> > > > > > > So what is your preference out of the possible approaches here? AFAICS,
> > > > > > > we have the following options:
> > > > > > > 
> > > > > > > 1.) The original "add readahead to LRU early" approach.
> > > > > > > 	Pros: simple one-liner
> > > > > > > 	Cons: bit of a hack, only covers readahead scenario
> > > > > > > 2.) Defer I/O count decrement to buffer release (this patch).
> > > > > > > 	Pros: should cover all cases (reads/writes)
> > > > > > > 	Cons: more complex (requires per-buffer accounting, etc.)
> > > > > > > 3.) Raw (buffer or bio?) I/O count (no defer to buffer release)
> > > > > > > 	Pros: eliminates some complexity from #2
> > > > > > > 	Cons: still more complex than #1, racy in that decrement does
> > > > > > > 	not serialize against LRU addition (requires drain_workqueue(),
> > > > > > > 	which still doesn't cover error conditions)
> > > > > > > 
> > > > > > > As noted above, option #3 also allows for either a buffer based count or
> > > > > > > bio based count, the latter of which might simplify things a bit further
> > > > > > > (TBD). Thoughts?
> > > > > 
> > > > > Pretty good summary :P
> > > > > 
> > > > > > FWIW, the following is a slightly cleaned up version of my initial
> > > > > > approach (option #3 above). Note that the flag is used to help deal with
> > > > > > varying ioend behavior. E.g., xfs_buf_ioend() is called once for some
> > > > > > buffers, multiple times for others with an iodone callback, that
> > > > > > behavior changes in some cases when an error is set, etc. (I'll add
> > > > > > comments before an official post.)
> > > > > 
> > > > > The approach looks good - I think there's a couple of things we can
> > > > > do to clean it up and make it robust. Comments inline.
> > > > > 
> > > > > > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> > > > > > index 4665ff6..45d3ddd 100644
> > > > > > --- a/fs/xfs/xfs_buf.c
> > > > > > +++ b/fs/xfs/xfs_buf.c
> > > > > > @@ -1018,7 +1018,10 @@ xfs_buf_ioend(
> > > > > >  
> > > > > >  	trace_xfs_buf_iodone(bp, _RET_IP_);
> > > > > >  
> > > > > > -	bp->b_flags &= ~(XBF_READ | XBF_WRITE | XBF_READ_AHEAD);
> > > > > > +	if (bp->b_flags & XBF_IN_FLIGHT)
> > > > > > +		percpu_counter_dec(&bp->b_target->bt_io_count);
> > > > > > +
> > > > > > +	bp->b_flags &= ~(XBF_READ | XBF_WRITE | XBF_READ_AHEAD | XBF_IN_FLIGHT);
> > > > > >  
> > > > > >  	/*
> > > > > >  	 * Pull in IO completion errors now. We are guaranteed to be running
> > > > > 
> > > > > I think the XBF_IN_FLIGHT can be moved to the final xfs_buf_rele()
> > > > > processing if:
> > > > > 
> > > > > > @@ -1341,6 +1344,11 @@ xfs_buf_submit(
> > > > > >  	 * xfs_buf_ioend too early.
> > > > > >  	 */
> > > > > >  	atomic_set(&bp->b_io_remaining, 1);
> > > > > > +	if (bp->b_flags & XBF_ASYNC) {
> > > > > > +		percpu_counter_inc(&bp->b_target->bt_io_count);
> > > > > > +		bp->b_flags |= XBF_IN_FLIGHT;
> > > > > > +	}
> > > > > 
> > > > > You change this to:
> > > > > 
> > > > > 	if (!(bp->b_flags & XBF_IN_FLIGHT)) {
> > > > > 		percpu_counter_inc(&bp->b_target->bt_io_count);
> > > > > 		bp->b_flags |= XBF_IN_FLIGHT;
> > > > > 	}
> > > > > 
> > > > 
> > > > Ok, so use the flag to cap the I/O count and defer the decrement to
> > > > release. I think that should work and addresses the raciness issue. I'll
> > > > give it a try.
> > > > 
> > > 
> > > This appears to be doable, but it reintroduces some ugliness from the
> > > previous approach.
> > 
> > Ah, so it does. Bugger.
> > 
> > > For example, we have to start filtering out uncached
> > > buffers again (if we defer the decrement to release, we must handle
> > > never-released buffers one way or another).
> > 
> > So the problem is limited to the superblock buffer and the iclog
> > buffers, right? How about making that special case explicit via a
> > flag set on the buffer? e.g. XBF_NO_IOCOUNT. THat way the exceptions
> > are clearly spelt out, rather than avoiding all uncached buffers?
> > 
> 
> I think so. I considered a similar approach earlier, but I didn't want
> to spend time tracking down the associated users until the broader
> approach was nailed down. More specifically, I think we could set
> b_lru_ref to 0 on those buffers and use that to bypass the accounting.
> That makes it clear that these buffers are not destined for the LRU and
> alternative synchronization is required (which already exists in the
> form of lock cycles).
> 

It occurs to me that this probably won't work in all cases because
b_lru_ref is decremented naturally as part of the LRU shrinker
mechanism. Disregard this, I'll look into the flag..

Brian

> The rest of the feedback makes sense, so I'll incorporate that and give
> the above a try... thanks.
> 
> Brian
> 
> > > Also, given the feedback on
> > > the previous patch with regard to filtering out non-new buffers from the
> > > I/O count, I've dropped that and replaced it with updates to
> > > xfs_buf_rele() to decrement when the buffer is returned to the LRU (we
> > > either have to filter out buffers already on the LRU at submit time or
> > > make sure that they are decremented when released back to the LRU).
> > > 
> > > Code follows...
> > > 
> > > Brian
> > > 
> > > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> > > index 4665ff6..b7afbac 100644
> > > --- a/fs/xfs/xfs_buf.c
> > > +++ b/fs/xfs/xfs_buf.c
> > > @@ -80,6 +80,25 @@ xfs_buf_vmap_len(
> > >  }
> > >  
> > >  /*
> > > + * Clear the in-flight state on a buffer about to be released to the LRU or
> > > + * freed and unaccount from the buftarg. The buftarg I/O count maintains a count
> > > + * of held buffers that have undergone at least one I/O in the current hold
> > > + * cycle (e.g., not a total I/O count). This provides protection against unmount
> > > + * for buffer I/O completion (see xfs_wait_buftarg()) processing.
> > > + */
> > > +static inline void
> > > +xfs_buf_rele_in_flight(
> > > +	struct xfs_buf	*bp)
> > 
> > Not sure about the name: xfs_buf_ioacct_dec()?
> > 
> > > +{
> > > +	if (!(bp->b_flags & _XBF_IN_FLIGHT))
> > > +		return;
> > > +
> > > +	ASSERT(bp->b_flags & XBF_ASYNC);
> > > +	bp->b_flags &= ~_XBF_IN_FLIGHT;
> > > +	percpu_counter_dec(&bp->b_target->bt_io_count);
> > > +}
> > > +
> > > +/*
> > >   * When we mark a buffer stale, we remove the buffer from the LRU and clear the
> > >   * b_lru_ref count so that the buffer is freed immediately when the buffer
> > >   * reference count falls to zero. If the buffer is already on the LRU, we need
> > > @@ -866,30 +885,37 @@ xfs_buf_hold(
> > >  }
> > >  
> > >  /*
> > > - *	Releases a hold on the specified buffer.  If the
> > > - *	the hold count is 1, calls xfs_buf_free.
> > > + * Release a hold on the specified buffer. If the hold count is 1, the buffer is
> > > + * placed on LRU or freed (depending on b_lru_ref).
> > >   */
> > >  void
> > >  xfs_buf_rele(
> > >  	xfs_buf_t		*bp)
> > >  {
> > >  	struct xfs_perag	*pag = bp->b_pag;
> > > +	bool			release;
> > > +	bool			freebuf = false;
> > >  
> > >  	trace_xfs_buf_rele(bp, _RET_IP_);
> > >  
> > >  	if (!pag) {
> > >  		ASSERT(list_empty(&bp->b_lru));
> > >  		ASSERT(RB_EMPTY_NODE(&bp->b_rbnode));
> > > -		if (atomic_dec_and_test(&bp->b_hold))
> > > +		if (atomic_dec_and_test(&bp->b_hold)) {
> > > +			xfs_buf_rele_in_flight(bp);
> > >  			xfs_buf_free(bp);
> > > +		}
> > >  		return;
> > >  	}
> > >  
> > >  	ASSERT(!RB_EMPTY_NODE(&bp->b_rbnode));
> > >  
> > >  	ASSERT(atomic_read(&bp->b_hold) > 0);
> > > -	if (atomic_dec_and_lock(&bp->b_hold, &pag->pag_buf_lock)) {
> > > -		spin_lock(&bp->b_lock);
> > > +
> > > +	release = atomic_dec_and_lock(&bp->b_hold, &pag->pag_buf_lock);
> > > +	spin_lock(&bp->b_lock);
> > > +	if (release) {
> > > +		xfs_buf_rele_in_flight(bp);
> > >  		if (!(bp->b_flags & XBF_STALE) && atomic_read(&bp->b_lru_ref)) {
> > >  			/*
> > >  			 * If the buffer is added to the LRU take a new
> > > @@ -900,7 +926,6 @@ xfs_buf_rele(
> > >  				bp->b_state &= ~XFS_BSTATE_DISPOSE;
> > >  				atomic_inc(&bp->b_hold);
> > >  			}
> > > -			spin_unlock(&bp->b_lock);
> > >  			spin_unlock(&pag->pag_buf_lock);
> > >  		} else {
> > >  			/*
> > > @@ -914,15 +939,24 @@ xfs_buf_rele(
> > >  			} else {
> > >  				ASSERT(list_empty(&bp->b_lru));
> > >  			}
> > > -			spin_unlock(&bp->b_lock);
> > >  
> > >  			ASSERT(!(bp->b_flags & _XBF_DELWRI_Q));
> > >  			rb_erase(&bp->b_rbnode, &pag->pag_buf_tree);
> > >  			spin_unlock(&pag->pag_buf_lock);
> > >  			xfs_perag_put(pag);
> > > -			xfs_buf_free(bp);
> > > +			freebuf = true;
> > >  		}
> > > +	} else if ((atomic_read(&bp->b_hold) == 1) && !list_empty(&bp->b_lru)) {
> > > +		/*
> > > +		 * The buffer is already on the LRU and it holds the only
> > > +		 * reference. Drop the in flight state.
> > > +		 */
> > > +		xfs_buf_rele_in_flight(bp);
> > >  	}
> > 
> > This b_hold check is racy - bp->b_lock is not enough to stabilise
> > the b_hold count. Because we don't hold the buffer semaphore any
> > more, another buffer reference holder can successfully run the above
> > atomic_dec_and_lock(&bp->b_hold, &pag->pag_buf_lock). New references
> > can be taken in xfs_buf_find() so the count could go up, but I
> > think that's fine given the eventual case we care about here is
> > draining references on unmount.
> > 
> > I think this is still ok for draining references, too, because of
> > the flag check inside xfs_buf_rele_in_flight(). If we race on a
> > transition a value of 1, then we end running the branch in each
> > caller. If we race on transition to zero, then the caller that is
> > releasing the buffer will execute xfs_buf_rele_in_flight() and all
> > will be well.
> > 
> > Needs comments, and maybe restructing the code to handle the
> > xfs_buf_rele_in_flight() call up front so it's clear that io
> > accounting is clearly a separate case from the rest of release
> > handling. e.g.
> > 
> > 	release = atomic_dec_and_lock(&bp->b_hold, &pag->pag_buf_lock);
> > 	spin_lock(&bp->b_lock);
> > 	if (!release) {
> > 		if (!(atomic_read(&bp->b_hold) == 1) && !list_empty(&bp->b_lru))
> > 			xfs_buf_ioacct_dec(bp);
> > 		goto out_unlock;
> > 	}
> > 	xfs_buf_ioacct_dec(bp);
> > 
> > 	/* rest of release code, one level of indentation removed */
> > 
> > out_unlock:
> > 	spin_unlock(&bp->b_lock);
> > 
> > 	if (freebuf)
> > 		xfs_buf_free(bp);
> > 
> > 
> > 
> > > @@ -1341,6 +1375,18 @@ xfs_buf_submit(
> > >  	 * xfs_buf_ioend too early.
> > >  	 */
> > >  	atomic_set(&bp->b_io_remaining, 1);
> > > +
> > > +	/*
> > > +	 * Bump the I/O in flight count on the buftarg if we haven't yet done
> > > +	 * so for this buffer. Skip uncached buffers because many of those
> > > +	 * (e.g., superblock, log buffers) are never released.
> > > +	 */
> > > +	if ((bp->b_bn != XFS_BUF_DADDR_NULL) &&
> > > +	    !(bp->b_flags & _XBF_IN_FLIGHT)) {
> > > +		bp->b_flags |= _XBF_IN_FLIGHT;
> > > +		percpu_counter_inc(&bp->b_target->bt_io_count);
> > > +	}
> > 
> > xfs_buf_ioacct_inc()
> > {
> > 	if (bp->b_flags & (XBF_NO_IOACCT | _XBF_IN_FLIGHT))
> > 		return;
> > 	percpu_counter_inc(&bp->b_target->bt_io_count);
> > 	bp->b_flags |= _XBF_IN_FLIGHT;
> > }
> > 
> > Cheers,
> > 
> > Dave.
> > -- 
> > Dave Chinner
> > david@fromorbit.com
> > 
> > _______________________________________________
> > xfs mailing list
> > xfs@oss.sgi.com
> > http://oss.sgi.com/mailman/listinfo/xfs
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs
diff mbox

Patch

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 4665ff6..b7afbac 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -80,6 +80,25 @@  xfs_buf_vmap_len(
 }
 
 /*
+ * Clear the in-flight state on a buffer about to be released to the LRU or
+ * freed and unaccount from the buftarg. The buftarg I/O count maintains a count
+ * of held buffers that have undergone at least one I/O in the current hold
+ * cycle (e.g., not a total I/O count). This provides protection against unmount
+ * for buffer I/O completion (see xfs_wait_buftarg()) processing.
+ */
+static inline void
+xfs_buf_rele_in_flight(
+	struct xfs_buf	*bp)
+{
+	if (!(bp->b_flags & _XBF_IN_FLIGHT))
+		return;
+
+	ASSERT(bp->b_flags & XBF_ASYNC);
+	bp->b_flags &= ~_XBF_IN_FLIGHT;
+	percpu_counter_dec(&bp->b_target->bt_io_count);
+}
+
+/*
  * When we mark a buffer stale, we remove the buffer from the LRU and clear the
  * b_lru_ref count so that the buffer is freed immediately when the buffer
  * reference count falls to zero. If the buffer is already on the LRU, we need
@@ -866,30 +885,37 @@  xfs_buf_hold(
 }
 
 /*
- *	Releases a hold on the specified buffer.  If the
- *	the hold count is 1, calls xfs_buf_free.
+ * Release a hold on the specified buffer. If the hold count is 1, the buffer is
+ * placed on LRU or freed (depending on b_lru_ref).
  */
 void
 xfs_buf_rele(
 	xfs_buf_t		*bp)
 {
 	struct xfs_perag	*pag = bp->b_pag;
+	bool			release;
+	bool			freebuf = false;
 
 	trace_xfs_buf_rele(bp, _RET_IP_);
 
 	if (!pag) {
 		ASSERT(list_empty(&bp->b_lru));
 		ASSERT(RB_EMPTY_NODE(&bp->b_rbnode));
-		if (atomic_dec_and_test(&bp->b_hold))
+		if (atomic_dec_and_test(&bp->b_hold)) {
+			xfs_buf_rele_in_flight(bp);
 			xfs_buf_free(bp);
+		}
 		return;
 	}
 
 	ASSERT(!RB_EMPTY_NODE(&bp->b_rbnode));
 
 	ASSERT(atomic_read(&bp->b_hold) > 0);
-	if (atomic_dec_and_lock(&bp->b_hold, &pag->pag_buf_lock)) {
-		spin_lock(&bp->b_lock);
+
+	release = atomic_dec_and_lock(&bp->b_hold, &pag->pag_buf_lock);
+	spin_lock(&bp->b_lock);
+	if (release) {
+		xfs_buf_rele_in_flight(bp);
 		if (!(bp->b_flags & XBF_STALE) && atomic_read(&bp->b_lru_ref)) {
 			/*
 			 * If the buffer is added to the LRU take a new
@@ -900,7 +926,6 @@  xfs_buf_rele(
 				bp->b_state &= ~XFS_BSTATE_DISPOSE;
 				atomic_inc(&bp->b_hold);
 			}
-			spin_unlock(&bp->b_lock);
 			spin_unlock(&pag->pag_buf_lock);
 		} else {
 			/*
@@ -914,15 +939,24 @@  xfs_buf_rele(
 			} else {
 				ASSERT(list_empty(&bp->b_lru));
 			}
-			spin_unlock(&bp->b_lock);
 
 			ASSERT(!(bp->b_flags & _XBF_DELWRI_Q));
 			rb_erase(&bp->b_rbnode, &pag->pag_buf_tree);
 			spin_unlock(&pag->pag_buf_lock);
 			xfs_perag_put(pag);
-			xfs_buf_free(bp);
+			freebuf = true;
 		}
+	} else if ((atomic_read(&bp->b_hold) == 1) && !list_empty(&bp->b_lru)) {
+		/*
+		 * The buffer is already on the LRU and it holds the only
+		 * reference. Drop the in flight state.
+		 */
+		xfs_buf_rele_in_flight(bp);
 	}
+	spin_unlock(&bp->b_lock);
+
+	if (freebuf)
+		xfs_buf_free(bp);
 }
 
 
@@ -1341,6 +1375,18 @@  xfs_buf_submit(
 	 * xfs_buf_ioend too early.
 	 */
 	atomic_set(&bp->b_io_remaining, 1);
+
+	/*
+	 * Bump the I/O in flight count on the buftarg if we haven't yet done
+	 * so for this buffer. Skip uncached buffers because many of those
+	 * (e.g., superblock, log buffers) are never released.
+	 */
+	if ((bp->b_bn != XFS_BUF_DADDR_NULL) &&
+	    !(bp->b_flags & _XBF_IN_FLIGHT)) {
+		bp->b_flags |= _XBF_IN_FLIGHT;
+		percpu_counter_inc(&bp->b_target->bt_io_count);
+	}
+
 	_xfs_buf_ioapply(bp);
 
 	/*
@@ -1526,13 +1572,19 @@  xfs_wait_buftarg(
 	int loop = 0;
 
 	/*
-	 * We need to flush the buffer workqueue to ensure that all IO
-	 * completion processing is 100% done. Just waiting on buffer locks is
-	 * not sufficient for async IO as the reference count held over IO is
-	 * not released until after the buffer lock is dropped. Hence we need to
-	 * ensure here that all reference counts have been dropped before we
-	 * start walking the LRU list.
+	 * First wait on the buftarg I/O count for all in-flight buffers to be
+	 * released. This is critical as new buffers do not make the LRU until
+	 * they are released.
+	 *
+	 * Next, flush the buffer workqueue to ensure all completion processing
+	 * has finished. Just waiting on buffer locks is not sufficient for
+	 * async IO as the reference count held over IO is not released until
+	 * after the buffer lock is dropped. Hence we need to ensure here that
+	 * all reference counts have been dropped before we start walking the
+	 * LRU list.
 	 */
+	while (percpu_counter_sum(&btp->bt_io_count))
+		delay(100);
 	drain_workqueue(btp->bt_mount->m_buf_workqueue);
 
 	/* loop until there is nothing left on the lru list. */
@@ -1629,6 +1681,8 @@  xfs_free_buftarg(
 	struct xfs_buftarg	*btp)
 {
 	unregister_shrinker(&btp->bt_shrinker);
+	ASSERT(percpu_counter_sum(&btp->bt_io_count) == 0);
+	percpu_counter_destroy(&btp->bt_io_count);
 	list_lru_destroy(&btp->bt_lru);
 
 	if (mp->m_flags & XFS_MOUNT_BARRIER)
@@ -1693,6 +1747,9 @@  xfs_alloc_buftarg(
 	if (list_lru_init(&btp->bt_lru))
 		goto error;
 
+	if (percpu_counter_init(&btp->bt_io_count, 0, GFP_KERNEL))
+		goto error;
+
 	btp->bt_shrinker.count_objects = xfs_buftarg_shrink_count;
 	btp->bt_shrinker.scan_objects = xfs_buftarg_shrink_scan;
 	btp->bt_shrinker.seeks = DEFAULT_SEEKS;
diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
index 8bfb974..19f70e2 100644
--- a/fs/xfs/xfs_buf.h
+++ b/fs/xfs/xfs_buf.h
@@ -62,6 +62,7 @@  typedef enum {
 #define _XBF_KMEM	 (1 << 21)/* backed by heap memory */
 #define _XBF_DELWRI_Q	 (1 << 22)/* buffer on a delwri queue */
 #define _XBF_COMPOUND	 (1 << 23)/* compound buffer */
+#define _XBF_IN_FLIGHT	 (1 << 25) /* I/O in flight, for accounting purposes */
 
 typedef unsigned int xfs_buf_flags_t;
 
@@ -115,6 +116,8 @@  typedef struct xfs_buftarg {
 	/* LRU control structures */
 	struct shrinker		bt_shrinker;
 	struct list_lru		bt_lru;
+
+	struct percpu_counter	bt_io_count;
 } xfs_buftarg_t;
 
 struct xfs_buf;