diff mbox

xfs: close xfs_wait_buftarg() race with buffer lookups

Message ID 1471883461-27407-1-git-send-email-bfoster@redhat.com (mailing list archive)
State New, archived
Headers show

Commit Message

Brian Foster Aug. 22, 2016, 4:31 p.m. UTC
xfs_wait_buftarg() is invoked on unmount and filesystem freeze to drain and
free the buffer LRU. We have reports of filesystem freeze hangs with the
following call chain:

  ->xfs_wait_buftarg()
    ->xfs_log_quiesce()
      ->xfs_quiesce_attr()
        ->xfs_fs_freeze()
          ...

This hang can reproduced with a long enough running fsstress instance
running in parallel with a tight freeze/unfreeze loop. The cause of the
hang is racing b_hold updates between xfs_wait_buftarg() and
_xfs_buf_lookup(). Specifically, buftarg wait path checks whether a
buffer has a >1 b_hold count to determine whether to skip the buffer.
If b_hold == 1, xfs_wait_buftarg_rele() proceeds to prepare the buffer
for the final removal and ultimately calls xfs_buf_rele() to drop the
LRU reference.

The problem is that _xfs_buf_find() can acquire a b_hold reference any
time after xfs_buftarg_wait_rele() has decided it has the only remaining
reference. If this occurs, xfs_wait_buftarg() drops the LRU reference,
but the xfs_buf_rele() instance doesn't actually remove the buffer from
the LRU due to the _xfs_buf_find() hold. At this point b_hold == 1, yet
the buffer is held via the _xfs_buf_find() codepath and still remains on
the LRU. Both call chains will ultimately call xfs_buf_rele() on a
buffer with b_hold == 1. This can have several user facing side effects
such as use after free errors or the freed buffer remaining on the LRU
indefinitely with an underflowed or garbage b_hold count value.

Update xfs_buftarg_wait_rele() to properly synchronize the LRU drain
against buffer lookups. Atomically decrement and lock the perag
associated with the buffer to lock out buffer lookups before
xfs_wait_buftarg() determines whether the LRU holds the final reference.
Open code freeing of the buffer so we can remove it from the perag
rbtree before the perag lock is dropped and guarantee it cannot be
looked up once freeing is imminent. Also update xfs_buftarg_wait_rele()
to drop and reacquire the lru_lock in the correct order to avoid
deadlocks with LRU insertion.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---

A couple notes... first is that are several different ways to close this
race. I opted for this approach because xfs_wait_buftarg() is the
slow/less frequent path.

Second, I believe we still have an independent freeze hang upstream due
to an issue with how we use drain_workqueue(). Specifically, I see the
following warning from kernel/workqueue.c:__queue_work():

	if (unlikely(wq->flags & __WQ_DRAINING) &&
	    WARN_ON_ONCE(!is_chained_work(wq)))
		return;

... followed by a hang. I suspect this is due to the ioend work being
dropped and thus never completing the I/O, but I haven't dug into it
yet. For now, I'm testing for hangs with this fix and the above
commented out.

Brian

 fs/xfs/xfs_buf.c | 52 ++++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 42 insertions(+), 10 deletions(-)

Comments

Dave Chinner Aug. 22, 2016, 11:24 p.m. UTC | #1
On Mon, Aug 22, 2016 at 12:31:01PM -0400, Brian Foster wrote:
> xfs_wait_buftarg() is invoked on unmount and filesystem freeze to drain and
> free the buffer LRU. We have reports of filesystem freeze hangs with the
> following call chain:
> 
>   ->xfs_wait_buftarg()
>     ->xfs_log_quiesce()
>       ->xfs_quiesce_attr()
>         ->xfs_fs_freeze()
>           ...
> 
> This hang can reproduced with a long enough running fsstress instance
> running in parallel with a tight freeze/unfreeze loop. The cause of the
> hang is racing b_hold updates between xfs_wait_buftarg() and
> _xfs_buf_lookup(). Specifically, buftarg wait path checks whether a
> buffer has a >1 b_hold count to determine whether to skip the buffer.
> If b_hold == 1, xfs_wait_buftarg_rele() proceeds to prepare the buffer
> for the final removal and ultimately calls xfs_buf_rele() to drop the
> LRU reference.

So, dumb question: why are we reclaiming the entire buffer cache
when the filesystem is being frozen?

> The problem is that _xfs_buf_find() can acquire a b_hold reference any
> time after xfs_buftarg_wait_rele() has decided it has the only remaining
> reference. If this occurs, xfs_wait_buftarg() drops the LRU reference,
> but the xfs_buf_rele() instance doesn't actually remove the buffer from
> the LRU due to the _xfs_buf_find() hold.

Right.


> At this point b_hold == 1, yet
> the buffer is held via the _xfs_buf_find() codepath and still remains on
> the LRU.  Both call chains will ultimately call xfs_buf_rele() on a
> buffer with b_hold == 1.

I don't follow your logic here? We've gone form a count of 2 (i.e.
lru + racing lookup) to a count of 1 by dropping the LRU hold count,
but now you're saying that both the find and lru still need to call
xfs_buf_rele()? I'm clearly missing something here - how do we get
to a hold count of 1 here without dropping a reference twice?

(I wrote this to analisys is so I'll leave it here for discussion
purposes).

The code paths are:

Freeze process			lookup process

xfs_buftarg_wait
<lru>
xfs_buftarg_wait_rele()
b_hold = 1
lock(b_lock)
b_lru_ref = 0
XFS_BSTATE_DISPOSE
move to dispose list
unlock(b_lock)
.....
				_xfs_buf_find()
				<finds buffer>
				atomic_inc(b_hold)	//hold count = 2
				.....
walks dispose list
xfs_buf_rele()
atomic_dec(b_hold)					//hold count = 1
release == false
<does nothing>

				b_lru_ref set to non-zero
				.....
				xfs_buf_rele()
				atomic_dec(b_hold)	// hold_count = 0
				!stale, b_lru_ref > 0
				add back to LRU,
				atomic_inc(b_hold)	// hold_count = 1
				~XFS_BSTATE_DISPOSE


So where does the hold count become 1 while both still need to call
xfs_buf_rele()?

> Update xfs_buftarg_wait_rele() to properly synchronize the LRU drain
> against buffer lookups. Atomically decrement and lock the perag
> associated with the buffer to lock out buffer lookups before
> xfs_wait_buftarg() determines whether the LRU holds the final reference.
> Open code freeing of the buffer so we can remove it from the perag
> rbtree before the perag lock is dropped and guarantee it cannot be
> looked up once freeing is imminent. Also update xfs_buftarg_wait_rele()
> to drop and reacquire the lru_lock in the correct order to avoid
> deadlocks with LRU insertion.

While this is a solution, I don't yet understand the reace being
solved, so I can't comment on whether it's the best solution or not.
It doesn't happen at unmount because there can't be racing lookups
occurring at that point in unmount, so it comes back to this
question: why are we invalidating the entire LRU on freeze?

We really only need to wait for IO to completei during freeze,
right?  i.e.  xfs-wait_buftarg() is doing 2 things: the first is
waiting for outstanding IO to complete, and the other is reclaiming
the LRU. And for freeze, we don't need to do the second?

So if we stop trying to reclaim the buffer cache on freeze, then the
whole lookup vs forced reclaim problem will go away?

> Second, I believe we still have an independent freeze hang upstream due
> to an issue with how we use drain_workqueue(). Specifically, I see the
> following warning from kernel/workqueue.c:__queue_work():
> 
> 	if (unlikely(wq->flags & __WQ_DRAINING) &&
> 	    WARN_ON_ONCE(!is_chained_work(wq)))
> 		return;

Using flush_workqueue() would solve that problem. We've already
guaranteed for unmount that there isn't any new incoming work. i.e:

        while (percpu_counter_sum(&btp->bt_io_count)) {
		delay(100);
		flush_workqueue(btp->bt_mount->m_buf_workqueue);
	}
	flush_workqueue(btp->bt_mount->m_buf_workqueue);

Cheers,

Dave.
Brian Foster Aug. 23, 2016, 12:54 p.m. UTC | #2
On Tue, Aug 23, 2016 at 09:24:10AM +1000, Dave Chinner wrote:
> On Mon, Aug 22, 2016 at 12:31:01PM -0400, Brian Foster wrote:
> > xfs_wait_buftarg() is invoked on unmount and filesystem freeze to drain and
> > free the buffer LRU. We have reports of filesystem freeze hangs with the
> > following call chain:
> > 
> >   ->xfs_wait_buftarg()
> >     ->xfs_log_quiesce()
> >       ->xfs_quiesce_attr()
> >         ->xfs_fs_freeze()
> >           ...
> > 
> > This hang can reproduced with a long enough running fsstress instance
> > running in parallel with a tight freeze/unfreeze loop. The cause of the
> > hang is racing b_hold updates between xfs_wait_buftarg() and
> > _xfs_buf_lookup(). Specifically, buftarg wait path checks whether a
> > buffer has a >1 b_hold count to determine whether to skip the buffer.
> > If b_hold == 1, xfs_wait_buftarg_rele() proceeds to prepare the buffer
> > for the final removal and ultimately calls xfs_buf_rele() to drop the
> > LRU reference.
> 
> So, dumb question: why are we reclaiming the entire buffer cache
> when the filesystem is being frozen?
> 

Not sure. I noticed that briefly, but I guess was too preoccupied with
figuring out what was happening to step back and think about it. It
looks like this changed in commit c75921a7 ("xfs: xfs_quiesce_attr()
should quiesce the log like unmount"), which may or may not have been
motivated by fs freeze. AFAICT, it just looks like a semi-lazy reuse of
the unmount mechanism. E.g., xfs_fs_freeze() was calling
xfs_quiesce_attr() at the time and that commit altered the latter to
reuse xfs_log_quiesce(), which happens to call xfs_wait_buftarg().

> > The problem is that _xfs_buf_find() can acquire a b_hold reference any
> > time after xfs_buftarg_wait_rele() has decided it has the only remaining
> > reference. If this occurs, xfs_wait_buftarg() drops the LRU reference,
> > but the xfs_buf_rele() instance doesn't actually remove the buffer from
> > the LRU due to the _xfs_buf_find() hold.
> 
> Right.
> 
> 
> > At this point b_hold == 1, yet
> > the buffer is held via the _xfs_buf_find() codepath and still remains on
> > the LRU.  Both call chains will ultimately call xfs_buf_rele() on a
> > buffer with b_hold == 1.
> 
> I don't follow your logic here? We've gone form a count of 2 (i.e.
> lru + racing lookup) to a count of 1 by dropping the LRU hold count,
> but now you're saying that both the find and lru still need to call
> xfs_buf_rele()? I'm clearly missing something here - how do we get
> to a hold count of 1 here without dropping a reference twice?
> 

Sorry if the explanation is unclear. I probably relied too much on the
downstream code to describe the problem as this is more difficult to
reproduce upstream. Downstream is also slightly different in that it
predates the list_lru based mechanism.

> (I wrote this to analisys is so I'll leave it here for discussion
> purposes).
> 
> The code paths are:
> 
> Freeze process			lookup process
> 
> xfs_buftarg_wait
> <lru>
> xfs_buftarg_wait_rele()
> b_hold = 1
> lock(b_lock)
> b_lru_ref = 0
> XFS_BSTATE_DISPOSE
> move to dispose list
> unlock(b_lock)
> .....
> 				_xfs_buf_find()
> 				<finds buffer>
> 				atomic_inc(b_hold)	//hold count = 2
> 				.....
> walks dispose list
> xfs_buf_rele()
> atomic_dec(b_hold)					//hold count = 1
> release == false
> <does nothing>
> 
> 				b_lru_ref set to non-zero
> 				.....
> 				xfs_buf_rele()
> 				atomic_dec(b_hold)	// hold_count = 0
> 				!stale, b_lru_ref > 0
> 				add back to LRU,
> 				atomic_inc(b_hold)	// hold_count = 1
> 				~XFS_BSTATE_DISPOSE
> 
> 
> So where does the hold count become 1 while both still need to call
> xfs_buf_rele()?
> 

Hmm, it's possible the hang due to this race is a downstream only
problem and the upstream hang is purely the workqueue issue. Looking at
the upstream code again, the part I missed the first time around is that
the buffer is dropped from the lru by xfs_wait_buftarg() rather than
xfs_buf_rele(). The fact that doesn't happen in the downstream code is
sort of the problem. E.g., in downstream, xfs_wait_buftarg() calls
xfs_buf_rele() which never removes the item from the LRU due to the
elevated hold count, so effectively it ends up calling xfs_buf_rele()
again and again until the buffer is removed from the LRU (and thus
freed). The upstream wait_buftarg() checks b_hold, migrates to the
dispose list, removes from the dispose list, then invokes xfs_buf_rele()
and never sees the buffer again. The lru_list management here should
ensure that xfs_buf_rele() is called only once for the LRU afaict...
(though I'm still not sure passing through xfs_wait_buftarg() with a
buffer being held by another context is quite correct, even if the race
is more of a landmine than anything).

I thought I had confirmed/reproduced this upstream, but it's possible
that occurred before I recognized the workqueue hang was independent.
Let me step back and try to reproduce this with the hack to work around
the workqueue issue and see what happens. If it reoccurs, I'll try to
put together more applicable details. If not, we can probably drop this
for now and incorporate it as a downstream only fix...

Brian

> > Update xfs_buftarg_wait_rele() to properly synchronize the LRU drain
> > against buffer lookups. Atomically decrement and lock the perag
> > associated with the buffer to lock out buffer lookups before
> > xfs_wait_buftarg() determines whether the LRU holds the final reference.
> > Open code freeing of the buffer so we can remove it from the perag
> > rbtree before the perag lock is dropped and guarantee it cannot be
> > looked up once freeing is imminent. Also update xfs_buftarg_wait_rele()
> > to drop and reacquire the lru_lock in the correct order to avoid
> > deadlocks with LRU insertion.
> 
> While this is a solution, I don't yet understand the reace being
> solved, so I can't comment on whether it's the best solution or not.
> It doesn't happen at unmount because there can't be racing lookups
> occurring at that point in unmount, so it comes back to this
> question: why are we invalidating the entire LRU on freeze?
> 
> We really only need to wait for IO to completei during freeze,
> right?  i.e.  xfs-wait_buftarg() is doing 2 things: the first is
> waiting for outstanding IO to complete, and the other is reclaiming
> the LRU. And for freeze, we don't need to do the second?
> 
> So if we stop trying to reclaim the buffer cache on freeze, then the
> whole lookup vs forced reclaim problem will go away?
> 
> > Second, I believe we still have an independent freeze hang upstream due
> > to an issue with how we use drain_workqueue(). Specifically, I see the
> > following warning from kernel/workqueue.c:__queue_work():
> > 
> > 	if (unlikely(wq->flags & __WQ_DRAINING) &&
> > 	    WARN_ON_ONCE(!is_chained_work(wq)))
> > 		return;
> 
> Using flush_workqueue() would solve that problem. We've already
> guaranteed for unmount that there isn't any new incoming work. i.e:
> 
>         while (percpu_counter_sum(&btp->bt_io_count)) {
> 		delay(100);
> 		flush_workqueue(btp->bt_mount->m_buf_workqueue);
> 	}
> 	flush_workqueue(btp->bt_mount->m_buf_workqueue);
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
diff mbox

Patch

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 607cc29..ce9c419 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -1570,24 +1570,48 @@  xfs_buftarg_wait_rele(
 {
 	struct xfs_buf		*bp = container_of(item, struct xfs_buf, b_lru);
 	struct list_head	*dispose = arg;
+	struct xfs_perag	*pag = bp->b_pag;
+	int			release;
+
+	/*
+	 * Drop lru_lock to avoid deadlocks with LRU insertion. The required
+	 * lock order to safely isolate a buffer is pag_buf_lock -> b_lock
+	 * -> lru_lock.
+	 */
+	spin_unlock(lru_lock);
 
-	if (atomic_read(&bp->b_hold) > 1) {
+	/*
+	 * Atomically decrement and lock the perag to synchronize against buffer
+	 * lookups. We unconditionally decrement because a check for b_hold == 1
+	 * alone is not sufficient. _xfs_buf_find() can acquire a reference
+	 * immediately after the check.
+	 */
+	release = atomic_dec_and_lock(&bp->b_hold, &pag->pag_buf_lock);
+	if (!release) {
 		/* need to wait, so skip it this pass */
 		trace_xfs_buf_wait_buftarg(bp, _RET_IP_);
-		return LRU_SKIP;
+		goto skip_buf;
 	}
-	if (!spin_trylock(&bp->b_lock))
-		return LRU_SKIP;
+	if (!spin_trylock(&bp->b_lock)) {
+		spin_unlock(&pag->pag_buf_lock);
+		goto skip_buf;
+	}
+	spin_lock(lru_lock);
+
+	/* remove from the rbtree to prevent further lookups */
+	rb_erase(&bp->b_rbnode, &pag->pag_buf_tree);
+	spin_unlock(&pag->pag_buf_lock);
 
-	/*
-	 * clear the LRU reference count so the buffer doesn't get
-	 * ignored in xfs_buf_rele().
-	 */
-	atomic_set(&bp->b_lru_ref, 0);
 	bp->b_state |= XFS_BSTATE_DISPOSE;
 	list_lru_isolate_move(lru, item, dispose);
 	spin_unlock(&bp->b_lock);
 	return LRU_REMOVED;
+
+skip_buf:
+	/* skipping the buf, put back the LRU hold */
+	atomic_inc(&bp->b_hold);
+	spin_lock(lru_lock);
+	return LRU_SKIP;
 }
 
 void
@@ -1629,7 +1653,15 @@  xfs_wait_buftarg(
 				xfs_alert(btp->bt_mount,
 "Please run xfs_repair to determine the extent of the problem.");
 			}
-			xfs_buf_rele(bp);
+
+			/*
+			 * The buffer has already been removed from the rbtree
+			 * and had the last reference dropped. Drop the perag
+			 * reference and free it.
+			 */
+			ASSERT(atomic_read(&bp->b_hold) == 0);
+			xfs_perag_put(bp->b_pag);
+			xfs_buf_free(bp);
 		}
 		if (loop++ != 0)
 			delay(100);