diff mbox

[1/5] xfs: bypass dquot reclaim to avoid quotacheck deadlock

Message ID 1487173247-5965-2-git-send-email-bfoster@redhat.com (mailing list archive)
State New, archived
Headers show

Commit Message

Brian Foster Feb. 15, 2017, 3:40 p.m. UTC
Reclaim during quotacheck can lead to deadlocks on the dquot flush lock:

 - Quotacheck populates a local delwri queue with the physical dquot
   buffers.
 - Quotacheck performs the xfs_qm_dqusage_adjust() bulkstat and dirties
   all of the dquots.
 - Reclaim kicks in and attempts to flush a dquot whose buffer is
   already queud on the quotacheck queue. The flush succeeds but
   queueing to the reclaim delwri queue fails as the backing buffer is
   already queued. The flush unlock is now deferred to I/O completion of
   the buffer from the quotacheck queue.
 - Quotacheck proceeds to the xfs_qm_flush_one() walk which requires the
   flush lock to update the backing buffers with the in-core
   recalculated values. This deadlocks as the flush lock was acquired by
   reclaim but the buffer never submitted for I/O as it already resided
   on the quotacheck queue.

This is reproduced by running quotacheck on a filesystem with a couple
million inodes in low memory (512MB-1GB) situations.

Quotacheck first resets and collects the physical dquot buffers in a
delwri queue. Then, it traverses the filesystem inodes via bulkstat,
updates the in-core dquots, flushes the corrected dquots to the backing
buffers and finally submits the delwri queue for I/O. Since the backing
buffers are queued across the entire quotacheck operation, dquot reclaim
cannot possibly complete a dquot flush before quotacheck completes.
Therefore, dquot reclaim provides no real value during quotacheck.

Lock out dquot reclaim during quotacheck to avoid the deadlock. Define
and set a new quotainfo flag when quotacheck is in progress that reclaim
can use to bypass processing as appropriate.

Reported-by: Martin Svec <martin.svec@zoner.cz>
Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_qm.c | 11 +++++++++++
 fs/xfs/xfs_qm.h |  1 +
 2 files changed, 12 insertions(+)

Comments

Dave Chinner Feb. 16, 2017, 10:37 p.m. UTC | #1
On Wed, Feb 15, 2017 at 10:40:43AM -0500, Brian Foster wrote:
> Reclaim during quotacheck can lead to deadlocks on the dquot flush lock:
> 
>  - Quotacheck populates a local delwri queue with the physical dquot
>    buffers.
>  - Quotacheck performs the xfs_qm_dqusage_adjust() bulkstat and dirties
>    all of the dquots.
>  - Reclaim kicks in and attempts to flush a dquot whose buffer is
>    already queud on the quotacheck queue. The flush succeeds but
>    queueing to the reclaim delwri queue fails as the backing buffer is
>    already queued. The flush unlock is now deferred to I/O completion of
>    the buffer from the quotacheck queue.
>  - Quotacheck proceeds to the xfs_qm_flush_one() walk which requires the
>    flush lock to update the backing buffers with the in-core
>    recalculated values. This deadlocks as the flush lock was acquired by
>    reclaim but the buffer never submitted for I/O as it already resided
>    on the quotacheck queue.
> 
> This is reproduced by running quotacheck on a filesystem with a couple
> million inodes in low memory (512MB-1GB) situations.
> 
> Quotacheck first resets and collects the physical dquot buffers in a
> delwri queue. Then, it traverses the filesystem inodes via bulkstat,
> updates the in-core dquots, flushes the corrected dquots to the backing
> buffers and finally submits the delwri queue for I/O. Since the backing
> buffers are queued across the entire quotacheck operation, dquot reclaim
> cannot possibly complete a dquot flush before quotacheck completes.
> Therefore, dquot reclaim provides no real value during quotacheck.

Which is an OOM vector on systems with lots of dquots and low memory
at mount time. Turning off dquot reclaim doesn't change that.

Address the root cause - the buffer list is never flushed and so
pins the memory quotacheck requires, so we need to flush the buffer
list more often.  We also need to submit the buffer list before the
flush walks begin, thereby unlocking all the dquots before doing the
flush walk and avoiding the deadlock.

-Dave.
Brian Foster Feb. 17, 2017, 6:30 p.m. UTC | #2
On Fri, Feb 17, 2017 at 09:37:05AM +1100, Dave Chinner wrote:
> On Wed, Feb 15, 2017 at 10:40:43AM -0500, Brian Foster wrote:
> > Reclaim during quotacheck can lead to deadlocks on the dquot flush lock:
> > 
> >  - Quotacheck populates a local delwri queue with the physical dquot
> >    buffers.
> >  - Quotacheck performs the xfs_qm_dqusage_adjust() bulkstat and dirties
> >    all of the dquots.
> >  - Reclaim kicks in and attempts to flush a dquot whose buffer is
> >    already queud on the quotacheck queue. The flush succeeds but
> >    queueing to the reclaim delwri queue fails as the backing buffer is
> >    already queued. The flush unlock is now deferred to I/O completion of
> >    the buffer from the quotacheck queue.
> >  - Quotacheck proceeds to the xfs_qm_flush_one() walk which requires the
> >    flush lock to update the backing buffers with the in-core
> >    recalculated values. This deadlocks as the flush lock was acquired by
> >    reclaim but the buffer never submitted for I/O as it already resided
> >    on the quotacheck queue.
> > 
> > This is reproduced by running quotacheck on a filesystem with a couple
> > million inodes in low memory (512MB-1GB) situations.
> > 
> > Quotacheck first resets and collects the physical dquot buffers in a
> > delwri queue. Then, it traverses the filesystem inodes via bulkstat,
> > updates the in-core dquots, flushes the corrected dquots to the backing
> > buffers and finally submits the delwri queue for I/O. Since the backing
> > buffers are queued across the entire quotacheck operation, dquot reclaim
> > cannot possibly complete a dquot flush before quotacheck completes.
> > Therefore, dquot reclaim provides no real value during quotacheck.
> 
> Which is an OOM vector on systems with lots of dquots and low memory
> at mount time. Turning off dquot reclaim doesn't change that.
> 

It's not intended to. The inefficient design of quotacheck wrt to memory
usage is a separate problem. AFAICT, that problem goes back as far as my
git tree does (2.6.xx).

Looking back through the git history, this deadlock appears to be a
regression as of commit 43ff2122e6 ("xfs: on-stack delayed write buffer
lists") in 2012, which replaced a trylock/push sequence from the
quotacheck flush walk with a synchronous flush lock.

> Address the root cause - the buffer list is never flushed and so
> pins the memory quotacheck requires, so we need to flush the buffer
> list more often.  We also need to submit the buffer list before the
> flush walks begin, thereby unlocking all the dquots before doing the
> flush walk and avoiding the deadlock.
> 

I acknowledge that this patch doesn't address the root problem, but
neither does this proposal. The root cause is not that quotacheck uses
too much memory, it's that the serialization between quotacheck and
dquot reclaim is bogus.

For example, if we drop the buffer list being held across the adjust and
flush walks, we do indeed unpin the buffers and dquots for memory
reclaim, particularly during the dqadjust bulkstat. The flush walk then
still has to cycle the flush lock of every dquot to guarantee we wait on
dquots being flushed by reclaim (otherwise we risk corruption with the
xfs_qm_flush_one() logic is it is today). The same deadlock vector still
exists, albeit with a smaller window, if reclaim happens to flush a
dquot that is backed by a buffer that has already been queued by the
flush walk. AFAICT, the only requirements for this to occur are a couple
of dquots on a single backing buffer and some memory pressure from
somewhere, so the local memory usage is irrelevant.

I'm specifically trying to address the deadlock problem here. If we
don't want to bypass reclaim, the other options I see are to rework the
logic on the reclaim side to check the buffer state before we flush the
dquot, or push on the buffer from the quotacheck side like we used to
before the commit above. I think the latter may be more robust as it
isolates the logic to quotacheck. The risk there is that we're open to
pushing buffers off the reclaim list, but that might be fine so long as
we have the buffer lock.

I'm open to other ideas of course, but so far "flushing the buffer list
more often" isn't clear enough to suggest it actually solves the locking
problem... I'll put a quotacheck rework on my todo list as well, but
that's something I'll have to get to separately (but I'm happy to help
if anybody else wants to jump on that sooner.. ;).

Brian

> -Dave.
> 
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner Feb. 17, 2017, 11:12 p.m. UTC | #3
On Fri, Feb 17, 2017 at 01:30:09PM -0500, Brian Foster wrote:
> On Fri, Feb 17, 2017 at 09:37:05AM +1100, Dave Chinner wrote:
> > On Wed, Feb 15, 2017 at 10:40:43AM -0500, Brian Foster wrote:
> > > Reclaim during quotacheck can lead to deadlocks on the dquot flush lock:
> > > 
> > >  - Quotacheck populates a local delwri queue with the physical dquot
> > >    buffers.
> > >  - Quotacheck performs the xfs_qm_dqusage_adjust() bulkstat and dirties
> > >    all of the dquots.
> > >  - Reclaim kicks in and attempts to flush a dquot whose buffer is
> > >    already queud on the quotacheck queue. The flush succeeds but
> > >    queueing to the reclaim delwri queue fails as the backing buffer is
> > >    already queued. The flush unlock is now deferred to I/O completion of
> > >    the buffer from the quotacheck queue.
> > >  - Quotacheck proceeds to the xfs_qm_flush_one() walk which requires the
> > >    flush lock to update the backing buffers with the in-core
> > >    recalculated values. This deadlocks as the flush lock was acquired by
> > >    reclaim but the buffer never submitted for I/O as it already resided
> > >    on the quotacheck queue.
> > > 
> > > This is reproduced by running quotacheck on a filesystem with a couple
> > > million inodes in low memory (512MB-1GB) situations.
> > > 
> > > Quotacheck first resets and collects the physical dquot buffers in a
> > > delwri queue. Then, it traverses the filesystem inodes via bulkstat,
> > > updates the in-core dquots, flushes the corrected dquots to the backing
> > > buffers and finally submits the delwri queue for I/O. Since the backing
> > > buffers are queued across the entire quotacheck operation, dquot reclaim
> > > cannot possibly complete a dquot flush before quotacheck completes.
> > > Therefore, dquot reclaim provides no real value during quotacheck.
> > 
> > Which is an OOM vector on systems with lots of dquots and low memory
> > at mount time. Turning off dquot reclaim doesn't change that.
> > 
> 
> It's not intended to. The inefficient design of quotacheck wrt to memory
> usage is a separate problem. AFAICT, that problem goes back as far as my
> git tree does (2.6.xx).
> 
> Looking back through the git history, this deadlock appears to be a
> regression as of commit 43ff2122e6 ("xfs: on-stack delayed write buffer
> lists") in 2012, which replaced a trylock/push sequence from the
> quotacheck flush walk with a synchronous flush lock.

Right - that's the commit that changed the code to what it is now.
That's what I implied needs fixing....

> > Address the root cause - the buffer list is never flushed and so
> > pins the memory quotacheck requires, so we need to flush the buffer
> > list more often.  We also need to submit the buffer list before the
> > flush walks begin, thereby unlocking all the dquots before doing the
> > flush walk and avoiding the deadlock.
> > 
> 
> I acknowledge that this patch doesn't address the root problem, but
> neither does this proposal. The root cause is not that quotacheck uses
> too much memory, it's that the serialization between quotacheck and
> dquot reclaim is bogus.
> 
> For example, if we drop the buffer list being held across the adjust and
> flush walks, we do indeed unpin the buffers and dquots for memory
> reclaim, particularly during the dqadjust bulkstat. The flush walk then
> still has to cycle the flush lock of every dquot to guarantee we wait on
> dquots being flushed by reclaim (otherwise we risk corruption with the
> xfs_qm_flush_one() logic is it is today). The same deadlock vector still
> exists, albeit with a smaller window, if reclaim happens to flush a
> dquot that is backed by a buffer that has already been queued by the
> flush walk. AFAICT, the only requirements for this to occur are a couple
> of dquots on a single backing buffer and some memory pressure from
> somewhere, so the local memory usage is irrelevant.

Except that the window is *tiny* because the dquots are walked in
order and so all the dquots are processed in ascending order. This
means all the dquots in a buffer are flushed sequentially, and each
dquot lookup resets the reclaim reference count count on the dquot
buffer. Hence it is extremely unlikely that a dquot buffer will be
reclaimed while it is partially flushed during the flush walk.

> I'm specifically trying to address the deadlock problem here. If we
> don't want to bypass reclaim, the other options I see are to rework the
> logic on the reclaim side to check the buffer state before we flush the
> dquot, or push on the buffer from the quotacheck side like we used to
> before the commit above.

The latter is effectively what flushing the buffer list before the
flush walk is run does.

Cheers,

Dave.
Brian Foster Feb. 18, 2017, 12:55 p.m. UTC | #4
On Sat, Feb 18, 2017 at 10:12:15AM +1100, Dave Chinner wrote:
> On Fri, Feb 17, 2017 at 01:30:09PM -0500, Brian Foster wrote:
> > On Fri, Feb 17, 2017 at 09:37:05AM +1100, Dave Chinner wrote:
> > > On Wed, Feb 15, 2017 at 10:40:43AM -0500, Brian Foster wrote:
> > > > Reclaim during quotacheck can lead to deadlocks on the dquot flush lock:
> > > > 
> > > >  - Quotacheck populates a local delwri queue with the physical dquot
> > > >    buffers.
> > > >  - Quotacheck performs the xfs_qm_dqusage_adjust() bulkstat and dirties
> > > >    all of the dquots.
> > > >  - Reclaim kicks in and attempts to flush a dquot whose buffer is
> > > >    already queud on the quotacheck queue. The flush succeeds but
> > > >    queueing to the reclaim delwri queue fails as the backing buffer is
> > > >    already queued. The flush unlock is now deferred to I/O completion of
> > > >    the buffer from the quotacheck queue.
> > > >  - Quotacheck proceeds to the xfs_qm_flush_one() walk which requires the
> > > >    flush lock to update the backing buffers with the in-core
> > > >    recalculated values. This deadlocks as the flush lock was acquired by
> > > >    reclaim but the buffer never submitted for I/O as it already resided
> > > >    on the quotacheck queue.
> > > > 
> > > > This is reproduced by running quotacheck on a filesystem with a couple
> > > > million inodes in low memory (512MB-1GB) situations.
> > > > 
> > > > Quotacheck first resets and collects the physical dquot buffers in a
> > > > delwri queue. Then, it traverses the filesystem inodes via bulkstat,
> > > > updates the in-core dquots, flushes the corrected dquots to the backing
> > > > buffers and finally submits the delwri queue for I/O. Since the backing
> > > > buffers are queued across the entire quotacheck operation, dquot reclaim
> > > > cannot possibly complete a dquot flush before quotacheck completes.
> > > > Therefore, dquot reclaim provides no real value during quotacheck.
> > > 
> > > Which is an OOM vector on systems with lots of dquots and low memory
> > > at mount time. Turning off dquot reclaim doesn't change that.
> > > 
> > 
> > It's not intended to. The inefficient design of quotacheck wrt to memory
> > usage is a separate problem. AFAICT, that problem goes back as far as my
> > git tree does (2.6.xx).
> > 
> > Looking back through the git history, this deadlock appears to be a
> > regression as of commit 43ff2122e6 ("xfs: on-stack delayed write buffer
> > lists") in 2012, which replaced a trylock/push sequence from the
> > quotacheck flush walk with a synchronous flush lock.
> 
> Right - that's the commit that changed the code to what it is now.
> That's what I implied needs fixing....
> 

Ok, but that commit has nothing to do with the design of quotacheck.

> > > Address the root cause - the buffer list is never flushed and so
> > > pins the memory quotacheck requires, so we need to flush the buffer
> > > list more often.  We also need to submit the buffer list before the
> > > flush walks begin, thereby unlocking all the dquots before doing the
> > > flush walk and avoiding the deadlock.
> > > 
> > 
> > I acknowledge that this patch doesn't address the root problem, but
> > neither does this proposal. The root cause is not that quotacheck uses
> > too much memory, it's that the serialization between quotacheck and
> > dquot reclaim is bogus.
> > 
> > For example, if we drop the buffer list being held across the adjust and
> > flush walks, we do indeed unpin the buffers and dquots for memory
> > reclaim, particularly during the dqadjust bulkstat. The flush walk then
> > still has to cycle the flush lock of every dquot to guarantee we wait on
> > dquots being flushed by reclaim (otherwise we risk corruption with the
> > xfs_qm_flush_one() logic is it is today). The same deadlock vector still
> > exists, albeit with a smaller window, if reclaim happens to flush a
> > dquot that is backed by a buffer that has already been queued by the
> > flush walk. AFAICT, the only requirements for this to occur are a couple
> > of dquots on a single backing buffer and some memory pressure from
> > somewhere, so the local memory usage is irrelevant.
> 
> Except that the window is *tiny* because the dquots are walked in
> order and so all the dquots are processed in ascending order. This
> means all the dquots in a buffer are flushed sequentially, and each
> dquot lookup resets the reclaim reference count count on the dquot
> buffer. Hence it is extremely unlikely that a dquot buffer will be
> reclaimed while it is partially flushed during the flush walk.
> 

At a high level, this fundamentally makes dquot reclaim safe during the
dqadjust bulkstat. We are susceptible to the same exact problem during
the flush walk... it is true that the race window is now reduced to the
time spent processing dquots for a particular buffer at time. The logic
used to explain away the race here doesn't make a lot of sense,
however...

First, the reference count on the underlying buffer is irrelevant
because dquot reclaim is what causes the race. All dquot reference
counts are zero when the flush walk begins. In other words, all in-core
dquots are on the lru and thus subject to reclaim at this point in time.

Second, the window is definitely not tiny. According to xfs_qm.h, we
have 30 dquots per 4k block. The only way a flush via dquot reclaim does
_not_ deadlock is if reclaim is first to flush a dquot of a particular
buffer (such that it can queue the buffer). In most reclaim scans this
may work fine, but if dquots are flushed in ascending order, that means
if reclaim ever has to flush a dquot, it's pretty much caught up to the
flush walk or will intersect eventually.

Further, the flush walk code iterates 32 dquots at a time. That's
effectively a qi_tree_lock cycle and radix tree lookup per dquot buffer.
With thousands of dquots around in memory, I think that presents ample
opportunity for reclaim to kick in and flush a dquot that's not the
first to be flushed to its backing buffer. All it probably takes is for
the tree lock cycle and the radix tree lookup for the next batch of 32
dquots to occur before the end of a particular buffer to present an even
bigger window for reclaim to kick in and flush one of the remaining
dquots on that particular buffer. Again, that's going to occur for
almost every dquot buffer.

This is all kind of pointless analysis anyways because if the buffer
list is submitted before the flush walk, the flush walk in its current
form is not sufficient to serialize quotacheck against I/O completion of
underlying dquot buffers. dquot reclaim can now flush and free any
number of dquots for which quotacheck won't have any window into the
state of the underlying buffer I/O. This probably means the
xfs_qm_dquot_walk() (radix tree) based flush walk needs to be replaced
with something that visits every dquot (or dquot buffer) in the fs,
reallocating them if necessary (which adds more flush walk -> reclaim
contention than we've considered above), in order to serialize against
the underlying buffer. Either way, to just submit the buffer list before
the flush walk and make no other changes to account for the lost buffer
serialization is to break quotacheck.

> > I'm specifically trying to address the deadlock problem here. If we
> > don't want to bypass reclaim, the other options I see are to rework the
> > logic on the reclaim side to check the buffer state before we flush the
> > dquot, or push on the buffer from the quotacheck side like we used to
> > before the commit above.
> 
> The latter is effectively what flushing the buffer list before the
> flush walk is run does.
> 

The buffer push forcibly writes out the buffer for any dquot that is
found flush locked during the quotacheck flush walk and waits on the
flush lock again to serialize against the write and allow us to flush
such dquots to the buffer without a deadlock. Flushing the buffer list
before the flush walk doesn't protect us from deadlock once the flush
walk actually begins, and creates new serialization problems with
broader quotacheck that can cause corruption if not handled properly.

I can try to put something together next week to resurrect the buffer
push similar to as implemented historically prior to commit 43ff2122e6
if you're Ok with that approach for the locking problem (or have some
other suggestion that doesn't handwave away the serialization problem
;)... Otherwise, I guess we can leave things as they are. I don't want
to rely on things like indeterminate cache walk and reclaim ordering to
try and justify whether a clear serialization problem is going to
trigger or not in various high/low memory/dquot configurations. Even if
it doesn't trigger right now, if we change something innocuous down the
road like the dquot buffer size then all of this logic goes out the
window and we have a "new" deadlock to track and down and resolve..

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c
index b669b12..7c44a6e 100644
--- a/fs/xfs/xfs_qm.c
+++ b/fs/xfs/xfs_qm.c
@@ -527,6 +527,8 @@  xfs_qm_shrink_scan(
 
 	if ((sc->gfp_mask & (__GFP_FS|__GFP_DIRECT_RECLAIM)) != (__GFP_FS|__GFP_DIRECT_RECLAIM))
 		return 0;
+	if (qi->qi_quotacheck)
+		return 0;
 
 	INIT_LIST_HEAD(&isol.buffers);
 	INIT_LIST_HEAD(&isol.dispose);
@@ -623,6 +625,7 @@  xfs_qm_init_quotainfo(
 	INIT_RADIX_TREE(&qinf->qi_gquota_tree, GFP_NOFS);
 	INIT_RADIX_TREE(&qinf->qi_pquota_tree, GFP_NOFS);
 	mutex_init(&qinf->qi_tree_lock);
+	qinf->qi_quotacheck = false;
 
 	/* mutex used to serialize quotaoffs */
 	mutex_init(&qinf->qi_quotaofflock);
@@ -1294,6 +1297,12 @@  xfs_qm_quotacheck(
 	ASSERT(uip || gip || pip);
 	ASSERT(XFS_IS_QUOTA_RUNNING(mp));
 
+	/*
+	 * Set a flag to lock out dquot reclaim during quotacheck. The dquot
+	 * shrinker can cause flush lock deadlocks by attempting to flush dquots
+	 * whose backing buffers are already on the quotacheck delwri queue.
+	 */
+	mp->m_quotainfo->qi_quotacheck = true;
 	xfs_notice(mp, "Quotacheck needed: Please wait.");
 
 	/*
@@ -1384,6 +1393,8 @@  xfs_qm_quotacheck(
 	mp->m_qflags |= flags;
 
  error_return:
+	mp->m_quotainfo->qi_quotacheck = false;
+
 	while (!list_empty(&buffer_list)) {
 		struct xfs_buf *bp =
 			list_first_entry(&buffer_list, struct xfs_buf, b_list);
diff --git a/fs/xfs/xfs_qm.h b/fs/xfs/xfs_qm.h
index 2975a82..d5a443d 100644
--- a/fs/xfs/xfs_qm.h
+++ b/fs/xfs/xfs_qm.h
@@ -89,6 +89,7 @@  typedef struct xfs_quotainfo {
 	struct xfs_def_quota	qi_grp_default;
 	struct xfs_def_quota	qi_prj_default;
 	struct shrinker  qi_shrinker;
+	bool			qi_quotacheck;	/* quotacheck is running */
 } xfs_quotainfo_t;
 
 static inline struct radix_tree_root *