diff mbox

[2/3] xfs: push buffer of flush locked dquot to avoid quotacheck deadlock

Message ID 1487966001-63263-3-git-send-email-bfoster@redhat.com (mailing list archive)
State Superseded, archived
Headers show

Commit Message

Brian Foster Feb. 24, 2017, 7:53 p.m. UTC
Reclaim during quotacheck can lead to deadlocks on the dquot flush
lock:

 - Quotacheck populates a local delwri queue with the physical dquot
   buffers.
 - Quotacheck performs the xfs_qm_dqusage_adjust() bulkstat and
   dirties all of the dquots.
 - Reclaim kicks in and attempts to flush a dquot whose buffer is
   already queud on the quotacheck queue. The flush succeeds but
   queueing to the reclaim delwri queue fails as the backing buffer is
   already queued. The flush unlock is now deferred to I/O completion
   of the buffer from the quotacheck queue.
 - The dqadjust bulkstat continues and dirties the recently flushed
   dquot once again.
 - Quotacheck proceeds to the xfs_qm_flush_one() walk which requires
   the flush lock to update the backing buffers with the in-core
   recalculated values. It deadlocks on the redirtied dquot as the
   flush lock was already acquired by reclaim, but the buffer resides
   on the local delwri queue which isn't submitted until the end of
   quotacheck.

This is reproduced by running quotacheck on a filesystem with a
couple million inodes in low memory (512MB-1GB) situations. This is
a regression as of commit 43ff2122e6 ("xfs: on-stack delayed write
buffer lists"), which removed a trylock and buffer I/O submission
from the quotacheck dquot flush sequence.

Quotacheck first resets and collects the physical dquot buffers in a
delwri queue. Then, it traverses the filesystem inodes via bulkstat,
updates the in-core dquots, flushes the corrected dquots to the
backing buffers and finally submits the delwri queue for I/O. Since
the backing buffers are queued across the entire quotacheck
operation, dquot reclaim cannot possibly complete a dquot flush
before quotacheck completes.

Therefore, quotacheck must submit the buffer for I/O in order to
cycle the flush lock and flush the dirty in-core dquot to the
buffer. Add a delwri queue buffer push mechanism to submit an
individual buffer for I/O without losing the delwri queue status and
use it from quotacheck to avoid the deadlock. This restores
quotacheck behavior to as before the regression was introduced.

Reported-by: Martin Svec <martin.svec@zoner.cz>
Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_buf.c   | 37 +++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_buf.h   |  1 +
 fs/xfs/xfs_qm.c    | 28 +++++++++++++++++++++++++++-
 fs/xfs/xfs_trace.h |  1 +
 4 files changed, 66 insertions(+), 1 deletion(-)

Comments

Darrick J. Wong April 6, 2017, 10:34 p.m. UTC | #1
On Fri, Feb 24, 2017 at 02:53:20PM -0500, Brian Foster wrote:
> Reclaim during quotacheck can lead to deadlocks on the dquot flush
> lock:
> 
>  - Quotacheck populates a local delwri queue with the physical dquot
>    buffers.
>  - Quotacheck performs the xfs_qm_dqusage_adjust() bulkstat and
>    dirties all of the dquots.
>  - Reclaim kicks in and attempts to flush a dquot whose buffer is
>    already queud on the quotacheck queue. The flush succeeds but
>    queueing to the reclaim delwri queue fails as the backing buffer is
>    already queued. The flush unlock is now deferred to I/O completion
>    of the buffer from the quotacheck queue.

<typing aloud here, trying to follow along...>

If I'm reading this correctly, this is the _buf_delwri_queue call
towards the end of the _DQ_IS_DIRTY clause in _qm_dquot_isolate?  Once
the dquot is flushed we fail to add it to isol->buffers because it's
already on quotacheck's buffer_list, which means that the
_buf_delwri_submit call in _qm_shrink_scan never sees our dirty dquot.

Therefore, the dquot doesn't get written and the flush lock stays
locked...

>  - The dqadjust bulkstat continues and dirties the recently flushed
>    dquot once again.
>  - Quotacheck proceeds to the xfs_qm_flush_one() walk which requires
>    the flush lock to update the backing buffers with the in-core
>    recalculated values. It deadlocks on the redirtied dquot as the
>    flush lock was already acquired by reclaim, but the buffer resides
>    on the local delwri queue which isn't submitted until the end of
>    quotacheck.

...until quotacheck tries to relock it and deadlocks, like you said.
Ok, I think I understand the deadlock...

> This is reproduced by running quotacheck on a filesystem with a
> couple million inodes in low memory (512MB-1GB) situations. This is
> a regression as of commit 43ff2122e6 ("xfs: on-stack delayed write
> buffer lists"), which removed a trylock and buffer I/O submission
> from the quotacheck dquot flush sequence.
> 
> Quotacheck first resets and collects the physical dquot buffers in a
> delwri queue. Then, it traverses the filesystem inodes via bulkstat,
> updates the in-core dquots, flushes the corrected dquots to the
> backing buffers and finally submits the delwri queue for I/O. Since
> the backing buffers are queued across the entire quotacheck
> operation, dquot reclaim cannot possibly complete a dquot flush
> before quotacheck completes.
> 
> Therefore, quotacheck must submit the buffer for I/O in order to
> cycle the flush lock and flush the dirty in-core dquot to the
> buffer. Add a delwri queue buffer push mechanism to submit an
> individual buffer for I/O without losing the delwri queue status and
> use it from quotacheck to avoid the deadlock. This restores
> quotacheck behavior to as before the regression was introduced.

...and therefore why this patch changes quotacheck to check if the
dquot's flush lock is already held and if so, to call
_buf_delwri_submit_buffers so that the io completion will unlock the
flush lock and try again.

When we return to _qm_flush_one, the dquot is no longer dirty and we're
done.  I /think/ this looks ok, though I also think I want to hear if
either Christian Affolter or Martin Svec have had any problems with
these two patches applied.  My current feeling is that this is a bit
late for 4.12, but I'd certainly put it in -washme for wider testing.
If anyone thinks this needs to happen sooner, please speak up.

Sorry this one took me a while to get to. :/

Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> Reported-by: Martin Svec <martin.svec@zoner.cz>
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
>  fs/xfs/xfs_buf.c   | 37 +++++++++++++++++++++++++++++++++++++
>  fs/xfs/xfs_buf.h   |  1 +
>  fs/xfs/xfs_qm.c    | 28 +++++++++++++++++++++++++++-
>  fs/xfs/xfs_trace.h |  1 +
>  4 files changed, 66 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index e566510..e97cf56 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -2011,6 +2011,43 @@ xfs_buf_delwri_submit(
>  	return error;
>  }
>  
> +int
> +xfs_buf_delwri_pushbuf(
> +	struct xfs_buf		*bp,
> +	struct list_head	*buffer_list)
> +{
> +	LIST_HEAD		(submit_list);
> +	int			error;
> +
> +	ASSERT(xfs_buf_islocked(bp));
> +	ASSERT(bp->b_flags & _XBF_DELWRI_Q);
> +
> +	trace_xfs_buf_delwri_pushbuf(bp, _RET_IP_);
> +
> +	/*
> +	 * Move the buffer to an empty list and submit. Pass the original list
> +	 * as the wait list so delwri submission moves the buf back to it before
> +	 * it is submitted (and thus before it is unlocked). This means the
> +	 * buffer cannot be placed on another list while we wait for it.
> +	 */
> +	list_move(&bp->b_list, &submit_list);
> +	xfs_buf_unlock(bp);
> +
> +	xfs_buf_delwri_submit_buffers(&submit_list, buffer_list);
> +
> +	/*
> +	 * Lock the buffer to wait for I/O completion. It's already held on the
> +	 * original list, so all we have to do is reset the delwri queue flag
> +	 * that was cleared by delwri submission.
> +	 */
> +	xfs_buf_lock(bp);
> +	error = bp->b_error;
> +	bp->b_flags |= _XBF_DELWRI_Q;
> +	xfs_buf_unlock(bp);
> +
> +	return error;
> +}
> +
>  int __init
>  xfs_buf_init(void)
>  {
> diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
> index 8a9d3a9..cd74216 100644
> --- a/fs/xfs/xfs_buf.h
> +++ b/fs/xfs/xfs_buf.h
> @@ -334,6 +334,7 @@ extern void xfs_buf_stale(struct xfs_buf *bp);
>  extern bool xfs_buf_delwri_queue(struct xfs_buf *, struct list_head *);
>  extern int xfs_buf_delwri_submit(struct list_head *);
>  extern int xfs_buf_delwri_submit_nowait(struct list_head *);
> +extern int xfs_buf_delwri_pushbuf(struct xfs_buf *, struct list_head *);
>  
>  /* Buffer Daemon Setup Routines */
>  extern int xfs_buf_init(void);
> diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c
> index 4ff993c..3815ed3 100644
> --- a/fs/xfs/xfs_qm.c
> +++ b/fs/xfs/xfs_qm.c
> @@ -1247,6 +1247,7 @@ xfs_qm_flush_one(
>  	struct xfs_dquot	*dqp,
>  	void			*data)
>  {
> +	struct xfs_mount	*mp = dqp->q_mount;
>  	struct list_head	*buffer_list = data;
>  	struct xfs_buf		*bp = NULL;
>  	int			error = 0;
> @@ -1257,7 +1258,32 @@ xfs_qm_flush_one(
>  	if (!XFS_DQ_IS_DIRTY(dqp))
>  		goto out_unlock;
>  
> -	xfs_dqflock(dqp);
> +	/*
> +	 * The only way the dquot is already flush locked by the time quotacheck
> +	 * gets here is if reclaim flushed it before the dqadjust walk dirtied
> +	 * it for the final time. Quotacheck collects all dquot bufs in the
> +	 * local delwri queue before dquots are dirtied, so reclaim can't have
> +	 * possibly queued it for I/O. The only way out is to push the buffer to
> +	 * cycle the flush lock.
> +	 */
> +	if (!xfs_dqflock_nowait(dqp)) {
> +		/* buf is pinned in-core by delwri list */
> +		DEFINE_SINGLE_BUF_MAP(map, dqp->q_blkno,
> +				      mp->m_quotainfo->qi_dqchunklen);
> +		bp = _xfs_buf_find(mp->m_ddev_targp, &map, 1, 0, NULL);
> +		if (!bp) {
> +			error = -EINVAL;
> +			goto out_unlock;
> +		}
> +
> +		/* delwri_pushbuf drops the buf lock */
> +		xfs_buf_delwri_pushbuf(bp, buffer_list);
> +		xfs_buf_rele(bp);
> +
> +		error = -EAGAIN;
> +		goto out_unlock;
> +	}
> +
>  	error = xfs_qm_dqflush(dqp, &bp);
>  	if (error)
>  		goto out_unlock;
> diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> index fb7555e..c8d28a9 100644
> --- a/fs/xfs/xfs_trace.h
> +++ b/fs/xfs/xfs_trace.h
> @@ -365,6 +365,7 @@ DEFINE_BUF_EVENT(xfs_buf_iowait_done);
>  DEFINE_BUF_EVENT(xfs_buf_delwri_queue);
>  DEFINE_BUF_EVENT(xfs_buf_delwri_queued);
>  DEFINE_BUF_EVENT(xfs_buf_delwri_split);
> +DEFINE_BUF_EVENT(xfs_buf_delwri_pushbuf);
>  DEFINE_BUF_EVENT(xfs_buf_get_uncached);
>  DEFINE_BUF_EVENT(xfs_buf_item_relse);
>  DEFINE_BUF_EVENT(xfs_buf_item_iodone_async);
> -- 
> 2.7.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Brian Foster April 7, 2017, 12:06 p.m. UTC | #2
On Thu, Apr 06, 2017 at 03:34:17PM -0700, Darrick J. Wong wrote:
> On Fri, Feb 24, 2017 at 02:53:20PM -0500, Brian Foster wrote:
> > Reclaim during quotacheck can lead to deadlocks on the dquot flush
> > lock:
> > 
> >  - Quotacheck populates a local delwri queue with the physical dquot
> >    buffers.
> >  - Quotacheck performs the xfs_qm_dqusage_adjust() bulkstat and
> >    dirties all of the dquots.
> >  - Reclaim kicks in and attempts to flush a dquot whose buffer is
> >    already queud on the quotacheck queue. The flush succeeds but
> >    queueing to the reclaim delwri queue fails as the backing buffer is
> >    already queued. The flush unlock is now deferred to I/O completion
> >    of the buffer from the quotacheck queue.
> 
> <typing aloud here, trying to follow along...>
> 
> If I'm reading this correctly, this is the _buf_delwri_queue call
> towards the end of the _DQ_IS_DIRTY clause in _qm_dquot_isolate?  Once
> the dquot is flushed we fail to add it to isol->buffers because it's
> already on quotacheck's buffer_list, which means that the
> _buf_delwri_submit call in _qm_shrink_scan never sees our dirty dquot.
> 
> Therefore, the dquot doesn't get written and the flush lock stays
> locked...
> 

Yes, exactly.

> >  - The dqadjust bulkstat continues and dirties the recently flushed
> >    dquot once again.
> >  - Quotacheck proceeds to the xfs_qm_flush_one() walk which requires
> >    the flush lock to update the backing buffers with the in-core
> >    recalculated values. It deadlocks on the redirtied dquot as the
> >    flush lock was already acquired by reclaim, but the buffer resides
> >    on the local delwri queue which isn't submitted until the end of
> >    quotacheck.
> 
> ...until quotacheck tries to relock it and deadlocks, like you said.
> Ok, I think I understand the deadlock...
> 

Yep.

> > This is reproduced by running quotacheck on a filesystem with a
> > couple million inodes in low memory (512MB-1GB) situations. This is
> > a regression as of commit 43ff2122e6 ("xfs: on-stack delayed write
> > buffer lists"), which removed a trylock and buffer I/O submission
> > from the quotacheck dquot flush sequence.
> > 
> > Quotacheck first resets and collects the physical dquot buffers in a
> > delwri queue. Then, it traverses the filesystem inodes via bulkstat,
> > updates the in-core dquots, flushes the corrected dquots to the
> > backing buffers and finally submits the delwri queue for I/O. Since
> > the backing buffers are queued across the entire quotacheck
> > operation, dquot reclaim cannot possibly complete a dquot flush
> > before quotacheck completes.
> > 
> > Therefore, quotacheck must submit the buffer for I/O in order to
> > cycle the flush lock and flush the dirty in-core dquot to the
> > buffer. Add a delwri queue buffer push mechanism to submit an
> > individual buffer for I/O without losing the delwri queue status and
> > use it from quotacheck to avoid the deadlock. This restores
> > quotacheck behavior to as before the regression was introduced.
> 
> ...and therefore why this patch changes quotacheck to check if the
> dquot's flush lock is already held and if so, to call
> _buf_delwri_submit_buffers so that the io completion will unlock the
> flush lock and try again.
> 

Yes, though only for the specific buffer.

> When we return to _qm_flush_one, the dquot is no longer dirty and we're
> done.  I /think/ this looks ok, though I also think I want to hear if
> either Christian Affolter or Martin Svec have had any problems with
> these two patches applied.  My current feeling is that this is a bit
> late for 4.12, but I'd certainly put it in -washme for wider testing.
> If anyone thinks this needs to happen sooner, please speak up.
> 

The dquot is still dirty and needs to be flushed on the second go around
(otherwise we wouldn't have needed the flush lock here on the first
try). The issue is basically that the dquot was adjusted (dirtied) at
some point during the bulkstat scan, reclaimed and flushed, and then
subsequently dirtied again such that another flush is required via
qm_flush_one(). The buffer push frees up the held flush lock so that
final flush can proceed.

BTW, what's the reasoning for being late to 4.12? I actually don't have
much of an opinion which release this goes into, so I guess my question
is more logistical due to the fact that up until recently I haven't paid
much attention to the upstream release cycles. ;P The 4.12 merge window
hasn't opened as we are still on 4.11-rc5, right? Is the concern then
that the merge window is close enough that this has missed 4.12
linux-next soak time?

Also, can you clarify the semantics of the -washme thing? ;) That
doesn't necessarily mean the patch has been "merged." Rather, it might
be suitable for the next for-next (4.13) provided nothing explodes by
then, correct? I'm also a little curious on what extra testing that
entails, if any, though perhaps this topic is better suited by an
independent thread..

> Sorry this one took me a while to get to. :/
> 
> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
> 

No problem. Thanks for review...

Brian

> --D
> 
> > Reported-by: Martin Svec <martin.svec@zoner.cz>
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > ---
> >  fs/xfs/xfs_buf.c   | 37 +++++++++++++++++++++++++++++++++++++
> >  fs/xfs/xfs_buf.h   |  1 +
> >  fs/xfs/xfs_qm.c    | 28 +++++++++++++++++++++++++++-
> >  fs/xfs/xfs_trace.h |  1 +
> >  4 files changed, 66 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> > index e566510..e97cf56 100644
> > --- a/fs/xfs/xfs_buf.c
> > +++ b/fs/xfs/xfs_buf.c
> > @@ -2011,6 +2011,43 @@ xfs_buf_delwri_submit(
> >  	return error;
> >  }
> >  
> > +int
> > +xfs_buf_delwri_pushbuf(
> > +	struct xfs_buf		*bp,
> > +	struct list_head	*buffer_list)
> > +{
> > +	LIST_HEAD		(submit_list);
> > +	int			error;
> > +
> > +	ASSERT(xfs_buf_islocked(bp));
> > +	ASSERT(bp->b_flags & _XBF_DELWRI_Q);
> > +
> > +	trace_xfs_buf_delwri_pushbuf(bp, _RET_IP_);
> > +
> > +	/*
> > +	 * Move the buffer to an empty list and submit. Pass the original list
> > +	 * as the wait list so delwri submission moves the buf back to it before
> > +	 * it is submitted (and thus before it is unlocked). This means the
> > +	 * buffer cannot be placed on another list while we wait for it.
> > +	 */
> > +	list_move(&bp->b_list, &submit_list);
> > +	xfs_buf_unlock(bp);
> > +
> > +	xfs_buf_delwri_submit_buffers(&submit_list, buffer_list);
> > +
> > +	/*
> > +	 * Lock the buffer to wait for I/O completion. It's already held on the
> > +	 * original list, so all we have to do is reset the delwri queue flag
> > +	 * that was cleared by delwri submission.
> > +	 */
> > +	xfs_buf_lock(bp);
> > +	error = bp->b_error;
> > +	bp->b_flags |= _XBF_DELWRI_Q;
> > +	xfs_buf_unlock(bp);
> > +
> > +	return error;
> > +}
> > +
> >  int __init
> >  xfs_buf_init(void)
> >  {
> > diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
> > index 8a9d3a9..cd74216 100644
> > --- a/fs/xfs/xfs_buf.h
> > +++ b/fs/xfs/xfs_buf.h
> > @@ -334,6 +334,7 @@ extern void xfs_buf_stale(struct xfs_buf *bp);
> >  extern bool xfs_buf_delwri_queue(struct xfs_buf *, struct list_head *);
> >  extern int xfs_buf_delwri_submit(struct list_head *);
> >  extern int xfs_buf_delwri_submit_nowait(struct list_head *);
> > +extern int xfs_buf_delwri_pushbuf(struct xfs_buf *, struct list_head *);
> >  
> >  /* Buffer Daemon Setup Routines */
> >  extern int xfs_buf_init(void);
> > diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c
> > index 4ff993c..3815ed3 100644
> > --- a/fs/xfs/xfs_qm.c
> > +++ b/fs/xfs/xfs_qm.c
> > @@ -1247,6 +1247,7 @@ xfs_qm_flush_one(
> >  	struct xfs_dquot	*dqp,
> >  	void			*data)
> >  {
> > +	struct xfs_mount	*mp = dqp->q_mount;
> >  	struct list_head	*buffer_list = data;
> >  	struct xfs_buf		*bp = NULL;
> >  	int			error = 0;
> > @@ -1257,7 +1258,32 @@ xfs_qm_flush_one(
> >  	if (!XFS_DQ_IS_DIRTY(dqp))
> >  		goto out_unlock;
> >  
> > -	xfs_dqflock(dqp);
> > +	/*
> > +	 * The only way the dquot is already flush locked by the time quotacheck
> > +	 * gets here is if reclaim flushed it before the dqadjust walk dirtied
> > +	 * it for the final time. Quotacheck collects all dquot bufs in the
> > +	 * local delwri queue before dquots are dirtied, so reclaim can't have
> > +	 * possibly queued it for I/O. The only way out is to push the buffer to
> > +	 * cycle the flush lock.
> > +	 */
> > +	if (!xfs_dqflock_nowait(dqp)) {
> > +		/* buf is pinned in-core by delwri list */
> > +		DEFINE_SINGLE_BUF_MAP(map, dqp->q_blkno,
> > +				      mp->m_quotainfo->qi_dqchunklen);
> > +		bp = _xfs_buf_find(mp->m_ddev_targp, &map, 1, 0, NULL);
> > +		if (!bp) {
> > +			error = -EINVAL;
> > +			goto out_unlock;
> > +		}
> > +
> > +		/* delwri_pushbuf drops the buf lock */
> > +		xfs_buf_delwri_pushbuf(bp, buffer_list);
> > +		xfs_buf_rele(bp);
> > +
> > +		error = -EAGAIN;
> > +		goto out_unlock;
> > +	}
> > +
> >  	error = xfs_qm_dqflush(dqp, &bp);
> >  	if (error)
> >  		goto out_unlock;
> > diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> > index fb7555e..c8d28a9 100644
> > --- a/fs/xfs/xfs_trace.h
> > +++ b/fs/xfs/xfs_trace.h
> > @@ -365,6 +365,7 @@ DEFINE_BUF_EVENT(xfs_buf_iowait_done);
> >  DEFINE_BUF_EVENT(xfs_buf_delwri_queue);
> >  DEFINE_BUF_EVENT(xfs_buf_delwri_queued);
> >  DEFINE_BUF_EVENT(xfs_buf_delwri_split);
> > +DEFINE_BUF_EVENT(xfs_buf_delwri_pushbuf);
> >  DEFINE_BUF_EVENT(xfs_buf_get_uncached);
> >  DEFINE_BUF_EVENT(xfs_buf_item_relse);
> >  DEFINE_BUF_EVENT(xfs_buf_item_iodone_async);
> > -- 
> > 2.7.4
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Darrick J. Wong April 7, 2017, 8:07 p.m. UTC | #3
On Fri, Apr 07, 2017 at 08:06:31AM -0400, Brian Foster wrote:
> On Thu, Apr 06, 2017 at 03:34:17PM -0700, Darrick J. Wong wrote:
> > On Fri, Feb 24, 2017 at 02:53:20PM -0500, Brian Foster wrote:
> > > Reclaim during quotacheck can lead to deadlocks on the dquot flush
> > > lock:
> > > 
> > >  - Quotacheck populates a local delwri queue with the physical dquot
> > >    buffers.
> > >  - Quotacheck performs the xfs_qm_dqusage_adjust() bulkstat and
> > >    dirties all of the dquots.
> > >  - Reclaim kicks in and attempts to flush a dquot whose buffer is
> > >    already queud on the quotacheck queue. The flush succeeds but
> > >    queueing to the reclaim delwri queue fails as the backing buffer is
> > >    already queued. The flush unlock is now deferred to I/O completion
> > >    of the buffer from the quotacheck queue.
> > 
> > <typing aloud here, trying to follow along...>
> > 
> > If I'm reading this correctly, this is the _buf_delwri_queue call
> > towards the end of the _DQ_IS_DIRTY clause in _qm_dquot_isolate?  Once
> > the dquot is flushed we fail to add it to isol->buffers because it's
> > already on quotacheck's buffer_list, which means that the
> > _buf_delwri_submit call in _qm_shrink_scan never sees our dirty dquot.
> > 
> > Therefore, the dquot doesn't get written and the flush lock stays
> > locked...
> > 
> 
> Yes, exactly.
> 
> > >  - The dqadjust bulkstat continues and dirties the recently flushed
> > >    dquot once again.
> > >  - Quotacheck proceeds to the xfs_qm_flush_one() walk which requires
> > >    the flush lock to update the backing buffers with the in-core
> > >    recalculated values. It deadlocks on the redirtied dquot as the
> > >    flush lock was already acquired by reclaim, but the buffer resides
> > >    on the local delwri queue which isn't submitted until the end of
> > >    quotacheck.
> > 
> > ...until quotacheck tries to relock it and deadlocks, like you said.
> > Ok, I think I understand the deadlock...
> > 
> 
> Yep.
> 
> > > This is reproduced by running quotacheck on a filesystem with a
> > > couple million inodes in low memory (512MB-1GB) situations. This is
> > > a regression as of commit 43ff2122e6 ("xfs: on-stack delayed write
> > > buffer lists"), which removed a trylock and buffer I/O submission
> > > from the quotacheck dquot flush sequence.
> > > 
> > > Quotacheck first resets and collects the physical dquot buffers in a
> > > delwri queue. Then, it traverses the filesystem inodes via bulkstat,
> > > updates the in-core dquots, flushes the corrected dquots to the
> > > backing buffers and finally submits the delwri queue for I/O. Since
> > > the backing buffers are queued across the entire quotacheck
> > > operation, dquot reclaim cannot possibly complete a dquot flush
> > > before quotacheck completes.
> > > 
> > > Therefore, quotacheck must submit the buffer for I/O in order to
> > > cycle the flush lock and flush the dirty in-core dquot to the
> > > buffer. Add a delwri queue buffer push mechanism to submit an
> > > individual buffer for I/O without losing the delwri queue status and
> > > use it from quotacheck to avoid the deadlock. This restores
> > > quotacheck behavior to as before the regression was introduced.
> > 
> > ...and therefore why this patch changes quotacheck to check if the
> > dquot's flush lock is already held and if so, to call
> > _buf_delwri_submit_buffers so that the io completion will unlock the
> > flush lock and try again.
> > 
> 
> Yes, though only for the specific buffer.
> 
> > When we return to _qm_flush_one, the dquot is no longer dirty and we're
> > done.  I /think/ this looks ok, though I also think I want to hear if
> > either Christian Affolter or Martin Svec have had any problems with
> > these two patches applied.  My current feeling is that this is a bit
> > late for 4.12, but I'd certainly put it in -washme for wider testing.
> > If anyone thinks this needs to happen sooner, please speak up.
> > 
> 
> The dquot is still dirty and needs to be flushed on the second go around
> (otherwise we wouldn't have needed the flush lock here on the first
> try). The issue is basically that the dquot was adjusted (dirtied) at
> some point during the bulkstat scan, reclaimed and flushed, and then
> subsequently dirtied again such that another flush is required via
> qm_flush_one(). The buffer push frees up the held flush lock so that
> final flush can proceed.

Ok.  I'd thought that the dquot could still be dirty when we got to there.

> BTW, what's the reasoning for being late to 4.12? I actually don't have
> much of an opinion which release this goes into, so I guess my question
> is more logistical due to the fact that up until recently I haven't paid
> much attention to the upstream release cycles. ;P The 4.12 merge window
> hasn't opened as we are still on 4.11-rc5, right? Is the concern then
> that the merge window is close enough that this has missed 4.12
> linux-next soak time?

We're still at least a week or two away from the merge window opening,
but I want to hear from the two bug reporters that this patch series has
made their symptoms go away, and that they've mounted enough times to be
reasonably confident that they haven't simply gotten lucky and not hit
the deadlock.  I don't know that both of those things will happen in the
time we have left, and seeing as the frequency of reports is pretty low,
it doesn't seem like we have to rush it into 4.12.

> Also, can you clarify the semantics of the -washme thing? ;) That
> doesn't necessarily mean the patch has been "merged." Rather, it might
> be suitable for the next for-next (4.13) provided nothing explodes by
> then, correct?

Yep.

> I'm also a little curious on what extra testing that entails, if any,
> though perhaps this topic is better suited by an independent thread..

Same testing as for-next gets (nightly auto group runs, ltp every now
and then, various $magic workloads); it's really just a place to land
patches that have been code reviewed but I'm not yet confident enough
about to put into for-next for the next cycle.

--D

> 
> > Sorry this one took me a while to get to. :/
> > 
> > Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> 
> No problem. Thanks for review...
> 
> Brian
> 
> > --D
> > 
> > > Reported-by: Martin Svec <martin.svec@zoner.cz>
> > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > > ---
> > >  fs/xfs/xfs_buf.c   | 37 +++++++++++++++++++++++++++++++++++++
> > >  fs/xfs/xfs_buf.h   |  1 +
> > >  fs/xfs/xfs_qm.c    | 28 +++++++++++++++++++++++++++-
> > >  fs/xfs/xfs_trace.h |  1 +
> > >  4 files changed, 66 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> > > index e566510..e97cf56 100644
> > > --- a/fs/xfs/xfs_buf.c
> > > +++ b/fs/xfs/xfs_buf.c
> > > @@ -2011,6 +2011,43 @@ xfs_buf_delwri_submit(
> > >  	return error;
> > >  }
> > >  
> > > +int
> > > +xfs_buf_delwri_pushbuf(
> > > +	struct xfs_buf		*bp,
> > > +	struct list_head	*buffer_list)
> > > +{
> > > +	LIST_HEAD		(submit_list);
> > > +	int			error;
> > > +
> > > +	ASSERT(xfs_buf_islocked(bp));
> > > +	ASSERT(bp->b_flags & _XBF_DELWRI_Q);
> > > +
> > > +	trace_xfs_buf_delwri_pushbuf(bp, _RET_IP_);
> > > +
> > > +	/*
> > > +	 * Move the buffer to an empty list and submit. Pass the original list
> > > +	 * as the wait list so delwri submission moves the buf back to it before
> > > +	 * it is submitted (and thus before it is unlocked). This means the
> > > +	 * buffer cannot be placed on another list while we wait for it.
> > > +	 */
> > > +	list_move(&bp->b_list, &submit_list);
> > > +	xfs_buf_unlock(bp);
> > > +
> > > +	xfs_buf_delwri_submit_buffers(&submit_list, buffer_list);
> > > +
> > > +	/*
> > > +	 * Lock the buffer to wait for I/O completion. It's already held on the
> > > +	 * original list, so all we have to do is reset the delwri queue flag
> > > +	 * that was cleared by delwri submission.
> > > +	 */
> > > +	xfs_buf_lock(bp);
> > > +	error = bp->b_error;
> > > +	bp->b_flags |= _XBF_DELWRI_Q;
> > > +	xfs_buf_unlock(bp);
> > > +
> > > +	return error;
> > > +}
> > > +
> > >  int __init
> > >  xfs_buf_init(void)
> > >  {
> > > diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
> > > index 8a9d3a9..cd74216 100644
> > > --- a/fs/xfs/xfs_buf.h
> > > +++ b/fs/xfs/xfs_buf.h
> > > @@ -334,6 +334,7 @@ extern void xfs_buf_stale(struct xfs_buf *bp);
> > >  extern bool xfs_buf_delwri_queue(struct xfs_buf *, struct list_head *);
> > >  extern int xfs_buf_delwri_submit(struct list_head *);
> > >  extern int xfs_buf_delwri_submit_nowait(struct list_head *);
> > > +extern int xfs_buf_delwri_pushbuf(struct xfs_buf *, struct list_head *);
> > >  
> > >  /* Buffer Daemon Setup Routines */
> > >  extern int xfs_buf_init(void);
> > > diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c
> > > index 4ff993c..3815ed3 100644
> > > --- a/fs/xfs/xfs_qm.c
> > > +++ b/fs/xfs/xfs_qm.c
> > > @@ -1247,6 +1247,7 @@ xfs_qm_flush_one(
> > >  	struct xfs_dquot	*dqp,
> > >  	void			*data)
> > >  {
> > > +	struct xfs_mount	*mp = dqp->q_mount;
> > >  	struct list_head	*buffer_list = data;
> > >  	struct xfs_buf		*bp = NULL;
> > >  	int			error = 0;
> > > @@ -1257,7 +1258,32 @@ xfs_qm_flush_one(
> > >  	if (!XFS_DQ_IS_DIRTY(dqp))
> > >  		goto out_unlock;
> > >  
> > > -	xfs_dqflock(dqp);
> > > +	/*
> > > +	 * The only way the dquot is already flush locked by the time quotacheck
> > > +	 * gets here is if reclaim flushed it before the dqadjust walk dirtied
> > > +	 * it for the final time. Quotacheck collects all dquot bufs in the
> > > +	 * local delwri queue before dquots are dirtied, so reclaim can't have
> > > +	 * possibly queued it for I/O. The only way out is to push the buffer to
> > > +	 * cycle the flush lock.
> > > +	 */
> > > +	if (!xfs_dqflock_nowait(dqp)) {
> > > +		/* buf is pinned in-core by delwri list */
> > > +		DEFINE_SINGLE_BUF_MAP(map, dqp->q_blkno,
> > > +				      mp->m_quotainfo->qi_dqchunklen);
> > > +		bp = _xfs_buf_find(mp->m_ddev_targp, &map, 1, 0, NULL);
> > > +		if (!bp) {
> > > +			error = -EINVAL;
> > > +			goto out_unlock;
> > > +		}
> > > +
> > > +		/* delwri_pushbuf drops the buf lock */
> > > +		xfs_buf_delwri_pushbuf(bp, buffer_list);
> > > +		xfs_buf_rele(bp);
> > > +
> > > +		error = -EAGAIN;
> > > +		goto out_unlock;
> > > +	}
> > > +
> > >  	error = xfs_qm_dqflush(dqp, &bp);
> > >  	if (error)
> > >  		goto out_unlock;
> > > diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> > > index fb7555e..c8d28a9 100644
> > > --- a/fs/xfs/xfs_trace.h
> > > +++ b/fs/xfs/xfs_trace.h
> > > @@ -365,6 +365,7 @@ DEFINE_BUF_EVENT(xfs_buf_iowait_done);
> > >  DEFINE_BUF_EVENT(xfs_buf_delwri_queue);
> > >  DEFINE_BUF_EVENT(xfs_buf_delwri_queued);
> > >  DEFINE_BUF_EVENT(xfs_buf_delwri_split);
> > > +DEFINE_BUF_EVENT(xfs_buf_delwri_pushbuf);
> > >  DEFINE_BUF_EVENT(xfs_buf_get_uncached);
> > >  DEFINE_BUF_EVENT(xfs_buf_item_relse);
> > >  DEFINE_BUF_EVENT(xfs_buf_item_iodone_async);
> > > -- 
> > > 2.7.4
> > > 
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner April 10, 2017, 5 a.m. UTC | #4
On Fri, Feb 24, 2017 at 02:53:20PM -0500, Brian Foster wrote:
> Reclaim during quotacheck can lead to deadlocks on the dquot flush
> lock:
> 
>  - Quotacheck populates a local delwri queue with the physical dquot
>    buffers.
>  - Quotacheck performs the xfs_qm_dqusage_adjust() bulkstat and
>    dirties all of the dquots.
>  - Reclaim kicks in and attempts to flush a dquot whose buffer is
>    already queud on the quotacheck queue. The flush succeeds but
>    queueing to the reclaim delwri queue fails as the backing buffer is
>    already queued. The flush unlock is now deferred to I/O completion
>    of the buffer from the quotacheck queue.
>  - The dqadjust bulkstat continues and dirties the recently flushed
>    dquot once again.
>  - Quotacheck proceeds to the xfs_qm_flush_one() walk which requires
>    the flush lock to update the backing buffers with the in-core
>    recalculated values. It deadlocks on the redirtied dquot as the
>    flush lock was already acquired by reclaim, but the buffer resides
>    on the local delwri queue which isn't submitted until the end of
>    quotacheck.
> 
> This is reproduced by running quotacheck on a filesystem with a
> couple million inodes in low memory (512MB-1GB) situations. This is
> a regression as of commit 43ff2122e6 ("xfs: on-stack delayed write
> buffer lists"), which removed a trylock and buffer I/O submission
> from the quotacheck dquot flush sequence.
> 
> Quotacheck first resets and collects the physical dquot buffers in a
> delwri queue. Then, it traverses the filesystem inodes via bulkstat,
> updates the in-core dquots, flushes the corrected dquots to the
> backing buffers and finally submits the delwri queue for I/O. Since
> the backing buffers are queued across the entire quotacheck
> operation, dquot reclaim cannot possibly complete a dquot flush
> before quotacheck completes.
> 
> Therefore, quotacheck must submit the buffer for I/O in order to
> cycle the flush lock and flush the dirty in-core dquot to the
> buffer. Add a delwri queue buffer push mechanism to submit an
> individual buffer for I/O without losing the delwri queue status and
> use it from quotacheck to avoid the deadlock. This restores
> quotacheck behavior to as before the regression was introduced.

While it may fix the problem, the solution gives me the heebee
jeebees. I'm on holidays, so I haven't bothered to spend the hours
necessary to answer these questions, but to give you an idea, this
was what I thought as I read the patch.  i.e. I have concerns about
whether....

> Reported-by: Martin Svec <martin.svec@zoner.cz>
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
>  fs/xfs/xfs_buf.c   | 37 +++++++++++++++++++++++++++++++++++++
>  fs/xfs/xfs_buf.h   |  1 +
>  fs/xfs/xfs_qm.c    | 28 +++++++++++++++++++++++++++-
>  fs/xfs/xfs_trace.h |  1 +
>  4 files changed, 66 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index e566510..e97cf56 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -2011,6 +2011,43 @@ xfs_buf_delwri_submit(
>  	return error;
>  }
>  
> +int
> +xfs_buf_delwri_pushbuf(
> +	struct xfs_buf		*bp,
> +	struct list_head	*buffer_list)
> +{
> +	LIST_HEAD		(submit_list);
> +	int			error;
> +
> +	ASSERT(xfs_buf_islocked(bp));
> +	ASSERT(bp->b_flags & _XBF_DELWRI_Q);
> +
> +	trace_xfs_buf_delwri_pushbuf(bp, _RET_IP_);
> +
> +	/*
> +	 * Move the buffer to an empty list and submit. Pass the original list
> +	 * as the wait list so delwri submission moves the buf back to it before
> +	 * it is submitted (and thus before it is unlocked). This means the
> +	 * buffer cannot be placed on another list while we wait for it.
> +	 */
> +	list_move(&bp->b_list, &submit_list);
> +	xfs_buf_unlock(bp);

.... this is safe/racy as we may have just moved it off the delwri
queue without changing state, reference counts, etc?

> +
> +	xfs_buf_delwri_submit_buffers(&submit_list, buffer_list);

.... using a caller supplied delwri buffer list as the buffer IO
wait list destination is making big assumptions about the internal
use of the wait list? e.g that xfs_buf_delwri_submit_buffers() does
not initialise the list_head before use...

.... we should be doing IO submission while holding other things on
the delwri list and unknown caller locks?

.... we have all the buffer reference counts we need to make this
work correctly?

> +	/*
> +	 * Lock the buffer to wait for I/O completion. It's already held on the
> +	 * original list, so all we have to do is reset the delwri queue flag
> +	 * that was cleared by delwri submission.
> +	 */
> +	xfs_buf_lock(bp);
> +	error = bp->b_error;
> +	bp->b_flags |= _XBF_DELWRI_Q;
> +	xfs_buf_unlock(bp);

.... this is racy w.r.t. the buffer going back onto the
buffer list without holding the buffer lock, or that the
_XBF_DELWRI_Q setting/clearing is not atomic w.r.t. the delwri queue
manipulations (i.e. can now be on the delwri list but not have
_XBF_DELWRI_Q set)?

.... the error is left on the buffer so it gets tripped over when
it is next accessed?

.... that the buffer locking is unbalanced for some undocumented
reason?

> +	return error;
> +}
> +
>  int __init
>  xfs_buf_init(void)
>  {
> diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
> index 8a9d3a9..cd74216 100644
> --- a/fs/xfs/xfs_buf.h
> +++ b/fs/xfs/xfs_buf.h
> @@ -334,6 +334,7 @@ extern void xfs_buf_stale(struct xfs_buf *bp);
>  extern bool xfs_buf_delwri_queue(struct xfs_buf *, struct list_head *);
>  extern int xfs_buf_delwri_submit(struct list_head *);
>  extern int xfs_buf_delwri_submit_nowait(struct list_head *);
> +extern int xfs_buf_delwri_pushbuf(struct xfs_buf *, struct list_head *);
>  
>  /* Buffer Daemon Setup Routines */
>  extern int xfs_buf_init(void);
> diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c
> index 4ff993c..3815ed3 100644
> --- a/fs/xfs/xfs_qm.c
> +++ b/fs/xfs/xfs_qm.c
> @@ -1247,6 +1247,7 @@ xfs_qm_flush_one(
>  	struct xfs_dquot	*dqp,
>  	void			*data)
>  {
> +	struct xfs_mount	*mp = dqp->q_mount;
>  	struct list_head	*buffer_list = data;
>  	struct xfs_buf		*bp = NULL;
>  	int			error = 0;
> @@ -1257,7 +1258,32 @@ xfs_qm_flush_one(
>  	if (!XFS_DQ_IS_DIRTY(dqp))
>  		goto out_unlock;
>  
> -	xfs_dqflock(dqp);
> +	/*
> +	 * The only way the dquot is already flush locked by the time quotacheck
> +	 * gets here is if reclaim flushed it before the dqadjust walk dirtied
> +	 * it for the final time. Quotacheck collects all dquot bufs in the
> +	 * local delwri queue before dquots are dirtied, so reclaim can't have
> +	 * possibly queued it for I/O. The only way out is to push the buffer to
> +	 * cycle the flush lock.
> +	 */
> +	if (!xfs_dqflock_nowait(dqp)) {
> +		/* buf is pinned in-core by delwri list */
> +		DEFINE_SINGLE_BUF_MAP(map, dqp->q_blkno,
> +				      mp->m_quotainfo->qi_dqchunklen);
> +		bp = _xfs_buf_find(mp->m_ddev_targp, &map, 1, 0, NULL);
> +		if (!bp) {
> +			error = -EINVAL;
> +			goto out_unlock;
> +		}
> +
> +		/* delwri_pushbuf drops the buf lock */
> +		xfs_buf_delwri_pushbuf(bp, buffer_list);

Ummm - you threw away the returned error....

> +		xfs_buf_rele(bp);

And despite the comment, I think this is simply wrong. We try really
hard to maintain balanced locking, and as such xfs_buf_relse() is
what goes along with _xfs_buf_find(). i.e. we are returned a locked
buffer with a reference by xfs_buf_find(), but xfs_buf_rele() only
drops the reference.

So if I look at this code in isolation, it looks like it leaks a
buffer lock, and now I have to go read other code to understand why
it doesn't and I'm left to wonder why it was implemented this
way....

Cheers,

Dave.
Brian Foster April 10, 2017, 2:15 p.m. UTC | #5
On Mon, Apr 10, 2017 at 03:00:01PM +1000, Dave Chinner wrote:
> On Fri, Feb 24, 2017 at 02:53:20PM -0500, Brian Foster wrote:
> > Reclaim during quotacheck can lead to deadlocks on the dquot flush
> > lock:
> > 
> >  - Quotacheck populates a local delwri queue with the physical dquot
> >    buffers.
> >  - Quotacheck performs the xfs_qm_dqusage_adjust() bulkstat and
> >    dirties all of the dquots.
> >  - Reclaim kicks in and attempts to flush a dquot whose buffer is
> >    already queud on the quotacheck queue. The flush succeeds but
> >    queueing to the reclaim delwri queue fails as the backing buffer is
> >    already queued. The flush unlock is now deferred to I/O completion
> >    of the buffer from the quotacheck queue.
> >  - The dqadjust bulkstat continues and dirties the recently flushed
> >    dquot once again.
> >  - Quotacheck proceeds to the xfs_qm_flush_one() walk which requires
> >    the flush lock to update the backing buffers with the in-core
> >    recalculated values. It deadlocks on the redirtied dquot as the
> >    flush lock was already acquired by reclaim, but the buffer resides
> >    on the local delwri queue which isn't submitted until the end of
> >    quotacheck.
> > 
> > This is reproduced by running quotacheck on a filesystem with a
> > couple million inodes in low memory (512MB-1GB) situations. This is
> > a regression as of commit 43ff2122e6 ("xfs: on-stack delayed write
> > buffer lists"), which removed a trylock and buffer I/O submission
> > from the quotacheck dquot flush sequence.
> > 
> > Quotacheck first resets and collects the physical dquot buffers in a
> > delwri queue. Then, it traverses the filesystem inodes via bulkstat,
> > updates the in-core dquots, flushes the corrected dquots to the
> > backing buffers and finally submits the delwri queue for I/O. Since
> > the backing buffers are queued across the entire quotacheck
> > operation, dquot reclaim cannot possibly complete a dquot flush
> > before quotacheck completes.
> > 
> > Therefore, quotacheck must submit the buffer for I/O in order to
> > cycle the flush lock and flush the dirty in-core dquot to the
> > buffer. Add a delwri queue buffer push mechanism to submit an
> > individual buffer for I/O without losing the delwri queue status and
> > use it from quotacheck to avoid the deadlock. This restores
> > quotacheck behavior to as before the regression was introduced.
> 
> While it may fix the problem, the solution gives me the heebee
> jeebees. I'm on holidays, so I haven't bothered to spend the hours
> necessary to answer these questions, but to give you an idea, this
> was what I thought as I read the patch.  i.e. I have concerns about
> whether....
> 

Thanks for looking at this. FWIW, this is based on a preexisting
solution prior to the commit referenced above (obviously this doesn't
mean the code is correct or that this is the best solution).

> > Reported-by: Martin Svec <martin.svec@zoner.cz>
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > ---
> >  fs/xfs/xfs_buf.c   | 37 +++++++++++++++++++++++++++++++++++++
> >  fs/xfs/xfs_buf.h   |  1 +
> >  fs/xfs/xfs_qm.c    | 28 +++++++++++++++++++++++++++-
> >  fs/xfs/xfs_trace.h |  1 +
> >  4 files changed, 66 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> > index e566510..e97cf56 100644
> > --- a/fs/xfs/xfs_buf.c
> > +++ b/fs/xfs/xfs_buf.c
> > @@ -2011,6 +2011,43 @@ xfs_buf_delwri_submit(
> >  	return error;
> >  }
> >  
> > +int
> > +xfs_buf_delwri_pushbuf(
> > +	struct xfs_buf		*bp,
> > +	struct list_head	*buffer_list)
> > +{
> > +	LIST_HEAD		(submit_list);
> > +	int			error;
> > +
> > +	ASSERT(xfs_buf_islocked(bp));
> > +	ASSERT(bp->b_flags & _XBF_DELWRI_Q);
> > +
> > +	trace_xfs_buf_delwri_pushbuf(bp, _RET_IP_);
> > +
> > +	/*
> > +	 * Move the buffer to an empty list and submit. Pass the original list
> > +	 * as the wait list so delwri submission moves the buf back to it before
> > +	 * it is submitted (and thus before it is unlocked). This means the
> > +	 * buffer cannot be placed on another list while we wait for it.
> > +	 */
> > +	list_move(&bp->b_list, &submit_list);
> > +	xfs_buf_unlock(bp);
> 
> .... this is safe/racy as we may have just moved it off the delwri
> queue without changing state, reference counts, etc?
> 

Racy with..? The buffer is effectively still on the delwri queue here.
We've just replaced the caller queue with a local queue to reuse the
existing submission infrastructure. IOW, we've just changed the queue
that the buffer is on.

> > +
> > +	xfs_buf_delwri_submit_buffers(&submit_list, buffer_list);
> 
> .... using a caller supplied delwri buffer list as the buffer IO
> wait list destination is making big assumptions about the internal
> use of the wait list? e.g that xfs_buf_delwri_submit_buffers() does
> not initialise the list_head before use...
> 

This is the scope at which the wait list is defined/initialized. E.g.,
_delwri_submit() initializes its local wait list and passes it down for
_delwri_submit_buffers() to use. This basically ensures that the buffer
is placed back on the original delwri queue because the latter moves the
buffer over under lock, before it is submitted for I/O.

We could probably use a local wait_list just the same if that is
preferred, and explicitly add the buffer back from the wait_list to the
original delwri queue. FWIW, it's not clear to me whether you're asking
about the use of the wait_list here or indicating preference.

> .... we should be doing IO submission while holding other things on
> the delwri list and unknown caller locks?
> 

I'm not following... this is effectively a move/submit of a delwri queue
item to another delwri queue. If the delwri queue is by design a local
data structure then only one thread should be queue processing at a
time. Why would this particular action be problematic?

Technically, we could rework this into something like a
delwri_queue_move()/delwri_submit() sequence, but the problem with that
is we lose the queue-populated state of the buffer for a period of time
and thus somebody else (i.e., reclaim) can jump in and queue it out from
underneath the caller. The purpose of this mechanism is to preserve
original delwri queue ownership of the buffer.

> .... we have all the buffer reference counts we need to make this
> work correctly?
> 

AFAICT, we follow the existing delwri queue rules with regard to locking
and reference counting. A buffer is queued under the buf lock and the
queue takes a reference on the buffer. Queue submission locks the
buffer, acquires a wait_list reference and moves to the wait list if one
is specified and submits the buffer. Buffer I/O completion drops the
original queue reference. If the wait_list was specified, we cycle the
buffer lock (to wait for I/O completion), remove from the wait_list and
drop the wait_list reference.

The pushbuf() call moves the buffer to the submit_list under the buffer
lock, which is effectively another local delwri queue. This takes
ownership of the original delwri queue reference. Queue submission
occurs as above: the buffer is moved to the wait_list, an associated
wait_list reference is acquired and the buffer is submitted. I/O
completion drops the original delwri queue reference as before, so we
are in the same state as delwri_submit() up through I/O completion.

We again cycle the buf lock to wait on I/O completion and process an
error. Rather than remove from the wait_list and drop the reference,
however, pushbuf() has the buffer back on the original delwri queue with
an associated reference. The only explicit state change here is the
DELWRI_Q flag, which is reset.

So am I missing something here with regard to proper reference
counting..? If so, please elaborate and I'll try to dig into it.

> > +	/*
> > +	 * Lock the buffer to wait for I/O completion. It's already held on the
> > +	 * original list, so all we have to do is reset the delwri queue flag
> > +	 * that was cleared by delwri submission.
> > +	 */
> > +	xfs_buf_lock(bp);
> > +	error = bp->b_error;
> > +	bp->b_flags |= _XBF_DELWRI_Q;
> > +	xfs_buf_unlock(bp);
> 
> .... this is racy w.r.t. the buffer going back onto the
> buffer list without holding the buffer lock, or that the
> _XBF_DELWRI_Q setting/clearing is not atomic w.r.t. the delwri queue
> manipulations (i.e. can now be on the delwri list but not have
> _XBF_DELWRI_Q set)?
> 

The buffer has been put back on the original delwri queue (by
_delwri_submit_buffers()) before the lock was dropped. Using the delwri
queue as the wait_list does mean that the flag might not be set until
the delwri queue processing is done and we have returned to the caller.
If another delwri queue occurs, the flag may be reset but the buffer
cannot be added to another queue, which is what we want (as noted
above).

(I'm not sure why delwri_queue() returns true in that case, but that's a
separate issue.)

> .... the error is left on the buffer so it gets tripped over when
> it is next accessed?
> 

Same as delwri_submit(). IIUC, errors are left on buffers by design and
cleared on a subsequemt I/O submission.

> .... that the buffer locking is unbalanced for some undocumented
> reason?
> 
> > +	return error;
> > +}
> > +
> >  int __init
> >  xfs_buf_init(void)
> >  {
> > diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
> > index 8a9d3a9..cd74216 100644
> > --- a/fs/xfs/xfs_buf.h
> > +++ b/fs/xfs/xfs_buf.h
> > @@ -334,6 +334,7 @@ extern void xfs_buf_stale(struct xfs_buf *bp);
> >  extern bool xfs_buf_delwri_queue(struct xfs_buf *, struct list_head *);
> >  extern int xfs_buf_delwri_submit(struct list_head *);
> >  extern int xfs_buf_delwri_submit_nowait(struct list_head *);
> > +extern int xfs_buf_delwri_pushbuf(struct xfs_buf *, struct list_head *);
> >  
> >  /* Buffer Daemon Setup Routines */
> >  extern int xfs_buf_init(void);
> > diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c
> > index 4ff993c..3815ed3 100644
> > --- a/fs/xfs/xfs_qm.c
> > +++ b/fs/xfs/xfs_qm.c
> > @@ -1247,6 +1247,7 @@ xfs_qm_flush_one(
> >  	struct xfs_dquot	*dqp,
> >  	void			*data)
> >  {
> > +	struct xfs_mount	*mp = dqp->q_mount;
> >  	struct list_head	*buffer_list = data;
> >  	struct xfs_buf		*bp = NULL;
> >  	int			error = 0;
> > @@ -1257,7 +1258,32 @@ xfs_qm_flush_one(
> >  	if (!XFS_DQ_IS_DIRTY(dqp))
> >  		goto out_unlock;
> >  
> > -	xfs_dqflock(dqp);
> > +	/*
> > +	 * The only way the dquot is already flush locked by the time quotacheck
> > +	 * gets here is if reclaim flushed it before the dqadjust walk dirtied
> > +	 * it for the final time. Quotacheck collects all dquot bufs in the
> > +	 * local delwri queue before dquots are dirtied, so reclaim can't have
> > +	 * possibly queued it for I/O. The only way out is to push the buffer to
> > +	 * cycle the flush lock.
> > +	 */
> > +	if (!xfs_dqflock_nowait(dqp)) {
> > +		/* buf is pinned in-core by delwri list */
> > +		DEFINE_SINGLE_BUF_MAP(map, dqp->q_blkno,
> > +				      mp->m_quotainfo->qi_dqchunklen);
> > +		bp = _xfs_buf_find(mp->m_ddev_targp, &map, 1, 0, NULL);
> > +		if (!bp) {
> > +			error = -EINVAL;
> > +			goto out_unlock;
> > +		}
> > +
> > +		/* delwri_pushbuf drops the buf lock */
> > +		xfs_buf_delwri_pushbuf(bp, buffer_list);
> 
> Ummm - you threw away the returned error....
> 

The I/O is going to be retried, but I'll fix it to just return from
here.

> > +		xfs_buf_rele(bp);
> 
> And despite the comment, I think this is simply wrong. We try really
> hard to maintain balanced locking, and as such xfs_buf_relse() is
> what goes along with _xfs_buf_find(). i.e. we are returned a locked
> buffer with a reference by xfs_buf_find(), but xfs_buf_rele() only
> drops the reference.
> 

I'm not quite following if you are saying the locking is unbalanced or
it simply looks that way. If the former, can you please elaborate?

Otherwise, we can probably have pushbuf() return with the buffer lock
held so the caller remains responsible for the lock and reference it
acquired.

> So if I look at this code in isolation, it looks like it leaks a
> buffer lock, and now I have to go read other code to understand why
> it doesn't and I'm left to wonder why it was implemented this
> way....
> 

Thanks for taking the time to send feedback. FWIW, I take most of these
questions as "things one would have to verify in order to review this
patch if one were not on vacation" :P as opposed to pointing things out
as explicitly broken. As noted previously, please do point out anything
I've missed to the contrary. Otherwise, I'll plan to send another
version that includes the error and locking API fixups.

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Brian Foster April 10, 2017, 2:19 p.m. UTC | #6
On Fri, Apr 07, 2017 at 01:07:25PM -0700, Darrick J. Wong wrote:
> On Fri, Apr 07, 2017 at 08:06:31AM -0400, Brian Foster wrote:
> > On Thu, Apr 06, 2017 at 03:34:17PM -0700, Darrick J. Wong wrote:
> > > On Fri, Feb 24, 2017 at 02:53:20PM -0500, Brian Foster wrote:
> > > > Reclaim during quotacheck can lead to deadlocks on the dquot flush
> > > > lock:
> > > > 
> > > >  - Quotacheck populates a local delwri queue with the physical dquot
> > > >    buffers.
> > > >  - Quotacheck performs the xfs_qm_dqusage_adjust() bulkstat and
> > > >    dirties all of the dquots.
> > > >  - Reclaim kicks in and attempts to flush a dquot whose buffer is
> > > >    already queud on the quotacheck queue. The flush succeeds but
> > > >    queueing to the reclaim delwri queue fails as the backing buffer is
> > > >    already queued. The flush unlock is now deferred to I/O completion
> > > >    of the buffer from the quotacheck queue.
> > > 
> > > <typing aloud here, trying to follow along...>
> > > 
> > > If I'm reading this correctly, this is the _buf_delwri_queue call
> > > towards the end of the _DQ_IS_DIRTY clause in _qm_dquot_isolate?  Once
> > > the dquot is flushed we fail to add it to isol->buffers because it's
> > > already on quotacheck's buffer_list, which means that the
> > > _buf_delwri_submit call in _qm_shrink_scan never sees our dirty dquot.
> > > 
> > > Therefore, the dquot doesn't get written and the flush lock stays
> > > locked...
> > > 
> > 
> > Yes, exactly.
> > 
> > > >  - The dqadjust bulkstat continues and dirties the recently flushed
> > > >    dquot once again.
> > > >  - Quotacheck proceeds to the xfs_qm_flush_one() walk which requires
> > > >    the flush lock to update the backing buffers with the in-core
> > > >    recalculated values. It deadlocks on the redirtied dquot as the
> > > >    flush lock was already acquired by reclaim, but the buffer resides
> > > >    on the local delwri queue which isn't submitted until the end of
> > > >    quotacheck.
> > > 
> > > ...until quotacheck tries to relock it and deadlocks, like you said.
> > > Ok, I think I understand the deadlock...
> > > 
> > 
> > Yep.
> > 
> > > > This is reproduced by running quotacheck on a filesystem with a
> > > > couple million inodes in low memory (512MB-1GB) situations. This is
> > > > a regression as of commit 43ff2122e6 ("xfs: on-stack delayed write
> > > > buffer lists"), which removed a trylock and buffer I/O submission
> > > > from the quotacheck dquot flush sequence.
> > > > 
> > > > Quotacheck first resets and collects the physical dquot buffers in a
> > > > delwri queue. Then, it traverses the filesystem inodes via bulkstat,
> > > > updates the in-core dquots, flushes the corrected dquots to the
> > > > backing buffers and finally submits the delwri queue for I/O. Since
> > > > the backing buffers are queued across the entire quotacheck
> > > > operation, dquot reclaim cannot possibly complete a dquot flush
> > > > before quotacheck completes.
> > > > 
> > > > Therefore, quotacheck must submit the buffer for I/O in order to
> > > > cycle the flush lock and flush the dirty in-core dquot to the
> > > > buffer. Add a delwri queue buffer push mechanism to submit an
> > > > individual buffer for I/O without losing the delwri queue status and
> > > > use it from quotacheck to avoid the deadlock. This restores
> > > > quotacheck behavior to as before the regression was introduced.
> > > 
> > > ...and therefore why this patch changes quotacheck to check if the
> > > dquot's flush lock is already held and if so, to call
> > > _buf_delwri_submit_buffers so that the io completion will unlock the
> > > flush lock and try again.
> > > 
> > 
> > Yes, though only for the specific buffer.
> > 
> > > When we return to _qm_flush_one, the dquot is no longer dirty and we're
> > > done.  I /think/ this looks ok, though I also think I want to hear if
> > > either Christian Affolter or Martin Svec have had any problems with
> > > these two patches applied.  My current feeling is that this is a bit
> > > late for 4.12, but I'd certainly put it in -washme for wider testing.
> > > If anyone thinks this needs to happen sooner, please speak up.
> > > 
> > 
> > The dquot is still dirty and needs to be flushed on the second go around
> > (otherwise we wouldn't have needed the flush lock here on the first
> > try). The issue is basically that the dquot was adjusted (dirtied) at
> > some point during the bulkstat scan, reclaimed and flushed, and then
> > subsequently dirtied again such that another flush is required via
> > qm_flush_one(). The buffer push frees up the held flush lock so that
> > final flush can proceed.
> 
> Ok.  I'd thought that the dquot could still be dirty when we got to there.
> 
> > BTW, what's the reasoning for being late to 4.12? I actually don't have
> > much of an opinion which release this goes into, so I guess my question
> > is more logistical due to the fact that up until recently I haven't paid
> > much attention to the upstream release cycles. ;P The 4.12 merge window
> > hasn't opened as we are still on 4.11-rc5, right? Is the concern then
> > that the merge window is close enough that this has missed 4.12
> > linux-next soak time?
> 
> We're still at least a week or two away from the merge window opening,
> but I want to hear from the two bug reporters that this patch series has
> made their symptoms go away, and that they've mounted enough times to be
> reasonably confident that they haven't simply gotten lucky and not hit
> the deadlock.  I don't know that both of those things will happen in the
> time we have left, and seeing as the frequency of reports is pretty low,
> it doesn't seem like we have to rush it into 4.12.
> 
> > Also, can you clarify the semantics of the -washme thing? ;) That
> > doesn't necessarily mean the patch has been "merged." Rather, it might
> > be suitable for the next for-next (4.13) provided nothing explodes by
> > then, correct?
> 
> Yep.
> 
> > I'm also a little curious on what extra testing that entails, if any,
> > though perhaps this topic is better suited by an independent thread..
> 
> Same testing as for-next gets (nightly auto group runs, ltp every now
> and then, various $magic workloads); it's really just a place to land
> patches that have been code reviewed but I'm not yet confident enough
> about to put into for-next for the next cycle.
> 

Ok, thanks. Note there will be another version incoming based on Dave's
feedback. Otherwise sounds good.

Brian

> --D
> 
> > 
> > > Sorry this one took me a while to get to. :/
> > > 
> > > Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > 
> > 
> > No problem. Thanks for review...
> > 
> > Brian
> > 
> > > --D
> > > 
> > > > Reported-by: Martin Svec <martin.svec@zoner.cz>
> > > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > > > ---
> > > >  fs/xfs/xfs_buf.c   | 37 +++++++++++++++++++++++++++++++++++++
> > > >  fs/xfs/xfs_buf.h   |  1 +
> > > >  fs/xfs/xfs_qm.c    | 28 +++++++++++++++++++++++++++-
> > > >  fs/xfs/xfs_trace.h |  1 +
> > > >  4 files changed, 66 insertions(+), 1 deletion(-)
> > > > 
> > > > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> > > > index e566510..e97cf56 100644
> > > > --- a/fs/xfs/xfs_buf.c
> > > > +++ b/fs/xfs/xfs_buf.c
> > > > @@ -2011,6 +2011,43 @@ xfs_buf_delwri_submit(
> > > >  	return error;
> > > >  }
> > > >  
> > > > +int
> > > > +xfs_buf_delwri_pushbuf(
> > > > +	struct xfs_buf		*bp,
> > > > +	struct list_head	*buffer_list)
> > > > +{
> > > > +	LIST_HEAD		(submit_list);
> > > > +	int			error;
> > > > +
> > > > +	ASSERT(xfs_buf_islocked(bp));
> > > > +	ASSERT(bp->b_flags & _XBF_DELWRI_Q);
> > > > +
> > > > +	trace_xfs_buf_delwri_pushbuf(bp, _RET_IP_);
> > > > +
> > > > +	/*
> > > > +	 * Move the buffer to an empty list and submit. Pass the original list
> > > > +	 * as the wait list so delwri submission moves the buf back to it before
> > > > +	 * it is submitted (and thus before it is unlocked). This means the
> > > > +	 * buffer cannot be placed on another list while we wait for it.
> > > > +	 */
> > > > +	list_move(&bp->b_list, &submit_list);
> > > > +	xfs_buf_unlock(bp);
> > > > +
> > > > +	xfs_buf_delwri_submit_buffers(&submit_list, buffer_list);
> > > > +
> > > > +	/*
> > > > +	 * Lock the buffer to wait for I/O completion. It's already held on the
> > > > +	 * original list, so all we have to do is reset the delwri queue flag
> > > > +	 * that was cleared by delwri submission.
> > > > +	 */
> > > > +	xfs_buf_lock(bp);
> > > > +	error = bp->b_error;
> > > > +	bp->b_flags |= _XBF_DELWRI_Q;
> > > > +	xfs_buf_unlock(bp);
> > > > +
> > > > +	return error;
> > > > +}
> > > > +
> > > >  int __init
> > > >  xfs_buf_init(void)
> > > >  {
> > > > diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
> > > > index 8a9d3a9..cd74216 100644
> > > > --- a/fs/xfs/xfs_buf.h
> > > > +++ b/fs/xfs/xfs_buf.h
> > > > @@ -334,6 +334,7 @@ extern void xfs_buf_stale(struct xfs_buf *bp);
> > > >  extern bool xfs_buf_delwri_queue(struct xfs_buf *, struct list_head *);
> > > >  extern int xfs_buf_delwri_submit(struct list_head *);
> > > >  extern int xfs_buf_delwri_submit_nowait(struct list_head *);
> > > > +extern int xfs_buf_delwri_pushbuf(struct xfs_buf *, struct list_head *);
> > > >  
> > > >  /* Buffer Daemon Setup Routines */
> > > >  extern int xfs_buf_init(void);
> > > > diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c
> > > > index 4ff993c..3815ed3 100644
> > > > --- a/fs/xfs/xfs_qm.c
> > > > +++ b/fs/xfs/xfs_qm.c
> > > > @@ -1247,6 +1247,7 @@ xfs_qm_flush_one(
> > > >  	struct xfs_dquot	*dqp,
> > > >  	void			*data)
> > > >  {
> > > > +	struct xfs_mount	*mp = dqp->q_mount;
> > > >  	struct list_head	*buffer_list = data;
> > > >  	struct xfs_buf		*bp = NULL;
> > > >  	int			error = 0;
> > > > @@ -1257,7 +1258,32 @@ xfs_qm_flush_one(
> > > >  	if (!XFS_DQ_IS_DIRTY(dqp))
> > > >  		goto out_unlock;
> > > >  
> > > > -	xfs_dqflock(dqp);
> > > > +	/*
> > > > +	 * The only way the dquot is already flush locked by the time quotacheck
> > > > +	 * gets here is if reclaim flushed it before the dqadjust walk dirtied
> > > > +	 * it for the final time. Quotacheck collects all dquot bufs in the
> > > > +	 * local delwri queue before dquots are dirtied, so reclaim can't have
> > > > +	 * possibly queued it for I/O. The only way out is to push the buffer to
> > > > +	 * cycle the flush lock.
> > > > +	 */
> > > > +	if (!xfs_dqflock_nowait(dqp)) {
> > > > +		/* buf is pinned in-core by delwri list */
> > > > +		DEFINE_SINGLE_BUF_MAP(map, dqp->q_blkno,
> > > > +				      mp->m_quotainfo->qi_dqchunklen);
> > > > +		bp = _xfs_buf_find(mp->m_ddev_targp, &map, 1, 0, NULL);
> > > > +		if (!bp) {
> > > > +			error = -EINVAL;
> > > > +			goto out_unlock;
> > > > +		}
> > > > +
> > > > +		/* delwri_pushbuf drops the buf lock */
> > > > +		xfs_buf_delwri_pushbuf(bp, buffer_list);
> > > > +		xfs_buf_rele(bp);
> > > > +
> > > > +		error = -EAGAIN;
> > > > +		goto out_unlock;
> > > > +	}
> > > > +
> > > >  	error = xfs_qm_dqflush(dqp, &bp);
> > > >  	if (error)
> > > >  		goto out_unlock;
> > > > diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> > > > index fb7555e..c8d28a9 100644
> > > > --- a/fs/xfs/xfs_trace.h
> > > > +++ b/fs/xfs/xfs_trace.h
> > > > @@ -365,6 +365,7 @@ DEFINE_BUF_EVENT(xfs_buf_iowait_done);
> > > >  DEFINE_BUF_EVENT(xfs_buf_delwri_queue);
> > > >  DEFINE_BUF_EVENT(xfs_buf_delwri_queued);
> > > >  DEFINE_BUF_EVENT(xfs_buf_delwri_split);
> > > > +DEFINE_BUF_EVENT(xfs_buf_delwri_pushbuf);
> > > >  DEFINE_BUF_EVENT(xfs_buf_get_uncached);
> > > >  DEFINE_BUF_EVENT(xfs_buf_item_relse);
> > > >  DEFINE_BUF_EVENT(xfs_buf_item_iodone_async);
> > > > -- 
> > > > 2.7.4
> > > > 
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > the body of a message to majordomo@vger.kernel.org
> > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner April 10, 2017, 11:55 p.m. UTC | #7
On Mon, Apr 10, 2017 at 10:15:21AM -0400, Brian Foster wrote:
> On Mon, Apr 10, 2017 at 03:00:01PM +1000, Dave Chinner wrote:
> > On Fri, Feb 24, 2017 at 02:53:20PM -0500, Brian Foster wrote:
> > > +	/*
> > > +	 * Move the buffer to an empty list and submit. Pass the original list
> > > +	 * as the wait list so delwri submission moves the buf back to it before
> > > +	 * it is submitted (and thus before it is unlocked). This means the
> > > +	 * buffer cannot be placed on another list while we wait for it.
> > > +	 */
> > > +	list_move(&bp->b_list, &submit_list);
> > > +	xfs_buf_unlock(bp);
> > 
> > .... this is safe/racy as we may have just moved it off the delwri
> > queue without changing state, reference counts, etc?
> > 
> 
> Racy with..?

Whatever might possibly manipulate the buffer that thinks it's on a
specific delwri list. We've unlocked it.

As I said, I haven't spend hours thinking this all through, I'm just
throwing out the questions I had....

> The buffer is effectively still on the delwri queue here.
> We've just replaced the caller queue with a local queue to reuse the
> existing submission infrastructure. IOW, we've just changed the queue
> that the buffer is on.

Yes, in theory, but it's not something we every considered would be
done when designing the private delwri queue infrastructure...

> > > +
> > > +	xfs_buf_delwri_submit_buffers(&submit_list, buffer_list);
> > 
> > .... using a caller supplied delwri buffer list as the buffer IO
> > wait list destination is making big assumptions about the internal
> > use of the wait list? e.g that xfs_buf_delwri_submit_buffers() does
> > not initialise the list_head before use...
> > 
> 
> This is the scope at which the wait list is defined/initialized.

It is now - passing in a list that is not empty effectively prevents
us from implementing anything differently in future....

> E.g.,
> _delwri_submit() initializes its local wait list and passes it down for
> _delwri_submit_buffers() to use. This basically ensures that the buffer
> is placed back on the original delwri queue because the latter moves the
> buffer over under lock, before it is submitted for I/O.

But xfs_buf_delwri_submit_buffers() adds a reference count to the
buffer if it's placed on a wait list. but we just moved it back to
the original buffer list. Now I have to look for why that works, and
determine if that was intended behaviour or not.

i.e. _delwri_submit_buffers is being used for more than just moving
the buffer back to the buffer list - there's reference counting
shenanigans going on here, too....

> We could probably use a local wait_list just the same if that is
> preferred, and explicitly add the buffer back from the wait_list to the
> original delwri queue. FWIW, it's not clear to me whether you're asking
> about the use of the wait_list here or indicating preference.

All I'm pointing out is that I don't think the patch is a good
solution, not what the solution should be. I haven't spent the hours
needed to think out a proper solution.

> > .... we should be doing IO submission while holding other things on
> > the delwri list and unknown caller locks?
> > 
> 
> I'm not following... this is effectively a move/submit of a delwri queue
> item to another delwri queue. If the delwri queue is by design a local
> data structure then only one thread should be queue processing at a
> time. Why would this particular action be problematic?

No, no it's not just a "list move". xfs_buf_delwri_submit_buffers()
calls xfs_buf_submit() to start IO on the buffers passed in on the
submit list. We're adding a new IO submission path here....

> Technically, we could rework this into something like a
> delwri_queue_move()/delwri_submit() sequence, but the problem with that
> is we lose the queue-populated state of the buffer for a period of time
> and thus somebody else (i.e., reclaim) can jump in and queue it out from
> underneath the caller. The purpose of this mechanism is to preserve
> original delwri queue ownership of the buffer.

Sure, I understand that, but I still don't think this is the right
thing to be doing.

> > .... we have all the buffer reference counts we need to make this
> > work correctly?
> > 
> 
> AFAICT, we follow the existing delwri queue rules with regard to locking
> and reference counting. A buffer is queued under the buf lock and the
> queue takes a reference on the buffer. Queue submission locks the
> buffer, acquires a wait_list reference and moves to the wait list if one
> is specified and submits the buffer. Buffer I/O completion drops the
> original queue reference. If the wait_list was specified, we cycle the
> buffer lock (to wait for I/O completion), remove from the wait_list and
> drop the wait_list reference.
>
> The pushbuf() call moves the buffer to the submit_list under the buffer
> lock, which is effectively another local delwri queue. This takes
> ownership of the original delwri queue reference. Queue submission
> occurs as above: the buffer is moved to the wait_list, an associated
> wait_list reference is acquired and the buffer is submitted. I/O
> completion drops the original delwri queue reference as before, so we
> are in the same state as delwri_submit() up through I/O completion.
> 
> We again cycle the buf lock to wait on I/O completion and process an
> error. Rather than remove from the wait_list and drop the reference,
> however, pushbuf() has the buffer back on the original delwri queue with
> an associated reference. The only explicit state change here is the
> DELWRI_Q flag, which is reset.
> 
> So am I missing something here with regard to proper reference
> counting..? If so, please elaborate and I'll try to dig into it.

IOWs, your abusing the waitlist reference count added deep inside
xfs_buf_delwri_submit_buffers() to handle the fact
that the IO submission removes the delwri queue reference count that
the buffer had on entry to the function, but still needs on exit to
the function.

I guarantee we'll break this in future - it's too subtle and fragile
for core infrastructure. Also, there's no comments to explain any of
these games being played, which makes it even less maintainable...


> > > +	/*
> > > +	 * Lock the buffer to wait for I/O completion. It's already held on the
> > > +	 * original list, so all we have to do is reset the delwri queue flag
> > > +	 * that was cleared by delwri submission.
> > > +	 */
> > > +	xfs_buf_lock(bp);
> > > +	error = bp->b_error;
> > > +	bp->b_flags |= _XBF_DELWRI_Q;
> > > +	xfs_buf_unlock(bp);
> > 
> > .... this is racy w.r.t. the buffer going back onto the
> > buffer list without holding the buffer lock, or that the
> > _XBF_DELWRI_Q setting/clearing is not atomic w.r.t. the delwri queue
> > manipulations (i.e. can now be on the delwri list but not have
> > _XBF_DELWRI_Q set)?
> > 
> 
> The buffer has been put back on the original delwri queue (by
> _delwri_submit_buffers()) before the lock was dropped. Using the delwri
> queue as the wait_list does mean that the flag might not be set until
> the delwri queue processing is done and we have returned to the caller.
> If another delwri queue occurs, the flag may be reset but the buffer
> cannot be added to another queue, which is what we want (as noted
> above).

Yes, I know this is what happens. My point is that the internal
_XBF_DELWRI_Q no longer indicates that a buffer is on a delwri queue
(i.e. that bp->b_list is valid). IOWs, if we want to check a buffer
is on a delwri queue, we can no longer just check the flag, we have
to check both the flag and list_empty(bp->b_list) because they are
no longer atomically updated.

To me this is a design rule violation, regardless of whether the
code works or not...

> (I'm not sure why delwri_queue() returns true in that case, but that's a
> separate issue.)
> 
> > .... the error is left on the buffer so it gets tripped over when
> > it is next accessed?
> > 
> 
> Same as delwri_submit(). IIUC, errors are left on buffers by design and
> cleared on a subsequemt I/O submission.

That doesn't mean it's the right thing to do here. Think about how
the error shold be handled in this case, don't copy something from a
bulk IO submission interface....

> > > +	/*
> > > +	 * The only way the dquot is already flush locked by the time quotacheck
> > > +	 * gets here is if reclaim flushed it before the dqadjust walk dirtied
> > > +	 * it for the final time. Quotacheck collects all dquot bufs in the
> > > +	 * local delwri queue before dquots are dirtied, so reclaim can't have
> > > +	 * possibly queued it for I/O. The only way out is to push the buffer to
> > > +	 * cycle the flush lock.
> > > +	 */
> > > +	if (!xfs_dqflock_nowait(dqp)) {
> > > +		/* buf is pinned in-core by delwri list */
> > > +		DEFINE_SINGLE_BUF_MAP(map, dqp->q_blkno,
> > > +				      mp->m_quotainfo->qi_dqchunklen);
> > > +		bp = _xfs_buf_find(mp->m_ddev_targp, &map, 1, 0, NULL);
> > > +		if (!bp) {
> > > +			error = -EINVAL;
> > > +			goto out_unlock;
> > > +		}
> > > +
> > > +		/* delwri_pushbuf drops the buf lock */
> > > +		xfs_buf_delwri_pushbuf(bp, buffer_list);
> > 
> > Ummm - you threw away the returned error....
> > 
> 
> The I/O is going to be retried, but I'll fix it to just return from
> here.
> 
> > > +		xfs_buf_rele(bp);
> > 
> > And despite the comment, I think this is simply wrong. We try really
> > hard to maintain balanced locking, and as such xfs_buf_relse() is
> > what goes along with _xfs_buf_find(). i.e. we are returned a locked
> > buffer with a reference by xfs_buf_find(), but xfs_buf_rele() only
> > drops the reference.
> > 
> 
> I'm not quite following if you are saying the locking is unbalanced or
> it simply looks that way. If the former, can you please elaborate?

It's not obviously correct, which to me is a bug. If I read this
code, you're forcing me to go read more code to work out why it is
correct because it doesn't follow the expected patterns. And then
the reader disappears down a hole for hours trying to work out how
this code actually works.

> Otherwise, we can probably have pushbuf() return with the buffer lock
> held so the caller remains responsible for the lock and reference it
> acquired.
>
> > So if I look at this code in isolation, it looks like it leaks a
> > buffer lock, and now I have to go read other code to understand why
> > it doesn't and I'm left to wonder why it was implemented this
> > way....
> > 
> 
> Thanks for taking the time to send feedback. FWIW, I take most of these
> questions as "things one would have to verify in order to review this
> patch if one were not on vacation" :P as opposed to pointing things out
> as explicitly broken.

Keep in mind that while the code /might work/, that doesn't mean it
can be merged. The more I look at the patch and think about it, the
more strongly I feel that we need to "go back to the drawing board
and try again".

So, rather than just being a negative prick that says no and doesn't
provide any real help, how about considering this line of
thinking....

We hold the buffer locked, the dquot is locked and the dquot is
flush locked already, right? We know the IO has not been submitted
because we have the buffer lock and the dquot is still flush locked,
which means /some/ dirty dquot data is already in the buffer.

So why can't we just modify the dquot in the buffer?  We already
hold all the locks needed to guarantee exclusive access to the dquot
and buffer, and they are all we hold when we do the initial flush to
the buffer. So why do we need to do IO to unlock the dquot flush
lock when we could just rewrite before we submit the buffer for IO?

Indeed, let's go look at inodes for an example of this, 
xfs_ifree_cluster() to be precise:

                /*                                                               
                 * We obtain and lock the backing buffer first in the process    
                 * here, as we have to ensure that any dirty inode that we       
                 * can't get the flush lock on is attached to the buffer.        
                 * If we scan the in-memory inodes first, then buffer IO can     
                 * complete before we get a lock on it, and hence we may fail    
                 * to mark all the active inodes on the buffer stale.            
                 */                                                              
 .....
                /*                                                               
                 * Walk the inodes already attached to the buffer and mark them  
                 * stale. These will all have the flush locks held, so an        
                 * in-memory inode walk can't lock them. By marking them all     
                 * stale first, we will not attempt to lock them in the loop     
                 * below as the XFS_ISTALE flag will be set.                     
                 */                                                              
.....
                /*                                                               
                 * For each inode in memory attempt to add it to the inode       
                 * buffer and set it up for being staled on buffer IO            
                 * completion.  This is safe as we've locked out tail pushing    
                 * and flushing by locking the buffer.                           
                 *                                                               
                 * We have already marked every inode that was part of a         
                 * transaction stale above, which means there is no point in     
                 * even trying to lock them.                                     
                 */                                                              

IOWs, what we have here is precendence for modifying the flush
locked objects attached to a buffer after first locking the buffer.
dquots have the same transaction/flush model as inodes, so I'm
pretty sure this will lead to the simplest, cleanest way to avoid
this deadlock....


Cheers,

Dave.
Brian Foster April 11, 2017, 12:50 p.m. UTC | #8
On Mon, Apr 10, 2017 at 03:00:01PM +1000, Dave Chinner wrote:
> On Fri, Feb 24, 2017 at 02:53:20PM -0500, Brian Foster wrote:
> > Reclaim during quotacheck can lead to deadlocks on the dquot flush
> > lock:
> > 
> >  - Quotacheck populates a local delwri queue with the physical dquot
> >    buffers.
> >  - Quotacheck performs the xfs_qm_dqusage_adjust() bulkstat and
> >    dirties all of the dquots.
> >  - Reclaim kicks in and attempts to flush a dquot whose buffer is
> >    already queud on the quotacheck queue. The flush succeeds but
> >    queueing to the reclaim delwri queue fails as the backing buffer is
> >    already queued. The flush unlock is now deferred to I/O completion
> >    of the buffer from the quotacheck queue.
> >  - The dqadjust bulkstat continues and dirties the recently flushed
> >    dquot once again.
> >  - Quotacheck proceeds to the xfs_qm_flush_one() walk which requires
> >    the flush lock to update the backing buffers with the in-core
> >    recalculated values. It deadlocks on the redirtied dquot as the
> >    flush lock was already acquired by reclaim, but the buffer resides
> >    on the local delwri queue which isn't submitted until the end of
> >    quotacheck.
> > 
> > This is reproduced by running quotacheck on a filesystem with a
> > couple million inodes in low memory (512MB-1GB) situations. This is
> > a regression as of commit 43ff2122e6 ("xfs: on-stack delayed write
> > buffer lists"), which removed a trylock and buffer I/O submission
> > from the quotacheck dquot flush sequence.
> > 
> > Quotacheck first resets and collects the physical dquot buffers in a
> > delwri queue. Then, it traverses the filesystem inodes via bulkstat,
> > updates the in-core dquots, flushes the corrected dquots to the
> > backing buffers and finally submits the delwri queue for I/O. Since
> > the backing buffers are queued across the entire quotacheck
> > operation, dquot reclaim cannot possibly complete a dquot flush
> > before quotacheck completes.
> > 
> > Therefore, quotacheck must submit the buffer for I/O in order to
> > cycle the flush lock and flush the dirty in-core dquot to the
> > buffer. Add a delwri queue buffer push mechanism to submit an
> > individual buffer for I/O without losing the delwri queue status and
> > use it from quotacheck to avoid the deadlock. This restores
> > quotacheck behavior to as before the regression was introduced.
> 
> While it may fix the problem, the solution gives me the heebee
> jeebees. I'm on holidays, so I haven't bothered to spend the hours
> necessary to answer these questions, but to give you an idea, this
> was what I thought as I read the patch.  i.e. I have concerns about
> whether....
> 

Thanks for looking at this. FWIW, this is based on a preexisting
solution prior to the commit referenced above (obviously this doesn't
mean the code is correct or that this is the best solution).

> > Reported-by: Martin Svec <martin.svec@zoner.cz>
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > ---
> >  fs/xfs/xfs_buf.c   | 37 +++++++++++++++++++++++++++++++++++++
> >  fs/xfs/xfs_buf.h   |  1 +
> >  fs/xfs/xfs_qm.c    | 28 +++++++++++++++++++++++++++-
> >  fs/xfs/xfs_trace.h |  1 +
> >  4 files changed, 66 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> > index e566510..e97cf56 100644
> > --- a/fs/xfs/xfs_buf.c
> > +++ b/fs/xfs/xfs_buf.c
> > @@ -2011,6 +2011,43 @@ xfs_buf_delwri_submit(
> >  	return error;
> >  }
> >  
> > +int
> > +xfs_buf_delwri_pushbuf(
> > +	struct xfs_buf		*bp,
> > +	struct list_head	*buffer_list)
> > +{
> > +	LIST_HEAD		(submit_list);
> > +	int			error;
> > +
> > +	ASSERT(xfs_buf_islocked(bp));
> > +	ASSERT(bp->b_flags & _XBF_DELWRI_Q);
> > +
> > +	trace_xfs_buf_delwri_pushbuf(bp, _RET_IP_);
> > +
> > +	/*
> > +	 * Move the buffer to an empty list and submit. Pass the original list
> > +	 * as the wait list so delwri submission moves the buf back to it before
> > +	 * it is submitted (and thus before it is unlocked). This means the
> > +	 * buffer cannot be placed on another list while we wait for it.
> > +	 */
> > +	list_move(&bp->b_list, &submit_list);
> > +	xfs_buf_unlock(bp);
> 
> .... this is safe/racy as we may have just moved it off the delwri
> queue without changing state, reference counts, etc?
> 

Racy with..? The buffer is effectively still on the delwri queue here.
We've just replaced the caller queue with a local queue to reuse the
existing submission infrastructure. IOW, we've just changed the queue
that the buffer is on.

> > +
> > +	xfs_buf_delwri_submit_buffers(&submit_list, buffer_list);
> 
> .... using a caller supplied delwri buffer list as the buffer IO
> wait list destination is making big assumptions about the internal
> use of the wait list? e.g that xfs_buf_delwri_submit_buffers() does
> not initialise the list_head before use...
> 

This is the scope at which the wait list is defined/initialized. E.g.,
_delwri_submit() initializes its local wait list and passes it down for
_delwri_submit_buffers() to use. This basically ensures that the buffer
is placed back on the original delwri queue because the latter moves the
buffer over under lock, before it is submitted for I/O.

We could probably use a local wait_list just the same if that is
preferred, and explicitly add the buffer back from the wait_list to the
original delwri queue. FWIW, it's not clear to me whether you're asking
about the use of the wait_list here or indicating preference.

> .... we should be doing IO submission while holding other things on
> the delwri list and unknown caller locks?
> 

I'm not following... this is effectively a move/submit of a delwri queue
item to another delwri queue. If the delwri queue is by design a local
data structure then only one thread should be queue processing at a
time. Why would this particular action be problematic?

Technically, we could rework this into something like a
delwri_queue_move()/delwri_submit() sequence, but the problem with that
is we lose the queue-populated state of the buffer for a period of time
and thus somebody else (i.e., reclaim) can jump in and queue it out from
underneath the caller. The purpose of this mechanism is to preserve
original delwri queue ownership of the buffer.

> .... we have all the buffer reference counts we need to make this
> work correctly?
> 

AFAICT, we follow the existing delwri queue rules with regard to locking
and reference counting. A buffer is queued under the buf lock and the
queue takes a reference on the buffer. Queue submission locks the
buffer, acquires a wait_list reference and moves to the wait list if one
is specified and submits the buffer. Buffer I/O completion drops the
original queue reference. If the wait_list was specified, we cycle the
buffer lock (to wait for I/O completion), remove from the wait_list and
drop the wait_list reference.

The pushbuf() call moves the buffer to the submit_list under the buffer
lock, which is effectively another local delwri queue. This takes
ownership of the original delwri queue reference. Queue submission
occurs as above: the buffer is moved to the wait_list, an associated
wait_list reference is acquired and the buffer is submitted. I/O
completion drops the original delwri queue reference as before, so we
are in the same state as delwri_submit() up through I/O completion.

We again cycle the buf lock to wait on I/O completion and process an
error. Rather than remove from the wait_list and drop the reference,
however, pushbuf() has the buffer back on the original delwri queue with
an associated reference. The only explicit state change here is the
DELWRI_Q flag, which is reset.

So am I missing something here with regard to proper reference
counting..? If so, please elaborate and I'll try to dig into it.

> > +	/*
> > +	 * Lock the buffer to wait for I/O completion. It's already held on the
> > +	 * original list, so all we have to do is reset the delwri queue flag
> > +	 * that was cleared by delwri submission.
> > +	 */
> > +	xfs_buf_lock(bp);
> > +	error = bp->b_error;
> > +	bp->b_flags |= _XBF_DELWRI_Q;
> > +	xfs_buf_unlock(bp);
> 
> .... this is racy w.r.t. the buffer going back onto the
> buffer list without holding the buffer lock, or that the
> _XBF_DELWRI_Q setting/clearing is not atomic w.r.t. the delwri queue
> manipulations (i.e. can now be on the delwri list but not have
> _XBF_DELWRI_Q set)?
> 

The buffer has been put back on the original delwri queue (by
_delwri_submit_buffers()) before the lock was dropped. Using the delwri
queue as the wait_list does mean that the flag might not be set until
the delwri queue processing is done and we have returned to the caller.
If another delwri queue occurs, the flag may be reset but the buffer
cannot be added to another queue, which is what we want (as noted
above).

(I'm not sure why delwri_queue() would return true in that case, but
that's a separate issue.)

> .... the error is left on the buffer so it gets tripped over when
> it is next accessed?
> 

Same as delwri_submit(). IIUC, errors are left on buffers by design and
cleared on a subsequemt I/O submission.

> .... that the buffer locking is unbalanced for some undocumented
> reason?
> 
> > +	return error;
> > +}
> > +
> >  int __init
> >  xfs_buf_init(void)
> >  {
> > diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
> > index 8a9d3a9..cd74216 100644
> > --- a/fs/xfs/xfs_buf.h
> > +++ b/fs/xfs/xfs_buf.h
> > @@ -334,6 +334,7 @@ extern void xfs_buf_stale(struct xfs_buf *bp);
> >  extern bool xfs_buf_delwri_queue(struct xfs_buf *, struct list_head *);
> >  extern int xfs_buf_delwri_submit(struct list_head *);
> >  extern int xfs_buf_delwri_submit_nowait(struct list_head *);
> > +extern int xfs_buf_delwri_pushbuf(struct xfs_buf *, struct list_head *);
> >  
> >  /* Buffer Daemon Setup Routines */
> >  extern int xfs_buf_init(void);
> > diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c
> > index 4ff993c..3815ed3 100644
> > --- a/fs/xfs/xfs_qm.c
> > +++ b/fs/xfs/xfs_qm.c
> > @@ -1247,6 +1247,7 @@ xfs_qm_flush_one(
> >  	struct xfs_dquot	*dqp,
> >  	void			*data)
> >  {
> > +	struct xfs_mount	*mp = dqp->q_mount;
> >  	struct list_head	*buffer_list = data;
> >  	struct xfs_buf		*bp = NULL;
> >  	int			error = 0;
> > @@ -1257,7 +1258,32 @@ xfs_qm_flush_one(
> >  	if (!XFS_DQ_IS_DIRTY(dqp))
> >  		goto out_unlock;
> >  
> > -	xfs_dqflock(dqp);
> > +	/*
> > +	 * The only way the dquot is already flush locked by the time quotacheck
> > +	 * gets here is if reclaim flushed it before the dqadjust walk dirtied
> > +	 * it for the final time. Quotacheck collects all dquot bufs in the
> > +	 * local delwri queue before dquots are dirtied, so reclaim can't have
> > +	 * possibly queued it for I/O. The only way out is to push the buffer to
> > +	 * cycle the flush lock.
> > +	 */
> > +	if (!xfs_dqflock_nowait(dqp)) {
> > +		/* buf is pinned in-core by delwri list */
> > +		DEFINE_SINGLE_BUF_MAP(map, dqp->q_blkno,
> > +				      mp->m_quotainfo->qi_dqchunklen);
> > +		bp = _xfs_buf_find(mp->m_ddev_targp, &map, 1, 0, NULL);
> > +		if (!bp) {
> > +			error = -EINVAL;
> > +			goto out_unlock;
> > +		}
> > +
> > +		/* delwri_pushbuf drops the buf lock */
> > +		xfs_buf_delwri_pushbuf(bp, buffer_list);
> 
> Ummm - you threw away the returned error....
> 

The I/O is going to be retried, but I'll fix it to just return from
here.

> > +		xfs_buf_rele(bp);
> 
> And despite the comment, I think this is simply wrong. We try really
> hard to maintain balanced locking, and as such xfs_buf_relse() is
> what goes along with _xfs_buf_find(). i.e. we are returned a locked
> buffer with a reference by xfs_buf_find(), but xfs_buf_rele() only
> drops the reference.
> 

I'm not quite following if you are saying the locking is unbalanced or
it simply looks that way. If the former, can you please elaborate?

Otherwise, we can probably have pushbuf() return with the buffer lock
held so the caller remains responsible for the lock and reference it
acquired.

> So if I look at this code in isolation, it looks like it leaks a
> buffer lock, and now I have to go read other code to understand why
> it doesn't and I'm left to wonder why it was implemented this
> way....
> 

Thanks for taking the time to send feedback. FWIW, I take most of these
questions as "things one would have to verify in order to review this
patch if one were not on vacation" :P as opposed to pointing things out
as explicitly broken. As noted previously, please do point out anything
I've missed to the contrary. Otherwise, I'll plan to send another
version that includes the error and locking API fixups.

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Brian Foster April 11, 2017, 2:53 p.m. UTC | #9
On Tue, Apr 11, 2017 at 09:55:25AM +1000, Dave Chinner wrote:
> On Mon, Apr 10, 2017 at 10:15:21AM -0400, Brian Foster wrote:
> > On Mon, Apr 10, 2017 at 03:00:01PM +1000, Dave Chinner wrote:
> > > On Fri, Feb 24, 2017 at 02:53:20PM -0500, Brian Foster wrote:
> > > > +	/*
> > > > +	 * Move the buffer to an empty list and submit. Pass the original list
> > > > +	 * as the wait list so delwri submission moves the buf back to it before
> > > > +	 * it is submitted (and thus before it is unlocked). This means the
> > > > +	 * buffer cannot be placed on another list while we wait for it.
> > > > +	 */
> > > > +	list_move(&bp->b_list, &submit_list);
> > > > +	xfs_buf_unlock(bp);
> > > 
> > > .... this is safe/racy as we may have just moved it off the delwri
> > > queue without changing state, reference counts, etc?
> > > 
> > 
> > Racy with..?
> 
> Whatever might possibly manipulate the buffer that thinks it's on a
> specific delwri list. We've unlocked it.
> 
> As I said, I haven't spend hours thinking this all through, I'm just
> throwing out the questions I had....
> 

Ok, I'm just trying to answer them. ;)

Here, we unlock buffers after they are added to the original delwri
queue as well. So the buffer is sitting on a local delwri queue before
it is locked and it is sitting on a local delwri queue after it is
unlocked. This is the expected state for an external lookup of the
buffer to observe before delwri_submit_buffers() acquires the lock for
I/O submission.

> > The buffer is effectively still on the delwri queue here.
> > We've just replaced the caller queue with a local queue to reuse the
> > existing submission infrastructure. IOW, we've just changed the queue
> > that the buffer is on.
> 
> Yes, in theory, but it's not something we every considered would be
> done when designing the private delwri queue infrastructure...
> 

As mentioned before, this is a resurrection of an old fix, though it may
pre-date the shift of the delwri queue mechanism to a local one.

> > > > +
> > > > +	xfs_buf_delwri_submit_buffers(&submit_list, buffer_list);
> > > 
> > > .... using a caller supplied delwri buffer list as the buffer IO
> > > wait list destination is making big assumptions about the internal
> > > use of the wait list? e.g that xfs_buf_delwri_submit_buffers() does
> > > not initialise the list_head before use...
> > > 
> > 
> > This is the scope at which the wait list is defined/initialized.
> 
> It is now - passing in a list that is not empty effectively prevents
> us from implementing anything differently in future....
> 

This is nothing more than an implementation detail. It simply optimizes
away an additional list on the stack. It exists purely because it is
safe to do with the current code and can be trivially factored away
either now or if the code demands it in the future.

> > E.g.,
> > _delwri_submit() initializes its local wait list and passes it down for
> > _delwri_submit_buffers() to use. This basically ensures that the buffer
> > is placed back on the original delwri queue because the latter moves the
> > buffer over under lock, before it is submitted for I/O.
> 
> But xfs_buf_delwri_submit_buffers() adds a reference count to the
> buffer if it's placed on a wait list. but we just moved it back to
> the original buffer list. Now I have to look for why that works, and
> determine if that was intended behaviour or not.
> 
> i.e. _delwri_submit_buffers is being used for more than just moving
> the buffer back to the buffer list - there's reference counting
> shenanigans going on here, too....
> 
> > We could probably use a local wait_list just the same if that is
> > preferred, and explicitly add the buffer back from the wait_list to the
> > original delwri queue. FWIW, it's not clear to me whether you're asking
> > about the use of the wait_list here or indicating preference.
> 
> All I'm pointing out is that I don't think the patch is a good
> solution, not what the solution should be. I haven't spent the hours
> needed to think out a proper solution.
> 
> > > .... we should be doing IO submission while holding other things on
> > > the delwri list and unknown caller locks?
> > > 
> > 
> > I'm not following... this is effectively a move/submit of a delwri queue
> > item to another delwri queue. If the delwri queue is by design a local
> > data structure then only one thread should be queue processing at a
> > time. Why would this particular action be problematic?
> 
> No, no it's not just a "list move". xfs_buf_delwri_submit_buffers()
> calls xfs_buf_submit() to start IO on the buffers passed in on the
> submit list. We're adding a new IO submission path here....
> 

Just above, I said "move/submit," which clearly refers to pushbuf() as
an aggregate of both actions. Regardless, it sounds to me that the
concern here is "is it safe to submit this I/O while we still have
things on a list?" What I'm trying to say is the notion of a private
delwri queue mechanism implies that is safe from a delwri queue
infrastructure point of view. Nothing prevents a particular context from
using multiple lists. Analysis beyond that is context dependent, and I
don't see anything in quotacheck that conflicts.

> > Technically, we could rework this into something like a
> > delwri_queue_move()/delwri_submit() sequence, but the problem with that
> > is we lose the queue-populated state of the buffer for a period of time
> > and thus somebody else (i.e., reclaim) can jump in and queue it out from
> > underneath the caller. The purpose of this mechanism is to preserve
> > original delwri queue ownership of the buffer.
> 
> Sure, I understand that, but I still don't think this is the right
> thing to be doing.
> 

Fair enough..

> > > .... we have all the buffer reference counts we need to make this
> > > work correctly?
> > > 
> > 
> > AFAICT, we follow the existing delwri queue rules with regard to locking
> > and reference counting. A buffer is queued under the buf lock and the
> > queue takes a reference on the buffer. Queue submission locks the
> > buffer, acquires a wait_list reference and moves to the wait list if one
> > is specified and submits the buffer. Buffer I/O completion drops the
> > original queue reference. If the wait_list was specified, we cycle the
> > buffer lock (to wait for I/O completion), remove from the wait_list and
> > drop the wait_list reference.
> >
> > The pushbuf() call moves the buffer to the submit_list under the buffer
> > lock, which is effectively another local delwri queue. This takes
> > ownership of the original delwri queue reference. Queue submission
> > occurs as above: the buffer is moved to the wait_list, an associated
> > wait_list reference is acquired and the buffer is submitted. I/O
> > completion drops the original delwri queue reference as before, so we
> > are in the same state as delwri_submit() up through I/O completion.
> > 
> > We again cycle the buf lock to wait on I/O completion and process an
> > error. Rather than remove from the wait_list and drop the reference,
> > however, pushbuf() has the buffer back on the original delwri queue with
> > an associated reference. The only explicit state change here is the
> > DELWRI_Q flag, which is reset.
> > 
> > So am I missing something here with regard to proper reference
> > counting..? If so, please elaborate and I'll try to dig into it.
> 
> IOWs, your abusing the waitlist reference count added deep inside
> xfs_buf_delwri_submit_buffers() to handle the fact
> that the IO submission removes the delwri queue reference count that
> the buffer had on entry to the function, but still needs on exit to
> the function.
> 
> I guarantee we'll break this in future - it's too subtle and fragile
> for core infrastructure. Also, there's no comments to explain any of
> these games being played, which makes it even less maintainable...
> 

Heh. Yeah, this is not clear from the code. We should have a comment
here to walk through the reference count semantics.

That aside... if we drop the wait_list optimization, as mentioned
earlier, we'd basically do what delwri_submit() does with the exception
of moving the buffer to a private delwri queue for submission and moving
it back to the original delwri queue (from the wait_list) after
acquiring the buffer lock (to wait on the I/O). We could even do the
moves in the caller and the only change that would be required to the
delwri queue infrastructure is effectively to export
delwri_submit_buffers() (I'd probably suggest an
xfs_buf_delwri_submit_waitlist() variant or some such instead, but that
is beside the point).

The broader point here is that this code isn't doing much fundamentally
different from the existing delwri infrastructure, certainly not enough
to qualify this as fragile and the existing mechanism as not, so I think
that is a bit over the top. Another option to consider here is that this
could all easily be refactored to open code the buffer submission from
pushbuf(), which perhaps would localize the reference count bump to the
pushbuf() function rather than reuse the delwri_submit_buffers() logic.

> 
> > > > +	/*
> > > > +	 * Lock the buffer to wait for I/O completion. It's already held on the
> > > > +	 * original list, so all we have to do is reset the delwri queue flag
> > > > +	 * that was cleared by delwri submission.
> > > > +	 */
> > > > +	xfs_buf_lock(bp);
> > > > +	error = bp->b_error;
> > > > +	bp->b_flags |= _XBF_DELWRI_Q;
> > > > +	xfs_buf_unlock(bp);
> > > 
> > > .... this is racy w.r.t. the buffer going back onto the
> > > buffer list without holding the buffer lock, or that the
> > > _XBF_DELWRI_Q setting/clearing is not atomic w.r.t. the delwri queue
> > > manipulations (i.e. can now be on the delwri list but not have
> > > _XBF_DELWRI_Q set)?
> > > 
> > 
> > The buffer has been put back on the original delwri queue (by
> > _delwri_submit_buffers()) before the lock was dropped. Using the delwri
> > queue as the wait_list does mean that the flag might not be set until
> > the delwri queue processing is done and we have returned to the caller.
> > If another delwri queue occurs, the flag may be reset but the buffer
> > cannot be added to another queue, which is what we want (as noted
> > above).
> 
> Yes, I know this is what happens. My point is that the internal
> _XBF_DELWRI_Q no longer indicates that a buffer is on a delwri queue
> (i.e. that bp->b_list is valid). IOWs, if we want to check a buffer
> is on a delwri queue, we can no longer just check the flag, we have
> to check both the flag and list_empty(bp->b_list) because they are
> no longer atomically updated.
> 

It is true that this code puts the buffer through a state where
_XBF_DELWRI_Q cannot be used to verify the buffer is on a delwri queue.
It is also true that today the absence of _XBF_DELWRI_Q cannot be used
to verify that a buffer can actually be put onto a delwri queue.

IOW, any code that truly cares about delwri queue state probably needs
to check both as it is, particularly any code that actually wants to
queue the buffer locally.

> To me this is a design rule violation, regardless of whether the
> code works or not...
> 
> > (I'm not sure why delwri_queue() returns true in that case, but that's a
> > separate issue.)
> > 
> > > .... the error is left on the buffer so it gets tripped over when
> > > it is next accessed?
> > > 
> > 
> > Same as delwri_submit(). IIUC, errors are left on buffers by design and
> > cleared on a subsequemt I/O submission.
> 
> That doesn't mean it's the right thing to do here. Think about how
> the error shold be handled in this case, don't copy something from a
> bulk IO submission interface....
> 

I'm not following what you are suggesting here.

> > > > +	/*
> > > > +	 * The only way the dquot is already flush locked by the time quotacheck
> > > > +	 * gets here is if reclaim flushed it before the dqadjust walk dirtied
> > > > +	 * it for the final time. Quotacheck collects all dquot bufs in the
> > > > +	 * local delwri queue before dquots are dirtied, so reclaim can't have
> > > > +	 * possibly queued it for I/O. The only way out is to push the buffer to
> > > > +	 * cycle the flush lock.
> > > > +	 */
> > > > +	if (!xfs_dqflock_nowait(dqp)) {
> > > > +		/* buf is pinned in-core by delwri list */
> > > > +		DEFINE_SINGLE_BUF_MAP(map, dqp->q_blkno,
> > > > +				      mp->m_quotainfo->qi_dqchunklen);
> > > > +		bp = _xfs_buf_find(mp->m_ddev_targp, &map, 1, 0, NULL);
> > > > +		if (!bp) {
> > > > +			error = -EINVAL;
> > > > +			goto out_unlock;
> > > > +		}
> > > > +
> > > > +		/* delwri_pushbuf drops the buf lock */
> > > > +		xfs_buf_delwri_pushbuf(bp, buffer_list);
> > > 
> > > Ummm - you threw away the returned error....
> > > 
> > 
> > The I/O is going to be retried, but I'll fix it to just return from
> > here.
> > 
> > > > +		xfs_buf_rele(bp);
> > > 
> > > And despite the comment, I think this is simply wrong. We try really
> > > hard to maintain balanced locking, and as such xfs_buf_relse() is
> > > what goes along with _xfs_buf_find(). i.e. we are returned a locked
> > > buffer with a reference by xfs_buf_find(), but xfs_buf_rele() only
> > > drops the reference.
> > > 
> > 
> > I'm not quite following if you are saying the locking is unbalanced or
> > it simply looks that way. If the former, can you please elaborate?
> 
> It's not obviously correct, which to me is a bug. If I read this
> code, you're forcing me to go read more code to work out why it is
> correct because it doesn't follow the expected patterns. And then
> the reader disappears down a hole for hours trying to work out how
> this code actually works.
> 

I do agree that maintaining the lock state through the pushbuf() call is
probably more clear than what the patch does currently (e.g., I agree
with your core point to justify adjusting how the locking is done here).

That said, the buffer locking in xfs_buf_delwri_pushbuf() can be
evaluated pretty quickly between reading through the former function and
xfs_buf_delwri_submit[_buffers](), if one is familiar with the delwri
queue infrastructure (which you are). So the idea that this requires
hours of validation time because the locking pattern is quirky is also a
little over the top.

> > Otherwise, we can probably have pushbuf() return with the buffer lock
> > held so the caller remains responsible for the lock and reference it
> > acquired.
> >
> > > So if I look at this code in isolation, it looks like it leaks a
> > > buffer lock, and now I have to go read other code to understand why
> > > it doesn't and I'm left to wonder why it was implemented this
> > > way....
> > > 
> > 
> > Thanks for taking the time to send feedback. FWIW, I take most of these
> > questions as "things one would have to verify in order to review this
> > patch if one were not on vacation" :P as opposed to pointing things out
> > as explicitly broken.
> 
> Keep in mind that while the code /might work/, that doesn't mean it
> can be merged. The more I look at the patch and think about it, the
> more strongly I feel that we need to "go back to the drawing board
> and try again".
> 

This is fair enough. "I don't like the approach" as a subjective
argument is more concrete (IMO) than to artificially inflate some of the
implementation details into architectural flaws. That at least saves me
the time spent working through what those arguments mean, whether they
legitimately apply to the changes made in the code and/or refactoring
the code to try and deal with what, to me, appears to mostly boil down
to confusion/complaints about code factoring (only to eventually
determine that the core argument is subjective).

> So, rather than just being a negative prick that says no and doesn't
> provide any real help, how about considering this line of
> thinking....
> 

That would certainly be more productive. This is not the only solution
I've considered, but clearly it is one of only a couple or so I'm aware
of so far that I consider any better than just leaving the code as is.
In fact, I had pretty much given up on this before Darrick happened to
pick up review of it.

> We hold the buffer locked, the dquot is locked and the dquot is
> flush locked already, right? We know the IO has not been submitted
> because we have the buffer lock and the dquot is still flush locked,
> which means /some/ dirty dquot data is already in the buffer.
> 

Essentially. The buffer is queued and the dquot flush locked by reclaim.
By the time we get to the problematic scenario, we acquire the dquot
lock and can acquire the buffer lock in response to flush lock failure.
The buffer cannot be under I/O because we hold it on the local delwri
queue.

> So why can't we just modify the dquot in the buffer?  We already
> hold all the locks needed to guarantee exclusive access to the dquot
> and buffer, and they are all we hold when we do the initial flush to
> the buffer. So why do we need to do IO to unlock the dquot flush
> lock when we could just rewrite before we submit the buffer for IO?
> 

Are you suggesting to essentially ignore the flush lock? I suppose we
could get away with this in dq_flush_one() since it is only called from
quotacheck, but we may have to kill off any assertions that expect the
flush lock in xfs_dqflush(), and perhaps refactor that code to accept a
locked buffer as a param.

I don't see anything that would break off hand, but my first reaction is
it sounds more hackish than this patch or the previous patch that just
disabled reclaim during quotacheck. In fact, if we're going to hack
around avoiding the flush lock, why not just hack up xfs_buf_submit() to
allow submission of a buffer while it remains on a delwri queue? AFAICT,
that would only require removing an assert and not clearing the flag on
I/O completion. I'm not convinced I'd post something for either
approach, but I'd have to think about it some more.

> Indeed, let's go look at inodes for an example of this, 
> xfs_ifree_cluster() to be precise:
> 
>                 /*                                                               
>                  * We obtain and lock the backing buffer first in the process    
>                  * here, as we have to ensure that any dirty inode that we       
>                  * can't get the flush lock on is attached to the buffer.        
>                  * If we scan the in-memory inodes first, then buffer IO can     
>                  * complete before we get a lock on it, and hence we may fail    
>                  * to mark all the active inodes on the buffer stale.            
>                  */                                                              
>  .....
>                 /*                                                               
>                  * Walk the inodes already attached to the buffer and mark them  
>                  * stale. These will all have the flush locks held, so an        
>                  * in-memory inode walk can't lock them. By marking them all     
>                  * stale first, we will not attempt to lock them in the loop     
>                  * below as the XFS_ISTALE flag will be set.                     
>                  */                                                              
> .....
>                 /*                                                               
>                  * For each inode in memory attempt to add it to the inode       
>                  * buffer and set it up for being staled on buffer IO            
>                  * completion.  This is safe as we've locked out tail pushing    
>                  * and flushing by locking the buffer.                           
>                  *                                                               
>                  * We have already marked every inode that was part of a         
>                  * transaction stale above, which means there is no point in     
>                  * even trying to lock them.                                     
>                  */                                                              
> 
> IOWs, what we have here is precendence for modifying the flush
> locked objects attached to a buffer after first locking the buffer.
> dquots have the same transaction/flush model as inodes, so I'm
> pretty sure this will lead to the simplest, cleanest way to avoid
> this deadlock....
> 

The simplest approach to avoid the deadlock is to disable dquot reclaim
during quotacheck. ;)

On a quick scan through the inode code, it seems the above refer to
using the buffer lock to serialize invalidation of higher level objects
based on the buffer. IOW, to ensure that we don't leave "valid" in-core
inode objects referring to an inode metadata buffer that is about to be
invalidated.

So while we do set the stale state on presumably flush locked objects
attached to the buffer here, this doesn't exactly establish precedent
for modifying flush locked content (meaning the range of the buffer that
covers the flush locked inode). But if precedent is of particular
concern, the approach with most historical precedent is probably the one
actually used in XFS to avoid this deadlock up until commit 43ff2122e6
(that is effectively implemented by this patch :/).

Brian

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner April 18, 2017, 2:35 a.m. UTC | #10
On Tue, Apr 11, 2017 at 10:53:03AM -0400, Brian Foster wrote:
> On Tue, Apr 11, 2017 at 09:55:25AM +1000, Dave Chinner wrote:
> > We hold the buffer locked, the dquot is locked and the dquot is
> > flush locked already, right? We know the IO has not been submitted
> > because we have the buffer lock and the dquot is still flush locked,
> > which means /some/ dirty dquot data is already in the buffer.
> > 
> 
> Essentially. The buffer is queued and the dquot flush locked by reclaim.
> By the time we get to the problematic scenario, we acquire the dquot
> lock and can acquire the buffer lock in response to flush lock failure.
> The buffer cannot be under I/O because we hold it on the local delwri
> queue.

Seems to me that you've missed the fundamental reason that the
buffer can't be under IO. It's not because it's on some local
private queue, it can't be under IO because /we hold the buffer
lock/.

> > So why can't we just modify the dquot in the buffer?  We already
> > hold all the locks needed to guarantee exclusive access to the dquot
> > and buffer, and they are all we hold when we do the initial flush to
> > the buffer. So why do we need to do IO to unlock the dquot flush
> > lock when we could just rewrite before we submit the buffer for IO?
> > 
> 
> Are you suggesting to essentially ignore the flush lock? I suppose we
> could get away with this in dq_flush_one() since it is only called from
> quotacheck, but we may have to kill off any assertions that expect the
> flush lock in xfs_dqflush(), and perhaps refactor that code to accept a
> locked buffer as a param.

No, I'm not suggesting we ignore the flush lock. What the flush
"lock" does is prevent higher layer attempts to flush the dquot to
the backing buffer whilst IO may be in progress. It's higher level
function is to allows async log item state updates to be done
correctly w.r.t. relogging and flushing, as we allow transactional
modifications to inodes and dquots while they are under IO.

But to flush an object, we have to own the object lock, the flush
lock and the underlying buffer lock. To flush a dquot (or inode) we
need to, in order:

	a) lock the dquot (inode)
	b) obtain the flush lock
	c) obtain the buffer lock
	d) flush the dquot to the buffer
	e) update log item state
	f) unlock the buffer

IO is then effectively:

	a) lock buffer
	b) submit IO
	<IO completes>
	b) dquot callback unlocks flush lock
	a) buffer is unlocked

IOWs, the flush "lock" does not work on it's own. It's really a
completion notifier, not a lock, and it's actually implemented that
way now. The name is historical because it used a semaphore when
that was all Irix or Linux had to implement an async non-owner
object release. That is, when we lock and flush the object, we
effective hand the "flush lock" to the buffer, as it is the buffer
that is responsible for "unlocking" it on IO completion.

The "lock" function prevents the higher layers from trying to update
the buffer until the buffer says "it's OK to modify me", but it does
not mean we cannot modify the buffer contents if we hold all the
correct context - the "modify the buffer" action is the inner most.

IOWs, if we lock the buffer and the buffer owns the flush lock state
for the object, then if we are able to lock the object, we have now
have exclusive access to both the logical object, the log item
associated with it, the backing buffer and the flush context lock.
And with all that context held, it is safe to flush the logical
object state to the backing buffer....

i.e. rather than locking from the top down to get to the state where
we can flush an object, we can lock objects from the bottom up and
end up in the same place...

> I don't see anything that would break off hand, but my first reaction is
> it sounds more hackish than this patch or the previous patch that just
> disabled reclaim during quotacheck.

I thought we'd done that discussion. i.e. we can't disable reclaim
in quotacheck because that can set off the OOM killer...

> In fact, if we're going to hack
> around avoiding the flush lock, why not just hack up xfs_buf_submit() to
> allow submission of a buffer while it remains on a delwri queue? AFAICT,
> that would only require removing an assert and not clearing the flag on
> I/O completion. I'm not convinced I'd post something for either
> approach, but I'd have to think about it some more.
> 
> > Indeed, let's go look at inodes for an example of this, 
> > xfs_ifree_cluster() to be precise:
> > 
> >                 /*                                                               
> >                  * We obtain and lock the backing buffer first in the process    
> >                  * here, as we have to ensure that any dirty inode that we       
> >                  * can't get the flush lock on is attached to the buffer.        
> >                  * If we scan the in-memory inodes first, then buffer IO can     
> >                  * complete before we get a lock on it, and hence we may fail    
> >                  * to mark all the active inodes on the buffer stale.            
> >                  */                                                              
> >  .....
> >                 /*                                                               
> >                  * Walk the inodes already attached to the buffer and mark them  
> >                  * stale. These will all have the flush locks held, so an        
> >                  * in-memory inode walk can't lock them. By marking them all     
> >                  * stale first, we will not attempt to lock them in the loop     
> >                  * below as the XFS_ISTALE flag will be set.                     
> >                  */                                                              
> > .....
> >                 /*                                                               
> >                  * For each inode in memory attempt to add it to the inode       
> >                  * buffer and set it up for being staled on buffer IO            
> >                  * completion.  This is safe as we've locked out tail pushing    
> >                  * and flushing by locking the buffer.                           
> >                  *                                                               
> >                  * We have already marked every inode that was part of a         
> >                  * transaction stale above, which means there is no point in     
> >                  * even trying to lock them.                                     
> >                  */                                                              
> > 
> > IOWs, what we have here is precendence for modifying the flush
> > locked objects attached to a buffer after first locking the buffer.
> > dquots have the same transaction/flush model as inodes, so I'm
> > pretty sure this will lead to the simplest, cleanest way to avoid
> > this deadlock....
> > 
> 
> The simplest approach to avoid the deadlock is to disable dquot reclaim
> during quotacheck. ;)
> 
> On a quick scan through the inode code, it seems the above refer to
> using the buffer lock to serialize invalidation of higher level objects
> based on the buffer. IOW, to ensure that we don't leave "valid" in-core
> inode objects referring to an inode metadata buffer that is about to be
> invalidated.

Yes, we lock from the bottom up until we have exclusive access to
the objects. Invalidation is the /purpose/ of that code, not the
architectural mechanism used to lock the objects into a state where
we can then invalidate them.

Cheers,

Dave.
Brian Foster April 18, 2017, 1:55 p.m. UTC | #11
On Tue, Apr 18, 2017 at 12:35:48PM +1000, Dave Chinner wrote:
> On Tue, Apr 11, 2017 at 10:53:03AM -0400, Brian Foster wrote:
> > On Tue, Apr 11, 2017 at 09:55:25AM +1000, Dave Chinner wrote:
> > > We hold the buffer locked, the dquot is locked and the dquot is
> > > flush locked already, right? We know the IO has not been submitted
> > > because we have the buffer lock and the dquot is still flush locked,
> > > which means /some/ dirty dquot data is already in the buffer.
> > > 
> > 
> > Essentially. The buffer is queued and the dquot flush locked by reclaim.
> > By the time we get to the problematic scenario, we acquire the dquot
> > lock and can acquire the buffer lock in response to flush lock failure.
> > The buffer cannot be under I/O because we hold it on the local delwri
> > queue.
> 
> Seems to me that you've missed the fundamental reason that the
> buffer can't be under IO. It's not because it's on some local
> private queue, it can't be under IO because /we hold the buffer
> lock/.
> 

Yes, I'm aware that the buffer lock protects I/O submission. Poor
wording above, perhaps.

> > > So why can't we just modify the dquot in the buffer?  We already
> > > hold all the locks needed to guarantee exclusive access to the dquot
> > > and buffer, and they are all we hold when we do the initial flush to
> > > the buffer. So why do we need to do IO to unlock the dquot flush
> > > lock when we could just rewrite before we submit the buffer for IO?
> > > 
> > 
> > Are you suggesting to essentially ignore the flush lock? I suppose we
> > could get away with this in dq_flush_one() since it is only called from
> > quotacheck, but we may have to kill off any assertions that expect the
> > flush lock in xfs_dqflush(), and perhaps refactor that code to accept a
> > locked buffer as a param.
> 
> No, I'm not suggesting we ignore the flush lock. What the flush
> "lock" does is prevent higher layer attempts to flush the dquot to
> the backing buffer whilst IO may be in progress. It's higher level
> function is to allows async log item state updates to be done
> correctly w.r.t. relogging and flushing, as we allow transactional
> modifications to inodes and dquots while they are under IO.
> 

Ok. What I was trying to recall previously was some discussion around
the flush lock retry problem where it was noted that we can't unlock the
flush lock until the currently flushed state makes it to disk[1]. Is the
conceptual difference here that we are not proposing to unlock the flush
lock, but rather co-opt it to perform another flush?

[1] http://www.spinics.net/lists/linux-xfs/msg01248.html

> But to flush an object, we have to own the object lock, the flush
> lock and the underlying buffer lock. To flush a dquot (or inode) we
> need to, in order:
> 
> 	a) lock the dquot (inode)
> 	b) obtain the flush lock
> 	c) obtain the buffer lock
> 	d) flush the dquot to the buffer
> 	e) update log item state
> 	f) unlock the buffer
> 
> IO is then effectively:
> 
> 	a) lock buffer
> 	b) submit IO
> 	<IO completes>
> 	b) dquot callback unlocks flush lock
> 	a) buffer is unlocked
> 
> IOWs, the flush "lock" does not work on it's own. It's really a
> completion notifier, not a lock, and it's actually implemented that
> way now. The name is historical because it used a semaphore when
> that was all Irix or Linux had to implement an async non-owner
> object release. That is, when we lock and flush the object, we
> effective hand the "flush lock" to the buffer, as it is the buffer
> that is responsible for "unlocking" it on IO completion.
> 
> The "lock" function prevents the higher layers from trying to update
> the buffer until the buffer says "it's OK to modify me", but it does
> not mean we cannot modify the buffer contents if we hold all the
> correct context - the "modify the buffer" action is the inner most.
> 
> IOWs, if we lock the buffer and the buffer owns the flush lock state
> for the object, then if we are able to lock the object, we have now
> have exclusive access to both the logical object, the log item
> associated with it, the backing buffer and the flush context lock.
> And with all that context held, it is safe to flush the logical
> object state to the backing buffer....
> 
> i.e. rather than locking from the top down to get to the state where
> we can flush an object, we can lock objects from the bottom up and
> end up in the same place...
> 

I didn't claim the locking or overall proposal wouldn't work, so this
all makes sense to me from a serialization perspective. I think the
implementation would be more hackish than the current and previous
proposals, however. That is a subjective argument, of course. To be
fair, the current/previous proposals are also hackish, but what makes
this approach so much better that it is worth starting over on this
problem yet again?

IOWs, if the argument here is more substantive than "look, we can do
this differently," I'm not seeing it. In the meantime, this patch has
been reviewed, tested against quotacheck under various low memory
conditions and verified by users in the field who actually reproduce the
problem. It also has the benefit of historical precedent in that it is
effectively a partial revert of the patch that introduced the regression
in the first place.

> > I don't see anything that would break off hand, but my first reaction is
> > it sounds more hackish than this patch or the previous patch that just
> > disabled reclaim during quotacheck.
> 
> I thought we'd done that discussion. i.e. we can't disable reclaim
> in quotacheck because that can set off the OOM killer...
> 

Huh? The only reason disabling of dquot reclaim during quotacheck was
proposed in the first place is because it is 100% superfluous.
Quotacheck, by design, does not allow reclaim to free memory. Therefore
reclaim does not and afaict never has prevented or even mitigated OOM
during quotacheck.

Is OOM possible due to quotacheck? You bet. Could quotacheck be improved
to be more memory efficient? Most likely, but that's beyond the scope of
this deadlock regression. However unless I'm missing something, we can
disable reclaim during quotacheck to no ill side effect.

Brian

P.S., To be completely clear of my position on this issue at this
point... given the amount of time I've already spent responding to
handwavy arguments (ultimately resulting in discussing trailing off
until a repost occurs), or experimenting with a known bogus quotacheck
rework (which is still left without feedback, I believe), etc.,
clarification on the correctness of this alternate approach (while
interesting) is not nearly convincing enough for me to start over on
this bug. I don't mind handwavy questions if the "caller" is receptive
to or attempting to process (or debate) clarification, but I don't get
the impression that is the case here.

If you feel strongly enough about a certain approach, feel free to just
send a patch. At this point, I'm happy to help review anything from the
sole perspective of technical correctness (regardless of whether the I
think the approach is ugly), but I'm not personally wasting any more
time than I already have to implement and test such an approach without
a more logical and convincing argument. IMO, the feedback to this patch
doesn't fairly or reasonably justify the level of pushback.

> > In fact, if we're going to hack
> > around avoiding the flush lock, why not just hack up xfs_buf_submit() to
> > allow submission of a buffer while it remains on a delwri queue? AFAICT,
> > that would only require removing an assert and not clearing the flag on
> > I/O completion. I'm not convinced I'd post something for either
> > approach, but I'd have to think about it some more.
> > 
> > > Indeed, let's go look at inodes for an example of this, 
> > > xfs_ifree_cluster() to be precise:
> > > 
> > >                 /*                                                               
> > >                  * We obtain and lock the backing buffer first in the process    
> > >                  * here, as we have to ensure that any dirty inode that we       
> > >                  * can't get the flush lock on is attached to the buffer.        
> > >                  * If we scan the in-memory inodes first, then buffer IO can     
> > >                  * complete before we get a lock on it, and hence we may fail    
> > >                  * to mark all the active inodes on the buffer stale.            
> > >                  */                                                              
> > >  .....
> > >                 /*                                                               
> > >                  * Walk the inodes already attached to the buffer and mark them  
> > >                  * stale. These will all have the flush locks held, so an        
> > >                  * in-memory inode walk can't lock them. By marking them all     
> > >                  * stale first, we will not attempt to lock them in the loop     
> > >                  * below as the XFS_ISTALE flag will be set.                     
> > >                  */                                                              
> > > .....
> > >                 /*                                                               
> > >                  * For each inode in memory attempt to add it to the inode       
> > >                  * buffer and set it up for being staled on buffer IO            
> > >                  * completion.  This is safe as we've locked out tail pushing    
> > >                  * and flushing by locking the buffer.                           
> > >                  *                                                               
> > >                  * We have already marked every inode that was part of a         
> > >                  * transaction stale above, which means there is no point in     
> > >                  * even trying to lock them.                                     
> > >                  */                                                              
> > > 
> > > IOWs, what we have here is precendence for modifying the flush
> > > locked objects attached to a buffer after first locking the buffer.
> > > dquots have the same transaction/flush model as inodes, so I'm
> > > pretty sure this will lead to the simplest, cleanest way to avoid
> > > this deadlock....
> > > 
> > 
> > The simplest approach to avoid the deadlock is to disable dquot reclaim
> > during quotacheck. ;)
> > 
> > On a quick scan through the inode code, it seems the above refer to
> > using the buffer lock to serialize invalidation of higher level objects
> > based on the buffer. IOW, to ensure that we don't leave "valid" in-core
> > inode objects referring to an inode metadata buffer that is about to be
> > invalidated.
> 
> Yes, we lock from the bottom up until we have exclusive access to
> the objects. Invalidation is the /purpose/ of that code, not the
> architectural mechanism used to lock the objects into a state where
> we can then invalidate them.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner April 19, 2017, 2:46 a.m. UTC | #12
On Tue, Apr 18, 2017 at 09:55:31AM -0400, Brian Foster wrote:
> On Tue, Apr 18, 2017 at 12:35:48PM +1000, Dave Chinner wrote:
> > On Tue, Apr 11, 2017 at 10:53:03AM -0400, Brian Foster wrote:
> > > On Tue, Apr 11, 2017 at 09:55:25AM +1000, Dave Chinner wrote:
> > > > So why can't we just modify the dquot in the buffer?  We already
> > > > hold all the locks needed to guarantee exclusive access to the dquot
> > > > and buffer, and they are all we hold when we do the initial flush to
> > > > the buffer. So why do we need to do IO to unlock the dquot flush
> > > > lock when we could just rewrite before we submit the buffer for IO?
> > > > 
> > > 
> > > Are you suggesting to essentially ignore the flush lock? I suppose we
> > > could get away with this in dq_flush_one() since it is only called from
> > > quotacheck, but we may have to kill off any assertions that expect the
> > > flush lock in xfs_dqflush(), and perhaps refactor that code to accept a
> > > locked buffer as a param.
> > 
> > No, I'm not suggesting we ignore the flush lock. What the flush
> > "lock" does is prevent higher layer attempts to flush the dquot to
> > the backing buffer whilst IO may be in progress. It's higher level
> > function is to allows async log item state updates to be done
> > correctly w.r.t. relogging and flushing, as we allow transactional
> > modifications to inodes and dquots while they are under IO.
> > 
> 
> Ok. What I was trying to recall previously was some discussion around
> the flush lock retry problem where it was noted that we can't unlock the
> flush lock until the currently flushed state makes it to disk[1]. Is the
> conceptual difference here that we are not proposing to unlock the flush
> lock, but rather co-opt it to perform another flush?

I wouldn't use that wording, but you've got the idea.

[...]

> > > I don't see anything that would break off hand, but my first reaction is
> > > it sounds more hackish than this patch or the previous patch that just
> > > disabled reclaim during quotacheck.
> > 
> > I thought we'd done that discussion. i.e. we can't disable reclaim
> > in quotacheck because that can set off the OOM killer...
> > 
> 
> Huh? The only reason disabling of dquot reclaim during quotacheck was
> proposed in the first place is because it is 100% superfluous.

ISTR that we broke dquot reclaim during quotacheck by moving to
private delwri lists. I'm working under the impression that dquot
reclaim during quotacheck used to work just fine. maybe I'm wrong,
but ....

> Quotacheck, by design, does not allow reclaim to free memory. Therefore
> reclaim does not and afaict never has prevented or even mitigated OOM
> during quotacheck.

.... the intent has always been to allow dquot reclaim to run when
quotacheck is active because we've had our fair share of OOM
problems in quotacheck over the past 10 years. Bugs notwithstanding,
we really should be trying to ensure the code fulfils that intent
rather than sweep it under the carpet and tell users "too bad, so
sad" when quotacheck OOMs...

[...]

> P.S., To be completely clear of my position on this issue at this
> point... given the amount of time I've already spent responding to
> handwavy arguments (ultimately resulting in discussing trailing off
> until a repost occurs), or experimenting with a known bogus quotacheck
> rework (which is still left without feedback, I believe), etc.,
> clarification on the correctness of this alternate approach (while
> interesting) is not nearly convincing enough for me to start over on
> this bug. I don't mind handwavy questions if the "caller" is receptive
> to or attempting to process (or debate) clarification, but I don't get
> the impression that is the case here.
> 
> If you feel strongly enough about a certain approach, feel free to just
> send a patch. At this point, I'm happy to help review anything from the
> sole perspective of technical correctness (regardless of whether the I
> think the approach is ugly), but I'm not personally wasting any more
> time than I already have to implement and test such an approach without
> a more logical and convincing argument. IMO, the feedback to this patch
> doesn't fairly or reasonably justify the level of pushback.

I just responded to the code that was posted, pointing out a
list of things that concerned me and, I thought, we've been working
through that quite amicably.

Really, it is up to the maintainer whether to merge the code or not.
That's not me - I'm now just someone who knows the code and it's
history.  This is where the maintainer needs to step in and make a
decision one way or the other....

-Dave.
Darrick J. Wong April 19, 2017, 7:55 p.m. UTC | #13
On Wed, Apr 19, 2017 at 12:46:49PM +1000, Dave Chinner wrote:
> On Tue, Apr 18, 2017 at 09:55:31AM -0400, Brian Foster wrote:
> > On Tue, Apr 18, 2017 at 12:35:48PM +1000, Dave Chinner wrote:
> > > On Tue, Apr 11, 2017 at 10:53:03AM -0400, Brian Foster wrote:
> > > > On Tue, Apr 11, 2017 at 09:55:25AM +1000, Dave Chinner wrote:
> > > > > So why can't we just modify the dquot in the buffer?  We already
> > > > > hold all the locks needed to guarantee exclusive access to the dquot
> > > > > and buffer, and they are all we hold when we do the initial flush to
> > > > > the buffer. So why do we need to do IO to unlock the dquot flush
> > > > > lock when we could just rewrite before we submit the buffer for IO?
> > > > > 
> > > > 
> > > > Are you suggesting to essentially ignore the flush lock? I suppose we
> > > > could get away with this in dq_flush_one() since it is only called from
> > > > quotacheck, but we may have to kill off any assertions that expect the
> > > > flush lock in xfs_dqflush(), and perhaps refactor that code to accept a
> > > > locked buffer as a param.
> > > 
> > > No, I'm not suggesting we ignore the flush lock. What the flush
> > > "lock" does is prevent higher layer attempts to flush the dquot to
> > > the backing buffer whilst IO may be in progress. It's higher level
> > > function is to allows async log item state updates to be done
> > > correctly w.r.t. relogging and flushing, as we allow transactional
> > > modifications to inodes and dquots while they are under IO.
> > > 
> > 
> > Ok. What I was trying to recall previously was some discussion around
> > the flush lock retry problem where it was noted that we can't unlock the
> > flush lock until the currently flushed state makes it to disk[1]. Is the
> > conceptual difference here that we are not proposing to unlock the flush
> > lock, but rather co-opt it to perform another flush?
> 
> I wouldn't use that wording, but you've got the idea.
> 
> [...]
> 
> > > > I don't see anything that would break off hand, but my first reaction is
> > > > it sounds more hackish than this patch or the previous patch that just
> > > > disabled reclaim during quotacheck.
> > > 
> > > I thought we'd done that discussion. i.e. we can't disable reclaim
> > > in quotacheck because that can set off the OOM killer...
> > > 
> > 
> > Huh? The only reason disabling of dquot reclaim during quotacheck was
> > proposed in the first place is because it is 100% superfluous.
> 
> ISTR that we broke dquot reclaim during quotacheck by moving to
> private delwri lists. I'm working under the impression that dquot
> reclaim during quotacheck used to work just fine. maybe I'm wrong,
> but ....
> 
> > Quotacheck, by design, does not allow reclaim to free memory. Therefore
> > reclaim does not and afaict never has prevented or even mitigated OOM
> > during quotacheck.
> 
> .... the intent has always been to allow dquot reclaim to run when
> quotacheck is active because we've had our fair share of OOM
> problems in quotacheck over the past 10 years. Bugs notwithstanding,
> we really should be trying to ensure the code fulfils that intent
> rather than sweep it under the carpet and tell users "too bad, so
> sad" when quotacheck OOMs...
> 
> [...]
> 
> > P.S., To be completely clear of my position on this issue at this
> > point... given the amount of time I've already spent responding to
> > handwavy arguments (ultimately resulting in discussing trailing off
> > until a repost occurs), or experimenting with a known bogus quotacheck
> > rework (which is still left without feedback, I believe), etc.,
> > clarification on the correctness of this alternate approach (while
> > interesting) is not nearly convincing enough for me to start over on
> > this bug. I don't mind handwavy questions if the "caller" is receptive
> > to or attempting to process (or debate) clarification, but I don't get
> > the impression that is the case here.
> > 
> > If you feel strongly enough about a certain approach, feel free to just
> > send a patch. At this point, I'm happy to help review anything from the
> > sole perspective of technical correctness (regardless of whether the I
> > think the approach is ugly), but I'm not personally wasting any more
> > time than I already have to implement and test such an approach without
> > a more logical and convincing argument. IMO, the feedback to this patch
> > doesn't fairly or reasonably justify the level of pushback.
> 
> I just responded to the code that was posted, pointing out a
> list of things that concerned me and, I thought, we've been working
> through that quite amicably.
> 
> Really, it is up to the maintainer whether to merge the code or not.
> That's not me - I'm now just someone who knows the code and it's
> history.  This is where the maintainer needs to step in and make a
> decision one way or the other....

So I do have a few (more) comments:

The first point I have to make is that the quotacheck OOM still seems to
happen infrequently, so at the point that this thread started going, I
was (and still am) fine with letting the discussion continue until we
run out of questions. :)

As far as the patch itself goes, it took me a while to figure out what's
going on with the delwri buffer list shuffling.  The part where the
buffer comes back from xfs_buf_delwri_submit_buffers during _pushbuf
still attached to a delwri queue and yet we still have to re-add the
DELWRI_Q flag raised my eyebrows.  I thought it was a little odd, but
in those particular circumstances (quotacheck and reclaim) I didn't find
anything that made me think it would fail, even if it probably wasn't
what the designers had in mind.

However, Dave's comments inspired me to take a second look.  Sorry if
this has been covered already, but what if dquot_isolate noticed when
xfs_buf_delwri_queue doesn't actually put the buffer on isol->buffers
(i.e. something else already put it on another delwri queue, which
afaict only happens during quotacheck?) and simply dqfunlock's the
buffer?  Reclaim will have already formatted the in-core dquot into the
on-disk buffer and can free the dquot, and quotacheck is still on the
hook to issue the IO.  (FWIW even that doesn't quite smell right since
it seems weird to push metadata to the buffer layer but someone else
writes it to disk.)

In any case, it would be helpful to document how the delwri queues work,
whether or not the list is allowed to change, and what relation the
_XBF_DELWRI_Q flag has to the value of b_list.  For example, I didn't
know that one wasn't supposed to bounce buffers between lists.  Can that
documentation be part of whatever patch series we end up committing,
please?

(It's hard for me to make decisions when I know I don't have enough info...)

Also, I think Brian said that he'd post updated patches with at least a
fix to a problem with dropped xfs_buf_delwri_pushbuf return codes, so
I'd still like to see that.

As far as testing goes I'm happ(ier) that /one/ of the original
complainants says that he could move on with the patches applied.  I
haven't observed any breakage w/ xfstests, though that doesn't prove it
right.

--D

> 
> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Brian Foster April 19, 2017, 8:40 p.m. UTC | #14
On Wed, Apr 19, 2017 at 12:46:49PM +1000, Dave Chinner wrote:
> On Tue, Apr 18, 2017 at 09:55:31AM -0400, Brian Foster wrote:
> > On Tue, Apr 18, 2017 at 12:35:48PM +1000, Dave Chinner wrote:
> > > On Tue, Apr 11, 2017 at 10:53:03AM -0400, Brian Foster wrote:
> > > > On Tue, Apr 11, 2017 at 09:55:25AM +1000, Dave Chinner wrote:
> > > > > So why can't we just modify the dquot in the buffer?  We already
> > > > > hold all the locks needed to guarantee exclusive access to the dquot
> > > > > and buffer, and they are all we hold when we do the initial flush to
> > > > > the buffer. So why do we need to do IO to unlock the dquot flush
> > > > > lock when we could just rewrite before we submit the buffer for IO?
> > > > > 
> > > > 
> > > > Are you suggesting to essentially ignore the flush lock? I suppose we
> > > > could get away with this in dq_flush_one() since it is only called from
> > > > quotacheck, but we may have to kill off any assertions that expect the
> > > > flush lock in xfs_dqflush(), and perhaps refactor that code to accept a
> > > > locked buffer as a param.
> > > 
> > > No, I'm not suggesting we ignore the flush lock. What the flush
> > > "lock" does is prevent higher layer attempts to flush the dquot to
> > > the backing buffer whilst IO may be in progress. It's higher level
> > > function is to allows async log item state updates to be done
> > > correctly w.r.t. relogging and flushing, as we allow transactional
> > > modifications to inodes and dquots while they are under IO.
> > > 
> > 
> > Ok. What I was trying to recall previously was some discussion around
> > the flush lock retry problem where it was noted that we can't unlock the
> > flush lock until the currently flushed state makes it to disk[1]. Is the
> > conceptual difference here that we are not proposing to unlock the flush
> > lock, but rather co-opt it to perform another flush?
> 
> I wouldn't use that wording, but you've got the idea.
> 
> [...]
> 

Ok.

> > > > I don't see anything that would break off hand, but my first reaction is
> > > > it sounds more hackish than this patch or the previous patch that just
> > > > disabled reclaim during quotacheck.
> > > 
> > > I thought we'd done that discussion. i.e. we can't disable reclaim
> > > in quotacheck because that can set off the OOM killer...
> > > 
> > 
> > Huh? The only reason disabling of dquot reclaim during quotacheck was
> > proposed in the first place is because it is 100% superfluous.
> 
> ISTR that we broke dquot reclaim during quotacheck by moving to
> private delwri lists. I'm working under the impression that dquot
> reclaim during quotacheck used to work just fine. maybe I'm wrong,
> but ....
> 

As noted in the commit log for this patch, the commit that appears to
introduce the regression is 43ff2122e6 ("xfs: on-stack delayed write
buffer lists"), which I believe introduces (or is part of a series that
introduces) the private delwri queue bits.

I should correct my previous statement.. it's not clear to me whether
dquot reclaim during quotacheck worked at some time before this. It may
very well have, I'm not really familiar with the pre-private delwri
queue bits. I've only gone as far back as this patch to identify that
quotacheck had a mechanism to deal with flush locked buffers and that
since this commit, I don't see how dquot reclaim has had the ability to
free memory during quotacheck. My intent is more to point out that this
apparently has been the state of things for quite some time.

Either way, this is not terribly important as this patch replaces the
original "disable reclaim during quotacheck" variant (IOW, I'll just
accept that it worked at some point in the past). I was originally
responding to the claim that the whole bottom up locking thing was
somehow notably more simple than what has been proposed so far.

> > Quotacheck, by design, does not allow reclaim to free memory. Therefore
> > reclaim does not and afaict never has prevented or even mitigated OOM
> > during quotacheck.
> 
> .... the intent has always been to allow dquot reclaim to run when
> quotacheck is active because we've had our fair share of OOM
> problems in quotacheck over the past 10 years. Bugs notwithstanding,
> we really should be trying to ensure the code fulfils that intent
> rather than sweep it under the carpet and tell users "too bad, so
> sad" when quotacheck OOMs...
> 

Apparently not so much in the last... looks like almost 5 years now. ;)

But as noted (multiple times) previously, this is not mutually exclusive
to fixing the deadlock and I would have been happy to get the deadlock
fixed and follow up looking at improving quotacheck efficiency.

> [...]
> 
> > P.S., To be completely clear of my position on this issue at this
> > point... given the amount of time I've already spent responding to
> > handwavy arguments (ultimately resulting in discussing trailing off
> > until a repost occurs), or experimenting with a known bogus quotacheck
> > rework (which is still left without feedback, I believe), etc.,
> > clarification on the correctness of this alternate approach (while
> > interesting) is not nearly convincing enough for me to start over on
> > this bug. I don't mind handwavy questions if the "caller" is receptive
> > to or attempting to process (or debate) clarification, but I don't get
> > the impression that is the case here.
> > 
> > If you feel strongly enough about a certain approach, feel free to just
> > send a patch. At this point, I'm happy to help review anything from the
> > sole perspective of technical correctness (regardless of whether the I
> > think the approach is ugly), but I'm not personally wasting any more
> > time than I already have to implement and test such an approach without
> > a more logical and convincing argument. IMO, the feedback to this patch
> > doesn't fairly or reasonably justify the level of pushback.
> 
> I just responded to the code that was posted, pointing out a
> list of things that concerned me and, I thought, we've been working
> through that quite amicably.
> 

I have noted a few useful updates to this patch, but otherwise don't see
anything to suggest that posting this again will get it anywhere close
to being merged. The previous mail actually states that this "can't be
merged." :P

Be that as it may, the justifications for that objection appear to me to
be mostly unsubstantiated, not fleshed out (IMO) and lead to an
alternative proposal that doesn't seem any less hacky to me. Therefore,
those reasons don't justify enough to me to start over when I've 1.)
done that twice already 2.) already invested considerable time into a
fix that I consider more appropriate and relatively straightforward and
3.) invested similarly to ensure that said fix actually works.

So feel free to either propose an approach that is sufficiently
attractive to warrant further investigation (I don't have any other
ideas at the moment), or just send a patch that implements your
preferred approach. As mentioned, I'll help review whatever for
technical correctness at this point. I think it's fair to call that
"amicable disagreement." ;)

> Really, it is up to the maintainer whether to merge the code or not.
> That's not me - I'm now just someone who knows the code and it's
> history.  This is where the maintainer needs to step in and make a
> decision one way or the other....
> 

I don't think anybody is going to suggest we merge a patch that you (or
any consistent XFS code reviewer) effectively object to. I certainly
wouldn't ask Darrick to do that, nor would I take that approach if I
were in his shoes.

Brian

> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Brian Foster April 19, 2017, 8:46 p.m. UTC | #15
On Wed, Apr 19, 2017 at 12:55:19PM -0700, Darrick J. Wong wrote:
> On Wed, Apr 19, 2017 at 12:46:49PM +1000, Dave Chinner wrote:
> > On Tue, Apr 18, 2017 at 09:55:31AM -0400, Brian Foster wrote:
> > > On Tue, Apr 18, 2017 at 12:35:48PM +1000, Dave Chinner wrote:
> > > > On Tue, Apr 11, 2017 at 10:53:03AM -0400, Brian Foster wrote:
> > > > > On Tue, Apr 11, 2017 at 09:55:25AM +1000, Dave Chinner wrote:
> > > > > > So why can't we just modify the dquot in the buffer?  We already
> > > > > > hold all the locks needed to guarantee exclusive access to the dquot
> > > > > > and buffer, and they are all we hold when we do the initial flush to
> > > > > > the buffer. So why do we need to do IO to unlock the dquot flush
> > > > > > lock when we could just rewrite before we submit the buffer for IO?
> > > > > > 
> > > > > 
> > > > > Are you suggesting to essentially ignore the flush lock? I suppose we
> > > > > could get away with this in dq_flush_one() since it is only called from
> > > > > quotacheck, but we may have to kill off any assertions that expect the
> > > > > flush lock in xfs_dqflush(), and perhaps refactor that code to accept a
> > > > > locked buffer as a param.
> > > > 
> > > > No, I'm not suggesting we ignore the flush lock. What the flush
> > > > "lock" does is prevent higher layer attempts to flush the dquot to
> > > > the backing buffer whilst IO may be in progress. It's higher level
> > > > function is to allows async log item state updates to be done
> > > > correctly w.r.t. relogging and flushing, as we allow transactional
> > > > modifications to inodes and dquots while they are under IO.
> > > > 
> > > 
> > > Ok. What I was trying to recall previously was some discussion around
> > > the flush lock retry problem where it was noted that we can't unlock the
> > > flush lock until the currently flushed state makes it to disk[1]. Is the
> > > conceptual difference here that we are not proposing to unlock the flush
> > > lock, but rather co-opt it to perform another flush?
> > 
> > I wouldn't use that wording, but you've got the idea.
> > 
> > [...]
> > 
> > > > > I don't see anything that would break off hand, but my first reaction is
> > > > > it sounds more hackish than this patch or the previous patch that just
> > > > > disabled reclaim during quotacheck.
> > > > 
> > > > I thought we'd done that discussion. i.e. we can't disable reclaim
> > > > in quotacheck because that can set off the OOM killer...
> > > > 
> > > 
> > > Huh? The only reason disabling of dquot reclaim during quotacheck was
> > > proposed in the first place is because it is 100% superfluous.
> > 
> > ISTR that we broke dquot reclaim during quotacheck by moving to
> > private delwri lists. I'm working under the impression that dquot
> > reclaim during quotacheck used to work just fine. maybe I'm wrong,
> > but ....
> > 
> > > Quotacheck, by design, does not allow reclaim to free memory. Therefore
> > > reclaim does not and afaict never has prevented or even mitigated OOM
> > > during quotacheck.
> > 
> > .... the intent has always been to allow dquot reclaim to run when
> > quotacheck is active because we've had our fair share of OOM
> > problems in quotacheck over the past 10 years. Bugs notwithstanding,
> > we really should be trying to ensure the code fulfils that intent
> > rather than sweep it under the carpet and tell users "too bad, so
> > sad" when quotacheck OOMs...
> > 
> > [...]
> > 
> > > P.S., To be completely clear of my position on this issue at this
> > > point... given the amount of time I've already spent responding to
> > > handwavy arguments (ultimately resulting in discussing trailing off
> > > until a repost occurs), or experimenting with a known bogus quotacheck
> > > rework (which is still left without feedback, I believe), etc.,
> > > clarification on the correctness of this alternate approach (while
> > > interesting) is not nearly convincing enough for me to start over on
> > > this bug. I don't mind handwavy questions if the "caller" is receptive
> > > to or attempting to process (or debate) clarification, but I don't get
> > > the impression that is the case here.
> > > 
> > > If you feel strongly enough about a certain approach, feel free to just
> > > send a patch. At this point, I'm happy to help review anything from the
> > > sole perspective of technical correctness (regardless of whether the I
> > > think the approach is ugly), but I'm not personally wasting any more
> > > time than I already have to implement and test such an approach without
> > > a more logical and convincing argument. IMO, the feedback to this patch
> > > doesn't fairly or reasonably justify the level of pushback.
> > 
> > I just responded to the code that was posted, pointing out a
> > list of things that concerned me and, I thought, we've been working
> > through that quite amicably.
> > 
> > Really, it is up to the maintainer whether to merge the code or not.
> > That's not me - I'm now just someone who knows the code and it's
> > history.  This is where the maintainer needs to step in and make a
> > decision one way or the other....
> 
> So I do have a few (more) comments:
> 
> The first point I have to make is that the quotacheck OOM still seems to
> happen infrequently, so at the point that this thread started going, I
> was (and still am) fine with letting the discussion continue until we
> run out of questions. :)
> 
> As far as the patch itself goes, it took me a while to figure out what's
> going on with the delwri buffer list shuffling.  The part where the
> buffer comes back from xfs_buf_delwri_submit_buffers during _pushbuf
> still attached to a delwri queue and yet we still have to re-add the
> DELWRI_Q flag raised my eyebrows.  I thought it was a little odd, but
> in those particular circumstances (quotacheck and reclaim) I didn't find
> anything that made me think it would fail, even if it probably wasn't
> what the designers had in mind.
> 
> However, Dave's comments inspired me to take a second look.  Sorry if
> this has been covered already, but what if dquot_isolate noticed when
> xfs_buf_delwri_queue doesn't actually put the buffer on isol->buffers
> (i.e. something else already put it on another delwri queue, which
> afaict only happens during quotacheck?) and simply dqfunlock's the
> buffer?  Reclaim will have already formatted the in-core dquot into the
> on-disk buffer and can free the dquot, and quotacheck is still on the
> hook to issue the IO.  (FWIW even that doesn't quite smell right since
> it seems weird to push metadata to the buffer layer but someone else
> writes it to disk.)
> 

FWIW, I did also consider something similar on the reclaim side of
things. Not to unlock the flush lock (I think we don't generally unlock
a flush lock until state reaches disk, even though technically it may
not be a problem from quotacheck context), but to avoid acquiring it in
the first place if the underlying buffer appeared to already belong on a
delwri queue (or something along those lines).

I don't recall the exact details off the top of my head, but I didn't
like how it turned out enough such that it never turned into something
post-worthy (I may still have that around somewhere, though).

> In any case, it would be helpful to document how the delwri queues work,
> whether or not the list is allowed to change, and what relation the
> _XBF_DELWRI_Q flag has to the value of b_list.  For example, I didn't
> know that one wasn't supposed to bounce buffers between lists.  Can that
> documentation be part of whatever patch series we end up committing,
> please?
> 

I'm not aware of any such limitation.

> (It's hard for me to make decisions when I know I don't have enough info...)
> 
> Also, I think Brian said that he'd post updated patches with at least a
> fix to a problem with dropped xfs_buf_delwri_pushbuf return codes, so
> I'd still like to see that.
> 

That was my intent, though I realized I haven't even made those changes
locally yet because this appears to be going nowhere fast, afaict. I can
post another update if that is actually useful, however.

(As noted in my previous mail, though, I don't really expect you to
consider merging a patch with outstanding objections.)

Brian

> As far as testing goes I'm happ(ier) that /one/ of the original
> complainants says that he could move on with the patches applied.  I
> haven't observed any breakage w/ xfstests, though that doesn't prove it
> right.
> 
> --D
> 
> > 
> > -Dave.
> > -- 
> > Dave Chinner
> > david@fromorbit.com
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Brian Foster April 21, 2017, 12:18 p.m. UTC | #16
On Wed, Apr 19, 2017 at 04:46:46PM -0400, Brian Foster wrote:
> On Wed, Apr 19, 2017 at 12:55:19PM -0700, Darrick J. Wong wrote:
> > On Wed, Apr 19, 2017 at 12:46:49PM +1000, Dave Chinner wrote:
> > > On Tue, Apr 18, 2017 at 09:55:31AM -0400, Brian Foster wrote:
> > > > On Tue, Apr 18, 2017 at 12:35:48PM +1000, Dave Chinner wrote:
> > > > > On Tue, Apr 11, 2017 at 10:53:03AM -0400, Brian Foster wrote:
> > > > > > On Tue, Apr 11, 2017 at 09:55:25AM +1000, Dave Chinner wrote:
> > > > > > > So why can't we just modify the dquot in the buffer?  We already
> > > > > > > hold all the locks needed to guarantee exclusive access to the dquot
> > > > > > > and buffer, and they are all we hold when we do the initial flush to
> > > > > > > the buffer. So why do we need to do IO to unlock the dquot flush
> > > > > > > lock when we could just rewrite before we submit the buffer for IO?
> > > > > > > 
> > > > > > 
> > > > > > Are you suggesting to essentially ignore the flush lock? I suppose we
> > > > > > could get away with this in dq_flush_one() since it is only called from
> > > > > > quotacheck, but we may have to kill off any assertions that expect the
> > > > > > flush lock in xfs_dqflush(), and perhaps refactor that code to accept a
> > > > > > locked buffer as a param.
> > > > > 
> > > > > No, I'm not suggesting we ignore the flush lock. What the flush
> > > > > "lock" does is prevent higher layer attempts to flush the dquot to
> > > > > the backing buffer whilst IO may be in progress. It's higher level
> > > > > function is to allows async log item state updates to be done
> > > > > correctly w.r.t. relogging and flushing, as we allow transactional
> > > > > modifications to inodes and dquots while they are under IO.
> > > > > 
> > > > 
> > > > Ok. What I was trying to recall previously was some discussion around
> > > > the flush lock retry problem where it was noted that we can't unlock the
> > > > flush lock until the currently flushed state makes it to disk[1]. Is the
> > > > conceptual difference here that we are not proposing to unlock the flush
> > > > lock, but rather co-opt it to perform another flush?
> > > 
> > > I wouldn't use that wording, but you've got the idea.
> > > 
> > > [...]
> > > 
> > > > > > I don't see anything that would break off hand, but my first reaction is
> > > > > > it sounds more hackish than this patch or the previous patch that just
> > > > > > disabled reclaim during quotacheck.
> > > > > 
> > > > > I thought we'd done that discussion. i.e. we can't disable reclaim
> > > > > in quotacheck because that can set off the OOM killer...
> > > > > 
> > > > 
> > > > Huh? The only reason disabling of dquot reclaim during quotacheck was
> > > > proposed in the first place is because it is 100% superfluous.
> > > 
> > > ISTR that we broke dquot reclaim during quotacheck by moving to
> > > private delwri lists. I'm working under the impression that dquot
> > > reclaim during quotacheck used to work just fine. maybe I'm wrong,
> > > but ....
> > > 
> > > > Quotacheck, by design, does not allow reclaim to free memory. Therefore
> > > > reclaim does not and afaict never has prevented or even mitigated OOM
> > > > during quotacheck.
> > > 
> > > .... the intent has always been to allow dquot reclaim to run when
> > > quotacheck is active because we've had our fair share of OOM
> > > problems in quotacheck over the past 10 years. Bugs notwithstanding,
> > > we really should be trying to ensure the code fulfils that intent
> > > rather than sweep it under the carpet and tell users "too bad, so
> > > sad" when quotacheck OOMs...
> > > 
> > > [...]
> > > 
> > > > P.S., To be completely clear of my position on this issue at this
> > > > point... given the amount of time I've already spent responding to
> > > > handwavy arguments (ultimately resulting in discussing trailing off
> > > > until a repost occurs), or experimenting with a known bogus quotacheck
> > > > rework (which is still left without feedback, I believe), etc.,
> > > > clarification on the correctness of this alternate approach (while
> > > > interesting) is not nearly convincing enough for me to start over on
> > > > this bug. I don't mind handwavy questions if the "caller" is receptive
> > > > to or attempting to process (or debate) clarification, but I don't get
> > > > the impression that is the case here.
> > > > 
> > > > If you feel strongly enough about a certain approach, feel free to just
> > > > send a patch. At this point, I'm happy to help review anything from the
> > > > sole perspective of technical correctness (regardless of whether the I
> > > > think the approach is ugly), but I'm not personally wasting any more
> > > > time than I already have to implement and test such an approach without
> > > > a more logical and convincing argument. IMO, the feedback to this patch
> > > > doesn't fairly or reasonably justify the level of pushback.
> > > 
> > > I just responded to the code that was posted, pointing out a
> > > list of things that concerned me and, I thought, we've been working
> > > through that quite amicably.
> > > 
> > > Really, it is up to the maintainer whether to merge the code or not.
> > > That's not me - I'm now just someone who knows the code and it's
> > > history.  This is where the maintainer needs to step in and make a
> > > decision one way or the other....
> > 
> > So I do have a few (more) comments:
> > 
> > The first point I have to make is that the quotacheck OOM still seems to
> > happen infrequently, so at the point that this thread started going, I
> > was (and still am) fine with letting the discussion continue until we
> > run out of questions. :)
> > 
> > As far as the patch itself goes, it took me a while to figure out what's
> > going on with the delwri buffer list shuffling.  The part where the
> > buffer comes back from xfs_buf_delwri_submit_buffers during _pushbuf
> > still attached to a delwri queue and yet we still have to re-add the
> > DELWRI_Q flag raised my eyebrows.  I thought it was a little odd, but
> > in those particular circumstances (quotacheck and reclaim) I didn't find
> > anything that made me think it would fail, even if it probably wasn't
> > what the designers had in mind.
> > 
> > However, Dave's comments inspired me to take a second look.  Sorry if
> > this has been covered already, but what if dquot_isolate noticed when
> > xfs_buf_delwri_queue doesn't actually put the buffer on isol->buffers
> > (i.e. something else already put it on another delwri queue, which
> > afaict only happens during quotacheck?) and simply dqfunlock's the
> > buffer?  Reclaim will have already formatted the in-core dquot into the
> > on-disk buffer and can free the dquot, and quotacheck is still on the
> > hook to issue the IO.  (FWIW even that doesn't quite smell right since
> > it seems weird to push metadata to the buffer layer but someone else
> > writes it to disk.)
> > 
> 
> FWIW, I did also consider something similar on the reclaim side of
> things. Not to unlock the flush lock (I think we don't generally unlock
> a flush lock until state reaches disk, even though technically it may
> not be a problem from quotacheck context), but to avoid acquiring it in
> the first place if the underlying buffer appeared to already belong on a
> delwri queue (or something along those lines).
> 
> I don't recall the exact details off the top of my head, but I didn't
> like how it turned out enough such that it never turned into something
> post-worthy (I may still have that around somewhere, though).
> 

I don't appear to have any code around for the above, unfortunately.
Having thought about this a bit more, I think one of the reasons I
didn't prefer that approach (or the latest alternative proposal) is that
all things being equal, I'd rather not complicate external, core
infrastructure bits of code such as reclaim or dquot flushing with a
hack that is only necessary to deal with quotacheck.

IOW, the problem is primarily with the quotacheck implementation, so IMO
the most appropriate place for whatever hacks are necessary to make it
work correctly is in quotacheck itself. Then it's more straightforward
to improve the whole implementation going forward (and possibly replace
or remove the code being added here), rather than have to reassess
whatever hacks we've sprinkled elsewhere in the codebase.

That's just my .02, of course. v3 incoming...

Brian

> > In any case, it would be helpful to document how the delwri queues work,
> > whether or not the list is allowed to change, and what relation the
> > _XBF_DELWRI_Q flag has to the value of b_list.  For example, I didn't
> > know that one wasn't supposed to bounce buffers between lists.  Can that
> > documentation be part of whatever patch series we end up committing,
> > please?
> > 
> 
> I'm not aware of any such limitation.
> 
> > (It's hard for me to make decisions when I know I don't have enough info...)
> > 
> > Also, I think Brian said that he'd post updated patches with at least a
> > fix to a problem with dropped xfs_buf_delwri_pushbuf return codes, so
> > I'd still like to see that.
> > 
> 
> That was my intent, though I realized I haven't even made those changes
> locally yet because this appears to be going nowhere fast, afaict. I can
> post another update if that is actually useful, however.
> 
> (As noted in my previous mail, though, I don't really expect you to
> consider merging a patch with outstanding objections.)
> 
> Brian
> 
> > As far as testing goes I'm happ(ier) that /one/ of the original
> > complainants says that he could move on with the patches applied.  I
> > haven't observed any breakage w/ xfstests, though that doesn't prove it
> > right.
> > 
> > --D
> > 
> > > 
> > > -Dave.
> > > -- 
> > > Dave Chinner
> > > david@fromorbit.com
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index e566510..e97cf56 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -2011,6 +2011,43 @@  xfs_buf_delwri_submit(
 	return error;
 }
 
+int
+xfs_buf_delwri_pushbuf(
+	struct xfs_buf		*bp,
+	struct list_head	*buffer_list)
+{
+	LIST_HEAD		(submit_list);
+	int			error;
+
+	ASSERT(xfs_buf_islocked(bp));
+	ASSERT(bp->b_flags & _XBF_DELWRI_Q);
+
+	trace_xfs_buf_delwri_pushbuf(bp, _RET_IP_);
+
+	/*
+	 * Move the buffer to an empty list and submit. Pass the original list
+	 * as the wait list so delwri submission moves the buf back to it before
+	 * it is submitted (and thus before it is unlocked). This means the
+	 * buffer cannot be placed on another list while we wait for it.
+	 */
+	list_move(&bp->b_list, &submit_list);
+	xfs_buf_unlock(bp);
+
+	xfs_buf_delwri_submit_buffers(&submit_list, buffer_list);
+
+	/*
+	 * Lock the buffer to wait for I/O completion. It's already held on the
+	 * original list, so all we have to do is reset the delwri queue flag
+	 * that was cleared by delwri submission.
+	 */
+	xfs_buf_lock(bp);
+	error = bp->b_error;
+	bp->b_flags |= _XBF_DELWRI_Q;
+	xfs_buf_unlock(bp);
+
+	return error;
+}
+
 int __init
 xfs_buf_init(void)
 {
diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
index 8a9d3a9..cd74216 100644
--- a/fs/xfs/xfs_buf.h
+++ b/fs/xfs/xfs_buf.h
@@ -334,6 +334,7 @@  extern void xfs_buf_stale(struct xfs_buf *bp);
 extern bool xfs_buf_delwri_queue(struct xfs_buf *, struct list_head *);
 extern int xfs_buf_delwri_submit(struct list_head *);
 extern int xfs_buf_delwri_submit_nowait(struct list_head *);
+extern int xfs_buf_delwri_pushbuf(struct xfs_buf *, struct list_head *);
 
 /* Buffer Daemon Setup Routines */
 extern int xfs_buf_init(void);
diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c
index 4ff993c..3815ed3 100644
--- a/fs/xfs/xfs_qm.c
+++ b/fs/xfs/xfs_qm.c
@@ -1247,6 +1247,7 @@  xfs_qm_flush_one(
 	struct xfs_dquot	*dqp,
 	void			*data)
 {
+	struct xfs_mount	*mp = dqp->q_mount;
 	struct list_head	*buffer_list = data;
 	struct xfs_buf		*bp = NULL;
 	int			error = 0;
@@ -1257,7 +1258,32 @@  xfs_qm_flush_one(
 	if (!XFS_DQ_IS_DIRTY(dqp))
 		goto out_unlock;
 
-	xfs_dqflock(dqp);
+	/*
+	 * The only way the dquot is already flush locked by the time quotacheck
+	 * gets here is if reclaim flushed it before the dqadjust walk dirtied
+	 * it for the final time. Quotacheck collects all dquot bufs in the
+	 * local delwri queue before dquots are dirtied, so reclaim can't have
+	 * possibly queued it for I/O. The only way out is to push the buffer to
+	 * cycle the flush lock.
+	 */
+	if (!xfs_dqflock_nowait(dqp)) {
+		/* buf is pinned in-core by delwri list */
+		DEFINE_SINGLE_BUF_MAP(map, dqp->q_blkno,
+				      mp->m_quotainfo->qi_dqchunklen);
+		bp = _xfs_buf_find(mp->m_ddev_targp, &map, 1, 0, NULL);
+		if (!bp) {
+			error = -EINVAL;
+			goto out_unlock;
+		}
+
+		/* delwri_pushbuf drops the buf lock */
+		xfs_buf_delwri_pushbuf(bp, buffer_list);
+		xfs_buf_rele(bp);
+
+		error = -EAGAIN;
+		goto out_unlock;
+	}
+
 	error = xfs_qm_dqflush(dqp, &bp);
 	if (error)
 		goto out_unlock;
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index fb7555e..c8d28a9 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -365,6 +365,7 @@  DEFINE_BUF_EVENT(xfs_buf_iowait_done);
 DEFINE_BUF_EVENT(xfs_buf_delwri_queue);
 DEFINE_BUF_EVENT(xfs_buf_delwri_queued);
 DEFINE_BUF_EVENT(xfs_buf_delwri_split);
+DEFINE_BUF_EVENT(xfs_buf_delwri_pushbuf);
 DEFINE_BUF_EVENT(xfs_buf_get_uncached);
 DEFINE_BUF_EVENT(xfs_buf_item_relse);
 DEFINE_BUF_EVENT(xfs_buf_item_iodone_async);