Message ID | 1487966001-63263-3-git-send-email-bfoster@redhat.com (mailing list archive) |
---|---|
State | Superseded, archived |
Headers | show |
On Fri, Feb 24, 2017 at 02:53:20PM -0500, Brian Foster wrote: > Reclaim during quotacheck can lead to deadlocks on the dquot flush > lock: > > - Quotacheck populates a local delwri queue with the physical dquot > buffers. > - Quotacheck performs the xfs_qm_dqusage_adjust() bulkstat and > dirties all of the dquots. > - Reclaim kicks in and attempts to flush a dquot whose buffer is > already queud on the quotacheck queue. The flush succeeds but > queueing to the reclaim delwri queue fails as the backing buffer is > already queued. The flush unlock is now deferred to I/O completion > of the buffer from the quotacheck queue. <typing aloud here, trying to follow along...> If I'm reading this correctly, this is the _buf_delwri_queue call towards the end of the _DQ_IS_DIRTY clause in _qm_dquot_isolate? Once the dquot is flushed we fail to add it to isol->buffers because it's already on quotacheck's buffer_list, which means that the _buf_delwri_submit call in _qm_shrink_scan never sees our dirty dquot. Therefore, the dquot doesn't get written and the flush lock stays locked... > - The dqadjust bulkstat continues and dirties the recently flushed > dquot once again. > - Quotacheck proceeds to the xfs_qm_flush_one() walk which requires > the flush lock to update the backing buffers with the in-core > recalculated values. It deadlocks on the redirtied dquot as the > flush lock was already acquired by reclaim, but the buffer resides > on the local delwri queue which isn't submitted until the end of > quotacheck. ...until quotacheck tries to relock it and deadlocks, like you said. Ok, I think I understand the deadlock... > This is reproduced by running quotacheck on a filesystem with a > couple million inodes in low memory (512MB-1GB) situations. This is > a regression as of commit 43ff2122e6 ("xfs: on-stack delayed write > buffer lists"), which removed a trylock and buffer I/O submission > from the quotacheck dquot flush sequence. > > Quotacheck first resets and collects the physical dquot buffers in a > delwri queue. Then, it traverses the filesystem inodes via bulkstat, > updates the in-core dquots, flushes the corrected dquots to the > backing buffers and finally submits the delwri queue for I/O. Since > the backing buffers are queued across the entire quotacheck > operation, dquot reclaim cannot possibly complete a dquot flush > before quotacheck completes. > > Therefore, quotacheck must submit the buffer for I/O in order to > cycle the flush lock and flush the dirty in-core dquot to the > buffer. Add a delwri queue buffer push mechanism to submit an > individual buffer for I/O without losing the delwri queue status and > use it from quotacheck to avoid the deadlock. This restores > quotacheck behavior to as before the regression was introduced. ...and therefore why this patch changes quotacheck to check if the dquot's flush lock is already held and if so, to call _buf_delwri_submit_buffers so that the io completion will unlock the flush lock and try again. When we return to _qm_flush_one, the dquot is no longer dirty and we're done. I /think/ this looks ok, though I also think I want to hear if either Christian Affolter or Martin Svec have had any problems with these two patches applied. My current feeling is that this is a bit late for 4.12, but I'd certainly put it in -washme for wider testing. If anyone thinks this needs to happen sooner, please speak up. Sorry this one took me a while to get to. :/ Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> --D > Reported-by: Martin Svec <martin.svec@zoner.cz> > Signed-off-by: Brian Foster <bfoster@redhat.com> > --- > fs/xfs/xfs_buf.c | 37 +++++++++++++++++++++++++++++++++++++ > fs/xfs/xfs_buf.h | 1 + > fs/xfs/xfs_qm.c | 28 +++++++++++++++++++++++++++- > fs/xfs/xfs_trace.h | 1 + > 4 files changed, 66 insertions(+), 1 deletion(-) > > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c > index e566510..e97cf56 100644 > --- a/fs/xfs/xfs_buf.c > +++ b/fs/xfs/xfs_buf.c > @@ -2011,6 +2011,43 @@ xfs_buf_delwri_submit( > return error; > } > > +int > +xfs_buf_delwri_pushbuf( > + struct xfs_buf *bp, > + struct list_head *buffer_list) > +{ > + LIST_HEAD (submit_list); > + int error; > + > + ASSERT(xfs_buf_islocked(bp)); > + ASSERT(bp->b_flags & _XBF_DELWRI_Q); > + > + trace_xfs_buf_delwri_pushbuf(bp, _RET_IP_); > + > + /* > + * Move the buffer to an empty list and submit. Pass the original list > + * as the wait list so delwri submission moves the buf back to it before > + * it is submitted (and thus before it is unlocked). This means the > + * buffer cannot be placed on another list while we wait for it. > + */ > + list_move(&bp->b_list, &submit_list); > + xfs_buf_unlock(bp); > + > + xfs_buf_delwri_submit_buffers(&submit_list, buffer_list); > + > + /* > + * Lock the buffer to wait for I/O completion. It's already held on the > + * original list, so all we have to do is reset the delwri queue flag > + * that was cleared by delwri submission. > + */ > + xfs_buf_lock(bp); > + error = bp->b_error; > + bp->b_flags |= _XBF_DELWRI_Q; > + xfs_buf_unlock(bp); > + > + return error; > +} > + > int __init > xfs_buf_init(void) > { > diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h > index 8a9d3a9..cd74216 100644 > --- a/fs/xfs/xfs_buf.h > +++ b/fs/xfs/xfs_buf.h > @@ -334,6 +334,7 @@ extern void xfs_buf_stale(struct xfs_buf *bp); > extern bool xfs_buf_delwri_queue(struct xfs_buf *, struct list_head *); > extern int xfs_buf_delwri_submit(struct list_head *); > extern int xfs_buf_delwri_submit_nowait(struct list_head *); > +extern int xfs_buf_delwri_pushbuf(struct xfs_buf *, struct list_head *); > > /* Buffer Daemon Setup Routines */ > extern int xfs_buf_init(void); > diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c > index 4ff993c..3815ed3 100644 > --- a/fs/xfs/xfs_qm.c > +++ b/fs/xfs/xfs_qm.c > @@ -1247,6 +1247,7 @@ xfs_qm_flush_one( > struct xfs_dquot *dqp, > void *data) > { > + struct xfs_mount *mp = dqp->q_mount; > struct list_head *buffer_list = data; > struct xfs_buf *bp = NULL; > int error = 0; > @@ -1257,7 +1258,32 @@ xfs_qm_flush_one( > if (!XFS_DQ_IS_DIRTY(dqp)) > goto out_unlock; > > - xfs_dqflock(dqp); > + /* > + * The only way the dquot is already flush locked by the time quotacheck > + * gets here is if reclaim flushed it before the dqadjust walk dirtied > + * it for the final time. Quotacheck collects all dquot bufs in the > + * local delwri queue before dquots are dirtied, so reclaim can't have > + * possibly queued it for I/O. The only way out is to push the buffer to > + * cycle the flush lock. > + */ > + if (!xfs_dqflock_nowait(dqp)) { > + /* buf is pinned in-core by delwri list */ > + DEFINE_SINGLE_BUF_MAP(map, dqp->q_blkno, > + mp->m_quotainfo->qi_dqchunklen); > + bp = _xfs_buf_find(mp->m_ddev_targp, &map, 1, 0, NULL); > + if (!bp) { > + error = -EINVAL; > + goto out_unlock; > + } > + > + /* delwri_pushbuf drops the buf lock */ > + xfs_buf_delwri_pushbuf(bp, buffer_list); > + xfs_buf_rele(bp); > + > + error = -EAGAIN; > + goto out_unlock; > + } > + > error = xfs_qm_dqflush(dqp, &bp); > if (error) > goto out_unlock; > diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h > index fb7555e..c8d28a9 100644 > --- a/fs/xfs/xfs_trace.h > +++ b/fs/xfs/xfs_trace.h > @@ -365,6 +365,7 @@ DEFINE_BUF_EVENT(xfs_buf_iowait_done); > DEFINE_BUF_EVENT(xfs_buf_delwri_queue); > DEFINE_BUF_EVENT(xfs_buf_delwri_queued); > DEFINE_BUF_EVENT(xfs_buf_delwri_split); > +DEFINE_BUF_EVENT(xfs_buf_delwri_pushbuf); > DEFINE_BUF_EVENT(xfs_buf_get_uncached); > DEFINE_BUF_EVENT(xfs_buf_item_relse); > DEFINE_BUF_EVENT(xfs_buf_item_iodone_async); > -- > 2.7.4 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Apr 06, 2017 at 03:34:17PM -0700, Darrick J. Wong wrote: > On Fri, Feb 24, 2017 at 02:53:20PM -0500, Brian Foster wrote: > > Reclaim during quotacheck can lead to deadlocks on the dquot flush > > lock: > > > > - Quotacheck populates a local delwri queue with the physical dquot > > buffers. > > - Quotacheck performs the xfs_qm_dqusage_adjust() bulkstat and > > dirties all of the dquots. > > - Reclaim kicks in and attempts to flush a dquot whose buffer is > > already queud on the quotacheck queue. The flush succeeds but > > queueing to the reclaim delwri queue fails as the backing buffer is > > already queued. The flush unlock is now deferred to I/O completion > > of the buffer from the quotacheck queue. > > <typing aloud here, trying to follow along...> > > If I'm reading this correctly, this is the _buf_delwri_queue call > towards the end of the _DQ_IS_DIRTY clause in _qm_dquot_isolate? Once > the dquot is flushed we fail to add it to isol->buffers because it's > already on quotacheck's buffer_list, which means that the > _buf_delwri_submit call in _qm_shrink_scan never sees our dirty dquot. > > Therefore, the dquot doesn't get written and the flush lock stays > locked... > Yes, exactly. > > - The dqadjust bulkstat continues and dirties the recently flushed > > dquot once again. > > - Quotacheck proceeds to the xfs_qm_flush_one() walk which requires > > the flush lock to update the backing buffers with the in-core > > recalculated values. It deadlocks on the redirtied dquot as the > > flush lock was already acquired by reclaim, but the buffer resides > > on the local delwri queue which isn't submitted until the end of > > quotacheck. > > ...until quotacheck tries to relock it and deadlocks, like you said. > Ok, I think I understand the deadlock... > Yep. > > This is reproduced by running quotacheck on a filesystem with a > > couple million inodes in low memory (512MB-1GB) situations. This is > > a regression as of commit 43ff2122e6 ("xfs: on-stack delayed write > > buffer lists"), which removed a trylock and buffer I/O submission > > from the quotacheck dquot flush sequence. > > > > Quotacheck first resets and collects the physical dquot buffers in a > > delwri queue. Then, it traverses the filesystem inodes via bulkstat, > > updates the in-core dquots, flushes the corrected dquots to the > > backing buffers and finally submits the delwri queue for I/O. Since > > the backing buffers are queued across the entire quotacheck > > operation, dquot reclaim cannot possibly complete a dquot flush > > before quotacheck completes. > > > > Therefore, quotacheck must submit the buffer for I/O in order to > > cycle the flush lock and flush the dirty in-core dquot to the > > buffer. Add a delwri queue buffer push mechanism to submit an > > individual buffer for I/O without losing the delwri queue status and > > use it from quotacheck to avoid the deadlock. This restores > > quotacheck behavior to as before the regression was introduced. > > ...and therefore why this patch changes quotacheck to check if the > dquot's flush lock is already held and if so, to call > _buf_delwri_submit_buffers so that the io completion will unlock the > flush lock and try again. > Yes, though only for the specific buffer. > When we return to _qm_flush_one, the dquot is no longer dirty and we're > done. I /think/ this looks ok, though I also think I want to hear if > either Christian Affolter or Martin Svec have had any problems with > these two patches applied. My current feeling is that this is a bit > late for 4.12, but I'd certainly put it in -washme for wider testing. > If anyone thinks this needs to happen sooner, please speak up. > The dquot is still dirty and needs to be flushed on the second go around (otherwise we wouldn't have needed the flush lock here on the first try). The issue is basically that the dquot was adjusted (dirtied) at some point during the bulkstat scan, reclaimed and flushed, and then subsequently dirtied again such that another flush is required via qm_flush_one(). The buffer push frees up the held flush lock so that final flush can proceed. BTW, what's the reasoning for being late to 4.12? I actually don't have much of an opinion which release this goes into, so I guess my question is more logistical due to the fact that up until recently I haven't paid much attention to the upstream release cycles. ;P The 4.12 merge window hasn't opened as we are still on 4.11-rc5, right? Is the concern then that the merge window is close enough that this has missed 4.12 linux-next soak time? Also, can you clarify the semantics of the -washme thing? ;) That doesn't necessarily mean the patch has been "merged." Rather, it might be suitable for the next for-next (4.13) provided nothing explodes by then, correct? I'm also a little curious on what extra testing that entails, if any, though perhaps this topic is better suited by an independent thread.. > Sorry this one took me a while to get to. :/ > > Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> > No problem. Thanks for review... Brian > --D > > > Reported-by: Martin Svec <martin.svec@zoner.cz> > > Signed-off-by: Brian Foster <bfoster@redhat.com> > > --- > > fs/xfs/xfs_buf.c | 37 +++++++++++++++++++++++++++++++++++++ > > fs/xfs/xfs_buf.h | 1 + > > fs/xfs/xfs_qm.c | 28 +++++++++++++++++++++++++++- > > fs/xfs/xfs_trace.h | 1 + > > 4 files changed, 66 insertions(+), 1 deletion(-) > > > > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c > > index e566510..e97cf56 100644 > > --- a/fs/xfs/xfs_buf.c > > +++ b/fs/xfs/xfs_buf.c > > @@ -2011,6 +2011,43 @@ xfs_buf_delwri_submit( > > return error; > > } > > > > +int > > +xfs_buf_delwri_pushbuf( > > + struct xfs_buf *bp, > > + struct list_head *buffer_list) > > +{ > > + LIST_HEAD (submit_list); > > + int error; > > + > > + ASSERT(xfs_buf_islocked(bp)); > > + ASSERT(bp->b_flags & _XBF_DELWRI_Q); > > + > > + trace_xfs_buf_delwri_pushbuf(bp, _RET_IP_); > > + > > + /* > > + * Move the buffer to an empty list and submit. Pass the original list > > + * as the wait list so delwri submission moves the buf back to it before > > + * it is submitted (and thus before it is unlocked). This means the > > + * buffer cannot be placed on another list while we wait for it. > > + */ > > + list_move(&bp->b_list, &submit_list); > > + xfs_buf_unlock(bp); > > + > > + xfs_buf_delwri_submit_buffers(&submit_list, buffer_list); > > + > > + /* > > + * Lock the buffer to wait for I/O completion. It's already held on the > > + * original list, so all we have to do is reset the delwri queue flag > > + * that was cleared by delwri submission. > > + */ > > + xfs_buf_lock(bp); > > + error = bp->b_error; > > + bp->b_flags |= _XBF_DELWRI_Q; > > + xfs_buf_unlock(bp); > > + > > + return error; > > +} > > + > > int __init > > xfs_buf_init(void) > > { > > diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h > > index 8a9d3a9..cd74216 100644 > > --- a/fs/xfs/xfs_buf.h > > +++ b/fs/xfs/xfs_buf.h > > @@ -334,6 +334,7 @@ extern void xfs_buf_stale(struct xfs_buf *bp); > > extern bool xfs_buf_delwri_queue(struct xfs_buf *, struct list_head *); > > extern int xfs_buf_delwri_submit(struct list_head *); > > extern int xfs_buf_delwri_submit_nowait(struct list_head *); > > +extern int xfs_buf_delwri_pushbuf(struct xfs_buf *, struct list_head *); > > > > /* Buffer Daemon Setup Routines */ > > extern int xfs_buf_init(void); > > diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c > > index 4ff993c..3815ed3 100644 > > --- a/fs/xfs/xfs_qm.c > > +++ b/fs/xfs/xfs_qm.c > > @@ -1247,6 +1247,7 @@ xfs_qm_flush_one( > > struct xfs_dquot *dqp, > > void *data) > > { > > + struct xfs_mount *mp = dqp->q_mount; > > struct list_head *buffer_list = data; > > struct xfs_buf *bp = NULL; > > int error = 0; > > @@ -1257,7 +1258,32 @@ xfs_qm_flush_one( > > if (!XFS_DQ_IS_DIRTY(dqp)) > > goto out_unlock; > > > > - xfs_dqflock(dqp); > > + /* > > + * The only way the dquot is already flush locked by the time quotacheck > > + * gets here is if reclaim flushed it before the dqadjust walk dirtied > > + * it for the final time. Quotacheck collects all dquot bufs in the > > + * local delwri queue before dquots are dirtied, so reclaim can't have > > + * possibly queued it for I/O. The only way out is to push the buffer to > > + * cycle the flush lock. > > + */ > > + if (!xfs_dqflock_nowait(dqp)) { > > + /* buf is pinned in-core by delwri list */ > > + DEFINE_SINGLE_BUF_MAP(map, dqp->q_blkno, > > + mp->m_quotainfo->qi_dqchunklen); > > + bp = _xfs_buf_find(mp->m_ddev_targp, &map, 1, 0, NULL); > > + if (!bp) { > > + error = -EINVAL; > > + goto out_unlock; > > + } > > + > > + /* delwri_pushbuf drops the buf lock */ > > + xfs_buf_delwri_pushbuf(bp, buffer_list); > > + xfs_buf_rele(bp); > > + > > + error = -EAGAIN; > > + goto out_unlock; > > + } > > + > > error = xfs_qm_dqflush(dqp, &bp); > > if (error) > > goto out_unlock; > > diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h > > index fb7555e..c8d28a9 100644 > > --- a/fs/xfs/xfs_trace.h > > +++ b/fs/xfs/xfs_trace.h > > @@ -365,6 +365,7 @@ DEFINE_BUF_EVENT(xfs_buf_iowait_done); > > DEFINE_BUF_EVENT(xfs_buf_delwri_queue); > > DEFINE_BUF_EVENT(xfs_buf_delwri_queued); > > DEFINE_BUF_EVENT(xfs_buf_delwri_split); > > +DEFINE_BUF_EVENT(xfs_buf_delwri_pushbuf); > > DEFINE_BUF_EVENT(xfs_buf_get_uncached); > > DEFINE_BUF_EVENT(xfs_buf_item_relse); > > DEFINE_BUF_EVENT(xfs_buf_item_iodone_async); > > -- > > 2.7.4 > > > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Apr 07, 2017 at 08:06:31AM -0400, Brian Foster wrote: > On Thu, Apr 06, 2017 at 03:34:17PM -0700, Darrick J. Wong wrote: > > On Fri, Feb 24, 2017 at 02:53:20PM -0500, Brian Foster wrote: > > > Reclaim during quotacheck can lead to deadlocks on the dquot flush > > > lock: > > > > > > - Quotacheck populates a local delwri queue with the physical dquot > > > buffers. > > > - Quotacheck performs the xfs_qm_dqusage_adjust() bulkstat and > > > dirties all of the dquots. > > > - Reclaim kicks in and attempts to flush a dquot whose buffer is > > > already queud on the quotacheck queue. The flush succeeds but > > > queueing to the reclaim delwri queue fails as the backing buffer is > > > already queued. The flush unlock is now deferred to I/O completion > > > of the buffer from the quotacheck queue. > > > > <typing aloud here, trying to follow along...> > > > > If I'm reading this correctly, this is the _buf_delwri_queue call > > towards the end of the _DQ_IS_DIRTY clause in _qm_dquot_isolate? Once > > the dquot is flushed we fail to add it to isol->buffers because it's > > already on quotacheck's buffer_list, which means that the > > _buf_delwri_submit call in _qm_shrink_scan never sees our dirty dquot. > > > > Therefore, the dquot doesn't get written and the flush lock stays > > locked... > > > > Yes, exactly. > > > > - The dqadjust bulkstat continues and dirties the recently flushed > > > dquot once again. > > > - Quotacheck proceeds to the xfs_qm_flush_one() walk which requires > > > the flush lock to update the backing buffers with the in-core > > > recalculated values. It deadlocks on the redirtied dquot as the > > > flush lock was already acquired by reclaim, but the buffer resides > > > on the local delwri queue which isn't submitted until the end of > > > quotacheck. > > > > ...until quotacheck tries to relock it and deadlocks, like you said. > > Ok, I think I understand the deadlock... > > > > Yep. > > > > This is reproduced by running quotacheck on a filesystem with a > > > couple million inodes in low memory (512MB-1GB) situations. This is > > > a regression as of commit 43ff2122e6 ("xfs: on-stack delayed write > > > buffer lists"), which removed a trylock and buffer I/O submission > > > from the quotacheck dquot flush sequence. > > > > > > Quotacheck first resets and collects the physical dquot buffers in a > > > delwri queue. Then, it traverses the filesystem inodes via bulkstat, > > > updates the in-core dquots, flushes the corrected dquots to the > > > backing buffers and finally submits the delwri queue for I/O. Since > > > the backing buffers are queued across the entire quotacheck > > > operation, dquot reclaim cannot possibly complete a dquot flush > > > before quotacheck completes. > > > > > > Therefore, quotacheck must submit the buffer for I/O in order to > > > cycle the flush lock and flush the dirty in-core dquot to the > > > buffer. Add a delwri queue buffer push mechanism to submit an > > > individual buffer for I/O without losing the delwri queue status and > > > use it from quotacheck to avoid the deadlock. This restores > > > quotacheck behavior to as before the regression was introduced. > > > > ...and therefore why this patch changes quotacheck to check if the > > dquot's flush lock is already held and if so, to call > > _buf_delwri_submit_buffers so that the io completion will unlock the > > flush lock and try again. > > > > Yes, though only for the specific buffer. > > > When we return to _qm_flush_one, the dquot is no longer dirty and we're > > done. I /think/ this looks ok, though I also think I want to hear if > > either Christian Affolter or Martin Svec have had any problems with > > these two patches applied. My current feeling is that this is a bit > > late for 4.12, but I'd certainly put it in -washme for wider testing. > > If anyone thinks this needs to happen sooner, please speak up. > > > > The dquot is still dirty and needs to be flushed on the second go around > (otherwise we wouldn't have needed the flush lock here on the first > try). The issue is basically that the dquot was adjusted (dirtied) at > some point during the bulkstat scan, reclaimed and flushed, and then > subsequently dirtied again such that another flush is required via > qm_flush_one(). The buffer push frees up the held flush lock so that > final flush can proceed. Ok. I'd thought that the dquot could still be dirty when we got to there. > BTW, what's the reasoning for being late to 4.12? I actually don't have > much of an opinion which release this goes into, so I guess my question > is more logistical due to the fact that up until recently I haven't paid > much attention to the upstream release cycles. ;P The 4.12 merge window > hasn't opened as we are still on 4.11-rc5, right? Is the concern then > that the merge window is close enough that this has missed 4.12 > linux-next soak time? We're still at least a week or two away from the merge window opening, but I want to hear from the two bug reporters that this patch series has made their symptoms go away, and that they've mounted enough times to be reasonably confident that they haven't simply gotten lucky and not hit the deadlock. I don't know that both of those things will happen in the time we have left, and seeing as the frequency of reports is pretty low, it doesn't seem like we have to rush it into 4.12. > Also, can you clarify the semantics of the -washme thing? ;) That > doesn't necessarily mean the patch has been "merged." Rather, it might > be suitable for the next for-next (4.13) provided nothing explodes by > then, correct? Yep. > I'm also a little curious on what extra testing that entails, if any, > though perhaps this topic is better suited by an independent thread.. Same testing as for-next gets (nightly auto group runs, ltp every now and then, various $magic workloads); it's really just a place to land patches that have been code reviewed but I'm not yet confident enough about to put into for-next for the next cycle. --D > > > Sorry this one took me a while to get to. :/ > > > > Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> > > > > No problem. Thanks for review... > > Brian > > > --D > > > > > Reported-by: Martin Svec <martin.svec@zoner.cz> > > > Signed-off-by: Brian Foster <bfoster@redhat.com> > > > --- > > > fs/xfs/xfs_buf.c | 37 +++++++++++++++++++++++++++++++++++++ > > > fs/xfs/xfs_buf.h | 1 + > > > fs/xfs/xfs_qm.c | 28 +++++++++++++++++++++++++++- > > > fs/xfs/xfs_trace.h | 1 + > > > 4 files changed, 66 insertions(+), 1 deletion(-) > > > > > > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c > > > index e566510..e97cf56 100644 > > > --- a/fs/xfs/xfs_buf.c > > > +++ b/fs/xfs/xfs_buf.c > > > @@ -2011,6 +2011,43 @@ xfs_buf_delwri_submit( > > > return error; > > > } > > > > > > +int > > > +xfs_buf_delwri_pushbuf( > > > + struct xfs_buf *bp, > > > + struct list_head *buffer_list) > > > +{ > > > + LIST_HEAD (submit_list); > > > + int error; > > > + > > > + ASSERT(xfs_buf_islocked(bp)); > > > + ASSERT(bp->b_flags & _XBF_DELWRI_Q); > > > + > > > + trace_xfs_buf_delwri_pushbuf(bp, _RET_IP_); > > > + > > > + /* > > > + * Move the buffer to an empty list and submit. Pass the original list > > > + * as the wait list so delwri submission moves the buf back to it before > > > + * it is submitted (and thus before it is unlocked). This means the > > > + * buffer cannot be placed on another list while we wait for it. > > > + */ > > > + list_move(&bp->b_list, &submit_list); > > > + xfs_buf_unlock(bp); > > > + > > > + xfs_buf_delwri_submit_buffers(&submit_list, buffer_list); > > > + > > > + /* > > > + * Lock the buffer to wait for I/O completion. It's already held on the > > > + * original list, so all we have to do is reset the delwri queue flag > > > + * that was cleared by delwri submission. > > > + */ > > > + xfs_buf_lock(bp); > > > + error = bp->b_error; > > > + bp->b_flags |= _XBF_DELWRI_Q; > > > + xfs_buf_unlock(bp); > > > + > > > + return error; > > > +} > > > + > > > int __init > > > xfs_buf_init(void) > > > { > > > diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h > > > index 8a9d3a9..cd74216 100644 > > > --- a/fs/xfs/xfs_buf.h > > > +++ b/fs/xfs/xfs_buf.h > > > @@ -334,6 +334,7 @@ extern void xfs_buf_stale(struct xfs_buf *bp); > > > extern bool xfs_buf_delwri_queue(struct xfs_buf *, struct list_head *); > > > extern int xfs_buf_delwri_submit(struct list_head *); > > > extern int xfs_buf_delwri_submit_nowait(struct list_head *); > > > +extern int xfs_buf_delwri_pushbuf(struct xfs_buf *, struct list_head *); > > > > > > /* Buffer Daemon Setup Routines */ > > > extern int xfs_buf_init(void); > > > diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c > > > index 4ff993c..3815ed3 100644 > > > --- a/fs/xfs/xfs_qm.c > > > +++ b/fs/xfs/xfs_qm.c > > > @@ -1247,6 +1247,7 @@ xfs_qm_flush_one( > > > struct xfs_dquot *dqp, > > > void *data) > > > { > > > + struct xfs_mount *mp = dqp->q_mount; > > > struct list_head *buffer_list = data; > > > struct xfs_buf *bp = NULL; > > > int error = 0; > > > @@ -1257,7 +1258,32 @@ xfs_qm_flush_one( > > > if (!XFS_DQ_IS_DIRTY(dqp)) > > > goto out_unlock; > > > > > > - xfs_dqflock(dqp); > > > + /* > > > + * The only way the dquot is already flush locked by the time quotacheck > > > + * gets here is if reclaim flushed it before the dqadjust walk dirtied > > > + * it for the final time. Quotacheck collects all dquot bufs in the > > > + * local delwri queue before dquots are dirtied, so reclaim can't have > > > + * possibly queued it for I/O. The only way out is to push the buffer to > > > + * cycle the flush lock. > > > + */ > > > + if (!xfs_dqflock_nowait(dqp)) { > > > + /* buf is pinned in-core by delwri list */ > > > + DEFINE_SINGLE_BUF_MAP(map, dqp->q_blkno, > > > + mp->m_quotainfo->qi_dqchunklen); > > > + bp = _xfs_buf_find(mp->m_ddev_targp, &map, 1, 0, NULL); > > > + if (!bp) { > > > + error = -EINVAL; > > > + goto out_unlock; > > > + } > > > + > > > + /* delwri_pushbuf drops the buf lock */ > > > + xfs_buf_delwri_pushbuf(bp, buffer_list); > > > + xfs_buf_rele(bp); > > > + > > > + error = -EAGAIN; > > > + goto out_unlock; > > > + } > > > + > > > error = xfs_qm_dqflush(dqp, &bp); > > > if (error) > > > goto out_unlock; > > > diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h > > > index fb7555e..c8d28a9 100644 > > > --- a/fs/xfs/xfs_trace.h > > > +++ b/fs/xfs/xfs_trace.h > > > @@ -365,6 +365,7 @@ DEFINE_BUF_EVENT(xfs_buf_iowait_done); > > > DEFINE_BUF_EVENT(xfs_buf_delwri_queue); > > > DEFINE_BUF_EVENT(xfs_buf_delwri_queued); > > > DEFINE_BUF_EVENT(xfs_buf_delwri_split); > > > +DEFINE_BUF_EVENT(xfs_buf_delwri_pushbuf); > > > DEFINE_BUF_EVENT(xfs_buf_get_uncached); > > > DEFINE_BUF_EVENT(xfs_buf_item_relse); > > > DEFINE_BUF_EVENT(xfs_buf_item_iodone_async); > > > -- > > > 2.7.4 > > > > > > -- > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > > > the body of a message to majordomo@vger.kernel.org > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Feb 24, 2017 at 02:53:20PM -0500, Brian Foster wrote: > Reclaim during quotacheck can lead to deadlocks on the dquot flush > lock: > > - Quotacheck populates a local delwri queue with the physical dquot > buffers. > - Quotacheck performs the xfs_qm_dqusage_adjust() bulkstat and > dirties all of the dquots. > - Reclaim kicks in and attempts to flush a dquot whose buffer is > already queud on the quotacheck queue. The flush succeeds but > queueing to the reclaim delwri queue fails as the backing buffer is > already queued. The flush unlock is now deferred to I/O completion > of the buffer from the quotacheck queue. > - The dqadjust bulkstat continues and dirties the recently flushed > dquot once again. > - Quotacheck proceeds to the xfs_qm_flush_one() walk which requires > the flush lock to update the backing buffers with the in-core > recalculated values. It deadlocks on the redirtied dquot as the > flush lock was already acquired by reclaim, but the buffer resides > on the local delwri queue which isn't submitted until the end of > quotacheck. > > This is reproduced by running quotacheck on a filesystem with a > couple million inodes in low memory (512MB-1GB) situations. This is > a regression as of commit 43ff2122e6 ("xfs: on-stack delayed write > buffer lists"), which removed a trylock and buffer I/O submission > from the quotacheck dquot flush sequence. > > Quotacheck first resets and collects the physical dquot buffers in a > delwri queue. Then, it traverses the filesystem inodes via bulkstat, > updates the in-core dquots, flushes the corrected dquots to the > backing buffers and finally submits the delwri queue for I/O. Since > the backing buffers are queued across the entire quotacheck > operation, dquot reclaim cannot possibly complete a dquot flush > before quotacheck completes. > > Therefore, quotacheck must submit the buffer for I/O in order to > cycle the flush lock and flush the dirty in-core dquot to the > buffer. Add a delwri queue buffer push mechanism to submit an > individual buffer for I/O without losing the delwri queue status and > use it from quotacheck to avoid the deadlock. This restores > quotacheck behavior to as before the regression was introduced. While it may fix the problem, the solution gives me the heebee jeebees. I'm on holidays, so I haven't bothered to spend the hours necessary to answer these questions, but to give you an idea, this was what I thought as I read the patch. i.e. I have concerns about whether.... > Reported-by: Martin Svec <martin.svec@zoner.cz> > Signed-off-by: Brian Foster <bfoster@redhat.com> > --- > fs/xfs/xfs_buf.c | 37 +++++++++++++++++++++++++++++++++++++ > fs/xfs/xfs_buf.h | 1 + > fs/xfs/xfs_qm.c | 28 +++++++++++++++++++++++++++- > fs/xfs/xfs_trace.h | 1 + > 4 files changed, 66 insertions(+), 1 deletion(-) > > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c > index e566510..e97cf56 100644 > --- a/fs/xfs/xfs_buf.c > +++ b/fs/xfs/xfs_buf.c > @@ -2011,6 +2011,43 @@ xfs_buf_delwri_submit( > return error; > } > > +int > +xfs_buf_delwri_pushbuf( > + struct xfs_buf *bp, > + struct list_head *buffer_list) > +{ > + LIST_HEAD (submit_list); > + int error; > + > + ASSERT(xfs_buf_islocked(bp)); > + ASSERT(bp->b_flags & _XBF_DELWRI_Q); > + > + trace_xfs_buf_delwri_pushbuf(bp, _RET_IP_); > + > + /* > + * Move the buffer to an empty list and submit. Pass the original list > + * as the wait list so delwri submission moves the buf back to it before > + * it is submitted (and thus before it is unlocked). This means the > + * buffer cannot be placed on another list while we wait for it. > + */ > + list_move(&bp->b_list, &submit_list); > + xfs_buf_unlock(bp); .... this is safe/racy as we may have just moved it off the delwri queue without changing state, reference counts, etc? > + > + xfs_buf_delwri_submit_buffers(&submit_list, buffer_list); .... using a caller supplied delwri buffer list as the buffer IO wait list destination is making big assumptions about the internal use of the wait list? e.g that xfs_buf_delwri_submit_buffers() does not initialise the list_head before use... .... we should be doing IO submission while holding other things on the delwri list and unknown caller locks? .... we have all the buffer reference counts we need to make this work correctly? > + /* > + * Lock the buffer to wait for I/O completion. It's already held on the > + * original list, so all we have to do is reset the delwri queue flag > + * that was cleared by delwri submission. > + */ > + xfs_buf_lock(bp); > + error = bp->b_error; > + bp->b_flags |= _XBF_DELWRI_Q; > + xfs_buf_unlock(bp); .... this is racy w.r.t. the buffer going back onto the buffer list without holding the buffer lock, or that the _XBF_DELWRI_Q setting/clearing is not atomic w.r.t. the delwri queue manipulations (i.e. can now be on the delwri list but not have _XBF_DELWRI_Q set)? .... the error is left on the buffer so it gets tripped over when it is next accessed? .... that the buffer locking is unbalanced for some undocumented reason? > + return error; > +} > + > int __init > xfs_buf_init(void) > { > diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h > index 8a9d3a9..cd74216 100644 > --- a/fs/xfs/xfs_buf.h > +++ b/fs/xfs/xfs_buf.h > @@ -334,6 +334,7 @@ extern void xfs_buf_stale(struct xfs_buf *bp); > extern bool xfs_buf_delwri_queue(struct xfs_buf *, struct list_head *); > extern int xfs_buf_delwri_submit(struct list_head *); > extern int xfs_buf_delwri_submit_nowait(struct list_head *); > +extern int xfs_buf_delwri_pushbuf(struct xfs_buf *, struct list_head *); > > /* Buffer Daemon Setup Routines */ > extern int xfs_buf_init(void); > diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c > index 4ff993c..3815ed3 100644 > --- a/fs/xfs/xfs_qm.c > +++ b/fs/xfs/xfs_qm.c > @@ -1247,6 +1247,7 @@ xfs_qm_flush_one( > struct xfs_dquot *dqp, > void *data) > { > + struct xfs_mount *mp = dqp->q_mount; > struct list_head *buffer_list = data; > struct xfs_buf *bp = NULL; > int error = 0; > @@ -1257,7 +1258,32 @@ xfs_qm_flush_one( > if (!XFS_DQ_IS_DIRTY(dqp)) > goto out_unlock; > > - xfs_dqflock(dqp); > + /* > + * The only way the dquot is already flush locked by the time quotacheck > + * gets here is if reclaim flushed it before the dqadjust walk dirtied > + * it for the final time. Quotacheck collects all dquot bufs in the > + * local delwri queue before dquots are dirtied, so reclaim can't have > + * possibly queued it for I/O. The only way out is to push the buffer to > + * cycle the flush lock. > + */ > + if (!xfs_dqflock_nowait(dqp)) { > + /* buf is pinned in-core by delwri list */ > + DEFINE_SINGLE_BUF_MAP(map, dqp->q_blkno, > + mp->m_quotainfo->qi_dqchunklen); > + bp = _xfs_buf_find(mp->m_ddev_targp, &map, 1, 0, NULL); > + if (!bp) { > + error = -EINVAL; > + goto out_unlock; > + } > + > + /* delwri_pushbuf drops the buf lock */ > + xfs_buf_delwri_pushbuf(bp, buffer_list); Ummm - you threw away the returned error.... > + xfs_buf_rele(bp); And despite the comment, I think this is simply wrong. We try really hard to maintain balanced locking, and as such xfs_buf_relse() is what goes along with _xfs_buf_find(). i.e. we are returned a locked buffer with a reference by xfs_buf_find(), but xfs_buf_rele() only drops the reference. So if I look at this code in isolation, it looks like it leaks a buffer lock, and now I have to go read other code to understand why it doesn't and I'm left to wonder why it was implemented this way.... Cheers, Dave.
On Mon, Apr 10, 2017 at 03:00:01PM +1000, Dave Chinner wrote: > On Fri, Feb 24, 2017 at 02:53:20PM -0500, Brian Foster wrote: > > Reclaim during quotacheck can lead to deadlocks on the dquot flush > > lock: > > > > - Quotacheck populates a local delwri queue with the physical dquot > > buffers. > > - Quotacheck performs the xfs_qm_dqusage_adjust() bulkstat and > > dirties all of the dquots. > > - Reclaim kicks in and attempts to flush a dquot whose buffer is > > already queud on the quotacheck queue. The flush succeeds but > > queueing to the reclaim delwri queue fails as the backing buffer is > > already queued. The flush unlock is now deferred to I/O completion > > of the buffer from the quotacheck queue. > > - The dqadjust bulkstat continues and dirties the recently flushed > > dquot once again. > > - Quotacheck proceeds to the xfs_qm_flush_one() walk which requires > > the flush lock to update the backing buffers with the in-core > > recalculated values. It deadlocks on the redirtied dquot as the > > flush lock was already acquired by reclaim, but the buffer resides > > on the local delwri queue which isn't submitted until the end of > > quotacheck. > > > > This is reproduced by running quotacheck on a filesystem with a > > couple million inodes in low memory (512MB-1GB) situations. This is > > a regression as of commit 43ff2122e6 ("xfs: on-stack delayed write > > buffer lists"), which removed a trylock and buffer I/O submission > > from the quotacheck dquot flush sequence. > > > > Quotacheck first resets and collects the physical dquot buffers in a > > delwri queue. Then, it traverses the filesystem inodes via bulkstat, > > updates the in-core dquots, flushes the corrected dquots to the > > backing buffers and finally submits the delwri queue for I/O. Since > > the backing buffers are queued across the entire quotacheck > > operation, dquot reclaim cannot possibly complete a dquot flush > > before quotacheck completes. > > > > Therefore, quotacheck must submit the buffer for I/O in order to > > cycle the flush lock and flush the dirty in-core dquot to the > > buffer. Add a delwri queue buffer push mechanism to submit an > > individual buffer for I/O without losing the delwri queue status and > > use it from quotacheck to avoid the deadlock. This restores > > quotacheck behavior to as before the regression was introduced. > > While it may fix the problem, the solution gives me the heebee > jeebees. I'm on holidays, so I haven't bothered to spend the hours > necessary to answer these questions, but to give you an idea, this > was what I thought as I read the patch. i.e. I have concerns about > whether.... > Thanks for looking at this. FWIW, this is based on a preexisting solution prior to the commit referenced above (obviously this doesn't mean the code is correct or that this is the best solution). > > Reported-by: Martin Svec <martin.svec@zoner.cz> > > Signed-off-by: Brian Foster <bfoster@redhat.com> > > --- > > fs/xfs/xfs_buf.c | 37 +++++++++++++++++++++++++++++++++++++ > > fs/xfs/xfs_buf.h | 1 + > > fs/xfs/xfs_qm.c | 28 +++++++++++++++++++++++++++- > > fs/xfs/xfs_trace.h | 1 + > > 4 files changed, 66 insertions(+), 1 deletion(-) > > > > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c > > index e566510..e97cf56 100644 > > --- a/fs/xfs/xfs_buf.c > > +++ b/fs/xfs/xfs_buf.c > > @@ -2011,6 +2011,43 @@ xfs_buf_delwri_submit( > > return error; > > } > > > > +int > > +xfs_buf_delwri_pushbuf( > > + struct xfs_buf *bp, > > + struct list_head *buffer_list) > > +{ > > + LIST_HEAD (submit_list); > > + int error; > > + > > + ASSERT(xfs_buf_islocked(bp)); > > + ASSERT(bp->b_flags & _XBF_DELWRI_Q); > > + > > + trace_xfs_buf_delwri_pushbuf(bp, _RET_IP_); > > + > > + /* > > + * Move the buffer to an empty list and submit. Pass the original list > > + * as the wait list so delwri submission moves the buf back to it before > > + * it is submitted (and thus before it is unlocked). This means the > > + * buffer cannot be placed on another list while we wait for it. > > + */ > > + list_move(&bp->b_list, &submit_list); > > + xfs_buf_unlock(bp); > > .... this is safe/racy as we may have just moved it off the delwri > queue without changing state, reference counts, etc? > Racy with..? The buffer is effectively still on the delwri queue here. We've just replaced the caller queue with a local queue to reuse the existing submission infrastructure. IOW, we've just changed the queue that the buffer is on. > > + > > + xfs_buf_delwri_submit_buffers(&submit_list, buffer_list); > > .... using a caller supplied delwri buffer list as the buffer IO > wait list destination is making big assumptions about the internal > use of the wait list? e.g that xfs_buf_delwri_submit_buffers() does > not initialise the list_head before use... > This is the scope at which the wait list is defined/initialized. E.g., _delwri_submit() initializes its local wait list and passes it down for _delwri_submit_buffers() to use. This basically ensures that the buffer is placed back on the original delwri queue because the latter moves the buffer over under lock, before it is submitted for I/O. We could probably use a local wait_list just the same if that is preferred, and explicitly add the buffer back from the wait_list to the original delwri queue. FWIW, it's not clear to me whether you're asking about the use of the wait_list here or indicating preference. > .... we should be doing IO submission while holding other things on > the delwri list and unknown caller locks? > I'm not following... this is effectively a move/submit of a delwri queue item to another delwri queue. If the delwri queue is by design a local data structure then only one thread should be queue processing at a time. Why would this particular action be problematic? Technically, we could rework this into something like a delwri_queue_move()/delwri_submit() sequence, but the problem with that is we lose the queue-populated state of the buffer for a period of time and thus somebody else (i.e., reclaim) can jump in and queue it out from underneath the caller. The purpose of this mechanism is to preserve original delwri queue ownership of the buffer. > .... we have all the buffer reference counts we need to make this > work correctly? > AFAICT, we follow the existing delwri queue rules with regard to locking and reference counting. A buffer is queued under the buf lock and the queue takes a reference on the buffer. Queue submission locks the buffer, acquires a wait_list reference and moves to the wait list if one is specified and submits the buffer. Buffer I/O completion drops the original queue reference. If the wait_list was specified, we cycle the buffer lock (to wait for I/O completion), remove from the wait_list and drop the wait_list reference. The pushbuf() call moves the buffer to the submit_list under the buffer lock, which is effectively another local delwri queue. This takes ownership of the original delwri queue reference. Queue submission occurs as above: the buffer is moved to the wait_list, an associated wait_list reference is acquired and the buffer is submitted. I/O completion drops the original delwri queue reference as before, so we are in the same state as delwri_submit() up through I/O completion. We again cycle the buf lock to wait on I/O completion and process an error. Rather than remove from the wait_list and drop the reference, however, pushbuf() has the buffer back on the original delwri queue with an associated reference. The only explicit state change here is the DELWRI_Q flag, which is reset. So am I missing something here with regard to proper reference counting..? If so, please elaborate and I'll try to dig into it. > > + /* > > + * Lock the buffer to wait for I/O completion. It's already held on the > > + * original list, so all we have to do is reset the delwri queue flag > > + * that was cleared by delwri submission. > > + */ > > + xfs_buf_lock(bp); > > + error = bp->b_error; > > + bp->b_flags |= _XBF_DELWRI_Q; > > + xfs_buf_unlock(bp); > > .... this is racy w.r.t. the buffer going back onto the > buffer list without holding the buffer lock, or that the > _XBF_DELWRI_Q setting/clearing is not atomic w.r.t. the delwri queue > manipulations (i.e. can now be on the delwri list but not have > _XBF_DELWRI_Q set)? > The buffer has been put back on the original delwri queue (by _delwri_submit_buffers()) before the lock was dropped. Using the delwri queue as the wait_list does mean that the flag might not be set until the delwri queue processing is done and we have returned to the caller. If another delwri queue occurs, the flag may be reset but the buffer cannot be added to another queue, which is what we want (as noted above). (I'm not sure why delwri_queue() returns true in that case, but that's a separate issue.) > .... the error is left on the buffer so it gets tripped over when > it is next accessed? > Same as delwri_submit(). IIUC, errors are left on buffers by design and cleared on a subsequemt I/O submission. > .... that the buffer locking is unbalanced for some undocumented > reason? > > > + return error; > > +} > > + > > int __init > > xfs_buf_init(void) > > { > > diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h > > index 8a9d3a9..cd74216 100644 > > --- a/fs/xfs/xfs_buf.h > > +++ b/fs/xfs/xfs_buf.h > > @@ -334,6 +334,7 @@ extern void xfs_buf_stale(struct xfs_buf *bp); > > extern bool xfs_buf_delwri_queue(struct xfs_buf *, struct list_head *); > > extern int xfs_buf_delwri_submit(struct list_head *); > > extern int xfs_buf_delwri_submit_nowait(struct list_head *); > > +extern int xfs_buf_delwri_pushbuf(struct xfs_buf *, struct list_head *); > > > > /* Buffer Daemon Setup Routines */ > > extern int xfs_buf_init(void); > > diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c > > index 4ff993c..3815ed3 100644 > > --- a/fs/xfs/xfs_qm.c > > +++ b/fs/xfs/xfs_qm.c > > @@ -1247,6 +1247,7 @@ xfs_qm_flush_one( > > struct xfs_dquot *dqp, > > void *data) > > { > > + struct xfs_mount *mp = dqp->q_mount; > > struct list_head *buffer_list = data; > > struct xfs_buf *bp = NULL; > > int error = 0; > > @@ -1257,7 +1258,32 @@ xfs_qm_flush_one( > > if (!XFS_DQ_IS_DIRTY(dqp)) > > goto out_unlock; > > > > - xfs_dqflock(dqp); > > + /* > > + * The only way the dquot is already flush locked by the time quotacheck > > + * gets here is if reclaim flushed it before the dqadjust walk dirtied > > + * it for the final time. Quotacheck collects all dquot bufs in the > > + * local delwri queue before dquots are dirtied, so reclaim can't have > > + * possibly queued it for I/O. The only way out is to push the buffer to > > + * cycle the flush lock. > > + */ > > + if (!xfs_dqflock_nowait(dqp)) { > > + /* buf is pinned in-core by delwri list */ > > + DEFINE_SINGLE_BUF_MAP(map, dqp->q_blkno, > > + mp->m_quotainfo->qi_dqchunklen); > > + bp = _xfs_buf_find(mp->m_ddev_targp, &map, 1, 0, NULL); > > + if (!bp) { > > + error = -EINVAL; > > + goto out_unlock; > > + } > > + > > + /* delwri_pushbuf drops the buf lock */ > > + xfs_buf_delwri_pushbuf(bp, buffer_list); > > Ummm - you threw away the returned error.... > The I/O is going to be retried, but I'll fix it to just return from here. > > + xfs_buf_rele(bp); > > And despite the comment, I think this is simply wrong. We try really > hard to maintain balanced locking, and as such xfs_buf_relse() is > what goes along with _xfs_buf_find(). i.e. we are returned a locked > buffer with a reference by xfs_buf_find(), but xfs_buf_rele() only > drops the reference. > I'm not quite following if you are saying the locking is unbalanced or it simply looks that way. If the former, can you please elaborate? Otherwise, we can probably have pushbuf() return with the buffer lock held so the caller remains responsible for the lock and reference it acquired. > So if I look at this code in isolation, it looks like it leaks a > buffer lock, and now I have to go read other code to understand why > it doesn't and I'm left to wonder why it was implemented this > way.... > Thanks for taking the time to send feedback. FWIW, I take most of these questions as "things one would have to verify in order to review this patch if one were not on vacation" :P as opposed to pointing things out as explicitly broken. As noted previously, please do point out anything I've missed to the contrary. Otherwise, I'll plan to send another version that includes the error and locking API fixups. Brian > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Apr 07, 2017 at 01:07:25PM -0700, Darrick J. Wong wrote: > On Fri, Apr 07, 2017 at 08:06:31AM -0400, Brian Foster wrote: > > On Thu, Apr 06, 2017 at 03:34:17PM -0700, Darrick J. Wong wrote: > > > On Fri, Feb 24, 2017 at 02:53:20PM -0500, Brian Foster wrote: > > > > Reclaim during quotacheck can lead to deadlocks on the dquot flush > > > > lock: > > > > > > > > - Quotacheck populates a local delwri queue with the physical dquot > > > > buffers. > > > > - Quotacheck performs the xfs_qm_dqusage_adjust() bulkstat and > > > > dirties all of the dquots. > > > > - Reclaim kicks in and attempts to flush a dquot whose buffer is > > > > already queud on the quotacheck queue. The flush succeeds but > > > > queueing to the reclaim delwri queue fails as the backing buffer is > > > > already queued. The flush unlock is now deferred to I/O completion > > > > of the buffer from the quotacheck queue. > > > > > > <typing aloud here, trying to follow along...> > > > > > > If I'm reading this correctly, this is the _buf_delwri_queue call > > > towards the end of the _DQ_IS_DIRTY clause in _qm_dquot_isolate? Once > > > the dquot is flushed we fail to add it to isol->buffers because it's > > > already on quotacheck's buffer_list, which means that the > > > _buf_delwri_submit call in _qm_shrink_scan never sees our dirty dquot. > > > > > > Therefore, the dquot doesn't get written and the flush lock stays > > > locked... > > > > > > > Yes, exactly. > > > > > > - The dqadjust bulkstat continues and dirties the recently flushed > > > > dquot once again. > > > > - Quotacheck proceeds to the xfs_qm_flush_one() walk which requires > > > > the flush lock to update the backing buffers with the in-core > > > > recalculated values. It deadlocks on the redirtied dquot as the > > > > flush lock was already acquired by reclaim, but the buffer resides > > > > on the local delwri queue which isn't submitted until the end of > > > > quotacheck. > > > > > > ...until quotacheck tries to relock it and deadlocks, like you said. > > > Ok, I think I understand the deadlock... > > > > > > > Yep. > > > > > > This is reproduced by running quotacheck on a filesystem with a > > > > couple million inodes in low memory (512MB-1GB) situations. This is > > > > a regression as of commit 43ff2122e6 ("xfs: on-stack delayed write > > > > buffer lists"), which removed a trylock and buffer I/O submission > > > > from the quotacheck dquot flush sequence. > > > > > > > > Quotacheck first resets and collects the physical dquot buffers in a > > > > delwri queue. Then, it traverses the filesystem inodes via bulkstat, > > > > updates the in-core dquots, flushes the corrected dquots to the > > > > backing buffers and finally submits the delwri queue for I/O. Since > > > > the backing buffers are queued across the entire quotacheck > > > > operation, dquot reclaim cannot possibly complete a dquot flush > > > > before quotacheck completes. > > > > > > > > Therefore, quotacheck must submit the buffer for I/O in order to > > > > cycle the flush lock and flush the dirty in-core dquot to the > > > > buffer. Add a delwri queue buffer push mechanism to submit an > > > > individual buffer for I/O without losing the delwri queue status and > > > > use it from quotacheck to avoid the deadlock. This restores > > > > quotacheck behavior to as before the regression was introduced. > > > > > > ...and therefore why this patch changes quotacheck to check if the > > > dquot's flush lock is already held and if so, to call > > > _buf_delwri_submit_buffers so that the io completion will unlock the > > > flush lock and try again. > > > > > > > Yes, though only for the specific buffer. > > > > > When we return to _qm_flush_one, the dquot is no longer dirty and we're > > > done. I /think/ this looks ok, though I also think I want to hear if > > > either Christian Affolter or Martin Svec have had any problems with > > > these two patches applied. My current feeling is that this is a bit > > > late for 4.12, but I'd certainly put it in -washme for wider testing. > > > If anyone thinks this needs to happen sooner, please speak up. > > > > > > > The dquot is still dirty and needs to be flushed on the second go around > > (otherwise we wouldn't have needed the flush lock here on the first > > try). The issue is basically that the dquot was adjusted (dirtied) at > > some point during the bulkstat scan, reclaimed and flushed, and then > > subsequently dirtied again such that another flush is required via > > qm_flush_one(). The buffer push frees up the held flush lock so that > > final flush can proceed. > > Ok. I'd thought that the dquot could still be dirty when we got to there. > > > BTW, what's the reasoning for being late to 4.12? I actually don't have > > much of an opinion which release this goes into, so I guess my question > > is more logistical due to the fact that up until recently I haven't paid > > much attention to the upstream release cycles. ;P The 4.12 merge window > > hasn't opened as we are still on 4.11-rc5, right? Is the concern then > > that the merge window is close enough that this has missed 4.12 > > linux-next soak time? > > We're still at least a week or two away from the merge window opening, > but I want to hear from the two bug reporters that this patch series has > made their symptoms go away, and that they've mounted enough times to be > reasonably confident that they haven't simply gotten lucky and not hit > the deadlock. I don't know that both of those things will happen in the > time we have left, and seeing as the frequency of reports is pretty low, > it doesn't seem like we have to rush it into 4.12. > > > Also, can you clarify the semantics of the -washme thing? ;) That > > doesn't necessarily mean the patch has been "merged." Rather, it might > > be suitable for the next for-next (4.13) provided nothing explodes by > > then, correct? > > Yep. > > > I'm also a little curious on what extra testing that entails, if any, > > though perhaps this topic is better suited by an independent thread.. > > Same testing as for-next gets (nightly auto group runs, ltp every now > and then, various $magic workloads); it's really just a place to land > patches that have been code reviewed but I'm not yet confident enough > about to put into for-next for the next cycle. > Ok, thanks. Note there will be another version incoming based on Dave's feedback. Otherwise sounds good. Brian > --D > > > > > > Sorry this one took me a while to get to. :/ > > > > > > Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> > > > > > > > No problem. Thanks for review... > > > > Brian > > > > > --D > > > > > > > Reported-by: Martin Svec <martin.svec@zoner.cz> > > > > Signed-off-by: Brian Foster <bfoster@redhat.com> > > > > --- > > > > fs/xfs/xfs_buf.c | 37 +++++++++++++++++++++++++++++++++++++ > > > > fs/xfs/xfs_buf.h | 1 + > > > > fs/xfs/xfs_qm.c | 28 +++++++++++++++++++++++++++- > > > > fs/xfs/xfs_trace.h | 1 + > > > > 4 files changed, 66 insertions(+), 1 deletion(-) > > > > > > > > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c > > > > index e566510..e97cf56 100644 > > > > --- a/fs/xfs/xfs_buf.c > > > > +++ b/fs/xfs/xfs_buf.c > > > > @@ -2011,6 +2011,43 @@ xfs_buf_delwri_submit( > > > > return error; > > > > } > > > > > > > > +int > > > > +xfs_buf_delwri_pushbuf( > > > > + struct xfs_buf *bp, > > > > + struct list_head *buffer_list) > > > > +{ > > > > + LIST_HEAD (submit_list); > > > > + int error; > > > > + > > > > + ASSERT(xfs_buf_islocked(bp)); > > > > + ASSERT(bp->b_flags & _XBF_DELWRI_Q); > > > > + > > > > + trace_xfs_buf_delwri_pushbuf(bp, _RET_IP_); > > > > + > > > > + /* > > > > + * Move the buffer to an empty list and submit. Pass the original list > > > > + * as the wait list so delwri submission moves the buf back to it before > > > > + * it is submitted (and thus before it is unlocked). This means the > > > > + * buffer cannot be placed on another list while we wait for it. > > > > + */ > > > > + list_move(&bp->b_list, &submit_list); > > > > + xfs_buf_unlock(bp); > > > > + > > > > + xfs_buf_delwri_submit_buffers(&submit_list, buffer_list); > > > > + > > > > + /* > > > > + * Lock the buffer to wait for I/O completion. It's already held on the > > > > + * original list, so all we have to do is reset the delwri queue flag > > > > + * that was cleared by delwri submission. > > > > + */ > > > > + xfs_buf_lock(bp); > > > > + error = bp->b_error; > > > > + bp->b_flags |= _XBF_DELWRI_Q; > > > > + xfs_buf_unlock(bp); > > > > + > > > > + return error; > > > > +} > > > > + > > > > int __init > > > > xfs_buf_init(void) > > > > { > > > > diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h > > > > index 8a9d3a9..cd74216 100644 > > > > --- a/fs/xfs/xfs_buf.h > > > > +++ b/fs/xfs/xfs_buf.h > > > > @@ -334,6 +334,7 @@ extern void xfs_buf_stale(struct xfs_buf *bp); > > > > extern bool xfs_buf_delwri_queue(struct xfs_buf *, struct list_head *); > > > > extern int xfs_buf_delwri_submit(struct list_head *); > > > > extern int xfs_buf_delwri_submit_nowait(struct list_head *); > > > > +extern int xfs_buf_delwri_pushbuf(struct xfs_buf *, struct list_head *); > > > > > > > > /* Buffer Daemon Setup Routines */ > > > > extern int xfs_buf_init(void); > > > > diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c > > > > index 4ff993c..3815ed3 100644 > > > > --- a/fs/xfs/xfs_qm.c > > > > +++ b/fs/xfs/xfs_qm.c > > > > @@ -1247,6 +1247,7 @@ xfs_qm_flush_one( > > > > struct xfs_dquot *dqp, > > > > void *data) > > > > { > > > > + struct xfs_mount *mp = dqp->q_mount; > > > > struct list_head *buffer_list = data; > > > > struct xfs_buf *bp = NULL; > > > > int error = 0; > > > > @@ -1257,7 +1258,32 @@ xfs_qm_flush_one( > > > > if (!XFS_DQ_IS_DIRTY(dqp)) > > > > goto out_unlock; > > > > > > > > - xfs_dqflock(dqp); > > > > + /* > > > > + * The only way the dquot is already flush locked by the time quotacheck > > > > + * gets here is if reclaim flushed it before the dqadjust walk dirtied > > > > + * it for the final time. Quotacheck collects all dquot bufs in the > > > > + * local delwri queue before dquots are dirtied, so reclaim can't have > > > > + * possibly queued it for I/O. The only way out is to push the buffer to > > > > + * cycle the flush lock. > > > > + */ > > > > + if (!xfs_dqflock_nowait(dqp)) { > > > > + /* buf is pinned in-core by delwri list */ > > > > + DEFINE_SINGLE_BUF_MAP(map, dqp->q_blkno, > > > > + mp->m_quotainfo->qi_dqchunklen); > > > > + bp = _xfs_buf_find(mp->m_ddev_targp, &map, 1, 0, NULL); > > > > + if (!bp) { > > > > + error = -EINVAL; > > > > + goto out_unlock; > > > > + } > > > > + > > > > + /* delwri_pushbuf drops the buf lock */ > > > > + xfs_buf_delwri_pushbuf(bp, buffer_list); > > > > + xfs_buf_rele(bp); > > > > + > > > > + error = -EAGAIN; > > > > + goto out_unlock; > > > > + } > > > > + > > > > error = xfs_qm_dqflush(dqp, &bp); > > > > if (error) > > > > goto out_unlock; > > > > diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h > > > > index fb7555e..c8d28a9 100644 > > > > --- a/fs/xfs/xfs_trace.h > > > > +++ b/fs/xfs/xfs_trace.h > > > > @@ -365,6 +365,7 @@ DEFINE_BUF_EVENT(xfs_buf_iowait_done); > > > > DEFINE_BUF_EVENT(xfs_buf_delwri_queue); > > > > DEFINE_BUF_EVENT(xfs_buf_delwri_queued); > > > > DEFINE_BUF_EVENT(xfs_buf_delwri_split); > > > > +DEFINE_BUF_EVENT(xfs_buf_delwri_pushbuf); > > > > DEFINE_BUF_EVENT(xfs_buf_get_uncached); > > > > DEFINE_BUF_EVENT(xfs_buf_item_relse); > > > > DEFINE_BUF_EVENT(xfs_buf_item_iodone_async); > > > > -- > > > > 2.7.4 > > > > > > > > -- > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > > > > the body of a message to majordomo@vger.kernel.org > > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > -- > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > > > the body of a message to majordomo@vger.kernel.org > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Apr 10, 2017 at 10:15:21AM -0400, Brian Foster wrote: > On Mon, Apr 10, 2017 at 03:00:01PM +1000, Dave Chinner wrote: > > On Fri, Feb 24, 2017 at 02:53:20PM -0500, Brian Foster wrote: > > > + /* > > > + * Move the buffer to an empty list and submit. Pass the original list > > > + * as the wait list so delwri submission moves the buf back to it before > > > + * it is submitted (and thus before it is unlocked). This means the > > > + * buffer cannot be placed on another list while we wait for it. > > > + */ > > > + list_move(&bp->b_list, &submit_list); > > > + xfs_buf_unlock(bp); > > > > .... this is safe/racy as we may have just moved it off the delwri > > queue without changing state, reference counts, etc? > > > > Racy with..? Whatever might possibly manipulate the buffer that thinks it's on a specific delwri list. We've unlocked it. As I said, I haven't spend hours thinking this all through, I'm just throwing out the questions I had.... > The buffer is effectively still on the delwri queue here. > We've just replaced the caller queue with a local queue to reuse the > existing submission infrastructure. IOW, we've just changed the queue > that the buffer is on. Yes, in theory, but it's not something we every considered would be done when designing the private delwri queue infrastructure... > > > + > > > + xfs_buf_delwri_submit_buffers(&submit_list, buffer_list); > > > > .... using a caller supplied delwri buffer list as the buffer IO > > wait list destination is making big assumptions about the internal > > use of the wait list? e.g that xfs_buf_delwri_submit_buffers() does > > not initialise the list_head before use... > > > > This is the scope at which the wait list is defined/initialized. It is now - passing in a list that is not empty effectively prevents us from implementing anything differently in future.... > E.g., > _delwri_submit() initializes its local wait list and passes it down for > _delwri_submit_buffers() to use. This basically ensures that the buffer > is placed back on the original delwri queue because the latter moves the > buffer over under lock, before it is submitted for I/O. But xfs_buf_delwri_submit_buffers() adds a reference count to the buffer if it's placed on a wait list. but we just moved it back to the original buffer list. Now I have to look for why that works, and determine if that was intended behaviour or not. i.e. _delwri_submit_buffers is being used for more than just moving the buffer back to the buffer list - there's reference counting shenanigans going on here, too.... > We could probably use a local wait_list just the same if that is > preferred, and explicitly add the buffer back from the wait_list to the > original delwri queue. FWIW, it's not clear to me whether you're asking > about the use of the wait_list here or indicating preference. All I'm pointing out is that I don't think the patch is a good solution, not what the solution should be. I haven't spent the hours needed to think out a proper solution. > > .... we should be doing IO submission while holding other things on > > the delwri list and unknown caller locks? > > > > I'm not following... this is effectively a move/submit of a delwri queue > item to another delwri queue. If the delwri queue is by design a local > data structure then only one thread should be queue processing at a > time. Why would this particular action be problematic? No, no it's not just a "list move". xfs_buf_delwri_submit_buffers() calls xfs_buf_submit() to start IO on the buffers passed in on the submit list. We're adding a new IO submission path here.... > Technically, we could rework this into something like a > delwri_queue_move()/delwri_submit() sequence, but the problem with that > is we lose the queue-populated state of the buffer for a period of time > and thus somebody else (i.e., reclaim) can jump in and queue it out from > underneath the caller. The purpose of this mechanism is to preserve > original delwri queue ownership of the buffer. Sure, I understand that, but I still don't think this is the right thing to be doing. > > .... we have all the buffer reference counts we need to make this > > work correctly? > > > > AFAICT, we follow the existing delwri queue rules with regard to locking > and reference counting. A buffer is queued under the buf lock and the > queue takes a reference on the buffer. Queue submission locks the > buffer, acquires a wait_list reference and moves to the wait list if one > is specified and submits the buffer. Buffer I/O completion drops the > original queue reference. If the wait_list was specified, we cycle the > buffer lock (to wait for I/O completion), remove from the wait_list and > drop the wait_list reference. > > The pushbuf() call moves the buffer to the submit_list under the buffer > lock, which is effectively another local delwri queue. This takes > ownership of the original delwri queue reference. Queue submission > occurs as above: the buffer is moved to the wait_list, an associated > wait_list reference is acquired and the buffer is submitted. I/O > completion drops the original delwri queue reference as before, so we > are in the same state as delwri_submit() up through I/O completion. > > We again cycle the buf lock to wait on I/O completion and process an > error. Rather than remove from the wait_list and drop the reference, > however, pushbuf() has the buffer back on the original delwri queue with > an associated reference. The only explicit state change here is the > DELWRI_Q flag, which is reset. > > So am I missing something here with regard to proper reference > counting..? If so, please elaborate and I'll try to dig into it. IOWs, your abusing the waitlist reference count added deep inside xfs_buf_delwri_submit_buffers() to handle the fact that the IO submission removes the delwri queue reference count that the buffer had on entry to the function, but still needs on exit to the function. I guarantee we'll break this in future - it's too subtle and fragile for core infrastructure. Also, there's no comments to explain any of these games being played, which makes it even less maintainable... > > > + /* > > > + * Lock the buffer to wait for I/O completion. It's already held on the > > > + * original list, so all we have to do is reset the delwri queue flag > > > + * that was cleared by delwri submission. > > > + */ > > > + xfs_buf_lock(bp); > > > + error = bp->b_error; > > > + bp->b_flags |= _XBF_DELWRI_Q; > > > + xfs_buf_unlock(bp); > > > > .... this is racy w.r.t. the buffer going back onto the > > buffer list without holding the buffer lock, or that the > > _XBF_DELWRI_Q setting/clearing is not atomic w.r.t. the delwri queue > > manipulations (i.e. can now be on the delwri list but not have > > _XBF_DELWRI_Q set)? > > > > The buffer has been put back on the original delwri queue (by > _delwri_submit_buffers()) before the lock was dropped. Using the delwri > queue as the wait_list does mean that the flag might not be set until > the delwri queue processing is done and we have returned to the caller. > If another delwri queue occurs, the flag may be reset but the buffer > cannot be added to another queue, which is what we want (as noted > above). Yes, I know this is what happens. My point is that the internal _XBF_DELWRI_Q no longer indicates that a buffer is on a delwri queue (i.e. that bp->b_list is valid). IOWs, if we want to check a buffer is on a delwri queue, we can no longer just check the flag, we have to check both the flag and list_empty(bp->b_list) because they are no longer atomically updated. To me this is a design rule violation, regardless of whether the code works or not... > (I'm not sure why delwri_queue() returns true in that case, but that's a > separate issue.) > > > .... the error is left on the buffer so it gets tripped over when > > it is next accessed? > > > > Same as delwri_submit(). IIUC, errors are left on buffers by design and > cleared on a subsequemt I/O submission. That doesn't mean it's the right thing to do here. Think about how the error shold be handled in this case, don't copy something from a bulk IO submission interface.... > > > + /* > > > + * The only way the dquot is already flush locked by the time quotacheck > > > + * gets here is if reclaim flushed it before the dqadjust walk dirtied > > > + * it for the final time. Quotacheck collects all dquot bufs in the > > > + * local delwri queue before dquots are dirtied, so reclaim can't have > > > + * possibly queued it for I/O. The only way out is to push the buffer to > > > + * cycle the flush lock. > > > + */ > > > + if (!xfs_dqflock_nowait(dqp)) { > > > + /* buf is pinned in-core by delwri list */ > > > + DEFINE_SINGLE_BUF_MAP(map, dqp->q_blkno, > > > + mp->m_quotainfo->qi_dqchunklen); > > > + bp = _xfs_buf_find(mp->m_ddev_targp, &map, 1, 0, NULL); > > > + if (!bp) { > > > + error = -EINVAL; > > > + goto out_unlock; > > > + } > > > + > > > + /* delwri_pushbuf drops the buf lock */ > > > + xfs_buf_delwri_pushbuf(bp, buffer_list); > > > > Ummm - you threw away the returned error.... > > > > The I/O is going to be retried, but I'll fix it to just return from > here. > > > > + xfs_buf_rele(bp); > > > > And despite the comment, I think this is simply wrong. We try really > > hard to maintain balanced locking, and as such xfs_buf_relse() is > > what goes along with _xfs_buf_find(). i.e. we are returned a locked > > buffer with a reference by xfs_buf_find(), but xfs_buf_rele() only > > drops the reference. > > > > I'm not quite following if you are saying the locking is unbalanced or > it simply looks that way. If the former, can you please elaborate? It's not obviously correct, which to me is a bug. If I read this code, you're forcing me to go read more code to work out why it is correct because it doesn't follow the expected patterns. And then the reader disappears down a hole for hours trying to work out how this code actually works. > Otherwise, we can probably have pushbuf() return with the buffer lock > held so the caller remains responsible for the lock and reference it > acquired. > > > So if I look at this code in isolation, it looks like it leaks a > > buffer lock, and now I have to go read other code to understand why > > it doesn't and I'm left to wonder why it was implemented this > > way.... > > > > Thanks for taking the time to send feedback. FWIW, I take most of these > questions as "things one would have to verify in order to review this > patch if one were not on vacation" :P as opposed to pointing things out > as explicitly broken. Keep in mind that while the code /might work/, that doesn't mean it can be merged. The more I look at the patch and think about it, the more strongly I feel that we need to "go back to the drawing board and try again". So, rather than just being a negative prick that says no and doesn't provide any real help, how about considering this line of thinking.... We hold the buffer locked, the dquot is locked and the dquot is flush locked already, right? We know the IO has not been submitted because we have the buffer lock and the dquot is still flush locked, which means /some/ dirty dquot data is already in the buffer. So why can't we just modify the dquot in the buffer? We already hold all the locks needed to guarantee exclusive access to the dquot and buffer, and they are all we hold when we do the initial flush to the buffer. So why do we need to do IO to unlock the dquot flush lock when we could just rewrite before we submit the buffer for IO? Indeed, let's go look at inodes for an example of this, xfs_ifree_cluster() to be precise: /* * We obtain and lock the backing buffer first in the process * here, as we have to ensure that any dirty inode that we * can't get the flush lock on is attached to the buffer. * If we scan the in-memory inodes first, then buffer IO can * complete before we get a lock on it, and hence we may fail * to mark all the active inodes on the buffer stale. */ ..... /* * Walk the inodes already attached to the buffer and mark them * stale. These will all have the flush locks held, so an * in-memory inode walk can't lock them. By marking them all * stale first, we will not attempt to lock them in the loop * below as the XFS_ISTALE flag will be set. */ ..... /* * For each inode in memory attempt to add it to the inode * buffer and set it up for being staled on buffer IO * completion. This is safe as we've locked out tail pushing * and flushing by locking the buffer. * * We have already marked every inode that was part of a * transaction stale above, which means there is no point in * even trying to lock them. */ IOWs, what we have here is precendence for modifying the flush locked objects attached to a buffer after first locking the buffer. dquots have the same transaction/flush model as inodes, so I'm pretty sure this will lead to the simplest, cleanest way to avoid this deadlock.... Cheers, Dave.
On Mon, Apr 10, 2017 at 03:00:01PM +1000, Dave Chinner wrote: > On Fri, Feb 24, 2017 at 02:53:20PM -0500, Brian Foster wrote: > > Reclaim during quotacheck can lead to deadlocks on the dquot flush > > lock: > > > > - Quotacheck populates a local delwri queue with the physical dquot > > buffers. > > - Quotacheck performs the xfs_qm_dqusage_adjust() bulkstat and > > dirties all of the dquots. > > - Reclaim kicks in and attempts to flush a dquot whose buffer is > > already queud on the quotacheck queue. The flush succeeds but > > queueing to the reclaim delwri queue fails as the backing buffer is > > already queued. The flush unlock is now deferred to I/O completion > > of the buffer from the quotacheck queue. > > - The dqadjust bulkstat continues and dirties the recently flushed > > dquot once again. > > - Quotacheck proceeds to the xfs_qm_flush_one() walk which requires > > the flush lock to update the backing buffers with the in-core > > recalculated values. It deadlocks on the redirtied dquot as the > > flush lock was already acquired by reclaim, but the buffer resides > > on the local delwri queue which isn't submitted until the end of > > quotacheck. > > > > This is reproduced by running quotacheck on a filesystem with a > > couple million inodes in low memory (512MB-1GB) situations. This is > > a regression as of commit 43ff2122e6 ("xfs: on-stack delayed write > > buffer lists"), which removed a trylock and buffer I/O submission > > from the quotacheck dquot flush sequence. > > > > Quotacheck first resets and collects the physical dquot buffers in a > > delwri queue. Then, it traverses the filesystem inodes via bulkstat, > > updates the in-core dquots, flushes the corrected dquots to the > > backing buffers and finally submits the delwri queue for I/O. Since > > the backing buffers are queued across the entire quotacheck > > operation, dquot reclaim cannot possibly complete a dquot flush > > before quotacheck completes. > > > > Therefore, quotacheck must submit the buffer for I/O in order to > > cycle the flush lock and flush the dirty in-core dquot to the > > buffer. Add a delwri queue buffer push mechanism to submit an > > individual buffer for I/O without losing the delwri queue status and > > use it from quotacheck to avoid the deadlock. This restores > > quotacheck behavior to as before the regression was introduced. > > While it may fix the problem, the solution gives me the heebee > jeebees. I'm on holidays, so I haven't bothered to spend the hours > necessary to answer these questions, but to give you an idea, this > was what I thought as I read the patch. i.e. I have concerns about > whether.... > Thanks for looking at this. FWIW, this is based on a preexisting solution prior to the commit referenced above (obviously this doesn't mean the code is correct or that this is the best solution). > > Reported-by: Martin Svec <martin.svec@zoner.cz> > > Signed-off-by: Brian Foster <bfoster@redhat.com> > > --- > > fs/xfs/xfs_buf.c | 37 +++++++++++++++++++++++++++++++++++++ > > fs/xfs/xfs_buf.h | 1 + > > fs/xfs/xfs_qm.c | 28 +++++++++++++++++++++++++++- > > fs/xfs/xfs_trace.h | 1 + > > 4 files changed, 66 insertions(+), 1 deletion(-) > > > > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c > > index e566510..e97cf56 100644 > > --- a/fs/xfs/xfs_buf.c > > +++ b/fs/xfs/xfs_buf.c > > @@ -2011,6 +2011,43 @@ xfs_buf_delwri_submit( > > return error; > > } > > > > +int > > +xfs_buf_delwri_pushbuf( > > + struct xfs_buf *bp, > > + struct list_head *buffer_list) > > +{ > > + LIST_HEAD (submit_list); > > + int error; > > + > > + ASSERT(xfs_buf_islocked(bp)); > > + ASSERT(bp->b_flags & _XBF_DELWRI_Q); > > + > > + trace_xfs_buf_delwri_pushbuf(bp, _RET_IP_); > > + > > + /* > > + * Move the buffer to an empty list and submit. Pass the original list > > + * as the wait list so delwri submission moves the buf back to it before > > + * it is submitted (and thus before it is unlocked). This means the > > + * buffer cannot be placed on another list while we wait for it. > > + */ > > + list_move(&bp->b_list, &submit_list); > > + xfs_buf_unlock(bp); > > .... this is safe/racy as we may have just moved it off the delwri > queue without changing state, reference counts, etc? > Racy with..? The buffer is effectively still on the delwri queue here. We've just replaced the caller queue with a local queue to reuse the existing submission infrastructure. IOW, we've just changed the queue that the buffer is on. > > + > > + xfs_buf_delwri_submit_buffers(&submit_list, buffer_list); > > .... using a caller supplied delwri buffer list as the buffer IO > wait list destination is making big assumptions about the internal > use of the wait list? e.g that xfs_buf_delwri_submit_buffers() does > not initialise the list_head before use... > This is the scope at which the wait list is defined/initialized. E.g., _delwri_submit() initializes its local wait list and passes it down for _delwri_submit_buffers() to use. This basically ensures that the buffer is placed back on the original delwri queue because the latter moves the buffer over under lock, before it is submitted for I/O. We could probably use a local wait_list just the same if that is preferred, and explicitly add the buffer back from the wait_list to the original delwri queue. FWIW, it's not clear to me whether you're asking about the use of the wait_list here or indicating preference. > .... we should be doing IO submission while holding other things on > the delwri list and unknown caller locks? > I'm not following... this is effectively a move/submit of a delwri queue item to another delwri queue. If the delwri queue is by design a local data structure then only one thread should be queue processing at a time. Why would this particular action be problematic? Technically, we could rework this into something like a delwri_queue_move()/delwri_submit() sequence, but the problem with that is we lose the queue-populated state of the buffer for a period of time and thus somebody else (i.e., reclaim) can jump in and queue it out from underneath the caller. The purpose of this mechanism is to preserve original delwri queue ownership of the buffer. > .... we have all the buffer reference counts we need to make this > work correctly? > AFAICT, we follow the existing delwri queue rules with regard to locking and reference counting. A buffer is queued under the buf lock and the queue takes a reference on the buffer. Queue submission locks the buffer, acquires a wait_list reference and moves to the wait list if one is specified and submits the buffer. Buffer I/O completion drops the original queue reference. If the wait_list was specified, we cycle the buffer lock (to wait for I/O completion), remove from the wait_list and drop the wait_list reference. The pushbuf() call moves the buffer to the submit_list under the buffer lock, which is effectively another local delwri queue. This takes ownership of the original delwri queue reference. Queue submission occurs as above: the buffer is moved to the wait_list, an associated wait_list reference is acquired and the buffer is submitted. I/O completion drops the original delwri queue reference as before, so we are in the same state as delwri_submit() up through I/O completion. We again cycle the buf lock to wait on I/O completion and process an error. Rather than remove from the wait_list and drop the reference, however, pushbuf() has the buffer back on the original delwri queue with an associated reference. The only explicit state change here is the DELWRI_Q flag, which is reset. So am I missing something here with regard to proper reference counting..? If so, please elaborate and I'll try to dig into it. > > + /* > > + * Lock the buffer to wait for I/O completion. It's already held on the > > + * original list, so all we have to do is reset the delwri queue flag > > + * that was cleared by delwri submission. > > + */ > > + xfs_buf_lock(bp); > > + error = bp->b_error; > > + bp->b_flags |= _XBF_DELWRI_Q; > > + xfs_buf_unlock(bp); > > .... this is racy w.r.t. the buffer going back onto the > buffer list without holding the buffer lock, or that the > _XBF_DELWRI_Q setting/clearing is not atomic w.r.t. the delwri queue > manipulations (i.e. can now be on the delwri list but not have > _XBF_DELWRI_Q set)? > The buffer has been put back on the original delwri queue (by _delwri_submit_buffers()) before the lock was dropped. Using the delwri queue as the wait_list does mean that the flag might not be set until the delwri queue processing is done and we have returned to the caller. If another delwri queue occurs, the flag may be reset but the buffer cannot be added to another queue, which is what we want (as noted above). (I'm not sure why delwri_queue() would return true in that case, but that's a separate issue.) > .... the error is left on the buffer so it gets tripped over when > it is next accessed? > Same as delwri_submit(). IIUC, errors are left on buffers by design and cleared on a subsequemt I/O submission. > .... that the buffer locking is unbalanced for some undocumented > reason? > > > + return error; > > +} > > + > > int __init > > xfs_buf_init(void) > > { > > diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h > > index 8a9d3a9..cd74216 100644 > > --- a/fs/xfs/xfs_buf.h > > +++ b/fs/xfs/xfs_buf.h > > @@ -334,6 +334,7 @@ extern void xfs_buf_stale(struct xfs_buf *bp); > > extern bool xfs_buf_delwri_queue(struct xfs_buf *, struct list_head *); > > extern int xfs_buf_delwri_submit(struct list_head *); > > extern int xfs_buf_delwri_submit_nowait(struct list_head *); > > +extern int xfs_buf_delwri_pushbuf(struct xfs_buf *, struct list_head *); > > > > /* Buffer Daemon Setup Routines */ > > extern int xfs_buf_init(void); > > diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c > > index 4ff993c..3815ed3 100644 > > --- a/fs/xfs/xfs_qm.c > > +++ b/fs/xfs/xfs_qm.c > > @@ -1247,6 +1247,7 @@ xfs_qm_flush_one( > > struct xfs_dquot *dqp, > > void *data) > > { > > + struct xfs_mount *mp = dqp->q_mount; > > struct list_head *buffer_list = data; > > struct xfs_buf *bp = NULL; > > int error = 0; > > @@ -1257,7 +1258,32 @@ xfs_qm_flush_one( > > if (!XFS_DQ_IS_DIRTY(dqp)) > > goto out_unlock; > > > > - xfs_dqflock(dqp); > > + /* > > + * The only way the dquot is already flush locked by the time quotacheck > > + * gets here is if reclaim flushed it before the dqadjust walk dirtied > > + * it for the final time. Quotacheck collects all dquot bufs in the > > + * local delwri queue before dquots are dirtied, so reclaim can't have > > + * possibly queued it for I/O. The only way out is to push the buffer to > > + * cycle the flush lock. > > + */ > > + if (!xfs_dqflock_nowait(dqp)) { > > + /* buf is pinned in-core by delwri list */ > > + DEFINE_SINGLE_BUF_MAP(map, dqp->q_blkno, > > + mp->m_quotainfo->qi_dqchunklen); > > + bp = _xfs_buf_find(mp->m_ddev_targp, &map, 1, 0, NULL); > > + if (!bp) { > > + error = -EINVAL; > > + goto out_unlock; > > + } > > + > > + /* delwri_pushbuf drops the buf lock */ > > + xfs_buf_delwri_pushbuf(bp, buffer_list); > > Ummm - you threw away the returned error.... > The I/O is going to be retried, but I'll fix it to just return from here. > > + xfs_buf_rele(bp); > > And despite the comment, I think this is simply wrong. We try really > hard to maintain balanced locking, and as such xfs_buf_relse() is > what goes along with _xfs_buf_find(). i.e. we are returned a locked > buffer with a reference by xfs_buf_find(), but xfs_buf_rele() only > drops the reference. > I'm not quite following if you are saying the locking is unbalanced or it simply looks that way. If the former, can you please elaborate? Otherwise, we can probably have pushbuf() return with the buffer lock held so the caller remains responsible for the lock and reference it acquired. > So if I look at this code in isolation, it looks like it leaks a > buffer lock, and now I have to go read other code to understand why > it doesn't and I'm left to wonder why it was implemented this > way.... > Thanks for taking the time to send feedback. FWIW, I take most of these questions as "things one would have to verify in order to review this patch if one were not on vacation" :P as opposed to pointing things out as explicitly broken. As noted previously, please do point out anything I've missed to the contrary. Otherwise, I'll plan to send another version that includes the error and locking API fixups. Brian > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Apr 11, 2017 at 09:55:25AM +1000, Dave Chinner wrote: > On Mon, Apr 10, 2017 at 10:15:21AM -0400, Brian Foster wrote: > > On Mon, Apr 10, 2017 at 03:00:01PM +1000, Dave Chinner wrote: > > > On Fri, Feb 24, 2017 at 02:53:20PM -0500, Brian Foster wrote: > > > > + /* > > > > + * Move the buffer to an empty list and submit. Pass the original list > > > > + * as the wait list so delwri submission moves the buf back to it before > > > > + * it is submitted (and thus before it is unlocked). This means the > > > > + * buffer cannot be placed on another list while we wait for it. > > > > + */ > > > > + list_move(&bp->b_list, &submit_list); > > > > + xfs_buf_unlock(bp); > > > > > > .... this is safe/racy as we may have just moved it off the delwri > > > queue without changing state, reference counts, etc? > > > > > > > Racy with..? > > Whatever might possibly manipulate the buffer that thinks it's on a > specific delwri list. We've unlocked it. > > As I said, I haven't spend hours thinking this all through, I'm just > throwing out the questions I had.... > Ok, I'm just trying to answer them. ;) Here, we unlock buffers after they are added to the original delwri queue as well. So the buffer is sitting on a local delwri queue before it is locked and it is sitting on a local delwri queue after it is unlocked. This is the expected state for an external lookup of the buffer to observe before delwri_submit_buffers() acquires the lock for I/O submission. > > The buffer is effectively still on the delwri queue here. > > We've just replaced the caller queue with a local queue to reuse the > > existing submission infrastructure. IOW, we've just changed the queue > > that the buffer is on. > > Yes, in theory, but it's not something we every considered would be > done when designing the private delwri queue infrastructure... > As mentioned before, this is a resurrection of an old fix, though it may pre-date the shift of the delwri queue mechanism to a local one. > > > > + > > > > + xfs_buf_delwri_submit_buffers(&submit_list, buffer_list); > > > > > > .... using a caller supplied delwri buffer list as the buffer IO > > > wait list destination is making big assumptions about the internal > > > use of the wait list? e.g that xfs_buf_delwri_submit_buffers() does > > > not initialise the list_head before use... > > > > > > > This is the scope at which the wait list is defined/initialized. > > It is now - passing in a list that is not empty effectively prevents > us from implementing anything differently in future.... > This is nothing more than an implementation detail. It simply optimizes away an additional list on the stack. It exists purely because it is safe to do with the current code and can be trivially factored away either now or if the code demands it in the future. > > E.g., > > _delwri_submit() initializes its local wait list and passes it down for > > _delwri_submit_buffers() to use. This basically ensures that the buffer > > is placed back on the original delwri queue because the latter moves the > > buffer over under lock, before it is submitted for I/O. > > But xfs_buf_delwri_submit_buffers() adds a reference count to the > buffer if it's placed on a wait list. but we just moved it back to > the original buffer list. Now I have to look for why that works, and > determine if that was intended behaviour or not. > > i.e. _delwri_submit_buffers is being used for more than just moving > the buffer back to the buffer list - there's reference counting > shenanigans going on here, too.... > > > We could probably use a local wait_list just the same if that is > > preferred, and explicitly add the buffer back from the wait_list to the > > original delwri queue. FWIW, it's not clear to me whether you're asking > > about the use of the wait_list here or indicating preference. > > All I'm pointing out is that I don't think the patch is a good > solution, not what the solution should be. I haven't spent the hours > needed to think out a proper solution. > > > > .... we should be doing IO submission while holding other things on > > > the delwri list and unknown caller locks? > > > > > > > I'm not following... this is effectively a move/submit of a delwri queue > > item to another delwri queue. If the delwri queue is by design a local > > data structure then only one thread should be queue processing at a > > time. Why would this particular action be problematic? > > No, no it's not just a "list move". xfs_buf_delwri_submit_buffers() > calls xfs_buf_submit() to start IO on the buffers passed in on the > submit list. We're adding a new IO submission path here.... > Just above, I said "move/submit," which clearly refers to pushbuf() as an aggregate of both actions. Regardless, it sounds to me that the concern here is "is it safe to submit this I/O while we still have things on a list?" What I'm trying to say is the notion of a private delwri queue mechanism implies that is safe from a delwri queue infrastructure point of view. Nothing prevents a particular context from using multiple lists. Analysis beyond that is context dependent, and I don't see anything in quotacheck that conflicts. > > Technically, we could rework this into something like a > > delwri_queue_move()/delwri_submit() sequence, but the problem with that > > is we lose the queue-populated state of the buffer for a period of time > > and thus somebody else (i.e., reclaim) can jump in and queue it out from > > underneath the caller. The purpose of this mechanism is to preserve > > original delwri queue ownership of the buffer. > > Sure, I understand that, but I still don't think this is the right > thing to be doing. > Fair enough.. > > > .... we have all the buffer reference counts we need to make this > > > work correctly? > > > > > > > AFAICT, we follow the existing delwri queue rules with regard to locking > > and reference counting. A buffer is queued under the buf lock and the > > queue takes a reference on the buffer. Queue submission locks the > > buffer, acquires a wait_list reference and moves to the wait list if one > > is specified and submits the buffer. Buffer I/O completion drops the > > original queue reference. If the wait_list was specified, we cycle the > > buffer lock (to wait for I/O completion), remove from the wait_list and > > drop the wait_list reference. > > > > The pushbuf() call moves the buffer to the submit_list under the buffer > > lock, which is effectively another local delwri queue. This takes > > ownership of the original delwri queue reference. Queue submission > > occurs as above: the buffer is moved to the wait_list, an associated > > wait_list reference is acquired and the buffer is submitted. I/O > > completion drops the original delwri queue reference as before, so we > > are in the same state as delwri_submit() up through I/O completion. > > > > We again cycle the buf lock to wait on I/O completion and process an > > error. Rather than remove from the wait_list and drop the reference, > > however, pushbuf() has the buffer back on the original delwri queue with > > an associated reference. The only explicit state change here is the > > DELWRI_Q flag, which is reset. > > > > So am I missing something here with regard to proper reference > > counting..? If so, please elaborate and I'll try to dig into it. > > IOWs, your abusing the waitlist reference count added deep inside > xfs_buf_delwri_submit_buffers() to handle the fact > that the IO submission removes the delwri queue reference count that > the buffer had on entry to the function, but still needs on exit to > the function. > > I guarantee we'll break this in future - it's too subtle and fragile > for core infrastructure. Also, there's no comments to explain any of > these games being played, which makes it even less maintainable... > Heh. Yeah, this is not clear from the code. We should have a comment here to walk through the reference count semantics. That aside... if we drop the wait_list optimization, as mentioned earlier, we'd basically do what delwri_submit() does with the exception of moving the buffer to a private delwri queue for submission and moving it back to the original delwri queue (from the wait_list) after acquiring the buffer lock (to wait on the I/O). We could even do the moves in the caller and the only change that would be required to the delwri queue infrastructure is effectively to export delwri_submit_buffers() (I'd probably suggest an xfs_buf_delwri_submit_waitlist() variant or some such instead, but that is beside the point). The broader point here is that this code isn't doing much fundamentally different from the existing delwri infrastructure, certainly not enough to qualify this as fragile and the existing mechanism as not, so I think that is a bit over the top. Another option to consider here is that this could all easily be refactored to open code the buffer submission from pushbuf(), which perhaps would localize the reference count bump to the pushbuf() function rather than reuse the delwri_submit_buffers() logic. > > > > > + /* > > > > + * Lock the buffer to wait for I/O completion. It's already held on the > > > > + * original list, so all we have to do is reset the delwri queue flag > > > > + * that was cleared by delwri submission. > > > > + */ > > > > + xfs_buf_lock(bp); > > > > + error = bp->b_error; > > > > + bp->b_flags |= _XBF_DELWRI_Q; > > > > + xfs_buf_unlock(bp); > > > > > > .... this is racy w.r.t. the buffer going back onto the > > > buffer list without holding the buffer lock, or that the > > > _XBF_DELWRI_Q setting/clearing is not atomic w.r.t. the delwri queue > > > manipulations (i.e. can now be on the delwri list but not have > > > _XBF_DELWRI_Q set)? > > > > > > > The buffer has been put back on the original delwri queue (by > > _delwri_submit_buffers()) before the lock was dropped. Using the delwri > > queue as the wait_list does mean that the flag might not be set until > > the delwri queue processing is done and we have returned to the caller. > > If another delwri queue occurs, the flag may be reset but the buffer > > cannot be added to another queue, which is what we want (as noted > > above). > > Yes, I know this is what happens. My point is that the internal > _XBF_DELWRI_Q no longer indicates that a buffer is on a delwri queue > (i.e. that bp->b_list is valid). IOWs, if we want to check a buffer > is on a delwri queue, we can no longer just check the flag, we have > to check both the flag and list_empty(bp->b_list) because they are > no longer atomically updated. > It is true that this code puts the buffer through a state where _XBF_DELWRI_Q cannot be used to verify the buffer is on a delwri queue. It is also true that today the absence of _XBF_DELWRI_Q cannot be used to verify that a buffer can actually be put onto a delwri queue. IOW, any code that truly cares about delwri queue state probably needs to check both as it is, particularly any code that actually wants to queue the buffer locally. > To me this is a design rule violation, regardless of whether the > code works or not... > > > (I'm not sure why delwri_queue() returns true in that case, but that's a > > separate issue.) > > > > > .... the error is left on the buffer so it gets tripped over when > > > it is next accessed? > > > > > > > Same as delwri_submit(). IIUC, errors are left on buffers by design and > > cleared on a subsequemt I/O submission. > > That doesn't mean it's the right thing to do here. Think about how > the error shold be handled in this case, don't copy something from a > bulk IO submission interface.... > I'm not following what you are suggesting here. > > > > + /* > > > > + * The only way the dquot is already flush locked by the time quotacheck > > > > + * gets here is if reclaim flushed it before the dqadjust walk dirtied > > > > + * it for the final time. Quotacheck collects all dquot bufs in the > > > > + * local delwri queue before dquots are dirtied, so reclaim can't have > > > > + * possibly queued it for I/O. The only way out is to push the buffer to > > > > + * cycle the flush lock. > > > > + */ > > > > + if (!xfs_dqflock_nowait(dqp)) { > > > > + /* buf is pinned in-core by delwri list */ > > > > + DEFINE_SINGLE_BUF_MAP(map, dqp->q_blkno, > > > > + mp->m_quotainfo->qi_dqchunklen); > > > > + bp = _xfs_buf_find(mp->m_ddev_targp, &map, 1, 0, NULL); > > > > + if (!bp) { > > > > + error = -EINVAL; > > > > + goto out_unlock; > > > > + } > > > > + > > > > + /* delwri_pushbuf drops the buf lock */ > > > > + xfs_buf_delwri_pushbuf(bp, buffer_list); > > > > > > Ummm - you threw away the returned error.... > > > > > > > The I/O is going to be retried, but I'll fix it to just return from > > here. > > > > > > + xfs_buf_rele(bp); > > > > > > And despite the comment, I think this is simply wrong. We try really > > > hard to maintain balanced locking, and as such xfs_buf_relse() is > > > what goes along with _xfs_buf_find(). i.e. we are returned a locked > > > buffer with a reference by xfs_buf_find(), but xfs_buf_rele() only > > > drops the reference. > > > > > > > I'm not quite following if you are saying the locking is unbalanced or > > it simply looks that way. If the former, can you please elaborate? > > It's not obviously correct, which to me is a bug. If I read this > code, you're forcing me to go read more code to work out why it is > correct because it doesn't follow the expected patterns. And then > the reader disappears down a hole for hours trying to work out how > this code actually works. > I do agree that maintaining the lock state through the pushbuf() call is probably more clear than what the patch does currently (e.g., I agree with your core point to justify adjusting how the locking is done here). That said, the buffer locking in xfs_buf_delwri_pushbuf() can be evaluated pretty quickly between reading through the former function and xfs_buf_delwri_submit[_buffers](), if one is familiar with the delwri queue infrastructure (which you are). So the idea that this requires hours of validation time because the locking pattern is quirky is also a little over the top. > > Otherwise, we can probably have pushbuf() return with the buffer lock > > held so the caller remains responsible for the lock and reference it > > acquired. > > > > > So if I look at this code in isolation, it looks like it leaks a > > > buffer lock, and now I have to go read other code to understand why > > > it doesn't and I'm left to wonder why it was implemented this > > > way.... > > > > > > > Thanks for taking the time to send feedback. FWIW, I take most of these > > questions as "things one would have to verify in order to review this > > patch if one were not on vacation" :P as opposed to pointing things out > > as explicitly broken. > > Keep in mind that while the code /might work/, that doesn't mean it > can be merged. The more I look at the patch and think about it, the > more strongly I feel that we need to "go back to the drawing board > and try again". > This is fair enough. "I don't like the approach" as a subjective argument is more concrete (IMO) than to artificially inflate some of the implementation details into architectural flaws. That at least saves me the time spent working through what those arguments mean, whether they legitimately apply to the changes made in the code and/or refactoring the code to try and deal with what, to me, appears to mostly boil down to confusion/complaints about code factoring (only to eventually determine that the core argument is subjective). > So, rather than just being a negative prick that says no and doesn't > provide any real help, how about considering this line of > thinking.... > That would certainly be more productive. This is not the only solution I've considered, but clearly it is one of only a couple or so I'm aware of so far that I consider any better than just leaving the code as is. In fact, I had pretty much given up on this before Darrick happened to pick up review of it. > We hold the buffer locked, the dquot is locked and the dquot is > flush locked already, right? We know the IO has not been submitted > because we have the buffer lock and the dquot is still flush locked, > which means /some/ dirty dquot data is already in the buffer. > Essentially. The buffer is queued and the dquot flush locked by reclaim. By the time we get to the problematic scenario, we acquire the dquot lock and can acquire the buffer lock in response to flush lock failure. The buffer cannot be under I/O because we hold it on the local delwri queue. > So why can't we just modify the dquot in the buffer? We already > hold all the locks needed to guarantee exclusive access to the dquot > and buffer, and they are all we hold when we do the initial flush to > the buffer. So why do we need to do IO to unlock the dquot flush > lock when we could just rewrite before we submit the buffer for IO? > Are you suggesting to essentially ignore the flush lock? I suppose we could get away with this in dq_flush_one() since it is only called from quotacheck, but we may have to kill off any assertions that expect the flush lock in xfs_dqflush(), and perhaps refactor that code to accept a locked buffer as a param. I don't see anything that would break off hand, but my first reaction is it sounds more hackish than this patch or the previous patch that just disabled reclaim during quotacheck. In fact, if we're going to hack around avoiding the flush lock, why not just hack up xfs_buf_submit() to allow submission of a buffer while it remains on a delwri queue? AFAICT, that would only require removing an assert and not clearing the flag on I/O completion. I'm not convinced I'd post something for either approach, but I'd have to think about it some more. > Indeed, let's go look at inodes for an example of this, > xfs_ifree_cluster() to be precise: > > /* > * We obtain and lock the backing buffer first in the process > * here, as we have to ensure that any dirty inode that we > * can't get the flush lock on is attached to the buffer. > * If we scan the in-memory inodes first, then buffer IO can > * complete before we get a lock on it, and hence we may fail > * to mark all the active inodes on the buffer stale. > */ > ..... > /* > * Walk the inodes already attached to the buffer and mark them > * stale. These will all have the flush locks held, so an > * in-memory inode walk can't lock them. By marking them all > * stale first, we will not attempt to lock them in the loop > * below as the XFS_ISTALE flag will be set. > */ > ..... > /* > * For each inode in memory attempt to add it to the inode > * buffer and set it up for being staled on buffer IO > * completion. This is safe as we've locked out tail pushing > * and flushing by locking the buffer. > * > * We have already marked every inode that was part of a > * transaction stale above, which means there is no point in > * even trying to lock them. > */ > > IOWs, what we have here is precendence for modifying the flush > locked objects attached to a buffer after first locking the buffer. > dquots have the same transaction/flush model as inodes, so I'm > pretty sure this will lead to the simplest, cleanest way to avoid > this deadlock.... > The simplest approach to avoid the deadlock is to disable dquot reclaim during quotacheck. ;) On a quick scan through the inode code, it seems the above refer to using the buffer lock to serialize invalidation of higher level objects based on the buffer. IOW, to ensure that we don't leave "valid" in-core inode objects referring to an inode metadata buffer that is about to be invalidated. So while we do set the stale state on presumably flush locked objects attached to the buffer here, this doesn't exactly establish precedent for modifying flush locked content (meaning the range of the buffer that covers the flush locked inode). But if precedent is of particular concern, the approach with most historical precedent is probably the one actually used in XFS to avoid this deadlock up until commit 43ff2122e6 (that is effectively implemented by this patch :/). Brian > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Apr 11, 2017 at 10:53:03AM -0400, Brian Foster wrote: > On Tue, Apr 11, 2017 at 09:55:25AM +1000, Dave Chinner wrote: > > We hold the buffer locked, the dquot is locked and the dquot is > > flush locked already, right? We know the IO has not been submitted > > because we have the buffer lock and the dquot is still flush locked, > > which means /some/ dirty dquot data is already in the buffer. > > > > Essentially. The buffer is queued and the dquot flush locked by reclaim. > By the time we get to the problematic scenario, we acquire the dquot > lock and can acquire the buffer lock in response to flush lock failure. > The buffer cannot be under I/O because we hold it on the local delwri > queue. Seems to me that you've missed the fundamental reason that the buffer can't be under IO. It's not because it's on some local private queue, it can't be under IO because /we hold the buffer lock/. > > So why can't we just modify the dquot in the buffer? We already > > hold all the locks needed to guarantee exclusive access to the dquot > > and buffer, and they are all we hold when we do the initial flush to > > the buffer. So why do we need to do IO to unlock the dquot flush > > lock when we could just rewrite before we submit the buffer for IO? > > > > Are you suggesting to essentially ignore the flush lock? I suppose we > could get away with this in dq_flush_one() since it is only called from > quotacheck, but we may have to kill off any assertions that expect the > flush lock in xfs_dqflush(), and perhaps refactor that code to accept a > locked buffer as a param. No, I'm not suggesting we ignore the flush lock. What the flush "lock" does is prevent higher layer attempts to flush the dquot to the backing buffer whilst IO may be in progress. It's higher level function is to allows async log item state updates to be done correctly w.r.t. relogging and flushing, as we allow transactional modifications to inodes and dquots while they are under IO. But to flush an object, we have to own the object lock, the flush lock and the underlying buffer lock. To flush a dquot (or inode) we need to, in order: a) lock the dquot (inode) b) obtain the flush lock c) obtain the buffer lock d) flush the dquot to the buffer e) update log item state f) unlock the buffer IO is then effectively: a) lock buffer b) submit IO <IO completes> b) dquot callback unlocks flush lock a) buffer is unlocked IOWs, the flush "lock" does not work on it's own. It's really a completion notifier, not a lock, and it's actually implemented that way now. The name is historical because it used a semaphore when that was all Irix or Linux had to implement an async non-owner object release. That is, when we lock and flush the object, we effective hand the "flush lock" to the buffer, as it is the buffer that is responsible for "unlocking" it on IO completion. The "lock" function prevents the higher layers from trying to update the buffer until the buffer says "it's OK to modify me", but it does not mean we cannot modify the buffer contents if we hold all the correct context - the "modify the buffer" action is the inner most. IOWs, if we lock the buffer and the buffer owns the flush lock state for the object, then if we are able to lock the object, we have now have exclusive access to both the logical object, the log item associated with it, the backing buffer and the flush context lock. And with all that context held, it is safe to flush the logical object state to the backing buffer.... i.e. rather than locking from the top down to get to the state where we can flush an object, we can lock objects from the bottom up and end up in the same place... > I don't see anything that would break off hand, but my first reaction is > it sounds more hackish than this patch or the previous patch that just > disabled reclaim during quotacheck. I thought we'd done that discussion. i.e. we can't disable reclaim in quotacheck because that can set off the OOM killer... > In fact, if we're going to hack > around avoiding the flush lock, why not just hack up xfs_buf_submit() to > allow submission of a buffer while it remains on a delwri queue? AFAICT, > that would only require removing an assert and not clearing the flag on > I/O completion. I'm not convinced I'd post something for either > approach, but I'd have to think about it some more. > > > Indeed, let's go look at inodes for an example of this, > > xfs_ifree_cluster() to be precise: > > > > /* > > * We obtain and lock the backing buffer first in the process > > * here, as we have to ensure that any dirty inode that we > > * can't get the flush lock on is attached to the buffer. > > * If we scan the in-memory inodes first, then buffer IO can > > * complete before we get a lock on it, and hence we may fail > > * to mark all the active inodes on the buffer stale. > > */ > > ..... > > /* > > * Walk the inodes already attached to the buffer and mark them > > * stale. These will all have the flush locks held, so an > > * in-memory inode walk can't lock them. By marking them all > > * stale first, we will not attempt to lock them in the loop > > * below as the XFS_ISTALE flag will be set. > > */ > > ..... > > /* > > * For each inode in memory attempt to add it to the inode > > * buffer and set it up for being staled on buffer IO > > * completion. This is safe as we've locked out tail pushing > > * and flushing by locking the buffer. > > * > > * We have already marked every inode that was part of a > > * transaction stale above, which means there is no point in > > * even trying to lock them. > > */ > > > > IOWs, what we have here is precendence for modifying the flush > > locked objects attached to a buffer after first locking the buffer. > > dquots have the same transaction/flush model as inodes, so I'm > > pretty sure this will lead to the simplest, cleanest way to avoid > > this deadlock.... > > > > The simplest approach to avoid the deadlock is to disable dquot reclaim > during quotacheck. ;) > > On a quick scan through the inode code, it seems the above refer to > using the buffer lock to serialize invalidation of higher level objects > based on the buffer. IOW, to ensure that we don't leave "valid" in-core > inode objects referring to an inode metadata buffer that is about to be > invalidated. Yes, we lock from the bottom up until we have exclusive access to the objects. Invalidation is the /purpose/ of that code, not the architectural mechanism used to lock the objects into a state where we can then invalidate them. Cheers, Dave.
On Tue, Apr 18, 2017 at 12:35:48PM +1000, Dave Chinner wrote: > On Tue, Apr 11, 2017 at 10:53:03AM -0400, Brian Foster wrote: > > On Tue, Apr 11, 2017 at 09:55:25AM +1000, Dave Chinner wrote: > > > We hold the buffer locked, the dquot is locked and the dquot is > > > flush locked already, right? We know the IO has not been submitted > > > because we have the buffer lock and the dquot is still flush locked, > > > which means /some/ dirty dquot data is already in the buffer. > > > > > > > Essentially. The buffer is queued and the dquot flush locked by reclaim. > > By the time we get to the problematic scenario, we acquire the dquot > > lock and can acquire the buffer lock in response to flush lock failure. > > The buffer cannot be under I/O because we hold it on the local delwri > > queue. > > Seems to me that you've missed the fundamental reason that the > buffer can't be under IO. It's not because it's on some local > private queue, it can't be under IO because /we hold the buffer > lock/. > Yes, I'm aware that the buffer lock protects I/O submission. Poor wording above, perhaps. > > > So why can't we just modify the dquot in the buffer? We already > > > hold all the locks needed to guarantee exclusive access to the dquot > > > and buffer, and they are all we hold when we do the initial flush to > > > the buffer. So why do we need to do IO to unlock the dquot flush > > > lock when we could just rewrite before we submit the buffer for IO? > > > > > > > Are you suggesting to essentially ignore the flush lock? I suppose we > > could get away with this in dq_flush_one() since it is only called from > > quotacheck, but we may have to kill off any assertions that expect the > > flush lock in xfs_dqflush(), and perhaps refactor that code to accept a > > locked buffer as a param. > > No, I'm not suggesting we ignore the flush lock. What the flush > "lock" does is prevent higher layer attempts to flush the dquot to > the backing buffer whilst IO may be in progress. It's higher level > function is to allows async log item state updates to be done > correctly w.r.t. relogging and flushing, as we allow transactional > modifications to inodes and dquots while they are under IO. > Ok. What I was trying to recall previously was some discussion around the flush lock retry problem where it was noted that we can't unlock the flush lock until the currently flushed state makes it to disk[1]. Is the conceptual difference here that we are not proposing to unlock the flush lock, but rather co-opt it to perform another flush? [1] http://www.spinics.net/lists/linux-xfs/msg01248.html > But to flush an object, we have to own the object lock, the flush > lock and the underlying buffer lock. To flush a dquot (or inode) we > need to, in order: > > a) lock the dquot (inode) > b) obtain the flush lock > c) obtain the buffer lock > d) flush the dquot to the buffer > e) update log item state > f) unlock the buffer > > IO is then effectively: > > a) lock buffer > b) submit IO > <IO completes> > b) dquot callback unlocks flush lock > a) buffer is unlocked > > IOWs, the flush "lock" does not work on it's own. It's really a > completion notifier, not a lock, and it's actually implemented that > way now. The name is historical because it used a semaphore when > that was all Irix or Linux had to implement an async non-owner > object release. That is, when we lock and flush the object, we > effective hand the "flush lock" to the buffer, as it is the buffer > that is responsible for "unlocking" it on IO completion. > > The "lock" function prevents the higher layers from trying to update > the buffer until the buffer says "it's OK to modify me", but it does > not mean we cannot modify the buffer contents if we hold all the > correct context - the "modify the buffer" action is the inner most. > > IOWs, if we lock the buffer and the buffer owns the flush lock state > for the object, then if we are able to lock the object, we have now > have exclusive access to both the logical object, the log item > associated with it, the backing buffer and the flush context lock. > And with all that context held, it is safe to flush the logical > object state to the backing buffer.... > > i.e. rather than locking from the top down to get to the state where > we can flush an object, we can lock objects from the bottom up and > end up in the same place... > I didn't claim the locking or overall proposal wouldn't work, so this all makes sense to me from a serialization perspective. I think the implementation would be more hackish than the current and previous proposals, however. That is a subjective argument, of course. To be fair, the current/previous proposals are also hackish, but what makes this approach so much better that it is worth starting over on this problem yet again? IOWs, if the argument here is more substantive than "look, we can do this differently," I'm not seeing it. In the meantime, this patch has been reviewed, tested against quotacheck under various low memory conditions and verified by users in the field who actually reproduce the problem. It also has the benefit of historical precedent in that it is effectively a partial revert of the patch that introduced the regression in the first place. > > I don't see anything that would break off hand, but my first reaction is > > it sounds more hackish than this patch or the previous patch that just > > disabled reclaim during quotacheck. > > I thought we'd done that discussion. i.e. we can't disable reclaim > in quotacheck because that can set off the OOM killer... > Huh? The only reason disabling of dquot reclaim during quotacheck was proposed in the first place is because it is 100% superfluous. Quotacheck, by design, does not allow reclaim to free memory. Therefore reclaim does not and afaict never has prevented or even mitigated OOM during quotacheck. Is OOM possible due to quotacheck? You bet. Could quotacheck be improved to be more memory efficient? Most likely, but that's beyond the scope of this deadlock regression. However unless I'm missing something, we can disable reclaim during quotacheck to no ill side effect. Brian P.S., To be completely clear of my position on this issue at this point... given the amount of time I've already spent responding to handwavy arguments (ultimately resulting in discussing trailing off until a repost occurs), or experimenting with a known bogus quotacheck rework (which is still left without feedback, I believe), etc., clarification on the correctness of this alternate approach (while interesting) is not nearly convincing enough for me to start over on this bug. I don't mind handwavy questions if the "caller" is receptive to or attempting to process (or debate) clarification, but I don't get the impression that is the case here. If you feel strongly enough about a certain approach, feel free to just send a patch. At this point, I'm happy to help review anything from the sole perspective of technical correctness (regardless of whether the I think the approach is ugly), but I'm not personally wasting any more time than I already have to implement and test such an approach without a more logical and convincing argument. IMO, the feedback to this patch doesn't fairly or reasonably justify the level of pushback. > > In fact, if we're going to hack > > around avoiding the flush lock, why not just hack up xfs_buf_submit() to > > allow submission of a buffer while it remains on a delwri queue? AFAICT, > > that would only require removing an assert and not clearing the flag on > > I/O completion. I'm not convinced I'd post something for either > > approach, but I'd have to think about it some more. > > > > > Indeed, let's go look at inodes for an example of this, > > > xfs_ifree_cluster() to be precise: > > > > > > /* > > > * We obtain and lock the backing buffer first in the process > > > * here, as we have to ensure that any dirty inode that we > > > * can't get the flush lock on is attached to the buffer. > > > * If we scan the in-memory inodes first, then buffer IO can > > > * complete before we get a lock on it, and hence we may fail > > > * to mark all the active inodes on the buffer stale. > > > */ > > > ..... > > > /* > > > * Walk the inodes already attached to the buffer and mark them > > > * stale. These will all have the flush locks held, so an > > > * in-memory inode walk can't lock them. By marking them all > > > * stale first, we will not attempt to lock them in the loop > > > * below as the XFS_ISTALE flag will be set. > > > */ > > > ..... > > > /* > > > * For each inode in memory attempt to add it to the inode > > > * buffer and set it up for being staled on buffer IO > > > * completion. This is safe as we've locked out tail pushing > > > * and flushing by locking the buffer. > > > * > > > * We have already marked every inode that was part of a > > > * transaction stale above, which means there is no point in > > > * even trying to lock them. > > > */ > > > > > > IOWs, what we have here is precendence for modifying the flush > > > locked objects attached to a buffer after first locking the buffer. > > > dquots have the same transaction/flush model as inodes, so I'm > > > pretty sure this will lead to the simplest, cleanest way to avoid > > > this deadlock.... > > > > > > > The simplest approach to avoid the deadlock is to disable dquot reclaim > > during quotacheck. ;) > > > > On a quick scan through the inode code, it seems the above refer to > > using the buffer lock to serialize invalidation of higher level objects > > based on the buffer. IOW, to ensure that we don't leave "valid" in-core > > inode objects referring to an inode metadata buffer that is about to be > > invalidated. > > Yes, we lock from the bottom up until we have exclusive access to > the objects. Invalidation is the /purpose/ of that code, not the > architectural mechanism used to lock the objects into a state where > we can then invalidate them. > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Apr 18, 2017 at 09:55:31AM -0400, Brian Foster wrote: > On Tue, Apr 18, 2017 at 12:35:48PM +1000, Dave Chinner wrote: > > On Tue, Apr 11, 2017 at 10:53:03AM -0400, Brian Foster wrote: > > > On Tue, Apr 11, 2017 at 09:55:25AM +1000, Dave Chinner wrote: > > > > So why can't we just modify the dquot in the buffer? We already > > > > hold all the locks needed to guarantee exclusive access to the dquot > > > > and buffer, and they are all we hold when we do the initial flush to > > > > the buffer. So why do we need to do IO to unlock the dquot flush > > > > lock when we could just rewrite before we submit the buffer for IO? > > > > > > > > > > Are you suggesting to essentially ignore the flush lock? I suppose we > > > could get away with this in dq_flush_one() since it is only called from > > > quotacheck, but we may have to kill off any assertions that expect the > > > flush lock in xfs_dqflush(), and perhaps refactor that code to accept a > > > locked buffer as a param. > > > > No, I'm not suggesting we ignore the flush lock. What the flush > > "lock" does is prevent higher layer attempts to flush the dquot to > > the backing buffer whilst IO may be in progress. It's higher level > > function is to allows async log item state updates to be done > > correctly w.r.t. relogging and flushing, as we allow transactional > > modifications to inodes and dquots while they are under IO. > > > > Ok. What I was trying to recall previously was some discussion around > the flush lock retry problem where it was noted that we can't unlock the > flush lock until the currently flushed state makes it to disk[1]. Is the > conceptual difference here that we are not proposing to unlock the flush > lock, but rather co-opt it to perform another flush? I wouldn't use that wording, but you've got the idea. [...] > > > I don't see anything that would break off hand, but my first reaction is > > > it sounds more hackish than this patch or the previous patch that just > > > disabled reclaim during quotacheck. > > > > I thought we'd done that discussion. i.e. we can't disable reclaim > > in quotacheck because that can set off the OOM killer... > > > > Huh? The only reason disabling of dquot reclaim during quotacheck was > proposed in the first place is because it is 100% superfluous. ISTR that we broke dquot reclaim during quotacheck by moving to private delwri lists. I'm working under the impression that dquot reclaim during quotacheck used to work just fine. maybe I'm wrong, but .... > Quotacheck, by design, does not allow reclaim to free memory. Therefore > reclaim does not and afaict never has prevented or even mitigated OOM > during quotacheck. .... the intent has always been to allow dquot reclaim to run when quotacheck is active because we've had our fair share of OOM problems in quotacheck over the past 10 years. Bugs notwithstanding, we really should be trying to ensure the code fulfils that intent rather than sweep it under the carpet and tell users "too bad, so sad" when quotacheck OOMs... [...] > P.S., To be completely clear of my position on this issue at this > point... given the amount of time I've already spent responding to > handwavy arguments (ultimately resulting in discussing trailing off > until a repost occurs), or experimenting with a known bogus quotacheck > rework (which is still left without feedback, I believe), etc., > clarification on the correctness of this alternate approach (while > interesting) is not nearly convincing enough for me to start over on > this bug. I don't mind handwavy questions if the "caller" is receptive > to or attempting to process (or debate) clarification, but I don't get > the impression that is the case here. > > If you feel strongly enough about a certain approach, feel free to just > send a patch. At this point, I'm happy to help review anything from the > sole perspective of technical correctness (regardless of whether the I > think the approach is ugly), but I'm not personally wasting any more > time than I already have to implement and test such an approach without > a more logical and convincing argument. IMO, the feedback to this patch > doesn't fairly or reasonably justify the level of pushback. I just responded to the code that was posted, pointing out a list of things that concerned me and, I thought, we've been working through that quite amicably. Really, it is up to the maintainer whether to merge the code or not. That's not me - I'm now just someone who knows the code and it's history. This is where the maintainer needs to step in and make a decision one way or the other.... -Dave.
On Wed, Apr 19, 2017 at 12:46:49PM +1000, Dave Chinner wrote: > On Tue, Apr 18, 2017 at 09:55:31AM -0400, Brian Foster wrote: > > On Tue, Apr 18, 2017 at 12:35:48PM +1000, Dave Chinner wrote: > > > On Tue, Apr 11, 2017 at 10:53:03AM -0400, Brian Foster wrote: > > > > On Tue, Apr 11, 2017 at 09:55:25AM +1000, Dave Chinner wrote: > > > > > So why can't we just modify the dquot in the buffer? We already > > > > > hold all the locks needed to guarantee exclusive access to the dquot > > > > > and buffer, and they are all we hold when we do the initial flush to > > > > > the buffer. So why do we need to do IO to unlock the dquot flush > > > > > lock when we could just rewrite before we submit the buffer for IO? > > > > > > > > > > > > > Are you suggesting to essentially ignore the flush lock? I suppose we > > > > could get away with this in dq_flush_one() since it is only called from > > > > quotacheck, but we may have to kill off any assertions that expect the > > > > flush lock in xfs_dqflush(), and perhaps refactor that code to accept a > > > > locked buffer as a param. > > > > > > No, I'm not suggesting we ignore the flush lock. What the flush > > > "lock" does is prevent higher layer attempts to flush the dquot to > > > the backing buffer whilst IO may be in progress. It's higher level > > > function is to allows async log item state updates to be done > > > correctly w.r.t. relogging and flushing, as we allow transactional > > > modifications to inodes and dquots while they are under IO. > > > > > > > Ok. What I was trying to recall previously was some discussion around > > the flush lock retry problem where it was noted that we can't unlock the > > flush lock until the currently flushed state makes it to disk[1]. Is the > > conceptual difference here that we are not proposing to unlock the flush > > lock, but rather co-opt it to perform another flush? > > I wouldn't use that wording, but you've got the idea. > > [...] > > > > > I don't see anything that would break off hand, but my first reaction is > > > > it sounds more hackish than this patch or the previous patch that just > > > > disabled reclaim during quotacheck. > > > > > > I thought we'd done that discussion. i.e. we can't disable reclaim > > > in quotacheck because that can set off the OOM killer... > > > > > > > Huh? The only reason disabling of dquot reclaim during quotacheck was > > proposed in the first place is because it is 100% superfluous. > > ISTR that we broke dquot reclaim during quotacheck by moving to > private delwri lists. I'm working under the impression that dquot > reclaim during quotacheck used to work just fine. maybe I'm wrong, > but .... > > > Quotacheck, by design, does not allow reclaim to free memory. Therefore > > reclaim does not and afaict never has prevented or even mitigated OOM > > during quotacheck. > > .... the intent has always been to allow dquot reclaim to run when > quotacheck is active because we've had our fair share of OOM > problems in quotacheck over the past 10 years. Bugs notwithstanding, > we really should be trying to ensure the code fulfils that intent > rather than sweep it under the carpet and tell users "too bad, so > sad" when quotacheck OOMs... > > [...] > > > P.S., To be completely clear of my position on this issue at this > > point... given the amount of time I've already spent responding to > > handwavy arguments (ultimately resulting in discussing trailing off > > until a repost occurs), or experimenting with a known bogus quotacheck > > rework (which is still left without feedback, I believe), etc., > > clarification on the correctness of this alternate approach (while > > interesting) is not nearly convincing enough for me to start over on > > this bug. I don't mind handwavy questions if the "caller" is receptive > > to or attempting to process (or debate) clarification, but I don't get > > the impression that is the case here. > > > > If you feel strongly enough about a certain approach, feel free to just > > send a patch. At this point, I'm happy to help review anything from the > > sole perspective of technical correctness (regardless of whether the I > > think the approach is ugly), but I'm not personally wasting any more > > time than I already have to implement and test such an approach without > > a more logical and convincing argument. IMO, the feedback to this patch > > doesn't fairly or reasonably justify the level of pushback. > > I just responded to the code that was posted, pointing out a > list of things that concerned me and, I thought, we've been working > through that quite amicably. > > Really, it is up to the maintainer whether to merge the code or not. > That's not me - I'm now just someone who knows the code and it's > history. This is where the maintainer needs to step in and make a > decision one way or the other.... So I do have a few (more) comments: The first point I have to make is that the quotacheck OOM still seems to happen infrequently, so at the point that this thread started going, I was (and still am) fine with letting the discussion continue until we run out of questions. :) As far as the patch itself goes, it took me a while to figure out what's going on with the delwri buffer list shuffling. The part where the buffer comes back from xfs_buf_delwri_submit_buffers during _pushbuf still attached to a delwri queue and yet we still have to re-add the DELWRI_Q flag raised my eyebrows. I thought it was a little odd, but in those particular circumstances (quotacheck and reclaim) I didn't find anything that made me think it would fail, even if it probably wasn't what the designers had in mind. However, Dave's comments inspired me to take a second look. Sorry if this has been covered already, but what if dquot_isolate noticed when xfs_buf_delwri_queue doesn't actually put the buffer on isol->buffers (i.e. something else already put it on another delwri queue, which afaict only happens during quotacheck?) and simply dqfunlock's the buffer? Reclaim will have already formatted the in-core dquot into the on-disk buffer and can free the dquot, and quotacheck is still on the hook to issue the IO. (FWIW even that doesn't quite smell right since it seems weird to push metadata to the buffer layer but someone else writes it to disk.) In any case, it would be helpful to document how the delwri queues work, whether or not the list is allowed to change, and what relation the _XBF_DELWRI_Q flag has to the value of b_list. For example, I didn't know that one wasn't supposed to bounce buffers between lists. Can that documentation be part of whatever patch series we end up committing, please? (It's hard for me to make decisions when I know I don't have enough info...) Also, I think Brian said that he'd post updated patches with at least a fix to a problem with dropped xfs_buf_delwri_pushbuf return codes, so I'd still like to see that. As far as testing goes I'm happ(ier) that /one/ of the original complainants says that he could move on with the patches applied. I haven't observed any breakage w/ xfstests, though that doesn't prove it right. --D > > -Dave. > -- > Dave Chinner > david@fromorbit.com > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Apr 19, 2017 at 12:46:49PM +1000, Dave Chinner wrote: > On Tue, Apr 18, 2017 at 09:55:31AM -0400, Brian Foster wrote: > > On Tue, Apr 18, 2017 at 12:35:48PM +1000, Dave Chinner wrote: > > > On Tue, Apr 11, 2017 at 10:53:03AM -0400, Brian Foster wrote: > > > > On Tue, Apr 11, 2017 at 09:55:25AM +1000, Dave Chinner wrote: > > > > > So why can't we just modify the dquot in the buffer? We already > > > > > hold all the locks needed to guarantee exclusive access to the dquot > > > > > and buffer, and they are all we hold when we do the initial flush to > > > > > the buffer. So why do we need to do IO to unlock the dquot flush > > > > > lock when we could just rewrite before we submit the buffer for IO? > > > > > > > > > > > > > Are you suggesting to essentially ignore the flush lock? I suppose we > > > > could get away with this in dq_flush_one() since it is only called from > > > > quotacheck, but we may have to kill off any assertions that expect the > > > > flush lock in xfs_dqflush(), and perhaps refactor that code to accept a > > > > locked buffer as a param. > > > > > > No, I'm not suggesting we ignore the flush lock. What the flush > > > "lock" does is prevent higher layer attempts to flush the dquot to > > > the backing buffer whilst IO may be in progress. It's higher level > > > function is to allows async log item state updates to be done > > > correctly w.r.t. relogging and flushing, as we allow transactional > > > modifications to inodes and dquots while they are under IO. > > > > > > > Ok. What I was trying to recall previously was some discussion around > > the flush lock retry problem where it was noted that we can't unlock the > > flush lock until the currently flushed state makes it to disk[1]. Is the > > conceptual difference here that we are not proposing to unlock the flush > > lock, but rather co-opt it to perform another flush? > > I wouldn't use that wording, but you've got the idea. > > [...] > Ok. > > > > I don't see anything that would break off hand, but my first reaction is > > > > it sounds more hackish than this patch or the previous patch that just > > > > disabled reclaim during quotacheck. > > > > > > I thought we'd done that discussion. i.e. we can't disable reclaim > > > in quotacheck because that can set off the OOM killer... > > > > > > > Huh? The only reason disabling of dquot reclaim during quotacheck was > > proposed in the first place is because it is 100% superfluous. > > ISTR that we broke dquot reclaim during quotacheck by moving to > private delwri lists. I'm working under the impression that dquot > reclaim during quotacheck used to work just fine. maybe I'm wrong, > but .... > As noted in the commit log for this patch, the commit that appears to introduce the regression is 43ff2122e6 ("xfs: on-stack delayed write buffer lists"), which I believe introduces (or is part of a series that introduces) the private delwri queue bits. I should correct my previous statement.. it's not clear to me whether dquot reclaim during quotacheck worked at some time before this. It may very well have, I'm not really familiar with the pre-private delwri queue bits. I've only gone as far back as this patch to identify that quotacheck had a mechanism to deal with flush locked buffers and that since this commit, I don't see how dquot reclaim has had the ability to free memory during quotacheck. My intent is more to point out that this apparently has been the state of things for quite some time. Either way, this is not terribly important as this patch replaces the original "disable reclaim during quotacheck" variant (IOW, I'll just accept that it worked at some point in the past). I was originally responding to the claim that the whole bottom up locking thing was somehow notably more simple than what has been proposed so far. > > Quotacheck, by design, does not allow reclaim to free memory. Therefore > > reclaim does not and afaict never has prevented or even mitigated OOM > > during quotacheck. > > .... the intent has always been to allow dquot reclaim to run when > quotacheck is active because we've had our fair share of OOM > problems in quotacheck over the past 10 years. Bugs notwithstanding, > we really should be trying to ensure the code fulfils that intent > rather than sweep it under the carpet and tell users "too bad, so > sad" when quotacheck OOMs... > Apparently not so much in the last... looks like almost 5 years now. ;) But as noted (multiple times) previously, this is not mutually exclusive to fixing the deadlock and I would have been happy to get the deadlock fixed and follow up looking at improving quotacheck efficiency. > [...] > > > P.S., To be completely clear of my position on this issue at this > > point... given the amount of time I've already spent responding to > > handwavy arguments (ultimately resulting in discussing trailing off > > until a repost occurs), or experimenting with a known bogus quotacheck > > rework (which is still left without feedback, I believe), etc., > > clarification on the correctness of this alternate approach (while > > interesting) is not nearly convincing enough for me to start over on > > this bug. I don't mind handwavy questions if the "caller" is receptive > > to or attempting to process (or debate) clarification, but I don't get > > the impression that is the case here. > > > > If you feel strongly enough about a certain approach, feel free to just > > send a patch. At this point, I'm happy to help review anything from the > > sole perspective of technical correctness (regardless of whether the I > > think the approach is ugly), but I'm not personally wasting any more > > time than I already have to implement and test such an approach without > > a more logical and convincing argument. IMO, the feedback to this patch > > doesn't fairly or reasonably justify the level of pushback. > > I just responded to the code that was posted, pointing out a > list of things that concerned me and, I thought, we've been working > through that quite amicably. > I have noted a few useful updates to this patch, but otherwise don't see anything to suggest that posting this again will get it anywhere close to being merged. The previous mail actually states that this "can't be merged." :P Be that as it may, the justifications for that objection appear to me to be mostly unsubstantiated, not fleshed out (IMO) and lead to an alternative proposal that doesn't seem any less hacky to me. Therefore, those reasons don't justify enough to me to start over when I've 1.) done that twice already 2.) already invested considerable time into a fix that I consider more appropriate and relatively straightforward and 3.) invested similarly to ensure that said fix actually works. So feel free to either propose an approach that is sufficiently attractive to warrant further investigation (I don't have any other ideas at the moment), or just send a patch that implements your preferred approach. As mentioned, I'll help review whatever for technical correctness at this point. I think it's fair to call that "amicable disagreement." ;) > Really, it is up to the maintainer whether to merge the code or not. > That's not me - I'm now just someone who knows the code and it's > history. This is where the maintainer needs to step in and make a > decision one way or the other.... > I don't think anybody is going to suggest we merge a patch that you (or any consistent XFS code reviewer) effectively object to. I certainly wouldn't ask Darrick to do that, nor would I take that approach if I were in his shoes. Brian > -Dave. > -- > Dave Chinner > david@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Apr 19, 2017 at 12:55:19PM -0700, Darrick J. Wong wrote: > On Wed, Apr 19, 2017 at 12:46:49PM +1000, Dave Chinner wrote: > > On Tue, Apr 18, 2017 at 09:55:31AM -0400, Brian Foster wrote: > > > On Tue, Apr 18, 2017 at 12:35:48PM +1000, Dave Chinner wrote: > > > > On Tue, Apr 11, 2017 at 10:53:03AM -0400, Brian Foster wrote: > > > > > On Tue, Apr 11, 2017 at 09:55:25AM +1000, Dave Chinner wrote: > > > > > > So why can't we just modify the dquot in the buffer? We already > > > > > > hold all the locks needed to guarantee exclusive access to the dquot > > > > > > and buffer, and they are all we hold when we do the initial flush to > > > > > > the buffer. So why do we need to do IO to unlock the dquot flush > > > > > > lock when we could just rewrite before we submit the buffer for IO? > > > > > > > > > > > > > > > > Are you suggesting to essentially ignore the flush lock? I suppose we > > > > > could get away with this in dq_flush_one() since it is only called from > > > > > quotacheck, but we may have to kill off any assertions that expect the > > > > > flush lock in xfs_dqflush(), and perhaps refactor that code to accept a > > > > > locked buffer as a param. > > > > > > > > No, I'm not suggesting we ignore the flush lock. What the flush > > > > "lock" does is prevent higher layer attempts to flush the dquot to > > > > the backing buffer whilst IO may be in progress. It's higher level > > > > function is to allows async log item state updates to be done > > > > correctly w.r.t. relogging and flushing, as we allow transactional > > > > modifications to inodes and dquots while they are under IO. > > > > > > > > > > Ok. What I was trying to recall previously was some discussion around > > > the flush lock retry problem where it was noted that we can't unlock the > > > flush lock until the currently flushed state makes it to disk[1]. Is the > > > conceptual difference here that we are not proposing to unlock the flush > > > lock, but rather co-opt it to perform another flush? > > > > I wouldn't use that wording, but you've got the idea. > > > > [...] > > > > > > > I don't see anything that would break off hand, but my first reaction is > > > > > it sounds more hackish than this patch or the previous patch that just > > > > > disabled reclaim during quotacheck. > > > > > > > > I thought we'd done that discussion. i.e. we can't disable reclaim > > > > in quotacheck because that can set off the OOM killer... > > > > > > > > > > Huh? The only reason disabling of dquot reclaim during quotacheck was > > > proposed in the first place is because it is 100% superfluous. > > > > ISTR that we broke dquot reclaim during quotacheck by moving to > > private delwri lists. I'm working under the impression that dquot > > reclaim during quotacheck used to work just fine. maybe I'm wrong, > > but .... > > > > > Quotacheck, by design, does not allow reclaim to free memory. Therefore > > > reclaim does not and afaict never has prevented or even mitigated OOM > > > during quotacheck. > > > > .... the intent has always been to allow dquot reclaim to run when > > quotacheck is active because we've had our fair share of OOM > > problems in quotacheck over the past 10 years. Bugs notwithstanding, > > we really should be trying to ensure the code fulfils that intent > > rather than sweep it under the carpet and tell users "too bad, so > > sad" when quotacheck OOMs... > > > > [...] > > > > > P.S., To be completely clear of my position on this issue at this > > > point... given the amount of time I've already spent responding to > > > handwavy arguments (ultimately resulting in discussing trailing off > > > until a repost occurs), or experimenting with a known bogus quotacheck > > > rework (which is still left without feedback, I believe), etc., > > > clarification on the correctness of this alternate approach (while > > > interesting) is not nearly convincing enough for me to start over on > > > this bug. I don't mind handwavy questions if the "caller" is receptive > > > to or attempting to process (or debate) clarification, but I don't get > > > the impression that is the case here. > > > > > > If you feel strongly enough about a certain approach, feel free to just > > > send a patch. At this point, I'm happy to help review anything from the > > > sole perspective of technical correctness (regardless of whether the I > > > think the approach is ugly), but I'm not personally wasting any more > > > time than I already have to implement and test such an approach without > > > a more logical and convincing argument. IMO, the feedback to this patch > > > doesn't fairly or reasonably justify the level of pushback. > > > > I just responded to the code that was posted, pointing out a > > list of things that concerned me and, I thought, we've been working > > through that quite amicably. > > > > Really, it is up to the maintainer whether to merge the code or not. > > That's not me - I'm now just someone who knows the code and it's > > history. This is where the maintainer needs to step in and make a > > decision one way or the other.... > > So I do have a few (more) comments: > > The first point I have to make is that the quotacheck OOM still seems to > happen infrequently, so at the point that this thread started going, I > was (and still am) fine with letting the discussion continue until we > run out of questions. :) > > As far as the patch itself goes, it took me a while to figure out what's > going on with the delwri buffer list shuffling. The part where the > buffer comes back from xfs_buf_delwri_submit_buffers during _pushbuf > still attached to a delwri queue and yet we still have to re-add the > DELWRI_Q flag raised my eyebrows. I thought it was a little odd, but > in those particular circumstances (quotacheck and reclaim) I didn't find > anything that made me think it would fail, even if it probably wasn't > what the designers had in mind. > > However, Dave's comments inspired me to take a second look. Sorry if > this has been covered already, but what if dquot_isolate noticed when > xfs_buf_delwri_queue doesn't actually put the buffer on isol->buffers > (i.e. something else already put it on another delwri queue, which > afaict only happens during quotacheck?) and simply dqfunlock's the > buffer? Reclaim will have already formatted the in-core dquot into the > on-disk buffer and can free the dquot, and quotacheck is still on the > hook to issue the IO. (FWIW even that doesn't quite smell right since > it seems weird to push metadata to the buffer layer but someone else > writes it to disk.) > FWIW, I did also consider something similar on the reclaim side of things. Not to unlock the flush lock (I think we don't generally unlock a flush lock until state reaches disk, even though technically it may not be a problem from quotacheck context), but to avoid acquiring it in the first place if the underlying buffer appeared to already belong on a delwri queue (or something along those lines). I don't recall the exact details off the top of my head, but I didn't like how it turned out enough such that it never turned into something post-worthy (I may still have that around somewhere, though). > In any case, it would be helpful to document how the delwri queues work, > whether or not the list is allowed to change, and what relation the > _XBF_DELWRI_Q flag has to the value of b_list. For example, I didn't > know that one wasn't supposed to bounce buffers between lists. Can that > documentation be part of whatever patch series we end up committing, > please? > I'm not aware of any such limitation. > (It's hard for me to make decisions when I know I don't have enough info...) > > Also, I think Brian said that he'd post updated patches with at least a > fix to a problem with dropped xfs_buf_delwri_pushbuf return codes, so > I'd still like to see that. > That was my intent, though I realized I haven't even made those changes locally yet because this appears to be going nowhere fast, afaict. I can post another update if that is actually useful, however. (As noted in my previous mail, though, I don't really expect you to consider merging a patch with outstanding objections.) Brian > As far as testing goes I'm happ(ier) that /one/ of the original > complainants says that he could move on with the patches applied. I > haven't observed any breakage w/ xfstests, though that doesn't prove it > right. > > --D > > > > > -Dave. > > -- > > Dave Chinner > > david@fromorbit.com > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Apr 19, 2017 at 04:46:46PM -0400, Brian Foster wrote: > On Wed, Apr 19, 2017 at 12:55:19PM -0700, Darrick J. Wong wrote: > > On Wed, Apr 19, 2017 at 12:46:49PM +1000, Dave Chinner wrote: > > > On Tue, Apr 18, 2017 at 09:55:31AM -0400, Brian Foster wrote: > > > > On Tue, Apr 18, 2017 at 12:35:48PM +1000, Dave Chinner wrote: > > > > > On Tue, Apr 11, 2017 at 10:53:03AM -0400, Brian Foster wrote: > > > > > > On Tue, Apr 11, 2017 at 09:55:25AM +1000, Dave Chinner wrote: > > > > > > > So why can't we just modify the dquot in the buffer? We already > > > > > > > hold all the locks needed to guarantee exclusive access to the dquot > > > > > > > and buffer, and they are all we hold when we do the initial flush to > > > > > > > the buffer. So why do we need to do IO to unlock the dquot flush > > > > > > > lock when we could just rewrite before we submit the buffer for IO? > > > > > > > > > > > > > > > > > > > Are you suggesting to essentially ignore the flush lock? I suppose we > > > > > > could get away with this in dq_flush_one() since it is only called from > > > > > > quotacheck, but we may have to kill off any assertions that expect the > > > > > > flush lock in xfs_dqflush(), and perhaps refactor that code to accept a > > > > > > locked buffer as a param. > > > > > > > > > > No, I'm not suggesting we ignore the flush lock. What the flush > > > > > "lock" does is prevent higher layer attempts to flush the dquot to > > > > > the backing buffer whilst IO may be in progress. It's higher level > > > > > function is to allows async log item state updates to be done > > > > > correctly w.r.t. relogging and flushing, as we allow transactional > > > > > modifications to inodes and dquots while they are under IO. > > > > > > > > > > > > > Ok. What I was trying to recall previously was some discussion around > > > > the flush lock retry problem where it was noted that we can't unlock the > > > > flush lock until the currently flushed state makes it to disk[1]. Is the > > > > conceptual difference here that we are not proposing to unlock the flush > > > > lock, but rather co-opt it to perform another flush? > > > > > > I wouldn't use that wording, but you've got the idea. > > > > > > [...] > > > > > > > > > I don't see anything that would break off hand, but my first reaction is > > > > > > it sounds more hackish than this patch or the previous patch that just > > > > > > disabled reclaim during quotacheck. > > > > > > > > > > I thought we'd done that discussion. i.e. we can't disable reclaim > > > > > in quotacheck because that can set off the OOM killer... > > > > > > > > > > > > > Huh? The only reason disabling of dquot reclaim during quotacheck was > > > > proposed in the first place is because it is 100% superfluous. > > > > > > ISTR that we broke dquot reclaim during quotacheck by moving to > > > private delwri lists. I'm working under the impression that dquot > > > reclaim during quotacheck used to work just fine. maybe I'm wrong, > > > but .... > > > > > > > Quotacheck, by design, does not allow reclaim to free memory. Therefore > > > > reclaim does not and afaict never has prevented or even mitigated OOM > > > > during quotacheck. > > > > > > .... the intent has always been to allow dquot reclaim to run when > > > quotacheck is active because we've had our fair share of OOM > > > problems in quotacheck over the past 10 years. Bugs notwithstanding, > > > we really should be trying to ensure the code fulfils that intent > > > rather than sweep it under the carpet and tell users "too bad, so > > > sad" when quotacheck OOMs... > > > > > > [...] > > > > > > > P.S., To be completely clear of my position on this issue at this > > > > point... given the amount of time I've already spent responding to > > > > handwavy arguments (ultimately resulting in discussing trailing off > > > > until a repost occurs), or experimenting with a known bogus quotacheck > > > > rework (which is still left without feedback, I believe), etc., > > > > clarification on the correctness of this alternate approach (while > > > > interesting) is not nearly convincing enough for me to start over on > > > > this bug. I don't mind handwavy questions if the "caller" is receptive > > > > to or attempting to process (or debate) clarification, but I don't get > > > > the impression that is the case here. > > > > > > > > If you feel strongly enough about a certain approach, feel free to just > > > > send a patch. At this point, I'm happy to help review anything from the > > > > sole perspective of technical correctness (regardless of whether the I > > > > think the approach is ugly), but I'm not personally wasting any more > > > > time than I already have to implement and test such an approach without > > > > a more logical and convincing argument. IMO, the feedback to this patch > > > > doesn't fairly or reasonably justify the level of pushback. > > > > > > I just responded to the code that was posted, pointing out a > > > list of things that concerned me and, I thought, we've been working > > > through that quite amicably. > > > > > > Really, it is up to the maintainer whether to merge the code or not. > > > That's not me - I'm now just someone who knows the code and it's > > > history. This is where the maintainer needs to step in and make a > > > decision one way or the other.... > > > > So I do have a few (more) comments: > > > > The first point I have to make is that the quotacheck OOM still seems to > > happen infrequently, so at the point that this thread started going, I > > was (and still am) fine with letting the discussion continue until we > > run out of questions. :) > > > > As far as the patch itself goes, it took me a while to figure out what's > > going on with the delwri buffer list shuffling. The part where the > > buffer comes back from xfs_buf_delwri_submit_buffers during _pushbuf > > still attached to a delwri queue and yet we still have to re-add the > > DELWRI_Q flag raised my eyebrows. I thought it was a little odd, but > > in those particular circumstances (quotacheck and reclaim) I didn't find > > anything that made me think it would fail, even if it probably wasn't > > what the designers had in mind. > > > > However, Dave's comments inspired me to take a second look. Sorry if > > this has been covered already, but what if dquot_isolate noticed when > > xfs_buf_delwri_queue doesn't actually put the buffer on isol->buffers > > (i.e. something else already put it on another delwri queue, which > > afaict only happens during quotacheck?) and simply dqfunlock's the > > buffer? Reclaim will have already formatted the in-core dquot into the > > on-disk buffer and can free the dquot, and quotacheck is still on the > > hook to issue the IO. (FWIW even that doesn't quite smell right since > > it seems weird to push metadata to the buffer layer but someone else > > writes it to disk.) > > > > FWIW, I did also consider something similar on the reclaim side of > things. Not to unlock the flush lock (I think we don't generally unlock > a flush lock until state reaches disk, even though technically it may > not be a problem from quotacheck context), but to avoid acquiring it in > the first place if the underlying buffer appeared to already belong on a > delwri queue (or something along those lines). > > I don't recall the exact details off the top of my head, but I didn't > like how it turned out enough such that it never turned into something > post-worthy (I may still have that around somewhere, though). > I don't appear to have any code around for the above, unfortunately. Having thought about this a bit more, I think one of the reasons I didn't prefer that approach (or the latest alternative proposal) is that all things being equal, I'd rather not complicate external, core infrastructure bits of code such as reclaim or dquot flushing with a hack that is only necessary to deal with quotacheck. IOW, the problem is primarily with the quotacheck implementation, so IMO the most appropriate place for whatever hacks are necessary to make it work correctly is in quotacheck itself. Then it's more straightforward to improve the whole implementation going forward (and possibly replace or remove the code being added here), rather than have to reassess whatever hacks we've sprinkled elsewhere in the codebase. That's just my .02, of course. v3 incoming... Brian > > In any case, it would be helpful to document how the delwri queues work, > > whether or not the list is allowed to change, and what relation the > > _XBF_DELWRI_Q flag has to the value of b_list. For example, I didn't > > know that one wasn't supposed to bounce buffers between lists. Can that > > documentation be part of whatever patch series we end up committing, > > please? > > > > I'm not aware of any such limitation. > > > (It's hard for me to make decisions when I know I don't have enough info...) > > > > Also, I think Brian said that he'd post updated patches with at least a > > fix to a problem with dropped xfs_buf_delwri_pushbuf return codes, so > > I'd still like to see that. > > > > That was my intent, though I realized I haven't even made those changes > locally yet because this appears to be going nowhere fast, afaict. I can > post another update if that is actually useful, however. > > (As noted in my previous mail, though, I don't really expect you to > consider merging a patch with outstanding objections.) > > Brian > > > As far as testing goes I'm happ(ier) that /one/ of the original > > complainants says that he could move on with the patches applied. I > > haven't observed any breakage w/ xfstests, though that doesn't prove it > > right. > > > > --D > > > > > > > > -Dave. > > > -- > > > Dave Chinner > > > david@fromorbit.com > > > -- > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > > > the body of a message to majordomo@vger.kernel.org > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c index e566510..e97cf56 100644 --- a/fs/xfs/xfs_buf.c +++ b/fs/xfs/xfs_buf.c @@ -2011,6 +2011,43 @@ xfs_buf_delwri_submit( return error; } +int +xfs_buf_delwri_pushbuf( + struct xfs_buf *bp, + struct list_head *buffer_list) +{ + LIST_HEAD (submit_list); + int error; + + ASSERT(xfs_buf_islocked(bp)); + ASSERT(bp->b_flags & _XBF_DELWRI_Q); + + trace_xfs_buf_delwri_pushbuf(bp, _RET_IP_); + + /* + * Move the buffer to an empty list and submit. Pass the original list + * as the wait list so delwri submission moves the buf back to it before + * it is submitted (and thus before it is unlocked). This means the + * buffer cannot be placed on another list while we wait for it. + */ + list_move(&bp->b_list, &submit_list); + xfs_buf_unlock(bp); + + xfs_buf_delwri_submit_buffers(&submit_list, buffer_list); + + /* + * Lock the buffer to wait for I/O completion. It's already held on the + * original list, so all we have to do is reset the delwri queue flag + * that was cleared by delwri submission. + */ + xfs_buf_lock(bp); + error = bp->b_error; + bp->b_flags |= _XBF_DELWRI_Q; + xfs_buf_unlock(bp); + + return error; +} + int __init xfs_buf_init(void) { diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h index 8a9d3a9..cd74216 100644 --- a/fs/xfs/xfs_buf.h +++ b/fs/xfs/xfs_buf.h @@ -334,6 +334,7 @@ extern void xfs_buf_stale(struct xfs_buf *bp); extern bool xfs_buf_delwri_queue(struct xfs_buf *, struct list_head *); extern int xfs_buf_delwri_submit(struct list_head *); extern int xfs_buf_delwri_submit_nowait(struct list_head *); +extern int xfs_buf_delwri_pushbuf(struct xfs_buf *, struct list_head *); /* Buffer Daemon Setup Routines */ extern int xfs_buf_init(void); diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c index 4ff993c..3815ed3 100644 --- a/fs/xfs/xfs_qm.c +++ b/fs/xfs/xfs_qm.c @@ -1247,6 +1247,7 @@ xfs_qm_flush_one( struct xfs_dquot *dqp, void *data) { + struct xfs_mount *mp = dqp->q_mount; struct list_head *buffer_list = data; struct xfs_buf *bp = NULL; int error = 0; @@ -1257,7 +1258,32 @@ xfs_qm_flush_one( if (!XFS_DQ_IS_DIRTY(dqp)) goto out_unlock; - xfs_dqflock(dqp); + /* + * The only way the dquot is already flush locked by the time quotacheck + * gets here is if reclaim flushed it before the dqadjust walk dirtied + * it for the final time. Quotacheck collects all dquot bufs in the + * local delwri queue before dquots are dirtied, so reclaim can't have + * possibly queued it for I/O. The only way out is to push the buffer to + * cycle the flush lock. + */ + if (!xfs_dqflock_nowait(dqp)) { + /* buf is pinned in-core by delwri list */ + DEFINE_SINGLE_BUF_MAP(map, dqp->q_blkno, + mp->m_quotainfo->qi_dqchunklen); + bp = _xfs_buf_find(mp->m_ddev_targp, &map, 1, 0, NULL); + if (!bp) { + error = -EINVAL; + goto out_unlock; + } + + /* delwri_pushbuf drops the buf lock */ + xfs_buf_delwri_pushbuf(bp, buffer_list); + xfs_buf_rele(bp); + + error = -EAGAIN; + goto out_unlock; + } + error = xfs_qm_dqflush(dqp, &bp); if (error) goto out_unlock; diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h index fb7555e..c8d28a9 100644 --- a/fs/xfs/xfs_trace.h +++ b/fs/xfs/xfs_trace.h @@ -365,6 +365,7 @@ DEFINE_BUF_EVENT(xfs_buf_iowait_done); DEFINE_BUF_EVENT(xfs_buf_delwri_queue); DEFINE_BUF_EVENT(xfs_buf_delwri_queued); DEFINE_BUF_EVENT(xfs_buf_delwri_split); +DEFINE_BUF_EVENT(xfs_buf_delwri_pushbuf); DEFINE_BUF_EVENT(xfs_buf_get_uncached); DEFINE_BUF_EVENT(xfs_buf_item_relse); DEFINE_BUF_EVENT(xfs_buf_item_iodone_async);
Reclaim during quotacheck can lead to deadlocks on the dquot flush lock: - Quotacheck populates a local delwri queue with the physical dquot buffers. - Quotacheck performs the xfs_qm_dqusage_adjust() bulkstat and dirties all of the dquots. - Reclaim kicks in and attempts to flush a dquot whose buffer is already queud on the quotacheck queue. The flush succeeds but queueing to the reclaim delwri queue fails as the backing buffer is already queued. The flush unlock is now deferred to I/O completion of the buffer from the quotacheck queue. - The dqadjust bulkstat continues and dirties the recently flushed dquot once again. - Quotacheck proceeds to the xfs_qm_flush_one() walk which requires the flush lock to update the backing buffers with the in-core recalculated values. It deadlocks on the redirtied dquot as the flush lock was already acquired by reclaim, but the buffer resides on the local delwri queue which isn't submitted until the end of quotacheck. This is reproduced by running quotacheck on a filesystem with a couple million inodes in low memory (512MB-1GB) situations. This is a regression as of commit 43ff2122e6 ("xfs: on-stack delayed write buffer lists"), which removed a trylock and buffer I/O submission from the quotacheck dquot flush sequence. Quotacheck first resets and collects the physical dquot buffers in a delwri queue. Then, it traverses the filesystem inodes via bulkstat, updates the in-core dquots, flushes the corrected dquots to the backing buffers and finally submits the delwri queue for I/O. Since the backing buffers are queued across the entire quotacheck operation, dquot reclaim cannot possibly complete a dquot flush before quotacheck completes. Therefore, quotacheck must submit the buffer for I/O in order to cycle the flush lock and flush the dirty in-core dquot to the buffer. Add a delwri queue buffer push mechanism to submit an individual buffer for I/O without losing the delwri queue status and use it from quotacheck to avoid the deadlock. This restores quotacheck behavior to as before the regression was introduced. Reported-by: Martin Svec <martin.svec@zoner.cz> Signed-off-by: Brian Foster <bfoster@redhat.com> --- fs/xfs/xfs_buf.c | 37 +++++++++++++++++++++++++++++++++++++ fs/xfs/xfs_buf.h | 1 + fs/xfs/xfs_qm.c | 28 +++++++++++++++++++++++++++- fs/xfs/xfs_trace.h | 1 + 4 files changed, 66 insertions(+), 1 deletion(-)