Message ID | 20220315064241.3133751-5-david@fromorbit.com (mailing list archive) |
---|---|
State | Superseded, archived |
Headers | show |
Series | xfs: log recovery fixes | expand |
On Tue, Mar 15, 2022 at 05:42:38PM +1100, Dave Chinner wrote: > From: Dave Chinner <dchinner@redhat.com> > > When the AIL tries to flush the CIL, it relies on the CIL push > ending up on stable storage without having to wait for and > manipulate iclog state directly. However, if there is already a > pending CIL push when the AIL tries to flush the CIL, it won't set > the cil->xc_push_commit_stable flag and so the CIL push will not > actively flush the commit record iclog. > > generic/530 when run on a single CPU test VM can trigger this fairly > reliably. This test exercises unlinked inode recovery, and can > result in inodes being pinned in memory by ongoing modifications to > the inode cluster buffer to record unlinked list modifications. As a > result, the first inode unlinked in a buffer can pin the tail of the > log whilst the inode cluster buffer is pinned by the current > checkpoint that has been pushed but isn't on stable storage because > because the cil->xc_push_commit_stable was not set. This results in > the log/AIL effectively deadlocking until something triggers the > commit record iclog to be pushed to stable storage (i.e. the > periodic log worker calling xfs_log_force()). > > The fix is two-fold - first we should always set the > cil->xc_push_commit_stable when xlog_cil_flush() is called, > regardless of whether there is already a pending push or not. > > Second, if the CIL is empty, we should trigger an iclog flush to > ensure that the iclogs of the last checkpoint have actually been > submitted to disk as that checkpoint may not have been run under > stable completion constraints. Can it ever be the case that the CIL is not empty but the last checkpoint wasn't committed to disk? Let's say someone else commits a transaction after the worker samples xc_push_commit_stable? IOWs, why does a not-empty CIL mean that the last checkpoint is on disk? > Reported-and-tested-by: Matthew Wilcox <willy@infradead.org> MMMM testing. :D --D > Fixes: 0020a190cf3e ("xfs: AIL needs asynchronous CIL forcing") > Signed-off-by: Dave Chinner <dchinner@redhat.com> > --- > fs/xfs/xfs_log_cil.c | 22 +++++++++++++++++++--- > 1 file changed, 19 insertions(+), 3 deletions(-) > > diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c > index 3d8ebf2a1e55..48b16a5feb27 100644 > --- a/fs/xfs/xfs_log_cil.c > +++ b/fs/xfs/xfs_log_cil.c > @@ -1369,18 +1369,27 @@ xlog_cil_push_now( > if (!async) > flush_workqueue(cil->xc_push_wq); > > + spin_lock(&cil->xc_push_lock); > + > + /* > + * If this is an async flush request, we always need to set the > + * xc_push_commit_stable flag even if something else has already queued > + * a push. The flush caller is asking for the CIL to be on stable > + * storage when the next push completes, so regardless of who has queued > + * the push, the flush requires stable semantics from it. > + */ > + cil->xc_push_commit_stable = async; > + > /* > * If the CIL is empty or we've already pushed the sequence then > - * there's no work we need to do. > + * there's no more work that we need to do. > */ > - spin_lock(&cil->xc_push_lock); > if (list_empty(&cil->xc_cil) || push_seq <= cil->xc_push_seq) { > spin_unlock(&cil->xc_push_lock); > return; > } > > cil->xc_push_seq = push_seq; > - cil->xc_push_commit_stable = async; > queue_work(cil->xc_push_wq, &cil->xc_ctx->push_work); > spin_unlock(&cil->xc_push_lock); > } > @@ -1520,6 +1529,13 @@ xlog_cil_flush( > > trace_xfs_log_force(log->l_mp, seq, _RET_IP_); > xlog_cil_push_now(log, seq, true); > + > + /* > + * If the CIL is empty, make sure that any previous checkpoint that may > + * still be in an active iclog is pushed to stable storage. > + */ > + if (list_empty(&log->l_cilp->xc_cil)) > + xfs_log_force(log->l_mp, 0); > } > > /* > -- > 2.35.1 >
On Tue, Mar 15, 2022 at 12:36:24PM -0700, Darrick J. Wong wrote: > On Tue, Mar 15, 2022 at 05:42:38PM +1100, Dave Chinner wrote: > > From: Dave Chinner <dchinner@redhat.com> > > > > When the AIL tries to flush the CIL, it relies on the CIL push > > ending up on stable storage without having to wait for and > > manipulate iclog state directly. However, if there is already a > > pending CIL push when the AIL tries to flush the CIL, it won't set > > the cil->xc_push_commit_stable flag and so the CIL push will not > > actively flush the commit record iclog. > > > > generic/530 when run on a single CPU test VM can trigger this fairly > > reliably. This test exercises unlinked inode recovery, and can > > result in inodes being pinned in memory by ongoing modifications to > > the inode cluster buffer to record unlinked list modifications. As a > > result, the first inode unlinked in a buffer can pin the tail of the > > log whilst the inode cluster buffer is pinned by the current > > checkpoint that has been pushed but isn't on stable storage because > > because the cil->xc_push_commit_stable was not set. This results in > > the log/AIL effectively deadlocking until something triggers the > > commit record iclog to be pushed to stable storage (i.e. the > > periodic log worker calling xfs_log_force()). > > > > The fix is two-fold - first we should always set the > > cil->xc_push_commit_stable when xlog_cil_flush() is called, > > regardless of whether there is already a pending push or not. > > > > Second, if the CIL is empty, we should trigger an iclog flush to > > ensure that the iclogs of the last checkpoint have actually been > > submitted to disk as that checkpoint may not have been run under > > stable completion constraints. > > Can it ever be the case that the CIL is not empty but the last > checkpoint wasn't committed to disk? Yes. But xlog_cil_push_now() will capture that, queue it and mark it as xc_push_commit_stable. Remember that the push_now() code updates the push seq/stable state under down_read(ctx lock) + spin_lock(push lock) context. The push seq/stable state is cleared by the push worker under down_write(ctx lock) + spin_lock(push lock) conditions when it atomically swaps in the new empty CIL context. Hence the push worker will always see the stable flag if it has been set for that push sequence. > Let's say someone else > commits a transaction after the worker samples xc_push_commit_stable? If we race with commits between the xlog_cil_push_now(log, seq, true) and the CIL list_empty check in xlog_cil_flush(), there are two posibilities: 1. the CIL push worker hasn't run and atomically switched in a new CIL context. 2. the CIL push worker has run and switched contexts In the first case, the commit will end up in the same context that xlog_cil_flush() pushed, and it will be stable. That will result in an empty CIL after the CIL push worker runs, but the racing commit will be stable as per the xc_push_commit_stable flag. This can also lead to the CIL being empty by the time the list_empty check is done (because pre-empt), in which case the log force will be a no-op because none of the iclogs need flushing. > IOWs, why does a not-empty CIL mean that the last checkpoint is on disk? In the second case, the CIL push triggered by xlog_cil_push_now() will be stable because xc_push_commit_stable says it must be, and the racing commit will end up in the new CIL context and the CIL won't be empty. We don't need a log force in this case because the previous sequence that was flushed with stable semantics as required. In the case of AIL pushing, we don't actually care about racing CIL commits because we are trying to get pinned AIL items unpinned so we can move the tail of the log forwards. If those pinned items are relogged by racing transactions, then the next call to xlog_cil_flush() from the AIL will get them unpinned and that will move them forward in the log, anyway. Cheers, Dave.
On Wed, Mar 16, 2022 at 08:47:45AM +1100, Dave Chinner wrote: > On Tue, Mar 15, 2022 at 12:36:24PM -0700, Darrick J. Wong wrote: > > On Tue, Mar 15, 2022 at 05:42:38PM +1100, Dave Chinner wrote: > > > From: Dave Chinner <dchinner@redhat.com> > > > > > > When the AIL tries to flush the CIL, it relies on the CIL push > > > ending up on stable storage without having to wait for and > > > manipulate iclog state directly. However, if there is already a > > > pending CIL push when the AIL tries to flush the CIL, it won't set > > > the cil->xc_push_commit_stable flag and so the CIL push will not > > > actively flush the commit record iclog. > > > > > > generic/530 when run on a single CPU test VM can trigger this fairly > > > reliably. This test exercises unlinked inode recovery, and can > > > result in inodes being pinned in memory by ongoing modifications to > > > the inode cluster buffer to record unlinked list modifications. As a > > > result, the first inode unlinked in a buffer can pin the tail of the > > > log whilst the inode cluster buffer is pinned by the current > > > checkpoint that has been pushed but isn't on stable storage because > > > because the cil->xc_push_commit_stable was not set. This results in > > > the log/AIL effectively deadlocking until something triggers the > > > commit record iclog to be pushed to stable storage (i.e. the > > > periodic log worker calling xfs_log_force()). > > > > > > The fix is two-fold - first we should always set the > > > cil->xc_push_commit_stable when xlog_cil_flush() is called, > > > regardless of whether there is already a pending push or not. > > > > > > Second, if the CIL is empty, we should trigger an iclog flush to > > > ensure that the iclogs of the last checkpoint have actually been > > > submitted to disk as that checkpoint may not have been run under > > > stable completion constraints. > > > > Can it ever be the case that the CIL is not empty but the last > > checkpoint wasn't committed to disk? > > Yes. But xlog_cil_push_now() will capture that, queue it and mark > it as xc_push_commit_stable. > > Remember that the push_now() code updates the push seq/stable > state under down_read(ctx lock) + spin_lock(push lock) context. The > push seq/stable state is cleared by the push worker under > down_write(ctx lock) + spin_lock(push lock) conditions when it > atomically swaps in the new empty CIL context. Hence the push worker > will always see the stable flag if it has been set for that push > sequence. > > > Let's say someone else > > commits a transaction after the worker samples xc_push_commit_stable? > > If we race with commits between the xlog_cil_push_now(log, seq, > true) and the CIL list_empty check in xlog_cil_flush(), there are > two posibilities: > > 1. the CIL push worker hasn't run and atomically switched in a new > CIL context. > 2. the CIL push worker has run and switched contexts > > In the first case, the commit will end up in the same context that > xlog_cil_flush() pushed, and it will be stable. That will result in > an empty CIL after the CIL push worker runs, but the racing commit > will be stable as per the xc_push_commit_stable flag. This can also > lead to the CIL being empty by the time the list_empty check is done > (because pre-empt), in which case the log force will be a no-op > because none of the iclogs need flushing. > > > IOWs, why does a not-empty CIL mean that the last checkpoint is on disk? > > In the second case, the CIL push triggered by xlog_cil_push_now() > will be stable because xc_push_commit_stable says it must be, and > the racing commit will end up in the new CIL context and the CIL > won't be empty. We don't need a log force in this case because the > previous sequence that was flushed with stable semantics as required. > > In the case of AIL pushing, we don't actually care about racing CIL > commits because we are trying to get pinned AIL items unpinned so we > can move the tail of the log forwards. If those pinned items are > relogged by racing transactions, then the next call to > xlog_cil_flush() from the AIL will get them unpinned and that will > move them forward in the log, anyway. Ok, that's kinda what I was thinking, but wasn't sure there wasn't some other weird subtlety that I hadn't figured out. Reviewed-by: Darrick J. Wong <djwong@kernel.org> --D --D > > Cheers, > > Dave. > > -- > Dave Chinner > david@fromorbit.com
On 15 Mar 2022 at 12:12, Dave Chinner wrote: > From: Dave Chinner <dchinner@redhat.com> > > When the AIL tries to flush the CIL, it relies on the CIL push > ending up on stable storage without having to wait for and > manipulate iclog state directly. However, if there is already a > pending CIL push when the AIL tries to flush the CIL, it won't set > the cil->xc_push_commit_stable flag and so the CIL push will not > actively flush the commit record iclog. I think the above sentence maps to the following snippet from xlog_cil_push_now(), if (list_empty(&cil->xc_cil) || push_seq <= cil->xc_push_seq) { spin_unlock(&cil->xc_push_lock); return; } i.e. if the CIL sequence that we are trying to push is already being pushed then xlog_cil_push_now() returns without queuing work on cil->xc_push_wq. However, the push_seq could have been previously pushed by, 1. xfsaild_push() In this case, cil->xc_push_commit_stable is set to true. Hence, xlog_cil_push_work() will definitely make sure to submit the commit record iclog for write I/O. 2. xfs_log_force_seq() => xlog_cil_force_seq() xfs_log_force_seq() invokes xlog_force_lsn() after executing xlog_cil_force_seq(). Here, A partially filled iclog will be in XLOG_STATE_ACTIVE state. This will cause xlog_force_and_check_iclog() to be invoked and hence the iclog is submitted for write I/O. In both the cases listed above, iclog is guaranteed to be submitted for I/O without any help from the log worker task. Looks like I am missing something obvious here. > > generic/530 when run on a single CPU test VM can trigger this fairly > reliably. This test exercises unlinked inode recovery, and can > result in inodes being pinned in memory by ongoing modifications to > the inode cluster buffer to record unlinked list modifications. As a > result, the first inode unlinked in a buffer can pin the tail of the > log whilst the inode cluster buffer is pinned by the current > checkpoint that has been pushed but isn't on stable storage because > because the cil->xc_push_commit_stable was not set. This results in > the log/AIL effectively deadlocking until something triggers the > commit record iclog to be pushed to stable storage (i.e. the > periodic log worker calling xfs_log_force()). > > The fix is two-fold - first we should always set the > cil->xc_push_commit_stable when xlog_cil_flush() is called, > regardless of whether there is already a pending push or not. > > Second, if the CIL is empty, we should trigger an iclog flush to > ensure that the iclogs of the last checkpoint have actually been > submitted to disk as that checkpoint may not have been run under > stable completion constraints. > > Reported-and-tested-by: Matthew Wilcox <willy@infradead.org> > Fixes: 0020a190cf3e ("xfs: AIL needs asynchronous CIL forcing") > Signed-off-by: Dave Chinner <dchinner@redhat.com> > --- > fs/xfs/xfs_log_cil.c | 22 +++++++++++++++++++--- > 1 file changed, 19 insertions(+), 3 deletions(-) > > diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c > index 3d8ebf2a1e55..48b16a5feb27 100644 > --- a/fs/xfs/xfs_log_cil.c > +++ b/fs/xfs/xfs_log_cil.c > @@ -1369,18 +1369,27 @@ xlog_cil_push_now( > if (!async) > flush_workqueue(cil->xc_push_wq); > > + spin_lock(&cil->xc_push_lock); > + > + /* > + * If this is an async flush request, we always need to set the > + * xc_push_commit_stable flag even if something else has already queued > + * a push. The flush caller is asking for the CIL to be on stable > + * storage when the next push completes, so regardless of who has queued > + * the push, the flush requires stable semantics from it. > + */ > + cil->xc_push_commit_stable = async; > + > /* > * If the CIL is empty or we've already pushed the sequence then > - * there's no work we need to do. > + * there's no more work that we need to do. > */ > - spin_lock(&cil->xc_push_lock); > if (list_empty(&cil->xc_cil) || push_seq <= cil->xc_push_seq) { > spin_unlock(&cil->xc_push_lock); > return; > } > > cil->xc_push_seq = push_seq; > - cil->xc_push_commit_stable = async; > queue_work(cil->xc_push_wq, &cil->xc_ctx->push_work); > spin_unlock(&cil->xc_push_lock); > } > @@ -1520,6 +1529,13 @@ xlog_cil_flush( > > trace_xfs_log_force(log->l_mp, seq, _RET_IP_); > xlog_cil_push_now(log, seq, true); > + > + /* > + * If the CIL is empty, make sure that any previous checkpoint that may > + * still be in an active iclog is pushed to stable storage. > + */ > + if (list_empty(&log->l_cilp->xc_cil)) > + xfs_log_force(log->l_mp, 0); > } > > /*
On Wed, Mar 16, 2022 at 04:04:55PM +0530, Chandan Babu R wrote: > On 15 Mar 2022 at 12:12, Dave Chinner wrote: > > From: Dave Chinner <dchinner@redhat.com> > > > > When the AIL tries to flush the CIL, it relies on the CIL push > > ending up on stable storage without having to wait for and > > manipulate iclog state directly. However, if there is already a > > pending CIL push when the AIL tries to flush the CIL, it won't set > > the cil->xc_push_commit_stable flag and so the CIL push will not > > actively flush the commit record iclog. > > I think the above sentence maps to the following snippet from > xlog_cil_push_now(), > > if (list_empty(&cil->xc_cil) || push_seq <= cil->xc_push_seq) { > spin_unlock(&cil->xc_push_lock); > return; > } > > i.e. if the CIL sequence that we are trying to push is already being pushed > then xlog_cil_push_now() returns without queuing work on cil->xc_push_wq. > > However, the push_seq could have been previously pushed by, > 1. xfsaild_push() > In this case, cil->xc_push_commit_stable is set to true. Hence, > xlog_cil_push_work() will definitely make sure to submit the commit record > iclog for write I/O. > 2. xfs_log_force_seq() => xlog_cil_force_seq() > xfs_log_force_seq() invokes xlog_force_lsn() after executing > xlog_cil_force_seq(). Here, A partially filled iclog will be in > XLOG_STATE_ACTIVE state. This will cause xlog_force_and_check_iclog() to be > invoked and hence the iclog is submitted for write I/O. > > In both the cases listed above, iclog is guaranteed to be submitted for I/O > without any help from the log worker task. > > Looks like I am missing something obvious here. Pushes triggered by xlog_cil_push_background() can complete leaving the partially filled iclog in ACTIVE state. Then xlog_cil_push_now() does nothing because it doesn't trigger a new CIL push and so setting the cil->xc_push_commit_stable flag doesn't trigger a flush of the ACTIVE iclog. The AIL flush does not use xfs_log_force_seq() because that blocks waiting for the entire CIL to hit the disk before it can force the last iclog to disk. Hence the second piece of this patch is necessary, and that is to call xfs_log_force() if the CIL is empty (i.e. the case where xlog_cil_push_now() is a no-op because the CIL is empty due to background pushes). Cheers, Dave.
On 17 Mar 2022 at 04:54, Dave Chinner wrote: > On Wed, Mar 16, 2022 at 04:04:55PM +0530, Chandan Babu R wrote: >> On 15 Mar 2022 at 12:12, Dave Chinner wrote: >> > From: Dave Chinner <dchinner@redhat.com> >> > >> > When the AIL tries to flush the CIL, it relies on the CIL push >> > ending up on stable storage without having to wait for and >> > manipulate iclog state directly. However, if there is already a >> > pending CIL push when the AIL tries to flush the CIL, it won't set >> > the cil->xc_push_commit_stable flag and so the CIL push will not >> > actively flush the commit record iclog. >> >> I think the above sentence maps to the following snippet from >> xlog_cil_push_now(), >> >> if (list_empty(&cil->xc_cil) || push_seq <= cil->xc_push_seq) { >> spin_unlock(&cil->xc_push_lock); >> return; >> } >> >> i.e. if the CIL sequence that we are trying to push is already being pushed >> then xlog_cil_push_now() returns without queuing work on cil->xc_push_wq. >> >> However, the push_seq could have been previously pushed by, >> 1. xfsaild_push() >> In this case, cil->xc_push_commit_stable is set to true. Hence, >> xlog_cil_push_work() will definitely make sure to submit the commit record >> iclog for write I/O. >> 2. xfs_log_force_seq() => xlog_cil_force_seq() >> xfs_log_force_seq() invokes xlog_force_lsn() after executing >> xlog_cil_force_seq(). Here, A partially filled iclog will be in >> XLOG_STATE_ACTIVE state. This will cause xlog_force_and_check_iclog() to be >> invoked and hence the iclog is submitted for write I/O. >> >> In both the cases listed above, iclog is guaranteed to be submitted for I/O >> without any help from the log worker task. >> >> Looks like I am missing something obvious here. > > Pushes triggered by xlog_cil_push_background() can complete leaving > the partially filled iclog in ACTIVE state. Then xlog_cil_push_now() > does nothing because it doesn't trigger a new CIL push and so > setting the cil->xc_push_commit_stable flag doesn't trigger a flush > of the ACTIVE iclog. Ah. I had missed xlog_cil_push_background(). > > The AIL flush does not use xfs_log_force_seq() because that blocks > waiting for the entire CIL to hit the disk before it can force the > last iclog to disk. Hence the second piece of this patch is > necessary, and that is to call xfs_log_force() if the CIL is empty > (i.e. the case where xlog_cil_push_now() is a no-op because the > CIL is empty due to background pushes). > Thanks for clarifying my doubts.
diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c index 3d8ebf2a1e55..48b16a5feb27 100644 --- a/fs/xfs/xfs_log_cil.c +++ b/fs/xfs/xfs_log_cil.c @@ -1369,18 +1369,27 @@ xlog_cil_push_now( if (!async) flush_workqueue(cil->xc_push_wq); + spin_lock(&cil->xc_push_lock); + + /* + * If this is an async flush request, we always need to set the + * xc_push_commit_stable flag even if something else has already queued + * a push. The flush caller is asking for the CIL to be on stable + * storage when the next push completes, so regardless of who has queued + * the push, the flush requires stable semantics from it. + */ + cil->xc_push_commit_stable = async; + /* * If the CIL is empty or we've already pushed the sequence then - * there's no work we need to do. + * there's no more work that we need to do. */ - spin_lock(&cil->xc_push_lock); if (list_empty(&cil->xc_cil) || push_seq <= cil->xc_push_seq) { spin_unlock(&cil->xc_push_lock); return; } cil->xc_push_seq = push_seq; - cil->xc_push_commit_stable = async; queue_work(cil->xc_push_wq, &cil->xc_ctx->push_work); spin_unlock(&cil->xc_push_lock); } @@ -1520,6 +1529,13 @@ xlog_cil_flush( trace_xfs_log_force(log->l_mp, seq, _RET_IP_); xlog_cil_push_now(log, seq, true); + + /* + * If the CIL is empty, make sure that any previous checkpoint that may + * still be in an active iclog is pushed to stable storage. + */ + if (list_empty(&log->l_cilp->xc_cil)) + xfs_log_force(log->l_mp, 0); } /*