Message ID | 20190904042451.9314-8-david@fromorbit.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | xfs: log race fixes and cleanups | expand |
> + ASSERT(XFS_LSN_CMP(atomic64_read(&log->l_last_sync_lsn), header_lsn) <= 0);
This adds an > 80 char line.
Otherwise this looks sensible to me.
On Wed, Sep 04, 2019 at 02:24:51PM +1000, Dave Chinner wrote: > From: Dave Chinner <dchinner@redhat.com> > > When the log fills up, we can get into the state where the > outstanding items in the CIL being committed and aggregated are > larger than the range that the reservation grant head tail pushing > will attempt to clean. This can result in the tail pushing range > being trimmed back to the the log head (l_last_sync_lsn) and so > may not actually move the push target at all. > > When the iclogs associated with the CIL commit finally land, the > log head moves forward, and this removes the restriction on the AIL > push target. However, if we already have transactions sleeping on > the grant head, and there's nothing in the AIL still to flush from > the current push target, then nothing will move the tail of the log > and trigger a log reservation wakeup. > > Hence the there is nothing that will trigger xlog_grant_push_ail() > to recalculate the AIL push target and start pushing on the AIL > again to write back the metadata objects that pin the tail of the > log and hence free up space and allow the transaction reservations > to be woken and make progress. > > Hence we need to push on the grant head when we move the log head > forward, as this may be the only trigger we have that can move the > AIL push target forwards in this situation. > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > --- > fs/xfs/xfs_log.c | 72 +++++++++++++++++++++++++++++++----------------- > 1 file changed, 47 insertions(+), 25 deletions(-) > > diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c > index 6a59d71d4c60..733693e1ac9f 100644 > --- a/fs/xfs/xfs_log.c > +++ b/fs/xfs/xfs_log.c > @@ -2632,6 +2632,46 @@ xlog_get_lowest_lsn( > return lowest_lsn; > } > > +/* > + * Completion of a iclog IO does not imply that a transaction has completed, as > + * transactions can be large enough to span many iclogs. We cannot change the > + * tail of the log half way through a transaction as this may be the only > + * transaction in the log and moving the tail to point to the middle of it > + * will prevent recovery from finding the start of the transaction. Hence we > + * should only update the last_sync_lsn if this iclog contains transaction > + * completion callbacks on it. > + * > + * We have to do this before we drop the icloglock to ensure we are the only one > + * that can update it. > + * > + * If we are moving the last_sync_lsn forwards, we also need to ensure we kick > + * the reservation grant head pushing. This is due to the fact that the push > + * target is bound by the current last_sync_lsn value. Hence if we have a large > + * amount of log space bound up in this committing transaction then the > + * last_sync_lsn value may be the limiting factor preventing tail pushing from > + * freeing space in the log. Hence once we've updated the last_sync_lsn we > + * should push the AIL to ensure the push target (and hence the grant head) is > + * no longer bound by the old log head location and can move forwards and make > + * progress again. > + */ > +static void > +xlog_state_set_callback( > + struct xlog *log, > + struct xlog_in_core *iclog, > + xfs_lsn_t header_lsn) > +{ > + iclog->ic_state = XLOG_STATE_CALLBACK; > + > + ASSERT(XFS_LSN_CMP(atomic64_read(&log->l_last_sync_lsn), header_lsn) <= 0); > + > + if (list_empty_careful(&iclog->ic_callbacks)) > + return; > + > + atomic64_set(&log->l_last_sync_lsn, header_lsn); > + xlog_grant_push_ail(log, 0); > + Nit: extra whitespace line above. This still seems racy to me, FWIW. What if the AIL is empty (i.e. the push is skipped)? What if xfsaild completes this push before the associated log items land in the AIL or we race with xfsaild emptying the AIL? Why not just reuse/update the existing grant head wake up logic in the iclog callback itself? E.g., something like the following (untested): @@ -740,12 +740,10 @@ xfs_trans_ail_update_bulk( if (mlip_changed) { if (!XFS_FORCED_SHUTDOWN(ailp->ail_mount)) xlog_assign_tail_lsn_locked(ailp->ail_mount); - spin_unlock(&ailp->ail_lock); - - xfs_log_space_wake(ailp->ail_mount); - } else { - spin_unlock(&ailp->ail_lock); } + + spin_unlock(&ailp->ail_lock); + xfs_log_space_wake(ailp->ail_mount); } ... seems to solve the same prospective problem without being racy and with less and more simple code. Hm? Brian > +} > + > /* > * Return true if we need to stop processing, false to continue to the next > * iclog. The caller will need to run callbacks if the iclog is returned in the > @@ -2644,6 +2684,7 @@ xlog_state_iodone_process_iclog( > struct xlog_in_core *completed_iclog) > { > xfs_lsn_t lowest_lsn; > + xfs_lsn_t header_lsn; > > /* Skip all iclogs in the ACTIVE & DIRTY states */ > if (iclog->ic_state & (XLOG_STATE_ACTIVE|XLOG_STATE_DIRTY)) > @@ -2681,34 +2722,15 @@ xlog_state_iodone_process_iclog( > * callbacks) see the above if. > * > * We will do one more check here to see if we have chased our tail > - * around. > + * around. If this is not the lowest lsn iclog, then we will leave it > + * for another completion to process. > */ > + header_lsn = be64_to_cpu(iclog->ic_header.h_lsn); > lowest_lsn = xlog_get_lowest_lsn(log); > - if (lowest_lsn && > - XFS_LSN_CMP(lowest_lsn, be64_to_cpu(iclog->ic_header.h_lsn)) < 0) > - return false; /* Leave this iclog for another thread */ > - > - iclog->ic_state = XLOG_STATE_CALLBACK; > - > - /* > - * Completion of a iclog IO does not imply that a transaction has > - * completed, as transactions can be large enough to span many iclogs. > - * We cannot change the tail of the log half way through a transaction > - * as this may be the only transaction in the log and moving th etail to > - * point to the middle of it will prevent recovery from finding the > - * start of the transaction. Hence we should only update the > - * last_sync_lsn if this iclog contains transaction completion callbacks > - * on it. > - * > - * We have to do this before we drop the icloglock to ensure we are the > - * only one that can update it. > - */ > - ASSERT(XFS_LSN_CMP(atomic64_read(&log->l_last_sync_lsn), > - be64_to_cpu(iclog->ic_header.h_lsn)) <= 0); > - if (!list_empty_careful(&iclog->ic_callbacks)) > - atomic64_set(&log->l_last_sync_lsn, > - be64_to_cpu(iclog->ic_header.h_lsn)); > + if (lowest_lsn && XFS_LSN_CMP(lowest_lsn, header_lsn) < 0) > + return false; > > + xlog_state_set_callback(log, iclog, header_lsn); > return false; > > } > -- > 2.23.0.rc1 >
On Tue, Sep 03, 2019 at 11:45:10PM -0700, Christoph Hellwig wrote: > > + ASSERT(XFS_LSN_CMP(atomic64_read(&log->l_last_sync_lsn), header_lsn) <= 0); > > This adds an > 80 char line. Fixed.
On Wed, Sep 04, 2019 at 03:34:42PM -0400, Brian Foster wrote: > On Wed, Sep 04, 2019 at 02:24:51PM +1000, Dave Chinner wrote: > > From: Dave Chinner <dchinner@redhat.com> > > +/* > > + * Completion of a iclog IO does not imply that a transaction has completed, as > > + * transactions can be large enough to span many iclogs. We cannot change the > > + * tail of the log half way through a transaction as this may be the only > > + * transaction in the log and moving the tail to point to the middle of it > > + * will prevent recovery from finding the start of the transaction. Hence we > > + * should only update the last_sync_lsn if this iclog contains transaction > > + * completion callbacks on it. > > + * > > + * We have to do this before we drop the icloglock to ensure we are the only one > > + * that can update it. > > + * > > + * If we are moving the last_sync_lsn forwards, we also need to ensure we kick > > + * the reservation grant head pushing. This is due to the fact that the push > > + * target is bound by the current last_sync_lsn value. Hence if we have a large > > + * amount of log space bound up in this committing transaction then the > > + * last_sync_lsn value may be the limiting factor preventing tail pushing from > > + * freeing space in the log. Hence once we've updated the last_sync_lsn we > > + * should push the AIL to ensure the push target (and hence the grant head) is > > + * no longer bound by the old log head location and can move forwards and make > > + * progress again. > > + */ > > +static void > > +xlog_state_set_callback( > > + struct xlog *log, > > + struct xlog_in_core *iclog, > > + xfs_lsn_t header_lsn) > > +{ > > + iclog->ic_state = XLOG_STATE_CALLBACK; > > + > > + ASSERT(XFS_LSN_CMP(atomic64_read(&log->l_last_sync_lsn), header_lsn) <= 0); > > + > > + if (list_empty_careful(&iclog->ic_callbacks)) > > + return; > > + > > + atomic64_set(&log->l_last_sync_lsn, header_lsn); > > + xlog_grant_push_ail(log, 0); > > + > > Nit: extra whitespace line above. Fixed. > This still seems racy to me, FWIW. What if the AIL is empty (i.e. the > push is skipped)? If the AIL is empty, then it's a no-op because pushing on the AIL is not going to make more log space become free. Besides, that's not the problem being solved here - reservation wakeups on first insert into the AIL are already handled by xfs_trans_ail_update_bulk() and hence the first patch in the series. This patch is addressing the situation where the bulk insert that occurs from the callbacks that are about to run -does not modify the tail of the log-. i.e. the commit moved the head but not the tail, so we have to update the AIL push target to take into account the new log head.... i.e. the AIL is for moving the tail of the log - this code moves the head of the log. But both impact on the AIL push target (it is based on the distance between the head and tail), so we need to update the push target just in case this commit does not move the tail... > What if xfsaild completes this push before the > associated log items land in the AIL or we race with xfsaild emptying > the AIL? Why not just reuse/update the existing grant head wake up logic > in the iclog callback itself? E.g., something like the following > (untested): > > @@ -740,12 +740,10 @@ xfs_trans_ail_update_bulk( > if (mlip_changed) { > if (!XFS_FORCED_SHUTDOWN(ailp->ail_mount)) > xlog_assign_tail_lsn_locked(ailp->ail_mount); > - spin_unlock(&ailp->ail_lock); > - > - xfs_log_space_wake(ailp->ail_mount); > - } else { > - spin_unlock(&ailp->ail_lock); > } > + > + spin_unlock(&ailp->ail_lock); > + xfs_log_space_wake(ailp->ail_mount); Two things that I see straight away: 1. if the AIL is empty, the first insert does not set mlip_changed = true and and so there will be no wakeup in the scenario you are posing. This would be easy to fix - if (!mlip || changed) - so that a wakeup is triggered in this case. 2. if we have not moved the tail, then calling xfs_log_space_wake() will, at best, just burn CPU. At worst, it wll cause hundreds of thousands of spurious wakeups a seconds because the waiting transaction reservation will be woken continuously when there isn't space available and there hasn't been any space made available. So, from #1 we see that unconditional wakeups are not necessary in the scenario you pose, and from #2 it's not a viable solution even if it was required. However, #1 indicates other problems if a xfs_log_space_wake() call is necessary in this case. No reservations space and an empty AIL implies that the CIL pins the entire log - a pending commit that hasn't finished flushing and the current context that is aggregating. This implies we've violated a much more important rule of the on-disk log format: finding the head and tail of the log requires no individual commit be larger than 50% of the log. So if we are actually stalling on trasnaction reservations with an empty AIL and an uncommitted CIL, screwing around with tail pushing wakeups does not address the bigger problem being seen... Cheers, Dave.
On Thu, Sep 05, 2019 at 08:50:56AM +1000, Dave Chinner wrote: > On Wed, Sep 04, 2019 at 03:34:42PM -0400, Brian Foster wrote: > > On Wed, Sep 04, 2019 at 02:24:51PM +1000, Dave Chinner wrote: > > > From: Dave Chinner <dchinner@redhat.com> > > > +/* > > > + * Completion of a iclog IO does not imply that a transaction has completed, as > > > + * transactions can be large enough to span many iclogs. We cannot change the > > > + * tail of the log half way through a transaction as this may be the only > > > + * transaction in the log and moving the tail to point to the middle of it > > > + * will prevent recovery from finding the start of the transaction. Hence we > > > + * should only update the last_sync_lsn if this iclog contains transaction > > > + * completion callbacks on it. > > > + * > > > + * We have to do this before we drop the icloglock to ensure we are the only one > > > + * that can update it. > > > + * > > > + * If we are moving the last_sync_lsn forwards, we also need to ensure we kick > > > + * the reservation grant head pushing. This is due to the fact that the push > > > + * target is bound by the current last_sync_lsn value. Hence if we have a large > > > + * amount of log space bound up in this committing transaction then the > > > + * last_sync_lsn value may be the limiting factor preventing tail pushing from > > > + * freeing space in the log. Hence once we've updated the last_sync_lsn we > > > + * should push the AIL to ensure the push target (and hence the grant head) is > > > + * no longer bound by the old log head location and can move forwards and make > > > + * progress again. > > > + */ > > > +static void > > > +xlog_state_set_callback( > > > + struct xlog *log, > > > + struct xlog_in_core *iclog, > > > + xfs_lsn_t header_lsn) > > > +{ > > > + iclog->ic_state = XLOG_STATE_CALLBACK; > > > + > > > + ASSERT(XFS_LSN_CMP(atomic64_read(&log->l_last_sync_lsn), header_lsn) <= 0); > > > + > > > + if (list_empty_careful(&iclog->ic_callbacks)) > > > + return; > > > + > > > + atomic64_set(&log->l_last_sync_lsn, header_lsn); > > > + xlog_grant_push_ail(log, 0); > > > + > > > > Nit: extra whitespace line above. > > Fixed. > > > This still seems racy to me, FWIW. What if the AIL is empty (i.e. the > > push is skipped)? > > If the AIL is empty, then it's a no-op because pushing on the AIL is > not going to make more log space become free. Besides, that's not > the problem being solved here - reservation wakeups on first insert > into the AIL are already handled by xfs_trans_ail_update_bulk() and > hence the first patch in the series. This patch is addressing the Nothing currently wakes up reservation waiters on first AIL insertion. I pointed this out in the original thread along with the fact that the push is a no-op for an empty AIL. What wasn't clear to me is whether it matters for the problem this patch is trying to fix. It sounds like not, but that's a separate question from whether this is a problem itself. > situation where the bulk insert that occurs from the callbacks that > are about to run -does not modify the tail of the log-. i.e. the > commit moved the head but not the tail, so we have to update the AIL > push target to take into account the new log head.... > Ok, I figured based on process of elimination. xfs_ail_push() ignores the push on an empty AIL and we obviously already have wakeups on tail updates. > i.e. the AIL is for moving the tail of the log - this code moves the > head of the log. But both impact on the AIL push target (it is based on > the distance between the head and tail), so we need > to update the push target just in case this commit does not move > the tail... > > > What if xfsaild completes this push before the > > associated log items land in the AIL or we race with xfsaild emptying > > the AIL? Why not just reuse/update the existing grant head wake up logic > > in the iclog callback itself? E.g., something like the following > > (untested): > > And the raciness concerns..? AFAICT this still opens a race window where the AIL can idle on the push target before AIL insertion. > > @@ -740,12 +740,10 @@ xfs_trans_ail_update_bulk( > > if (mlip_changed) { > > if (!XFS_FORCED_SHUTDOWN(ailp->ail_mount)) > > xlog_assign_tail_lsn_locked(ailp->ail_mount); > > - spin_unlock(&ailp->ail_lock); > > - > > - xfs_log_space_wake(ailp->ail_mount); > > - } else { > > - spin_unlock(&ailp->ail_lock); > > } > > + > > + spin_unlock(&ailp->ail_lock); > > + xfs_log_space_wake(ailp->ail_mount); > > Two things that I see straight away: > > 1. if the AIL is empty, the first insert does not set mlip_changed = > true and and so there will be no wakeup in the scenario you are > posing. This would be easy to fix - if (!mlip || changed) - so that > a wakeup is triggered in this case. > This (again) was what I suggested originally in Chandan's thread for the empty AIL case. > 2. if we have not moved the tail, then calling xfs_log_space_wake() > will, at best, just burn CPU. At worst, it wll cause hundreds of > thousands of spurious wakeups a seconds because the waiting > transaction reservation will be woken continuously when there isn't > space available and there hasn't been any space made available. > Yes, I can see how that would be problematic with the diff I posted above. It's also something that can be easily fixed. Note that I think there's another potential side effect of that diff in terms of amplifying pressure on the AIL because we don't know whether the waiters were blocked because of pent up in-core reservation consumption or simply because the tail is pinned. That said, I think both patches share that particular quirk. Either way, this doesn't address the raciness concern I have with this patch. If you're wedded to this particular approach, then the simplest fix is probably to just reorder the xlog_grans_push_ail() call properly after processing iclog callbacks. A more appropriate fix, IMO, would be to either export this logic to where the AIL update happens and/or enhance the existing log space wake up logic to filter wakeups in the scenarios where it is not necessary (i.e. no tail update && xa_push_target == max_lsn), but this is more subjective... > So, from #1 we see that unconditional wakeups are not necessary in > the scenario you pose, and from #2 it's not a viable solution even > if it was required. > > However, #1 indicates other problems if a xfs_log_space_wake() call > is necessary in this case. No reservations space and an empty AIL > implies that the CIL pins the entire log - a pending commit that > hasn't finished flushing and the current context that is > aggregating. This implies we've violated a much more important rule > of the on-disk log format: finding the head and tail of the log > requires no individual commit be larger than 50% of the log. > I described this exact problem days ago in the original thread. There's no need to rehash it here. FWIW, I can reproduce much worse than 50% log consumption aggregated outside of the AIL with the current code and it doesn't depend on a nonpreemptible kernel (though the workqueue fix looks legit to me). Brian > So if we are actually stalling on trasnaction reservations with an > empty AIL and an uncommitted CIL, screwing around with tail pushing > wakeups does not address the bigger problem being seen... > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com
On Thu, Sep 05, 2019 at 12:25:33PM -0400, Brian Foster wrote: > On Thu, Sep 05, 2019 at 08:50:56AM +1000, Dave Chinner wrote: > > On Wed, Sep 04, 2019 at 03:34:42PM -0400, Brian Foster wrote: > > > On Wed, Sep 04, 2019 at 02:24:51PM +1000, Dave Chinner wrote: > > > > From: Dave Chinner <dchinner@redhat.com> > > > > +/* > > > > + * Completion of a iclog IO does not imply that a transaction has completed, as > > > > + * transactions can be large enough to span many iclogs. We cannot change the > > > > + * tail of the log half way through a transaction as this may be the only > > > > + * transaction in the log and moving the tail to point to the middle of it > > > > + * will prevent recovery from finding the start of the transaction. Hence we > > > > + * should only update the last_sync_lsn if this iclog contains transaction > > > > + * completion callbacks on it. > > > > + * > > > > + * We have to do this before we drop the icloglock to ensure we are the only one > > > > + * that can update it. > > > > + * > > > > + * If we are moving the last_sync_lsn forwards, we also need to ensure we kick > > > > + * the reservation grant head pushing. This is due to the fact that the push > > > > + * target is bound by the current last_sync_lsn value. Hence if we have a large > > > > + * amount of log space bound up in this committing transaction then the > > > > + * last_sync_lsn value may be the limiting factor preventing tail pushing from > > > > + * freeing space in the log. Hence once we've updated the last_sync_lsn we > > > > + * should push the AIL to ensure the push target (and hence the grant head) is > > > > + * no longer bound by the old log head location and can move forwards and make > > > > + * progress again. > > > > + */ > > > > +static void > > > > +xlog_state_set_callback( > > > > + struct xlog *log, > > > > + struct xlog_in_core *iclog, > > > > + xfs_lsn_t header_lsn) > > > > +{ > > > > + iclog->ic_state = XLOG_STATE_CALLBACK; > > > > + > > > > + ASSERT(XFS_LSN_CMP(atomic64_read(&log->l_last_sync_lsn), header_lsn) <= 0); > > > > + > > > > + if (list_empty_careful(&iclog->ic_callbacks)) > > > > + return; > > > > + > > > > + atomic64_set(&log->l_last_sync_lsn, header_lsn); > > > > + xlog_grant_push_ail(log, 0); > > > > + > > > > > > Nit: extra whitespace line above. > > > > Fixed. > > > > > This still seems racy to me, FWIW. What if the AIL is empty (i.e. the > > > push is skipped)? > > > > If the AIL is empty, then it's a no-op because pushing on the AIL is > > not going to make more log space become free. Besides, that's not > > the problem being solved here - reservation wakeups on first insert > > into the AIL are already handled by xfs_trans_ail_update_bulk() and > > hence the first patch in the series. This patch is addressing the > > Nothing currently wakes up reservation waiters on first AIL insertion. Nor should it be necessary - it's the removal from the AIL that frees up log space, not insertion. The update operation is a remove followed by an insert - the remove part of that operation is what may free up log space, not the insert. Hence if we need to wake the log reservation waiters on first AIL insert to fix a bug, we haven't found the underlying problem is preventing log space from being freed... > > > i.e. the AIL is for moving the tail of the log - this code moves the > > head of the log. But both impact on the AIL push target (it is based on > > the distance between the head and tail), so we need > > to update the push target just in case this commit does not move > > the tail... > > > > > What if xfsaild completes this push before the > > > associated log items land in the AIL or we race with xfsaild emptying > > > the AIL? Why not just reuse/update the existing grant head wake up logic > > > in the iclog callback itself? E.g., something like the following > > > (untested): > > > > > And the raciness concerns..? AFAICT this still opens a race window where > the AIL can idle on the push target before AIL insertion. I don't know what race you see - if the AIL completes a push before we insert new objects at the head from the current commit, then it does not matter one bit because the items are being inserted at the log head, not the log tail where the pushing occurs at. If we are inserting objects into the AIL within the push target window, then there is something else very wrong going on, because when the log is out of space the push target should be nowhere near the LSN we are inserting inew objects into the AIL at. (i.e. they should be 3/4s of the log apart...) > > So, from #1 we see that unconditional wakeups are not necessary in > > the scenario you pose, and from #2 it's not a viable solution even > > if it was required. > > > > However, #1 indicates other problems if a xfs_log_space_wake() call > > is necessary in this case. No reservations space and an empty AIL > > implies that the CIL pins the entire log - a pending commit that > > hasn't finished flushing and the current context that is > > aggregating. This implies we've violated a much more important rule > > of the on-disk log format: finding the head and tail of the log > > requires no individual commit be larger than 50% of the log. > > > > I described this exact problem days ago in the original thread. There's > no need to rehash it here. FWIW, I can reproduce much worse than 50% log > consumption aggregated outside of the AIL with the current code and it > doesn't depend on a nonpreemptible kernel (though the workqueue fix > looks legit to me). I'm not rehashing anything intentionally - I'm responding to the questions you are asking me directly in this thread. Maybe I am going over something you've already mentioned in a previous thread, and maybe that hasn't occurred to me because you didn't reference it and the similarites didn't occur to me because I've spend more time looking at the code trying to understand how this "impossible situation" was occurring than reading mailing list discussions. I've been certain that we were seeing was some fundamental rule was being violated to cause this "log full, AIL empty", but I couldn't work out exactly what it was. I was even questioning whether I understood the basic operation of the log because there was no way I could see that CIL would not push during log recovery until the log was full. I said this to Darrick yesterday morning on #xfs: [5/9/19 12:56] <dchinner> there's something bothering me about this log head update thing and I can't put my finger on what it is.... It wasn't until Chandan's trace showed me the CPU hold-off problem with the CIL workqueue. A couple of hours later, after I'd seen Chandan's trace: [5/9/19 14:26] <dchinner> oooohhhh [5/9/19 14:27] <dchinner> this isn't a premeptible kernel, is it? And that was the thing that I couldn't put my finger on - I couldn't work out how a CIL push was being delayed so long on a multi-cpu system with lots of idle CPU that we had a completely empty AIL when we ran out of reservation space. IOWs, I didn't know the right question to ask until I saw the answer in front of me. I've never seen a "CIL checkpoint too large" issue manifiest in the real world, but it's been there since delayed logging was introduced. I knew about this issue right from the start, but it was largely a theoretical concern because workqueue scheduling preempts userspace and so is mostly only ever delayed by the number of transactions in a single syscall. And for large, ongoing transactions like a truncate, it will yield the moment we have to pull in metadata from disk. What's new in recent kernels is the in-core inode unlinked processing mechanisms have changed the way both the syscall and log recovery mechanisms work (merged in 5.1, IIRC), and it looks like it no longer blocks in log recovery like it used to. Given Christoph first reported this generic/530 issue in May there's a fair correlation indicating that the two are linked. i.e. we changed the unlinked inode processing in a way that the kernel can now runs tens of thousands of unlink transactions without yeilding the CPU. That violated the "CIL push work will run within a few transactions of the background push occurring" mechanism the workqueue provided us with and that, fundamentally, is the underlying issue here. It's not a CIL vs empty AIL vs log reservation exhaustion race condition - that's just an observable symptom. To that end, I have been prototyping patches to fix this exact problem as part of the non-blocking inode reclaim series. I've been looking at this because the CIL pins so much memory on large logs and I wanted to put an upper bound on it that wasn't measured in GBs of RAM. Hence I'm planning to pull these out into a separate series now as it's clear that non-preemptible kernels and workqueues do not play well together and that the more we use workqueues for async processing, the more we introduce a potential real-world vector for CIL overruns... Cheers, Dave.
On Fri, Sep 06, 2019 at 10:02:05AM +1000, Dave Chinner wrote: > On Thu, Sep 05, 2019 at 12:25:33PM -0400, Brian Foster wrote: > > On Thu, Sep 05, 2019 at 08:50:56AM +1000, Dave Chinner wrote: > > > On Wed, Sep 04, 2019 at 03:34:42PM -0400, Brian Foster wrote: > > > > On Wed, Sep 04, 2019 at 02:24:51PM +1000, Dave Chinner wrote: > > > > > From: Dave Chinner <dchinner@redhat.com> > > > > > +/* > > > > > + * Completion of a iclog IO does not imply that a transaction has completed, as > > > > > + * transactions can be large enough to span many iclogs. We cannot change the > > > > > + * tail of the log half way through a transaction as this may be the only > > > > > + * transaction in the log and moving the tail to point to the middle of it > > > > > + * will prevent recovery from finding the start of the transaction. Hence we > > > > > + * should only update the last_sync_lsn if this iclog contains transaction > > > > > + * completion callbacks on it. > > > > > + * > > > > > + * We have to do this before we drop the icloglock to ensure we are the only one > > > > > + * that can update it. > > > > > + * > > > > > + * If we are moving the last_sync_lsn forwards, we also need to ensure we kick > > > > > + * the reservation grant head pushing. This is due to the fact that the push > > > > > + * target is bound by the current last_sync_lsn value. Hence if we have a large > > > > > + * amount of log space bound up in this committing transaction then the > > > > > + * last_sync_lsn value may be the limiting factor preventing tail pushing from > > > > > + * freeing space in the log. Hence once we've updated the last_sync_lsn we > > > > > + * should push the AIL to ensure the push target (and hence the grant head) is > > > > > + * no longer bound by the old log head location and can move forwards and make > > > > > + * progress again. > > > > > + */ > > > > > +static void > > > > > +xlog_state_set_callback( > > > > > + struct xlog *log, > > > > > + struct xlog_in_core *iclog, > > > > > + xfs_lsn_t header_lsn) > > > > > +{ > > > > > + iclog->ic_state = XLOG_STATE_CALLBACK; > > > > > + > > > > > + ASSERT(XFS_LSN_CMP(atomic64_read(&log->l_last_sync_lsn), header_lsn) <= 0); > > > > > + > > > > > + if (list_empty_careful(&iclog->ic_callbacks)) > > > > > + return; > > > > > + > > > > > + atomic64_set(&log->l_last_sync_lsn, header_lsn); > > > > > + xlog_grant_push_ail(log, 0); > > > > > + > > > > > > > > Nit: extra whitespace line above. > > > > > > Fixed. > > > > > > > This still seems racy to me, FWIW. What if the AIL is empty (i.e. the > > > > push is skipped)? > > > > > > If the AIL is empty, then it's a no-op because pushing on the AIL is > > > not going to make more log space become free. Besides, that's not > > > the problem being solved here - reservation wakeups on first insert > > > into the AIL are already handled by xfs_trans_ail_update_bulk() and > > > hence the first patch in the series. This patch is addressing the > > > > Nothing currently wakes up reservation waiters on first AIL insertion. > > Nor should it be necessary - it's the removal from the AIL that > frees up log space, not insertion. The update operation is a > remove followed by an insert - the remove part of that operation is > what may free up log space, not the insert. > Just above you wrote: "reservation wakeups on first insert into the AIL are already handled by xfs_trans_ail_update_bulk()". My reply was just to point out that there are no wakeups in that case. > Hence if we need to wake the log reservation waiters on first AIL > insert to fix a bug, we haven't found the underlying problem is > preventing log space from being freed... > > > > > i.e. the AIL is for moving the tail of the log - this code moves the > > > head of the log. But both impact on the AIL push target (it is based on > > > the distance between the head and tail), so we need > > > to update the push target just in case this commit does not move > > > the tail... > > > > > > > What if xfsaild completes this push before the > > > > associated log items land in the AIL or we race with xfsaild emptying > > > > the AIL? Why not just reuse/update the existing grant head wake up logic > > > > in the iclog callback itself? E.g., something like the following > > > > (untested): > > > > > > > > And the raciness concerns..? AFAICT this still opens a race window where > > the AIL can idle on the push target before AIL insertion. > > I don't know what race you see - if the AIL completes a push before > we insert new objects at the head from the current commit, then it > does not matter one bit because the items are being inserted at the > log head, not the log tail where the pushing occurs at. If we are > inserting objects into the AIL within the push target window, then > there is something else very wrong going on, because when the log is > out of space the push target should be nowhere near the LSN we are > inserting inew objects into the AIL at. (i.e. they should be 3/4s of > the log apart...) > I'm not following your reasoning. It sounds to me that you're arguing it doesn't matter that the AIL is not populated from the current commit because the push target should be much farther behind the head. If that's the case, why does this patch order the AIL push after a ->l_last_sync_lsn update? That's the LSN of the most recent commit record to hit the log and hence the new physical log head. Side note: I think the LSN of the commit record iclog is different and actually ahead of the LSN associated with AIL insertion. I don't necessarily think that's a problem given how the log subsystem behaves today, but it's another subtle/undocumented (and easily avoidable) quirk that may not always be so benign. > > > So, from #1 we see that unconditional wakeups are not necessary in > > > the scenario you pose, and from #2 it's not a viable solution even > > > if it was required. > > > > > > However, #1 indicates other problems if a xfs_log_space_wake() call > > > is necessary in this case. No reservations space and an empty AIL > > > implies that the CIL pins the entire log - a pending commit that > > > hasn't finished flushing and the current context that is > > > aggregating. This implies we've violated a much more important rule > > > of the on-disk log format: finding the head and tail of the log > > > requires no individual commit be larger than 50% of the log. > > > > > > > I described this exact problem days ago in the original thread. There's > > no need to rehash it here. FWIW, I can reproduce much worse than 50% log > > consumption aggregated outside of the AIL with the current code and it > > doesn't depend on a nonpreemptible kernel (though the workqueue fix > > looks legit to me). > ... > > i.e. we changed the unlinked inode processing in a way that > the kernel can now runs tens of thousands of unlink transactions > without yeilding the CPU. That violated the "CIL push work will run > within a few transactions of the background push occurring" > mechanism the workqueue provided us with and that, fundamentally, is > the underlying issue here. It's not a CIL vs empty AIL vs log > reservation exhaustion race condition - that's just an observable > symptom. > Yes, but the point is that's not the only thing that can delay CIL push work. Since the AIL is not populated until the commit record iclog is written out, and background CIL pushing doesn't flush the commit record for the associated checkpoint before it completes, and CIL pushing itself is serialized, a stalled commit record iclog I/O is enough to create "log full, empty AIL" conditions. > To that end, I have been prototyping patches to fix this exact problem > as part of the non-blocking inode reclaim series. I've been looking at > this because the CIL pins so much memory on large logs and I wanted to > put an upper bound on it that wasn't measured in GBs of RAM. Hence I'm > planning to pull these out into a separate series now as it's clear > that non-preemptible kernels and workqueues do not play well together > and that the more we use workqueues for async processing, the more we > introduce a potential real-world vector for CIL overruns... > Yes, I think a separate series for CIL management makes sense. Brian > Cheers, > > Dave. -- Dave Chinner david@fromorbit.com
On Fri, Sep 06, 2019 at 09:10:14AM -0400, Brian Foster wrote: > On Fri, Sep 06, 2019 at 10:02:05AM +1000, Dave Chinner wrote: > > On Thu, Sep 05, 2019 at 12:25:33PM -0400, Brian Foster wrote: > > > On Thu, Sep 05, 2019 at 08:50:56AM +1000, Dave Chinner wrote: > > > > On Wed, Sep 04, 2019 at 03:34:42PM -0400, Brian Foster wrote: > > > > > On Wed, Sep 04, 2019 at 02:24:51PM +1000, Dave Chinner wrote: > > > > > > From: Dave Chinner <dchinner@redhat.com> > > > > > > +/* > > > > > > + * Completion of a iclog IO does not imply that a transaction has completed, as > > > > > > + * transactions can be large enough to span many iclogs. We cannot change the > > > > > > + * tail of the log half way through a transaction as this may be the only > > > > > > + * transaction in the log and moving the tail to point to the middle of it > > > > > > + * will prevent recovery from finding the start of the transaction. Hence we > > > > > > + * should only update the last_sync_lsn if this iclog contains transaction > > > > > > + * completion callbacks on it. > > > > > > + * > > > > > > + * We have to do this before we drop the icloglock to ensure we are the only one > > > > > > + * that can update it. > > > > > > + * > > > > > > + * If we are moving the last_sync_lsn forwards, we also need to ensure we kick > > > > > > + * the reservation grant head pushing. This is due to the fact that the push > > > > > > + * target is bound by the current last_sync_lsn value. Hence if we have a large > > > > > > + * amount of log space bound up in this committing transaction then the > > > > > > + * last_sync_lsn value may be the limiting factor preventing tail pushing from > > > > > > + * freeing space in the log. Hence once we've updated the last_sync_lsn we > > > > > > + * should push the AIL to ensure the push target (and hence the grant head) is > > > > > > + * no longer bound by the old log head location and can move forwards and make > > > > > > + * progress again. > > > > > > + */ > > > > > > +static void > > > > > > +xlog_state_set_callback( > > > > > > + struct xlog *log, > > > > > > + struct xlog_in_core *iclog, > > > > > > + xfs_lsn_t header_lsn) > > > > > > +{ > > > > > > + iclog->ic_state = XLOG_STATE_CALLBACK; > > > > > > + > > > > > > + ASSERT(XFS_LSN_CMP(atomic64_read(&log->l_last_sync_lsn), header_lsn) <= 0); > > > > > > + > > > > > > + if (list_empty_careful(&iclog->ic_callbacks)) > > > > > > + return; > > > > > > + > > > > > > + atomic64_set(&log->l_last_sync_lsn, header_lsn); > > > > > > + xlog_grant_push_ail(log, 0); > > > > > > + > > > > > > > > > > Nit: extra whitespace line above. > > > > > > > > Fixed. > > > > > > > > > This still seems racy to me, FWIW. What if the AIL is empty (i.e. the > > > > > push is skipped)? > > > > > > > > If the AIL is empty, then it's a no-op because pushing on the AIL is > > > > not going to make more log space become free. Besides, that's not > > > > the problem being solved here - reservation wakeups on first insert > > > > into the AIL are already handled by xfs_trans_ail_update_bulk() and > > > > hence the first patch in the series. This patch is addressing the > > > > > > Nothing currently wakes up reservation waiters on first AIL insertion. > > > > Nor should it be necessary - it's the removal from the AIL that > > frees up log space, not insertion. The update operation is a > > remove followed by an insert - the remove part of that operation is > > what may free up log space, not the insert. > > > > Just above you wrote: "reservation wakeups on first insert into the AIL > are already handled by xfs_trans_ail_update_bulk()". My reply was just > to point out that there are no wakeups in that case. > > > Hence if we need to wake the log reservation waiters on first AIL > > insert to fix a bug, we haven't found the underlying problem is > > preventing log space from being freed... > > > > > > > i.e. the AIL is for moving the tail of the log - this code moves the > > > > head of the log. But both impact on the AIL push target (it is based on > > > > the distance between the head and tail), so we need > > > > to update the push target just in case this commit does not move > > > > the tail... > > > > > > > > > What if xfsaild completes this push before the > > > > > associated log items land in the AIL or we race with xfsaild emptying > > > > > the AIL? Why not just reuse/update the existing grant head wake up logic > > > > > in the iclog callback itself? E.g., something like the following > > > > > (untested): > > > > > > > > > > > And the raciness concerns..? AFAICT this still opens a race window where > > > the AIL can idle on the push target before AIL insertion. > > > > I don't know what race you see - if the AIL completes a push before > > we insert new objects at the head from the current commit, then it > > does not matter one bit because the items are being inserted at the > > log head, not the log tail where the pushing occurs at. If we are > > inserting objects into the AIL within the push target window, then > > there is something else very wrong going on, because when the log is > > out of space the push target should be nowhere near the LSN we are > > inserting inew objects into the AIL at. (i.e. they should be 3/4s of > > the log apart...) > > > > I'm not following your reasoning. It sounds to me that you're arguing it > doesn't matter that the AIL is not populated from the current commit > because the push target should be much farther behind the head. If > that's the case, why does this patch order the AIL push after a > ->l_last_sync_lsn update? That's the LSN of the most recent commit > record to hit the log and hence the new physical log head. > > Side note: I think the LSN of the commit record iclog is different and > actually ahead of the LSN associated with AIL insertion. I don't > necessarily think that's a problem given how the log subsystem behaves > today, but it's another subtle/undocumented (and easily avoidable) quirk > that may not always be so benign. > Just to put a finer point on this (and since this seems to be the only way I can get you to consider nontrivial feedback to your patches): kworker/0:1H-220 [000] ...1 3869.403829: xlog_state_do_callback: 2691: l_last_sync_lsn 0x15000021f6 kworker/0:1H-220 [000] ...1 3869.403864: xfs_ail_push: 639: ail_target 0x15000021f6 <...>-215246 [002] ...1 3875.568561: xfsaild: 403: empty (target 0x15000021f6) <...>-215246 [002] .... 3875.568649: xfsaild: 589: idle kworker/0:1H-220 [000] ...1 3889.843872: xfs_trans_ail_update_bulk: 746: inserted lsn 0x1500001bf6 This is an instance of xfsaild going idle between the time this new AIL push sets the target based on the iclog about to be committed and AIL insertion of the associated log items, reproduced via a bit of timing instrumentation. Don't be distracted by the timestamps or the fact that the LSNs do not match because the log items in the AIL end up indexed by the start lsn of the CIL checkpoint (whereas last_sync_lsn refers to the commit record). The point is simply that xfsaild has completed a push of a target that hasn't been inserted yet. A couple additional notes.. I don't see further side effects in the variant I reproduced, I suspect because we have other wakeups that squash this transient state created by the race, but I'm not totally sure of that. I'm also not totally convinced this is the only vector to this problem, FWIW. It wouldn't surprise me a ton if we had some other scenario that could result in the same problem with actual side effects, but this is beside the point. Brian > > > > So, from #1 we see that unconditional wakeups are not necessary in > > > > the scenario you pose, and from #2 it's not a viable solution even > > > > if it was required. > > > > > > > > However, #1 indicates other problems if a xfs_log_space_wake() call > > > > is necessary in this case. No reservations space and an empty AIL > > > > implies that the CIL pins the entire log - a pending commit that > > > > hasn't finished flushing and the current context that is > > > > aggregating. This implies we've violated a much more important rule > > > > of the on-disk log format: finding the head and tail of the log > > > > requires no individual commit be larger than 50% of the log. > > > > > > > > > > I described this exact problem days ago in the original thread. There's > > > no need to rehash it here. FWIW, I can reproduce much worse than 50% log > > > consumption aggregated outside of the AIL with the current code and it > > > doesn't depend on a nonpreemptible kernel (though the workqueue fix > > > looks legit to me). > > > ... > > > > i.e. we changed the unlinked inode processing in a way that > > the kernel can now runs tens of thousands of unlink transactions > > without yeilding the CPU. That violated the "CIL push work will run > > within a few transactions of the background push occurring" > > mechanism the workqueue provided us with and that, fundamentally, is > > the underlying issue here. It's not a CIL vs empty AIL vs log > > reservation exhaustion race condition - that's just an observable > > symptom. > > > > Yes, but the point is that's not the only thing that can delay CIL push > work. Since the AIL is not populated until the commit record iclog is > written out, and background CIL pushing doesn't flush the commit record > for the associated checkpoint before it completes, and CIL pushing > itself is serialized, a stalled commit record iclog I/O is enough to > create "log full, empty AIL" conditions. > > > To that end, I have been prototyping patches to fix this exact problem > > as part of the non-blocking inode reclaim series. I've been looking at > > this because the CIL pins so much memory on large logs and I wanted to > > put an upper bound on it that wasn't measured in GBs of RAM. Hence I'm > > planning to pull these out into a separate series now as it's clear > > that non-preemptible kernels and workqueues do not play well together > > and that the more we use workqueues for async processing, the more we > > introduce a potential real-world vector for CIL overruns... > > > > Yes, I think a separate series for CIL management makes sense. > > Brian > > > Cheers, > > > > Dave. -- Dave Chinner david@fromorbit.com
On Sat, Sep 07, 2019 at 11:10:50AM -0400, Brian Foster wrote: > On Fri, Sep 06, 2019 at 09:10:14AM -0400, Brian Foster wrote: > > On Fri, Sep 06, 2019 at 10:02:05AM +1000, Dave Chinner wrote: > > > > And the raciness concerns..? AFAICT this still opens a race window where > > > > the AIL can idle on the push target before AIL insertion. > > > > > > I don't know what race you see - if the AIL completes a push before > > > we insert new objects at the head from the current commit, then it > > > does not matter one bit because the items are being inserted at the > > > log head, not the log tail where the pushing occurs at. If we are > > > inserting objects into the AIL within the push target window, then > > > there is something else very wrong going on, because when the log is > > > out of space the push target should be nowhere near the LSN we are > > > inserting inew objects into the AIL at. (i.e. they should be 3/4s of > > > the log apart...) > > > > > > > I'm not following your reasoning. It sounds to me that you're arguing it > > doesn't matter that the AIL is not populated from the current commit > > because the push target should be much farther behind the head. If > > that's the case, why does this patch order the AIL push after a > > ->l_last_sync_lsn update? That's the LSN of the most recent commit > > record to hit the log and hence the new physical log head. > > > > Side note: I think the LSN of the commit record iclog is different and > > actually ahead of the LSN associated with AIL insertion. I don't > > necessarily think that's a problem given how the log subsystem behaves > > today, but it's another subtle/undocumented (and easily avoidable) quirk > > that may not always be so benign. > > > > Just to put a finer point on this (and since this seems to be the only > way I can get you to consider nontrivial feedback to your patches): If I can't make head or tail of the problem you are describing, exactly how am I supposed to respond? If I'm unable to get my point across, I'd much prefer to spend my time on patches than on going around in circles. I'm not interested in winning arguments. I'm not interested in spending lots of time discussing theoretical problems with the current set of fixes that don't exist once the root cause we've already identified is fixed. My time is much better spent fixing that root cause... Keep in mind that I also have a lot of different, complex things going on at once that all require total focus while I'm looking at them, so it can take days for me to cycle through everything and get back to past topics. Delay doesn't mean I haven't read your response or taken it on board, it just means I don't have time to write a -meaingful response- straight away. > kworker/0:1H-220 [000] ...1 3869.403829: xlog_state_do_callback: 2691: l_last_sync_lsn 0x15000021f6 > kworker/0:1H-220 [000] ...1 3869.403864: xfs_ail_push: 639: ail_target 0x15000021f6 Which implies that the log has less than 25% of space free because we've issued a push, and that the distance we push is bound by the log head. > <...>-215246 [002] ...1 3875.568561: xfsaild: 403: empty (target 0x15000021f6) > <...>-215246 [002] .... 3875.568649: xfsaild: 589: idle has an empty AIL. IOWs, you are creating the situation where the CIL has not been allowed to run and hence has violated the >50% log size limit on transactions. This goes away once we prevent the CIL from doing this. > kworker/0:1H-220 [000] ...1 3889.843872: xfs_trans_ail_update_bulk: 746: inserted lsn 0x1500001bf6 Ok, so what you see here is somewhat intentional, based on the fact that the LSN used for items is different to the LSN used for the commit record (start of commit vs end of commit). We don't want to push the currently commiting items instantly to disk as that defeats the "delayed write" behaviour the AIL uses to allow efficient relogging to occur. The next commit will do a similar push during with the new l_last_sync_lsn, which causes the target to point at the new last_sync_lsn and so all the items in the AIL from the previous commit that haven't been relogged (pinned) in the current commit will get pushed. i.e. commit N will cause commit (N - 1) to get pushed. This will continue while we are in a situation where the current log head location is limiting the push target and we are completely out of log reservation space. Once we get to the point where the physical head of the log is more than 25% of the log away from the tail, the push target will stop being limited by the l_last_sync_lsn and we'll go back to triggering push target updates via the tail of the log moving forwards as we currently do. IOWs, this "log head pushing" behaviour is likely only necessary for the first 2-3 CIL commits of a workload, then we fall back into the normal tail pushing scenario. > This is an instance of xfsaild going idle between the time this > new AIL push sets the target based on the iclog about to be > committed and AIL insertion of the associated log items, > reproduced via a bit of timing instrumentation. Don't be > distracted by the timestamps or the fact that the LSNs do not > match because the log items in the AIL end up indexed by the start > lsn of the CIL checkpoint (whereas last_sync_lsn refers to the > commit record). The point is simply that xfsaild has completed a > push of a target that hasn't been inserted yet. AFAICT, what you are showing requires delaying of the CIL push to the point it violates a fundamental assumption about commit sizes, which is why I largely think it's irrelevant. > A couple additional notes.. I don't see further side effects in the > variant I reproduced, I suspect because we have other wakeups that > squash this transient state created by the race, Right, if we do run out of log space, the log reservation tail pushing mechanisms takes over and does the right thing. > > > i.e. we changed the unlinked inode processing in a way that > > > the kernel can now runs tens of thousands of unlink transactions > > > without yeilding the CPU. That violated the "CIL push work will run > > > within a few transactions of the background push occurring" > > > mechanism the workqueue provided us with and that, fundamentally, is > > > the underlying issue here. It's not a CIL vs empty AIL vs log > > > reservation exhaustion race condition - that's just an observable > > > symptom. > > > > > > > Yes, but the point is that's not the only thing that can delay CIL push > > work. Since the AIL is not populated until the commit record iclog is > > written out, and background CIL pushing doesn't flush the commit record > > for the associated checkpoint before it completes, and CIL pushing > > itself is serialized, a stalled commit record iclog I/O is enough to > > create "log full, empty AIL" conditions. CIL pushing is not actually serialised. Ordered, yes, serialised, no. ANd stalling an iclog with a commit record should not cause the log to fill completely - the next CIL push when it overflows should get it moving long before the log runs out of reservation space. Cheers, Dave.
On Mon, Sep 09, 2019 at 09:26:32AM +1000, Dave Chinner wrote: > On Sat, Sep 07, 2019 at 11:10:50AM -0400, Brian Foster wrote: > > On Fri, Sep 06, 2019 at 09:10:14AM -0400, Brian Foster wrote: > > > On Fri, Sep 06, 2019 at 10:02:05AM +1000, Dave Chinner wrote: > > > > > And the raciness concerns..? AFAICT this still opens a race window where > > > > > the AIL can idle on the push target before AIL insertion. > > > > > > > > I don't know what race you see - if the AIL completes a push before > > > > we insert new objects at the head from the current commit, then it > > > > does not matter one bit because the items are being inserted at the > > > > log head, not the log tail where the pushing occurs at. If we are > > > > inserting objects into the AIL within the push target window, then > > > > there is something else very wrong going on, because when the log is > > > > out of space the push target should be nowhere near the LSN we are > > > > inserting inew objects into the AIL at. (i.e. they should be 3/4s of > > > > the log apart...) > > > > > > > > > > I'm not following your reasoning. It sounds to me that you're arguing it > > > doesn't matter that the AIL is not populated from the current commit > > > because the push target should be much farther behind the head. If > > > that's the case, why does this patch order the AIL push after a > > > ->l_last_sync_lsn update? That's the LSN of the most recent commit > > > record to hit the log and hence the new physical log head. > > > > > > Side note: I think the LSN of the commit record iclog is different and > > > actually ahead of the LSN associated with AIL insertion. I don't > > > necessarily think that's a problem given how the log subsystem behaves > > > today, but it's another subtle/undocumented (and easily avoidable) quirk > > > that may not always be so benign. > > > > > > > Just to put a finer point on this (and since this seems to be the only > > way I can get you to consider nontrivial feedback to your patches): > > If I can't make head or tail of the problem you are describing, > exactly how am I supposed to respond? If I'm unable to get my point > across, I'd much prefer to spend my time on patches than on going > around in circles. I'm not interested in winning arguments. I'm not > interested in spending lots of time discussing theoretical problems > with the current set of fixes that don't exist once the root cause > we've already identified is fixed. My time is much better spent > fixing that root cause... > > Keep in mind that I also have a lot of different, complex things > going on at once that all require total focus while I'm looking at > them, so it can take days for me to cycle through everything and get > back to past topics. Delay doesn't mean I haven't read your response > or taken it on board, it just means I don't have time to write a > -meaingful response- straight away. > > > kworker/0:1H-220 [000] ...1 3869.403829: xlog_state_do_callback: 2691: l_last_sync_lsn 0x15000021f6 > > kworker/0:1H-220 [000] ...1 3869.403864: xfs_ail_push: 639: ail_target 0x15000021f6 > > Which implies that the log has less than 25% of space free because > we've issued a push, and that the distance we push is bound by the > log head. > > > <...>-215246 [002] ...1 3875.568561: xfsaild: 403: empty (target 0x15000021f6) > > <...>-215246 [002] .... 3875.568649: xfsaild: 589: idle > > has an empty AIL. IOWs, you are creating the situation where the CIL > has not been allowed to run and hence has violated the >50% log size > limit on transactions. This goes away once we prevent the CIL from > doing this. > > > kworker/0:1H-220 [000] ...1 3889.843872: xfs_trans_ail_update_bulk: 746: inserted lsn 0x1500001bf6 > > Ok, so what you see here is somewhat intentional, based on the fact > that the LSN used for items is different to the LSN used for the > commit record (start of commit vs end of commit). We don't want to > push the currently commiting items instantly to disk as that defeats > the "delayed write" behaviour the AIL uses to allow efficient > relogging to occur. > > The next commit will do a similar push during with the new > l_last_sync_lsn, which causes the target to point at the new > last_sync_lsn and so all the items in the AIL from the previous > commit that haven't been relogged (pinned) in the current commit > will get pushed. i.e. commit N will cause commit (N - 1) to get > pushed. > > This will continue while we are in a situation where the current log > head location is limiting the push target and we are completely out > of log reservation space. Once we get to the point where the > physical head of the log is more than 25% of the log away from the > tail, the push target will stop being limited by the l_last_sync_lsn > and we'll go back to triggering push target updates via the tail of > the log moving forwards as we currently do. IOWs, this "log head > pushing" behaviour is likely only necessary for the first 2-3 CIL > commits of a workload, then we fall back into the normal tail > pushing scenario. > > > This is an instance of xfsaild going idle between the time this > > new AIL push sets the target based on the iclog about to be > > committed and AIL insertion of the associated log items, > > reproduced via a bit of timing instrumentation. Don't be > > distracted by the timestamps or the fact that the LSNs do not > > match because the log items in the AIL end up indexed by the start > > lsn of the CIL checkpoint (whereas last_sync_lsn refers to the > > commit record). The point is simply that xfsaild has completed a > > push of a target that hasn't been inserted yet. > > AFAICT, what you are showing requires delaying of the CIL push to the > point it violates a fundamental assumption about commit sizes, which > is why I largely think it's irrelevant. > The CIL checkpoint size is an unrelated side effect of the test I happened to use, not a fundamental cause of the problem it demonstrates. Fixing CIL checkpoint size issues won't change anything. Here's a different variant of the same problem with a small enough number of log items such that background CIL pushing is not a factor: <...>-79670 [000] ...1 56126.015522: xfs_log_force: dev 253:4 lsn 0x0 caller xfs_log_worker+0x2f/0xf0 [xfs] kworker/0:1H-220 [000] ...1 56126.030587: __xlog_grant_push_ail: 1596: threshold_lsn 0x1000032e4 ... <...>-81293 [000] ...2 56126.032647: xfs_ail_delete: dev 253:4 lip 00000000cbe82125 old lsn 1/13026 new lsn 1/13026 type XFS_LI_INODE flags IN_AIL <...>-81633 [000] .... 56126.053544: xfsaild: 588: idle ->ail_target 0x1000032e4 kworker/0:1H-220 [000] ...2 56127.038835: xfs_ail_insert: dev 253:4 lip 00000000a44ab1ef old lsn 0/0 new lsn 1/13028 type XFS_LI_INODE flags IN_AIL kworker/0:1H-220 [000] ...2 56127.038911: xfs_ail_insert: dev 253:4 lip 0000000028d2061f old lsn 0/0 new lsn 1/13028 type XFS_LI_INODE flags IN_AIL .... This sequence starts with one log item in the AIL and some number of items in the CIL such that a checkpoint executes from the background log worker. The worker forces the CIL and log I/O completion issues an AIL push that is truncated by the recently updated ->l_last_sync_lsn due to outstanding transaction reservation and small AIL size. This push races with completion of a previous push that empties the AIL and iclog callbacks insert log items for the current checkpoint at the LSN target xfsaild just idled at. Brian > > A couple additional notes.. I don't see further side effects in the > > variant I reproduced, I suspect because we have other wakeups that > > squash this transient state created by the race, > > Right, if we do run out of log space, the log reservation tail > pushing mechanisms takes over and does the right thing. > > > > > i.e. we changed the unlinked inode processing in a way that > > > > the kernel can now runs tens of thousands of unlink transactions > > > > without yeilding the CPU. That violated the "CIL push work will run > > > > within a few transactions of the background push occurring" > > > > mechanism the workqueue provided us with and that, fundamentally, is > > > > the underlying issue here. It's not a CIL vs empty AIL vs log > > > > reservation exhaustion race condition - that's just an observable > > > > symptom. > > > > > > > > > > Yes, but the point is that's not the only thing that can delay CIL push > > > work. Since the AIL is not populated until the commit record iclog is > > > written out, and background CIL pushing doesn't flush the commit record > > > for the associated checkpoint before it completes, and CIL pushing > > > itself is serialized, a stalled commit record iclog I/O is enough to > > > create "log full, empty AIL" conditions. > > CIL pushing is not actually serialised. Ordered, yes, serialised, > no. ANd stalling an iclog with a commit record should not cause the > log to fill completely - the next CIL push when it overflows should > get it moving long before the log runs out of reservation space. > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com
On Tue, Sep 10, 2019 at 05:56:28AM -0400, Brian Foster wrote: > On Mon, Sep 09, 2019 at 09:26:32AM +1000, Dave Chinner wrote: > > On Sat, Sep 07, 2019 at 11:10:50AM -0400, Brian Foster wrote: > > > This is an instance of xfsaild going idle between the time this > > > new AIL push sets the target based on the iclog about to be > > > committed and AIL insertion of the associated log items, > > > reproduced via a bit of timing instrumentation. Don't be > > > distracted by the timestamps or the fact that the LSNs do not > > > match because the log items in the AIL end up indexed by the start > > > lsn of the CIL checkpoint (whereas last_sync_lsn refers to the > > > commit record). The point is simply that xfsaild has completed a > > > push of a target that hasn't been inserted yet. > > > > AFAICT, what you are showing requires delaying of the CIL push to the > > point it violates a fundamental assumption about commit sizes, which > > is why I largely think it's irrelevant. > > > > The CIL checkpoint size is an unrelated side effect of the test I > happened to use, not a fundamental cause of the problem it demonstrates. > Fixing CIL checkpoint size issues won't change anything. Here's a > different variant of the same problem with a small enough number of log > items such that background CIL pushing is not a factor: > > <...>-79670 [000] ...1 56126.015522: xfs_log_force: dev 253:4 lsn 0x0 caller xfs_log_worker+0x2f/0xf0 [xfs] > kworker/0:1H-220 [000] ...1 56126.030587: __xlog_grant_push_ail: 1596: threshold_lsn 0x1000032e4 > ... > <...>-81293 [000] ...2 56126.032647: xfs_ail_delete: dev 253:4 lip 00000000cbe82125 old lsn 1/13026 new lsn 1/13026 type XFS_LI_INODE flags IN_AIL > <...>-81633 [000] .... 56126.053544: xfsaild: 588: idle ->ail_target 0x1000032e4 > kworker/0:1H-220 [000] ...2 56127.038835: xfs_ail_insert: dev 253:4 lip 00000000a44ab1ef old lsn 0/0 new lsn 1/13028 type XFS_LI_INODE flags IN_AIL > kworker/0:1H-220 [000] ...2 56127.038911: xfs_ail_insert: dev 253:4 lip 0000000028d2061f old lsn 0/0 new lsn 1/13028 type XFS_LI_INODE flags IN_AIL > .... > > This sequence starts with one log item in the AIL and some number of > items in the CIL such that a checkpoint executes from the background log > worker. The worker forces the CIL and log I/O completion issues an AIL > push that is truncated by the recently updated ->l_last_sync_lsn due to > outstanding transaction reservation and small AIL size. This push races > with completion of a previous push that empties the AIL and iclog > callbacks insert log items for the current checkpoint at the LSN target > xfsaild just idled at. I'm just not seeing what the problem here is. The behaviour you are describing has been around since day zero and doesn't require the addition of an ail push from iclog completion to trigger. Prior to this series, it would be: process 1 reservation log completion xfsaild <completes metadata IO> xfs_ail_delete() mlip_changed xlog_assign_tail_lsn_locked() ail empty, sets l_last_sync = 0x1000032e2 xfs_log_space_wake() xlog_state_do_callback sets CALLBACK sets last_sync_lsn to iclog head -> 0x1000032e4 <drops icloglock, gets preempted> <wakes> xlog_grant_head_wait free_bytes < need_bytes xlog_grant_push_ail() xlog_push_ail() ->ail_target 0x1000032e4 <sleeps> <wakes> sets prev target to 0x1000032e4 sees empty AIL <sleeps> <runs again> runs callbacks xfs_ail_insert(lsn = 0x1000032e4) and now we have the AIL push thread asleep with items in it at the push threshold. IOWs, what you describe has always been possible, and before the CIL was introduced this sort of thing happened quite a bit because iclog completions freed up much less space in the log than a CIL commit completion. It's not a problem, however, because if we are out of transaction reservation space we must have transactions in progress, and as long as they make progress then the commit of each transaction will end up calling xlog_ungrant_log_space() to return the unused portion of the transaction reservation. That calls xfs_log_space_wake() to allow reservation waiters to try to make progress. If there's still not enough space reservation after the transaction in progress has released it's reservation, then it goes back to sleep. As long as we have active transactions in progress while there are transaction reservations waiting on reservation space, there will be a wakeup vector for the reservation independent of the CIL, iclogs and AIL behaviour. [ Yes, there was a bug here, in the case xfs_log_space_wake() did not issue a wakeup because of not enough space being availble and the push target was limited by the old log head location. i.e. nothing ever updated the push target to reflect the new log head and so the tail might never get moved now. That particular bug was fixed by a an earlier patch in the series, so we can ignore it here. ] IOWs, if the AIL is empty, the CIL cannot consume more than 25% of the log space, and we have transactions waiting on log reservation space, then we must have enough transactions in progress to cover at least 75% of the log space. Completion of those transactions will wake waiters and, if necessary, push the AIL again to keep the log tail moving appropriately. This handles the AIL empty and "insert before target" situations you are concerned about just fine, as long as we have a guarantee of forwards progress. Bounding the CIL size provides that forwards progress guarantee for the CIL... Cheers, Dave.
On Wed, Sep 11, 2019 at 09:38:58AM +1000, Dave Chinner wrote: > On Tue, Sep 10, 2019 at 05:56:28AM -0400, Brian Foster wrote: > > On Mon, Sep 09, 2019 at 09:26:32AM +1000, Dave Chinner wrote: > > > On Sat, Sep 07, 2019 at 11:10:50AM -0400, Brian Foster wrote: > > > > This is an instance of xfsaild going idle between the time this > > > > new AIL push sets the target based on the iclog about to be > > > > committed and AIL insertion of the associated log items, > > > > reproduced via a bit of timing instrumentation. Don't be > > > > distracted by the timestamps or the fact that the LSNs do not > > > > match because the log items in the AIL end up indexed by the start > > > > lsn of the CIL checkpoint (whereas last_sync_lsn refers to the > > > > commit record). The point is simply that xfsaild has completed a > > > > push of a target that hasn't been inserted yet. > > > > > > AFAICT, what you are showing requires delaying of the CIL push to the > > > point it violates a fundamental assumption about commit sizes, which > > > is why I largely think it's irrelevant. > > > > > > > The CIL checkpoint size is an unrelated side effect of the test I > > happened to use, not a fundamental cause of the problem it demonstrates. > > Fixing CIL checkpoint size issues won't change anything. Here's a > > different variant of the same problem with a small enough number of log > > items such that background CIL pushing is not a factor: > > > > <...>-79670 [000] ...1 56126.015522: xfs_log_force: dev 253:4 lsn 0x0 caller xfs_log_worker+0x2f/0xf0 [xfs] > > kworker/0:1H-220 [000] ...1 56126.030587: __xlog_grant_push_ail: 1596: threshold_lsn 0x1000032e4 > > ... > > <...>-81293 [000] ...2 56126.032647: xfs_ail_delete: dev 253:4 lip 00000000cbe82125 old lsn 1/13026 new lsn 1/13026 type XFS_LI_INODE flags IN_AIL > > <...>-81633 [000] .... 56126.053544: xfsaild: 588: idle ->ail_target 0x1000032e4 > > kworker/0:1H-220 [000] ...2 56127.038835: xfs_ail_insert: dev 253:4 lip 00000000a44ab1ef old lsn 0/0 new lsn 1/13028 type XFS_LI_INODE flags IN_AIL > > kworker/0:1H-220 [000] ...2 56127.038911: xfs_ail_insert: dev 253:4 lip 0000000028d2061f old lsn 0/0 new lsn 1/13028 type XFS_LI_INODE flags IN_AIL > > .... > > > > This sequence starts with one log item in the AIL and some number of > > items in the CIL such that a checkpoint executes from the background log > > worker. The worker forces the CIL and log I/O completion issues an AIL > > push that is truncated by the recently updated ->l_last_sync_lsn due to > > outstanding transaction reservation and small AIL size. This push races > > with completion of a previous push that empties the AIL and iclog > > callbacks insert log items for the current checkpoint at the LSN target > > xfsaild just idled at. > > I'm just not seeing what the problem here is. The behaviour you are > describing has been around since day zero and doesn't require the > addition of an ail push from iclog completion to trigger. Prior to > this series, it would be: > A few days ago you said that if we're inserting log items before the push target, "something is very wrong." Since this was what I was concerned about, I attempted to manufacture the issue to demonstrate. You suggested the first reproducer I came up with was a separate problem (related to CIL size issues), so I came up with the one above to avoid that distraction. Now you're telling me this has always happened and is fine.. While I don't think this is quite accurate (more below), I do find this reasoning somewhat amusing in that it essentially implies that this patch itself is dubious. If this new AIL push is required to fix a real issue, and this race is essentially manifest as implied, then this patch can't possibly reliably fix the original problem. Anyways, that is neither here nor there.. All of the details of this particular issue aside, I do think there's a development process problem here. It shouldn't require an extended game of whack-a-mole with this kind of inconsistent reasoning just to request a trivial change to a patch (you also implied in a previous response it was me wasting your time on this topic) that closes an obvious race and otherwise has no negative effect. Someone is being unreasonable here and I don't think it's me. More importantly, discussion of open issues shouldn't be a race against the associated patch being merged. :/ > process 1 reservation log completion xfsaild > <completes metadata IO> > xfs_ail_delete() > mlip_changed > xlog_assign_tail_lsn_locked() > ail empty, sets l_last_sync = 0x1000032e2 > xfs_log_space_wake() > xlog_state_do_callback > sets CALLBACK > sets last_sync_lsn to iclog head > -> 0x1000032e4 > <drops icloglock, gets preempted> > <wakes> > xlog_grant_head_wait > free_bytes < need_bytes > xlog_grant_push_ail() > xlog_push_ail() > ->ail_target 0x1000032e4 > <sleeps> > <wakes> > sets prev target to 0x1000032e4 > sees empty AIL > <sleeps> > <runs again> > runs callbacks > xfs_ail_insert(lsn = 0x1000032e4) > > and now we have the AIL push thread asleep with items in it at the > push threshold. IOWs, what you describe has always been possible, > and before the CIL was introduced this sort of thing happened quite > a bit because iclog completions freed up much less space in the log > than a CIL commit completion. > I was suspicious that this could occur prior to this change but I hadn't confirmed. The scenario documented above cannot occur because a push on an empty AIL has no effect. The target doesn't move and the task isn't woken. That said, I still suspect the race can occur with the current code via between a grant head waiter, AIL emptying and iclog completion. This just speaks to the frequency of the problem, though. I'm not convinced it's something that happens "quite a bit" given the nature of the 3-way race. I also don't agree that existence of a historical problem somehow excuses introduction a new variant of the same problem. Instead, if this patch exposes a historical problem that simply had no noticeable impact to this point, we should probably look into whether it needs fixing too. > It's not a problem, however, because if we are out of transaction > reservation space we must have transactions in progress, and as long > as they make progress then the commit of each transaction will end > up calling xlog_ungrant_log_space() to return the unused portion of > the transaction reservation. That calls xfs_log_space_wake() to > allow reservation waiters to try to make progress. > Yes, this is why I don't see immediate side effects in the tests I've run so far. The assumptions you're basing this off are not always true, however. Particularly on smaller (<= 1GB) filesystems, it's relatively easy to produce conditions where the entire reservation space is consumed by open transactions that don't ultimately commit anything to the log subsystem and thus generate no forward progress. > If there's still not enough space reservation after the transaction > in progress has released it's reservation, then it goes back to > sleep. As long as we have active transactions in progress while > there are transaction reservations waiting on reservation space, > there will be a wakeup vector for the reservation independent of > the CIL, iclogs and AIL behaviour. > We do have clean transaction cancel and error scenarios, existing log deadlock vectors, increasing reliance on long running transactions via deferred ops, scrub, etc. Also consider the fact that open transactions consume considerably more reservation than committed transactions on average. I'm not saying it's likely for a real world workload to consume the entirety of log reservation space via open transactions and then release it without filesystem modification (and then race with log I/O and AIL emptying), but from the perspective of proving the existence of a bug it's really not that difficult to produce. I've not seen a real world workload that reproduces the problems fixed by any of these patches either, but we still fix them. > [ Yes, there was a bug here, in the case xfs_log_space_wake() did > not issue a wakeup because of not enough space being availble and > the push target was limited by the old log head location. i.e. > nothing ever updated the push target to reflect the new log head and > so the tail might never get moved now. That particular bug was fixed > by a an earlier patch in the series, so we can ignore it here. ] > > IOWs, if the AIL is empty, the CIL cannot consume more than 25% of > the log space, and we have transactions waiting on log reservation > space, then we must have enough transactions in progress to cover at > least 75% of the log space. Completion of those transactions will > wake waiters and, if necessary, push the AIL again to keep the log > tail moving appropriately. This handles the AIL empty and "insert > before target" situations you are concerned about just fine, as long > as we have a guarantee of forwards progress. Bounding the CIL size > provides that forwards progress guarantee for the CIL... > I think you have some tunnel vision or something going on here with regard to the higher level architectural view of how things are supposed to operate in a normal running/steady state vs simply what can and cannot happen in the code. I can't really tell why/how, but the only suggestion I can make is to perhaps separate from this high level view of things and take a closer look at the code. This is a simple code bug, not some grand architectural flaw. The context here is way out of whack. The repeated unrelated and overblown architectural assertions come off as indication of lack of any real argument to allow this race to live. There is simply no such guarantee of forward progress in all scenarios that produce the conditions that can cause this race. Yet another example: <...>-369 [002] ...2 220.055746: xfs_ail_insert: dev 253:4 lip 00000000ddb123f2 old lsn 0/0 new lsn 1/248 type XFS_LI_INODE flags IN_AIL <...>-27 [003] ...1 224.753110: xfs_log_force: dev 253:4 lsn 0x0 caller xfs_log_worker+0x2f/0xf0 [xfs] <...>-404 [003] ...1 224.775551: __xlog_grant_push_ail: 1596: threshold_lsn 0x1000000fa kworker/3:1-39 [003] ...2 224.777953: xfs_ail_delete: dev 253:4 lip 00000000ddb123f2 old lsn 1/248 new lsn 1/248 type XFS_LI_INODE flags IN_AIL xfsaild/dm-4-1034 [000] .... 224.797919: xfsaild: 588: idle ->ail_target 0x1000000fa kworker/3:1H-404 [003] ...2 225.841198: xfs_ail_insert: dev 253:4 lip 000000006845aeed old lsn 0/0 new lsn 1/250 type XFS_LI_INODE flags IN_AIL kworker/3:1-39 [003] ...1 254.962822: xfs_log_force: dev 253:4 lsn 0x0 caller xfs_log_worker+0x2f/0xf0 [xfs] ... kworker/3:2-1920 [003] ...1 3759.291275: xfs_log_force: dev 253:4 lsn 0x0 caller xfs_log_worker+0x2f/0xf0 [xfs] # cat /sys/fs/xfs/dm-4/log/log_*lsn 1:252 1:250 This instance of the race uses the same serialization instrumentation to control execution timing and whatnot as before (i.e. no functional changes). First, an item is inserted into the AIL. Immediately after AIL insertion, another transaction commits to the CIL (not shown in the trace). The background log worker comes around a few seconds later and forces the log/CIL. The checkpoint for this log force races with an AIL delete and idle (same as before). AIL insertion occurs at the push target xfsaild just idled at, but this time reservation pressure relieves and the filesystem goes idle. At this point, nothing occurs on the fs except for continuous background log worker jobs. Note the timestamp difference between the first post-race log force and the last in the trace. The log worker runs at the default 30s interval and has run repeatedly for almost an hour while failing to push the AIL and subsequently cover the log. To confirm the AIL is populated, see the log head/tail LSNs reported via sysfs. This state persists indefinitely so long as the fs is idle. This is a bug. Brian > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com
On Thu, Sep 12, 2019 at 09:46:06AM -0400, Brian Foster wrote: > On Wed, Sep 11, 2019 at 09:38:58AM +1000, Dave Chinner wrote: > > On Tue, Sep 10, 2019 at 05:56:28AM -0400, Brian Foster wrote: > > > On Mon, Sep 09, 2019 at 09:26:32AM +1000, Dave Chinner wrote: > > > > On Sat, Sep 07, 2019 at 11:10:50AM -0400, Brian Foster wrote: > > > > > This is an instance of xfsaild going idle between the time this > > > > > new AIL push sets the target based on the iclog about to be > > > > > committed and AIL insertion of the associated log items, > > > > > reproduced via a bit of timing instrumentation. Don't be > > > > > distracted by the timestamps or the fact that the LSNs do not > > > > > match because the log items in the AIL end up indexed by the start > > > > > lsn of the CIL checkpoint (whereas last_sync_lsn refers to the > > > > > commit record). The point is simply that xfsaild has completed a > > > > > push of a target that hasn't been inserted yet. > > > > > > > > AFAICT, what you are showing requires delaying of the CIL push to the > > > > point it violates a fundamental assumption about commit sizes, which > > > > is why I largely think it's irrelevant. > > > > > > > > > > The CIL checkpoint size is an unrelated side effect of the test I > > > happened to use, not a fundamental cause of the problem it demonstrates. > > > Fixing CIL checkpoint size issues won't change anything. Here's a > > > different variant of the same problem with a small enough number of log > > > items such that background CIL pushing is not a factor: > > > > > > <...>-79670 [000] ...1 56126.015522: xfs_log_force: dev 253:4 lsn 0x0 caller xfs_log_worker+0x2f/0xf0 [xfs] > > > kworker/0:1H-220 [000] ...1 56126.030587: __xlog_grant_push_ail: 1596: threshold_lsn 0x1000032e4 > > > ... > > > <...>-81293 [000] ...2 56126.032647: xfs_ail_delete: dev 253:4 lip 00000000cbe82125 old lsn 1/13026 new lsn 1/13026 type XFS_LI_INODE flags IN_AIL > > > <...>-81633 [000] .... 56126.053544: xfsaild: 588: idle ->ail_target 0x1000032e4 > > > kworker/0:1H-220 [000] ...2 56127.038835: xfs_ail_insert: dev 253:4 lip 00000000a44ab1ef old lsn 0/0 new lsn 1/13028 type XFS_LI_INODE flags IN_AIL > > > kworker/0:1H-220 [000] ...2 56127.038911: xfs_ail_insert: dev 253:4 lip 0000000028d2061f old lsn 0/0 new lsn 1/13028 type XFS_LI_INODE flags IN_AIL > > > .... > > > > > > This sequence starts with one log item in the AIL and some number of > > > items in the CIL such that a checkpoint executes from the background log > > > worker. The worker forces the CIL and log I/O completion issues an AIL > > > push that is truncated by the recently updated ->l_last_sync_lsn due to > > > outstanding transaction reservation and small AIL size. This push races > > > with completion of a previous push that empties the AIL and iclog > > > callbacks insert log items for the current checkpoint at the LSN target > > > xfsaild just idled at. > > > > I'm just not seeing what the problem here is. The behaviour you are > > describing has been around since day zero and doesn't require the > > addition of an ail push from iclog completion to trigger. Prior to > > this series, it would be: > > > > A few days ago you said that if we're inserting log items before the > push target, "something is very wrong." Since this was what I was > concerned about, I attempted to manufacture the issue to demonstrate. > You suggested the first reproducer I came up with was a separate problem > (related to CIL size issues), so I came up with the one above to avoid > that distraction. Now you're telling me this has always happened and is > fine.. > > While I don't think this is quite accurate (more below), I do find this > reasoning somewhat amusing in that it essentially implies that this > patch itself is dubious. If this new AIL push is required to fix a real > issue, and this race is essentially manifest as implied, then this patch > can't possibly reliably fix the original problem. Anyways, that is > neither here nor there.. > > All of the details of this particular issue aside, I do think there's a > development process problem here. It shouldn't require an extended game > of whack-a-mole with this kind of inconsistent reasoning just to request > a trivial change to a patch (you also implied in a previous response it > was me wasting your time on this topic) that closes an obvious race and > otherwise has no negative effect. Someone is being unreasonable here and > I don't think it's me. More importantly, discussion of open issues > shouldn't be a race against the associated patch being merged. :/ > > > process 1 reservation log completion xfsaild > > <completes metadata IO> > > xfs_ail_delete() > > mlip_changed > > xlog_assign_tail_lsn_locked() > > ail empty, sets l_last_sync = 0x1000032e2 > > xfs_log_space_wake() > > xlog_state_do_callback > > sets CALLBACK > > sets last_sync_lsn to iclog head > > -> 0x1000032e4 > > <drops icloglock, gets preempted> > > <wakes> > > xlog_grant_head_wait > > free_bytes < need_bytes > > xlog_grant_push_ail() > > xlog_push_ail() > > ->ail_target 0x1000032e4 > > <sleeps> > > <wakes> > > sets prev target to 0x1000032e4 > > sees empty AIL > > <sleeps> > > <runs again> > > runs callbacks > > xfs_ail_insert(lsn = 0x1000032e4) > > > > and now we have the AIL push thread asleep with items in it at the > > push threshold. IOWs, what you describe has always been possible, > > and before the CIL was introduced this sort of thing happened quite > > a bit because iclog completions freed up much less space in the log > > than a CIL commit completion. > > > > I was suspicious that this could occur prior to this change but I hadn't > confirmed. The scenario documented above cannot occur because a push on > an empty AIL has no effect. The target doesn't move and the task isn't > woken. That said, I still suspect the race can occur with the current > code via between a grant head waiter, AIL emptying and iclog completion. > > This just speaks to the frequency of the problem, though. I'm not > convinced it's something that happens "quite a bit" given the nature of > the 3-way race. I also don't agree that existence of a historical > problem somehow excuses introduction a new variant of the same problem. > Instead, if this patch exposes a historical problem that simply had no > noticeable impact to this point, we should probably look into whether it > needs fixing too. > > > It's not a problem, however, because if we are out of transaction > > reservation space we must have transactions in progress, and as long > > as they make progress then the commit of each transaction will end > > up calling xlog_ungrant_log_space() to return the unused portion of > > the transaction reservation. That calls xfs_log_space_wake() to > > allow reservation waiters to try to make progress. > > > > Yes, this is why I don't see immediate side effects in the tests I've > run so far. The assumptions you're basing this off are not always true, > however. Particularly on smaller (<= 1GB) filesystems, it's relatively > easy to produce conditions where the entire reservation space is > consumed by open transactions that don't ultimately commit anything to > the log subsystem and thus generate no forward progress. > > > If there's still not enough space reservation after the transaction > > in progress has released it's reservation, then it goes back to > > sleep. As long as we have active transactions in progress while > > there are transaction reservations waiting on reservation space, > > there will be a wakeup vector for the reservation independent of > > the CIL, iclogs and AIL behaviour. > > > > We do have clean transaction cancel and error scenarios, existing log > deadlock vectors, increasing reliance on long running transactions via > deferred ops, scrub, etc. Also consider the fact that open transactions > consume considerably more reservation than committed transactions on > average. > > I'm not saying it's likely for a real world workload to consume the > entirety of log reservation space via open transactions and then release > it without filesystem modification (and then race with log I/O and AIL > emptying), but from the perspective of proving the existence of a bug > it's really not that difficult to produce. I've not seen a real world > workload that reproduces the problems fixed by any of these patches > either, but we still fix them. > > > [ Yes, there was a bug here, in the case xfs_log_space_wake() did > > not issue a wakeup because of not enough space being availble and > > the push target was limited by the old log head location. i.e. > > nothing ever updated the push target to reflect the new log head and > > so the tail might never get moved now. That particular bug was fixed > > by a an earlier patch in the series, so we can ignore it here. ] > > > > IOWs, if the AIL is empty, the CIL cannot consume more than 25% of > > the log space, and we have transactions waiting on log reservation > > space, then we must have enough transactions in progress to cover at > > least 75% of the log space. Completion of those transactions will > > wake waiters and, if necessary, push the AIL again to keep the log > > tail moving appropriately. This handles the AIL empty and "insert > > before target" situations you are concerned about just fine, as long > > as we have a guarantee of forwards progress. Bounding the CIL size > > provides that forwards progress guarantee for the CIL... > > > > I think you have some tunnel vision or something going on here with > regard to the higher level architectural view of how things are supposed > to operate in a normal running/steady state vs simply what can and > cannot happen in the code. I can't really tell why/how, but the only > suggestion I can make is to perhaps separate from this high level view > of things and take a closer look at the code. This is a simple code bug, > not some grand architectural flaw. The context here is way out of whack. > The repeated unrelated and overblown architectural assertions come off > as indication of lack of any real argument to allow this race to live. > There is simply no such guarantee of forward progress in all scenarios > that produce the conditions that can cause this race. > > Yet another example: > > <...>-369 [002] ...2 220.055746: xfs_ail_insert: dev 253:4 lip 00000000ddb123f2 old lsn 0/0 new lsn 1/248 type XFS_LI_INODE flags IN_AIL > <...>-27 [003] ...1 224.753110: xfs_log_force: dev 253:4 lsn 0x0 caller xfs_log_worker+0x2f/0xf0 [xfs] > <...>-404 [003] ...1 224.775551: __xlog_grant_push_ail: 1596: threshold_lsn 0x1000000fa > kworker/3:1-39 [003] ...2 224.777953: xfs_ail_delete: dev 253:4 lip 00000000ddb123f2 old lsn 1/248 new lsn 1/248 type XFS_LI_INODE flags IN_AIL > xfsaild/dm-4-1034 [000] .... 224.797919: xfsaild: 588: idle ->ail_target 0x1000000fa > kworker/3:1H-404 [003] ...2 225.841198: xfs_ail_insert: dev 253:4 lip 000000006845aeed old lsn 0/0 new lsn 1/250 type XFS_LI_INODE flags IN_AIL > kworker/3:1-39 [003] ...1 254.962822: xfs_log_force: dev 253:4 lsn 0x0 caller xfs_log_worker+0x2f/0xf0 [xfs] > ... > kworker/3:2-1920 [003] ...1 3759.291275: xfs_log_force: dev 253:4 lsn 0x0 caller xfs_log_worker+0x2f/0xf0 [xfs] > > > # cat /sys/fs/xfs/dm-4/log/log_*lsn > 1:252 > 1:250 > > This instance of the race uses the same serialization instrumentation to > control execution timing and whatnot as before (i.e. no functional > changes). First, an item is inserted into the AIL. Immediately after AIL > insertion, another transaction commits to the CIL (not shown in the > trace). The background log worker comes around a few seconds later and > forces the log/CIL. The checkpoint for this log force races with an AIL > delete and idle (same as before). AIL insertion occurs at the push > target xfsaild just idled at, but this time reservation pressure > relieves and the filesystem goes idle. > > At this point, nothing occurs on the fs except for continuous background > log worker jobs. Note the timestamp difference between the first > post-race log force and the last in the trace. The log worker runs at > the default 30s interval and has run repeatedly for almost an hour while > failing to push the AIL and subsequently cover the log. To confirm the > AIL is populated, see the log head/tail LSNs reported via sysfs. This > state persists indefinitely so long as the fs is idle. This is a bug. /me stumbles back in after ~2wks, and has a few questions: 1) Are these concerns a reason to hold up this series, or are they a separate bug lurking in the code being touched by the series? AFAICT I think it's the second, but <shrug> my brain is still mush. 2) Er... how do you get the log stuck like this? I see things earlier in the thread like "open transactions that don't ultimately commit anything to the log subsystem" and think "OH, you mean xfs_scrub!" --D > Brian > > > Cheers, > > > > Dave. > > -- > > Dave Chinner > > david@fromorbit.com
On Mon, Sep 16, 2019 at 09:31:56PM -0700, Darrick J. Wong wrote: > On Thu, Sep 12, 2019 at 09:46:06AM -0400, Brian Foster wrote: > > On Wed, Sep 11, 2019 at 09:38:58AM +1000, Dave Chinner wrote: > > > On Tue, Sep 10, 2019 at 05:56:28AM -0400, Brian Foster wrote: > > > > On Mon, Sep 09, 2019 at 09:26:32AM +1000, Dave Chinner wrote: > > > > > On Sat, Sep 07, 2019 at 11:10:50AM -0400, Brian Foster wrote: > > > > > > This is an instance of xfsaild going idle between the time this > > > > > > new AIL push sets the target based on the iclog about to be > > > > > > committed and AIL insertion of the associated log items, > > > > > > reproduced via a bit of timing instrumentation. Don't be > > > > > > distracted by the timestamps or the fact that the LSNs do not > > > > > > match because the log items in the AIL end up indexed by the start > > > > > > lsn of the CIL checkpoint (whereas last_sync_lsn refers to the > > > > > > commit record). The point is simply that xfsaild has completed a > > > > > > push of a target that hasn't been inserted yet. > > > > > > > > > > AFAICT, what you are showing requires delaying of the CIL push to the > > > > > point it violates a fundamental assumption about commit sizes, which > > > > > is why I largely think it's irrelevant. > > > > > > > > > > > > > The CIL checkpoint size is an unrelated side effect of the test I > > > > happened to use, not a fundamental cause of the problem it demonstrates. > > > > Fixing CIL checkpoint size issues won't change anything. Here's a > > > > different variant of the same problem with a small enough number of log > > > > items such that background CIL pushing is not a factor: > > > > > > > > <...>-79670 [000] ...1 56126.015522: xfs_log_force: dev 253:4 lsn 0x0 caller xfs_log_worker+0x2f/0xf0 [xfs] > > > > kworker/0:1H-220 [000] ...1 56126.030587: __xlog_grant_push_ail: 1596: threshold_lsn 0x1000032e4 > > > > ... > > > > <...>-81293 [000] ...2 56126.032647: xfs_ail_delete: dev 253:4 lip 00000000cbe82125 old lsn 1/13026 new lsn 1/13026 type XFS_LI_INODE flags IN_AIL > > > > <...>-81633 [000] .... 56126.053544: xfsaild: 588: idle ->ail_target 0x1000032e4 > > > > kworker/0:1H-220 [000] ...2 56127.038835: xfs_ail_insert: dev 253:4 lip 00000000a44ab1ef old lsn 0/0 new lsn 1/13028 type XFS_LI_INODE flags IN_AIL > > > > kworker/0:1H-220 [000] ...2 56127.038911: xfs_ail_insert: dev 253:4 lip 0000000028d2061f old lsn 0/0 new lsn 1/13028 type XFS_LI_INODE flags IN_AIL > > > > .... > > > > > > > > This sequence starts with one log item in the AIL and some number of > > > > items in the CIL such that a checkpoint executes from the background log > > > > worker. The worker forces the CIL and log I/O completion issues an AIL > > > > push that is truncated by the recently updated ->l_last_sync_lsn due to > > > > outstanding transaction reservation and small AIL size. This push races > > > > with completion of a previous push that empties the AIL and iclog > > > > callbacks insert log items for the current checkpoint at the LSN target > > > > xfsaild just idled at. > > > > > > I'm just not seeing what the problem here is. The behaviour you are > > > describing has been around since day zero and doesn't require the > > > addition of an ail push from iclog completion to trigger. Prior to > > > this series, it would be: > > > > > > > A few days ago you said that if we're inserting log items before the > > push target, "something is very wrong." Since this was what I was > > concerned about, I attempted to manufacture the issue to demonstrate. > > You suggested the first reproducer I came up with was a separate problem > > (related to CIL size issues), so I came up with the one above to avoid > > that distraction. Now you're telling me this has always happened and is > > fine.. > > > > While I don't think this is quite accurate (more below), I do find this > > reasoning somewhat amusing in that it essentially implies that this > > patch itself is dubious. If this new AIL push is required to fix a real > > issue, and this race is essentially manifest as implied, then this patch > > can't possibly reliably fix the original problem. Anyways, that is > > neither here nor there.. > > > > All of the details of this particular issue aside, I do think there's a > > development process problem here. It shouldn't require an extended game > > of whack-a-mole with this kind of inconsistent reasoning just to request > > a trivial change to a patch (you also implied in a previous response it > > was me wasting your time on this topic) that closes an obvious race and > > otherwise has no negative effect. Someone is being unreasonable here and > > I don't think it's me. More importantly, discussion of open issues > > shouldn't be a race against the associated patch being merged. :/ > > > > > process 1 reservation log completion xfsaild > > > <completes metadata IO> > > > xfs_ail_delete() > > > mlip_changed > > > xlog_assign_tail_lsn_locked() > > > ail empty, sets l_last_sync = 0x1000032e2 > > > xfs_log_space_wake() > > > xlog_state_do_callback > > > sets CALLBACK > > > sets last_sync_lsn to iclog head > > > -> 0x1000032e4 > > > <drops icloglock, gets preempted> > > > <wakes> > > > xlog_grant_head_wait > > > free_bytes < need_bytes > > > xlog_grant_push_ail() > > > xlog_push_ail() > > > ->ail_target 0x1000032e4 > > > <sleeps> > > > <wakes> > > > sets prev target to 0x1000032e4 > > > sees empty AIL > > > <sleeps> > > > <runs again> > > > runs callbacks > > > xfs_ail_insert(lsn = 0x1000032e4) > > > > > > and now we have the AIL push thread asleep with items in it at the > > > push threshold. IOWs, what you describe has always been possible, > > > and before the CIL was introduced this sort of thing happened quite > > > a bit because iclog completions freed up much less space in the log > > > than a CIL commit completion. > > > > > > > I was suspicious that this could occur prior to this change but I hadn't > > confirmed. The scenario documented above cannot occur because a push on > > an empty AIL has no effect. The target doesn't move and the task isn't > > woken. That said, I still suspect the race can occur with the current > > code via between a grant head waiter, AIL emptying and iclog completion. > > > > This just speaks to the frequency of the problem, though. I'm not > > convinced it's something that happens "quite a bit" given the nature of > > the 3-way race. I also don't agree that existence of a historical > > problem somehow excuses introduction a new variant of the same problem. > > Instead, if this patch exposes a historical problem that simply had no > > noticeable impact to this point, we should probably look into whether it > > needs fixing too. > > > > > It's not a problem, however, because if we are out of transaction > > > reservation space we must have transactions in progress, and as long > > > as they make progress then the commit of each transaction will end > > > up calling xlog_ungrant_log_space() to return the unused portion of > > > the transaction reservation. That calls xfs_log_space_wake() to > > > allow reservation waiters to try to make progress. > > > > > > > Yes, this is why I don't see immediate side effects in the tests I've > > run so far. The assumptions you're basing this off are not always true, > > however. Particularly on smaller (<= 1GB) filesystems, it's relatively > > easy to produce conditions where the entire reservation space is > > consumed by open transactions that don't ultimately commit anything to > > the log subsystem and thus generate no forward progress. > > > > > If there's still not enough space reservation after the transaction > > > in progress has released it's reservation, then it goes back to > > > sleep. As long as we have active transactions in progress while > > > there are transaction reservations waiting on reservation space, > > > there will be a wakeup vector for the reservation independent of > > > the CIL, iclogs and AIL behaviour. > > > > > > > We do have clean transaction cancel and error scenarios, existing log > > deadlock vectors, increasing reliance on long running transactions via > > deferred ops, scrub, etc. Also consider the fact that open transactions > > consume considerably more reservation than committed transactions on > > average. > > > > I'm not saying it's likely for a real world workload to consume the > > entirety of log reservation space via open transactions and then release > > it without filesystem modification (and then race with log I/O and AIL > > emptying), but from the perspective of proving the existence of a bug > > it's really not that difficult to produce. I've not seen a real world > > workload that reproduces the problems fixed by any of these patches > > either, but we still fix them. > > > > > [ Yes, there was a bug here, in the case xfs_log_space_wake() did > > > not issue a wakeup because of not enough space being availble and > > > the push target was limited by the old log head location. i.e. > > > nothing ever updated the push target to reflect the new log head and > > > so the tail might never get moved now. That particular bug was fixed > > > by a an earlier patch in the series, so we can ignore it here. ] > > > > > > IOWs, if the AIL is empty, the CIL cannot consume more than 25% of > > > the log space, and we have transactions waiting on log reservation > > > space, then we must have enough transactions in progress to cover at > > > least 75% of the log space. Completion of those transactions will > > > wake waiters and, if necessary, push the AIL again to keep the log > > > tail moving appropriately. This handles the AIL empty and "insert > > > before target" situations you are concerned about just fine, as long > > > as we have a guarantee of forwards progress. Bounding the CIL size > > > provides that forwards progress guarantee for the CIL... > > > > > > > I think you have some tunnel vision or something going on here with > > regard to the higher level architectural view of how things are supposed > > to operate in a normal running/steady state vs simply what can and > > cannot happen in the code. I can't really tell why/how, but the only > > suggestion I can make is to perhaps separate from this high level view > > of things and take a closer look at the code. This is a simple code bug, > > not some grand architectural flaw. The context here is way out of whack. > > The repeated unrelated and overblown architectural assertions come off > > as indication of lack of any real argument to allow this race to live. > > There is simply no such guarantee of forward progress in all scenarios > > that produce the conditions that can cause this race. > > > > Yet another example: > > > > <...>-369 [002] ...2 220.055746: xfs_ail_insert: dev 253:4 lip 00000000ddb123f2 old lsn 0/0 new lsn 1/248 type XFS_LI_INODE flags IN_AIL > > <...>-27 [003] ...1 224.753110: xfs_log_force: dev 253:4 lsn 0x0 caller xfs_log_worker+0x2f/0xf0 [xfs] > > <...>-404 [003] ...1 224.775551: __xlog_grant_push_ail: 1596: threshold_lsn 0x1000000fa > > kworker/3:1-39 [003] ...2 224.777953: xfs_ail_delete: dev 253:4 lip 00000000ddb123f2 old lsn 1/248 new lsn 1/248 type XFS_LI_INODE flags IN_AIL > > xfsaild/dm-4-1034 [000] .... 224.797919: xfsaild: 588: idle ->ail_target 0x1000000fa > > kworker/3:1H-404 [003] ...2 225.841198: xfs_ail_insert: dev 253:4 lip 000000006845aeed old lsn 0/0 new lsn 1/250 type XFS_LI_INODE flags IN_AIL > > kworker/3:1-39 [003] ...1 254.962822: xfs_log_force: dev 253:4 lsn 0x0 caller xfs_log_worker+0x2f/0xf0 [xfs] > > ... > > kworker/3:2-1920 [003] ...1 3759.291275: xfs_log_force: dev 253:4 lsn 0x0 caller xfs_log_worker+0x2f/0xf0 [xfs] > > > > > > # cat /sys/fs/xfs/dm-4/log/log_*lsn > > 1:252 > > 1:250 > > > > This instance of the race uses the same serialization instrumentation to > > control execution timing and whatnot as before (i.e. no functional > > changes). First, an item is inserted into the AIL. Immediately after AIL > > insertion, another transaction commits to the CIL (not shown in the > > trace). The background log worker comes around a few seconds later and > > forces the log/CIL. The checkpoint for this log force races with an AIL > > delete and idle (same as before). AIL insertion occurs at the push > > target xfsaild just idled at, but this time reservation pressure > > relieves and the filesystem goes idle. > > > > At this point, nothing occurs on the fs except for continuous background > > log worker jobs. Note the timestamp difference between the first > > post-race log force and the last in the trace. The log worker runs at > > the default 30s interval and has run repeatedly for almost an hour while > > failing to push the AIL and subsequently cover the log. To confirm the > > AIL is populated, see the log head/tail LSNs reported via sysfs. This > > state persists indefinitely so long as the fs is idle. This is a bug. > > /me stumbles back in after ~2wks, and has a few questions: > Heh, welcome back.. ;) > 1) Are these concerns a reason to hold up this series, or are they a > separate bug lurking in the code being touched by the series? AFAICT I > think it's the second, but <shrug> my brain is still mush. > A little of both I guess. To Dave's earlier point, I think this technically can happen in the existing code as a 3-way race between the aforementioned tasks (just not the way it was described). OTOH, I'm not sure what this has to do with the fact that the new code being added is racy on its own (or since when discovery of some old bug justifies adding new ones..?). The examples shown above are fundamentally races between log I/O completion and xfsaild. The last one shows the log remain uncovered indefinitely on an idle fs (which is not a corruption or anything, but certainly a bug) simply because that's the easiest side effect to reproduce. I'm fairly confident at this point that one could be manufactured into a similar log deadlock if we really wanted to try, but that would be much more difficult and TBH I'm tired of burning myself out on these kind of objections to obvious and easily addressed landmines. How likely is it that somebody would hit these problems? Probably highly unlikely. How likely is it somebody would hit this problem before whatever problem this patch fixes? *shrug* I don't think it's a reason to hold up the series, but at the same time this patch is unrelated to the original problem. IIRC, it fell out of some other issue reproduced with a different experimental hack/fix (that was eventually replaced) to the original problem. FWIW, I'm annoyed with the lazy approach to review here more than anything. In hindsight, if I knew the feedback was going to be dismissed and the patchset rolled forward and merged, perhaps I should have just nacked the subsequent reposts to make the objection clear. I dunno, not my call on what to do with it now. Feel free to add my Nacked-by: to the upstream commit I guess so I at least remember this when/if considering whether to backport it anywhere. :/ > 2) Er... how do you get the log stuck like this? I see things earlier > in the thread like "open transactions that don't ultimately commit > anything to the log subsystem" and think "OH, you mean xfs_scrub!" > That's one thing I was thinking about but I didn't end up looking into it (does scrub actually acquire log reservation?). For a more simple example, consider a bunch of threads running into quota block allocation failures where a system is also under memory pressure. On filesystems with smaller logs, it only takes a handful of such threads to bash the reservation grant head against the log tail even though the log is empty (and doing so without ever committing anything to the log). Note that this by itself isn't what gets the log "stuck" in the most recent example (note: not deadlocked), but rather if we're in a state where the grant head is close enough to the log head (such that we AIL push the items associated with the current checkpoint before it inserts) when log I/O completion happens to race with AIL emptying as described. Brian > --D > > > Brian > > > > > Cheers, > > > > > > Dave. > > > -- > > > Dave Chinner > > > david@fromorbit.com
On Tue, Sep 17, 2019 at 08:48:27AM -0400, Brian Foster wrote: > On Mon, Sep 16, 2019 at 09:31:56PM -0700, Darrick J. Wong wrote: > > On Thu, Sep 12, 2019 at 09:46:06AM -0400, Brian Foster wrote: > > > On Wed, Sep 11, 2019 at 09:38:58AM +1000, Dave Chinner wrote: > > > > On Tue, Sep 10, 2019 at 05:56:28AM -0400, Brian Foster wrote: > > > > > On Mon, Sep 09, 2019 at 09:26:32AM +1000, Dave Chinner wrote: > > > > > > On Sat, Sep 07, 2019 at 11:10:50AM -0400, Brian Foster wrote: > > > > > > > This is an instance of xfsaild going idle between the time this > > > > > > > new AIL push sets the target based on the iclog about to be > > > > > > > committed and AIL insertion of the associated log items, > > > > > > > reproduced via a bit of timing instrumentation. Don't be > > > > > > > distracted by the timestamps or the fact that the LSNs do not > > > > > > > match because the log items in the AIL end up indexed by the start > > > > > > > lsn of the CIL checkpoint (whereas last_sync_lsn refers to the > > > > > > > commit record). The point is simply that xfsaild has completed a > > > > > > > push of a target that hasn't been inserted yet. > > > > > > > > > > > > AFAICT, what you are showing requires delaying of the CIL push to the > > > > > > point it violates a fundamental assumption about commit sizes, which > > > > > > is why I largely think it's irrelevant. > > > > > > > > > > > > > > > > The CIL checkpoint size is an unrelated side effect of the test I > > > > > happened to use, not a fundamental cause of the problem it demonstrates. > > > > > Fixing CIL checkpoint size issues won't change anything. Here's a > > > > > different variant of the same problem with a small enough number of log > > > > > items such that background CIL pushing is not a factor: > > > > > > > > > > <...>-79670 [000] ...1 56126.015522: xfs_log_force: dev 253:4 lsn 0x0 caller xfs_log_worker+0x2f/0xf0 [xfs] > > > > > kworker/0:1H-220 [000] ...1 56126.030587: __xlog_grant_push_ail: 1596: threshold_lsn 0x1000032e4 > > > > > ... > > > > > <...>-81293 [000] ...2 56126.032647: xfs_ail_delete: dev 253:4 lip 00000000cbe82125 old lsn 1/13026 new lsn 1/13026 type XFS_LI_INODE flags IN_AIL > > > > > <...>-81633 [000] .... 56126.053544: xfsaild: 588: idle ->ail_target 0x1000032e4 > > > > > kworker/0:1H-220 [000] ...2 56127.038835: xfs_ail_insert: dev 253:4 lip 00000000a44ab1ef old lsn 0/0 new lsn 1/13028 type XFS_LI_INODE flags IN_AIL > > > > > kworker/0:1H-220 [000] ...2 56127.038911: xfs_ail_insert: dev 253:4 lip 0000000028d2061f old lsn 0/0 new lsn 1/13028 type XFS_LI_INODE flags IN_AIL > > > > > .... > > > > > > > > > > This sequence starts with one log item in the AIL and some number of > > > > > items in the CIL such that a checkpoint executes from the background log > > > > > worker. The worker forces the CIL and log I/O completion issues an AIL > > > > > push that is truncated by the recently updated ->l_last_sync_lsn due to > > > > > outstanding transaction reservation and small AIL size. This push races > > > > > with completion of a previous push that empties the AIL and iclog > > > > > callbacks insert log items for the current checkpoint at the LSN target > > > > > xfsaild just idled at. > > > > > > > > I'm just not seeing what the problem here is. The behaviour you are > > > > describing has been around since day zero and doesn't require the > > > > addition of an ail push from iclog completion to trigger. Prior to > > > > this series, it would be: > > > > > > > > > > A few days ago you said that if we're inserting log items before the > > > push target, "something is very wrong." Since this was what I was > > > concerned about, I attempted to manufacture the issue to demonstrate. > > > You suggested the first reproducer I came up with was a separate problem > > > (related to CIL size issues), so I came up with the one above to avoid > > > that distraction. Now you're telling me this has always happened and is > > > fine.. > > > > > > While I don't think this is quite accurate (more below), I do find this > > > reasoning somewhat amusing in that it essentially implies that this > > > patch itself is dubious. If this new AIL push is required to fix a real > > > issue, and this race is essentially manifest as implied, then this patch > > > can't possibly reliably fix the original problem. Anyways, that is > > > neither here nor there.. > > > > > > All of the details of this particular issue aside, I do think there's a > > > development process problem here. It shouldn't require an extended game > > > of whack-a-mole with this kind of inconsistent reasoning just to request > > > a trivial change to a patch (you also implied in a previous response it > > > was me wasting your time on this topic) that closes an obvious race and > > > otherwise has no negative effect. Someone is being unreasonable here and > > > I don't think it's me. More importantly, discussion of open issues > > > shouldn't be a race against the associated patch being merged. :/ > > > > > > > process 1 reservation log completion xfsaild > > > > <completes metadata IO> > > > > xfs_ail_delete() > > > > mlip_changed > > > > xlog_assign_tail_lsn_locked() > > > > ail empty, sets l_last_sync = 0x1000032e2 > > > > xfs_log_space_wake() > > > > xlog_state_do_callback > > > > sets CALLBACK > > > > sets last_sync_lsn to iclog head > > > > -> 0x1000032e4 > > > > <drops icloglock, gets preempted> > > > > <wakes> > > > > xlog_grant_head_wait > > > > free_bytes < need_bytes > > > > xlog_grant_push_ail() > > > > xlog_push_ail() > > > > ->ail_target 0x1000032e4 > > > > <sleeps> > > > > <wakes> > > > > sets prev target to 0x1000032e4 > > > > sees empty AIL > > > > <sleeps> > > > > <runs again> > > > > runs callbacks > > > > xfs_ail_insert(lsn = 0x1000032e4) > > > > > > > > and now we have the AIL push thread asleep with items in it at the > > > > push threshold. IOWs, what you describe has always been possible, > > > > and before the CIL was introduced this sort of thing happened quite > > > > a bit because iclog completions freed up much less space in the log > > > > than a CIL commit completion. > > > > > > > > > > I was suspicious that this could occur prior to this change but I hadn't > > > confirmed. The scenario documented above cannot occur because a push on > > > an empty AIL has no effect. The target doesn't move and the task isn't > > > woken. That said, I still suspect the race can occur with the current > > > code via between a grant head waiter, AIL emptying and iclog completion. > > > > > > This just speaks to the frequency of the problem, though. I'm not > > > convinced it's something that happens "quite a bit" given the nature of > > > the 3-way race. I also don't agree that existence of a historical > > > problem somehow excuses introduction a new variant of the same problem. > > > Instead, if this patch exposes a historical problem that simply had no > > > noticeable impact to this point, we should probably look into whether it > > > needs fixing too. > > > > > > > It's not a problem, however, because if we are out of transaction > > > > reservation space we must have transactions in progress, and as long > > > > as they make progress then the commit of each transaction will end > > > > up calling xlog_ungrant_log_space() to return the unused portion of > > > > the transaction reservation. That calls xfs_log_space_wake() to > > > > allow reservation waiters to try to make progress. > > > > > > > > > > Yes, this is why I don't see immediate side effects in the tests I've > > > run so far. The assumptions you're basing this off are not always true, > > > however. Particularly on smaller (<= 1GB) filesystems, it's relatively > > > easy to produce conditions where the entire reservation space is > > > consumed by open transactions that don't ultimately commit anything to > > > the log subsystem and thus generate no forward progress. > > > > > > > If there's still not enough space reservation after the transaction > > > > in progress has released it's reservation, then it goes back to > > > > sleep. As long as we have active transactions in progress while > > > > there are transaction reservations waiting on reservation space, > > > > there will be a wakeup vector for the reservation independent of > > > > the CIL, iclogs and AIL behaviour. > > > > > > > > > > We do have clean transaction cancel and error scenarios, existing log > > > deadlock vectors, increasing reliance on long running transactions via > > > deferred ops, scrub, etc. Also consider the fact that open transactions > > > consume considerably more reservation than committed transactions on > > > average. > > > > > > I'm not saying it's likely for a real world workload to consume the > > > entirety of log reservation space via open transactions and then release > > > it without filesystem modification (and then race with log I/O and AIL > > > emptying), but from the perspective of proving the existence of a bug > > > it's really not that difficult to produce. I've not seen a real world > > > workload that reproduces the problems fixed by any of these patches > > > either, but we still fix them. > > > > > > > [ Yes, there was a bug here, in the case xfs_log_space_wake() did > > > > not issue a wakeup because of not enough space being availble and > > > > the push target was limited by the old log head location. i.e. > > > > nothing ever updated the push target to reflect the new log head and > > > > so the tail might never get moved now. That particular bug was fixed > > > > by a an earlier patch in the series, so we can ignore it here. ] > > > > > > > > IOWs, if the AIL is empty, the CIL cannot consume more than 25% of > > > > the log space, and we have transactions waiting on log reservation > > > > space, then we must have enough transactions in progress to cover at > > > > least 75% of the log space. Completion of those transactions will > > > > wake waiters and, if necessary, push the AIL again to keep the log > > > > tail moving appropriately. This handles the AIL empty and "insert > > > > before target" situations you are concerned about just fine, as long > > > > as we have a guarantee of forwards progress. Bounding the CIL size > > > > provides that forwards progress guarantee for the CIL... > > > > > > > > > > I think you have some tunnel vision or something going on here with > > > regard to the higher level architectural view of how things are supposed > > > to operate in a normal running/steady state vs simply what can and > > > cannot happen in the code. I can't really tell why/how, but the only > > > suggestion I can make is to perhaps separate from this high level view > > > of things and take a closer look at the code. This is a simple code bug, > > > not some grand architectural flaw. The context here is way out of whack. > > > The repeated unrelated and overblown architectural assertions come off > > > as indication of lack of any real argument to allow this race to live. > > > There is simply no such guarantee of forward progress in all scenarios > > > that produce the conditions that can cause this race. > > > > > > Yet another example: > > > > > > <...>-369 [002] ...2 220.055746: xfs_ail_insert: dev 253:4 lip 00000000ddb123f2 old lsn 0/0 new lsn 1/248 type XFS_LI_INODE flags IN_AIL > > > <...>-27 [003] ...1 224.753110: xfs_log_force: dev 253:4 lsn 0x0 caller xfs_log_worker+0x2f/0xf0 [xfs] > > > <...>-404 [003] ...1 224.775551: __xlog_grant_push_ail: 1596: threshold_lsn 0x1000000fa > > > kworker/3:1-39 [003] ...2 224.777953: xfs_ail_delete: dev 253:4 lip 00000000ddb123f2 old lsn 1/248 new lsn 1/248 type XFS_LI_INODE flags IN_AIL > > > xfsaild/dm-4-1034 [000] .... 224.797919: xfsaild: 588: idle ->ail_target 0x1000000fa > > > kworker/3:1H-404 [003] ...2 225.841198: xfs_ail_insert: dev 253:4 lip 000000006845aeed old lsn 0/0 new lsn 1/250 type XFS_LI_INODE flags IN_AIL > > > kworker/3:1-39 [003] ...1 254.962822: xfs_log_force: dev 253:4 lsn 0x0 caller xfs_log_worker+0x2f/0xf0 [xfs] > > > ... > > > kworker/3:2-1920 [003] ...1 3759.291275: xfs_log_force: dev 253:4 lsn 0x0 caller xfs_log_worker+0x2f/0xf0 [xfs] > > > > > > > > > # cat /sys/fs/xfs/dm-4/log/log_*lsn > > > 1:252 > > > 1:250 > > > > > > This instance of the race uses the same serialization instrumentation to > > > control execution timing and whatnot as before (i.e. no functional > > > changes). First, an item is inserted into the AIL. Immediately after AIL > > > insertion, another transaction commits to the CIL (not shown in the > > > trace). The background log worker comes around a few seconds later and > > > forces the log/CIL. The checkpoint for this log force races with an AIL > > > delete and idle (same as before). AIL insertion occurs at the push > > > target xfsaild just idled at, but this time reservation pressure > > > relieves and the filesystem goes idle. > > > > > > At this point, nothing occurs on the fs except for continuous background > > > log worker jobs. Note the timestamp difference between the first > > > post-race log force and the last in the trace. The log worker runs at > > > the default 30s interval and has run repeatedly for almost an hour while > > > failing to push the AIL and subsequently cover the log. To confirm the > > > AIL is populated, see the log head/tail LSNs reported via sysfs. This > > > state persists indefinitely so long as the fs is idle. This is a bug. > > > > /me stumbles back in after ~2wks, and has a few questions: > > > > Heh, welcome back.. ;) > > > 1) Are these concerns a reason to hold up this series, or are they a > > separate bug lurking in the code being touched by the series? AFAICT I > > think it's the second, but <shrug> my brain is still mush. > > > > A little of both I guess. To Dave's earlier point, I think this > technically can happen in the existing code as a 3-way race between the > aforementioned tasks (just not the way it was described). OTOH, I'm not > sure what this has to do with the fact that the new code being added is > racy on its own (or since when discovery of some old bug justifies > adding new ones..?). The examples shown above are fundamentally races > between log I/O completion and xfsaild. The last one shows the log > remain uncovered indefinitely on an idle fs (which is not a corruption > or anything, but certainly a bug) simply because that's the easiest side > effect to reproduce. I'm fairly confident at this point that one could > be manufactured into a similar log deadlock if we really wanted to try, > but that would be much more difficult and TBH I'm tired of burning > myself out on these kind of objections to obvious and easily addressed > landmines. How likely is it that somebody would hit these problems? > Probably highly unlikely. How likely is it somebody would hit this > problem before whatever problem this patch fixes? *shrug* > > I don't think it's a reason to hold up the series, but at the same time > this patch is unrelated to the original problem. IIRC, it fell out of > some other issue reproduced with a different experimental hack/fix (that > was eventually replaced) to the original problem. FWIW, I'm annoyed with > the lazy approach to review here more than anything. In hindsight, if I > knew the feedback was going to be dismissed and the patchset rolled > forward and merged, perhaps I should have just nacked the subsequent > reposts to make the objection clear. :( I'm sorry you feel that way. I myself don't feel that my own handling of this merge window has been good, between feeling pressured to get the branches ready to go before vacation and for-next becoming intermittent right around the same time. Those both decrease my certainty about what's going in the next merge and increases my own anxieties, and it becomes a competition in my head between "I can add it now and revert it later as a regression fix" vs. "if I don't add it I'll wonder if it was necessary". Anyway, I /think/ the end result is that if the AIL gets stuck /and/ the system goes down before it becomes unstuck, then there'll be more work for log recovery to do, because we failed to checkpoint everything that we possibly could have before the crash? So AFAICT it's not a critical disaster bug but I would like to study this situation some more, particularly now that we have ~2mos for stabilizing things. > I dunno, not my call on what to do with it now. Feel free to add my > Nacked-by: to the upstream commit I guess so I at least remember this > when/if considering whether to backport it anywhere. :/ (/me continues to wish there was an easy way to add tagging to a commit, particularly when it comes well after the fact.) > > 2) Er... how do you get the log stuck like this? I see things earlier > > in the thread like "open transactions that don't ultimately commit > > anything to the log subsystem" and think "OH, you mean xfs_scrub!" > > > > That's one thing I was thinking about but I didn't end up looking into > it (does scrub actually acquire log reservation?). If you invoke the scrub ioctl with IFLAG_REPAIR set, it allocates a non-empty transaction (itruncate, iirc) to do the check and rebuild the data structure. If the item is ok then it'll cancel the transaction. > For a more simple > example, consider a bunch of threads running into quota block allocation > failures where a system is also under memory pressure. On filesystems > with smaller logs, it only takes a handful of such threads to bash the > reservation grant head against the log tail even though the log is empty > (and doing so without ever committing anything to the log). > > Note that this by itself isn't what gets the log "stuck" in the most > recent example (note: not deadlocked), but rather if we're in a state > where the grant head is close enough to the log head (such that we AIL > push the items associated with the current checkpoint before it inserts) > when log I/O completion happens to race with AIL emptying as described. Hmm... I wonder if we could reproduce this by formatting a filesystem with a small log; running a slow moving thread that touches a file once per second (or something to generate a moderate amount of workload) and monitors the log to watch its progress; and then kicking off dozens of threads to invoke IFLAG_REPAIR scrubbers on some other non-corrupt part of the filesystem? --D > Brian > > > --D > > > > > Brian > > > > > > > Cheers, > > > > > > > > Dave. > > > > -- > > > > Dave Chinner > > > > david@fromorbit.com
On Tue, Sep 24, 2019 at 10:16:09AM -0700, Darrick J. Wong wrote: > On Tue, Sep 17, 2019 at 08:48:27AM -0400, Brian Foster wrote: > > On Mon, Sep 16, 2019 at 09:31:56PM -0700, Darrick J. Wong wrote: > > > On Thu, Sep 12, 2019 at 09:46:06AM -0400, Brian Foster wrote: > > > > On Wed, Sep 11, 2019 at 09:38:58AM +1000, Dave Chinner wrote: > > > > > On Tue, Sep 10, 2019 at 05:56:28AM -0400, Brian Foster wrote: > > > > > > On Mon, Sep 09, 2019 at 09:26:32AM +1000, Dave Chinner wrote: > > > > > > > On Sat, Sep 07, 2019 at 11:10:50AM -0400, Brian Foster wrote: > > > > > > > > This is an instance of xfsaild going idle between the time this > > > > > > > > new AIL push sets the target based on the iclog about to be > > > > > > > > committed and AIL insertion of the associated log items, > > > > > > > > reproduced via a bit of timing instrumentation. Don't be > > > > > > > > distracted by the timestamps or the fact that the LSNs do not > > > > > > > > match because the log items in the AIL end up indexed by the start > > > > > > > > lsn of the CIL checkpoint (whereas last_sync_lsn refers to the > > > > > > > > commit record). The point is simply that xfsaild has completed a > > > > > > > > push of a target that hasn't been inserted yet. > > > > > > > > > > > > > > AFAICT, what you are showing requires delaying of the CIL push to the > > > > > > > point it violates a fundamental assumption about commit sizes, which > > > > > > > is why I largely think it's irrelevant. > > > > > > > > > > > > > > > > > > > The CIL checkpoint size is an unrelated side effect of the test I > > > > > > happened to use, not a fundamental cause of the problem it demonstrates. > > > > > > Fixing CIL checkpoint size issues won't change anything. Here's a > > > > > > different variant of the same problem with a small enough number of log > > > > > > items such that background CIL pushing is not a factor: > > > > > > > > > > > > <...>-79670 [000] ...1 56126.015522: xfs_log_force: dev 253:4 lsn 0x0 caller xfs_log_worker+0x2f/0xf0 [xfs] > > > > > > kworker/0:1H-220 [000] ...1 56126.030587: __xlog_grant_push_ail: 1596: threshold_lsn 0x1000032e4 > > > > > > ... > > > > > > <...>-81293 [000] ...2 56126.032647: xfs_ail_delete: dev 253:4 lip 00000000cbe82125 old lsn 1/13026 new lsn 1/13026 type XFS_LI_INODE flags IN_AIL > > > > > > <...>-81633 [000] .... 56126.053544: xfsaild: 588: idle ->ail_target 0x1000032e4 > > > > > > kworker/0:1H-220 [000] ...2 56127.038835: xfs_ail_insert: dev 253:4 lip 00000000a44ab1ef old lsn 0/0 new lsn 1/13028 type XFS_LI_INODE flags IN_AIL > > > > > > kworker/0:1H-220 [000] ...2 56127.038911: xfs_ail_insert: dev 253:4 lip 0000000028d2061f old lsn 0/0 new lsn 1/13028 type XFS_LI_INODE flags IN_AIL > > > > > > .... > > > > > > > > > > > > This sequence starts with one log item in the AIL and some number of > > > > > > items in the CIL such that a checkpoint executes from the background log > > > > > > worker. The worker forces the CIL and log I/O completion issues an AIL > > > > > > push that is truncated by the recently updated ->l_last_sync_lsn due to > > > > > > outstanding transaction reservation and small AIL size. This push races > > > > > > with completion of a previous push that empties the AIL and iclog > > > > > > callbacks insert log items for the current checkpoint at the LSN target > > > > > > xfsaild just idled at. > > > > > > > > > > I'm just not seeing what the problem here is. The behaviour you are > > > > > describing has been around since day zero and doesn't require the > > > > > addition of an ail push from iclog completion to trigger. Prior to > > > > > this series, it would be: > > > > > > > > > > > > > A few days ago you said that if we're inserting log items before the > > > > push target, "something is very wrong." Since this was what I was > > > > concerned about, I attempted to manufacture the issue to demonstrate. > > > > You suggested the first reproducer I came up with was a separate problem > > > > (related to CIL size issues), so I came up with the one above to avoid > > > > that distraction. Now you're telling me this has always happened and is > > > > fine.. > > > > > > > > While I don't think this is quite accurate (more below), I do find this > > > > reasoning somewhat amusing in that it essentially implies that this > > > > patch itself is dubious. If this new AIL push is required to fix a real > > > > issue, and this race is essentially manifest as implied, then this patch > > > > can't possibly reliably fix the original problem. Anyways, that is > > > > neither here nor there.. > > > > > > > > All of the details of this particular issue aside, I do think there's a > > > > development process problem here. It shouldn't require an extended game > > > > of whack-a-mole with this kind of inconsistent reasoning just to request > > > > a trivial change to a patch (you also implied in a previous response it > > > > was me wasting your time on this topic) that closes an obvious race and > > > > otherwise has no negative effect. Someone is being unreasonable here and > > > > I don't think it's me. More importantly, discussion of open issues > > > > shouldn't be a race against the associated patch being merged. :/ > > > > > > > > > process 1 reservation log completion xfsaild > > > > > <completes metadata IO> > > > > > xfs_ail_delete() > > > > > mlip_changed > > > > > xlog_assign_tail_lsn_locked() > > > > > ail empty, sets l_last_sync = 0x1000032e2 > > > > > xfs_log_space_wake() > > > > > xlog_state_do_callback > > > > > sets CALLBACK > > > > > sets last_sync_lsn to iclog head > > > > > -> 0x1000032e4 > > > > > <drops icloglock, gets preempted> > > > > > <wakes> > > > > > xlog_grant_head_wait > > > > > free_bytes < need_bytes > > > > > xlog_grant_push_ail() > > > > > xlog_push_ail() > > > > > ->ail_target 0x1000032e4 > > > > > <sleeps> > > > > > <wakes> > > > > > sets prev target to 0x1000032e4 > > > > > sees empty AIL > > > > > <sleeps> > > > > > <runs again> > > > > > runs callbacks > > > > > xfs_ail_insert(lsn = 0x1000032e4) > > > > > > > > > > and now we have the AIL push thread asleep with items in it at the > > > > > push threshold. IOWs, what you describe has always been possible, > > > > > and before the CIL was introduced this sort of thing happened quite > > > > > a bit because iclog completions freed up much less space in the log > > > > > than a CIL commit completion. > > > > > > > > > > > > > I was suspicious that this could occur prior to this change but I hadn't > > > > confirmed. The scenario documented above cannot occur because a push on > > > > an empty AIL has no effect. The target doesn't move and the task isn't > > > > woken. That said, I still suspect the race can occur with the current > > > > code via between a grant head waiter, AIL emptying and iclog completion. > > > > > > > > This just speaks to the frequency of the problem, though. I'm not > > > > convinced it's something that happens "quite a bit" given the nature of > > > > the 3-way race. I also don't agree that existence of a historical > > > > problem somehow excuses introduction a new variant of the same problem. > > > > Instead, if this patch exposes a historical problem that simply had no > > > > noticeable impact to this point, we should probably look into whether it > > > > needs fixing too. > > > > > > > > > It's not a problem, however, because if we are out of transaction > > > > > reservation space we must have transactions in progress, and as long > > > > > as they make progress then the commit of each transaction will end > > > > > up calling xlog_ungrant_log_space() to return the unused portion of > > > > > the transaction reservation. That calls xfs_log_space_wake() to > > > > > allow reservation waiters to try to make progress. > > > > > > > > > > > > > Yes, this is why I don't see immediate side effects in the tests I've > > > > run so far. The assumptions you're basing this off are not always true, > > > > however. Particularly on smaller (<= 1GB) filesystems, it's relatively > > > > easy to produce conditions where the entire reservation space is > > > > consumed by open transactions that don't ultimately commit anything to > > > > the log subsystem and thus generate no forward progress. > > > > > > > > > If there's still not enough space reservation after the transaction > > > > > in progress has released it's reservation, then it goes back to > > > > > sleep. As long as we have active transactions in progress while > > > > > there are transaction reservations waiting on reservation space, > > > > > there will be a wakeup vector for the reservation independent of > > > > > the CIL, iclogs and AIL behaviour. > > > > > > > > > > > > > We do have clean transaction cancel and error scenarios, existing log > > > > deadlock vectors, increasing reliance on long running transactions via > > > > deferred ops, scrub, etc. Also consider the fact that open transactions > > > > consume considerably more reservation than committed transactions on > > > > average. > > > > > > > > I'm not saying it's likely for a real world workload to consume the > > > > entirety of log reservation space via open transactions and then release > > > > it without filesystem modification (and then race with log I/O and AIL > > > > emptying), but from the perspective of proving the existence of a bug > > > > it's really not that difficult to produce. I've not seen a real world > > > > workload that reproduces the problems fixed by any of these patches > > > > either, but we still fix them. > > > > > > > > > [ Yes, there was a bug here, in the case xfs_log_space_wake() did > > > > > not issue a wakeup because of not enough space being availble and > > > > > the push target was limited by the old log head location. i.e. > > > > > nothing ever updated the push target to reflect the new log head and > > > > > so the tail might never get moved now. That particular bug was fixed > > > > > by a an earlier patch in the series, so we can ignore it here. ] > > > > > > > > > > IOWs, if the AIL is empty, the CIL cannot consume more than 25% of > > > > > the log space, and we have transactions waiting on log reservation > > > > > space, then we must have enough transactions in progress to cover at > > > > > least 75% of the log space. Completion of those transactions will > > > > > wake waiters and, if necessary, push the AIL again to keep the log > > > > > tail moving appropriately. This handles the AIL empty and "insert > > > > > before target" situations you are concerned about just fine, as long > > > > > as we have a guarantee of forwards progress. Bounding the CIL size > > > > > provides that forwards progress guarantee for the CIL... > > > > > > > > > > > > > I think you have some tunnel vision or something going on here with > > > > regard to the higher level architectural view of how things are supposed > > > > to operate in a normal running/steady state vs simply what can and > > > > cannot happen in the code. I can't really tell why/how, but the only > > > > suggestion I can make is to perhaps separate from this high level view > > > > of things and take a closer look at the code. This is a simple code bug, > > > > not some grand architectural flaw. The context here is way out of whack. > > > > The repeated unrelated and overblown architectural assertions come off > > > > as indication of lack of any real argument to allow this race to live. > > > > There is simply no such guarantee of forward progress in all scenarios > > > > that produce the conditions that can cause this race. > > > > > > > > Yet another example: > > > > > > > > <...>-369 [002] ...2 220.055746: xfs_ail_insert: dev 253:4 lip 00000000ddb123f2 old lsn 0/0 new lsn 1/248 type XFS_LI_INODE flags IN_AIL > > > > <...>-27 [003] ...1 224.753110: xfs_log_force: dev 253:4 lsn 0x0 caller xfs_log_worker+0x2f/0xf0 [xfs] > > > > <...>-404 [003] ...1 224.775551: __xlog_grant_push_ail: 1596: threshold_lsn 0x1000000fa > > > > kworker/3:1-39 [003] ...2 224.777953: xfs_ail_delete: dev 253:4 lip 00000000ddb123f2 old lsn 1/248 new lsn 1/248 type XFS_LI_INODE flags IN_AIL > > > > xfsaild/dm-4-1034 [000] .... 224.797919: xfsaild: 588: idle ->ail_target 0x1000000fa > > > > kworker/3:1H-404 [003] ...2 225.841198: xfs_ail_insert: dev 253:4 lip 000000006845aeed old lsn 0/0 new lsn 1/250 type XFS_LI_INODE flags IN_AIL > > > > kworker/3:1-39 [003] ...1 254.962822: xfs_log_force: dev 253:4 lsn 0x0 caller xfs_log_worker+0x2f/0xf0 [xfs] > > > > ... > > > > kworker/3:2-1920 [003] ...1 3759.291275: xfs_log_force: dev 253:4 lsn 0x0 caller xfs_log_worker+0x2f/0xf0 [xfs] > > > > > > > > > > > > # cat /sys/fs/xfs/dm-4/log/log_*lsn > > > > 1:252 > > > > 1:250 > > > > > > > > This instance of the race uses the same serialization instrumentation to > > > > control execution timing and whatnot as before (i.e. no functional > > > > changes). First, an item is inserted into the AIL. Immediately after AIL > > > > insertion, another transaction commits to the CIL (not shown in the > > > > trace). The background log worker comes around a few seconds later and > > > > forces the log/CIL. The checkpoint for this log force races with an AIL > > > > delete and idle (same as before). AIL insertion occurs at the push > > > > target xfsaild just idled at, but this time reservation pressure > > > > relieves and the filesystem goes idle. > > > > > > > > At this point, nothing occurs on the fs except for continuous background > > > > log worker jobs. Note the timestamp difference between the first > > > > post-race log force and the last in the trace. The log worker runs at > > > > the default 30s interval and has run repeatedly for almost an hour while > > > > failing to push the AIL and subsequently cover the log. To confirm the > > > > AIL is populated, see the log head/tail LSNs reported via sysfs. This > > > > state persists indefinitely so long as the fs is idle. This is a bug. > > > > > > /me stumbles back in after ~2wks, and has a few questions: > > > > > > > Heh, welcome back.. ;) > > > > > 1) Are these concerns a reason to hold up this series, or are they a > > > separate bug lurking in the code being touched by the series? AFAICT I > > > think it's the second, but <shrug> my brain is still mush. > > > > > > > A little of both I guess. To Dave's earlier point, I think this > > technically can happen in the existing code as a 3-way race between the > > aforementioned tasks (just not the way it was described). OTOH, I'm not > > sure what this has to do with the fact that the new code being added is > > racy on its own (or since when discovery of some old bug justifies > > adding new ones..?). The examples shown above are fundamentally races > > between log I/O completion and xfsaild. The last one shows the log > > remain uncovered indefinitely on an idle fs (which is not a corruption > > or anything, but certainly a bug) simply because that's the easiest side > > effect to reproduce. I'm fairly confident at this point that one could > > be manufactured into a similar log deadlock if we really wanted to try, > > but that would be much more difficult and TBH I'm tired of burning > > myself out on these kind of objections to obvious and easily addressed > > landmines. How likely is it that somebody would hit these problems? > > Probably highly unlikely. How likely is it somebody would hit this > > problem before whatever problem this patch fixes? *shrug* > > > > I don't think it's a reason to hold up the series, but at the same time > > this patch is unrelated to the original problem. IIRC, it fell out of > > some other issue reproduced with a different experimental hack/fix (that > > was eventually replaced) to the original problem. FWIW, I'm annoyed with > > the lazy approach to review here more than anything. In hindsight, if I > > knew the feedback was going to be dismissed and the patchset rolled > > forward and merged, perhaps I should have just nacked the subsequent > > reposts to make the objection clear. > > :( > > I'm sorry you feel that way. I myself don't feel that my own handling > of this merge window has been good, between feeling pressured to get the > branches ready to go before vacation and for-next becoming intermittent > right around the same time. Those both decrease my certainty about > what's going in the next merge and increases my own anxieties, and it > becomes a competition in my head between "I can add it now and revert it > later as a regression fix" vs. "if I don't add it I'll wonder if it was > necessary". > Eh, it is what it is. I don't expect to always agree on everything. I could/should have noted the objection more clearly on subsequent posts and will try to do so in the future. Conversely, a quick "there appears to be a disagreement, the maintainer is making a decision" note on the list would be appreciated should that play out in the future. Just my .02 (and not saying this is some kind of pattern or anything), but I do think the whole "merge it since we can revert it from for-next later if it causes problems" thing kind of sets a low bar and a bad precedent. What's the point of reviewing patches with that approach? Beyond discouraging review, it also diminishes sense of responsibility for the quality of affected areas of code IMO. > Anyway, I /think/ the end result is that if the AIL gets stuck /and/ the > system goes down before it becomes unstuck, then there'll be more work > for log recovery to do, because we failed to checkpoint everything that > we possibly could have before the crash? > Yep, in this particular variant at least. > So AFAICT it's not a critical disaster bug but I would like to study > this situation some more, particularly now that we have ~2mos for > stabilizing things. > Right, this is definitely not a critical bug. AFAICT neither is the bug fixed by the patch. Note that just on principle I'm not going to spend a whole lot of time testing for things post-merge that I consider and/or call out as problems during review and end up ignored. The point of addressing such things during review is to avoid those problems in the first place. In this particular case, why spend time on that when it is so relatively simple to relocate the xlog_grant_push_ail() call to after callback processing? I still don't understand that tbh. > > I dunno, not my call on what to do with it now. Feel free to add my > > Nacked-by: to the upstream commit I guess so I at least remember this > > when/if considering whether to backport it anywhere. :/ > > (/me continues to wish there was an easy way to add tagging to a commit, > particularly when it comes well after the fact.) > No big deal, this was all in hindsight. > > > 2) Er... how do you get the log stuck like this? I see things earlier > > > in the thread like "open transactions that don't ultimately commit > > > anything to the log subsystem" and think "OH, you mean xfs_scrub!" > > > > > > > That's one thing I was thinking about but I didn't end up looking into > > it (does scrub actually acquire log reservation?). > > If you invoke the scrub ioctl with IFLAG_REPAIR set, it allocates a > non-empty transaction (itruncate, iirc) to do the check and rebuild the > data structure. If the item is ok then it'll cancel the transaction. > Ah, tr_itruncate is actually larger than tr_write too (which I assume is why it's used here)... > > For a more simple > > example, consider a bunch of threads running into quota block allocation > > failures where a system is also under memory pressure. On filesystems > > with smaller logs, it only takes a handful of such threads to bash the > > reservation grant head against the log tail even though the log is empty > > (and doing so without ever committing anything to the log). > > > > Note that this by itself isn't what gets the log "stuck" in the most > > recent example (note: not deadlocked), but rather if we're in a state > > where the grant head is close enough to the log head (such that we AIL > > push the items associated with the current checkpoint before it inserts) > > when log I/O completion happens to race with AIL emptying as described. > > Hmm... I wonder if we could reproduce this by formatting a filesystem > with a small log; running a slow moving thread that touches a file once > per second (or something to generate a moderate amount of workload) and > monitors the log to watch its progress; and then kicking off dozens of > threads to invoke IFLAG_REPAIR scrubbers on some other non-corrupt part > of the filesystem? > Yeah, given the above and if we're able to kick off enough concurrent scrubbers such that they do nontrivial work and don't block eachother before transaction allocation, that looks like it could result in similar behavior wrt to reservation. Concurrent allocbt scrubs perhaps? If so, on an otherwise default sized 100g fs, concurrent repair scrubs to anything more than 18 or so AGs (based on tr_itruncate size) would be enough to cause AIL pushing from the log subsystem (with a minimum size log, I think something like 3 AGs would be enough) without any guarantee that any of those transactions commit. The rest is just a simple race between AIL pushing a fabricated target and xfsaild. BTW, another thing I noticed with regard to timing is that on 32-bit systems the push target is updated under ->ail_lock, which I think means the xfs_ail_push() call from log I/O completion can serialize behind an xfsaild push in progress (after the former has checked for a !empty AIL). I'm not sure that's enough to make the race easy to reproduce without explicit delay injection, but it's a step in that direction.. Brian > --D > > > Brian > > > > > --D > > > > > > > Brian > > > > > > > > > Cheers, > > > > > > > > > > Dave. > > > > > -- > > > > > Dave Chinner > > > > > david@fromorbit.com
diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c index 6a59d71d4c60..733693e1ac9f 100644 --- a/fs/xfs/xfs_log.c +++ b/fs/xfs/xfs_log.c @@ -2632,6 +2632,46 @@ xlog_get_lowest_lsn( return lowest_lsn; } +/* + * Completion of a iclog IO does not imply that a transaction has completed, as + * transactions can be large enough to span many iclogs. We cannot change the + * tail of the log half way through a transaction as this may be the only + * transaction in the log and moving the tail to point to the middle of it + * will prevent recovery from finding the start of the transaction. Hence we + * should only update the last_sync_lsn if this iclog contains transaction + * completion callbacks on it. + * + * We have to do this before we drop the icloglock to ensure we are the only one + * that can update it. + * + * If we are moving the last_sync_lsn forwards, we also need to ensure we kick + * the reservation grant head pushing. This is due to the fact that the push + * target is bound by the current last_sync_lsn value. Hence if we have a large + * amount of log space bound up in this committing transaction then the + * last_sync_lsn value may be the limiting factor preventing tail pushing from + * freeing space in the log. Hence once we've updated the last_sync_lsn we + * should push the AIL to ensure the push target (and hence the grant head) is + * no longer bound by the old log head location and can move forwards and make + * progress again. + */ +static void +xlog_state_set_callback( + struct xlog *log, + struct xlog_in_core *iclog, + xfs_lsn_t header_lsn) +{ + iclog->ic_state = XLOG_STATE_CALLBACK; + + ASSERT(XFS_LSN_CMP(atomic64_read(&log->l_last_sync_lsn), header_lsn) <= 0); + + if (list_empty_careful(&iclog->ic_callbacks)) + return; + + atomic64_set(&log->l_last_sync_lsn, header_lsn); + xlog_grant_push_ail(log, 0); + +} + /* * Return true if we need to stop processing, false to continue to the next * iclog. The caller will need to run callbacks if the iclog is returned in the @@ -2644,6 +2684,7 @@ xlog_state_iodone_process_iclog( struct xlog_in_core *completed_iclog) { xfs_lsn_t lowest_lsn; + xfs_lsn_t header_lsn; /* Skip all iclogs in the ACTIVE & DIRTY states */ if (iclog->ic_state & (XLOG_STATE_ACTIVE|XLOG_STATE_DIRTY)) @@ -2681,34 +2722,15 @@ xlog_state_iodone_process_iclog( * callbacks) see the above if. * * We will do one more check here to see if we have chased our tail - * around. + * around. If this is not the lowest lsn iclog, then we will leave it + * for another completion to process. */ + header_lsn = be64_to_cpu(iclog->ic_header.h_lsn); lowest_lsn = xlog_get_lowest_lsn(log); - if (lowest_lsn && - XFS_LSN_CMP(lowest_lsn, be64_to_cpu(iclog->ic_header.h_lsn)) < 0) - return false; /* Leave this iclog for another thread */ - - iclog->ic_state = XLOG_STATE_CALLBACK; - - /* - * Completion of a iclog IO does not imply that a transaction has - * completed, as transactions can be large enough to span many iclogs. - * We cannot change the tail of the log half way through a transaction - * as this may be the only transaction in the log and moving th etail to - * point to the middle of it will prevent recovery from finding the - * start of the transaction. Hence we should only update the - * last_sync_lsn if this iclog contains transaction completion callbacks - * on it. - * - * We have to do this before we drop the icloglock to ensure we are the - * only one that can update it. - */ - ASSERT(XFS_LSN_CMP(atomic64_read(&log->l_last_sync_lsn), - be64_to_cpu(iclog->ic_header.h_lsn)) <= 0); - if (!list_empty_careful(&iclog->ic_callbacks)) - atomic64_set(&log->l_last_sync_lsn, - be64_to_cpu(iclog->ic_header.h_lsn)); + if (lowest_lsn && XFS_LSN_CMP(lowest_lsn, header_lsn) < 0) + return false; + xlog_state_set_callback(log, iclog, header_lsn); return false; }