Message ID | 20210223053212.3287398-3-david@fromorbit.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | xfs: CIL improvements | expand |
On Tue, Feb 23, 2021 at 04:32:11PM +1100, Dave Chinner wrote: > From: Dave Chinner <dchinner@redhat.com> > > The AIL pushing is stalling on log forces when it comes across > pinned items. This is happening on removal workloads where the AIL > is dominated by stale items that are removed from AIL when the > checkpoint that marks the items stale is committed to the journal. > This results is relatively few items in the AIL, but those that are > are often pinned as directories items are being removed from are > still being logged. > > As a result, many push cycles through the CIL will first issue a > blocking log force to unpin the items. This can take some time to > complete, with tracing regularly showing push delays of half a > second and sometimes up into the range of several seconds. Sequences > like this aren't uncommon: > > .... > 399.829437: xfsaild: last lsn 0x11002dd000 count 101 stuck 101 flushing 0 tout 20 > <wanted 20ms, got 270ms delay> > 400.099622: xfsaild: target 0x11002f3600, prev 0x11002f3600, last lsn 0x0 > 400.099623: xfsaild: first lsn 0x11002f3600 > 400.099679: xfsaild: last lsn 0x1100305000 count 16 stuck 11 flushing 0 tout 50 > <wanted 50ms, got 500ms delay> > 400.589348: xfsaild: target 0x110032e600, prev 0x11002f3600, last lsn 0x0 > 400.589349: xfsaild: first lsn 0x1100305000 > 400.589595: xfsaild: last lsn 0x110032e600 count 156 stuck 101 flushing 30 tout 50 > <wanted 50ms, got 460ms delay> > 400.950341: xfsaild: target 0x1100353000, prev 0x110032e600, last lsn 0x0 > 400.950343: xfsaild: first lsn 0x1100317c00 > 400.950436: xfsaild: last lsn 0x110033d200 count 105 stuck 101 flushing 0 tout 20 > <wanted 20ms, got 200ms delay> > 401.142333: xfsaild: target 0x1100361600, prev 0x1100353000, last lsn 0x0 > 401.142334: xfsaild: first lsn 0x110032e600 > 401.142535: xfsaild: last lsn 0x1100353000 count 122 stuck 101 flushing 8 tout 10 > <wanted 10ms, got 10ms delay> > 401.154323: xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x1100353000 > 401.154328: xfsaild: first lsn 0x1100353000 > 401.154389: xfsaild: last lsn 0x1100353000 count 101 stuck 101 flushing 0 tout 20 > <wanted 20ms, got 300ms delay> > 401.451525: xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x0 > 401.451526: xfsaild: first lsn 0x1100353000 > 401.451804: xfsaild: last lsn 0x1100377200 count 170 stuck 22 flushing 122 tout 50 > <wanted 50ms, got 500ms delay> > 401.933581: xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x0 > .... > > In each of these cases, every AIL pass saw 101 log items stuck on > the AIL (pinned) with very few other items being found. Each pass, a > log force was issued, and delay between last/first is the sleep time > + the sync log force time. > > Some of these 101 items pinned the tail of the log. The tail of the > log does slowly creep forward (first lsn), but the problem is that > the log is actually out of reservation space because it's been > running so many transactions that stale items that never reach the > AIL but consume log space. Hence we have a largely empty AIL, with > long term pins on items that pin the tail of the log that don't get > pushed frequently enough to keep log space available. > > The problem is the hundreds of milliseconds that we block in the log > force pushing the CIL out to disk. The AIL should not be stalled > like this - it needs to run and flush items that are at the tail of > the log with minimal latency. What we really need to do is trigger a > log flush, but then not wait for it at all - we've already done our > waiting for stuff to complete when we backed off prior to the log > force being issued. > > Even if we remove the XFS_LOG_SYNC from the xfs_log_force() call, we > still do a blocking flush of the CIL and that is what is causing the > issue. Hence we need a new interface for the CIL to trigger an > immediate background push of the CIL to get it moving faster but not > to wait on that to occur. While the CIL is pushing, the AIL can also > be pushing. > > We already have an internal interface to do this - > xlog_cil_push_now() - but we need a wrapper for it to be used > externally. xlog_cil_force_seq() can easily be extended to do what > we need as it already implements the synchronous CIL push via > xlog_cil_push_now(). Add the necessary flags and "push current > sequence" semantics to xlog_cil_force_seq() and convert the AIL > pushing to use it. Overnight testing indicates generic/530 hangs on small logs with this patch. It essentially runs out of log space because there are inode cluster buffers permanently pinned by this workload. That is, as inodes are unlinked, they repeatedly relog the inode cluster buffer, and so while the CIL is flushing to unpin the buffer, a new unlink relogs it and adds a new pin to the buffer. Hence the log force that the AIL does to unpin pinned items doesn't actually unpin buffers that are being relogged. These buffers only get unpinned when the log actually runs out of space because they pin the tail of the log. Then the modifications that are relogging the buffer stop because they fail to get a log reservation, and the tail of the log does not move forward until the AIL issues a log force that finally unpins the buffer and writes it back. Essentially, relogged buffers cannot be flushed by the AIL until the log completely stalls. The problem being tripped over here is that we no longer force the final iclogs in a CIL push to disk - we leave the iclog with the commit record in it in ACTIVE state, and by not waiting and forcing all the iclogs to disk, the buffer never gets unpinned because there isn't any more log pressure to force it out because everything is blocked on reservation space. The solution to this is to have the CIL push change the state of the commit iclog to WANT_SYNC before it is released. This means the CIL push will always flush the iclog to disk and the checkpoint will complete and unpin the buffers. Right now, we really only want to do this state switch for these async pushes - for small sync transactions and fsync we really want the iclog aggregation that we have now to optimise iclogbuf usage, so I'll have to pass a new flag through the push code and back into xlog_write(). That will make this patch behave the same as we currently do. In the longer term, we need to change how the AIL deals with pinned buffers, as the current method is definitely sub-optimal. It also explains the "everything stops for a second" performance anomalies I've been seeing for a while in testing. But fixing that is outside the scope of this patchset, so in the mean time I'll fix this one up and repost it in a little while. FWIW, this is also the likely cause of the "CIL workqueue too deep" hangs I was seeing with the next patch in the series, too. Cheers, Dave.
On Thu, Feb 25, 2021 at 08:10:58AM +1100, Dave Chinner wrote: > On Tue, Feb 23, 2021 at 04:32:11PM +1100, Dave Chinner wrote: > > From: Dave Chinner <dchinner@redhat.com> > > > > The AIL pushing is stalling on log forces when it comes across > > pinned items. This is happening on removal workloads where the AIL > > is dominated by stale items that are removed from AIL when the > > checkpoint that marks the items stale is committed to the journal. > > This results is relatively few items in the AIL, but those that are > > are often pinned as directories items are being removed from are > > still being logged. > > > > As a result, many push cycles through the CIL will first issue a > > blocking log force to unpin the items. This can take some time to > > complete, with tracing regularly showing push delays of half a > > second and sometimes up into the range of several seconds. Sequences > > like this aren't uncommon: > > > > .... > > 399.829437: xfsaild: last lsn 0x11002dd000 count 101 stuck 101 flushing 0 tout 20 > > <wanted 20ms, got 270ms delay> > > 400.099622: xfsaild: target 0x11002f3600, prev 0x11002f3600, last lsn 0x0 > > 400.099623: xfsaild: first lsn 0x11002f3600 > > 400.099679: xfsaild: last lsn 0x1100305000 count 16 stuck 11 flushing 0 tout 50 > > <wanted 50ms, got 500ms delay> > > 400.589348: xfsaild: target 0x110032e600, prev 0x11002f3600, last lsn 0x0 > > 400.589349: xfsaild: first lsn 0x1100305000 > > 400.589595: xfsaild: last lsn 0x110032e600 count 156 stuck 101 flushing 30 tout 50 > > <wanted 50ms, got 460ms delay> > > 400.950341: xfsaild: target 0x1100353000, prev 0x110032e600, last lsn 0x0 > > 400.950343: xfsaild: first lsn 0x1100317c00 > > 400.950436: xfsaild: last lsn 0x110033d200 count 105 stuck 101 flushing 0 tout 20 > > <wanted 20ms, got 200ms delay> > > 401.142333: xfsaild: target 0x1100361600, prev 0x1100353000, last lsn 0x0 > > 401.142334: xfsaild: first lsn 0x110032e600 > > 401.142535: xfsaild: last lsn 0x1100353000 count 122 stuck 101 flushing 8 tout 10 > > <wanted 10ms, got 10ms delay> > > 401.154323: xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x1100353000 > > 401.154328: xfsaild: first lsn 0x1100353000 > > 401.154389: xfsaild: last lsn 0x1100353000 count 101 stuck 101 flushing 0 tout 20 > > <wanted 20ms, got 300ms delay> > > 401.451525: xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x0 > > 401.451526: xfsaild: first lsn 0x1100353000 > > 401.451804: xfsaild: last lsn 0x1100377200 count 170 stuck 22 flushing 122 tout 50 > > <wanted 50ms, got 500ms delay> > > 401.933581: xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x0 > > .... > > > > In each of these cases, every AIL pass saw 101 log items stuck on > > the AIL (pinned) with very few other items being found. Each pass, a > > log force was issued, and delay between last/first is the sleep time > > + the sync log force time. > > > > Some of these 101 items pinned the tail of the log. The tail of the > > log does slowly creep forward (first lsn), but the problem is that > > the log is actually out of reservation space because it's been > > running so many transactions that stale items that never reach the > > AIL but consume log space. Hence we have a largely empty AIL, with > > long term pins on items that pin the tail of the log that don't get > > pushed frequently enough to keep log space available. > > > > The problem is the hundreds of milliseconds that we block in the log > > force pushing the CIL out to disk. The AIL should not be stalled > > like this - it needs to run and flush items that are at the tail of > > the log with minimal latency. What we really need to do is trigger a > > log flush, but then not wait for it at all - we've already done our > > waiting for stuff to complete when we backed off prior to the log > > force being issued. > > > > Even if we remove the XFS_LOG_SYNC from the xfs_log_force() call, we > > still do a blocking flush of the CIL and that is what is causing the > > issue. Hence we need a new interface for the CIL to trigger an > > immediate background push of the CIL to get it moving faster but not > > to wait on that to occur. While the CIL is pushing, the AIL can also > > be pushing. > > > > We already have an internal interface to do this - > > xlog_cil_push_now() - but we need a wrapper for it to be used > > externally. xlog_cil_force_seq() can easily be extended to do what > > we need as it already implements the synchronous CIL push via > > xlog_cil_push_now(). Add the necessary flags and "push current > > sequence" semantics to xlog_cil_force_seq() and convert the AIL > > pushing to use it. > > Overnight testing indicates generic/530 hangs on small logs with > this patch. It essentially runs out of log space because there are > inode cluster buffers permanently pinned by this workload. > > That is, as inodes are unlinked, they repeatedly relog the inode > cluster buffer, and so while the CIL is flushing to unpin the > buffer, a new unlink relogs it and adds a new pin to the buffer. > Hence the log force that the AIL does to unpin pinned items doesn't > actually unpin buffers that are being relogged. > > These buffers only get unpinned when the log actually runs out of > space because they pin the tail of the log. Then the modifications > that are relogging the buffer stop because they fail to get a log > reservation, and the tail of the log does not move forward until the > AIL issues a log force that finally unpins the buffer and writes it > back. > > Essentially, relogged buffers cannot be flushed by the AIL until the > log completely stalls. > > The problem being tripped over here is that we no longer force the > final iclogs in a CIL push to disk - we leave the iclog with the > commit record in it in ACTIVE state, and by not waiting and forcing > all the iclogs to disk, the buffer never gets unpinned because there > isn't any more log pressure to force it out because everything is > blocked on reservation space. > > The solution to this is to have the CIL push change the state of the > commit iclog to WANT_SYNC before it is released. This means the CIL > push will always flush the iclog to disk and the checkpoint will > complete and unpin the buffers. > > Right now, we really only want to do this state switch for these > async pushes - for small sync transactions and fsync we really want > the iclog aggregation that we have now to optimise iclogbuf usage, > so I'll have to pass a new flag through the push code and back into > xlog_write(). That will make this patch behave the same as we > currently do. > Unfortunately I've not yet caught up to reviewing your most recently posted set of log patches so I can easily be missing some context, but when passing through some of the feedback/updates so far this has me rather confused. We discussed this pre-existing CIL behavior in the previous version, I suggested some similar potential behavior change where we would opportunistically send off checkpoint iclogs for I/O a bit earlier than normal and you argued [1] against it. Now it sounds like not only are we implementing that, but it's actually necessary to fix a log hang problem..? What am I missing? The updated iclog behavior does sound more friendly to me than what we currently do (obviously, based on my previous comments), but I am slightly skeptical of how such a change fixes the root cause of a hang. Is this a stall/perf issue or an actual log deadlock? If the latter, what prevents this deadlock on current upstream? Brian [1] https://lore.kernel.org/linux-xfs/20210129222559.GT4662@dread.disaster.area/ > In the longer term, we need to change how the AIL deals with pinned > buffers, as the current method is definitely sub-optimal. It also > explains the "everything stops for a second" performance anomalies > I've been seeing for a while in testing. But fixing that is outside > the scope of this patchset, so in the mean time I'll fix this one up > and repost it in a little while. > > FWIW, this is also the likely cause of the "CIL workqueue too deep" > hangs I was seeing with the next patch in the series, too. > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com >
On Thu, Feb 25, 2021 at 08:12:12AM -0500, Brian Foster wrote: > On Thu, Feb 25, 2021 at 08:10:58AM +1100, Dave Chinner wrote: > > On Tue, Feb 23, 2021 at 04:32:11PM +1100, Dave Chinner wrote: > > The problem being tripped over here is that we no longer force the > > final iclogs in a CIL push to disk - we leave the iclog with the > > commit record in it in ACTIVE state, and by not waiting and forcing > > all the iclogs to disk, the buffer never gets unpinned because there > > isn't any more log pressure to force it out because everything is > > blocked on reservation space. > > > > The solution to this is to have the CIL push change the state of the > > commit iclog to WANT_SYNC before it is released. This means the CIL > > push will always flush the iclog to disk and the checkpoint will > > complete and unpin the buffers. > > > > Right now, we really only want to do this state switch for these > > async pushes - for small sync transactions and fsync we really want > > the iclog aggregation that we have now to optimise iclogbuf usage, > > so I'll have to pass a new flag through the push code and back into > > xlog_write(). That will make this patch behave the same as we > > currently do. > > > > Unfortunately I've not yet caught up to reviewing your most recently > posted set of log patches so I can easily be missing some context, but > when passing through some of the feedback/updates so far this has me > rather confused. We discussed this pre-existing CIL behavior in the > previous version, I suggested some similar potential behavior change > where we would opportunistically send off checkpoint iclogs for I/O a > bit earlier than normal and you argued [1] against it. Now it sounds > like not only are we implementing that, but it's actually necessary to > fix a log hang problem..? What am I missing? You're talking about this comment? | > That means there's an increased chance that | > the next iclog in the ring may be active. Perhaps we could introduce | > some logic to switch out the commit record iclog before the CIL push | > returns in those particular cases. For example, switch out if the | > current checkpoint hit multiple iclogs..? Or hit multiple iclogs and the | > next iclog in the ring is clean..? | | We could just call xlog_state_switch_iclogs() to mark it WANT_SYNC, | but then we can't aggregate more changes into it and fill it up. If | someone starts waiting on that iclog (i.e. a log force), then it | immediately gets marked WANT_SYNC and submitted to disk when it is | released. But if there is no-one waiting on it, then we largely | don't care if an asynchronous checkpoint is committed immediately, | at the start of the next checkpoint, or at worst, within 30s when | the log worker next kicks a log force.... I'm arguing against the general case here of submitting iclog buffers with checkpoints in them early. THere is no reason to do that because the xfs_log_force() code will kick the iclog to disk if necessary. That still stands - I have not changed behaviour unconditionally. This patch only kicks the commit_iclog to disk if the cause of the CIL push was an async call to xlog_cil_force_seq(). That can occur from xfs_iunpin() and, with this patch, the AIL. In both cases, all they care about is that the CIL checkpoint fully completes in the near future. For xfs_iunpin(), the log force doesn't guarantee the inode is actually unpinned, hence it waits on the pin count and not the log force. So waiting for the CIL push is unnecessary as long as the CIL push gets the entire checkpoint to disk. Similarly, the AIL only needs the log force to guarantee the CIL commits to disk in the near future, but it does not need to block waiting for that to occur. So in these cases, we don't want to block while the CIL is pushing just so the log force code can iterate the last iclog the CIL wrote to flush it to disk. Instead, we pass a flag to the CIL to tell it "you need to push the commit record to disk" so we don't have to wait for it to then check if it pushed the commit record or not. This is a special case, not the general case where we are running lots of fsync or sync transactions. In the general case, we are trying to aggregate as many sync transactions/fsync log forces as possible into the current iclog to batch them up effectively. We'll continue to batch them while there is space in the current iclog and the previous iclog is in the WANT_SYNC/SYNCING state indicating it is still being written to disk. This automated batching behaviour of the iclog state machine is one of the reasons XFS has always had excellent performance under synci transaction/fsync heavy workloads such as NFS/CIFS servers. It is functionality we need to preserve, and this patch does not actually change that behaviour. > The updated iclog behavior does sound more friendly to me than what we > currently do (obviously, based on my previous comments), As per above, the conditional behaviour I've added here is actually less friendly :) > but I am > slightly skeptical of how such a change fixes the root cause of a hang. > Is this a stall/perf issue or an actual log deadlock? If the latter, > what prevents this deadlock on current upstream? The problem that I talk about is introduced by this patch and also fixed by this patch. The problem does not exist until we have code that depends on an async CIL push guaranteeing that items in the CIL are unpinned. A stock xfs_log_force_seq(0) call guarantees that the log reaches disk by waiting on the CIL push then pushing the iclogs. All it does is elide waiting for the final iclogs to hit the disk. By eliding the "wait for CIL push" in an async xfs_log_force_seq() call, we also elide the "push iclogs" part of the log force and that's what causes the AIL problem. We don't push the iclogs, so the items committed to the last iclog don't get dispatched and the items never get unpinned because the checkpoint never completes. So this "AIL can hang because tail items aren't unpinned" condition is exposed by async CIL forces. It is also fixed by this patch passing the async push state to the CIL push and having it ensure that the commit iclog if written out if the CIL push was triggered by an async push. Like all log forces, async pushes are rare, but we have to ensure tehy behave correctly. There is no bug in the code before this patch, and there is no bug in the code (that I know of) after this patch. I've described a condition that we have to handle in this patch that, if we don't, we end up with broken AIL pushing. As to the "root cause of the hang" I now know what that is too, but addressing it is outside the scope of this patchset. It's taken me a long time to identify this, but I've been seeing symptoms of it for quite a long time (i.e. years) but until doing this async CIL push work I've never been able to identify the cause. That is, the root cause is that the AIL cannot force a pinned buffer to disk, and a log force (of any kind) does not guarantee a pinned item is unpinned. That's because the CIL is pipelined, meaning that an item can be pinned in one CIL context, and while that is being pushed, the front end can relog that same item and pin it in the current CIL context. Hence it has two pins, not one. Now when the original log force completes, the pin from the first CIL context has been removed, but the pin from the current CIL context is still present. Hence when we are relogging buffers over time (e.g. repeatedly modifying a directory), the AIL always sees the buffer pinned and cannot flush it. Even if the buffer has been moved to the delwri list (i.e. had a successful push) we still can't write it out because the delwri code skips pinned buffers. This iterative pin continues preventing the AIL from flushing the buffer until it pins the tail of the log. When the log runs out of space and blocks the modifying/relogging task(s), the log force the AIL issues allows the CIL to be pushed without the front end relogging the item and re-pinning the item. At the completion of the log force, the AIL can now push the unpinned buffer and hence move the tail of the log forward which frees up log reservation space. This then wakes the front end, which goes back to it's relogging game and the cycle repeats over. This is really nasty behaviour, and it's only recently that I've got a handle on it. I found it because my original "async CIL push" code resulted in long stalls every time the log is filled and the tail is pinned by a buffer that is being relogged in this manner.... I'm not sure how to fix this yet - the AIL needs to block the front end relogging to allow the buffer to be unpinned. Essentially, we need to hold the pinned items locked across a CIL push to guarantee they are unpinned, but that's the complete opposite of what the AIL currently does to prevent the front end from seeing long tail lock latencies when modifying stuff.... Cheers, Dave.
On Fri, Feb 26, 2021 at 09:03:05AM +1100, Dave Chinner wrote: > On Thu, Feb 25, 2021 at 08:12:12AM -0500, Brian Foster wrote: > > On Thu, Feb 25, 2021 at 08:10:58AM +1100, Dave Chinner wrote: > > > On Tue, Feb 23, 2021 at 04:32:11PM +1100, Dave Chinner wrote: > > > The problem being tripped over here is that we no longer force the > > > final iclogs in a CIL push to disk - we leave the iclog with the > > > commit record in it in ACTIVE state, and by not waiting and forcing > > > all the iclogs to disk, the buffer never gets unpinned because there > > > isn't any more log pressure to force it out because everything is > > > blocked on reservation space. > > > > > > The solution to this is to have the CIL push change the state of the > > > commit iclog to WANT_SYNC before it is released. This means the CIL > > > push will always flush the iclog to disk and the checkpoint will > > > complete and unpin the buffers. > > > > > > Right now, we really only want to do this state switch for these > > > async pushes - for small sync transactions and fsync we really want > > > the iclog aggregation that we have now to optimise iclogbuf usage, > > > so I'll have to pass a new flag through the push code and back into > > > xlog_write(). That will make this patch behave the same as we > > > currently do. > > > > > > > Unfortunately I've not yet caught up to reviewing your most recently > > posted set of log patches so I can easily be missing some context, but > > when passing through some of the feedback/updates so far this has me > > rather confused. We discussed this pre-existing CIL behavior in the > > previous version, I suggested some similar potential behavior change > > where we would opportunistically send off checkpoint iclogs for I/O a > > bit earlier than normal and you argued [1] against it. Now it sounds > > like not only are we implementing that, but it's actually necessary to > > fix a log hang problem..? What am I missing? > > You're talking about this comment? > Yes. > | > That means there's an increased chance that > | > the next iclog in the ring may be active. Perhaps we could introduce > | > some logic to switch out the commit record iclog before the CIL push > | > returns in those particular cases. For example, switch out if the > | > current checkpoint hit multiple iclogs..? Or hit multiple iclogs and the > | > next iclog in the ring is clean..? > | > | We could just call xlog_state_switch_iclogs() to mark it WANT_SYNC, > | but then we can't aggregate more changes into it and fill it up. If > | someone starts waiting on that iclog (i.e. a log force), then it > | immediately gets marked WANT_SYNC and submitted to disk when it is > | released. But if there is no-one waiting on it, then we largely > | don't care if an asynchronous checkpoint is committed immediately, > | at the start of the next checkpoint, or at worst, within 30s when > | the log worker next kicks a log force.... > > I'm arguing against the general case here of submitting iclog > buffers with checkpoints in them early. THere is no reason to do that > because the xfs_log_force() code will kick the iclog to disk if > necessary. That still stands - I have not changed behaviour > unconditionally. > But the suggestion wasn't to change behavior unconditionally. That seems fairly clear to me even from just the quoted snippet above..? > This patch only kicks the commit_iclog to disk if the cause of the > CIL push was an async call to xlog_cil_force_seq(). That can occur > from xfs_iunpin() and, with this patch, the AIL. > > In both cases, all they care about is that the CIL checkpoint fully > completes in the near future. For xfs_iunpin(), the log force > doesn't guarantee the inode is actually unpinned, hence it waits on > the pin count and not the log force. So waiting for the CIL push is > unnecessary as long as the CIL push gets the entire checkpoint to > disk. > > Similarly, the AIL only needs the log force to guarantee the CIL > commits to disk in the near future, but it does not need to block > waiting for that to occur. > > So in these cases, we don't want to block while the CIL is pushing > just so the log force code can iterate the last iclog the CIL wrote > to flush it to disk. Instead, we pass a flag to the CIL to tell it > "you need to push the commit record to disk" so we don't have to > wait for it to then check if it pushed the commit record or not. > This is a special case, not the general case where we are running > lots of fsync or sync transactions. > > In the general case, we are trying to aggregate as many sync > transactions/fsync log forces as possible into the current iclog to > batch them up effectively. We'll continue to batch them while there > is space in the current iclog and the previous iclog is in the > WANT_SYNC/SYNCING state indicating it is still being written to > disk. > > This automated batching behaviour of the iclog state machine is one > of the reasons XFS has always had excellent performance under > synci transaction/fsync heavy workloads such as NFS/CIFS servers. It > is functionality we need to preserve, and this patch does not > actually change that behaviour. > Sure, this all makes sense. > > The updated iclog behavior does sound more friendly to me than what we > > currently do (obviously, based on my previous comments), > > As per above, the conditional behaviour I've added here is actually > less friendly :) > Heh. :P > > but I am > > slightly skeptical of how such a change fixes the root cause of a hang. > > Is this a stall/perf issue or an actual log deadlock? If the latter, > > what prevents this deadlock on current upstream? > > The problem that I talk about is introduced by this patch and also > fixed by this patch. The problem does not exist until we have code > that depends on an async CIL push guaranteeing that items in the CIL > are unpinned. A stock xfs_log_force_seq(0) call guarantees that the > log reaches disk by waiting on the CIL push then pushing the iclogs. > All it does is elide waiting for the final iclogs to hit the disk. > > By eliding the "wait for CIL push" in an async xfs_log_force_seq() > call, we also elide the "push iclogs" part of the log force and > that's what causes the AIL problem. We don't push the iclogs, so the > items committed to the last iclog don't get dispatched and the items > never get unpinned because the checkpoint never completes. > So in general it sounds like this patch broke async log force by skipping the current iclog switch somehow or another and then this change is incorporated to fix it (hence there is no outstanding bug upstream). That makes a bit more sense. I'll get the detailed context from the patches. > So this "AIL can hang because tail items aren't unpinned" condition > is exposed by async CIL forces. It is also fixed by this patch > passing the async push state to the CIL push and having it ensure > that the commit iclog if written out if the CIL push was triggered > by an async push. > > Like all log forces, async pushes are rare, but we have to ensure > tehy behave correctly. There is no bug in the code before this > patch, and there is no bug in the code (that I know of) after this > patch. I've described a condition that we have to handle in this > patch that, if we don't, we end up with broken AIL pushing. > Ok. > As to the "root cause of the hang" I now know what that is too, but > addressing it is outside the scope of this patchset. It's taken me a > long time to identify this, but I've been seeing symptoms of it for > quite a long time (i.e. years) but until doing this async CIL push > work I've never been able to identify the cause. > > That is, the root cause is that the AIL cannot force a pinned buffer > to disk, and a log force (of any kind) does not guarantee a pinned > item is unpinned. That's because the CIL is pipelined, meaning that > an item can be pinned in one CIL context, and while that is being > pushed, the front end can relog that same item and pin it in the > current CIL context. Hence it has two pins, not one. Now when the > original log force completes, the pin from the first CIL context has > been removed, but the pin from the current CIL context is still > present. > Somewhat related to the above point, ISTM our current behavior can lead to scenarios where the potential window for elevated pin counts is somewhat artificially widened. IOW, the log vector for a pincount == 1 item might have been written out to disk, but the log item itself not unpinned for some time later because the commit record is still sitting in an active iclog. The unpin processing of the item doesn't occur until that commit record iclog is synced out and I/O completes, which leaves plenty of time for another pin to come along in the meantime (and the scenario potentially repeats, starving out pincount == 0 state). So ISTM the "damage" is done to some degree long before the AIL has any opportunity to fix it, particularly if the only recourse it has is a log force. (And of course there's a huge tradeoff between current behavior and something that unconditionally synced out commit record iclogs, particularly depending on CIL checkpoint sizes, so that's a broad simplification and not a valid solution.) > Hence when we are relogging buffers over time (e.g. repeatedly > modifying a directory), the AIL always sees the buffer pinned and > cannot flush it. Even if the buffer has been moved to the delwri > list (i.e. had a successful push) we still can't write it out > because the delwri code skips pinned buffers. > > This iterative pin continues preventing the AIL from flushing the > buffer until it pins the tail of the log. When the log runs out of > space and blocks the modifying/relogging task(s), the log force the > AIL issues allows the CIL to be pushed without the front end > relogging the item and re-pinning the item. At the completion of the > log force, the AIL can now push the unpinned buffer and hence move > the tail of the log forward which frees up log reservation space. > This then wakes the front end, which goes back to it's relogging > game and the cycle repeats over. > The description of the nested pin behavior makes sense. I've observed the same thing recently when looking at some unrelated shutdown problems. In any event, if I'm following correctly the deadlock issue was a transient problem with this patch and not an upstream problem. However there is still a pipeline stall/performance issue that manifests as described above, and that preexists the proposed changes. Yes? > This is really nasty behaviour, and it's only recently that I've got > a handle on it. I found it because my original "async CIL push" code > resulted in long stalls every time the log is filled and the tail is > pinned by a buffer that is being relogged in this manner.... > > I'm not sure how to fix this yet - the AIL needs to block the front > end relogging to allow the buffer to be unpinned. Essentially, we > need to hold the pinned items locked across a CIL push to guarantee > they are unpinned, but that's the complete opposite of what the AIL > currently does to prevent the front end from seeing long tail lock > latencies when modifying stuff.... > When this stall problem manifests, I'm assuming it's exacerbated by delayed logging and the commit record behavior I described above. If that's the case, could the AIL communicate writeback pressure through affected log items such that checkpoints in which they are resident are flushed out completely/immediately when the checkpoints occur? I suppose that would require a log item flag or some such, which does raise a concern of unnecessarily tagging many items (it's not clear to me how likely that really is), but I'm curious if that would be an effective POC at least.. Brian > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com >
On Sat, Feb 27, 2021 at 11:25:06AM -0500, Brian Foster wrote: > On Fri, Feb 26, 2021 at 09:03:05AM +1100, Dave Chinner wrote: > > On Thu, Feb 25, 2021 at 08:12:12AM -0500, Brian Foster wrote: > > > On Thu, Feb 25, 2021 at 08:10:58AM +1100, Dave Chinner wrote: > > > > On Tue, Feb 23, 2021 at 04:32:11PM +1100, Dave Chinner wrote: > > The problem that I talk about is introduced by this patch and also > > fixed by this patch. The problem does not exist until we have code > > that depends on an async CIL push guaranteeing that items in the CIL > > are unpinned. A stock xfs_log_force_seq(0) call guarantees that the > > log reaches disk by waiting on the CIL push then pushing the iclogs. > > All it does is elide waiting for the final iclogs to hit the disk. > > > > By eliding the "wait for CIL push" in an async xfs_log_force_seq() > > call, we also elide the "push iclogs" part of the log force and > > that's what causes the AIL problem. We don't push the iclogs, so the > > items committed to the last iclog don't get dispatched and the items > > never get unpinned because the checkpoint never completes. > > > > So in general it sounds like this patch broke async log force by > skipping the current iclog switch somehow or another and then this > change is incorporated to fix it (hence there is no outstanding bug > upstream). That makes a bit more sense. I'll get the detailed context > from the patches. Yes. > > As to the "root cause of the hang" I now know what that is too, but > > addressing it is outside the scope of this patchset. It's taken me a > > long time to identify this, but I've been seeing symptoms of it for > > quite a long time (i.e. years) but until doing this async CIL push > > work I've never been able to identify the cause. > > > > That is, the root cause is that the AIL cannot force a pinned buffer > > to disk, and a log force (of any kind) does not guarantee a pinned > > item is unpinned. That's because the CIL is pipelined, meaning that > > an item can be pinned in one CIL context, and while that is being > > pushed, the front end can relog that same item and pin it in the > > current CIL context. Hence it has two pins, not one. Now when the > > original log force completes, the pin from the first CIL context has > > been removed, but the pin from the current CIL context is still > > present. > > > > Somewhat related to the above point, ISTM our current behavior can lead > to scenarios where the potential window for elevated pin counts is > somewhat artificially widened. IOW, the log vector for a pincount == 1 > item might have been written out to disk, but the log item itself not > unpinned for some time later because the commit record is still sitting > in an active iclog. The unpin processing of the item doesn't occur until > that commit record iclog is synced out and I/O completes, which leaves > plenty of time for another pin to come along in the meantime (and the > scenario potentially repeats, starving out pincount == 0 state). So ISTM > the "damage" is done to some degree long before the AIL has any > opportunity to fix it, particularly if the only recourse it has is a log > force. > > (And of course there's a huge tradeoff between current behavior and > something that unconditionally synced out commit record iclogs, > particularly depending on CIL checkpoint sizes, so that's a broad > simplification and not a valid solution.) Yes, this can be an issue because when the single CIL work is CPU bound, it is the first iclog write of the checkpoint that pushes the previous commit record to disk. So, like you say, there's a natural pin overlap in this situation. > > Hence when we are relogging buffers over time (e.g. repeatedly > > modifying a directory), the AIL always sees the buffer pinned and > > cannot flush it. Even if the buffer has been moved to the delwri > > list (i.e. had a successful push) we still can't write it out > > because the delwri code skips pinned buffers. > > > > This iterative pin continues preventing the AIL from flushing the > > buffer until it pins the tail of the log. When the log runs out of > > space and blocks the modifying/relogging task(s), the log force the > > AIL issues allows the CIL to be pushed without the front end > > relogging the item and re-pinning the item. At the completion of the > > log force, the AIL can now push the unpinned buffer and hence move > > the tail of the log forward which frees up log reservation space. > > This then wakes the front end, which goes back to it's relogging > > game and the cycle repeats over. > > > > The description of the nested pin behavior makes sense. I've observed > the same thing recently when looking at some unrelated shutdown > problems. > > In any event, if I'm following correctly the deadlock issue was a > transient problem with this patch and not an upstream problem. However > there is still a pipeline stall/performance issue that manifests as > described above, and that preexists the proposed changes. Yes? Yes, AFAICT, this pin problem dates back to day zero. Before delayed logging it required the inode or buffer to be relogged in almost every iclog that was written (think a long running truncate removing hundreds of thousands of extents). But before delayed logging, this was very slow and limited by iclog IO so having the AIL stall would have largely been unnoticed. > > This is really nasty behaviour, and it's only recently that I've got > > a handle on it. I found it because my original "async CIL push" code > > resulted in long stalls every time the log is filled and the tail is > > pinned by a buffer that is being relogged in this manner.... > > > > I'm not sure how to fix this yet - the AIL needs to block the front > > end relogging to allow the buffer to be unpinned. Essentially, we > > need to hold the pinned items locked across a CIL push to guarantee > > they are unpinned, but that's the complete opposite of what the AIL > > currently does to prevent the front end from seeing long tail lock > > latencies when modifying stuff.... > > When this stall problem manifests, I'm assuming it's exacerbated by > delayed logging and the commit record behavior I described above. If > that's the case, could the AIL communicate writeback pressure through > affected log items such that checkpoints in which they are resident are > flushed out completely/immediately when the checkpoints occur? I suppose > that would require a log item flag or some such, which does raise a > concern of unnecessarily tagging many items (it's not clear to me how > likely that really is), but I'm curious if that would be an effective > POC at least.. I don't think we need to do anything like that. All we need to do to ensure that the AIL can flush a pinned buffer is to lock it, kick the log and wait for the pin count to go to zero. Then we can write it just fine, blocking only the front end transactions that need that buffer lock. Same goes for inodes, though xfs_iunpin_wait() already does this.... The only issue is determining if this causes unaaceptible long tail latencies on buffers like the AG headers.... Cheers, Dave.
On Mon, Mar 01, 2021 at 03:54:22PM +1100, Dave Chinner wrote: > On Sat, Feb 27, 2021 at 11:25:06AM -0500, Brian Foster wrote: > > On Fri, Feb 26, 2021 at 09:03:05AM +1100, Dave Chinner wrote: > > > On Thu, Feb 25, 2021 at 08:12:12AM -0500, Brian Foster wrote: > > > > On Thu, Feb 25, 2021 at 08:10:58AM +1100, Dave Chinner wrote: > > > > > On Tue, Feb 23, 2021 at 04:32:11PM +1100, Dave Chinner wrote: > > > The problem that I talk about is introduced by this patch and also > > > fixed by this patch. The problem does not exist until we have code > > > that depends on an async CIL push guaranteeing that items in the CIL > > > are unpinned. A stock xfs_log_force_seq(0) call guarantees that the > > > log reaches disk by waiting on the CIL push then pushing the iclogs. > > > All it does is elide waiting for the final iclogs to hit the disk. > > > > > > By eliding the "wait for CIL push" in an async xfs_log_force_seq() > > > call, we also elide the "push iclogs" part of the log force and > > > that's what causes the AIL problem. We don't push the iclogs, so the > > > items committed to the last iclog don't get dispatched and the items > > > never get unpinned because the checkpoint never completes. > > > > > > > So in general it sounds like this patch broke async log force by > > skipping the current iclog switch somehow or another and then this > > change is incorporated to fix it (hence there is no outstanding bug > > upstream). That makes a bit more sense. I'll get the detailed context > > from the patches. > > Yes. > > > > As to the "root cause of the hang" I now know what that is too, but > > > addressing it is outside the scope of this patchset. It's taken me a > > > long time to identify this, but I've been seeing symptoms of it for > > > quite a long time (i.e. years) but until doing this async CIL push > > > work I've never been able to identify the cause. > > > > > > That is, the root cause is that the AIL cannot force a pinned buffer > > > to disk, and a log force (of any kind) does not guarantee a pinned > > > item is unpinned. That's because the CIL is pipelined, meaning that > > > an item can be pinned in one CIL context, and while that is being > > > pushed, the front end can relog that same item and pin it in the > > > current CIL context. Hence it has two pins, not one. Now when the > > > original log force completes, the pin from the first CIL context has > > > been removed, but the pin from the current CIL context is still > > > present. > > > > > > > Somewhat related to the above point, ISTM our current behavior can lead > > to scenarios where the potential window for elevated pin counts is > > somewhat artificially widened. IOW, the log vector for a pincount == 1 > > item might have been written out to disk, but the log item itself not > > unpinned for some time later because the commit record is still sitting > > in an active iclog. The unpin processing of the item doesn't occur until > > that commit record iclog is synced out and I/O completes, which leaves > > plenty of time for another pin to come along in the meantime (and the > > scenario potentially repeats, starving out pincount == 0 state). So ISTM > > the "damage" is done to some degree long before the AIL has any > > opportunity to fix it, particularly if the only recourse it has is a log > > force. > > > > (And of course there's a huge tradeoff between current behavior and > > something that unconditionally synced out commit record iclogs, > > particularly depending on CIL checkpoint sizes, so that's a broad > > simplification and not a valid solution.) > > Yes, this can be an issue because when the single CIL work is CPU > bound, it is the first iclog write of the checkpoint that pushes the > previous commit record to disk. So, like you say, there's a natural > pin overlap in this situation. > > > > Hence when we are relogging buffers over time (e.g. repeatedly > > > modifying a directory), the AIL always sees the buffer pinned and > > > cannot flush it. Even if the buffer has been moved to the delwri > > > list (i.e. had a successful push) we still can't write it out > > > because the delwri code skips pinned buffers. > > > > > > This iterative pin continues preventing the AIL from flushing the > > > buffer until it pins the tail of the log. When the log runs out of > > > space and blocks the modifying/relogging task(s), the log force the > > > AIL issues allows the CIL to be pushed without the front end > > > relogging the item and re-pinning the item. At the completion of the > > > log force, the AIL can now push the unpinned buffer and hence move > > > the tail of the log forward which frees up log reservation space. > > > This then wakes the front end, which goes back to it's relogging > > > game and the cycle repeats over. > > > > > > > The description of the nested pin behavior makes sense. I've observed > > the same thing recently when looking at some unrelated shutdown > > problems. > > > > In any event, if I'm following correctly the deadlock issue was a > > transient problem with this patch and not an upstream problem. However > > there is still a pipeline stall/performance issue that manifests as > > described above, and that preexists the proposed changes. Yes? > > Yes, AFAICT, this pin problem dates back to day zero. Before delayed > logging it required the inode or buffer to be relogged in almost > every iclog that was written (think a long running truncate removing > hundreds of thousands of extents). But before delayed logging, this > was very slow and limited by iclog IO so having the AIL stall would > have largely been unnoticed. > > > > This is really nasty behaviour, and it's only recently that I've got > > > a handle on it. I found it because my original "async CIL push" code > > > resulted in long stalls every time the log is filled and the tail is > > > pinned by a buffer that is being relogged in this manner.... > > > > > > I'm not sure how to fix this yet - the AIL needs to block the front > > > end relogging to allow the buffer to be unpinned. Essentially, we > > > need to hold the pinned items locked across a CIL push to guarantee > > > they are unpinned, but that's the complete opposite of what the AIL > > > currently does to prevent the front end from seeing long tail lock > > > latencies when modifying stuff.... > > > > When this stall problem manifests, I'm assuming it's exacerbated by > > delayed logging and the commit record behavior I described above. If > > that's the case, could the AIL communicate writeback pressure through > > affected log items such that checkpoints in which they are resident are > > flushed out completely/immediately when the checkpoints occur? I suppose > > that would require a log item flag or some such, which does raise a > > concern of unnecessarily tagging many items (it's not clear to me how > > likely that really is), but I'm curious if that would be an effective > > POC at least.. > > I don't think we need to do anything like that. All we need to do to > ensure that the AIL can flush a pinned buffer is to lock it, kick > the log and wait for the pin count to go to zero. Then we can write > it just fine, blocking only the front end transactions that need > that buffer lock. Same goes for inodes, though xfs_iunpin_wait() > already does this.... > Yeah, but why would we want to block xfsaild on a single item like that? Wouldn't holding the item locked like that just create a new stall point within xfsaild? Maybe I'm missing something, but what you describe here basically just sounds like a "lock the item and run a sync log force" pattern. I agree that the implementation I handwave about above seems overkill, but I'm really curious what the disadvantage of some form of logic in the logging code to do something like "if this checkpoint contains items that are AIL resident && under some measure of writeback pressure && the next iclog is active, want_sync the commit record iclog on background push to reduce item pin count times" might be. Brian > The only issue is determining if this causes unaaceptible long tail > latencies on buffers like the AG headers.... > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com >
On Mon, Mar 01, 2021 at 08:32:18AM -0500, Brian Foster wrote: > On Mon, Mar 01, 2021 at 03:54:22PM +1100, Dave Chinner wrote: > > On Sat, Feb 27, 2021 at 11:25:06AM -0500, Brian Foster wrote: > > > On Fri, Feb 26, 2021 at 09:03:05AM +1100, Dave Chinner wrote: > > > > This is really nasty behaviour, and it's only recently that I've got > > > > a handle on it. I found it because my original "async CIL push" code > > > > resulted in long stalls every time the log is filled and the tail is > > > > pinned by a buffer that is being relogged in this manner.... > > > > > > > > I'm not sure how to fix this yet - the AIL needs to block the front > > > > end relogging to allow the buffer to be unpinned. Essentially, we > > > > need to hold the pinned items locked across a CIL push to guarantee > > > > they are unpinned, but that's the complete opposite of what the AIL > > > > currently does to prevent the front end from seeing long tail lock > > > > latencies when modifying stuff.... > > > > > > When this stall problem manifests, I'm assuming it's exacerbated by > > > delayed logging and the commit record behavior I described above. If > > > that's the case, could the AIL communicate writeback pressure through > > > affected log items such that checkpoints in which they are resident are > > > flushed out completely/immediately when the checkpoints occur? I suppose > > > that would require a log item flag or some such, which does raise a > > > concern of unnecessarily tagging many items (it's not clear to me how > > > likely that really is), but I'm curious if that would be an effective > > > POC at least.. > > > > I don't think we need to do anything like that. All we need to do to > > ensure that the AIL can flush a pinned buffer is to lock it, kick > > the log and wait for the pin count to go to zero. Then we can write > > it just fine, blocking only the front end transactions that need > > that buffer lock. Same goes for inodes, though xfs_iunpin_wait() > > already does this.... > > > > Yeah, but why would we want to block xfsaild on a single item like that? Who said anything about blocking the AIL on a single item? :) > Wouldn't holding the item locked like that just create a new stall point > within xfsaild? Maybe I'm missing something, but what you describe here > basically just sounds like a "lock the item and run a sync log force" > pattern. What I was thinking is that if the item is pinned and at the tail of the log, then we leave it locked when we flush it rather than unlocking it and relocking it when the delwri submit code gets to it. If it gets unpinned before the delwri code gets to it, all good. If not, the delwri code being unable to flush it will feed back up into the AIL to trigger a log force, which will unpin it in the near future and it will be written on the next AIL delwri submit cycle. The thing we need to be careful of this is minimising the lock hold time by the AIL while we unpin the item. That's the "long tail latencies" problem I mention above. Essentially, we need to try to avoid holding the item locked for a long period of time before issuing the log force and/or resubmitting it for IO once it is unlocked. I have a few ideas on this, but no patches yet. Cheers, Dave.
On Wed, Mar 03, 2021 at 12:23:04PM +1100, Dave Chinner wrote: > On Mon, Mar 01, 2021 at 08:32:18AM -0500, Brian Foster wrote: > > On Mon, Mar 01, 2021 at 03:54:22PM +1100, Dave Chinner wrote: > > > On Sat, Feb 27, 2021 at 11:25:06AM -0500, Brian Foster wrote: > > > > On Fri, Feb 26, 2021 at 09:03:05AM +1100, Dave Chinner wrote: > > > > > This is really nasty behaviour, and it's only recently that I've got > > > > > a handle on it. I found it because my original "async CIL push" code > > > > > resulted in long stalls every time the log is filled and the tail is > > > > > pinned by a buffer that is being relogged in this manner.... > > > > > > > > > > I'm not sure how to fix this yet - the AIL needs to block the front > > > > > end relogging to allow the buffer to be unpinned. Essentially, we > > > > > need to hold the pinned items locked across a CIL push to guarantee > > > > > they are unpinned, but that's the complete opposite of what the AIL > > > > > currently does to prevent the front end from seeing long tail lock > > > > > latencies when modifying stuff.... > > > > > > > > When this stall problem manifests, I'm assuming it's exacerbated by > > > > delayed logging and the commit record behavior I described above. If > > > > that's the case, could the AIL communicate writeback pressure through > > > > affected log items such that checkpoints in which they are resident are > > > > flushed out completely/immediately when the checkpoints occur? I suppose > > > > that would require a log item flag or some such, which does raise a > > > > concern of unnecessarily tagging many items (it's not clear to me how > > > > likely that really is), but I'm curious if that would be an effective > > > > POC at least.. > > > > > > I don't think we need to do anything like that. All we need to do to > > > ensure that the AIL can flush a pinned buffer is to lock it, kick > > > the log and wait for the pin count to go to zero. Then we can write > > > it just fine, blocking only the front end transactions that need > > > that buffer lock. Same goes for inodes, though xfs_iunpin_wait() > > > already does this.... > > > > > > > Yeah, but why would we want to block xfsaild on a single item like that? > > Who said anything about blocking the AIL on a single item? :) > > > Wouldn't holding the item locked like that just create a new stall point > > within xfsaild? Maybe I'm missing something, but what you describe here > > basically just sounds like a "lock the item and run a sync log force" > > pattern. > > What I was thinking is that if the item is pinned and at the > tail of the log, then we leave it locked when we flush it rather > than unlocking it and relocking it when the delwri submit code gets > to it. If it gets unpinned before the delwri code gets to it, all > good. If not, the delwri code being unable to flush it will feed > back up into the AIL to trigger a log force, which will unpin it > in the near future and it will be written on the next AIL delwri > submit cycle. > I'm not sure what you mean by leaving the item locked when we flush it if it is pinned, since we don't flush pinned items. Perhaps implied is that this would trylock the buffer first, then only worry about if it's pinned if we acquire the lock. If so (and the pin is at the tail), hold the lock and kick the log as a means to ensure that xfsaild is guaranteed next access to the buffer once unpinned. Hm? IIUC, that seems interesting. Though as noted in the flush optimization series, I am a little leery about issuing (even async) log forces with locks held, at least with the current implementation.. Brian > The thing we need to be careful of this is minimising the lock hold > time by the AIL while we unpin the item. That's the "long tail > latencies" problem I mention above. Essentially, we need to try to > avoid holding the item locked for a long period of time before > issuing the log force and/or resubmitting it for IO once it is > unlocked. I have a few ideas on this, but no patches yet. > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com >
On Wed, Mar 03, 2021 at 12:20:19PM -0500, Brian Foster wrote: > On Wed, Mar 03, 2021 at 12:23:04PM +1100, Dave Chinner wrote: > > On Mon, Mar 01, 2021 at 08:32:18AM -0500, Brian Foster wrote: > > > On Mon, Mar 01, 2021 at 03:54:22PM +1100, Dave Chinner wrote: > > > > On Sat, Feb 27, 2021 at 11:25:06AM -0500, Brian Foster wrote: > > > > > On Fri, Feb 26, 2021 at 09:03:05AM +1100, Dave Chinner wrote: > > > > > > This is really nasty behaviour, and it's only recently that I've got > > > > > > a handle on it. I found it because my original "async CIL push" code > > > > > > resulted in long stalls every time the log is filled and the tail is > > > > > > pinned by a buffer that is being relogged in this manner.... > > > > > > > > > > > > I'm not sure how to fix this yet - the AIL needs to block the front > > > > > > end relogging to allow the buffer to be unpinned. Essentially, we > > > > > > need to hold the pinned items locked across a CIL push to guarantee > > > > > > they are unpinned, but that's the complete opposite of what the AIL > > > > > > currently does to prevent the front end from seeing long tail lock > > > > > > latencies when modifying stuff.... > > > > > > > > > > When this stall problem manifests, I'm assuming it's exacerbated by > > > > > delayed logging and the commit record behavior I described above. If > > > > > that's the case, could the AIL communicate writeback pressure through > > > > > affected log items such that checkpoints in which they are resident are > > > > > flushed out completely/immediately when the checkpoints occur? I suppose > > > > > that would require a log item flag or some such, which does raise a > > > > > concern of unnecessarily tagging many items (it's not clear to me how > > > > > likely that really is), but I'm curious if that would be an effective > > > > > POC at least.. > > > > > > > > I don't think we need to do anything like that. All we need to do to > > > > ensure that the AIL can flush a pinned buffer is to lock it, kick > > > > the log and wait for the pin count to go to zero. Then we can write > > > > it just fine, blocking only the front end transactions that need > > > > that buffer lock. Same goes for inodes, though xfs_iunpin_wait() > > > > already does this.... > > > > > > > > > > Yeah, but why would we want to block xfsaild on a single item like that? > > > > Who said anything about blocking the AIL on a single item? :) > > > > > Wouldn't holding the item locked like that just create a new stall point > > > within xfsaild? Maybe I'm missing something, but what you describe here > > > basically just sounds like a "lock the item and run a sync log force" > > > pattern. > > > > What I was thinking is that if the item is pinned and at the > > tail of the log, then we leave it locked when we flush it rather > > than unlocking it and relocking it when the delwri submit code gets > > to it. If it gets unpinned before the delwri code gets to it, all > > good. If not, the delwri code being unable to flush it will feed > > back up into the AIL to trigger a log force, which will unpin it > > in the near future and it will be written on the next AIL delwri > > submit cycle. > > > > I'm not sure what you mean by leaving the item locked when we flush it > if it is pinned, since we don't flush pinned items. That's exactly what I'm talking about changing. -Dave.
diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c index 84cd9b6c6d1f..a0ecaf4f9b77 100644 --- a/fs/xfs/xfs_log.c +++ b/fs/xfs/xfs_log.c @@ -3335,6 +3335,20 @@ xfs_log_force( return -EIO; } +/* + * Force the log to a specific LSN. + * + * If an iclog with that lsn can be found: + * If it is in the DIRTY state, just return. + * If it is in the ACTIVE state, move the in-core log into the WANT_SYNC + * state and go to sleep or return. + * If it is in any other state, go to sleep or return. + * + * Synchronous forces are implemented with a wait queue. All callers trying + * to force a given lsn to disk must wait on the queue attached to the + * specific in-core log. When given in-core log finally completes its write + * to disk, that thread will wake up all threads waiting on the queue. + */ static int xlog_force_lsn( struct xlog *log, @@ -3398,18 +3412,15 @@ xlog_force_lsn( } /* - * Force the in-core log to disk for a specific LSN. + * Force the log to a specific checkpoint sequence. * - * Find in-core log with lsn. - * If it is in the DIRTY state, just return. - * If it is in the ACTIVE state, move the in-core log into the WANT_SYNC - * state and go to sleep or return. - * If it is in any other state, go to sleep or return. + * First force the CIL so that all the required changes have been flushed to the + * iclogs. If the CIL force completed it will return a commit LSN that indicates + * the iclog that needs to be flushed to stable storage. * - * Synchronous forces are implemented with a wait queue. All callers trying - * to force a given lsn to disk must wait on the queue attached to the - * specific in-core log. When given in-core log finally completes its write - * to disk, that thread will wake up all threads waiting on the queue. + * If the XFS_LOG_SYNC flag is not set, we only trigger a background CIL force + * and do not wait for it to complete, nor do we attempt to check/flush iclogs + * as the CIL will not have committed when xlog_cil_force_seq() returns. */ int xfs_log_force_seq( @@ -3426,7 +3437,7 @@ xfs_log_force_seq( XFS_STATS_INC(mp, xs_log_force); trace_xfs_log_force(mp, seq, _RET_IP_); - lsn = xlog_cil_force_seq(log, seq); + lsn = xlog_cil_force_seq(log, flags, seq); if (lsn == NULLCOMMITLSN) return 0; diff --git a/fs/xfs/xfs_log.h b/fs/xfs/xfs_log.h index 33ae53401060..77d3fca28f55 100644 --- a/fs/xfs/xfs_log.h +++ b/fs/xfs/xfs_log.h @@ -104,6 +104,7 @@ struct xlog_ticket; struct xfs_log_item; struct xfs_item_ops; struct xfs_trans; +struct xlog; int xfs_log_force(struct xfs_mount *mp, uint flags); int xfs_log_force_seq(struct xfs_mount *mp, uint64_t seq, uint flags, diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c index 2fda8c4513b2..be6c91d7218f 100644 --- a/fs/xfs/xfs_log_cil.c +++ b/fs/xfs/xfs_log_cil.c @@ -999,7 +999,8 @@ xlog_cil_push_background( static void xlog_cil_push_now( struct xlog *log, - xfs_lsn_t push_seq) + xfs_lsn_t push_seq, + bool flush) { struct xfs_cil *cil = log->l_cilp; @@ -1009,7 +1010,8 @@ xlog_cil_push_now( ASSERT(push_seq && push_seq <= cil->xc_current_sequence); /* start on any pending background push to minimise wait time on it */ - flush_work(&cil->xc_push_work); + if (flush) + flush_work(&cil->xc_push_work); /* * If the CIL is empty or we've already pushed the sequence then @@ -1109,16 +1111,22 @@ xlog_cil_commit( /* * Conditionally push the CIL based on the sequence passed in. * - * We only need to push if we haven't already pushed the sequence - * number given. Hence the only time we will trigger a push here is - * if the push sequence is the same as the current context. + * We only need to push if we haven't already pushed the sequence number given. + * Hence the only time we will trigger a push here is if the push sequence is + * the same as the current context. * - * We return the current commit lsn to allow the callers to determine if a - * iclog flush is necessary following this call. + * If the sequence is zero, push the current sequence. If XFS_LOG_SYNC is set in + * the flags wait for it to complete, otherwise jsut return NULLCOMMITLSN to + * indicate we didn't wait for a commit lsn. + * + * If we waited for the push to complete, then we return the current commit lsn + * to allow the callers to determine if a iclog flush is necessary following + * this call. */ xfs_lsn_t xlog_cil_force_seq( struct xlog *log, + uint32_t flags, uint64_t sequence) { struct xfs_cil *cil = log->l_cilp; @@ -1127,13 +1135,18 @@ xlog_cil_force_seq( ASSERT(sequence <= cil->xc_current_sequence); + if (!sequence) + sequence = cil->xc_current_sequence; + /* * check to see if we need to force out the current context. * xlog_cil_push() handles racing pushes for the same sequence, * so no need to deal with it here. */ restart: - xlog_cil_push_now(log, sequence); + xlog_cil_push_now(log, sequence, flags & XFS_LOG_SYNC); + if (!(flags & XFS_LOG_SYNC)) + return commit_lsn; /* * See if we can find a previous sequence still committing. diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h index 59778cd5ecdd..d9929d9146e4 100644 --- a/fs/xfs/xfs_log_priv.h +++ b/fs/xfs/xfs_log_priv.h @@ -558,12 +558,13 @@ void xlog_cil_commit(struct xlog *log, struct xfs_trans *tp, /* * CIL force routines */ -xfs_lsn_t xlog_cil_force_seq(struct xlog *log, uint64_t sequence); +xfs_lsn_t xlog_cil_force_seq(struct xlog *log, uint32_t flags, + uint64_t sequence); static inline void xlog_cil_force(struct xlog *log) { - xlog_cil_force_seq(log, log->l_cilp->xc_current_sequence); + xlog_cil_force_seq(log, XFS_LOG_SYNC, log->l_cilp->xc_current_sequence); } /* diff --git a/fs/xfs/xfs_sysfs.c b/fs/xfs/xfs_sysfs.c index f1bc88f4367c..18dc5eca6c04 100644 --- a/fs/xfs/xfs_sysfs.c +++ b/fs/xfs/xfs_sysfs.c @@ -10,6 +10,7 @@ #include "xfs_log_format.h" #include "xfs_trans_resv.h" #include "xfs_sysfs.h" +#include "xfs_log.h" #include "xfs_log_priv.h" #include "xfs_mount.h" diff --git a/fs/xfs/xfs_trace.c b/fs/xfs/xfs_trace.c index 9b8d703dc9fd..d111a994b7b6 100644 --- a/fs/xfs/xfs_trace.c +++ b/fs/xfs/xfs_trace.c @@ -20,6 +20,7 @@ #include "xfs_bmap.h" #include "xfs_attr.h" #include "xfs_trans.h" +#include "xfs_log.h" #include "xfs_log_priv.h" #include "xfs_buf_item.h" #include "xfs_quota.h" diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c index 697703f3be48..7d05694681e3 100644 --- a/fs/xfs/xfs_trans.c +++ b/fs/xfs/xfs_trans.c @@ -9,7 +9,6 @@ #include "xfs_shared.h" #include "xfs_format.h" #include "xfs_log_format.h" -#include "xfs_log_priv.h" #include "xfs_trans_resv.h" #include "xfs_mount.h" #include "xfs_extent_busy.h" @@ -17,6 +16,7 @@ #include "xfs_trans.h" #include "xfs_trans_priv.h" #include "xfs_log.h" +#include "xfs_log_priv.h" #include "xfs_trace.h" #include "xfs_error.h" #include "xfs_defer.h" diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c index dbb69b4bf3ed..dfc0206c0d36 100644 --- a/fs/xfs/xfs_trans_ail.c +++ b/fs/xfs/xfs_trans_ail.c @@ -17,6 +17,7 @@ #include "xfs_errortag.h" #include "xfs_error.h" #include "xfs_log.h" +#include "xfs_log_priv.h" #ifdef DEBUG /* @@ -429,8 +430,12 @@ xfsaild_push( /* * If we encountered pinned items or did not finish writing out all - * buffers the last time we ran, force the log first and wait for it - * before pushing again. + * buffers the last time we ran, force a background CIL push to get the + * items unpinned in the near future. We do not wait on the CIL push as + * that could stall us for seconds if there is enough background IO + * load. Stalling for that long when the tail of the log is pinned and + * needs flushing will hard stop the transaction subsystem when log + * space runs out. */ if (ailp->ail_log_flush && ailp->ail_last_pushed_lsn == 0 && (!list_empty_careful(&ailp->ail_buf_list) || @@ -438,7 +443,7 @@ xfsaild_push( ailp->ail_log_flush = 0; XFS_STATS_INC(mp, xs_push_ail_flush); - xfs_log_force(mp, XFS_LOG_SYNC); + xlog_cil_force_seq(mp->m_log, 0, 0); } spin_lock(&ailp->ail_lock);