diff mbox series

[2/3] xfs: AIL needs asynchronous CIL forcing

Message ID 20210223053212.3287398-3-david@fromorbit.com (mailing list archive)
State Superseded
Headers show
Series xfs: CIL improvements | expand

Commit Message

Dave Chinner Feb. 23, 2021, 5:32 a.m. UTC
From: Dave Chinner <dchinner@redhat.com>

The AIL pushing is stalling on log forces when it comes across
pinned items. This is happening on removal workloads where the AIL
is dominated by stale items that are removed from AIL when the
checkpoint that marks the items stale is committed to the journal.
This results is relatively few items in the AIL, but those that are
are often pinned as directories items are being removed from are
still being logged.

As a result, many push cycles through the CIL will first issue a
blocking log force to unpin the items. This can take some time to
complete, with tracing regularly showing push delays of half a
second and sometimes up into the range of several seconds. Sequences
like this aren't uncommon:

....
 399.829437:  xfsaild: last lsn 0x11002dd000 count 101 stuck 101 flushing 0 tout 20
<wanted 20ms, got 270ms delay>
 400.099622:  xfsaild: target 0x11002f3600, prev 0x11002f3600, last lsn 0x0
 400.099623:  xfsaild: first lsn 0x11002f3600
 400.099679:  xfsaild: last lsn 0x1100305000 count 16 stuck 11 flushing 0 tout 50
<wanted 50ms, got 500ms delay>
 400.589348:  xfsaild: target 0x110032e600, prev 0x11002f3600, last lsn 0x0
 400.589349:  xfsaild: first lsn 0x1100305000
 400.589595:  xfsaild: last lsn 0x110032e600 count 156 stuck 101 flushing 30 tout 50
<wanted 50ms, got 460ms delay>
 400.950341:  xfsaild: target 0x1100353000, prev 0x110032e600, last lsn 0x0
 400.950343:  xfsaild: first lsn 0x1100317c00
 400.950436:  xfsaild: last lsn 0x110033d200 count 105 stuck 101 flushing 0 tout 20
<wanted 20ms, got 200ms delay>
 401.142333:  xfsaild: target 0x1100361600, prev 0x1100353000, last lsn 0x0
 401.142334:  xfsaild: first lsn 0x110032e600
 401.142535:  xfsaild: last lsn 0x1100353000 count 122 stuck 101 flushing 8 tout 10
<wanted 10ms, got 10ms delay>
 401.154323:  xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x1100353000
 401.154328:  xfsaild: first lsn 0x1100353000
 401.154389:  xfsaild: last lsn 0x1100353000 count 101 stuck 101 flushing 0 tout 20
<wanted 20ms, got 300ms delay>
 401.451525:  xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x0
 401.451526:  xfsaild: first lsn 0x1100353000
 401.451804:  xfsaild: last lsn 0x1100377200 count 170 stuck 22 flushing 122 tout 50
<wanted 50ms, got 500ms delay>
 401.933581:  xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x0
....

In each of these cases, every AIL pass saw 101 log items stuck on
the AIL (pinned) with very few other items being found. Each pass, a
log force was issued, and delay between last/first is the sleep time
+ the sync log force time.

Some of these 101 items pinned the tail of the log. The tail of the
log does slowly creep forward (first lsn), but the problem is that
the log is actually out of reservation space because it's been
running so many transactions that stale items that never reach the
AIL but consume log space. Hence we have a largely empty AIL, with
long term pins on items that pin the tail of the log that don't get
pushed frequently enough to keep log space available.

The problem is the hundreds of milliseconds that we block in the log
force pushing the CIL out to disk. The AIL should not be stalled
like this - it needs to run and flush items that are at the tail of
the log with minimal latency. What we really need to do is trigger a
log flush, but then not wait for it at all - we've already done our
waiting for stuff to complete when we backed off prior to the log
force being issued.

Even if we remove the XFS_LOG_SYNC from the xfs_log_force() call, we
still do a blocking flush of the CIL and that is what is causing the
issue. Hence we need a new interface for the CIL to trigger an
immediate background push of the CIL to get it moving faster but not
to wait on that to occur. While the CIL is pushing, the AIL can also
be pushing.

We already have an internal interface to do this -
xlog_cil_push_now() - but we need a wrapper for it to be used
externally. xlog_cil_force_seq() can easily be extended to do what
we need as it already implements the synchronous CIL push via
xlog_cil_push_now(). Add the necessary flags and "push current
sequence" semantics to xlog_cil_force_seq() and convert the AIL
pushing to use it.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_log.c       | 33 ++++++++++++++++++++++-----------
 fs/xfs/xfs_log.h       |  1 +
 fs/xfs/xfs_log_cil.c   | 29 +++++++++++++++++++++--------
 fs/xfs/xfs_log_priv.h  |  5 +++--
 fs/xfs/xfs_sysfs.c     |  1 +
 fs/xfs/xfs_trace.c     |  1 +
 fs/xfs/xfs_trans.c     |  2 +-
 fs/xfs/xfs_trans_ail.c | 11 ++++++++---
 8 files changed, 58 insertions(+), 25 deletions(-)

Comments

Dave Chinner Feb. 24, 2021, 9:10 p.m. UTC | #1
On Tue, Feb 23, 2021 at 04:32:11PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> The AIL pushing is stalling on log forces when it comes across
> pinned items. This is happening on removal workloads where the AIL
> is dominated by stale items that are removed from AIL when the
> checkpoint that marks the items stale is committed to the journal.
> This results is relatively few items in the AIL, but those that are
> are often pinned as directories items are being removed from are
> still being logged.
> 
> As a result, many push cycles through the CIL will first issue a
> blocking log force to unpin the items. This can take some time to
> complete, with tracing regularly showing push delays of half a
> second and sometimes up into the range of several seconds. Sequences
> like this aren't uncommon:
> 
> ....
>  399.829437:  xfsaild: last lsn 0x11002dd000 count 101 stuck 101 flushing 0 tout 20
> <wanted 20ms, got 270ms delay>
>  400.099622:  xfsaild: target 0x11002f3600, prev 0x11002f3600, last lsn 0x0
>  400.099623:  xfsaild: first lsn 0x11002f3600
>  400.099679:  xfsaild: last lsn 0x1100305000 count 16 stuck 11 flushing 0 tout 50
> <wanted 50ms, got 500ms delay>
>  400.589348:  xfsaild: target 0x110032e600, prev 0x11002f3600, last lsn 0x0
>  400.589349:  xfsaild: first lsn 0x1100305000
>  400.589595:  xfsaild: last lsn 0x110032e600 count 156 stuck 101 flushing 30 tout 50
> <wanted 50ms, got 460ms delay>
>  400.950341:  xfsaild: target 0x1100353000, prev 0x110032e600, last lsn 0x0
>  400.950343:  xfsaild: first lsn 0x1100317c00
>  400.950436:  xfsaild: last lsn 0x110033d200 count 105 stuck 101 flushing 0 tout 20
> <wanted 20ms, got 200ms delay>
>  401.142333:  xfsaild: target 0x1100361600, prev 0x1100353000, last lsn 0x0
>  401.142334:  xfsaild: first lsn 0x110032e600
>  401.142535:  xfsaild: last lsn 0x1100353000 count 122 stuck 101 flushing 8 tout 10
> <wanted 10ms, got 10ms delay>
>  401.154323:  xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x1100353000
>  401.154328:  xfsaild: first lsn 0x1100353000
>  401.154389:  xfsaild: last lsn 0x1100353000 count 101 stuck 101 flushing 0 tout 20
> <wanted 20ms, got 300ms delay>
>  401.451525:  xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x0
>  401.451526:  xfsaild: first lsn 0x1100353000
>  401.451804:  xfsaild: last lsn 0x1100377200 count 170 stuck 22 flushing 122 tout 50
> <wanted 50ms, got 500ms delay>
>  401.933581:  xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x0
> ....
> 
> In each of these cases, every AIL pass saw 101 log items stuck on
> the AIL (pinned) with very few other items being found. Each pass, a
> log force was issued, and delay between last/first is the sleep time
> + the sync log force time.
> 
> Some of these 101 items pinned the tail of the log. The tail of the
> log does slowly creep forward (first lsn), but the problem is that
> the log is actually out of reservation space because it's been
> running so many transactions that stale items that never reach the
> AIL but consume log space. Hence we have a largely empty AIL, with
> long term pins on items that pin the tail of the log that don't get
> pushed frequently enough to keep log space available.
> 
> The problem is the hundreds of milliseconds that we block in the log
> force pushing the CIL out to disk. The AIL should not be stalled
> like this - it needs to run and flush items that are at the tail of
> the log with minimal latency. What we really need to do is trigger a
> log flush, but then not wait for it at all - we've already done our
> waiting for stuff to complete when we backed off prior to the log
> force being issued.
> 
> Even if we remove the XFS_LOG_SYNC from the xfs_log_force() call, we
> still do a blocking flush of the CIL and that is what is causing the
> issue. Hence we need a new interface for the CIL to trigger an
> immediate background push of the CIL to get it moving faster but not
> to wait on that to occur. While the CIL is pushing, the AIL can also
> be pushing.
> 
> We already have an internal interface to do this -
> xlog_cil_push_now() - but we need a wrapper for it to be used
> externally. xlog_cil_force_seq() can easily be extended to do what
> we need as it already implements the synchronous CIL push via
> xlog_cil_push_now(). Add the necessary flags and "push current
> sequence" semantics to xlog_cil_force_seq() and convert the AIL
> pushing to use it.

Overnight testing indicates generic/530 hangs on small logs with
this patch. It essentially runs out of log space because there are
inode cluster buffers permanently pinned by this workload.

That is, as inodes are unlinked, they repeatedly relog the inode
cluster buffer, and so while the CIL is flushing to unpin the
buffer, a new unlink relogs it and adds a new pin to the buffer.
Hence the log force that the AIL does to unpin pinned items doesn't
actually unpin buffers that are being relogged.

These buffers only get unpinned when the log actually runs out of
space because they pin the tail of the log. Then the modifications
that are relogging the buffer stop because they fail to get a log
reservation, and the tail of the log does not move forward until the
AIL issues a log force that finally unpins the buffer and writes it
back.

Essentially, relogged buffers cannot be flushed by the AIL until the
log completely stalls.

The problem being tripped over here is that we no longer force the
final iclogs in a CIL push to disk - we leave the iclog with the
commit record in it in ACTIVE state, and by not waiting and forcing
all the iclogs to disk, the buffer never gets unpinned because there
isn't any more log pressure to force it out because everything is
blocked on reservation space.

The solution to this is to have the CIL push change the state of the
commit iclog to WANT_SYNC before it is released. This means the CIL
push will always flush the iclog to disk and the checkpoint will
complete and unpin the buffers.

Right now, we really only want to do this state switch for these
async pushes - for small sync transactions and fsync we really want
the iclog aggregation that we have now to optimise iclogbuf usage,
so I'll have to pass a new flag through the push code and back into
xlog_write(). That will make this patch behave the same as we
currently do.

In the longer term, we need to change how the AIL deals with pinned
buffers, as the current method is definitely sub-optimal. It also
explains the "everything stops for a second" performance anomalies
I've been seeing for a while in testing. But fixing that is outside
the scope of this patchset, so in the mean time I'll fix this one up
and repost it in a little while.

FWIW, this is also the likely cause of the "CIL workqueue too deep"
hangs I was seeing with the next patch in the series, too.

Cheers,

Dave.
Brian Foster Feb. 25, 2021, 1:12 p.m. UTC | #2
On Thu, Feb 25, 2021 at 08:10:58AM +1100, Dave Chinner wrote:
> On Tue, Feb 23, 2021 at 04:32:11PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > The AIL pushing is stalling on log forces when it comes across
> > pinned items. This is happening on removal workloads where the AIL
> > is dominated by stale items that are removed from AIL when the
> > checkpoint that marks the items stale is committed to the journal.
> > This results is relatively few items in the AIL, but those that are
> > are often pinned as directories items are being removed from are
> > still being logged.
> > 
> > As a result, many push cycles through the CIL will first issue a
> > blocking log force to unpin the items. This can take some time to
> > complete, with tracing regularly showing push delays of half a
> > second and sometimes up into the range of several seconds. Sequences
> > like this aren't uncommon:
> > 
> > ....
> >  399.829437:  xfsaild: last lsn 0x11002dd000 count 101 stuck 101 flushing 0 tout 20
> > <wanted 20ms, got 270ms delay>
> >  400.099622:  xfsaild: target 0x11002f3600, prev 0x11002f3600, last lsn 0x0
> >  400.099623:  xfsaild: first lsn 0x11002f3600
> >  400.099679:  xfsaild: last lsn 0x1100305000 count 16 stuck 11 flushing 0 tout 50
> > <wanted 50ms, got 500ms delay>
> >  400.589348:  xfsaild: target 0x110032e600, prev 0x11002f3600, last lsn 0x0
> >  400.589349:  xfsaild: first lsn 0x1100305000
> >  400.589595:  xfsaild: last lsn 0x110032e600 count 156 stuck 101 flushing 30 tout 50
> > <wanted 50ms, got 460ms delay>
> >  400.950341:  xfsaild: target 0x1100353000, prev 0x110032e600, last lsn 0x0
> >  400.950343:  xfsaild: first lsn 0x1100317c00
> >  400.950436:  xfsaild: last lsn 0x110033d200 count 105 stuck 101 flushing 0 tout 20
> > <wanted 20ms, got 200ms delay>
> >  401.142333:  xfsaild: target 0x1100361600, prev 0x1100353000, last lsn 0x0
> >  401.142334:  xfsaild: first lsn 0x110032e600
> >  401.142535:  xfsaild: last lsn 0x1100353000 count 122 stuck 101 flushing 8 tout 10
> > <wanted 10ms, got 10ms delay>
> >  401.154323:  xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x1100353000
> >  401.154328:  xfsaild: first lsn 0x1100353000
> >  401.154389:  xfsaild: last lsn 0x1100353000 count 101 stuck 101 flushing 0 tout 20
> > <wanted 20ms, got 300ms delay>
> >  401.451525:  xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x0
> >  401.451526:  xfsaild: first lsn 0x1100353000
> >  401.451804:  xfsaild: last lsn 0x1100377200 count 170 stuck 22 flushing 122 tout 50
> > <wanted 50ms, got 500ms delay>
> >  401.933581:  xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x0
> > ....
> > 
> > In each of these cases, every AIL pass saw 101 log items stuck on
> > the AIL (pinned) with very few other items being found. Each pass, a
> > log force was issued, and delay between last/first is the sleep time
> > + the sync log force time.
> > 
> > Some of these 101 items pinned the tail of the log. The tail of the
> > log does slowly creep forward (first lsn), but the problem is that
> > the log is actually out of reservation space because it's been
> > running so many transactions that stale items that never reach the
> > AIL but consume log space. Hence we have a largely empty AIL, with
> > long term pins on items that pin the tail of the log that don't get
> > pushed frequently enough to keep log space available.
> > 
> > The problem is the hundreds of milliseconds that we block in the log
> > force pushing the CIL out to disk. The AIL should not be stalled
> > like this - it needs to run and flush items that are at the tail of
> > the log with minimal latency. What we really need to do is trigger a
> > log flush, but then not wait for it at all - we've already done our
> > waiting for stuff to complete when we backed off prior to the log
> > force being issued.
> > 
> > Even if we remove the XFS_LOG_SYNC from the xfs_log_force() call, we
> > still do a blocking flush of the CIL and that is what is causing the
> > issue. Hence we need a new interface for the CIL to trigger an
> > immediate background push of the CIL to get it moving faster but not
> > to wait on that to occur. While the CIL is pushing, the AIL can also
> > be pushing.
> > 
> > We already have an internal interface to do this -
> > xlog_cil_push_now() - but we need a wrapper for it to be used
> > externally. xlog_cil_force_seq() can easily be extended to do what
> > we need as it already implements the synchronous CIL push via
> > xlog_cil_push_now(). Add the necessary flags and "push current
> > sequence" semantics to xlog_cil_force_seq() and convert the AIL
> > pushing to use it.
> 
> Overnight testing indicates generic/530 hangs on small logs with
> this patch. It essentially runs out of log space because there are
> inode cluster buffers permanently pinned by this workload.
> 
> That is, as inodes are unlinked, they repeatedly relog the inode
> cluster buffer, and so while the CIL is flushing to unpin the
> buffer, a new unlink relogs it and adds a new pin to the buffer.
> Hence the log force that the AIL does to unpin pinned items doesn't
> actually unpin buffers that are being relogged.
> 
> These buffers only get unpinned when the log actually runs out of
> space because they pin the tail of the log. Then the modifications
> that are relogging the buffer stop because they fail to get a log
> reservation, and the tail of the log does not move forward until the
> AIL issues a log force that finally unpins the buffer and writes it
> back.
> 
> Essentially, relogged buffers cannot be flushed by the AIL until the
> log completely stalls.
> 
> The problem being tripped over here is that we no longer force the
> final iclogs in a CIL push to disk - we leave the iclog with the
> commit record in it in ACTIVE state, and by not waiting and forcing
> all the iclogs to disk, the buffer never gets unpinned because there
> isn't any more log pressure to force it out because everything is
> blocked on reservation space.
> 
> The solution to this is to have the CIL push change the state of the
> commit iclog to WANT_SYNC before it is released. This means the CIL
> push will always flush the iclog to disk and the checkpoint will
> complete and unpin the buffers.
> 
> Right now, we really only want to do this state switch for these
> async pushes - for small sync transactions and fsync we really want
> the iclog aggregation that we have now to optimise iclogbuf usage,
> so I'll have to pass a new flag through the push code and back into
> xlog_write(). That will make this patch behave the same as we
> currently do.
> 

Unfortunately I've not yet caught up to reviewing your most recently
posted set of log patches so I can easily be missing some context, but
when passing through some of the feedback/updates so far this has me
rather confused. We discussed this pre-existing CIL behavior in the
previous version, I suggested some similar potential behavior change
where we would opportunistically send off checkpoint iclogs for I/O a
bit earlier than normal and you argued [1] against it. Now it sounds
like not only are we implementing that, but it's actually necessary to
fix a log hang problem..? What am I missing?

The updated iclog behavior does sound more friendly to me than what we
currently do (obviously, based on my previous comments), but I am
slightly skeptical of how such a change fixes the root cause of a hang.
Is this a stall/perf issue or an actual log deadlock? If the latter,
what prevents this deadlock on current upstream?

Brian

[1] https://lore.kernel.org/linux-xfs/20210129222559.GT4662@dread.disaster.area/

> In the longer term, we need to change how the AIL deals with pinned
> buffers, as the current method is definitely sub-optimal. It also
> explains the "everything stops for a second" performance anomalies
> I've been seeing for a while in testing. But fixing that is outside
> the scope of this patchset, so in the mean time I'll fix this one up
> and repost it in a little while.
> 
> FWIW, this is also the likely cause of the "CIL workqueue too deep"
> hangs I was seeing with the next patch in the series, too.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
>
Dave Chinner Feb. 25, 2021, 10:03 p.m. UTC | #3
On Thu, Feb 25, 2021 at 08:12:12AM -0500, Brian Foster wrote:
> On Thu, Feb 25, 2021 at 08:10:58AM +1100, Dave Chinner wrote:
> > On Tue, Feb 23, 2021 at 04:32:11PM +1100, Dave Chinner wrote:
> > The problem being tripped over here is that we no longer force the
> > final iclogs in a CIL push to disk - we leave the iclog with the
> > commit record in it in ACTIVE state, and by not waiting and forcing
> > all the iclogs to disk, the buffer never gets unpinned because there
> > isn't any more log pressure to force it out because everything is
> > blocked on reservation space.
> > 
> > The solution to this is to have the CIL push change the state of the
> > commit iclog to WANT_SYNC before it is released. This means the CIL
> > push will always flush the iclog to disk and the checkpoint will
> > complete and unpin the buffers.
> > 
> > Right now, we really only want to do this state switch for these
> > async pushes - for small sync transactions and fsync we really want
> > the iclog aggregation that we have now to optimise iclogbuf usage,
> > so I'll have to pass a new flag through the push code and back into
> > xlog_write(). That will make this patch behave the same as we
> > currently do.
> > 
> 
> Unfortunately I've not yet caught up to reviewing your most recently
> posted set of log patches so I can easily be missing some context, but
> when passing through some of the feedback/updates so far this has me
> rather confused. We discussed this pre-existing CIL behavior in the
> previous version, I suggested some similar potential behavior change
> where we would opportunistically send off checkpoint iclogs for I/O a
> bit earlier than normal and you argued [1] against it. Now it sounds
> like not only are we implementing that, but it's actually necessary to
> fix a log hang problem..? What am I missing?

You're talking about this comment?

| > That means there's an increased chance that
| > the next iclog in the ring may be active. Perhaps we could introduce
| > some logic to switch out the commit record iclog before the CIL push
| > returns in those particular cases.  For example, switch out if the
| > current checkpoint hit multiple iclogs..? Or hit multiple iclogs and the
| > next iclog in the ring is clean..?
|
| We could just call xlog_state_switch_iclogs() to mark it WANT_SYNC,
| but then we can't aggregate more changes into it and fill it up. If
| someone starts waiting on that iclog (i.e. a log force), then it
| immediately gets marked WANT_SYNC and submitted to disk when it is
| released. But if there is no-one waiting on it, then we largely
| don't care if an asynchronous checkpoint is committed immediately,
| at the start of the next checkpoint, or at worst, within 30s when
| the log worker next kicks a log force....

I'm arguing against the general case here of submitting iclog
buffers with checkpoints in them early. THere is no reason to do that
because the xfs_log_force() code will kick the iclog to disk if
necessary. That still stands - I have not changed behaviour
unconditionally.

This patch only kicks the commit_iclog to disk if the cause of the
CIL push was an async call to xlog_cil_force_seq(). That can occur
from xfs_iunpin() and, with this patch, the AIL.

In both cases, all they care about is that the CIL checkpoint fully
completes in the near future. For xfs_iunpin(), the log force
doesn't guarantee the inode is actually unpinned, hence it waits on
the pin count and not the log force. So waiting for the CIL push is
unnecessary as long as the CIL push gets the entire checkpoint to
disk.

Similarly, the AIL only needs the log force to guarantee the CIL
commits to disk in the near future, but it does not need to block
waiting for that to occur.

So in these cases, we don't want to block while the CIL is pushing
just so the log force code can iterate the last iclog the CIL wrote
to flush it to disk. Instead, we pass a flag to the CIL to tell it
"you need to push the commit record to disk" so we don't have to
wait for it to then check if it pushed the commit record or not.
This is a special case, not the general case where we are running
lots of fsync or sync transactions.

In the general case, we are trying to aggregate as many sync
transactions/fsync log forces as possible into the current iclog to
batch them up effectively. We'll continue to batch them while there
is space in the current iclog and the previous iclog is in the
WANT_SYNC/SYNCING state indicating it is still being written to
disk.

This automated batching behaviour of the iclog state machine is one
of the reasons XFS has always had excellent performance under
synci transaction/fsync heavy workloads such as NFS/CIFS servers. It
is functionality we need to preserve, and this patch does not
actually change that behaviour.

> The updated iclog behavior does sound more friendly to me than what we
> currently do (obviously, based on my previous comments),

As per above, the conditional behaviour I've added here is actually
less friendly :)

> but I am
> slightly skeptical of how such a change fixes the root cause of a hang.
> Is this a stall/perf issue or an actual log deadlock? If the latter,
> what prevents this deadlock on current upstream?

The problem that I talk about is introduced by this patch and also
fixed by this patch. The problem does not exist until we have code
that depends on an async CIL push guaranteeing that items in the CIL
are unpinned. A stock xfs_log_force_seq(0) call guarantees that the
log reaches disk by waiting on the CIL push then pushing the iclogs.
All it does is elide waiting for the final iclogs to hit the disk.

By eliding the "wait for CIL push" in an async xfs_log_force_seq()
call, we also elide the "push iclogs" part of the log force and
that's what causes the AIL problem. We don't push the iclogs, so the
items committed to the last iclog don't get dispatched and the items
never get unpinned because the checkpoint never completes.

So this "AIL can hang because tail items aren't unpinned" condition
is exposed by async CIL forces. It is also fixed by this patch
passing the async push state to the CIL push and having it ensure
that the commit iclog if written out if the CIL push was triggered
by an async push.

Like all log forces, async pushes are rare, but we have to ensure
tehy behave correctly. There is no bug in the code before this
patch, and there is no bug in the code (that I know of) after this
patch. I've described a condition that we have to handle in this
patch that, if we don't, we end up with broken AIL pushing.

As to the "root cause of the hang" I now know what that is too, but
addressing it is outside the scope of this patchset. It's taken me a
long time to identify this, but I've been seeing symptoms of it for
quite a long time (i.e. years) but until doing this async CIL push
work I've never been able to identify the cause.

That is, the root cause is that the AIL cannot force a pinned buffer
to disk, and a log force (of any kind) does not guarantee a pinned
item is unpinned. That's because the CIL is pipelined, meaning that
an item can be pinned in one CIL context, and while that is being
pushed, the front end can relog that same item and pin it in the
current CIL context. Hence it has two pins, not one. Now when the
original log force completes, the pin from the first CIL context has
been removed, but the pin from the current CIL context is still
present.

Hence when we are relogging buffers over time (e.g. repeatedly
modifying a directory), the AIL always sees the buffer pinned and
cannot flush it. Even if the buffer has been moved to the delwri
list (i.e. had a successful push) we still can't write it out
because the delwri code skips pinned buffers.

This iterative pin continues preventing the AIL from flushing the
buffer until it pins the tail of the log. When the log runs out of
space and blocks the modifying/relogging task(s), the log force the
AIL issues allows the CIL to be pushed without the front end
relogging the item and re-pinning the item. At the completion of the
log force, the AIL can now push the unpinned buffer and hence move
the tail of the log forward which frees up log reservation space.
This then wakes the front end, which goes back to it's relogging
game and the cycle repeats over.

This is really nasty behaviour, and it's only recently that I've got
a handle on it. I found it because my original "async CIL push" code
resulted in long stalls every time the log is filled and the tail is
pinned by a buffer that is being relogged in this manner....

I'm not sure how to fix this yet - the AIL needs to block the front
end relogging to allow the buffer to be unpinned. Essentially, we
need to hold the pinned items locked across a CIL push to guarantee
they are unpinned, but that's the complete opposite of what the AIL
currently does to prevent the front end from seeing long tail lock
latencies when modifying stuff....

Cheers,

Dave.
Brian Foster Feb. 27, 2021, 4:25 p.m. UTC | #4
On Fri, Feb 26, 2021 at 09:03:05AM +1100, Dave Chinner wrote:
> On Thu, Feb 25, 2021 at 08:12:12AM -0500, Brian Foster wrote:
> > On Thu, Feb 25, 2021 at 08:10:58AM +1100, Dave Chinner wrote:
> > > On Tue, Feb 23, 2021 at 04:32:11PM +1100, Dave Chinner wrote:
> > > The problem being tripped over here is that we no longer force the
> > > final iclogs in a CIL push to disk - we leave the iclog with the
> > > commit record in it in ACTIVE state, and by not waiting and forcing
> > > all the iclogs to disk, the buffer never gets unpinned because there
> > > isn't any more log pressure to force it out because everything is
> > > blocked on reservation space.
> > > 
> > > The solution to this is to have the CIL push change the state of the
> > > commit iclog to WANT_SYNC before it is released. This means the CIL
> > > push will always flush the iclog to disk and the checkpoint will
> > > complete and unpin the buffers.
> > > 
> > > Right now, we really only want to do this state switch for these
> > > async pushes - for small sync transactions and fsync we really want
> > > the iclog aggregation that we have now to optimise iclogbuf usage,
> > > so I'll have to pass a new flag through the push code and back into
> > > xlog_write(). That will make this patch behave the same as we
> > > currently do.
> > > 
> > 
> > Unfortunately I've not yet caught up to reviewing your most recently
> > posted set of log patches so I can easily be missing some context, but
> > when passing through some of the feedback/updates so far this has me
> > rather confused. We discussed this pre-existing CIL behavior in the
> > previous version, I suggested some similar potential behavior change
> > where we would opportunistically send off checkpoint iclogs for I/O a
> > bit earlier than normal and you argued [1] against it. Now it sounds
> > like not only are we implementing that, but it's actually necessary to
> > fix a log hang problem..? What am I missing?
> 
> You're talking about this comment?
> 

Yes.

> | > That means there's an increased chance that
> | > the next iclog in the ring may be active. Perhaps we could introduce
> | > some logic to switch out the commit record iclog before the CIL push
> | > returns in those particular cases.  For example, switch out if the
> | > current checkpoint hit multiple iclogs..? Or hit multiple iclogs and the
> | > next iclog in the ring is clean..?
> |
> | We could just call xlog_state_switch_iclogs() to mark it WANT_SYNC,
> | but then we can't aggregate more changes into it and fill it up. If
> | someone starts waiting on that iclog (i.e. a log force), then it
> | immediately gets marked WANT_SYNC and submitted to disk when it is
> | released. But if there is no-one waiting on it, then we largely
> | don't care if an asynchronous checkpoint is committed immediately,
> | at the start of the next checkpoint, or at worst, within 30s when
> | the log worker next kicks a log force....
> 
> I'm arguing against the general case here of submitting iclog
> buffers with checkpoints in them early. THere is no reason to do that
> because the xfs_log_force() code will kick the iclog to disk if
> necessary. That still stands - I have not changed behaviour
> unconditionally.
> 

But the suggestion wasn't to change behavior unconditionally. That seems
fairly clear to me even from just the quoted snippet above..?

> This patch only kicks the commit_iclog to disk if the cause of the
> CIL push was an async call to xlog_cil_force_seq(). That can occur
> from xfs_iunpin() and, with this patch, the AIL.
> 
> In both cases, all they care about is that the CIL checkpoint fully
> completes in the near future. For xfs_iunpin(), the log force
> doesn't guarantee the inode is actually unpinned, hence it waits on
> the pin count and not the log force. So waiting for the CIL push is
> unnecessary as long as the CIL push gets the entire checkpoint to
> disk.
> 
> Similarly, the AIL only needs the log force to guarantee the CIL
> commits to disk in the near future, but it does not need to block
> waiting for that to occur.
> 
> So in these cases, we don't want to block while the CIL is pushing
> just so the log force code can iterate the last iclog the CIL wrote
> to flush it to disk. Instead, we pass a flag to the CIL to tell it
> "you need to push the commit record to disk" so we don't have to
> wait for it to then check if it pushed the commit record or not.
> This is a special case, not the general case where we are running
> lots of fsync or sync transactions.
> 
> In the general case, we are trying to aggregate as many sync
> transactions/fsync log forces as possible into the current iclog to
> batch them up effectively. We'll continue to batch them while there
> is space in the current iclog and the previous iclog is in the
> WANT_SYNC/SYNCING state indicating it is still being written to
> disk.
> 
> This automated batching behaviour of the iclog state machine is one
> of the reasons XFS has always had excellent performance under
> synci transaction/fsync heavy workloads such as NFS/CIFS servers. It
> is functionality we need to preserve, and this patch does not
> actually change that behaviour.
> 

Sure, this all makes sense.

> > The updated iclog behavior does sound more friendly to me than what we
> > currently do (obviously, based on my previous comments),
> 
> As per above, the conditional behaviour I've added here is actually
> less friendly :)
> 

Heh. :P

> > but I am
> > slightly skeptical of how such a change fixes the root cause of a hang.
> > Is this a stall/perf issue or an actual log deadlock? If the latter,
> > what prevents this deadlock on current upstream?
> 
> The problem that I talk about is introduced by this patch and also
> fixed by this patch. The problem does not exist until we have code
> that depends on an async CIL push guaranteeing that items in the CIL
> are unpinned. A stock xfs_log_force_seq(0) call guarantees that the
> log reaches disk by waiting on the CIL push then pushing the iclogs.
> All it does is elide waiting for the final iclogs to hit the disk.
> 
> By eliding the "wait for CIL push" in an async xfs_log_force_seq()
> call, we also elide the "push iclogs" part of the log force and
> that's what causes the AIL problem. We don't push the iclogs, so the
> items committed to the last iclog don't get dispatched and the items
> never get unpinned because the checkpoint never completes.
> 

So in general it sounds like this patch broke async log force by
skipping the current iclog switch somehow or another and then this
change is incorporated to fix it (hence there is no outstanding bug
upstream). That makes a bit more sense. I'll get the detailed context
from the patches.

> So this "AIL can hang because tail items aren't unpinned" condition
> is exposed by async CIL forces. It is also fixed by this patch
> passing the async push state to the CIL push and having it ensure
> that the commit iclog if written out if the CIL push was triggered
> by an async push.
> 
> Like all log forces, async pushes are rare, but we have to ensure
> tehy behave correctly. There is no bug in the code before this
> patch, and there is no bug in the code (that I know of) after this
> patch. I've described a condition that we have to handle in this
> patch that, if we don't, we end up with broken AIL pushing.
> 

Ok.

> As to the "root cause of the hang" I now know what that is too, but
> addressing it is outside the scope of this patchset. It's taken me a
> long time to identify this, but I've been seeing symptoms of it for
> quite a long time (i.e. years) but until doing this async CIL push
> work I've never been able to identify the cause.
> 
> That is, the root cause is that the AIL cannot force a pinned buffer
> to disk, and a log force (of any kind) does not guarantee a pinned
> item is unpinned. That's because the CIL is pipelined, meaning that
> an item can be pinned in one CIL context, and while that is being
> pushed, the front end can relog that same item and pin it in the
> current CIL context. Hence it has two pins, not one. Now when the
> original log force completes, the pin from the first CIL context has
> been removed, but the pin from the current CIL context is still
> present.
> 

Somewhat related to the above point, ISTM our current behavior can lead
to scenarios where the potential window for elevated pin counts is
somewhat artificially widened. IOW, the log vector for a pincount == 1
item might have been written out to disk, but the log item itself not
unpinned for some time later because the commit record is still sitting
in an active iclog. The unpin processing of the item doesn't occur until
that commit record iclog is synced out and I/O completes, which leaves
plenty of time for another pin to come along in the meantime (and the
scenario potentially repeats, starving out pincount == 0 state). So ISTM
the "damage" is done to some degree long before the AIL has any
opportunity to fix it, particularly if the only recourse it has is a log
force.

(And of course there's a huge tradeoff between current behavior and
something that unconditionally synced out commit record iclogs,
particularly depending on CIL checkpoint sizes, so that's a broad
simplification and not a valid solution.)

> Hence when we are relogging buffers over time (e.g. repeatedly
> modifying a directory), the AIL always sees the buffer pinned and
> cannot flush it. Even if the buffer has been moved to the delwri
> list (i.e. had a successful push) we still can't write it out
> because the delwri code skips pinned buffers.
> 
> This iterative pin continues preventing the AIL from flushing the
> buffer until it pins the tail of the log. When the log runs out of
> space and blocks the modifying/relogging task(s), the log force the
> AIL issues allows the CIL to be pushed without the front end
> relogging the item and re-pinning the item. At the completion of the
> log force, the AIL can now push the unpinned buffer and hence move
> the tail of the log forward which frees up log reservation space.
> This then wakes the front end, which goes back to it's relogging
> game and the cycle repeats over.
> 

The description of the nested pin behavior makes sense. I've observed
the same thing recently when looking at some unrelated shutdown
problems.

In any event, if I'm following correctly the deadlock issue was a
transient problem with this patch and not an upstream problem. However
there is still a pipeline stall/performance issue that manifests as
described above, and that preexists the proposed changes. Yes?

> This is really nasty behaviour, and it's only recently that I've got
> a handle on it. I found it because my original "async CIL push" code
> resulted in long stalls every time the log is filled and the tail is
> pinned by a buffer that is being relogged in this manner....
> 
> I'm not sure how to fix this yet - the AIL needs to block the front
> end relogging to allow the buffer to be unpinned. Essentially, we
> need to hold the pinned items locked across a CIL push to guarantee
> they are unpinned, but that's the complete opposite of what the AIL
> currently does to prevent the front end from seeing long tail lock
> latencies when modifying stuff....
> 

When this stall problem manifests, I'm assuming it's exacerbated by
delayed logging and the commit record behavior I described above. If
that's the case, could the AIL communicate writeback pressure through
affected log items such that checkpoints in which they are resident are
flushed out completely/immediately when the checkpoints occur? I suppose
that would require a log item flag or some such, which does raise a
concern of unnecessarily tagging many items (it's not clear to me how
likely that really is), but I'm curious if that would be an effective
POC at least..

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
>
Dave Chinner March 1, 2021, 4:54 a.m. UTC | #5
On Sat, Feb 27, 2021 at 11:25:06AM -0500, Brian Foster wrote:
> On Fri, Feb 26, 2021 at 09:03:05AM +1100, Dave Chinner wrote:
> > On Thu, Feb 25, 2021 at 08:12:12AM -0500, Brian Foster wrote:
> > > On Thu, Feb 25, 2021 at 08:10:58AM +1100, Dave Chinner wrote:
> > > > On Tue, Feb 23, 2021 at 04:32:11PM +1100, Dave Chinner wrote:
> > The problem that I talk about is introduced by this patch and also
> > fixed by this patch. The problem does not exist until we have code
> > that depends on an async CIL push guaranteeing that items in the CIL
> > are unpinned. A stock xfs_log_force_seq(0) call guarantees that the
> > log reaches disk by waiting on the CIL push then pushing the iclogs.
> > All it does is elide waiting for the final iclogs to hit the disk.
> > 
> > By eliding the "wait for CIL push" in an async xfs_log_force_seq()
> > call, we also elide the "push iclogs" part of the log force and
> > that's what causes the AIL problem. We don't push the iclogs, so the
> > items committed to the last iclog don't get dispatched and the items
> > never get unpinned because the checkpoint never completes.
> > 
> 
> So in general it sounds like this patch broke async log force by
> skipping the current iclog switch somehow or another and then this
> change is incorporated to fix it (hence there is no outstanding bug
> upstream). That makes a bit more sense. I'll get the detailed context
> from the patches.

Yes.

> > As to the "root cause of the hang" I now know what that is too, but
> > addressing it is outside the scope of this patchset. It's taken me a
> > long time to identify this, but I've been seeing symptoms of it for
> > quite a long time (i.e. years) but until doing this async CIL push
> > work I've never been able to identify the cause.
> > 
> > That is, the root cause is that the AIL cannot force a pinned buffer
> > to disk, and a log force (of any kind) does not guarantee a pinned
> > item is unpinned. That's because the CIL is pipelined, meaning that
> > an item can be pinned in one CIL context, and while that is being
> > pushed, the front end can relog that same item and pin it in the
> > current CIL context. Hence it has two pins, not one. Now when the
> > original log force completes, the pin from the first CIL context has
> > been removed, but the pin from the current CIL context is still
> > present.
> > 
> 
> Somewhat related to the above point, ISTM our current behavior can lead
> to scenarios where the potential window for elevated pin counts is
> somewhat artificially widened. IOW, the log vector for a pincount == 1
> item might have been written out to disk, but the log item itself not
> unpinned for some time later because the commit record is still sitting
> in an active iclog. The unpin processing of the item doesn't occur until
> that commit record iclog is synced out and I/O completes, which leaves
> plenty of time for another pin to come along in the meantime (and the
> scenario potentially repeats, starving out pincount == 0 state). So ISTM
> the "damage" is done to some degree long before the AIL has any
> opportunity to fix it, particularly if the only recourse it has is a log
> force.
> 
> (And of course there's a huge tradeoff between current behavior and
> something that unconditionally synced out commit record iclogs,
> particularly depending on CIL checkpoint sizes, so that's a broad
> simplification and not a valid solution.)

Yes, this can be an issue because when the single CIL work is CPU
bound, it is the first iclog write of the checkpoint that pushes the
previous commit record to disk. So, like you say, there's a natural
pin overlap in this situation.

> > Hence when we are relogging buffers over time (e.g. repeatedly
> > modifying a directory), the AIL always sees the buffer pinned and
> > cannot flush it. Even if the buffer has been moved to the delwri
> > list (i.e. had a successful push) we still can't write it out
> > because the delwri code skips pinned buffers.
> > 
> > This iterative pin continues preventing the AIL from flushing the
> > buffer until it pins the tail of the log. When the log runs out of
> > space and blocks the modifying/relogging task(s), the log force the
> > AIL issues allows the CIL to be pushed without the front end
> > relogging the item and re-pinning the item. At the completion of the
> > log force, the AIL can now push the unpinned buffer and hence move
> > the tail of the log forward which frees up log reservation space.
> > This then wakes the front end, which goes back to it's relogging
> > game and the cycle repeats over.
> > 
> 
> The description of the nested pin behavior makes sense. I've observed
> the same thing recently when looking at some unrelated shutdown
> problems.
> 
> In any event, if I'm following correctly the deadlock issue was a
> transient problem with this patch and not an upstream problem. However
> there is still a pipeline stall/performance issue that manifests as
> described above, and that preexists the proposed changes. Yes?

Yes, AFAICT, this pin problem dates back to day zero. Before delayed
logging it required the inode or buffer to be relogged in almost
every iclog that was written (think a long running truncate removing
hundreds of thousands of extents). But before delayed logging, this
was very slow and limited by iclog IO so having the AIL stall would
have largely been unnoticed.

> > This is really nasty behaviour, and it's only recently that I've got
> > a handle on it. I found it because my original "async CIL push" code
> > resulted in long stalls every time the log is filled and the tail is
> > pinned by a buffer that is being relogged in this manner....
> > 
> > I'm not sure how to fix this yet - the AIL needs to block the front
> > end relogging to allow the buffer to be unpinned. Essentially, we
> > need to hold the pinned items locked across a CIL push to guarantee
> > they are unpinned, but that's the complete opposite of what the AIL
> > currently does to prevent the front end from seeing long tail lock
> > latencies when modifying stuff....
> 
> When this stall problem manifests, I'm assuming it's exacerbated by
> delayed logging and the commit record behavior I described above. If
> that's the case, could the AIL communicate writeback pressure through
> affected log items such that checkpoints in which they are resident are
> flushed out completely/immediately when the checkpoints occur? I suppose
> that would require a log item flag or some such, which does raise a
> concern of unnecessarily tagging many items (it's not clear to me how
> likely that really is), but I'm curious if that would be an effective
> POC at least..

I don't think we need to do anything like that. All we need to do to
ensure that the AIL can flush a pinned buffer is to lock it, kick
the log and wait for the pin count to go to zero. Then we can write
it just fine, blocking only the front end transactions that need
that buffer lock.  Same goes for inodes, though xfs_iunpin_wait()
already does this....

The only issue is determining if this causes unaaceptible long tail
latencies on buffers like the AG headers....

Cheers,

Dave.
Brian Foster March 1, 2021, 1:32 p.m. UTC | #6
On Mon, Mar 01, 2021 at 03:54:22PM +1100, Dave Chinner wrote:
> On Sat, Feb 27, 2021 at 11:25:06AM -0500, Brian Foster wrote:
> > On Fri, Feb 26, 2021 at 09:03:05AM +1100, Dave Chinner wrote:
> > > On Thu, Feb 25, 2021 at 08:12:12AM -0500, Brian Foster wrote:
> > > > On Thu, Feb 25, 2021 at 08:10:58AM +1100, Dave Chinner wrote:
> > > > > On Tue, Feb 23, 2021 at 04:32:11PM +1100, Dave Chinner wrote:
> > > The problem that I talk about is introduced by this patch and also
> > > fixed by this patch. The problem does not exist until we have code
> > > that depends on an async CIL push guaranteeing that items in the CIL
> > > are unpinned. A stock xfs_log_force_seq(0) call guarantees that the
> > > log reaches disk by waiting on the CIL push then pushing the iclogs.
> > > All it does is elide waiting for the final iclogs to hit the disk.
> > > 
> > > By eliding the "wait for CIL push" in an async xfs_log_force_seq()
> > > call, we also elide the "push iclogs" part of the log force and
> > > that's what causes the AIL problem. We don't push the iclogs, so the
> > > items committed to the last iclog don't get dispatched and the items
> > > never get unpinned because the checkpoint never completes.
> > > 
> > 
> > So in general it sounds like this patch broke async log force by
> > skipping the current iclog switch somehow or another and then this
> > change is incorporated to fix it (hence there is no outstanding bug
> > upstream). That makes a bit more sense. I'll get the detailed context
> > from the patches.
> 
> Yes.
> 
> > > As to the "root cause of the hang" I now know what that is too, but
> > > addressing it is outside the scope of this patchset. It's taken me a
> > > long time to identify this, but I've been seeing symptoms of it for
> > > quite a long time (i.e. years) but until doing this async CIL push
> > > work I've never been able to identify the cause.
> > > 
> > > That is, the root cause is that the AIL cannot force a pinned buffer
> > > to disk, and a log force (of any kind) does not guarantee a pinned
> > > item is unpinned. That's because the CIL is pipelined, meaning that
> > > an item can be pinned in one CIL context, and while that is being
> > > pushed, the front end can relog that same item and pin it in the
> > > current CIL context. Hence it has two pins, not one. Now when the
> > > original log force completes, the pin from the first CIL context has
> > > been removed, but the pin from the current CIL context is still
> > > present.
> > > 
> > 
> > Somewhat related to the above point, ISTM our current behavior can lead
> > to scenarios where the potential window for elevated pin counts is
> > somewhat artificially widened. IOW, the log vector for a pincount == 1
> > item might have been written out to disk, but the log item itself not
> > unpinned for some time later because the commit record is still sitting
> > in an active iclog. The unpin processing of the item doesn't occur until
> > that commit record iclog is synced out and I/O completes, which leaves
> > plenty of time for another pin to come along in the meantime (and the
> > scenario potentially repeats, starving out pincount == 0 state). So ISTM
> > the "damage" is done to some degree long before the AIL has any
> > opportunity to fix it, particularly if the only recourse it has is a log
> > force.
> > 
> > (And of course there's a huge tradeoff between current behavior and
> > something that unconditionally synced out commit record iclogs,
> > particularly depending on CIL checkpoint sizes, so that's a broad
> > simplification and not a valid solution.)
> 
> Yes, this can be an issue because when the single CIL work is CPU
> bound, it is the first iclog write of the checkpoint that pushes the
> previous commit record to disk. So, like you say, there's a natural
> pin overlap in this situation.
> 
> > > Hence when we are relogging buffers over time (e.g. repeatedly
> > > modifying a directory), the AIL always sees the buffer pinned and
> > > cannot flush it. Even if the buffer has been moved to the delwri
> > > list (i.e. had a successful push) we still can't write it out
> > > because the delwri code skips pinned buffers.
> > > 
> > > This iterative pin continues preventing the AIL from flushing the
> > > buffer until it pins the tail of the log. When the log runs out of
> > > space and blocks the modifying/relogging task(s), the log force the
> > > AIL issues allows the CIL to be pushed without the front end
> > > relogging the item and re-pinning the item. At the completion of the
> > > log force, the AIL can now push the unpinned buffer and hence move
> > > the tail of the log forward which frees up log reservation space.
> > > This then wakes the front end, which goes back to it's relogging
> > > game and the cycle repeats over.
> > > 
> > 
> > The description of the nested pin behavior makes sense. I've observed
> > the same thing recently when looking at some unrelated shutdown
> > problems.
> > 
> > In any event, if I'm following correctly the deadlock issue was a
> > transient problem with this patch and not an upstream problem. However
> > there is still a pipeline stall/performance issue that manifests as
> > described above, and that preexists the proposed changes. Yes?
> 
> Yes, AFAICT, this pin problem dates back to day zero. Before delayed
> logging it required the inode or buffer to be relogged in almost
> every iclog that was written (think a long running truncate removing
> hundreds of thousands of extents). But before delayed logging, this
> was very slow and limited by iclog IO so having the AIL stall would
> have largely been unnoticed.
> 
> > > This is really nasty behaviour, and it's only recently that I've got
> > > a handle on it. I found it because my original "async CIL push" code
> > > resulted in long stalls every time the log is filled and the tail is
> > > pinned by a buffer that is being relogged in this manner....
> > > 
> > > I'm not sure how to fix this yet - the AIL needs to block the front
> > > end relogging to allow the buffer to be unpinned. Essentially, we
> > > need to hold the pinned items locked across a CIL push to guarantee
> > > they are unpinned, but that's the complete opposite of what the AIL
> > > currently does to prevent the front end from seeing long tail lock
> > > latencies when modifying stuff....
> > 
> > When this stall problem manifests, I'm assuming it's exacerbated by
> > delayed logging and the commit record behavior I described above. If
> > that's the case, could the AIL communicate writeback pressure through
> > affected log items such that checkpoints in which they are resident are
> > flushed out completely/immediately when the checkpoints occur? I suppose
> > that would require a log item flag or some such, which does raise a
> > concern of unnecessarily tagging many items (it's not clear to me how
> > likely that really is), but I'm curious if that would be an effective
> > POC at least..
> 
> I don't think we need to do anything like that. All we need to do to
> ensure that the AIL can flush a pinned buffer is to lock it, kick
> the log and wait for the pin count to go to zero. Then we can write
> it just fine, blocking only the front end transactions that need
> that buffer lock.  Same goes for inodes, though xfs_iunpin_wait()
> already does this....
> 

Yeah, but why would we want to block xfsaild on a single item like that?
Wouldn't holding the item locked like that just create a new stall point
within xfsaild? Maybe I'm missing something, but what you describe here
basically just sounds like a "lock the item and run a sync log force"
pattern.

I agree that the implementation I handwave about above seems overkill,
but I'm really curious what the disadvantage of some form of logic in
the logging code to do something like "if this checkpoint contains items
that are AIL resident && under some measure of writeback pressure && the
next iclog is active, want_sync the commit record iclog on background
push to reduce item pin count times" might be.

Brian

> The only issue is determining if this causes unaaceptible long tail
> latencies on buffers like the AG headers....
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
>
Dave Chinner March 3, 2021, 1:23 a.m. UTC | #7
On Mon, Mar 01, 2021 at 08:32:18AM -0500, Brian Foster wrote:
> On Mon, Mar 01, 2021 at 03:54:22PM +1100, Dave Chinner wrote:
> > On Sat, Feb 27, 2021 at 11:25:06AM -0500, Brian Foster wrote:
> > > On Fri, Feb 26, 2021 at 09:03:05AM +1100, Dave Chinner wrote:
> > > > This is really nasty behaviour, and it's only recently that I've got
> > > > a handle on it. I found it because my original "async CIL push" code
> > > > resulted in long stalls every time the log is filled and the tail is
> > > > pinned by a buffer that is being relogged in this manner....
> > > > 
> > > > I'm not sure how to fix this yet - the AIL needs to block the front
> > > > end relogging to allow the buffer to be unpinned. Essentially, we
> > > > need to hold the pinned items locked across a CIL push to guarantee
> > > > they are unpinned, but that's the complete opposite of what the AIL
> > > > currently does to prevent the front end from seeing long tail lock
> > > > latencies when modifying stuff....
> > > 
> > > When this stall problem manifests, I'm assuming it's exacerbated by
> > > delayed logging and the commit record behavior I described above. If
> > > that's the case, could the AIL communicate writeback pressure through
> > > affected log items such that checkpoints in which they are resident are
> > > flushed out completely/immediately when the checkpoints occur? I suppose
> > > that would require a log item flag or some such, which does raise a
> > > concern of unnecessarily tagging many items (it's not clear to me how
> > > likely that really is), but I'm curious if that would be an effective
> > > POC at least..
> > 
> > I don't think we need to do anything like that. All we need to do to
> > ensure that the AIL can flush a pinned buffer is to lock it, kick
> > the log and wait for the pin count to go to zero. Then we can write
> > it just fine, blocking only the front end transactions that need
> > that buffer lock.  Same goes for inodes, though xfs_iunpin_wait()
> > already does this....
> > 
> 
> Yeah, but why would we want to block xfsaild on a single item like that?

Who said anything about blocking the AIL on a single item? :)

> Wouldn't holding the item locked like that just create a new stall point
> within xfsaild? Maybe I'm missing something, but what you describe here
> basically just sounds like a "lock the item and run a sync log force"
> pattern.

What I was thinking is that if the item is pinned and at the
tail of the log, then we leave it locked when we flush it rather
than unlocking it and relocking it when the delwri submit code gets
to it. If it gets unpinned before the delwri code gets to it, all
good. If not, the delwri code being unable to flush it will feed
back up into the AIL to trigger a log force, which will unpin it
in the near future and it will be written on the next AIL delwri
submit cycle.

The thing we need to be careful of this is minimising the lock hold
time by the AIL while we unpin the item. That's the "long tail
latencies" problem I mention above. Essentially, we need to try to
avoid holding the item locked for a long period of time before
issuing the log force and/or resubmitting it for IO once it is
unlocked. I have a few ideas on this, but no patches yet.

Cheers,

Dave.
Brian Foster March 3, 2021, 5:20 p.m. UTC | #8
On Wed, Mar 03, 2021 at 12:23:04PM +1100, Dave Chinner wrote:
> On Mon, Mar 01, 2021 at 08:32:18AM -0500, Brian Foster wrote:
> > On Mon, Mar 01, 2021 at 03:54:22PM +1100, Dave Chinner wrote:
> > > On Sat, Feb 27, 2021 at 11:25:06AM -0500, Brian Foster wrote:
> > > > On Fri, Feb 26, 2021 at 09:03:05AM +1100, Dave Chinner wrote:
> > > > > This is really nasty behaviour, and it's only recently that I've got
> > > > > a handle on it. I found it because my original "async CIL push" code
> > > > > resulted in long stalls every time the log is filled and the tail is
> > > > > pinned by a buffer that is being relogged in this manner....
> > > > > 
> > > > > I'm not sure how to fix this yet - the AIL needs to block the front
> > > > > end relogging to allow the buffer to be unpinned. Essentially, we
> > > > > need to hold the pinned items locked across a CIL push to guarantee
> > > > > they are unpinned, but that's the complete opposite of what the AIL
> > > > > currently does to prevent the front end from seeing long tail lock
> > > > > latencies when modifying stuff....
> > > > 
> > > > When this stall problem manifests, I'm assuming it's exacerbated by
> > > > delayed logging and the commit record behavior I described above. If
> > > > that's the case, could the AIL communicate writeback pressure through
> > > > affected log items such that checkpoints in which they are resident are
> > > > flushed out completely/immediately when the checkpoints occur? I suppose
> > > > that would require a log item flag or some such, which does raise a
> > > > concern of unnecessarily tagging many items (it's not clear to me how
> > > > likely that really is), but I'm curious if that would be an effective
> > > > POC at least..
> > > 
> > > I don't think we need to do anything like that. All we need to do to
> > > ensure that the AIL can flush a pinned buffer is to lock it, kick
> > > the log and wait for the pin count to go to zero. Then we can write
> > > it just fine, blocking only the front end transactions that need
> > > that buffer lock.  Same goes for inodes, though xfs_iunpin_wait()
> > > already does this....
> > > 
> > 
> > Yeah, but why would we want to block xfsaild on a single item like that?
> 
> Who said anything about blocking the AIL on a single item? :)
> 
> > Wouldn't holding the item locked like that just create a new stall point
> > within xfsaild? Maybe I'm missing something, but what you describe here
> > basically just sounds like a "lock the item and run a sync log force"
> > pattern.
> 
> What I was thinking is that if the item is pinned and at the
> tail of the log, then we leave it locked when we flush it rather
> than unlocking it and relocking it when the delwri submit code gets
> to it. If it gets unpinned before the delwri code gets to it, all
> good. If not, the delwri code being unable to flush it will feed
> back up into the AIL to trigger a log force, which will unpin it
> in the near future and it will be written on the next AIL delwri
> submit cycle.
> 

I'm not sure what you mean by leaving the item locked when we flush it
if it is pinned, since we don't flush pinned items. Perhaps implied is
that this would trylock the buffer first, then only worry about if it's
pinned if we acquire the lock. If so (and the pin is at the tail), hold
the lock and kick the log as a means to ensure that xfsaild is
guaranteed next access to the buffer once unpinned. Hm?

IIUC, that seems interesting. Though as noted in the flush optimization
series, I am a little leery about issuing (even async) log forces with
locks held, at least with the current implementation..

Brian

> The thing we need to be careful of this is minimising the lock hold
> time by the AIL while we unpin the item. That's the "long tail
> latencies" problem I mention above. Essentially, we need to try to
> avoid holding the item locked for a long period of time before
> issuing the log force and/or resubmitting it for IO once it is
> unlocked. I have a few ideas on this, but no patches yet.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
>
Dave Chinner March 4, 2021, 2:01 a.m. UTC | #9
On Wed, Mar 03, 2021 at 12:20:19PM -0500, Brian Foster wrote:
> On Wed, Mar 03, 2021 at 12:23:04PM +1100, Dave Chinner wrote:
> > On Mon, Mar 01, 2021 at 08:32:18AM -0500, Brian Foster wrote:
> > > On Mon, Mar 01, 2021 at 03:54:22PM +1100, Dave Chinner wrote:
> > > > On Sat, Feb 27, 2021 at 11:25:06AM -0500, Brian Foster wrote:
> > > > > On Fri, Feb 26, 2021 at 09:03:05AM +1100, Dave Chinner wrote:
> > > > > > This is really nasty behaviour, and it's only recently that I've got
> > > > > > a handle on it. I found it because my original "async CIL push" code
> > > > > > resulted in long stalls every time the log is filled and the tail is
> > > > > > pinned by a buffer that is being relogged in this manner....
> > > > > > 
> > > > > > I'm not sure how to fix this yet - the AIL needs to block the front
> > > > > > end relogging to allow the buffer to be unpinned. Essentially, we
> > > > > > need to hold the pinned items locked across a CIL push to guarantee
> > > > > > they are unpinned, but that's the complete opposite of what the AIL
> > > > > > currently does to prevent the front end from seeing long tail lock
> > > > > > latencies when modifying stuff....
> > > > > 
> > > > > When this stall problem manifests, I'm assuming it's exacerbated by
> > > > > delayed logging and the commit record behavior I described above. If
> > > > > that's the case, could the AIL communicate writeback pressure through
> > > > > affected log items such that checkpoints in which they are resident are
> > > > > flushed out completely/immediately when the checkpoints occur? I suppose
> > > > > that would require a log item flag or some such, which does raise a
> > > > > concern of unnecessarily tagging many items (it's not clear to me how
> > > > > likely that really is), but I'm curious if that would be an effective
> > > > > POC at least..
> > > > 
> > > > I don't think we need to do anything like that. All we need to do to
> > > > ensure that the AIL can flush a pinned buffer is to lock it, kick
> > > > the log and wait for the pin count to go to zero. Then we can write
> > > > it just fine, blocking only the front end transactions that need
> > > > that buffer lock.  Same goes for inodes, though xfs_iunpin_wait()
> > > > already does this....
> > > > 
> > > 
> > > Yeah, but why would we want to block xfsaild on a single item like that?
> > 
> > Who said anything about blocking the AIL on a single item? :)
> > 
> > > Wouldn't holding the item locked like that just create a new stall point
> > > within xfsaild? Maybe I'm missing something, but what you describe here
> > > basically just sounds like a "lock the item and run a sync log force"
> > > pattern.
> > 
> > What I was thinking is that if the item is pinned and at the
> > tail of the log, then we leave it locked when we flush it rather
> > than unlocking it and relocking it when the delwri submit code gets
> > to it. If it gets unpinned before the delwri code gets to it, all
> > good. If not, the delwri code being unable to flush it will feed
> > back up into the AIL to trigger a log force, which will unpin it
> > in the near future and it will be written on the next AIL delwri
> > submit cycle.
> > 
> 
> I'm not sure what you mean by leaving the item locked when we flush it
> if it is pinned, since we don't flush pinned items.

That's exactly what I'm talking about changing.

-Dave.
diff mbox series

Patch

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index 84cd9b6c6d1f..a0ecaf4f9b77 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -3335,6 +3335,20 @@  xfs_log_force(
 	return -EIO;
 }
 
+/*
+ * Force the log to a specific LSN.
+ *
+ * If an iclog with that lsn can be found:
+ *	If it is in the DIRTY state, just return.
+ *	If it is in the ACTIVE state, move the in-core log into the WANT_SYNC
+ *		state and go to sleep or return.
+ *	If it is in any other state, go to sleep or return.
+ *
+ * Synchronous forces are implemented with a wait queue.  All callers trying
+ * to force a given lsn to disk must wait on the queue attached to the
+ * specific in-core log.  When given in-core log finally completes its write
+ * to disk, that thread will wake up all threads waiting on the queue.
+ */
 static int
 xlog_force_lsn(
 	struct xlog		*log,
@@ -3398,18 +3412,15 @@  xlog_force_lsn(
 }
 
 /*
- * Force the in-core log to disk for a specific LSN.
+ * Force the log to a specific checkpoint sequence.
  *
- * Find in-core log with lsn.
- *	If it is in the DIRTY state, just return.
- *	If it is in the ACTIVE state, move the in-core log into the WANT_SYNC
- *		state and go to sleep or return.
- *	If it is in any other state, go to sleep or return.
+ * First force the CIL so that all the required changes have been flushed to the
+ * iclogs. If the CIL force completed it will return a commit LSN that indicates
+ * the iclog that needs to be flushed to stable storage.
  *
- * Synchronous forces are implemented with a wait queue.  All callers trying
- * to force a given lsn to disk must wait on the queue attached to the
- * specific in-core log.  When given in-core log finally completes its write
- * to disk, that thread will wake up all threads waiting on the queue.
+ * If the XFS_LOG_SYNC flag is not set, we only trigger a background CIL force
+ * and do not wait for it to complete, nor do we attempt to check/flush iclogs
+ * as the CIL will not have committed when xlog_cil_force_seq() returns.
  */
 int
 xfs_log_force_seq(
@@ -3426,7 +3437,7 @@  xfs_log_force_seq(
 	XFS_STATS_INC(mp, xs_log_force);
 	trace_xfs_log_force(mp, seq, _RET_IP_);
 
-	lsn = xlog_cil_force_seq(log, seq);
+	lsn = xlog_cil_force_seq(log, flags, seq);
 	if (lsn == NULLCOMMITLSN)
 		return 0;
 
diff --git a/fs/xfs/xfs_log.h b/fs/xfs/xfs_log.h
index 33ae53401060..77d3fca28f55 100644
--- a/fs/xfs/xfs_log.h
+++ b/fs/xfs/xfs_log.h
@@ -104,6 +104,7 @@  struct xlog_ticket;
 struct xfs_log_item;
 struct xfs_item_ops;
 struct xfs_trans;
+struct xlog;
 
 int	  xfs_log_force(struct xfs_mount *mp, uint flags);
 int	  xfs_log_force_seq(struct xfs_mount *mp, uint64_t seq, uint flags,
diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index 2fda8c4513b2..be6c91d7218f 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -999,7 +999,8 @@  xlog_cil_push_background(
 static void
 xlog_cil_push_now(
 	struct xlog	*log,
-	xfs_lsn_t	push_seq)
+	xfs_lsn_t	push_seq,
+	bool		flush)
 {
 	struct xfs_cil	*cil = log->l_cilp;
 
@@ -1009,7 +1010,8 @@  xlog_cil_push_now(
 	ASSERT(push_seq && push_seq <= cil->xc_current_sequence);
 
 	/* start on any pending background push to minimise wait time on it */
-	flush_work(&cil->xc_push_work);
+	if (flush)
+		flush_work(&cil->xc_push_work);
 
 	/*
 	 * If the CIL is empty or we've already pushed the sequence then
@@ -1109,16 +1111,22 @@  xlog_cil_commit(
 /*
  * Conditionally push the CIL based on the sequence passed in.
  *
- * We only need to push if we haven't already pushed the sequence
- * number given. Hence the only time we will trigger a push here is
- * if the push sequence is the same as the current context.
+ * We only need to push if we haven't already pushed the sequence number given.
+ * Hence the only time we will trigger a push here is if the push sequence is
+ * the same as the current context.
  *
- * We return the current commit lsn to allow the callers to determine if a
- * iclog flush is necessary following this call.
+ * If the sequence is zero, push the current sequence. If XFS_LOG_SYNC is set in
+ * the flags wait for it to complete, otherwise jsut return NULLCOMMITLSN to
+ * indicate we didn't wait for a commit lsn.
+ *
+ * If we waited for the push to complete, then we return the current commit lsn
+ * to allow the callers to determine if a iclog flush is necessary following
+ * this call.
  */
 xfs_lsn_t
 xlog_cil_force_seq(
 	struct xlog	*log,
+	uint32_t	flags,
 	uint64_t	sequence)
 {
 	struct xfs_cil		*cil = log->l_cilp;
@@ -1127,13 +1135,18 @@  xlog_cil_force_seq(
 
 	ASSERT(sequence <= cil->xc_current_sequence);
 
+	if (!sequence)
+		sequence = cil->xc_current_sequence;
+
 	/*
 	 * check to see if we need to force out the current context.
 	 * xlog_cil_push() handles racing pushes for the same sequence,
 	 * so no need to deal with it here.
 	 */
 restart:
-	xlog_cil_push_now(log, sequence);
+	xlog_cil_push_now(log, sequence, flags & XFS_LOG_SYNC);
+	if (!(flags & XFS_LOG_SYNC))
+		return commit_lsn;
 
 	/*
 	 * See if we can find a previous sequence still committing.
diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index 59778cd5ecdd..d9929d9146e4 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -558,12 +558,13 @@  void	xlog_cil_commit(struct xlog *log, struct xfs_trans *tp,
 /*
  * CIL force routines
  */
-xfs_lsn_t xlog_cil_force_seq(struct xlog *log, uint64_t sequence);
+xfs_lsn_t xlog_cil_force_seq(struct xlog *log, uint32_t flags,
+				uint64_t sequence);
 
 static inline void
 xlog_cil_force(struct xlog *log)
 {
-	xlog_cil_force_seq(log, log->l_cilp->xc_current_sequence);
+	xlog_cil_force_seq(log, XFS_LOG_SYNC, log->l_cilp->xc_current_sequence);
 }
 
 /*
diff --git a/fs/xfs/xfs_sysfs.c b/fs/xfs/xfs_sysfs.c
index f1bc88f4367c..18dc5eca6c04 100644
--- a/fs/xfs/xfs_sysfs.c
+++ b/fs/xfs/xfs_sysfs.c
@@ -10,6 +10,7 @@ 
 #include "xfs_log_format.h"
 #include "xfs_trans_resv.h"
 #include "xfs_sysfs.h"
+#include "xfs_log.h"
 #include "xfs_log_priv.h"
 #include "xfs_mount.h"
 
diff --git a/fs/xfs/xfs_trace.c b/fs/xfs/xfs_trace.c
index 9b8d703dc9fd..d111a994b7b6 100644
--- a/fs/xfs/xfs_trace.c
+++ b/fs/xfs/xfs_trace.c
@@ -20,6 +20,7 @@ 
 #include "xfs_bmap.h"
 #include "xfs_attr.h"
 #include "xfs_trans.h"
+#include "xfs_log.h"
 #include "xfs_log_priv.h"
 #include "xfs_buf_item.h"
 #include "xfs_quota.h"
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index 697703f3be48..7d05694681e3 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -9,7 +9,6 @@ 
 #include "xfs_shared.h"
 #include "xfs_format.h"
 #include "xfs_log_format.h"
-#include "xfs_log_priv.h"
 #include "xfs_trans_resv.h"
 #include "xfs_mount.h"
 #include "xfs_extent_busy.h"
@@ -17,6 +16,7 @@ 
 #include "xfs_trans.h"
 #include "xfs_trans_priv.h"
 #include "xfs_log.h"
+#include "xfs_log_priv.h"
 #include "xfs_trace.h"
 #include "xfs_error.h"
 #include "xfs_defer.h"
diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
index dbb69b4bf3ed..dfc0206c0d36 100644
--- a/fs/xfs/xfs_trans_ail.c
+++ b/fs/xfs/xfs_trans_ail.c
@@ -17,6 +17,7 @@ 
 #include "xfs_errortag.h"
 #include "xfs_error.h"
 #include "xfs_log.h"
+#include "xfs_log_priv.h"
 
 #ifdef DEBUG
 /*
@@ -429,8 +430,12 @@  xfsaild_push(
 
 	/*
 	 * If we encountered pinned items or did not finish writing out all
-	 * buffers the last time we ran, force the log first and wait for it
-	 * before pushing again.
+	 * buffers the last time we ran, force a background CIL push to get the
+	 * items unpinned in the near future. We do not wait on the CIL push as
+	 * that could stall us for seconds if there is enough background IO
+	 * load. Stalling for that long when the tail of the log is pinned and
+	 * needs flushing will hard stop the transaction subsystem when log
+	 * space runs out.
 	 */
 	if (ailp->ail_log_flush && ailp->ail_last_pushed_lsn == 0 &&
 	    (!list_empty_careful(&ailp->ail_buf_list) ||
@@ -438,7 +443,7 @@  xfsaild_push(
 		ailp->ail_log_flush = 0;
 
 		XFS_STATS_INC(mp, xs_push_ail_flush);
-		xfs_log_force(mp, XFS_LOG_SYNC);
+		xlog_cil_force_seq(mp->m_log, 0, 0);
 	}
 
 	spin_lock(&ailp->ail_lock);