Message ID | 20190909015159.19662-2-david@fromorbit.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | [1/2] xfs: Lower CIL flush limit for large logs | expand |
On Mon, Sep 09, 2019 at 11:51:58AM +1000, Dave Chinner wrote: > From: Dave Chinner <dchinner@redhat.com> > > The current CIL size aggregation limit is 1/8th the log size. This > means for large logs we might be aggregating at least 250MB of dirty objects > in memory before the CIL is flushed to the journal. With CIL shadow > buffers sitting around, this means the CIL is often consuming >500MB > of temporary memory that is all allocated under GFP_NOFS conditions. > > Flushing the CIL can take some time to do if there is other IO > ongoing, and can introduce substantial log force latency by itself. > It also pins the memory until the objects are in the AIL and can be > written back and reclaimed by shrinkers. Hence this threshold also > tends to determine the minimum amount of memory XFS can operate in > under heavy modification without triggering the OOM killer. > > Modify the CIL space limit to prevent such huge amounts of pinned > metadata from aggregating. We can have 2MB of log IO in flight at > once, so limit aggregation to 16x this size. This threshold was > chosen as it little impact on performance (on 16-way fsmark) or log > traffic but pins a lot less memory on large logs especially under > heavy memory pressure. An aggregation limit of 8x had 5-10% > performance degradation and a 50% increase in log throughput for > the same workload, so clearly that was too small for highly > concurrent workloads on large logs. It would be nice to capture at least some of this reasoning in the already lengthy comment preceeding the #define.... > This was found via trace analysis of AIL behaviour. e.g. insertion > from a single CIL flush: > > xfs_ail_insert: old lsn 0/0 new lsn 1/3033090 type XFS_LI_INODE flags IN_AIL > > $ grep xfs_ail_insert /mnt/scratch/s.t |grep "new lsn 1/3033090" |wc -l > 1721823 > $ > > So there were 1.7 million objects inserted into the AIL from this > CIL checkpoint, the first at 2323.392108, the last at 2325.667566 which > was the end of the trace (i.e. it hadn't finished). Clearly a major > problem. > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > --- > fs/xfs/xfs_log_priv.h | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h > index b880c23cb6e4..187a43ffeaf7 100644 > --- a/fs/xfs/xfs_log_priv.h > +++ b/fs/xfs/xfs_log_priv.h > @@ -329,7 +329,8 @@ struct xfs_cil { > * enforced to ensure we stay within our maximum checkpoint size bounds. > * threshold, yet give us plenty of space for aggregation on large logs. ...also, does XLOG_CIL_SPACE_LIMIT correspond to "a lower threshold at which background pushing is attempted", or "a separate, higher bound"? I think it's the first (????) but ... I don't know. The name made me think it was the second, but the single use of the symbol suggests the first. :) --D > */ > -#define XLOG_CIL_SPACE_LIMIT(log) (log->l_logsize >> 3) > +#define XLOG_CIL_SPACE_LIMIT(log) \ > + min_t(int, (log)->l_logsize >> 3, BBTOB(XLOG_TOTAL_REC_SHIFT(log)) << 4) > > /* > * ticket grant locks, queues and accounting have their own cachlines > -- > 2.23.0.rc1 >
On Mon, Sep 16, 2019 at 09:33:25AM -0700, Darrick J. Wong wrote: > On Mon, Sep 09, 2019 at 11:51:58AM +1000, Dave Chinner wrote: > > From: Dave Chinner <dchinner@redhat.com> > > > > The current CIL size aggregation limit is 1/8th the log size. This > > means for large logs we might be aggregating at least 250MB of dirty objects > > in memory before the CIL is flushed to the journal. With CIL shadow > > buffers sitting around, this means the CIL is often consuming >500MB > > of temporary memory that is all allocated under GFP_NOFS conditions. > > > > Flushing the CIL can take some time to do if there is other IO > > ongoing, and can introduce substantial log force latency by itself. > > It also pins the memory until the objects are in the AIL and can be > > written back and reclaimed by shrinkers. Hence this threshold also > > tends to determine the minimum amount of memory XFS can operate in > > under heavy modification without triggering the OOM killer. > > > > Modify the CIL space limit to prevent such huge amounts of pinned > > metadata from aggregating. We can have 2MB of log IO in flight at > > once, so limit aggregation to 16x this size. This threshold was > > chosen as it little impact on performance (on 16-way fsmark) or log > > traffic but pins a lot less memory on large logs especially under > > heavy memory pressure. An aggregation limit of 8x had 5-10% > > performance degradation and a 50% increase in log throughput for > > the same workload, so clearly that was too small for highly > > concurrent workloads on large logs. > > It would be nice to capture at least some of this reasoning in the > already lengthy comment preceeding the #define.... A lot of it is already there, but I will revise it. > > > This was found via trace analysis of AIL behaviour. e.g. insertion > > from a single CIL flush: > > > > xfs_ail_insert: old lsn 0/0 new lsn 1/3033090 type XFS_LI_INODE flags IN_AIL > > > > $ grep xfs_ail_insert /mnt/scratch/s.t |grep "new lsn 1/3033090" |wc -l > > 1721823 > > $ > > > > So there were 1.7 million objects inserted into the AIL from this > > CIL checkpoint, the first at 2323.392108, the last at 2325.667566 which > > was the end of the trace (i.e. it hadn't finished). Clearly a major > > problem. > > > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > > --- > > fs/xfs/xfs_log_priv.h | 3 ++- > > 1 file changed, 2 insertions(+), 1 deletion(-) > > > > diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h > > index b880c23cb6e4..187a43ffeaf7 100644 > > --- a/fs/xfs/xfs_log_priv.h > > +++ b/fs/xfs/xfs_log_priv.h > > @@ -329,7 +329,8 @@ struct xfs_cil { > > * enforced to ensure we stay within our maximum checkpoint size bounds. > > * threshold, yet give us plenty of space for aggregation on large logs. > > ...also, does XLOG_CIL_SPACE_LIMIT correspond to "a lower threshold at > which background pushing is attempted", or "a separate, higher bound"? > I think it's the first (????) but ... I don't know. The name made me > think it was the second, but the single use of the symbol suggests the > first. :) See, the comment here talks about two limits, because that was how the initial implementation worked - the background CIL push was not async, so there was some juggling done to prevent new commits from blocking on background pushes in progress unless the size was actually growing to large. This patch pretty much describes the whole issue here: https://lore.kernel.org/linux-xfs/1285552073-14663-2-git-send-email-david@fromorbit.com/ That's in commit 80168676ebfe ("xfs: force background CIL push under sustained load") which went into 2.6.38 or so. The cause of the problem in that case was concurrent transaction commit load causing lock contention and preventing a background push from getting the context lock to do the actual push. The hard limit in the CIL code was dropped when the background push was converted to run asynchronously to use a work queue in 2012 as it allowed the locking to be changed (down_write_trylock -> down_write) to turn it into a transaction commit barrier while the contexts are switched over. That was done in 2012 via commit 4c2d542f2e78 ("xfs: Do background CIL flushes via a workqueue") and so we haven't actually capped CIL checkpoint sizes since 2012. Essentially, the comment you point out documents the two limits from the original code, and this commit is restoring that behaviour for background CIL pushes.... I'll do some work to update it all. Cheers, Dave.
On Wed, Sep 25, 2019 at 08:29:01AM +1000, Dave Chinner wrote: > On Mon, Sep 16, 2019 at 09:33:25AM -0700, Darrick J. Wong wrote: > > On Mon, Sep 09, 2019 at 11:51:58AM +1000, Dave Chinner wrote: > > > From: Dave Chinner <dchinner@redhat.com> > > > > > > The current CIL size aggregation limit is 1/8th the log size. This > > > means for large logs we might be aggregating at least 250MB of dirty objects > > > in memory before the CIL is flushed to the journal. With CIL shadow > > > buffers sitting around, this means the CIL is often consuming >500MB > > > of temporary memory that is all allocated under GFP_NOFS conditions. > > > > > > Flushing the CIL can take some time to do if there is other IO > > > ongoing, and can introduce substantial log force latency by itself. > > > It also pins the memory until the objects are in the AIL and can be > > > written back and reclaimed by shrinkers. Hence this threshold also > > > tends to determine the minimum amount of memory XFS can operate in > > > under heavy modification without triggering the OOM killer. > > > > > > Modify the CIL space limit to prevent such huge amounts of pinned > > > metadata from aggregating. We can have 2MB of log IO in flight at > > > once, so limit aggregation to 16x this size. This threshold was > > > chosen as it little impact on performance (on 16-way fsmark) or log > > > traffic but pins a lot less memory on large logs especially under > > > heavy memory pressure. An aggregation limit of 8x had 5-10% > > > performance degradation and a 50% increase in log throughput for > > > the same workload, so clearly that was too small for highly > > > concurrent workloads on large logs. > > > > It would be nice to capture at least some of this reasoning in the > > already lengthy comment preceeding the #define.... > > A lot of it is already there, but I will revise it. > > > > > > This was found via trace analysis of AIL behaviour. e.g. insertion > > > from a single CIL flush: > > > > > > xfs_ail_insert: old lsn 0/0 new lsn 1/3033090 type XFS_LI_INODE flags IN_AIL > > > > > > $ grep xfs_ail_insert /mnt/scratch/s.t |grep "new lsn 1/3033090" |wc -l > > > 1721823 > > > $ > > > > > > So there were 1.7 million objects inserted into the AIL from this > > > CIL checkpoint, the first at 2323.392108, the last at 2325.667566 which > > > was the end of the trace (i.e. it hadn't finished). Clearly a major > > > problem. > > > > > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > > > --- > > > fs/xfs/xfs_log_priv.h | 3 ++- > > > 1 file changed, 2 insertions(+), 1 deletion(-) > > > > > > diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h > > > index b880c23cb6e4..187a43ffeaf7 100644 > > > --- a/fs/xfs/xfs_log_priv.h > > > +++ b/fs/xfs/xfs_log_priv.h > > > @@ -329,7 +329,8 @@ struct xfs_cil { > > > * enforced to ensure we stay within our maximum checkpoint size bounds. > > > * threshold, yet give us plenty of space for aggregation on large logs. > > > > ...also, does XLOG_CIL_SPACE_LIMIT correspond to "a lower threshold at > > which background pushing is attempted", or "a separate, higher bound"? > > I think it's the first (????) but ... I don't know. The name made me > > think it was the second, but the single use of the symbol suggests the > > first. :) > > See, the comment here talks about two limits, because that was how > the initial implementation worked - the background CIL push was not > async, so there was some juggling done to prevent new commits from > blocking on background pushes in progress unless the size was > actually growing to large. This patch pretty much describes the > whole issue here: > > https://lore.kernel.org/linux-xfs/1285552073-14663-2-git-send-email-david@fromorbit.com/ > > That's in commit 80168676ebfe ("xfs: force background CIL push under > sustained load") which went into 2.6.38 or so. The cause of the > problem in that case was concurrent transaction commit load causing > lock contention and preventing a background push from getting the > context lock to do the actual push. > More related to the next patch, but what prevents a similar but generally unbound concurrent workload from exceeding the new hard limit once transactions start to block post commit? Brian > The hard limit in the CIL code was dropped when the background push > was converted to run asynchronously to use a work queue in 2012 as > it allowed the locking to be changed (down_write_trylock -> > down_write) to turn it into a transaction commit barrier while the > contexts are switched over. That was done in 2012 via commit > 4c2d542f2e78 ("xfs: Do background CIL flushes via a workqueue") and > so we haven't actually capped CIL checkpoint sizes since 2012. > > Essentially, the comment you point out documents the two limits from > the original code, and this commit is restoring that behaviour for > background CIL pushes.... > > I'll do some work to update it all. > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com
On Wed, Sep 25, 2019 at 08:08:59AM -0400, Brian Foster wrote: > On Wed, Sep 25, 2019 at 08:29:01AM +1000, Dave Chinner wrote: > > That's in commit 80168676ebfe ("xfs: force background CIL push under > > sustained load") which went into 2.6.38 or so. The cause of the > > problem in that case was concurrent transaction commit load causing > > lock contention and preventing a background push from getting the > > context lock to do the actual push. > > > > More related to the next patch, but what prevents a similar but > generally unbound concurrent workload from exceeding the new hard limit > once transactions start to block post commit? The new code, like the original code, is not actually a "hard" limit. It's essentially just throttles ongoing work until the CIL push starts. In this case, it forces the current process to give up the CPU immediately once over the CIL high limit, which allows the workqueue to run the push work straight away. I thought about making it a "hard limit" by blocking before the CIL insert, but that's no guarantee that by the time we get woken and add the new commit to the CIL that this new context has not already gone over the hard limit. i.e. we block the unbound concurrency before commit, then let it all go in a thundering herd on the new context and immeidately punch that way over the hard threshold again. To avoid this, we'd probably need a CIL ticket and grant mechanism to make CIL insertion FIFO and wakeups limited by remaining space in the CIL. I'm not sure we actually need such a complex solution, especially considering the potential serialisation problems it introduces in what is a highly concurrent fast path... Cheers, Dave.
On Sat, Sep 28, 2019 at 08:47:58AM +1000, Dave Chinner wrote: > On Wed, Sep 25, 2019 at 08:08:59AM -0400, Brian Foster wrote: > > On Wed, Sep 25, 2019 at 08:29:01AM +1000, Dave Chinner wrote: > > > That's in commit 80168676ebfe ("xfs: force background CIL push under > > > sustained load") which went into 2.6.38 or so. The cause of the > > > problem in that case was concurrent transaction commit load causing > > > lock contention and preventing a background push from getting the > > > context lock to do the actual push. > > > > > > > More related to the next patch, but what prevents a similar but > > generally unbound concurrent workload from exceeding the new hard limit > > once transactions start to block post commit? > > The new code, like the original code, is not actually a "hard" limit. > It's essentially just throttles ongoing work until the CIL push > starts. In this case, it forces the current process to give up the > CPU immediately once over the CIL high limit, which allows the > workqueue to run the push work straight away. > > I thought about making it a "hard limit" by blocking before the CIL > insert, but that's no guarantee that by the time we get woken and > add the new commit to the CIL that this new context has not already > gone over the hard limit. i.e. we block the unbound concurrency > before commit, then let it all go in a thundering herd on the new > context and immeidately punch that way over the hard threshold > again. To avoid this, we'd probably need a CIL ticket and grant > mechanism to make CIL insertion FIFO and wakeups limited by > remaining space in the CIL. I'm not sure we actually need such a > complex solution, especially considering the potential serialisation > problems it introduces in what is a highly concurrent fast path... > Ok. The latter is more of what I'd expect to see if the objective is truly to hard cap CIL size, FWIW. This seems more reasonable if the objective is to yield committers under overloaded CIL conditions, with the caveat that CIL size is still not hard capped, so long as the documentation and whatnot is updated to more accurately reflect the implementation (and at a glance, it looks like that has happened in the next version..). Brian > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com
diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h index b880c23cb6e4..187a43ffeaf7 100644 --- a/fs/xfs/xfs_log_priv.h +++ b/fs/xfs/xfs_log_priv.h @@ -329,7 +329,8 @@ struct xfs_cil { * enforced to ensure we stay within our maximum checkpoint size bounds. * threshold, yet give us plenty of space for aggregation on large logs. */ -#define XLOG_CIL_SPACE_LIMIT(log) (log->l_logsize >> 3) +#define XLOG_CIL_SPACE_LIMIT(log) \ + min_t(int, (log)->l_logsize >> 3, BBTOB(XLOG_TOTAL_REC_SHIFT(log)) << 4) /* * ticket grant locks, queues and accounting have their own cachlines