diff mbox series

[1/2] xfs: Lower CIL flush limit for large logs

Message ID 20190909015159.19662-2-david@fromorbit.com (mailing list archive)
State Superseded
Headers show
Series [1/2] xfs: Lower CIL flush limit for large logs | expand

Commit Message

Dave Chinner Sept. 9, 2019, 1:51 a.m. UTC
From: Dave Chinner <dchinner@redhat.com>

The current CIL size aggregation limit is 1/8th the log size. This
means for large logs we might be aggregating at least 250MB of dirty objects
in memory before the CIL is flushed to the journal. With CIL shadow
buffers sitting around, this means the CIL is often consuming >500MB
of temporary memory that is all allocated under GFP_NOFS conditions.

Flushing the CIL can take some time to do if there is other IO
ongoing, and can introduce substantial log force latency by itself.
It also pins the memory until the objects are in the AIL and can be
written back and reclaimed by shrinkers. Hence this threshold also
tends to determine the minimum amount of memory XFS can operate in
under heavy modification without triggering the OOM killer.

Modify the CIL space limit to prevent such huge amounts of pinned
metadata from aggregating. We can have 2MB of log IO in flight at
once, so limit aggregation to 16x this size. This threshold was
chosen as it little impact on performance (on 16-way fsmark) or log
traffic but pins a lot less memory on large logs especially under
heavy memory pressure.  An aggregation limit of 8x had 5-10%
performance degradation and a 50% increase in log throughput for
the same workload, so clearly that was too small for highly
concurrent workloads on large logs.

This was found via trace analysis of AIL behaviour. e.g. insertion
from a single CIL flush:

xfs_ail_insert: old lsn 0/0 new lsn 1/3033090 type XFS_LI_INODE flags IN_AIL

$ grep xfs_ail_insert /mnt/scratch/s.t |grep "new lsn 1/3033090" |wc -l
1721823
$

So there were 1.7 million objects inserted into the AIL from this
CIL checkpoint, the first at 2323.392108, the last at 2325.667566 which
was the end of the trace (i.e. it hadn't finished). Clearly a major
problem.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_log_priv.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Comments

Darrick J. Wong Sept. 16, 2019, 4:33 p.m. UTC | #1
On Mon, Sep 09, 2019 at 11:51:58AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> The current CIL size aggregation limit is 1/8th the log size. This
> means for large logs we might be aggregating at least 250MB of dirty objects
> in memory before the CIL is flushed to the journal. With CIL shadow
> buffers sitting around, this means the CIL is often consuming >500MB
> of temporary memory that is all allocated under GFP_NOFS conditions.
> 
> Flushing the CIL can take some time to do if there is other IO
> ongoing, and can introduce substantial log force latency by itself.
> It also pins the memory until the objects are in the AIL and can be
> written back and reclaimed by shrinkers. Hence this threshold also
> tends to determine the minimum amount of memory XFS can operate in
> under heavy modification without triggering the OOM killer.
> 
> Modify the CIL space limit to prevent such huge amounts of pinned
> metadata from aggregating. We can have 2MB of log IO in flight at
> once, so limit aggregation to 16x this size. This threshold was
> chosen as it little impact on performance (on 16-way fsmark) or log
> traffic but pins a lot less memory on large logs especially under
> heavy memory pressure.  An aggregation limit of 8x had 5-10%
> performance degradation and a 50% increase in log throughput for
> the same workload, so clearly that was too small for highly
> concurrent workloads on large logs.

It would be nice to capture at least some of this reasoning in the
already lengthy comment preceeding the #define....

> This was found via trace analysis of AIL behaviour. e.g. insertion
> from a single CIL flush:
> 
> xfs_ail_insert: old lsn 0/0 new lsn 1/3033090 type XFS_LI_INODE flags IN_AIL
> 
> $ grep xfs_ail_insert /mnt/scratch/s.t |grep "new lsn 1/3033090" |wc -l
> 1721823
> $
> 
> So there were 1.7 million objects inserted into the AIL from this
> CIL checkpoint, the first at 2323.392108, the last at 2325.667566 which
> was the end of the trace (i.e. it hadn't finished). Clearly a major
> problem.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_log_priv.h | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
> index b880c23cb6e4..187a43ffeaf7 100644
> --- a/fs/xfs/xfs_log_priv.h
> +++ b/fs/xfs/xfs_log_priv.h
> @@ -329,7 +329,8 @@ struct xfs_cil {
>   * enforced to ensure we stay within our maximum checkpoint size bounds.
>   * threshold, yet give us plenty of space for aggregation on large logs.

...also, does XLOG_CIL_SPACE_LIMIT correspond to "a lower threshold at
which background pushing is attempted", or "a separate, higher bound"?
I think it's the first (????) but ... I don't know.  The name made me
think it was the second, but the single use of the symbol suggests the
first. :)

--D

>   */
> -#define XLOG_CIL_SPACE_LIMIT(log)	(log->l_logsize >> 3)
> +#define XLOG_CIL_SPACE_LIMIT(log)	\
> +	min_t(int, (log)->l_logsize >> 3, BBTOB(XLOG_TOTAL_REC_SHIFT(log)) << 4)
>  
>  /*
>   * ticket grant locks, queues and accounting have their own cachlines
> -- 
> 2.23.0.rc1
>
Dave Chinner Sept. 24, 2019, 10:29 p.m. UTC | #2
On Mon, Sep 16, 2019 at 09:33:25AM -0700, Darrick J. Wong wrote:
> On Mon, Sep 09, 2019 at 11:51:58AM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > The current CIL size aggregation limit is 1/8th the log size. This
> > means for large logs we might be aggregating at least 250MB of dirty objects
> > in memory before the CIL is flushed to the journal. With CIL shadow
> > buffers sitting around, this means the CIL is often consuming >500MB
> > of temporary memory that is all allocated under GFP_NOFS conditions.
> > 
> > Flushing the CIL can take some time to do if there is other IO
> > ongoing, and can introduce substantial log force latency by itself.
> > It also pins the memory until the objects are in the AIL and can be
> > written back and reclaimed by shrinkers. Hence this threshold also
> > tends to determine the minimum amount of memory XFS can operate in
> > under heavy modification without triggering the OOM killer.
> > 
> > Modify the CIL space limit to prevent such huge amounts of pinned
> > metadata from aggregating. We can have 2MB of log IO in flight at
> > once, so limit aggregation to 16x this size. This threshold was
> > chosen as it little impact on performance (on 16-way fsmark) or log
> > traffic but pins a lot less memory on large logs especially under
> > heavy memory pressure.  An aggregation limit of 8x had 5-10%
> > performance degradation and a 50% increase in log throughput for
> > the same workload, so clearly that was too small for highly
> > concurrent workloads on large logs.
> 
> It would be nice to capture at least some of this reasoning in the
> already lengthy comment preceeding the #define....

A lot of it is already there, but I will revise it.

> 
> > This was found via trace analysis of AIL behaviour. e.g. insertion
> > from a single CIL flush:
> > 
> > xfs_ail_insert: old lsn 0/0 new lsn 1/3033090 type XFS_LI_INODE flags IN_AIL
> > 
> > $ grep xfs_ail_insert /mnt/scratch/s.t |grep "new lsn 1/3033090" |wc -l
> > 1721823
> > $
> > 
> > So there were 1.7 million objects inserted into the AIL from this
> > CIL checkpoint, the first at 2323.392108, the last at 2325.667566 which
> > was the end of the trace (i.e. it hadn't finished). Clearly a major
> > problem.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/xfs/xfs_log_priv.h | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
> > index b880c23cb6e4..187a43ffeaf7 100644
> > --- a/fs/xfs/xfs_log_priv.h
> > +++ b/fs/xfs/xfs_log_priv.h
> > @@ -329,7 +329,8 @@ struct xfs_cil {
> >   * enforced to ensure we stay within our maximum checkpoint size bounds.
> >   * threshold, yet give us plenty of space for aggregation on large logs.
> 
> ...also, does XLOG_CIL_SPACE_LIMIT correspond to "a lower threshold at
> which background pushing is attempted", or "a separate, higher bound"?
> I think it's the first (????) but ... I don't know.  The name made me
> think it was the second, but the single use of the symbol suggests the
> first. :)

See, the comment here talks about two limits, because that was how
the initial implementation worked - the background CIL push was not
async, so there was some juggling done to prevent new commits from
blocking on background pushes in progress unless the size was
actually growing to large.  This patch pretty much describes the
whole issue here:

https://lore.kernel.org/linux-xfs/1285552073-14663-2-git-send-email-david@fromorbit.com/

That's in commit 80168676ebfe ("xfs: force background CIL push under
sustained load") which went into 2.6.38 or so. The cause of the
problem in that case was concurrent transaction commit load causing
lock contention and preventing a background push from getting the
context lock to do the actual push.

The hard limit in the CIL code was dropped when the background push
was converted to run asynchronously to use a work queue in 2012 as
it allowed the locking to be changed (down_write_trylock ->
down_write) to turn it into a transaction commit barrier while the
contexts are switched over.  That was done in 2012 via commit
4c2d542f2e78 ("xfs: Do background CIL flushes via a workqueue") and
so we haven't actually capped CIL checkpoint sizes since 2012.

Essentially, the comment you point out documents the two limits from
the original code, and this commit is restoring that behaviour for
background CIL pushes....

I'll do some work to update it all.

Cheers,

Dave.
Brian Foster Sept. 25, 2019, 12:08 p.m. UTC | #3
On Wed, Sep 25, 2019 at 08:29:01AM +1000, Dave Chinner wrote:
> On Mon, Sep 16, 2019 at 09:33:25AM -0700, Darrick J. Wong wrote:
> > On Mon, Sep 09, 2019 at 11:51:58AM +1000, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > The current CIL size aggregation limit is 1/8th the log size. This
> > > means for large logs we might be aggregating at least 250MB of dirty objects
> > > in memory before the CIL is flushed to the journal. With CIL shadow
> > > buffers sitting around, this means the CIL is often consuming >500MB
> > > of temporary memory that is all allocated under GFP_NOFS conditions.
> > > 
> > > Flushing the CIL can take some time to do if there is other IO
> > > ongoing, and can introduce substantial log force latency by itself.
> > > It also pins the memory until the objects are in the AIL and can be
> > > written back and reclaimed by shrinkers. Hence this threshold also
> > > tends to determine the minimum amount of memory XFS can operate in
> > > under heavy modification without triggering the OOM killer.
> > > 
> > > Modify the CIL space limit to prevent such huge amounts of pinned
> > > metadata from aggregating. We can have 2MB of log IO in flight at
> > > once, so limit aggregation to 16x this size. This threshold was
> > > chosen as it little impact on performance (on 16-way fsmark) or log
> > > traffic but pins a lot less memory on large logs especially under
> > > heavy memory pressure.  An aggregation limit of 8x had 5-10%
> > > performance degradation and a 50% increase in log throughput for
> > > the same workload, so clearly that was too small for highly
> > > concurrent workloads on large logs.
> > 
> > It would be nice to capture at least some of this reasoning in the
> > already lengthy comment preceeding the #define....
> 
> A lot of it is already there, but I will revise it.
> 
> > 
> > > This was found via trace analysis of AIL behaviour. e.g. insertion
> > > from a single CIL flush:
> > > 
> > > xfs_ail_insert: old lsn 0/0 new lsn 1/3033090 type XFS_LI_INODE flags IN_AIL
> > > 
> > > $ grep xfs_ail_insert /mnt/scratch/s.t |grep "new lsn 1/3033090" |wc -l
> > > 1721823
> > > $
> > > 
> > > So there were 1.7 million objects inserted into the AIL from this
> > > CIL checkpoint, the first at 2323.392108, the last at 2325.667566 which
> > > was the end of the trace (i.e. it hadn't finished). Clearly a major
> > > problem.
> > > 
> > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > ---
> > >  fs/xfs/xfs_log_priv.h | 3 ++-
> > >  1 file changed, 2 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
> > > index b880c23cb6e4..187a43ffeaf7 100644
> > > --- a/fs/xfs/xfs_log_priv.h
> > > +++ b/fs/xfs/xfs_log_priv.h
> > > @@ -329,7 +329,8 @@ struct xfs_cil {
> > >   * enforced to ensure we stay within our maximum checkpoint size bounds.
> > >   * threshold, yet give us plenty of space for aggregation on large logs.
> > 
> > ...also, does XLOG_CIL_SPACE_LIMIT correspond to "a lower threshold at
> > which background pushing is attempted", or "a separate, higher bound"?
> > I think it's the first (????) but ... I don't know.  The name made me
> > think it was the second, but the single use of the symbol suggests the
> > first. :)
> 
> See, the comment here talks about two limits, because that was how
> the initial implementation worked - the background CIL push was not
> async, so there was some juggling done to prevent new commits from
> blocking on background pushes in progress unless the size was
> actually growing to large.  This patch pretty much describes the
> whole issue here:
> 
> https://lore.kernel.org/linux-xfs/1285552073-14663-2-git-send-email-david@fromorbit.com/
> 
> That's in commit 80168676ebfe ("xfs: force background CIL push under
> sustained load") which went into 2.6.38 or so. The cause of the
> problem in that case was concurrent transaction commit load causing
> lock contention and preventing a background push from getting the
> context lock to do the actual push.
> 

More related to the next patch, but what prevents a similar but
generally unbound concurrent workload from exceeding the new hard limit
once transactions start to block post commit?

Brian

> The hard limit in the CIL code was dropped when the background push
> was converted to run asynchronously to use a work queue in 2012 as
> it allowed the locking to be changed (down_write_trylock ->
> down_write) to turn it into a transaction commit barrier while the
> contexts are switched over.  That was done in 2012 via commit
> 4c2d542f2e78 ("xfs: Do background CIL flushes via a workqueue") and
> so we haven't actually capped CIL checkpoint sizes since 2012.
> 
> Essentially, the comment you point out documents the two limits from
> the original code, and this commit is restoring that behaviour for
> background CIL pushes....
> 
> I'll do some work to update it all.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
Dave Chinner Sept. 27, 2019, 10:47 p.m. UTC | #4
On Wed, Sep 25, 2019 at 08:08:59AM -0400, Brian Foster wrote:
> On Wed, Sep 25, 2019 at 08:29:01AM +1000, Dave Chinner wrote:
> > That's in commit 80168676ebfe ("xfs: force background CIL push under
> > sustained load") which went into 2.6.38 or so. The cause of the
> > problem in that case was concurrent transaction commit load causing
> > lock contention and preventing a background push from getting the
> > context lock to do the actual push.
> > 
> 
> More related to the next patch, but what prevents a similar but
> generally unbound concurrent workload from exceeding the new hard limit
> once transactions start to block post commit?

The new code, like the original code, is not actually a "hard" limit.
It's essentially just throttles ongoing work until the CIL push
starts. In this case, it forces the current process to give up the
CPU immediately once over the CIL high limit, which allows the
workqueue to run the push work straight away.

I thought about making it a "hard limit" by blocking before the CIL
insert, but that's no guarantee that by the time we get woken and
add the new commit to the CIL that this new context has not already
gone over the hard limit. i.e. we block the unbound concurrency
before commit, then let it all go in a thundering herd on the new
context and immeidately punch that way over the hard threshold
again. To avoid this, we'd probably need a CIL ticket and grant
mechanism to make CIL insertion FIFO and wakeups limited by
remaining space in the CIL. I'm not sure we actually need such a
complex solution, especially considering the potential serialisation
problems it introduces in what is a highly concurrent fast path...

Cheers,

Dave.
Brian Foster Sept. 30, 2019, 12:24 p.m. UTC | #5
On Sat, Sep 28, 2019 at 08:47:58AM +1000, Dave Chinner wrote:
> On Wed, Sep 25, 2019 at 08:08:59AM -0400, Brian Foster wrote:
> > On Wed, Sep 25, 2019 at 08:29:01AM +1000, Dave Chinner wrote:
> > > That's in commit 80168676ebfe ("xfs: force background CIL push under
> > > sustained load") which went into 2.6.38 or so. The cause of the
> > > problem in that case was concurrent transaction commit load causing
> > > lock contention and preventing a background push from getting the
> > > context lock to do the actual push.
> > > 
> > 
> > More related to the next patch, but what prevents a similar but
> > generally unbound concurrent workload from exceeding the new hard limit
> > once transactions start to block post commit?
> 
> The new code, like the original code, is not actually a "hard" limit.
> It's essentially just throttles ongoing work until the CIL push
> starts. In this case, it forces the current process to give up the
> CPU immediately once over the CIL high limit, which allows the
> workqueue to run the push work straight away.
> 
> I thought about making it a "hard limit" by blocking before the CIL
> insert, but that's no guarantee that by the time we get woken and
> add the new commit to the CIL that this new context has not already
> gone over the hard limit. i.e. we block the unbound concurrency
> before commit, then let it all go in a thundering herd on the new
> context and immeidately punch that way over the hard threshold
> again. To avoid this, we'd probably need a CIL ticket and grant
> mechanism to make CIL insertion FIFO and wakeups limited by
> remaining space in the CIL. I'm not sure we actually need such a
> complex solution, especially considering the potential serialisation
> problems it introduces in what is a highly concurrent fast path...
> 

Ok. The latter is more of what I'd expect to see if the objective is
truly to hard cap CIL size, FWIW. This seems more reasonable if the
objective is to yield committers under overloaded CIL conditions, with
the caveat that CIL size is still not hard capped, so long as the
documentation and whatnot is updated to more accurately reflect the
implementation (and at a glance, it looks like that has happened in the
next version..).

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
diff mbox series

Patch

diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index b880c23cb6e4..187a43ffeaf7 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -329,7 +329,8 @@  struct xfs_cil {
  * enforced to ensure we stay within our maximum checkpoint size bounds.
  * threshold, yet give us plenty of space for aggregation on large logs.
  */
-#define XLOG_CIL_SPACE_LIMIT(log)	(log->l_logsize >> 3)
+#define XLOG_CIL_SPACE_LIMIT(log)	\
+	min_t(int, (log)->l_logsize >> 3, BBTOB(XLOG_TOTAL_REC_SHIFT(log)) << 4)
 
 /*
  * ticket grant locks, queues and accounting have their own cachlines