[16/24] xfs: Lower CIL flush limit for large logs

Message ID	20190801021752.4986-17-david@fromorbit.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-fsdevel-owner@kernel.org> From: Dave Chinner <david@fromorbit.com> To: linux-xfs@vger.kernel.org Cc: linux-mm@kvack.org, linux-fsdevel@vger.kernel.org Subject: [PATCH 16/24] xfs: Lower CIL flush limit for large logs Date: Thu, 1 Aug 2019 12:17:44 +1000 Message-Id: <20190801021752.4986-17-david@fromorbit.com> In-Reply-To: <20190801021752.4986-1-david@fromorbit.com> References: <20190801021752.4986-1-david@fromorbit.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk
Series	mm, xfs: non-blocking inode reclaim \| expand [RFC,00/24] mm, xfs: non-blocking inode reclaim [01/24] mm: directed shrinker work deferral [02/24] shrinkers: use will_defer for GFP_NOFS sensitive shrinkers [03/24] mm: factor shrinker work calculations [04/24] shrinker: defer work only to kswapd [05/24] shrinker: clean up variable types and tracepoints [06/24] mm: reclaim_state records pages reclaimed, not slabs [07/24] mm: back off direct reclaim on excessive shrinker deferral [08/24] mm: kswapd backoff for shrinkers [09/24] xfs: don't allow log IO to be throttled [10/24] xfs: fix missed wakeup on l_flush_wait [11/24] xfs:: account for memory freed from metadata buffers [12/24] xfs: correctly acount for reclaimable slabs [13/24] xfs: synchronous AIL pushing [14/24] xfs: tail updates only need to occur when LSN changes [15/24] xfs: eagerly free shadow buffers to reduce CIL footprint [16/24] xfs: Lower CIL flush limit for large logs [17/24] xfs: don't block kswapd in inode reclaim [18/24] xfs: reduce kswapd blocking on inode locking. [19/24] xfs: kill background reclaim work [20/24] xfs: use AIL pushing for inode reclaim IO [21/24] xfs: remove mode from xfs_reclaim_inodes() [22/24] xfs: track reclaimable inodes using a LRU list [23/24] xfs: reclaim inodes from the LRU [24/24] xfs: remove unusued old inode reclaim code

Message ID

20190801021752.4986-17-david@fromorbit.com (mailing list archive)

State

New, archived

Headers

From: Dave Chinner <david@fromorbit.com>
To: linux-xfs@vger.kernel.org
Cc: linux-mm@kvack.org, linux-fsdevel@vger.kernel.org
Subject: [PATCH 16/24] xfs: Lower CIL flush limit for large logs
Date: Thu,  1 Aug 2019 12:17:44 +1000
Message-Id: <20190801021752.4986-17-david@fromorbit.com>
In-Reply-To: <20190801021752.4986-1-david@fromorbit.com>
References: <20190801021752.4986-1-david@fromorbit.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Sender: linux-fsdevel-owner@vger.kernel.org
Precedence: bulk

Series

mm, xfs: non-blocking inode reclaim | expand

Commit Message

Dave Chinner Aug. 1, 2019, 2:17 a.m. UTC

From: Dave Chinner <dchinner@redhat.com>

The current CIL size aggregation limit is 1/8th the log size. This
means for large logs we might be aggregating at least 250MB of dirty objects
in memory before the CIL is flushed to the journal. With CIL shadow
buffers sitting around, this means the CIL is often consuming >500MB
of temporary memory that is all allocated under GFP_NOFS conditions.

FLushing the CIL can take some time to do if there is other IO
ongoing, and can introduce substantial log force latency by itself.
It also pins the memory until the objects are in the AIL and can be
written back and reclaimed by shrinkers. Hence this threshold also
tends to determine the minimum amount of memory XFS can operate in
under heavy modification without triggering the OOM killer.

Modify the CIL space limit to prevent such huge amounts of pinned
metadata from aggregating. We can 2MB of log IO in flight at once,
so limit aggregation to 8x this size (arbitrary). This has some
impact on performance (5-10% decrease on 16-way fsmark) and
increases the amount of log traffic (~50% on same workload) but it
is necessary to prevent rampant OOM killing under iworkloads that
modify large amounts of metadata under heavy memory pressure.

This was found via trace analysis or AIL behaviour. e.g. insertion
from a single CIL flush:

xfs_ail_insert: old lsn 0/0 new lsn 1/3033090 type XFS_LI_INODE flags IN_AIL

$ grep xfs_ail_insert /mnt/scratch/s.t |grep "new lsn 1/3033090" |wc -l
1721823
$

So there were 1.7 million objects inserted into the AIL from this
CIL checkpoint, the first at 2323.392108, the last at 2325.667566 which
was the end of the trace (i.e. it hadn't finished). Clearly a major
problem.

XXX: Need to try bigger sizes to see where the performance/stability
boundary lies to see if some of the losses can be regained and log
bandwidth increases minimised.

XXX: Ideally this threshold should slide with memory pressure. We
can allow large amounts of metadata to build up when there is no
memory pressure, but then close the window as memory pressure builds
up to reduce the footprint of the CIL until memory pressure passes.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_log_priv.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Comments

Nikolay Borisov Aug. 4, 2019, 5:12 p.m. UTC | #1

On 1.08.19 г. 5:17 ч., Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> The current CIL size aggregation limit is 1/8th the log size. This
> means for large logs we might be aggregating at least 250MB of dirty objects
> in memory before the CIL is flushed to the journal. With CIL shadow
> buffers sitting around, this means the CIL is often consuming >500MB
> of temporary memory that is all allocated under GFP_NOFS conditions.
> 
> FLushing the CIL can take some time to do if there is other IO
> ongoing, and can introduce substantial log force latency by itself.
> It also pins the memory until the objects are in the AIL and can be
> written back and reclaimed by shrinkers. Hence this threshold also
> tends to determine the minimum amount of memory XFS can operate in
> under heavy modification without triggering the OOM killer.
> 
> Modify the CIL space limit to prevent such huge amounts of pinned
> metadata from aggregating. We can 2MB of log IO in flight at once,
There is a word missing between 'can' and '2MB'
> so limit aggregation to 8x this size (arbitrary). This has some
> impact on performance (5-10% decrease on 16-way fsmark) and
> increases the amount of log traffic (~50% on same workload) but it
> is necessary to prevent rampant OOM killing under iworkloads that
> modify large amounts of metadata under heavy memory pressure.
> 
> This was found via trace analysis or AIL behaviour. e.g. insertion
s/or/of/

> from a single CIL flush:

<snip>

diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index b880c23cb6e4..87c6191daef7 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -329,7 +329,8 @@  struct xfs_cil {
  * enforced to ensure we stay within our maximum checkpoint size bounds.
  * threshold, yet give us plenty of space for aggregation on large logs.
  */
-#define XLOG_CIL_SPACE_LIMIT(log)	(log->l_logsize >> 3)
+#define XLOG_CIL_SPACE_LIMIT(log)	\
+	min_t(int, (log)->l_logsize >> 3, XLOG_TOTAL_REC_SHIFT(log) << 3)
 
 /*
  * ticket grant locks, queues and accounting have their own cachlines

[16/24] xfs: Lower CIL flush limit for large logs

Commit Message

Comments

Patch