From patchwork Mon Nov 14 07:27:08 2016
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Dave Chinner <david@fromorbit.com>
X-Patchwork-Id: 9486411
X-Mozilla-Keys: nonjunk
Return-Path: <linux-xfs-owner@vger.kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on sandeen.net
X-Spam-Level: 
X-Spam-Status: No, score=-7.0 required=5.0 tests=BAYES_00,
	HEADER_FROM_DIFFERENT_DOMAINS, RCVD_IN_DNSWL_HI,
	RP_MATCHES_RCVD autolearn=ham autolearn_force=no version=3.4.0
X-Spam-HP: BAYES_00=-1.9,HEADER_FROM_DIFFERENT_DOMAINS=0.001,
	RCVD_IN_DNSWL_HI=-5,RP_MATCHES_RCVD=-0.1
X-Original-To: sandeen@sandeen.net
Delivered-To: sandeen@sandeen.net
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by sandeen.net (Postfix) with ESMTP id 2B56011657
	for <sandeen@sandeen.net>; Mon, 14 Nov 2016 01:26:48 -0600 (CST)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1752901AbcKNH1P (ORCPT <rfc822;sandeen@sandeen.net>);
	Mon, 14 Nov 2016 02:27:15 -0500
Received: from ipmail05.adl6.internode.on.net ([150.101.137.143]:54497 "EHLO
	ipmail05.adl6.internode.on.net" rhost-flags-OK-OK-OK-OK)
	by vger.kernel.org with ESMTP id S1752824AbcKNH1O (ORCPT
	<rfc822; linux-xfs@vger.kernel.org>); Mon, 14 Nov 2016 02:27:14 -0500
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: 
 A2BfHABEZilYELuKLHlRBAgaAQEBAQIBAQEBCAEBAQGDMQEBAQEBH1iBAIJ7g3mcKwEBAQEBAQaBHIwcg32COIIPggcghX0CAgEBAoIPQRMBAgEBAQEBAQEGAQEBAQEBAQE3RUIShA0BAQEEHQoTHCMQCAMOBwMJJQ8FJQMHGgoHAhqIRrBgPYtEAQEIAgEkHoVUhByBCIQWBAwFAYV9BYhQkXGGPIoWgXqEdoM9hX5JjHuECiACgQ0NEQuDcYFZKgMxggODEoItAQEB
Received: from ppp121-44-138-187.lns20.syd7.internode.on.net (HELO dastard)
	([121.44.138.187]) by ipmail05.adl6.internode.on.net with ESMTP;
	14 Nov 2016 17:57:10 +1030
Received: from dave by dastard with local (Exim 4.80)
	(envelope-from <david@fromorbit.com>)
	id 1c6BfY-0006Qc-Uz; Mon, 14 Nov 2016 18:27:09 +1100
Date: Mon, 14 Nov 2016 18:27:08 +1100
From: Dave Chinner <david@fromorbit.com>
To: Chris Mason <clm@fb.com>
Cc: linux-xfs@vger.kernel.org
Subject: Re: [PATCH RFC] xfs: drop SYNC_WAIT from xfs_reclaim_inodes_ag
	during slab reclaim
Message-ID: <20161114072708.GN28922@dastard>
References: <06aade22-b29e-f55e-7f00-39154f220aa6@fb.com>
	<20161015223454.GS23194@dastard>
	<20161017002427.GA70229@clm-mbp.masoncoding.com>
	<20161017015251.GU23194@dastard>
	<e0c37926-a670-98b6-4bd7-b438def130bd@fb.com>
	<20161017223057.GV23194@dastard>
	<f6dadb97-4233-4bbf-3edc-4985c7727548@fb.com>
	<20161018020324.GA23194@dastard>
	<20161114005951.GB2127@clm-mbp.thefacebook.com>
MIME-Version: 1.0
Content-Disposition: inline
In-Reply-To: <20161114005951.GB2127@clm-mbp.thefacebook.com>
User-Agent: Mutt/1.5.21 (2010-09-15)
Sender: linux-xfs-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-xfs.vger.kernel.org>
X-Mailing-List: linux-xfs@vger.kernel.org

On Sun, Nov 13, 2016 at 08:00:04PM -0500, Chris Mason wrote:
> On Tue, Oct 18, 2016 at 01:03:24PM +1100, Dave Chinner wrote:
> 
> [ Long stalls from xfs_reclaim_inodes_ag ]
> 
> >XFS has *1* tunable that can change the behaviour of metadata
> >writeback. Please try it.
> 
> [ weeks go by, so this email is insanely long ]
> 
> Testing all of this was slow going because two of the three test
> boxes I had with the hadoop configuration starting having hardware
> problems.  The good news is that while I was adjusting the
> benchmark, we lined up access to a bunch of duplicate boxes, so I
> can now try ~20 different configurations in parallel.
> 
> My rough benchmark is here:
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/mason/simoop.git
> 
> The command line I ended up using was:
> 
> simoop -t 512 -m 190 -M 128 -C 28 -r 60000 -f 70 -T 20 -R1 -W 1 -i
> 60 -w 300 -D 2 /hammer/*

There's a lightly tested patch below that should do the trick.

After 5 minutes on a modified simoop cli on a single filesystem,
4.9-rc4+for-next:

$ ./simoop -t 128 -m 50 -M 128 -C 14 -r 60000 -f 2 -T 20 -R1 -W 1 -i 60 -w 300 -D 2 /mnt/scratch
....
Run time: 300 seconds
Read latency (p50: 3,174,400) (p95: 4,530,176) (p99: 18,055,168)
Write latency (p50: 14,991,360) (p95: 28,672,000) (p99: 33,325,056)
Allocation latency (p50: 1,771,520) (p95: 17,530,880) (p99: 23,756,800)
work rate = 4.75/sec (avg 5.24/sec) (p50: 5.79) (p95: 6.99) (p99: 6.99)
alloc stall rate = 94.42/sec (avg: 51.63) (p50: 51.60) (p95: 59.12) (p99: 59.12)

With the patch below:

Run time: 300 seconds
Read latency (p50: 3,043,328) (p95: 3,649,536) (p99: 4,710,400)
Write latency (p50: 21,004,288) (p95: 27,557,888) (p99: 29,130,752)
Allocation latency (p50: 280,064) (p95: 680,960) (p99: 863,232)
work rate = 4.08/sec (avg 4.76/sec) (p50: 5.39) (p95: 6.93) (p99: 6.93)
alloc stall rate = 0.08/sec (avg: 0.02) (p50: 0.00) (p95: 0.01) (p99: 0.01)

Stall rate went to zero and stayed there at the 120s mark of the
warmup. Note the p99 difference for read and allocation latency,
too.

I'll post some graphs tomorrow from my live PCP telemetry that
demonstrate the difference in behaviour better than any words...

Cheers,

Dave.

diff --git a/fs/super.c b/fs/super.c
index c183835566c1..9fff5630b12d 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -102,8 +102,14 @@ static unsigned long super_cache_scan(struct shrinker *shrink,
 	freed += prune_icache_sb(sb, sc);
 
 	if (fs_objects) {
+		unsigned long ret;
+
 		sc->nr_to_scan = fs_objects + 1;
-		freed += sb->s_op->free_cached_objects(sb, sc);
+		ret = sb->s_op->free_cached_objects(sb, sc);
+		if (ret == SHRINK_STOP)
+			freed = SHRINK_STOP;
+		else
+			freed += ret;
 	}
 
 	up_read(&sb->s_umount);
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 9c3e5c6ddf20..65c0f79f3edc 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -975,7 +975,7 @@ xfs_reclaim_inode(
 	error = 0;
 	xfs_ilock(ip, XFS_ILOCK_EXCL);
 	if (!xfs_iflock_nowait(ip)) {
-		if (!(sync_mode & SYNC_WAIT))
+		if (sync_mode & SYNC_TRYLOCK)
 			goto out;
 		xfs_iflock(ip);
 	}
@@ -987,7 +987,7 @@ xfs_reclaim_inode(
 		goto reclaim;
 	}
 	if (xfs_ipincount(ip)) {
-		if (!(sync_mode & SYNC_WAIT))
+		if (sync_mode & SYNC_TRYLOCK)
 			goto out_ifunlock;
 		xfs_iunpin_wait(ip);
 	}
@@ -1103,7 +1103,7 @@ xfs_reclaim_inode(
  * then a shut down during filesystem unmount reclaim walk leak all the
  * unreclaimed inodes.
  */
-STATIC int
+STATIC long
 xfs_reclaim_inodes_ag(
 	struct xfs_mount	*mp,
 	int			flags,
@@ -1113,18 +1113,22 @@ xfs_reclaim_inodes_ag(
 	int			error = 0;
 	int			last_error = 0;
 	xfs_agnumber_t		ag;
-	int			trylock = flags & SYNC_TRYLOCK;
+	int			trylock;
 	int			skipped;
+	int			dirty_ags;
 
 restart:
+	trylock = flags & SYNC_TRYLOCK;
 	ag = 0;
 	skipped = 0;
+	dirty_ags = 0;
 	while ((pag = xfs_perag_get_tag(mp, ag, XFS_ICI_RECLAIM_TAG))) {
 		unsigned long	first_index = 0;
 		int		done = 0;
 		int		nr_found = 0;
 
 		ag = pag->pag_agno + 1;
+		dirty_ags++;
 
 		if (trylock) {
 			if (!mutex_trylock(&pag->pag_ici_reclaim_lock)) {
@@ -1132,10 +1136,16 @@ xfs_reclaim_inodes_ag(
 				xfs_perag_put(pag);
 				continue;
 			}
-			first_index = pag->pag_ici_reclaim_cursor;
 		} else
 			mutex_lock(&pag->pag_ici_reclaim_lock);
 
+		/*
+		 * Always start from the last scanned inode so that we don't
+		 * block on inodes that a previous iteration just flushed.
+		 * Iterate over the entire inode range before coming back to
+		 * skipped/dirty inodes.
+		 */
+		first_index = pag->pag_ici_reclaim_cursor;
 		do {
 			struct xfs_inode *batch[XFS_LOOKUP_BATCH];
 			int	i;
@@ -1201,23 +1211,31 @@ xfs_reclaim_inodes_ag(
 
 		} while (nr_found && !done && *nr_to_scan > 0);
 
-		if (trylock && !done)
-			pag->pag_ici_reclaim_cursor = first_index;
-		else
-			pag->pag_ici_reclaim_cursor = 0;
+		if (done)
+			first_index = 0;
+		pag->pag_ici_reclaim_cursor = first_index;
 		mutex_unlock(&pag->pag_ici_reclaim_lock);
 		xfs_perag_put(pag);
 	}
 
 	/*
-	 * if we skipped any AG, and we still have scan count remaining, do
+	 * If we skipped all AGs because they are locked, we've reached maximum
+	 * reclaim concurrency. At this point there is no point in having more
+	 * attempts to shrink the cache from this context. Tell the shrinker to
+	 * stop and defer the reclaim work till later.
+	 */
+	if (skipped && skipped == dirty_ags)
+		return SHRINK_STOP;
+
+	/*
+	 * If we skipped any AG, and we still have scan count remaining, do
 	 * another pass this time using blocking reclaim semantics (i.e
 	 * waiting on the reclaim locks and ignoring the reclaim cursors). This
 	 * ensure that when we get more reclaimers than AGs we block rather
 	 * than spin trying to execute reclaim.
 	 */
 	if (skipped && (flags & SYNC_WAIT) && *nr_to_scan > 0) {
-		trylock = 0;
+		flags &= ~SYNC_TRYLOCK;
 		goto restart;
 	}
 	return last_error;
@@ -1229,8 +1247,12 @@ xfs_reclaim_inodes(
 	int		mode)
 {
 	int		nr_to_scan = INT_MAX;
+	long		ret;
 
-	return xfs_reclaim_inodes_ag(mp, mode, &nr_to_scan);
+	ret = xfs_reclaim_inodes_ag(mp, mode, &nr_to_scan);
+	if (ret == SHRINK_STOP)
+		return 0;
+	return ret;
 }
 
 /*
@@ -1241,17 +1263,28 @@ xfs_reclaim_inodes(
  * reclaim of inodes. That means if we come across dirty inodes, we wait for
  * them to be cleaned, which we hope will not be very long due to the
  * background walker having already kicked the IO off on those dirty inodes.
+ *
+ * Also, treat kswapd specially - we really want it to run asynchronously and
+ * not block on dirty inodes, unlike direct reclaim that we can tolerate
+ * blocking and some amount of IO latency. If we start to overload the reclaim
+ * subsystem with too many direct reclaimers, it will start returning
+ * SHRINK_STOP to tell the mm subsystem to defer the work rather than continuing
+ * to call us and forcing us to block.
  */
 long
 xfs_reclaim_inodes_nr(
 	struct xfs_mount	*mp,
 	int			nr_to_scan)
 {
+	int			flags = SYNC_TRYLOCK;
+
 	/* kick background reclaimer and push the AIL */
 	xfs_reclaim_work_queue(mp);
 	xfs_ail_push_all(mp->m_ail);
 
-	return xfs_reclaim_inodes_ag(mp, SYNC_TRYLOCK | SYNC_WAIT, &nr_to_scan);
+	if (!current_is_kswapd())
+		flags |= SYNC_WAIT;
+	return xfs_reclaim_inodes_ag(mp, flags, &nr_to_scan);
 }
 
 /*