From patchwork Mon Nov 14 07:27:08 2016 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 9486411 X-Mozilla-Keys: nonjunk Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on sandeen.net X-Spam-Level: X-Spam-Status: No, score=-7.0 required=5.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS, RCVD_IN_DNSWL_HI, RP_MATCHES_RCVD autolearn=ham autolearn_force=no version=3.4.0 X-Spam-HP: BAYES_00=-1.9,HEADER_FROM_DIFFERENT_DOMAINS=0.001, RCVD_IN_DNSWL_HI=-5,RP_MATCHES_RCVD=-0.1 X-Original-To: sandeen@sandeen.net Delivered-To: sandeen@sandeen.net Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by sandeen.net (Postfix) with ESMTP id 2B56011657 for ; Mon, 14 Nov 2016 01:26:48 -0600 (CST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752901AbcKNH1P (ORCPT ); Mon, 14 Nov 2016 02:27:15 -0500 Received: from ipmail05.adl6.internode.on.net ([150.101.137.143]:54497 "EHLO ipmail05.adl6.internode.on.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752824AbcKNH1O (ORCPT ); Mon, 14 Nov 2016 02:27:14 -0500 X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: A2BfHABEZilYELuKLHlRBAgaAQEBAQIBAQEBCAEBAQGDMQEBAQEBH1iBAIJ7g3mcKwEBAQEBAQaBHIwcg32COIIPggcghX0CAgEBAoIPQRMBAgEBAQEBAQEGAQEBAQEBAQE3RUIShA0BAQEEHQoTHCMQCAMOBwMJJQ8FJQMHGgoHAhqIRrBgPYtEAQEIAgEkHoVUhByBCIQWBAwFAYV9BYhQkXGGPIoWgXqEdoM9hX5JjHuECiACgQ0NEQuDcYFZKgMxggODEoItAQEB Received: from ppp121-44-138-187.lns20.syd7.internode.on.net (HELO dastard) ([121.44.138.187]) by ipmail05.adl6.internode.on.net with ESMTP; 14 Nov 2016 17:57:10 +1030 Received: from dave by dastard with local (Exim 4.80) (envelope-from ) id 1c6BfY-0006Qc-Uz; Mon, 14 Nov 2016 18:27:09 +1100 Date: Mon, 14 Nov 2016 18:27:08 +1100 From: Dave Chinner To: Chris Mason Cc: linux-xfs@vger.kernel.org Subject: Re: [PATCH RFC] xfs: drop SYNC_WAIT from xfs_reclaim_inodes_ag during slab reclaim Message-ID: <20161114072708.GN28922@dastard> References: <06aade22-b29e-f55e-7f00-39154f220aa6@fb.com> <20161015223454.GS23194@dastard> <20161017002427.GA70229@clm-mbp.masoncoding.com> <20161017015251.GU23194@dastard> <20161017223057.GV23194@dastard> <20161018020324.GA23194@dastard> <20161114005951.GB2127@clm-mbp.thefacebook.com> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20161114005951.GB2127@clm-mbp.thefacebook.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-xfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org On Sun, Nov 13, 2016 at 08:00:04PM -0500, Chris Mason wrote: > On Tue, Oct 18, 2016 at 01:03:24PM +1100, Dave Chinner wrote: > > [ Long stalls from xfs_reclaim_inodes_ag ] > > >XFS has *1* tunable that can change the behaviour of metadata > >writeback. Please try it. > > [ weeks go by, so this email is insanely long ] > > Testing all of this was slow going because two of the three test > boxes I had with the hadoop configuration starting having hardware > problems. The good news is that while I was adjusting the > benchmark, we lined up access to a bunch of duplicate boxes, so I > can now try ~20 different configurations in parallel. > > My rough benchmark is here: > > git://git.kernel.org/pub/scm/linux/kernel/git/mason/simoop.git > > The command line I ended up using was: > > simoop -t 512 -m 190 -M 128 -C 28 -r 60000 -f 70 -T 20 -R1 -W 1 -i > 60 -w 300 -D 2 /hammer/* There's a lightly tested patch below that should do the trick. After 5 minutes on a modified simoop cli on a single filesystem, 4.9-rc4+for-next: $ ./simoop -t 128 -m 50 -M 128 -C 14 -r 60000 -f 2 -T 20 -R1 -W 1 -i 60 -w 300 -D 2 /mnt/scratch .... Run time: 300 seconds Read latency (p50: 3,174,400) (p95: 4,530,176) (p99: 18,055,168) Write latency (p50: 14,991,360) (p95: 28,672,000) (p99: 33,325,056) Allocation latency (p50: 1,771,520) (p95: 17,530,880) (p99: 23,756,800) work rate = 4.75/sec (avg 5.24/sec) (p50: 5.79) (p95: 6.99) (p99: 6.99) alloc stall rate = 94.42/sec (avg: 51.63) (p50: 51.60) (p95: 59.12) (p99: 59.12) With the patch below: Run time: 300 seconds Read latency (p50: 3,043,328) (p95: 3,649,536) (p99: 4,710,400) Write latency (p50: 21,004,288) (p95: 27,557,888) (p99: 29,130,752) Allocation latency (p50: 280,064) (p95: 680,960) (p99: 863,232) work rate = 4.08/sec (avg 4.76/sec) (p50: 5.39) (p95: 6.93) (p99: 6.93) alloc stall rate = 0.08/sec (avg: 0.02) (p50: 0.00) (p95: 0.01) (p99: 0.01) Stall rate went to zero and stayed there at the 120s mark of the warmup. Note the p99 difference for read and allocation latency, too. I'll post some graphs tomorrow from my live PCP telemetry that demonstrate the difference in behaviour better than any words... Cheers, Dave. diff --git a/fs/super.c b/fs/super.c index c183835566c1..9fff5630b12d 100644 --- a/fs/super.c +++ b/fs/super.c @@ -102,8 +102,14 @@ static unsigned long super_cache_scan(struct shrinker *shrink, freed += prune_icache_sb(sb, sc); if (fs_objects) { + unsigned long ret; + sc->nr_to_scan = fs_objects + 1; - freed += sb->s_op->free_cached_objects(sb, sc); + ret = sb->s_op->free_cached_objects(sb, sc); + if (ret == SHRINK_STOP) + freed = SHRINK_STOP; + else + freed += ret; } up_read(&sb->s_umount); diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c index 9c3e5c6ddf20..65c0f79f3edc 100644 --- a/fs/xfs/xfs_icache.c +++ b/fs/xfs/xfs_icache.c @@ -975,7 +975,7 @@ xfs_reclaim_inode( error = 0; xfs_ilock(ip, XFS_ILOCK_EXCL); if (!xfs_iflock_nowait(ip)) { - if (!(sync_mode & SYNC_WAIT)) + if (sync_mode & SYNC_TRYLOCK) goto out; xfs_iflock(ip); } @@ -987,7 +987,7 @@ xfs_reclaim_inode( goto reclaim; } if (xfs_ipincount(ip)) { - if (!(sync_mode & SYNC_WAIT)) + if (sync_mode & SYNC_TRYLOCK) goto out_ifunlock; xfs_iunpin_wait(ip); } @@ -1103,7 +1103,7 @@ xfs_reclaim_inode( * then a shut down during filesystem unmount reclaim walk leak all the * unreclaimed inodes. */ -STATIC int +STATIC long xfs_reclaim_inodes_ag( struct xfs_mount *mp, int flags, @@ -1113,18 +1113,22 @@ xfs_reclaim_inodes_ag( int error = 0; int last_error = 0; xfs_agnumber_t ag; - int trylock = flags & SYNC_TRYLOCK; + int trylock; int skipped; + int dirty_ags; restart: + trylock = flags & SYNC_TRYLOCK; ag = 0; skipped = 0; + dirty_ags = 0; while ((pag = xfs_perag_get_tag(mp, ag, XFS_ICI_RECLAIM_TAG))) { unsigned long first_index = 0; int done = 0; int nr_found = 0; ag = pag->pag_agno + 1; + dirty_ags++; if (trylock) { if (!mutex_trylock(&pag->pag_ici_reclaim_lock)) { @@ -1132,10 +1136,16 @@ xfs_reclaim_inodes_ag( xfs_perag_put(pag); continue; } - first_index = pag->pag_ici_reclaim_cursor; } else mutex_lock(&pag->pag_ici_reclaim_lock); + /* + * Always start from the last scanned inode so that we don't + * block on inodes that a previous iteration just flushed. + * Iterate over the entire inode range before coming back to + * skipped/dirty inodes. + */ + first_index = pag->pag_ici_reclaim_cursor; do { struct xfs_inode *batch[XFS_LOOKUP_BATCH]; int i; @@ -1201,23 +1211,31 @@ xfs_reclaim_inodes_ag( } while (nr_found && !done && *nr_to_scan > 0); - if (trylock && !done) - pag->pag_ici_reclaim_cursor = first_index; - else - pag->pag_ici_reclaim_cursor = 0; + if (done) + first_index = 0; + pag->pag_ici_reclaim_cursor = first_index; mutex_unlock(&pag->pag_ici_reclaim_lock); xfs_perag_put(pag); } /* - * if we skipped any AG, and we still have scan count remaining, do + * If we skipped all AGs because they are locked, we've reached maximum + * reclaim concurrency. At this point there is no point in having more + * attempts to shrink the cache from this context. Tell the shrinker to + * stop and defer the reclaim work till later. + */ + if (skipped && skipped == dirty_ags) + return SHRINK_STOP; + + /* + * If we skipped any AG, and we still have scan count remaining, do * another pass this time using blocking reclaim semantics (i.e * waiting on the reclaim locks and ignoring the reclaim cursors). This * ensure that when we get more reclaimers than AGs we block rather * than spin trying to execute reclaim. */ if (skipped && (flags & SYNC_WAIT) && *nr_to_scan > 0) { - trylock = 0; + flags &= ~SYNC_TRYLOCK; goto restart; } return last_error; @@ -1229,8 +1247,12 @@ xfs_reclaim_inodes( int mode) { int nr_to_scan = INT_MAX; + long ret; - return xfs_reclaim_inodes_ag(mp, mode, &nr_to_scan); + ret = xfs_reclaim_inodes_ag(mp, mode, &nr_to_scan); + if (ret == SHRINK_STOP) + return 0; + return ret; } /* @@ -1241,17 +1263,28 @@ xfs_reclaim_inodes( * reclaim of inodes. That means if we come across dirty inodes, we wait for * them to be cleaned, which we hope will not be very long due to the * background walker having already kicked the IO off on those dirty inodes. + * + * Also, treat kswapd specially - we really want it to run asynchronously and + * not block on dirty inodes, unlike direct reclaim that we can tolerate + * blocking and some amount of IO latency. If we start to overload the reclaim + * subsystem with too many direct reclaimers, it will start returning + * SHRINK_STOP to tell the mm subsystem to defer the work rather than continuing + * to call us and forcing us to block. */ long xfs_reclaim_inodes_nr( struct xfs_mount *mp, int nr_to_scan) { + int flags = SYNC_TRYLOCK; + /* kick background reclaimer and push the AIL */ xfs_reclaim_work_queue(mp); xfs_ail_push_all(mp->m_ail); - return xfs_reclaim_inodes_ag(mp, SYNC_TRYLOCK | SYNC_WAIT, &nr_to_scan); + if (!current_is_kswapd()) + flags |= SYNC_WAIT; + return xfs_reclaim_inodes_ag(mp, flags, &nr_to_scan); } /*