From patchwork Wed Oct 9 03:20:59 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 11180403 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 4A4121709 for ; Wed, 9 Oct 2019 03:22:08 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 0A9A0206C2 for ; Wed, 9 Oct 2019 03:22:08 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 0A9A0206C2 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=fromorbit.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id DE76A8E0015; Tue, 8 Oct 2019 23:21:34 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id B254F8E001A; Tue, 8 Oct 2019 23:21:34 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5E9B58E0010; Tue, 8 Oct 2019 23:21:34 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0192.hostedemail.com [216.40.44.192]) by kanga.kvack.org (Postfix) with ESMTP id 19C1D8E0014 for ; Tue, 8 Oct 2019 23:21:34 -0400 (EDT) Received: from smtpin19.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with SMTP id B2873824CA3B for ; Wed, 9 Oct 2019 03:21:33 +0000 (UTC) X-FDA: 76022796066.19.ring32_2aa243ef1800a X-Spam-Summary: 40,2.5,0,a5c526419065748a,d41d8cd98f00b204,david@fromorbit.com,:linux-xfs@vger.kernel.org::linux-fsdevel@vger.kernel.org,RULES_HIT:41:355:379:541:800:960:966:973:988:989:1260:1261:1311:1314:1345:1359:1437:1515:1535:1544:1605:1711:1730:1747:1777:1792:2196:2198:2199:2200:2307:2393:2553:2559:2562:2693:2731:2903:2914:3138:3139:3140:3141:3142:3653:3865:3866:3867:3868:3870:3871:3872:3874:4250:4321:4385:5007:6119:6261:7576:7875:7903:8603:9036:10011:11026:11473:11658:11914:12043:12049:12291:12295:12296:12297:12438:12517:12519:12555:12679:12895:12986:13146:13161:13229:13230:13869:13894:14181:14394:14721:21080:21450:21451:21627:21740:30005:30054:30070:30085:30090,0,RBL:211.29.132.246:@fromorbit.com:.lbl8.mailshell.net-62.8.32.100 66.201.201.201,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not bulk,SPF:fn,MSBL:0,DNSBL:neutral,Custom_rules:0:1:0,LFtime:2,LUA_SUMMARY:none X-HE-Tag: ring32_2aa243ef1800a X-Filterd-Recvd-Size: 5980 Received: from mail104.syd.optusnet.com.au (mail104.syd.optusnet.com.au [211.29.132.246]) by imf09.hostedemail.com (Postfix) with ESMTP for ; Wed, 9 Oct 2019 03:21:32 +0000 (UTC) Received: from dread.disaster.area (pa49-181-226-196.pa.nsw.optusnet.com.au [49.181.226.196]) by mail104.syd.optusnet.com.au (Postfix) with ESMTPS id 3F99E43EC20; Wed, 9 Oct 2019 14:21:27 +1100 (AEDT) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.2) (envelope-from ) id 1iI2XW-0006B2-La; Wed, 09 Oct 2019 14:21:26 +1100 Received: from dave by discord.disaster.area with local (Exim 4.92) (envelope-from ) id 1iI2XW-00038w-JB; Wed, 09 Oct 2019 14:21:26 +1100 From: Dave Chinner To: linux-xfs@vger.kernel.org Cc: linux-mm@kvack.org, linux-fsdevel@vger.kernel.org Subject: [PATCH 01/26] xfs: Lower CIL flush limit for large logs Date: Wed, 9 Oct 2019 14:20:59 +1100 Message-Id: <20191009032124.10541-2-david@fromorbit.com> X-Mailer: git-send-email 2.23.0.rc1 In-Reply-To: <20191009032124.10541-1-david@fromorbit.com> References: <20191009032124.10541-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.2 cv=P6RKvmIu c=1 sm=1 tr=0 a=dRuLqZ1tmBNts2YiI0zFQg==:117 a=dRuLqZ1tmBNts2YiI0zFQg==:17 a=jpOVt7BSZ2e4Z31A5e1TngXxSK0=:19 a=XobE76Q3jBoA:10 a=20KFwNOVAAAA:8 a=UuDO-KUg2xYWwgiXHJ0A:9 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: Dave Chinner The current CIL size aggregation limit is 1/8th the log size. This means for large logs we might be aggregating at least 250MB of dirty objects in memory before the CIL is flushed to the journal. With CIL shadow buffers sitting around, this means the CIL is often consuming >500MB of temporary memory that is all allocated under GFP_NOFS conditions. Flushing the CIL can take some time to do if there is other IO ongoing, and can introduce substantial log force latency by itself. It also pins the memory until the objects are in the AIL and can be written back and reclaimed by shrinkers. Hence this threshold also tends to determine the minimum amount of memory XFS can operate in under heavy modification without triggering the OOM killer. Modify the CIL space limit to prevent such huge amounts of pinned metadata from aggregating. We can have 2MB of log IO in flight at once, so limit aggregation to 16x this size. This threshold was chosen as it little impact on performance (on 16-way fsmark) or log traffic but pins a lot less memory on large logs especially under heavy memory pressure. An aggregation limit of 8x had 5-10% performance degradation and a 50% increase in log throughput for the same workload, so clearly that was too small for highly concurrent workloads on large logs. This was found via trace analysis of AIL behaviour. e.g. insertion from a single CIL flush: xfs_ail_insert: old lsn 0/0 new lsn 1/3033090 type XFS_LI_INODE flags IN_AIL $ grep xfs_ail_insert /mnt/scratch/s.t |grep "new lsn 1/3033090" |wc -l 1721823 $ So there were 1.7 million objects inserted into the AIL from this CIL checkpoint, the first at 2323.392108, the last at 2325.667566 which was the end of the trace (i.e. it hadn't finished). Clearly a major problem. Signed-off-by: Dave Chinner Reviewed-by: Brian Foster Reviewed-by: Darrick J. Wong --- fs/xfs/xfs_log_priv.h | 29 +++++++++++++++++++++++------ 1 file changed, 23 insertions(+), 6 deletions(-) diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h index b880c23cb6e4..a3cc8a9a16d9 100644 --- a/fs/xfs/xfs_log_priv.h +++ b/fs/xfs/xfs_log_priv.h @@ -323,13 +323,30 @@ struct xfs_cil { * tries to keep 25% of the log free, so we need to keep below that limit or we * risk running out of free log space to start any new transactions. * - * In order to keep background CIL push efficient, we will set a lower - * threshold at which background pushing is attempted without blocking current - * transaction commits. A separate, higher bound defines when CIL pushes are - * enforced to ensure we stay within our maximum checkpoint size bounds. - * threshold, yet give us plenty of space for aggregation on large logs. + * In order to keep background CIL push efficient, we only need to ensure the + * CIL is large enough to maintain sufficient in-memory relogging to avoid + * repeated physical writes of frequently modified metadata. If we allow the CIL + * to grow to a substantial fraction of the log, then we may be pinning hundreds + * of megabytes of metadata in memory until the CIL flushes. This can cause + * issues when we are running low on memory - pinned memory cannot be reclaimed, + * and the CIL consumes a lot of memory. Hence we need to set an upper physical + * size limit for the CIL that limits the maximum amount of memory pinned by the + * CIL but does not limit performance by reducing relogging efficiency + * significantly. + * + * As such, the CIL push threshold ends up being the smaller of two thresholds: + * - a threshold large enough that it allows CIL to be pushed and progress to be + * made without excessive blocking of incoming transaction commits. This is + * defined to be 12.5% of the log space - half the 25% push threshold of the + * AIL. + * - small enough that it doesn't pin excessive amounts of memory but maintains + * close to peak relogging efficiency. This is defined to be 16x the iclog + * buffer window (32MB) as measurements have shown this to be roughly the + * point of diminishing performance increases under highly concurrent + * modification workloads. */ -#define XLOG_CIL_SPACE_LIMIT(log) (log->l_logsize >> 3) +#define XLOG_CIL_SPACE_LIMIT(log) \ + min_t(int, (log)->l_logsize >> 3, BBTOB(XLOG_TOTAL_REC_SHIFT(log)) << 4) /* * ticket grant locks, queues and accounting have their own cachlines From patchwork Wed Oct 9 03:21:00 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 11180395 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id F37551709 for ; Wed, 9 Oct 2019 03:21:57 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id C0F15206C2 for ; Wed, 9 Oct 2019 03:21:57 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org C0F15206C2 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=fromorbit.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 10F678E0011; Tue, 8 Oct 2019 23:21:34 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 095878E0013; Tue, 8 Oct 2019 23:21:33 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id DFFF18E0011; Tue, 8 Oct 2019 23:21:33 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0227.hostedemail.com [216.40.44.227]) by kanga.kvack.org (Postfix) with ESMTP id BF7478E000D for ; Tue, 8 Oct 2019 23:21:33 -0400 (EDT) Received: from smtpin04.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with SMTP id 5D15F824CA36 for ; Wed, 9 Oct 2019 03:21:33 +0000 (UTC) X-FDA: 76022796066.04.gold76_2a9583b51bb61 X-Spam-Summary: 2,0,0,eb49432ef4000535,d41d8cd98f00b204,david@fromorbit.com,:linux-xfs@vger.kernel.org::linux-fsdevel@vger.kernel.org,RULES_HIT:2:41:355:379:541:800:960:973:988:989:1260:1261:1311:1314:1345:1359:1437:1515:1535:1605:1730:1747:1777:1792:2393:2559:2562:2693:3138:3139:3140:3141:3142:3865:3866:3867:3868:3870:3871:3872:3874:4050:4119:4250:4321:4384:5007:6119:6261:6630:7576:7875:7903:8603:9010:9036:9389:10004:11026:11473:11658:11914:12043:12291:12296:12297:12438:12517:12519:12555:12679:12683:12895:12986:13161:13229:13894:14394:21080:21324:21433:21450:21451:21627:21740:30034:30054:30069:30070:30085,0,RBL:211.29.132.249:@fromorbit.com:.lbl8.mailshell.net-62.8.32.100 66.201.201.201,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not bulk,SPF:fn,MSBL:0,DNSBL:neutral,Custom_rules:0:0:0,LFtime:24,LUA_SUMMARY:none X-HE-Tag: gold76_2a9583b51bb61 X-Filterd-Recvd-Size: 8710 Received: from mail105.syd.optusnet.com.au (mail105.syd.optusnet.com.au [211.29.132.249]) by imf42.hostedemail.com (Postfix) with ESMTP for ; Wed, 9 Oct 2019 03:21:32 +0000 (UTC) Received: from dread.disaster.area (pa49-181-226-196.pa.nsw.optusnet.com.au [49.181.226.196]) by mail105.syd.optusnet.com.au (Postfix) with ESMTPS id 596B7363311; Wed, 9 Oct 2019 14:21:27 +1100 (AEDT) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.2) (envelope-from ) id 1iI2XW-0006B4-M7; Wed, 09 Oct 2019 14:21:26 +1100 Received: from dave by discord.disaster.area with local (Exim 4.92) (envelope-from ) id 1iI2XW-00038y-KH; Wed, 09 Oct 2019 14:21:26 +1100 From: Dave Chinner To: linux-xfs@vger.kernel.org Cc: linux-mm@kvack.org, linux-fsdevel@vger.kernel.org Subject: [PATCH 02/26] xfs: Throttle commits on delayed background CIL push Date: Wed, 9 Oct 2019 14:21:00 +1100 Message-Id: <20191009032124.10541-3-david@fromorbit.com> X-Mailer: git-send-email 2.23.0.rc1 In-Reply-To: <20191009032124.10541-1-david@fromorbit.com> References: <20191009032124.10541-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.2 cv=FNpr/6gs c=1 sm=1 tr=0 a=dRuLqZ1tmBNts2YiI0zFQg==:117 a=dRuLqZ1tmBNts2YiI0zFQg==:17 a=jpOVt7BSZ2e4Z31A5e1TngXxSK0=:19 a=XobE76Q3jBoA:10 a=20KFwNOVAAAA:8 a=tJydQp0-tQu7GIGPLyIA:9 a=6FibWmxOnplWDJ0V:21 a=ecaMPYEfmtohdEYC:21 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: Dave Chinner In certain situations the background CIL push can be indefinitely delayed. While we have workarounds from the obvious cases now, it doesn't solve the underlying issue. This issue is that there is no upper limit on the CIL where we will either force or wait for a background push to start, hence allowing the CIL to grow without bound until it consumes all log space. To fix this, add a new wait queue to the CIL which allows background pushes to wait for the CIL context to be switched out. This happens when the push starts, so it will allow us to block incoming transaction commit completion until the push has started. This will only affect processes that are running modifications, and only when the CIL threshold has been significantly overrun. This has no apparent impact on performance, and doesn't even trigger until over 45 million inodes had been created in a 16-way fsmark test on a 2GB log. That was limiting at 64MB of log space used, so the active CIL size is only about 3% of the total log in that case. The concurrent removal of those files did not trigger the background sleep at all. Signed-off-by: Dave Chinner --- fs/xfs/xfs_log_cil.c | 37 +++++++++++++++++++++++++++++++++---- fs/xfs/xfs_log_priv.h | 24 ++++++++++++++++++++++++ fs/xfs/xfs_trace.h | 1 + 3 files changed, 58 insertions(+), 4 deletions(-) diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c index ef652abd112c..4a09d50e1368 100644 --- a/fs/xfs/xfs_log_cil.c +++ b/fs/xfs/xfs_log_cil.c @@ -670,6 +670,11 @@ xlog_cil_push( push_seq = cil->xc_push_seq; ASSERT(push_seq <= ctx->sequence); + /* + * Wake up any background push waiters now this context is being pushed. + */ + wake_up_all(&ctx->push_wait); + /* * Check if we've anything to push. If there is nothing, then we don't * move on to a new sequence number and so we have to be able to push @@ -746,6 +751,7 @@ xlog_cil_push( */ INIT_LIST_HEAD(&new_ctx->committing); INIT_LIST_HEAD(&new_ctx->busy_extents); + init_waitqueue_head(&new_ctx->push_wait); new_ctx->sequence = ctx->sequence + 1; new_ctx->cil = cil; cil->xc_ctx = new_ctx; @@ -900,7 +906,7 @@ xlog_cil_push_work( */ static void xlog_cil_push_background( - struct xlog *log) + struct xlog *log) __releases(cil->xc_ctx_lock) { struct xfs_cil *cil = log->l_cilp; @@ -914,14 +920,36 @@ xlog_cil_push_background( * don't do a background push if we haven't used up all the * space available yet. */ - if (cil->xc_ctx->space_used < XLOG_CIL_SPACE_LIMIT(log)) + if (cil->xc_ctx->space_used < XLOG_CIL_SPACE_LIMIT(log)) { + up_read(&cil->xc_ctx_lock); return; + } spin_lock(&cil->xc_push_lock); if (cil->xc_push_seq < cil->xc_current_sequence) { cil->xc_push_seq = cil->xc_current_sequence; queue_work(log->l_mp->m_cil_workqueue, &cil->xc_push_work); } + + /* + * Drop the context lock now, we can't hold that if we need to sleep + * because we are over the blocking threshold. The push_lock is still + * held, so blocking threshold sleep/wakeup is still correctly + * serialised here. + */ + up_read(&cil->xc_ctx_lock); + + /* + * If we are well over the space limit, throttle the work that is being + * done until the push work on this context has begun. + */ + if (cil->xc_ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log)) { + trace_xfs_log_cil_wait(log, cil->xc_ctx->ticket); + ASSERT(cil->xc_ctx->space_used < log->l_logsize); + xlog_wait(&cil->xc_ctx->push_wait, &cil->xc_push_lock); + return; + } + spin_unlock(&cil->xc_push_lock); } @@ -1038,9 +1066,9 @@ xfs_log_commit_cil( if (lip->li_ops->iop_committing) lip->li_ops->iop_committing(lip, xc_commit_lsn); } - xlog_cil_push_background(log); - up_read(&cil->xc_ctx_lock); + /* xlog_cil_push_background() releases cil->xc_ctx_lock */ + xlog_cil_push_background(log); } /* @@ -1199,6 +1227,7 @@ xlog_cil_init( INIT_LIST_HEAD(&ctx->committing); INIT_LIST_HEAD(&ctx->busy_extents); + init_waitqueue_head(&ctx->push_wait); ctx->sequence = 1; ctx->cil = cil; cil->xc_ctx = ctx; diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h index a3cc8a9a16d9..f231b7dfaeab 100644 --- a/fs/xfs/xfs_log_priv.h +++ b/fs/xfs/xfs_log_priv.h @@ -247,6 +247,7 @@ struct xfs_cil_ctx { struct xfs_log_vec *lv_chain; /* logvecs being pushed */ struct list_head iclog_entry; struct list_head committing; /* ctx committing list */ + wait_queue_head_t push_wait; /* background push throttle */ struct work_struct discard_endio_work; }; @@ -344,10 +345,33 @@ struct xfs_cil { * buffer window (32MB) as measurements have shown this to be roughly the * point of diminishing performance increases under highly concurrent * modification workloads. + * + * To prevent the CIL from overflowing upper commit size bounds, we introduce a + * new threshold at which we block committing transactions until the background + * CIL commit commences and switches to a new context. While this is not a hard + * limit, it forces the process committing a transaction to the CIL to block and + * yeild the CPU, giving the CIL push work a chance to be scheduled and start + * work. This prevents a process running lots of transactions from overfilling + * the CIL because it is not yielding the CPU. We set the blocking limit at + * twice the background push space threshold so we keep in line with the AIL + * push thresholds. + * + * Note: this is not a -hard- limit as blocking is applied after the transaction + * is inserted into the CIL and the push has been triggered. It is largely a + * throttling mechanism that allows the CIL push to be scheduled and run. A hard + * limit will be difficult to implement without introducing global serialisation + * in the CIL commit fast path, and it's not at all clear that we actually need + * such hard limits given the ~7 years we've run without a hard limit before + * finding the first situation where a checkpoint size overflow actually + * occurred. Hence the simple throttle, and an ASSERT check to tell us that + * we've overrun the max size. */ #define XLOG_CIL_SPACE_LIMIT(log) \ min_t(int, (log)->l_logsize >> 3, BBTOB(XLOG_TOTAL_REC_SHIFT(log)) << 4) +#define XLOG_CIL_BLOCKING_SPACE_LIMIT(log) \ + (XLOG_CIL_SPACE_LIMIT(log) * 2) + /* * ticket grant locks, queues and accounting have their own cachlines * as these are quite hot and can be operated on concurrently. diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h index eaae275ed430..e7087ede2662 100644 --- a/fs/xfs/xfs_trace.h +++ b/fs/xfs/xfs_trace.h @@ -1011,6 +1011,7 @@ DEFINE_LOGGRANT_EVENT(xfs_log_regrant_reserve_sub); DEFINE_LOGGRANT_EVENT(xfs_log_ungrant_enter); DEFINE_LOGGRANT_EVENT(xfs_log_ungrant_exit); DEFINE_LOGGRANT_EVENT(xfs_log_ungrant_sub); +DEFINE_LOGGRANT_EVENT(xfs_log_cil_wait); DECLARE_EVENT_CLASS(xfs_log_item_class, TP_PROTO(struct xfs_log_item *lip), From patchwork Wed Oct 9 03:21:01 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 11180397 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 85B161709 for ; Wed, 9 Oct 2019 03:22:00 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 52E0F206C2 for ; Wed, 9 Oct 2019 03:22:00 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 52E0F206C2 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=fromorbit.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 5EE068E0014; Tue, 8 Oct 2019 23:21:34 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 59EBE8E0013; Tue, 8 Oct 2019 23:21:34 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 109748E0010; Tue, 8 Oct 2019 23:21:34 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0116.hostedemail.com [216.40.44.116]) by kanga.kvack.org (Postfix) with ESMTP id D28868E0010 for ; Tue, 8 Oct 2019 23:21:33 -0400 (EDT) Received: from smtpin24.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with SMTP id 742D94408 for ; Wed, 9 Oct 2019 03:21:33 +0000 (UTC) X-FDA: 76022796066.24.bread67_2a9f0d2e17445 X-Spam-Summary: 2,0,0,13675744e57fe269,d41d8cd98f00b204,david@fromorbit.com,:linux-xfs@vger.kernel.org::linux-fsdevel@vger.kernel.org,RULES_HIT:41:355:379:541:800:960:973:988:989:1260:1261:1311:1314:1345:1359:1437:1515:1534:1542:1711:1730:1747:1777:1792:2393:2559:2562:2693:3138:3139:3140:3141:3142:3354:3865:3866:3867:3868:3871:3872:3874:4250:5007:6261:7576:7903:10004:11026:11473:11658:11914:12043:12296:12297:12517:12519:12555:12679:12895:13161:13229:13869:13894:14096:14181:14394:14721:21080:21627:21740:21939:30005:30054:30091,0,RBL:211.29.132.249:@fromorbit.com:.lbl8.mailshell.net-62.8.32.100 66.201.201.201,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not bulk,SPF:fn,MSBL:0,DNSBL:neutral,Custom_rules:0:0:0,LFtime:24,LUA_SUMMARY:none X-HE-Tag: bread67_2a9f0d2e17445 X-Filterd-Recvd-Size: 3693 Received: from mail105.syd.optusnet.com.au (mail105.syd.optusnet.com.au [211.29.132.249]) by imf41.hostedemail.com (Postfix) with ESMTP for ; Wed, 9 Oct 2019 03:21:32 +0000 (UTC) Received: from dread.disaster.area (pa49-181-226-196.pa.nsw.optusnet.com.au [49.181.226.196]) by mail105.syd.optusnet.com.au (Postfix) with ESMTPS id 6CC1C36271F; Wed, 9 Oct 2019 14:21:28 +1100 (AEDT) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.2) (envelope-from ) id 1iI2XW-0006B7-N8; Wed, 09 Oct 2019 14:21:26 +1100 Received: from dave by discord.disaster.area with local (Exim 4.92) (envelope-from ) id 1iI2XW-000391-LG; Wed, 09 Oct 2019 14:21:26 +1100 From: Dave Chinner To: linux-xfs@vger.kernel.org Cc: linux-mm@kvack.org, linux-fsdevel@vger.kernel.org Subject: [PATCH 03/26] xfs: don't allow log IO to be throttled Date: Wed, 9 Oct 2019 14:21:01 +1100 Message-Id: <20191009032124.10541-4-david@fromorbit.com> X-Mailer: git-send-email 2.23.0.rc1 In-Reply-To: <20191009032124.10541-1-david@fromorbit.com> References: <20191009032124.10541-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.2 cv=P6RKvmIu c=1 sm=1 tr=0 a=dRuLqZ1tmBNts2YiI0zFQg==:117 a=dRuLqZ1tmBNts2YiI0zFQg==:17 a=jpOVt7BSZ2e4Z31A5e1TngXxSK0=:19 a=XobE76Q3jBoA:10 a=20KFwNOVAAAA:8 a=5HahVxdoFHTWBnBQlCYA:9 a=DiKeHqHhRZ4A:10 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: Dave Chinner Running metadata intensive workloads, I've been seeing the AIL pushing getting stuck on pinned buffers and triggering log forces. The log force is taking a long time to run because the log IO is getting throttled by wbt_wait() - the block layer writeback throttle. It's being throttled because there is a huge amount of metadata writeback going on which is filling the request queue. IOWs, we have a priority inversion problem here. Mark the log IO bios with REQ_IDLE so they don't get throttled by the block layer writeback throttle. When we are forcing the CIL, we are likely to need to to tens of log IOs, and they are issued as fast as they can be build and IO completed. Hence REQ_IDLE is appropriate - it's an indication that more IO will follow shortly. And because we also set REQ_SYNC, the writeback throttle will no treat log IO the same way it treats direct IO writes - it will not throttle them at all. Hence we solve the priority inversion problem caused by the writeback throttle being unable to distinguish between high priority log IO and background metadata writeback. Signed-off-by: Dave Chinner Reviewed-by: Christoph Hellwig Reviewed-by: Brian Foster Reviewed-by: Darrick J. Wong --- fs/xfs/xfs_log.c | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c index 6f99d6eae6a4..cf098e19967e 100644 --- a/fs/xfs/xfs_log.c +++ b/fs/xfs/xfs_log.c @@ -1751,7 +1751,15 @@ xlog_write_iclog( iclog->ic_bio.bi_iter.bi_sector = log->l_logBBstart + bno; iclog->ic_bio.bi_end_io = xlog_bio_end_io; iclog->ic_bio.bi_private = iclog; - iclog->ic_bio.bi_opf = REQ_OP_WRITE | REQ_META | REQ_SYNC | REQ_FUA; + + /* + * We use REQ_SYNC | REQ_IDLE here to tell the block layer the are more + * IOs coming immediately after this one. This prevents the block layer + * writeback throttle from throttling log writes behind background + * metadata writeback and causing priority inversions. + */ + iclog->ic_bio.bi_opf = REQ_OP_WRITE | REQ_META | REQ_SYNC | + REQ_IDLE | REQ_FUA; if (need_flush) iclog->ic_bio.bi_opf |= REQ_PREFLUSH; From patchwork Wed Oct 9 03:21:02 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 11180333 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 9DB441864 for ; Wed, 9 Oct 2019 03:21:46 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 7509921924 for ; Wed, 9 Oct 2019 03:21:46 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 7509921924 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=fromorbit.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id F3B128E000C; Tue, 8 Oct 2019 23:21:32 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id EEA4F8E0008; Tue, 8 Oct 2019 23:21:32 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D8A528E000E; Tue, 8 Oct 2019 23:21:32 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0038.hostedemail.com [216.40.44.38]) by kanga.kvack.org (Postfix) with ESMTP id AAEED8E0008 for ; Tue, 8 Oct 2019 23:21:32 -0400 (EDT) Received: from smtpin17.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with SMTP id 156CE824CA36 for ; Wed, 9 Oct 2019 03:21:32 +0000 (UTC) X-FDA: 76022796024.17.nest71_2a6ac8710d05e X-Spam-Summary: 2,0,0,4ffbd214bfd918ba,d41d8cd98f00b204,david@fromorbit.com,:linux-xfs@vger.kernel.org::linux-fsdevel@vger.kernel.org,RULES_HIT:41:355:379:541:800:960:966:973:988:989:1260:1261:1311:1314:1345:1359:1437:1515:1534:1541:1711:1730:1747:1777:1792:2196:2199:2393:2559:2562:2890:3138:3139:3140:3141:3142:3352:3865:3866:3867:3870:3872:3874:4321:4385:5007:6261:7576:7903:10004:10394:11026:11473:11658:11914:12043:12296:12297:12438:12517:12519:12555:12679:12895:13069:13161:13229:13311:13357:13894:14096:14181:14384:14394:14721:21080:21450:21451:21627:21789:30012:30054,0,RBL:211.29.132.249:@fromorbit.com:.lbl8.mailshell.net-62.8.32.100 66.201.201.201,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not bulk,SPF:fn,MSBL:0,DNSBL:neutral,Custom_rules:0:0:0,LFtime:166,LUA_SUMMARY:none X-HE-Tag: nest71_2a6ac8710d05e X-Filterd-Recvd-Size: 3026 Received: from mail105.syd.optusnet.com.au (mail105.syd.optusnet.com.au [211.29.132.249]) by imf42.hostedemail.com (Postfix) with ESMTP for ; Wed, 9 Oct 2019 03:21:31 +0000 (UTC) Received: from dread.disaster.area (pa49-181-226-196.pa.nsw.optusnet.com.au [49.181.226.196]) by mail105.syd.optusnet.com.au (Postfix) with ESMTPS id 4133D362EDA; Wed, 9 Oct 2019 14:21:27 +1100 (AEDT) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.2) (envelope-from ) id 1iI2XW-0006B9-Nw; Wed, 09 Oct 2019 14:21:26 +1100 Received: from dave by discord.disaster.area with local (Exim 4.92) (envelope-from ) id 1iI2XW-000395-M9; Wed, 09 Oct 2019 14:21:26 +1100 From: Dave Chinner To: linux-xfs@vger.kernel.org Cc: linux-mm@kvack.org, linux-fsdevel@vger.kernel.org Subject: [PATCH 04/26] xfs: Improve metadata buffer reclaim accountability Date: Wed, 9 Oct 2019 14:21:02 +1100 Message-Id: <20191009032124.10541-5-david@fromorbit.com> X-Mailer: git-send-email 2.23.0.rc1 In-Reply-To: <20191009032124.10541-1-david@fromorbit.com> References: <20191009032124.10541-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.2 cv=P6RKvmIu c=1 sm=1 tr=0 a=dRuLqZ1tmBNts2YiI0zFQg==:117 a=dRuLqZ1tmBNts2YiI0zFQg==:17 a=jpOVt7BSZ2e4Z31A5e1TngXxSK0=:19 a=XobE76Q3jBoA:10 a=20KFwNOVAAAA:8 a=1dk79d6Hl8FtNpQQbMkA:9 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: Dave Chinner The buffer cache shrinker frees more than just the xfs_buf slab objects - it also frees the pages attached to the buffers. Make sure the memory reclaim code accounts for this memory being freed correctly, similar to how the inode shrinker accounts for pages freed from the page cache due to mapping invalidation. We also need to make sure that the mm subsystem knows these are reclaimable objects. We provide the memory reclaim subsystem with a a shrinker to reclaim xfs_bufs, so we should really mark the slab that way. We also have a lot of xfs_bufs in a busy system, spread them around like we do inodes. Signed-off-by: Dave Chinner --- fs/xfs/xfs_buf.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c index e484f6bead53..45b470f55ad7 100644 --- a/fs/xfs/xfs_buf.c +++ b/fs/xfs/xfs_buf.c @@ -324,6 +324,9 @@ xfs_buf_free( __free_page(page); } + if (current->reclaim_state) + current->reclaim_state->reclaimed_slab += + bp->b_page_count; } else if (bp->b_flags & _XBF_KMEM) kmem_free(bp->b_addr); _xfs_buf_free_pages(bp); @@ -2064,7 +2067,8 @@ int __init xfs_buf_init(void) { xfs_buf_zone = kmem_zone_init_flags(sizeof(xfs_buf_t), "xfs_buf", - KM_ZONE_HWALIGN, NULL); + KM_ZONE_HWALIGN | KM_ZONE_SPREAD | KM_ZONE_RECLAIM, + NULL); if (!xfs_buf_zone) goto out; From patchwork Wed Oct 9 03:21:03 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 11180425 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 7776D1668 for ; Wed, 9 Oct 2019 03:50:55 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 407FA2133F for ; Wed, 9 Oct 2019 03:50:55 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 407FA2133F Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=fromorbit.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 6E6B48E0005; Tue, 8 Oct 2019 23:50:54 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 66FEE8E0003; Tue, 8 Oct 2019 23:50:54 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 55DFF8E0005; Tue, 8 Oct 2019 23:50:54 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0087.hostedemail.com [216.40.44.87]) by kanga.kvack.org (Postfix) with ESMTP id 2E9098E0003 for ; Tue, 8 Oct 2019 23:50:54 -0400 (EDT) Received: from smtpin02.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with SMTP id C1F634403 for ; Wed, 9 Oct 2019 03:50:53 +0000 (UTC) X-FDA: 76022869986.02.pipe08_7c18ead97e2f X-Spam-Summary: 2,0,0,6973fd83dcd6ea7b,d41d8cd98f00b204,david@fromorbit.com,:linux-xfs@vger.kernel.org::linux-fsdevel@vger.kernel.org,RULES_HIT:41:355:379:541:800:960:973:988:989:1260:1261:1311:1314:1345:1359:1437:1515:1534:1540:1711:1714:1730:1747:1777:1792:2198:2199:2393:2559:2562:3138:3139:3140:3141:3142:3351:3865:3866:3872:4250:4321:5007:6119:6261:7576:7903:7974:10004:11026:11473:11658:11914:12043:12297:12438:12517:12519:12555:12679:12895:13069:13161:13229:13311:13357:13894:14096:14181:14384:14394:14721:21080:21325:21627:21773:30054,0,RBL:211.29.132.246:@fromorbit.com:.lbl8.mailshell.net-62.8.32.100 66.201.201.201,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not bulk,SPF:fn,MSBL:0,DNSBL:neutral,Custom_rules:0:0:0,LFtime:27,LUA_SUMMARY:none X-HE-Tag: pipe08_7c18ead97e2f X-Filterd-Recvd-Size: 2456 Received: from mail104.syd.optusnet.com.au (mail104.syd.optusnet.com.au [211.29.132.246]) by imf18.hostedemail.com (Postfix) with ESMTP for ; Wed, 9 Oct 2019 03:50:53 +0000 (UTC) Received: from dread.disaster.area (pa49-181-226-196.pa.nsw.optusnet.com.au [49.181.226.196]) by mail104.syd.optusnet.com.au (Postfix) with ESMTPS id 5626343ECA0; Wed, 9 Oct 2019 14:21:28 +1100 (AEDT) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.2) (envelope-from ) id 1iI2XW-0006BA-Op; Wed, 09 Oct 2019 14:21:26 +1100 Received: from dave by discord.disaster.area with local (Exim 4.92) (envelope-from ) id 1iI2XW-000397-Mz; Wed, 09 Oct 2019 14:21:26 +1100 From: Dave Chinner To: linux-xfs@vger.kernel.org Cc: linux-mm@kvack.org, linux-fsdevel@vger.kernel.org Subject: [PATCH 05/26] xfs: correctly acount for reclaimable slabs Date: Wed, 9 Oct 2019 14:21:03 +1100 Message-Id: <20191009032124.10541-6-david@fromorbit.com> X-Mailer: git-send-email 2.23.0.rc1 In-Reply-To: <20191009032124.10541-1-david@fromorbit.com> References: <20191009032124.10541-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.2 cv=D+Q3ErZj c=1 sm=1 tr=0 a=dRuLqZ1tmBNts2YiI0zFQg==:117 a=dRuLqZ1tmBNts2YiI0zFQg==:17 a=jpOVt7BSZ2e4Z31A5e1TngXxSK0=:19 a=XobE76Q3jBoA:10 a=20KFwNOVAAAA:8 a=q-nAjRHzglZ9esTleQAA:9 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: Dave Chinner The XFS inode item slab actually reclaimed by inode shrinker callbacks from the memory reclaim subsystem. These should be marked as reclaimable so the mm subsystem has the full picture of how much memory it can actually reclaim from the XFS slab caches. Signed-off-by: Dave Chinner Reviewed-by: Brian Foster Reviewed-by: Darrick J. Wong --- fs/xfs/xfs_super.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c index 8d1df9f8be07..f0aff1f034e6 100644 --- a/fs/xfs/xfs_super.c +++ b/fs/xfs/xfs_super.c @@ -1919,7 +1919,7 @@ xfs_init_zones(void) xfs_ili_zone = kmem_zone_init_flags(sizeof(xfs_inode_log_item_t), "xfs_ili", - KM_ZONE_SPREAD, NULL); + KM_ZONE_SPREAD | KM_ZONE_RECLAIM, NULL); if (!xfs_ili_zone) goto out_destroy_inode_zone; xfs_icreate_zone = kmem_zone_init(sizeof(struct xfs_icreate_item), From patchwork Wed Oct 9 03:21:04 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 11180343 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 8E8081864 for ; Wed, 9 Oct 2019 03:21:48 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 65D08206C2 for ; Wed, 9 Oct 2019 03:21:48 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 65D08206C2 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=fromorbit.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 32DD48E000A; Tue, 8 Oct 2019 23:21:33 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 2B8C68E0010; Tue, 8 Oct 2019 23:21:33 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id F38518E000A; Tue, 8 Oct 2019 23:21:32 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0178.hostedemail.com [216.40.44.178]) by kanga.kvack.org (Postfix) with ESMTP id B43EB8E000C for ; Tue, 8 Oct 2019 23:21:32 -0400 (EDT) Received: from smtpin08.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with SMTP id 50CCD40E1 for ; Wed, 9 Oct 2019 03:21:32 +0000 (UTC) X-FDA: 76022796024.08.plot41_2a6c49537ca42 X-Spam-Summary: 2,0,0,496977d011d85cd9,d41d8cd98f00b204,david@fromorbit.com,:linux-xfs@vger.kernel.org::linux-fsdevel@vger.kernel.org,RULES_HIT:41:69:355:379:541:800:960:968:973:988:989:1260:1261:1311:1314:1345:1359:1437:1515:1535:1543:1711:1730:1747:1777:1792:2393:2559:2562:3138:3139:3140:3141:3142:3354:3865:3866:3867:3874:4250:4321:5007:6119:6261:7576:7875:9592:10004:10128:11026:11473:11658:11914:12043:12291:12296:12297:12438:12517:12519:12555:12679:12683:12895:12986:13894:14096:14110:14181:14394:14721:21080:21324:21611:21627:30012:30054:30079,0,RBL:211.29.132.249:@fromorbit.com:.lbl8.mailshell.net-62.8.32.100 66.201.201.201,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not bulk,SPF:fn,MSBL:0,DNSBL:neutral,Custom_rules:0:0:0,LFtime:2,LUA_SUMMARY:none X-HE-Tag: plot41_2a6c49537ca42 X-Filterd-Recvd-Size: 5324 Received: from mail105.syd.optusnet.com.au (mail105.syd.optusnet.com.au [211.29.132.249]) by imf41.hostedemail.com (Postfix) with ESMTP for ; Wed, 9 Oct 2019 03:21:31 +0000 (UTC) Received: from dread.disaster.area (pa49-181-226-196.pa.nsw.optusnet.com.au [49.181.226.196]) by mail105.syd.optusnet.com.au (Postfix) with ESMTPS id 55AAD3632BB; Wed, 9 Oct 2019 14:21:27 +1100 (AEDT) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.2) (envelope-from ) id 1iI2XW-0006BB-QB; Wed, 09 Oct 2019 14:21:26 +1100 Received: from dave by discord.disaster.area with local (Exim 4.92) (envelope-from ) id 1iI2XW-00039B-O4; Wed, 09 Oct 2019 14:21:26 +1100 From: Dave Chinner To: linux-xfs@vger.kernel.org Cc: linux-mm@kvack.org, linux-fsdevel@vger.kernel.org Subject: [PATCH 06/26] xfs: synchronous AIL pushing Date: Wed, 9 Oct 2019 14:21:04 +1100 Message-Id: <20191009032124.10541-7-david@fromorbit.com> X-Mailer: git-send-email 2.23.0.rc1 In-Reply-To: <20191009032124.10541-1-david@fromorbit.com> References: <20191009032124.10541-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.2 cv=FNpr/6gs c=1 sm=1 tr=0 a=dRuLqZ1tmBNts2YiI0zFQg==:117 a=dRuLqZ1tmBNts2YiI0zFQg==:17 a=jpOVt7BSZ2e4Z31A5e1TngXxSK0=:19 a=XobE76Q3jBoA:10 a=20KFwNOVAAAA:8 a=MYL0kNLdbJVOrCnBhk4A:9 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: Dave Chinner Factor the common AIL deletion code that does all the wakeups into a helper so we only have one copy of this somewhat tricky code to interface with all the wakeups necessary when the LSN of the log tail changes. Signed-off-by: Dave Chinner Reviewed-by: Christoph Hellwig --- fs/xfs/xfs_inode_item.c | 12 +---------- fs/xfs/xfs_trans_ail.c | 48 ++++++++++++++++++++++------------------- fs/xfs/xfs_trans_priv.h | 4 +++- 3 files changed, 30 insertions(+), 34 deletions(-) diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c index bb8f076805b9..ab12e526540a 100644 --- a/fs/xfs/xfs_inode_item.c +++ b/fs/xfs/xfs_inode_item.c @@ -743,17 +743,7 @@ xfs_iflush_done( xfs_clear_li_failed(blip); } } - - if (mlip_changed) { - if (!XFS_FORCED_SHUTDOWN(ailp->ail_mount)) - xlog_assign_tail_lsn_locked(ailp->ail_mount); - if (list_empty(&ailp->ail_head)) - wake_up_all(&ailp->ail_empty); - } - spin_unlock(&ailp->ail_lock); - - if (mlip_changed) - xfs_log_space_wake(ailp->ail_mount); + xfs_ail_update_finish(ailp, mlip_changed); } /* diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c index 6ccfd75d3c24..656819523bbd 100644 --- a/fs/xfs/xfs_trans_ail.c +++ b/fs/xfs/xfs_trans_ail.c @@ -678,6 +678,27 @@ xfs_ail_push_all_sync( finish_wait(&ailp->ail_empty, &wait); } +void +xfs_ail_update_finish( + struct xfs_ail *ailp, + bool do_tail_update) __releases(ailp->ail_lock) +{ + struct xfs_mount *mp = ailp->ail_mount; + + if (!do_tail_update) { + spin_unlock(&ailp->ail_lock); + return; + } + + if (!XFS_FORCED_SHUTDOWN(mp)) + xlog_assign_tail_lsn_locked(mp); + + if (list_empty(&ailp->ail_head)) + wake_up_all(&ailp->ail_empty); + spin_unlock(&ailp->ail_lock); + xfs_log_space_wake(mp); +} + /* * xfs_trans_ail_update - bulk AIL insertion operation. * @@ -737,15 +758,7 @@ xfs_trans_ail_update_bulk( if (!list_empty(&tmp)) xfs_ail_splice(ailp, cur, &tmp, lsn); - if (mlip_changed) { - if (!XFS_FORCED_SHUTDOWN(ailp->ail_mount)) - xlog_assign_tail_lsn_locked(ailp->ail_mount); - spin_unlock(&ailp->ail_lock); - - xfs_log_space_wake(ailp->ail_mount); - } else { - spin_unlock(&ailp->ail_lock); - } + xfs_ail_update_finish(ailp, mlip_changed); } bool @@ -789,10 +802,10 @@ void xfs_trans_ail_delete( struct xfs_ail *ailp, struct xfs_log_item *lip, - int shutdown_type) __releases(ailp->ail_lock) + int shutdown_type) { struct xfs_mount *mp = ailp->ail_mount; - bool mlip_changed; + bool need_update; if (!test_bit(XFS_LI_IN_AIL, &lip->li_flags)) { spin_unlock(&ailp->ail_lock); @@ -805,17 +818,8 @@ xfs_trans_ail_delete( return; } - mlip_changed = xfs_ail_delete_one(ailp, lip); - if (mlip_changed) { - if (!XFS_FORCED_SHUTDOWN(mp)) - xlog_assign_tail_lsn_locked(mp); - if (list_empty(&ailp->ail_head)) - wake_up_all(&ailp->ail_empty); - } - - spin_unlock(&ailp->ail_lock); - if (mlip_changed) - xfs_log_space_wake(ailp->ail_mount); + need_update = xfs_ail_delete_one(ailp, lip); + xfs_ail_update_finish(ailp, need_update); } int diff --git a/fs/xfs/xfs_trans_priv.h b/fs/xfs/xfs_trans_priv.h index 2e073c1c4614..64ffa746730e 100644 --- a/fs/xfs/xfs_trans_priv.h +++ b/fs/xfs/xfs_trans_priv.h @@ -92,8 +92,10 @@ xfs_trans_ail_update( } bool xfs_ail_delete_one(struct xfs_ail *ailp, struct xfs_log_item *lip); +void xfs_ail_update_finish(struct xfs_ail *ailp, bool do_tail_update) + __releases(ailp->ail_lock); void xfs_trans_ail_delete(struct xfs_ail *ailp, struct xfs_log_item *lip, - int shutdown_type) __releases(ailp->ail_lock); + int shutdown_type); static inline void xfs_trans_ail_remove( From patchwork Wed Oct 9 03:21:05 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 11180411 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id B26FA1709 for ; Wed, 9 Oct 2019 03:22:18 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 7D222206C2 for ; Wed, 9 Oct 2019 03:22:18 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 7D222206C2 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=fromorbit.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 7D15C8E0018; Tue, 8 Oct 2019 23:21:35 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 3099A8E0019; Tue, 8 Oct 2019 23:21:35 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 12EFC8E0016; Tue, 8 Oct 2019 23:21:35 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0171.hostedemail.com [216.40.44.171]) by kanga.kvack.org (Postfix) with ESMTP id A50CA8E000D for ; Tue, 8 Oct 2019 23:21:34 -0400 (EDT) Received: from smtpin28.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with SMTP id 418B14408 for ; Wed, 9 Oct 2019 03:21:34 +0000 (UTC) X-FDA: 76022796108.28.drop83_2ab60e20f1005 X-Spam-Summary: 2,0,0,6ef9e4321ce38677,d41d8cd98f00b204,david@fromorbit.com,:linux-xfs@vger.kernel.org::linux-fsdevel@vger.kernel.org,RULES_HIT:2:41:69:355:379:541:800:960:968:973:988:989:1260:1261:1311:1314:1345:1359:1437:1515:1535:1605:1730:1747:1777:1792:2198:2199:2393:2553:2559:2562:2693:2731:2898:3138:3139:3140:3141:3142:3865:3866:3867:3868:3870:3871:3872:3874:4049:4118:4250:4321:5007:6117:6119:6120:6261:7576:7901:7903:9389:10004:11026:11473:11658:11914:12043:12291:12296:12297:12438:12517:12519:12555:12679:12683:12895:12986:13161:13229:13894:14096:14394:21080:21324:21627:30012:30034:30054:30070:30090,0,RBL:211.29.132.249:@fromorbit.com:.lbl8.mailshell.net-62.8.32.100 66.201.201.201,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not bulk,SPF:fn,MSBL:0,DNSBL:neutral,Custom_rules:0:0:0,LFtime:1,LUA_SUMMARY:none X-HE-Tag: drop83_2ab60e20f1005 X-Filterd-Recvd-Size: 7961 Received: from mail105.syd.optusnet.com.au (mail105.syd.optusnet.com.au [211.29.132.249]) by imf38.hostedemail.com (Postfix) with ESMTP for ; Wed, 9 Oct 2019 03:21:33 +0000 (UTC) Received: from dread.disaster.area (pa49-181-226-196.pa.nsw.optusnet.com.au [49.181.226.196]) by mail105.syd.optusnet.com.au (Postfix) with ESMTPS id 79E96362EB8; Wed, 9 Oct 2019 14:21:27 +1100 (AEDT) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.2) (envelope-from ) id 1iI2XW-0006BK-RT; Wed, 09 Oct 2019 14:21:26 +1100 Received: from dave by discord.disaster.area with local (Exim 4.92) (envelope-from ) id 1iI2XW-00039E-PS; Wed, 09 Oct 2019 14:21:26 +1100 From: Dave Chinner To: linux-xfs@vger.kernel.org Cc: linux-mm@kvack.org, linux-fsdevel@vger.kernel.org Subject: [PATCH 07/26] xfs: tail updates only need to occur when LSN changes Date: Wed, 9 Oct 2019 14:21:05 +1100 Message-Id: <20191009032124.10541-8-david@fromorbit.com> X-Mailer: git-send-email 2.23.0.rc1 In-Reply-To: <20191009032124.10541-1-david@fromorbit.com> References: <20191009032124.10541-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.2 cv=D+Q3ErZj c=1 sm=1 tr=0 a=dRuLqZ1tmBNts2YiI0zFQg==:117 a=dRuLqZ1tmBNts2YiI0zFQg==:17 a=jpOVt7BSZ2e4Z31A5e1TngXxSK0=:19 a=XobE76Q3jBoA:10 a=20KFwNOVAAAA:8 a=W3SVq3cs_kfo-QxLsoMA:9 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: Dave Chinner We currently wake anything waiting on the log tail to move whenever the log item at the tail of the log is removed. Historically this was fine behaviour because there were very few items at any given LSN. But with delayed logging, there may be thousands of items at any given LSN, and we can't move the tail until they are all gone. Hence if we are removing them in near tail-first order, we might be waking up processes waiting on the tail LSN to change (e.g. log space waiters) repeatedly without them being able to make progress. This also occurs with the new sync push waiters, and can result in thousands of spurious wakeups every second when under heavy direct reclaim pressure. To fix this, check that the tail LSN has actually changed on the AIL before triggering wakeups. This will reduce the number of spurious wakeups when doing bulk AIL removal and make this code much more efficient. Signed-off-by: Dave Chinner Reviewed-by: Christoph Hellwig Reviewed-by: Brian Foster --- fs/xfs/xfs_inode_item.c | 18 ++++++++++---- fs/xfs/xfs_trans_ail.c | 52 ++++++++++++++++++++++++++++------------- fs/xfs/xfs_trans_priv.h | 4 ++-- 3 files changed, 51 insertions(+), 23 deletions(-) diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c index ab12e526540a..0d5eee456b0c 100644 --- a/fs/xfs/xfs_inode_item.c +++ b/fs/xfs/xfs_inode_item.c @@ -731,19 +731,27 @@ xfs_iflush_done( * holding the lock before removing the inode from the AIL. */ if (need_ail) { - bool mlip_changed = false; + xfs_lsn_t tail_lsn = 0; /* this is an opencoded batch version of xfs_trans_ail_delete */ spin_lock(&ailp->ail_lock); list_for_each_entry(blip, &tmp, li_bio_list) { if (INODE_ITEM(blip)->ili_logged && - blip->li_lsn == INODE_ITEM(blip)->ili_flush_lsn) - mlip_changed |= xfs_ail_delete_one(ailp, blip); - else { + blip->li_lsn == INODE_ITEM(blip)->ili_flush_lsn) { + /* + * xfs_ail_update_finish() only cares about the + * lsn of the first tail item removed, any others + * will be at the same or higher lsn so we just + * ignore them. + */ + xfs_lsn_t lsn = xfs_ail_delete_one(ailp, blip); + if (!tail_lsn && lsn) + tail_lsn = lsn; + } else { xfs_clear_li_failed(blip); } } - xfs_ail_update_finish(ailp, mlip_changed); + xfs_ail_update_finish(ailp, tail_lsn); } /* diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c index 656819523bbd..685a21cd24c0 100644 --- a/fs/xfs/xfs_trans_ail.c +++ b/fs/xfs/xfs_trans_ail.c @@ -108,17 +108,25 @@ xfs_ail_next( * We need the AIL lock in order to get a coherent read of the lsn of the last * item in the AIL. */ +static xfs_lsn_t +__xfs_ail_min_lsn( + struct xfs_ail *ailp) +{ + struct xfs_log_item *lip = xfs_ail_min(ailp); + + if (lip) + return lip->li_lsn; + return 0; +} + xfs_lsn_t xfs_ail_min_lsn( struct xfs_ail *ailp) { - xfs_lsn_t lsn = 0; - struct xfs_log_item *lip; + xfs_lsn_t lsn; spin_lock(&ailp->ail_lock); - lip = xfs_ail_min(ailp); - if (lip) - lsn = lip->li_lsn; + lsn = __xfs_ail_min_lsn(ailp); spin_unlock(&ailp->ail_lock); return lsn; @@ -681,11 +689,12 @@ xfs_ail_push_all_sync( void xfs_ail_update_finish( struct xfs_ail *ailp, - bool do_tail_update) __releases(ailp->ail_lock) + xfs_lsn_t old_lsn) __releases(ailp->ail_lock) { struct xfs_mount *mp = ailp->ail_mount; - if (!do_tail_update) { + /* if the tail lsn hasn't changed, don't do updates or wakeups. */ + if (!old_lsn || old_lsn == __xfs_ail_min_lsn(ailp)) { spin_unlock(&ailp->ail_lock); return; } @@ -730,7 +739,7 @@ xfs_trans_ail_update_bulk( xfs_lsn_t lsn) __releases(ailp->ail_lock) { struct xfs_log_item *mlip; - int mlip_changed = 0; + xfs_lsn_t tail_lsn = 0; int i; LIST_HEAD(tmp); @@ -745,9 +754,10 @@ xfs_trans_ail_update_bulk( continue; trace_xfs_ail_move(lip, lip->li_lsn, lsn); + if (mlip == lip && !tail_lsn) + tail_lsn = lip->li_lsn; + xfs_ail_delete(ailp, lip); - if (mlip == lip) - mlip_changed = 1; } else { trace_xfs_ail_insert(lip, 0, lsn); } @@ -758,15 +768,23 @@ xfs_trans_ail_update_bulk( if (!list_empty(&tmp)) xfs_ail_splice(ailp, cur, &tmp, lsn); - xfs_ail_update_finish(ailp, mlip_changed); + xfs_ail_update_finish(ailp, tail_lsn); } -bool +/* + * Delete one log item from the AIL. + * + * If this item was at the tail of the AIL, return the LSN of the log item so + * that we can use it to check if the LSN of the tail of the log has moved + * when finishing up the AIL delete process in xfs_ail_update_finish(). + */ +xfs_lsn_t xfs_ail_delete_one( struct xfs_ail *ailp, struct xfs_log_item *lip) { struct xfs_log_item *mlip = xfs_ail_min(ailp); + xfs_lsn_t lsn = lip->li_lsn; trace_xfs_ail_delete(lip, mlip->li_lsn, lip->li_lsn); xfs_ail_delete(ailp, lip); @@ -774,7 +792,9 @@ xfs_ail_delete_one( clear_bit(XFS_LI_IN_AIL, &lip->li_flags); lip->li_lsn = 0; - return mlip == lip; + if (mlip == lip) + return lsn; + return 0; } /** @@ -805,7 +825,7 @@ xfs_trans_ail_delete( int shutdown_type) { struct xfs_mount *mp = ailp->ail_mount; - bool need_update; + xfs_lsn_t tail_lsn; if (!test_bit(XFS_LI_IN_AIL, &lip->li_flags)) { spin_unlock(&ailp->ail_lock); @@ -818,8 +838,8 @@ xfs_trans_ail_delete( return; } - need_update = xfs_ail_delete_one(ailp, lip); - xfs_ail_update_finish(ailp, need_update); + tail_lsn = xfs_ail_delete_one(ailp, lip); + xfs_ail_update_finish(ailp, tail_lsn); } int diff --git a/fs/xfs/xfs_trans_priv.h b/fs/xfs/xfs_trans_priv.h index 64ffa746730e..35655eac01a6 100644 --- a/fs/xfs/xfs_trans_priv.h +++ b/fs/xfs/xfs_trans_priv.h @@ -91,8 +91,8 @@ xfs_trans_ail_update( xfs_trans_ail_update_bulk(ailp, NULL, &lip, 1, lsn); } -bool xfs_ail_delete_one(struct xfs_ail *ailp, struct xfs_log_item *lip); -void xfs_ail_update_finish(struct xfs_ail *ailp, bool do_tail_update) +xfs_lsn_t xfs_ail_delete_one(struct xfs_ail *ailp, struct xfs_log_item *lip); +void xfs_ail_update_finish(struct xfs_ail *ailp, xfs_lsn_t old_lsn) __releases(ailp->ail_lock); void xfs_trans_ail_delete(struct xfs_ail *ailp, struct xfs_log_item *lip, int shutdown_type); From patchwork Wed Oct 9 03:21:06 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 11180321 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 93BA61709 for ; Wed, 9 Oct 2019 03:21:44 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 6AC3020B7C for ; Wed, 9 Oct 2019 03:21:44 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 6AC3020B7C Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=fromorbit.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id B4F798E0009; Tue, 8 Oct 2019 23:21:32 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 9935E8E000A; Tue, 8 Oct 2019 23:21:32 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8350C8E0008; Tue, 8 Oct 2019 23:21:32 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0068.hostedemail.com [216.40.44.68]) by kanga.kvack.org (Postfix) with ESMTP id 573E78E0009 for ; Tue, 8 Oct 2019 23:21:32 -0400 (EDT) Received: from smtpin09.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with SMTP id D5D15181AC9B4 for ; Wed, 9 Oct 2019 03:21:31 +0000 (UTC) X-FDA: 76022795982.09.bead12_2a6877e051d43 X-Spam-Summary: 2,0,0,fbbe6ae7d3c06007,d41d8cd98f00b204,david@fromorbit.com,:linux-xfs@vger.kernel.org::linux-fsdevel@vger.kernel.org,RULES_HIT:41:355:379:541:800:960:966:973:988:989:1260:1261:1311:1314:1345:1359:1437:1515:1534:1542:1711:1730:1747:1777:1792:2196:2199:2393:2559:2562:2693:3138:3139:3140:3141:3142:3353:3865:3866:3867:3868:3870:3871:3872:3874:4321:4385:4605:5007:6261:7576:7903:9707:10004:11026:11658:11914:12043:12294:12296:12297:12438:12517:12519:12555:12679:12895:12986:13894:14096:14181:14394:14721:14877:21080:21433:21450:21451:21627:21740:21796:30012:30036:30054:30091,0,RBL:211.29.132.246:@fromorbit.com:.lbl8.mailshell.net-62.8.32.100 66.201.201.201,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not bulk,SPF:fn,MSBL:0,DNSBL:neutral,Custom_rules:0:0:0,LFtime:29,LUA_SUMMARY:none X-HE-Tag: bead12_2a6877e051d43 X-Filterd-Recvd-Size: 3615 Received: from mail104.syd.optusnet.com.au (mail104.syd.optusnet.com.au [211.29.132.246]) by imf12.hostedemail.com (Postfix) with ESMTP for ; Wed, 9 Oct 2019 03:21:31 +0000 (UTC) Received: from dread.disaster.area (pa49-181-226-196.pa.nsw.optusnet.com.au [49.181.226.196]) by mail104.syd.optusnet.com.au (Postfix) with ESMTPS id 532C943EB2A; Wed, 9 Oct 2019 14:21:28 +1100 (AEDT) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.2) (envelope-from ) id 1iI2XW-0006BO-SU; Wed, 09 Oct 2019 14:21:26 +1100 Received: from dave by discord.disaster.area with local (Exim 4.92) (envelope-from ) id 1iI2XW-00039H-QV; Wed, 09 Oct 2019 14:21:26 +1100 From: Dave Chinner To: linux-xfs@vger.kernel.org Cc: linux-mm@kvack.org, linux-fsdevel@vger.kernel.org Subject: [PATCH 08/26] mm: directed shrinker work deferral Date: Wed, 9 Oct 2019 14:21:06 +1100 Message-Id: <20191009032124.10541-9-david@fromorbit.com> X-Mailer: git-send-email 2.23.0.rc1 In-Reply-To: <20191009032124.10541-1-david@fromorbit.com> References: <20191009032124.10541-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.2 cv=D+Q3ErZj c=1 sm=1 tr=0 a=dRuLqZ1tmBNts2YiI0zFQg==:117 a=dRuLqZ1tmBNts2YiI0zFQg==:17 a=jpOVt7BSZ2e4Z31A5e1TngXxSK0=:19 a=XobE76Q3jBoA:10 a=20KFwNOVAAAA:8 a=3J8m_CEvPCn7CZIx0tYA:9 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: Dave Chinner Introduce a mechanism for ->count_objects() to indicate to the shrinker infrastructure that the reclaim context will not allow scanning work to be done and so the work it decides is necessary needs to be deferred. This simplifies the code by separating out the accounting of deferred work from the actual doing of the work, and allows better decisions to be made by the shrinekr control logic on what action it can take. Signed-off-by: Dave Chinner --- include/linux/shrinker.h | 7 +++++++ mm/vmscan.c | 8 ++++++++ 2 files changed, 15 insertions(+) diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h index 0f80123650e2..3405c39ab92c 100644 --- a/include/linux/shrinker.h +++ b/include/linux/shrinker.h @@ -31,6 +31,13 @@ struct shrink_control { /* current memcg being shrunk (for memcg aware shrinkers) */ struct mem_cgroup *memcg; + + /* + * set by ->count_objects if reclaim context prevents reclaim from + * occurring. This allows the shrinker to immediately defer all the + * work and not even attempt to scan the cache. + */ + bool defer_work; }; #define SHRINK_STOP (~0UL) diff --git a/mm/vmscan.c b/mm/vmscan.c index c6659bb758a4..1445bc7578c0 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -535,6 +535,13 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl, trace_mm_shrink_slab_start(shrinker, shrinkctl, nr, freeable, delta, total_scan, priority); + /* + * If the shrinker can't run (e.g. due to gfp_mask constraints), then + * defer the work to a context that can scan the cache. + */ + if (shrinkctl->defer_work) + goto done; + /* * Normally, we should not scan less than batch_size objects in one * pass to avoid too frequent shrinker calls, but if the slab has less @@ -569,6 +576,7 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl, cond_resched(); } +done: if (next_deferred >= scanned) next_deferred -= scanned; else From patchwork Wed Oct 9 03:21:07 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 11180421 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id DD38676 for ; Wed, 9 Oct 2019 03:36:50 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id A6C6920B7C for ; Wed, 9 Oct 2019 03:36:50 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org A6C6920B7C Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=fromorbit.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 2AE4B8E001D; Tue, 8 Oct 2019 23:36:48 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 1EAF08E0016; Tue, 8 Oct 2019 23:36:48 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 0164E8E001D; Tue, 8 Oct 2019 23:36:47 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0157.hostedemail.com [216.40.44.157]) by kanga.kvack.org (Postfix) with ESMTP id C528E8E0016 for ; Tue, 8 Oct 2019 23:36:47 -0400 (EDT) Received: from smtpin13.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with SMTP id 69B0A180AD803 for ; Wed, 9 Oct 2019 03:36:47 +0000 (UTC) X-FDA: 76022834454.13.robin77_1e1373dde140e X-Spam-Summary: 2,0,0,0aec7a9ba6cc1b9d,d41d8cd98f00b204,david@fromorbit.com,:linux-xfs@vger.kernel.org::linux-fsdevel@vger.kernel.org,RULES_HIT:2:41:355:379:541:800:960:966:973:988:989:1260:1261:1311:1314:1345:1359:1437:1515:1535:1605:1730:1747:1777:1792:1801:2196:2198:2199:2200:2393:2553:2559:2562:3138:3139:3140:3141:3142:3865:3866:3867:3868:3870:3871:3872:3874:4049:4118:4250:4321:4385:4605:5007:6261:7576:9010:9592:9707:10004:11026:11232:11473:11658:11914:12043:12294:12296:12297:12438:12517:12519:12555:12679:12895:12986:13894:14096:14394:14877:21080:21433:21611:21627:21740:21965:30012:30054:30070:30090,0,RBL:211.29.132.249:@fromorbit.com:.lbl8.mailshell.net-62.8.32.100 66.201.201.201,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not bulk,SPF:fn,MSBL:0,DNSBL:neutral,Custom_rules:0:0:0,LFtime:24,LUA_SUMMARY:none X-HE-Tag: robin77_1e1373dde140e X-Filterd-Recvd-Size: 7937 Received: from mail105.syd.optusnet.com.au (mail105.syd.optusnet.com.au [211.29.132.249]) by imf20.hostedemail.com (Postfix) with ESMTP for ; Wed, 9 Oct 2019 03:36:46 +0000 (UTC) Received: from dread.disaster.area (pa49-181-226-196.pa.nsw.optusnet.com.au [49.181.226.196]) by mail105.syd.optusnet.com.au (Postfix) with ESMTPS id 8CA32362BAA for ; Wed, 9 Oct 2019 14:36:45 +1100 (AEDT) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.2) (envelope-from ) id 1iI2XW-0006BS-Ty; Wed, 09 Oct 2019 14:21:26 +1100 Received: from dave by discord.disaster.area with local (Exim 4.92) (envelope-from ) id 1iI2XW-00039K-Rj; Wed, 09 Oct 2019 14:21:26 +1100 From: Dave Chinner To: linux-xfs@vger.kernel.org Cc: linux-mm@kvack.org, linux-fsdevel@vger.kernel.org Subject: [PATCH 09/26] shrinkers: use defer_work for GFP_NOFS sensitive shrinkers Date: Wed, 9 Oct 2019 14:21:07 +1100 Message-Id: <20191009032124.10541-10-david@fromorbit.com> X-Mailer: git-send-email 2.23.0.rc1 In-Reply-To: <20191009032124.10541-1-david@fromorbit.com> References: <20191009032124.10541-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.2 cv=D+Q3ErZj c=1 sm=1 tr=0 a=dRuLqZ1tmBNts2YiI0zFQg==:117 a=dRuLqZ1tmBNts2YiI0zFQg==:17 a=jpOVt7BSZ2e4Z31A5e1TngXxSK0=:19 a=XobE76Q3jBoA:10 a=20KFwNOVAAAA:8 a=bIsfdx-f5ddGStTJopEA:9 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: Dave Chinner For shrinkers that currently avoid scanning when called under GFP_NOFS contexts, convert them to use the new ->defer_work flag rather than checking and returning errors during scans. This makes it very clear that these shrinkers are not doing any work because of the context limitations, not because there is no work that can be done. Signed-off-by: Dave Chinner --- drivers/staging/android/ashmem.c | 8 ++++---- fs/gfs2/glock.c | 5 +++-- fs/gfs2/quota.c | 6 +++--- fs/nfs/dir.c | 6 +++--- fs/super.c | 6 +++--- fs/xfs/xfs_qm.c | 11 ++++++++--- net/sunrpc/auth.c | 5 ++--- 7 files changed, 26 insertions(+), 21 deletions(-) diff --git a/drivers/staging/android/ashmem.c b/drivers/staging/android/ashmem.c index 74d497d39c5a..0b80149f0ac5 100644 --- a/drivers/staging/android/ashmem.c +++ b/drivers/staging/android/ashmem.c @@ -438,10 +438,6 @@ ashmem_shrink_scan(struct shrinker *shrink, struct shrink_control *sc) { unsigned long freed = 0; - /* We might recurse into filesystem code, so bail out if necessary */ - if (!(sc->gfp_mask & __GFP_FS)) - return SHRINK_STOP; - if (!mutex_trylock(&ashmem_mutex)) return -1; @@ -478,6 +474,10 @@ ashmem_shrink_scan(struct shrinker *shrink, struct shrink_control *sc) static unsigned long ashmem_shrink_count(struct shrinker *shrink, struct shrink_control *sc) { + /* We might recurse into filesystem code, so bail out if necessary */ + if (!(sc->gfp_mask & __GFP_FS)) + sc->defer_work = true; + /* * note that lru_count is count of pages on the lru, not a count of * objects on the list. This means the scan function needs to return the diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c index 0290a22ebccf..a25161b93f96 100644 --- a/fs/gfs2/glock.c +++ b/fs/gfs2/glock.c @@ -1614,14 +1614,15 @@ static long gfs2_scan_glock_lru(int nr) static unsigned long gfs2_glock_shrink_scan(struct shrinker *shrink, struct shrink_control *sc) { - if (!(sc->gfp_mask & __GFP_FS)) - return SHRINK_STOP; return gfs2_scan_glock_lru(sc->nr_to_scan); } static unsigned long gfs2_glock_shrink_count(struct shrinker *shrink, struct shrink_control *sc) { + if (!(sc->gfp_mask & __GFP_FS)) + sc->defer_work = true; + return vfs_pressure_ratio(atomic_read(&lru_count)); } diff --git a/fs/gfs2/quota.c b/fs/gfs2/quota.c index 7c016a082aa6..661189b42c31 100644 --- a/fs/gfs2/quota.c +++ b/fs/gfs2/quota.c @@ -166,9 +166,6 @@ static unsigned long gfs2_qd_shrink_scan(struct shrinker *shrink, LIST_HEAD(dispose); unsigned long freed; - if (!(sc->gfp_mask & __GFP_FS)) - return SHRINK_STOP; - freed = list_lru_shrink_walk(&gfs2_qd_lru, sc, gfs2_qd_isolate, &dispose); @@ -180,6 +177,9 @@ static unsigned long gfs2_qd_shrink_scan(struct shrinker *shrink, static unsigned long gfs2_qd_shrink_count(struct shrinker *shrink, struct shrink_control *sc) { + if (!(sc->gfp_mask & __GFP_FS)) + sc->defer_work = true; + return vfs_pressure_ratio(list_lru_shrink_count(&gfs2_qd_lru, sc)); } diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c index e180033e35cf..fd4a70479790 100644 --- a/fs/nfs/dir.c +++ b/fs/nfs/dir.c @@ -2211,10 +2211,7 @@ unsigned long nfs_access_cache_scan(struct shrinker *shrink, struct shrink_control *sc) { int nr_to_scan = sc->nr_to_scan; - gfp_t gfp_mask = sc->gfp_mask; - if ((gfp_mask & GFP_KERNEL) != GFP_KERNEL) - return SHRINK_STOP; return nfs_do_access_cache_scan(nr_to_scan); } @@ -2222,6 +2219,9 @@ nfs_access_cache_scan(struct shrinker *shrink, struct shrink_control *sc) unsigned long nfs_access_cache_count(struct shrinker *shrink, struct shrink_control *sc) { + if ((sc->gfp_mask & GFP_KERNEL) != GFP_KERNEL) + sc->defer_work = true; + return vfs_pressure_ratio(atomic_long_read(&nfs_access_nr_entries)); } diff --git a/fs/super.c b/fs/super.c index f627b7c53d2b..d6a93d7fe05f 100644 --- a/fs/super.c +++ b/fs/super.c @@ -74,9 +74,6 @@ static unsigned long super_cache_scan(struct shrinker *shrink, * Deadlock avoidance. We may hold various FS locks, and we don't want * to recurse into the FS that called us in clear_inode() and friends.. */ - if (!(sc->gfp_mask & __GFP_FS)) - return SHRINK_STOP; - if (!trylock_super(sb)) return SHRINK_STOP; @@ -141,6 +138,9 @@ static unsigned long super_cache_count(struct shrinker *shrink, return 0; smp_rmb(); + if (!(sc->gfp_mask & __GFP_FS)) + sc->defer_work = true; + if (sb->s_op && sb->s_op->nr_cached_objects) total_objects = sb->s_op->nr_cached_objects(sb, sc); diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c index ecd8ce152ab1..aa03f2448145 100644 --- a/fs/xfs/xfs_qm.c +++ b/fs/xfs/xfs_qm.c @@ -502,9 +502,6 @@ xfs_qm_shrink_scan( unsigned long freed; int error; - if ((sc->gfp_mask & (__GFP_FS|__GFP_DIRECT_RECLAIM)) != (__GFP_FS|__GFP_DIRECT_RECLAIM)) - return 0; - INIT_LIST_HEAD(&isol.buffers); INIT_LIST_HEAD(&isol.dispose); @@ -534,6 +531,14 @@ xfs_qm_shrink_count( struct xfs_quotainfo *qi = container_of(shrink, struct xfs_quotainfo, qi_shrinker); + /* + * __GFP_DIRECT_RECLAIM is used here to avoid blocking kswapd + */ + if ((sc->gfp_mask & (__GFP_FS|__GFP_DIRECT_RECLAIM)) != + (__GFP_FS|__GFP_DIRECT_RECLAIM)) { + sc->defer_work = true; + } + return list_lru_shrink_count(&qi->qi_lru, sc); } diff --git a/net/sunrpc/auth.c b/net/sunrpc/auth.c index cdb05b48de44..7d11a7034fee 100644 --- a/net/sunrpc/auth.c +++ b/net/sunrpc/auth.c @@ -527,9 +527,6 @@ static unsigned long rpcauth_cache_shrink_scan(struct shrinker *shrink, struct shrink_control *sc) { - if ((sc->gfp_mask & GFP_KERNEL) != GFP_KERNEL) - return SHRINK_STOP; - /* nothing left, don't come back */ if (list_empty(&cred_unused)) return SHRINK_STOP; @@ -541,6 +538,8 @@ static unsigned long rpcauth_cache_shrink_count(struct shrinker *shrink, struct shrink_control *sc) { + if ((sc->gfp_mask & GFP_KERNEL) != GFP_KERNEL) + sc->defer_work = true; return number_cred_unused * sysctl_vfs_cache_pressure / 100; } From patchwork Wed Oct 9 03:21:08 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 11180359 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id BD92317D4 for ; Wed, 9 Oct 2019 03:21:50 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 8B7EF206C2 for ; Wed, 9 Oct 2019 03:21:50 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 8B7EF206C2 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=fromorbit.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 53C528E000F; Tue, 8 Oct 2019 23:21:33 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 4DAA78E000D; Tue, 8 Oct 2019 23:21:33 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 244318E000F; Tue, 8 Oct 2019 23:21:33 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0040.hostedemail.com [216.40.44.40]) by kanga.kvack.org (Postfix) with ESMTP id D56308E000D for ; Tue, 8 Oct 2019 23:21:32 -0400 (EDT) Received: from smtpin12.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with SMTP id 7A8BF180AD803 for ; Wed, 9 Oct 2019 03:21:32 +0000 (UTC) X-FDA: 76022796024.12.top44_2a77a5e3e2a58 X-Spam-Summary: 2,0,0,00ca1d837882e32a,d41d8cd98f00b204,david@fromorbit.com,:linux-xfs@vger.kernel.org::linux-fsdevel@vger.kernel.org,RULES_HIT:2:41:69:355:379:541:800:960:966:973:988:989:1260:1261:1311:1314:1345:1359:1437:1515:1535:1605:1606:1730:1747:1777:1792:2196:2199:2393:2559:2562:2690:2912:3138:3139:3140:3141:3142:3865:3866:3867:3868:3870:3871:3872:3874:4118:4250:4385:4605:5007:6119:6261:7576:7903:7974:8603:9036:9121:9592:9707:10004:11026:11473:11658:11914:12043:12291:12296:12297:12438:12517:12519:12555:12679:12683:12895:13141:13146:13230:13869:13894:13972:14096:14394:14877:21080:21324:21433:21451:21627:21740:30005:30012:30034:30054:30070:30079,0,RBL:211.29.132.246:@fromorbit.com:.lbl8.mailshell.net-62.8.32.100 66.201.201.201,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not bulk,SPF:fn,MSBL:0,DNSBL:neutral,Custom_rules:0:0:0,LFtime:2,LUA_SUMMARY:none X-HE-Tag: top44_2a77a5e3e2a58 X-Filterd-Recvd-Size: 7381 Received: from mail104.syd.optusnet.com.au (mail104.syd.optusnet.com.au [211.29.132.246]) by imf39.hostedemail.com (Postfix) with ESMTP for ; Wed, 9 Oct 2019 03:21:31 +0000 (UTC) Received: from dread.disaster.area (pa49-181-226-196.pa.nsw.optusnet.com.au [49.181.226.196]) by mail104.syd.optusnet.com.au (Postfix) with ESMTPS id 564DF43ECAB; Wed, 9 Oct 2019 14:21:27 +1100 (AEDT) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.2) (envelope-from ) id 1iI2XW-0006BU-VF; Wed, 09 Oct 2019 14:21:26 +1100 Received: from dave by discord.disaster.area with local (Exim 4.92) (envelope-from ) id 1iI2XW-00039N-TH; Wed, 09 Oct 2019 14:21:26 +1100 From: Dave Chinner To: linux-xfs@vger.kernel.org Cc: linux-mm@kvack.org, linux-fsdevel@vger.kernel.org Subject: [PATCH 10/26] mm: factor shrinker work calculations Date: Wed, 9 Oct 2019 14:21:08 +1100 Message-Id: <20191009032124.10541-11-david@fromorbit.com> X-Mailer: git-send-email 2.23.0.rc1 In-Reply-To: <20191009032124.10541-1-david@fromorbit.com> References: <20191009032124.10541-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.2 cv=FNpr/6gs c=1 sm=1 tr=0 a=dRuLqZ1tmBNts2YiI0zFQg==:117 a=dRuLqZ1tmBNts2YiI0zFQg==:17 a=jpOVt7BSZ2e4Z31A5e1TngXxSK0=:19 a=XobE76Q3jBoA:10 a=20KFwNOVAAAA:8 a=rQ5XUWRrd7byYprixfUA:9 a=FbgdKT8UfCh_ZLh5:21 a=DSgJKIrDGFEhNJPj:21 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: Dave Chinner Start to clean up the shrinker code by factoring out the calculation that determines how much work to do. This separates the calculation from clamping and other adjustments that are done before the shrinker work is run. Document the scan batch size calculation better while we are there. Also convert the calculation for the amount of work to be done to use 64 bit logic so we don't have to keep jumping through hoops to keep calculations within 32 bits on 32 bit systems. Signed-off-by: Dave Chinner --- mm/vmscan.c | 97 ++++++++++++++++++++++++++++++++++++++--------------- 1 file changed, 70 insertions(+), 27 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 1445bc7578c0..de6b09ad97ed 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -458,13 +458,68 @@ EXPORT_SYMBOL(unregister_shrinker); #define SHRINK_BATCH 128 +/* + * Calculate the number of new objects to scan this time around. Return + * the work to be done. If there are freeable objects, return that number in + * @freeable_objects. + */ +static int64_t shrink_scan_count(struct shrink_control *shrinkctl, + struct shrinker *shrinker, int priority, + int64_t *freeable_objects) +{ + int64_t delta; + int64_t freeable; + + freeable = shrinker->count_objects(shrinker, shrinkctl); + if (freeable == 0 || freeable == SHRINK_EMPTY) + return freeable; + + if (shrinker->seeks) { + /* + * shrinker->seeks is a measure of how much IO is required to + * reinstantiate the object in memory. The default value is 2 + * which is typical for a cold inode requiring a directory read + * and an inode read to re-instantiate. + * + * The scan batch size is defined by the shrinker priority, but + * to be able to bias the reclaim we increase the default batch + * size by 4. Hence we end up with a scan batch multipler that + * scales like so: + * + * ->seeks scan batch multiplier + * 1 4.00x + * 2 2.00x + * 3 1.33x + * 4 1.00x + * 8 0.50x + * + * IOWs, the more seeks it takes to pull the item into cache, + * the smaller the reclaim scan batch. Hence we put more reclaim + * pressure on caches that are fast to repopulate and to keep a + * rough balance between caches that have different costs. + */ + delta = freeable >> (priority - 2); + do_div(delta, shrinker->seeks); + } else { + /* + * These objects don't require any IO to create. Trim them + * aggressively under memory pressure to keep them from causing + * refetches in the IO caches. + */ + delta = freeable / 2; + } + + *freeable_objects = freeable; + return delta > 0 ? delta : 0; +} + static unsigned long do_shrink_slab(struct shrink_control *shrinkctl, struct shrinker *shrinker, int priority) { unsigned long freed = 0; - unsigned long long delta; long total_scan; - long freeable; + int64_t freeable_objects = 0; + int64_t scan_count; long nr; long new_nr; int nid = shrinkctl->nid; @@ -475,9 +530,10 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl, if (!(shrinker->flags & SHRINKER_NUMA_AWARE)) nid = 0; - freeable = shrinker->count_objects(shrinker, shrinkctl); - if (freeable == 0 || freeable == SHRINK_EMPTY) - return freeable; + scan_count = shrink_scan_count(shrinkctl, shrinker, priority, + &freeable_objects); + if (scan_count == 0 || scan_count == SHRINK_EMPTY) + return scan_count; /* * copy the current shrinker scan count into a local variable @@ -486,25 +542,11 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl, */ nr = atomic_long_xchg(&shrinker->nr_deferred[nid], 0); - total_scan = nr; - if (shrinker->seeks) { - delta = freeable >> priority; - delta *= 4; - do_div(delta, shrinker->seeks); - } else { - /* - * These objects don't require any IO to create. Trim - * them aggressively under memory pressure to keep - * them from causing refetches in the IO caches. - */ - delta = freeable / 2; - } - - total_scan += delta; + total_scan = nr + scan_count; if (total_scan < 0) { pr_err("shrink_slab: %pS negative objects to delete nr=%ld\n", shrinker->scan_objects, total_scan); - total_scan = freeable; + total_scan = scan_count; next_deferred = nr; } else next_deferred = total_scan; @@ -521,19 +563,20 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl, * Hence only allow the shrinker to scan the entire cache when * a large delta change is calculated directly. */ - if (delta < freeable / 4) - total_scan = min(total_scan, freeable / 2); + if (scan_count < freeable_objects / 4) + total_scan = min_t(long, total_scan, freeable_objects / 2); /* * Avoid risking looping forever due to too large nr value: * never try to free more than twice the estimate number of * freeable entries. */ - if (total_scan > freeable * 2) - total_scan = freeable * 2; + if (total_scan > freeable_objects * 2) + total_scan = freeable_objects * 2; trace_mm_shrink_slab_start(shrinker, shrinkctl, nr, - freeable, delta, total_scan, priority); + freeable_objects, scan_count, + total_scan, priority); /* * If the shrinker can't run (e.g. due to gfp_mask constraints), then @@ -558,7 +601,7 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl, * possible. */ while (total_scan >= batch_size || - total_scan >= freeable) { + total_scan >= freeable_objects) { unsigned long ret; unsigned long nr_to_scan = min(batch_size, total_scan); From patchwork Wed Oct 9 03:21:09 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 11180265 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 4347E17D4 for ; Wed, 9 Oct 2019 03:21:32 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id E78BD2133F for ; Wed, 9 Oct 2019 03:21:31 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org E78BD2133F Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=fromorbit.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id E9D7A8E0005; Tue, 8 Oct 2019 23:21:30 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id E4E678E0003; Tue, 8 Oct 2019 23:21:30 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D641D8E0005; Tue, 8 Oct 2019 23:21:30 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0142.hostedemail.com [216.40.44.142]) by kanga.kvack.org (Postfix) with ESMTP id B643F8E0003 for ; Tue, 8 Oct 2019 23:21:30 -0400 (EDT) Received: from smtpin10.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with SMTP id 503E66406 for ; Wed, 9 Oct 2019 03:21:30 +0000 (UTC) X-FDA: 76022795940.10.smash17_2a25666906953 X-Spam-Summary: 2,0,0,33af8c0c29dc9edd,d41d8cd98f00b204,david@fromorbit.com,:linux-xfs@vger.kernel.org::linux-fsdevel@vger.kernel.org,RULES_HIT:1:2:41:69:355:379:541:800:960:966:973:988:989:1260:1261:1311:1314:1345:1359:1437:1515:1605:1730:1747:1777:1792:2196:2199:2393:2559:2562:2690:2693:2895:3138:3139:3140:3141:3142:3865:3866:3867:3868:3870:3871:3872:3873:3874:4031:4051:4385:4605:5007:6119:6261:7576:7875:7903:7974:9036:9592:9707:11026:11473:11658:11914:12043:12291:12296:12297:12438:12517:12519:12555:12679:12683:12895:13161:13180:13229:13868:13869:13894:14394:14877:21080:21325:21433:21451:21627:21740:30005:30012:30034:30054:30070:30079:30080,0,RBL:211.29.132.246:@fromorbit.com:.lbl8.mailshell.net-62.14.32.100 66.201.201.201,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not bulk,SPF:fn,MSBL:0,DNSBL:neutral,Custom_rules:0:1:0,LFtime:24,LUA_SUMMARY:none X-HE-Tag: smash17_2a25666906953 X-Filterd-Recvd-Size: 10037 Received: from mail104.syd.optusnet.com.au (mail104.syd.optusnet.com.au [211.29.132.246]) by imf39.hostedemail.com (Postfix) with ESMTP for ; Wed, 9 Oct 2019 03:21:29 +0000 (UTC) Received: from dread.disaster.area (pa49-181-226-196.pa.nsw.optusnet.com.au [49.181.226.196]) by mail104.syd.optusnet.com.au (Postfix) with ESMTPS id 3FE9E43EC2F; Wed, 9 Oct 2019 14:21:27 +1100 (AEDT) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.2) (envelope-from ) id 1iI2XX-0006BY-0P; Wed, 09 Oct 2019 14:21:27 +1100 Received: from dave by discord.disaster.area with local (Exim 4.92) (envelope-from ) id 1iI2XW-00039Q-UQ; Wed, 09 Oct 2019 14:21:26 +1100 From: Dave Chinner To: linux-xfs@vger.kernel.org Cc: linux-mm@kvack.org, linux-fsdevel@vger.kernel.org Subject: [PATCH 11/26] shrinker: defer work only to kswapd Date: Wed, 9 Oct 2019 14:21:09 +1100 Message-Id: <20191009032124.10541-12-david@fromorbit.com> X-Mailer: git-send-email 2.23.0.rc1 In-Reply-To: <20191009032124.10541-1-david@fromorbit.com> References: <20191009032124.10541-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.2 cv=FNpr/6gs c=1 sm=1 tr=0 a=dRuLqZ1tmBNts2YiI0zFQg==:117 a=dRuLqZ1tmBNts2YiI0zFQg==:17 a=jpOVt7BSZ2e4Z31A5e1TngXxSK0=:19 a=XobE76Q3jBoA:10 a=20KFwNOVAAAA:8 a=DY0x1sV2QIg9a-hhTtsA:9 a=MCR_tvXtDtI_MWXX:21 a=Jfgs850-GcERAMIO:21 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: Dave Chinner Right now deferred work is picked up by whatever GFP_KERNEL context reclaimer that wins the race to empty the node's deferred work counter. However, if there are lots of direct reclaimers, that work might be continually picked up by contexts taht can't do any work and so the opportunities to do the work are missed by contexts that could do them. A further problem with the current code is that the deferred work can be picked up by a random direct reclaimer, resulting in that specific process having to do all the deferred reclaim work and hence can take extremely long latencies if the reclaim work blocks regularly. This is not good for direct reclaim fairness or for minimising long tail latency events. To avoid these problems, simply limit deferred work to kswapd contexts. We know kswapd is a context that can always do reclaim work, and hence deferring work to kswapd allows the deferred work to be done in the background and not adversely affect any specific process context doing direct reclaim. The advantage of this is that amount of work to be done in direct reclaim is now bound and predictable - it is entirely based on the cache's freeable objects and the reclaim priority. hence all direct reclaimers running at the same time should be doing relatively equal amounts of work, thereby reducing the incidence of long tail latencies due to uneven reclaim workloads. Note that we use signed integers for everything except the freed count as the returns from the shrinker callouts cannot be guaranteed untainted. Indeed, the shrinkers can return scan counts larger that were fed in, so we need scan counts to underflow in a detectable manner to terminate loops. This is necessary to avoid a misbehaving shrinker from triggering endless scanning loops. Signed-off-by: Dave Chinner --- include/linux/shrinker.h | 2 +- mm/vmscan.c | 98 +++++++++++++++++++++------------------- 2 files changed, 52 insertions(+), 48 deletions(-) diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h index 3405c39ab92c..30c10f42109f 100644 --- a/include/linux/shrinker.h +++ b/include/linux/shrinker.h @@ -81,7 +81,7 @@ struct shrinker { int id; #endif /* objs pending delete, per node */ - atomic_long_t *nr_deferred; + atomic64_t *nr_deferred; }; #define DEFAULT_SEEKS 2 /* A good number if you don't know better. */ diff --git a/mm/vmscan.c b/mm/vmscan.c index de6b09ad97ed..d05f64bd26ff 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -516,16 +516,16 @@ static int64_t shrink_scan_count(struct shrink_control *shrinkctl, static unsigned long do_shrink_slab(struct shrink_control *shrinkctl, struct shrinker *shrinker, int priority) { - unsigned long freed = 0; - long total_scan; + uint64_t freed = 0; int64_t freeable_objects = 0; int64_t scan_count; - long nr; - long new_nr; + int64_t scanned_objects = 0; + int64_t next_deferred = 0; + int64_t deferred_count = 0; + int64_t new_nr; int nid = shrinkctl->nid; long batch_size = shrinker->batch ? shrinker->batch : SHRINK_BATCH; - long scanned = 0, next_deferred; if (!(shrinker->flags & SHRINKER_NUMA_AWARE)) nid = 0; @@ -536,47 +536,51 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl, return scan_count; /* - * copy the current shrinker scan count into a local variable - * and zero it so that other concurrent shrinker invocations - * don't also do this scanning work. + * If kswapd, we take all the deferred work and do it here. We don't let + * direct reclaim do this, because then it means some poor sod is going + * to have to do somebody else's GFP_NOFS reclaim, and it hides the real + * amount of reclaim work from concurrent kswapd operations. Hence we do + * the work in the wrong place, at the wrong time, and it's largely + * unpredictable. + * + * By doing the deferred work only in kswapd, we can schedule the work + * according the the reclaim priority - low priority reclaim will do + * less deferred work, hence we'll do more of the deferred work the more + * desperate we become for free memory. This avoids the need for needing + * to specifically avoid deferred work windup as low amount os memory + * pressure won't excessive trim caches anymore. */ - nr = atomic_long_xchg(&shrinker->nr_deferred[nid], 0); + if (current_is_kswapd()) { + int64_t deferred_scan; - total_scan = nr + scan_count; - if (total_scan < 0) { - pr_err("shrink_slab: %pS negative objects to delete nr=%ld\n", - shrinker->scan_objects, total_scan); - total_scan = scan_count; - next_deferred = nr; - } else - next_deferred = total_scan; + deferred_count = atomic64_xchg(&shrinker->nr_deferred[nid], 0); - /* - * We need to avoid excessive windup on filesystem shrinkers - * due to large numbers of GFP_NOFS allocations causing the - * shrinkers to return -1 all the time. This results in a large - * nr being built up so when a shrink that can do some work - * comes along it empties the entire cache due to nr >>> - * freeable. This is bad for sustaining a working set in - * memory. - * - * Hence only allow the shrinker to scan the entire cache when - * a large delta change is calculated directly. - */ - if (scan_count < freeable_objects / 4) - total_scan = min_t(long, total_scan, freeable_objects / 2); + /* we want to scan 5-10% of the deferred work here at minimum */ + deferred_scan = deferred_count; + if (priority) + do_div(deferred_scan, priority); + scan_count += deferred_scan; + + /* + * If there is more deferred work than the number of freeable + * items in the cache, limit the amount of work we will carry + * over to the next kswapd run on this cache. This prevents + * deferred work windup. + */ + deferred_count = min(deferred_count, freeable_objects * 2); + + } /* * Avoid risking looping forever due to too large nr value: * never try to free more than twice the estimate number of * freeable entries. */ - if (total_scan > freeable_objects * 2) - total_scan = freeable_objects * 2; + scan_count = min(scan_count, freeable_objects * 2); - trace_mm_shrink_slab_start(shrinker, shrinkctl, nr, + trace_mm_shrink_slab_start(shrinker, shrinkctl, deferred_count, freeable_objects, scan_count, - total_scan, priority); + scan_count, priority); /* * If the shrinker can't run (e.g. due to gfp_mask constraints), then @@ -600,10 +604,10 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl, * scanning at high prio and therefore should try to reclaim as much as * possible. */ - while (total_scan >= batch_size || - total_scan >= freeable_objects) { + while (scan_count >= batch_size || + scan_count >= freeable_objects) { unsigned long ret; - unsigned long nr_to_scan = min(batch_size, total_scan); + unsigned long nr_to_scan = min_t(long, batch_size, scan_count); shrinkctl->nr_to_scan = nr_to_scan; shrinkctl->nr_scanned = nr_to_scan; @@ -613,29 +617,29 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl, freed += ret; count_vm_events(SLABS_SCANNED, shrinkctl->nr_scanned); - total_scan -= shrinkctl->nr_scanned; - scanned += shrinkctl->nr_scanned; + scan_count -= shrinkctl->nr_scanned; + scanned_objects += shrinkctl->nr_scanned; cond_resched(); } - done: - if (next_deferred >= scanned) - next_deferred -= scanned; + if (deferred_count) + next_deferred = deferred_count - scanned_objects; else - next_deferred = 0; + next_deferred = scan_count; /* * move the unused scan count back into the shrinker in a * manner that handles concurrent updates. If we exhausted the * scan, there is no need to do an update. */ if (next_deferred > 0) - new_nr = atomic_long_add_return(next_deferred, + new_nr = atomic64_add_return(next_deferred, &shrinker->nr_deferred[nid]); else - new_nr = atomic_long_read(&shrinker->nr_deferred[nid]); + new_nr = atomic64_read(&shrinker->nr_deferred[nid]); - trace_mm_shrink_slab_end(shrinker, nid, freed, nr, new_nr, total_scan); + trace_mm_shrink_slab_end(shrinker, nid, freed, deferred_count, new_nr, + scan_count); return freed; } From patchwork Wed Oct 9 03:21:10 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 11180419 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id EBBFC76 for ; Wed, 9 Oct 2019 03:36:48 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id B9E1D21835 for ; Wed, 9 Oct 2019 03:36:48 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org B9E1D21835 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=fromorbit.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id DA8BD8E001C; Tue, 8 Oct 2019 23:36:46 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id D59B88E0016; Tue, 8 Oct 2019 23:36:46 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C6F3C8E001C; Tue, 8 Oct 2019 23:36:46 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0136.hostedemail.com [216.40.44.136]) by kanga.kvack.org (Postfix) with ESMTP id 94A108E0016 for ; Tue, 8 Oct 2019 23:36:46 -0400 (EDT) Received: from smtpin15.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with SMTP id 3CE2D180AD803 for ; Wed, 9 Oct 2019 03:36:46 +0000 (UTC) X-FDA: 76022834412.15.pail87_1de86187c7f5a X-Spam-Summary: 2,0,0,29e6ee2100cd0331,d41d8cd98f00b204,david@fromorbit.com,:linux-xfs@vger.kernel.org::linux-fsdevel@vger.kernel.org,RULES_HIT:2:41:69:355:379:541:800:960:966:973:988:989:1260:1261:1311:1314:1345:1359:1437:1515:1535:1605:1730:1747:1777:1792:2196:2198:2199:2200:2393:2559:2562:2892:2895:3138:3139:3140:3141:3142:3865:3866:3867:3868:3870:3871:3872:3874:4049:4119:4321:4385:4605:5007:6119:6261:7576:7875:10004:10128:11026:11232:11658:11914:12043:12296:12297:12438:12517:12519:12555:12679:12683:12895:13161:13229:13894:14394:14877:21080:21451:21627:21740:30012:30054:30070:30075,0,RBL:211.29.132.249:@fromorbit.com:.lbl8.mailshell.net-62.8.32.100 66.201.201.201,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not bulk,SPF:fn,MSBL:0,DNSBL:neutral,Custom_rules:0:0:0,LFtime:24,LUA_SUMMARY:none X-HE-Tag: pail87_1de86187c7f5a X-Filterd-Recvd-Size: 8250 Received: from mail105.syd.optusnet.com.au (mail105.syd.optusnet.com.au [211.29.132.249]) by imf37.hostedemail.com (Postfix) with ESMTP for ; Wed, 9 Oct 2019 03:36:45 +0000 (UTC) Received: from dread.disaster.area (pa49-181-226-196.pa.nsw.optusnet.com.au [49.181.226.196]) by mail105.syd.optusnet.com.au (Postfix) with ESMTPS id 4B476363A60 for ; Wed, 9 Oct 2019 14:36:44 +1100 (AEDT) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.2) (envelope-from ) id 1iI2XX-0006Bb-1p; Wed, 09 Oct 2019 14:21:27 +1100 Received: from dave by discord.disaster.area with local (Exim 4.92) (envelope-from ) id 1iI2XW-00039T-Vy; Wed, 09 Oct 2019 14:21:26 +1100 From: Dave Chinner To: linux-xfs@vger.kernel.org Cc: linux-mm@kvack.org, linux-fsdevel@vger.kernel.org Subject: [PATCH 12/26] shrinker: clean up variable types and tracepoints Date: Wed, 9 Oct 2019 14:21:10 +1100 Message-Id: <20191009032124.10541-13-david@fromorbit.com> X-Mailer: git-send-email 2.23.0.rc1 In-Reply-To: <20191009032124.10541-1-david@fromorbit.com> References: <20191009032124.10541-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.2 cv=P6RKvmIu c=1 sm=1 tr=0 a=dRuLqZ1tmBNts2YiI0zFQg==:117 a=dRuLqZ1tmBNts2YiI0zFQg==:17 a=jpOVt7BSZ2e4Z31A5e1TngXxSK0=:19 a=XobE76Q3jBoA:10 a=20KFwNOVAAAA:8 a=dfQxWFgAP5TgkvwPFjsA:9 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000008, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: Dave Chinner The tracepoint information in the shrinker code don't make a lot of sense anymore and contain redundant information as a result of the changes in the patchset. Refine the information passed to the tracepoints so they expose the operation of the shrinkers more precisely and clean up the remaining code and varibles in the shrinker code so it all makes sense. Signed-off-by: Dave Chinner --- include/trace/events/vmscan.h | 69 ++++++++++++++++------------------- mm/vmscan.c | 24 +++++------- 2 files changed, 41 insertions(+), 52 deletions(-) diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h index a5ab2973e8dc..110637d9efa5 100644 --- a/include/trace/events/vmscan.h +++ b/include/trace/events/vmscan.h @@ -184,84 +184,77 @@ DEFINE_EVENT(mm_vmscan_direct_reclaim_end_template, mm_vmscan_memcg_softlimit_re TRACE_EVENT(mm_shrink_slab_start, TP_PROTO(struct shrinker *shr, struct shrink_control *sc, - long nr_objects_to_shrink, unsigned long cache_items, - unsigned long long delta, unsigned long total_scan, - int priority), + int64_t deferred_count, int64_t freeable_objects, + int64_t scan_count, int priority), - TP_ARGS(shr, sc, nr_objects_to_shrink, cache_items, delta, total_scan, + TP_ARGS(shr, sc, deferred_count, freeable_objects, scan_count, priority), TP_STRUCT__entry( __field(struct shrinker *, shr) __field(void *, shrink) __field(int, nid) - __field(long, nr_objects_to_shrink) - __field(gfp_t, gfp_flags) - __field(unsigned long, cache_items) - __field(unsigned long long, delta) - __field(unsigned long, total_scan) + __field(int64_t, deferred_count) + __field(int64_t, freeable_objects) + __field(int64_t, scan_count) __field(int, priority) + __field(gfp_t, gfp_flags) ), TP_fast_assign( __entry->shr = shr; __entry->shrink = shr->scan_objects; __entry->nid = sc->nid; - __entry->nr_objects_to_shrink = nr_objects_to_shrink; - __entry->gfp_flags = sc->gfp_mask; - __entry->cache_items = cache_items; - __entry->delta = delta; - __entry->total_scan = total_scan; + __entry->deferred_count = deferred_count; + __entry->freeable_objects = freeable_objects; + __entry->scan_count = scan_count; __entry->priority = priority; + __entry->gfp_flags = sc->gfp_mask; ), - TP_printk("%pS %p: nid: %d objects to shrink %ld gfp_flags %s cache items %ld delta %lld total_scan %ld priority %d", + TP_printk("%pS %p: nid: %d scan count %lld freeable items %lld deferred count %lld priority %d gfp_flags %s", __entry->shrink, __entry->shr, __entry->nid, - __entry->nr_objects_to_shrink, - show_gfp_flags(__entry->gfp_flags), - __entry->cache_items, - __entry->delta, - __entry->total_scan, - __entry->priority) + __entry->scan_count, + __entry->freeable_objects, + __entry->deferred_count, + __entry->priority, + show_gfp_flags(__entry->gfp_flags)) ); TRACE_EVENT(mm_shrink_slab_end, - TP_PROTO(struct shrinker *shr, int nid, int shrinker_retval, - long unused_scan_cnt, long new_scan_cnt, long total_scan), + TP_PROTO(struct shrinker *shr, int nid, int64_t freed_objects, + int64_t scanned_objects, int64_t deferred_scan), - TP_ARGS(shr, nid, shrinker_retval, unused_scan_cnt, new_scan_cnt, - total_scan), + TP_ARGS(shr, nid, freed_objects, scanned_objects, + deferred_scan), TP_STRUCT__entry( __field(struct shrinker *, shr) __field(int, nid) __field(void *, shrink) - __field(long, unused_scan) - __field(long, new_scan) - __field(int, retval) - __field(long, total_scan) + __field(long long, freed_objects) + __field(long long, scanned_objects) + __field(long long, deferred_scan) ), TP_fast_assign( __entry->shr = shr; __entry->nid = nid; __entry->shrink = shr->scan_objects; - __entry->unused_scan = unused_scan_cnt; - __entry->new_scan = new_scan_cnt; - __entry->retval = shrinker_retval; - __entry->total_scan = total_scan; + __entry->freed_objects = freed_objects; + __entry->scanned_objects = scanned_objects; + __entry->deferred_scan = deferred_scan; ), - TP_printk("%pS %p: nid: %d unused scan count %ld new scan count %ld total_scan %ld last shrinker return val %d", + TP_printk("%pS %p: nid: %d freed objects %lld scanned objects %lld, deferred scan %lld", __entry->shrink, __entry->shr, __entry->nid, - __entry->unused_scan, - __entry->new_scan, - __entry->total_scan, - __entry->retval) + __entry->freed_objects, + __entry->scanned_objects, + __entry->deferred_scan) ); TRACE_EVENT(mm_vmscan_lru_isolate, diff --git a/mm/vmscan.c b/mm/vmscan.c index d05f64bd26ff..65093dd89dd7 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -522,7 +522,6 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl, int64_t scanned_objects = 0; int64_t next_deferred = 0; int64_t deferred_count = 0; - int64_t new_nr; int nid = shrinkctl->nid; long batch_size = shrinker->batch ? shrinker->batch : SHRINK_BATCH; @@ -579,8 +578,7 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl, scan_count = min(scan_count, freeable_objects * 2); trace_mm_shrink_slab_start(shrinker, shrinkctl, deferred_count, - freeable_objects, scan_count, - scan_count, priority); + freeable_objects, scan_count, priority); /* * If the shrinker can't run (e.g. due to gfp_mask constraints), then @@ -623,23 +621,21 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl, cond_resched(); } done: + /* + * Calculate the remaining work that we need to defer to kswapd, and + * store it in a manner that handles concurrent updates. If we exhausted + * the scan, there is no need to do an update. + */ if (deferred_count) next_deferred = deferred_count - scanned_objects; else next_deferred = scan_count; - /* - * move the unused scan count back into the shrinker in a - * manner that handles concurrent updates. If we exhausted the - * scan, there is no need to do an update. - */ + if (next_deferred > 0) - new_nr = atomic64_add_return(next_deferred, - &shrinker->nr_deferred[nid]); - else - new_nr = atomic64_read(&shrinker->nr_deferred[nid]); + atomic64_add(next_deferred, &shrinker->nr_deferred[nid]); - trace_mm_shrink_slab_end(shrinker, nid, freed, deferred_count, new_nr, - scan_count); + trace_mm_shrink_slab_end(shrinker, nid, freed, scanned_objects, + next_deferred); return freed; } From patchwork Wed Oct 9 03:21:11 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 11180371 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 26F031864 for ; Wed, 9 Oct 2019 03:21:53 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id E87F120B7C for ; Wed, 9 Oct 2019 03:21:52 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org E87F120B7C Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=fromorbit.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 8AAB78E000E; Tue, 8 Oct 2019 23:21:33 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 85DA78E000D; Tue, 8 Oct 2019 23:21:33 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 612628E0010; Tue, 8 Oct 2019 23:21:33 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0137.hostedemail.com [216.40.44.137]) by kanga.kvack.org (Postfix) with ESMTP id 2FAA48E0008 for ; Tue, 8 Oct 2019 23:21:33 -0400 (EDT) Received: from smtpin07.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with SMTP id C5D6E824CA36 for ; Wed, 9 Oct 2019 03:21:32 +0000 (UTC) X-FDA: 76022796024.07.soda83_2a8053f84bc36 X-Spam-Summary: 2,0,0,c3f66bb3521c3937,d41d8cd98f00b204,david@fromorbit.com,:linux-xfs@vger.kernel.org::linux-fsdevel@vger.kernel.org,RULES_HIT:41:355:379:541:800:960:966:973:988:989:1260:1261:1311:1314:1345:1359:1437:1515:1535:1544:1711:1730:1747:1777:1792:2196:2198:2199:2200:2393:2553:2559:2562:2731:2890:2892:3138:3139:3140:3141:3142:3355:3865:3866:3867:3870:3871:3872:3874:4042:4117:4321:4385:4605:5007:6261:7576:7903:7974:8957:9592:10004:11026:11473:11658:11914:12043:12296:12297:12438:12517:12519:12555:12679:12895:12986:13141:13230:13894:14096:14181:14394:14721:21080:21433:21450:21627:30001:30012:30054:30070:30075:30090,0,RBL:211.29.132.249:@fromorbit.com:.lbl8.mailshell.net-62.8.32.100 66.201.201.201,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not bulk,SPF:fn,MSBL:0,DNSBL:neutral,Custom_rules:0:0:0,LFtime:50,LUA_SUMMARY:none X-HE-Tag: soda83_2a8053f84bc36 X-Filterd-Recvd-Size: 6383 Received: from mail105.syd.optusnet.com.au (mail105.syd.optusnet.com.au [211.29.132.249]) by imf24.hostedemail.com (Postfix) with ESMTP for ; Wed, 9 Oct 2019 03:21:31 +0000 (UTC) Received: from dread.disaster.area (pa49-181-226-196.pa.nsw.optusnet.com.au [49.181.226.196]) by mail105.syd.optusnet.com.au (Postfix) with ESMTPS id 559533632B6; Wed, 9 Oct 2019 14:21:27 +1100 (AEDT) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.2) (envelope-from ) id 1iI2XX-0006Be-3j; Wed, 09 Oct 2019 14:21:27 +1100 Received: from dave by discord.disaster.area with local (Exim 4.92) (envelope-from ) id 1iI2XX-00039W-1N; Wed, 09 Oct 2019 14:21:27 +1100 From: Dave Chinner To: linux-xfs@vger.kernel.org Cc: linux-mm@kvack.org, linux-fsdevel@vger.kernel.org Subject: [PATCH 13/26] mm: reclaim_state records pages reclaimed, not slabs Date: Wed, 9 Oct 2019 14:21:11 +1100 Message-Id: <20191009032124.10541-14-david@fromorbit.com> X-Mailer: git-send-email 2.23.0.rc1 In-Reply-To: <20191009032124.10541-1-david@fromorbit.com> References: <20191009032124.10541-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.2 cv=FNpr/6gs c=1 sm=1 tr=0 a=dRuLqZ1tmBNts2YiI0zFQg==:117 a=dRuLqZ1tmBNts2YiI0zFQg==:17 a=jpOVt7BSZ2e4Z31A5e1TngXxSK0=:19 a=XobE76Q3jBoA:10 a=20KFwNOVAAAA:8 a=avOCubNBGnevqGq14jMA:9 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: Dave Chinner Add a wrapper to account for page freeing in shrinker reclaim so that the high level scanning accounts for all the memory freed during a shrinker scan. No logic changes, just replacing open coded checks with a simple wrapper. Signed-off-by: Dave Chinner --- fs/inode.c | 3 +-- fs/xfs/xfs_buf.c | 4 +--- include/linux/swap.h | 20 ++++++++++++++++++-- mm/slab.c | 3 +-- mm/slob.c | 4 +--- mm/slub.c | 3 +-- mm/vmscan.c | 4 ++-- 7 files changed, 25 insertions(+), 16 deletions(-) diff --git a/fs/inode.c b/fs/inode.c index fef457a42882..a77caf216659 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -764,8 +764,7 @@ static enum lru_status inode_lru_isolate(struct list_head *item, __count_vm_events(KSWAPD_INODESTEAL, reap); else __count_vm_events(PGINODESTEAL, reap); - if (current->reclaim_state) - current->reclaim_state->reclaimed_slab += reap; + current_reclaim_account_pages(reap); } iput(inode); spin_lock(lru_lock); diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c index 45b470f55ad7..bc5e0c712e2e 100644 --- a/fs/xfs/xfs_buf.c +++ b/fs/xfs/xfs_buf.c @@ -324,9 +324,7 @@ xfs_buf_free( __free_page(page); } - if (current->reclaim_state) - current->reclaim_state->reclaimed_slab += - bp->b_page_count; + current_reclaim_account_pages(bp->b_page_count); } else if (bp->b_flags & _XBF_KMEM) kmem_free(bp->b_addr); _xfs_buf_free_pages(bp); diff --git a/include/linux/swap.h b/include/linux/swap.h index 063c0c1e112b..72b855fe20b0 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -126,12 +126,28 @@ union swap_header { /* * current->reclaim_state points to one of these when a task is running - * memory reclaim + * memory reclaim. It is typically used by shrinkers to return reclaim + * information back to the main vmscan loop. */ struct reclaim_state { - unsigned long reclaimed_slab; + unsigned long reclaimed_pages; /* pages freed by shrinkers */ }; +/* + * When code frees a page that may be run from a memory reclaim context, it + * needs to account for the pages it frees so memory reclaim can track them. + * Slab memory that is freed is accounted via this mechanism, so this is not + * necessary for slab or heap memory being freed. However, if the object being + * freed frees pages directly, then those pages should be accounted as well when + * in memory reclaim. This helper function takes care accounting for the pages + * being reclaimed when it is required. + */ +static inline void current_reclaim_account_pages(int nr_pages) +{ + if (current->reclaim_state) + current->reclaim_state->reclaimed_pages += nr_pages; +} + #ifdef __KERNEL__ struct address_space; diff --git a/mm/slab.c b/mm/slab.c index 9df370558e5d..05baeda97fef 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -1395,8 +1395,7 @@ static void kmem_freepages(struct kmem_cache *cachep, struct page *page) page_mapcount_reset(page); page->mapping = NULL; - if (current->reclaim_state) - current->reclaim_state->reclaimed_slab += 1 << order; + current_reclaim_account_pages(1 << order); uncharge_slab_page(page, order, cachep); __free_pages(page, order); } diff --git a/mm/slob.c b/mm/slob.c index fa53e9f73893..c54a7eeee86d 100644 --- a/mm/slob.c +++ b/mm/slob.c @@ -211,9 +211,7 @@ static void slob_free_pages(void *b, int order) { struct page *sp = virt_to_page(b); - if (current->reclaim_state) - current->reclaim_state->reclaimed_slab += 1 << order; - + current_reclaim_account_pages(1 << order); mod_node_page_state(page_pgdat(sp), NR_SLAB_UNRECLAIMABLE, -(1 << order)); __free_pages(sp, order); diff --git a/mm/slub.c b/mm/slub.c index 3d63ae320d31..c79122dd9452 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -1746,8 +1746,7 @@ static void __free_slab(struct kmem_cache *s, struct page *page) __ClearPageSlab(page); page->mapping = NULL; - if (current->reclaim_state) - current->reclaim_state->reclaimed_slab += pages; + current_reclaim_account_pages(pages); uncharge_slab_page(page, order, s); __free_pages(page, order); } diff --git a/mm/vmscan.c b/mm/vmscan.c index 65093dd89dd7..feea179bcb67 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2872,8 +2872,8 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc) } while ((memcg = mem_cgroup_iter(root, memcg, NULL))); if (reclaim_state) { - sc->nr_reclaimed += reclaim_state->reclaimed_slab; - reclaim_state->reclaimed_slab = 0; + sc->nr_reclaimed += reclaim_state->reclaimed_pages; + reclaim_state->reclaimed_pages = 0; } /* Record the subtree's reclaim efficiency */ From patchwork Wed Oct 9 03:21:12 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 11180279 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 020F418B7 for ; Wed, 9 Oct 2019 03:21:35 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id CDB7F20B7C for ; Wed, 9 Oct 2019 03:21:34 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org CDB7F20B7C Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=fromorbit.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 48BF98E0007; Tue, 8 Oct 2019 23:21:31 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 413D28E0006; Tue, 8 Oct 2019 23:21:31 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 2B4318E0007; Tue, 8 Oct 2019 23:21:31 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0106.hostedemail.com [216.40.44.106]) by kanga.kvack.org (Postfix) with ESMTP id 0982E8E0006 for ; Tue, 8 Oct 2019 23:21:31 -0400 (EDT) Received: from smtpin16.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with SMTP id 5F279824CA36 for ; Wed, 9 Oct 2019 03:21:30 +0000 (UTC) X-FDA: 76022795940.16.waves32_2a270af8d363c X-Spam-Summary: 20,1.5,0,56febc440d3342a8,d41d8cd98f00b204,david@fromorbit.com,:linux-xfs@vger.kernel.org::linux-fsdevel@vger.kernel.org,RULES_HIT:41:355:379:541:800:960:966:973:988:989:1260:1261:1311:1314:1345:1359:1437:1515:1535:1543:1605:1711:1730:1747:1777:1792:2196:2199:2393:2559:2562:3138:3139:3140:3141:3142:3865:3866:3867:3868:3870:3871:3872:3874:4250:4321:4385:4605:5007:6261:7576:7903:10009:11026:11473:11658:11914:12043:12291:12296:12297:12517:12519:12555:12679:12683:12895:13161:13229:13869:13894:14096:14181:14394:14664:14721:21063:21080:21324:21433:21627:21740:30054,0,RBL:211.29.132.249:@fromorbit.com:.lbl8.mailshell.net-62.8.32.100 66.201.201.201,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not bulk,SPF:fn,MSBL:0,DNSBL:neutral,Custom_rules:0:1:0,LFtime:24,LUA_SUMMARY:none X-HE-Tag: waves32_2a270af8d363c X-Filterd-Recvd-Size: 5303 Received: from mail105.syd.optusnet.com.au (mail105.syd.optusnet.com.au [211.29.132.249]) by imf48.hostedemail.com (Postfix) with ESMTP for ; Wed, 9 Oct 2019 03:21:29 +0000 (UTC) Received: from dread.disaster.area (pa49-181-226-196.pa.nsw.optusnet.com.au [49.181.226.196]) by mail105.syd.optusnet.com.au (Postfix) with ESMTPS id 404CD362D5A; Wed, 9 Oct 2019 14:21:28 +1100 (AEDT) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.2) (envelope-from ) id 1iI2XX-0006Bg-4w; Wed, 09 Oct 2019 14:21:27 +1100 Received: from dave by discord.disaster.area with local (Exim 4.92) (envelope-from ) id 1iI2XX-00039Z-2i; Wed, 09 Oct 2019 14:21:27 +1100 From: Dave Chinner To: linux-xfs@vger.kernel.org Cc: linux-mm@kvack.org, linux-fsdevel@vger.kernel.org Subject: [PATCH 14/26] mm: back off direct reclaim on excessive shrinker deferral Date: Wed, 9 Oct 2019 14:21:12 +1100 Message-Id: <20191009032124.10541-15-david@fromorbit.com> X-Mailer: git-send-email 2.23.0.rc1 In-Reply-To: <20191009032124.10541-1-david@fromorbit.com> References: <20191009032124.10541-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.2 cv=D+Q3ErZj c=1 sm=1 tr=0 a=dRuLqZ1tmBNts2YiI0zFQg==:117 a=dRuLqZ1tmBNts2YiI0zFQg==:17 a=jpOVt7BSZ2e4Z31A5e1TngXxSK0=:19 a=XobE76Q3jBoA:10 a=20KFwNOVAAAA:8 a=c3jh6I83BcSAbW0NpfQA:9 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: Dave Chinner When the majority of possible shrinker reclaim work is deferred by the shrinkers (e.g. due to GFP_NOFS context), and there is more work defered than LRU pages were scanned, back off reclaim if there are large amounts of IO in progress. This tends to occur when there are inode cache heavy workloads that have little page cache or application memory pressure on filesytems like XFS. Inode cache heavy workloads involve lots of IO, so if we are getting device congestion it is indicative of memory reclaim running up against an IO throughput limitation. in this situation we need to throttle direct reclaim as we nee dto wait for kswapd to get some of the deferred work done. However, if there is no device congestion, then the system is keeping up with both the workload and memory reclaim and so there's no need to throttle. Hence we should only back off scanning for a bit if we see this condition and there is block device congestion present. Signed-off-by: Dave Chinner --- include/linux/swap.h | 2 ++ mm/vmscan.c | 30 +++++++++++++++++++++++++++++- 2 files changed, 31 insertions(+), 1 deletion(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index 72b855fe20b0..da0913e14bb9 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -131,6 +131,8 @@ union swap_header { */ struct reclaim_state { unsigned long reclaimed_pages; /* pages freed by shrinkers */ + unsigned long scanned_objects; /* quantity of work done */ + unsigned long deferred_objects; /* work that wasn't done */ }; /* diff --git a/mm/vmscan.c b/mm/vmscan.c index feea179bcb67..fe8e8508f98d 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -569,6 +569,8 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl, deferred_count = min(deferred_count, freeable_objects * 2); } + if (current->reclaim_state) + current->reclaim_state->scanned_objects += scanned_objects; /* * Avoid risking looping forever due to too large nr value: @@ -584,8 +586,11 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl, * If the shrinker can't run (e.g. due to gfp_mask constraints), then * defer the work to a context that can scan the cache. */ - if (shrinkctl->defer_work) + if (shrinkctl->defer_work) { + if (current->reclaim_state) + current->reclaim_state->deferred_objects += scan_count; goto done; + } /* * Normally, we should not scan less than batch_size objects in one @@ -2873,7 +2878,30 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc) if (reclaim_state) { sc->nr_reclaimed += reclaim_state->reclaimed_pages; + + /* + * If we are deferring more work than we are actually + * doing in the shrinkers, and we are scanning more + * objects than we are pages, the we have a large amount + * of slab caches we are deferring work to kswapd for. + * We better back off here for a while, otherwise + * we risk priority windup, swap storms and OOM kills + * once we empty the page lists but still can't make + * progress on the shrinker memory. + * + * kswapd won't ever defer work as it's run under a + * GFP_KERNEL context and can always do work. + */ + if ((reclaim_state->deferred_objects > + sc->nr_scanned - nr_scanned) && + (reclaim_state->deferred_objects > + reclaim_state->scanned_objects)) { + wait_iff_congested(BLK_RW_ASYNC, HZ/50); + } + reclaim_state->reclaimed_pages = 0; + reclaim_state->deferred_objects = 0; + reclaim_state->scanned_objects = 0; } /* Record the subtree's reclaim efficiency */ From patchwork Wed Oct 9 03:21:13 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 11180385 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 80C7B17D4 for ; Wed, 9 Oct 2019 03:21:55 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 571CC206C2 for ; Wed, 9 Oct 2019 03:21:55 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 571CC206C2 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=fromorbit.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id A99698E0008; Tue, 8 Oct 2019 23:21:33 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 945D48E0010; Tue, 8 Oct 2019 23:21:33 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 6D5028E0008; Tue, 8 Oct 2019 23:21:33 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0065.hostedemail.com [216.40.44.65]) by kanga.kvack.org (Postfix) with ESMTP id 1FAB98E000E for ; Tue, 8 Oct 2019 23:21:33 -0400 (EDT) Received: from smtpin20.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with SMTP id B6DE26131 for ; Wed, 9 Oct 2019 03:21:32 +0000 (UTC) X-FDA: 76022796024.20.size08_2a7ec5c5cd225 X-Spam-Summary: 2,0,0,2e53370fb6c1f99e,d41d8cd98f00b204,david@fromorbit.com,:linux-xfs@vger.kernel.org::linux-fsdevel@vger.kernel.org,RULES_HIT:41:355:379:541:800:960:966:973:988:989:1260:1261:1311:1314:1345:1359:1437:1515:1534:1543:1711:1730:1747:1777:1792:2196:2199:2393:2559:2562:2693:3138:3139:3140:3141:3142:3355:3865:3866:3867:3868:3870:3871:3872:3874:4250:4321:4385:4605:5007:6261:7576:7903:9121:10004:11026:11233:11473:11658:11914:12296:12297:12517:12519:12555:12679:12895:12986:13870:13894:14096:14181:14394:14721:14819:21080:21324:21451:21627:21740:30054,0,RBL:211.29.132.249:@fromorbit.com:.lbl8.mailshell.net-62.8.32.100 66.201.201.201,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not bulk,SPF:fn,MSBL:0,DNSBL:neutral,Custom_rules:0:0:0,LFtime:4,LUA_SUMMARY:none X-HE-Tag: size08_2a7ec5c5cd225 X-Filterd-Recvd-Size: 4521 Received: from mail105.syd.optusnet.com.au (mail105.syd.optusnet.com.au [211.29.132.249]) by imf34.hostedemail.com (Postfix) with ESMTP for ; Wed, 9 Oct 2019 03:21:31 +0000 (UTC) Received: from dread.disaster.area (pa49-181-226-196.pa.nsw.optusnet.com.au [49.181.226.196]) by mail105.syd.optusnet.com.au (Postfix) with ESMTPS id 56B213632DC; Wed, 9 Oct 2019 14:21:28 +1100 (AEDT) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.2) (envelope-from ) id 1iI2XX-0006Bj-61; Wed, 09 Oct 2019 14:21:27 +1100 Received: from dave by discord.disaster.area with local (Exim 4.92) (envelope-from ) id 1iI2XX-00039c-47; Wed, 09 Oct 2019 14:21:27 +1100 From: Dave Chinner To: linux-xfs@vger.kernel.org Cc: linux-mm@kvack.org, linux-fsdevel@vger.kernel.org Subject: [PATCH 15/26] mm: kswapd backoff for shrinkers Date: Wed, 9 Oct 2019 14:21:13 +1100 Message-Id: <20191009032124.10541-16-david@fromorbit.com> X-Mailer: git-send-email 2.23.0.rc1 In-Reply-To: <20191009032124.10541-1-david@fromorbit.com> References: <20191009032124.10541-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.2 cv=FNpr/6gs c=1 sm=1 tr=0 a=dRuLqZ1tmBNts2YiI0zFQg==:117 a=dRuLqZ1tmBNts2YiI0zFQg==:17 a=jpOVt7BSZ2e4Z31A5e1TngXxSK0=:19 a=XobE76Q3jBoA:10 a=20KFwNOVAAAA:8 a=NNMOctoXzqbiiAOzY8AA:9 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: Dave Chinner When kswapd reaches the end of the page LRU and starts hitting dirty pages, the logic in shrink_node() allows it to back off and wait for IO to complete, thereby preventing kswapd from scanning excessively and driving the system into swap thrashing and OOM conditions. When we have inode cache heavy workloads on XFS, we have exactly the same problem with reclaim inodes. The non-blocking kswapd reclaim will keep putting pressure onto the inode cache which is unable to make progress. When the system gets to the point where there is no pages in the LRU to free, there is no swap left and there are no clean inodes that can be freed, it will OOM. This has a specific signature in OOM: [ 110.841987] Mem-Info: [ 110.842816] active_anon:241 inactive_anon:82 isolated_anon:1 active_file:168 inactive_file:143 isolated_file:0 unevictable:2621523 dirty:1 writeback:8 unstable:0 slab_reclaimable:564445 slab_unreclaimable:420046 mapped:1042 shmem:11 pagetables:6509 bounce:0 free:77626 free_pcp:2 free_cma:0 In this case, we have about 500-600 pages left in teh LRUs, but we have ~565000 reclaimable slab pages still available for reclaim. Unfortunately, they are mostly dirty inodes, and so we really need to be able to throttle kswapd when shrinker progress is limited due to reaching the dirty end of the LRU... So, add a flag into the reclaim_state so if the shrinker decides it needs kswapd to back off and wait for a while (for whatever reason) it can do so. Signed-off-by: Dave Chinner --- include/linux/swap.h | 1 + mm/vmscan.c | 10 +++++++++- 2 files changed, 10 insertions(+), 1 deletion(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index da0913e14bb9..76fc28f0e483 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -133,6 +133,7 @@ struct reclaim_state { unsigned long reclaimed_pages; /* pages freed by shrinkers */ unsigned long scanned_objects; /* quantity of work done */ unsigned long deferred_objects; /* work that wasn't done */ + bool need_backoff; /* tell kswapd to slow down */ }; /* diff --git a/mm/vmscan.c b/mm/vmscan.c index fe8e8508f98d..c56a9ac6d042 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2951,8 +2951,16 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc) * implies that pages are cycling through the LRU * faster than they are written so also forcibly stall. */ - if (sc->nr.immediate) + if (sc->nr.immediate) { congestion_wait(BLK_RW_ASYNC, HZ/10); + } else if (reclaim_state && reclaim_state->need_backoff) { + /* + * Ditto, but it's a slab cache that is cycling + * through the LRU faster than they are written + */ + congestion_wait(BLK_RW_ASYNC, HZ/10); + reclaim_state->need_backoff = false; + } } /* From patchwork Wed Oct 9 03:21:14 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 11180297 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id C42251864 for ; Wed, 9 Oct 2019 03:21:39 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 974DE206C2 for ; Wed, 9 Oct 2019 03:21:39 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 974DE206C2 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=fromorbit.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id B0DE68E0003; Tue, 8 Oct 2019 23:21:31 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 96EEB8E0009; Tue, 8 Oct 2019 23:21:31 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 797E88E0003; Tue, 8 Oct 2019 23:21:31 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0179.hostedemail.com [216.40.44.179]) by kanga.kvack.org (Postfix) with ESMTP id 551DD8E0008 for ; Tue, 8 Oct 2019 23:21:31 -0400 (EDT) Received: from smtpin02.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with SMTP id D8BC0181AC9B4 for ; Wed, 9 Oct 2019 03:21:30 +0000 (UTC) X-FDA: 76022795940.02.hope94_2a2bfc13ab41d X-Spam-Summary: 2,0,0,0ddfbb830ecd4e43,d41d8cd98f00b204,david@fromorbit.com,:linux-xfs@vger.kernel.org::linux-fsdevel@vger.kernel.org,RULES_HIT:41:355:379:541:617:800:960:966:968:973:988:989:1260:1261:1311:1314:1345:1359:1437:1515:1535:1543:1711:1730:1747:1777:1792:2196:2199:2393:2559:2562:2914:3138:3139:3140:3141:3142:3308:3354:3865:3866:3867:3868:3870:3871:3872:3874:4250:4321:4385:5007:6119:6261:7576:7903:10004:10128:11026:11473:11658:11914:12043:12291:12296:12297:12438:12517:12519:12555:12679:12683:12895:13161:13229:13894:14096:14181:14394:14721:21080:21220:21324:21451:21611:21627:30012:30054:30079:30091,0,RBL:211.29.132.249:@fromorbit.com:.lbl8.mailshell.net-62.8.32.100 66.201.201.201,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not bulk,SPF:fn,MSBL:0,DNSBL:neutral,Custom_rules:0:0:0,LFtime:371,LUA_SUMMARY:none X-HE-Tag: hope94_2a2bfc13ab41d X-Filterd-Recvd-Size: 5030 Received: from mail105.syd.optusnet.com.au (mail105.syd.optusnet.com.au [211.29.132.249]) by imf22.hostedemail.com (Postfix) with ESMTP for ; Wed, 9 Oct 2019 03:21:29 +0000 (UTC) Received: from dread.disaster.area (pa49-181-226-196.pa.nsw.optusnet.com.au [49.181.226.196]) by mail105.syd.optusnet.com.au (Postfix) with ESMTPS id 4039F362956; Wed, 9 Oct 2019 14:21:27 +1100 (AEDT) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.2) (envelope-from ) id 1iI2XX-0006Bm-7A; Wed, 09 Oct 2019 14:21:27 +1100 Received: from dave by discord.disaster.area with local (Exim 4.92) (envelope-from ) id 1iI2XX-00039f-5C; Wed, 09 Oct 2019 14:21:27 +1100 From: Dave Chinner To: linux-xfs@vger.kernel.org Cc: linux-mm@kvack.org, linux-fsdevel@vger.kernel.org Subject: [PATCH 16/26] xfs: synchronous AIL pushing Date: Wed, 9 Oct 2019 14:21:14 +1100 Message-Id: <20191009032124.10541-17-david@fromorbit.com> X-Mailer: git-send-email 2.23.0.rc1 In-Reply-To: <20191009032124.10541-1-david@fromorbit.com> References: <20191009032124.10541-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.2 cv=P6RKvmIu c=1 sm=1 tr=0 a=dRuLqZ1tmBNts2YiI0zFQg==:117 a=dRuLqZ1tmBNts2YiI0zFQg==:17 a=jpOVt7BSZ2e4Z31A5e1TngXxSK0=:19 a=XobE76Q3jBoA:10 a=20KFwNOVAAAA:8 a=LpVGDu2G9yASKScQ9C8A:9 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: Dave Chinner Provide an interface to push the AIL to a target LSN and wait for the tail of the log to move past that LSN. This is used to wait for all items older than a specific LSN to either be cleaned (written back) or relogged to a higher LSN in the AIL. The primary use for this is to allow IO free inode reclaim throttling. Factor the common AIL deletion code that does all the wakeups into a helper so we only have one copy of this somewhat tricky code to interface with all the wakeups necessary when the LSN of the log tail changes. xfs_ail_push_sync() is temporary infrastructure to facilitate non-blocking, IO-less inode reclaim throttling that allows further structural changes to be made. Once those structural changes are made, the need for this function goes away and it is removed, leaving us with only the xfs_ail_update_finish() factoring when this is all done. Signed-off-by: Dave Chinner --- fs/xfs/xfs_trans_ail.c | 33 +++++++++++++++++++++++++++++++++ fs/xfs/xfs_trans_priv.h | 2 ++ 2 files changed, 35 insertions(+) diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c index 685a21cd24c0..5e500a75b62b 100644 --- a/fs/xfs/xfs_trans_ail.c +++ b/fs/xfs/xfs_trans_ail.c @@ -662,6 +662,37 @@ xfs_ail_push_all( xfs_ail_push(ailp, threshold_lsn); } +/* + * Push the AIL to a specific lsn and wait for it to complete. + */ +void +xfs_ail_push_sync( + struct xfs_ail *ailp, + xfs_lsn_t threshold_lsn) +{ + struct xfs_log_item *lip; + DEFINE_WAIT(wait); + + spin_lock(&ailp->ail_lock); + while ((lip = xfs_ail_min(ailp)) != NULL) { + prepare_to_wait(&ailp->ail_push, &wait, TASK_UNINTERRUPTIBLE); + if (XFS_FORCED_SHUTDOWN(ailp->ail_mount) || + XFS_LSN_CMP(threshold_lsn, lip->li_lsn) <= 0) + break; + /* XXX: cmpxchg? */ + while (XFS_LSN_CMP(threshold_lsn, ailp->ail_target) > 0) + xfs_trans_ail_copy_lsn(ailp, &ailp->ail_target, &threshold_lsn); + wake_up_process(ailp->ail_task); + spin_unlock(&ailp->ail_lock); + schedule(); + spin_lock(&ailp->ail_lock); + } + spin_unlock(&ailp->ail_lock); + + finish_wait(&ailp->ail_push, &wait); +} + + /* * Push out all items in the AIL immediately and wait until the AIL is empty. */ @@ -702,6 +733,7 @@ xfs_ail_update_finish( if (!XFS_FORCED_SHUTDOWN(mp)) xlog_assign_tail_lsn_locked(mp); + wake_up_all(&ailp->ail_push); if (list_empty(&ailp->ail_head)) wake_up_all(&ailp->ail_empty); spin_unlock(&ailp->ail_lock); @@ -858,6 +890,7 @@ xfs_trans_ail_init( spin_lock_init(&ailp->ail_lock); INIT_LIST_HEAD(&ailp->ail_buf_list); init_waitqueue_head(&ailp->ail_empty); + init_waitqueue_head(&ailp->ail_push); ailp->ail_task = kthread_run(xfsaild, ailp, "xfsaild/%s", ailp->ail_mount->m_fsname); diff --git a/fs/xfs/xfs_trans_priv.h b/fs/xfs/xfs_trans_priv.h index 35655eac01a6..1b6f4bbd47c0 100644 --- a/fs/xfs/xfs_trans_priv.h +++ b/fs/xfs/xfs_trans_priv.h @@ -61,6 +61,7 @@ struct xfs_ail { int ail_log_flush; struct list_head ail_buf_list; wait_queue_head_t ail_empty; + wait_queue_head_t ail_push; }; /* @@ -113,6 +114,7 @@ xfs_trans_ail_remove( } void xfs_ail_push(struct xfs_ail *, xfs_lsn_t); +void xfs_ail_push_sync(struct xfs_ail *, xfs_lsn_t); void xfs_ail_push_all(struct xfs_ail *); void xfs_ail_push_all_sync(struct xfs_ail *); struct xfs_log_item *xfs_ail_min(struct xfs_ail *ailp); From patchwork Wed Oct 9 03:21:15 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 11180417 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id EF6CB76 for ; Wed, 9 Oct 2019 03:36:46 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id C695120B7C for ; Wed, 9 Oct 2019 03:36:46 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org C695120B7C Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=fromorbit.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id EE3558E001B; Tue, 8 Oct 2019 23:36:44 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id E42FB8E0016; Tue, 8 Oct 2019 23:36:44 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id CBF5E8E001B; Tue, 8 Oct 2019 23:36:44 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0030.hostedemail.com [216.40.44.30]) by kanga.kvack.org (Postfix) with ESMTP id 9213F8E0016 for ; Tue, 8 Oct 2019 23:36:44 -0400 (EDT) Received: from smtpin21.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with SMTP id 2FE39824CA36 for ; Wed, 9 Oct 2019 03:36:44 +0000 (UTC) X-FDA: 76022834328.21.egg84_1da53993bfe42 X-Spam-Summary: 2,0,0,60704ea11d242c41,d41d8cd98f00b204,david@fromorbit.com,:linux-xfs@vger.kernel.org::linux-fsdevel@vger.kernel.org,RULES_HIT:41:355:379:541:800:960:973:988:989:1260:1261:1311:1314:1345:1359:1437:1515:1534:1542:1711:1730:1747:1777:1792:2393:2559:2562:2693:3138:3139:3140:3141:3142:3354:3865:3866:3867:3868:3870:3871:3872:3874:4078:4081:4250:5007:6119:6261:7576:7903:9151:10004:11026:11658:11914:12296:12297:12438:12517:12519:12555:12679:12895:13161:13229:13894:14096:14181:14394:14721:21080:21324:21433:21627:21740:30034:30054,0,RBL:211.29.132.249:@fromorbit.com:.lbl8.mailshell.net-62.8.32.100 66.201.201.201,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not bulk,SPF:fn,MSBL:0,DNSBL:neutral,Custom_rules:0:1:0,LFtime:24,LUA_SUMMARY:none X-HE-Tag: egg84_1da53993bfe42 X-Filterd-Recvd-Size: 3752 Received: from mail105.syd.optusnet.com.au (mail105.syd.optusnet.com.au [211.29.132.249]) by imf38.hostedemail.com (Postfix) with ESMTP for ; Wed, 9 Oct 2019 03:36:43 +0000 (UTC) Received: from dread.disaster.area (pa49-181-226-196.pa.nsw.optusnet.com.au [49.181.226.196]) by mail105.syd.optusnet.com.au (Postfix) with ESMTPS id A00F3362497 for ; Wed, 9 Oct 2019 14:36:42 +1100 (AEDT) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.2) (envelope-from ) id 1iI2XX-0006Bp-8N; Wed, 09 Oct 2019 14:21:27 +1100 Received: from dave by discord.disaster.area with local (Exim 4.92) (envelope-from ) id 1iI2XX-00039i-6J; Wed, 09 Oct 2019 14:21:27 +1100 From: Dave Chinner To: linux-xfs@vger.kernel.org Cc: linux-mm@kvack.org, linux-fsdevel@vger.kernel.org Subject: [PATCH 17/26] xfs: don't block kswapd in inode reclaim Date: Wed, 9 Oct 2019 14:21:15 +1100 Message-Id: <20191009032124.10541-18-david@fromorbit.com> X-Mailer: git-send-email 2.23.0.rc1 In-Reply-To: <20191009032124.10541-1-david@fromorbit.com> References: <20191009032124.10541-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.2 cv=D+Q3ErZj c=1 sm=1 tr=0 a=dRuLqZ1tmBNts2YiI0zFQg==:117 a=dRuLqZ1tmBNts2YiI0zFQg==:17 a=jpOVt7BSZ2e4Z31A5e1TngXxSK0=:19 a=XobE76Q3jBoA:10 a=20KFwNOVAAAA:8 a=x2yLRZ3W9cWBXR4-ZBgA:9 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: Dave Chinner We have a number of reasons for blocking kswapd in XFS inode reclaim, mainly all to do with the fact that memory reclaim has no feedback mechanisms to throttle on dirty slab objects that need IO to reclaim. As a result, we currently throttle inode reclaim by issuing IO in the reclaim context. The unfortunate side effect of this is that it can cause long tail latencies in reclaim and for some workloads this can be a problem. Now that the shrinkers finally have a method of telling kswapd to back off, we can start the process of making inode reclaim in XFS non-blocking. The first thing we need to do is not block kswapd, but so that doesn't cause immediate serious problems, make sure inode writeback is always underway when kswapd is running. As we don't block kswapd now, we don't have to worry about reclaim scans taking long delays due to IO being issued and waited for. Hence while direct reclaim gets delayed by IO, kswapd will not and so it will keep pushing the AIL to clean inodes. Hence direct reclaim doesn't need to push the AIL anymore as kswapd will do it reliably now. Signed-off-by: Dave Chinner Reviewed-by: Brian Foster --- fs/xfs/xfs_icache.c | 17 ++++++++++++++--- 1 file changed, 14 insertions(+), 3 deletions(-) diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c index 944add5ff8e0..edcc3f6bb3bf 100644 --- a/fs/xfs/xfs_icache.c +++ b/fs/xfs/xfs_icache.c @@ -1378,11 +1378,22 @@ xfs_reclaim_inodes_nr( struct xfs_mount *mp, int nr_to_scan) { - /* kick background reclaimer and push the AIL */ + int sync_mode = SYNC_TRYLOCK; + + /* kick background reclaimer */ xfs_reclaim_work_queue(mp); - xfs_ail_push_all(mp->m_ail); - return xfs_reclaim_inodes_ag(mp, SYNC_TRYLOCK | SYNC_WAIT, &nr_to_scan); + /* + * For kswapd, we kick background inode writeback. For direct + * reclaim, we issue and wait on inode writeback to throttle + * reclaim rates and avoid shouty OOM-death. + */ + if (current_is_kswapd()) + xfs_ail_push_all(mp->m_ail); + else + sync_mode |= SYNC_WAIT; + + return xfs_reclaim_inodes_ag(mp, sync_mode, &nr_to_scan); } /* From patchwork Wed Oct 9 03:21:16 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 11180415 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 1C95476 for ; Wed, 9 Oct 2019 03:36:45 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id D01E220B7C for ; Wed, 9 Oct 2019 03:36:44 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org D01E220B7C Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=fromorbit.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 198A58E001A; Tue, 8 Oct 2019 23:36:44 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 149B18E0016; Tue, 8 Oct 2019 23:36:44 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 060DE8E001A; Tue, 8 Oct 2019 23:36:44 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0148.hostedemail.com [216.40.44.148]) by kanga.kvack.org (Postfix) with ESMTP id D83088E0016 for ; Tue, 8 Oct 2019 23:36:43 -0400 (EDT) Received: from smtpin05.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with SMTP id 6F5AD40CD for ; Wed, 9 Oct 2019 03:36:43 +0000 (UTC) X-FDA: 76022834286.05.balls38_1d7dcaa11c218 X-Spam-Summary: 2,0,0,782ea43d052219bc,d41d8cd98f00b204,david@fromorbit.com,:linux-xfs@vger.kernel.org::linux-fsdevel@vger.kernel.org,RULES_HIT:41:355:379:541:800:960:973:988:989:1260:1261:1311:1314:1345:1359:1437:1515:1534:1542:1711:1730:1747:1777:1792:2393:2559:2562:2693:3138:3139:3140:3141:3142:3355:3865:3866:3867:3868:3870:3871:3872:3874:4250:5007:6119:6261:7576:7903:10004:11026:11473:11658:11914:12043:12297:12438:12517:12519:12555:12679:12895:13161:13229:13894:14096:14181:14394:14721:21080:21324:21433:21627:21740:30054:30070,0,RBL:211.29.132.246:@fromorbit.com:.lbl8.mailshell.net-62.8.32.100 66.201.201.201,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not bulk,SPF:fn,MSBL:0,DNSBL:neutral,Custom_rules:0:0:0,LFtime:25,LUA_SUMMARY:none X-HE-Tag: balls38_1d7dcaa11c218 X-Filterd-Recvd-Size: 4446 Received: from mail104.syd.optusnet.com.au (mail104.syd.optusnet.com.au [211.29.132.246]) by imf48.hostedemail.com (Postfix) with ESMTP for ; Wed, 9 Oct 2019 03:36:42 +0000 (UTC) Received: from dread.disaster.area (pa49-181-226-196.pa.nsw.optusnet.com.au [49.181.226.196]) by mail104.syd.optusnet.com.au (Postfix) with ESMTPS id 7580843DF95 for ; Wed, 9 Oct 2019 14:36:41 +1100 (AEDT) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.2) (envelope-from ) id 1iI2XX-0006Bs-9i; Wed, 09 Oct 2019 14:21:27 +1100 Received: from dave by discord.disaster.area with local (Exim 4.92) (envelope-from ) id 1iI2XX-00039l-7Z; Wed, 09 Oct 2019 14:21:27 +1100 From: Dave Chinner To: linux-xfs@vger.kernel.org Cc: linux-mm@kvack.org, linux-fsdevel@vger.kernel.org Subject: [PATCH 18/26] xfs: reduce kswapd blocking on inode locking. Date: Wed, 9 Oct 2019 14:21:16 +1100 Message-Id: <20191009032124.10541-19-david@fromorbit.com> X-Mailer: git-send-email 2.23.0.rc1 In-Reply-To: <20191009032124.10541-1-david@fromorbit.com> References: <20191009032124.10541-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.2 cv=D+Q3ErZj c=1 sm=1 tr=0 a=dRuLqZ1tmBNts2YiI0zFQg==:117 a=dRuLqZ1tmBNts2YiI0zFQg==:17 a=jpOVt7BSZ2e4Z31A5e1TngXxSK0=:19 a=XobE76Q3jBoA:10 a=20KFwNOVAAAA:8 a=KE6An8oM74Ymw0apzXAA:9 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: Dave Chinner When doing async node reclaiming, we grab a batch of inodes that we are likely able to reclaim and ignore those that are already flushing. However, when we actually go to reclaim them, the first thing we do is lock the inode. If we are racing with something else reclaiming the inode or flushing it because it is dirty, we block on the inode lock. Hence we can still block kswapd here. Further, if we flush an inode, we also cluster all the other dirty inodes in that cluster into the same IO, flush locking them all. However, if the workload is operating on sequential inodes (e.g. created by a tarball extraction) most of these inodes will be sequntial in the cache and so in the same batch we've already grabbed for reclaim scanning. As a result, it is common for all the inodes in the batch to be dirty and it is common for the first inode flushed to also flush all the inodes in the reclaim batch. In which case, they are now all going to be flush locked and we do not want to block on them. Hence, for async reclaim (SYNC_TRYLOCK) make sure we always use trylock semantics and abort reclaim of an inode as quickly as we can without blocking kswapd. This will be necessary for the upcoming conversion to LRU lists for inode reclaim tracking. Found via tracing and finding big batches of repeated lock/unlock runs on inodes that we just flushed by write clustering during reclaim. Signed-off-by: Dave Chinner Reviewed-by: Christoph Hellwig --- fs/xfs/xfs_icache.c | 23 ++++++++++++++++++----- 1 file changed, 18 insertions(+), 5 deletions(-) diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c index edcc3f6bb3bf..189cf423fe8f 100644 --- a/fs/xfs/xfs_icache.c +++ b/fs/xfs/xfs_icache.c @@ -1104,11 +1104,23 @@ xfs_reclaim_inode( restart: error = 0; - xfs_ilock(ip, XFS_ILOCK_EXCL); - if (!xfs_iflock_nowait(ip)) { - if (!(sync_mode & SYNC_WAIT)) + /* + * Don't try to flush the inode if another inode in this cluster has + * already flushed it after we did the initial checks in + * xfs_reclaim_inode_grab(). + */ + if (sync_mode & SYNC_TRYLOCK) { + if (!xfs_ilock_nowait(ip, XFS_ILOCK_EXCL)) goto out; - xfs_iflock(ip); + if (!xfs_iflock_nowait(ip)) + goto out_unlock; + } else { + xfs_ilock(ip, XFS_ILOCK_EXCL); + if (!xfs_iflock_nowait(ip)) { + if (!(sync_mode & SYNC_WAIT)) + goto out_unlock; + xfs_iflock(ip); + } } if (XFS_FORCED_SHUTDOWN(ip->i_mount)) { @@ -1215,9 +1227,10 @@ xfs_reclaim_inode( out_ifunlock: xfs_ifunlock(ip); +out_unlock: + xfs_iunlock(ip, XFS_ILOCK_EXCL); out: xfs_iflags_clear(ip, XFS_IRECLAIM); - xfs_iunlock(ip, XFS_ILOCK_EXCL); /* * We could return -EAGAIN here to make reclaim rescan the inode tree in * a short while. However, this just burns CPU time scanning the tree From patchwork Wed Oct 9 03:21:17 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 11180423 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id D451476 for ; Wed, 9 Oct 2019 03:36:52 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 9E18D20B7C for ; Wed, 9 Oct 2019 03:36:52 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 9E18D20B7C Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=fromorbit.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 459DE8E001E; Tue, 8 Oct 2019 23:36:49 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 409BB8E0016; Tue, 8 Oct 2019 23:36:49 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 2D3018E001E; Tue, 8 Oct 2019 23:36:49 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0030.hostedemail.com [216.40.44.30]) by kanga.kvack.org (Postfix) with ESMTP id 058E38E0016 for ; Tue, 8 Oct 2019 23:36:48 -0400 (EDT) Received: from smtpin30.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with SMTP id 9D24A824CA36 for ; Wed, 9 Oct 2019 03:36:48 +0000 (UTC) X-FDA: 76022834496.30.price77_1e42c09c62763 X-Spam-Summary: 2,0,0,def9b4a12672bf7a,d41d8cd98f00b204,david@fromorbit.com,:linux-xfs@vger.kernel.org::linux-fsdevel@vger.kernel.org,RULES_HIT:2:41:69:355:379:541:800:960:966:973:988:989:1260:1261:1311:1314:1345:1359:1437:1515:1535:1605:1730:1747:1777:1792:2196:2198:2199:2200:2307:2393:2559:2562:3138:3139:3140:3141:3142:3865:3866:3867:3868:3870:3871:3872:3874:4049:4119:4321:4385:4605:5007:6117:6119:6261:7576:7903:8603:9121:9592:9707:10004:11026:11232:11233:11473:11658:11914:12043:12294:12296:12297:12438:12517:12519:12555:12679:12683:12895:12986:13138:13149:13230:13231:13894:14096:14110:14394:21080:21324:21627:21740:21796:21965:30034:30036:30054,0,RBL:211.29.132.246:@fromorbit.com:.lbl8.mailshell.net-62.8.32.100 66.201.201.201,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not bulk,SPF:fn,MSBL:0,DNSBL:neutral,Custom_rules:0:0:0,LFtime:23,LUA_SUMMARY:none X-HE-Tag: price77_1e42c09c62763 X-Filterd-Recvd-Size: 8164 Received: from mail104.syd.optusnet.com.au (mail104.syd.optusnet.com.au [211.29.132.246]) by imf44.hostedemail.com (Postfix) with ESMTP for ; Wed, 9 Oct 2019 03:36:47 +0000 (UTC) Received: from dread.disaster.area (pa49-181-226-196.pa.nsw.optusnet.com.au [49.181.226.196]) by mail104.syd.optusnet.com.au (Postfix) with ESMTPS id DC56443ED3F for ; Wed, 9 Oct 2019 14:36:46 +1100 (AEDT) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.2) (envelope-from ) id 1iI2XX-0006Bw-BE; Wed, 09 Oct 2019 14:21:27 +1100 Received: from dave by discord.disaster.area with local (Exim 4.92) (envelope-from ) id 1iI2XX-00039o-8x; Wed, 09 Oct 2019 14:21:27 +1100 From: Dave Chinner To: linux-xfs@vger.kernel.org Cc: linux-mm@kvack.org, linux-fsdevel@vger.kernel.org Subject: [PATCH 19/26] xfs: kill background reclaim work Date: Wed, 9 Oct 2019 14:21:17 +1100 Message-Id: <20191009032124.10541-20-david@fromorbit.com> X-Mailer: git-send-email 2.23.0.rc1 In-Reply-To: <20191009032124.10541-1-david@fromorbit.com> References: <20191009032124.10541-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.2 cv=FNpr/6gs c=1 sm=1 tr=0 a=dRuLqZ1tmBNts2YiI0zFQg==:117 a=dRuLqZ1tmBNts2YiI0zFQg==:17 a=jpOVt7BSZ2e4Z31A5e1TngXxSK0=:19 a=XobE76Q3jBoA:10 a=20KFwNOVAAAA:8 a=1xoyCpcK-Ekt5S4qF2sA:9 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: Dave Chinner This function is now entirely done by kswapd, so we don't need the worker thread to do async reclaim anymore. Signed-off-by: Dave Chinner Reviewed-by: Christoph Hellwig --- fs/xfs/xfs_icache.c | 44 -------------------------------------------- fs/xfs/xfs_icache.h | 2 -- fs/xfs/xfs_mount.c | 2 -- fs/xfs/xfs_mount.h | 2 -- fs/xfs/xfs_super.c | 11 +---------- 5 files changed, 1 insertion(+), 60 deletions(-) diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c index 189cf423fe8f..7e175304e146 100644 --- a/fs/xfs/xfs_icache.c +++ b/fs/xfs/xfs_icache.c @@ -138,44 +138,6 @@ xfs_inode_free( __xfs_inode_free(ip); } -/* - * Queue a new inode reclaim pass if there are reclaimable inodes and there - * isn't a reclaim pass already in progress. By default it runs every 5s based - * on the xfs periodic sync default of 30s. Perhaps this should have it's own - * tunable, but that can be done if this method proves to be ineffective or too - * aggressive. - */ -static void -xfs_reclaim_work_queue( - struct xfs_mount *mp) -{ - - rcu_read_lock(); - if (radix_tree_tagged(&mp->m_perag_tree, XFS_ICI_RECLAIM_TAG)) { - queue_delayed_work(mp->m_reclaim_workqueue, &mp->m_reclaim_work, - msecs_to_jiffies(xfs_syncd_centisecs / 6 * 10)); - } - rcu_read_unlock(); -} - -/* - * This is a fast pass over the inode cache to try to get reclaim moving on as - * many inodes as possible in a short period of time. It kicks itself every few - * seconds, as well as being kicked by the inode cache shrinker when memory - * goes low. It scans as quickly as possible avoiding locked inodes or those - * already being flushed, and once done schedules a future pass. - */ -void -xfs_reclaim_worker( - struct work_struct *work) -{ - struct xfs_mount *mp = container_of(to_delayed_work(work), - struct xfs_mount, m_reclaim_work); - - xfs_reclaim_inodes(mp, SYNC_TRYLOCK); - xfs_reclaim_work_queue(mp); -} - static void xfs_perag_set_reclaim_tag( struct xfs_perag *pag) @@ -192,9 +154,6 @@ xfs_perag_set_reclaim_tag( XFS_ICI_RECLAIM_TAG); spin_unlock(&mp->m_perag_lock); - /* schedule periodic background inode reclaim */ - xfs_reclaim_work_queue(mp); - trace_xfs_perag_set_reclaim(mp, pag->pag_agno, -1, _RET_IP_); } @@ -1393,9 +1352,6 @@ xfs_reclaim_inodes_nr( { int sync_mode = SYNC_TRYLOCK; - /* kick background reclaimer */ - xfs_reclaim_work_queue(mp); - /* * For kswapd, we kick background inode writeback. For direct * reclaim, we issue and wait on inode writeback to throttle diff --git a/fs/xfs/xfs_icache.h b/fs/xfs/xfs_icache.h index 48f1fd2bb6ad..4c0d8920cc54 100644 --- a/fs/xfs/xfs_icache.h +++ b/fs/xfs/xfs_icache.h @@ -49,8 +49,6 @@ int xfs_iget(struct xfs_mount *mp, struct xfs_trans *tp, xfs_ino_t ino, struct xfs_inode * xfs_inode_alloc(struct xfs_mount *mp, xfs_ino_t ino); void xfs_inode_free(struct xfs_inode *ip); -void xfs_reclaim_worker(struct work_struct *work); - int xfs_reclaim_inodes(struct xfs_mount *mp, int mode); int xfs_reclaim_inodes_count(struct xfs_mount *mp); long xfs_reclaim_inodes_nr(struct xfs_mount *mp, int nr_to_scan); diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c index ba5b6f3b2b88..ecbc21af9100 100644 --- a/fs/xfs/xfs_mount.c +++ b/fs/xfs/xfs_mount.c @@ -988,7 +988,6 @@ xfs_mountfs( * qm_unmount_quotas and therefore rely on qm_unmount to release the * quota inodes. */ - cancel_delayed_work_sync(&mp->m_reclaim_work); xfs_reclaim_inodes(mp, SYNC_WAIT); xfs_health_unmount(mp); out_log_dealloc: @@ -1071,7 +1070,6 @@ xfs_unmountfs( * reclaim just to be sure. We can stop background inode reclaim * here as well if it is still running. */ - cancel_delayed_work_sync(&mp->m_reclaim_work); xfs_reclaim_inodes(mp, SYNC_WAIT); xfs_health_unmount(mp); diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h index fdb60e09a9c5..f0cc952ad527 100644 --- a/fs/xfs/xfs_mount.h +++ b/fs/xfs/xfs_mount.h @@ -165,7 +165,6 @@ typedef struct xfs_mount { uint m_chsize; /* size of next field */ atomic_t m_active_trans; /* number trans frozen */ struct xfs_mru_cache *m_filestream; /* per-mount filestream data */ - struct delayed_work m_reclaim_work; /* background inode reclaim */ struct delayed_work m_eofblocks_work; /* background eof blocks trimming */ struct delayed_work m_cowblocks_work; /* background cow blocks @@ -182,7 +181,6 @@ typedef struct xfs_mount { struct workqueue_struct *m_buf_workqueue; struct workqueue_struct *m_unwritten_workqueue; struct workqueue_struct *m_cil_workqueue; - struct workqueue_struct *m_reclaim_workqueue; struct workqueue_struct *m_eofblocks_workqueue; struct workqueue_struct *m_sync_workqueue; diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c index f0aff1f034e6..74767e6f48a7 100644 --- a/fs/xfs/xfs_super.c +++ b/fs/xfs/xfs_super.c @@ -823,15 +823,10 @@ xfs_init_mount_workqueues( if (!mp->m_cil_workqueue) goto out_destroy_unwritten; - mp->m_reclaim_workqueue = alloc_workqueue("xfs-reclaim/%s", - WQ_MEM_RECLAIM|WQ_FREEZABLE, 0, mp->m_fsname); - if (!mp->m_reclaim_workqueue) - goto out_destroy_cil; - mp->m_eofblocks_workqueue = alloc_workqueue("xfs-eofblocks/%s", WQ_MEM_RECLAIM|WQ_FREEZABLE, 0, mp->m_fsname); if (!mp->m_eofblocks_workqueue) - goto out_destroy_reclaim; + goto out_destroy_cil; mp->m_sync_workqueue = alloc_workqueue("xfs-sync/%s", WQ_FREEZABLE, 0, mp->m_fsname); @@ -842,8 +837,6 @@ xfs_init_mount_workqueues( out_destroy_eofb: destroy_workqueue(mp->m_eofblocks_workqueue); -out_destroy_reclaim: - destroy_workqueue(mp->m_reclaim_workqueue); out_destroy_cil: destroy_workqueue(mp->m_cil_workqueue); out_destroy_unwritten: @@ -860,7 +853,6 @@ xfs_destroy_mount_workqueues( { destroy_workqueue(mp->m_sync_workqueue); destroy_workqueue(mp->m_eofblocks_workqueue); - destroy_workqueue(mp->m_reclaim_workqueue); destroy_workqueue(mp->m_cil_workqueue); destroy_workqueue(mp->m_unwritten_workqueue); destroy_workqueue(mp->m_buf_workqueue); @@ -1558,7 +1550,6 @@ xfs_mount_alloc( spin_lock_init(&mp->m_perag_lock); mutex_init(&mp->m_growlock); atomic_set(&mp->m_active_trans, 0); - INIT_DELAYED_WORK(&mp->m_reclaim_work, xfs_reclaim_worker); INIT_DELAYED_WORK(&mp->m_eofblocks_work, xfs_eofblocks_worker); INIT_DELAYED_WORK(&mp->m_cowblocks_work, xfs_cowblocks_worker); mp->m_kobj.kobject.kset = xfs_kset; From patchwork Wed Oct 9 03:21:18 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 11180293 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 6721D18B7 for ; Wed, 9 Oct 2019 03:21:37 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 247E520B7C for ; Wed, 9 Oct 2019 03:21:37 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 247E520B7C Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=fromorbit.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 8AB148E0006; Tue, 8 Oct 2019 23:21:31 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 859398E0008; Tue, 8 Oct 2019 23:21:31 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 721628E0006; Tue, 8 Oct 2019 23:21:31 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0205.hostedemail.com [216.40.44.205]) by kanga.kvack.org (Postfix) with ESMTP id 2B23E8E0003 for ; Tue, 8 Oct 2019 23:21:31 -0400 (EDT) Received: from smtpin15.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with SMTP id B101568A3 for ; Wed, 9 Oct 2019 03:21:30 +0000 (UTC) X-FDA: 76022795940.15.voice87_2a26201029c33 X-Spam-Summary: 2,0,0,20d07cb3a638b513,d41d8cd98f00b204,david@fromorbit.com,:linux-xfs@vger.kernel.org::linux-fsdevel@vger.kernel.org,RULES_HIT:4:41:69:355:379:421:541:800:960:966:968:973:988:989:1260:1261:1311:1314:1345:1359:1437:1515:1605:1730:1747:1777:1792:2196:2198:2199:2200:2393:2559:2562:2639:2693:2895:2898:2899:2903:2924:2926:3138:3139:3140:3141:3142:3865:3866:3867:3868:3870:3871:3872:3874:4037:4250:4321:4385:4419:5007:6117:6119:6261:7558:7576:7875:7903:9592:10004:11026:11473:11658:11914:12043:12296:12297:12438:12485:12517:12519:12555:12679:12683:12895:13161:13180:13184:13229:13869:13894:14096:14394:21063:21080:21324:21451:21524:21627:21740:21796:30005:30012:30034:30036:30045:30054:30070,0,RBL:211.29.132.246:@fromorbit.com:.lbl8.mailshell.net-62.8.32.100 66.201.201.201,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not bulk,SPF:fn,MSBL:0,DNSBL:neutral,Custom_rules:0:0:0,LFtime:27,LUA_SUMMARY:none X-HE-Tag: voice87_2a26201029c33 X-Filterd-Recvd-Size: 15202 Received: from mail104.syd.optusnet.com.au (mail104.syd.optusnet.com.au [211.29.132.246]) by imf20.hostedemail.com (Postfix) with ESMTP for ; Wed, 9 Oct 2019 03:21:29 +0000 (UTC) Received: from dread.disaster.area (pa49-181-226-196.pa.nsw.optusnet.com.au [49.181.226.196]) by mail104.syd.optusnet.com.au (Postfix) with ESMTPS id 402B743EC4D; Wed, 9 Oct 2019 14:21:27 +1100 (AEDT) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.2) (envelope-from ) id 1iI2XX-0006By-CK; Wed, 09 Oct 2019 14:21:27 +1100 Received: from dave by discord.disaster.area with local (Exim 4.92) (envelope-from ) id 1iI2XX-00039r-Ac; Wed, 09 Oct 2019 14:21:27 +1100 From: Dave Chinner To: linux-xfs@vger.kernel.org Cc: linux-mm@kvack.org, linux-fsdevel@vger.kernel.org Subject: [PATCH 20/26] xfs: use AIL pushing for inode reclaim IO Date: Wed, 9 Oct 2019 14:21:18 +1100 Message-Id: <20191009032124.10541-21-david@fromorbit.com> X-Mailer: git-send-email 2.23.0.rc1 In-Reply-To: <20191009032124.10541-1-david@fromorbit.com> References: <20191009032124.10541-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.2 cv=FNpr/6gs c=1 sm=1 tr=0 a=dRuLqZ1tmBNts2YiI0zFQg==:117 a=dRuLqZ1tmBNts2YiI0zFQg==:17 a=jpOVt7BSZ2e4Z31A5e1TngXxSK0=:19 a=XobE76Q3jBoA:10 a=20KFwNOVAAAA:8 a=1ffGU6BmlR_CJC1Va-MA:9 a=C1E4cXzGdla5xFYO:21 a=f_UNZQBlpuBpCn4x:21 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: Dave Chinner Inode reclaim currently issues it's own inode IO when it comes across dirty inodes. This is used to throttle direct reclaim down to the rate at which we can reclaim dirty inodes. Failure to throttle in this manner results in the OOM killer being trivial to trigger even when there is lots of free memory available. However, having direct reclaimers issue IO causes an amount of IO thrashing to occur. We can have up to the number of AGs in the filesystem concurrently issuing IO, plus the AIL pushing thread as well. This means we can many competing sources of IO and they all end up thrashing and competing for the request slots in the block device. Similar to dirty page throttling and the BDI flusher thread, we can use the AIL pushing thread the sole place we issue inode writeback from and everything else waits for it to make progress. To do this, reclaim will skip over dirty inodes, but in doing so will record the lowest LSN of all the dirty inodes it skips. It will then push the AIL to this LSN and wait for it to complete that work. In doing so, we block direct reclaim on the IO of at least one IO, thereby providing some level of throttling for when we encounter dirty inodes. However we gain the ability to scan and reclaim clean inodes in a non-blocking fashion. This allows us to remove all the per-ag reclaim locking that avoids excessive direct reclaim, as repeated concurrent direct reclaim will hit the same dirty inodes and block waiting on the same IO to complete. Hence direct reclaim will be throttled directly by the rate at which dirty inodes are cleaned by AIL pushing, rather than by delays caused by competing IO submissions. This allows us to remove all the locking that limits direct reclaim concurrency and greatly simplifies the inode reclaim code now that it just skips dirty inodes. Note: this patch by itself isn't completely able to throttle direct reclaim sufficiently to prevent OOM killer madness. We can't do that until we change the way we index reclaimable inodes in the next patch and can feed back state to the mm core sanely. However, we can't change the way we index reclaimable inodes until we have IO-less non-blocking reclaim for both direct reclaim and kswapd reclaim. Catch-22... Signed-off-by: Dave Chinner --- fs/xfs/xfs_icache.c | 215 +++++++++++++++++++------------------------- 1 file changed, 90 insertions(+), 125 deletions(-) diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c index 7e175304e146..ed996b37bda0 100644 --- a/fs/xfs/xfs_icache.c +++ b/fs/xfs/xfs_icache.c @@ -22,6 +22,7 @@ #include "xfs_dquot_item.h" #include "xfs_dquot.h" #include "xfs_reflink.h" +#include "xfs_log.h" #include @@ -967,28 +968,42 @@ xfs_inode_ag_iterator_tag( } /* - * Grab the inode for reclaim exclusively. - * Return 0 if we grabbed it, non-zero otherwise. + * Grab the inode for reclaim. + * + * Return false if we aren't going to reclaim it, true if it is a reclaim + * candidate. + * + * If the inode is clean or unreclaimable, return 0 to tell the caller it does + * not require flushing. Otherwise return the log item lsn of the inode so the + * caller can determine it's inode flush target. If we get the clean/dirty + * state wrong then it will be sorted in xfs_reclaim_inode() once we have locks + * held. */ -STATIC int +STATIC bool xfs_reclaim_inode_grab( struct xfs_inode *ip, - int flags) + int flags, + xfs_lsn_t *lsn) { ASSERT(rcu_read_lock_held()); + *lsn = 0; /* quick check for stale RCU freed inode */ if (!ip->i_ino) - return 1; + return false; /* - * If we are asked for non-blocking operation, do unlocked checks to - * see if the inode already is being flushed or in reclaim to avoid - * lock traffic. + * Do unlocked checks to see if the inode already is being flushed or in + * reclaim to avoid lock traffic. If the inode is not clean, return the + * it's position in the AIL for the caller to push to. */ - if ((flags & SYNC_TRYLOCK) && - __xfs_iflags_test(ip, XFS_IFLOCK | XFS_IRECLAIM)) - return 1; + if (!xfs_inode_clean(ip)) { + *lsn = ip->i_itemp->ili_item.li_lsn; + return false; + } + + if (__xfs_iflags_test(ip, XFS_IFLOCK | XFS_IRECLAIM)) + return false; /* * The radix tree lock here protects a thread in xfs_iget from racing @@ -1005,11 +1020,11 @@ xfs_reclaim_inode_grab( __xfs_iflags_test(ip, XFS_IRECLAIM)) { /* not a reclaim candidate. */ spin_unlock(&ip->i_flags_lock); - return 1; + return false; } __xfs_iflags_set(ip, XFS_IRECLAIM); spin_unlock(&ip->i_flags_lock); - return 0; + return true; } /* @@ -1050,92 +1065,64 @@ xfs_reclaim_inode_grab( * clean => reclaim * dirty, async => requeue * dirty, sync => flush, wait and reclaim + * + * Returns true if the inode was reclaimed, false otherwise. */ -STATIC int +STATIC bool xfs_reclaim_inode( struct xfs_inode *ip, struct xfs_perag *pag, - int sync_mode) + xfs_lsn_t *lsn) { - struct xfs_buf *bp = NULL; - xfs_ino_t ino = ip->i_ino; /* for radix_tree_delete */ - int error; + xfs_ino_t ino; + + *lsn = 0; -restart: - error = 0; /* * Don't try to flush the inode if another inode in this cluster has * already flushed it after we did the initial checks in * xfs_reclaim_inode_grab(). */ - if (sync_mode & SYNC_TRYLOCK) { - if (!xfs_ilock_nowait(ip, XFS_ILOCK_EXCL)) - goto out; - if (!xfs_iflock_nowait(ip)) - goto out_unlock; - } else { - xfs_ilock(ip, XFS_ILOCK_EXCL); - if (!xfs_iflock_nowait(ip)) { - if (!(sync_mode & SYNC_WAIT)) - goto out_unlock; - xfs_iflock(ip); - } - } + if (!xfs_ilock_nowait(ip, XFS_ILOCK_EXCL)) + goto out; + if (!xfs_iflock_nowait(ip)) + goto out_unlock; + /* If we are in shutdown, we don't care about blocking. */ if (XFS_FORCED_SHUTDOWN(ip->i_mount)) { xfs_iunpin_wait(ip); /* xfs_iflush_abort() drops the flush lock */ xfs_iflush_abort(ip, false); goto reclaim; } - if (xfs_ipincount(ip)) { - if (!(sync_mode & SYNC_WAIT)) - goto out_ifunlock; - xfs_iunpin_wait(ip); - } - if (xfs_iflags_test(ip, XFS_ISTALE) || xfs_inode_clean(ip)) { - xfs_ifunlock(ip); - goto reclaim; - } /* - * Never flush out dirty data during non-blocking reclaim, as it would - * just contend with AIL pushing trying to do the same job. + * If it is pinned, we don't have an LSN we can push the AIL to - just + * an LSN that we can push the CIL with. We don't want to block doing + * that, so we'll just skip over this one without triggering writeback + * for now. */ - if (!(sync_mode & SYNC_WAIT)) + if (xfs_ipincount(ip)) goto out_ifunlock; /* - * Now we have an inode that needs flushing. - * - * Note that xfs_iflush will never block on the inode buffer lock, as - * xfs_ifree_cluster() can lock the inode buffer before it locks the - * ip->i_lock, and we are doing the exact opposite here. As a result, - * doing a blocking xfs_imap_to_bp() to get the cluster buffer would - * result in an ABBA deadlock with xfs_ifree_cluster(). - * - * As xfs_ifree_cluser() must gather all inodes that are active in the - * cache to mark them stale, if we hit this case we don't actually want - * to do IO here - we want the inode marked stale so we can simply - * reclaim it. Hence if we get an EAGAIN error here, just unlock the - * inode, back off and try again. Hopefully the next pass through will - * see the stale flag set on the inode. + * Dirty inode we didn't catch, skip it. */ - error = xfs_iflush(ip, &bp); - if (error == -EAGAIN) { - xfs_iunlock(ip, XFS_ILOCK_EXCL); - /* backoff longer than in xfs_ifree_cluster */ - delay(2); - goto restart; + if (!xfs_inode_clean(ip) && !xfs_iflags_test(ip, XFS_ISTALE)) { + *lsn = ip->i_itemp->ili_item.li_lsn; + goto out_ifunlock; } - if (!error) { - error = xfs_bwrite(bp); - xfs_buf_relse(bp); - } + /* + * It's clean, we have it locked, we can now drop the flush lock + * and reclaim it. + */ + xfs_ifunlock(ip); reclaim: ASSERT(!xfs_isiflocked(ip)); + ASSERT(xfs_inode_clean(ip) || xfs_iflags_test(ip, XFS_ISTALE)); + ASSERT(ip->i_ino != 0); /* * Because we use RCU freeing we need to ensure the inode always appears @@ -1148,6 +1135,7 @@ xfs_reclaim_inode( * will see an invalid inode that it can skip. */ spin_lock(&ip->i_flags_lock); + ino = ip->i_ino; /* for radix_tree_delete */ ip->i_flags = XFS_IRECLAIM; ip->i_ino = 0; spin_unlock(&ip->i_flags_lock); @@ -1182,7 +1170,7 @@ xfs_reclaim_inode( xfs_iunlock(ip, XFS_ILOCK_EXCL); __xfs_inode_free(ip); - return error; + return true; out_ifunlock: xfs_ifunlock(ip); @@ -1190,14 +1178,7 @@ xfs_reclaim_inode( xfs_iunlock(ip, XFS_ILOCK_EXCL); out: xfs_iflags_clear(ip, XFS_IRECLAIM); - /* - * We could return -EAGAIN here to make reclaim rescan the inode tree in - * a short while. However, this just burns CPU time scanning the tree - * waiting for IO to complete and the reclaim work never goes back to - * the idle state. Instead, return 0 to let the next scheduled - * background reclaim attempt to reclaim the inode again. - */ - return 0; + return false; } /* @@ -1205,44 +1186,34 @@ xfs_reclaim_inode( * corrupted, we still want to try to reclaim all the inodes. If we don't, * then a shut down during filesystem unmount reclaim walk leak all the * unreclaimed inodes. + * + * Return the number of inodes freed. */ STATIC int xfs_reclaim_inodes_ag( struct xfs_mount *mp, int flags, - int *nr_to_scan) + int nr_to_scan) { struct xfs_perag *pag; - int error = 0; - int last_error = 0; xfs_agnumber_t ag; - int trylock = flags & SYNC_TRYLOCK; - int skipped; + xfs_lsn_t lsn, lowest_lsn = NULLCOMMITLSN; + long freed = 0; -restart: ag = 0; - skipped = 0; while ((pag = xfs_perag_get_tag(mp, ag, XFS_ICI_RECLAIM_TAG))) { unsigned long first_index = 0; int done = 0; int nr_found = 0; ag = pag->pag_agno + 1; - - if (trylock) { - if (!mutex_trylock(&pag->pag_ici_reclaim_lock)) { - skipped++; - xfs_perag_put(pag); - continue; - } - first_index = pag->pag_ici_reclaim_cursor; - } else - mutex_lock(&pag->pag_ici_reclaim_lock); - do { struct xfs_inode *batch[XFS_LOOKUP_BATCH]; int i; + mutex_lock(&pag->pag_ici_reclaim_lock); + first_index = pag->pag_ici_reclaim_cursor; + rcu_read_lock(); nr_found = radix_tree_gang_lookup_tag( &pag->pag_ici_root, @@ -1262,9 +1233,13 @@ xfs_reclaim_inodes_ag( for (i = 0; i < nr_found; i++) { struct xfs_inode *ip = batch[i]; - if (done || xfs_reclaim_inode_grab(ip, flags)) + if (done || + !xfs_reclaim_inode_grab(ip, flags, &lsn)) batch[i] = NULL; + if (lsn && XFS_LSN_CMP(lsn, lowest_lsn) < 0) + lowest_lsn = lsn; + /* * Update the index for the next lookup. Catch * overflows into the next AG range which can @@ -1289,41 +1264,33 @@ xfs_reclaim_inodes_ag( /* unlock now we've grabbed the inodes. */ rcu_read_unlock(); + if (!done) + pag->pag_ici_reclaim_cursor = first_index; + else + pag->pag_ici_reclaim_cursor = 0; + mutex_unlock(&pag->pag_ici_reclaim_lock); for (i = 0; i < nr_found; i++) { if (!batch[i]) continue; - error = xfs_reclaim_inode(batch[i], pag, flags); - if (error && last_error != -EFSCORRUPTED) - last_error = error; + if (xfs_reclaim_inode(batch[i], pag, &lsn)) + freed++; + if (lsn && XFS_LSN_CMP(lsn, lowest_lsn) < 0) + lowest_lsn = lsn; } - *nr_to_scan -= XFS_LOOKUP_BATCH; - + nr_to_scan -= XFS_LOOKUP_BATCH; cond_resched(); - } while (nr_found && !done && *nr_to_scan > 0); + } while (nr_found && !done && nr_to_scan > 0); - if (trylock && !done) - pag->pag_ici_reclaim_cursor = first_index; - else - pag->pag_ici_reclaim_cursor = 0; - mutex_unlock(&pag->pag_ici_reclaim_lock); xfs_perag_put(pag); } - /* - * if we skipped any AG, and we still have scan count remaining, do - * another pass this time using blocking reclaim semantics (i.e - * waiting on the reclaim locks and ignoring the reclaim cursors). This - * ensure that when we get more reclaimers than AGs we block rather - * than spin trying to execute reclaim. - */ - if (skipped && (flags & SYNC_WAIT) && *nr_to_scan > 0) { - trylock = 0; - goto restart; - } - return last_error; + if ((flags & SYNC_WAIT) && lowest_lsn != NULLCOMMITLSN) + xfs_ail_push_sync(mp->m_ail, lowest_lsn); + + return freed; } int @@ -1331,9 +1298,7 @@ xfs_reclaim_inodes( xfs_mount_t *mp, int mode) { - int nr_to_scan = INT_MAX; - - return xfs_reclaim_inodes_ag(mp, mode, &nr_to_scan); + return xfs_reclaim_inodes_ag(mp, mode, INT_MAX); } /* @@ -1350,7 +1315,7 @@ xfs_reclaim_inodes_nr( struct xfs_mount *mp, int nr_to_scan) { - int sync_mode = SYNC_TRYLOCK; + int sync_mode = 0; /* * For kswapd, we kick background inode writeback. For direct @@ -1362,7 +1327,7 @@ xfs_reclaim_inodes_nr( else sync_mode |= SYNC_WAIT; - return xfs_reclaim_inodes_ag(mp, sync_mode, &nr_to_scan); + return xfs_reclaim_inodes_ag(mp, sync_mode, nr_to_scan); } /* From patchwork Wed Oct 9 03:21:19 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 11180309 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 4713A1864 for ; Wed, 9 Oct 2019 03:21:42 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 1E35520B7C for ; Wed, 9 Oct 2019 03:21:42 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 1E35520B7C Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=fromorbit.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 995948E000B; Tue, 8 Oct 2019 23:21:32 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 8CF558E0009; Tue, 8 Oct 2019 23:21:32 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7979C8E000A; Tue, 8 Oct 2019 23:21:32 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0170.hostedemail.com [216.40.44.170]) by kanga.kvack.org (Postfix) with ESMTP id 56AA98E0008 for ; Tue, 8 Oct 2019 23:21:32 -0400 (EDT) Received: from smtpin20.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with SMTP id ED7D4180AD803 for ; Wed, 9 Oct 2019 03:21:31 +0000 (UTC) X-FDA: 76022795982.20.value48_2a69cfd436632 X-Spam-Summary: 2,0,0,4bae2ee31e3303fc,d41d8cd98f00b204,david@fromorbit.com,:linux-xfs@vger.kernel.org::linux-fsdevel@vger.kernel.org,RULES_HIT:41:355:379:541:800:960:966:973:988:989:1260:1261:1311:1314:1345:1359:1437:1515:1534:1542:1711:1730:1747:1777:1792:2196:2198:2199:2200:2393:2559:2562:3138:3139:3140:3141:3142:3353:3865:3867:3868:3871:3872:4250:4321:4385:5007:6261:7576:10004:11026:11473:11658:11914:12043:12114:12294:12296:12297:12438:12517:12519:12555:12679:12895:12986:13894:14096:14181:14394:14721:21080:21627:21965:30054,0,RBL:211.29.132.246:@fromorbit.com:.lbl8.mailshell.net-62.8.32.100 66.201.201.201,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not bulk,SPF:fn,MSBL:0,DNSBL:neutral,Custom_rules:0:0:0,LFtime:24,LUA_SUMMARY:none X-HE-Tag: value48_2a69cfd436632 X-Filterd-Recvd-Size: 4069 Received: from mail104.syd.optusnet.com.au (mail104.syd.optusnet.com.au [211.29.132.246]) by imf45.hostedemail.com (Postfix) with ESMTP for ; Wed, 9 Oct 2019 03:21:31 +0000 (UTC) Received: from dread.disaster.area (pa49-181-226-196.pa.nsw.optusnet.com.au [49.181.226.196]) by mail104.syd.optusnet.com.au (Postfix) with ESMTPS id 5624543EC97; Wed, 9 Oct 2019 14:21:27 +1100 (AEDT) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.2) (envelope-from ) id 1iI2XX-0006C1-Dd; Wed, 09 Oct 2019 14:21:27 +1100 Received: from dave by discord.disaster.area with local (Exim 4.92) (envelope-from ) id 1iI2XX-00039u-Bl; Wed, 09 Oct 2019 14:21:27 +1100 From: Dave Chinner To: linux-xfs@vger.kernel.org Cc: linux-mm@kvack.org, linux-fsdevel@vger.kernel.org Subject: [PATCH 21/26] xfs: remove mode from xfs_reclaim_inodes() Date: Wed, 9 Oct 2019 14:21:19 +1100 Message-Id: <20191009032124.10541-22-david@fromorbit.com> X-Mailer: git-send-email 2.23.0.rc1 In-Reply-To: <20191009032124.10541-1-david@fromorbit.com> References: <20191009032124.10541-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.2 cv=FNpr/6gs c=1 sm=1 tr=0 a=dRuLqZ1tmBNts2YiI0zFQg==:117 a=dRuLqZ1tmBNts2YiI0zFQg==:17 a=jpOVt7BSZ2e4Z31A5e1TngXxSK0=:19 a=XobE76Q3jBoA:10 a=20KFwNOVAAAA:8 a=rBCjN8xBrULXB8iKm2EA:9 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: Dave Chinner Because it's always SYNC_WAIT now. Signed-off-by: Dave Chinner Reviewed-by: Brian Foster --- fs/xfs/xfs_icache.c | 7 +++---- fs/xfs/xfs_icache.h | 2 +- fs/xfs/xfs_mount.c | 4 ++-- fs/xfs/xfs_super.c | 3 +-- 4 files changed, 7 insertions(+), 9 deletions(-) diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c index ed996b37bda0..39c56200f1ce 100644 --- a/fs/xfs/xfs_icache.c +++ b/fs/xfs/xfs_icache.c @@ -1293,12 +1293,11 @@ xfs_reclaim_inodes_ag( return freed; } -int +void xfs_reclaim_inodes( - xfs_mount_t *mp, - int mode) + struct xfs_mount *mp) { - return xfs_reclaim_inodes_ag(mp, mode, INT_MAX); + xfs_reclaim_inodes_ag(mp, SYNC_WAIT, INT_MAX); } /* diff --git a/fs/xfs/xfs_icache.h b/fs/xfs/xfs_icache.h index 4c0d8920cc54..1c9b9edb2986 100644 --- a/fs/xfs/xfs_icache.h +++ b/fs/xfs/xfs_icache.h @@ -49,7 +49,7 @@ int xfs_iget(struct xfs_mount *mp, struct xfs_trans *tp, xfs_ino_t ino, struct xfs_inode * xfs_inode_alloc(struct xfs_mount *mp, xfs_ino_t ino); void xfs_inode_free(struct xfs_inode *ip); -int xfs_reclaim_inodes(struct xfs_mount *mp, int mode); +void xfs_reclaim_inodes(struct xfs_mount *mp); int xfs_reclaim_inodes_count(struct xfs_mount *mp); long xfs_reclaim_inodes_nr(struct xfs_mount *mp, int nr_to_scan); diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c index ecbc21af9100..3a38fe7c4f8d 100644 --- a/fs/xfs/xfs_mount.c +++ b/fs/xfs/xfs_mount.c @@ -988,7 +988,7 @@ xfs_mountfs( * qm_unmount_quotas and therefore rely on qm_unmount to release the * quota inodes. */ - xfs_reclaim_inodes(mp, SYNC_WAIT); + xfs_reclaim_inodes(mp); xfs_health_unmount(mp); out_log_dealloc: mp->m_flags |= XFS_MOUNT_UNMOUNTING; @@ -1070,7 +1070,7 @@ xfs_unmountfs( * reclaim just to be sure. We can stop background inode reclaim * here as well if it is still running. */ - xfs_reclaim_inodes(mp, SYNC_WAIT); + xfs_reclaim_inodes(mp); xfs_health_unmount(mp); xfs_qm_unmount(mp); diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c index 74767e6f48a7..d0619bf02a5d 100644 --- a/fs/xfs/xfs_super.c +++ b/fs/xfs/xfs_super.c @@ -1180,8 +1180,7 @@ xfs_quiesce_attr( xfs_log_force(mp, XFS_LOG_SYNC); /* reclaim inodes to do any IO before the freeze completes */ - xfs_reclaim_inodes(mp, 0); - xfs_reclaim_inodes(mp, SYNC_WAIT); + xfs_reclaim_inodes(mp); /* Push the superblock and write an unmount record */ error = xfs_log_sbcount(mp); From patchwork Wed Oct 9 03:21:20 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 11180401 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id B1C6E17D4 for ; Wed, 9 Oct 2019 03:22:05 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 7B387206C2 for ; Wed, 9 Oct 2019 03:22:05 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 7B387206C2 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=fromorbit.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id B414F8E0010; Tue, 8 Oct 2019 23:21:34 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 947678E0015; Tue, 8 Oct 2019 23:21:34 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 3F49D8E000D; Tue, 8 Oct 2019 23:21:34 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0143.hostedemail.com [216.40.44.143]) by kanga.kvack.org (Postfix) with ESMTP id F151F8E0012 for ; Tue, 8 Oct 2019 23:21:33 -0400 (EDT) Received: from smtpin26.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with SMTP id 8C4E3181AC9B4 for ; Wed, 9 Oct 2019 03:21:33 +0000 (UTC) X-FDA: 76022796066.26.pie49_2a90aac93b533 X-Spam-Summary: 2,0,0,40bc56097d8c11b8,d41d8cd98f00b204,david@fromorbit.com,:linux-xfs@vger.kernel.org::linux-fsdevel@vger.kernel.org,RULES_HIT:2:41:69:355:379:541:800:960:966:968:973:988:989:1260:1261:1311:1314:1345:1359:1437:1515:1535:1605:1730:1747:1777:1792:2194:2196:2198:2199:2200:2201:2393:2559:2562:2731:2914:3138:3139:3140:3141:3142:3308:3865:3866:3867:3868:3870:3871:3872:3874:4050:4119:4250:4321:4385:4419:5007:6119:6261:7576:7875:7903:9010:9012:9108:9163:9592:10004:11026:11232:11473:11657:11658:11914:12043:12291:12296:12297:12438:12485:12517:12519:12555:12679:12683:12895:12986:13215:13229:13894:14096:14394:21080:21451:21627:30012:30034:30054:30070:30079,0,RBL:211.29.132.246:@fromorbit.com:.lbl8.mailshell.net-62.8.32.100 66.201.201.201,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not bulk,SPF:fn,MSBL:0,DNSBL:neutral,Custom_rules:0:0:0,LFtime:367,LUA_SUMMARY:none X-HE-Tag: pie49_2a90aac93b533 X-Filterd-Recvd-Size: 8659 Received: from mail104.syd.optusnet.com.au (mail104.syd.optusnet.com.au [211.29.132.246]) by imf45.hostedemail.com (Postfix) with ESMTP for ; Wed, 9 Oct 2019 03:21:32 +0000 (UTC) Received: from dread.disaster.area (pa49-181-226-196.pa.nsw.optusnet.com.au [49.181.226.196]) by mail104.syd.optusnet.com.au (Postfix) with ESMTPS id 7158143E6C9; Wed, 9 Oct 2019 14:21:28 +1100 (AEDT) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.2) (envelope-from ) id 1iI2XX-0006C4-Eo; Wed, 09 Oct 2019 14:21:27 +1100 Received: from dave by discord.disaster.area with local (Exim 4.92) (envelope-from ) id 1iI2XX-00039x-Cz; Wed, 09 Oct 2019 14:21:27 +1100 From: Dave Chinner To: linux-xfs@vger.kernel.org Cc: linux-mm@kvack.org, linux-fsdevel@vger.kernel.org Subject: [PATCH 22/26] xfs: track reclaimable inodes using a LRU list Date: Wed, 9 Oct 2019 14:21:20 +1100 Message-Id: <20191009032124.10541-23-david@fromorbit.com> X-Mailer: git-send-email 2.23.0.rc1 In-Reply-To: <20191009032124.10541-1-david@fromorbit.com> References: <20191009032124.10541-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.2 cv=FNpr/6gs c=1 sm=1 tr=0 a=dRuLqZ1tmBNts2YiI0zFQg==:117 a=dRuLqZ1tmBNts2YiI0zFQg==:17 a=jpOVt7BSZ2e4Z31A5e1TngXxSK0=:19 a=XobE76Q3jBoA:10 a=20KFwNOVAAAA:8 a=xxV3NzHXsLv75N6J5P4A:9 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: Dave Chinner Now that we don't do IO from the inode reclaim code, there is no need to optimise inode scanning order for optimal IO characteristics. The AIL takes care of that for us, so now reclaim can focus on selecting the best inodes to reclaim. Hence we can change the inode reclaim algorithm to a real LRU and remove the need to use the radix tree to track and walk inodes under reclaim. This frees up a radix tree bit and simplifies the code that marks inodes are reclaim candidates. It also simplifies the reclaim code - we don't need batching anymore and all the reclaim logic can be added to the LRU isolation callback. Further, we get node aware reclaim at the xfs_inode level, which should help the per-node reclaim code free relevant inodes faster. We can re-use the VFS inode lru pointers - once the inode has been reclaimed from the VFS, we can use these pointers ourselves. Hence we don't need to grow the inode to change the way we index reclaimable inodes. Start by adding the list_lru tracking in parallel with the existing reclaim code. This makes it easier to see the LRU infrastructure separate to the reclaim algorithm changes. Especially the locking order, which is ip->i_flags_lock -> list_lru lock. Signed-off-by: Dave Chinner Reviewed-by: Brian Foster --- fs/xfs/xfs_icache.c | 32 ++++++++------------------------ fs/xfs/xfs_icache.h | 1 - fs/xfs/xfs_mount.h | 1 + fs/xfs/xfs_super.c | 29 ++++++++++++++++++++++------- 4 files changed, 31 insertions(+), 32 deletions(-) diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c index 39c56200f1ce..06fdaa746674 100644 --- a/fs/xfs/xfs_icache.c +++ b/fs/xfs/xfs_icache.c @@ -198,6 +198,8 @@ xfs_inode_set_reclaim_tag( xfs_perag_set_reclaim_tag(pag); __xfs_iflags_set(ip, XFS_IRECLAIMABLE); + list_lru_add(&mp->m_inode_lru, &VFS_I(ip)->i_lru); + spin_unlock(&ip->i_flags_lock); spin_unlock(&pag->pag_ici_lock); xfs_perag_put(pag); @@ -370,12 +372,10 @@ xfs_iget_cache_hit( /* * We need to set XFS_IRECLAIM to prevent xfs_reclaim_inode - * from stomping over us while we recycle the inode. We can't - * clear the radix tree reclaimable tag yet as it requires - * pag_ici_lock to be held exclusive. + * from stomping over us while we recycle the inode. Remove it + * from the LRU straight away so we can re-init the VFS inode. */ ip->i_flags |= XFS_IRECLAIM; - spin_unlock(&ip->i_flags_lock); rcu_read_unlock(); @@ -407,6 +407,7 @@ xfs_iget_cache_hit( */ ip->i_flags &= ~XFS_IRECLAIM_RESET_FLAGS; ip->i_flags |= XFS_INEW; + list_lru_del(&mp->m_inode_lru, &inode->i_lru); xfs_inode_clear_reclaim_tag(pag, ip->i_ino); inode->i_state = I_NEW; ip->i_sick = 0; @@ -1138,6 +1139,9 @@ xfs_reclaim_inode( ino = ip->i_ino; /* for radix_tree_delete */ ip->i_flags = XFS_IRECLAIM; ip->i_ino = 0; + + /* XXX: temporary until lru based reclaim */ + list_lru_del(&pag->pag_mount->m_inode_lru, &VFS_I(ip)->i_lru); spin_unlock(&ip->i_flags_lock); xfs_iunlock(ip, XFS_ILOCK_EXCL); @@ -1329,26 +1333,6 @@ xfs_reclaim_inodes_nr( return xfs_reclaim_inodes_ag(mp, sync_mode, nr_to_scan); } -/* - * Return the number of reclaimable inodes in the filesystem for - * the shrinker to determine how much to reclaim. - */ -int -xfs_reclaim_inodes_count( - struct xfs_mount *mp) -{ - struct xfs_perag *pag; - xfs_agnumber_t ag = 0; - int reclaimable = 0; - - while ((pag = xfs_perag_get_tag(mp, ag, XFS_ICI_RECLAIM_TAG))) { - ag = pag->pag_agno + 1; - reclaimable += pag->pag_ici_reclaimable; - xfs_perag_put(pag); - } - return reclaimable; -} - STATIC int xfs_inode_match_id( struct xfs_inode *ip, diff --git a/fs/xfs/xfs_icache.h b/fs/xfs/xfs_icache.h index 1c9b9edb2986..0ab08b58cd45 100644 --- a/fs/xfs/xfs_icache.h +++ b/fs/xfs/xfs_icache.h @@ -50,7 +50,6 @@ struct xfs_inode * xfs_inode_alloc(struct xfs_mount *mp, xfs_ino_t ino); void xfs_inode_free(struct xfs_inode *ip); void xfs_reclaim_inodes(struct xfs_mount *mp); -int xfs_reclaim_inodes_count(struct xfs_mount *mp); long xfs_reclaim_inodes_nr(struct xfs_mount *mp, int nr_to_scan); void xfs_inode_set_reclaim_tag(struct xfs_inode *ip); diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h index f0cc952ad527..f1e4c2eae984 100644 --- a/fs/xfs/xfs_mount.h +++ b/fs/xfs/xfs_mount.h @@ -75,6 +75,7 @@ typedef struct xfs_mount { uint8_t m_rt_sick; struct xfs_ail *m_ail; /* fs active log item list */ + struct list_lru m_inode_lru; struct xfs_sb m_sb; /* copy of fs superblock */ spinlock_t m_sb_lock; /* sb counter lock */ diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c index d0619bf02a5d..01f08706a3fb 100644 --- a/fs/xfs/xfs_super.c +++ b/fs/xfs/xfs_super.c @@ -920,28 +920,31 @@ xfs_fs_destroy_inode( struct inode *inode) { struct xfs_inode *ip = XFS_I(inode); + struct xfs_mount *mp = ip->i_mount; trace_xfs_destroy_inode(ip); ASSERT(!rwsem_is_locked(&inode->i_rwsem)); - XFS_STATS_INC(ip->i_mount, vn_rele); - XFS_STATS_INC(ip->i_mount, vn_remove); + XFS_STATS_INC(mp, vn_rele); + XFS_STATS_INC(mp, vn_remove); xfs_inactive(ip); - if (!XFS_FORCED_SHUTDOWN(ip->i_mount) && ip->i_delayed_blks) { + if (!XFS_FORCED_SHUTDOWN(mp) && ip->i_delayed_blks) { xfs_check_delalloc(ip, XFS_DATA_FORK); xfs_check_delalloc(ip, XFS_COW_FORK); ASSERT(0); } - XFS_STATS_INC(ip->i_mount, vn_reclaim); + XFS_STATS_INC(mp, vn_reclaim); /* * We should never get here with one of the reclaim flags already set. */ - ASSERT_ALWAYS(!xfs_iflags_test(ip, XFS_IRECLAIMABLE)); - ASSERT_ALWAYS(!xfs_iflags_test(ip, XFS_IRECLAIM)); + spin_lock(&ip->i_flags_lock); + ASSERT_ALWAYS(!__xfs_iflags_test(ip, XFS_IRECLAIMABLE)); + ASSERT_ALWAYS(!__xfs_iflags_test(ip, XFS_IRECLAIM)); + spin_unlock(&ip->i_flags_lock); /* * We always use background reclaim here because even if the @@ -1542,6 +1545,15 @@ xfs_mount_alloc( if (!mp) return NULL; + /* + * The inode lru needs to be associated with the superblock shrinker, + * and like the rest of the superblock shrinker, it's memcg aware. + */ + if (list_lru_init_memcg(&mp->m_inode_lru, &sb->s_shrink)) { + kfree(mp); + return NULL; + } + mp->m_super = sb; spin_lock_init(&mp->m_sb_lock); spin_lock_init(&mp->m_agirotor_lock); @@ -1751,6 +1763,7 @@ xfs_fs_fill_super( out_free_fsname: sb->s_fs_info = NULL; xfs_free_fsname(mp); + list_lru_destroy(&mp->m_inode_lru); kfree(mp); out: return error; @@ -1783,6 +1796,7 @@ xfs_fs_put_super( sb->s_fs_info = NULL; xfs_free_fsname(mp); + list_lru_destroy(&mp->m_inode_lru); kfree(mp); } @@ -1804,7 +1818,8 @@ xfs_fs_nr_cached_objects( /* Paranoia: catch incorrect calls during mount setup or teardown */ if (WARN_ON_ONCE(!sb->s_fs_info)) return 0; - return xfs_reclaim_inodes_count(XFS_M(sb)); + + return list_lru_shrink_count(&XFS_M(sb)->m_inode_lru, sc); } static long From patchwork Wed Oct 9 03:21:21 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 11180399 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 108AC17D4 for ; Wed, 9 Oct 2019 03:22:03 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id C12EF206C2 for ; Wed, 9 Oct 2019 03:22:02 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org C12EF206C2 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=fromorbit.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 892A28E0013; Tue, 8 Oct 2019 23:21:34 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 7BF948E0017; Tue, 8 Oct 2019 23:21:34 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 2B9128E0015; Tue, 8 Oct 2019 23:21:34 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0030.hostedemail.com [216.40.44.30]) by kanga.kvack.org (Postfix) with ESMTP id E2B2C8E000D for ; Tue, 8 Oct 2019 23:21:33 -0400 (EDT) Received: from smtpin14.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with SMTP id 77B556406 for ; Wed, 9 Oct 2019 03:21:33 +0000 (UTC) X-FDA: 76022796066.14.river63_2a8d42708e447 X-Spam-Summary: 2,0,0,622357975a9cff7a,d41d8cd98f00b204,david@fromorbit.com,:linux-xfs@vger.kernel.org::linux-fsdevel@vger.kernel.org,RULES_HIT:4:41:355:379:541:800:960:966:968:973:988:989:1260:1261:1311:1314:1345:1359:1437:1515:1605:1730:1747:1777:1792:2194:2196:2198:2199:2200:2201:2393:2553:2559:2562:2639:2693:2731:2736:2903:3138:3139:3140:3141:3142:3308:3865:3866:3867:3868:3870:3871:3872:3874:4078:4081:4250:4321:4385:4605:5007:6119:6261:7576:7875:7903:9010:9151:11026:11473:11658:11914:12043:12291:12294:12296:12297:12438:12517:12519:12555:12679:12683:12895:13161:13184:13229:13894:14096:14394:21063:21080:21324:21433:21451:21611:21627:21740:30012:30034:30045:30054:30070:30090,0,RBL:211.29.132.246:@fromorbit.com:.lbl8.mailshell.net-62.8.32.100 66.201.201.201,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not bulk,SPF:fn,MSBL:0,DNSBL:neutral,Custom_rules:0:1:0,LFtime:3,LUA_SUMMARY:none X-HE-Tag: river63_2a8d42708e447 X-Filterd-Recvd-Size: 15206 Received: from mail104.syd.optusnet.com.au (mail104.syd.optusnet.com.au [211.29.132.246]) by imf12.hostedemail.com (Postfix) with ESMTP for ; Wed, 9 Oct 2019 03:21:32 +0000 (UTC) Received: from dread.disaster.area (pa49-181-226-196.pa.nsw.optusnet.com.au [49.181.226.196]) by mail104.syd.optusnet.com.au (Postfix) with ESMTPS id 5621A43EC8D; Wed, 9 Oct 2019 14:21:27 +1100 (AEDT) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.2) (envelope-from ) id 1iI2XX-0006C6-G7; Wed, 09 Oct 2019 14:21:27 +1100 Received: from dave by discord.disaster.area with local (Exim 4.92) (envelope-from ) id 1iI2XX-0003A0-EG; Wed, 09 Oct 2019 14:21:27 +1100 From: Dave Chinner To: linux-xfs@vger.kernel.org Cc: linux-mm@kvack.org, linux-fsdevel@vger.kernel.org Subject: [PATCH 23/26] xfs: reclaim inodes from the LRU Date: Wed, 9 Oct 2019 14:21:21 +1100 Message-Id: <20191009032124.10541-24-david@fromorbit.com> X-Mailer: git-send-email 2.23.0.rc1 In-Reply-To: <20191009032124.10541-1-david@fromorbit.com> References: <20191009032124.10541-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.2 cv=FNpr/6gs c=1 sm=1 tr=0 a=dRuLqZ1tmBNts2YiI0zFQg==:117 a=dRuLqZ1tmBNts2YiI0zFQg==:17 a=jpOVt7BSZ2e4Z31A5e1TngXxSK0=:19 a=XobE76Q3jBoA:10 a=20KFwNOVAAAA:8 a=DZylAeVkA9-7jzjJDzkA:9 a=I6VJddeCkTb8Bg0N:21 a=kod2Zbzq7Elvxm72:21 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: Dave Chinner Replace the AG radix tree walking reclaim code with a list_lru walker, giving us both node-aware and memcg-aware inode reclaim at the XFS level. This requires adding an inode isolation function to determine if the inode can be reclaim, and a list walker to dispose of the inodes that were isolated. We want the isolation function to be non-blocking. If we can't grab an inode then we either skip it or rotate it. If it's clean then we skip it, if it's dirty then we rotate to give it time to be cleaned before it is scanned again. This congregates the dirty inodes at the tail of the LRU, which means that if we start hitting a majority of dirty inodes either there are lots of unlinked inodes in the reclaim list or we've reclaimed all the clean inodes and we're looped back on the dirty inodes. Either way, this is an indication we should tell kswapd to back off. The non-blocking isolation function introduces a complexity for the filesystem shutdown case. When the filesystem is shut down, we want to free the inode even if it is dirty, and this may require blocking. We already hold the locks needed to do this blocking, so what we do is that we leave inodes locked - both the ILOCK and the flush lock - while they are sitting on the dispose list to be freed after the LRU walk completes. This allows us to process the shutdown state outside the LRU walk where we can block safely. Because we now are reclaiming inodes from the context that it needs memory in (memcg and/or node), direct reclaim throttling within the high level reclaim code in now much more effective. Hence we don't wait on IO for either kswapd or direct reclaim. However, we have to tell kswapd to back off if we start hitting too many dirty inodes. This implies we've wrapped around the LRU and don't have many clean inodes left to reclaim, so it needs to wait a while for the AIL pushing to clean some of the remaining reclaimable inodes. Keep in mind we don't have to care about inode lock order or blocking with inode locks held here because a) we are using trylocks, and b) once marked with XFS_IRECLAIM they can't be found via the LRU and inode cache lookups will abort and retry. Hence nobody will try to lock them in any other context that might also be holding other inode locks. Also convert xfs_reclaim_inodes() to use a LRU walk to free all the reclaimable inodes in the filesystem. Signed-off-by: Dave Chinner --- fs/xfs/xfs_icache.c | 210 ++++++++++++++++++++++++++++++++++++++------ fs/xfs/xfs_icache.h | 10 ++- fs/xfs/xfs_inode.h | 8 ++ fs/xfs/xfs_super.c | 48 ++++++++-- 4 files changed, 241 insertions(+), 35 deletions(-) diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c index 06fdaa746674..ef9ef46cfe6c 100644 --- a/fs/xfs/xfs_icache.c +++ b/fs/xfs/xfs_icache.c @@ -1193,7 +1193,7 @@ xfs_reclaim_inode( * * Return the number of inodes freed. */ -STATIC int +int xfs_reclaim_inodes_ag( struct xfs_mount *mp, int flags, @@ -1297,40 +1297,196 @@ xfs_reclaim_inodes_ag( return freed; } -void -xfs_reclaim_inodes( - struct xfs_mount *mp) +enum lru_status +xfs_inode_reclaim_isolate( + struct list_head *item, + struct list_lru_one *lru, + spinlock_t *lru_lock, + void *arg) { - xfs_reclaim_inodes_ag(mp, SYNC_WAIT, INT_MAX); + struct xfs_ireclaim_args *ra = arg; + struct inode *inode = container_of(item, struct inode, i_lru); + struct xfs_inode *ip = XFS_I(inode); + enum lru_status ret; + xfs_lsn_t lsn = 0; + + /* Careful: inversion of iflags_lock and everything else here */ + if (!spin_trylock(&ip->i_flags_lock)) + return LRU_SKIP; + + /* if we are in shutdown, we'll reclaim it even if dirty */ + ret = LRU_ROTATE; + if (!xfs_inode_clean(ip) && !__xfs_iflags_test(ip, XFS_ISTALE) && + !XFS_FORCED_SHUTDOWN(ip->i_mount)) { + lsn = ip->i_itemp->ili_item.li_lsn; + ra->dirty_skipped++; + goto out_unlock_flags; + } + + ret = LRU_SKIP; + if (!xfs_ilock_nowait(ip, XFS_ILOCK_EXCL)) + goto out_unlock_flags; + + if (!__xfs_iflock_nowait(ip)) { + lsn = ip->i_itemp->ili_item.li_lsn; + ra->dirty_skipped++; + goto out_unlock_inode; + } + + if (XFS_FORCED_SHUTDOWN(ip->i_mount)) + goto reclaim; + + /* + * Now the inode is locked, we can actually determine if it is dirty + * without racing with anything. + */ + ret = LRU_ROTATE; + if (xfs_ipincount(ip)) { + ra->dirty_skipped++; + goto out_ifunlock; + } + if (!xfs_inode_clean(ip) && !__xfs_iflags_test(ip, XFS_ISTALE)) { + lsn = ip->i_itemp->ili_item.li_lsn; + ra->dirty_skipped++; + goto out_ifunlock; + } + +reclaim: + /* + * Once we mark the inode with XFS_IRECLAIM, no-one will grab it again. + * RCU lookups will still find the inode, but they'll stop when they set + * the IRECLAIM flag. Hence we can leave the inode locked as we move it + * to the dispose list so we can deal with shutdown cleanup there + * outside the LRU lock context. + */ + __xfs_iflags_set(ip, XFS_IRECLAIM); + list_lru_isolate_move(lru, &inode->i_lru, &ra->freeable); + spin_unlock(&ip->i_flags_lock); + return LRU_REMOVED; + +out_ifunlock: + xfs_ifunlock(ip); +out_unlock_inode: + xfs_iunlock(ip, XFS_ILOCK_EXCL); +out_unlock_flags: + spin_unlock(&ip->i_flags_lock); + + if (lsn && XFS_LSN_CMP(lsn, ra->lowest_lsn) < 0) + ra->lowest_lsn = lsn; + return ret; } -/* - * Scan a certain number of inodes for reclaim. - * - * When called we make sure that there is a background (fast) inode reclaim in - * progress, while we will throttle the speed of reclaim via doing synchronous - * reclaim of inodes. That means if we come across dirty inodes, we wait for - * them to be cleaned, which we hope will not be very long due to the - * background walker having already kicked the IO off on those dirty inodes. - */ -long -xfs_reclaim_inodes_nr( - struct xfs_mount *mp, - int nr_to_scan) +static void +xfs_dispose_inode( + struct xfs_inode *ip) { - int sync_mode = 0; + struct xfs_mount *mp = ip->i_mount; + struct xfs_perag *pag; + xfs_ino_t ino; + + ASSERT(xfs_isiflocked(ip)); + ASSERT(xfs_inode_clean(ip) || xfs_iflags_test(ip, XFS_ISTALE) || + XFS_FORCED_SHUTDOWN(mp)); + ASSERT(ip->i_ino != 0); /* - * For kswapd, we kick background inode writeback. For direct - * reclaim, we issue and wait on inode writeback to throttle - * reclaim rates and avoid shouty OOM-death. + * Process the shutdown reclaim work we deferred from the LRU isolation + * callback before we go any further. */ - if (current_is_kswapd()) - xfs_ail_push_all(mp->m_ail); - else - sync_mode |= SYNC_WAIT; + if (XFS_FORCED_SHUTDOWN(mp)) { + xfs_iunpin_wait(ip); + xfs_iflush_abort(ip, false); + } else { + xfs_ifunlock(ip); + } - return xfs_reclaim_inodes_ag(mp, sync_mode, nr_to_scan); + /* + * Because we use RCU freeing we need to ensure the inode always appears + * to be reclaimed with an invalid inode number when in the free state. + * We do this as early as possible under the ILOCK so that + * xfs_iflush_cluster() and xfs_ifree_cluster() can be guaranteed to + * detect races with us here. By doing this, we guarantee that once + * xfs_iflush_cluster() or xfs_ifree_cluster() has locked XFS_ILOCK that + * it will see either a valid inode that will serialise correctly, or it + * will see an invalid inode that it can skip. + */ + spin_lock(&ip->i_flags_lock); + ino = ip->i_ino; /* for radix_tree_delete */ + ip->i_flags = XFS_IRECLAIM; + ip->i_ino = 0; + spin_unlock(&ip->i_flags_lock); + xfs_iunlock(ip, XFS_ILOCK_EXCL); + + XFS_STATS_INC(mp, xs_ig_reclaims); + /* + * Remove the inode from the per-AG radix tree. + * + * Because radix_tree_delete won't complain even if the item was never + * added to the tree assert that it's been there before to catch + * problems with the inode life time early on. + */ + pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, ino)); + spin_lock(&pag->pag_ici_lock); + if (!radix_tree_delete(&pag->pag_ici_root, XFS_INO_TO_AGINO(mp, ino))) + ASSERT(0); + spin_unlock(&pag->pag_ici_lock); + xfs_perag_put(pag); + + /* + * Here we do an (almost) spurious inode lock in order to coordinate + * with inode cache radix tree lookups. This is because the lookup + * can reference the inodes in the cache without taking references. + * + * We make that OK here by ensuring that we wait until the inode is + * unlocked after the lookup before we go ahead and free it. + * + * XXX: need to check this is still true. Not sure it is. + */ + xfs_ilock(ip, XFS_ILOCK_EXCL); + xfs_qm_dqdetach(ip); + xfs_iunlock(ip, XFS_ILOCK_EXCL); + + __xfs_inode_free(ip); +} + +void +xfs_dispose_inodes( + struct list_head *freeable) +{ + while (!list_empty(freeable)) { + struct inode *inode; + + inode = list_first_entry(freeable, struct inode, i_lru); + list_del_init(&inode->i_lru); + + xfs_dispose_inode(XFS_I(inode)); + cond_resched(); + } +} +void +xfs_reclaim_inodes( + struct xfs_mount *mp) +{ + while (list_lru_count(&mp->m_inode_lru)) { + struct xfs_ireclaim_args ra; + long freed, to_free; + + INIT_LIST_HEAD(&ra.freeable); + ra.lowest_lsn = NULLCOMMITLSN; + to_free = list_lru_count(&mp->m_inode_lru); + + freed = list_lru_walk(&mp->m_inode_lru, xfs_inode_reclaim_isolate, + &ra, to_free); + xfs_dispose_inodes(&ra.freeable); + + if (freed == 0) { + xfs_log_force(mp, XFS_LOG_SYNC); + xfs_ail_push_all(mp->m_ail); + } else if (ra.lowest_lsn != NULLCOMMITLSN) { + xfs_ail_push_sync(mp->m_ail, ra.lowest_lsn); + } + cond_resched(); + } } STATIC int diff --git a/fs/xfs/xfs_icache.h b/fs/xfs/xfs_icache.h index 0ab08b58cd45..dadc69a30f33 100644 --- a/fs/xfs/xfs_icache.h +++ b/fs/xfs/xfs_icache.h @@ -49,8 +49,16 @@ int xfs_iget(struct xfs_mount *mp, struct xfs_trans *tp, xfs_ino_t ino, struct xfs_inode * xfs_inode_alloc(struct xfs_mount *mp, xfs_ino_t ino); void xfs_inode_free(struct xfs_inode *ip); +struct xfs_ireclaim_args { + struct list_head freeable; + xfs_lsn_t lowest_lsn; + unsigned long dirty_skipped; +}; + +enum lru_status xfs_inode_reclaim_isolate(struct list_head *item, + struct list_lru_one *lru, spinlock_t *lru_lock, void *arg); +void xfs_dispose_inodes(struct list_head *freeable); void xfs_reclaim_inodes(struct xfs_mount *mp); -long xfs_reclaim_inodes_nr(struct xfs_mount *mp, int nr_to_scan); void xfs_inode_set_reclaim_tag(struct xfs_inode *ip); diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h index 558173f95a03..463170dc4c02 100644 --- a/fs/xfs/xfs_inode.h +++ b/fs/xfs/xfs_inode.h @@ -263,6 +263,14 @@ static inline int xfs_isiflocked(struct xfs_inode *ip) extern void __xfs_iflock(struct xfs_inode *ip); +static inline int __xfs_iflock_nowait(struct xfs_inode *ip) +{ + if (ip->i_flags & XFS_IFLOCK) + return false; + ip->i_flags |= XFS_IFLOCK; + return true; +} + static inline int xfs_iflock_nowait(struct xfs_inode *ip) { return !xfs_iflags_test_and_set(ip, XFS_IFLOCK); diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c index 01f08706a3fb..3dfddd3a443b 100644 --- a/fs/xfs/xfs_super.c +++ b/fs/xfs/xfs_super.c @@ -17,6 +17,7 @@ #include "xfs_alloc.h" #include "xfs_fsops.h" #include "xfs_trans.h" +#include "xfs_trans_priv.h" #include "xfs_buf_item.h" #include "xfs_log.h" #include "xfs_log_priv.h" @@ -1811,23 +1812,56 @@ xfs_fs_mount( } static long -xfs_fs_nr_cached_objects( +xfs_fs_free_cached_objects( struct super_block *sb, struct shrink_control *sc) { - /* Paranoia: catch incorrect calls during mount setup or teardown */ - if (WARN_ON_ONCE(!sb->s_fs_info)) - return 0; + struct xfs_mount *mp = XFS_M(sb); + struct xfs_ireclaim_args ra; + long freed; - return list_lru_shrink_count(&XFS_M(sb)->m_inode_lru, sc); + INIT_LIST_HEAD(&ra.freeable); + ra.lowest_lsn = NULLCOMMITLSN; + ra.dirty_skipped = 0; + + freed = list_lru_shrink_walk(&mp->m_inode_lru, sc, + xfs_inode_reclaim_isolate, &ra); + xfs_dispose_inodes(&ra.freeable); + + /* + * Deal with dirty inodes. We will have the LSN of + * the oldest dirty inode in our reclaim args if we skipped any. + * + * For kswapd, if we skipped too many dirty inodes (i.e. more dirty than + * we freed) then we need kswapd to back off once it's scan has been + * completed. That way it will have some clean inodes once it comes back + * and can make progress, but make sure we have inode cleaning in + * progress. + * + * Direct reclaim will be throttled by the caller as it winds the + * priority up. All we need to do is keep pushing on dirty inodes + * in the background so when we come back progress will be made. + */ + if (current_is_kswapd() && ra.dirty_skipped >= freed) { + if (current->reclaim_state) + current->reclaim_state->need_backoff = true; + } + if (ra.lowest_lsn != NULLCOMMITLSN) + xfs_ail_push(mp->m_ail, ra.lowest_lsn); + + return freed; } static long -xfs_fs_free_cached_objects( +xfs_fs_nr_cached_objects( struct super_block *sb, struct shrink_control *sc) { - return xfs_reclaim_inodes_nr(XFS_M(sb), sc->nr_to_scan); + /* Paranoia: catch incorrect calls during mount setup or teardown */ + if (WARN_ON_ONCE(!sb->s_fs_info)) + return 0; + + return list_lru_shrink_count(&XFS_M(sb)->m_inode_lru, sc); } static const struct super_operations xfs_super_operations = { From patchwork Wed Oct 9 03:21:22 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 11180413 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 7431F1709 for ; Wed, 9 Oct 2019 03:22:21 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 2271B206C2 for ; Wed, 9 Oct 2019 03:22:21 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 2271B206C2 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=fromorbit.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id A04F38E0019; Tue, 8 Oct 2019 23:21:35 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 3ECC18E0016; Tue, 8 Oct 2019 23:21:35 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 0BC488E0012; Tue, 8 Oct 2019 23:21:34 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0097.hostedemail.com [216.40.44.97]) by kanga.kvack.org (Postfix) with ESMTP id A546B8E0012 for ; Tue, 8 Oct 2019 23:21:34 -0400 (EDT) Received: from smtpin01.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with SMTP id 2F22F180AD804 for ; Wed, 9 Oct 2019 03:21:34 +0000 (UTC) X-FDA: 76022796108.01.sail03_2aaa59d35ef0c X-Spam-Summary: 2,0,0,2dbb3683921ab844,d41d8cd98f00b204,david@fromorbit.com,:linux-xfs@vger.kernel.org::linux-fsdevel@vger.kernel.org,RULES_HIT:4:41:69:355:379:541:800:960:966:968:973:988:989:1260:1261:1311:1314:1345:1359:1437:1515:1605:1730:1747:1777:1792:2196:2198:2199:2200:2307:2393:2553:2559:2562:2693:2731:2895:2898:2903:2914:2924:2926:3138:3139:3140:3141:3142:3308:3865:3866:3867:3868:3870:3871:3872:3874:4250:4321:4385:4605:5007:6117:6119:6121:6261:7576:7875:7903:9121:9592:10004:11026:11232:11473:11657:11658:11914:12043:12296:12297:12438:12485:12517:12519:12555:12679:12683:12895:12986:13141:13230:13869:13894:14096:14110:14394:21063:21080:21324:21451:21524:21627:21740:30005:30012:30034:30045:30054:30070:30090,0,RBL:211.29.132.249:@fromorbit.com:.lbl8.mailshell.net-62.8.32.100 66.201.201.201,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not bulk,SPF:fn,MSBL:0,DNSBL:neutral,Custom_rules:0:0:0,LFtime:23,LUA_SUMMARY:none X-HE-Tag: sail03_2aaa59d35ef0c X-Filterd-Recvd-Size: 19238 Received: from mail105.syd.optusnet.com.au (mail105.syd.optusnet.com.au [211.29.132.249]) by imf44.hostedemail.com (Postfix) with ESMTP for ; Wed, 9 Oct 2019 03:21:33 +0000 (UTC) Received: from dread.disaster.area (pa49-181-226-196.pa.nsw.optusnet.com.au [49.181.226.196]) by mail105.syd.optusnet.com.au (Postfix) with ESMTPS id 569963629EE; Wed, 9 Oct 2019 14:21:27 +1100 (AEDT) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.2) (envelope-from ) id 1iI2XX-0006CB-I3; Wed, 09 Oct 2019 14:21:27 +1100 Received: from dave by discord.disaster.area with local (Exim 4.92) (envelope-from ) id 1iI2XX-0003A3-Fe; Wed, 09 Oct 2019 14:21:27 +1100 From: Dave Chinner To: linux-xfs@vger.kernel.org Cc: linux-mm@kvack.org, linux-fsdevel@vger.kernel.org Subject: [PATCH 24/26] xfs: remove unusued old inode reclaim code Date: Wed, 9 Oct 2019 14:21:22 +1100 Message-Id: <20191009032124.10541-25-david@fromorbit.com> X-Mailer: git-send-email 2.23.0.rc1 In-Reply-To: <20191009032124.10541-1-david@fromorbit.com> References: <20191009032124.10541-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.2 cv=D+Q3ErZj c=1 sm=1 tr=0 a=dRuLqZ1tmBNts2YiI0zFQg==:117 a=dRuLqZ1tmBNts2YiI0zFQg==:17 a=jpOVt7BSZ2e4Z31A5e1TngXxSK0=:19 a=XobE76Q3jBoA:10 a=20KFwNOVAAAA:8 a=1n5jCYYRKrH5BCLvDUQA:9 a=9iRo5a8wJfiI_Cf1:21 a=vDsUGHXTItgeFNxT:21 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: Dave Chinner We don't use the custom AG radix tree walker, the reclaim radix tree tag, the reclaimable inode counters, etc, so remove the all now. Signed-off-by: Dave Chinner --- fs/xfs/xfs_icache.c | 411 +------------------------------------------- fs/xfs/xfs_icache.h | 7 +- fs/xfs/xfs_mount.c | 4 - fs/xfs/xfs_mount.h | 3 - fs/xfs/xfs_super.c | 5 +- 5 files changed, 6 insertions(+), 424 deletions(-) diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c index ef9ef46cfe6c..a6de159c71c2 100644 --- a/fs/xfs/xfs_icache.c +++ b/fs/xfs/xfs_icache.c @@ -139,83 +139,6 @@ xfs_inode_free( __xfs_inode_free(ip); } -static void -xfs_perag_set_reclaim_tag( - struct xfs_perag *pag) -{ - struct xfs_mount *mp = pag->pag_mount; - - lockdep_assert_held(&pag->pag_ici_lock); - if (pag->pag_ici_reclaimable++) - return; - - /* propagate the reclaim tag up into the perag radix tree */ - spin_lock(&mp->m_perag_lock); - radix_tree_tag_set(&mp->m_perag_tree, pag->pag_agno, - XFS_ICI_RECLAIM_TAG); - spin_unlock(&mp->m_perag_lock); - - trace_xfs_perag_set_reclaim(mp, pag->pag_agno, -1, _RET_IP_); -} - -static void -xfs_perag_clear_reclaim_tag( - struct xfs_perag *pag) -{ - struct xfs_mount *mp = pag->pag_mount; - - lockdep_assert_held(&pag->pag_ici_lock); - if (--pag->pag_ici_reclaimable) - return; - - /* clear the reclaim tag from the perag radix tree */ - spin_lock(&mp->m_perag_lock); - radix_tree_tag_clear(&mp->m_perag_tree, pag->pag_agno, - XFS_ICI_RECLAIM_TAG); - spin_unlock(&mp->m_perag_lock); - trace_xfs_perag_clear_reclaim(mp, pag->pag_agno, -1, _RET_IP_); -} - - -/* - * We set the inode flag atomically with the radix tree tag. - * Once we get tag lookups on the radix tree, this inode flag - * can go away. - */ -void -xfs_inode_set_reclaim_tag( - struct xfs_inode *ip) -{ - struct xfs_mount *mp = ip->i_mount; - struct xfs_perag *pag; - - pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, ip->i_ino)); - spin_lock(&pag->pag_ici_lock); - spin_lock(&ip->i_flags_lock); - - radix_tree_tag_set(&pag->pag_ici_root, XFS_INO_TO_AGINO(mp, ip->i_ino), - XFS_ICI_RECLAIM_TAG); - xfs_perag_set_reclaim_tag(pag); - __xfs_iflags_set(ip, XFS_IRECLAIMABLE); - - list_lru_add(&mp->m_inode_lru, &VFS_I(ip)->i_lru); - - spin_unlock(&ip->i_flags_lock); - spin_unlock(&pag->pag_ici_lock); - xfs_perag_put(pag); -} - -STATIC void -xfs_inode_clear_reclaim_tag( - struct xfs_perag *pag, - xfs_ino_t ino) -{ - radix_tree_tag_clear(&pag->pag_ici_root, - XFS_INO_TO_AGINO(pag->pag_mount, ino), - XFS_ICI_RECLAIM_TAG); - xfs_perag_clear_reclaim_tag(pag); -} - static void xfs_inew_wait( struct xfs_inode *ip) @@ -397,18 +320,16 @@ xfs_iget_cache_hit( goto out_error; } - spin_lock(&pag->pag_ici_lock); - spin_lock(&ip->i_flags_lock); /* * Clear the per-lifetime state in the inode as we are now * effectively a new inode and need to return to the initial * state before reuse occurs. */ + spin_lock(&ip->i_flags_lock); ip->i_flags &= ~XFS_IRECLAIM_RESET_FLAGS; ip->i_flags |= XFS_INEW; list_lru_del(&mp->m_inode_lru, &inode->i_lru); - xfs_inode_clear_reclaim_tag(pag, ip->i_ino); inode->i_state = I_NEW; ip->i_sick = 0; ip->i_checked = 0; @@ -417,7 +338,6 @@ xfs_iget_cache_hit( init_rwsem(&inode->i_rwsem); spin_unlock(&ip->i_flags_lock); - spin_unlock(&pag->pag_ici_lock); } else { /* If the VFS inode is being torn down, pause and try again. */ if (!igrab(inode)) { @@ -968,335 +888,6 @@ xfs_inode_ag_iterator_tag( return last_error; } -/* - * Grab the inode for reclaim. - * - * Return false if we aren't going to reclaim it, true if it is a reclaim - * candidate. - * - * If the inode is clean or unreclaimable, return 0 to tell the caller it does - * not require flushing. Otherwise return the log item lsn of the inode so the - * caller can determine it's inode flush target. If we get the clean/dirty - * state wrong then it will be sorted in xfs_reclaim_inode() once we have locks - * held. - */ -STATIC bool -xfs_reclaim_inode_grab( - struct xfs_inode *ip, - int flags, - xfs_lsn_t *lsn) -{ - ASSERT(rcu_read_lock_held()); - *lsn = 0; - - /* quick check for stale RCU freed inode */ - if (!ip->i_ino) - return false; - - /* - * Do unlocked checks to see if the inode already is being flushed or in - * reclaim to avoid lock traffic. If the inode is not clean, return the - * it's position in the AIL for the caller to push to. - */ - if (!xfs_inode_clean(ip)) { - *lsn = ip->i_itemp->ili_item.li_lsn; - return false; - } - - if (__xfs_iflags_test(ip, XFS_IFLOCK | XFS_IRECLAIM)) - return false; - - /* - * The radix tree lock here protects a thread in xfs_iget from racing - * with us starting reclaim on the inode. Once we have the - * XFS_IRECLAIM flag set it will not touch us. - * - * Due to RCU lookup, we may find inodes that have been freed and only - * have XFS_IRECLAIM set. Indeed, we may see reallocated inodes that - * aren't candidates for reclaim at all, so we must check the - * XFS_IRECLAIMABLE is set first before proceeding to reclaim. - */ - spin_lock(&ip->i_flags_lock); - if (!__xfs_iflags_test(ip, XFS_IRECLAIMABLE) || - __xfs_iflags_test(ip, XFS_IRECLAIM)) { - /* not a reclaim candidate. */ - spin_unlock(&ip->i_flags_lock); - return false; - } - __xfs_iflags_set(ip, XFS_IRECLAIM); - spin_unlock(&ip->i_flags_lock); - return true; -} - -/* - * Inodes in different states need to be treated differently. The following - * table lists the inode states and the reclaim actions necessary: - * - * inode state iflush ret required action - * --------------- ---------- --------------- - * bad - reclaim - * shutdown EIO unpin and reclaim - * clean, unpinned 0 reclaim - * stale, unpinned 0 reclaim - * clean, pinned(*) 0 requeue - * stale, pinned EAGAIN requeue - * dirty, async - requeue - * dirty, sync 0 reclaim - * - * (*) dgc: I don't think the clean, pinned state is possible but it gets - * handled anyway given the order of checks implemented. - * - * Also, because we get the flush lock first, we know that any inode that has - * been flushed delwri has had the flush completed by the time we check that - * the inode is clean. - * - * Note that because the inode is flushed delayed write by AIL pushing, the - * flush lock may already be held here and waiting on it can result in very - * long latencies. Hence for sync reclaims, where we wait on the flush lock, - * the caller should push the AIL first before trying to reclaim inodes to - * minimise the amount of time spent waiting. For background relaim, we only - * bother to reclaim clean inodes anyway. - * - * Hence the order of actions after gaining the locks should be: - * bad => reclaim - * shutdown => unpin and reclaim - * pinned, async => requeue - * pinned, sync => unpin - * stale => reclaim - * clean => reclaim - * dirty, async => requeue - * dirty, sync => flush, wait and reclaim - * - * Returns true if the inode was reclaimed, false otherwise. - */ -STATIC bool -xfs_reclaim_inode( - struct xfs_inode *ip, - struct xfs_perag *pag, - xfs_lsn_t *lsn) -{ - xfs_ino_t ino; - - *lsn = 0; - - /* - * Don't try to flush the inode if another inode in this cluster has - * already flushed it after we did the initial checks in - * xfs_reclaim_inode_grab(). - */ - if (!xfs_ilock_nowait(ip, XFS_ILOCK_EXCL)) - goto out; - if (!xfs_iflock_nowait(ip)) - goto out_unlock; - - /* If we are in shutdown, we don't care about blocking. */ - if (XFS_FORCED_SHUTDOWN(ip->i_mount)) { - xfs_iunpin_wait(ip); - /* xfs_iflush_abort() drops the flush lock */ - xfs_iflush_abort(ip, false); - goto reclaim; - } - - /* - * If it is pinned, we don't have an LSN we can push the AIL to - just - * an LSN that we can push the CIL with. We don't want to block doing - * that, so we'll just skip over this one without triggering writeback - * for now. - */ - if (xfs_ipincount(ip)) - goto out_ifunlock; - - /* - * Dirty inode we didn't catch, skip it. - */ - if (!xfs_inode_clean(ip) && !xfs_iflags_test(ip, XFS_ISTALE)) { - *lsn = ip->i_itemp->ili_item.li_lsn; - goto out_ifunlock; - } - - /* - * It's clean, we have it locked, we can now drop the flush lock - * and reclaim it. - */ - xfs_ifunlock(ip); - -reclaim: - ASSERT(!xfs_isiflocked(ip)); - ASSERT(xfs_inode_clean(ip) || xfs_iflags_test(ip, XFS_ISTALE)); - ASSERT(ip->i_ino != 0); - - /* - * Because we use RCU freeing we need to ensure the inode always appears - * to be reclaimed with an invalid inode number when in the free state. - * We do this as early as possible under the ILOCK so that - * xfs_iflush_cluster() and xfs_ifree_cluster() can be guaranteed to - * detect races with us here. By doing this, we guarantee that once - * xfs_iflush_cluster() or xfs_ifree_cluster() has locked XFS_ILOCK that - * it will see either a valid inode that will serialise correctly, or it - * will see an invalid inode that it can skip. - */ - spin_lock(&ip->i_flags_lock); - ino = ip->i_ino; /* for radix_tree_delete */ - ip->i_flags = XFS_IRECLAIM; - ip->i_ino = 0; - - /* XXX: temporary until lru based reclaim */ - list_lru_del(&pag->pag_mount->m_inode_lru, &VFS_I(ip)->i_lru); - spin_unlock(&ip->i_flags_lock); - - xfs_iunlock(ip, XFS_ILOCK_EXCL); - - XFS_STATS_INC(ip->i_mount, xs_ig_reclaims); - /* - * Remove the inode from the per-AG radix tree. - * - * Because radix_tree_delete won't complain even if the item was never - * added to the tree assert that it's been there before to catch - * problems with the inode life time early on. - */ - spin_lock(&pag->pag_ici_lock); - if (!radix_tree_delete(&pag->pag_ici_root, - XFS_INO_TO_AGINO(ip->i_mount, ino))) - ASSERT(0); - xfs_perag_clear_reclaim_tag(pag); - spin_unlock(&pag->pag_ici_lock); - - /* - * Here we do an (almost) spurious inode lock in order to coordinate - * with inode cache radix tree lookups. This is because the lookup - * can reference the inodes in the cache without taking references. - * - * We make that OK here by ensuring that we wait until the inode is - * unlocked after the lookup before we go ahead and free it. - */ - xfs_ilock(ip, XFS_ILOCK_EXCL); - xfs_qm_dqdetach(ip); - xfs_iunlock(ip, XFS_ILOCK_EXCL); - - __xfs_inode_free(ip); - return true; - -out_ifunlock: - xfs_ifunlock(ip); -out_unlock: - xfs_iunlock(ip, XFS_ILOCK_EXCL); -out: - xfs_iflags_clear(ip, XFS_IRECLAIM); - return false; -} - -/* - * Walk the AGs and reclaim the inodes in them. Even if the filesystem is - * corrupted, we still want to try to reclaim all the inodes. If we don't, - * then a shut down during filesystem unmount reclaim walk leak all the - * unreclaimed inodes. - * - * Return the number of inodes freed. - */ -int -xfs_reclaim_inodes_ag( - struct xfs_mount *mp, - int flags, - int nr_to_scan) -{ - struct xfs_perag *pag; - xfs_agnumber_t ag; - xfs_lsn_t lsn, lowest_lsn = NULLCOMMITLSN; - long freed = 0; - - ag = 0; - while ((pag = xfs_perag_get_tag(mp, ag, XFS_ICI_RECLAIM_TAG))) { - unsigned long first_index = 0; - int done = 0; - int nr_found = 0; - - ag = pag->pag_agno + 1; - do { - struct xfs_inode *batch[XFS_LOOKUP_BATCH]; - int i; - - mutex_lock(&pag->pag_ici_reclaim_lock); - first_index = pag->pag_ici_reclaim_cursor; - - rcu_read_lock(); - nr_found = radix_tree_gang_lookup_tag( - &pag->pag_ici_root, - (void **)batch, first_index, - XFS_LOOKUP_BATCH, - XFS_ICI_RECLAIM_TAG); - if (!nr_found) { - done = 1; - rcu_read_unlock(); - break; - } - - /* - * Grab the inodes before we drop the lock. if we found - * nothing, nr == 0 and the loop will be skipped. - */ - for (i = 0; i < nr_found; i++) { - struct xfs_inode *ip = batch[i]; - - if (done || - !xfs_reclaim_inode_grab(ip, flags, &lsn)) - batch[i] = NULL; - - if (lsn && XFS_LSN_CMP(lsn, lowest_lsn) < 0) - lowest_lsn = lsn; - - /* - * Update the index for the next lookup. Catch - * overflows into the next AG range which can - * occur if we have inodes in the last block of - * the AG and we are currently pointing to the - * last inode. - * - * Because we may see inodes that are from the - * wrong AG due to RCU freeing and - * reallocation, only update the index if it - * lies in this AG. It was a race that lead us - * to see this inode, so another lookup from - * the same index will not find it again. - */ - if (XFS_INO_TO_AGNO(mp, ip->i_ino) != - pag->pag_agno) - continue; - first_index = XFS_INO_TO_AGINO(mp, ip->i_ino + 1); - if (first_index < XFS_INO_TO_AGINO(mp, ip->i_ino)) - done = 1; - } - - /* unlock now we've grabbed the inodes. */ - rcu_read_unlock(); - if (!done) - pag->pag_ici_reclaim_cursor = first_index; - else - pag->pag_ici_reclaim_cursor = 0; - mutex_unlock(&pag->pag_ici_reclaim_lock); - - for (i = 0; i < nr_found; i++) { - if (!batch[i]) - continue; - if (xfs_reclaim_inode(batch[i], pag, &lsn)) - freed++; - if (lsn && XFS_LSN_CMP(lsn, lowest_lsn) < 0) - lowest_lsn = lsn; - } - - nr_to_scan -= XFS_LOOKUP_BATCH; - cond_resched(); - - } while (nr_found && !done && nr_to_scan > 0); - - xfs_perag_put(pag); - } - - if ((flags & SYNC_WAIT) && lowest_lsn != NULLCOMMITLSN) - xfs_ail_push_sync(mp->m_ail, lowest_lsn); - - return freed; -} - enum lru_status xfs_inode_reclaim_isolate( struct list_head *item, diff --git a/fs/xfs/xfs_icache.h b/fs/xfs/xfs_icache.h index dadc69a30f33..0b4d06691275 100644 --- a/fs/xfs/xfs_icache.h +++ b/fs/xfs/xfs_icache.h @@ -25,9 +25,8 @@ struct xfs_eofblocks { */ #define XFS_ICI_NO_TAG (-1) /* special flag for an untagged lookup in xfs_inode_ag_iterator */ -#define XFS_ICI_RECLAIM_TAG 0 /* inode is to be reclaimed */ -#define XFS_ICI_EOFBLOCKS_TAG 1 /* inode has blocks beyond EOF */ -#define XFS_ICI_COWBLOCKS_TAG 2 /* inode can have cow blocks to gc */ +#define XFS_ICI_EOFBLOCKS_TAG 0 /* inode has blocks beyond EOF */ +#define XFS_ICI_COWBLOCKS_TAG 1 /* inode can have cow blocks to gc */ /* * Flags for xfs_iget() @@ -60,8 +59,6 @@ enum lru_status xfs_inode_reclaim_isolate(struct list_head *item, void xfs_dispose_inodes(struct list_head *freeable); void xfs_reclaim_inodes(struct xfs_mount *mp); -void xfs_inode_set_reclaim_tag(struct xfs_inode *ip); - void xfs_inode_set_eofblocks_tag(struct xfs_inode *ip); void xfs_inode_clear_eofblocks_tag(struct xfs_inode *ip); int xfs_icache_free_eofblocks(struct xfs_mount *, struct xfs_eofblocks *); diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c index 3a38fe7c4f8d..32c6bc186c14 100644 --- a/fs/xfs/xfs_mount.c +++ b/fs/xfs/xfs_mount.c @@ -148,7 +148,6 @@ xfs_free_perag( ASSERT(atomic_read(&pag->pag_ref) == 0); xfs_iunlink_destroy(pag); xfs_buf_hash_destroy(pag); - mutex_destroy(&pag->pag_ici_reclaim_lock); call_rcu(&pag->rcu_head, __xfs_free_perag); } } @@ -200,7 +199,6 @@ xfs_initialize_perag( pag->pag_agno = index; pag->pag_mount = mp; spin_lock_init(&pag->pag_ici_lock); - mutex_init(&pag->pag_ici_reclaim_lock); INIT_RADIX_TREE(&pag->pag_ici_root, GFP_ATOMIC); if (xfs_buf_hash_init(pag)) goto out_free_pag; @@ -242,7 +240,6 @@ xfs_initialize_perag( out_hash_destroy: xfs_buf_hash_destroy(pag); out_free_pag: - mutex_destroy(&pag->pag_ici_reclaim_lock); kmem_free(pag); out_unwind_new_pags: /* unwind any prior newly initialized pags */ @@ -252,7 +249,6 @@ xfs_initialize_perag( break; xfs_buf_hash_destroy(pag); xfs_iunlink_destroy(pag); - mutex_destroy(&pag->pag_ici_reclaim_lock); kmem_free(pag); } return error; diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h index f1e4c2eae984..ef63357da7af 100644 --- a/fs/xfs/xfs_mount.h +++ b/fs/xfs/xfs_mount.h @@ -383,9 +383,6 @@ typedef struct xfs_perag { spinlock_t pag_ici_lock; /* incore inode cache lock */ struct radix_tree_root pag_ici_root; /* incore inode cache root */ - int pag_ici_reclaimable; /* reclaimable inodes */ - struct mutex pag_ici_reclaim_lock; /* serialisation point */ - unsigned long pag_ici_reclaim_cursor; /* reclaim restart point */ /* buffer cache index */ spinlock_t pag_buf_lock; /* lock for pag_buf_hash */ diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c index 3dfddd3a443b..a706862994c8 100644 --- a/fs/xfs/xfs_super.c +++ b/fs/xfs/xfs_super.c @@ -945,7 +945,6 @@ xfs_fs_destroy_inode( spin_lock(&ip->i_flags_lock); ASSERT_ALWAYS(!__xfs_iflags_test(ip, XFS_IRECLAIMABLE)); ASSERT_ALWAYS(!__xfs_iflags_test(ip, XFS_IRECLAIM)); - spin_unlock(&ip->i_flags_lock); /* * We always use background reclaim here because even if the @@ -954,7 +953,9 @@ xfs_fs_destroy_inode( * this more efficiently than we can here, so simply let background * reclaim tear down all inodes. */ - xfs_inode_set_reclaim_tag(ip); + __xfs_iflags_set(ip, XFS_IRECLAIMABLE); + list_lru_add(&mp->m_inode_lru, &VFS_I(ip)->i_lru); + spin_unlock(&ip->i_flags_lock); } static void From patchwork Wed Oct 9 03:21:23 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 11180407 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id A326D17D4 for ; Wed, 9 Oct 2019 03:22:13 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 524A4206C2 for ; Wed, 9 Oct 2019 03:22:13 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 524A4206C2 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=fromorbit.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 41A698E0012; Tue, 8 Oct 2019 23:21:35 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id EA13B8E0017; Tue, 8 Oct 2019 23:21:34 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B1D9A8E0019; Tue, 8 Oct 2019 23:21:34 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0234.hostedemail.com [216.40.44.234]) by kanga.kvack.org (Postfix) with ESMTP id 78B918E0016 for ; Tue, 8 Oct 2019 23:21:34 -0400 (EDT) Received: from smtpin10.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with SMTP id 0F31C180AD803 for ; Wed, 9 Oct 2019 03:21:34 +0000 (UTC) X-FDA: 76022796108.10.brush39_2aa59be9fb85a X-Spam-Summary: 2,0,0,f78d22bac9fd19be,d41d8cd98f00b204,david@fromorbit.com,:linux-xfs@vger.kernel.org::linux-fsdevel@vger.kernel.org,RULES_HIT:41:69:327:355:379:541:800:960:966:968:973:988:989:1260:1261:1311:1314:1345:1359:1437:1515:1605:1730:1747:1777:1792:2194:2196:2198:2199:2200:2201:2393:2553:2559:2562:2693:2731:2892:2898:2903:3138:3139:3140:3141:3142:3308:3865:3866:3867:3868:3870:3871:3872:3874:4250:4321:4385:4605:5007:6119:6261:6691:7576:7875:7903:8603:8660:9010:9592:9707:10004:10128:10394:10954:11026:11232:11473:11658:11914:12043:12291:12296:12297:12438:12485:12517:12519:12555:12679:12683:12895:12986:13141:13148:13161:13229:13230:13548:13869:13870:13894:14096:14394:21067:21080:21324:21433:21451:21627:21740:21789:30005:30012:30034:30045:30054:30062:30070:30075:30089:30090,0,RBL:211.29.132.246:@fromorbit.com:.lbl8.mailshell.net-62.8.32.100 66.201.201.201,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not bulk,SPF:fn,MSBL:0,DNSBL:neutral,Custom_rules:0 :0:0,LFt X-HE-Tag: brush39_2aa59be9fb85a X-Filterd-Recvd-Size: 21404 Received: from mail104.syd.optusnet.com.au (mail104.syd.optusnet.com.au [211.29.132.246]) by imf39.hostedemail.com (Postfix) with ESMTP for ; Wed, 9 Oct 2019 03:21:32 +0000 (UTC) Received: from dread.disaster.area (pa49-181-226-196.pa.nsw.optusnet.com.au [49.181.226.196]) by mail104.syd.optusnet.com.au (Postfix) with ESMTPS id 10E5543EACC; Wed, 9 Oct 2019 14:21:29 +1100 (AEDT) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.2) (envelope-from ) id 1iI2XX-0006CD-JQ; Wed, 09 Oct 2019 14:21:27 +1100 Received: from dave by discord.disaster.area with local (Exim 4.92) (envelope-from ) id 1iI2XX-0003A6-HG; Wed, 09 Oct 2019 14:21:27 +1100 From: Dave Chinner To: linux-xfs@vger.kernel.org Cc: linux-mm@kvack.org, linux-fsdevel@vger.kernel.org Subject: [PATCH 25/26] xfs: rework unreferenced inode lookups Date: Wed, 9 Oct 2019 14:21:23 +1100 Message-Id: <20191009032124.10541-26-david@fromorbit.com> X-Mailer: git-send-email 2.23.0.rc1 In-Reply-To: <20191009032124.10541-1-david@fromorbit.com> References: <20191009032124.10541-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.2 cv=P6RKvmIu c=1 sm=1 tr=0 a=dRuLqZ1tmBNts2YiI0zFQg==:117 a=dRuLqZ1tmBNts2YiI0zFQg==:17 a=jpOVt7BSZ2e4Z31A5e1TngXxSK0=:19 a=XobE76Q3jBoA:10 a=20KFwNOVAAAA:8 a=VlnApPuOkL4U3gyNBW0A:9 a=tgTecsELLDXGZ-Jq:21 a=o5b6YgS90KOZ5V1B:21 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: Dave Chinner Looking up an unreferenced inode in the inode cache is a bit hairy. We do this for inode invalidation and writeback clustering purposes, which is all invisible to the VFS. Hence we can't take reference counts to the inode and so must be very careful how we do it. There are several different places that all do the lookups and checks slightly differently. Fundamentally, though, they are all racy and inode reclaim has to block waiting for the inode lock if it loses the race. This is not very optimal given all the work we;ve already done to make reclaim non-blocking. We can make the reclaim process nonblocking with a couple of simple changes. If we define the unreferenced lookup process in a way that will either always grab an inode in a way that reclaim will notice and skip, or will notice a reclaim has grabbed the inode so it can skip the inode, then there is no need for reclaim to need to cycle the inode ILOCK at all. Selecting an inode for reclaim is already non-blocking, so if the ILOCK is held the inode will be skipped. If we ensure that reclaim holds the ILOCK until the inode is freed, then we can do the same thing in the unreferenced lookup to avoid inodes in reclaim. We can do this simply by holding the ILOCK until the RCU grace period expires and the inode freeing callback is run. As all unreferenced lookups have to hold the rcu_read_lock(), we are guaranteed that a reclaimed inode will be noticed as the trylock will fail. Additional research notes on final reclaim locking before free -------------------------------------------------------------- 2016: 1f2dcfe89eda ("xfs: xfs_inode_free() isn't RCU safe") Fixes situation where the inode is found during RCU lookup within the freeing grace period, but critical structures have already been freed. lookup code that has this problem is stuff like xfs_iflush_cluster. 2008: 455486b9ccdd ("[XFS] avoid all reclaimable inodes in xfs_sync_inodes_ag") Prior to this commit, the flushing of inodes required serialisation with xfs_ireclaim(), which did this lock/unlock thingy to ensure that it waited for flushing in xfs_sync_inodes_ag() to complete before freeing the inode: /* - * If we can't get a reference on the VFS_I, the inode must be - * in reclaim. If we can get the inode lock without blocking, - * it is safe to flush the inode because we hold the tree lock - * and xfs_iextract will block right now. Hence if we lock the - * inode while holding the tree lock, xfs_ireclaim() is - * guaranteed to block on the inode lock we now hold and hence - * it is safe to reference the inode until we drop the inode - * locks completely. + * If we can't get a reference on the inode, it must be + * in reclaim. Leave it for the reclaim code to flush. */ This case is completely gone from the modern code. lock/unlock exists at start of git era. Switching to archive tree. This xfs_sync() functionality goes back to 1994 when inode writeback was first introduced by: 47ac6d60 ("Add support to xfs_ireclaim() needed for xfs_sync().") So it has been there forever - lets see if we can get rid of it. State of existing codeL - xfs_iflush_cluster() does not check for XFS_IRECLAIM inode flag while holding rcu_read_lock()/i_flags_lock, so doesn't avoid reclaimable or inodes that are in the process of being reclaimed. Inodes at this point of reclaim are clean, so if xfs_iflush_cluster wins the race to the ILOCK, then inode reclaim has to wait for the lock to be dropped by xfs_iflush_cluster() once it detects the inode is clean. - xfs_ifree_cluster() has similar logic based around XFS_ISTALE, results in similar race conditions that require inode reclaim to cycle the ILOCK to serialise against. - xfs_inode_ag_walk() uses xfs_inode_ag_walk_grab(), and it checks XFS_IRECLAIM under RCU. It then tries to take a reference to the VFS inode via igrab(), which will fail if the inode is either XFS_IRECLAIMABLE | XFS_IRECLAIM, and it if races then igrab() will fail because the inode has I_FREEING still set, so it's protected against reclaim races. That leaves xfs_iflush_cluster() + xfs_ifree_cluster() to be modified to do reclaim-safe lookups. W.r.t. new inode reclaim LRU isolate function: 1. inode can be referenced while rcu_read_lock() is held. 2. XFS_IRECLAIM means inode has been fully locked down and has placed on the dispose list, and will be freed soon. - ilock_nowait() will fail once IRECLAIM is set due to lock order in isolation code. 3. ip->i_ino == 0 means it's been removed from the dispose list and is about to or has been removed from the radix tree and may have already been queued on the rcu freeing list to be freed at the end of the current grace period. - the old xfs_ireclaim() code will have dropped the ILOCK here, and so there's a race between checking IRECLAIM, grabbing ilock_nowait() and reclaim freeing the inode. - this is what the spurious lock/unlock avoids. 4. it xfs_ilock_nowait() fails until the rcu grace period expires, it doesn't matter if we race between checking IRECLAIM and failing the lock attempt. In fact, we don't even have to check XFS_IRECLAIM - just failing xfs_ilock_nowait() is sufficient to avoid inodes being reclaimed. Hence when xfs_ilock_nowait() fails, we can either drop the rcu_read_lock at that point and restart the inode lookup, or we just skip the inode altogether. If we raced with reclaim, the retry will not find the inode in reclaim again. If we raced wtih some other lock holder, then we'll find the inode and try to lock it again. - Requires holding ILOCK into rcu freeing callback and dropping it there. i.e. inode to be reclaimed remains locked until grace period expires. - No window at all between IRECLAIM being set and visible to other CPUs and the inode being removed from the cache and freed where ilock_nowait will succeed. - simple, effective, reliable. Signed-off-by: Dave Chinner --- fs/xfs/xfs_icache.c | 86 ++++++++++++++++++++++------- fs/xfs/xfs_inode.c | 131 +++++++++++++++++++++----------------------- 2 files changed, 126 insertions(+), 91 deletions(-) diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c index a6de159c71c2..7a507aefeea6 100644 --- a/fs/xfs/xfs_icache.c +++ b/fs/xfs/xfs_icache.c @@ -105,6 +105,7 @@ xfs_inode_free_callback( ip->i_itemp = NULL; } + xfs_iunlock(ip, XFS_ILOCK_EXCL); kmem_zone_free(xfs_inode_zone, ip); } @@ -131,6 +132,7 @@ xfs_inode_free( * free state. The ip->i_flags_lock provides the barrier against lookup * races. */ + xfs_ilock(ip, XFS_ILOCK_EXCL); spin_lock(&ip->i_flags_lock); ip->i_flags = XFS_IRECLAIM; ip->i_ino = 0; @@ -294,11 +296,24 @@ xfs_iget_cache_hit( } /* - * We need to set XFS_IRECLAIM to prevent xfs_reclaim_inode - * from stomping over us while we recycle the inode. Remove it - * from the LRU straight away so we can re-init the VFS inode. + * Before we reinitialise the inode, we need to make sure + * reclaim does not pull it out from underneath us. We already + * hold the i_flags_lock, and because the XFS_IRECLAIM is not + * set we know the inode is still on the LRU. However, the LRU + * code may have just selected this inode to reclaim, so we need + * to ensure we hold the i_flags_lock long enough for the + * trylock in xfs_inode_reclaim_isolate() to fail. We do this by + * removing the inode from the LRU, which will spin on the LRU + * list locks until reclaim stops walking, at which point we + * know there is no possible race between reclaim isolation and + * this lookup. + * + * We also set the XFS_IRECLAIM flag here while trying to do the + * re-initialisation to prevent multiple racing lookups on this + * inode from all landing here at the same time. */ ip->i_flags |= XFS_IRECLAIM; + list_lru_del(&mp->m_inode_lru, &inode->i_lru); spin_unlock(&ip->i_flags_lock); rcu_read_unlock(); @@ -312,7 +327,8 @@ xfs_iget_cache_hit( rcu_read_lock(); spin_lock(&ip->i_flags_lock); wake = !!__xfs_iflags_test(ip, XFS_INEW); - ip->i_flags &= ~(XFS_INEW | XFS_IRECLAIM); + ip->i_flags &= ~XFS_INEW | XFS_IRECLAIM; + list_lru_add(&mp->m_inode_lru, &inode->i_lru); if (wake) wake_up_bit(&ip->i_flags, __XFS_INEW_BIT); ASSERT(ip->i_flags & XFS_IRECLAIMABLE); @@ -329,7 +345,6 @@ xfs_iget_cache_hit( spin_lock(&ip->i_flags_lock); ip->i_flags &= ~XFS_IRECLAIM_RESET_FLAGS; ip->i_flags |= XFS_INEW; - list_lru_del(&mp->m_inode_lru, &inode->i_lru); inode->i_state = I_NEW; ip->i_sick = 0; ip->i_checked = 0; @@ -609,8 +624,7 @@ xfs_icache_inode_is_allocated( /* * The inode lookup is done in batches to keep the amount of lock traffic and * radix tree lookups to a minimum. The batch size is a trade off between - * lookup reduction and stack usage. This is in the reclaim path, so we can't - * be too greedy. + * lookup reduction and stack usage. */ #define XFS_LOOKUP_BATCH 32 @@ -967,6 +981,41 @@ xfs_inode_reclaim_isolate( return ret; } +/* + * We are passed a locked inode to dispose of. + * + * To avoid race conditions with lookups that don't take references, we do + * not drop the XFS_ILOCK_EXCL until the RCU callback that frees the inode. + * This means that any attempt to lock the inode during the current RCU grace + * period will fail, and hence we do not need any synchonisation here to wait + * for code that pins unreferenced inodes with the XFS_ILOCK to drain. + * + * This requires code that requires such pins to do the following under a single + * rcu_read_lock() context: + * + * - rcu_read_lock + * - find the inode via radix tree lookup + * - take the ip->i_flags_lock + * - check ip->i_ino != 0 + * - check XFS_IRECLAIM is not set + * - call xfs_ilock_nowait(ip, XFS_ILOCK_[SHARED|EXCL]) to lock the inode + * - drop ip->i_flags_lock + * - rcu_read_unlock() + * + * Only if all this succeeds and the caller has the inode locked and protected + * against it being freed until the ilock is released. If the XFS_IRECLAIM flag + * is set or xfs_ilock_nowait() fails, then the caller must either skip the + * inode and move on to the next inode (gang lookup) or drop the rcu_read_lock + * and start the entire inode lookup process again (individual lookup). + * + * This works because i_flags_lock serialises against + * xfs_inode_reclaim_isolate() - if the lookup wins the race on i_flags_lock and + * XFS_IRECLAIM is not set, then it will be able to lock the inode and hold off + * reclaim. If the isolate function wins the race, it will lock the inode and + * set the XFS_IRECLAIM flag if it is going to free the inode and this will + * prevent the lookup callers from succeeding in getting unreferenced pin via + * the ILOCK. + */ static void xfs_dispose_inode( struct xfs_inode *ip) @@ -975,11 +1024,14 @@ xfs_dispose_inode( struct xfs_perag *pag; xfs_ino_t ino; + ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL)); ASSERT(xfs_isiflocked(ip)); ASSERT(xfs_inode_clean(ip) || xfs_iflags_test(ip, XFS_ISTALE) || XFS_FORCED_SHUTDOWN(mp)); ASSERT(ip->i_ino != 0); + XFS_STATS_INC(mp, xs_ig_reclaims); + /* * Process the shutdown reclaim work we deferred from the LRU isolation * callback before we go any further. @@ -1006,9 +1058,7 @@ xfs_dispose_inode( ip->i_flags = XFS_IRECLAIM; ip->i_ino = 0; spin_unlock(&ip->i_flags_lock); - xfs_iunlock(ip, XFS_ILOCK_EXCL); - XFS_STATS_INC(mp, xs_ig_reclaims); /* * Remove the inode from the per-AG radix tree. * @@ -1023,19 +1073,7 @@ xfs_dispose_inode( spin_unlock(&pag->pag_ici_lock); xfs_perag_put(pag); - /* - * Here we do an (almost) spurious inode lock in order to coordinate - * with inode cache radix tree lookups. This is because the lookup - * can reference the inodes in the cache without taking references. - * - * We make that OK here by ensuring that we wait until the inode is - * unlocked after the lookup before we go ahead and free it. - * - * XXX: need to check this is still true. Not sure it is. - */ - xfs_ilock(ip, XFS_ILOCK_EXCL); xfs_qm_dqdetach(ip); - xfs_iunlock(ip, XFS_ILOCK_EXCL); __xfs_inode_free(ip); } @@ -1062,6 +1100,12 @@ xfs_reclaim_inodes( struct xfs_ireclaim_args ra; long freed, to_free; + /* push the AIL to clean dirty reclaimable inodes */ + xfs_ail_push_all(mp->m_ail); + + /* push the AIL to clean dirty reclaimable inodes */ + xfs_ail_push_all(mp->m_ail); + INIT_LIST_HEAD(&ra.freeable); ra.lowest_lsn = NULLCOMMITLSN; to_free = list_lru_count(&mp->m_inode_lru); diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c index 18f4b262e61c..1d7e3f575952 100644 --- a/fs/xfs/xfs_inode.c +++ b/fs/xfs/xfs_inode.c @@ -2622,52 +2622,54 @@ xfs_ifree_cluster( } /* - * because this is an RCU protected lookup, we could - * find a recently freed or even reallocated inode - * during the lookup. We need to check under the - * i_flags_lock for a valid inode here. Skip it if it - * is not valid, the wrong inode or stale. + * See xfs_dispose_inode() for an explanation of the + * tests here to avoid inode reclaim races. */ spin_lock(&ip->i_flags_lock); - if (ip->i_ino != inum + i || - __xfs_iflags_test(ip, XFS_ISTALE)) { + if (!ip->i_ino || + __xfs_iflags_test(ip, XFS_IRECLAIM)) { spin_unlock(&ip->i_flags_lock); rcu_read_unlock(); continue; } - spin_unlock(&ip->i_flags_lock); /* - * Don't try to lock/unlock the current inode, but we - * _cannot_ skip the other inodes that we did not find - * in the list attached to the buffer and are not - * already marked stale. If we can't lock it, back off - * and retry. + * The inode isn't in reclaim, but it might be locked by + * someone else. In that case, we retry the inode rather + * than skipping it completely. */ - if (ip != free_ip) { - if (!xfs_ilock_nowait(ip, XFS_ILOCK_EXCL)) { - rcu_read_unlock(); - delay(1); - goto retry; - } + if (ip != free_ip && + !xfs_ilock_nowait(ip, XFS_ILOCK_EXCL)) { + spin_unlock(&ip->i_flags_lock); + rcu_read_unlock(); + delay(1); + goto retry; + } - /* - * Check the inode number again in case we're - * racing with freeing in xfs_reclaim_inode(). - * See the comments in that function for more - * information as to why the initial check is - * not sufficient. - */ - if (ip->i_ino != inum + i) { + /* + * Inode is now pinned against reclaim until we unlock + * it, so now we can do the work necessary to mark the + * inode stale and get it held until the cluster freeing + * transaction is logged. If it's stale, then it has + * already been attached to the buffer and we're done. + */ + if (__xfs_iflags_test(ip, XFS_ISTALE)) { + spin_unlock(&ip->i_flags_lock); + if (ip != free_ip) xfs_iunlock(ip, XFS_ILOCK_EXCL); - rcu_read_unlock(); - continue; - } + rcu_read_unlock(); + continue; } + __xfs_iflags_set(ip, XFS_ISTALE); + spin_unlock(&ip->i_flags_lock); rcu_read_unlock(); + /* + * Flush lock will hold off inode reclaim until the + * buffer completion routine runs the xfs_istale_done + * callback on the inode and unlocks it. + */ xfs_iflock(ip); - xfs_iflags_set(ip, XFS_ISTALE); /* * we don't need to attach clean inodes or those only @@ -2677,7 +2679,8 @@ xfs_ifree_cluster( if (!iip || xfs_inode_clean(ip)) { ASSERT(ip != free_ip); xfs_ifunlock(ip); - xfs_iunlock(ip, XFS_ILOCK_EXCL); + if (ip != free_ip) + xfs_iunlock(ip, XFS_ILOCK_EXCL); continue; } @@ -3498,44 +3501,40 @@ xfs_iflush_cluster( continue; /* - * because this is an RCU protected lookup, we could find a - * recently freed or even reallocated inode during the lookup. - * We need to check under the i_flags_lock for a valid inode - * here. Skip it if it is not valid or the wrong inode. + * See xfs_dispose_inode() for an explanation of the + * tests here to avoid inode reclaim races. */ spin_lock(&cip->i_flags_lock); if (!cip->i_ino || - __xfs_iflags_test(cip, XFS_ISTALE)) { + __xfs_iflags_test(cip, XFS_IRECLAIM)) { spin_unlock(&cip->i_flags_lock); continue; } - /* - * Once we fall off the end of the cluster, no point checking - * any more inodes in the list because they will also all be - * outside the cluster. - */ + /* ILOCK will pin the inode against reclaim */ + if (!xfs_ilock_nowait(cip, XFS_ILOCK_SHARED)) { + spin_unlock(&cip->i_flags_lock); + continue; + } + + if (__xfs_iflags_test(cip, XFS_ISTALE)) { + xfs_iunlock(cip, XFS_ILOCK_SHARED); + spin_unlock(&cip->i_flags_lock); + continue; + } + + /* Lookup can find inodes outside the cluster being flushed. */ if ((XFS_INO_TO_AGINO(mp, cip->i_ino) & mask) != first_index) { + xfs_iunlock(cip, XFS_ILOCK_SHARED); spin_unlock(&cip->i_flags_lock); break; } spin_unlock(&cip->i_flags_lock); /* - * Do an un-protected check to see if the inode is dirty and - * is a candidate for flushing. These checks will be repeated - * later after the appropriate locks are acquired. - */ - if (xfs_inode_clean(cip) && xfs_ipincount(cip) == 0) - continue; - - /* - * Try to get locks. If any are unavailable or it is pinned, + * If we can't get the flush lock now or the inode is pinned, * then this inode cannot be flushed and is skipped. */ - - if (!xfs_ilock_nowait(cip, XFS_ILOCK_SHARED)) - continue; if (!xfs_iflock_nowait(cip)) { xfs_iunlock(cip, XFS_ILOCK_SHARED); continue; @@ -3546,22 +3545,9 @@ xfs_iflush_cluster( continue; } - /* - * Check the inode number again, just to be certain we are not - * racing with freeing in xfs_reclaim_inode(). See the comments - * in that function for more information as to why the initial - * check is not sufficient. - */ - if (!cip->i_ino) { - xfs_ifunlock(cip); - xfs_iunlock(cip, XFS_ILOCK_SHARED); - continue; - } - - /* - * arriving here means that this inode can be flushed. First - * re-check that it's dirty before flushing. + * Arriving here means that this inode can be flushed. First + * check that it's dirty before flushing. */ if (!xfs_inode_clean(cip)) { int error; @@ -3575,6 +3561,7 @@ xfs_iflush_cluster( xfs_ifunlock(cip); } xfs_iunlock(cip, XFS_ILOCK_SHARED); + /* unsafe to reference cip from here */ } if (clcount) { @@ -3613,7 +3600,11 @@ xfs_iflush_cluster( xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE); - /* abort the corrupt inode, as it was not attached to the buffer */ + /* + * Abort the corrupt inode, as it was not attached to the buffer. It is + * unlocked, but still pinned against reclaim by the flush lock so it is + * safe to reference here until after the flush abort completes. + */ xfs_iflush_abort(cip, false); kmem_free(cilist); xfs_perag_put(pag); From patchwork Wed Oct 9 03:21:24 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 11180409 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 3969717D4 for ; Wed, 9 Oct 2019 03:22:16 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 10332206C2 for ; Wed, 9 Oct 2019 03:22:16 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 10332206C2 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=fromorbit.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 6074A8E0017; Tue, 8 Oct 2019 23:21:35 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 049818E0018; Tue, 8 Oct 2019 23:21:34 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D3BED8E0016; Tue, 8 Oct 2019 23:21:34 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0007.hostedemail.com [216.40.44.7]) by kanga.kvack.org (Postfix) with ESMTP id AC1218E0017 for ; Tue, 8 Oct 2019 23:21:34 -0400 (EDT) Received: from smtpin05.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with SMTP id 4AA7E824CA36 for ; Wed, 9 Oct 2019 03:21:34 +0000 (UTC) X-FDA: 76022796108.05.gun98_2aba91c0cbc0a X-Spam-Summary: 2,0,0,1455a1de42acdd49,d41d8cd98f00b204,david@fromorbit.com,:linux-xfs@vger.kernel.org::linux-fsdevel@vger.kernel.org,RULES_HIT:41:69:355:379:541:617:800:960:966:968:973:988:989:1260:1261:1311:1314:1345:1359:1437:1515:1535:1544:1711:1730:1747:1777:1792:2196:2199:2393:2559:2562:3138:3139:3140:3141:3142:3308:3355:3865:3866:3867:3868:3870:3871:3872:3874:4250:4321:4385:5007:6119:6261:7576:7903:9592:10004:11026:11232:11473:11658:11914:12043:12114:12296:12297:12438:12517:12519:12555:12679:12683:12895:12986:13161:13229:13894:14096:14181:14394:14721:21080:21324:21433:21451:21611:21627:30054:30091,0,RBL:211.29.132.246:@fromorbit.com:.lbl8.mailshell.net-62.8.32.100 66.201.201.201,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not bulk,SPF:fn,MSBL:0,DNSBL:neutral,Custom_rules:0:0:0,LFtime:2,LUA_SUMMARY:none X-HE-Tag: gun98_2aba91c0cbc0a X-Filterd-Recvd-Size: 5918 Received: from mail104.syd.optusnet.com.au (mail104.syd.optusnet.com.au [211.29.132.246]) by imf19.hostedemail.com (Postfix) with ESMTP for ; Wed, 9 Oct 2019 03:21:33 +0000 (UTC) Received: from dread.disaster.area (pa49-181-226-196.pa.nsw.optusnet.com.au [49.181.226.196]) by mail104.syd.optusnet.com.au (Postfix) with ESMTPS id 2343643ECB7; Wed, 9 Oct 2019 14:21:29 +1100 (AEDT) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.2) (envelope-from ) id 1iI2XX-0006CH-Ks; Wed, 09 Oct 2019 14:21:27 +1100 Received: from dave by discord.disaster.area with local (Exim 4.92) (envelope-from ) id 1iI2XX-0003A9-IY; Wed, 09 Oct 2019 14:21:27 +1100 From: Dave Chinner To: linux-xfs@vger.kernel.org Cc: linux-mm@kvack.org, linux-fsdevel@vger.kernel.org Subject: [PATCH 26/26] xfs: use xfs_ail_push_all_sync in xfs_reclaim_inodes Date: Wed, 9 Oct 2019 14:21:24 +1100 Message-Id: <20191009032124.10541-27-david@fromorbit.com> X-Mailer: git-send-email 2.23.0.rc1 In-Reply-To: <20191009032124.10541-1-david@fromorbit.com> References: <20191009032124.10541-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.2 cv=FNpr/6gs c=1 sm=1 tr=0 a=dRuLqZ1tmBNts2YiI0zFQg==:117 a=dRuLqZ1tmBNts2YiI0zFQg==:17 a=jpOVt7BSZ2e4Z31A5e1TngXxSK0=:19 a=XobE76Q3jBoA:10 a=20KFwNOVAAAA:8 a=ZzdNmvoJkPJkjLTIc_MA:9 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: Dave Chinner If we are reclaiming all inodes, it is likely we need to flush the entire AIL to do that. We have mechanisms to do that without needing to push to a specific LSN. Convert xfs_relaim_inodes() to use xfs_ail_push_all variant so we can get rid of the hacky xfs_ail_push_sync() scaffolding we used to support the intermediate stages of the non-blocking reclaim changeset. Signed-off-by: Dave Chinner --- fs/xfs/xfs_icache.c | 17 +++++++++++------ fs/xfs/xfs_trans_ail.c | 33 --------------------------------- fs/xfs/xfs_trans_priv.h | 2 -- 3 files changed, 11 insertions(+), 41 deletions(-) diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c index 7a507aefeea6..c1cbef610081 100644 --- a/fs/xfs/xfs_icache.c +++ b/fs/xfs/xfs_icache.c @@ -25,6 +25,7 @@ #include "xfs_log.h" #include +#include /* for congestion_wait() */ /* * Allocate and initialise an xfs_inode. @@ -1092,6 +1093,10 @@ xfs_dispose_inodes( cond_resched(); } } + +/* + * Reclaim all unused inodes in the filesystem. + */ void xfs_reclaim_inodes( struct xfs_mount *mp) @@ -1106,6 +1111,9 @@ xfs_reclaim_inodes( /* push the AIL to clean dirty reclaimable inodes */ xfs_ail_push_all(mp->m_ail); + /* push the AIL to clean dirty reclaimable inodes */ + xfs_ail_push_all(mp->m_ail); + INIT_LIST_HEAD(&ra.freeable); ra.lowest_lsn = NULLCOMMITLSN; to_free = list_lru_count(&mp->m_inode_lru); @@ -1114,13 +1122,10 @@ xfs_reclaim_inodes( &ra, to_free); xfs_dispose_inodes(&ra.freeable); - if (freed == 0) { + if (freed == 0) xfs_log_force(mp, XFS_LOG_SYNC); - xfs_ail_push_all(mp->m_ail); - } else if (ra.lowest_lsn != NULLCOMMITLSN) { - xfs_ail_push_sync(mp->m_ail, ra.lowest_lsn); - } - cond_resched(); + else if (ra.dirty_skipped) + congestion_wait(BLK_RW_ASYNC, HZ/10); } } diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c index 5e500a75b62b..685a21cd24c0 100644 --- a/fs/xfs/xfs_trans_ail.c +++ b/fs/xfs/xfs_trans_ail.c @@ -662,37 +662,6 @@ xfs_ail_push_all( xfs_ail_push(ailp, threshold_lsn); } -/* - * Push the AIL to a specific lsn and wait for it to complete. - */ -void -xfs_ail_push_sync( - struct xfs_ail *ailp, - xfs_lsn_t threshold_lsn) -{ - struct xfs_log_item *lip; - DEFINE_WAIT(wait); - - spin_lock(&ailp->ail_lock); - while ((lip = xfs_ail_min(ailp)) != NULL) { - prepare_to_wait(&ailp->ail_push, &wait, TASK_UNINTERRUPTIBLE); - if (XFS_FORCED_SHUTDOWN(ailp->ail_mount) || - XFS_LSN_CMP(threshold_lsn, lip->li_lsn) <= 0) - break; - /* XXX: cmpxchg? */ - while (XFS_LSN_CMP(threshold_lsn, ailp->ail_target) > 0) - xfs_trans_ail_copy_lsn(ailp, &ailp->ail_target, &threshold_lsn); - wake_up_process(ailp->ail_task); - spin_unlock(&ailp->ail_lock); - schedule(); - spin_lock(&ailp->ail_lock); - } - spin_unlock(&ailp->ail_lock); - - finish_wait(&ailp->ail_push, &wait); -} - - /* * Push out all items in the AIL immediately and wait until the AIL is empty. */ @@ -733,7 +702,6 @@ xfs_ail_update_finish( if (!XFS_FORCED_SHUTDOWN(mp)) xlog_assign_tail_lsn_locked(mp); - wake_up_all(&ailp->ail_push); if (list_empty(&ailp->ail_head)) wake_up_all(&ailp->ail_empty); spin_unlock(&ailp->ail_lock); @@ -890,7 +858,6 @@ xfs_trans_ail_init( spin_lock_init(&ailp->ail_lock); INIT_LIST_HEAD(&ailp->ail_buf_list); init_waitqueue_head(&ailp->ail_empty); - init_waitqueue_head(&ailp->ail_push); ailp->ail_task = kthread_run(xfsaild, ailp, "xfsaild/%s", ailp->ail_mount->m_fsname); diff --git a/fs/xfs/xfs_trans_priv.h b/fs/xfs/xfs_trans_priv.h index 1b6f4bbd47c0..35655eac01a6 100644 --- a/fs/xfs/xfs_trans_priv.h +++ b/fs/xfs/xfs_trans_priv.h @@ -61,7 +61,6 @@ struct xfs_ail { int ail_log_flush; struct list_head ail_buf_list; wait_queue_head_t ail_empty; - wait_queue_head_t ail_push; }; /* @@ -114,7 +113,6 @@ xfs_trans_ail_remove( } void xfs_ail_push(struct xfs_ail *, xfs_lsn_t); -void xfs_ail_push_sync(struct xfs_ail *, xfs_lsn_t); void xfs_ail_push_all(struct xfs_ail *); void xfs_ail_push_all_sync(struct xfs_ail *); struct xfs_log_item *xfs_ail_min(struct xfs_ail *ailp);