From patchwork Thu Mar 11 03:05:46 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Darrick J. Wong" X-Patchwork-Id: 12130191 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.2 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 505D0C433E9 for ; Thu, 11 Mar 2021 03:06:36 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 0A35D64FC8 for ; Thu, 11 Mar 2021 03:06:35 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229469AbhCKDGE (ORCPT ); Wed, 10 Mar 2021 22:06:04 -0500 Received: from mail.kernel.org ([198.145.29.99]:45672 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229562AbhCKDFq (ORCPT ); Wed, 10 Mar 2021 22:05:46 -0500 Received: by mail.kernel.org (Postfix) with ESMTPSA id 5FC9E64FC4; Thu, 11 Mar 2021 03:05:46 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1615431946; bh=HX6mTiI95u1DkubNHBK2b4P+74/gkNrjPnImsNVnKX8=; h=Subject:From:To:Cc:Date:In-Reply-To:References:From; b=WZNPqJmpirb9joXI7CxJVl1CA9g6/pCp99V7zHoQuz7iieDHCgnmOzthEmCjMsPvQ G9Yza3wS3CZKKCg8tcYrbAsiVnDNLNtIeBh8TclOLXyGr+KfEp8cn6CCkyguGnJWWg Jj++2RldWh+YE7hYyFDq5qxzBzQ3OnC49+2vptJ7fe4Ab9O9QHcP/pqEvhxJquMSgx wdaapUHbBbOlFVHMMMZAsJXLX13Nr5DvPccYW+rWnAA5Uukm5UFNTV3n8vorh6RbdI tB1IV0FUBYg0koQxRdn95jzd3w79lLXQcVbQRGhgRo5oZ1MmsarYJRa0cQAd02O1QO 4c2ddH4zhL/xg== Subject: [PATCH 01/11] xfs: prevent metadata files from being inactivated From: "Darrick J. Wong" To: djwong@kernel.org Cc: linux-xfs@vger.kernel.org Date: Wed, 10 Mar 2021 19:05:46 -0800 Message-ID: <161543194600.1947934.584103655060069020.stgit@magnolia> In-Reply-To: <161543194009.1947934.9910987247994410125.stgit@magnolia> References: <161543194009.1947934.9910987247994410125.stgit@magnolia> User-Agent: StGit/0.19 MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Darrick J. Wong Files containing metadata (quota records, rt bitmap and summary info) are fully managed by the filesystem, which means that all resource cleanup must be explicit, not automatic. This means that they should never be subjected automatic to post-eof truncation, nor should they be freed automatically even if the link count drops to zero. In other words, xfs_inactive() should leave these files alone. Add the necessary predicate functions to make this happen. This adds a second layer of prevention for the kinds of fs corruption that was fixed by commit f4c32e87de7d. If we ever decide to support removing metadata files, we should make all those metadata updates explicit. Rearrange the order of #includes to fix compiler errors, since xfs_mount.h is supposed to be included before xfs_inode.h Followup-to: f4c32e87de7d ("xfs: fix realtime bitmap/summary file truncation when growing rt volume") Signed-off-by: Darrick J. Wong Reviewed-by: Christoph Hellwig Reviewed-by: Dave Chinner --- fs/xfs/libxfs/xfs_iext_tree.c | 2 +- fs/xfs/xfs_inode.c | 4 ++++ fs/xfs/xfs_inode.h | 8 ++++++++ fs/xfs/xfs_xattr.c | 2 ++ 4 files changed, 15 insertions(+), 1 deletion(-) diff --git a/fs/xfs/libxfs/xfs_iext_tree.c b/fs/xfs/libxfs/xfs_iext_tree.c index b4164256993d..773cf4349428 100644 --- a/fs/xfs/libxfs/xfs_iext_tree.c +++ b/fs/xfs/libxfs/xfs_iext_tree.c @@ -8,9 +8,9 @@ #include "xfs_format.h" #include "xfs_bit.h" #include "xfs_log_format.h" -#include "xfs_inode.h" #include "xfs_trans_resv.h" #include "xfs_mount.h" +#include "xfs_inode.h" #include "xfs_trace.h" /* diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c index f93370bd7b1e..12c79962f8c3 100644 --- a/fs/xfs/xfs_inode.c +++ b/fs/xfs/xfs_inode.c @@ -1697,6 +1697,10 @@ xfs_inactive( if (mp->m_flags & XFS_MOUNT_RDONLY) return; + /* Metadata inodes require explicit resource cleanup. */ + if (xfs_is_metadata_inode(ip)) + return; + /* Try to clean out the cow blocks if there are any. */ if (xfs_inode_has_cow_data(ip)) xfs_reflink_cancel_cow_range(ip, 0, NULLFILEOFF, true); diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h index 88ee4c3930ae..c2c26f8f4a81 100644 --- a/fs/xfs/xfs_inode.h +++ b/fs/xfs/xfs_inode.h @@ -185,6 +185,14 @@ static inline bool xfs_is_reflink_inode(struct xfs_inode *ip) return ip->i_d.di_flags2 & XFS_DIFLAG2_REFLINK; } +static inline bool xfs_is_metadata_inode(struct xfs_inode *ip) +{ + struct xfs_mount *mp = ip->i_mount; + + return ip == mp->m_rbmip || ip == mp->m_rsumip || + xfs_is_quota_inode(&mp->m_sb, ip->i_ino); +} + /* * Check if an inode has any data in the COW fork. This might be often false * even for inodes with the reflink flag when there is no pending COW operation. diff --git a/fs/xfs/xfs_xattr.c b/fs/xfs/xfs_xattr.c index 12be32f66dc1..0d050f8829ef 100644 --- a/fs/xfs/xfs_xattr.c +++ b/fs/xfs/xfs_xattr.c @@ -9,6 +9,8 @@ #include "xfs_format.h" #include "xfs_log_format.h" #include "xfs_da_format.h" +#include "xfs_trans_resv.h" +#include "xfs_mount.h" #include "xfs_inode.h" #include "xfs_attr.h" #include "xfs_acl.h" From patchwork Thu Mar 11 03:05:51 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Darrick J. Wong" X-Patchwork-Id: 12130195 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.2 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2CC31C433E6 for ; Thu, 11 Mar 2021 03:06:36 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id E40FE64FCE for ; Thu, 11 Mar 2021 03:06:35 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229562AbhCKDGE (ORCPT ); Wed, 10 Mar 2021 22:06:04 -0500 Received: from mail.kernel.org ([198.145.29.99]:45694 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229767AbhCKDFw (ORCPT ); Wed, 10 Mar 2021 22:05:52 -0500 Received: by mail.kernel.org (Postfix) with ESMTPSA id E6BCE64EDB; Thu, 11 Mar 2021 03:05:51 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1615431952; bh=6iiuLUwyjNk50xf94MQ7HUSs3rnb7OzMLuHg8ZF9tVY=; h=Subject:From:To:Cc:Date:In-Reply-To:References:From; b=jJdm3wNHrxXU7iFYLaMjU7FwwDGo2h2ohqfL3XdspmTPxWZAGVCxOkdq7xfL0tPTG fNsRzHUM8GXFxuFfkdEQC6UFXWnNre+y7Yon3f0+ezOb7oxf2aoLSsUOBHmBRre0ag c6Lfn7GNyuwP3SsdtyQje6FETaBDHbhX2BnK/+oCMWLdwlmGIaNmSLIUcFS9V24JY3 HIJJRmK/4xbhiG3LXyLkvD//kkttpkMZSHh7aa2rc/zRpDDBjTWOuNHt8CnNezUPMP k1W6ZKSMoqh386qX77Id5FWmKFwh3gDhigBT7dP68e/yPs2s2SWOLS3fTdxF5VKnyo caexEymUnt0JA== Subject: [PATCH 02/11] xfs: refactor the predicate part of xfs_free_eofblocks From: "Darrick J. Wong" To: djwong@kernel.org Cc: linux-xfs@vger.kernel.org Date: Wed, 10 Mar 2021 19:05:51 -0800 Message-ID: <161543195167.1947934.16237799936089844524.stgit@magnolia> In-Reply-To: <161543194009.1947934.9910987247994410125.stgit@magnolia> References: <161543194009.1947934.9910987247994410125.stgit@magnolia> User-Agent: StGit/0.19 MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Darrick J. Wong Refactor the part of _free_eofblocks that decides if it's really going to truncate post-EOF blocks into a separate helper function. The upcoming deferred inode inactivation patch requires us to be able to decide this prior to actual inactivation. No functionality changes. Signed-off-by: Darrick J. Wong Reviewed-by: Christoph Hellwig --- fs/xfs/xfs_bmap_util.c | 129 ++++++++++++++++++++++++++++-------------------- fs/xfs/xfs_bmap_util.h | 1 2 files changed, 76 insertions(+), 54 deletions(-) diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c index e7d68318e6a5..21aa38183ae9 100644 --- a/fs/xfs/xfs_bmap_util.c +++ b/fs/xfs/xfs_bmap_util.c @@ -628,27 +628,23 @@ xfs_can_free_eofblocks(struct xfs_inode *ip, bool force) } /* - * This is called to free any blocks beyond eof. The caller must hold - * IOLOCK_EXCL unless we are in the inode reclaim path and have the only - * reference to the inode. + * Decide if this inode have post-EOF blocks. The caller is responsible + * for knowing / caring about the PREALLOC/APPEND flags. */ int -xfs_free_eofblocks( - struct xfs_inode *ip) +xfs_has_eofblocks( + struct xfs_inode *ip, + bool *has) { - struct xfs_trans *tp; - int error; + struct xfs_bmbt_irec imap; + struct xfs_mount *mp = ip->i_mount; xfs_fileoff_t end_fsb; xfs_fileoff_t last_fsb; xfs_filblks_t map_len; int nimaps; - struct xfs_bmbt_irec imap; - struct xfs_mount *mp = ip->i_mount; + int error; - /* - * Figure out if there are any blocks beyond the end - * of the file. If not, then there is nothing to do. - */ + *has = false; end_fsb = XFS_B_TO_FSB(mp, (xfs_ufsize_t)XFS_ISIZE(ip)); last_fsb = XFS_B_TO_FSB(mp, mp->m_super->s_maxbytes); if (last_fsb <= end_fsb) @@ -660,55 +656,80 @@ xfs_free_eofblocks( error = xfs_bmapi_read(ip, end_fsb, map_len, &imap, &nimaps, 0); xfs_iunlock(ip, XFS_ILOCK_SHARED); + if (error || nimaps == 0) + return error; + + *has = imap.br_startblock != HOLESTARTBLOCK || ip->i_delayed_blks; + return 0; +} + +/* + * This is called to free any blocks beyond eof. The caller must hold + * IOLOCK_EXCL unless we are in the inode reclaim path and have the only + * reference to the inode. + */ +int +xfs_free_eofblocks( + struct xfs_inode *ip) +{ + struct xfs_trans *tp; + struct xfs_mount *mp = ip->i_mount; + bool has; + int error; + /* * If there are blocks after the end of file, truncate the file to its * current size to free them up. */ - if (!error && (nimaps != 0) && - (imap.br_startblock != HOLESTARTBLOCK || - ip->i_delayed_blks)) { - /* - * Attach the dquots to the inode up front. - */ - error = xfs_qm_dqattach(ip); - if (error) - return error; + error = xfs_has_eofblocks(ip, &has); + if (error || !has) + return error; - /* wait on dio to ensure i_size has settled */ - inode_dio_wait(VFS_I(ip)); + /* + * Attach the dquots to the inode up front. + */ + error = xfs_qm_dqattach(ip); + if (error) + return error; - error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, 0, 0, 0, - &tp); - if (error) { - ASSERT(XFS_FORCED_SHUTDOWN(mp)); - return error; - } + /* wait on dio to ensure i_size has settled */ + inode_dio_wait(VFS_I(ip)); - xfs_ilock(ip, XFS_ILOCK_EXCL); - xfs_trans_ijoin(tp, ip, 0); - - /* - * Do not update the on-disk file size. If we update the - * on-disk file size and then the system crashes before the - * contents of the file are flushed to disk then the files - * may be full of holes (ie NULL files bug). - */ - error = xfs_itruncate_extents_flags(&tp, ip, XFS_DATA_FORK, - XFS_ISIZE(ip), XFS_BMAPI_NODISCARD); - if (error) { - /* - * If we get an error at this point we simply don't - * bother truncating the file. - */ - xfs_trans_cancel(tp); - } else { - error = xfs_trans_commit(tp); - if (!error) - xfs_inode_clear_eofblocks_tag(ip); - } - - xfs_iunlock(ip, XFS_ILOCK_EXCL); + error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, 0, 0, 0, &tp); + if (error) { + ASSERT(XFS_FORCED_SHUTDOWN(mp)); + return error; } + + xfs_ilock(ip, XFS_ILOCK_EXCL); + xfs_trans_ijoin(tp, ip, 0); + + /* + * Do not update the on-disk file size. If we update the + * on-disk file size and then the system crashes before the + * contents of the file are flushed to disk then the files + * may be full of holes (ie NULL files bug). + */ + error = xfs_itruncate_extents_flags(&tp, ip, XFS_DATA_FORK, + XFS_ISIZE(ip), XFS_BMAPI_NODISCARD); + if (error) + goto err_cancel; + + error = xfs_trans_commit(tp); + if (error) + goto out_unlock; + + xfs_inode_clear_eofblocks_tag(ip); + goto out_unlock; + +err_cancel: + /* + * If we get an error at this point we simply don't + * bother truncating the file. + */ + xfs_trans_cancel(tp); +out_unlock: + xfs_iunlock(ip, XFS_ILOCK_EXCL); return error; } diff --git a/fs/xfs/xfs_bmap_util.h b/fs/xfs/xfs_bmap_util.h index 9f993168b55b..af07a4a20d7c 100644 --- a/fs/xfs/xfs_bmap_util.h +++ b/fs/xfs/xfs_bmap_util.h @@ -63,6 +63,7 @@ int xfs_insert_file_space(struct xfs_inode *, xfs_off_t offset, xfs_off_t len); /* EOF block manipulation functions */ +int xfs_has_eofblocks(struct xfs_inode *ip, bool *has); bool xfs_can_free_eofblocks(struct xfs_inode *ip, bool force); int xfs_free_eofblocks(struct xfs_inode *ip); From patchwork Thu Mar 11 03:05:57 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Darrick J. Wong" X-Patchwork-Id: 12130187 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.2 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1BB96C433E0 for ; Thu, 11 Mar 2021 03:06:36 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id BDFDF64FC6 for ; Thu, 11 Mar 2021 03:06:35 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229767AbhCKDGE (ORCPT ); Wed, 10 Mar 2021 22:06:04 -0500 Received: from mail.kernel.org ([198.145.29.99]:45714 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229778AbhCKDF5 (ORCPT ); Wed, 10 Mar 2021 22:05:57 -0500 Received: by mail.kernel.org (Postfix) with ESMTPSA id 6F1D964EDB; Thu, 11 Mar 2021 03:05:57 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1615431957; bh=WG+ja7N7H8Y/QGkwfK6rVDuuSZO88XfisSFbrls4Mgg=; h=Subject:From:To:Cc:Date:In-Reply-To:References:From; b=BgXruYdrnm1cqSI6u/FS3hc/1VsqANNk5Oh14+YSWpIBQXOrmirACLWYDIiq91Yp2 6cXVZcvOZ+1igG7VlpwHEwd02kXRfl09+B+oe1k6fjrXXL9s8rFKQhubsDcKITYKsP 1SwZbFedIoYaumR71WTBXEJuARwxGnJPtjz75gQYme21nhAd8Bz9hf0VxqxnenZiL7 S/++Gb7dQmp+i+P11xhXaVmNv3l2DSP5W1ox5k2dQrmFJlqkzqgyFIUS+gNSQPGIEG IMNHBhxdvKziBPdE6KtP7WZ/atK8V0r3EUO96T+OY42cRZq85MHnQVsDdnJ418uCwY EaHYFMPcOwIyQ== Subject: [PATCH 03/11] xfs: don't reclaim dquots with incore reservations From: "Darrick J. Wong" To: djwong@kernel.org Cc: linux-xfs@vger.kernel.org Date: Wed, 10 Mar 2021 19:05:57 -0800 Message-ID: <161543195719.1947934.8218545606940173264.stgit@magnolia> In-Reply-To: <161543194009.1947934.9910987247994410125.stgit@magnolia> References: <161543194009.1947934.9910987247994410125.stgit@magnolia> User-Agent: StGit/0.19 MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Darrick J. Wong If a dquot has an incore reservation that exceeds the ondisk count, it by definition has active incore state and must not be reclaimed. Up to this point every inode with an incore dquot reservation has always retained a reference to the dquot so it was never possible for xfs_qm_dquot_isolate to be called on a dquot with active state and zero refcount, but this will soon change. Deferred inode inactivation is about to reorganize how inodes are inactivated by shunting all that work to a background workqueue. In order to avoid deadlocks with the quotaoff inode scan and reduce overall memory requirements (since inodes can spend a lot of time waiting for inactivation), inactive inodes will drop their dquot references while they're waiting to be inactivated. However, inactive inodes can have delalloc extents in the data fork or any extents in the CoW fork. Either of these contribute to the dquot's incore reservation being larger than the resource count (i.e. they're the reason the dquot still has active incore state), so we cannot allow the dquot to be reclaimed. Signed-off-by: Darrick J. Wong Reviewed-by: Christoph Hellwig --- fs/xfs/xfs_qm.c | 29 ++++++++++++++++++++++++----- fs/xfs/xfs_qm.h | 17 +++++++++++++++++ 2 files changed, 41 insertions(+), 5 deletions(-) diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c index bfa4164990b1..b3ce04dec181 100644 --- a/fs/xfs/xfs_qm.c +++ b/fs/xfs/xfs_qm.c @@ -166,9 +166,14 @@ xfs_qm_dqpurge( /* * We move dquots to the freelist as soon as their reference count - * hits zero, so it really should be on the freelist here. + * hits zero, so it really should be on the freelist here. If we're + * running quotaoff, it's possible that we're purging a zero-refcount + * dquot with active incore reservation because there are inodes + * awaiting inactivation. Dquots in this state will not be on the LRU + * but it's quotaoff, so we don't care. */ - ASSERT(!list_empty(&dqp->q_lru)); + ASSERT(!(mp->m_qflags & xfs_quota_active_flag(xfs_dquot_type(dqp))) || + !list_empty(&dqp->q_lru)); list_lru_del(&qi->qi_lru, &dqp->q_lru); XFS_STATS_DEC(mp, xs_qm_dquot_unused); @@ -411,6 +416,15 @@ struct xfs_qm_isolate { struct list_head dispose; }; +static inline bool +xfs_dquot_has_incore_resv( + struct xfs_dquot *dqp) +{ + return dqp->q_blk.reserved > dqp->q_blk.count || + dqp->q_ino.reserved > dqp->q_ino.count || + dqp->q_rtb.reserved > dqp->q_rtb.count; +} + static enum lru_status xfs_qm_dquot_isolate( struct list_head *item, @@ -427,10 +441,15 @@ xfs_qm_dquot_isolate( goto out_miss_busy; /* - * This dquot has acquired a reference in the meantime remove it from - * the freelist and try again. + * Either this dquot has incore reservations or it has acquired a + * reference. Remove it from the freelist and try again. + * + * Inodes tagged for inactivation drop their dquot references to avoid + * deadlocks with quotaoff. If these inodes have delalloc reservations + * in the data fork or any extents in the CoW fork, these contribute + * to the dquot's incore block reservation exceeding the count. */ - if (dqp->q_nrefs) { + if (xfs_dquot_has_incore_resv(dqp) || dqp->q_nrefs) { xfs_dqunlock(dqp); XFS_STATS_INC(dqp->q_mount, xs_qm_dqwants); diff --git a/fs/xfs/xfs_qm.h b/fs/xfs/xfs_qm.h index e3dabab44097..78f90935e91e 100644 --- a/fs/xfs/xfs_qm.h +++ b/fs/xfs/xfs_qm.h @@ -105,6 +105,23 @@ xfs_quota_inode(struct xfs_mount *mp, xfs_dqtype_t type) return NULL; } +static inline unsigned int +xfs_quota_active_flag( + xfs_dqtype_t type) +{ + switch (type) { + case XFS_DQTYPE_USER: + return XFS_UQUOTA_ACTIVE; + case XFS_DQTYPE_GROUP: + return XFS_GQUOTA_ACTIVE; + case XFS_DQTYPE_PROJ: + return XFS_PQUOTA_ACTIVE; + default: + ASSERT(0); + } + return 0; +} + extern void xfs_trans_mod_dquot(struct xfs_trans *tp, struct xfs_dquot *dqp, uint field, int64_t delta); extern void xfs_trans_dqjoin(struct xfs_trans *, struct xfs_dquot *); From patchwork Thu Mar 11 03:06:02 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Darrick J. Wong" X-Patchwork-Id: 12130189 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.2 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id EC168C433DB for ; Thu, 11 Mar 2021 03:06:35 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 9E26964FC8 for ; Thu, 11 Mar 2021 03:06:35 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229778AbhCKDGE (ORCPT ); Wed, 10 Mar 2021 22:06:04 -0500 Received: from mail.kernel.org ([198.145.29.99]:45738 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229803AbhCKDGD (ORCPT ); Wed, 10 Mar 2021 22:06:03 -0500 Received: by mail.kernel.org (Postfix) with ESMTPSA id ECB4A64FC4; Thu, 11 Mar 2021 03:06:02 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1615431963; bh=rkFPfgneG8SPucad0U2iTgZHLihxGMnr4JNJfcvnXvw=; h=Subject:From:To:Cc:Date:In-Reply-To:References:From; b=Cg6KcOJPncqAt3eMCta5JZLnY8u6C99mIL4/n2bKdeBG47Mis1TqPJo3ScJkdM65f OLrtQw4af09slFH9oPtQYskBxAIfr7Vu8rlIU+BnsyUgRhNuPNZGY5r5rptel4IJnB q9G7TGgaMP0G30FR3UgBF282hDXpo/aUuOnIVUOOrUX/ZGVOmZvxE1n0tP2UxgSUzY LtA7CwilWnY7rO+5xt3/8hMj409DKyqJZDPSPeUkccSnMQybv+4Ktd9kY7Ey8Xnpi6 qFp3Xnkcq49Ucr+1EuDLxLn5ZzLgRTuxHX8Rnri59H2R7eEQJAqPzktp9+1l/1OIF7 /GJ5ArggBLIcA== Subject: [PATCH 04/11] xfs: decide if inode needs inactivation From: "Darrick J. Wong" To: djwong@kernel.org Cc: linux-xfs@vger.kernel.org Date: Wed, 10 Mar 2021 19:06:02 -0800 Message-ID: <161543196269.1947934.4125444770307830204.stgit@magnolia> In-Reply-To: <161543194009.1947934.9910987247994410125.stgit@magnolia> References: <161543194009.1947934.9910987247994410125.stgit@magnolia> User-Agent: StGit/0.19 MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Darrick J. Wong Add a predicate function to decide if an inode needs (deferred) inactivation. Any file that has been unlinked or has speculative preallocations either for post-EOF writes or for CoW qualifies. This function will also be used by the upcoming deferred inactivation patch. Signed-off-by: Darrick J. Wong --- fs/xfs/xfs_inode.c | 63 ++++++++++++++++++++++++++++++++++++++++++++++++++++ fs/xfs/xfs_inode.h | 2 ++ 2 files changed, 65 insertions(+) diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c index 12c79962f8c3..65897cb0cf2a 100644 --- a/fs/xfs/xfs_inode.c +++ b/fs/xfs/xfs_inode.c @@ -1665,6 +1665,69 @@ xfs_inactive_ifree( return 0; } +/* + * Returns true if we need to update the on-disk metadata before we can free + * the memory used by this inode. Updates include freeing post-eof + * preallocations; freeing COW staging extents; and marking the inode free in + * the inobt if it is on the unlinked list. + */ +bool +xfs_inode_needs_inactivation( + struct xfs_inode *ip) +{ + struct xfs_mount *mp = ip->i_mount; + struct xfs_ifork *cow_ifp = XFS_IFORK_PTR(ip, XFS_COW_FORK); + + /* + * If the inode is already free, then there can be nothing + * to clean up here. + */ + if (VFS_I(ip)->i_mode == 0) + return false; + + /* If this is a read-only mount, don't do this (would generate I/O) */ + if (mp->m_flags & XFS_MOUNT_RDONLY) + return false; + + /* Metadata inodes require explicit resource cleanup. */ + if (xfs_is_metadata_inode(ip)) + return false; + + /* Try to clean out the cow blocks if there are any. */ + if (cow_ifp && cow_ifp->if_bytes > 0) + return true; + + if (VFS_I(ip)->i_nlink != 0) { + int error; + bool has; + + /* + * force is true because we are evicting an inode from the + * cache. Post-eof blocks must be freed, lest we end up with + * broken free space accounting. + * + * Note: don't bother with iolock here since lockdep complains + * about acquiring it in reclaim context. We have the only + * reference to the inode at this point anyways. + * + * If the predicate errors out, send the inode through + * inactivation anyway, because that's what we did before. + * The inactivation worker will ignore an inode that doesn't + * actually need it. + */ + if (!xfs_can_free_eofblocks(ip, true)) + return false; + error = xfs_has_eofblocks(ip, &has); + return error != 0 || has; + } + + /* + * Link count dropped to zero, which means we have to mark the inode + * free on disk and remove it from the AGI unlinked list. + */ + return true; +} + /* * xfs_inactive * diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h index c2c26f8f4a81..3fe8c8afbc72 100644 --- a/fs/xfs/xfs_inode.h +++ b/fs/xfs/xfs_inode.h @@ -480,6 +480,8 @@ extern struct kmem_zone *xfs_inode_zone; /* The default CoW extent size hint. */ #define XFS_DEFAULT_COWEXTSZ_HINT 32 +bool xfs_inode_needs_inactivation(struct xfs_inode *ip); + int xfs_iunlink_init(struct xfs_perag *pag); void xfs_iunlink_destroy(struct xfs_perag *pag); From patchwork Thu Mar 11 03:06:08 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Darrick J. Wong" X-Patchwork-Id: 12130203 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.2 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id E9D77C433DB for ; Thu, 11 Mar 2021 03:07:07 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 9DD9764FC4 for ; Thu, 11 Mar 2021 03:07:07 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229931AbhCKDGf (ORCPT ); Wed, 10 Mar 2021 22:06:35 -0500 Received: from mail.kernel.org ([198.145.29.99]:45772 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229813AbhCKDGI (ORCPT ); Wed, 10 Mar 2021 22:06:08 -0500 Received: by mail.kernel.org (Postfix) with ESMTPSA id 7637664EDB; Thu, 11 Mar 2021 03:06:08 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1615431968; bh=lanNz/6hwDHxua+Mn5t1kXpbwgyWxlVz18cNr+sqrE4=; h=Subject:From:To:Cc:Date:In-Reply-To:References:From; b=K5d+LmM6kAYXIhzGf5GkRMpz+K7rfqvTd/+3JjmvNYCgvi9KjSdlRtIHvqx09EjXQ wIbMX6BocI/ipK0VS7z2xn5BQzLlMa1WK/1ZajVc52PBoYnVMLPUcLeg5Cp4E1vDNM L0cquN9kNU99g0pK6iYtliT1EUpunTk6NHnAbPZ3dPVZiAluXjE6FnTpE0iTkkVFHx fVmt0AitvRcW9o6ewgHumX3lwl0vX05SSIQAJGaXwU1Z2UaCY51UoRsUG2QO1gArJT rCRAs0bev+PIw6/MtDJDnar/RsBK9fo3gWjlCqEjYoXloICl8OMdfwC5oERSNBEHet +Ltbltv2A9Lgg== Subject: [PATCH 05/11] xfs: rename the blockgc workqueue From: "Darrick J. Wong" To: djwong@kernel.org Cc: linux-xfs@vger.kernel.org Date: Wed, 10 Mar 2021 19:06:08 -0800 Message-ID: <161543196819.1947934.4325937657338405659.stgit@magnolia> In-Reply-To: <161543194009.1947934.9910987247994410125.stgit@magnolia> References: <161543194009.1947934.9910987247994410125.stgit@magnolia> User-Agent: StGit/0.19 MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Darrick J. Wong Since we're about to start using the blockgc workqueue to dispose of inactivated inodes, strip the "block" prefix from the name; now it's merely the general garbage collection (gc) workqueue. Signed-off-by: Darrick J. Wong Reviewed-by: Christoph Hellwig --- Documentation/admin-guide/xfs.rst | 2 +- fs/xfs/xfs_icache.c | 2 +- fs/xfs/xfs_mount.h | 2 +- fs/xfs/xfs_super.c | 8 ++++---- 4 files changed, 7 insertions(+), 7 deletions(-) diff --git a/Documentation/admin-guide/xfs.rst b/Documentation/admin-guide/xfs.rst index 5422407a96d7..8de008c0c5ad 100644 --- a/Documentation/admin-guide/xfs.rst +++ b/Documentation/admin-guide/xfs.rst @@ -522,7 +522,7 @@ and the short name of the data device. They all can be found in: ================ =========== xfs_iwalk-$pid Inode scans of the entire filesystem. Currently limited to mount time quotacheck. - xfs-blockgc Background garbage collection of disk space that have been + xfs-gc Background garbage collection of disk space that have been speculatively allocated beyond EOF or for staging copy on write operations. ================ =========== diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c index 1d7720a0c068..e6a62f765422 100644 --- a/fs/xfs/xfs_icache.c +++ b/fs/xfs/xfs_icache.c @@ -1335,7 +1335,7 @@ xfs_blockgc_queue( { rcu_read_lock(); if (radix_tree_tagged(&pag->pag_ici_root, XFS_ICI_BLOCKGC_TAG)) - queue_delayed_work(pag->pag_mount->m_blockgc_workqueue, + queue_delayed_work(pag->pag_mount->m_gc_workqueue, &pag->pag_blockgc_work, msecs_to_jiffies(xfs_blockgc_secs * 1000)); rcu_read_unlock(); diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h index 659ad95fe3e0..81829d19596e 100644 --- a/fs/xfs/xfs_mount.h +++ b/fs/xfs/xfs_mount.h @@ -93,7 +93,7 @@ typedef struct xfs_mount { struct workqueue_struct *m_unwritten_workqueue; struct workqueue_struct *m_cil_workqueue; struct workqueue_struct *m_reclaim_workqueue; - struct workqueue_struct *m_blockgc_workqueue; + struct workqueue_struct *m_gc_workqueue; struct workqueue_struct *m_sync_workqueue; int m_bsize; /* fs logical block size */ diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c index e5e0713bebcd..e774358383d6 100644 --- a/fs/xfs/xfs_super.c +++ b/fs/xfs/xfs_super.c @@ -519,10 +519,10 @@ xfs_init_mount_workqueues( if (!mp->m_reclaim_workqueue) goto out_destroy_cil; - mp->m_blockgc_workqueue = alloc_workqueue("xfs-blockgc/%s", + mp->m_gc_workqueue = alloc_workqueue("xfs-gc/%s", WQ_SYSFS | WQ_UNBOUND | WQ_FREEZABLE | WQ_MEM_RECLAIM, 0, mp->m_super->s_id); - if (!mp->m_blockgc_workqueue) + if (!mp->m_gc_workqueue) goto out_destroy_reclaim; mp->m_sync_workqueue = alloc_workqueue("xfs-sync/%s", @@ -533,7 +533,7 @@ xfs_init_mount_workqueues( return 0; out_destroy_eofb: - destroy_workqueue(mp->m_blockgc_workqueue); + destroy_workqueue(mp->m_gc_workqueue); out_destroy_reclaim: destroy_workqueue(mp->m_reclaim_workqueue); out_destroy_cil: @@ -551,7 +551,7 @@ xfs_destroy_mount_workqueues( struct xfs_mount *mp) { destroy_workqueue(mp->m_sync_workqueue); - destroy_workqueue(mp->m_blockgc_workqueue); + destroy_workqueue(mp->m_gc_workqueue); destroy_workqueue(mp->m_reclaim_workqueue); destroy_workqueue(mp->m_cil_workqueue); destroy_workqueue(mp->m_unwritten_workqueue); From patchwork Thu Mar 11 03:06:13 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Darrick J. Wong" X-Patchwork-Id: 12130211 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.2 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 76763C432C3 for ; Thu, 11 Mar 2021 03:07:09 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 3DB8764FD6 for ; Thu, 11 Mar 2021 03:07:09 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230051AbhCKDGg (ORCPT ); Wed, 10 Mar 2021 22:06:36 -0500 Received: from mail.kernel.org ([198.145.29.99]:45820 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229874AbhCKDGO (ORCPT ); Wed, 10 Mar 2021 22:06:14 -0500 Received: by mail.kernel.org (Postfix) with ESMTPSA id 14F7E64EDB; Thu, 11 Mar 2021 03:06:14 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1615431974; bh=sYD0CFhtMCkyrXRkEM12iphjxLVdMjPyyniGAO/ge+w=; h=Subject:From:To:Cc:Date:In-Reply-To:References:From; b=iF/kLv6CnkYpgquOE0+jwSV6rcLezXomCZTirI1zfh6mty2Zo+GBqd4bNa2L3Gro6 TuREhSaAcaaOuE6i7hoe2/bnK3H37Ciggcxh3IpM1GsdAnFD2Mw08z2tmiGtj2tNPh rjTOIM4g6mBQR2N//kqKbnZs0gvWVvAICdn7Ev86+nTZiy+wNOdyq59YH4mK16BELs 9BoXnCVbjHSm5TSsJC/7Mw2MeBEG+UMiGh0qf4QvFEktHeSuTJz9/ww+CccelEuEq3 NpJomQwovR9ut5sWn39RaHeHmJ/3JcegvSdA9bB4P6eDOHmf+1H0Ril5gIt/VwIYVN klIyjZaxBy8ig== Subject: [PATCH 06/11] xfs: deferred inode inactivation From: "Darrick J. Wong" To: djwong@kernel.org Cc: linux-xfs@vger.kernel.org Date: Wed, 10 Mar 2021 19:06:13 -0800 Message-ID: <161543197372.1947934.1230576164438094965.stgit@magnolia> In-Reply-To: <161543194009.1947934.9910987247994410125.stgit@magnolia> References: <161543194009.1947934.9910987247994410125.stgit@magnolia> User-Agent: StGit/0.19 MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Darrick J. Wong Instead of calling xfs_inactive directly from xfs_fs_destroy_inode, defer the inactivation phase to a separate workqueue. With this we avoid blocking memory reclaim on filesystem metadata updates that are necessary to free an in-core inode, such as post-eof block freeing, COW staging extent freeing, and truncating and freeing unlinked inodes. Now that work is deferred to a workqueue where we can do the freeing in batches. We introduce two new inode flags -- NEEDS_INACTIVE and INACTIVATING. The first flag helps our worker find inodes needing inactivation, and the second flag marks inodes that are in the process of being inactivated. A concurrent xfs_iget on the inode can still resurrect the inode by clearing NEEDS_INACTIVE (or bailing if INACTIVATING is set). Unfortunately, deferring the inactivation has one huge downside -- eventual consistency. Since all the freeing is deferred to a worker thread, one can rm a file but the space doesn't come back immediately. This can cause some odd side effects with quota accounting and statfs, so we also force inactivation scans in order to maintain the existing behaviors, at least outwardly. For this patch we'll set the delay to zero to mimic the old timing as much as possible; in the next patch we'll play with different delay settings. Signed-off-by: Darrick J. Wong --- Documentation/admin-guide/xfs.rst | 3 fs/xfs/scrub/common.c | 2 fs/xfs/xfs_fsops.c | 9 + fs/xfs/xfs_icache.c | 436 ++++++++++++++++++++++++++++++++++++- fs/xfs/xfs_icache.h | 9 + fs/xfs/xfs_inode.c | 45 +++- fs/xfs/xfs_inode.h | 14 + fs/xfs/xfs_log_recover.c | 7 + fs/xfs/xfs_mount.c | 13 + fs/xfs/xfs_mount.h | 4 fs/xfs/xfs_qm_syscalls.c | 20 ++ fs/xfs/xfs_super.c | 53 ++++ fs/xfs/xfs_trace.h | 15 + 13 files changed, 604 insertions(+), 26 deletions(-) diff --git a/Documentation/admin-guide/xfs.rst b/Documentation/admin-guide/xfs.rst index 8de008c0c5ad..f9b109bfc6a6 100644 --- a/Documentation/admin-guide/xfs.rst +++ b/Documentation/admin-guide/xfs.rst @@ -524,7 +524,8 @@ and the short name of the data device. They all can be found in: mount time quotacheck. xfs-gc Background garbage collection of disk space that have been speculatively allocated beyond EOF or for staging copy on - write operations. + write operations; and files that are no longer linked into + the directory tree. ================ =========== For example, the knobs for the quotacheck workqueue for /dev/nvme0n1 would be diff --git a/fs/xfs/scrub/common.c b/fs/xfs/scrub/common.c index da60e7d1f895..8bc824515e0b 100644 --- a/fs/xfs/scrub/common.c +++ b/fs/xfs/scrub/common.c @@ -886,6 +886,7 @@ xchk_stop_reaping( { sc->flags |= XCHK_REAPING_DISABLED; xfs_blockgc_stop(sc->mp); + xfs_inodegc_stop(sc->mp); } /* Restart background reaping of resources. */ @@ -893,6 +894,7 @@ void xchk_start_reaping( struct xfs_scrub *sc) { + xfs_inodegc_start(sc->mp); xfs_blockgc_start(sc->mp); sc->flags &= ~XCHK_REAPING_DISABLED; } diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c index a2a407039227..3a3baf56198b 100644 --- a/fs/xfs/xfs_fsops.c +++ b/fs/xfs/xfs_fsops.c @@ -19,6 +19,8 @@ #include "xfs_log.h" #include "xfs_ag.h" #include "xfs_ag_resv.h" +#include "xfs_inode.h" +#include "xfs_icache.h" /* * growfs operations @@ -290,6 +292,13 @@ xfs_fs_counts( xfs_mount_t *mp, xfs_fsop_counts_t *cnt) { + /* + * Process all the queued file and speculative preallocation cleanup so + * that the counter values we report here do not incorporate any + * resources that were previously deleted. + */ + xfs_inodegc_force(mp); + cnt->allocino = percpu_counter_read_positive(&mp->m_icount); cnt->freeino = percpu_counter_read_positive(&mp->m_ifree); cnt->freedata = percpu_counter_read_positive(&mp->m_fdblocks) - diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c index e6a62f765422..1b7652af5ee5 100644 --- a/fs/xfs/xfs_icache.c +++ b/fs/xfs/xfs_icache.c @@ -195,6 +195,18 @@ xfs_perag_clear_reclaim_tag( trace_xfs_perag_clear_reclaim(mp, pag->pag_agno, -1, _RET_IP_); } +static void +__xfs_inode_set_reclaim_tag( + struct xfs_perag *pag, + struct xfs_inode *ip) +{ + struct xfs_mount *mp = ip->i_mount; + + radix_tree_tag_set(&pag->pag_ici_root, XFS_INO_TO_AGINO(mp, ip->i_ino), + XFS_ICI_RECLAIM_TAG); + xfs_perag_set_reclaim_tag(pag); + __xfs_iflags_set(ip, XFS_IRECLAIMABLE); +} /* * We set the inode flag atomically with the radix tree tag. @@ -212,10 +224,7 @@ xfs_inode_set_reclaim_tag( spin_lock(&pag->pag_ici_lock); spin_lock(&ip->i_flags_lock); - radix_tree_tag_set(&pag->pag_ici_root, XFS_INO_TO_AGINO(mp, ip->i_ino), - XFS_ICI_RECLAIM_TAG); - xfs_perag_set_reclaim_tag(pag); - __xfs_iflags_set(ip, XFS_IRECLAIMABLE); + __xfs_inode_set_reclaim_tag(pag, ip); spin_unlock(&ip->i_flags_lock); spin_unlock(&pag->pag_ici_lock); @@ -233,6 +242,94 @@ xfs_inode_clear_reclaim_tag( xfs_perag_clear_reclaim_tag(pag); } +/* Queue a new inode gc pass if there are inodes needing inactivation. */ +static void +xfs_inodegc_queue( + struct xfs_mount *mp) +{ + rcu_read_lock(); + if (radix_tree_tagged(&mp->m_perag_tree, XFS_ICI_INACTIVE_TAG)) + queue_delayed_work(mp->m_gc_workqueue, &mp->m_inodegc_work, + 2 * HZ); + rcu_read_unlock(); +} + +/* Remember that an AG has one more inode to inactivate. */ +static void +xfs_perag_set_inactive_tag( + struct xfs_perag *pag) +{ + struct xfs_mount *mp = pag->pag_mount; + + lockdep_assert_held(&pag->pag_ici_lock); + if (pag->pag_ici_inactive++) + return; + + /* propagate the inactive tag up into the perag radix tree */ + spin_lock(&mp->m_perag_lock); + radix_tree_tag_set(&mp->m_perag_tree, pag->pag_agno, + XFS_ICI_INACTIVE_TAG); + spin_unlock(&mp->m_perag_lock); + + /* schedule periodic background inode inactivation */ + xfs_inodegc_queue(mp); + + trace_xfs_perag_set_inactive(mp, pag->pag_agno, -1, _RET_IP_); +} + +/* Set this inode's inactive tag and set the per-AG tag. */ +void +xfs_inode_set_inactive_tag( + struct xfs_inode *ip) +{ + struct xfs_mount *mp = ip->i_mount; + struct xfs_perag *pag; + + pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, ip->i_ino)); + spin_lock(&pag->pag_ici_lock); + spin_lock(&ip->i_flags_lock); + + radix_tree_tag_set(&pag->pag_ici_root, XFS_INO_TO_AGINO(mp, ip->i_ino), + XFS_ICI_INACTIVE_TAG); + xfs_perag_set_inactive_tag(pag); + __xfs_iflags_set(ip, XFS_NEED_INACTIVE); + + spin_unlock(&ip->i_flags_lock); + spin_unlock(&pag->pag_ici_lock); + xfs_perag_put(pag); +} + +/* Remember that an AG has one less inode to inactivate. */ +static void +xfs_perag_clear_inactive_tag( + struct xfs_perag *pag) +{ + struct xfs_mount *mp = pag->pag_mount; + + lockdep_assert_held(&pag->pag_ici_lock); + if (--pag->pag_ici_inactive) + return; + + /* clear the inactive tag from the perag radix tree */ + spin_lock(&mp->m_perag_lock); + radix_tree_tag_clear(&mp->m_perag_tree, pag->pag_agno, + XFS_ICI_INACTIVE_TAG); + spin_unlock(&mp->m_perag_lock); + trace_xfs_perag_clear_inactive(mp, pag->pag_agno, -1, _RET_IP_); +} + +/* Clear this inode's inactive tag and try to clear the AG's. */ +STATIC void +xfs_inode_clear_inactive_tag( + struct xfs_perag *pag, + xfs_ino_t ino) +{ + radix_tree_tag_clear(&pag->pag_ici_root, + XFS_INO_TO_AGINO(pag->pag_mount, ino), + XFS_ICI_INACTIVE_TAG); + xfs_perag_clear_inactive_tag(pag); +} + static void xfs_inew_wait( struct xfs_inode *ip) @@ -298,6 +395,13 @@ xfs_iget_check_free_state( struct xfs_inode *ip, int flags) { + /* + * Unlinked inodes awaiting inactivation must not be reused until we + * have a chance to clear the on-disk metadata. + */ + if (VFS_I(ip)->i_nlink == 0 && (ip->i_flags & XFS_NEED_INACTIVE)) + return -ENOENT; + if (flags & XFS_IGET_CREATE) { /* should be a free inode */ if (VFS_I(ip)->i_mode != 0) { @@ -323,6 +427,67 @@ xfs_iget_check_free_state( return 0; } +/* + * We've torn down the VFS part of this NEED_INACTIVE inode, so we need to get + * it back into working state. + */ +static int +xfs_iget_inactive( + struct xfs_perag *pag, + struct xfs_inode *ip) +{ + struct xfs_mount *mp = ip->i_mount; + struct inode *inode = VFS_I(ip); + int error; + + error = xfs_reinit_inode(mp, inode); + if (error) { + bool wake; + /* + * Re-initializing the inode failed, and we are in deep + * trouble. Try to re-add it to the inactive list. + */ + rcu_read_lock(); + spin_lock(&ip->i_flags_lock); + wake = !!__xfs_iflags_test(ip, XFS_INEW); + ip->i_flags &= ~(XFS_INEW | XFS_INACTIVATING); + if (wake) + wake_up_bit(&ip->i_flags, __XFS_INEW_BIT); + ASSERT(ip->i_flags & XFS_NEED_INACTIVE); + trace_xfs_iget_inactive_fail(ip); + spin_unlock(&ip->i_flags_lock); + rcu_read_unlock(); + return error; + } + + spin_lock(&pag->pag_ici_lock); + spin_lock(&ip->i_flags_lock); + + /* + * Clear the per-lifetime state in the inode as we are now effectively + * a new inode and need to return to the initial state before reuse + * occurs. + */ + ip->i_flags &= ~XFS_IRECLAIM_RESET_FLAGS; + ip->i_flags |= XFS_INEW; + xfs_inode_clear_inactive_tag(pag, ip->i_ino); + inode->i_state = I_NEW; + ip->i_sick = 0; + ip->i_checked = 0; + + ASSERT(!rwsem_is_locked(&inode->i_rwsem)); + init_rwsem(&inode->i_rwsem); + + spin_unlock(&ip->i_flags_lock); + spin_unlock(&pag->pag_ici_lock); + + /* + * Reattach dquots since we might have removed them when we put this + * inode on the inactivation list. + */ + return xfs_qm_dqattach(ip); +} + /* * Check the validity of the inode we just found it the cache */ @@ -357,14 +522,14 @@ xfs_iget_cache_hit( /* * If we are racing with another cache hit that is currently * instantiating this inode or currently recycling it out of - * reclaimabe state, wait for the initialisation to complete + * reclaimable state, wait for the initialisation to complete * before continuing. * * XXX(hch): eventually we should do something equivalent to * wait_on_inode to wait for these flags to be cleared * instead of polling for it. */ - if (ip->i_flags & (XFS_INEW|XFS_IRECLAIM)) { + if (ip->i_flags & (XFS_INEW | XFS_IRECLAIM | XFS_INACTIVATING)) { trace_xfs_iget_skip(ip); XFS_STATS_INC(mp, xs_ig_frecycle); error = -EAGAIN; @@ -438,6 +603,32 @@ xfs_iget_cache_hit( spin_unlock(&ip->i_flags_lock); spin_unlock(&pag->pag_ici_lock); + } else if (ip->i_flags & XFS_NEED_INACTIVE) { + /* + * If NEED_INACTIVE is set, we've torn down the VFS inode and + * need to carefully get it back into useable state. + */ + trace_xfs_iget_inactive(ip); + + if (flags & XFS_IGET_INCORE) { + error = -EAGAIN; + goto out_error; + } + + /* + * We need to set XFS_INACTIVATING to prevent + * xfs_inactive_inode from stomping over us while we recycle + * the inode. We can't clear the radix tree inactive tag yet + * as it requires pag_ici_lock to be held exclusive. + */ + ip->i_flags |= XFS_INACTIVATING; + + spin_unlock(&ip->i_flags_lock); + rcu_read_unlock(); + + error = xfs_iget_inactive(pag, ip); + if (error) + return error; } else { /* If the VFS inode is being torn down, pause and try again. */ if (!igrab(inode)) { @@ -713,6 +904,43 @@ xfs_icache_inode_is_allocated( return 0; } +/* + * Grab the inode for inactivation exclusively. + * Return true if we grabbed it. + */ +static bool +xfs_inactive_grab( + struct xfs_inode *ip) +{ + ASSERT(rcu_read_lock_held()); + + /* quick check for stale RCU freed inode */ + if (!ip->i_ino) + return false; + + /* + * The radix tree lock here protects a thread in xfs_iget from racing + * with us starting reclaim on the inode. + * + * Due to RCU lookup, we may find inodes that have been freed and only + * have XFS_IRECLAIM set. Indeed, we may see reallocated inodes that + * aren't candidates for reclaim at all, so we must check the + * XFS_IRECLAIMABLE is set first before proceeding to reclaim. + * Obviously if XFS_NEED_INACTIVE isn't set then we ignore this inode. + */ + spin_lock(&ip->i_flags_lock); + if (!(ip->i_flags & XFS_NEED_INACTIVE) || + (ip->i_flags & XFS_INACTIVATING)) { + /* not a inactivation candidate. */ + spin_unlock(&ip->i_flags_lock); + return false; + } + + ip->i_flags |= XFS_INACTIVATING; + spin_unlock(&ip->i_flags_lock); + return true; +} + /* * The inode lookup is done in batches to keep the amount of lock traffic and * radix tree lookups to a minimum. The batch size is a trade off between @@ -736,6 +964,9 @@ xfs_inode_walk_ag_grab( ASSERT(rcu_read_lock_held()); + if (flags & XFS_INODE_WALK_INACTIVE) + return xfs_inactive_grab(ip); + /* Check for stale RCU freed inode */ spin_lock(&ip->i_flags_lock); if (!ip->i_ino) @@ -743,7 +974,8 @@ xfs_inode_walk_ag_grab( /* avoid new or reclaimable inodes. Leave for reclaim code to flush */ if ((!newinos && __xfs_iflags_test(ip, XFS_INEW)) || - __xfs_iflags_test(ip, XFS_IRECLAIMABLE | XFS_IRECLAIM)) + __xfs_iflags_test(ip, XFS_IRECLAIMABLE | XFS_IRECLAIM | + XFS_NEED_INACTIVE | XFS_INACTIVATING)) goto out_unlock_noent; spin_unlock(&ip->i_flags_lock); @@ -848,7 +1080,8 @@ xfs_inode_walk_ag( xfs_iflags_test(batch[i], XFS_INEW)) xfs_inew_wait(batch[i]); error = execute(batch[i], args); - xfs_irele(batch[i]); + if (!(iter_flags & XFS_INODE_WALK_INACTIVE)) + xfs_irele(batch[i]); if (error == -EAGAIN) { skipped++; continue; @@ -986,6 +1219,7 @@ xfs_reclaim_inode( xfs_iflags_clear(ip, XFS_IFLUSHING); reclaim: + trace_xfs_inode_reclaiming(ip); /* * Because we use RCU freeing we need to ensure the inode always appears @@ -1705,3 +1939,189 @@ xfs_blockgc_free_quota( xfs_inode_dquot(ip, XFS_DQTYPE_GROUP), xfs_inode_dquot(ip, XFS_DQTYPE_PROJ), eof_flags); } + +/* + * Deferred Inode Inactivation + * =========================== + * + * Sometimes, inodes need to have work done on them once the last program has + * closed the file. Typically this means cleaning out any leftover post-eof or + * CoW staging blocks for linked files. For inodes that have been totally + * unlinked, this means unmapping data/attr/cow blocks, removing the inode + * from the unlinked buckets, and marking it free in the inobt and inode table. + * + * This process can generate many metadata updates, which shows up as close() + * and unlink() calls that take a long time. We defer all that work to a + * per-AG workqueue which means that we can batch a lot of work and do it in + * inode order for better performance. Furthermore, we can control the + * workqueue, which means that we can avoid doing inactivation work at a bad + * time, such as when the fs is frozen. + * + * Deferred inactivation introduces new inode flag states (NEED_INACTIVE and + * INACTIVATING) and adds a new INACTIVE radix tree tag for fast access. We + * maintain separate perag counters for both types, and move counts as inodes + * wander the state machine, which now works as follows: + * + * If the inode needs inactivation, we: + * - Set the NEED_INACTIVE inode flag + * - Increment the per-AG inactive count + * - Set the INACTIVE tag in the per-AG inode tree + * - Set the INACTIVE tag in the per-fs AG tree + * - Schedule background inode inactivation + * + * If the inode does not need inactivation, we: + * - Set the RECLAIMABLE inode flag + * - Increment the per-AG reclaim count + * - Set the RECLAIM tag in the per-AG inode tree + * - Set the RECLAIM tag in the per-fs AG tree + * - Schedule background inode reclamation + * + * When it is time for background inode inactivation, we: + * - Set the INACTIVATING inode flag + * - Make all the on-disk updates + * - Clear both INACTIVATING and NEED_INACTIVE inode flags + * - Decrement the per-AG inactive count + * - Clear the INACTIVE tag in the per-AG inode tree + * - Clear the INACTIVE tag in the per-fs AG tree if that was the last one + * - Kick the inode into reclamation per the previous paragraph. + * + * When it is time for background inode reclamation, we: + * - Set the IRECLAIM inode flag + * - Detach all the resources and remove the inode from the per-AG inode tree + * - Clear both IRECLAIM and RECLAIMABLE inode flags + * - Decrement the per-AG reclaim count + * - Clear the RECLAIM tag from the per-AG inode tree + * - Clear the RECLAIM tag from the per-fs AG tree if there are no more + * inodes waiting for reclamation or inactivation + * + * Note that xfs_inodegc_queue and xfs_inactive_grab are further up in + * the source code so that we avoid static function declarations. + */ + +/* Inactivate this inode. */ +STATIC int +xfs_inactive_inode( + struct xfs_inode *ip, + void *args) +{ + struct xfs_eofblocks *eofb = args; + struct xfs_perag *pag; + + ASSERT(ip->i_mount->m_super->s_writers.frozen < SB_FREEZE_FS); + + /* + * Not a match for our passed in scan filter? Put it back on the shelf + * and move on. + */ + spin_lock(&ip->i_flags_lock); + if (!xfs_inode_matches_eofb(ip, eofb)) { + ip->i_flags &= ~XFS_INACTIVATING; + spin_unlock(&ip->i_flags_lock); + return 0; + } + spin_unlock(&ip->i_flags_lock); + + trace_xfs_inode_inactivating(ip); + + xfs_inactive(ip); + ASSERT(XFS_FORCED_SHUTDOWN(ip->i_mount) || ip->i_delayed_blks == 0); + + /* + * Clear the inactive state flags and schedule a reclaim run once + * we're done with the inactivations. We must ensure that the inode + * smoothly transitions from inactivating to reclaimable so that iget + * cannot see either data structure midway through the transition. + */ + pag = xfs_perag_get(ip->i_mount, + XFS_INO_TO_AGNO(ip->i_mount, ip->i_ino)); + spin_lock(&pag->pag_ici_lock); + spin_lock(&ip->i_flags_lock); + + ip->i_flags &= ~(XFS_NEED_INACTIVE | XFS_INACTIVATING); + xfs_inode_clear_inactive_tag(pag, ip->i_ino); + + __xfs_inode_set_reclaim_tag(pag, ip); + + spin_unlock(&ip->i_flags_lock); + spin_unlock(&pag->pag_ici_lock); + xfs_perag_put(pag); + + return 0; +} + +/* + * Walk the AGs and reclaim the inodes in them. Even if the filesystem is + * corrupted, we still need to clear the INACTIVE iflag so that we can move + * on to reclaiming the inode. + */ +static int +xfs_inodegc_free_space( + struct xfs_mount *mp, + struct xfs_eofblocks *eofb) +{ + return xfs_inode_walk(mp, XFS_INODE_WALK_INACTIVE, + xfs_inactive_inode, eofb, XFS_ICI_INACTIVE_TAG); +} + +/* Try to get inode inactivation moving. */ +void +xfs_inodegc_worker( + struct work_struct *work) +{ + struct xfs_mount *mp = container_of(to_delayed_work(work), + struct xfs_mount, m_inodegc_work); + int error; + + /* + * We want to skip inode inactivation while the filesystem is frozen + * because we don't want the inactivation thread to block while taking + * sb_intwrite. Therefore, we try to take sb_write for the duration + * of the inactive scan -- a freeze attempt will block until we're + * done here, and if the fs is past stage 1 freeze we'll bounce out + * until things unfreeze. If the fs goes down while frozen we'll + * still have log recovery to clean up after us. + */ + if (!sb_start_write_trylock(mp->m_super)) + return; + + error = xfs_inodegc_free_space(mp, NULL); + if (error && error != -EAGAIN) + xfs_err(mp, "inode inactivation failed, error %d", error); + + sb_end_write(mp->m_super); + xfs_inodegc_queue(mp); +} + +/* Force all queued inode inactivation work to run immediately. */ +void +xfs_inodegc_force( + struct xfs_mount *mp) +{ + /* + * In order to reset the delay timer to run immediately, we have to + * cancel the work item and requeue it with a zero timer value. We + * don't care if the worker races with our requeue, because at worst + * we iterate the radix tree and find no inodes to inactivate. + */ + if (!cancel_delayed_work(&mp->m_inodegc_work)) + return; + + queue_delayed_work(mp->m_gc_workqueue, &mp->m_inodegc_work, 0); + flush_delayed_work(&mp->m_inodegc_work); +} + +/* Stop all queued inactivation work. */ +void +xfs_inodegc_stop( + struct xfs_mount *mp) +{ + cancel_delayed_work_sync(&mp->m_inodegc_work); +} + +/* Schedule deferred inode inactivation work. */ +void +xfs_inodegc_start( + struct xfs_mount *mp) +{ + xfs_inodegc_queue(mp); +} diff --git a/fs/xfs/xfs_icache.h b/fs/xfs/xfs_icache.h index d1fddb152420..c199b920722a 100644 --- a/fs/xfs/xfs_icache.h +++ b/fs/xfs/xfs_icache.h @@ -25,6 +25,8 @@ struct xfs_eofblocks { #define XFS_ICI_RECLAIM_TAG 0 /* inode is to be reclaimed */ /* Inode has speculative preallocations (posteof or cow) to clean. */ #define XFS_ICI_BLOCKGC_TAG 1 +/* Inode can be inactivated. */ +#define XFS_ICI_INACTIVE_TAG 2 /* * Flags for xfs_iget() @@ -38,6 +40,7 @@ struct xfs_eofblocks { * flags for AG inode iterator */ #define XFS_INODE_WALK_INEW_WAIT 0x1 /* wait on new inodes */ +#define XFS_INODE_WALK_INACTIVE 0x2 /* inactivation loop */ int xfs_iget(struct xfs_mount *mp, struct xfs_trans *tp, xfs_ino_t ino, uint flags, uint lock_flags, xfs_inode_t **ipp); @@ -53,6 +56,7 @@ int xfs_reclaim_inodes_count(struct xfs_mount *mp); long xfs_reclaim_inodes_nr(struct xfs_mount *mp, int nr_to_scan); void xfs_inode_set_reclaim_tag(struct xfs_inode *ip); +void xfs_inode_set_inactive_tag(struct xfs_inode *ip); int xfs_blockgc_free_dquots(struct xfs_mount *mp, struct xfs_dquot *udqp, struct xfs_dquot *gdqp, struct xfs_dquot *pdqp, @@ -78,4 +82,9 @@ int xfs_icache_inode_is_allocated(struct xfs_mount *mp, struct xfs_trans *tp, void xfs_blockgc_stop(struct xfs_mount *mp); void xfs_blockgc_start(struct xfs_mount *mp); +void xfs_inodegc_worker(struct work_struct *work); +void xfs_inodegc_force(struct xfs_mount *mp); +void xfs_inodegc_stop(struct xfs_mount *mp); +void xfs_inodegc_start(struct xfs_mount *mp); + #endif diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c index 65897cb0cf2a..f20694f220c8 100644 --- a/fs/xfs/xfs_inode.c +++ b/fs/xfs/xfs_inode.c @@ -1665,6 +1665,35 @@ xfs_inactive_ifree( return 0; } +/* Prepare inode for inactivation. */ +void +xfs_inode_inactivation_prep( + struct xfs_inode *ip) +{ + if (XFS_FORCED_SHUTDOWN(ip->i_mount)) + return; + + /* + * If this inode is unlinked (and now unreferenced) we need to dispose + * of it in the on disk metadata. + * + * Change the generation so that the inode can't be opened by handle + * now that the last external references has dropped. Bulkstat won't + * return inodes with zero nlink so nobody will ever find this inode + * again. Then add this inode & blocks to the counts of things that + * will be freed during the next inactivation run. + */ + if (VFS_I(ip)->i_nlink == 0) + VFS_I(ip)->i_generation = prandom_u32(); + + /* + * Detach dquots just in case someone tries a quotaoff while the inode + * is waiting on the inactive list. We'll reattach them (if needed) + * when inactivating the inode. + */ + xfs_qm_dqdetach(ip); +} + /* * Returns true if we need to update the on-disk metadata before we can free * the memory used by this inode. Updates include freeing post-eof @@ -1738,7 +1767,7 @@ xfs_inode_needs_inactivation( */ void xfs_inactive( - xfs_inode_t *ip) + struct xfs_inode *ip) { struct xfs_mount *mp; int error; @@ -1764,6 +1793,16 @@ xfs_inactive( if (xfs_is_metadata_inode(ip)) return; + /* + * Re-attach dquots prior to freeing EOF blocks or CoW staging extents. + * We dropped the dquot prior to inactivation (because quotaoff can't + * resurrect inactive inodes to force-drop the dquot) so we /must/ + * do this before touching any block mappings. + */ + error = xfs_qm_dqattach(ip); + if (error) + return; + /* Try to clean out the cow blocks if there are any. */ if (xfs_inode_has_cow_data(ip)) xfs_reflink_cancel_cow_range(ip, 0, NULLFILEOFF, true); @@ -1789,10 +1828,6 @@ xfs_inactive( ip->i_df.if_nextents > 0 || ip->i_delayed_blks > 0)) truncate = 1; - error = xfs_qm_dqattach(ip); - if (error) - return; - if (S_ISLNK(VFS_I(ip)->i_mode)) error = xfs_inactive_symlink(ip); else if (truncate) diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h index 3fe8c8afbc72..7aaff07d1210 100644 --- a/fs/xfs/xfs_inode.h +++ b/fs/xfs/xfs_inode.h @@ -222,6 +222,7 @@ static inline bool xfs_inode_has_bigtime(struct xfs_inode *ip) #define XFS_IRECLAIMABLE (1 << 2) /* inode can be reclaimed */ #define __XFS_INEW_BIT 3 /* inode has just been allocated */ #define XFS_INEW (1 << __XFS_INEW_BIT) +#define XFS_NEED_INACTIVE (1 << 4) /* see XFS_INACTIVATING below */ #define XFS_ITRUNCATED (1 << 5) /* truncated down so flush-on-close */ #define XFS_IDIRTY_RELEASE (1 << 6) /* dirty release already seen */ #define XFS_IFLUSHING (1 << 7) /* inode is being flushed */ @@ -236,6 +237,15 @@ static inline bool xfs_inode_has_bigtime(struct xfs_inode *ip) #define XFS_IRECOVERY (1 << 11) #define XFS_ICOWBLOCKS (1 << 12)/* has the cowblocks tag set */ +/* + * If we need to update on-disk metadata before this IRECLAIMABLE inode can be + * freed, then NEED_INACTIVE will be set. Once we start the updates, the + * INACTIVATING bit will be set to keep iget away from this inode. After the + * inactivation completes, both flags will be cleared and the inode is a + * plain old IRECLAIMABLE inode. + */ +#define XFS_INACTIVATING (1 << 13) + /* * Per-lifetime flags need to be reset when re-using a reclaimable inode during * inode lookup. This prevents unintended behaviour on the new inode from @@ -243,7 +253,8 @@ static inline bool xfs_inode_has_bigtime(struct xfs_inode *ip) */ #define XFS_IRECLAIM_RESET_FLAGS \ (XFS_IRECLAIMABLE | XFS_IRECLAIM | \ - XFS_IDIRTY_RELEASE | XFS_ITRUNCATED) + XFS_IDIRTY_RELEASE | XFS_ITRUNCATED | XFS_NEED_INACTIVE | \ + XFS_INACTIVATING) /* * Flags for inode locking. @@ -481,6 +492,7 @@ extern struct kmem_zone *xfs_inode_zone; #define XFS_DEFAULT_COWEXTSZ_HINT 32 bool xfs_inode_needs_inactivation(struct xfs_inode *ip); +void xfs_inode_inactivation_prep(struct xfs_inode *ip); int xfs_iunlink_init(struct xfs_perag *pag); void xfs_iunlink_destroy(struct xfs_perag *pag); diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c index 97f31308de03..b03b127e34cc 100644 --- a/fs/xfs/xfs_log_recover.c +++ b/fs/xfs/xfs_log_recover.c @@ -2792,6 +2792,13 @@ xlog_recover_process_iunlinks( } xfs_buf_rele(agibp); } + + /* + * Now that we've put all the iunlink inodes on the lru, let's make + * sure that we perform all the on-disk metadata updates to actually + * free those inodes. + */ + xfs_inodegc_force(mp); } STATIC void diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c index 1c97b155a8ee..cd015e3d72fc 100644 --- a/fs/xfs/xfs_mount.c +++ b/fs/xfs/xfs_mount.c @@ -640,6 +640,10 @@ xfs_check_summary_counts( * so we need to unpin them, write them back and/or reclaim them before unmount * can proceed. * + * Start the process by pushing all inodes through the inactivation process + * so that all file updates to on-disk metadata can be flushed with the log. + * After the AIL push, all inodes should be ready for reclamation. + * * An inode cluster that has been freed can have its buffer still pinned in * memory because the transaction is still sitting in a iclog. The stale inodes * on that buffer will be pinned to the buffer until the transaction hits the @@ -663,6 +667,7 @@ static void xfs_unmount_flush_inodes( struct xfs_mount *mp) { + xfs_inodegc_force(mp); xfs_log_force(mp, XFS_LOG_SYNC); xfs_extent_busy_wait_all(mp); flush_workqueue(xfs_discard_wq); @@ -670,6 +675,7 @@ xfs_unmount_flush_inodes( mp->m_flags |= XFS_MOUNT_UNMOUNTING; xfs_ail_push_all_sync(mp->m_ail); + xfs_inodegc_stop(mp); cancel_delayed_work_sync(&mp->m_reclaim_work); xfs_reclaim_inodes(mp); xfs_health_unmount(mp); @@ -1095,6 +1101,13 @@ xfs_unmountfs( uint64_t resblks; int error; + /* + * Perform all on-disk metadata updates required to inactivate inodes. + * Since this can involve finobt updates, do it now before we lose the + * per-AG space reservations. + */ + xfs_inodegc_force(mp); + xfs_blockgc_stop(mp); xfs_fs_unreserve_ag_blocks(mp); xfs_qm_unmount_quotas(mp); diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h index 81829d19596e..ce00ad47b8ea 100644 --- a/fs/xfs/xfs_mount.h +++ b/fs/xfs/xfs_mount.h @@ -177,6 +177,7 @@ typedef struct xfs_mount { uint64_t m_resblks_avail;/* available reserved blocks */ uint64_t m_resblks_save; /* reserved blks @ remount,ro */ struct delayed_work m_reclaim_work; /* background inode reclaim */ + struct delayed_work m_inodegc_work; /* background inode inactive */ struct xfs_kobj m_kobj; struct xfs_kobj m_error_kobj; struct xfs_kobj m_error_meta_kobj; @@ -349,7 +350,8 @@ typedef struct xfs_perag { spinlock_t pag_ici_lock; /* incore inode cache lock */ struct radix_tree_root pag_ici_root; /* incore inode cache root */ - int pag_ici_reclaimable; /* reclaimable inodes */ + unsigned int pag_ici_reclaimable; /* reclaimable inodes */ + unsigned int pag_ici_inactive; /* inactive inodes */ unsigned long pag_ici_reclaim_cursor; /* reclaim restart point */ /* buffer cache index */ diff --git a/fs/xfs/xfs_qm_syscalls.c b/fs/xfs/xfs_qm_syscalls.c index ca1b57d291dc..0f9a1450fe0e 100644 --- a/fs/xfs/xfs_qm_syscalls.c +++ b/fs/xfs/xfs_qm_syscalls.c @@ -104,6 +104,12 @@ xfs_qm_scall_quotaoff( uint inactivate_flags; struct xfs_qoff_logitem *qoffstart = NULL; + /* + * Clean up the inactive list before we turn quota off, to reduce the + * amount of quotaoff work we have to do with the mutex held. + */ + xfs_inodegc_force(mp); + /* * No file system can have quotas enabled on disk but not in core. * Note that quota utilities (like quotaoff) _expect_ @@ -697,6 +703,13 @@ xfs_qm_scall_getquota( struct xfs_dquot *dqp; int error; + /* + * Process all the queued file and speculative preallocation cleanup so + * that the counter values we report here do not incorporate any + * resources that were previously deleted. + */ + xfs_inodegc_force(mp); + /* * Try to get the dquot. We don't want it allocated on disk, so don't * set doalloc. If it doesn't exist, we'll get ENOENT back. @@ -735,6 +748,13 @@ xfs_qm_scall_getquota_next( struct xfs_dquot *dqp; int error; + /* + * Process all the queued file and speculative preallocation cleanup so + * that the counter values we report here do not incorporate any + * resources that were previously deleted. + */ + xfs_inodegc_force(mp); + error = xfs_qm_dqget_next(mp, *id, type, &dqp); if (error) return error; diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c index e774358383d6..8d0142487fc7 100644 --- a/fs/xfs/xfs_super.c +++ b/fs/xfs/xfs_super.c @@ -637,28 +637,34 @@ xfs_fs_destroy_inode( struct inode *inode) { struct xfs_inode *ip = XFS_I(inode); + struct xfs_mount *mp = ip->i_mount; + bool need_inactive; trace_xfs_destroy_inode(ip); ASSERT(!rwsem_is_locked(&inode->i_rwsem)); - XFS_STATS_INC(ip->i_mount, vn_rele); - XFS_STATS_INC(ip->i_mount, vn_remove); + XFS_STATS_INC(mp, vn_rele); + XFS_STATS_INC(mp, vn_remove); - xfs_inactive(ip); - - if (!XFS_FORCED_SHUTDOWN(ip->i_mount) && ip->i_delayed_blks) { + need_inactive = xfs_inode_needs_inactivation(ip); + if (need_inactive) { + trace_xfs_inode_set_need_inactive(ip); + xfs_inode_inactivation_prep(ip); + } else if (!XFS_FORCED_SHUTDOWN(ip->i_mount) && ip->i_delayed_blks) { xfs_check_delalloc(ip, XFS_DATA_FORK); xfs_check_delalloc(ip, XFS_COW_FORK); ASSERT(0); } - - XFS_STATS_INC(ip->i_mount, vn_reclaim); + XFS_STATS_INC(mp, vn_reclaim); + trace_xfs_inode_set_reclaimable(ip); /* * We should never get here with one of the reclaim flags already set. */ ASSERT_ALWAYS(!xfs_iflags_test(ip, XFS_IRECLAIMABLE)); ASSERT_ALWAYS(!xfs_iflags_test(ip, XFS_IRECLAIM)); + ASSERT_ALWAYS(!xfs_iflags_test(ip, XFS_NEED_INACTIVE)); + ASSERT_ALWAYS(!xfs_iflags_test(ip, XFS_INACTIVATING)); /* * We always use background reclaim here because even if the inode is @@ -667,7 +673,10 @@ xfs_fs_destroy_inode( * reclaim path handles this more efficiently than we can here, so * simply let background reclaim tear down all inodes. */ - xfs_inode_set_reclaim_tag(ip); + if (need_inactive) + xfs_inode_set_inactive_tag(ip); + else + xfs_inode_set_reclaim_tag(ip); } static void @@ -797,6 +806,13 @@ xfs_fs_statfs( xfs_extlen_t lsize; int64_t ffree; + /* + * Process all the queued file and speculative preallocation cleanup so + * that the counter values we report here do not incorporate any + * resources that were previously deleted. + */ + xfs_inodegc_force(mp); + statp->f_type = XFS_SUPER_MAGIC; statp->f_namelen = MAXNAMELEN - 1; @@ -911,6 +927,18 @@ xfs_fs_unfreeze( return 0; } +/* + * Before we get to stage 1 of a freeze, force all the inactivation work so + * that there's less work to do if we crash during the freeze. + */ +STATIC int +xfs_fs_freeze_super( + struct super_block *sb) +{ + xfs_inodegc_force(XFS_M(sb)); + return freeze_super(sb); +} + /* * This function fills in xfs_mount_t fields based on mount args. * Note: the superblock _has_ now been read in. @@ -1089,6 +1117,7 @@ static const struct super_operations xfs_super_operations = { .show_options = xfs_fs_show_options, .nr_cached_objects = xfs_fs_nr_cached_objects, .free_cached_objects = xfs_fs_free_cached_objects, + .freeze_super = xfs_fs_freeze_super, }; static int @@ -1720,6 +1749,13 @@ xfs_remount_ro( return error; } + /* + * Perform all on-disk metadata updates required to inactivate inodes. + * Since this can involve finobt updates, do it now before we lose the + * per-AG space reservations. + */ + xfs_inodegc_force(mp); + /* Free the per-AG metadata reservation pool. */ error = xfs_fs_unreserve_ag_blocks(mp); if (error) { @@ -1843,6 +1879,7 @@ static int xfs_init_fs_context( mutex_init(&mp->m_growlock); INIT_WORK(&mp->m_flush_inodes_work, xfs_flush_inodes_worker); INIT_DELAYED_WORK(&mp->m_reclaim_work, xfs_reclaim_worker); + INIT_DELAYED_WORK(&mp->m_inodegc_work, xfs_inodegc_worker); mp->m_kobj.kobject.kset = xfs_kset; /* * We don't create the finobt per-ag space reservation until after log diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h index e74bbb648f83..9193cfbb02ef 100644 --- a/fs/xfs/xfs_trace.h +++ b/fs/xfs/xfs_trace.h @@ -157,6 +157,8 @@ DEFINE_PERAG_REF_EVENT(xfs_perag_set_reclaim); DEFINE_PERAG_REF_EVENT(xfs_perag_clear_reclaim); DEFINE_PERAG_REF_EVENT(xfs_perag_set_blockgc); DEFINE_PERAG_REF_EVENT(xfs_perag_clear_blockgc); +DEFINE_PERAG_REF_EVENT(xfs_perag_set_inactive); +DEFINE_PERAG_REF_EVENT(xfs_perag_clear_inactive); DECLARE_EVENT_CLASS(xfs_ag_class, TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno), @@ -617,14 +619,17 @@ DECLARE_EVENT_CLASS(xfs_inode_class, TP_STRUCT__entry( __field(dev_t, dev) __field(xfs_ino_t, ino) + __field(unsigned long, iflags) ), TP_fast_assign( __entry->dev = VFS_I(ip)->i_sb->s_dev; __entry->ino = ip->i_ino; + __entry->iflags = ip->i_flags; ), - TP_printk("dev %d:%d ino 0x%llx", + TP_printk("dev %d:%d ino 0x%llx iflags 0x%lx", MAJOR(__entry->dev), MINOR(__entry->dev), - __entry->ino) + __entry->ino, + __entry->iflags) ) #define DEFINE_INODE_EVENT(name) \ @@ -634,6 +639,8 @@ DEFINE_EVENT(xfs_inode_class, name, \ DEFINE_INODE_EVENT(xfs_iget_skip); DEFINE_INODE_EVENT(xfs_iget_reclaim); DEFINE_INODE_EVENT(xfs_iget_reclaim_fail); +DEFINE_INODE_EVENT(xfs_iget_inactive); +DEFINE_INODE_EVENT(xfs_iget_inactive_fail); DEFINE_INODE_EVENT(xfs_iget_hit); DEFINE_INODE_EVENT(xfs_iget_miss); @@ -668,6 +675,10 @@ DEFINE_INODE_EVENT(xfs_inode_free_eofblocks_invalid); DEFINE_INODE_EVENT(xfs_inode_set_cowblocks_tag); DEFINE_INODE_EVENT(xfs_inode_clear_cowblocks_tag); DEFINE_INODE_EVENT(xfs_inode_free_cowblocks_invalid); +DEFINE_INODE_EVENT(xfs_inode_set_reclaimable); +DEFINE_INODE_EVENT(xfs_inode_reclaiming); +DEFINE_INODE_EVENT(xfs_inode_set_need_inactive); +DEFINE_INODE_EVENT(xfs_inode_inactivating); /* * ftrace's __print_symbolic requires that all enum values be wrapped in the From patchwork Thu Mar 11 03:06:19 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Darrick J. Wong" X-Patchwork-Id: 12130209 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.2 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4763BC43332 for ; Thu, 11 Mar 2021 03:07:09 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 1FB2E64FCE for ; Thu, 11 Mar 2021 03:07:09 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230081AbhCKDGg (ORCPT ); Wed, 10 Mar 2021 22:06:36 -0500 Received: from mail.kernel.org ([198.145.29.99]:45836 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229899AbhCKDGU (ORCPT ); Wed, 10 Mar 2021 22:06:20 -0500 Received: by mail.kernel.org (Postfix) with ESMTPSA id AD9B464FC4; Thu, 11 Mar 2021 03:06:19 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1615431979; bh=i4dmjEdoS7xy4Ki0UaA2HJNEhsRGN/N/3/8XRDW2frY=; h=Subject:From:To:Cc:Date:In-Reply-To:References:From; b=i8qYy1oA/jvAto2S22YG/yfSHQz+vf1qfhDyhAKUNmd8o4lf6wXhMw3Qz83LULNoF crna4rdkSyWg/2vS4rSu/qHfAhhJ/6Q6eChw96oE3lAqC/xgoUJ53pu35vuLWrSNKM p0Xi0oh71eGLAGJwdnHKbORM38si1dp/vEZbngXhkbMr3DkgxXo/3DrsOEtGXRLF0z xN2XD2HcoHNE1Cz28qNI6WZYb5y4eWQIM3jt00lEi3WpN12RTxCU5FnirbSMlpJrIY XU8Jl2+iOMct0etaVoIpTfTfhEfxC3Q1leuPxwv09JgMqaOYWpZ0rskTnizRfujifg 3CwLKfnP5e2kQ== Subject: [PATCH 07/11] xfs: expose sysfs knob to control inode inactivation delay From: "Darrick J. Wong" To: djwong@kernel.org Cc: linux-xfs@vger.kernel.org Date: Wed, 10 Mar 2021 19:06:19 -0800 Message-ID: <161543197945.1947934.7946093923141335693.stgit@magnolia> In-Reply-To: <161543194009.1947934.9910987247994410125.stgit@magnolia> References: <161543194009.1947934.9910987247994410125.stgit@magnolia> User-Agent: StGit/0.19 MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Darrick J. Wong Allow administrators to control the length that we defer inode inactivation. By default we'll set the delay to 5 seconds, as an arbitrary choice between allowing for some batching of a deltree operation, and not letting too many inodes pile up in memory. Signed-off-by: Darrick J. Wong --- Documentation/admin-guide/xfs.rst | 9 +++++++++ fs/xfs/xfs_globals.c | 3 +++ fs/xfs/xfs_icache.c | 2 +- fs/xfs/xfs_linux.h | 1 + fs/xfs/xfs_sysctl.c | 9 +++++++++ fs/xfs/xfs_sysctl.h | 1 + 6 files changed, 24 insertions(+), 1 deletion(-) diff --git a/Documentation/admin-guide/xfs.rst b/Documentation/admin-guide/xfs.rst index f9b109bfc6a6..608d0ba7a86e 100644 --- a/Documentation/admin-guide/xfs.rst +++ b/Documentation/admin-guide/xfs.rst @@ -277,6 +277,15 @@ The following sysctls are available for the XFS filesystem: references and returns timed-out AGs back to the free stream pool. + fs.xfs.inode_gc_delay + (Units: centiseconds Min: 1 Default: 200 Max: 360000) + The amount of time to delay garbage collection of inodes that + have been closed or have been unlinked from the directory tree. + Garbage collection here means clearing speculative preallocations + from linked files and freeing unlinked inodes. A higher value + here enables more batching at a cost of delayed reclamation of + incore inodes. + fs.xfs.speculative_prealloc_lifetime (Units: seconds Min: 1 Default: 300 Max: 86400) The interval at which the background scanning for inodes diff --git a/fs/xfs/xfs_globals.c b/fs/xfs/xfs_globals.c index f62fa652c2fd..2945c2c54cf0 100644 --- a/fs/xfs/xfs_globals.c +++ b/fs/xfs/xfs_globals.c @@ -28,6 +28,9 @@ xfs_param_t xfs_params = { .rotorstep = { 1, 1, 255 }, .inherit_nodfrg = { 0, 1, 1 }, .fstrm_timer = { 1, 30*100, 3600*100}, + .inodegc_timer = { 1, 2*100, 3600*100}, + + /* Values below here are measured in seconds */ .blockgc_timer = { 1, 300, 3600*24}, }; diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c index 1b7652af5ee5..6081bba3c6ce 100644 --- a/fs/xfs/xfs_icache.c +++ b/fs/xfs/xfs_icache.c @@ -250,7 +250,7 @@ xfs_inodegc_queue( rcu_read_lock(); if (radix_tree_tagged(&mp->m_perag_tree, XFS_ICI_INACTIVE_TAG)) queue_delayed_work(mp->m_gc_workqueue, &mp->m_inodegc_work, - 2 * HZ); + msecs_to_jiffies(xfs_inodegc_centisecs * 10)); rcu_read_unlock(); } diff --git a/fs/xfs/xfs_linux.h b/fs/xfs/xfs_linux.h index af6be9b9ccdf..b4c5a2c71f43 100644 --- a/fs/xfs/xfs_linux.h +++ b/fs/xfs/xfs_linux.h @@ -99,6 +99,7 @@ typedef __u32 xfs_nlink_t; #define xfs_inherit_nodefrag xfs_params.inherit_nodfrg.val #define xfs_fstrm_centisecs xfs_params.fstrm_timer.val #define xfs_blockgc_secs xfs_params.blockgc_timer.val +#define xfs_inodegc_centisecs xfs_params.inodegc_timer.val #define current_cpu() (raw_smp_processor_id()) #define current_set_flags_nested(sp, f) \ diff --git a/fs/xfs/xfs_sysctl.c b/fs/xfs/xfs_sysctl.c index 546a6cd96729..878f31d3a587 100644 --- a/fs/xfs/xfs_sysctl.c +++ b/fs/xfs/xfs_sysctl.c @@ -176,6 +176,15 @@ static struct ctl_table xfs_table[] = { .extra1 = &xfs_params.fstrm_timer.min, .extra2 = &xfs_params.fstrm_timer.max, }, + { + .procname = "inode_gc_delay", + .data = &xfs_params.inodegc_timer.val, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_dointvec_minmax, + .extra1 = &xfs_params.inodegc_timer.min, + .extra2 = &xfs_params.inodegc_timer.max + }, { .procname = "speculative_prealloc_lifetime", .data = &xfs_params.blockgc_timer.val, diff --git a/fs/xfs/xfs_sysctl.h b/fs/xfs/xfs_sysctl.h index 7692e76ead33..a045c33c3d30 100644 --- a/fs/xfs/xfs_sysctl.h +++ b/fs/xfs/xfs_sysctl.h @@ -36,6 +36,7 @@ typedef struct xfs_param { xfs_sysctl_val_t inherit_nodfrg;/* Inherit the "nodefrag" inode flag. */ xfs_sysctl_val_t fstrm_timer; /* Filestream dir-AG assoc'n timeout. */ xfs_sysctl_val_t blockgc_timer; /* Interval between blockgc scans */ + xfs_sysctl_val_t inodegc_timer; /* Inode inactivation scan interval */ } xfs_param_t; /* From patchwork Thu Mar 11 03:06:25 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Darrick J. Wong" X-Patchwork-Id: 12130207 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.2 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id DC4BDC4332E for ; Thu, 11 Mar 2021 03:07:08 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 947FD64FDF for ; Thu, 11 Mar 2021 03:07:08 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230095AbhCKDGg (ORCPT ); Wed, 10 Mar 2021 22:06:36 -0500 Received: from mail.kernel.org ([198.145.29.99]:45852 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229900AbhCKDGZ (ORCPT ); Wed, 10 Mar 2021 22:06:25 -0500 Received: by mail.kernel.org (Postfix) with ESMTPSA id 59B4164FC4; Thu, 11 Mar 2021 03:06:25 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1615431985; bh=80TvvDWVCGz28cbyAa5zmk1OIJVYp8lU/u1j1GPkfK8=; h=Subject:From:To:Cc:Date:In-Reply-To:References:From; b=r8aU1+IRBBH9EYM4pdJFFD/HxQryuzAw4FqfHWEWwQdBSGsZcyw03jCag+1PxsmeY zk1vtwVoQTV5fbxArzMjW2Do5KqHZdcNjlRa/5n8UqOfz8DiVYKcZCuT1lpSDuKfgO 9K29BKXfARPkYnG9Er5CnWh6NsX1HCQX+xmXr2lW/j95rZOA3ll7n4zQweFcCaFpui cusU5DcEMq6EgwtIqLJCiS48HQXtJU/KJFIF3Q0qNyZH8C1r1rIfbCUkJ6y45JriLw 2LpOGDfaHD/6s9xgwv4D8kaqHWyiI+4ZO24fwUPJ3f/dodei2bBj22F52Z9cLOSFuZ aZ9vJDKdST7hg== Subject: [PATCH 08/11] xfs: force inode inactivation and retry fs writes when there isn't space From: "Darrick J. Wong" To: djwong@kernel.org Cc: linux-xfs@vger.kernel.org Date: Wed, 10 Mar 2021 19:06:25 -0800 Message-ID: <161543198495.1947934.14544893595452477454.stgit@magnolia> In-Reply-To: <161543194009.1947934.9910987247994410125.stgit@magnolia> References: <161543194009.1947934.9910987247994410125.stgit@magnolia> User-Agent: StGit/0.19 MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Darrick J. Wong Any time we try to modify a file's contents and it fails due to ENOSPC or EDQUOT, force inode inactivation work to try to free space. We're going to use the xfs_inodegc_free_space function externally in the next patch, so add it to xfs_icache.h now to reduce churn. Signed-off-by: Darrick J. Wong --- fs/xfs/xfs_icache.c | 10 ++++++++-- fs/xfs/xfs_icache.h | 1 + 2 files changed, 9 insertions(+), 2 deletions(-) diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c index 6081bba3c6ce..594d340bbe37 100644 --- a/fs/xfs/xfs_icache.c +++ b/fs/xfs/xfs_icache.c @@ -1868,10 +1868,16 @@ xfs_blockgc_free_space( struct xfs_mount *mp, struct xfs_eofblocks *eofb) { + int error; + trace_xfs_blockgc_free_space(mp, eofb, _RET_IP_); - return xfs_inode_walk(mp, 0, xfs_blockgc_scan_inode, eofb, + error = xfs_inode_walk(mp, 0, xfs_blockgc_scan_inode, eofb, XFS_ICI_BLOCKGC_TAG); + if (error) + return error; + + return xfs_inodegc_free_space(mp, eofb); } /* @@ -2054,7 +2060,7 @@ xfs_inactive_inode( * corrupted, we still need to clear the INACTIVE iflag so that we can move * on to reclaiming the inode. */ -static int +int xfs_inodegc_free_space( struct xfs_mount *mp, struct xfs_eofblocks *eofb) diff --git a/fs/xfs/xfs_icache.h b/fs/xfs/xfs_icache.h index c199b920722a..9d5a1f4c0369 100644 --- a/fs/xfs/xfs_icache.h +++ b/fs/xfs/xfs_icache.h @@ -86,5 +86,6 @@ void xfs_inodegc_worker(struct work_struct *work); void xfs_inodegc_force(struct xfs_mount *mp); void xfs_inodegc_stop(struct xfs_mount *mp); void xfs_inodegc_start(struct xfs_mount *mp); +int xfs_inodegc_free_space(struct xfs_mount *mp, struct xfs_eofblocks *eofb); #endif From patchwork Thu Mar 11 03:06:30 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Darrick J. Wong" X-Patchwork-Id: 12130205 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.2 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 80D7DC433E6 for ; Thu, 11 Mar 2021 03:07:08 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 3861F64FD6 for ; Thu, 11 Mar 2021 03:07:08 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229900AbhCKDGg (ORCPT ); Wed, 10 Mar 2021 22:06:36 -0500 Received: from mail.kernel.org ([198.145.29.99]:45888 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229928AbhCKDGb (ORCPT ); Wed, 10 Mar 2021 22:06:31 -0500 Received: by mail.kernel.org (Postfix) with ESMTPSA id 07C3264FC4; Thu, 11 Mar 2021 03:06:30 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1615431991; bh=1VrstQJy5J9ECetOJMSTJJz1514eIi30YkNmcMekSaQ=; h=Subject:From:To:Cc:Date:In-Reply-To:References:From; b=WLnZovycUGMw/dPorb3Nv/lJ2JNlC9IgHIa3vw/CK/HEGjNX+i5inWtT2GI9sTZJY vGQFfmY0MnHPwqvCb4AxVV0HMYD7d0216xKiXjSXoG2yG71jhZs17iiwcExrmPxDBG rzDCpDGj3hmMxQzy7/JYU0qopow93CXAyxWhgTDkoZDbS/eoYuK0pNoIn9ozyMuFu3 quZV4NF9YvTkvLUz6EFXWB11uIp+mY5ExSrKeiA/H7a1rNTf83626LDDqvnvUzKnDl HATNKwm8L+9jfqaeGPdipUNpj8VdAglGjvu2DoFDyQdp4QC6GfDyA9TYgpTl9EtjVX fREwfM6/N+6FQ== Subject: [PATCH 09/11] xfs: force inode garbage collection before fallocate when space is low From: "Darrick J. Wong" To: djwong@kernel.org Cc: linux-xfs@vger.kernel.org Date: Wed, 10 Mar 2021 19:06:30 -0800 Message-ID: <161543199062.1947934.17280004993407696065.stgit@magnolia> In-Reply-To: <161543194009.1947934.9910987247994410125.stgit@magnolia> References: <161543194009.1947934.9910987247994410125.stgit@magnolia> User-Agent: StGit/0.19 MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Darrick J. Wong Generally speaking, when a user calls fallocate, they're looking to preallocate space in a file in the largest contiguous chunks possible. If free space is low, it's possible that the free space will look unnecessarily fragmented because there are unlinked inodes that are holding on to space that we could allocate. When this happens, fallocate makes suboptimal allocation decisions for the sake of deleted files, which doesn't make much sense, so scan the filesystem for dead items to delete to try to avoid this. Note that there are a handful of fstests that fill a filesystem, delete just enough files to allow a single large allocation, and check that fallocate actually gets the allocation. These tests regress because the test runs fallocate before the inode gc has a chance to run, so add this behavior to maintain as much of the old behavior as possible. Signed-off-by: Darrick J. Wong --- fs/xfs/xfs_bmap_util.c | 44 ++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 44 insertions(+) diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c index 21aa38183ae9..6d2fece45bdc 100644 --- a/fs/xfs/xfs_bmap_util.c +++ b/fs/xfs/xfs_bmap_util.c @@ -28,6 +28,7 @@ #include "xfs_icache.h" #include "xfs_iomap.h" #include "xfs_reflink.h" +#include "xfs_sb.h" /* Kernel only BMAP related definitions and functions */ @@ -733,6 +734,44 @@ xfs_free_eofblocks( return error; } +/* + * If we suspect that the target device is full enough that it isn't to be able + * to satisfy the entire request, try a non-sync inode inactivation scan to + * free up space. While it's perfectly fine to fill a preallocation request + * with a bunch of short extents, we'd prefer to do the inactivation work now + * to combat long term fragmentation in new file data. This is purely for + * optimization, so we don't take any blocking locks and we only look for space + * that is already on the reclaim list (i.e. we don't zap speculative + * preallocations). + */ +static int +xfs_alloc_reclaim_inactive_space( + struct xfs_mount *mp, + bool is_rt, + xfs_filblks_t allocatesize_fsb) +{ + struct xfs_perag *pag; + struct xfs_sb *sbp = &mp->m_sb; + xfs_extlen_t free; + xfs_agnumber_t agno; + + if (is_rt) { + if (sbp->sb_frextents * sbp->sb_rextsize >= allocatesize_fsb) + return 0; + } else { + for (agno = 0; agno < mp->m_sb.sb_agcount; agno++) { + pag = xfs_perag_get(mp, agno); + free = pag->pagf_freeblks; + xfs_perag_put(pag); + + if (free >= allocatesize_fsb) + return 0; + } + } + + return xfs_inodegc_free_space(mp, NULL); +} + int xfs_alloc_file_space( struct xfs_inode *ip, @@ -817,6 +856,11 @@ xfs_alloc_file_space( rblocks = 0; } + error = xfs_alloc_reclaim_inactive_space(mp, rt, + allocatesize_fsb); + if (error) + break; + /* * Allocate and setup the transaction. */ From patchwork Thu Mar 11 03:06:36 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Darrick J. Wong" X-Patchwork-Id: 12130213 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.2 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9CEA6C433E6 for ; Thu, 11 Mar 2021 03:07:39 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 779A064DAD for ; Thu, 11 Mar 2021 03:07:39 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230212AbhCKDHH (ORCPT ); Wed, 10 Mar 2021 22:07:07 -0500 Received: from mail.kernel.org ([198.145.29.99]:45944 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230116AbhCKDGh (ORCPT ); Wed, 10 Mar 2021 22:06:37 -0500 Received: by mail.kernel.org (Postfix) with ESMTPSA id A0E1064FC4; Thu, 11 Mar 2021 03:06:36 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1615431996; bh=BotaVq+EBr871yYHRbgnBTb2osg7CS9jK6S8330/ZzU=; h=Subject:From:To:Cc:Date:In-Reply-To:References:From; b=PJq0kr9rbU5F75MBbZoqyz+kkrbDw32Is+cbrMFFQ+XZo0R43KI4TqkxT099GMyhZ mJN0oNcM8YI9MGINTNbANpYvD2/XDOaSxl0k9Dbgr7TX2Wc3IsH8bevJQq5QfPjr+f JVe37hxuS0YTt8k6gg/vxan9m4LB7QqbuwlHP63QxS5ZPK1AGlqwRtTJgV2ewao/aT RyD2+oWUm72oHm5Hm37qjcYpE962ffzjF1ddDZoBptOYLD9vSYVl9D4BjJ/e5nu+6a 4nzH2U1MKNip2OXMiImXVFrHMyNLJ1xRY4C7bVczQ4latJLZ4LN87IUa1CD5Xq4WEx ZVvcnzaSatVig== Subject: [PATCH 10/11] xfs: parallelize inode inactivation From: "Darrick J. Wong" To: djwong@kernel.org Cc: linux-xfs@vger.kernel.org Date: Wed, 10 Mar 2021 19:06:36 -0800 Message-ID: <161543199635.1947934.2885924822578773349.stgit@magnolia> In-Reply-To: <161543194009.1947934.9910987247994410125.stgit@magnolia> References: <161543194009.1947934.9910987247994410125.stgit@magnolia> User-Agent: StGit/0.19 MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Darrick J. Wong Split the inode inactivation work into per-AG work items so that we can take advantage of parallelization. Signed-off-by: Darrick J. Wong --- fs/xfs/xfs_icache.c | 62 ++++++++++++++++++++++++++++++++++++++------------- fs/xfs/xfs_mount.c | 3 ++ fs/xfs/xfs_mount.h | 4 ++- fs/xfs/xfs_super.c | 1 - 4 files changed, 52 insertions(+), 18 deletions(-) diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c index 594d340bbe37..d5f580b92e48 100644 --- a/fs/xfs/xfs_icache.c +++ b/fs/xfs/xfs_icache.c @@ -245,11 +245,13 @@ xfs_inode_clear_reclaim_tag( /* Queue a new inode gc pass if there are inodes needing inactivation. */ static void xfs_inodegc_queue( - struct xfs_mount *mp) + struct xfs_perag *pag) { + struct xfs_mount *mp = pag->pag_mount; + rcu_read_lock(); - if (radix_tree_tagged(&mp->m_perag_tree, XFS_ICI_INACTIVE_TAG)) - queue_delayed_work(mp->m_gc_workqueue, &mp->m_inodegc_work, + if (radix_tree_tagged(&pag->pag_ici_root, XFS_ICI_INACTIVE_TAG)) + queue_delayed_work(mp->m_gc_workqueue, &pag->pag_inodegc_work, msecs_to_jiffies(xfs_inodegc_centisecs * 10)); rcu_read_unlock(); } @@ -272,7 +274,7 @@ xfs_perag_set_inactive_tag( spin_unlock(&mp->m_perag_lock); /* schedule periodic background inode inactivation */ - xfs_inodegc_queue(mp); + xfs_inodegc_queue(pag); trace_xfs_perag_set_inactive(mp, pag->pag_agno, -1, _RET_IP_); } @@ -2074,8 +2076,9 @@ void xfs_inodegc_worker( struct work_struct *work) { - struct xfs_mount *mp = container_of(to_delayed_work(work), - struct xfs_mount, m_inodegc_work); + struct xfs_perag *pag = container_of(to_delayed_work(work), + struct xfs_perag, pag_inodegc_work); + struct xfs_mount *mp = pag->pag_mount; int error; /* @@ -2095,25 +2098,44 @@ xfs_inodegc_worker( xfs_err(mp, "inode inactivation failed, error %d", error); sb_end_write(mp->m_super); - xfs_inodegc_queue(mp); + xfs_inodegc_queue(pag); } -/* Force all queued inode inactivation work to run immediately. */ -void -xfs_inodegc_force( - struct xfs_mount *mp) +/* Garbage collect all inactive inodes in an AG immediately. */ +static inline bool +xfs_inodegc_force_pag( + struct xfs_perag *pag) { + struct xfs_mount *mp = pag->pag_mount; + /* * In order to reset the delay timer to run immediately, we have to * cancel the work item and requeue it with a zero timer value. We * don't care if the worker races with our requeue, because at worst * we iterate the radix tree and find no inodes to inactivate. */ - if (!cancel_delayed_work(&mp->m_inodegc_work)) + if (!cancel_delayed_work(&pag->pag_inodegc_work)) + return false; + + queue_delayed_work(mp->m_gc_workqueue, &pag->pag_inodegc_work, 0); + return true; +} + +/* Force all queued inode inactivation work to run immediately. */ +void +xfs_inodegc_force( + struct xfs_mount *mp) +{ + struct xfs_perag *pag; + xfs_agnumber_t agno; + bool queued = false; + + for_each_perag_tag(mp, agno, pag, XFS_ICI_INACTIVE_TAG) + queued |= xfs_inodegc_force_pag(pag); + if (!queued) return; - queue_delayed_work(mp->m_gc_workqueue, &mp->m_inodegc_work, 0); - flush_delayed_work(&mp->m_inodegc_work); + flush_workqueue(mp->m_gc_workqueue); } /* Stop all queued inactivation work. */ @@ -2121,7 +2143,11 @@ void xfs_inodegc_stop( struct xfs_mount *mp) { - cancel_delayed_work_sync(&mp->m_inodegc_work); + struct xfs_perag *pag; + xfs_agnumber_t agno; + + for_each_perag_tag(mp, agno, pag, XFS_ICI_INACTIVE_TAG) + cancel_delayed_work_sync(&pag->pag_inodegc_work); } /* Schedule deferred inode inactivation work. */ @@ -2129,5 +2155,9 @@ void xfs_inodegc_start( struct xfs_mount *mp) { - xfs_inodegc_queue(mp); + struct xfs_perag *pag; + xfs_agnumber_t agno; + + for_each_perag_tag(mp, agno, pag, XFS_ICI_INACTIVE_TAG) + xfs_inodegc_queue(pag); } diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c index cd015e3d72fc..a5963061485c 100644 --- a/fs/xfs/xfs_mount.c +++ b/fs/xfs/xfs_mount.c @@ -127,6 +127,7 @@ __xfs_free_perag( struct xfs_perag *pag = container_of(head, struct xfs_perag, rcu_head); ASSERT(!delayed_work_pending(&pag->pag_blockgc_work)); + ASSERT(!delayed_work_pending(&pag->pag_inodegc_work)); ASSERT(atomic_read(&pag->pag_ref) == 0); kmem_free(pag); } @@ -148,6 +149,7 @@ xfs_free_perag( ASSERT(pag); ASSERT(atomic_read(&pag->pag_ref) == 0); cancel_delayed_work_sync(&pag->pag_blockgc_work); + cancel_delayed_work_sync(&pag->pag_inodegc_work); xfs_iunlink_destroy(pag); xfs_buf_hash_destroy(pag); call_rcu(&pag->rcu_head, __xfs_free_perag); @@ -204,6 +206,7 @@ xfs_initialize_perag( pag->pag_mount = mp; spin_lock_init(&pag->pag_ici_lock); INIT_DELAYED_WORK(&pag->pag_blockgc_work, xfs_blockgc_worker); + INIT_DELAYED_WORK(&pag->pag_inodegc_work, xfs_inodegc_worker); INIT_RADIX_TREE(&pag->pag_ici_root, GFP_ATOMIC); error = xfs_buf_hash_init(pag); diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h index ce00ad47b8ea..835c07d00cd7 100644 --- a/fs/xfs/xfs_mount.h +++ b/fs/xfs/xfs_mount.h @@ -177,7 +177,6 @@ typedef struct xfs_mount { uint64_t m_resblks_avail;/* available reserved blocks */ uint64_t m_resblks_save; /* reserved blks @ remount,ro */ struct delayed_work m_reclaim_work; /* background inode reclaim */ - struct delayed_work m_inodegc_work; /* background inode inactive */ struct xfs_kobj m_kobj; struct xfs_kobj m_error_kobj; struct xfs_kobj m_error_meta_kobj; @@ -370,6 +369,9 @@ typedef struct xfs_perag { /* background prealloc block trimming */ struct delayed_work pag_blockgc_work; + /* background inode inactivation */ + struct delayed_work pag_inodegc_work; + /* reference count */ uint8_t pagf_refcount_level; diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c index 8d0142487fc7..566e5657c1b0 100644 --- a/fs/xfs/xfs_super.c +++ b/fs/xfs/xfs_super.c @@ -1879,7 +1879,6 @@ static int xfs_init_fs_context( mutex_init(&mp->m_growlock); INIT_WORK(&mp->m_flush_inodes_work, xfs_flush_inodes_worker); INIT_DELAYED_WORK(&mp->m_reclaim_work, xfs_reclaim_worker); - INIT_DELAYED_WORK(&mp->m_inodegc_work, xfs_inodegc_worker); mp->m_kobj.kobject.kset = xfs_kset; /* * We don't create the finobt per-ag space reservation until after log From patchwork Thu Mar 11 03:06:41 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Darrick J. Wong" X-Patchwork-Id: 12130215 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.2 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 96093C433DB for ; Thu, 11 Mar 2021 03:07:39 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 604DF64FC9 for ; Thu, 11 Mar 2021 03:07:39 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230116AbhCKDHI (ORCPT ); Wed, 10 Mar 2021 22:07:08 -0500 Received: from mail.kernel.org ([198.145.29.99]:45960 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229851AbhCKDGm (ORCPT ); Wed, 10 Mar 2021 22:06:42 -0500 Received: by mail.kernel.org (Postfix) with ESMTPSA id 3DFC064EDB; Thu, 11 Mar 2021 03:06:42 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1615432002; bh=ZCVsRKI42Linf5t4n2+G5ehDI3HVpBgtEno5ueROlwA=; h=Subject:From:To:Cc:Date:In-Reply-To:References:From; b=stB0B1dKcjZTXIH873R9aCeoTmdo+XmC7aUL8251UYFTqzzT7JEgS+mt7CsEOL6Tf 7Z/XazSzdPhbSKr++yAzrMF4B2Cd0SdRoq5YrY0vdKkCg4VNJR/EtjJRLiUVj8XDdo mQvBX8SLVf+dmFVXSevOHNcLxVVonV14V8qQ/zWj74+klatp93oG4lE6TUtNU7BEdo 7FcmPZKXTt2lFAwA9li0P/0BdVQzfJkcecPm69inKZ1tBcrmBxfBwY6QuetkmNz+3w QM7GDqqmh5b+p4qTQlo/KvWIky6ImPZcaYLsLGVupaYfil4hb0G2Hbyu/sNxH9ymKu YqOzSvLmlyTRw== Subject: [PATCH 11/11] xfs: create a polled function to force inode inactivation From: "Darrick J. Wong" To: djwong@kernel.org Cc: linux-xfs@vger.kernel.org Date: Wed, 10 Mar 2021 19:06:41 -0800 Message-ID: <161543200190.1947934.3117722394191799491.stgit@magnolia> In-Reply-To: <161543194009.1947934.9910987247994410125.stgit@magnolia> References: <161543194009.1947934.9910987247994410125.stgit@magnolia> User-Agent: StGit/0.19 MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Darrick J. Wong Create a polled version of xfs_inactive_force so that we can force inactivation while holding a lock (usually the umount lock) without tripping over the softlockup timer. This is for callers that hold vfs locks while calling inactivation, which is currently unmount, iunlink processing during mount, and rw->ro remount. Signed-off-by: Darrick J. Wong --- fs/xfs/xfs_icache.c | 38 +++++++++++++++++++++++++++++++++++++- fs/xfs/xfs_icache.h | 1 + fs/xfs/xfs_mount.c | 2 +- fs/xfs/xfs_mount.h | 5 +++++ fs/xfs/xfs_super.c | 3 ++- 5 files changed, 46 insertions(+), 3 deletions(-) diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c index d5f580b92e48..9db2beb4e732 100644 --- a/fs/xfs/xfs_icache.c +++ b/fs/xfs/xfs_icache.c @@ -25,6 +25,7 @@ #include "xfs_ialloc.h" #include +#include /* * Allocate and initialise an xfs_inode. @@ -2067,8 +2068,12 @@ xfs_inodegc_free_space( struct xfs_mount *mp, struct xfs_eofblocks *eofb) { - return xfs_inode_walk(mp, XFS_INODE_WALK_INACTIVE, + int error; + + error = xfs_inode_walk(mp, XFS_INODE_WALK_INACTIVE, xfs_inactive_inode, eofb, XFS_ICI_INACTIVE_TAG); + wake_up(&mp->m_inactive_wait); + return error; } /* Try to get inode inactivation moving. */ @@ -2138,6 +2143,37 @@ xfs_inodegc_force( flush_workqueue(mp->m_gc_workqueue); } +/* + * Force all inode inactivation work to run immediately, and poll until the + * work is complete. Callers should only use this function if they must + * inactivate inodes while holding VFS locks, and must be prepared to prevent + * or to wait for inodes that are queued for inactivation while this runs. + */ +void +xfs_inodegc_force_poll( + struct xfs_mount *mp) +{ + struct xfs_perag *pag; + xfs_agnumber_t agno; + bool queued = false; + + for_each_perag_tag(mp, agno, pag, XFS_ICI_INACTIVE_TAG) + queued |= xfs_inodegc_force_pag(pag); + if (!queued) + return; + + /* + * Touch the softlockup watchdog every 1/10th of a second while there + * are still inactivation-tagged inodes in the filesystem. + */ + while (!wait_event_timeout(mp->m_inactive_wait, + !radix_tree_tagged(&mp->m_perag_tree, + XFS_ICI_INACTIVE_TAG), + HZ / 10)) { + touch_softlockup_watchdog(); + } +} + /* Stop all queued inactivation work. */ void xfs_inodegc_stop( diff --git a/fs/xfs/xfs_icache.h b/fs/xfs/xfs_icache.h index 9d5a1f4c0369..80a79bace641 100644 --- a/fs/xfs/xfs_icache.h +++ b/fs/xfs/xfs_icache.h @@ -84,6 +84,7 @@ void xfs_blockgc_start(struct xfs_mount *mp); void xfs_inodegc_worker(struct work_struct *work); void xfs_inodegc_force(struct xfs_mount *mp); +void xfs_inodegc_force_poll(struct xfs_mount *mp); void xfs_inodegc_stop(struct xfs_mount *mp); void xfs_inodegc_start(struct xfs_mount *mp); int xfs_inodegc_free_space(struct xfs_mount *mp, struct xfs_eofblocks *eofb); diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c index a5963061485c..1012b1b361ba 100644 --- a/fs/xfs/xfs_mount.c +++ b/fs/xfs/xfs_mount.c @@ -1109,7 +1109,7 @@ xfs_unmountfs( * Since this can involve finobt updates, do it now before we lose the * per-AG space reservations. */ - xfs_inodegc_force(mp); + xfs_inodegc_force_poll(mp); xfs_blockgc_stop(mp); xfs_fs_unreserve_ag_blocks(mp); diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h index 835c07d00cd7..23d9888d2b82 100644 --- a/fs/xfs/xfs_mount.h +++ b/fs/xfs/xfs_mount.h @@ -213,6 +213,11 @@ typedef struct xfs_mount { unsigned int *m_errortag; struct xfs_kobj m_errortag_kobj; #endif + /* + * Use this to wait for the inode inactivation workqueue to finish + * inactivating all the inodes. + */ + struct wait_queue_head m_inactive_wait; } xfs_mount_t; #define M_IGEO(mp) (&(mp)->m_ino_geo) diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c index 566e5657c1b0..8329a3efced7 100644 --- a/fs/xfs/xfs_super.c +++ b/fs/xfs/xfs_super.c @@ -1754,7 +1754,7 @@ xfs_remount_ro( * Since this can involve finobt updates, do it now before we lose the * per-AG space reservations. */ - xfs_inodegc_force(mp); + xfs_inodegc_force_poll(mp); /* Free the per-AG metadata reservation pool. */ error = xfs_fs_unreserve_ag_blocks(mp); @@ -1880,6 +1880,7 @@ static int xfs_init_fs_context( INIT_WORK(&mp->m_flush_inodes_work, xfs_flush_inodes_worker); INIT_DELAYED_WORK(&mp->m_reclaim_work, xfs_reclaim_worker); mp->m_kobj.kobject.kset = xfs_kset; + init_waitqueue_head(&mp->m_inactive_wait); /* * We don't create the finobt per-ag space reservation until after log * recovery, so we must set this to true so that an ifree transaction