From patchwork Thu Jun 4 07:45:37 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 11587237 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 40A06913 for ; Thu, 4 Jun 2020 07:46:14 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 3290F206C3 for ; Thu, 4 Jun 2020 07:46:14 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726047AbgFDHqN (ORCPT ); Thu, 4 Jun 2020 03:46:13 -0400 Received: from mail106.syd.optusnet.com.au ([211.29.132.42]:49418 "EHLO mail106.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727043AbgFDHqN (ORCPT ); Thu, 4 Jun 2020 03:46:13 -0400 Received: from dread.disaster.area (pa49-180-124-177.pa.nsw.optusnet.com.au [49.180.124.177]) by mail106.syd.optusnet.com.au (Postfix) with ESMTPS id 4311A5AAA3A for ; Thu, 4 Jun 2020 17:46:09 +1000 (AEST) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.3) (envelope-from ) id 1jgkZj-00049V-DF for linux-xfs@vger.kernel.org; Thu, 04 Jun 2020 17:46:07 +1000 Received: from dave by discord.disaster.area with local (Exim 4.93) (envelope-from ) id 1jgkZj-0017GX-48 for linux-xfs@vger.kernel.org; Thu, 04 Jun 2020 17:46:07 +1000 From: Dave Chinner To: linux-xfs@vger.kernel.org Subject: [PATCH 01/30] xfs: Don't allow logging of XFS_ISTALE inodes Date: Thu, 4 Jun 2020 17:45:37 +1000 Message-Id: <20200604074606.266213-2-david@fromorbit.com> X-Mailer: git-send-email 2.26.2.761.g0e0b3e54be In-Reply-To: <20200604074606.266213-1-david@fromorbit.com> References: <20200604074606.266213-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.3 cv=QIgWuTDL c=1 sm=1 tr=0 a=k3aV/LVJup6ZGWgigO6cSA==:117 a=k3aV/LVJup6ZGWgigO6cSA==:17 a=nTHF0DUjJn0A:10 a=20KFwNOVAAAA:8 a=yPCof4ZbAAAA:8 a=E4R3UMY25Q1b2mD5dw0A:9 Sender: linux-xfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Dave Chinner In tracking down a problem in this patchset, I discovered we are reclaiming dirty stale inodes. This wasn't discovered until inodes were always attached to the cluster buffer and then the rcu callback that freed inodes was assert failing because the inode still had an active pointer to the cluster buffer after it had been reclaimed. Debugging the issue indicated that this was a pre-existing issue resulting from the way the inodes are handled in xfs_inactive_ifree. When we free a cluster buffer from xfs_ifree_cluster, all the inodes in cache are marked XFS_ISTALE. Those that are clean have nothing else done to them and so eventually get cleaned up by background reclaim. i.e. it is assumed we'll never dirty/relog an inode marked XFS_ISTALE. On journal commit dirty stale inodes as are handled by both buffer and inode log items to run though xfs_istale_done() and removed from the AIL (buffer log item commit) or the log item will simply unpin it because the buffer log item will clean it. What happens to any specific inode is entirely dependent on which log item wins the commit race, but the result is the same - stale inodes are clean, not attached to the cluster buffer, and not in the AIL. Hence inode reclaim can just free these inodes without further care. However, if the stale inode is relogged, it gets dirtied again and relogged into the CIL. Most of the time this isn't an issue, because relogging simply changes the inode's location in the current checkpoint. Problems arise, however, when the CIL checkpoints between two transactions in the xfs_inactive_ifree() deferops processing. This results in the XFS_ISTALE inode being redirtied and inserted into the CIL without any of the other stale cluster buffer infrastructure being in place. Hence on journal commit, it simply gets unpinned, so it remains dirty in memory. Everything in inode writeback avoids XFS_ISTALE inodes so it can't be written back, and it is not tracked in the AIL so there's not even a trigger to attempt to clean the inode. Hence the inode just sits dirty in memory until inode reclaim comes along, sees that it is XFS_ISTALE, and goes to reclaim it. This reclaiming of a dirty inode caused use after free, list corruptions and other nasty issues later in this patchset. Hence this patch addresses a violation of the "never log XFS_ISTALE inodes" caused by the deferops processing rolling a transaction and relogging a stale inode in xfs_inactive_free. It also adds a bunch of asserts to catch this problem in debug kernels so that we don't reintroduce this problem in future. Reproducer for this issue was generic/558 on a v4 filesystem. Signed-off-by: Dave Chinner Reviewed-by: Brian Foster Reviewed-by: Darrick J. Wong --- fs/xfs/libxfs/xfs_trans_inode.c | 2 ++ fs/xfs/xfs_icache.c | 3 ++- fs/xfs/xfs_inode.c | 25 ++++++++++++++++++++++--- 3 files changed, 26 insertions(+), 4 deletions(-) diff --git a/fs/xfs/libxfs/xfs_trans_inode.c b/fs/xfs/libxfs/xfs_trans_inode.c index b5dfb66548422..4504d215cd590 100644 --- a/fs/xfs/libxfs/xfs_trans_inode.c +++ b/fs/xfs/libxfs/xfs_trans_inode.c @@ -36,6 +36,7 @@ xfs_trans_ijoin( ASSERT(iip->ili_lock_flags == 0); iip->ili_lock_flags = lock_flags; + ASSERT(!xfs_iflags_test(ip, XFS_ISTALE)); /* * Get a log_item_desc to point at the new item. @@ -89,6 +90,7 @@ xfs_trans_log_inode( ASSERT(ip->i_itemp != NULL); ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL)); + ASSERT(!xfs_iflags_test(ip, XFS_ISTALE)); /* * Don't bother with i_lock for the I_DIRTY_TIME check here, as races diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c index 0a5ac6f9a5834..dbba4c1946386 100644 --- a/fs/xfs/xfs_icache.c +++ b/fs/xfs/xfs_icache.c @@ -1141,7 +1141,7 @@ xfs_reclaim_inode( goto out_ifunlock; xfs_iunpin_wait(ip); } - if (xfs_iflags_test(ip, XFS_ISTALE) || xfs_inode_clean(ip)) { + if (xfs_inode_clean(ip)) { xfs_ifunlock(ip); goto reclaim; } @@ -1228,6 +1228,7 @@ xfs_reclaim_inode( xfs_ilock(ip, XFS_ILOCK_EXCL); xfs_qm_dqdetach(ip); xfs_iunlock(ip, XFS_ILOCK_EXCL); + ASSERT(xfs_inode_clean(ip)); __xfs_inode_free(ip); return error; diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c index 64f5f9a440aed..53a1d64782c35 100644 --- a/fs/xfs/xfs_inode.c +++ b/fs/xfs/xfs_inode.c @@ -1740,10 +1740,31 @@ xfs_inactive_ifree( return error; } + /* + * We do not hold the inode locked across the entire rolling transaction + * here. We only need to hold it for the first transaction that + * xfs_ifree() builds, which may mark the inode XFS_ISTALE if the + * underlying cluster buffer is freed. Relogging an XFS_ISTALE inode + * here breaks the relationship between cluster buffer invalidation and + * stale inode invalidation on cluster buffer item journal commit + * completion, and can result in leaving dirty stale inodes hanging + * around in memory. + * + * We have no need for serialising this inode operation against other + * operations - we freed the inode and hence reallocation is required + * and that will serialise on reallocating the space the deferops need + * to free. Hence we can unlock the inode on the first commit of + * the transaction rather than roll it right through the deferops. This + * avoids relogging the XFS_ISTALE inode. + * + * We check that xfs_ifree() hasn't grown an internal transaction roll + * by asserting that the inode is still locked when it returns. + */ xfs_ilock(ip, XFS_ILOCK_EXCL); - xfs_trans_ijoin(tp, ip, 0); + xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL); error = xfs_ifree(tp, ip); + ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL)); if (error) { /* * If we fail to free the inode, shut down. The cancel @@ -1756,7 +1777,6 @@ xfs_inactive_ifree( xfs_force_shutdown(mp, SHUTDOWN_META_IO_ERROR); } xfs_trans_cancel(tp); - xfs_iunlock(ip, XFS_ILOCK_EXCL); return error; } @@ -1774,7 +1794,6 @@ xfs_inactive_ifree( xfs_notice(mp, "%s: xfs_trans_commit returned error %d", __func__, error); - xfs_iunlock(ip, XFS_ILOCK_EXCL); return 0; } From patchwork Thu Jun 4 07:45:38 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 11587245 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 7B4891392 for ; Thu, 4 Jun 2020 07:46:17 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 6B50C2067B for ; Thu, 4 Jun 2020 07:46:17 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727966AbgFDHqQ (ORCPT ); Thu, 4 Jun 2020 03:46:16 -0400 Received: from mail110.syd.optusnet.com.au ([211.29.132.97]:34179 "EHLO mail110.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727779AbgFDHqQ (ORCPT ); Thu, 4 Jun 2020 03:46:16 -0400 Received: from dread.disaster.area (pa49-180-124-177.pa.nsw.optusnet.com.au [49.180.124.177]) by mail110.syd.optusnet.com.au (Postfix) with ESMTPS id B4F0510882A for ; Thu, 4 Jun 2020 17:46:08 +1000 (AEST) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.3) (envelope-from ) id 1jgkZj-00049X-Ez for linux-xfs@vger.kernel.org; Thu, 04 Jun 2020 17:46:07 +1000 Received: from dave by discord.disaster.area with local (Exim 4.93) (envelope-from ) id 1jgkZj-0017Gc-5h for linux-xfs@vger.kernel.org; Thu, 04 Jun 2020 17:46:07 +1000 From: Dave Chinner To: linux-xfs@vger.kernel.org Subject: [PATCH 02/30] xfs: remove logged flag from inode log item Date: Thu, 4 Jun 2020 17:45:38 +1000 Message-Id: <20200604074606.266213-3-david@fromorbit.com> X-Mailer: git-send-email 2.26.2.761.g0e0b3e54be In-Reply-To: <20200604074606.266213-1-david@fromorbit.com> References: <20200604074606.266213-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.3 cv=W5xGqiek c=1 sm=1 tr=0 a=k3aV/LVJup6ZGWgigO6cSA==:117 a=k3aV/LVJup6ZGWgigO6cSA==:17 a=nTHF0DUjJn0A:10 a=20KFwNOVAAAA:8 a=yPCof4ZbAAAA:8 a=RMhDva1DPA6hu7IbcAIA:9 Sender: linux-xfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Dave Chinner This was used to track if the item had logged fields being flushed to disk. We log everything in the inode these days, so this logic is no longer needed. Remove it. Signed-off-by: Dave Chinner Reviewed-by: Christoph Hellwig Reviewed-by: Darrick J. Wong Reviewed-by: Brian Foster --- fs/xfs/xfs_inode.c | 13 ++++--------- fs/xfs/xfs_inode_item.c | 35 ++++++++++------------------------- fs/xfs/xfs_inode_item.h | 1 - 3 files changed, 14 insertions(+), 35 deletions(-) diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c index 53a1d64782c35..4fa12775ac146 100644 --- a/fs/xfs/xfs_inode.c +++ b/fs/xfs/xfs_inode.c @@ -2677,7 +2677,6 @@ xfs_ifree_cluster( list_for_each_entry(lip, &bp->b_li_list, li_bio_list) { if (lip->li_type == XFS_LI_INODE) { iip = (struct xfs_inode_log_item *)lip; - ASSERT(iip->ili_logged == 1); lip->li_cb = xfs_istale_done; xfs_trans_ail_copy_lsn(mp->m_ail, &iip->ili_flush_lsn, @@ -2706,7 +2705,6 @@ xfs_ifree_cluster( iip->ili_last_fields = iip->ili_fields; iip->ili_fields = 0; iip->ili_fsync_fields = 0; - iip->ili_logged = 1; xfs_trans_ail_copy_lsn(mp->m_ail, &iip->ili_flush_lsn, &iip->ili_item.li_lsn); @@ -3838,19 +3836,16 @@ xfs_iflush_int( * * We can play with the ili_fields bits here, because the inode lock * must be held exclusively in order to set bits there and the flush - * lock protects the ili_last_fields bits. Set ili_logged so the flush - * done routine can tell whether or not to look in the AIL. Also, store - * the current LSN of the inode so that we can tell whether the item has - * moved in the AIL from xfs_iflush_done(). In order to read the lsn we - * need the AIL lock, because it is a 64 bit value that cannot be read - * atomically. + * lock protects the ili_last_fields bits. Store the current LSN of the + * inode so that we can tell whether the item has moved in the AIL from + * xfs_iflush_done(). In order to read the lsn we need the AIL lock, + * because it is a 64 bit value that cannot be read atomically. */ error = 0; flush_out: iip->ili_last_fields = iip->ili_fields; iip->ili_fields = 0; iip->ili_fsync_fields = 0; - iip->ili_logged = 1; xfs_trans_ail_copy_lsn(mp->m_ail, &iip->ili_flush_lsn, &iip->ili_item.li_lsn); diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c index ba47bf65b772b..b17384aa8df40 100644 --- a/fs/xfs/xfs_inode_item.c +++ b/fs/xfs/xfs_inode_item.c @@ -528,8 +528,6 @@ xfs_inode_item_push( } ASSERT(iip->ili_fields != 0 || XFS_FORCED_SHUTDOWN(ip->i_mount)); - ASSERT(iip->ili_logged == 0 || XFS_FORCED_SHUTDOWN(ip->i_mount)); - spin_unlock(&lip->li_ailp->ail_lock); error = xfs_iflush(ip, &bp); @@ -690,30 +688,24 @@ xfs_iflush_done( continue; list_move_tail(&blip->li_bio_list, &tmp); - /* - * while we have the item, do the unlocked check for needing - * the AIL lock. - */ + + /* Do an unlocked check for needing the AIL lock. */ iip = INODE_ITEM(blip); - if ((iip->ili_logged && blip->li_lsn == iip->ili_flush_lsn) || + if (blip->li_lsn == iip->ili_flush_lsn || test_bit(XFS_LI_FAILED, &blip->li_flags)) need_ail++; } /* make sure we capture the state of the initial inode. */ iip = INODE_ITEM(lip); - if ((iip->ili_logged && lip->li_lsn == iip->ili_flush_lsn) || + if (lip->li_lsn == iip->ili_flush_lsn || test_bit(XFS_LI_FAILED, &lip->li_flags)) need_ail++; /* - * We only want to pull the item from the AIL if it is - * actually there and its location in the log has not - * changed since we started the flush. Thus, we only bother - * if the ili_logged flag is set and the inode's lsn has not - * changed. First we check the lsn outside - * the lock since it's cheaper, and then we recheck while - * holding the lock before removing the inode from the AIL. + * We only want to pull the item from the AIL if it is actually there + * and its location in the log has not changed since we started the + * flush. Thus, we only bother if the inode's lsn has not changed. */ if (need_ail) { xfs_lsn_t tail_lsn = 0; @@ -721,8 +713,7 @@ xfs_iflush_done( /* this is an opencoded batch version of xfs_trans_ail_delete */ spin_lock(&ailp->ail_lock); list_for_each_entry(blip, &tmp, li_bio_list) { - if (INODE_ITEM(blip)->ili_logged && - blip->li_lsn == INODE_ITEM(blip)->ili_flush_lsn) { + if (blip->li_lsn == INODE_ITEM(blip)->ili_flush_lsn) { /* * xfs_ail_update_finish() only cares about the * lsn of the first tail item removed, any @@ -740,14 +731,13 @@ xfs_iflush_done( } /* - * clean up and unlock the flush lock now we are done. We can clear the + * Clean up and unlock the flush lock now we are done. We can clear the * ili_last_fields bits now that we know that the data corresponding to * them is safely on disk. */ list_for_each_entry_safe(blip, n, &tmp, li_bio_list) { list_del_init(&blip->li_bio_list); iip = INODE_ITEM(blip); - iip->ili_logged = 0; iip->ili_last_fields = 0; xfs_ifunlock(iip->ili_inode); } @@ -768,16 +758,11 @@ xfs_iflush_abort( if (iip) { xfs_trans_ail_delete(&iip->ili_item, 0); - iip->ili_logged = 0; - /* - * Clear the ili_last_fields bits now that we know that the - * data corresponding to them is safely on disk. - */ - iip->ili_last_fields = 0; /* * Clear the inode logging fields so no more flushes are * attempted. */ + iip->ili_last_fields = 0; iip->ili_fields = 0; iip->ili_fsync_fields = 0; } diff --git a/fs/xfs/xfs_inode_item.h b/fs/xfs/xfs_inode_item.h index 60b34bb66e8ed..4de5070e07655 100644 --- a/fs/xfs/xfs_inode_item.h +++ b/fs/xfs/xfs_inode_item.h @@ -19,7 +19,6 @@ struct xfs_inode_log_item { xfs_lsn_t ili_flush_lsn; /* lsn at last flush */ xfs_lsn_t ili_last_lsn; /* lsn at last transaction */ unsigned short ili_lock_flags; /* lock flags */ - unsigned short ili_logged; /* flushed logged data */ unsigned int ili_last_fields; /* fields when flushed */ unsigned int ili_fields; /* fields to be logged */ unsigned int ili_fsync_fields; /* logged since last fsync */ From patchwork Thu Jun 4 07:45:39 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 11587265 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 047F9913 for ; Thu, 4 Jun 2020 07:46:23 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id E1E132072E for ; Thu, 4 Jun 2020 07:46:22 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727777AbgFDHqW (ORCPT ); Thu, 4 Jun 2020 03:46:22 -0400 Received: from mail109.syd.optusnet.com.au ([211.29.132.80]:50326 "EHLO mail109.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727839AbgFDHqS (ORCPT ); Thu, 4 Jun 2020 03:46:18 -0400 Received: from dread.disaster.area (pa49-180-124-177.pa.nsw.optusnet.com.au [49.180.124.177]) by mail109.syd.optusnet.com.au (Postfix) with ESMTPS id F406BD79752 for ; Thu, 4 Jun 2020 17:46:11 +1000 (AEST) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.3) (envelope-from ) id 1jgkZj-00049Z-Ga for linux-xfs@vger.kernel.org; Thu, 04 Jun 2020 17:46:07 +1000 Received: from dave by discord.disaster.area with local (Exim 4.93) (envelope-from ) id 1jgkZj-0017Gh-72 for linux-xfs@vger.kernel.org; Thu, 04 Jun 2020 17:46:07 +1000 From: Dave Chinner To: linux-xfs@vger.kernel.org Subject: [PATCH 03/30] xfs: add an inode item lock Date: Thu, 4 Jun 2020 17:45:39 +1000 Message-Id: <20200604074606.266213-4-david@fromorbit.com> X-Mailer: git-send-email 2.26.2.761.g0e0b3e54be In-Reply-To: <20200604074606.266213-1-david@fromorbit.com> References: <20200604074606.266213-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.3 cv=QIgWuTDL c=1 sm=1 tr=0 a=k3aV/LVJup6ZGWgigO6cSA==:117 a=k3aV/LVJup6ZGWgigO6cSA==:17 a=nTHF0DUjJn0A:10 a=20KFwNOVAAAA:8 a=JeUEpQy9zLAlQ1B7pzYA:9 Sender: linux-xfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Dave Chinner The inode log item is kind of special in that it can be aggregating new changes in memory at the same time time existing changes are being written back to disk. This means there are fields in the log item that are accessed concurrently from contexts that don't share any locking at all. e.g. updating ili_last_fields occurs at flush time under the ILOCK_EXCL and flush lock at flush time, under the flush lock at IO completion time, and is read under the ILOCK_EXCL when the inode is logged. Hence there is no actual serialisation between reading the field during logging of the inode in transactions vs clearing the field in IO completion. We currently get away with this by the fact that we are only clearing fields in IO completion, and nothing bad happens if we accidentally log more of the inode than we actually modify. Worst case is we consume a tiny bit more memory and log bandwidth. However, if we want to do more complex state manipulations on the log item that requires updates at all three of these potential locations, we need to have some mechanism of serialising those operations. To do this, introduce a spinlock into the log item to serialise internal state. This could be done via the xfs_inode i_flags_lock, but this then leads to potential lock inversion issues where inode flag updates need to occur inside locks that best nest inside the inode log item locks (e.g. marking inodes stale during inode cluster freeing). Using a separate spinlock avoids these sorts of problems and simplifies future code. This does not touch the use of ili_fields in the item formatting code - that is entirely protected by the ILOCK_EXCL at this point in time, so it remains untouched. Signed-off-by: Dave Chinner Reviewed-by: Brian Foster --- fs/xfs/libxfs/xfs_trans_inode.c | 52 ++++++++++++++++----------------- fs/xfs/xfs_file.c | 9 ++++-- fs/xfs/xfs_inode.c | 20 ++++++++----- fs/xfs/xfs_inode_item.c | 7 +++++ fs/xfs/xfs_inode_item.h | 18 ++++++++++-- 5 files changed, 66 insertions(+), 40 deletions(-) diff --git a/fs/xfs/libxfs/xfs_trans_inode.c b/fs/xfs/libxfs/xfs_trans_inode.c index 4504d215cd590..c66d9d1dd58b9 100644 --- a/fs/xfs/libxfs/xfs_trans_inode.c +++ b/fs/xfs/libxfs/xfs_trans_inode.c @@ -82,16 +82,20 @@ xfs_trans_ichgtime( */ void xfs_trans_log_inode( - xfs_trans_t *tp, - xfs_inode_t *ip, - uint flags) + struct xfs_trans *tp, + struct xfs_inode *ip, + uint flags) { - struct inode *inode = VFS_I(ip); + struct xfs_inode_log_item *iip = ip->i_itemp; + struct inode *inode = VFS_I(ip); + uint iversion_flags = 0; - ASSERT(ip->i_itemp != NULL); + ASSERT(iip); ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL)); ASSERT(!xfs_iflags_test(ip, XFS_ISTALE)); + tp->t_flags |= XFS_TRANS_DIRTY; + /* * Don't bother with i_lock for the I_DIRTY_TIME check here, as races * don't matter - we either will need an extra transaction in 24 hours @@ -104,15 +108,6 @@ xfs_trans_log_inode( spin_unlock(&inode->i_lock); } - /* - * Record the specific change for fdatasync optimisation. This - * allows fdatasync to skip log forces for inodes that are only - * timestamp dirty. We do this before the change count so that - * the core being logged in this case does not impact on fdatasync - * behaviour. - */ - ip->i_itemp->ili_fsync_fields |= flags; - /* * First time we log the inode in a transaction, bump the inode change * counter if it is configured for this to occur. While we have the @@ -122,23 +117,28 @@ xfs_trans_log_inode( * set however, then go ahead and bump the i_version counter * unconditionally. */ - if (!test_and_set_bit(XFS_LI_DIRTY, &ip->i_itemp->ili_item.li_flags) && - IS_I_VERSION(VFS_I(ip))) { - if (inode_maybe_inc_iversion(VFS_I(ip), flags & XFS_ILOG_CORE)) - flags |= XFS_ILOG_CORE; + if (!test_and_set_bit(XFS_LI_DIRTY, &iip->ili_item.li_flags)) { + if (IS_I_VERSION(inode) && + inode_maybe_inc_iversion(inode, flags & XFS_ILOG_CORE)) + iversion_flags = XFS_ILOG_CORE; } - tp->t_flags |= XFS_TRANS_DIRTY; + /* + * Record the specific change for fdatasync optimisation. This allows + * fdatasync to skip log forces for inodes that are only timestamp + * dirty. + */ + spin_lock(&iip->ili_lock); + iip->ili_fsync_fields |= flags; /* - * Always OR in the bits from the ili_last_fields field. - * This is to coordinate with the xfs_iflush() and xfs_iflush_done() - * routines in the eventual clearing of the ili_fields bits. - * See the big comment in xfs_iflush() for an explanation of - * this coordination mechanism. + * Always OR in the bits from the ili_last_fields field. This is to + * coordinate with the xfs_iflush() and xfs_iflush_done() routines in + * the eventual clearing of the ili_fields bits. See the big comment in + * xfs_iflush() for an explanation of this coordination mechanism. */ - flags |= ip->i_itemp->ili_last_fields; - ip->i_itemp->ili_fields |= flags; + iip->ili_fields |= (flags | iip->ili_last_fields | iversion_flags); + spin_unlock(&iip->ili_lock); } int diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index 403c90309a8ff..0abf770b77498 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -94,6 +94,7 @@ xfs_file_fsync( { struct inode *inode = file->f_mapping->host; struct xfs_inode *ip = XFS_I(inode); + struct xfs_inode_log_item *iip = ip->i_itemp; struct xfs_mount *mp = ip->i_mount; int error = 0; int log_flushed = 0; @@ -137,13 +138,15 @@ xfs_file_fsync( xfs_ilock(ip, XFS_ILOCK_SHARED); if (xfs_ipincount(ip)) { if (!datasync || - (ip->i_itemp->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP)) - lsn = ip->i_itemp->ili_last_lsn; + (iip->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP)) + lsn = iip->ili_last_lsn; } if (lsn) { error = xfs_log_force_lsn(mp, lsn, XFS_LOG_SYNC, &log_flushed); - ip->i_itemp->ili_fsync_fields = 0; + spin_lock(&iip->ili_lock); + iip->ili_fsync_fields = 0; + spin_unlock(&iip->ili_lock); } xfs_iunlock(ip, XFS_ILOCK_SHARED); diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c index 4fa12775ac146..ac3c8af8c9a14 100644 --- a/fs/xfs/xfs_inode.c +++ b/fs/xfs/xfs_inode.c @@ -2702,9 +2702,11 @@ xfs_ifree_cluster( continue; iip = ip->i_itemp; + spin_lock(&iip->ili_lock); iip->ili_last_fields = iip->ili_fields; iip->ili_fields = 0; iip->ili_fsync_fields = 0; + spin_unlock(&iip->ili_lock); xfs_trans_ail_copy_lsn(mp->m_ail, &iip->ili_flush_lsn, &iip->ili_item.li_lsn); @@ -2740,6 +2742,7 @@ xfs_ifree( { int error; struct xfs_icluster xic = { 0 }; + struct xfs_inode_log_item *iip = ip->i_itemp; ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL)); ASSERT(VFS_I(ip)->i_nlink == 0); @@ -2777,7 +2780,9 @@ xfs_ifree( ip->i_df.if_format = XFS_DINODE_FMT_EXTENTS; /* Don't attempt to replay owner changes for a deleted inode */ - ip->i_itemp->ili_fields &= ~(XFS_ILOG_AOWNER|XFS_ILOG_DOWNER); + spin_lock(&iip->ili_lock); + iip->ili_fields &= ~(XFS_ILOG_AOWNER | XFS_ILOG_DOWNER); + spin_unlock(&iip->ili_lock); /* * Bump the generation count so no one will be confused @@ -3833,20 +3838,19 @@ xfs_iflush_int( * know that the information those bits represent is permanently on * disk. As long as the flush completes before the inode is logged * again, then both ili_fields and ili_last_fields will be cleared. - * - * We can play with the ili_fields bits here, because the inode lock - * must be held exclusively in order to set bits there and the flush - * lock protects the ili_last_fields bits. Store the current LSN of the - * inode so that we can tell whether the item has moved in the AIL from - * xfs_iflush_done(). In order to read the lsn we need the AIL lock, - * because it is a 64 bit value that cannot be read atomically. */ error = 0; flush_out: + spin_lock(&iip->ili_lock); iip->ili_last_fields = iip->ili_fields; iip->ili_fields = 0; iip->ili_fsync_fields = 0; + spin_unlock(&iip->ili_lock); + /* + * Store the current LSN of the inode so that we can tell whether the + * item has moved in the AIL from xfs_iflush_done(). + */ xfs_trans_ail_copy_lsn(mp->m_ail, &iip->ili_flush_lsn, &iip->ili_item.li_lsn); diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c index b17384aa8df40..6ef9cbcfc94a7 100644 --- a/fs/xfs/xfs_inode_item.c +++ b/fs/xfs/xfs_inode_item.c @@ -637,6 +637,7 @@ xfs_inode_item_init( iip = ip->i_itemp = kmem_zone_zalloc(xfs_ili_zone, 0); iip->ili_inode = ip; + spin_lock_init(&iip->ili_lock); xfs_log_item_init(mp, &iip->ili_item, XFS_LI_INODE, &xfs_inode_item_ops); } @@ -738,7 +739,11 @@ xfs_iflush_done( list_for_each_entry_safe(blip, n, &tmp, li_bio_list) { list_del_init(&blip->li_bio_list); iip = INODE_ITEM(blip); + + spin_lock(&iip->ili_lock); iip->ili_last_fields = 0; + spin_unlock(&iip->ili_lock); + xfs_ifunlock(iip->ili_inode); } list_del(&tmp); @@ -762,9 +767,11 @@ xfs_iflush_abort( * Clear the inode logging fields so no more flushes are * attempted. */ + spin_lock(&iip->ili_lock); iip->ili_last_fields = 0; iip->ili_fields = 0; iip->ili_fsync_fields = 0; + spin_unlock(&iip->ili_lock); } /* * Release the inode's flush lock since we're done with it. diff --git a/fs/xfs/xfs_inode_item.h b/fs/xfs/xfs_inode_item.h index 4de5070e07655..4a10a1b92ee99 100644 --- a/fs/xfs/xfs_inode_item.h +++ b/fs/xfs/xfs_inode_item.h @@ -16,12 +16,24 @@ struct xfs_mount; struct xfs_inode_log_item { struct xfs_log_item ili_item; /* common portion */ struct xfs_inode *ili_inode; /* inode ptr */ - xfs_lsn_t ili_flush_lsn; /* lsn at last flush */ - xfs_lsn_t ili_last_lsn; /* lsn at last transaction */ - unsigned short ili_lock_flags; /* lock flags */ + unsigned short ili_lock_flags; /* inode lock flags */ + /* + * The ili_lock protects the interactions between the dirty state and + * the flush state of the inode log item. This allows us to do atomic + * modifications of multiple state fields without having to hold a + * specific inode lock to serialise them. + * + * We need atomic changes between inode dirtying, inode flushing and + * inode completion, but these all hold different combinations of + * ILOCK and iflock and hence we need some other method of serialising + * updates to the flush state. + */ + spinlock_t ili_lock; /* flush state lock */ unsigned int ili_last_fields; /* fields when flushed */ unsigned int ili_fields; /* fields to be logged */ unsigned int ili_fsync_fields; /* logged since last fsync */ + xfs_lsn_t ili_flush_lsn; /* lsn at last flush */ + xfs_lsn_t ili_last_lsn; /* lsn at last transaction */ }; static inline int xfs_inode_clean(xfs_inode_t *ip) From patchwork Thu Jun 4 07:45:40 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 11587255 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id F21DE175A for ; Thu, 4 Jun 2020 07:46:20 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id E24422067B for ; Thu, 4 Jun 2020 07:46:20 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727944AbgFDHqU (ORCPT ); Thu, 4 Jun 2020 03:46:20 -0400 Received: from mail105.syd.optusnet.com.au ([211.29.132.249]:41140 "EHLO mail105.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727043AbgFDHqR (ORCPT ); Thu, 4 Jun 2020 03:46:17 -0400 Received: from dread.disaster.area (pa49-180-124-177.pa.nsw.optusnet.com.au [49.180.124.177]) by mail105.syd.optusnet.com.au (Postfix) with ESMTPS id BBA773A459A for ; Thu, 4 Jun 2020 17:46:12 +1000 (AEST) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.3) (envelope-from ) id 1jgkZj-00049e-Hh for linux-xfs@vger.kernel.org; Thu, 04 Jun 2020 17:46:07 +1000 Received: from dave by discord.disaster.area with local (Exim 4.93) (envelope-from ) id 1jgkZj-0017Gm-8t for linux-xfs@vger.kernel.org; Thu, 04 Jun 2020 17:46:07 +1000 From: Dave Chinner To: linux-xfs@vger.kernel.org Subject: [PATCH 04/30] xfs: mark inode buffers in cache Date: Thu, 4 Jun 2020 17:45:40 +1000 Message-Id: <20200604074606.266213-5-david@fromorbit.com> X-Mailer: git-send-email 2.26.2.761.g0e0b3e54be In-Reply-To: <20200604074606.266213-1-david@fromorbit.com> References: <20200604074606.266213-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.3 cv=QIgWuTDL c=1 sm=1 tr=0 a=k3aV/LVJup6ZGWgigO6cSA==:117 a=k3aV/LVJup6ZGWgigO6cSA==:17 a=nTHF0DUjJn0A:10 a=20KFwNOVAAAA:8 a=k2ft56n7lZIwALnzbJ4A:9 Sender: linux-xfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Dave Chinner Inode buffers always have write IO callbacks, so by marking them directly we can avoid needing to attach ->b_iodone functions to them. This avoids an indirect call, and makes future modifications much simpler. While this is largely a refactor of existing functionality, we broaden the scope of the flag to beyond where inodes are explicitly attached because future changes need to know what type of log items are attached to the buffer. Adding this buffer flag may invoke the inode iodone callback in cases where it wouldn't have been previously, but this is not a functional change because the callback is identical to the normal buffer write iodone callback when inodes are not attached. Signed-off-by: Dave Chinner Reviewed-by: Brian Foster --- fs/xfs/xfs_buf.c | 21 ++++++++++++++++----- fs/xfs/xfs_buf.h | 38 +++++++++++++++++++++++++------------- fs/xfs/xfs_buf_item.c | 42 +++++++++++++++++++++++++++++++----------- fs/xfs/xfs_buf_item.h | 1 + fs/xfs/xfs_inode.c | 2 +- fs/xfs/xfs_trans_buf.c | 3 +++ 6 files changed, 77 insertions(+), 30 deletions(-) diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c index 9c2fbb6bbf89d..fcf650575be61 100644 --- a/fs/xfs/xfs_buf.c +++ b/fs/xfs/xfs_buf.c @@ -14,6 +14,8 @@ #include "xfs_mount.h" #include "xfs_trace.h" #include "xfs_log.h" +#include "xfs_trans.h" +#include "xfs_buf_item.h" #include "xfs_errortag.h" #include "xfs_error.h" @@ -1202,12 +1204,21 @@ xfs_buf_ioend( bp->b_flags |= XBF_DONE; } - if (bp->b_iodone) + if (read) + goto out_finish; + + if (bp->b_flags & _XBF_INODES) { + xfs_buf_inode_iodone(bp); + return; + } + + if (bp->b_iodone) { (*(bp->b_iodone))(bp); - else if (bp->b_flags & XBF_ASYNC) - xfs_buf_relse(bp); - else - complete(&bp->b_iowait); + return; + } + +out_finish: + xfs_buf_ioend_finish(bp); } static void diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h index 050c53b739e24..2400cb90a04c6 100644 --- a/fs/xfs/xfs_buf.h +++ b/fs/xfs/xfs_buf.h @@ -30,15 +30,18 @@ #define XBF_STALE (1 << 6) /* buffer has been staled, do not find it */ #define XBF_WRITE_FAIL (1 << 7) /* async writes have failed on this buffer */ -/* flags used only as arguments to access routines */ -#define XBF_TRYLOCK (1 << 16)/* lock requested, but do not wait */ -#define XBF_UNMAPPED (1 << 17)/* do not map the buffer */ +/* buffer type flags for write callbacks */ +#define _XBF_INODES (1 << 16)/* inode buffer */ /* flags used only internally */ #define _XBF_PAGES (1 << 20)/* backed by refcounted pages */ #define _XBF_KMEM (1 << 21)/* backed by heap memory */ #define _XBF_DELWRI_Q (1 << 22)/* buffer on a delwri queue */ +/* flags used only as arguments to access routines */ +#define XBF_TRYLOCK (1 << 30)/* lock requested, but do not wait */ +#define XBF_UNMAPPED (1 << 31)/* do not map the buffer */ + typedef unsigned int xfs_buf_flags_t; #define XFS_BUF_FLAGS \ @@ -50,12 +53,13 @@ typedef unsigned int xfs_buf_flags_t; { XBF_DONE, "DONE" }, \ { XBF_STALE, "STALE" }, \ { XBF_WRITE_FAIL, "WRITE_FAIL" }, \ - { XBF_TRYLOCK, "TRYLOCK" }, /* should never be set */\ - { XBF_UNMAPPED, "UNMAPPED" }, /* ditto */\ + { _XBF_INODES, "INODES" }, \ { _XBF_PAGES, "PAGES" }, \ { _XBF_KMEM, "KMEM" }, \ - { _XBF_DELWRI_Q, "DELWRI_Q" } - + { _XBF_DELWRI_Q, "DELWRI_Q" }, \ + /* The following interface flags should never be set */ \ + { XBF_TRYLOCK, "TRYLOCK" }, \ + { XBF_UNMAPPED, "UNMAPPED" } /* * Internal state flags. @@ -257,9 +261,23 @@ extern void xfs_buf_unlock(xfs_buf_t *); #define xfs_buf_islocked(bp) \ ((bp)->b_sema.count <= 0) +static inline void xfs_buf_relse(xfs_buf_t *bp) +{ + xfs_buf_unlock(bp); + xfs_buf_rele(bp); +} + /* Buffer Read and Write Routines */ extern int xfs_bwrite(struct xfs_buf *bp); extern void xfs_buf_ioend(struct xfs_buf *bp); +static inline void xfs_buf_ioend_finish(struct xfs_buf *bp) +{ + if (bp->b_flags & XBF_ASYNC) + xfs_buf_relse(bp); + else + complete(&bp->b_iowait); +} + extern void __xfs_buf_ioerror(struct xfs_buf *bp, int error, xfs_failaddr_t failaddr); #define xfs_buf_ioerror(bp, err) __xfs_buf_ioerror((bp), (err), __this_address) @@ -324,12 +342,6 @@ static inline int xfs_buf_ispinned(struct xfs_buf *bp) return atomic_read(&bp->b_pin_count); } -static inline void xfs_buf_relse(xfs_buf_t *bp) -{ - xfs_buf_unlock(bp); - xfs_buf_rele(bp); -} - static inline int xfs_buf_verify_cksum(struct xfs_buf *bp, unsigned long cksum_offset) { diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c index 9e75e8d6042ec..8659cf4282a64 100644 --- a/fs/xfs/xfs_buf_item.c +++ b/fs/xfs/xfs_buf_item.c @@ -1158,20 +1158,15 @@ xfs_buf_iodone_callback_error( return false; } -/* - * This is the iodone() function for buffers which have had callbacks attached - * to them by xfs_buf_attach_iodone(). We need to iterate the items on the - * callback list, mark the buffer as having no more callbacks and then push the - * buffer through IO completion processing. - */ -void -xfs_buf_iodone_callbacks( +static void +xfs_buf_run_callbacks( struct xfs_buf *bp) { + /* - * If there is an error, process it. Some errors require us - * to run callbacks after failure processing is done so we - * detect that and take appropriate action. + * If there is an error, process it. Some errors require us to run + * callbacks after failure processing is done so we detect that and take + * appropriate action. */ if (bp->b_error && xfs_buf_iodone_callback_error(bp)) return; @@ -1188,9 +1183,34 @@ xfs_buf_iodone_callbacks( bp->b_log_item = NULL; list_del_init(&bp->b_li_list); bp->b_iodone = NULL; +} + +/* + * This is the iodone() function for buffers which have had callbacks attached + * to them by xfs_buf_attach_iodone(). We need to iterate the items on the + * callback list, mark the buffer as having no more callbacks and then push the + * buffer through IO completion processing. + */ +void +xfs_buf_iodone_callbacks( + struct xfs_buf *bp) +{ + xfs_buf_run_callbacks(bp); xfs_buf_ioend(bp); } +/* + * Inode buffer iodone callback function. + */ +void +xfs_buf_inode_iodone( + struct xfs_buf *bp) +{ + xfs_buf_run_callbacks(bp); + xfs_buf_ioend_finish(bp); +} + + /* * This is the iodone() function for buffers which have been * logged. It is called when they are eventually flushed out. diff --git a/fs/xfs/xfs_buf_item.h b/fs/xfs/xfs_buf_item.h index c9c57e2da9327..a342933ad9b8d 100644 --- a/fs/xfs/xfs_buf_item.h +++ b/fs/xfs/xfs_buf_item.h @@ -59,6 +59,7 @@ void xfs_buf_attach_iodone(struct xfs_buf *, struct xfs_log_item *); void xfs_buf_iodone_callbacks(struct xfs_buf *); void xfs_buf_iodone(struct xfs_buf *, struct xfs_log_item *); +void xfs_buf_inode_iodone(struct xfs_buf *); bool xfs_buf_log_check_iovec(struct xfs_log_iovec *iovec); extern kmem_zone_t *xfs_buf_item_zone; diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c index ac3c8af8c9a14..d5dee57f914a9 100644 --- a/fs/xfs/xfs_inode.c +++ b/fs/xfs/xfs_inode.c @@ -3860,13 +3860,13 @@ xfs_iflush_int( * completion on the buffer to remove the inode from the AIL and release * the flush lock. */ + bp->b_flags |= _XBF_INODES; xfs_buf_attach_iodone(bp, xfs_iflush_done, &iip->ili_item); /* generate the checksum. */ xfs_dinode_calc_crc(mp, dip); ASSERT(!list_empty(&bp->b_li_list)); - ASSERT(bp->b_iodone != NULL); return error; } diff --git a/fs/xfs/xfs_trans_buf.c b/fs/xfs/xfs_trans_buf.c index 08174ffa21189..552d0869aa0fe 100644 --- a/fs/xfs/xfs_trans_buf.c +++ b/fs/xfs/xfs_trans_buf.c @@ -626,6 +626,7 @@ xfs_trans_inode_buf( ASSERT(atomic_read(&bip->bli_refcount) > 0); bip->bli_flags |= XFS_BLI_INODE_BUF; + bp->b_flags |= _XBF_INODES; xfs_trans_buf_set_type(tp, bp, XFS_BLFT_DINO_BUF); } @@ -651,6 +652,7 @@ xfs_trans_stale_inode_buf( bip->bli_flags |= XFS_BLI_STALE_INODE; bip->bli_item.li_cb = xfs_buf_iodone; + bp->b_flags |= _XBF_INODES; xfs_trans_buf_set_type(tp, bp, XFS_BLFT_DINO_BUF); } @@ -675,6 +677,7 @@ xfs_trans_inode_alloc_buf( ASSERT(atomic_read(&bip->bli_refcount) > 0); bip->bli_flags |= XFS_BLI_INODE_ALLOC_BUF; + bp->b_flags |= _XBF_INODES; xfs_trans_buf_set_type(tp, bp, XFS_BLFT_DINO_BUF); } From patchwork Thu Jun 4 07:45:41 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 11587229 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 902F0913 for ; Thu, 4 Jun 2020 07:46:12 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 826D3207D5 for ; Thu, 4 Jun 2020 07:46:12 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726537AbgFDHqL (ORCPT ); Thu, 4 Jun 2020 03:46:11 -0400 Received: from mail108.syd.optusnet.com.au ([211.29.132.59]:46784 "EHLO mail108.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726983AbgFDHqL (ORCPT ); Thu, 4 Jun 2020 03:46:11 -0400 Received: from dread.disaster.area (pa49-180-124-177.pa.nsw.optusnet.com.au [49.180.124.177]) by mail108.syd.optusnet.com.au (Postfix) with ESMTPS id 3F9DA1A8715 for ; Thu, 4 Jun 2020 17:46:08 +1000 (AEST) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.3) (envelope-from ) id 1jgkZj-00049i-J9 for linux-xfs@vger.kernel.org; Thu, 04 Jun 2020 17:46:07 +1000 Received: from dave by discord.disaster.area with local (Exim 4.93) (envelope-from ) id 1jgkZj-0017Gr-AJ for linux-xfs@vger.kernel.org; Thu, 04 Jun 2020 17:46:07 +1000 From: Dave Chinner To: linux-xfs@vger.kernel.org Subject: [PATCH 05/30] xfs: mark dquot buffers in cache Date: Thu, 4 Jun 2020 17:45:41 +1000 Message-Id: <20200604074606.266213-6-david@fromorbit.com> X-Mailer: git-send-email 2.26.2.761.g0e0b3e54be In-Reply-To: <20200604074606.266213-1-david@fromorbit.com> References: <20200604074606.266213-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.3 cv=X6os11be c=1 sm=1 tr=0 a=k3aV/LVJup6ZGWgigO6cSA==:117 a=k3aV/LVJup6ZGWgigO6cSA==:17 a=nTHF0DUjJn0A:10 a=20KFwNOVAAAA:8 a=yPCof4ZbAAAA:8 a=nDrCKdGSDarCGLS7qDwA:9 Sender: linux-xfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org dquot buffers always have write IO callbacks, so by marking them directly we can avoid needing to attach ->b_iodone functions to them. This avoids an indirect call, and makes future modifications much simpler. This is largely a rearrangement of the code at this point - no IO completion functionality changes at this point, just how the code is run is modified. Signed-off-by: Dave Chinner Reviewed-by: Brian Foster Reviewed-by: Darrick J. Wong --- fs/xfs/xfs_buf.c | 5 +++++ fs/xfs/xfs_buf.h | 2 ++ fs/xfs/xfs_buf_item.c | 10 ++++++++++ fs/xfs/xfs_buf_item.h | 1 + fs/xfs/xfs_dquot.c | 1 + fs/xfs/xfs_trans_buf.c | 1 + 6 files changed, 20 insertions(+) diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c index fcf650575be61..3bffde8640a52 100644 --- a/fs/xfs/xfs_buf.c +++ b/fs/xfs/xfs_buf.c @@ -1212,6 +1212,11 @@ xfs_buf_ioend( return; } + if (bp->b_flags & _XBF_DQUOTS) { + xfs_buf_dquot_iodone(bp); + return; + } + if (bp->b_iodone) { (*(bp->b_iodone))(bp); return; diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h index 2400cb90a04c6..c1d0843206dd6 100644 --- a/fs/xfs/xfs_buf.h +++ b/fs/xfs/xfs_buf.h @@ -32,6 +32,7 @@ /* buffer type flags for write callbacks */ #define _XBF_INODES (1 << 16)/* inode buffer */ +#define _XBF_DQUOTS (1 << 17)/* dquot buffer */ /* flags used only internally */ #define _XBF_PAGES (1 << 20)/* backed by refcounted pages */ @@ -54,6 +55,7 @@ typedef unsigned int xfs_buf_flags_t; { XBF_STALE, "STALE" }, \ { XBF_WRITE_FAIL, "WRITE_FAIL" }, \ { _XBF_INODES, "INODES" }, \ + { _XBF_DQUOTS, "DQUOTS" }, \ { _XBF_PAGES, "PAGES" }, \ { _XBF_KMEM, "KMEM" }, \ { _XBF_DELWRI_Q, "DELWRI_Q" }, \ diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c index 8659cf4282a64..a42cdf9ccc47d 100644 --- a/fs/xfs/xfs_buf_item.c +++ b/fs/xfs/xfs_buf_item.c @@ -1210,6 +1210,16 @@ xfs_buf_inode_iodone( xfs_buf_ioend_finish(bp); } +/* + * Dquot buffer iodone callback function. + */ +void +xfs_buf_dquot_iodone( + struct xfs_buf *bp) +{ + xfs_buf_run_callbacks(bp); + xfs_buf_ioend_finish(bp); +} /* * This is the iodone() function for buffers which have been diff --git a/fs/xfs/xfs_buf_item.h b/fs/xfs/xfs_buf_item.h index a342933ad9b8d..27d13d29b5bbb 100644 --- a/fs/xfs/xfs_buf_item.h +++ b/fs/xfs/xfs_buf_item.h @@ -60,6 +60,7 @@ void xfs_buf_attach_iodone(struct xfs_buf *, void xfs_buf_iodone_callbacks(struct xfs_buf *); void xfs_buf_iodone(struct xfs_buf *, struct xfs_log_item *); void xfs_buf_inode_iodone(struct xfs_buf *); +void xfs_buf_dquot_iodone(struct xfs_buf *); bool xfs_buf_log_check_iovec(struct xfs_log_iovec *iovec); extern kmem_zone_t *xfs_buf_item_zone; diff --git a/fs/xfs/xfs_dquot.c b/fs/xfs/xfs_dquot.c index d5b7f03e93c8d..2e2146fa0914c 100644 --- a/fs/xfs/xfs_dquot.c +++ b/fs/xfs/xfs_dquot.c @@ -1179,6 +1179,7 @@ xfs_qm_dqflush( * Attach an iodone routine so that we can remove this dquot from the * AIL and release the flush lock once the dquot is synced to disk. */ + bp->b_flags |= _XBF_DQUOTS; xfs_buf_attach_iodone(bp, xfs_qm_dqflush_done, &dqp->q_logitem.qli_item); diff --git a/fs/xfs/xfs_trans_buf.c b/fs/xfs/xfs_trans_buf.c index 552d0869aa0fe..93d62cb864c15 100644 --- a/fs/xfs/xfs_trans_buf.c +++ b/fs/xfs/xfs_trans_buf.c @@ -788,5 +788,6 @@ xfs_trans_dquot_buf( break; } + bp->b_flags |= _XBF_DQUOTS; xfs_trans_buf_set_type(tp, bp, type); } From patchwork Thu Jun 4 07:45:42 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 11587227 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 58AAE1392 for ; Thu, 4 Jun 2020 07:46:12 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 45D512072E for ; Thu, 4 Jun 2020 07:46:12 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727768AbgFDHqL (ORCPT ); Thu, 4 Jun 2020 03:46:11 -0400 Received: from mail104.syd.optusnet.com.au ([211.29.132.246]:33589 "EHLO mail104.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726537AbgFDHqL (ORCPT ); Thu, 4 Jun 2020 03:46:11 -0400 Received: from dread.disaster.area (pa49-180-124-177.pa.nsw.optusnet.com.au [49.180.124.177]) by mail104.syd.optusnet.com.au (Postfix) with ESMTPS id 5724382122B for ; Thu, 4 Jun 2020 17:46:08 +1000 (AEST) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.3) (envelope-from ) id 1jgkZj-00049k-KS for linux-xfs@vger.kernel.org; Thu, 04 Jun 2020 17:46:07 +1000 Received: from dave by discord.disaster.area with local (Exim 4.93) (envelope-from ) id 1jgkZj-0017Gw-BS for linux-xfs@vger.kernel.org; Thu, 04 Jun 2020 17:46:07 +1000 From: Dave Chinner To: linux-xfs@vger.kernel.org Subject: [PATCH 06/30] xfs: mark log recovery buffers for completion Date: Thu, 4 Jun 2020 17:45:42 +1000 Message-Id: <20200604074606.266213-7-david@fromorbit.com> X-Mailer: git-send-email 2.26.2.761.g0e0b3e54be In-Reply-To: <20200604074606.266213-1-david@fromorbit.com> References: <20200604074606.266213-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.3 cv=QIgWuTDL c=1 sm=1 tr=0 a=k3aV/LVJup6ZGWgigO6cSA==:117 a=k3aV/LVJup6ZGWgigO6cSA==:17 a=nTHF0DUjJn0A:10 a=20KFwNOVAAAA:8 a=yPCof4ZbAAAA:8 a=J8PYUobFq9NN7i4cLVkA:9 Sender: linux-xfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Dave Chinner Log recovery has it's own buffer write completion handler for buffers that it directly recovers. Convert these to direct calls by flagging these buffers as being log recovery buffers. The flag will get cleared by the log recovery IO completion routine, so it will never leak out of log recovery. Signed-off-by: Dave Chinner Reviewed-by: Brian Foster Reviewed-by: Darrick J. Wong --- fs/xfs/xfs_buf.c | 10 ++++++++++ fs/xfs/xfs_buf.h | 2 ++ fs/xfs/xfs_buf_item_recover.c | 5 ++--- fs/xfs/xfs_dquot_item_recover.c | 2 +- fs/xfs/xfs_inode_item_recover.c | 2 +- fs/xfs/xfs_log_recover.c | 5 ++--- 6 files changed, 18 insertions(+), 8 deletions(-) diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c index 3bffde8640a52..0a69de674af9d 100644 --- a/fs/xfs/xfs_buf.c +++ b/fs/xfs/xfs_buf.c @@ -14,6 +14,7 @@ #include "xfs_mount.h" #include "xfs_trace.h" #include "xfs_log.h" +#include "xfs_log_recover.h" #include "xfs_trans.h" #include "xfs_buf_item.h" #include "xfs_errortag.h" @@ -1207,6 +1208,15 @@ xfs_buf_ioend( if (read) goto out_finish; + /* + * If this is a log recovery buffer, we aren't doing transactional IO + * yet so we need to let it handle IO completions. + */ + if (bp->b_flags & _XBF_LOGRECOVERY) { + xlog_recover_iodone(bp); + return; + } + if (bp->b_flags & _XBF_INODES) { xfs_buf_inode_iodone(bp); return; diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h index c1d0843206dd6..30dabc5bae96d 100644 --- a/fs/xfs/xfs_buf.h +++ b/fs/xfs/xfs_buf.h @@ -33,6 +33,7 @@ /* buffer type flags for write callbacks */ #define _XBF_INODES (1 << 16)/* inode buffer */ #define _XBF_DQUOTS (1 << 17)/* dquot buffer */ +#define _XBF_LOGRECOVERY (1 << 18)/* log recovery buffer */ /* flags used only internally */ #define _XBF_PAGES (1 << 20)/* backed by refcounted pages */ @@ -56,6 +57,7 @@ typedef unsigned int xfs_buf_flags_t; { XBF_WRITE_FAIL, "WRITE_FAIL" }, \ { _XBF_INODES, "INODES" }, \ { _XBF_DQUOTS, "DQUOTS" }, \ + { _XBF_LOGRECOVERY, "LOG_RECOVERY" }, \ { _XBF_PAGES, "PAGES" }, \ { _XBF_KMEM, "KMEM" }, \ { _XBF_DELWRI_Q, "DELWRI_Q" }, \ diff --git a/fs/xfs/xfs_buf_item_recover.c b/fs/xfs/xfs_buf_item_recover.c index 04faa7310c4f0..74c851f60eeeb 100644 --- a/fs/xfs/xfs_buf_item_recover.c +++ b/fs/xfs/xfs_buf_item_recover.c @@ -419,8 +419,7 @@ xlog_recover_validate_buf_type( if (bp->b_ops) { struct xfs_buf_log_item *bip; - ASSERT(!bp->b_iodone || bp->b_iodone == xlog_recover_iodone); - bp->b_iodone = xlog_recover_iodone; + bp->b_flags |= _XBF_LOGRECOVERY; xfs_buf_item_init(bp, mp); bip = bp->b_log_item; bip->bli_item.li_lsn = current_lsn; @@ -963,7 +962,7 @@ xlog_recover_buf_commit_pass2( error = xfs_bwrite(bp); } else { ASSERT(bp->b_mount == mp); - bp->b_iodone = xlog_recover_iodone; + bp->b_flags |= _XBF_LOGRECOVERY; xfs_buf_delwri_queue(bp, buffer_list); } diff --git a/fs/xfs/xfs_dquot_item_recover.c b/fs/xfs/xfs_dquot_item_recover.c index 3400be4c88f08..f9ea9f55aa7cc 100644 --- a/fs/xfs/xfs_dquot_item_recover.c +++ b/fs/xfs/xfs_dquot_item_recover.c @@ -153,7 +153,7 @@ xlog_recover_dquot_commit_pass2( ASSERT(dq_f->qlf_size == 2); ASSERT(bp->b_mount == mp); - bp->b_iodone = xlog_recover_iodone; + bp->b_flags |= _XBF_LOGRECOVERY; xfs_buf_delwri_queue(bp, buffer_list); out_release: diff --git a/fs/xfs/xfs_inode_item_recover.c b/fs/xfs/xfs_inode_item_recover.c index dc3e26ff16c90..5e0d291835b35 100644 --- a/fs/xfs/xfs_inode_item_recover.c +++ b/fs/xfs/xfs_inode_item_recover.c @@ -376,7 +376,7 @@ xlog_recover_inode_commit_pass2( xfs_dinode_calc_crc(log->l_mp, dip); ASSERT(bp->b_mount == mp); - bp->b_iodone = xlog_recover_iodone; + bp->b_flags |= _XBF_LOGRECOVERY; xfs_buf_delwri_queue(bp, buffer_list); out_release: diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c index ec015df55b77a..52a65a74208ff 100644 --- a/fs/xfs/xfs_log_recover.c +++ b/fs/xfs/xfs_log_recover.c @@ -287,9 +287,8 @@ xlog_recover_iodone( if (bp->b_log_item) xfs_buf_item_relse(bp); ASSERT(bp->b_log_item == NULL); - - bp->b_iodone = NULL; - xfs_buf_ioend(bp); + bp->b_flags &= ~_XBF_LOGRECOVERY; + xfs_buf_ioend_finish(bp); } /* From patchwork Thu Jun 4 07:45:43 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 11587235 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 89E72913 for ; Thu, 4 Jun 2020 07:46:13 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 79A332072E for ; Thu, 4 Jun 2020 07:46:13 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727829AbgFDHqM (ORCPT ); Thu, 4 Jun 2020 03:46:12 -0400 Received: from mail107.syd.optusnet.com.au ([211.29.132.53]:48562 "EHLO mail107.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726047AbgFDHqM (ORCPT ); Thu, 4 Jun 2020 03:46:12 -0400 Received: from dread.disaster.area (pa49-180-124-177.pa.nsw.optusnet.com.au [49.180.124.177]) by mail107.syd.optusnet.com.au (Postfix) with ESMTPS id 3D1FAD5986D for ; Thu, 4 Jun 2020 17:46:08 +1000 (AEST) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.3) (envelope-from ) id 1jgkZj-00049o-Lx for linux-xfs@vger.kernel.org; Thu, 04 Jun 2020 17:46:07 +1000 Received: from dave by discord.disaster.area with local (Exim 4.93) (envelope-from ) id 1jgkZj-0017H1-Cj for linux-xfs@vger.kernel.org; Thu, 04 Jun 2020 17:46:07 +1000 From: Dave Chinner To: linux-xfs@vger.kernel.org Subject: [PATCH 07/30] xfs: call xfs_buf_iodone directly Date: Thu, 4 Jun 2020 17:45:43 +1000 Message-Id: <20200604074606.266213-8-david@fromorbit.com> X-Mailer: git-send-email 2.26.2.761.g0e0b3e54be In-Reply-To: <20200604074606.266213-1-david@fromorbit.com> References: <20200604074606.266213-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.3 cv=QIgWuTDL c=1 sm=1 tr=0 a=k3aV/LVJup6ZGWgigO6cSA==:117 a=k3aV/LVJup6ZGWgigO6cSA==:17 a=nTHF0DUjJn0A:10 a=20KFwNOVAAAA:8 a=yPCof4ZbAAAA:8 a=pGLkceISAAAA:8 a=IyPPYMNQrFCbev4FMNYA:9 a=wfjQprMUrlhGnqmd:21 a=cmJELCyg_pleF8qG:21 Sender: linux-xfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Dave Chinner All unmarked dirty buffers should be in the AIL and have log items attached to them. Hence when they are written, we will run a callback to remove the item from the AIL if appropriate. Now that we've handled inode and dquot buffers, all remaining calls are to xfs_buf_iodone() and so we can hard code this rather than use an indirect call. Signed-off-by: Dave Chinner Reviewed-by: Darrick J. Wong Reviewed-by: Amir Goldstein Reviewed-by: Brian Foster --- fs/xfs/xfs_buf.c | 24 ++++++++---------------- fs/xfs/xfs_buf.h | 6 +----- fs/xfs/xfs_buf_item.c | 40 ++++++++++------------------------------ fs/xfs/xfs_buf_item.h | 4 ++-- fs/xfs/xfs_trans_buf.c | 13 +++---------- 5 files changed, 24 insertions(+), 63 deletions(-) diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c index 0a69de674af9d..d7695b638e994 100644 --- a/fs/xfs/xfs_buf.c +++ b/fs/xfs/xfs_buf.c @@ -658,7 +658,6 @@ xfs_buf_find( */ if (bp->b_flags & XBF_STALE) { ASSERT((bp->b_flags & _XBF_DELWRI_Q) == 0); - ASSERT(bp->b_iodone == NULL); bp->b_flags &= _XBF_KMEM | _XBF_PAGES; bp->b_ops = NULL; } @@ -1194,10 +1193,13 @@ xfs_buf_ioend( if (!bp->b_error && bp->b_io_error) xfs_buf_ioerror(bp, bp->b_io_error); - /* Only validate buffers that were read without errors */ - if (read && !bp->b_error && bp->b_ops) { - ASSERT(!bp->b_iodone); - bp->b_ops->verify_read(bp); + if (read) { + if (!bp->b_error && bp->b_ops) + bp->b_ops->verify_read(bp); + if (!bp->b_error) + bp->b_flags |= XBF_DONE; + xfs_buf_ioend_finish(bp); + return; } if (!bp->b_error) { @@ -1205,9 +1207,6 @@ xfs_buf_ioend( bp->b_flags |= XBF_DONE; } - if (read) - goto out_finish; - /* * If this is a log recovery buffer, we aren't doing transactional IO * yet so we need to let it handle IO completions. @@ -1226,14 +1225,7 @@ xfs_buf_ioend( xfs_buf_dquot_iodone(bp); return; } - - if (bp->b_iodone) { - (*(bp->b_iodone))(bp); - return; - } - -out_finish: - xfs_buf_ioend_finish(bp); + xfs_buf_iodone(bp); } static void diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h index 30dabc5bae96d..755b652e695ac 100644 --- a/fs/xfs/xfs_buf.h +++ b/fs/xfs/xfs_buf.h @@ -18,6 +18,7 @@ /* * Base types */ +struct xfs_buf; #define XFS_BUF_DADDR_NULL ((xfs_daddr_t) (-1LL)) @@ -102,10 +103,6 @@ typedef struct xfs_buftarg { struct ratelimit_state bt_ioerror_rl; } xfs_buftarg_t; -struct xfs_buf; -typedef void (*xfs_buf_iodone_t)(struct xfs_buf *); - - #define XB_PAGES 2 struct xfs_buf_map { @@ -158,7 +155,6 @@ typedef struct xfs_buf { xfs_buftarg_t *b_target; /* buffer target (device) */ void *b_addr; /* virtual address of buffer */ struct work_struct b_ioend_work; - xfs_buf_iodone_t b_iodone; /* I/O completion function */ struct completion b_iowait; /* queue for I/O waiters */ struct xfs_buf_log_item *b_log_item; struct list_head b_li_list; /* Log items list head */ diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c index a42cdf9ccc47d..d87ae6363a130 100644 --- a/fs/xfs/xfs_buf_item.c +++ b/fs/xfs/xfs_buf_item.c @@ -460,7 +460,6 @@ xfs_buf_item_unpin( xfs_buf_do_callbacks(bp); bp->b_log_item = NULL; list_del_init(&bp->b_li_list); - bp->b_iodone = NULL; } else { xfs_trans_ail_delete(lip, SHUTDOWN_LOG_IO_ERROR); xfs_buf_item_relse(bp); @@ -936,11 +935,7 @@ xfs_buf_item_free( } /* - * This is called when the buf log item is no longer needed. It should - * free the buf log item associated with the given buffer and clear - * the buffer's pointer to the buf log item. If there are no more - * items in the list, clear the b_iodone field of the buffer (see - * xfs_buf_attach_iodone() below). + * xfs_buf_item_relse() is called when the buf log item is no longer needed. */ void xfs_buf_item_relse( @@ -952,9 +947,6 @@ xfs_buf_item_relse( ASSERT(!test_bit(XFS_LI_IN_AIL, &bip->bli_item.li_flags)); bp->b_log_item = NULL; - if (list_empty(&bp->b_li_list)) - bp->b_iodone = NULL; - xfs_buf_rele(bp); xfs_buf_item_free(bip); } @@ -962,10 +954,7 @@ xfs_buf_item_relse( /* * Add the given log item with its callback to the list of callbacks - * to be called when the buffer's I/O completes. If it is not set - * already, set the buffer's b_iodone() routine to be - * xfs_buf_iodone_callbacks() and link the log item into the list of - * items rooted at b_li_list. + * to be called when the buffer's I/O completes. */ void xfs_buf_attach_iodone( @@ -977,10 +966,6 @@ xfs_buf_attach_iodone( lip->li_cb = cb; list_add_tail(&lip->li_bio_list, &bp->b_li_list); - - ASSERT(bp->b_iodone == NULL || - bp->b_iodone == xfs_buf_iodone_callbacks); - bp->b_iodone = xfs_buf_iodone_callbacks; } /* @@ -1096,7 +1081,6 @@ xfs_buf_iodone_callback_error( goto out_stale; trace_xfs_buf_item_iodone_async(bp, _RET_IP_); - ASSERT(bp->b_iodone != NULL); cfg = xfs_error_get_cfg(mp, XFS_ERR_METADATA, bp->b_error); @@ -1182,28 +1166,24 @@ xfs_buf_run_callbacks( xfs_buf_do_callbacks(bp); bp->b_log_item = NULL; list_del_init(&bp->b_li_list); - bp->b_iodone = NULL; } /* - * This is the iodone() function for buffers which have had callbacks attached - * to them by xfs_buf_attach_iodone(). We need to iterate the items on the - * callback list, mark the buffer as having no more callbacks and then push the - * buffer through IO completion processing. + * Inode buffer iodone callback function. */ void -xfs_buf_iodone_callbacks( +xfs_buf_inode_iodone( struct xfs_buf *bp) { xfs_buf_run_callbacks(bp); - xfs_buf_ioend(bp); + xfs_buf_ioend_finish(bp); } /* - * Inode buffer iodone callback function. + * Dquot buffer iodone callback function. */ void -xfs_buf_inode_iodone( +xfs_buf_dquot_iodone( struct xfs_buf *bp) { xfs_buf_run_callbacks(bp); @@ -1211,10 +1191,10 @@ xfs_buf_inode_iodone( } /* - * Dquot buffer iodone callback function. + * Dirty buffer iodone callback function. */ void -xfs_buf_dquot_iodone( +xfs_buf_iodone( struct xfs_buf *bp) { xfs_buf_run_callbacks(bp); @@ -1229,7 +1209,7 @@ xfs_buf_dquot_iodone( * care of cleaning up the buffer itself. */ void -xfs_buf_iodone( +xfs_buf_item_iodone( struct xfs_buf *bp, struct xfs_log_item *lip) { diff --git a/fs/xfs/xfs_buf_item.h b/fs/xfs/xfs_buf_item.h index 27d13d29b5bbb..610cd00193289 100644 --- a/fs/xfs/xfs_buf_item.h +++ b/fs/xfs/xfs_buf_item.h @@ -57,10 +57,10 @@ bool xfs_buf_item_dirty_format(struct xfs_buf_log_item *); void xfs_buf_attach_iodone(struct xfs_buf *, void(*)(struct xfs_buf *, struct xfs_log_item *), struct xfs_log_item *); -void xfs_buf_iodone_callbacks(struct xfs_buf *); -void xfs_buf_iodone(struct xfs_buf *, struct xfs_log_item *); +void xfs_buf_item_iodone(struct xfs_buf *, struct xfs_log_item *); void xfs_buf_inode_iodone(struct xfs_buf *); void xfs_buf_dquot_iodone(struct xfs_buf *); +void xfs_buf_iodone(struct xfs_buf *); bool xfs_buf_log_check_iovec(struct xfs_log_iovec *iovec); extern kmem_zone_t *xfs_buf_item_zone; diff --git a/fs/xfs/xfs_trans_buf.c b/fs/xfs/xfs_trans_buf.c index 93d62cb864c15..6752676b94fe7 100644 --- a/fs/xfs/xfs_trans_buf.c +++ b/fs/xfs/xfs_trans_buf.c @@ -465,24 +465,17 @@ xfs_trans_dirty_buf( ASSERT(bp->b_transp == tp); ASSERT(bip != NULL); - ASSERT(bp->b_iodone == NULL || - bp->b_iodone == xfs_buf_iodone_callbacks); /* * Mark the buffer as needing to be written out eventually, * and set its iodone function to remove the buffer's buf log * item from the AIL and free it when the buffer is flushed - * to disk. See xfs_buf_attach_iodone() for more details - * on li_cb and xfs_buf_iodone_callbacks(). - * If we end up aborting this transaction, we trap this buffer - * inside the b_bdstrat callback so that this won't get written to - * disk. + * to disk. */ bp->b_flags |= XBF_DONE; ASSERT(atomic_read(&bip->bli_refcount) > 0); - bp->b_iodone = xfs_buf_iodone_callbacks; - bip->bli_item.li_cb = xfs_buf_iodone; + bip->bli_item.li_cb = xfs_buf_item_iodone; /* * If we invalidated the buffer within this transaction, then @@ -651,7 +644,7 @@ xfs_trans_stale_inode_buf( ASSERT(atomic_read(&bip->bli_refcount) > 0); bip->bli_flags |= XFS_BLI_STALE_INODE; - bip->bli_item.li_cb = xfs_buf_iodone; + bip->bli_item.li_cb = xfs_buf_item_iodone; bp->b_flags |= _XBF_INODES; xfs_trans_buf_set_type(tp, bp, XFS_BLFT_DINO_BUF); } From patchwork Thu Jun 4 07:45:44 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 11587269 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 7E62F159A for ; Thu, 4 Jun 2020 07:46:23 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 665282075B for ; Thu, 4 Jun 2020 07:46:23 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727839AbgFDHqW (ORCPT ); Thu, 4 Jun 2020 03:46:22 -0400 Received: from mail109.syd.optusnet.com.au ([211.29.132.80]:50662 "EHLO mail109.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727976AbgFDHqT (ORCPT ); Thu, 4 Jun 2020 03:46:19 -0400 Received: from dread.disaster.area (pa49-180-124-177.pa.nsw.optusnet.com.au [49.180.124.177]) by mail109.syd.optusnet.com.au (Postfix) with ESMTPS id 9EDF2D798CC for ; Thu, 4 Jun 2020 17:46:12 +1000 (AEST) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.3) (envelope-from ) id 1jgkZj-00049p-Mf for linux-xfs@vger.kernel.org; Thu, 04 Jun 2020 17:46:07 +1000 Received: from dave by discord.disaster.area with local (Exim 4.93) (envelope-from ) id 1jgkZj-0017H6-E3 for linux-xfs@vger.kernel.org; Thu, 04 Jun 2020 17:46:07 +1000 From: Dave Chinner To: linux-xfs@vger.kernel.org Subject: [PATCH 08/30] xfs: clean up whacky buffer log item list reinit Date: Thu, 4 Jun 2020 17:45:44 +1000 Message-Id: <20200604074606.266213-9-david@fromorbit.com> X-Mailer: git-send-email 2.26.2.761.g0e0b3e54be In-Reply-To: <20200604074606.266213-1-david@fromorbit.com> References: <20200604074606.266213-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.3 cv=W5xGqiek c=1 sm=1 tr=0 a=k3aV/LVJup6ZGWgigO6cSA==:117 a=k3aV/LVJup6ZGWgigO6cSA==:17 a=nTHF0DUjJn0A:10 a=20KFwNOVAAAA:8 a=yPCof4ZbAAAA:8 a=bUt8L1G7oKE4gpPHD4QA:9 Sender: linux-xfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Dave Chinner When we've emptied the buffer log item list, it does a list_del_init on itself to reset it's pointers to itself. This is unnecessary as the list is already empty at this point - it was a left-over fragment from the list_head conversion of the buffer log item list. Remove them. Signed-off-by: Dave Chinner Reviewed-by: Darrick J. Wong Reviewed-by: Christoph Hellwig Reviewed-by: Brian Foster --- fs/xfs/xfs_buf_item.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c index d87ae6363a130..5b3cd5e90947c 100644 --- a/fs/xfs/xfs_buf_item.c +++ b/fs/xfs/xfs_buf_item.c @@ -459,7 +459,6 @@ xfs_buf_item_unpin( if (bip->bli_flags & XFS_BLI_STALE_INODE) { xfs_buf_do_callbacks(bp); bp->b_log_item = NULL; - list_del_init(&bp->b_li_list); } else { xfs_trans_ail_delete(lip, SHUTDOWN_LOG_IO_ERROR); xfs_buf_item_relse(bp); @@ -1165,7 +1164,6 @@ xfs_buf_run_callbacks( xfs_buf_do_callbacks(bp); bp->b_log_item = NULL; - list_del_init(&bp->b_li_list); } /* From patchwork Thu Jun 4 07:45:45 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 11587261 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 9BE761392 for ; Thu, 4 Jun 2020 07:46:22 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 85C442072E for ; Thu, 4 Jun 2020 07:46:22 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726246AbgFDHqV (ORCPT ); Thu, 4 Jun 2020 03:46:21 -0400 Received: from mail108.syd.optusnet.com.au ([211.29.132.59]:47464 "EHLO mail108.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727961AbgFDHqT (ORCPT ); Thu, 4 Jun 2020 03:46:19 -0400 Received: from dread.disaster.area (pa49-180-124-177.pa.nsw.optusnet.com.au [49.180.124.177]) by mail108.syd.optusnet.com.au (Postfix) with ESMTPS id 43D471A82B4 for ; Thu, 4 Jun 2020 17:46:15 +1000 (AEST) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.3) (envelope-from ) id 1jgkZj-00049u-OO for linux-xfs@vger.kernel.org; Thu, 04 Jun 2020 17:46:07 +1000 Received: from dave by discord.disaster.area with local (Exim 4.93) (envelope-from ) id 1jgkZj-0017HB-F6 for linux-xfs@vger.kernel.org; Thu, 04 Jun 2020 17:46:07 +1000 From: Dave Chinner To: linux-xfs@vger.kernel.org Subject: [PATCH 09/30] xfs: make inode IO completion buffer centric Date: Thu, 4 Jun 2020 17:45:45 +1000 Message-Id: <20200604074606.266213-10-david@fromorbit.com> X-Mailer: git-send-email 2.26.2.761.g0e0b3e54be In-Reply-To: <20200604074606.266213-1-david@fromorbit.com> References: <20200604074606.266213-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.3 cv=X6os11be c=1 sm=1 tr=0 a=k3aV/LVJup6ZGWgigO6cSA==:117 a=k3aV/LVJup6ZGWgigO6cSA==:17 a=nTHF0DUjJn0A:10 a=20KFwNOVAAAA:8 a=yPCof4ZbAAAA:8 a=SUfCrG0rDGvUTUl_RgwA:9 Sender: linux-xfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Dave Chinner Having different io completion callbacks for different inode states makes things complex. We can detect if the inode is stale via the XFS_ISTALE flag in IO completion, so we don't need a special callback just for this. This means inodes only have a single iodone callback, and inode IO completion is entirely buffer centric at this point. Hence we no longer need to use a log item callback at all as we can just call xfs_iflush_done() directly from the buffer completions and walk the buffer log item list to complete the all inodes under IO. Signed-off-by: Dave Chinner Reviewed-by: Christoph Hellwig Reviewed-by: Darrick J. Wong Reviewed-by: Brian Foster --- fs/xfs/xfs_buf_item.c | 35 ++++++++++++++++++---- fs/xfs/xfs_inode.c | 6 ++-- fs/xfs/xfs_inode_item.c | 65 ++++++++++++++--------------------------- fs/xfs/xfs_inode_item.h | 5 ++-- 4 files changed, 56 insertions(+), 55 deletions(-) diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c index 5b3cd5e90947c..a4e416af5c614 100644 --- a/fs/xfs/xfs_buf_item.c +++ b/fs/xfs/xfs_buf_item.c @@ -13,6 +13,8 @@ #include "xfs_mount.h" #include "xfs_trans.h" #include "xfs_buf_item.h" +#include "xfs_inode.h" +#include "xfs_inode_item.h" #include "xfs_trans_priv.h" #include "xfs_trace.h" #include "xfs_log.h" @@ -457,7 +459,8 @@ xfs_buf_item_unpin( * the AIL lock. */ if (bip->bli_flags & XFS_BLI_STALE_INODE) { - xfs_buf_do_callbacks(bp); + lip->li_cb(bp, lip); + xfs_iflush_done(bp); bp->b_log_item = NULL; } else { xfs_trans_ail_delete(lip, SHUTDOWN_LOG_IO_ERROR); @@ -1141,8 +1144,8 @@ xfs_buf_iodone_callback_error( return false; } -static void -xfs_buf_run_callbacks( +static inline bool +xfs_buf_had_callback_errors( struct xfs_buf *bp) { @@ -1152,7 +1155,7 @@ xfs_buf_run_callbacks( * appropriate action. */ if (bp->b_error && xfs_buf_iodone_callback_error(bp)) - return; + return true; /* * Successful IO or permanent error. Either way, we can clear the @@ -1161,7 +1164,16 @@ xfs_buf_run_callbacks( bp->b_last_error = 0; bp->b_retries = 0; bp->b_first_retry_time = 0; + return false; +} +static void +xfs_buf_run_callbacks( + struct xfs_buf *bp) +{ + + if (xfs_buf_had_callback_errors(bp)) + return; xfs_buf_do_callbacks(bp); bp->b_log_item = NULL; } @@ -1173,7 +1185,20 @@ void xfs_buf_inode_iodone( struct xfs_buf *bp) { - xfs_buf_run_callbacks(bp); + struct xfs_buf_log_item *blip = bp->b_log_item; + struct xfs_log_item *lip; + + if (xfs_buf_had_callback_errors(bp)) + return; + + /* If there is a buf_log_item attached, run its callback */ + if (blip) { + lip = &blip->bli_item; + lip->li_cb(bp, lip); + bp->b_log_item = NULL; + } + + xfs_iflush_done(bp); xfs_buf_ioend_finish(bp); } diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c index d5dee57f914a9..1b4e8e0bb0cf0 100644 --- a/fs/xfs/xfs_inode.c +++ b/fs/xfs/xfs_inode.c @@ -2677,7 +2677,6 @@ xfs_ifree_cluster( list_for_each_entry(lip, &bp->b_li_list, li_bio_list) { if (lip->li_type == XFS_LI_INODE) { iip = (struct xfs_inode_log_item *)lip; - lip->li_cb = xfs_istale_done; xfs_trans_ail_copy_lsn(mp->m_ail, &iip->ili_flush_lsn, &iip->ili_item.li_lsn); @@ -2710,8 +2709,7 @@ xfs_ifree_cluster( xfs_trans_ail_copy_lsn(mp->m_ail, &iip->ili_flush_lsn, &iip->ili_item.li_lsn); - xfs_buf_attach_iodone(bp, xfs_istale_done, - &iip->ili_item); + xfs_buf_attach_iodone(bp, NULL, &iip->ili_item); if (ip != free_ip) xfs_iunlock(ip, XFS_ILOCK_EXCL); @@ -3861,7 +3859,7 @@ xfs_iflush_int( * the flush lock. */ bp->b_flags |= _XBF_INODES; - xfs_buf_attach_iodone(bp, xfs_iflush_done, &iip->ili_item); + xfs_buf_attach_iodone(bp, NULL, &iip->ili_item); /* generate the checksum. */ xfs_dinode_calc_crc(mp, dip); diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c index 6ef9cbcfc94a7..7049f2ae8d186 100644 --- a/fs/xfs/xfs_inode_item.c +++ b/fs/xfs/xfs_inode_item.c @@ -668,40 +668,34 @@ xfs_inode_item_destroy( */ void xfs_iflush_done( - struct xfs_buf *bp, - struct xfs_log_item *lip) + struct xfs_buf *bp) { struct xfs_inode_log_item *iip; - struct xfs_log_item *blip, *n; - struct xfs_ail *ailp = lip->li_ailp; + struct xfs_log_item *lip, *n; + struct xfs_ail *ailp = bp->b_mount->m_ail; int need_ail = 0; LIST_HEAD(tmp); /* - * Scan the buffer IO completions for other inodes being completed and - * attach them to the current inode log item. + * Pull the attached inodes from the buffer one at a time and take the + * appropriate action on them. */ - - list_add_tail(&lip->li_bio_list, &tmp); - - list_for_each_entry_safe(blip, n, &bp->b_li_list, li_bio_list) { - if (lip->li_cb != xfs_iflush_done) + list_for_each_entry_safe(lip, n, &bp->b_li_list, li_bio_list) { + iip = INODE_ITEM(lip); + if (xfs_iflags_test(iip->ili_inode, XFS_ISTALE)) { + list_del_init(&lip->li_bio_list); + xfs_iflush_abort(iip->ili_inode); continue; + } - list_move_tail(&blip->li_bio_list, &tmp); + list_move_tail(&lip->li_bio_list, &tmp); /* Do an unlocked check for needing the AIL lock. */ - iip = INODE_ITEM(blip); - if (blip->li_lsn == iip->ili_flush_lsn || - test_bit(XFS_LI_FAILED, &blip->li_flags)) + if (lip->li_lsn == iip->ili_flush_lsn || + test_bit(XFS_LI_FAILED, &lip->li_flags)) need_ail++; } - - /* make sure we capture the state of the initial inode. */ - iip = INODE_ITEM(lip); - if (lip->li_lsn == iip->ili_flush_lsn || - test_bit(XFS_LI_FAILED, &lip->li_flags)) - need_ail++; + ASSERT(list_empty(&bp->b_li_list)); /* * We only want to pull the item from the AIL if it is actually there @@ -713,19 +707,13 @@ xfs_iflush_done( /* this is an opencoded batch version of xfs_trans_ail_delete */ spin_lock(&ailp->ail_lock); - list_for_each_entry(blip, &tmp, li_bio_list) { - if (blip->li_lsn == INODE_ITEM(blip)->ili_flush_lsn) { - /* - * xfs_ail_update_finish() only cares about the - * lsn of the first tail item removed, any - * others will be at the same or higher lsn so - * we just ignore them. - */ - xfs_lsn_t lsn = xfs_ail_delete_one(ailp, blip); + list_for_each_entry(lip, &tmp, li_bio_list) { + if (lip->li_lsn == INODE_ITEM(lip)->ili_flush_lsn) { + xfs_lsn_t lsn = xfs_ail_delete_one(ailp, lip); if (!tail_lsn && lsn) tail_lsn = lsn; } else { - xfs_clear_li_failed(blip); + xfs_clear_li_failed(lip); } } xfs_ail_update_finish(ailp, tail_lsn); @@ -736,9 +724,9 @@ xfs_iflush_done( * ili_last_fields bits now that we know that the data corresponding to * them is safely on disk. */ - list_for_each_entry_safe(blip, n, &tmp, li_bio_list) { - list_del_init(&blip->li_bio_list); - iip = INODE_ITEM(blip); + list_for_each_entry_safe(lip, n, &tmp, li_bio_list) { + list_del_init(&lip->li_bio_list); + iip = INODE_ITEM(lip); spin_lock(&iip->ili_lock); iip->ili_last_fields = 0; @@ -746,7 +734,6 @@ xfs_iflush_done( xfs_ifunlock(iip->ili_inode); } - list_del(&tmp); } /* @@ -779,14 +766,6 @@ xfs_iflush_abort( xfs_ifunlock(ip); } -void -xfs_istale_done( - struct xfs_buf *bp, - struct xfs_log_item *lip) -{ - xfs_iflush_abort(INODE_ITEM(lip)->ili_inode); -} - /* * convert an xfs_inode_log_format struct from the old 32 bit version * (which can have different field alignments) to the native 64 bit version diff --git a/fs/xfs/xfs_inode_item.h b/fs/xfs/xfs_inode_item.h index 4a10a1b92ee99..048b5e7dee901 100644 --- a/fs/xfs/xfs_inode_item.h +++ b/fs/xfs/xfs_inode_item.h @@ -36,15 +36,14 @@ struct xfs_inode_log_item { xfs_lsn_t ili_last_lsn; /* lsn at last transaction */ }; -static inline int xfs_inode_clean(xfs_inode_t *ip) +static inline int xfs_inode_clean(struct xfs_inode *ip) { return !ip->i_itemp || !(ip->i_itemp->ili_fields & XFS_ILOG_ALL); } extern void xfs_inode_item_init(struct xfs_inode *, struct xfs_mount *); extern void xfs_inode_item_destroy(struct xfs_inode *); -extern void xfs_iflush_done(struct xfs_buf *, struct xfs_log_item *); -extern void xfs_istale_done(struct xfs_buf *, struct xfs_log_item *); +extern void xfs_iflush_done(struct xfs_buf *); extern void xfs_iflush_abort(struct xfs_inode *); extern int xfs_inode_item_format_convert(xfs_log_iovec_t *, struct xfs_inode_log_format *); From patchwork Thu Jun 4 07:45:46 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 11587263 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id AED8B159A for ; Thu, 4 Jun 2020 07:46:22 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id A0697206C3 for ; Thu, 4 Jun 2020 07:46:22 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727961AbgFDHqV (ORCPT ); Thu, 4 Jun 2020 03:46:21 -0400 Received: from mail109.syd.optusnet.com.au ([211.29.132.80]:50660 "EHLO mail109.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727777AbgFDHqT (ORCPT ); Thu, 4 Jun 2020 03:46:19 -0400 Received: from dread.disaster.area (pa49-180-124-177.pa.nsw.optusnet.com.au [49.180.124.177]) by mail109.syd.optusnet.com.au (Postfix) with ESMTPS id B7BE1D79C3A for ; Thu, 4 Jun 2020 17:46:12 +1000 (AEST) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.3) (envelope-from ) id 1jgkZj-00049w-Pt for linux-xfs@vger.kernel.org; Thu, 04 Jun 2020 17:46:07 +1000 Received: from dave by discord.disaster.area with local (Exim 4.93) (envelope-from ) id 1jgkZj-0017HG-Gw for linux-xfs@vger.kernel.org; Thu, 04 Jun 2020 17:46:07 +1000 From: Dave Chinner To: linux-xfs@vger.kernel.org Subject: [PATCH 10/30] xfs: use direct calls for dquot IO completion Date: Thu, 4 Jun 2020 17:45:46 +1000 Message-Id: <20200604074606.266213-11-david@fromorbit.com> X-Mailer: git-send-email 2.26.2.761.g0e0b3e54be In-Reply-To: <20200604074606.266213-1-david@fromorbit.com> References: <20200604074606.266213-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.3 cv=W5xGqiek c=1 sm=1 tr=0 a=k3aV/LVJup6ZGWgigO6cSA==:117 a=k3aV/LVJup6ZGWgigO6cSA==:17 a=nTHF0DUjJn0A:10 a=20KFwNOVAAAA:8 a=yPCof4ZbAAAA:8 a=RC6LCOMzik_5eL-T5OoA:9 Sender: linux-xfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Dave Chinner Similar to inodes, we can call the dquot IO completion functions directly from the buffer completion code, removing another user of log item callbacks for IO completion processing. Signed-off-by: Dave Chinner Reviewed-by: Brian Foster Reviewed-by: Darrick J. Wong --- fs/xfs/xfs_buf_item.c | 18 +++++++++++++++++- fs/xfs/xfs_dquot.c | 18 ++++++++++++++---- fs/xfs/xfs_dquot.h | 1 + 3 files changed, 32 insertions(+), 5 deletions(-) diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c index a4e416af5c614..f46e5ec28111c 100644 --- a/fs/xfs/xfs_buf_item.c +++ b/fs/xfs/xfs_buf_item.c @@ -15,6 +15,9 @@ #include "xfs_buf_item.h" #include "xfs_inode.h" #include "xfs_inode_item.h" +#include "xfs_quota.h" +#include "xfs_dquot_item.h" +#include "xfs_dquot.h" #include "xfs_trans_priv.h" #include "xfs_trace.h" #include "xfs_log.h" @@ -1209,7 +1212,20 @@ void xfs_buf_dquot_iodone( struct xfs_buf *bp) { - xfs_buf_run_callbacks(bp); + struct xfs_buf_log_item *blip = bp->b_log_item; + struct xfs_log_item *lip; + + if (xfs_buf_had_callback_errors(bp)) + return; + + /* a newly allocated dquot buffer might have a log item attached */ + if (blip) { + lip = &blip->bli_item; + lip->li_cb(bp, lip); + bp->b_log_item = NULL; + } + + xfs_dquot_done(bp); xfs_buf_ioend_finish(bp); } diff --git a/fs/xfs/xfs_dquot.c b/fs/xfs/xfs_dquot.c index 2e2146fa0914c..403bc4e9f21ff 100644 --- a/fs/xfs/xfs_dquot.c +++ b/fs/xfs/xfs_dquot.c @@ -1048,9 +1048,8 @@ xfs_qm_dqrele( * from the AIL if it has not been re-logged, and unlocking the dquot's * flush lock. This behavior is very similar to that of inodes.. */ -STATIC void +static void xfs_qm_dqflush_done( - struct xfs_buf *bp, struct xfs_log_item *lip) { struct xfs_dq_logitem *qip = (struct xfs_dq_logitem *)lip; @@ -1091,6 +1090,18 @@ xfs_qm_dqflush_done( xfs_dqfunlock(dqp); } +void +xfs_dquot_done( + struct xfs_buf *bp) +{ + struct xfs_log_item *lip, *n; + + list_for_each_entry_safe(lip, n, &bp->b_li_list, li_bio_list) { + list_del_init(&lip->li_bio_list); + xfs_qm_dqflush_done(lip); + } +} + /* * Write a modified dquot to disk. * The dquot must be locked and the flush lock too taken by caller. @@ -1180,8 +1191,7 @@ xfs_qm_dqflush( * AIL and release the flush lock once the dquot is synced to disk. */ bp->b_flags |= _XBF_DQUOTS; - xfs_buf_attach_iodone(bp, xfs_qm_dqflush_done, - &dqp->q_logitem.qli_item); + xfs_buf_attach_iodone(bp, NULL, &dqp->q_logitem.qli_item); /* * If the buffer is pinned then push on the log so we won't diff --git a/fs/xfs/xfs_dquot.h b/fs/xfs/xfs_dquot.h index 71e36c85e20b6..fe9cc3e08ed6d 100644 --- a/fs/xfs/xfs_dquot.h +++ b/fs/xfs/xfs_dquot.h @@ -174,6 +174,7 @@ void xfs_qm_dqput(struct xfs_dquot *dqp); void xfs_dqlock2(struct xfs_dquot *, struct xfs_dquot *); void xfs_dquot_set_prealloc_limits(struct xfs_dquot *); +void xfs_dquot_done(struct xfs_buf *); static inline struct xfs_dquot *xfs_qm_dqhold(struct xfs_dquot *dqp) { From patchwork Thu Jun 4 07:45:47 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 11587287 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 337C1159A for ; Thu, 4 Jun 2020 07:46:27 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 260AF2072E for ; Thu, 4 Jun 2020 07:46:27 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728035AbgFDHqZ (ORCPT ); Thu, 4 Jun 2020 03:46:25 -0400 Received: from mail105.syd.optusnet.com.au ([211.29.132.249]:41580 "EHLO mail105.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728025AbgFDHqY (ORCPT ); Thu, 4 Jun 2020 03:46:24 -0400 Received: from dread.disaster.area (pa49-180-124-177.pa.nsw.optusnet.com.au [49.180.124.177]) by mail105.syd.optusnet.com.au (Postfix) with ESMTPS id 6247E3A436B for ; Thu, 4 Jun 2020 17:46:13 +1000 (AEST) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.3) (envelope-from ) id 1jgkZj-0004A0-Rd for linux-xfs@vger.kernel.org; Thu, 04 Jun 2020 17:46:07 +1000 Received: from dave by discord.disaster.area with local (Exim 4.93) (envelope-from ) id 1jgkZj-0017HL-IO for linux-xfs@vger.kernel.org; Thu, 04 Jun 2020 17:46:07 +1000 From: Dave Chinner To: linux-xfs@vger.kernel.org Subject: [PATCH 11/30] xfs: clean up the buffer iodone callback functions Date: Thu, 4 Jun 2020 17:45:47 +1000 Message-Id: <20200604074606.266213-12-david@fromorbit.com> X-Mailer: git-send-email 2.26.2.761.g0e0b3e54be In-Reply-To: <20200604074606.266213-1-david@fromorbit.com> References: <20200604074606.266213-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.3 cv=X6os11be c=1 sm=1 tr=0 a=k3aV/LVJup6ZGWgigO6cSA==:117 a=k3aV/LVJup6ZGWgigO6cSA==:17 a=nTHF0DUjJn0A:10 a=20KFwNOVAAAA:8 a=yPCof4ZbAAAA:8 a=QrwaDmlbODsJXtoaXeEA:9 Sender: linux-xfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Dave Chinner Now that we've sorted inode and dquot buffers, we can apply the same cleanups to dirty buffers with buffer log items. They only have one callback, too, so we don't need the log item callback. Collapse the iodone functions and remove all the now unnecessary infrastructure around callback processing. Signed-off-by: Dave Chinner Reviewed-by: Christoph Hellwig Reviewed-by: Darrick J. Wong Reviewed-by: Brian Foster --- fs/xfs/xfs_buf_item.c | 140 +++++++++-------------------------------- fs/xfs/xfs_buf_item.h | 1 - fs/xfs/xfs_trans_buf.c | 2 - 3 files changed, 29 insertions(+), 114 deletions(-) diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c index f46e5ec28111c..0ece5de9dd711 100644 --- a/fs/xfs/xfs_buf_item.c +++ b/fs/xfs/xfs_buf_item.c @@ -30,7 +30,7 @@ static inline struct xfs_buf_log_item *BUF_ITEM(struct xfs_log_item *lip) return container_of(lip, struct xfs_buf_log_item, bli_item); } -STATIC void xfs_buf_do_callbacks(struct xfs_buf *bp); +static void xfs_buf_item_done(struct xfs_buf *bp); /* Is this log iovec plausibly large enough to contain the buffer log format? */ bool @@ -462,9 +462,8 @@ xfs_buf_item_unpin( * the AIL lock. */ if (bip->bli_flags & XFS_BLI_STALE_INODE) { - lip->li_cb(bp, lip); + xfs_buf_item_done(bp); xfs_iflush_done(bp); - bp->b_log_item = NULL; } else { xfs_trans_ail_delete(lip, SHUTDOWN_LOG_IO_ERROR); xfs_buf_item_relse(bp); @@ -973,46 +972,6 @@ xfs_buf_attach_iodone( list_add_tail(&lip->li_bio_list, &bp->b_li_list); } -/* - * We can have many callbacks on a buffer. Running the callbacks individually - * can cause a lot of contention on the AIL lock, so we allow for a single - * callback to be able to scan the remaining items in bp->b_li_list for other - * items of the same type and callback to be processed in the first call. - * - * As a result, the loop walking the callback list below will also modify the - * list. it removes the first item from the list and then runs the callback. - * The loop then restarts from the new first item int the list. This allows the - * callback to scan and modify the list attached to the buffer and we don't - * have to care about maintaining a next item pointer. - */ -STATIC void -xfs_buf_do_callbacks( - struct xfs_buf *bp) -{ - struct xfs_buf_log_item *blip = bp->b_log_item; - struct xfs_log_item *lip; - - /* If there is a buf_log_item attached, run its callback */ - if (blip) { - lip = &blip->bli_item; - lip->li_cb(bp, lip); - } - - while (!list_empty(&bp->b_li_list)) { - lip = list_first_entry(&bp->b_li_list, struct xfs_log_item, - li_bio_list); - - /* - * Remove the item from the list, so we don't have any - * confusion if the item is added to another buf. - * Don't touch the log item after calling its - * callback, because it could have freed itself. - */ - list_del_init(&lip->li_bio_list); - lip->li_cb(bp, lip); - } -} - /* * Invoke the error state callback for each log item affected by the failed I/O. * @@ -1025,8 +984,8 @@ STATIC void xfs_buf_do_callbacks_fail( struct xfs_buf *bp) { + struct xfs_ail *ailp = bp->b_mount->m_ail; struct xfs_log_item *lip; - struct xfs_ail *ailp; /* * Buffer log item errors are handled directly by xfs_buf_item_push() @@ -1036,9 +995,6 @@ xfs_buf_do_callbacks_fail( if (list_empty(&bp->b_li_list)) return; - lip = list_first_entry(&bp->b_li_list, struct xfs_log_item, - li_bio_list); - ailp = lip->li_ailp; spin_lock(&ailp->ail_lock); list_for_each_entry(lip, &bp->b_li_list, li_bio_list) { if (lip->li_ops->iop_error) @@ -1051,22 +1007,11 @@ static bool xfs_buf_iodone_callback_error( struct xfs_buf *bp) { - struct xfs_buf_log_item *bip = bp->b_log_item; - struct xfs_log_item *lip; - struct xfs_mount *mp; + struct xfs_mount *mp = bp->b_mount; static ulong lasttime; static xfs_buftarg_t *lasttarg; struct xfs_error_cfg *cfg; - /* - * The failed buffer might not have a buf_log_item attached or the - * log_item list might be empty. Get the mp from the available - * xfs_log_item - */ - lip = list_first_entry_or_null(&bp->b_li_list, struct xfs_log_item, - li_bio_list); - mp = lip ? lip->li_mountp : bip->bli_item.li_mountp; - /* * If we've already decided to shutdown the filesystem because of * I/O errors, there's no point in giving this a retry. @@ -1171,14 +1116,27 @@ xfs_buf_had_callback_errors( } static void -xfs_buf_run_callbacks( +xfs_buf_item_done( struct xfs_buf *bp) { + struct xfs_buf_log_item *bip = bp->b_log_item; - if (xfs_buf_had_callback_errors(bp)) + if (!bip) return; - xfs_buf_do_callbacks(bp); + + /* + * If we are forcibly shutting down, this may well be off the AIL + * already. That's because we simulate the log-committed callbacks to + * unpin these buffers. Or we may never have put this item on AIL + * because of the transaction was aborted forcibly. + * xfs_trans_ail_delete() takes care of these. + * + * Either way, AIL is useless if we're forcing a shutdown. + */ + xfs_trans_ail_delete(&bip->bli_item, SHUTDOWN_CORRUPT_INCORE); bp->b_log_item = NULL; + xfs_buf_item_free(bip); + xfs_buf_rele(bp); } /* @@ -1188,19 +1146,10 @@ void xfs_buf_inode_iodone( struct xfs_buf *bp) { - struct xfs_buf_log_item *blip = bp->b_log_item; - struct xfs_log_item *lip; - if (xfs_buf_had_callback_errors(bp)) return; - /* If there is a buf_log_item attached, run its callback */ - if (blip) { - lip = &blip->bli_item; - lip->li_cb(bp, lip); - bp->b_log_item = NULL; - } - + xfs_buf_item_done(bp); xfs_iflush_done(bp); xfs_buf_ioend_finish(bp); } @@ -1212,59 +1161,28 @@ void xfs_buf_dquot_iodone( struct xfs_buf *bp) { - struct xfs_buf_log_item *blip = bp->b_log_item; - struct xfs_log_item *lip; - if (xfs_buf_had_callback_errors(bp)) return; /* a newly allocated dquot buffer might have a log item attached */ - if (blip) { - lip = &blip->bli_item; - lip->li_cb(bp, lip); - bp->b_log_item = NULL; - } - + xfs_buf_item_done(bp); xfs_dquot_done(bp); xfs_buf_ioend_finish(bp); } /* * Dirty buffer iodone callback function. + * + * Note that for things like remote attribute buffers, there may not be a buffer + * log item here, so processing the buffer log item must remain be optional. */ void xfs_buf_iodone( struct xfs_buf *bp) { - xfs_buf_run_callbacks(bp); - xfs_buf_ioend_finish(bp); -} - -/* - * This is the iodone() function for buffers which have been - * logged. It is called when they are eventually flushed out. - * It should remove the buf item from the AIL, and free the buf item. - * It is called by xfs_buf_iodone_callbacks() above which will take - * care of cleaning up the buffer itself. - */ -void -xfs_buf_item_iodone( - struct xfs_buf *bp, - struct xfs_log_item *lip) -{ - ASSERT(BUF_ITEM(lip)->bli_buf == bp); - - xfs_buf_rele(bp); + if (xfs_buf_had_callback_errors(bp)) + return; - /* - * If we are forcibly shutting down, this may well be off the AIL - * already. That's because we simulate the log-committed callbacks to - * unpin these buffers. Or we may never have put this item on AIL - * because of the transaction was aborted forcibly. - * xfs_trans_ail_delete() takes care of these. - * - * Either way, AIL is useless if we're forcing a shutdown. - */ - xfs_trans_ail_delete(lip, SHUTDOWN_CORRUPT_INCORE); - xfs_buf_item_free(BUF_ITEM(lip)); + xfs_buf_item_done(bp); + xfs_buf_ioend_finish(bp); } diff --git a/fs/xfs/xfs_buf_item.h b/fs/xfs/xfs_buf_item.h index 610cd00193289..7c0bd2a210aff 100644 --- a/fs/xfs/xfs_buf_item.h +++ b/fs/xfs/xfs_buf_item.h @@ -57,7 +57,6 @@ bool xfs_buf_item_dirty_format(struct xfs_buf_log_item *); void xfs_buf_attach_iodone(struct xfs_buf *, void(*)(struct xfs_buf *, struct xfs_log_item *), struct xfs_log_item *); -void xfs_buf_item_iodone(struct xfs_buf *, struct xfs_log_item *); void xfs_buf_inode_iodone(struct xfs_buf *); void xfs_buf_dquot_iodone(struct xfs_buf *); void xfs_buf_iodone(struct xfs_buf *); diff --git a/fs/xfs/xfs_trans_buf.c b/fs/xfs/xfs_trans_buf.c index 6752676b94fe7..11cd666cd99a6 100644 --- a/fs/xfs/xfs_trans_buf.c +++ b/fs/xfs/xfs_trans_buf.c @@ -475,7 +475,6 @@ xfs_trans_dirty_buf( bp->b_flags |= XBF_DONE; ASSERT(atomic_read(&bip->bli_refcount) > 0); - bip->bli_item.li_cb = xfs_buf_item_iodone; /* * If we invalidated the buffer within this transaction, then @@ -644,7 +643,6 @@ xfs_trans_stale_inode_buf( ASSERT(atomic_read(&bip->bli_refcount) > 0); bip->bli_flags |= XFS_BLI_STALE_INODE; - bip->bli_item.li_cb = xfs_buf_item_iodone; bp->b_flags |= _XBF_INODES; xfs_trans_buf_set_type(tp, bp, XFS_BLFT_DINO_BUF); } From patchwork Thu Jun 4 07:45:48 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 11587283 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 6F2F4175A for ; Thu, 4 Jun 2020 07:46:26 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 6199F2075B for ; Thu, 4 Jun 2020 07:46:26 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727941AbgFDHqZ (ORCPT ); Thu, 4 Jun 2020 03:46:25 -0400 Received: from mail105.syd.optusnet.com.au ([211.29.132.249]:41764 "EHLO mail105.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728035AbgFDHqX (ORCPT ); Thu, 4 Jun 2020 03:46:23 -0400 Received: from dread.disaster.area (pa49-180-124-177.pa.nsw.optusnet.com.au [49.180.124.177]) by mail105.syd.optusnet.com.au (Postfix) with ESMTPS id 910753A4510 for ; Thu, 4 Jun 2020 17:46:13 +1000 (AEST) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.3) (envelope-from ) id 1jgkZj-0004A3-TZ for linux-xfs@vger.kernel.org; Thu, 04 Jun 2020 17:46:07 +1000 Received: from dave by discord.disaster.area with local (Exim 4.93) (envelope-from ) id 1jgkZj-0017HQ-KO for linux-xfs@vger.kernel.org; Thu, 04 Jun 2020 17:46:07 +1000 From: Dave Chinner To: linux-xfs@vger.kernel.org Subject: [PATCH 12/30] xfs: get rid of log item callbacks Date: Thu, 4 Jun 2020 17:45:48 +1000 Message-Id: <20200604074606.266213-13-david@fromorbit.com> X-Mailer: git-send-email 2.26.2.761.g0e0b3e54be In-Reply-To: <20200604074606.266213-1-david@fromorbit.com> References: <20200604074606.266213-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.3 cv=QIgWuTDL c=1 sm=1 tr=0 a=k3aV/LVJup6ZGWgigO6cSA==:117 a=k3aV/LVJup6ZGWgigO6cSA==:17 a=nTHF0DUjJn0A:10 a=20KFwNOVAAAA:8 a=yPCof4ZbAAAA:8 a=IxSPRu1M28NFb9wgbagA:9 Sender: linux-xfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Dave Chinner They are not used anymore, so remove them from the log item and the buffer iodone attachment interfaces. Signed-off-by: Dave Chinner Reviewed-by: Christoph Hellwig Reviewed-by: Darrick J. Wong Reviewed-by: Brian Foster --- fs/xfs/xfs_buf_item.c | 17 ----------------- fs/xfs/xfs_buf_item.h | 3 --- fs/xfs/xfs_dquot.c | 6 +++--- fs/xfs/xfs_inode.c | 5 +++-- fs/xfs/xfs_trans.h | 4 ---- 5 files changed, 6 insertions(+), 29 deletions(-) diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c index 0ece5de9dd711..09bfe9c52dbdb 100644 --- a/fs/xfs/xfs_buf_item.c +++ b/fs/xfs/xfs_buf_item.c @@ -955,23 +955,6 @@ xfs_buf_item_relse( xfs_buf_item_free(bip); } - -/* - * Add the given log item with its callback to the list of callbacks - * to be called when the buffer's I/O completes. - */ -void -xfs_buf_attach_iodone( - struct xfs_buf *bp, - void (*cb)(struct xfs_buf *, struct xfs_log_item *), - struct xfs_log_item *lip) -{ - ASSERT(xfs_buf_islocked(bp)); - - lip->li_cb = cb; - list_add_tail(&lip->li_bio_list, &bp->b_li_list); -} - /* * Invoke the error state callback for each log item affected by the failed I/O. * diff --git a/fs/xfs/xfs_buf_item.h b/fs/xfs/xfs_buf_item.h index 7c0bd2a210aff..23507cbb4c413 100644 --- a/fs/xfs/xfs_buf_item.h +++ b/fs/xfs/xfs_buf_item.h @@ -54,9 +54,6 @@ void xfs_buf_item_relse(struct xfs_buf *); bool xfs_buf_item_put(struct xfs_buf_log_item *); void xfs_buf_item_log(struct xfs_buf_log_item *, uint, uint); bool xfs_buf_item_dirty_format(struct xfs_buf_log_item *); -void xfs_buf_attach_iodone(struct xfs_buf *, - void(*)(struct xfs_buf *, struct xfs_log_item *), - struct xfs_log_item *); void xfs_buf_inode_iodone(struct xfs_buf *); void xfs_buf_dquot_iodone(struct xfs_buf *); void xfs_buf_iodone(struct xfs_buf *); diff --git a/fs/xfs/xfs_dquot.c b/fs/xfs/xfs_dquot.c index 403bc4e9f21ff..d5984a926d1d0 100644 --- a/fs/xfs/xfs_dquot.c +++ b/fs/xfs/xfs_dquot.c @@ -1187,11 +1187,11 @@ xfs_qm_dqflush( } /* - * Attach an iodone routine so that we can remove this dquot from the - * AIL and release the flush lock once the dquot is synced to disk. + * Attach the dquot to the buffer so that we can remove this dquot from + * the AIL and release the flush lock once the dquot is synced to disk. */ bp->b_flags |= _XBF_DQUOTS; - xfs_buf_attach_iodone(bp, NULL, &dqp->q_logitem.qli_item); + list_add_tail(&dqp->q_logitem.qli_item.li_bio_list, &bp->b_li_list); /* * If the buffer is pinned then push on the log so we won't diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c index 1b4e8e0bb0cf0..272b54cf97000 100644 --- a/fs/xfs/xfs_inode.c +++ b/fs/xfs/xfs_inode.c @@ -2709,7 +2709,8 @@ xfs_ifree_cluster( xfs_trans_ail_copy_lsn(mp->m_ail, &iip->ili_flush_lsn, &iip->ili_item.li_lsn); - xfs_buf_attach_iodone(bp, NULL, &iip->ili_item); + list_add_tail(&iip->ili_item.li_bio_list, + &bp->b_li_list); if (ip != free_ip) xfs_iunlock(ip, XFS_ILOCK_EXCL); @@ -3859,7 +3860,7 @@ xfs_iflush_int( * the flush lock. */ bp->b_flags |= _XBF_INODES; - xfs_buf_attach_iodone(bp, NULL, &iip->ili_item); + list_add_tail(&iip->ili_item.li_bio_list, &bp->b_li_list); /* generate the checksum. */ xfs_dinode_calc_crc(mp, dip); diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h index 8308bf6d7e404..99a9ab9cab25b 100644 --- a/fs/xfs/xfs_trans.h +++ b/fs/xfs/xfs_trans.h @@ -37,10 +37,6 @@ struct xfs_log_item { unsigned long li_flags; /* misc flags */ struct xfs_buf *li_buf; /* real buffer pointer */ struct list_head li_bio_list; /* buffer item list */ - void (*li_cb)(struct xfs_buf *, - struct xfs_log_item *); - /* buffer item iodone */ - /* callback func */ const struct xfs_item_ops *li_ops; /* function list */ /* delayed logging */ From patchwork Thu Jun 4 07:45:49 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 11587257 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 6516C913 for ; Thu, 4 Jun 2020 07:46:21 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 55979206C3 for ; Thu, 4 Jun 2020 07:46:21 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727043AbgFDHqU (ORCPT ); Thu, 4 Jun 2020 03:46:20 -0400 Received: from mail106.syd.optusnet.com.au ([211.29.132.42]:49998 "EHLO mail106.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727941AbgFDHqS (ORCPT ); Thu, 4 Jun 2020 03:46:18 -0400 Received: from dread.disaster.area (pa49-180-124-177.pa.nsw.optusnet.com.au [49.180.124.177]) by mail106.syd.optusnet.com.au (Postfix) with ESMTPS id 4F1A25AAE7A for ; Thu, 4 Jun 2020 17:46:09 +1000 (AEST) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.3) (envelope-from ) id 1jgkZj-0004A5-Um for linux-xfs@vger.kernel.org; Thu, 04 Jun 2020 17:46:07 +1000 Received: from dave by discord.disaster.area with local (Exim 4.93) (envelope-from ) id 1jgkZj-0017HV-Lw for linux-xfs@vger.kernel.org; Thu, 04 Jun 2020 17:46:07 +1000 From: Dave Chinner To: linux-xfs@vger.kernel.org Subject: [PATCH 13/30] xfs: handle buffer log item IO errors directly Date: Thu, 4 Jun 2020 17:45:49 +1000 Message-Id: <20200604074606.266213-14-david@fromorbit.com> X-Mailer: git-send-email 2.26.2.761.g0e0b3e54be In-Reply-To: <20200604074606.266213-1-david@fromorbit.com> References: <20200604074606.266213-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.3 cv=X6os11be c=1 sm=1 tr=0 a=k3aV/LVJup6ZGWgigO6cSA==:117 a=k3aV/LVJup6ZGWgigO6cSA==:17 a=nTHF0DUjJn0A:10 a=20KFwNOVAAAA:8 a=1Txx6-oBkrdBIOE7CJkA:9 Sender: linux-xfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Dave Chinner Currently when a buffer with attached log items has an IO error it called ->iop_error for each attched log item. These all call xfs_set_li_failed() to handle the error, but we are about to change the way log items manage buffers. hence we first need to remove the per-item dependency on buffer handling done by xfs_set_li_failed(). We already have specific buffer type IO completion routines, so move the log item error handling out of the generic error handling and into the log item specific functions so we can implement per-type error handling easily. This requires a more complex return value from the error handling code so that we can take the correct action the failure handling requires. This results in some repeated boilerplate in the functions, but that can be cleaned up later once all the changes cascade through this code. Signed-off-by: Dave Chinner --- fs/xfs/xfs_buf_item.c | 215 ++++++++++++++++++++++++++++-------------- 1 file changed, 145 insertions(+), 70 deletions(-) diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c index 09bfe9c52dbdb..3560993847b7c 100644 --- a/fs/xfs/xfs_buf_item.c +++ b/fs/xfs/xfs_buf_item.c @@ -986,21 +986,24 @@ xfs_buf_do_callbacks_fail( spin_unlock(&ailp->ail_lock); } +/* + * Decide if we're going to retry the write after a failure, and prepare + * the buffer for retrying the write. + */ static bool -xfs_buf_iodone_callback_error( +xfs_buf_ioerror_fail_without_retry( struct xfs_buf *bp) { struct xfs_mount *mp = bp->b_mount; static ulong lasttime; static xfs_buftarg_t *lasttarg; - struct xfs_error_cfg *cfg; /* * If we've already decided to shutdown the filesystem because of * I/O errors, there's no point in giving this a retry. */ if (XFS_FORCED_SHUTDOWN(mp)) - goto out_stale; + return true; if (bp->b_target != lasttarg || time_after(jiffies, (lasttime + 5*HZ))) { @@ -1011,91 +1014,115 @@ xfs_buf_iodone_callback_error( /* synchronous writes will have callers process the error */ if (!(bp->b_flags & XBF_ASYNC)) - goto out_stale; - - trace_xfs_buf_item_iodone_async(bp, _RET_IP_); - - cfg = xfs_error_get_cfg(mp, XFS_ERR_METADATA, bp->b_error); + return true; + return false; +} - /* - * If the write was asynchronous then no one will be looking for the - * error. If this is the first failure of this type, clear the error - * state and write the buffer out again. This means we always retry an - * async write failure at least once, but we also need to set the buffer - * up to behave correctly now for repeated failures. - */ - if (!(bp->b_flags & (XBF_STALE | XBF_WRITE_FAIL)) || - bp->b_last_error != bp->b_error) { - bp->b_flags |= (XBF_WRITE | XBF_DONE | XBF_WRITE_FAIL); - bp->b_last_error = bp->b_error; - if (cfg->retry_timeout != XFS_ERR_RETRY_FOREVER && - !bp->b_first_retry_time) - bp->b_first_retry_time = jiffies; +static bool +xfs_buf_ioerror_retry( + struct xfs_buf *bp, + struct xfs_error_cfg *cfg) +{ + if (bp->b_flags & (XBF_STALE | XBF_WRITE_FAIL)) + return false; + if (bp->b_last_error == bp->b_error) + return false; - xfs_buf_ioerror(bp, 0); - xfs_buf_submit(bp); - return true; - } + bp->b_flags |= (XBF_WRITE | XBF_DONE | XBF_WRITE_FAIL); + bp->b_last_error = bp->b_error; + if (cfg->retry_timeout != XFS_ERR_RETRY_FOREVER && + !bp->b_first_retry_time) + bp->b_first_retry_time = jiffies; + return true; +} - /* - * Repeated failure on an async write. Take action according to the - * error configuration we have been set up to use. - */ +/* + * Account for this latest trip around the retry handler, and decide if + * we've failed enough times to constitute a permanent failure. + */ +static bool +xfs_buf_ioerror_permanent( + struct xfs_buf *bp, + struct xfs_error_cfg *cfg) +{ + struct xfs_mount *mp = bp->b_mount; if (cfg->max_retries != XFS_ERR_RETRY_FOREVER && ++bp->b_retries > cfg->max_retries) - goto permanent_error; + return true; if (cfg->retry_timeout != XFS_ERR_RETRY_FOREVER && time_after(jiffies, cfg->retry_timeout + bp->b_first_retry_time)) - goto permanent_error; + return true; /* At unmount we may treat errors differently */ if ((mp->m_flags & XFS_MOUNT_UNMOUNTING) && mp->m_fail_unmount) - goto permanent_error; - - /* - * Still a transient error, run IO completion failure callbacks and let - * the higher layers retry the buffer. - */ - xfs_buf_do_callbacks_fail(bp); - xfs_buf_ioerror(bp, 0); - xfs_buf_relse(bp); - return true; + return true; - /* - * Permanent error - we need to trigger a shutdown if we haven't already - * to indicate that inconsistency will result from this action. - */ -permanent_error: - xfs_force_shutdown(mp, SHUTDOWN_META_IO_ERROR); -out_stale: - xfs_buf_stale(bp); - bp->b_flags |= XBF_DONE; - trace_xfs_buf_error_relse(bp, _RET_IP_); return false; } -static inline bool -xfs_buf_had_callback_errors( +/* + * On a sync write or shutdown we just want to stale the buffer and let the + * caller handle the error in bp->b_error appropriately. + * + * If the write was asynchronous then no one will be looking for the error. If + * this is the first failure of this type, clear the error state and write the + * buffer out again. This means we always retry an async write failure at least + * once, but we also need to set the buffer up to behave correctly now for + * repeated failures. + * + * If we get repeated async write failures, then we take action according to the + * error configuration we have been set up to use. + * + * Multi-state return value: + * + * XBF_IOERROR_FINISH: clear IO error retry state and run callback completions + * XBF_IOERROR_DONE: resubmitted immediately, do not run any completions + * XBF_IOERROR_FAIL: transient error, run failure callback completions and then + * release the buffer + */ +enum { + XBF_IOERROR_FINISH, + XBF_IOERROR_DONE, + XBF_IOERROR_FAIL, +}; + +static int +xfs_buf_iodone_error( struct xfs_buf *bp) { + struct xfs_mount *mp = bp->b_mount; + struct xfs_error_cfg *cfg; - /* - * If there is an error, process it. Some errors require us to run - * callbacks after failure processing is done so we detect that and take - * appropriate action. - */ - if (bp->b_error && xfs_buf_iodone_callback_error(bp)) - return true; + if (xfs_buf_ioerror_fail_without_retry(bp)) + goto out_stale; + + trace_xfs_buf_item_iodone_async(bp, _RET_IP_); + + cfg = xfs_error_get_cfg(mp, XFS_ERR_METADATA, bp->b_error); + if (xfs_buf_ioerror_retry(bp, cfg)) { + xfs_buf_ioerror(bp, 0); + xfs_buf_submit(bp); + return XBF_IOERROR_DONE; + } /* - * Successful IO or permanent error. Either way, we can clear the - * retry state here in preparation for the next error that may occur. + * Permanent error - we need to trigger a shutdown if we haven't already + * to indicate that inconsistency will result from this action. */ - bp->b_last_error = 0; - bp->b_retries = 0; - bp->b_first_retry_time = 0; - return false; + if (xfs_buf_ioerror_permanent(bp, cfg)) { + xfs_force_shutdown(mp, SHUTDOWN_META_IO_ERROR); + goto out_stale; + } + + /* Still considered a transient error. Caller will schedule retries. */ + return XBF_IOERROR_FAIL; + +out_stale: + xfs_buf_stale(bp); + bp->b_flags |= XBF_DONE; + trace_xfs_buf_error_relse(bp, _RET_IP_); + return XBF_IOERROR_FINISH; } static void @@ -1122,6 +1149,15 @@ xfs_buf_item_done( xfs_buf_rele(bp); } +static inline void +xfs_buf_clear_ioerror_retry_state( + struct xfs_buf *bp) +{ + bp->b_last_error = 0; + bp->b_retries = 0; + bp->b_first_retry_time = 0; +} + /* * Inode buffer iodone callback function. */ @@ -1129,9 +1165,22 @@ void xfs_buf_inode_iodone( struct xfs_buf *bp) { - if (xfs_buf_had_callback_errors(bp)) + if (bp->b_error) { + int ret = xfs_buf_iodone_error(bp); + + if (ret == XBF_IOERROR_FINISH) + goto finish_iodone; + if (ret == XBF_IOERROR_DONE) + return; + ASSERT(ret == XBF_IOERROR_FAIL); + xfs_buf_do_callbacks_fail(bp); + xfs_buf_ioerror(bp, 0); + xfs_buf_relse(bp); return; + } +finish_iodone: + xfs_buf_clear_ioerror_retry_state(bp); xfs_buf_item_done(bp); xfs_iflush_done(bp); xfs_buf_ioend_finish(bp); @@ -1144,9 +1193,22 @@ void xfs_buf_dquot_iodone( struct xfs_buf *bp) { - if (xfs_buf_had_callback_errors(bp)) + if (bp->b_error) { + int ret = xfs_buf_iodone_error(bp); + + if (ret == XBF_IOERROR_FINISH) + goto finish_iodone; + if (ret == XBF_IOERROR_DONE) + return; + ASSERT(ret == XBF_IOERROR_FAIL); + xfs_buf_do_callbacks_fail(bp); + xfs_buf_ioerror(bp, 0); + xfs_buf_relse(bp); return; + } +finish_iodone: + xfs_buf_clear_ioerror_retry_state(bp); /* a newly allocated dquot buffer might have a log item attached */ xfs_buf_item_done(bp); xfs_dquot_done(bp); @@ -1163,9 +1225,22 @@ void xfs_buf_iodone( struct xfs_buf *bp) { - if (xfs_buf_had_callback_errors(bp)) + if (bp->b_error) { + int ret = xfs_buf_iodone_error(bp); + + if (ret == XBF_IOERROR_FINISH) + goto finish_iodone; + if (ret == XBF_IOERROR_DONE) + return; + ASSERT(ret == XBF_IOERROR_FAIL); + xfs_buf_do_callbacks_fail(bp); + xfs_buf_ioerror(bp, 0); + xfs_buf_relse(bp); return; + } +finish_iodone: + xfs_buf_clear_ioerror_retry_state(bp); xfs_buf_item_done(bp); xfs_buf_ioend_finish(bp); } From patchwork Thu Jun 4 07:45:50 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 11587277 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 1E76C913 for ; Thu, 4 Jun 2020 07:46:25 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 10178206C3 for ; Thu, 4 Jun 2020 07:46:25 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728029AbgFDHqX (ORCPT ); Thu, 4 Jun 2020 03:46:23 -0400 Received: from mail105.syd.optusnet.com.au ([211.29.132.249]:41578 "EHLO mail105.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728019AbgFDHqW (ORCPT ); Thu, 4 Jun 2020 03:46:22 -0400 Received: from dread.disaster.area (pa49-180-124-177.pa.nsw.optusnet.com.au [49.180.124.177]) by mail105.syd.optusnet.com.au (Postfix) with ESMTPS id 28A7F3A40C7 for ; Thu, 4 Jun 2020 17:46:13 +1000 (AEST) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.3) (envelope-from ) id 1jgkZk-0004A9-0L for linux-xfs@vger.kernel.org; Thu, 04 Jun 2020 17:46:08 +1000 Received: from dave by discord.disaster.area with local (Exim 4.93) (envelope-from ) id 1jgkZj-0017Ha-NU for linux-xfs@vger.kernel.org; Thu, 04 Jun 2020 17:46:07 +1000 From: Dave Chinner To: linux-xfs@vger.kernel.org Subject: [PATCH 14/30] xfs: unwind log item error flagging Date: Thu, 4 Jun 2020 17:45:50 +1000 Message-Id: <20200604074606.266213-15-david@fromorbit.com> X-Mailer: git-send-email 2.26.2.761.g0e0b3e54be In-Reply-To: <20200604074606.266213-1-david@fromorbit.com> References: <20200604074606.266213-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.3 cv=W5xGqiek c=1 sm=1 tr=0 a=k3aV/LVJup6ZGWgigO6cSA==:117 a=k3aV/LVJup6ZGWgigO6cSA==:17 a=nTHF0DUjJn0A:10 a=20KFwNOVAAAA:8 a=yPCof4ZbAAAA:8 a=T7xEnian507jIie1A9MA:9 Sender: linux-xfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Dave Chinner When an buffer IO error occurs, we want to mark all the log items attached to the buffer as failed. Open code the error handling loop so that we can modify the flagging for the different types of objects directly and independently of each other. This also allows us to remove the ->iop_error method from the log item operations. Signed-off-by: Dave Chinner Reviewed-by: Brian Foster Reviewed-by: Darrick J. Wong --- fs/xfs/xfs_buf_item.c | 48 ++++++++++++----------------------------- fs/xfs/xfs_dquot_item.c | 18 ---------------- fs/xfs/xfs_inode_item.c | 18 ---------------- fs/xfs/xfs_trans.h | 1 - 4 files changed, 14 insertions(+), 71 deletions(-) diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c index 3560993847b7c..bc8f8df6cc4c6 100644 --- a/fs/xfs/xfs_buf_item.c +++ b/fs/xfs/xfs_buf_item.c @@ -12,6 +12,7 @@ #include "xfs_bit.h" #include "xfs_mount.h" #include "xfs_trans.h" +#include "xfs_trans_priv.h" #include "xfs_buf_item.h" #include "xfs_inode.h" #include "xfs_inode_item.h" @@ -955,37 +956,6 @@ xfs_buf_item_relse( xfs_buf_item_free(bip); } -/* - * Invoke the error state callback for each log item affected by the failed I/O. - * - * If a metadata buffer write fails with a non-permanent error, the buffer is - * eventually resubmitted and so the completion callbacks are not run. The error - * state may need to be propagated to the log items attached to the buffer, - * however, so the next AIL push of the item knows hot to handle it correctly. - */ -STATIC void -xfs_buf_do_callbacks_fail( - struct xfs_buf *bp) -{ - struct xfs_ail *ailp = bp->b_mount->m_ail; - struct xfs_log_item *lip; - - /* - * Buffer log item errors are handled directly by xfs_buf_item_push() - * and xfs_buf_iodone_callback_error, and they have no IO error - * callbacks. Check only for items in b_li_list. - */ - if (list_empty(&bp->b_li_list)) - return; - - spin_lock(&ailp->ail_lock); - list_for_each_entry(lip, &bp->b_li_list, li_bio_list) { - if (lip->li_ops->iop_error) - lip->li_ops->iop_error(lip, bp); - } - spin_unlock(&ailp->ail_lock); -} - /* * Decide if we're going to retry the write after a failure, and prepare * the buffer for retrying the write. @@ -1166,6 +1136,7 @@ xfs_buf_inode_iodone( struct xfs_buf *bp) { if (bp->b_error) { + struct xfs_log_item *lip; int ret = xfs_buf_iodone_error(bp); if (ret == XBF_IOERROR_FINISH) @@ -1173,7 +1144,11 @@ xfs_buf_inode_iodone( if (ret == XBF_IOERROR_DONE) return; ASSERT(ret == XBF_IOERROR_FAIL); - xfs_buf_do_callbacks_fail(bp); + spin_lock(&bp->b_mount->m_ail->ail_lock); + list_for_each_entry(lip, &bp->b_li_list, li_bio_list) { + xfs_set_li_failed(lip, bp); + } + spin_unlock(&bp->b_mount->m_ail->ail_lock); xfs_buf_ioerror(bp, 0); xfs_buf_relse(bp); return; @@ -1194,6 +1169,7 @@ xfs_buf_dquot_iodone( struct xfs_buf *bp) { if (bp->b_error) { + struct xfs_log_item *lip; int ret = xfs_buf_iodone_error(bp); if (ret == XBF_IOERROR_FINISH) @@ -1201,7 +1177,11 @@ xfs_buf_dquot_iodone( if (ret == XBF_IOERROR_DONE) return; ASSERT(ret == XBF_IOERROR_FAIL); - xfs_buf_do_callbacks_fail(bp); + spin_lock(&bp->b_mount->m_ail->ail_lock); + list_for_each_entry(lip, &bp->b_li_list, li_bio_list) { + xfs_set_li_failed(lip, bp); + } + spin_unlock(&bp->b_mount->m_ail->ail_lock); xfs_buf_ioerror(bp, 0); xfs_buf_relse(bp); return; @@ -1233,7 +1213,7 @@ xfs_buf_iodone( if (ret == XBF_IOERROR_DONE) return; ASSERT(ret == XBF_IOERROR_FAIL); - xfs_buf_do_callbacks_fail(bp); + ASSERT(list_empty(&bp->b_li_list)); xfs_buf_ioerror(bp, 0); xfs_buf_relse(bp); return; diff --git a/fs/xfs/xfs_dquot_item.c b/fs/xfs/xfs_dquot_item.c index 349c92d26570c..d7e4de7151d7f 100644 --- a/fs/xfs/xfs_dquot_item.c +++ b/fs/xfs/xfs_dquot_item.c @@ -113,23 +113,6 @@ xfs_qm_dqunpin_wait( wait_event(dqp->q_pinwait, (atomic_read(&dqp->q_pincount) == 0)); } -/* - * Callback used to mark a buffer with XFS_LI_FAILED when items in the buffer - * have been failed during writeback - * - * this informs the AIL that the dquot is already flush locked on the next push, - * and acquires a hold on the buffer to ensure that it isn't reclaimed before - * dirty data makes it to disk. - */ -STATIC void -xfs_dquot_item_error( - struct xfs_log_item *lip, - struct xfs_buf *bp) -{ - ASSERT(!completion_done(&DQUOT_ITEM(lip)->qli_dquot->q_flush)); - xfs_set_li_failed(lip, bp); -} - STATIC uint xfs_qm_dquot_logitem_push( struct xfs_log_item *lip, @@ -216,7 +199,6 @@ static const struct xfs_item_ops xfs_dquot_item_ops = { .iop_release = xfs_qm_dquot_logitem_release, .iop_committing = xfs_qm_dquot_logitem_committing, .iop_push = xfs_qm_dquot_logitem_push, - .iop_error = xfs_dquot_item_error }; /* diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c index 7049f2ae8d186..86c783dec2bac 100644 --- a/fs/xfs/xfs_inode_item.c +++ b/fs/xfs/xfs_inode_item.c @@ -464,23 +464,6 @@ xfs_inode_item_unpin( wake_up_bit(&ip->i_flags, __XFS_IPINNED_BIT); } -/* - * Callback used to mark a buffer with XFS_LI_FAILED when items in the buffer - * have been failed during writeback - * - * This informs the AIL that the inode is already flush locked on the next push, - * and acquires a hold on the buffer to ensure that it isn't reclaimed before - * dirty data makes it to disk. - */ -STATIC void -xfs_inode_item_error( - struct xfs_log_item *lip, - struct xfs_buf *bp) -{ - ASSERT(xfs_isiflocked(INODE_ITEM(lip)->ili_inode)); - xfs_set_li_failed(lip, bp); -} - STATIC uint xfs_inode_item_push( struct xfs_log_item *lip, @@ -619,7 +602,6 @@ static const struct xfs_item_ops xfs_inode_item_ops = { .iop_committed = xfs_inode_item_committed, .iop_push = xfs_inode_item_push, .iop_committing = xfs_inode_item_committing, - .iop_error = xfs_inode_item_error }; diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h index 99a9ab9cab25b..b752501818d25 100644 --- a/fs/xfs/xfs_trans.h +++ b/fs/xfs/xfs_trans.h @@ -74,7 +74,6 @@ struct xfs_item_ops { void (*iop_committing)(struct xfs_log_item *, xfs_lsn_t commit_lsn); void (*iop_release)(struct xfs_log_item *); xfs_lsn_t (*iop_committed)(struct xfs_log_item *, xfs_lsn_t); - void (*iop_error)(struct xfs_log_item *, xfs_buf_t *); int (*iop_recover)(struct xfs_log_item *lip, struct xfs_trans *tp); bool (*iop_match)(struct xfs_log_item *item, uint64_t id); }; From patchwork Thu Jun 4 07:45:51 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 11587233 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 51AED1392 for ; Thu, 4 Jun 2020 07:46:13 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 44489206C3 for ; Thu, 4 Jun 2020 07:46:13 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726967AbgFDHqM (ORCPT ); Thu, 4 Jun 2020 03:46:12 -0400 Received: from mail107.syd.optusnet.com.au ([211.29.132.53]:48572 "EHLO mail107.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726246AbgFDHqL (ORCPT ); Thu, 4 Jun 2020 03:46:11 -0400 Received: from dread.disaster.area (pa49-180-124-177.pa.nsw.optusnet.com.au [49.180.124.177]) by mail107.syd.optusnet.com.au (Postfix) with ESMTPS id A67CDD598F3 for ; Thu, 4 Jun 2020 17:46:08 +1000 (AEST) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.3) (envelope-from ) id 1jgkZk-0004AB-1Q for linux-xfs@vger.kernel.org; Thu, 04 Jun 2020 17:46:08 +1000 Received: from dave by discord.disaster.area with local (Exim 4.93) (envelope-from ) id 1jgkZj-0017Hf-Om for linux-xfs@vger.kernel.org; Thu, 04 Jun 2020 17:46:07 +1000 From: Dave Chinner To: linux-xfs@vger.kernel.org Subject: [PATCH 15/30] xfs: move xfs_clear_li_failed out of xfs_ail_delete_one() Date: Thu, 4 Jun 2020 17:45:51 +1000 Message-Id: <20200604074606.266213-16-david@fromorbit.com> X-Mailer: git-send-email 2.26.2.761.g0e0b3e54be In-Reply-To: <20200604074606.266213-1-david@fromorbit.com> References: <20200604074606.266213-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.3 cv=W5xGqiek c=1 sm=1 tr=0 a=k3aV/LVJup6ZGWgigO6cSA==:117 a=k3aV/LVJup6ZGWgigO6cSA==:17 a=nTHF0DUjJn0A:10 a=20KFwNOVAAAA:8 a=yPCof4ZbAAAA:8 a=n6ZPMNil_ZAWRd6T3lcA:9 a=pHzHmUro8NiASowvMSCR:22 a=nt3jZW36AmriUCFCBwmW:22 Sender: linux-xfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Dave Chinner xfs_ail_delete_one() is called directly from dquot and inode IO completion, as well as from the generic xfs_trans_ail_delete() function. Inodes are about to have their own failure handling, and dquots will in future, too. Pull the clearing of the LI_FAILED flag up into the callers so we can customise the code appropriately. Signed-off-by: Dave Chinner Reviewed-by: Brian Foster Reviewed-by: Darrick J. Wong --- fs/xfs/xfs_dquot.c | 6 +----- fs/xfs/xfs_inode_item.c | 3 +-- fs/xfs/xfs_trans_ail.c | 2 +- 3 files changed, 3 insertions(+), 8 deletions(-) diff --git a/fs/xfs/xfs_dquot.c b/fs/xfs/xfs_dquot.c index d5984a926d1d0..76353c9a723ee 100644 --- a/fs/xfs/xfs_dquot.c +++ b/fs/xfs/xfs_dquot.c @@ -1070,16 +1070,12 @@ xfs_qm_dqflush_done( test_bit(XFS_LI_FAILED, &lip->li_flags))) { spin_lock(&ailp->ail_lock); + xfs_clear_li_failed(lip); if (lip->li_lsn == qip->qli_flush_lsn) { /* xfs_ail_update_finish() drops the AIL lock */ tail_lsn = xfs_ail_delete_one(ailp, lip); xfs_ail_update_finish(ailp, tail_lsn); } else { - /* - * Clear the failed state since we are about to drop the - * flush lock - */ - xfs_clear_li_failed(lip); spin_unlock(&ailp->ail_lock); } } diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c index 86c783dec2bac..0ba75764a8dc5 100644 --- a/fs/xfs/xfs_inode_item.c +++ b/fs/xfs/xfs_inode_item.c @@ -690,12 +690,11 @@ xfs_iflush_done( /* this is an opencoded batch version of xfs_trans_ail_delete */ spin_lock(&ailp->ail_lock); list_for_each_entry(lip, &tmp, li_bio_list) { + xfs_clear_li_failed(lip); if (lip->li_lsn == INODE_ITEM(lip)->ili_flush_lsn) { xfs_lsn_t lsn = xfs_ail_delete_one(ailp, lip); if (!tail_lsn && lsn) tail_lsn = lsn; - } else { - xfs_clear_li_failed(lip); } } xfs_ail_update_finish(ailp, tail_lsn); diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c index ac5019361a139..ac33f6393f99c 100644 --- a/fs/xfs/xfs_trans_ail.c +++ b/fs/xfs/xfs_trans_ail.c @@ -843,7 +843,6 @@ xfs_ail_delete_one( trace_xfs_ail_delete(lip, mlip->li_lsn, lip->li_lsn); xfs_ail_delete(ailp, lip); - xfs_clear_li_failed(lip); clear_bit(XFS_LI_IN_AIL, &lip->li_flags); lip->li_lsn = 0; @@ -874,6 +873,7 @@ xfs_trans_ail_delete( } /* xfs_ail_update_finish() drops the AIL lock */ + xfs_clear_li_failed(lip); tail_lsn = xfs_ail_delete_one(ailp, lip); xfs_ail_update_finish(ailp, tail_lsn); } From patchwork Thu Jun 4 07:45:52 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 11587271 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id EDCDC1392 for ; Thu, 4 Jun 2020 07:46:23 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id D9C9D2072E for ; Thu, 4 Jun 2020 07:46:23 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727058AbgFDHqW (ORCPT ); Thu, 4 Jun 2020 03:46:22 -0400 Received: from mail106.syd.optusnet.com.au ([211.29.132.42]:49996 "EHLO mail106.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727921AbgFDHqU (ORCPT ); Thu, 4 Jun 2020 03:46:20 -0400 Received: from dread.disaster.area (pa49-180-124-177.pa.nsw.optusnet.com.au [49.180.124.177]) by mail106.syd.optusnet.com.au (Postfix) with ESMTPS id 5D67B5AAE47 for ; Thu, 4 Jun 2020 17:46:09 +1000 (AEST) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.3) (envelope-from ) id 1jgkZk-0004AF-35 for linux-xfs@vger.kernel.org; Thu, 04 Jun 2020 17:46:08 +1000 Received: from dave by discord.disaster.area with local (Exim 4.93) (envelope-from ) id 1jgkZj-0017Hk-Q9 for linux-xfs@vger.kernel.org; Thu, 04 Jun 2020 17:46:07 +1000 From: Dave Chinner To: linux-xfs@vger.kernel.org Subject: [PATCH 16/30] xfs: pin inode backing buffer to the inode log item Date: Thu, 4 Jun 2020 17:45:52 +1000 Message-Id: <20200604074606.266213-17-david@fromorbit.com> X-Mailer: git-send-email 2.26.2.761.g0e0b3e54be In-Reply-To: <20200604074606.266213-1-david@fromorbit.com> References: <20200604074606.266213-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.3 cv=QIgWuTDL c=1 sm=1 tr=0 a=k3aV/LVJup6ZGWgigO6cSA==:117 a=k3aV/LVJup6ZGWgigO6cSA==:17 a=nTHF0DUjJn0A:10 a=20KFwNOVAAAA:8 a=BWTjdfWTP5q8G_7E9WIA:9 Sender: linux-xfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Dave Chinner When we dirty an inode, we are going to have to write it disk at some point in the near future. This requires the inode cluster backing buffer to be present in memory. Unfortunately, under severe memory pressure we can reclaim the inode backing buffer while the inode is dirty in memory, resulting in stalling the AIL pushing because it has to do a read-modify-write cycle on the cluster buffer. When we have no memory available, the read of the cluster buffer blocks the AIL pushing process, and this causes all sorts of issues for memory reclaim as it requires inode writeback to make forwards progress. Allocating a cluster buffer causes more memory pressure, and results in more cluster buffers to be reclaimed, resulting in more RMW cycles to be done in the AIL context and everything then backs up on AIL progress. Only the synchronous inode cluster writeback in the the inode reclaim code provides some level of forwards progress guarantees that prevent OOM-killer rampages in this situation. Fix this by pinning the inode backing buffer to the inode log item when the inode is first dirtied (i.e. in xfs_trans_log_inode()). This may mean the first modification of an inode that has been held in cache for a long time may block on a cluster buffer read, but we can do that in transaction context and block safely until the buffer has been allocated and read. Once we have the cluster buffer, the inode log item takes a reference to it, pinning it in memory, and attaches it to the log item for future reference. This means we can always grab the cluster buffer from the inode log item when we need it. When the inode is finally cleaned and removed from the AIL, we can drop the reference the inode log item holds on the cluster buffer. Once all inodes on the cluster buffer are clean, the cluster buffer will be unpinned and it will be available for memory reclaim to reclaim again. This avoids the issues with needing to do RMW cycles in the AIL pushing context, and hence allows complete non-blocking inode flushing to be performed by the AIL pushing context. Signed-off-by: Dave Chinner Reviewed-by: Brian Foster --- fs/xfs/libxfs/xfs_inode_buf.c | 3 +- fs/xfs/libxfs/xfs_trans_inode.c | 53 ++++++++++++++++++++++++---- fs/xfs/xfs_buf_item.c | 4 +-- fs/xfs/xfs_inode_item.c | 61 ++++++++++++++++++++++++++------- fs/xfs/xfs_trans_ail.c | 8 +++-- 5 files changed, 105 insertions(+), 24 deletions(-) diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c index 6f84ea85fdd83..1af97235785c8 100644 --- a/fs/xfs/libxfs/xfs_inode_buf.c +++ b/fs/xfs/libxfs/xfs_inode_buf.c @@ -176,7 +176,8 @@ xfs_imap_to_bp( } *bpp = bp; - *dipp = xfs_buf_offset(bp, imap->im_boffset); + if (dipp) + *dipp = xfs_buf_offset(bp, imap->im_boffset); return 0; } diff --git a/fs/xfs/libxfs/xfs_trans_inode.c b/fs/xfs/libxfs/xfs_trans_inode.c index c66d9d1dd58b9..ad5974365c589 100644 --- a/fs/xfs/libxfs/xfs_trans_inode.c +++ b/fs/xfs/libxfs/xfs_trans_inode.c @@ -8,6 +8,8 @@ #include "xfs_shared.h" #include "xfs_format.h" #include "xfs_log_format.h" +#include "xfs_trans_resv.h" +#include "xfs_mount.h" #include "xfs_inode.h" #include "xfs_trans.h" #include "xfs_trans_priv.h" @@ -72,13 +74,19 @@ xfs_trans_ichgtime( } /* - * This is called to mark the fields indicated in fieldmask as needing - * to be logged when the transaction is committed. The inode must - * already be associated with the given transaction. + * This is called to mark the fields indicated in fieldmask as needing to be + * logged when the transaction is committed. The inode must already be + * associated with the given transaction. * - * The values for fieldmask are defined in xfs_inode_item.h. We always - * log all of the core inode if any of it has changed, and we always log - * all of the inline data/extents/b-tree root if any of them has changed. + * The values for fieldmask are defined in xfs_inode_item.h. We always log all + * of the core inode if any of it has changed, and we always log all of the + * inline data/extents/b-tree root if any of them has changed. + * + * Grab and pin the cluster buffer associated with this inode to avoid RMW + * cycles at inode writeback time. Avoid the need to add error handling to every + * xfs_trans_log_inode() call by shutting down on read error. This will cause + * transactions to fail and everything to error out, just like if we return a + * read error in a dirty transaction and cancel it. */ void xfs_trans_log_inode( @@ -131,6 +139,39 @@ xfs_trans_log_inode( spin_lock(&iip->ili_lock); iip->ili_fsync_fields |= flags; + if (!iip->ili_item.li_buf) { + struct xfs_buf *bp; + int error; + + /* + * We hold the ILOCK here, so this inode is not going to be + * flushed while we are here. Further, because there is no + * buffer attached to the item, we know that there is no IO in + * progress, so nothing will clear the ili_fields while we read + * in the buffer. Hence we can safely drop the spin lock and + * read the buffer knowing that the state will not change from + * here. + */ + spin_unlock(&iip->ili_lock); + error = xfs_imap_to_bp(ip->i_mount, tp, &ip->i_imap, NULL, + &bp, 0); + if (error) { + xfs_force_shutdown(ip->i_mount, SHUTDOWN_META_IO_ERROR); + return; + } + + /* + * We need an explicit buffer reference for the log item but + * don't want the buffer to remain attached to the transaction. + * Hold the buffer but release the transaction reference. + */ + xfs_buf_hold(bp); + xfs_trans_brelse(tp, bp); + + spin_lock(&iip->ili_lock); + iip->ili_item.li_buf = bp; + } + /* * Always OR in the bits from the ili_last_fields field. This is to * coordinate with the xfs_iflush() and xfs_iflush_done() routines in diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c index bc8f8df6cc4c6..a8c5070376b21 100644 --- a/fs/xfs/xfs_buf_item.c +++ b/fs/xfs/xfs_buf_item.c @@ -1144,11 +1144,9 @@ xfs_buf_inode_iodone( if (ret == XBF_IOERROR_DONE) return; ASSERT(ret == XBF_IOERROR_FAIL); - spin_lock(&bp->b_mount->m_ail->ail_lock); list_for_each_entry(lip, &bp->b_li_list, li_bio_list) { - xfs_set_li_failed(lip, bp); + set_bit(XFS_LI_FAILED, &lip->li_flags); } - spin_unlock(&bp->b_mount->m_ail->ail_lock); xfs_buf_ioerror(bp, 0); xfs_buf_relse(bp); return; diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c index 0ba75764a8dc5..64bdda72f7b27 100644 --- a/fs/xfs/xfs_inode_item.c +++ b/fs/xfs/xfs_inode_item.c @@ -439,6 +439,7 @@ xfs_inode_item_pin( struct xfs_inode *ip = INODE_ITEM(lip)->ili_inode; ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL)); + ASSERT(lip->li_buf); trace_xfs_inode_pin(ip, _RET_IP_); atomic_inc(&ip->i_pincount); @@ -450,6 +451,12 @@ xfs_inode_item_pin( * item which was previously pinned with a call to xfs_inode_item_pin(). * * Also wake up anyone in xfs_iunpin_wait() if the count goes to 0. + * + * Note that unpin can race with inode cluster buffer freeing marking the buffer + * stale. In that case, flush completions are run from the buffer unpin call, + * which may happen before the inode is unpinned. If we lose the race, there + * will be no buffer attached to the log item, but the inode will be marked + * XFS_ISTALE. */ STATIC void xfs_inode_item_unpin( @@ -459,6 +466,7 @@ xfs_inode_item_unpin( struct xfs_inode *ip = INODE_ITEM(lip)->ili_inode; trace_xfs_inode_unpin(ip, _RET_IP_); + ASSERT(lip->li_buf || xfs_iflags_test(ip, XFS_ISTALE)); ASSERT(atomic_read(&ip->i_pincount) > 0); if (atomic_dec_and_test(&ip->i_pincount)) wake_up_bit(&ip->i_flags, __XFS_IPINNED_BIT); @@ -629,10 +637,15 @@ xfs_inode_item_init( */ void xfs_inode_item_destroy( - xfs_inode_t *ip) + struct xfs_inode *ip) { - kmem_free(ip->i_itemp->ili_item.li_lv_shadow); - kmem_cache_free(xfs_ili_zone, ip->i_itemp); + struct xfs_inode_log_item *iip = ip->i_itemp; + + ASSERT(iip->ili_item.li_buf == NULL); + + ip->i_itemp = NULL; + kmem_free(iip->ili_item.li_lv_shadow); + kmem_cache_free(xfs_ili_zone, iip); } @@ -673,11 +686,10 @@ xfs_iflush_done( list_move_tail(&lip->li_bio_list, &tmp); /* Do an unlocked check for needing the AIL lock. */ - if (lip->li_lsn == iip->ili_flush_lsn || + if (iip->ili_flush_lsn == lip->li_lsn || test_bit(XFS_LI_FAILED, &lip->li_flags)) need_ail++; } - ASSERT(list_empty(&bp->b_li_list)); /* * We only want to pull the item from the AIL if it is actually there @@ -690,7 +702,7 @@ xfs_iflush_done( /* this is an opencoded batch version of xfs_trans_ail_delete */ spin_lock(&ailp->ail_lock); list_for_each_entry(lip, &tmp, li_bio_list) { - xfs_clear_li_failed(lip); + clear_bit(XFS_LI_FAILED, &lip->li_flags); if (lip->li_lsn == INODE_ITEM(lip)->ili_flush_lsn) { xfs_lsn_t lsn = xfs_ail_delete_one(ailp, lip); if (!tail_lsn && lsn) @@ -706,14 +718,29 @@ xfs_iflush_done( * them is safely on disk. */ list_for_each_entry_safe(lip, n, &tmp, li_bio_list) { + bool drop_buffer = false; + list_del_init(&lip->li_bio_list); iip = INODE_ITEM(lip); spin_lock(&iip->ili_lock); + + /* + * Remove the reference to the cluster buffer if the inode is + * clean in memory. Drop the buffer reference once we've dropped + * the locks we hold. + */ + ASSERT(iip->ili_item.li_buf == bp); + if (!iip->ili_fields) { + iip->ili_item.li_buf = NULL; + drop_buffer = true; + } iip->ili_last_fields = 0; + iip->ili_flush_lsn = 0; spin_unlock(&iip->ili_lock); - xfs_ifunlock(iip->ili_inode); + if (drop_buffer) + xfs_buf_rele(bp); } } @@ -725,12 +752,20 @@ xfs_iflush_done( */ void xfs_iflush_abort( - struct xfs_inode *ip) + struct xfs_inode *ip) { - struct xfs_inode_log_item *iip = ip->i_itemp; + struct xfs_inode_log_item *iip = ip->i_itemp; + struct xfs_buf *bp = NULL; if (iip) { + /* + * Clear the failed bit before removing the item from the AIL so + * xfs_trans_ail_delete() doesn't try to clear and release the + * buffer attached to the log item before we are done with it. + */ + clear_bit(XFS_LI_FAILED, &iip->ili_item.li_flags); xfs_trans_ail_delete(&iip->ili_item, 0); + /* * Clear the inode logging fields so no more flushes are * attempted. @@ -739,12 +774,14 @@ xfs_iflush_abort( iip->ili_last_fields = 0; iip->ili_fields = 0; iip->ili_fsync_fields = 0; + iip->ili_flush_lsn = 0; + bp = iip->ili_item.li_buf; + iip->ili_item.li_buf = NULL; spin_unlock(&iip->ili_lock); } - /* - * Release the inode's flush lock since we're done with it. - */ xfs_ifunlock(ip); + if (bp) + xfs_buf_rele(bp); } /* diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c index ac33f6393f99c..c3be6e4401343 100644 --- a/fs/xfs/xfs_trans_ail.c +++ b/fs/xfs/xfs_trans_ail.c @@ -377,8 +377,12 @@ xfsaild_resubmit_item( } /* protected by ail_lock */ - list_for_each_entry(lip, &bp->b_li_list, li_bio_list) - xfs_clear_li_failed(lip); + list_for_each_entry(lip, &bp->b_li_list, li_bio_list) { + if (bp->b_flags & _XBF_INODES) + clear_bit(XFS_LI_FAILED, &lip->li_flags); + else + xfs_clear_li_failed(lip); + } xfs_buf_unlock(bp); return XFS_ITEM_SUCCESS; From patchwork Thu Jun 4 07:45:53 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 11587273 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 5F8C9913 for ; Thu, 4 Jun 2020 07:46:24 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 4EA212067B for ; Thu, 4 Jun 2020 07:46:24 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728021AbgFDHqX (ORCPT ); Thu, 4 Jun 2020 03:46:23 -0400 Received: from mail109.syd.optusnet.com.au ([211.29.132.80]:50662 "EHLO mail109.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728029AbgFDHqW (ORCPT ); Thu, 4 Jun 2020 03:46:22 -0400 Received: from dread.disaster.area (pa49-180-124-177.pa.nsw.optusnet.com.au [49.180.124.177]) by mail109.syd.optusnet.com.au (Postfix) with ESMTPS id 2D6DCD79D5A for ; Thu, 4 Jun 2020 17:46:13 +1000 (AEST) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.3) (envelope-from ) id 1jgkZk-0004AJ-5A for linux-xfs@vger.kernel.org; Thu, 04 Jun 2020 17:46:08 +1000 Received: from dave by discord.disaster.area with local (Exim 4.93) (envelope-from ) id 1jgkZj-0017Hp-S8 for linux-xfs@vger.kernel.org; Thu, 04 Jun 2020 17:46:07 +1000 From: Dave Chinner To: linux-xfs@vger.kernel.org Subject: [PATCH 17/30] xfs: make inode reclaim almost non-blocking Date: Thu, 4 Jun 2020 17:45:53 +1000 Message-Id: <20200604074606.266213-18-david@fromorbit.com> X-Mailer: git-send-email 2.26.2.761.g0e0b3e54be In-Reply-To: <20200604074606.266213-1-david@fromorbit.com> References: <20200604074606.266213-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.3 cv=QIgWuTDL c=1 sm=1 tr=0 a=k3aV/LVJup6ZGWgigO6cSA==:117 a=k3aV/LVJup6ZGWgigO6cSA==:17 a=nTHF0DUjJn0A:10 a=20KFwNOVAAAA:8 a=pGLkceISAAAA:8 a=yPCof4ZbAAAA:8 a=A2ig2MbBs_zKB9zLTN4A:9 Sender: linux-xfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Dave Chinner Now that dirty inode writeback doesn't cause read-modify-write cycles on the inode cluster buffer under memory pressure, the need to throttle memory reclaim to the rate at which we can clean dirty inodes goes away. That is due to the fact that we no longer thrash inode cluster buffers under memory pressure to clean dirty inodes. This means inode writeback no longer stalls on memory allocation or read IO, and hence can be done asynchronously without generating memory pressure. As a result, blocking inode writeback in reclaim is no longer necessary to prevent reclaim priority windup as cleaning dirty inodes is no longer dependent on having memory reserves available for the filesystem to make progress reclaiming inodes. Hence we can convert inode reclaim to be non-blocking for shrinker callouts, both for direct reclaim and kswapd. On a vanilla kernel, running a 16-way fsmark create workload on a 4 node/16p/16GB RAM machine, I can reliably pin 14.75GB of RAM via userspace mlock(). The OOM killer gets invoked at 15GB of pinned RAM. With this patch alone, pinning memory triggers premature OOM killer invocation, sometimes with as much as 45% of RAM being free. It's trivially easy to trigger the OOM killer when reclaim does not block. With pinning inode clusters in RAM and then adding this patch, I can reliably pin 14.5GB of RAM and still have the fsmark workload run to completion. The OOM killer gets invoked 14.75GB of pinned RAM, which is only a small amount of memory less than the vanilla kernel. It is much more reliable than just with async reclaim alone. simoops shows that allocation stalls go away when async reclaim is used. Vanilla kernel: Run time: 1924 seconds Read latency (p50: 3,305,472) (p95: 3,723,264) (p99: 4,001,792) Write latency (p50: 184,064) (p95: 553,984) (p99: 807,936) Allocation latency (p50: 2,641,920) (p95: 3,911,680) (p99: 4,464,640) work rate = 13.45/sec (avg 13.44/sec) (p50: 13.46) (p95: 13.58) (p99: 13.70) alloc stall rate = 3.80/sec (avg: 2.59) (p50: 2.54) (p95: 2.96) (p99: 3.02) With inode cluster pinning and async reclaim: Run time: 1924 seconds Read latency (p50: 3,305,472) (p95: 3,715,072) (p99: 3,977,216) Write latency (p50: 187,648) (p95: 553,984) (p99: 789,504) Allocation latency (p50: 2,748,416) (p95: 3,919,872) (p99: 4,448,256) work rate = 13.28/sec (avg 13.32/sec) (p50: 13.26) (p95: 13.34) (p99: 13.34) alloc stall rate = 0.02/sec (avg: 0.02) (p50: 0.01) (p95: 0.03) (p99: 0.03) Latencies don't really change much, nor does the work rate. However, allocation almost never stalls with these changes, whilst the vanilla kernel is sometimes reporting 20 stalls/s over a 60s sample period. This difference is due to inode reclaim being largely non-blocking now. IOWs, once we have pinned inode cluster buffers, we can make inode reclaim non-blocking without a major risk of premature and/or spurious OOM killer invocation, and without any changes to memory reclaim infrastructure. Signed-off-by: Dave Chinner Reviewed-by: Amir Goldstein Reviewed-by: Darrick J. Wong Reviewed-by: Brian Foster --- fs/xfs/xfs_icache.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c index dbba4c1946386..a6780942034fc 100644 --- a/fs/xfs/xfs_icache.c +++ b/fs/xfs/xfs_icache.c @@ -1402,7 +1402,7 @@ xfs_reclaim_inodes_nr( xfs_reclaim_work_queue(mp); xfs_ail_push_all(mp->m_ail); - return xfs_reclaim_inodes_ag(mp, SYNC_TRYLOCK | SYNC_WAIT, &nr_to_scan); + return xfs_reclaim_inodes_ag(mp, SYNC_TRYLOCK, &nr_to_scan); } /* From patchwork Thu Jun 4 07:45:54 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 11587281 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id DB5CA159A for ; Thu, 4 Jun 2020 07:46:25 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id CD2E32075B for ; Thu, 4 Jun 2020 07:46:25 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727997AbgFDHqZ (ORCPT ); Thu, 4 Jun 2020 03:46:25 -0400 Received: from mail105.syd.optusnet.com.au ([211.29.132.249]:41762 "EHLO mail105.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727941AbgFDHqW (ORCPT ); Thu, 4 Jun 2020 03:46:22 -0400 Received: from dread.disaster.area (pa49-180-124-177.pa.nsw.optusnet.com.au [49.180.124.177]) by mail105.syd.optusnet.com.au (Postfix) with ESMTPS id B1A7F3A479C for ; Thu, 4 Jun 2020 17:46:13 +1000 (AEST) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.3) (envelope-from ) id 1jgkZk-0004Ac-6X for linux-xfs@vger.kernel.org; Thu, 04 Jun 2020 17:46:08 +1000 Received: from dave by discord.disaster.area with local (Exim 4.93) (envelope-from ) id 1jgkZj-0017Hu-TW for linux-xfs@vger.kernel.org; Thu, 04 Jun 2020 17:46:07 +1000 From: Dave Chinner To: linux-xfs@vger.kernel.org Subject: [PATCH 18/30] xfs: remove IO submission from xfs_reclaim_inode() Date: Thu, 4 Jun 2020 17:45:54 +1000 Message-Id: <20200604074606.266213-19-david@fromorbit.com> X-Mailer: git-send-email 2.26.2.761.g0e0b3e54be In-Reply-To: <20200604074606.266213-1-david@fromorbit.com> References: <20200604074606.266213-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.3 cv=W5xGqiek c=1 sm=1 tr=0 a=k3aV/LVJup6ZGWgigO6cSA==:117 a=k3aV/LVJup6ZGWgigO6cSA==:17 a=nTHF0DUjJn0A:10 a=20KFwNOVAAAA:8 a=yPCof4ZbAAAA:8 a=1aYX5UUvoHHSElsYj00A:9 Sender: linux-xfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Dave Chinner We no longer need to issue IO from shrinker based inode reclaim to prevent spurious OOM killer invocation. This leaves only the global filesystem management operations such as unmount needing to writeback dirty inodes and reclaim them. Instead of using the reclaim pass to write dirty inodes before reclaiming them, use the AIL to push all the dirty inodes before we try to reclaim them. This allows us to remove all the conditional SYNC_WAIT locking and the writeback code from xfs_reclaim_inode() and greatly simplify the checks we need to do to reclaim an inode. Signed-off-by: Dave Chinner Reviewed-by: Darrick J. Wong Reviewed-by: Brian Foster --- fs/xfs/xfs_icache.c | 117 ++++++++++++-------------------------------- 1 file changed, 31 insertions(+), 86 deletions(-) diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c index a6780942034fc..74032316ce5cc 100644 --- a/fs/xfs/xfs_icache.c +++ b/fs/xfs/xfs_icache.c @@ -1111,24 +1111,17 @@ xfs_reclaim_inode_grab( * dirty, async => requeue * dirty, sync => flush, wait and reclaim */ -STATIC int +static bool xfs_reclaim_inode( struct xfs_inode *ip, struct xfs_perag *pag, int sync_mode) { - struct xfs_buf *bp = NULL; xfs_ino_t ino = ip->i_ino; /* for radix_tree_delete */ - int error; -restart: - error = 0; xfs_ilock(ip, XFS_ILOCK_EXCL); - if (!xfs_iflock_nowait(ip)) { - if (!(sync_mode & SYNC_WAIT)) - goto out; - xfs_iflock(ip); - } + if (!xfs_iflock_nowait(ip)) + goto out; if (XFS_FORCED_SHUTDOWN(ip->i_mount)) { xfs_iunpin_wait(ip); @@ -1136,52 +1129,12 @@ xfs_reclaim_inode( xfs_iflush_abort(ip); goto reclaim; } - if (xfs_ipincount(ip)) { - if (!(sync_mode & SYNC_WAIT)) - goto out_ifunlock; - xfs_iunpin_wait(ip); - } - if (xfs_inode_clean(ip)) { - xfs_ifunlock(ip); - goto reclaim; - } - - /* - * Never flush out dirty data during non-blocking reclaim, as it would - * just contend with AIL pushing trying to do the same job. - */ - if (!(sync_mode & SYNC_WAIT)) + if (xfs_ipincount(ip)) + goto out_ifunlock; + if (!xfs_inode_clean(ip)) goto out_ifunlock; - /* - * Now we have an inode that needs flushing. - * - * Note that xfs_iflush will never block on the inode buffer lock, as - * xfs_ifree_cluster() can lock the inode buffer before it locks the - * ip->i_lock, and we are doing the exact opposite here. As a result, - * doing a blocking xfs_imap_to_bp() to get the cluster buffer would - * result in an ABBA deadlock with xfs_ifree_cluster(). - * - * As xfs_ifree_cluser() must gather all inodes that are active in the - * cache to mark them stale, if we hit this case we don't actually want - * to do IO here - we want the inode marked stale so we can simply - * reclaim it. Hence if we get an EAGAIN error here, just unlock the - * inode, back off and try again. Hopefully the next pass through will - * see the stale flag set on the inode. - */ - error = xfs_iflush(ip, &bp); - if (error == -EAGAIN) { - xfs_iunlock(ip, XFS_ILOCK_EXCL); - /* backoff longer than in xfs_ifree_cluster */ - delay(2); - goto restart; - } - - if (!error) { - error = xfs_bwrite(bp); - xfs_buf_relse(bp); - } - + xfs_ifunlock(ip); reclaim: ASSERT(!xfs_isiflocked(ip)); @@ -1231,21 +1184,14 @@ xfs_reclaim_inode( ASSERT(xfs_inode_clean(ip)); __xfs_inode_free(ip); - return error; + return true; out_ifunlock: xfs_ifunlock(ip); out: - xfs_iflags_clear(ip, XFS_IRECLAIM); xfs_iunlock(ip, XFS_ILOCK_EXCL); - /* - * We could return -EAGAIN here to make reclaim rescan the inode tree in - * a short while. However, this just burns CPU time scanning the tree - * waiting for IO to complete and the reclaim work never goes back to - * the idle state. Instead, return 0 to let the next scheduled - * background reclaim attempt to reclaim the inode again. - */ - return 0; + xfs_iflags_clear(ip, XFS_IRECLAIM); + return false; } /* @@ -1253,21 +1199,22 @@ xfs_reclaim_inode( * corrupted, we still want to try to reclaim all the inodes. If we don't, * then a shut down during filesystem unmount reclaim walk leak all the * unreclaimed inodes. + * + * Returns non-zero if any AGs or inodes were skipped in the reclaim pass + * so that callers that want to block until all dirty inodes are written back + * and reclaimed can sanely loop. */ -STATIC int +static int xfs_reclaim_inodes_ag( struct xfs_mount *mp, int flags, int *nr_to_scan) { struct xfs_perag *pag; - int error = 0; - int last_error = 0; xfs_agnumber_t ag; int trylock = flags & SYNC_TRYLOCK; int skipped; -restart: ag = 0; skipped = 0; while ((pag = xfs_perag_get_tag(mp, ag, XFS_ICI_RECLAIM_TAG))) { @@ -1341,9 +1288,8 @@ xfs_reclaim_inodes_ag( for (i = 0; i < nr_found; i++) { if (!batch[i]) continue; - error = xfs_reclaim_inode(batch[i], pag, flags); - if (error && last_error != -EFSCORRUPTED) - last_error = error; + if (!xfs_reclaim_inode(batch[i], pag, flags)) + skipped++; } *nr_to_scan -= XFS_LOOKUP_BATCH; @@ -1359,19 +1305,7 @@ xfs_reclaim_inodes_ag( mutex_unlock(&pag->pag_ici_reclaim_lock); xfs_perag_put(pag); } - - /* - * if we skipped any AG, and we still have scan count remaining, do - * another pass this time using blocking reclaim semantics (i.e - * waiting on the reclaim locks and ignoring the reclaim cursors). This - * ensure that when we get more reclaimers than AGs we block rather - * than spin trying to execute reclaim. - */ - if (skipped && (flags & SYNC_WAIT) && *nr_to_scan > 0) { - trylock = 0; - goto restart; - } - return last_error; + return skipped; } int @@ -1380,8 +1314,18 @@ xfs_reclaim_inodes( int mode) { int nr_to_scan = INT_MAX; + int skipped; - return xfs_reclaim_inodes_ag(mp, mode, &nr_to_scan); + xfs_reclaim_inodes_ag(mp, mode, &nr_to_scan); + if (!(mode & SYNC_WAIT)) + return 0; + + do { + xfs_ail_push_all_sync(mp->m_ail); + skipped = xfs_reclaim_inodes_ag(mp, mode, &nr_to_scan); + } while (skipped > 0); + + return 0; } /* @@ -1402,7 +1346,8 @@ xfs_reclaim_inodes_nr( xfs_reclaim_work_queue(mp); xfs_ail_push_all(mp->m_ail); - return xfs_reclaim_inodes_ag(mp, SYNC_TRYLOCK, &nr_to_scan); + xfs_reclaim_inodes_ag(mp, SYNC_TRYLOCK, &nr_to_scan); + return 0; } /* From patchwork Thu Jun 4 07:45:55 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 11587239 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id A10D6159A for ; Thu, 4 Jun 2020 07:46:14 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 930852067B for ; Thu, 4 Jun 2020 07:46:14 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727854AbgFDHqN (ORCPT ); Thu, 4 Jun 2020 03:46:13 -0400 Received: from mail106.syd.optusnet.com.au ([211.29.132.42]:49522 "EHLO mail106.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727058AbgFDHqN (ORCPT ); Thu, 4 Jun 2020 03:46:13 -0400 Received: from dread.disaster.area (pa49-180-124-177.pa.nsw.optusnet.com.au [49.180.124.177]) by mail106.syd.optusnet.com.au (Postfix) with ESMTPS id 41BE95AACE2 for ; Thu, 4 Jun 2020 17:46:10 +1000 (AEST) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.3) (envelope-from ) id 1jgkZk-0004Af-7O for linux-xfs@vger.kernel.org; Thu, 04 Jun 2020 17:46:08 +1000 Received: from dave by discord.disaster.area with local (Exim 4.93) (envelope-from ) id 1jgkZj-0017Hz-VE for linux-xfs@vger.kernel.org; Thu, 04 Jun 2020 17:46:07 +1000 From: Dave Chinner To: linux-xfs@vger.kernel.org Subject: [PATCH 19/30] xfs: allow multiple reclaimers per AG Date: Thu, 4 Jun 2020 17:45:55 +1000 Message-Id: <20200604074606.266213-20-david@fromorbit.com> X-Mailer: git-send-email 2.26.2.761.g0e0b3e54be In-Reply-To: <20200604074606.266213-1-david@fromorbit.com> References: <20200604074606.266213-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.3 cv=QIgWuTDL c=1 sm=1 tr=0 a=k3aV/LVJup6ZGWgigO6cSA==:117 a=k3aV/LVJup6ZGWgigO6cSA==:17 a=nTHF0DUjJn0A:10 a=20KFwNOVAAAA:8 a=yPCof4ZbAAAA:8 a=vMY9R6GoBIkpaNy9teYA:9 Sender: linux-xfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Dave Chinner Inode reclaim will still throttle direct reclaim on the per-ag reclaim locks. This is no longer necessary as reclaim can run non-blocking now. Hence we can remove these locks so that we don't arbitrarily block reclaimers just because there are more direct reclaimers than there are AGs. This can result in multiple reclaimers working on the same range of an AG, but this doesn't cause any apparent issues. Optimising the spread of concurrent reclaimers for best efficiency can be done in a future patchset. Signed-off-by: Dave Chinner Reviewed-by: Darrick J. Wong Reviewed-by: Brian Foster --- fs/xfs/xfs_icache.c | 31 ++++++++++++------------------- fs/xfs/xfs_mount.c | 4 ---- fs/xfs/xfs_mount.h | 1 - 3 files changed, 12 insertions(+), 24 deletions(-) diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c index 74032316ce5cc..c4ba8d7bc45bc 100644 --- a/fs/xfs/xfs_icache.c +++ b/fs/xfs/xfs_icache.c @@ -1211,12 +1211,9 @@ xfs_reclaim_inodes_ag( int *nr_to_scan) { struct xfs_perag *pag; - xfs_agnumber_t ag; - int trylock = flags & SYNC_TRYLOCK; - int skipped; + xfs_agnumber_t ag = 0; + int skipped = 0; - ag = 0; - skipped = 0; while ((pag = xfs_perag_get_tag(mp, ag, XFS_ICI_RECLAIM_TAG))) { unsigned long first_index = 0; int done = 0; @@ -1224,15 +1221,13 @@ xfs_reclaim_inodes_ag( ag = pag->pag_agno + 1; - if (trylock) { - if (!mutex_trylock(&pag->pag_ici_reclaim_lock)) { - skipped++; - xfs_perag_put(pag); - continue; - } - first_index = pag->pag_ici_reclaim_cursor; - } else - mutex_lock(&pag->pag_ici_reclaim_lock); + /* + * If the cursor is not zero, we haven't scanned the whole AG + * so we might have skipped inodes here. + */ + first_index = READ_ONCE(pag->pag_ici_reclaim_cursor); + if (first_index) + skipped++; do { struct xfs_inode *batch[XFS_LOOKUP_BATCH]; @@ -1298,11 +1293,9 @@ xfs_reclaim_inodes_ag( } while (nr_found && !done && *nr_to_scan > 0); - if (trylock && !done) - pag->pag_ici_reclaim_cursor = first_index; - else - pag->pag_ici_reclaim_cursor = 0; - mutex_unlock(&pag->pag_ici_reclaim_lock); + if (done) + first_index = 0; + WRITE_ONCE(pag->pag_ici_reclaim_cursor, first_index); xfs_perag_put(pag); } return skipped; diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c index d5dcf98698600..03158b42a1943 100644 --- a/fs/xfs/xfs_mount.c +++ b/fs/xfs/xfs_mount.c @@ -148,7 +148,6 @@ xfs_free_perag( ASSERT(atomic_read(&pag->pag_ref) == 0); xfs_iunlink_destroy(pag); xfs_buf_hash_destroy(pag); - mutex_destroy(&pag->pag_ici_reclaim_lock); call_rcu(&pag->rcu_head, __xfs_free_perag); } } @@ -200,7 +199,6 @@ xfs_initialize_perag( pag->pag_agno = index; pag->pag_mount = mp; spin_lock_init(&pag->pag_ici_lock); - mutex_init(&pag->pag_ici_reclaim_lock); INIT_RADIX_TREE(&pag->pag_ici_root, GFP_ATOMIC); if (xfs_buf_hash_init(pag)) goto out_free_pag; @@ -242,7 +240,6 @@ xfs_initialize_perag( out_hash_destroy: xfs_buf_hash_destroy(pag); out_free_pag: - mutex_destroy(&pag->pag_ici_reclaim_lock); kmem_free(pag); out_unwind_new_pags: /* unwind any prior newly initialized pags */ @@ -252,7 +249,6 @@ xfs_initialize_perag( break; xfs_buf_hash_destroy(pag); xfs_iunlink_destroy(pag); - mutex_destroy(&pag->pag_ici_reclaim_lock); kmem_free(pag); } return error; diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h index 3725d25ad97e8..a72cfcaa4ad12 100644 --- a/fs/xfs/xfs_mount.h +++ b/fs/xfs/xfs_mount.h @@ -354,7 +354,6 @@ typedef struct xfs_perag { spinlock_t pag_ici_lock; /* incore inode cache lock */ struct radix_tree_root pag_ici_root; /* incore inode cache root */ int pag_ici_reclaimable; /* reclaimable inodes */ - struct mutex pag_ici_reclaim_lock; /* serialisation point */ unsigned long pag_ici_reclaim_cursor; /* reclaim restart point */ /* buffer cache index */ From patchwork Thu Jun 4 07:45:56 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 11587243 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id D55491392 for ; Thu, 4 Jun 2020 07:46:15 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id BE473206C3 for ; Thu, 4 Jun 2020 07:46:15 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727016AbgFDHqO (ORCPT ); Thu, 4 Jun 2020 03:46:14 -0400 Received: from mail110.syd.optusnet.com.au ([211.29.132.97]:34225 "EHLO mail110.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726246AbgFDHqO (ORCPT ); Thu, 4 Jun 2020 03:46:14 -0400 Received: from dread.disaster.area (pa49-180-124-177.pa.nsw.optusnet.com.au [49.180.124.177]) by mail110.syd.optusnet.com.au (Postfix) with ESMTPS id E5CFA107D0D for ; Thu, 4 Jun 2020 17:46:11 +1000 (AEST) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.3) (envelope-from ) id 1jgkZk-0004Aj-8o for linux-xfs@vger.kernel.org; Thu, 04 Jun 2020 17:46:08 +1000 Received: from dave by discord.disaster.area with local (Exim 4.93) (envelope-from ) id 1jgkZk-0017I4-08 for linux-xfs@vger.kernel.org; Thu, 04 Jun 2020 17:46:08 +1000 From: Dave Chinner To: linux-xfs@vger.kernel.org Subject: [PATCH 20/30] xfs: don't block inode reclaim on the ILOCK Date: Thu, 4 Jun 2020 17:45:56 +1000 Message-Id: <20200604074606.266213-21-david@fromorbit.com> X-Mailer: git-send-email 2.26.2.761.g0e0b3e54be In-Reply-To: <20200604074606.266213-1-david@fromorbit.com> References: <20200604074606.266213-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.3 cv=W5xGqiek c=1 sm=1 tr=0 a=k3aV/LVJup6ZGWgigO6cSA==:117 a=k3aV/LVJup6ZGWgigO6cSA==:17 a=nTHF0DUjJn0A:10 a=20KFwNOVAAAA:8 a=yPCof4ZbAAAA:8 a=tOYOrV8LyEHl0yQJZvYA:9 Sender: linux-xfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Dave Chinner When we attempt to reclaim an inode, the first thing we do is take the inode lock. This is blocking right now, so if the inode being accessed by something else (e.g. being flushed to the cluster buffer) we will block here. Change this to a trylock so that we do not block inode reclaim unnecessarily here. Signed-off-by: Dave Chinner Reviewed-by: Darrick J. Wong Reviewed-by: Brian Foster --- fs/xfs/xfs_icache.c | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c index c4ba8d7bc45bc..d1c47a0e0b0ec 100644 --- a/fs/xfs/xfs_icache.c +++ b/fs/xfs/xfs_icache.c @@ -1119,9 +1119,10 @@ xfs_reclaim_inode( { xfs_ino_t ino = ip->i_ino; /* for radix_tree_delete */ - xfs_ilock(ip, XFS_ILOCK_EXCL); - if (!xfs_iflock_nowait(ip)) + if (!xfs_ilock_nowait(ip, XFS_ILOCK_EXCL)) goto out; + if (!xfs_iflock_nowait(ip)) + goto out_iunlock; if (XFS_FORCED_SHUTDOWN(ip->i_mount)) { xfs_iunpin_wait(ip); @@ -1188,8 +1189,9 @@ xfs_reclaim_inode( out_ifunlock: xfs_ifunlock(ip); -out: +out_iunlock: xfs_iunlock(ip, XFS_ILOCK_EXCL); +out: xfs_iflags_clear(ip, XFS_IRECLAIM); return false; } From patchwork Thu Jun 4 07:45:57 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 11587275 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 81477175A for ; Thu, 4 Jun 2020 07:46:24 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 68581207D3 for ; Thu, 4 Jun 2020 07:46:24 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727921AbgFDHqX (ORCPT ); Thu, 4 Jun 2020 03:46:23 -0400 Received: from mail107.syd.optusnet.com.au ([211.29.132.53]:49320 "EHLO mail107.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728021AbgFDHqV (ORCPT ); Thu, 4 Jun 2020 03:46:21 -0400 Received: from dread.disaster.area (pa49-180-124-177.pa.nsw.optusnet.com.au [49.180.124.177]) by mail107.syd.optusnet.com.au (Postfix) with ESMTPS id 57EE8D5995F for ; Thu, 4 Jun 2020 17:46:13 +1000 (AEST) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.3) (envelope-from ) id 1jgkZk-0004Ak-9Q for linux-xfs@vger.kernel.org; Thu, 04 Jun 2020 17:46:08 +1000 Received: from dave by discord.disaster.area with local (Exim 4.93) (envelope-from ) id 1jgkZk-0017I9-16 for linux-xfs@vger.kernel.org; Thu, 04 Jun 2020 17:46:08 +1000 From: Dave Chinner To: linux-xfs@vger.kernel.org Subject: [PATCH 21/30] xfs: remove SYNC_TRYLOCK from inode reclaim Date: Thu, 4 Jun 2020 17:45:57 +1000 Message-Id: <20200604074606.266213-22-david@fromorbit.com> X-Mailer: git-send-email 2.26.2.761.g0e0b3e54be In-Reply-To: <20200604074606.266213-1-david@fromorbit.com> References: <20200604074606.266213-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.3 cv=QIgWuTDL c=1 sm=1 tr=0 a=k3aV/LVJup6ZGWgigO6cSA==:117 a=k3aV/LVJup6ZGWgigO6cSA==:17 a=nTHF0DUjJn0A:10 a=20KFwNOVAAAA:8 a=yPCof4ZbAAAA:8 a=wwBL6EluFSDcluHKyHoA:9 Sender: linux-xfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Dave Chinner All background reclaim is SYNC_TRYLOCK already, and even blocking reclaim (SYNC_WAIT) can use trylock mechanisms as xfs_reclaim_inodes_ag() will keep cycling until there are no more reclaimable inodes. Hence we can kill SYNC_TRYLOCK from inode reclaim and make everything unconditionally non-blocking. Signed-off-by: Dave Chinner Reviewed-by: Darrick J. Wong Reviewed-by: Brian Foster --- fs/xfs/xfs_icache.c | 29 ++++++++++++----------------- 1 file changed, 12 insertions(+), 17 deletions(-) diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c index d1c47a0e0b0ec..ebe55124d6cb8 100644 --- a/fs/xfs/xfs_icache.c +++ b/fs/xfs/xfs_icache.c @@ -174,7 +174,7 @@ xfs_reclaim_worker( struct xfs_mount *mp = container_of(to_delayed_work(work), struct xfs_mount, m_reclaim_work); - xfs_reclaim_inodes(mp, SYNC_TRYLOCK); + xfs_reclaim_inodes(mp, 0); xfs_reclaim_work_queue(mp); } @@ -1030,10 +1030,9 @@ xfs_cowblocks_worker( * Grab the inode for reclaim exclusively. * Return 0 if we grabbed it, non-zero otherwise. */ -STATIC int +static int xfs_reclaim_inode_grab( - struct xfs_inode *ip, - int flags) + struct xfs_inode *ip) { ASSERT(rcu_read_lock_held()); @@ -1042,12 +1041,10 @@ xfs_reclaim_inode_grab( return 1; /* - * If we are asked for non-blocking operation, do unlocked checks to - * see if the inode already is being flushed or in reclaim to avoid - * lock traffic. + * Do unlocked checks to see if the inode already is being flushed or in + * reclaim to avoid lock traffic. */ - if ((flags & SYNC_TRYLOCK) && - __xfs_iflags_test(ip, XFS_IFLOCK | XFS_IRECLAIM)) + if (__xfs_iflags_test(ip, XFS_IFLOCK | XFS_IRECLAIM)) return 1; /* @@ -1114,8 +1111,7 @@ xfs_reclaim_inode_grab( static bool xfs_reclaim_inode( struct xfs_inode *ip, - struct xfs_perag *pag, - int sync_mode) + struct xfs_perag *pag) { xfs_ino_t ino = ip->i_ino; /* for radix_tree_delete */ @@ -1209,7 +1205,6 @@ xfs_reclaim_inode( static int xfs_reclaim_inodes_ag( struct xfs_mount *mp, - int flags, int *nr_to_scan) { struct xfs_perag *pag; @@ -1254,7 +1249,7 @@ xfs_reclaim_inodes_ag( for (i = 0; i < nr_found; i++) { struct xfs_inode *ip = batch[i]; - if (done || xfs_reclaim_inode_grab(ip, flags)) + if (done || xfs_reclaim_inode_grab(ip)) batch[i] = NULL; /* @@ -1285,7 +1280,7 @@ xfs_reclaim_inodes_ag( for (i = 0; i < nr_found; i++) { if (!batch[i]) continue; - if (!xfs_reclaim_inode(batch[i], pag, flags)) + if (!xfs_reclaim_inode(batch[i], pag)) skipped++; } @@ -1311,13 +1306,13 @@ xfs_reclaim_inodes( int nr_to_scan = INT_MAX; int skipped; - xfs_reclaim_inodes_ag(mp, mode, &nr_to_scan); + xfs_reclaim_inodes_ag(mp, &nr_to_scan); if (!(mode & SYNC_WAIT)) return 0; do { xfs_ail_push_all_sync(mp->m_ail); - skipped = xfs_reclaim_inodes_ag(mp, mode, &nr_to_scan); + skipped = xfs_reclaim_inodes_ag(mp, &nr_to_scan); } while (skipped > 0); return 0; @@ -1341,7 +1336,7 @@ xfs_reclaim_inodes_nr( xfs_reclaim_work_queue(mp); xfs_ail_push_all(mp->m_ail); - xfs_reclaim_inodes_ag(mp, SYNC_TRYLOCK, &nr_to_scan); + xfs_reclaim_inodes_ag(mp, &nr_to_scan); return 0; } From patchwork Thu Jun 4 07:45:58 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 11587253 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id BBF4E1392 for ; Thu, 4 Jun 2020 07:46:20 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id AD0AA2072E for ; Thu, 4 Jun 2020 07:46:20 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727881AbgFDHqU (ORCPT ); Thu, 4 Jun 2020 03:46:20 -0400 Received: from mail106.syd.optusnet.com.au ([211.29.132.42]:50240 "EHLO mail106.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727944AbgFDHqR (ORCPT ); Thu, 4 Jun 2020 03:46:17 -0400 Received: from dread.disaster.area (pa49-180-124-177.pa.nsw.optusnet.com.au [49.180.124.177]) by mail106.syd.optusnet.com.au (Postfix) with ESMTPS id E5AB85AAE5E for ; Thu, 4 Jun 2020 17:46:14 +1000 (AEST) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.3) (envelope-from ) id 1jgkZk-0004Ap-BG for linux-xfs@vger.kernel.org; Thu, 04 Jun 2020 17:46:08 +1000 Received: from dave by discord.disaster.area with local (Exim 4.93) (envelope-from ) id 1jgkZk-0017IF-2A for linux-xfs@vger.kernel.org; Thu, 04 Jun 2020 17:46:08 +1000 From: Dave Chinner To: linux-xfs@vger.kernel.org Subject: [PATCH 22/30] xfs: remove SYNC_WAIT from xfs_reclaim_inodes() Date: Thu, 4 Jun 2020 17:45:58 +1000 Message-Id: <20200604074606.266213-23-david@fromorbit.com> X-Mailer: git-send-email 2.26.2.761.g0e0b3e54be In-Reply-To: <20200604074606.266213-1-david@fromorbit.com> References: <20200604074606.266213-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.3 cv=X6os11be c=1 sm=1 tr=0 a=k3aV/LVJup6ZGWgigO6cSA==:117 a=k3aV/LVJup6ZGWgigO6cSA==:17 a=nTHF0DUjJn0A:10 a=20KFwNOVAAAA:8 a=yPCof4ZbAAAA:8 a=oUvfQH9j-zolZsU4VWYA:9 a=DiKeHqHhRZ4A:10 Sender: linux-xfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Dave Chinner Clean up xfs_reclaim_inodes() callers. Most callers want blocking behaviour, so just make the existing SYNC_WAIT behaviour the default. For the xfs_reclaim_worker(), just call xfs_reclaim_inodes_ag() directly because we just want optimistic clean inode reclaim to be done in the background. For xfs_quiesce_attr() we can just remove the inode reclaim calls as they are a historic relic that was required to flush dirty inodes that contained unlogged changes. We now log all changes to the inodes, so the sync AIL push from xfs_log_quiesce() called by xfs_quiesce_attr() will do all the required inode writeback for freeze. Signed-off-by: Dave Chinner Reviewed-by: Darrick J. Wong Reviewed-by: Brian Foster --- fs/xfs/xfs_icache.c | 48 ++++++++++++++++++++------------------------- fs/xfs/xfs_icache.h | 2 +- fs/xfs/xfs_mount.c | 11 +++++------ fs/xfs/xfs_super.c | 3 --- 4 files changed, 27 insertions(+), 37 deletions(-) diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c index ebe55124d6cb8..a27470fc201ff 100644 --- a/fs/xfs/xfs_icache.c +++ b/fs/xfs/xfs_icache.c @@ -160,24 +160,6 @@ xfs_reclaim_work_queue( rcu_read_unlock(); } -/* - * This is a fast pass over the inode cache to try to get reclaim moving on as - * many inodes as possible in a short period of time. It kicks itself every few - * seconds, as well as being kicked by the inode cache shrinker when memory - * goes low. It scans as quickly as possible avoiding locked inodes or those - * already being flushed, and once done schedules a future pass. - */ -void -xfs_reclaim_worker( - struct work_struct *work) -{ - struct xfs_mount *mp = container_of(to_delayed_work(work), - struct xfs_mount, m_reclaim_work); - - xfs_reclaim_inodes(mp, 0); - xfs_reclaim_work_queue(mp); -} - static void xfs_perag_set_reclaim_tag( struct xfs_perag *pag) @@ -1298,24 +1280,17 @@ xfs_reclaim_inodes_ag( return skipped; } -int +void xfs_reclaim_inodes( - xfs_mount_t *mp, - int mode) + struct xfs_mount *mp) { int nr_to_scan = INT_MAX; int skipped; - xfs_reclaim_inodes_ag(mp, &nr_to_scan); - if (!(mode & SYNC_WAIT)) - return 0; - do { xfs_ail_push_all_sync(mp->m_ail); skipped = xfs_reclaim_inodes_ag(mp, &nr_to_scan); } while (skipped > 0); - - return 0; } /* @@ -1434,6 +1409,25 @@ xfs_inode_matches_eofb( return true; } +/* + * This is a fast pass over the inode cache to try to get reclaim moving on as + * many inodes as possible in a short period of time. It kicks itself every few + * seconds, as well as being kicked by the inode cache shrinker when memory + * goes low. It scans as quickly as possible avoiding locked inodes or those + * already being flushed, and once done schedules a future pass. + */ +void +xfs_reclaim_worker( + struct work_struct *work) +{ + struct xfs_mount *mp = container_of(to_delayed_work(work), + struct xfs_mount, m_reclaim_work); + int nr_to_scan = INT_MAX; + + xfs_reclaim_inodes_ag(mp, &nr_to_scan); + xfs_reclaim_work_queue(mp); +} + STATIC int xfs_inode_free_eofblocks( struct xfs_inode *ip, diff --git a/fs/xfs/xfs_icache.h b/fs/xfs/xfs_icache.h index 93b54e7d55f0d..ae92ca53de423 100644 --- a/fs/xfs/xfs_icache.h +++ b/fs/xfs/xfs_icache.h @@ -51,7 +51,7 @@ void xfs_inode_free(struct xfs_inode *ip); void xfs_reclaim_worker(struct work_struct *work); -int xfs_reclaim_inodes(struct xfs_mount *mp, int mode); +void xfs_reclaim_inodes(struct xfs_mount *mp); int xfs_reclaim_inodes_count(struct xfs_mount *mp); long xfs_reclaim_inodes_nr(struct xfs_mount *mp, int nr_to_scan); diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c index 03158b42a1943..c8ae49a1e99c3 100644 --- a/fs/xfs/xfs_mount.c +++ b/fs/xfs/xfs_mount.c @@ -1011,7 +1011,7 @@ xfs_mountfs( * quota inodes. */ cancel_delayed_work_sync(&mp->m_reclaim_work); - xfs_reclaim_inodes(mp, SYNC_WAIT); + xfs_reclaim_inodes(mp); xfs_health_unmount(mp); out_log_dealloc: mp->m_flags |= XFS_MOUNT_UNMOUNTING; @@ -1088,13 +1088,12 @@ xfs_unmountfs( xfs_ail_push_all_sync(mp->m_ail); /* - * And reclaim all inodes. At this point there should be no dirty - * inodes and none should be pinned or locked, but use synchronous - * reclaim just to be sure. We can stop background inode reclaim - * here as well if it is still running. + * Reclaim all inodes. At this point there should be no dirty inodes and + * none should be pinned or locked. Stop background inode reclaim here + * if it is still running. */ cancel_delayed_work_sync(&mp->m_reclaim_work); - xfs_reclaim_inodes(mp, SYNC_WAIT); + xfs_reclaim_inodes(mp); xfs_health_unmount(mp); xfs_qm_unmount(mp); diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c index fa58cb07c8fdf..9b03ea43f4fe7 100644 --- a/fs/xfs/xfs_super.c +++ b/fs/xfs/xfs_super.c @@ -890,9 +890,6 @@ xfs_quiesce_attr( /* force the log to unpin objects from the now complete transactions */ xfs_log_force(mp, XFS_LOG_SYNC); - /* reclaim inodes to do any IO before the freeze completes */ - xfs_reclaim_inodes(mp, 0); - xfs_reclaim_inodes(mp, SYNC_WAIT); /* Push the superblock and write an unmount record */ error = xfs_log_sbcount(mp); From patchwork Thu Jun 4 07:45:59 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 11587249 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 11D1F913 for ; Thu, 4 Jun 2020 07:46:20 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 03332206C3 for ; Thu, 4 Jun 2020 07:46:20 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727990AbgFDHqR (ORCPT ); Thu, 4 Jun 2020 03:46:17 -0400 Received: from mail107.syd.optusnet.com.au ([211.29.132.53]:48984 "EHLO mail107.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727898AbgFDHqQ (ORCPT ); Thu, 4 Jun 2020 03:46:16 -0400 Received: from dread.disaster.area (pa49-180-124-177.pa.nsw.optusnet.com.au [49.180.124.177]) by mail107.syd.optusnet.com.au (Postfix) with ESMTPS id 67E2BD58F24 for ; Thu, 4 Jun 2020 17:46:13 +1000 (AEST) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.3) (envelope-from ) id 1jgkZk-0004As-CZ for linux-xfs@vger.kernel.org; Thu, 04 Jun 2020 17:46:08 +1000 Received: from dave by discord.disaster.area with local (Exim 4.93) (envelope-from ) id 1jgkZk-0017IK-3r for linux-xfs@vger.kernel.org; Thu, 04 Jun 2020 17:46:08 +1000 From: Dave Chinner To: linux-xfs@vger.kernel.org Subject: [PATCH 23/30] xfs: clean up inode reclaim comments Date: Thu, 4 Jun 2020 17:45:59 +1000 Message-Id: <20200604074606.266213-24-david@fromorbit.com> X-Mailer: git-send-email 2.26.2.761.g0e0b3e54be In-Reply-To: <20200604074606.266213-1-david@fromorbit.com> References: <20200604074606.266213-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.3 cv=X6os11be c=1 sm=1 tr=0 a=k3aV/LVJup6ZGWgigO6cSA==:117 a=k3aV/LVJup6ZGWgigO6cSA==:17 a=nTHF0DUjJn0A:10 a=20KFwNOVAAAA:8 a=yPCof4ZbAAAA:8 a=hsgbSS1IDo3W-jZ95UsA:9 a=pWZLxvMJ_RWUTdsa:21 a=O3O4emV_o1Te1xPj:21 Sender: linux-xfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Dave Chinner Inode reclaim is quite different now to the way described in various comments, so update all the comments explaining what it does and how it works. Signed-off-by: Dave Chinner Reviewed-by: Darrick J. Wong Reviewed-by: Brian Foster --- fs/xfs/xfs_icache.c | 128 ++++++++++++-------------------------------- 1 file changed, 35 insertions(+), 93 deletions(-) diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c index a27470fc201ff..4fe6f250e8448 100644 --- a/fs/xfs/xfs_icache.c +++ b/fs/xfs/xfs_icache.c @@ -141,11 +141,8 @@ xfs_inode_free( } /* - * Queue a new inode reclaim pass if there are reclaimable inodes and there - * isn't a reclaim pass already in progress. By default it runs every 5s based - * on the xfs periodic sync default of 30s. Perhaps this should have it's own - * tunable, but that can be done if this method proves to be ineffective or too - * aggressive. + * Queue background inode reclaim work if there are reclaimable inodes and there + * isn't reclaim work already scheduled or in progress. */ static void xfs_reclaim_work_queue( @@ -600,48 +597,31 @@ xfs_iget_cache_miss( } /* - * Look up an inode by number in the given file system. - * The inode is looked up in the cache held in each AG. - * If the inode is found in the cache, initialise the vfs inode - * if necessary. + * Look up an inode by number in the given file system. The inode is looked up + * in the cache held in each AG. If the inode is found in the cache, initialise + * the vfs inode if necessary. * - * If it is not in core, read it in from the file system's device, - * add it to the cache and initialise the vfs inode. + * If it is not in core, read it in from the file system's device, add it to the + * cache and initialise the vfs inode. * * The inode is locked according to the value of the lock_flags parameter. - * This flag parameter indicates how and if the inode's IO lock and inode lock - * should be taken. - * - * mp -- the mount point structure for the current file system. It points - * to the inode hash table. - * tp -- a pointer to the current transaction if there is one. This is - * simply passed through to the xfs_iread() call. - * ino -- the number of the inode desired. This is the unique identifier - * within the file system for the inode being requested. - * lock_flags -- flags indicating how to lock the inode. See the comment - * for xfs_ilock() for a list of valid values. + * Inode lookup is only done during metadata operations and not as part of the + * data IO path. Hence we only allow locking of the XFS_ILOCK during lookup. */ int xfs_iget( - xfs_mount_t *mp, - xfs_trans_t *tp, - xfs_ino_t ino, - uint flags, - uint lock_flags, - xfs_inode_t **ipp) + struct xfs_mount *mp, + struct xfs_trans *tp, + xfs_ino_t ino, + uint flags, + uint lock_flags, + struct xfs_inode **ipp) { - xfs_inode_t *ip; - int error; - xfs_perag_t *pag; - xfs_agino_t agino; + struct xfs_inode *ip; + struct xfs_perag *pag; + xfs_agino_t agino; + int error; - /* - * xfs_reclaim_inode() uses the ILOCK to ensure an inode - * doesn't get freed while it's being referenced during a - * radix tree traversal here. It assumes this function - * aqcuires only the ILOCK (and therefore it has no need to - * involve the IOLOCK in this synchronization). - */ ASSERT((lock_flags & (XFS_IOLOCK_EXCL | XFS_IOLOCK_SHARED)) == 0); /* reject inode numbers outside existing AGs */ @@ -758,15 +738,7 @@ xfs_inode_walk_ag_grab( ASSERT(rcu_read_lock_held()); - /* - * check for stale RCU freed inode - * - * If the inode has been reallocated, it doesn't matter if it's not in - * the AG we are walking - we are walking for writeback, so if it - * passes all the "valid inode" checks and is dirty, then we'll write - * it back anyway. If it has been reallocated and still being - * initialised, the XFS_INEW check below will catch it. - */ + /* Check for stale RCU freed inode */ spin_lock(&ip->i_flags_lock); if (!ip->i_ino) goto out_unlock_noent; @@ -1052,43 +1024,16 @@ xfs_reclaim_inode_grab( } /* - * Inodes in different states need to be treated differently. The following - * table lists the inode states and the reclaim actions necessary: - * - * inode state iflush ret required action - * --------------- ---------- --------------- - * bad - reclaim - * shutdown EIO unpin and reclaim - * clean, unpinned 0 reclaim - * stale, unpinned 0 reclaim - * clean, pinned(*) 0 requeue - * stale, pinned EAGAIN requeue - * dirty, async - requeue - * dirty, sync 0 reclaim + * Inode reclaim is non-blocking, so the default action if progress cannot be + * made is to "requeue" the inode for reclaim by unlocking it and clearing the + * XFS_IRECLAIM flag. If we are in a shutdown state, we don't care about + * blocking anymore and hence we can wait for the inode to be able to reclaim + * it. * - * (*) dgc: I don't think the clean, pinned state is possible but it gets - * handled anyway given the order of checks implemented. - * - * Also, because we get the flush lock first, we know that any inode that has - * been flushed delwri has had the flush completed by the time we check that - * the inode is clean. - * - * Note that because the inode is flushed delayed write by AIL pushing, the - * flush lock may already be held here and waiting on it can result in very - * long latencies. Hence for sync reclaims, where we wait on the flush lock, - * the caller should push the AIL first before trying to reclaim inodes to - * minimise the amount of time spent waiting. For background relaim, we only - * bother to reclaim clean inodes anyway. - * - * Hence the order of actions after gaining the locks should be: - * bad => reclaim - * shutdown => unpin and reclaim - * pinned, async => requeue - * pinned, sync => unpin - * stale => reclaim - * clean => reclaim - * dirty, async => requeue - * dirty, sync => flush, wait and reclaim + * We do no IO here - if callers require inodes to be cleaned they must push the + * AIL first to trigger writeback of dirty inodes. This enables writeback to be + * done in the background in a non-blocking manner, and enables memory reclaim + * to make progress without blocking. */ static bool xfs_reclaim_inode( @@ -1294,13 +1239,11 @@ xfs_reclaim_inodes( } /* - * Scan a certain number of inodes for reclaim. - * - * When called we make sure that there is a background (fast) inode reclaim in - * progress, while we will throttle the speed of reclaim via doing synchronous - * reclaim of inodes. That means if we come across dirty inodes, we wait for - * them to be cleaned, which we hope will not be very long due to the - * background walker having already kicked the IO off on those dirty inodes. + * The shrinker infrastructure determines how many inodes we should scan for + * reclaim. We want as many clean inodes ready to reclaim as possible, so we + * push the AIL here. We also want to proactively free up memory if we can to + * minimise the amount of work memory reclaim has to do so we kick the + * background reclaim if it isn't already scheduled. */ long xfs_reclaim_inodes_nr( @@ -1413,8 +1356,7 @@ xfs_inode_matches_eofb( * This is a fast pass over the inode cache to try to get reclaim moving on as * many inodes as possible in a short period of time. It kicks itself every few * seconds, as well as being kicked by the inode cache shrinker when memory - * goes low. It scans as quickly as possible avoiding locked inodes or those - * already being flushed, and once done schedules a future pass. + * goes low. */ void xfs_reclaim_worker( From patchwork Thu Jun 4 07:46:00 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 11587285 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 20882913 for ; Thu, 4 Jun 2020 07:46:27 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 0E3BC2072E for ; Thu, 4 Jun 2020 07:46:27 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728025AbgFDHq0 (ORCPT ); Thu, 4 Jun 2020 03:46:26 -0400 Received: from mail105.syd.optusnet.com.au ([211.29.132.249]:41578 "EHLO mail105.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728043AbgFDHqZ (ORCPT ); Thu, 4 Jun 2020 03:46:25 -0400 Received: from dread.disaster.area (pa49-180-124-177.pa.nsw.optusnet.com.au [49.180.124.177]) by mail105.syd.optusnet.com.au (Postfix) with ESMTPS id E3B733A3D34 for ; Thu, 4 Jun 2020 17:46:14 +1000 (AEST) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.3) (envelope-from ) id 1jgkZk-0004Av-Dv for linux-xfs@vger.kernel.org; Thu, 04 Jun 2020 17:46:08 +1000 Received: from dave by discord.disaster.area with local (Exim 4.93) (envelope-from ) id 1jgkZk-0017IP-5S for linux-xfs@vger.kernel.org; Thu, 04 Jun 2020 17:46:08 +1000 From: Dave Chinner To: linux-xfs@vger.kernel.org Subject: [PATCH 24/30] xfs: rework stale inodes in xfs_ifree_cluster Date: Thu, 4 Jun 2020 17:46:00 +1000 Message-Id: <20200604074606.266213-25-david@fromorbit.com> X-Mailer: git-send-email 2.26.2.761.g0e0b3e54be In-Reply-To: <20200604074606.266213-1-david@fromorbit.com> References: <20200604074606.266213-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.3 cv=QIgWuTDL c=1 sm=1 tr=0 a=k3aV/LVJup6ZGWgigO6cSA==:117 a=k3aV/LVJup6ZGWgigO6cSA==:17 a=nTHF0DUjJn0A:10 a=20KFwNOVAAAA:8 a=yPCof4ZbAAAA:8 a=sEHBG_JH86uubuPEzfwA:9 a=EmdybhaG4RS5P0Sq:21 a=QT12gNWgP9Hs2AXr:21 Sender: linux-xfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Dave Chinner Once we have inodes pinning the cluster buffer and attached whenever they are dirty, we no longer have a guarantee that the items are flush locked when we lock the cluster buffer. Hence we cannot just walk the buffer log item list and modify the attached inodes. If the inode is not flush locked, we have to ILOCK it first and then flush lock it to do all the prerequisite checks needed to avoid races with other code. This is already handled by xfs_ifree_get_one_inode(), so rework the inode iteration loop and function to update all inodes in cache whether they are attached to the buffer or not. Note: we also remove the copying of the log item lsn to the ili_flush_lsn as xfs_iflush_done() now uses the XFS_ISTALE flag to trigger aborts and so flush lsn matching is not needed in IO completion for processing freed inodes. Signed-off-by: Dave Chinner Reviewed-by: Darrick J. Wong Reviewed-by: Brian Foster --- fs/xfs/xfs_inode.c | 158 ++++++++++++++++++--------------------------- 1 file changed, 62 insertions(+), 96 deletions(-) diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c index 272b54cf97000..fb4c614c64fda 100644 --- a/fs/xfs/xfs_inode.c +++ b/fs/xfs/xfs_inode.c @@ -2517,17 +2517,19 @@ xfs_iunlink_remove( } /* - * Look up the inode number specified and mark it stale if it is found. If it is - * dirty, return the inode so it can be attached to the cluster buffer so it can - * be processed appropriately when the cluster free transaction completes. + * Look up the inode number specified and if it is not already marked XFS_ISTALE + * mark it stale. We should only find clean inodes in this lookup that aren't + * already stale. */ -static struct xfs_inode * -xfs_ifree_get_one_inode( - struct xfs_perag *pag, +static void +xfs_ifree_mark_inode_stale( + struct xfs_buf *bp, struct xfs_inode *free_ip, xfs_ino_t inum) { - struct xfs_mount *mp = pag->pag_mount; + struct xfs_mount *mp = bp->b_mount; + struct xfs_perag *pag = bp->b_pag; + struct xfs_inode_log_item *iip; struct xfs_inode *ip; retry: @@ -2535,8 +2537,10 @@ xfs_ifree_get_one_inode( ip = radix_tree_lookup(&pag->pag_ici_root, XFS_INO_TO_AGINO(mp, inum)); /* Inode not in memory, nothing to do */ - if (!ip) - goto out_rcu_unlock; + if (!ip) { + rcu_read_unlock(); + return; + } /* * because this is an RCU protected lookup, we could find a recently @@ -2547,9 +2551,9 @@ xfs_ifree_get_one_inode( spin_lock(&ip->i_flags_lock); if (ip->i_ino != inum || __xfs_iflags_test(ip, XFS_ISTALE)) { spin_unlock(&ip->i_flags_lock); - goto out_rcu_unlock; + rcu_read_unlock(); + return; } - spin_unlock(&ip->i_flags_lock); /* * Don't try to lock/unlock the current inode, but we _cannot_ skip the @@ -2559,43 +2563,53 @@ xfs_ifree_get_one_inode( */ if (ip != free_ip) { if (!xfs_ilock_nowait(ip, XFS_ILOCK_EXCL)) { + spin_unlock(&ip->i_flags_lock); rcu_read_unlock(); delay(1); goto retry; } - - /* - * Check the inode number again in case we're racing with - * freeing in xfs_reclaim_inode(). See the comments in that - * function for more information as to why the initial check is - * not sufficient. - */ - if (ip->i_ino != inum) { - xfs_iunlock(ip, XFS_ILOCK_EXCL); - goto out_rcu_unlock; - } } + ip->i_flags |= XFS_ISTALE; + spin_unlock(&ip->i_flags_lock); rcu_read_unlock(); - xfs_iflock(ip); - xfs_iflags_set(ip, XFS_ISTALE); + /* + * If we can't get the flush lock, the inode is already attached. All + * we needed to do here is mark the inode stale so buffer IO completion + * will remove it from the AIL. + */ + iip = ip->i_itemp; + if (!xfs_iflock_nowait(ip)) { + ASSERT(!list_empty(&iip->ili_item.li_bio_list)); + ASSERT(iip->ili_last_fields); + goto out_iunlock; + } + ASSERT(!iip || list_empty(&iip->ili_item.li_bio_list)); /* - * We don't need to attach clean inodes or those only with unlogged - * changes (which we throw away, anyway). + * Clean inodes can be released immediately. Everything else has to go + * through xfs_iflush_abort() on journal commit as the flock + * synchronises removal of the inode from the cluster buffer against + * inode reclaim. */ - if (!ip->i_itemp || xfs_inode_clean(ip)) { - ASSERT(ip != free_ip); + if (xfs_inode_clean(ip)) { xfs_ifunlock(ip); - xfs_iunlock(ip, XFS_ILOCK_EXCL); - goto out_no_inode; + goto out_iunlock; } - return ip; -out_rcu_unlock: - rcu_read_unlock(); -out_no_inode: - return NULL; + /* we have a dirty inode in memory that has not yet been flushed. */ + ASSERT(iip->ili_fields); + spin_lock(&iip->ili_lock); + iip->ili_last_fields = iip->ili_fields; + iip->ili_fields = 0; + iip->ili_fsync_fields = 0; + spin_unlock(&iip->ili_lock); + list_add_tail(&iip->ili_item.li_bio_list, &bp->b_li_list); + ASSERT(iip->ili_last_fields); + +out_iunlock: + if (ip != free_ip) + xfs_iunlock(ip, XFS_ILOCK_EXCL); } /* @@ -2605,26 +2619,20 @@ xfs_ifree_get_one_inode( */ STATIC int xfs_ifree_cluster( - xfs_inode_t *free_ip, - xfs_trans_t *tp, + struct xfs_inode *free_ip, + struct xfs_trans *tp, struct xfs_icluster *xic) { - xfs_mount_t *mp = free_ip->i_mount; + struct xfs_mount *mp = free_ip->i_mount; + struct xfs_ino_geometry *igeo = M_IGEO(mp); + struct xfs_buf *bp; + xfs_daddr_t blkno; + xfs_ino_t inum = xic->first_ino; int nbufs; int i, j; int ioffset; - xfs_daddr_t blkno; - xfs_buf_t *bp; - xfs_inode_t *ip; - struct xfs_inode_log_item *iip; - struct xfs_log_item *lip; - struct xfs_perag *pag; - struct xfs_ino_geometry *igeo = M_IGEO(mp); - xfs_ino_t inum; int error; - inum = xic->first_ino; - pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, inum)); nbufs = igeo->ialloc_blks / igeo->blocks_per_cluster; for (j = 0; j < nbufs; j++, inum += igeo->inodes_per_cluster) { @@ -2668,59 +2676,16 @@ xfs_ifree_cluster( bp->b_ops = &xfs_inode_buf_ops; /* - * Walk the inodes already attached to the buffer and mark them - * stale. These will all have the flush locks held, so an - * in-memory inode walk can't lock them. By marking them all - * stale first, we will not attempt to lock them in the loop - * below as the XFS_ISTALE flag will be set. - */ - list_for_each_entry(lip, &bp->b_li_list, li_bio_list) { - if (lip->li_type == XFS_LI_INODE) { - iip = (struct xfs_inode_log_item *)lip; - xfs_trans_ail_copy_lsn(mp->m_ail, - &iip->ili_flush_lsn, - &iip->ili_item.li_lsn); - xfs_iflags_set(iip->ili_inode, XFS_ISTALE); - } - } - - - /* - * For each inode in memory attempt to add it to the inode - * buffer and set it up for being staled on buffer IO - * completion. This is safe as we've locked out tail pushing - * and flushing by locking the buffer. - * - * We have already marked every inode that was part of a - * transaction stale above, which means there is no point in - * even trying to lock them. + * Now we need to set all the cached clean inodes as XFS_ISTALE, + * too. This requires lookups, and will skip inodes that we've + * already marked XFS_ISTALE. */ - for (i = 0; i < igeo->inodes_per_cluster; i++) { - ip = xfs_ifree_get_one_inode(pag, free_ip, inum + i); - if (!ip) - continue; - - iip = ip->i_itemp; - spin_lock(&iip->ili_lock); - iip->ili_last_fields = iip->ili_fields; - iip->ili_fields = 0; - iip->ili_fsync_fields = 0; - spin_unlock(&iip->ili_lock); - xfs_trans_ail_copy_lsn(mp->m_ail, &iip->ili_flush_lsn, - &iip->ili_item.li_lsn); - - list_add_tail(&iip->ili_item.li_bio_list, - &bp->b_li_list); - - if (ip != free_ip) - xfs_iunlock(ip, XFS_ILOCK_EXCL); - } + for (i = 0; i < igeo->inodes_per_cluster; i++) + xfs_ifree_mark_inode_stale(bp, free_ip, inum + i); xfs_trans_stale_inode_buf(tp, bp); xfs_trans_binval(tp, bp); } - - xfs_perag_put(pag); return 0; } @@ -3845,6 +3810,7 @@ xfs_iflush_int( iip->ili_fields = 0; iip->ili_fsync_fields = 0; spin_unlock(&iip->ili_lock); + ASSERT(iip->ili_last_fields); /* * Store the current LSN of the inode so that we can tell whether the From patchwork Thu Jun 4 07:46:01 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 11587231 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id D161E175A for ; Thu, 4 Jun 2020 07:46:12 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id C10BF207D3 for ; Thu, 4 Jun 2020 07:46:12 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726983AbgFDHqM (ORCPT ); Thu, 4 Jun 2020 03:46:12 -0400 Received: from mail104.syd.optusnet.com.au ([211.29.132.246]:33635 "EHLO mail104.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726967AbgFDHqM (ORCPT ); Thu, 4 Jun 2020 03:46:12 -0400 Received: from dread.disaster.area (pa49-180-124-177.pa.nsw.optusnet.com.au [49.180.124.177]) by mail104.syd.optusnet.com.au (Postfix) with ESMTPS id F10C382107C for ; Thu, 4 Jun 2020 17:46:08 +1000 (AEST) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.3) (envelope-from ) id 1jgkZk-0004Ay-Fr for linux-xfs@vger.kernel.org; Thu, 04 Jun 2020 17:46:08 +1000 Received: from dave by discord.disaster.area with local (Exim 4.93) (envelope-from ) id 1jgkZk-0017IU-6l for linux-xfs@vger.kernel.org; Thu, 04 Jun 2020 17:46:08 +1000 From: Dave Chinner To: linux-xfs@vger.kernel.org Subject: [PATCH 25/30] xfs: attach inodes to the cluster buffer when dirtied Date: Thu, 4 Jun 2020 17:46:01 +1000 Message-Id: <20200604074606.266213-26-david@fromorbit.com> X-Mailer: git-send-email 2.26.2.761.g0e0b3e54be In-Reply-To: <20200604074606.266213-1-david@fromorbit.com> References: <20200604074606.266213-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.3 cv=X6os11be c=1 sm=1 tr=0 a=k3aV/LVJup6ZGWgigO6cSA==:117 a=k3aV/LVJup6ZGWgigO6cSA==:17 a=nTHF0DUjJn0A:10 a=20KFwNOVAAAA:8 a=yPCof4ZbAAAA:8 a=qUdrDcP34_7PpI3gcEgA:9 Sender: linux-xfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Dave Chinner Rather than attach inodes to the cluster buffer just when we are doing IO, attach the inodes to the cluster buffer when they are dirtied. The means the buffer always carries a list of dirty inodes that reference it, and we can use that list to make more fundamental changes to inode writeback that aren't otherwise possible. Signed-off-by: Dave Chinner Reviewed-by: Darrick J. Wong Reviewed-by: Brian Foster --- fs/xfs/libxfs/xfs_trans_inode.c | 9 ++++++--- fs/xfs/xfs_buf_item.c | 1 + fs/xfs/xfs_icache.c | 1 + fs/xfs/xfs_inode.c | 24 +++++------------------- fs/xfs/xfs_inode_item.c | 16 ++++++++++++++-- 5 files changed, 27 insertions(+), 24 deletions(-) diff --git a/fs/xfs/libxfs/xfs_trans_inode.c b/fs/xfs/libxfs/xfs_trans_inode.c index ad5974365c589..e15129647e00c 100644 --- a/fs/xfs/libxfs/xfs_trans_inode.c +++ b/fs/xfs/libxfs/xfs_trans_inode.c @@ -163,13 +163,16 @@ xfs_trans_log_inode( /* * We need an explicit buffer reference for the log item but * don't want the buffer to remain attached to the transaction. - * Hold the buffer but release the transaction reference. + * Hold the buffer but release the transaction reference once + * we've attached the inode log item to the buffer log item + * list. */ xfs_buf_hold(bp); - xfs_trans_brelse(tp, bp); - spin_lock(&iip->ili_lock); iip->ili_item.li_buf = bp; + bp->b_flags |= _XBF_INODES; + list_add_tail(&iip->ili_item.li_bio_list, &bp->b_li_list); + xfs_trans_brelse(tp, bp); } /* diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c index a8c5070376b21..e5c57c4a03f62 100644 --- a/fs/xfs/xfs_buf_item.c +++ b/fs/xfs/xfs_buf_item.c @@ -465,6 +465,7 @@ xfs_buf_item_unpin( if (bip->bli_flags & XFS_BLI_STALE_INODE) { xfs_buf_item_done(bp); xfs_iflush_done(bp); + ASSERT(list_empty(&bp->b_li_list)); } else { xfs_trans_ail_delete(lip, SHUTDOWN_LOG_IO_ERROR); xfs_buf_item_relse(bp); diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c index 4fe6f250e8448..ed386bc930977 100644 --- a/fs/xfs/xfs_icache.c +++ b/fs/xfs/xfs_icache.c @@ -115,6 +115,7 @@ __xfs_inode_free( { /* asserts to verify all state is correct here */ ASSERT(atomic_read(&ip->i_pincount) == 0); + ASSERT(!ip->i_itemp || list_empty(&ip->i_itemp->ili_item.li_bio_list)); XFS_STATS_DEC(ip->i_mount, vn_active); call_rcu(&VFS_I(ip)->i_rcu, xfs_inode_free_callback); diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c index fb4c614c64fda..af65acd24ec4e 100644 --- a/fs/xfs/xfs_inode.c +++ b/fs/xfs/xfs_inode.c @@ -2584,27 +2584,24 @@ xfs_ifree_mark_inode_stale( ASSERT(iip->ili_last_fields); goto out_iunlock; } - ASSERT(!iip || list_empty(&iip->ili_item.li_bio_list)); /* - * Clean inodes can be released immediately. Everything else has to go - * through xfs_iflush_abort() on journal commit as the flock - * synchronises removal of the inode from the cluster buffer against - * inode reclaim. + * Inodes not attached to the buffer can be released immediately. + * Everything else has to go through xfs_iflush_abort() on journal + * commit as the flock synchronises removal of the inode from the + * cluster buffer against inode reclaim. */ - if (xfs_inode_clean(ip)) { + if (!iip || list_empty(&iip->ili_item.li_bio_list)) { xfs_ifunlock(ip); goto out_iunlock; } /* we have a dirty inode in memory that has not yet been flushed. */ - ASSERT(iip->ili_fields); spin_lock(&iip->ili_lock); iip->ili_last_fields = iip->ili_fields; iip->ili_fields = 0; iip->ili_fsync_fields = 0; spin_unlock(&iip->ili_lock); - list_add_tail(&iip->ili_item.li_bio_list, &bp->b_li_list); ASSERT(iip->ili_last_fields); out_iunlock: @@ -3819,19 +3816,8 @@ xfs_iflush_int( xfs_trans_ail_copy_lsn(mp->m_ail, &iip->ili_flush_lsn, &iip->ili_item.li_lsn); - /* - * Attach the inode item callback to the buffer whether the flush - * succeeded or not. If not, the caller will shut down and fail I/O - * completion on the buffer to remove the inode from the AIL and release - * the flush lock. - */ - bp->b_flags |= _XBF_INODES; - list_add_tail(&iip->ili_item.li_bio_list, &bp->b_li_list); - /* generate the checksum. */ xfs_dinode_calc_crc(mp, dip); - - ASSERT(!list_empty(&bp->b_li_list)); return error; } diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c index 64bdda72f7b27..697248b7eb2be 100644 --- a/fs/xfs/xfs_inode_item.c +++ b/fs/xfs/xfs_inode_item.c @@ -660,6 +660,10 @@ xfs_inode_item_destroy( * list for other inodes that will run this function. We remove them from the * buffer list so we can process all the inode IO completions in one AIL lock * traversal. + * + * Note: Now that we attach the log item to the buffer when we first log the + * inode in memory, we can have unflushed inodes on the buffer list here. These + * inodes will have a zero ili_last_fields, so skip over them here. */ void xfs_iflush_done( @@ -677,12 +681,15 @@ xfs_iflush_done( */ list_for_each_entry_safe(lip, n, &bp->b_li_list, li_bio_list) { iip = INODE_ITEM(lip); + if (xfs_iflags_test(iip->ili_inode, XFS_ISTALE)) { - list_del_init(&lip->li_bio_list); xfs_iflush_abort(iip->ili_inode); continue; } + if (!iip->ili_last_fields) + continue; + list_move_tail(&lip->li_bio_list, &tmp); /* Do an unlocked check for needing the AIL lock. */ @@ -728,12 +735,16 @@ xfs_iflush_done( /* * Remove the reference to the cluster buffer if the inode is * clean in memory. Drop the buffer reference once we've dropped - * the locks we hold. + * the locks we hold. If the inode is dirty in memory, we need + * to put the inode item back on the buffer list for another + * pass through the flush machinery. */ ASSERT(iip->ili_item.li_buf == bp); if (!iip->ili_fields) { iip->ili_item.li_buf = NULL; drop_buffer = true; + } else { + list_add(&lip->li_bio_list, &bp->b_li_list); } iip->ili_last_fields = 0; iip->ili_flush_lsn = 0; @@ -777,6 +788,7 @@ xfs_iflush_abort( iip->ili_flush_lsn = 0; bp = iip->ili_item.li_buf; iip->ili_item.li_buf = NULL; + list_del_init(&iip->ili_item.li_bio_list); spin_unlock(&iip->ili_lock); } xfs_ifunlock(ip); From patchwork Thu Jun 4 07:46:02 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 11587247 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 22B48159A for ; Thu, 4 Jun 2020 07:46:18 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 0AC5B206C3 for ; Thu, 4 Jun 2020 07:46:18 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727779AbgFDHqQ (ORCPT ); Thu, 4 Jun 2020 03:46:16 -0400 Received: from mail104.syd.optusnet.com.au ([211.29.132.246]:33881 "EHLO mail104.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727777AbgFDHqQ (ORCPT ); Thu, 4 Jun 2020 03:46:16 -0400 Received: from dread.disaster.area (pa49-180-124-177.pa.nsw.optusnet.com.au [49.180.124.177]) by mail104.syd.optusnet.com.au (Postfix) with ESMTPS id 082078207AB for ; Thu, 4 Jun 2020 17:46:09 +1000 (AEST) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.3) (envelope-from ) id 1jgkZk-0004B1-Hl for linux-xfs@vger.kernel.org; Thu, 04 Jun 2020 17:46:08 +1000 Received: from dave by discord.disaster.area with local (Exim 4.93) (envelope-from ) id 1jgkZk-0017IZ-8V for linux-xfs@vger.kernel.org; Thu, 04 Jun 2020 17:46:08 +1000 From: Dave Chinner To: linux-xfs@vger.kernel.org Subject: [PATCH 26/30] xfs: xfs_iflush() is no longer necessary Date: Thu, 4 Jun 2020 17:46:02 +1000 Message-Id: <20200604074606.266213-27-david@fromorbit.com> X-Mailer: git-send-email 2.26.2.761.g0e0b3e54be In-Reply-To: <20200604074606.266213-1-david@fromorbit.com> References: <20200604074606.266213-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.3 cv=W5xGqiek c=1 sm=1 tr=0 a=k3aV/LVJup6ZGWgigO6cSA==:117 a=k3aV/LVJup6ZGWgigO6cSA==:17 a=nTHF0DUjJn0A:10 a=20KFwNOVAAAA:8 a=yPCof4ZbAAAA:8 a=FPQKXuVpiB5N0kaoULwA:9 a=qR0udBaQOyvojzcw:21 a=ulzJpOM_6WnMERZw:21 Sender: linux-xfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Dave Chinner Now we have a cached buffer on inode log items, we don't need to do buffer lookups when flushing inodes anymore - all we need to do is lock the buffer and we are ready to go. This largely gets rid of the need for xfs_iflush(), which is essentially just a mechanism to look up the buffer and flush the inode to it. Instead, we can just call xfs_iflush_cluster() with a few modifications to ensure it also flushes the inode we already hold locked. This allows the AIL inode item pushing to be almost entirely non-blocking in XFS - we won't block unless memory allocation for the cluster inode lookup blocks or the block device queues are full. Writeback during inode reclaim becomes a little more complex because we now have to lock the buffer ourselves, but otherwise this change is largely a functional no-op that removes a whole lot of code. Signed-off-by: Dave Chinner Reviewed-by: Darrick J. Wong --- fs/xfs/xfs_inode.c | 106 ++++++---------------------------------- fs/xfs/xfs_inode.h | 2 +- fs/xfs/xfs_inode_item.c | 54 +++++++++----------- 3 files changed, 37 insertions(+), 125 deletions(-) diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c index af65acd24ec4e..61c872e4ee157 100644 --- a/fs/xfs/xfs_inode.c +++ b/fs/xfs/xfs_inode.c @@ -3450,7 +3450,18 @@ xfs_rename( return error; } -STATIC int +/* + * Non-blocking flush of dirty inode metadata into the backing buffer. + * + * The caller must have a reference to the inode and hold the cluster buffer + * locked. The function will walk across all the inodes on the cluster buffer it + * can find and lock without blocking, and flush them to the cluster buffer. + * + * On success, the caller must write out the buffer returned in *bp and + * release it. On failure, the filesystem will be shut down, the buffer will + * have been unlocked and released, and EFSCORRUPTED will be returned. + */ +int xfs_iflush_cluster( struct xfs_inode *ip, struct xfs_buf *bp) @@ -3485,8 +3496,6 @@ xfs_iflush_cluster( for (i = 0; i < nr_found; i++) { cip = cilist[i]; - if (cip == ip) - continue; /* * because this is an RCU protected lookup, we could find a @@ -3577,99 +3586,11 @@ xfs_iflush_cluster( kmem_free(cilist); out_put: xfs_perag_put(pag); - return error; -} - -/* - * Flush dirty inode metadata into the backing buffer. - * - * The caller must have the inode lock and the inode flush lock held. The - * inode lock will still be held upon return to the caller, and the inode - * flush lock will be released after the inode has reached the disk. - * - * The caller must write out the buffer returned in *bpp and release it. - */ -int -xfs_iflush( - struct xfs_inode *ip, - struct xfs_buf **bpp) -{ - struct xfs_mount *mp = ip->i_mount; - struct xfs_buf *bp = NULL; - struct xfs_dinode *dip; - int error; - - XFS_STATS_INC(mp, xs_iflush_count); - - ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL|XFS_ILOCK_SHARED)); - ASSERT(xfs_isiflocked(ip)); - ASSERT(ip->i_df.if_format != XFS_DINODE_FMT_BTREE || - ip->i_df.if_nextents > XFS_IFORK_MAXEXT(ip, XFS_DATA_FORK)); - - *bpp = NULL; - - xfs_iunpin_wait(ip); - - /* - * For stale inodes we cannot rely on the backing buffer remaining - * stale in cache for the remaining life of the stale inode and so - * xfs_imap_to_bp() below may give us a buffer that no longer contains - * inodes below. We have to check this after ensuring the inode is - * unpinned so that it is safe to reclaim the stale inode after the - * flush call. - */ - if (xfs_iflags_test(ip, XFS_ISTALE)) { - xfs_ifunlock(ip); - return 0; - } - - /* - * Get the buffer containing the on-disk inode. We are doing a try-lock - * operation here, so we may get an EAGAIN error. In that case, return - * leaving the inode dirty. - * - * If we get any other error, we effectively have a corruption situation - * and we cannot flush the inode. Abort the flush and shut down. - */ - error = xfs_imap_to_bp(mp, NULL, &ip->i_imap, &dip, &bp, XBF_TRYLOCK); - if (error == -EAGAIN) { - xfs_ifunlock(ip); - return error; - } - if (error) - goto abort; - - /* - * If the buffer is pinned then push on the log now so we won't - * get stuck waiting in the write for too long. - */ - if (xfs_buf_ispinned(bp)) - xfs_log_force(mp, 0); - - /* - * Flush the provided inode then attempt to gather others from the - * cluster into the write. - * - * Note: Once we attempt to flush an inode, we must run buffer - * completion callbacks on any failure. If this fails, simulate an I/O - * failure on the buffer and shut down. - */ - error = xfs_iflush_int(ip, bp); - if (!error) - error = xfs_iflush_cluster(ip, bp); if (error) { bp->b_flags |= XBF_ASYNC; xfs_buf_ioend_fail(bp); - goto shutdown; + xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE); } - - *bpp = bp; - return 0; - -abort: - xfs_iflush_abort(ip); -shutdown: - xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE); return error; } @@ -3688,6 +3609,7 @@ xfs_iflush_int( ASSERT(ip->i_df.if_format != XFS_DINODE_FMT_BTREE || ip->i_df.if_nextents > XFS_IFORK_MAXEXT(ip, XFS_DATA_FORK)); ASSERT(iip != NULL && iip->ili_fields != 0); + ASSERT(iip->ili_item.li_buf == bp); dip = xfs_buf_offset(bp, ip->i_imap.im_boffset); diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h index dadcf19458960..d1109eb13ba2e 100644 --- a/fs/xfs/xfs_inode.h +++ b/fs/xfs/xfs_inode.h @@ -427,7 +427,7 @@ int xfs_log_force_inode(struct xfs_inode *ip); void xfs_iunpin_wait(xfs_inode_t *); #define xfs_ipincount(ip) ((unsigned int) atomic_read(&ip->i_pincount)) -int xfs_iflush(struct xfs_inode *, struct xfs_buf **); +int xfs_iflush_cluster(struct xfs_inode *, struct xfs_buf *); void xfs_lock_two_inodes(struct xfs_inode *ip0, uint ip0_mode, struct xfs_inode *ip1, uint ip1_mode); diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c index 697248b7eb2be..326547e89cb6b 100644 --- a/fs/xfs/xfs_inode_item.c +++ b/fs/xfs/xfs_inode_item.c @@ -485,53 +485,42 @@ xfs_inode_item_push( uint rval = XFS_ITEM_SUCCESS; int error; - if (xfs_ipincount(ip) > 0) + ASSERT(iip->ili_item.li_buf); + + if (xfs_ipincount(ip) > 0 || xfs_buf_ispinned(bp) || + (ip->i_flags & XFS_ISTALE)) return XFS_ITEM_PINNED; - if (!xfs_ilock_nowait(ip, XFS_ILOCK_SHARED)) - return XFS_ITEM_LOCKED; + /* If the inode is already flush locked, we're already flushing. */ + if (xfs_isiflocked(ip)) + return XFS_ITEM_FLUSHING; - /* - * Re-check the pincount now that we stabilized the value by - * taking the ilock. - */ - if (xfs_ipincount(ip) > 0) { - rval = XFS_ITEM_PINNED; - goto out_unlock; - } + if (!xfs_buf_trylock(bp)) + return XFS_ITEM_LOCKED; - /* - * Stale inode items should force out the iclog. - */ - if (ip->i_flags & XFS_ISTALE) { - rval = XFS_ITEM_PINNED; - goto out_unlock; + if (bp->b_flags & _XBF_DELWRI_Q) { + xfs_buf_unlock(bp); + return XFS_ITEM_FLUSHING; } + spin_unlock(&lip->li_ailp->ail_lock); /* - * Someone else is already flushing the inode. Nothing we can do - * here but wait for the flush to finish and remove the item from - * the AIL. + * We need to hold a reference for flushing the cluster buffer as it may + * fail the buffer without IO submission. In which case, we better get a + * reference for that completion because otherwise we don't get a + * reference for IO until we queue the buffer for delwri submission. */ - if (!xfs_iflock_nowait(ip)) { - rval = XFS_ITEM_FLUSHING; - goto out_unlock; - } - - ASSERT(iip->ili_fields != 0 || XFS_FORCED_SHUTDOWN(ip->i_mount)); - spin_unlock(&lip->li_ailp->ail_lock); - - error = xfs_iflush(ip, &bp); + xfs_buf_hold(bp); + error = xfs_iflush_cluster(ip, bp); if (!error) { if (!xfs_buf_delwri_queue(bp, buffer_list)) rval = XFS_ITEM_FLUSHING; xfs_buf_relse(bp); - } else if (error == -EAGAIN) + } else { rval = XFS_ITEM_LOCKED; + } spin_lock(&lip->li_ailp->ail_lock); -out_unlock: - xfs_iunlock(ip, XFS_ILOCK_SHARED); return rval; } @@ -548,6 +537,7 @@ xfs_inode_item_release( ASSERT(ip->i_itemp != NULL); ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL)); + ASSERT(lip->li_buf || !test_bit(XFS_LI_DIRTY, &lip->li_flags)); lock_flags = iip->ili_lock_flags; iip->ili_lock_flags = 0; From patchwork Thu Jun 4 07:46:03 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 11587241 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 787A6913 for ; Thu, 4 Jun 2020 07:46:15 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 62731206C3 for ; Thu, 4 Jun 2020 07:46:15 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727928AbgFDHqO (ORCPT ); Thu, 4 Jun 2020 03:46:14 -0400 Received: from mail109.syd.optusnet.com.au ([211.29.132.80]:50048 "EHLO mail109.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727016AbgFDHqO (ORCPT ); Thu, 4 Jun 2020 03:46:14 -0400 Received: from dread.disaster.area (pa49-180-124-177.pa.nsw.optusnet.com.au [49.180.124.177]) by mail109.syd.optusnet.com.au (Postfix) with ESMTPS id 12721D737DC for ; Thu, 4 Jun 2020 17:46:09 +1000 (AEST) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.3) (envelope-from ) id 1jgkZk-0004B4-JF for linux-xfs@vger.kernel.org; Thu, 04 Jun 2020 17:46:08 +1000 Received: from dave by discord.disaster.area with local (Exim 4.93) (envelope-from ) id 1jgkZk-0017Ie-Ac for linux-xfs@vger.kernel.org; Thu, 04 Jun 2020 17:46:08 +1000 From: Dave Chinner To: linux-xfs@vger.kernel.org Subject: [PATCH 27/30] xfs: rename xfs_iflush_int() Date: Thu, 4 Jun 2020 17:46:03 +1000 Message-Id: <20200604074606.266213-28-david@fromorbit.com> X-Mailer: git-send-email 2.26.2.761.g0e0b3e54be In-Reply-To: <20200604074606.266213-1-david@fromorbit.com> References: <20200604074606.266213-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.3 cv=QIgWuTDL c=1 sm=1 tr=0 a=k3aV/LVJup6ZGWgigO6cSA==:117 a=k3aV/LVJup6ZGWgigO6cSA==:17 a=nTHF0DUjJn0A:10 a=20KFwNOVAAAA:8 a=pGLkceISAAAA:8 a=yPCof4ZbAAAA:8 a=FsWqk0_xLHmNMxEvWM8A:9 a=QnvvxhyabaL76C8d:21 a=e8aoB_H6GwcmPgQX:21 Sender: linux-xfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Dave Chinner with xfs_iflush() gone, we can rename xfs_iflush_int() back to xfs_iflush(). Also move it up above xfs_iflush_cluster() so we don't need the forward definition any more. Signed-off-by: Dave Chinner Reviewed-by: Amir Goldstein Reviewed-by: Darrick J. Wong Reviewed-by: Brian Foster --- fs/xfs/xfs_inode.c | 293 ++++++++++++++++++++++----------------------- 1 file changed, 146 insertions(+), 147 deletions(-) diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c index 61c872e4ee157..8566bd0f4334d 100644 --- a/fs/xfs/xfs_inode.c +++ b/fs/xfs/xfs_inode.c @@ -44,7 +44,6 @@ kmem_zone_t *xfs_inode_zone; */ #define XFS_ITRUNC_MAX_EXTENTS 2 -STATIC int xfs_iflush_int(struct xfs_inode *, struct xfs_buf *); STATIC int xfs_iunlink(struct xfs_trans *, struct xfs_inode *); STATIC int xfs_iunlink_remove(struct xfs_trans *, struct xfs_inode *); @@ -3450,152 +3449,8 @@ xfs_rename( return error; } -/* - * Non-blocking flush of dirty inode metadata into the backing buffer. - * - * The caller must have a reference to the inode and hold the cluster buffer - * locked. The function will walk across all the inodes on the cluster buffer it - * can find and lock without blocking, and flush them to the cluster buffer. - * - * On success, the caller must write out the buffer returned in *bp and - * release it. On failure, the filesystem will be shut down, the buffer will - * have been unlocked and released, and EFSCORRUPTED will be returned. - */ -int -xfs_iflush_cluster( - struct xfs_inode *ip, - struct xfs_buf *bp) -{ - struct xfs_mount *mp = ip->i_mount; - struct xfs_perag *pag; - unsigned long first_index, mask; - int cilist_size; - struct xfs_inode **cilist; - struct xfs_inode *cip; - struct xfs_ino_geometry *igeo = M_IGEO(mp); - int error = 0; - int nr_found; - int clcount = 0; - int i; - - pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, ip->i_ino)); - - cilist_size = igeo->inodes_per_cluster * sizeof(struct xfs_inode *); - cilist = kmem_alloc(cilist_size, KM_MAYFAIL|KM_NOFS); - if (!cilist) - goto out_put; - - mask = ~(igeo->inodes_per_cluster - 1); - first_index = XFS_INO_TO_AGINO(mp, ip->i_ino) & mask; - rcu_read_lock(); - /* really need a gang lookup range call here */ - nr_found = radix_tree_gang_lookup(&pag->pag_ici_root, (void**)cilist, - first_index, igeo->inodes_per_cluster); - if (nr_found == 0) - goto out_free; - - for (i = 0; i < nr_found; i++) { - cip = cilist[i]; - - /* - * because this is an RCU protected lookup, we could find a - * recently freed or even reallocated inode during the lookup. - * We need to check under the i_flags_lock for a valid inode - * here. Skip it if it is not valid or the wrong inode. - */ - spin_lock(&cip->i_flags_lock); - if (!cip->i_ino || - __xfs_iflags_test(cip, XFS_ISTALE)) { - spin_unlock(&cip->i_flags_lock); - continue; - } - - /* - * Once we fall off the end of the cluster, no point checking - * any more inodes in the list because they will also all be - * outside the cluster. - */ - if ((XFS_INO_TO_AGINO(mp, cip->i_ino) & mask) != first_index) { - spin_unlock(&cip->i_flags_lock); - break; - } - spin_unlock(&cip->i_flags_lock); - - /* - * Do an un-protected check to see if the inode is dirty and - * is a candidate for flushing. These checks will be repeated - * later after the appropriate locks are acquired. - */ - if (xfs_inode_clean(cip) && xfs_ipincount(cip) == 0) - continue; - - /* - * Try to get locks. If any are unavailable or it is pinned, - * then this inode cannot be flushed and is skipped. - */ - - if (!xfs_ilock_nowait(cip, XFS_ILOCK_SHARED)) - continue; - if (!xfs_iflock_nowait(cip)) { - xfs_iunlock(cip, XFS_ILOCK_SHARED); - continue; - } - if (xfs_ipincount(cip)) { - xfs_ifunlock(cip); - xfs_iunlock(cip, XFS_ILOCK_SHARED); - continue; - } - - - /* - * Check the inode number again, just to be certain we are not - * racing with freeing in xfs_reclaim_inode(). See the comments - * in that function for more information as to why the initial - * check is not sufficient. - */ - if (!cip->i_ino) { - xfs_ifunlock(cip); - xfs_iunlock(cip, XFS_ILOCK_SHARED); - continue; - } - - /* - * arriving here means that this inode can be flushed. First - * re-check that it's dirty before flushing. - */ - if (!xfs_inode_clean(cip)) { - error = xfs_iflush_int(cip, bp); - if (error) { - xfs_iunlock(cip, XFS_ILOCK_SHARED); - goto out_free; - } - clcount++; - } else { - xfs_ifunlock(cip); - } - xfs_iunlock(cip, XFS_ILOCK_SHARED); - } - - if (clcount) { - XFS_STATS_INC(mp, xs_icluster_flushcnt); - XFS_STATS_ADD(mp, xs_icluster_flushinode, clcount); - } - -out_free: - rcu_read_unlock(); - kmem_free(cilist); -out_put: - xfs_perag_put(pag); - if (error) { - bp->b_flags |= XBF_ASYNC; - xfs_buf_ioend_fail(bp); - xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE); - } - return error; -} - -STATIC int -xfs_iflush_int( +static int +xfs_iflush( struct xfs_inode *ip, struct xfs_buf *bp) { @@ -3743,6 +3598,150 @@ xfs_iflush_int( return error; } +/* + * Non-blocking flush of dirty inode metadata into the backing buffer. + * + * The caller must have a reference to the inode and hold the cluster buffer + * locked. The function will walk across all the inodes on the cluster buffer it + * can find and lock without blocking, and flush them to the cluster buffer. + * + * On success, the caller must write out the buffer returned in *bp and + * release it. On failure, the filesystem will be shut down, the buffer will + * have been unlocked and released, and EFSCORRUPTED will be returned. + */ +int +xfs_iflush_cluster( + struct xfs_inode *ip, + struct xfs_buf *bp) +{ + struct xfs_mount *mp = ip->i_mount; + struct xfs_perag *pag; + unsigned long first_index, mask; + int cilist_size; + struct xfs_inode **cilist; + struct xfs_inode *cip; + struct xfs_ino_geometry *igeo = M_IGEO(mp); + int error = 0; + int nr_found; + int clcount = 0; + int i; + + pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, ip->i_ino)); + + cilist_size = igeo->inodes_per_cluster * sizeof(struct xfs_inode *); + cilist = kmem_alloc(cilist_size, KM_MAYFAIL|KM_NOFS); + if (!cilist) + goto out_put; + + mask = ~(igeo->inodes_per_cluster - 1); + first_index = XFS_INO_TO_AGINO(mp, ip->i_ino) & mask; + rcu_read_lock(); + /* really need a gang lookup range call here */ + nr_found = radix_tree_gang_lookup(&pag->pag_ici_root, (void**)cilist, + first_index, igeo->inodes_per_cluster); + if (nr_found == 0) + goto out_free; + + for (i = 0; i < nr_found; i++) { + cip = cilist[i]; + + /* + * because this is an RCU protected lookup, we could find a + * recently freed or even reallocated inode during the lookup. + * We need to check under the i_flags_lock for a valid inode + * here. Skip it if it is not valid or the wrong inode. + */ + spin_lock(&cip->i_flags_lock); + if (!cip->i_ino || + __xfs_iflags_test(cip, XFS_ISTALE)) { + spin_unlock(&cip->i_flags_lock); + continue; + } + + /* + * Once we fall off the end of the cluster, no point checking + * any more inodes in the list because they will also all be + * outside the cluster. + */ + if ((XFS_INO_TO_AGINO(mp, cip->i_ino) & mask) != first_index) { + spin_unlock(&cip->i_flags_lock); + break; + } + spin_unlock(&cip->i_flags_lock); + + /* + * Do an un-protected check to see if the inode is dirty and + * is a candidate for flushing. These checks will be repeated + * later after the appropriate locks are acquired. + */ + if (xfs_inode_clean(cip) && xfs_ipincount(cip) == 0) + continue; + + /* + * Try to get locks. If any are unavailable or it is pinned, + * then this inode cannot be flushed and is skipped. + */ + + if (!xfs_ilock_nowait(cip, XFS_ILOCK_SHARED)) + continue; + if (!xfs_iflock_nowait(cip)) { + xfs_iunlock(cip, XFS_ILOCK_SHARED); + continue; + } + if (xfs_ipincount(cip)) { + xfs_ifunlock(cip); + xfs_iunlock(cip, XFS_ILOCK_SHARED); + continue; + } + + + /* + * Check the inode number again, just to be certain we are not + * racing with freeing in xfs_reclaim_inode(). See the comments + * in that function for more information as to why the initial + * check is not sufficient. + */ + if (!cip->i_ino) { + xfs_ifunlock(cip); + xfs_iunlock(cip, XFS_ILOCK_SHARED); + continue; + } + + /* + * arriving here means that this inode can be flushed. First + * re-check that it's dirty before flushing. + */ + if (!xfs_inode_clean(cip)) { + error = xfs_iflush(cip, bp); + if (error) { + xfs_iunlock(cip, XFS_ILOCK_SHARED); + goto out_free; + } + clcount++; + } else { + xfs_ifunlock(cip); + } + xfs_iunlock(cip, XFS_ILOCK_SHARED); + } + + if (clcount) { + XFS_STATS_INC(mp, xs_icluster_flushcnt); + XFS_STATS_ADD(mp, xs_icluster_flushinode, clcount); + } + +out_free: + rcu_read_unlock(); + kmem_free(cilist); +out_put: + xfs_perag_put(pag); + if (error) { + bp->b_flags |= XBF_ASYNC; + xfs_buf_ioend_fail(bp); + xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE); + } + return error; +} + /* Release an inode. */ void xfs_irele( From patchwork Thu Jun 4 07:46:04 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 11587279 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id B3AB81392 for ; Thu, 4 Jun 2020 07:46:25 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id A08472072E for ; Thu, 4 Jun 2020 07:46:25 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728019AbgFDHqY (ORCPT ); Thu, 4 Jun 2020 03:46:24 -0400 Received: from mail107.syd.optusnet.com.au ([211.29.132.53]:49322 "EHLO mail107.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727997AbgFDHqW (ORCPT ); Thu, 4 Jun 2020 03:46:22 -0400 Received: from dread.disaster.area (pa49-180-124-177.pa.nsw.optusnet.com.au [49.180.124.177]) by mail107.syd.optusnet.com.au (Postfix) with ESMTPS id 83E9ED597B4 for ; Thu, 4 Jun 2020 17:46:13 +1000 (AEST) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.3) (envelope-from ) id 1jgkZk-0004B6-KR for linux-xfs@vger.kernel.org; Thu, 04 Jun 2020 17:46:08 +1000 Received: from dave by discord.disaster.area with local (Exim 4.93) (envelope-from ) id 1jgkZk-0017Ij-C6 for linux-xfs@vger.kernel.org; Thu, 04 Jun 2020 17:46:08 +1000 From: Dave Chinner To: linux-xfs@vger.kernel.org Subject: [PATCH 28/30] xfs: rework xfs_iflush_cluster() dirty inode iteration Date: Thu, 4 Jun 2020 17:46:04 +1000 Message-Id: <20200604074606.266213-29-david@fromorbit.com> X-Mailer: git-send-email 2.26.2.761.g0e0b3e54be In-Reply-To: <20200604074606.266213-1-david@fromorbit.com> References: <20200604074606.266213-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.3 cv=X6os11be c=1 sm=1 tr=0 a=k3aV/LVJup6ZGWgigO6cSA==:117 a=k3aV/LVJup6ZGWgigO6cSA==:17 a=nTHF0DUjJn0A:10 a=20KFwNOVAAAA:8 a=yPCof4ZbAAAA:8 a=ccWGV2Zw5G_KkjRFBUYA:9 Sender: linux-xfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Dave Chinner Now that we have all the dirty inodes attached to the cluster buffer, we don't actually have to do radix tree lookups to find them. Sure, the radix tree is efficient, but walking a linked list of just the dirty inodes attached to the buffer is much better. We are also no longer dependent on having a locked inode passed into the function to determine where to start the lookup. This means we can drop it from the function call and treat all inodes the same. We also make xfs_iflush_cluster skip inodes marked with XFS_IRECLAIM. This we avoid races with inodes that reclaim is actively referencing or are being re-initialised by inode lookup. If they are actually dirty, they'll get written by a future cluster flush.... We also add a shutdown check after obtaining the flush lock so that we catch inodes that are dirty in memory and may have inconsistent state due to the shutdown in progress. We abort these inodes directly and so they remove themselves directly from the buffer list and the AIL rather than having to wait for the buffer to be failed and callbacks run to be processed correctly. Signed-off-by: Dave Chinner Reviewed-by: Darrick J. Wong --- fs/xfs/xfs_inode.c | 148 ++++++++++++++++------------------------ fs/xfs/xfs_inode.h | 2 +- fs/xfs/xfs_inode_item.c | 2 +- 3 files changed, 62 insertions(+), 90 deletions(-) diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c index 8566bd0f4334d..931a483d5b316 100644 --- a/fs/xfs/xfs_inode.c +++ b/fs/xfs/xfs_inode.c @@ -3611,117 +3611,94 @@ xfs_iflush( */ int xfs_iflush_cluster( - struct xfs_inode *ip, struct xfs_buf *bp) { - struct xfs_mount *mp = ip->i_mount; - struct xfs_perag *pag; - unsigned long first_index, mask; - int cilist_size; - struct xfs_inode **cilist; - struct xfs_inode *cip; - struct xfs_ino_geometry *igeo = M_IGEO(mp); - int error = 0; - int nr_found; + struct xfs_mount *mp = bp->b_mount; + struct xfs_log_item *lip, *n; + struct xfs_inode *ip; + struct xfs_inode_log_item *iip; int clcount = 0; - int i; - - pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, ip->i_ino)); - - cilist_size = igeo->inodes_per_cluster * sizeof(struct xfs_inode *); - cilist = kmem_alloc(cilist_size, KM_MAYFAIL|KM_NOFS); - if (!cilist) - goto out_put; - - mask = ~(igeo->inodes_per_cluster - 1); - first_index = XFS_INO_TO_AGINO(mp, ip->i_ino) & mask; - rcu_read_lock(); - /* really need a gang lookup range call here */ - nr_found = radix_tree_gang_lookup(&pag->pag_ici_root, (void**)cilist, - first_index, igeo->inodes_per_cluster); - if (nr_found == 0) - goto out_free; + int error = 0; - for (i = 0; i < nr_found; i++) { - cip = cilist[i]; + /* + * We must use the safe variant here as on shutdown xfs_iflush_abort() + * can remove itself from the list. + */ + list_for_each_entry_safe(lip, n, &bp->b_li_list, li_bio_list) { + iip = (struct xfs_inode_log_item *)lip; + ip = iip->ili_inode; /* - * because this is an RCU protected lookup, we could find a - * recently freed or even reallocated inode during the lookup. - * We need to check under the i_flags_lock for a valid inode - * here. Skip it if it is not valid or the wrong inode. + * Quick and dirty check to avoid locks if possible. */ - spin_lock(&cip->i_flags_lock); - if (!cip->i_ino || - __xfs_iflags_test(cip, XFS_ISTALE)) { - spin_unlock(&cip->i_flags_lock); + if (__xfs_iflags_test(ip, XFS_IRECLAIM | XFS_IFLOCK)) + continue; + if (xfs_ipincount(ip)) continue; - } /* - * Once we fall off the end of the cluster, no point checking - * any more inodes in the list because they will also all be - * outside the cluster. + * The inode is still attached to the buffer, which means it is + * dirty but reclaim might try to grab it. Check carefully for + * that, and grab the ilock while still holding the i_flags_lock + * to guarantee reclaim will not be able to reclaim this inode + * once we drop the i_flags_lock. */ - if ((XFS_INO_TO_AGINO(mp, cip->i_ino) & mask) != first_index) { - spin_unlock(&cip->i_flags_lock); - break; + spin_lock(&ip->i_flags_lock); + ASSERT(!__xfs_iflags_test(ip, XFS_ISTALE)); + if (__xfs_iflags_test(ip, XFS_IRECLAIM | XFS_IFLOCK)) { + spin_unlock(&ip->i_flags_lock); + continue; } - spin_unlock(&cip->i_flags_lock); /* - * Do an un-protected check to see if the inode is dirty and - * is a candidate for flushing. These checks will be repeated - * later after the appropriate locks are acquired. + * ILOCK will pin the inode against reclaim and prevent + * concurrent transactions modifying the inode while we are + * flushing the inode. */ - if (xfs_inode_clean(cip) && xfs_ipincount(cip) == 0) + if (!xfs_ilock_nowait(ip, XFS_ILOCK_SHARED)) { + spin_unlock(&ip->i_flags_lock); continue; + } + spin_unlock(&ip->i_flags_lock); /* - * Try to get locks. If any are unavailable or it is pinned, - * then this inode cannot be flushed and is skipped. + * Skip inodes that are already flush locked as they have + * already been written to the buffer. */ - - if (!xfs_ilock_nowait(cip, XFS_ILOCK_SHARED)) - continue; - if (!xfs_iflock_nowait(cip)) { - xfs_iunlock(cip, XFS_ILOCK_SHARED); - continue; - } - if (xfs_ipincount(cip)) { - xfs_ifunlock(cip); - xfs_iunlock(cip, XFS_ILOCK_SHARED); + if (!xfs_iflock_nowait(ip)) { + xfs_iunlock(ip, XFS_ILOCK_SHARED); continue; } - /* - * Check the inode number again, just to be certain we are not - * racing with freeing in xfs_reclaim_inode(). See the comments - * in that function for more information as to why the initial - * check is not sufficient. + * If we are shut down, unpin and abort the inode now as there + * is no point in flushing it to the buffer just to get an IO + * completion to abort the buffer and remove it from the AIL. */ - if (!cip->i_ino) { - xfs_ifunlock(cip); - xfs_iunlock(cip, XFS_ILOCK_SHARED); + if (XFS_FORCED_SHUTDOWN(mp)) { + xfs_iunpin_wait(ip); + /* xfs_iflush_abort() drops the flush lock */ + xfs_iflush_abort(ip); + xfs_iunlock(ip, XFS_ILOCK_SHARED); + error = -EIO; continue; } - /* - * arriving here means that this inode can be flushed. First - * re-check that it's dirty before flushing. - */ - if (!xfs_inode_clean(cip)) { - error = xfs_iflush(cip, bp); - if (error) { - xfs_iunlock(cip, XFS_ILOCK_SHARED); - goto out_free; - } - clcount++; - } else { - xfs_ifunlock(cip); + /* don't block waiting on a log force to unpin dirty inodes */ + if (xfs_ipincount(ip)) { + xfs_ifunlock(ip); + xfs_iunlock(ip, XFS_ILOCK_SHARED); + continue; } - xfs_iunlock(cip, XFS_ILOCK_SHARED); + + if (!xfs_inode_clean(ip)) + error = xfs_iflush(ip, bp); + else + xfs_ifunlock(ip); + xfs_iunlock(ip, XFS_ILOCK_SHARED); + if (error) + break; + clcount++; } if (clcount) { @@ -3729,11 +3706,6 @@ xfs_iflush_cluster( XFS_STATS_ADD(mp, xs_icluster_flushinode, clcount); } -out_free: - rcu_read_unlock(); - kmem_free(cilist); -out_put: - xfs_perag_put(pag); if (error) { bp->b_flags |= XBF_ASYNC; xfs_buf_ioend_fail(bp); diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h index d1109eb13ba2e..b93cf9076df8a 100644 --- a/fs/xfs/xfs_inode.h +++ b/fs/xfs/xfs_inode.h @@ -427,7 +427,7 @@ int xfs_log_force_inode(struct xfs_inode *ip); void xfs_iunpin_wait(xfs_inode_t *); #define xfs_ipincount(ip) ((unsigned int) atomic_read(&ip->i_pincount)) -int xfs_iflush_cluster(struct xfs_inode *, struct xfs_buf *); +int xfs_iflush_cluster(struct xfs_buf *); void xfs_lock_two_inodes(struct xfs_inode *ip0, uint ip0_mode, struct xfs_inode *ip1, uint ip1_mode); diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c index 326547e89cb6b..dee7385466f83 100644 --- a/fs/xfs/xfs_inode_item.c +++ b/fs/xfs/xfs_inode_item.c @@ -511,7 +511,7 @@ xfs_inode_item_push( * reference for IO until we queue the buffer for delwri submission. */ xfs_buf_hold(bp); - error = xfs_iflush_cluster(ip, bp); + error = xfs_iflush_cluster(bp); if (!error) { if (!xfs_buf_delwri_queue(bp, buffer_list)) rval = XFS_ITEM_FLUSHING; From patchwork Thu Jun 4 07:46:05 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 11587259 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id BF2F6159A for ; Thu, 4 Jun 2020 07:46:21 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id AEA912072E for ; Thu, 4 Jun 2020 07:46:21 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728042AbgFDHqU (ORCPT ); Thu, 4 Jun 2020 03:46:20 -0400 Received: from mail110.syd.optusnet.com.au ([211.29.132.97]:34417 "EHLO mail110.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726246AbgFDHqS (ORCPT ); Thu, 4 Jun 2020 03:46:18 -0400 Received: from dread.disaster.area (pa49-180-124-177.pa.nsw.optusnet.com.au [49.180.124.177]) by mail110.syd.optusnet.com.au (Postfix) with ESMTPS id 7F722107C04 for ; Thu, 4 Jun 2020 17:46:12 +1000 (AEST) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.3) (envelope-from ) id 1jgkZk-0004B9-M2 for linux-xfs@vger.kernel.org; Thu, 04 Jun 2020 17:46:08 +1000 Received: from dave by discord.disaster.area with local (Exim 4.93) (envelope-from ) id 1jgkZk-0017Io-DM for linux-xfs@vger.kernel.org; Thu, 04 Jun 2020 17:46:08 +1000 From: Dave Chinner To: linux-xfs@vger.kernel.org Subject: [PATCH 29/30] xfs: factor xfs_iflush_done Date: Thu, 4 Jun 2020 17:46:05 +1000 Message-Id: <20200604074606.266213-30-david@fromorbit.com> X-Mailer: git-send-email 2.26.2.761.g0e0b3e54be In-Reply-To: <20200604074606.266213-1-david@fromorbit.com> References: <20200604074606.266213-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.3 cv=W5xGqiek c=1 sm=1 tr=0 a=k3aV/LVJup6ZGWgigO6cSA==:117 a=k3aV/LVJup6ZGWgigO6cSA==:17 a=nTHF0DUjJn0A:10 a=20KFwNOVAAAA:8 a=yPCof4ZbAAAA:8 a=5dYq1025zWcnlPueRbEA:9 Sender: linux-xfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Dave Chinner xfs_iflush_done() does 3 distinct operations to the inodes attached to the buffer. Separate these operations out into functions so that it is easier to modify these operations independently in future. Signed-off-by: Dave Chinner Reviewed-by: Darrick J. Wong --- fs/xfs/xfs_inode_item.c | 154 +++++++++++++++++++++------------------- 1 file changed, 81 insertions(+), 73 deletions(-) diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c index dee7385466f83..3894d190ea5b9 100644 --- a/fs/xfs/xfs_inode_item.c +++ b/fs/xfs/xfs_inode_item.c @@ -640,101 +640,64 @@ xfs_inode_item_destroy( /* - * This is the inode flushing I/O completion routine. It is called - * from interrupt level when the buffer containing the inode is - * flushed to disk. It is responsible for removing the inode item - * from the AIL if it has not been re-logged, and unlocking the inode's - * flush lock. - * - * To reduce AIL lock traffic as much as possible, we scan the buffer log item - * list for other inodes that will run this function. We remove them from the - * buffer list so we can process all the inode IO completions in one AIL lock - * traversal. - * - * Note: Now that we attach the log item to the buffer when we first log the - * inode in memory, we can have unflushed inodes on the buffer list here. These - * inodes will have a zero ili_last_fields, so skip over them here. + * We only want to pull the item from the AIL if it is actually there + * and its location in the log has not changed since we started the + * flush. Thus, we only bother if the inode's lsn has not changed. */ void -xfs_iflush_done( - struct xfs_buf *bp) +xfs_iflush_ail_updates( + struct xfs_ail *ailp, + struct list_head *list) { - struct xfs_inode_log_item *iip; - struct xfs_log_item *lip, *n; - struct xfs_ail *ailp = bp->b_mount->m_ail; - int need_ail = 0; - LIST_HEAD(tmp); + struct xfs_log_item *lip; + xfs_lsn_t tail_lsn = 0; - /* - * Pull the attached inodes from the buffer one at a time and take the - * appropriate action on them. - */ - list_for_each_entry_safe(lip, n, &bp->b_li_list, li_bio_list) { - iip = INODE_ITEM(lip); + /* this is an opencoded batch version of xfs_trans_ail_delete */ + spin_lock(&ailp->ail_lock); + list_for_each_entry(lip, list, li_bio_list) { + xfs_lsn_t lsn; - if (xfs_iflags_test(iip->ili_inode, XFS_ISTALE)) { - xfs_iflush_abort(iip->ili_inode); + if (INODE_ITEM(lip)->ili_flush_lsn != lip->li_lsn) { + clear_bit(XFS_LI_FAILED, &lip->li_flags); continue; } - if (!iip->ili_last_fields) - continue; - - list_move_tail(&lip->li_bio_list, &tmp); - - /* Do an unlocked check for needing the AIL lock. */ - if (iip->ili_flush_lsn == lip->li_lsn || - test_bit(XFS_LI_FAILED, &lip->li_flags)) - need_ail++; + lsn = xfs_ail_delete_one(ailp, lip); + if (!tail_lsn && lsn) + tail_lsn = lsn; } + xfs_ail_update_finish(ailp, tail_lsn); +} - /* - * We only want to pull the item from the AIL if it is actually there - * and its location in the log has not changed since we started the - * flush. Thus, we only bother if the inode's lsn has not changed. - */ - if (need_ail) { - xfs_lsn_t tail_lsn = 0; - - /* this is an opencoded batch version of xfs_trans_ail_delete */ - spin_lock(&ailp->ail_lock); - list_for_each_entry(lip, &tmp, li_bio_list) { - clear_bit(XFS_LI_FAILED, &lip->li_flags); - if (lip->li_lsn == INODE_ITEM(lip)->ili_flush_lsn) { - xfs_lsn_t lsn = xfs_ail_delete_one(ailp, lip); - if (!tail_lsn && lsn) - tail_lsn = lsn; - } - } - xfs_ail_update_finish(ailp, tail_lsn); - } +/* + * Walk the list of inodes that have completed their IOs. If they are clean + * remove them from the list and dissociate them from the buffer. Buffers that + * are still dirty remain linked to the buffer and on the list. Caller must + * handle them appropriately. + */ +void +xfs_iflush_finish( + struct xfs_buf *bp, + struct list_head *list) +{ + struct xfs_log_item *lip, *n; - /* - * Clean up and unlock the flush lock now we are done. We can clear the - * ili_last_fields bits now that we know that the data corresponding to - * them is safely on disk. - */ - list_for_each_entry_safe(lip, n, &tmp, li_bio_list) { + list_for_each_entry_safe(lip, n, list, li_bio_list) { + struct xfs_inode_log_item *iip = INODE_ITEM(lip); bool drop_buffer = false; - list_del_init(&lip->li_bio_list); - iip = INODE_ITEM(lip); - spin_lock(&iip->ili_lock); /* * Remove the reference to the cluster buffer if the inode is - * clean in memory. Drop the buffer reference once we've dropped - * the locks we hold. If the inode is dirty in memory, we need - * to put the inode item back on the buffer list for another - * pass through the flush machinery. + * clean in memory and drop the buffer reference once we've + * dropped the locks we hold. */ ASSERT(iip->ili_item.li_buf == bp); if (!iip->ili_fields) { iip->ili_item.li_buf = NULL; + list_del_init(&lip->li_bio_list); drop_buffer = true; - } else { - list_add(&lip->li_bio_list, &bp->b_li_list); } iip->ili_last_fields = 0; iip->ili_flush_lsn = 0; @@ -745,6 +708,51 @@ xfs_iflush_done( } } +/* + * Inode buffer IO completion routine. It is responsible for removing inodes + * attached to the buffer from the AIL if they have not been re-logged, as well + * as completing the flush and unlocking the inode. + */ +void +xfs_iflush_done( + struct xfs_buf *bp) +{ + struct xfs_log_item *lip, *n; + LIST_HEAD(flushed_inodes); + LIST_HEAD(ail_updates); + + /* + * Pull the attached inodes from the buffer one at a time and take the + * appropriate action on them. + */ + list_for_each_entry_safe(lip, n, &bp->b_li_list, li_bio_list) { + struct xfs_inode_log_item *iip = INODE_ITEM(lip); + + if (xfs_iflags_test(iip->ili_inode, XFS_ISTALE)) { + xfs_iflush_abort(iip->ili_inode); + continue; + } + if (!iip->ili_last_fields) + continue; + + /* Do an unlocked check for needing the AIL lock. */ + if (iip->ili_flush_lsn == lip->li_lsn || + test_bit(XFS_LI_FAILED, &lip->li_flags)) + list_move_tail(&lip->li_bio_list, &ail_updates); + else + list_move_tail(&lip->li_bio_list, &flushed_inodes); + } + + if (!list_empty(&ail_updates)) { + xfs_iflush_ail_updates(bp->b_mount->m_ail, &ail_updates); + list_splice_tail(&ail_updates, &flushed_inodes); + } + + xfs_iflush_finish(bp, &flushed_inodes); + if (!list_empty(&flushed_inodes)) + list_splice_tail(&flushed_inodes, &bp->b_li_list); +} + /* * This is the inode flushing abort routine. It is called from xfs_iflush when * the filesystem is shutting down to clean up the inode state. It is From patchwork Thu Jun 4 07:46:06 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 11587251 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id A82CA159A for ; Thu, 4 Jun 2020 07:46:20 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 995B6206C3 for ; Thu, 4 Jun 2020 07:46:20 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727898AbgFDHqS (ORCPT ); Thu, 4 Jun 2020 03:46:18 -0400 Received: from mail107.syd.optusnet.com.au ([211.29.132.53]:48980 "EHLO mail107.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727881AbgFDHqQ (ORCPT ); Thu, 4 Jun 2020 03:46:16 -0400 Received: from dread.disaster.area (pa49-180-124-177.pa.nsw.optusnet.com.au [49.180.124.177]) by mail107.syd.optusnet.com.au (Postfix) with ESMTPS id 3481CD596A9 for ; Thu, 4 Jun 2020 17:46:13 +1000 (AEST) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.3) (envelope-from ) id 1jgkZk-0004BD-Nv for linux-xfs@vger.kernel.org; Thu, 04 Jun 2020 17:46:08 +1000 Received: from dave by discord.disaster.area with local (Exim 4.93) (envelope-from ) id 1jgkZk-0017It-F1 for linux-xfs@vger.kernel.org; Thu, 04 Jun 2020 17:46:08 +1000 From: Dave Chinner To: linux-xfs@vger.kernel.org Subject: [PATCH 30/30] xfs: remove xfs_inobp_check() Date: Thu, 4 Jun 2020 17:46:06 +1000 Message-Id: <20200604074606.266213-31-david@fromorbit.com> X-Mailer: git-send-email 2.26.2.761.g0e0b3e54be In-Reply-To: <20200604074606.266213-1-david@fromorbit.com> References: <20200604074606.266213-1-david@fromorbit.com> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.3 cv=W5xGqiek c=1 sm=1 tr=0 a=k3aV/LVJup6ZGWgigO6cSA==:117 a=k3aV/LVJup6ZGWgigO6cSA==:17 a=nTHF0DUjJn0A:10 a=20KFwNOVAAAA:8 a=yPCof4ZbAAAA:8 a=DPGN0NLpbuj1Jjrhi_sA:9 Sender: linux-xfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org From: Dave Chinner This debug code is called on every xfs_iflush() call, which then checks every inode in the buffer for non-zero unlinked list field. Hence it checks every inode in the cluster buffer every time a single inode on that cluster it flushed. This is resulting in: - 38.91% 5.33% [kernel] [k] xfs_iflush - 17.70% xfs_iflush - 9.93% xfs_inobp_check 4.36% xfs_buf_offset 10% of the CPU time spent flushing inodes is repeatedly checking unlinked fields in the buffer. We don't need to do this. The other place we call xfs_inobp_check() is xfs_iunlink_update_dinode(), and this is after we've done this assert for the agino we are about to write into that inode: ASSERT(xfs_verify_agino_or_null(mp, agno, next_agino)); which means we've already checked that the agino we are about to write is not 0 on debug kernels. The inode buffer verifiers do everything else we need, so let's just remove this debug code. Signed-off-by: Dave Chinner Reviewed-by: Christoph Hellwig Reviewed-by: Darrick J. Wong Reviewed-by: Brian Foster --- fs/xfs/libxfs/xfs_inode_buf.c | 24 ------------------------ fs/xfs/libxfs/xfs_inode_buf.h | 6 ------ fs/xfs/xfs_inode.c | 2 -- 3 files changed, 32 deletions(-) diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c index 1af97235785c8..6b6f67595bf4e 100644 --- a/fs/xfs/libxfs/xfs_inode_buf.c +++ b/fs/xfs/libxfs/xfs_inode_buf.c @@ -20,30 +20,6 @@ #include -/* - * Check that none of the inode's in the buffer have a next - * unlinked field of 0. - */ -#if defined(DEBUG) -void -xfs_inobp_check( - xfs_mount_t *mp, - xfs_buf_t *bp) -{ - int i; - xfs_dinode_t *dip; - - for (i = 0; i < M_IGEO(mp)->inodes_per_cluster; i++) { - dip = xfs_buf_offset(bp, i * mp->m_sb.sb_inodesize); - if (!dip->di_next_unlinked) { - xfs_alert(mp, - "Detected bogus zero next_unlinked field in inode %d buffer 0x%llx.", - i, (long long)bp->b_bn); - } - } -} -#endif - /* * If we are doing readahead on an inode buffer, we might be in log recovery * reading an inode allocation buffer that hasn't yet been replayed, and hence diff --git a/fs/xfs/libxfs/xfs_inode_buf.h b/fs/xfs/libxfs/xfs_inode_buf.h index 865ac493c72a2..6b08b9d060c2e 100644 --- a/fs/xfs/libxfs/xfs_inode_buf.h +++ b/fs/xfs/libxfs/xfs_inode_buf.h @@ -52,12 +52,6 @@ int xfs_inode_from_disk(struct xfs_inode *ip, struct xfs_dinode *from); void xfs_log_dinode_to_disk(struct xfs_log_dinode *from, struct xfs_dinode *to); -#if defined(DEBUG) -void xfs_inobp_check(struct xfs_mount *, struct xfs_buf *); -#else -#define xfs_inobp_check(mp, bp) -#endif /* DEBUG */ - xfs_failaddr_t xfs_dinode_verify(struct xfs_mount *mp, xfs_ino_t ino, struct xfs_dinode *dip); xfs_failaddr_t xfs_inode_validate_extsize(struct xfs_mount *mp, diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c index 931a483d5b316..9400c2e0b0c4a 100644 --- a/fs/xfs/xfs_inode.c +++ b/fs/xfs/xfs_inode.c @@ -2165,7 +2165,6 @@ xfs_iunlink_update_dinode( xfs_dinode_calc_crc(mp, dip); xfs_trans_inode_buf(tp, ibp); xfs_trans_log_buf(tp, ibp, offset, offset + sizeof(xfs_agino_t) - 1); - xfs_inobp_check(mp, ibp); } /* Set an in-core inode's unlinked pointer and return the old value. */ @@ -3559,7 +3558,6 @@ xfs_iflush( xfs_iflush_fork(ip, dip, iip, XFS_DATA_FORK); if (XFS_IFORK_Q(ip)) xfs_iflush_fork(ip, dip, iip, XFS_ATTR_FORK); - xfs_inobp_check(mp, bp); /* * We've recorded everything logged in the inode, so we'd like to clear