[28/28] xfs: rework unreferenced inode lookups

From: Dave Chinner <dchinner@redhat.com>

From: Dave Chinner <dchinner@redhat.com>

Looking up an unreferenced inode in the inode cache is a bit hairy.
We do this for inode invalidation and writeback clustering purposes,
which is all invisible to the VFS. Hence we can't take reference
counts to the inode and so must be very careful how we do it.

There are several different places that all do the lookups and
checks slightly differently. Fundamentally, though, they are all
racy and inode reclaim has to block waiting for the inode lock if it
loses the race. This is not very optimal given all the work we;ve
already done to make reclaim non-blocking.

We can make the reclaim process nonblocking with a couple of simple
changes. If we define the unreferenced lookup process in a way that
will either always grab an inode in a way that reclaim will notice
and skip, or will notice a reclaim has grabbed the inode so it can
skip the inode, then there is no need for reclaim to need to cycle
the inode ILOCK at all.

Selecting an inode for reclaim is already non-blocking, so if the
ILOCK is held the inode will be skipped. If we ensure that reclaim
holds the ILOCK until the inode is freed, then we can do the same
thing in the unreferenced lookup to avoid inodes in reclaim. We can
do this simply by holding the ILOCK until the RCU grace period
expires and the inode freeing callback is run. As all unreferenced
lookups have to hold the rcu_read_lock(), we are guaranteed that
a reclaimed inode will be noticed as the trylock will fail.

Additional research notes on final reclaim locking before free
--------------------------------------------------------------

2016: 1f2dcfe89eda ("xfs: xfs_inode_free() isn't RCU safe")

Fixes situation where the inode is found during RCU lookup within
the freeing grace period, but critical structures have already been
freed. lookup code that has this problem is stuff like
xfs_iflush_cluster.

2008: 455486b9ccdd ("[XFS] avoid all reclaimable inodes in xfs_sync_inodes_ag")

Prior to this commit, the flushing of inodes required serialisation
with xfs_ireclaim(), which did this lock/unlock thingy to ensure
that it waited for flushing in xfs_sync_inodes_ag() to complete
before freeing the inode:

                /*
-                * If we can't get a reference on the VFS_I, the inode must be
-                * in reclaim. If we can get the inode lock without blocking,
-                * it is safe to flush the inode because we hold the tree lock
-                * and xfs_iextract will block right now. Hence if we lock the
-                * inode while holding the tree lock, xfs_ireclaim() is
-                * guaranteed to block on the inode lock we now hold and hence
-                * it is safe to reference the inode until we drop the inode
-                * locks completely.
+                * If we can't get a reference on the inode, it must be
+                * in reclaim. Leave it for the reclaim code to flush.
                 */

This case is completely gone from the modern code.

lock/unlock exists at start of git era. Switching to archive tree.

This xfs_sync() functionality goes back to 1994 when inode
writeback was first introduced by:

47ac6d60 ("Add support to xfs_ireclaim() needed for xfs_sync().")

So it has been there forever -  lets see if we can get rid of it.
State of existing codeL

- xfs_iflush_cluster() does not check for XFS_IRECLAIM inode flag
  while holding rcu_read_lock()/i_flags_lock, so doesn't avoid
  reclaimable or inodes that are in the process of being reclaimed.
  Inodes at this point of reclaim are clean, so if xfs_iflush_cluster
  wins the race to the ILOCK, then inode reclaim has to wait
  for the lock to be dropped by xfs_iflush_cluster() once it detects
  the inode is clean.

- xfs_ifree_cluster() has similar logic based around XFS_ISTALE,
  results in similar race conditions that require inode reclaim to
  cycle the ILOCK to serialise against.

- xfs_inode_ag_walk() uses xfs_inode_ag_walk_grab(), and it checks
  XFS_IRECLAIM under RCU. It then tries to take a reference to the
  VFS inode via igrab(), which will fail if the inode is either
  XFS_IRECLAIMABLE | XFS_IRECLAIM, and it if races then igrab() will
  fail because the inode has I_FREEING still set, so it's protected
  against reclaim races.

That leaves xfs_iflush_cluster() + xfs_ifree_cluster() to be
modified to do reclaim-safe lookups. W.r.t. new inode reclaim LRU
isolate function:

	1. inode can be referenced while rcu_read_lock() is held.

	2. XFS_IRECLAIM means inode has been fully locked down and
	   has placed on the dispose list, and will be freed soon.
		- ilock_nowait() will fail once IRECLAIM is set due
		  to lock order in isolation code.

	3. ip->i_ino == 0 means it's been removed from the dispose
	   list and is about to or has been removed from the radix
	   tree and may have already been queued on the rcu freeing
	   list to be freed at the end of the current grace period.

		- the old xfs_ireclaim() code will have dropped the
		  ILOCK here, and so there's a race between checking
		  IRECLAIM, grabbing ilock_nowait() and reclaim
		  freeing the inode.
		- this is what the spurious lock/unlock avoids.

	4. it xfs_ilock_nowait() fails before the rcu grace period
	   expires, it doesn't matter if we race between checking
	   IRECLAIM and failing the lock attempt. In fact, we don't
	   even have to check XFS_IRECLAIM - just failing
	   xfs_ilock_nowait() is sufficient to avoid inodes being
	   reclaimed.

	   Hence when xfs_ilock_nowait() fails, we can either drop the
	   rcu_read_lock at that point and restart the inode lookup,
	   or we just skip the inode altogether. If we raced with
	   reclaim, the retry will not find the inode in reclaim
	   again. If we raced wtih some other lock holder, then
	   we'll find the inode and try to lock it again.

		- Requires holding ILOCK into rcu freeing callback
		  and dropping it there. i.e. inode to be reclaimed
		  remains locked until grace period expires.
		- No window at all between IRECLAIM being set and
		  visible to other CPUs and the inode being removed
		  from the cache and freed where ilock_nowait will
		  succeed.
		- simple, effective, reliable.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/mrlock.h     |  27 +++++++++
 fs/xfs/xfs_icache.c |  88 +++++++++++++++++++++--------
 fs/xfs/xfs_inode.c  | 131 +++++++++++++++++++++-----------------------
 3 files changed, 153 insertions(+), 93 deletions(-)

Message ID	20191031234618.15403-29-david@fromorbit.com (mailing list archive)
State	Deferred, archived
Headers	show Return-Path: <SRS0=K+Ru=YY=vger.kernel.org=linux-xfs-owner@kernel.org> Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 3B84C912 for <patchwork-linux-xfs@patchwork.kernel.org>; Thu, 31 Oct 2019 23:47:07 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 06FF6217D9 for <patchwork-linux-xfs@patchwork.kernel.org>; Thu, 31 Oct 2019 23:47:07 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729090AbfJaXqt (ORCPT <rfc822;patchwork-linux-xfs@patchwork.kernel.org>); Thu, 31 Oct 2019 19:46:49 -0400 Received: from mail104.syd.optusnet.com.au ([211.29.132.246]:40133 "EHLO mail104.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728578AbfJaXqe (ORCPT <rfc822;linux-xfs@vger.kernel.org>); Thu, 31 Oct 2019 19:46:34 -0400 Received: from dread.disaster.area (pa49-180-67-183.pa.nsw.optusnet.com.au [49.180.67.183]) by mail104.syd.optusnet.com.au (Postfix) with ESMTPS id ED9677EA903; Fri, 1 Nov 2019 10:46:25 +1100 (AEDT) Received: from discord.disaster.area ([192.168.253.110]) by dread.disaster.area with esmtp (Exim 4.92.3) (envelope-from <david@fromorbit.com>) id 1iQK8x-0007Co-V0; Fri, 01 Nov 2019 10:46:19 +1100 Received: from dave by discord.disaster.area with local (Exim 4.92.3) (envelope-from <david@fromorbit.com>) id 1iQK8x-00042W-SH; Fri, 01 Nov 2019 10:46:19 +1100 From: Dave Chinner <david@fromorbit.com> To: linux-xfs@vger.kernel.org Cc: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: [PATCH 28/28] xfs: rework unreferenced inode lookups Date: Fri, 1 Nov 2019 10:46:18 +1100 Message-Id: <20191031234618.15403-29-david@fromorbit.com> X-Mailer: git-send-email 2.24.0.rc0 In-Reply-To: <20191031234618.15403-1-david@fromorbit.com> References: <20191031234618.15403-1-david@fromorbit.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.2 cv=G6BsK5s5 c=1 sm=1 tr=0 a=3wLbm4YUAFX2xaPZIabsgw==:117 a=3wLbm4YUAFX2xaPZIabsgw==:17 a=jpOVt7BSZ2e4Z31A5e1TngXxSK0=:19 a=MeAgGD-zjQ4A:10 a=20KFwNOVAAAA:8 a=r6joNmVve4b8u8DPqOAA:9 a=cpms68JvJ2FCVZZY:21 a=fhZphxW6ybyfpQKv:21 Sender: linux-xfs-owner@vger.kernel.org Precedence: bulk List-ID: <linux-xfs.vger.kernel.org> X-Mailing-List: linux-xfs@vger.kernel.org
Series	mm, xfs: non-blocking inode reclaim \| expand [00/28] mm, xfs: non-blocking inode reclaim [01/28] xfs: Lower CIL flush limit for large logs [02/28] xfs: Throttle commits on delayed background CIL push [03/28] xfs: don't allow log IO to be throttled [04/28] xfs: Improve metadata buffer reclaim accountability [05/28] xfs: correctly acount for reclaimable slabs [06/28] xfs: factor common AIL item deletion code [07/28] xfs: tail updates only need to occur when LSN changes [08/28] xfs: factor inode lookup from xfs_ifree_cluster [09/28] mm: directed shrinker work deferral [10/28] shrinkers: use defer_work for GFP_NOFS sensitive shrinkers [11/28] mm: factor shrinker work calculations [12/28] shrinker: defer work only to kswapd [13/28] shrinker: clean up variable types and tracepoints [14/28] mm: reclaim_state records pages reclaimed, not slabs [15/28] mm: back off direct reclaim on excessive shrinker deferral [16/28] mm: kswapd backoff for shrinkers [17/28] xfs: synchronous AIL pushing [18/28] xfs: don't block kswapd in inode reclaim [19/28] xfs: reduce kswapd blocking on inode locking. [20/28] xfs: kill background reclaim work [21/28] xfs: use AIL pushing for inode reclaim IO [22/28] xfs: remove mode from xfs_reclaim_inodes() [23/28] xfs: track reclaimable inodes using a LRU list [24/28] xfs: reclaim inodes from the LRU [25/28] xfs: remove unusued old inode reclaim code [26/28] xfs: use xfs_ail_push_all in xfs_reclaim_inodes [27/28] rwsem: introduce down/up_write_non_owner [28/28] xfs: rework unreferenced inode lookups

[28/28] xfs: rework unreferenced inode lookups

Commit Message

Comments

Patch