diff mbox series

[1/1] xfs: fix potential AGI <-> ILOCK ABBA deadlock in xrep_dinode_findmode_walk_directory

Message ID 171150379387.3216268.6890967813601957901.stgit@frogsfrogsfrogs (mailing list archive)
State Superseded
Headers show
Series [1/1] xfs: fix potential AGI <-> ILOCK ABBA deadlock in xrep_dinode_findmode_walk_directory | expand

Commit Message

Darrick J. Wong March 27, 2024, 1:50 a.m. UTC
From: Darrick J. Wong <djwong@kernel.org>

xfs/399 found the following deadlock when fuzzing core.mode = ones:

/proc/20506/task/20558/stack :
[<0>] xfs_ilock+0xa0/0x240 [xfs]
[<0>] xfs_ilock_data_map_shared+0x1b/0x20 [xfs]
[<0>] xrep_dinode_findmode_walk_directory+0x69/0xe0 [xfs]
[<0>] xrep_dinode_find_mode+0x103/0x2a0 [xfs]
[<0>] xrep_dinode_mode+0x7c/0x120 [xfs]
[<0>] xrep_dinode_core+0xed/0x2b0 [xfs]
[<0>] xrep_dinode_problems+0x10/0x80 [xfs]
[<0>] xrep_inode+0x6c/0xc0 [xfs]
[<0>] xrep_attempt+0x64/0x1d0 [xfs]
[<0>] xfs_scrub_metadata+0x365/0x840 [xfs]
[<0>] xfs_scrubv_metadata+0x282/0x430 [xfs]
[<0>] xfs_ioc_scrubv_metadata+0x149/0x1a0 [xfs]
[<0>] xfs_file_ioctl+0xc68/0x1780 [xfs]
/proc/20506/task/20559/stack :
[<0>] xfs_buf_lock+0x3b/0x110 [xfs]
[<0>] xfs_buf_find_lock+0x66/0x1c0 [xfs]
[<0>] xfs_buf_get_map+0x208/0xc00 [xfs]
[<0>] xfs_buf_read_map+0x5d/0x2c0 [xfs]
[<0>] xfs_trans_read_buf_map+0x1b0/0x4c0 [xfs]
[<0>] xfs_read_agi+0xbd/0x190 [xfs]
[<0>] xfs_ialloc_read_agi+0x47/0x160 [xfs]
[<0>] xfs_imap_lookup+0x69/0x1f0 [xfs]
[<0>] xfs_imap+0x1fc/0x3d0 [xfs]
[<0>] xfs_iget+0x357/0xd50 [xfs]
[<0>] xchk_dir_actor+0x16e/0x330 [xfs]
[<0>] xchk_dir_walk_block+0x164/0x1e0 [xfs]
[<0>] xchk_dir_walk+0x13a/0x190 [xfs]
[<0>] xchk_directory+0x1a2/0x2b0 [xfs]
[<0>] xfs_scrub_metadata+0x2f4/0x840 [xfs]
[<0>] xfs_scrubv_metadata+0x282/0x430 [xfs]
[<0>] xfs_ioc_scrubv_metadata+0x149/0x1a0 [xfs]
[<0>] xfs_file_ioctl+0xc68/0x1780 [xfs]

Thread 20558 holds an AGI buffer and is trying to grab the ILOCK of the
root directory.  Thread 20559 holds the root directory ILOCK and is
trying to grab the AGI of an inode that is one of the root directory's
children.  The AGI held by 20558 is the same buffer that 20559 is trying
to acquire.  In other words, this is an ABBA deadlock.

In general, the lock order is ILOCK and then AGI -- rename does this
while preparing for an operation involving whiteouts or renaming files
out of existence; and unlink does this when moving an inode to the
unlinked list.  The only place where we do it in the opposite order is
on the child during an icreate, but at that point the child is marked
INEW and is not visible to other threads.

Work around this deadlock by replacing the blocking ilock attempt with a
nonblocking loop that aborts after 30 seconds.  Relax for a jiffy after
a failed lock attempt.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/inode_repair.c |   48 ++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 47 insertions(+), 1 deletion(-)

Comments

Christoph Hellwig March 27, 2024, 4:56 p.m. UTC | #1
> Thread 20558 holds an AGI buffer and is trying to grab the ILOCK of the
> root directory.  Thread 20559 holds the root directory ILOCK and is
> trying to grab the AGI of an inode that is one of the root directory's
> children.  The AGI held by 20558 is the same buffer that 20559 is trying
> to acquire.  In other words, this is an ABBA deadlock.
> 
> In general, the lock order is ILOCK and then AGI -- rename does this
> while preparing for an operation involving whiteouts or renaming files
> out of existence; and unlink does this when moving an inode to the
> unlinked list.  The only place where we do it in the opposite order is
> on the child during an icreate, but at that point the child is marked
> INEW and is not visible to other threads.
> 
> Work around this deadlock by replacing the blocking ilock attempt with a
> nonblocking loop that aborts after 30 seconds.  Relax for a jiffy after
> a failed lock attempt.

Trylock and wait schemes are sketchy as hell.  Why do we need to hold
the AGI lock when walking the directory?
Darrick J. Wong March 29, 2024, 6:38 p.m. UTC | #2
On Wed, Mar 27, 2024 at 09:56:35AM -0700, Christoph Hellwig wrote:
> > Thread 20558 holds an AGI buffer and is trying to grab the ILOCK of the
> > root directory.  Thread 20559 holds the root directory ILOCK and is
> > trying to grab the AGI of an inode that is one of the root directory's
> > children.  The AGI held by 20558 is the same buffer that 20559 is trying
> > to acquire.  In other words, this is an ABBA deadlock.
> > 
> > In general, the lock order is ILOCK and then AGI -- rename does this
> > while preparing for an operation involving whiteouts or renaming files
> > out of existence; and unlink does this when moving an inode to the
> > unlinked list.  The only place where we do it in the opposite order is
> > on the child during an icreate, but at that point the child is marked
> > INEW and is not visible to other threads.
> > 
> > Work around this deadlock by replacing the blocking ilock attempt with a
> > nonblocking loop that aborts after 30 seconds.  Relax for a jiffy after
> > a failed lock attempt.
> 
> Trylock and wait schemes are sketchy as hell.  Why do we need to hold
> the AGI lock when walking the directory?

The short answer is that we're holding the AGI to quiesce inode cache
activity in the AG containing the inode that xrep_dinode* is trying to
fix.  The goal of xrep_dinode* functions is to get the ondisk inode into
good enough shape that we can iget the inode and continue repairs with
the cached inode and all the functionality that you get with a cached
inode.

Longer answer:

When the xchk_setup_inode function fails to iget an inode, it grabs the
AGI buffer, computes the xfs_imap of the affected inode, and hands
things over to repair.  At this point, we've prevented any other threads
from trying to allocate or free an inode in that AG.

Repair uses the xfs_imap to read the inode cluster buffer, so now it
holds the top and the bottom of the inode structure.  One of two things
can happen:

1) If xrep_dinode_mode decides it doesn't need to do anything, we
continue correcting problems in the rest of the xfs_dinode, commit the
cluster buffer, and retry the untrusted iget.  Repair still holds the
AGI and the icluster buffer, so we know that nobody else could have
started a walk.  Therefore, we cannot deadlock with another thread
calling iget.

2) If xrep_dinode_mode does decide to scan the filesystem to try to
recover i_mode from ftypes, now we need to do untrusted igets of
every directory in the filesystem.  However, we still need to hold the
AGI and the cluster buffer.

Oh.  The xchk_iscan_iter in xrep_dinode_find_mode does a blocking
acquisition of every AGI in the filesystem.  If the busted inode is in a
high AGI, we'll end up taking AGIs in the wrong order.  Ok, I need to
fix that too.

--D
diff mbox series

Patch

diff --git a/fs/xfs/scrub/inode_repair.c b/fs/xfs/scrub/inode_repair.c
index eab380e95ef40..96c5763dc3839 100644
--- a/fs/xfs/scrub/inode_repair.c
+++ b/fs/xfs/scrub/inode_repair.c
@@ -282,6 +282,50 @@  xrep_dinode_findmode_dirent(
 	return 0;
 }
 
+/* Try to lock a directory, or wait a jiffy. */
+static inline int
+xrep_dinode_ilock_nowait(
+	struct xfs_inode	*dp,
+	unsigned int		lock_mode)
+{
+	if (xfs_ilock_nowait(dp, lock_mode))
+		return true;
+
+	schedule_timeout_killable(1);
+	return false;
+}
+
+/*
+ * Try to lock a directory to look for ftype hints.  Since we already hold the
+ * AGI buffer, we cannot block waiting for the ILOCK because rename can take
+ * the ILOCK and then try to lock AGIs.
+ */
+STATIC int
+xrep_dinode_trylock_directory(
+	struct xrep_inode	*ri,
+	struct xfs_inode	*dp,
+	unsigned int		*lock_modep)
+{
+	unsigned long		deadline = jiffies + msecs_to_jiffies(30000);
+	unsigned int		lock_mode;
+	int			error = 0;
+
+	do {
+		if (xchk_should_terminate(ri->sc, &error))
+			return error;
+		if (time_is_before_jiffies(deadline))
+			return -EBUSY;
+
+		if (xfs_need_iread_extents(&dp->i_df))
+			lock_mode = XFS_ILOCK_EXCL;
+		else
+			lock_mode = XFS_ILOCK_SHARED;
+	} while (!xrep_dinode_ilock_nowait(dp, lock_mode));
+
+	*lock_modep = lock_mode;
+	return 0;
+}
+
 /*
  * If this is a directory, walk the dirents looking for any that point to the
  * scrub target inode.
@@ -299,7 +343,9 @@  xrep_dinode_findmode_walk_directory(
 	 * Scan the directory to see if there it contains an entry pointing to
 	 * the directory that we are repairing.
 	 */
-	lock_mode = xfs_ilock_data_map_shared(dp);
+	error = xrep_dinode_trylock_directory(ri, dp, &lock_mode);
+	if (error)
+		return error;
 
 	/*
 	 * If this directory is known to be sick, we cannot scan it reliably