diff mbox

[08/21] xfs: defer iput on certain inodes while scrub / repair are running

Message ID 152986826018.3155.9241833069276452949.stgit@magnolia (mailing list archive)
State New, archived
Headers show

Commit Message

Darrick J. Wong June 24, 2018, 7:24 p.m. UTC
From: Darrick J. Wong <darrick.wong@oracle.com>

Destroying an incore inode sometimes requires some work to be done on
the inode.  For example, post-EOF blocks on a non-PREALLOC inode are
trimmed, and copy-on-write staging extents are freed.  This work is done
in separate transactions, which is bad for scrub and repair because (a)
we already have a transaction and can't nest them, and (b) if we've
frozen the filesystem for scrub/repair work, that (regular) transaction
allocation will block on the freeze.

Therefore, if we detect that work has to be done to destroy the incore
inode, we'll just hang on to the reference until after the scrub is
finished.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/scrub/common.c |   52 +++++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/common.h |    1 +
 fs/xfs/scrub/dir.c    |    2 +-
 fs/xfs/scrub/parent.c |    6 +++---
 fs/xfs/scrub/scrub.c  |   20 +++++++++++++++++++
 fs/xfs/scrub/scrub.h  |    9 ++++++++
 fs/xfs/scrub/trace.h  |   30 ++++++++++++++++++++++++++++
 7 files changed, 116 insertions(+), 4 deletions(-)



--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Dave Chinner June 28, 2018, 11:37 p.m. UTC | #1
On Sun, Jun 24, 2018 at 12:24:20PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Destroying an incore inode sometimes requires some work to be done on
> the inode.  For example, post-EOF blocks on a non-PREALLOC inode are
> trimmed, and copy-on-write staging extents are freed.  This work is done
> in separate transactions, which is bad for scrub and repair because (a)
> we already have a transaction and can't nest them, and (b) if we've
> frozen the filesystem for scrub/repair work, that (regular) transaction
> allocation will block on the freeze.
> 
> Therefore, if we detect that work has to be done to destroy the incore
> inode, we'll just hang on to the reference until after the scrub is
> finished.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>

Darrick, I'll just repeat what we discussed on #xfs here so we have
in it the archive and everyone else knows why this is probably going
to be done differently.

I think we should move deferred inode inactivation processing into
the background reclaim radix tree walker rather than introduce a
special new "don't iput this inode yet" state. We're really only
trying to prevent the transactions that xfs_inactive() may run
throught iput() when the filesystem is frozen, and we already stop
background reclaim processing when the fs is frozen.

I've always intended that xfs_fs_destroy_inode() basically becomes a
no-op that just queues the inode for final inactivation, freeing and
reclaim - right now it ony does the reclaim work in the background.
I first proposed this back in ~2008 here:

http://xfs.org/index.php/Improving_inode_Caching#Inode_Unlink

At this point, it really only requires a new inode flag to indicate
that it has an inactivation pending - we set that if xfs_inactive
needs to do work before the inode can be reclaimed, and have a
separate per-ag work queue that walks the inode radix tree finding
reclaimable inodes that have the NEED_INACTIVATION inode flag set.
This way background reclaim doesn't get stuck on them.

This has benefits for many operations e.g. bulk processing of
inode inactivation and freeing either concurrently or after rm -rf
rather than at unlink syscall exit, VFS inode cache shrinker never
blocks on inactivation needing to run transactions, etc.

It also allows us to turn off inactivation on a per-AG basis,
meaning that when we are rebuilding an AG structure in repair (e.g.
the rmap btree) we can turn off inode inactivation and reclaim for
that AG rather than needing to freeze the entire filesystem....

Cheers,

Dave.
Darrick J. Wong June 29, 2018, 2:49 p.m. UTC | #2
On Fri, Jun 29, 2018 at 09:37:21AM +1000, Dave Chinner wrote:
> On Sun, Jun 24, 2018 at 12:24:20PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > Destroying an incore inode sometimes requires some work to be done on
> > the inode.  For example, post-EOF blocks on a non-PREALLOC inode are
> > trimmed, and copy-on-write staging extents are freed.  This work is done
> > in separate transactions, which is bad for scrub and repair because (a)
> > we already have a transaction and can't nest them, and (b) if we've
> > frozen the filesystem for scrub/repair work, that (regular) transaction
> > allocation will block on the freeze.
> > 
> > Therefore, if we detect that work has to be done to destroy the incore
> > inode, we'll just hang on to the reference until after the scrub is
> > finished.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Darrick, I'll just repeat what we discussed on #xfs here so we have
> in it the archive and everyone else knows why this is probably going
> to be done differently.
> 
> I think we should move deferred inode inactivation processing into
> the background reclaim radix tree walker rather than introduce a
> special new "don't iput this inode yet" state. We're really only
> trying to prevent the transactions that xfs_inactive() may run
> throught iput() when the filesystem is frozen, and we already stop
> background reclaim processing when the fs is frozen.
> 
> I've always intended that xfs_fs_destroy_inode() basically becomes a
> no-op that just queues the inode for final inactivation, freeing and
> reclaim - right now it ony does the reclaim work in the background.
> I first proposed this back in ~2008 here:
> 
> http://xfs.org/index.php/Improving_inode_Caching#Inode_Unlink
> 
> At this point, it really only requires a new inode flag to indicate
> that it has an inactivation pending - we set that if xfs_inactive
> needs to do work before the inode can be reclaimed, and have a
> separate per-ag work queue that walks the inode radix tree finding
> reclaimable inodes that have the NEED_INACTIVATION inode flag set.
> This way background reclaim doesn't get stuck on them.
> 
> This has benefits for many operations e.g. bulk processing of
> inode inactivation and freeing either concurrently or after rm -rf
> rather than at unlink syscall exit, VFS inode cache shrinker never
> blocks on inactivation needing to run transactions, etc.
> 
> It also allows us to turn off inactivation on a per-AG basis,
> meaning that when we are rebuilding an AG structure in repair (e.g.
> the rmap btree) we can turn off inode inactivation and reclaim for
> that AG rather than needing to freeze the entire filesystem....

So although I've been off playing a JavaScript monkey this week, I should
note that the past few months I've also been slowly combing through all
the past online repair fuzz test output to see what's still majorly
broken.  I've noticed that the bmbt fuzzers have a particular failure
pattern that leads to shutdown, which is:

1) Fuzz a bmbt.br_blockcount value to a large enough value that we now
have a giant post-eof extent.

2) Mount filesystem.

3) Run xfs_scrub, which loads said inode, checks the bad bmbt, and tells
userspace it's broken...

4) ...and releases the inode.

5) Memory reclaim or someone comes along and calls xfs_inactive, which
says "Hey, nice post-EOF extent, let's trim that off!"  The extent free
code then freaks out "ZOMG, that extent is already free!"

6) Bam, filesystem shuts down.

7) xfs_scrub retries the bmbt scrub, but this time with IFLAG_REPAIR
set, but by now the fs has already gone down, and sadness.

I've had a thought lurking around in my head for a while that perhaps we
should have a second SKIP_INACTIVATION iflag that indicates that the
inode is corrupt and we should skip post-eof inactivation to avoid fs
shutdowns.  We'd still have to take the risk of cleaning out the cow
fork (because that metadata are never persisted) but we could at least
avoid a shutdown.

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/xfs/scrub/common.c b/fs/xfs/scrub/common.c
index c1132a40a366..9740c28384b6 100644
--- a/fs/xfs/scrub/common.c
+++ b/fs/xfs/scrub/common.c
@@ -22,6 +22,7 @@ 
 #include "xfs_alloc_btree.h"
 #include "xfs_bmap.h"
 #include "xfs_bmap_btree.h"
+#include "xfs_bmap_util.h"
 #include "xfs_ialloc.h"
 #include "xfs_ialloc_btree.h"
 #include "xfs_refcount.h"
@@ -890,3 +891,54 @@  xfs_scrub_ilock_inverted(
 	}
 	return -EDEADLOCK;
 }
+
+/*
+ * Release a reference to an inode while the fs is running a scrub or repair.
+ * If we anticipate that destroying the incore inode will require work to be
+ * done, we'll defer the iput until after the scrub/repair releases the
+ * transaction.
+ */
+void
+xfs_scrub_iput(
+	struct xfs_scrub_context	*sc,
+	struct xfs_inode		*ip)
+{
+	/*
+	 * If this file doesn't have any blocks to be freed at release time,
+	 * go straight to iput.
+	 */
+	if (!xfs_can_free_eofblocks(ip, true))
+		goto iput;
+
+	/*
+	 * Any real/unwritten extents in the CoW fork will have to be freed
+	 * so iput if there aren't any.
+	 */
+	if (!xfs_inode_has_cow_blocks(ip))
+		goto iput;
+
+	/*
+	 * Any blocks after the end of the file will have to be freed so iput
+	 * if there aren't any.
+	 */
+	if (!xfs_inode_has_posteof_blocks(ip))
+		goto iput;
+
+	/*
+	 * There are no other users of i_private in XFS so if it's non-NULL
+	 * this inode is already on the deferred iput list and we can release
+	 * this reference.
+	 */
+	if (VFS_I(ip)->i_private)
+		goto iput;
+
+	/* Otherwise, add it to the deferred iput list. */
+	trace_xfs_scrub_iput_defer(ip, __return_address);
+	VFS_I(ip)->i_private = sc->deferred_iput_list;
+	sc->deferred_iput_list = VFS_I(ip);
+	return;
+
+iput:
+	trace_xfs_scrub_iput_now(ip, __return_address);
+	iput(VFS_I(ip));
+}
diff --git a/fs/xfs/scrub/common.h b/fs/xfs/scrub/common.h
index 2172bd5361e2..ca9e15af2a4f 100644
--- a/fs/xfs/scrub/common.h
+++ b/fs/xfs/scrub/common.h
@@ -140,5 +140,6 @@  static inline bool xfs_scrub_skip_xref(struct xfs_scrub_metadata *sm)
 
 int xfs_scrub_metadata_inode_forks(struct xfs_scrub_context *sc);
 int xfs_scrub_ilock_inverted(struct xfs_inode *ip, uint lock_mode);
+void xfs_scrub_iput(struct xfs_scrub_context *sc, struct xfs_inode *ip);
 
 #endif	/* __XFS_SCRUB_COMMON_H__ */
diff --git a/fs/xfs/scrub/dir.c b/fs/xfs/scrub/dir.c
index 86324775fc9b..5cb371576732 100644
--- a/fs/xfs/scrub/dir.c
+++ b/fs/xfs/scrub/dir.c
@@ -87,7 +87,7 @@  xfs_scrub_dir_check_ftype(
 			xfs_mode_to_ftype(VFS_I(ip)->i_mode));
 	if (ino_dtype != dtype)
 		xfs_scrub_fblock_set_corrupt(sdc->sc, XFS_DATA_FORK, offset);
-	iput(VFS_I(ip));
+	xfs_scrub_iput(sdc->sc, ip);
 out:
 	return error;
 }
diff --git a/fs/xfs/scrub/parent.c b/fs/xfs/scrub/parent.c
index e2bda58c32f0..fd0b2bfb8f18 100644
--- a/fs/xfs/scrub/parent.c
+++ b/fs/xfs/scrub/parent.c
@@ -230,11 +230,11 @@  xfs_scrub_parent_validate(
 
 	/* Drat, parent changed.  Try again! */
 	if (dnum != dp->i_ino) {
-		iput(VFS_I(dp));
+		xfs_scrub_iput(sc, dp);
 		*try_again = true;
 		return 0;
 	}
-	iput(VFS_I(dp));
+	xfs_scrub_iput(sc, dp);
 
 	/*
 	 * '..' didn't change, so check that there was only one entry
@@ -247,7 +247,7 @@  xfs_scrub_parent_validate(
 out_unlock:
 	xfs_iunlock(dp, XFS_IOLOCK_SHARED);
 out_rele:
-	iput(VFS_I(dp));
+	xfs_scrub_iput(sc, dp);
 out:
 	return error;
 }
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index fec0e130f19e..b66cfbc56a34 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -157,6 +157,24 @@  xfs_scrub_probe(
 
 /* Scrub setup and teardown */
 
+/* Release all references to inodes we encountered needing deferred iput. */
+STATIC void
+xfs_scrub_iput_deferred(
+	struct xfs_scrub_context	*sc)
+{
+	struct inode			*inode, *next;
+
+	inode = sc->deferred_iput_list;
+	while (inode != (struct inode *)sc) {
+		next = inode->i_private;
+		inode->i_private = NULL;
+		trace_xfs_scrub_iput_deferred(XFS_I(inode), __return_address);
+		iput(inode);
+		inode = next;
+	}
+	sc->deferred_iput_list = sc;
+}
+
 /* Free all the resources and finish the transactions. */
 STATIC int
 xfs_scrub_teardown(
@@ -180,6 +198,7 @@  xfs_scrub_teardown(
 			iput(VFS_I(sc->ip));
 		sc->ip = NULL;
 	}
+	xfs_scrub_iput_deferred(sc);
 	if (sc->has_quotaofflock)
 		mutex_unlock(&sc->mp->m_quotainfo->qi_quotaofflock);
 	if (sc->buf) {
@@ -506,6 +525,7 @@  xfs_scrub_metadata(
 	sc.ops = &meta_scrub_ops[sm->sm_type];
 	sc.try_harder = try_harder;
 	sc.sa.agno = NULLAGNUMBER;
+	sc.deferred_iput_list = &sc;
 	error = sc.ops->setup(&sc, ip);
 	if (error)
 		goto out_teardown;
diff --git a/fs/xfs/scrub/scrub.h b/fs/xfs/scrub/scrub.h
index b295edd5fc0e..69eee2ffed29 100644
--- a/fs/xfs/scrub/scrub.h
+++ b/fs/xfs/scrub/scrub.h
@@ -65,6 +65,15 @@  struct xfs_scrub_context {
 	bool				try_harder;
 	bool				has_quotaofflock;
 
+	/*
+	 * List of inodes which cannot be released (by scrub) until after the
+	 * scrub operation concludes because we'd have to do some work to the
+	 * inode to destroy its incore representation (cow blocks, posteof
+	 * blocks, etc.).  Each inode's i_private points to the next inode, or
+	 * to the scrub context as a sentinel for the end of the list.
+	 */
+	void				*deferred_iput_list;
+
 	/* State tracking for single-AG operations. */
 	struct xfs_scrub_ag		sa;
 };
diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h
index cec3e5ece5a1..a050a00fc258 100644
--- a/fs/xfs/scrub/trace.h
+++ b/fs/xfs/scrub/trace.h
@@ -480,6 +480,36 @@  TRACE_EVENT(xfs_scrub_xref_error,
 		  __entry->ret_ip)
 );
 
+DECLARE_EVENT_CLASS(xfs_scrub_iref_class,
+	TP_PROTO(struct xfs_inode *ip, xfs_failaddr_t caller_ip),
+	TP_ARGS(ip, caller_ip),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_ino_t, ino)
+		__field(int, count)
+		__field(xfs_failaddr_t, caller_ip)
+	),
+	TP_fast_assign(
+		__entry->dev = VFS_I(ip)->i_sb->s_dev;
+		__entry->ino = ip->i_ino;
+		__entry->count = atomic_read(&VFS_I(ip)->i_count);
+		__entry->caller_ip = caller_ip;
+	),
+	TP_printk("dev %d:%d ino 0x%llx count %d caller %pS",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->ino,
+		  __entry->count,
+		  __entry->caller_ip)
+)
+
+#define DEFINE_SCRUB_IREF_EVENT(name) \
+DEFINE_EVENT(xfs_scrub_iref_class, name, \
+	TP_PROTO(struct xfs_inode *ip, xfs_failaddr_t caller_ip), \
+	TP_ARGS(ip, caller_ip))
+DEFINE_SCRUB_IREF_EVENT(xfs_scrub_iput_deferred);
+DEFINE_SCRUB_IREF_EVENT(xfs_scrub_iput_defer);
+DEFINE_SCRUB_IREF_EVENT(xfs_scrub_iput_now);
+
 /* repair tracepoints */
 #if IS_ENABLED(CONFIG_XFS_ONLINE_REPAIR)