diff mbox series

xfs: fix online repair probing when CONFIG_XFS_ONLINE_REPAIR=n

Message ID 20250114224819.GD2103004@frogsfrogsfrogs (mailing list archive)
State New
Headers show
Series xfs: fix online repair probing when CONFIG_XFS_ONLINE_REPAIR=n | expand

Commit Message

Darrick J. Wong Jan. 14, 2025, 10:48 p.m. UTC
From: Darrick J. Wong <djwong@kernel.org>

I received a report from the release engineering side of the house that
xfs_scrub without the -n flag (aka fix it mode) would try to fix a
broken filesystem even on a kernel that doesn't have online repair built
into it:

 # xfs_scrub -dTvn /mnt/test
 EXPERIMENTAL xfs_scrub program in use! Use at your own risk!
 Phase 1: Find filesystem geometry.
 /mnt/test: using 1 threads to scrub.
 Phase 1: Memory used: 132k/0k (108k/25k), time:  0.00/ 0.00/ 0.00s
 <snip>
 Phase 4: Repair filesystem.
 <snip>
 Info: /mnt/test/some/victimdir directory entries: Attempting repair. (repair.c line 351)
 Corruption: /mnt/test/some/victimdir directory entries: Repair unsuccessful; offline repair required. (repair.c line 204)

Source: https://blogs.oracle.com/linux/post/xfs-online-filesystem-repair

It is strange that xfs_scrub doesn't refuse to run, because the kernel
is supposed to return EOPNOTSUPP if we actually needed to run a repair,
and xfs_io's repair subcommand will perror that.  And yet:

 # xfs_io -x -c 'repair probe' /mnt/test
 #

The first problem is commit dcb660f9222fd9 (4.15) which should have had
xchk_probe set the CORRUPT OFLAG so that any of the repair machinery
will get called at all.

It turns out that some refactoring that happened in the 6.6-6.8 era
broke the operation of this corner case.  What we *really* want to
happen is that all the predicates that would steer xfs_scrub_metadata()
towards calling xrep_attempt() should function the same way that they do
when repair is compiled in; and then xrep_attempt gets to return the
fatal EOPNOTSUPP error code that causes the probe to fail.

Instead, commit 8336a64eb75cba (6.6) started the failwhale swimming by
hoisting OFLAG checking logic into a helper whose non-repair stub always
returns false, causing scrub to return "repair not needed" when in fact
the repair is not supported.  Prior to that commit, the oflag checking
that was open-coded in scrub.c worked correctly.

Similarly, in commit 4bdfd7d15747b1 (6.8) we hoisted the IFLAG_REPAIR
and ALREADY_FIXED logic into a helper whose non-repair stub always
returns false, so we never enter the if test body that would have called
xrep_attempt, let alone fail to decode the OFLAGs correctly.

The final insult (yes, we're doing The Naked Gun now) is commit
48a72f60861f79 (6.8) in which we hoisted the "are we going to try a
repair?" predicate into yet another function with a non-repair stub
always returns false.

So.  Fix xchk_probe to trigger xrep_probe, then fix all the other helper
predicates in scrub.h so that we always point to xrep_attempt, even if
online repair is disabled.  Commit 48a72 is tagged here because the
scrub code prior to LTS 6.12 are incomplete and not worth patching.

Reported-by: David Flynn <david.flynn@oracle.com>
Cc: <stable@vger.kernel.org> # v6.8
Fixes: 48a72f60861f79 ("xfs: don't complain about unfixed metadata when repairs were injected")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/xfs/scrub/common.h |    5 -----
 fs/xfs/scrub/repair.h |   11 ++++++++++-
 fs/xfs/scrub/scrub.c  |    9 +++++++++
 3 files changed, 19 insertions(+), 6 deletions(-)

Comments

Christoph Hellwig Jan. 15, 2025, 6:06 a.m. UTC | #1
On Tue, Jan 14, 2025 at 02:48:19PM -0800, Darrick J. Wong wrote:
> index 950f5a58dcd967..09468f50781b24 100644
> --- a/fs/xfs/scrub/scrub.c
> +++ b/fs/xfs/scrub/scrub.c
> @@ -149,6 +149,15 @@ xchk_probe(
>  	if (xchk_should_terminate(sc, &error))
>  		return error;
>  
> +	/*
> +	 * If the caller is probing to see if repair works, set the CORRUPT
> +	 * flag (without any of the usual tracing/logging) to force us into
> +	 * the repair codepaths.  If repair is compiled into the kernel, we'll
> +	 * call xrep_probe and simulate a repair; otherwise, the repair
> +	 * codepaths return EOPNOTSUPP.
> +	 */
> +	if (xchk_could_repair(sc))
> +		sc->sm->sm_flags |= XFS_SCRUB_OFLAG_CORRUPT;

Stupid question: what is the point in not just directly returning
-EOPNOTSUPP here when online repair is not supported?
Darrick J. Wong Jan. 15, 2025, 6:20 a.m. UTC | #2
On Wed, Jan 15, 2025 at 07:06:15AM +0100, Christoph Hellwig wrote:
> On Tue, Jan 14, 2025 at 02:48:19PM -0800, Darrick J. Wong wrote:
> > index 950f5a58dcd967..09468f50781b24 100644
> > --- a/fs/xfs/scrub/scrub.c
> > +++ b/fs/xfs/scrub/scrub.c
> > @@ -149,6 +149,15 @@ xchk_probe(
> >  	if (xchk_should_terminate(sc, &error))
> >  		return error;
> >  
> > +	/*
> > +	 * If the caller is probing to see if repair works, set the CORRUPT
> > +	 * flag (without any of the usual tracing/logging) to force us into
> > +	 * the repair codepaths.  If repair is compiled into the kernel, we'll
> > +	 * call xrep_probe and simulate a repair; otherwise, the repair
> > +	 * codepaths return EOPNOTSUPP.
> > +	 */
> > +	if (xchk_could_repair(sc))
> > +		sc->sm->sm_flags |= XFS_SCRUB_OFLAG_CORRUPT;
> 
> Stupid question: what is the point in not just directly returning
> -EOPNOTSUPP here when online repair is not supported?

Good point, we could cut it off right then and there.  Though this seems
a little gross:

	if (xchk_could_repair(sc))
#ifdef CONFIG_XFS_ONLINE_REPAIR
		sc->sm->sm_flags |= XFS_SCRUB_OFLAG_CORRUPT;
#else
		return -EOPNOTSUPP;
#endif
	return 0;

but I don't mind.  Some day the stubs will go away, fingers crossed.

--D
Christoph Hellwig Jan. 15, 2025, 6:27 a.m. UTC | #3
On Tue, Jan 14, 2025 at 10:20:37PM -0800, Darrick J. Wong wrote:
> Good point, we could cut it off right then and there.  Though this seems
> a little gross:
> 
> 	if (xchk_could_repair(sc))
> #ifdef CONFIG_XFS_ONLINE_REPAIR
> 		sc->sm->sm_flags |= XFS_SCRUB_OFLAG_CORRUPT;
> #else
> 		return -EOPNOTSUPP;
> #endif
> 	return 0;
> 
> but I don't mind.  Some day the stubs will go away, fingers crossed.

We'll I'd write it as:

	if (xchk_could_repair(sc)) {
		if (!IS_ENABLED(CONFIG_XFS_ONLINE_REPAIR))
			return -EOPNOTSUPP;
		sc->sm->sm_flags |= XFS_SCRUB_OFLAG_CORRUPT;
	}

but I'm fine with either version:

Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong Jan. 15, 2025, 6:29 a.m. UTC | #4
On Wed, Jan 15, 2025 at 07:27:43AM +0100, Christoph Hellwig wrote:
> On Tue, Jan 14, 2025 at 10:20:37PM -0800, Darrick J. Wong wrote:
> > Good point, we could cut it off right then and there.  Though this seems
> > a little gross:
> > 
> > 	if (xchk_could_repair(sc))
> > #ifdef CONFIG_XFS_ONLINE_REPAIR
> > 		sc->sm->sm_flags |= XFS_SCRUB_OFLAG_CORRUPT;
> > #else
> > 		return -EOPNOTSUPP;
> > #endif
> > 	return 0;
> > 
> > but I don't mind.  Some day the stubs will go away, fingers crossed.
> 
> We'll I'd write it as:
> 
> 	if (xchk_could_repair(sc)) {
> 		if (!IS_ENABLED(CONFIG_XFS_ONLINE_REPAIR))
> 			return -EOPNOTSUPP;
> 		sc->sm->sm_flags |= XFS_SCRUB_OFLAG_CORRUPT;
> 	}
> 
> but I'm fine with either version:

I like your version /much/ better.

> Reviewed-by: Christoph Hellwig <hch@lst.de>

Thanks!

--D
diff mbox series

Patch

diff --git a/fs/xfs/scrub/common.h b/fs/xfs/scrub/common.h
index eecb6d54c0f953..53a6b34ce54271 100644
--- a/fs/xfs/scrub/common.h
+++ b/fs/xfs/scrub/common.h
@@ -212,7 +212,6 @@  static inline bool xchk_skip_xref(struct xfs_scrub_metadata *sm)
 bool xchk_dir_looks_zapped(struct xfs_inode *dp);
 bool xchk_pptr_looks_zapped(struct xfs_inode *ip);
 
-#ifdef CONFIG_XFS_ONLINE_REPAIR
 /* Decide if a repair is required. */
 static inline bool xchk_needs_repair(const struct xfs_scrub_metadata *sm)
 {
@@ -232,10 +231,6 @@  static inline bool xchk_could_repair(const struct xfs_scrub *sc)
 	return (sc->sm->sm_flags & XFS_SCRUB_IFLAG_REPAIR) &&
 		!(sc->flags & XREP_ALREADY_FIXED);
 }
-#else
-# define xchk_needs_repair(sc)		(false)
-# define xchk_could_repair(sc)		(false)
-#endif /* CONFIG_XFS_ONLINE_REPAIR */
 
 int xchk_metadata_inode_forks(struct xfs_scrub *sc);
 
diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h
index b649da1a93eb8c..b3b1fe62814e7b 100644
--- a/fs/xfs/scrub/repair.h
+++ b/fs/xfs/scrub/repair.h
@@ -173,7 +173,16 @@  bool xrep_buf_verify_struct(struct xfs_buf *bp, const struct xfs_buf_ops *ops);
 #else
 
 #define xrep_ino_dqattach(sc)	(0)
-#define xrep_will_attempt(sc)	(false)
+
+/*
+ * When online repair is not built into the kernel, we still want to attempt
+ * the repair so that the stub xrep_attempt below will return EOPNOTSUPP.
+ */
+static inline bool xrep_will_attempt(const struct xfs_scrub *sc)
+{
+	return (sc->sm->sm_flags & XFS_SCRUB_IFLAG_FORCE_REBUILD) ||
+		xchk_needs_repair(sc->sm);
+}
 
 static inline int
 xrep_attempt(
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index 950f5a58dcd967..09468f50781b24 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -149,6 +149,15 @@  xchk_probe(
 	if (xchk_should_terminate(sc, &error))
 		return error;
 
+	/*
+	 * If the caller is probing to see if repair works, set the CORRUPT
+	 * flag (without any of the usual tracing/logging) to force us into
+	 * the repair codepaths.  If repair is compiled into the kernel, we'll
+	 * call xrep_probe and simulate a repair; otherwise, the repair
+	 * codepaths return EOPNOTSUPP.
+	 */
+	if (xchk_could_repair(sc))
+		sc->sm->sm_flags |= XFS_SCRUB_OFLAG_CORRUPT;
 	return 0;
 }