diff mbox series

[7/7] xfs_scrub: create a new category for unfixable errors

Message ID 157177017442.1460394.7425325906254151917.stgit@magnolia (mailing list archive)
State Accepted
Headers show
Series xfs_scrub: clean up error classifications | expand

Commit Message

Darrick J. Wong Oct. 22, 2019, 6:49 p.m. UTC
From: Darrick J. Wong <darrick.wong@oracle.com>

There's nothing that xfs_scrub (or XFS) can do about media errors for
data file blocks -- the data are gone.  Create a new category for these
unfixable errors so that we don't advise the user to take further action
that won't fix the problem.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 scrub/common.c    |    8 +++++++-
 scrub/common.h    |    3 +++
 scrub/phase4.c    |    5 ++++-
 scrub/phase5.c    |    2 +-
 scrub/phase6.c    |    2 +-
 scrub/xfs_scrub.c |   17 +++++++++++++----
 scrub/xfs_scrub.h |    1 +
 7 files changed, 30 insertions(+), 8 deletions(-)

Comments

Eric Sandeen Nov. 1, 2019, 9:32 p.m. UTC | #1
On 10/22/19 1:49 PM, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> There's nothing that xfs_scrub (or XFS) can do about media errors for
> data file blocks -- the data are gone.  Create a new category for these
> unfixable errors so that we don't advise the user to take further action
> that won't fix the problem.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>

are "unfixable errors" /exclusively/ file data media errors?

> diff --git a/scrub/phase4.c b/scrub/phase4.c
> index 1cf3f6b7..a276bc32 100644
> --- a/scrub/phase4.c
> +++ b/scrub/phase4.c
> @@ -99,7 +99,10 @@ xfs_process_action_items(
>  	workqueue_destroy(&wq);
>  
>  	pthread_mutex_lock(&ctx->lock);
> -	if (moveon && ctx->corruptions_found == 0 && want_fstrim) {
> +	if (moveon &&
> +	    ctx->corruptions_found == 0 &&
> +	    ctx->unfixable_errors == 0 &&
> +	    want_fstrim) {
>  		fstrim(ctx);
>  		progress_add(1);
>  	}


why would a file data media error preclude fstrim?

> diff --git a/scrub/phase5.c b/scrub/phase5.c
> index dc0ee5e8..e0c4a3df 100644
> --- a/scrub/phase5.c
> +++ b/scrub/phase5.c
> @@ -336,7 +336,7 @@ xfs_scan_connections(
>  	bool			moveon = true;
>  	bool			ret;
>  
> -	if (ctx->corruptions_found) {
> +	if (ctx->corruptions_found || ctx->unfixable_errors) {
>  		str_info(ctx, ctx->mntpoint,
>  _("Filesystem has errors, skipping connectivity checks."));

why would a file data media error stop the connectivity check?

unless "unfixable" may be other types, in which case it makes sense.

But the commit log indicates it's just for a file data media error.

What's the intent?

-Eric
Darrick J. Wong Nov. 1, 2019, 10:10 p.m. UTC | #2
On Fri, Nov 01, 2019 at 04:32:28PM -0500, Eric Sandeen wrote:
> On 10/22/19 1:49 PM, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > There's nothing that xfs_scrub (or XFS) can do about media errors for
> > data file blocks -- the data are gone.  Create a new category for these
> > unfixable errors so that we don't advise the user to take further action
> > that won't fix the problem.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> 
> are "unfixable errors" /exclusively/ file data media errors?

The intent behind this new classification is for things that we can't
ever fix or rebuild in the filesystem.  However, at least for now, file
data loss is the only source of unfixable errors.

> > diff --git a/scrub/phase4.c b/scrub/phase4.c
> > index 1cf3f6b7..a276bc32 100644
> > --- a/scrub/phase4.c
> > +++ b/scrub/phase4.c
> > @@ -99,7 +99,10 @@ xfs_process_action_items(
> >  	workqueue_destroy(&wq);
> >  
> >  	pthread_mutex_lock(&ctx->lock);
> > -	if (moveon && ctx->corruptions_found == 0 && want_fstrim) {
> > +	if (moveon &&
> > +	    ctx->corruptions_found == 0 &&
> > +	    ctx->unfixable_errors == 0 &&
> > +	    want_fstrim) {
> >  		fstrim(ctx);
> >  		progress_add(1);
> >  	}
> 
> 
> why would a file data media error preclude fstrim?

If there's even the slightest hint of things still being wrong with the
filesystem or the underlying storage, we should avoid writing to the
storage (e.g. FITRIM) because that could screw things up even more.

Note that this fstrim happens at the end of phase 4, so that means that
if we have any corruptions or unfixable errors at this point, they're in
the metadata, our attempts to fix them have failed, and so we absolutely
should not go writing more things to the disk.

> > diff --git a/scrub/phase5.c b/scrub/phase5.c
> > index dc0ee5e8..e0c4a3df 100644
> > --- a/scrub/phase5.c
> > +++ b/scrub/phase5.c
> > @@ -336,7 +336,7 @@ xfs_scan_connections(
> >  	bool			moveon = true;
> >  	bool			ret;
> >  
> > -	if (ctx->corruptions_found) {
> > +	if (ctx->corruptions_found || ctx->unfixable_errors) {
> >  		str_info(ctx, ctx->mntpoint,
> >  _("Filesystem has errors, skipping connectivity checks."));
> 
> why would a file data media error stop the connectivity check?

It won't, because the media scan happens during phase 6.

> unless "unfixable" may be other types, in which case it makes sense.

They will, presumably.  For an unfixable error to stop the connectivity
checks (phase 5), the unfixable error would have to be observed during
phases 1-4.  Those first four phases exclusively look for (and repair)
internal fs metadata, so if we reach phase 5 with an unfixable error
that means the fs metadata are trashed and xfs_scrub is going to fail
anyway.

Note that I haven't actually tried to write a directory repair function
yet.  The last time I asked viro about it he wondered what in the hell I
was going to do about the dentry cache and doubted that it could be
done... so either I have to prove him wrong, or maybe directories /will/
become the first source of unfixable errors for phase 1-4.

So yes, the 'unfixable' category is supposed to be broad enough to cover
more than just file data loss, though for now it's only used for that.

--D

> But the commit log indicates it's just for a file data media error.
> 
> What's the intent?
>
> -Eric
Eric Sandeen Nov. 1, 2019, 10:15 p.m. UTC | #3
On 10/22/19 1:49 PM, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> There's nothing that xfs_scrub (or XFS) can do about media errors for
> data file blocks -- the data are gone.  Create a new category for these
> unfixable errors so that we don't advise the user to take further action
> that won't fix the problem.

Ok.  I'll just slightly edit this to indicate that it's not exclusively
media errors and the rest makes sense.

Reviewed-by: Eric Sandeen <sandeen@redhat.com>
diff mbox series

Patch

diff --git a/scrub/common.c b/scrub/common.c
index 261c6bb2..e72ae540 100644
--- a/scrub/common.c
+++ b/scrub/common.c
@@ -43,7 +43,7 @@  xfs_scrub_excessive_errors(
 		return false;
 
 	pthread_mutex_lock(&ctx->lock);
-	errors_seen = ctx->corruptions_found;
+	errors_seen = ctx->corruptions_found + ctx->unfixable_errors;
 	pthread_mutex_unlock(&ctx->lock);
 
 	return errors_seen >= ctx->max_errors;
@@ -61,6 +61,10 @@  static struct {
 		.string = "Corruption",
 		.loglevel = LOG_ERR,
 	},
+	[S_UNFIXABLE] = {
+		.string = "Unfixable Error",
+		.loglevel = LOG_ERR,
+	},
 	[S_WARN]   = {
 		.string = "Warning",
 		.loglevel = LOG_WARNING,
@@ -136,6 +140,8 @@  __str_out(
 		ctx->runtime_errors++;
 	else if (level == S_CORRUPT)
 		ctx->corruptions_found++;
+	else if (level == S_UNFIXABLE)
+		ctx->unfixable_errors++;
 	else if (level == S_WARN)
 		ctx->warnings_found++;
 	else if (level == S_REPAIR)
diff --git a/scrub/common.h b/scrub/common.h
index b1f2ea2c..cfd9f186 100644
--- a/scrub/common.h
+++ b/scrub/common.h
@@ -18,6 +18,7 @@  bool xfs_scrub_excessive_errors(struct scrub_ctx *ctx);
 enum error_level {
 	S_ERROR	= 0,
 	S_CORRUPT,
+	S_UNFIXABLE,
 	S_WARN,
 	S_INFO,
 	S_REPAIR,
@@ -43,6 +44,8 @@  void __str_out(struct scrub_ctx *ctx, const char *descr, enum error_level level,
 	__str_out(ctx, str, S_REPAIR,	0,	__FILE__, __LINE__, __VA_ARGS__)
 #define record_preen(ctx, str, ...) \
 	__str_out(ctx, str, S_PREEN,	0,	__FILE__, __LINE__, __VA_ARGS__)
+#define str_unfixable_error(ctx, str, ...) \
+	__str_out(ctx, str, S_UNFIXABLE, 0,	__FILE__, __LINE__, __VA_ARGS__)
 
 #define dbg_printf(fmt, ...) \
 	do {if (debug > 1) {printf(fmt, __VA_ARGS__);}} while (0)
diff --git a/scrub/phase4.c b/scrub/phase4.c
index 1cf3f6b7..a276bc32 100644
--- a/scrub/phase4.c
+++ b/scrub/phase4.c
@@ -99,7 +99,10 @@  xfs_process_action_items(
 	workqueue_destroy(&wq);
 
 	pthread_mutex_lock(&ctx->lock);
-	if (moveon && ctx->corruptions_found == 0 && want_fstrim) {
+	if (moveon &&
+	    ctx->corruptions_found == 0 &&
+	    ctx->unfixable_errors == 0 &&
+	    want_fstrim) {
 		fstrim(ctx);
 		progress_add(1);
 	}
diff --git a/scrub/phase5.c b/scrub/phase5.c
index dc0ee5e8..e0c4a3df 100644
--- a/scrub/phase5.c
+++ b/scrub/phase5.c
@@ -336,7 +336,7 @@  xfs_scan_connections(
 	bool			moveon = true;
 	bool			ret;
 
-	if (ctx->corruptions_found) {
+	if (ctx->corruptions_found || ctx->unfixable_errors) {
 		str_info(ctx, ctx->mntpoint,
 _("Filesystem has errors, skipping connectivity checks."));
 		return true;
diff --git a/scrub/phase6.c b/scrub/phase6.c
index bb159641..aae6b7d8 100644
--- a/scrub/phase6.c
+++ b/scrub/phase6.c
@@ -161,7 +161,7 @@  report_badfile(
 	bad_length = min(start + length,
 			 br->bmap->bm_physical + br->bmap->bm_length) - start;
 
-	str_error(br->ctx, br->descr,
+	str_unfixable_error(br->ctx, br->descr,
 _("media error at data offset %llu length %llu."),
 			br->bmap->bm_offset + bad_offset, bad_length);
 	return 0;
diff --git a/scrub/xfs_scrub.c b/scrub/xfs_scrub.c
index 222daae1..963d0d70 100644
--- a/scrub/xfs_scrub.c
+++ b/scrub/xfs_scrub.c
@@ -511,15 +511,24 @@  static void
 report_outcome(
 	struct scrub_ctx	*ctx)
 {
-	unsigned long long	total_errors;
+	unsigned long long	actionable_errors;
 
-	total_errors = ctx->corruptions_found + ctx->runtime_errors;
+	actionable_errors = ctx->corruptions_found + ctx->runtime_errors;
 
-	if (total_errors == 0 && ctx->warnings_found == 0) {
+	if (actionable_errors == 0 &&
+	    ctx->unfixable_errors == 0 &&
+	    ctx->warnings_found == 0) {
 		log_info(ctx, _("No problems found."));
 		return;
 	}
 
+	if (ctx->unfixable_errors) {
+		fprintf(stderr, _("%s: unfixable errors found: %llu\n"),
+				ctx->mntpoint, ctx->unfixable_errors);
+		log_err(ctx, _("unfixable errors found: %llu"),
+				ctx->unfixable_errors);
+	}
+
 	if (ctx->corruptions_found > 0) {
 		fprintf(stderr, _("%s: corruptions found: %llu\n"),
 				ctx->mntpoint, ctx->corruptions_found);
@@ -545,7 +554,7 @@  report_outcome(
 	 * setting up the scrub and we actually saw corruptions.  Warnings
 	 * are not corruptions.
 	 */
-	if (ctx->scrub_setup_succeeded && total_errors > 0) {
+	if (ctx->scrub_setup_succeeded && actionable_errors > 0) {
 		char		*msg;
 
 		if (ctx->mode == SCRUB_MODE_DRY_RUN)
diff --git a/scrub/xfs_scrub.h b/scrub/xfs_scrub.h
index 5abc41fd..61831c92 100644
--- a/scrub/xfs_scrub.h
+++ b/scrub/xfs_scrub.h
@@ -74,6 +74,7 @@  struct scrub_ctx {
 	unsigned long long	max_errors;
 	unsigned long long	runtime_errors;
 	unsigned long long	corruptions_found;
+	unsigned long long	unfixable_errors;
 	unsigned long long	warnings_found;
 	unsigned long long	inodes_checked;
 	unsigned long long	bytes_checked;