[7/7] btrfs: scrub: fix the frequency of error messages

Message ID	c46343633e6198f315876179d31b3a66b66a8aa2.1710409033.git.wqu@suse.com (mailing list archive)
State	New, archived
Headers	show Received: from smtp-out2.suse.de (smtp-out2.suse.de [195.135.223.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D95FB6E2B3 for <linux-btrfs@vger.kernel.org>; Thu, 14 Mar 2024 09:50:36 +0000 (UTC) From: Qu Wenruo <wqu@suse.com> To: linux-btrfs@vger.kernel.org Subject: [PATCH 7/7] btrfs: scrub: fix the frequency of error messages Date: Thu, 14 Mar 2024 20:20:20 +1030 Message-ID: <c46343633e6198f315876179d31b3a66b66a8aa2.1710409033.git.wqu@suse.com> In-Reply-To: <cover.1710409033.git.wqu@suse.com> References: <cover.1710409033.git.wqu@suse.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	btrfs: scrub: refine the error messages \| expand [0/6] btrfs: scrub: refine the error messages [1/7] btrfs: scrub: fix incorrectly reported logical/physical address [2/7] btrfs: reduce the log level for btrfs_dev_stat_inc_and_print() [3/7] btrfs: scrub: remove unused is_super parameter from scrub_print_common_warning() [4/7] btrfs: scrub: remove unnecessary dev/physical lookup for scrub_stripe_report_errors() [5/7] btrfs: scrub: simplify the inode iteration output [6/7] btrfs: scrub: unify and shorten the error message [7/7] btrfs: scrub: fix the frequency of error messages

Message ID

c46343633e6198f315876179d31b3a66b66a8aa2.1710409033.git.wqu@suse.com (mailing list archive)

State

New, archived

Headers

From: Qu Wenruo <wqu@suse.com>
To: linux-btrfs@vger.kernel.org
Subject: [PATCH 7/7] btrfs: scrub: fix the frequency of error messages
Date: Thu, 14 Mar 2024 20:20:20 +1030
Message-ID: 
 <c46343633e6198f315876179d31b3a66b66a8aa2.1710409033.git.wqu@suse.com>
In-Reply-To: <cover.1710409033.git.wqu@suse.com>
References: <cover.1710409033.git.wqu@suse.com>
Precedence: bulk
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit

Series

btrfs: scrub: refine the error messages | expand

Commit Message

Qu Wenruo March 14, 2024, 9:50 a.m. UTC

For btrfs scrub error messages, I have hit a lot of support cases on
older kernels where no filename is outputted at all, with only error
messages like:

 BTRFS error (device dm-0): bdev /dev/mapper/sys-rootlv errs: wr 0, rd 0, flush 0, corrupt 2823, gen 0
 BTRFS error (device dm-0): unable to fixup (regular) error at logical 1563504640 on dev /dev/mapper/sys-rootlv
 BTRFS error (device dm-0): bdev /dev/mapper/sys-rootlv errs: wr 0, rd 0, flush 0, corrupt 2824, gen 0
 BTRFS error (device dm-0): unable to fixup (regular) error at logical 1579016192 on dev /dev/mapper/sys-rootlv
 BTRFS error (device dm-0): bdev /dev/mapper/sys-rootlv errs: wr 0, rd 0, flush 0, corrupt 2825, gen 0

The "unable to fixup (regular) error" line shows we hit an unrepairable
error, then normally we would do data/metadata backref walk to grab the
correct info.

But we can hit cases like the inode is already orphan (unlinked from any
parent inode), or even the subvolume is orphan (unlinked and waiting to
be deleted).

In that case we're not sure what's the proper way to continue (Is it
some critical data/metadata? Would it prevent the system from booting?)

To improve the situation, this patch would:

- Ensure we at least output one message for each unrepairable error
  If by somehow we didn't output any error message for the error, we
  always fallback to the basic logical/physical error output, with its
  type (data or metadata).

- Remove the "unable to fixup" error message
  Since we have ensured at least one error message is outputted for each
  unrepairable corruption, there is no need for that fallback one.

Now one unrepairable corruption would output a more refined and less
duplicated error message:

 BTRFS info (device dm-2): scrub: started on devid 1
 BTRFS warning (device dm-2): checksum error at inode 5/257(file1) fileoff 16384, logical 13647872(1) physical 1(/dev/mapper/test-scratch1)13647872
 BTRFS info (device dm-2): scrub: finished on devid 1 with status: 0

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/scrub.c | 42 ++++++++++++++++++++++++++----------------
 1 file changed, 26 insertions(+), 16 deletions(-)

Comments

Filipe Manana March 14, 2024, 5:51 p.m. UTC | #1

On Thu, Mar 14, 2024 at 9:51 AM Qu Wenruo <wqu@suse.com> wrote:
>
> For btrfs scrub error messages, I have hit a lot of support cases on
> older kernels where no filename is outputted at all, with only error
> messages like:
>
>  BTRFS error (device dm-0): bdev /dev/mapper/sys-rootlv errs: wr 0, rd 0, flush 0, corrupt 2823, gen 0
>  BTRFS error (device dm-0): unable to fixup (regular) error at logical 1563504640 on dev /dev/mapper/sys-rootlv
>  BTRFS error (device dm-0): bdev /dev/mapper/sys-rootlv errs: wr 0, rd 0, flush 0, corrupt 2824, gen 0
>  BTRFS error (device dm-0): unable to fixup (regular) error at logical 1579016192 on dev /dev/mapper/sys-rootlv
>  BTRFS error (device dm-0): bdev /dev/mapper/sys-rootlv errs: wr 0, rd 0, flush 0, corrupt 2825, gen 0
>
> The "unable to fixup (regular) error" line shows we hit an unrepairable
> error, then normally we would do data/metadata backref walk to grab the
> correct info.
>
> But we can hit cases like the inode is already orphan (unlinked from any
> parent inode), or even the subvolume is orphan (unlinked and waiting to
> be deleted).
>
> In that case we're not sure what's the proper way to continue (Is it
> some critical data/metadata? Would it prevent the system from booting?)
>
> To improve the situation, this patch would:
>
> - Ensure we at least output one message for each unrepairable error
>   If by somehow we didn't output any error message for the error, we
>   always fallback to the basic logical/physical error output, with its
>   type (data or metadata).
>
> - Remove the "unable to fixup" error message
>   Since we have ensured at least one error message is outputted for each
>   unrepairable corruption, there is no need for that fallback one.
>
> Now one unrepairable corruption would output a more refined and less
> duplicated error message:
>
>  BTRFS info (device dm-2): scrub: started on devid 1
>  BTRFS warning (device dm-2): checksum error at inode 5/257(file1) fileoff 16384, logical 13647872(1) physical 1(/dev/mapper/test-scratch1)13647872
>  BTRFS info (device dm-2): scrub: finished on devid 1 with status: 0
>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  fs/btrfs/scrub.c | 42 ++++++++++++++++++++++++++----------------
>  1 file changed, 26 insertions(+), 16 deletions(-)
>
> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
> index 17e492076bf8..84617a64b2d4 100644
> --- a/fs/btrfs/scrub.c
> +++ b/fs/btrfs/scrub.c
> @@ -227,6 +227,7 @@ struct scrub_warning {
>         u64                     logical;
>         struct btrfs_device     *dev;
>         int                     mirror_num;
> +       bool                    message_printed;
>  };
>
>  static void release_scrub_stripe(struct scrub_stripe *stripe)
> @@ -426,26 +427,29 @@ static int scrub_print_warning_inode(u64 inum, u64 offset, u64 num_bytes,
>          * we deliberately ignore the bit ipath might have been too small to
>          * hold all of the paths here
>          */
> -       for (i = 0; i < ipath->fspath->elem_cnt; ++i)
> -               btrfs_warn_in_rcu(fs_info,
> +       for (i = 0; i < ipath->fspath->elem_cnt; ++i) {
> +               btrfs_warn_rl_in_rcu(fs_info,
>  "%s at inode %lld/%llu(%s) fileoff %llu, logical %llu(%u) physical %llu(%s)%llu",
>                                   swarn->errstr, root, inum,
>                                   (char *)ipath->fspath->val[i], offset,
>                                   swarn->logical, swarn->mirror_num,
>                                   swarn->dev->devid, btrfs_dev_name(swarn->dev),
>                                   swarn->physical);
> +               swarn->message_printed = true;
> +       }
>
>         btrfs_put_root(local_root);
>         free_ipath(ipath);
>         return 0;
>
>  err:
> -       btrfs_warn_in_rcu(fs_info,
> +       btrfs_warn_rl_in_rcu(fs_info,
>         "%s at inode %lld/%llu fileoff %llu, logical %llu(%u) physical %llu(%s)%llu",
>                           swarn->errstr, root, inum, offset,
>                           swarn->logical, swarn->mirror_num,
>                           swarn->dev->devid, btrfs_dev_name(swarn->dev),
>                           swarn->physical);
> +       swarn->message_printed = true;
>         free_ipath(ipath);
>         return 0;
>  }
> @@ -472,6 +476,7 @@ static void scrub_print_common_warning(const char *errstr, struct btrfs_device *
>         swarn.errstr = errstr;
>         swarn.dev = NULL;
>         swarn.mirror_num = mirror_num;
> +       swarn.message_printed = false;
>
>         ret = extent_from_logical(fs_info, swarn.logical, path, &found_key,
>                                   &flags);
> @@ -488,26 +493,31 @@ static void scrub_print_common_warning(const char *errstr, struct btrfs_device *
>                 unsigned long ptr = 0;
>                 u8 ref_level;
>                 u64 ref_root;
> +               bool message_printed = false;
>
>                 while (true) {
>                         ret = tree_backref_for_extent(&ptr, eb, &found_key, ei,
>                                                       item_size, &ref_root,
>                                                       &ref_level);
> -                       if (ret < 0) {
> -                               btrfs_warn(fs_info,
> -                               "failed to resolve tree backref for logical %llu: %d",
> -                                                 swarn.logical, ret);
> +                       if (ret < 0)
>                                 break;
> -                       }
>                         if (ret > 0)
>                                 break;
> -                       btrfs_warn_in_rcu(fs_info,
> +                       btrfs_warn_rl_in_rcu(fs_info,
>  "%s at metadata in tree %llu(%u), logical %llu(%u) physical %llu(%s)%llu",
>                                 errstr, ref_root, ref_level,
>                                 swarn.logical, swarn.mirror_num,
>                                 dev->devid, btrfs_dev_name(dev),
>                                 swarn.physical);
> +                       message_printed = true;
>                 }
> +               if (!message_printed)
> +                       btrfs_warn_rl_in_rcu(fs_info,
> +"%s at metadata, logical %llu(%u) physical %llu(%s)%llu",
> +                               errstr, swarn.logical, swarn.mirror_num,
> +                               dev->devid, btrfs_dev_name(dev),
> +                               swarn.physical);
> +
>                 btrfs_release_path(path);
>         } else {
>                 struct btrfs_backref_walk_ctx ctx = { 0 };
> @@ -522,8 +532,14 @@ static void scrub_print_common_warning(const char *errstr, struct btrfs_device *
>                 swarn.dev = dev;
>
>                 iterate_extent_inodes(&ctx, true, scrub_print_warning_inode, &swarn);
> +               if (!swarn.message_printed) {
> +                       btrfs_warn_rl_in_rcu(fs_info,
> +"%s at data, unresolved path name, logical %llu(%u) physical %llu(%s)%llu",
> +                               errstr, swarn.logical, swarn.mirror_num,
> +                               dev->devid, btrfs_dev_name(dev),
> +                               swarn.physical);
> +               }

No need for the curly braces here, it's a single statement, so we
follow the style of dropping them.


>         }
> -
>  out:
>         btrfs_free_path(path);
>  }
> @@ -889,12 +905,6 @@ static void scrub_stripe_report_errors(struct scrub_ctx *sctx,
>                 }
>
>                 /* The remaining are all for unrepaired. */
> -               btrfs_err_rl_in_rcu(fs_info,
> -       "unable to fixup (regular) error at logical %llu(%u) physical %llu(%s)%llu",
> -                                           logical, stripe->mirror_num,
> -                                           dev->devid, btrfs_dev_name(dev),
> -                                           physical);
> -

This hunk seems unintentional?

Thanks.

>                 if (test_bit(sector_nr, &stripe->io_error_bitmap))
>                         scrub_print_common_warning("i/o error", dev,
>                                                      logical, physical,
> --
> 2.44.0
>
>

Qu Wenruo March 14, 2024, 8:32 p.m. UTC | #2

在 2024/3/15 04:21, Filipe Manana 写道:
[...]
>> +               }
>
> No need for the curly braces here, it's a single statement, so we
> follow the style of dropping them.
>
>
>>          }
>> -
>>   out:
>>          btrfs_free_path(path);
>>   }
>> @@ -889,12 +905,6 @@ static void scrub_stripe_report_errors(struct scrub_ctx *sctx,
>>                  }
>>
>>                  /* The remaining are all for unrepaired. */
>> -               btrfs_err_rl_in_rcu(fs_info,
>> -       "unable to fixup (regular) error at logical %llu(%u) physical %llu(%s)%llu",
>> -                                           logical, stripe->mirror_num,
>> -                                           dev->devid, btrfs_dev_name(dev),
>> -                                           physical);
>> -
>
> This hunk seems unintentional?

This is intentional:

  - Remove the "unable to fixup" error message
   Since we have ensured at least one error message is outputted for each
   unrepairable corruption, there is no need for that fallback one.

Since we have ensured we would at least output one line for each
unrepaired corruption (before rate limit though), that line is duplicated.

Thanks,
Qu
>
> Thanks.
>
>>                  if (test_bit(sector_nr, &stripe->io_error_bitmap))
>>                          scrub_print_common_warning("i/o error", dev,
>>                                                       logical, physical,
>> --
>> 2.44.0
>>
>>
>

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 17e492076bf8..84617a64b2d4 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -227,6 +227,7 @@  struct scrub_warning {
 	u64			logical;
 	struct btrfs_device	*dev;
 	int			mirror_num;
+	bool			message_printed;
 };
 
 static void release_scrub_stripe(struct scrub_stripe *stripe)
@@ -426,26 +427,29 @@  static int scrub_print_warning_inode(u64 inum, u64 offset, u64 num_bytes,
 	 * we deliberately ignore the bit ipath might have been too small to
 	 * hold all of the paths here
 	 */
-	for (i = 0; i < ipath->fspath->elem_cnt; ++i)
-		btrfs_warn_in_rcu(fs_info,
+	for (i = 0; i < ipath->fspath->elem_cnt; ++i) {
+		btrfs_warn_rl_in_rcu(fs_info,
 "%s at inode %lld/%llu(%s) fileoff %llu, logical %llu(%u) physical %llu(%s)%llu",
 				  swarn->errstr, root, inum,
 				  (char *)ipath->fspath->val[i], offset,
 				  swarn->logical, swarn->mirror_num,
 				  swarn->dev->devid, btrfs_dev_name(swarn->dev),
 				  swarn->physical);
+		swarn->message_printed = true;
+	}
 
 	btrfs_put_root(local_root);
 	free_ipath(ipath);
 	return 0;
 
 err:
-	btrfs_warn_in_rcu(fs_info,
+	btrfs_warn_rl_in_rcu(fs_info,
 	"%s at inode %lld/%llu fileoff %llu, logical %llu(%u) physical %llu(%s)%llu",
 			  swarn->errstr, root, inum, offset,
 			  swarn->logical, swarn->mirror_num,
 			  swarn->dev->devid, btrfs_dev_name(swarn->dev),
 			  swarn->physical);
+	swarn->message_printed = true;
 	free_ipath(ipath);
 	return 0;
 }
@@ -472,6 +476,7 @@  static void scrub_print_common_warning(const char *errstr, struct btrfs_device *
 	swarn.errstr = errstr;
 	swarn.dev = NULL;
 	swarn.mirror_num = mirror_num;
+	swarn.message_printed = false;
 
 	ret = extent_from_logical(fs_info, swarn.logical, path, &found_key,
 				  &flags);
@@ -488,26 +493,31 @@  static void scrub_print_common_warning(const char *errstr, struct btrfs_device *
 		unsigned long ptr = 0;
 		u8 ref_level;
 		u64 ref_root;
+		bool message_printed = false;
 
 		while (true) {
 			ret = tree_backref_for_extent(&ptr, eb, &found_key, ei,
 						      item_size, &ref_root,
 						      &ref_level);
-			if (ret < 0) {
-				btrfs_warn(fs_info,
-				"failed to resolve tree backref for logical %llu: %d",
-						  swarn.logical, ret);
+			if (ret < 0)
 				break;
-			}
 			if (ret > 0)
 				break;
-			btrfs_warn_in_rcu(fs_info,
+			btrfs_warn_rl_in_rcu(fs_info,
 "%s at metadata in tree %llu(%u), logical %llu(%u) physical %llu(%s)%llu",
 				errstr, ref_root, ref_level,
 				swarn.logical, swarn.mirror_num,
 				dev->devid, btrfs_dev_name(dev),
 				swarn.physical);
+			message_printed = true;
 		}
+		if (!message_printed)
+			btrfs_warn_rl_in_rcu(fs_info,
+"%s at metadata, logical %llu(%u) physical %llu(%s)%llu",
+				errstr, swarn.logical, swarn.mirror_num,
+				dev->devid, btrfs_dev_name(dev),
+				swarn.physical);
+
 		btrfs_release_path(path);
 	} else {
 		struct btrfs_backref_walk_ctx ctx = { 0 };
@@ -522,8 +532,14 @@  static void scrub_print_common_warning(const char *errstr, struct btrfs_device *
 		swarn.dev = dev;
 
 		iterate_extent_inodes(&ctx, true, scrub_print_warning_inode, &swarn);
+		if (!swarn.message_printed) {
+			btrfs_warn_rl_in_rcu(fs_info,
+"%s at data, unresolved path name, logical %llu(%u) physical %llu(%s)%llu",
+				errstr, swarn.logical, swarn.mirror_num,
+				dev->devid, btrfs_dev_name(dev),
+				swarn.physical);
+		}
 	}
-
 out:
 	btrfs_free_path(path);
 }
@@ -889,12 +905,6 @@  static void scrub_stripe_report_errors(struct scrub_ctx *sctx,
 		}
 
 		/* The remaining are all for unrepaired. */
-		btrfs_err_rl_in_rcu(fs_info,
-	"unable to fixup (regular) error at logical %llu(%u) physical %llu(%s)%llu",
-					    logical, stripe->mirror_num,
-					    dev->devid, btrfs_dev_name(dev),
-					    physical);
-
 		if (test_bit(sector_nr, &stripe->io_error_bitmap))
 			scrub_print_common_warning("i/o error", dev,
 						     logical, physical,

[7/7] btrfs: scrub: fix the frequency of error messages

Commit Message

Comments

Patch