diff mbox series

btrfs: scrub: also report errors hit during the initial read

Message ID 40578035047b80505e2adf02978c4c293abb44c9.1686035069.git.wqu@suse.com (mailing list archive)
State New, archived
Headers show
Series btrfs: scrub: also report errors hit during the initial read | expand

Commit Message

Qu Wenruo June 6, 2023, 7:08 a.m. UTC
[BUG]
After the recent scrub rework introduced in commit e02ee89baa66 ("btrfs:
scrub: switch scrub_simple_mirror() to scrub_stripe infrastructure"),
btrfs scrub no longer reports repaired errors any more:

 # mkfs.btrfs -f $dev -d DUP
 # mount $dev $mnt
 # xfs_io -f -d -c "pwrite -b 64K -S 0xaa 0 64" $mnt/file
 # umount $dev
 # xfs_io -f -c "pwrite -S 0xff $phy1 64K" $dev # Corrupt the first mirror
 # mount $dev $mnt
 # btrfs scrub start -BR $mnt
 scrub done for 725e7cb7-8a4a-4c77-9f2a-86943619e218
 Scrub started:    Tue Jun  6 14:56:50 2023
 Status:           finished
 Duration:         0:00:00
 	data_extents_scrubbed: 2
 	tree_extents_scrubbed: 18
 	data_bytes_scrubbed: 131072
 	tree_bytes_scrubbed: 294912
 	read_errors: 0
 	csum_errors: 0 <<< No errors here
 	verify_errors: 0
	[...]
 	uncorrectable_errors: 0
 	unverified_errors: 0
 	corrected_errors: 16 <<< Only corrected errors
 	last_physical: 2723151872

This can confuse btrfs-progs, as it relies on the csum_errors to
determine if there is anything wrong.

While on v6.3.x kernels, the report is different:

 	csum_errors: 16 <<<
 	verify_errors: 0
	[...]
 	uncorrectable_errors: 0
 	unverified_errors: 0
 	corrected_errors: 16 <<<

[CAUSE]
In the reworked scrub, we update the scrub progress inside
scrub_stripe_report_errors(), using various bitmaps to update the
result.

For example for csum_errors, we use bitmap_weight() of
stripe->csum_error_bitmap.

Unfortunately at that stage, all error bitmaps (except
init_error_bitmap) are the result of the latest repair attempt, thus if
the stripe is fully repaired, those error bitmaps will all be empty,
resulting the above output mismatch.

To fix this, record the number of errors into stripe->init_nr_*_errors.
Since we don't really care about where those errors are, we only need to
record the number of errors.

Then in scrub_stripe_report_errors(), use those initial numbers to
update the progress other than using the latest error bitmaps.

Fixes: e02ee89baa66 ("btrfs: scrub: switch scrub_simple_mirror() to scrub_stripe infrastructure")
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/scrub.c | 24 ++++++++++++++++++------
 1 file changed, 18 insertions(+), 6 deletions(-)

Comments

David Sterba June 8, 2023, 12:32 p.m. UTC | #1
On Tue, Jun 06, 2023 at 03:08:28PM +0800, Qu Wenruo wrote:
> [BUG]
> After the recent scrub rework introduced in commit e02ee89baa66 ("btrfs:
> scrub: switch scrub_simple_mirror() to scrub_stripe infrastructure"),
> btrfs scrub no longer reports repaired errors any more:
> 
>  # mkfs.btrfs -f $dev -d DUP
>  # mount $dev $mnt
>  # xfs_io -f -d -c "pwrite -b 64K -S 0xaa 0 64" $mnt/file
>  # umount $dev
>  # xfs_io -f -c "pwrite -S 0xff $phy1 64K" $dev # Corrupt the first mirror
>  # mount $dev $mnt
>  # btrfs scrub start -BR $mnt
>  scrub done for 725e7cb7-8a4a-4c77-9f2a-86943619e218
>  Scrub started:    Tue Jun  6 14:56:50 2023
>  Status:           finished
>  Duration:         0:00:00
>  	data_extents_scrubbed: 2
>  	tree_extents_scrubbed: 18
>  	data_bytes_scrubbed: 131072
>  	tree_bytes_scrubbed: 294912
>  	read_errors: 0
>  	csum_errors: 0 <<< No errors here
>  	verify_errors: 0
> 	[...]
>  	uncorrectable_errors: 0
>  	unverified_errors: 0
>  	corrected_errors: 16 <<< Only corrected errors
>  	last_physical: 2723151872
> 
> This can confuse btrfs-progs, as it relies on the csum_errors to
> determine if there is anything wrong.
> 
> While on v6.3.x kernels, the report is different:
> 
>  	csum_errors: 16 <<<
>  	verify_errors: 0
> 	[...]
>  	uncorrectable_errors: 0
>  	unverified_errors: 0
>  	corrected_errors: 16 <<<
> 
> [CAUSE]
> In the reworked scrub, we update the scrub progress inside
> scrub_stripe_report_errors(), using various bitmaps to update the
> result.
> 
> For example for csum_errors, we use bitmap_weight() of
> stripe->csum_error_bitmap.
> 
> Unfortunately at that stage, all error bitmaps (except
> init_error_bitmap) are the result of the latest repair attempt, thus if
> the stripe is fully repaired, those error bitmaps will all be empty,
> resulting the above output mismatch.
> 
> To fix this, record the number of errors into stripe->init_nr_*_errors.
> Since we don't really care about where those errors are, we only need to
> record the number of errors.
> 
> Then in scrub_stripe_report_errors(), use those initial numbers to
> update the progress other than using the latest error bitmaps.
> 
> Fixes: e02ee89baa66 ("btrfs: scrub: switch scrub_simple_mirror() to scrub_stripe infrastructure")
> Signed-off-by: Qu Wenruo <wqu@suse.com>

Added to misc-next, thanks. Also queued with the other scrub patch for 6.4
diff mbox series

Patch

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 375c1f8fef4d..f79f611bde74 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -134,8 +134,14 @@  struct scrub_stripe {
 	 * The errors hit during the initial read of the stripe.
 	 *
 	 * Would be utilized for error reporting and repair.
+	 *
+	 * The remaining init_nr_* records the number of errors hit, only
+	 * used by error reporting.
 	 */
 	unsigned long init_error_bitmap;
+	unsigned int init_nr_io_errors;
+	unsigned int init_nr_csum_errors;
+	unsigned int init_nr_meta_errors;
 
 	/*
 	 * The following error bitmaps are all for the current status.
@@ -968,12 +974,9 @@  static void scrub_stripe_report_errors(struct scrub_ctx *sctx,
 	sctx->stat.data_bytes_scrubbed += nr_data_sectors << fs_info->sectorsize_bits;
 	sctx->stat.tree_bytes_scrubbed += nr_meta_sectors << fs_info->sectorsize_bits;
 	sctx->stat.no_csum += nr_nodatacsum_sectors;
-	sctx->stat.read_errors +=
-		bitmap_weight(&stripe->io_error_bitmap, stripe->nr_sectors);
-	sctx->stat.csum_errors +=
-		bitmap_weight(&stripe->csum_error_bitmap, stripe->nr_sectors);
-	sctx->stat.verify_errors +=
-		bitmap_weight(&stripe->meta_error_bitmap, stripe->nr_sectors);
+	sctx->stat.read_errors += stripe->init_nr_io_errors;
+	sctx->stat.csum_errors += stripe->init_nr_csum_errors;
+	sctx->stat.verify_errors += stripe->init_nr_meta_errors;
 	sctx->stat.uncorrectable_errors +=
 		bitmap_weight(&stripe->error_bitmap, stripe->nr_sectors);
 	sctx->stat.corrected_errors += nr_repaired_sectors;
@@ -1006,6 +1009,12 @@  static void scrub_stripe_read_repair_worker(struct work_struct *work)
 	scrub_verify_one_stripe(stripe, stripe->extent_sector_bitmap);
 	/* Save the initial failed bitmap for later repair and report usage. */
 	stripe->init_error_bitmap = stripe->error_bitmap;
+	stripe->init_nr_io_errors = bitmap_weight(&stripe->io_error_bitmap,
+						  stripe->nr_sectors);
+	stripe->init_nr_csum_errors = bitmap_weight(&stripe->csum_error_bitmap,
+						    stripe->nr_sectors);
+	stripe->init_nr_meta_errors = bitmap_weight(&stripe->meta_error_bitmap,
+						    stripe->nr_sectors);
 
 	if (bitmap_empty(&stripe->init_error_bitmap, stripe->nr_sectors))
 		goto out;
@@ -1455,6 +1464,9 @@  static void scrub_stripe_reset_bitmaps(struct scrub_stripe *stripe)
 {
 	stripe->extent_sector_bitmap = 0;
 	stripe->init_error_bitmap = 0;
+	stripe->init_nr_io_errors = 0;
+	stripe->init_nr_csum_errors = 0;
+	stripe->init_nr_meta_errors = 0;
 	stripe->error_bitmap = 0;
 	stripe->io_error_bitmap = 0;
 	stripe->csum_error_bitmap = 0;