Message ID | c46343633e6198f315876179d31b3a66b66a8aa2.1710409033.git.wqu@suse.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | btrfs: scrub: refine the error messages | expand |
On Thu, Mar 14, 2024 at 9:51 AM Qu Wenruo <wqu@suse.com> wrote: > > For btrfs scrub error messages, I have hit a lot of support cases on > older kernels where no filename is outputted at all, with only error > messages like: > > BTRFS error (device dm-0): bdev /dev/mapper/sys-rootlv errs: wr 0, rd 0, flush 0, corrupt 2823, gen 0 > BTRFS error (device dm-0): unable to fixup (regular) error at logical 1563504640 on dev /dev/mapper/sys-rootlv > BTRFS error (device dm-0): bdev /dev/mapper/sys-rootlv errs: wr 0, rd 0, flush 0, corrupt 2824, gen 0 > BTRFS error (device dm-0): unable to fixup (regular) error at logical 1579016192 on dev /dev/mapper/sys-rootlv > BTRFS error (device dm-0): bdev /dev/mapper/sys-rootlv errs: wr 0, rd 0, flush 0, corrupt 2825, gen 0 > > The "unable to fixup (regular) error" line shows we hit an unrepairable > error, then normally we would do data/metadata backref walk to grab the > correct info. > > But we can hit cases like the inode is already orphan (unlinked from any > parent inode), or even the subvolume is orphan (unlinked and waiting to > be deleted). > > In that case we're not sure what's the proper way to continue (Is it > some critical data/metadata? Would it prevent the system from booting?) > > To improve the situation, this patch would: > > - Ensure we at least output one message for each unrepairable error > If by somehow we didn't output any error message for the error, we > always fallback to the basic logical/physical error output, with its > type (data or metadata). > > - Remove the "unable to fixup" error message > Since we have ensured at least one error message is outputted for each > unrepairable corruption, there is no need for that fallback one. > > Now one unrepairable corruption would output a more refined and less > duplicated error message: > > BTRFS info (device dm-2): scrub: started on devid 1 > BTRFS warning (device dm-2): checksum error at inode 5/257(file1) fileoff 16384, logical 13647872(1) physical 1(/dev/mapper/test-scratch1)13647872 > BTRFS info (device dm-2): scrub: finished on devid 1 with status: 0 > > Signed-off-by: Qu Wenruo <wqu@suse.com> > --- > fs/btrfs/scrub.c | 42 ++++++++++++++++++++++++++---------------- > 1 file changed, 26 insertions(+), 16 deletions(-) > > diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c > index 17e492076bf8..84617a64b2d4 100644 > --- a/fs/btrfs/scrub.c > +++ b/fs/btrfs/scrub.c > @@ -227,6 +227,7 @@ struct scrub_warning { > u64 logical; > struct btrfs_device *dev; > int mirror_num; > + bool message_printed; > }; > > static void release_scrub_stripe(struct scrub_stripe *stripe) > @@ -426,26 +427,29 @@ static int scrub_print_warning_inode(u64 inum, u64 offset, u64 num_bytes, > * we deliberately ignore the bit ipath might have been too small to > * hold all of the paths here > */ > - for (i = 0; i < ipath->fspath->elem_cnt; ++i) > - btrfs_warn_in_rcu(fs_info, > + for (i = 0; i < ipath->fspath->elem_cnt; ++i) { > + btrfs_warn_rl_in_rcu(fs_info, > "%s at inode %lld/%llu(%s) fileoff %llu, logical %llu(%u) physical %llu(%s)%llu", > swarn->errstr, root, inum, > (char *)ipath->fspath->val[i], offset, > swarn->logical, swarn->mirror_num, > swarn->dev->devid, btrfs_dev_name(swarn->dev), > swarn->physical); > + swarn->message_printed = true; > + } > > btrfs_put_root(local_root); > free_ipath(ipath); > return 0; > > err: > - btrfs_warn_in_rcu(fs_info, > + btrfs_warn_rl_in_rcu(fs_info, > "%s at inode %lld/%llu fileoff %llu, logical %llu(%u) physical %llu(%s)%llu", > swarn->errstr, root, inum, offset, > swarn->logical, swarn->mirror_num, > swarn->dev->devid, btrfs_dev_name(swarn->dev), > swarn->physical); > + swarn->message_printed = true; > free_ipath(ipath); > return 0; > } > @@ -472,6 +476,7 @@ static void scrub_print_common_warning(const char *errstr, struct btrfs_device * > swarn.errstr = errstr; > swarn.dev = NULL; > swarn.mirror_num = mirror_num; > + swarn.message_printed = false; > > ret = extent_from_logical(fs_info, swarn.logical, path, &found_key, > &flags); > @@ -488,26 +493,31 @@ static void scrub_print_common_warning(const char *errstr, struct btrfs_device * > unsigned long ptr = 0; > u8 ref_level; > u64 ref_root; > + bool message_printed = false; > > while (true) { > ret = tree_backref_for_extent(&ptr, eb, &found_key, ei, > item_size, &ref_root, > &ref_level); > - if (ret < 0) { > - btrfs_warn(fs_info, > - "failed to resolve tree backref for logical %llu: %d", > - swarn.logical, ret); > + if (ret < 0) > break; > - } > if (ret > 0) > break; > - btrfs_warn_in_rcu(fs_info, > + btrfs_warn_rl_in_rcu(fs_info, > "%s at metadata in tree %llu(%u), logical %llu(%u) physical %llu(%s)%llu", > errstr, ref_root, ref_level, > swarn.logical, swarn.mirror_num, > dev->devid, btrfs_dev_name(dev), > swarn.physical); > + message_printed = true; > } > + if (!message_printed) > + btrfs_warn_rl_in_rcu(fs_info, > +"%s at metadata, logical %llu(%u) physical %llu(%s)%llu", > + errstr, swarn.logical, swarn.mirror_num, > + dev->devid, btrfs_dev_name(dev), > + swarn.physical); > + > btrfs_release_path(path); > } else { > struct btrfs_backref_walk_ctx ctx = { 0 }; > @@ -522,8 +532,14 @@ static void scrub_print_common_warning(const char *errstr, struct btrfs_device * > swarn.dev = dev; > > iterate_extent_inodes(&ctx, true, scrub_print_warning_inode, &swarn); > + if (!swarn.message_printed) { > + btrfs_warn_rl_in_rcu(fs_info, > +"%s at data, unresolved path name, logical %llu(%u) physical %llu(%s)%llu", > + errstr, swarn.logical, swarn.mirror_num, > + dev->devid, btrfs_dev_name(dev), > + swarn.physical); > + } No need for the curly braces here, it's a single statement, so we follow the style of dropping them. > } > - > out: > btrfs_free_path(path); > } > @@ -889,12 +905,6 @@ static void scrub_stripe_report_errors(struct scrub_ctx *sctx, > } > > /* The remaining are all for unrepaired. */ > - btrfs_err_rl_in_rcu(fs_info, > - "unable to fixup (regular) error at logical %llu(%u) physical %llu(%s)%llu", > - logical, stripe->mirror_num, > - dev->devid, btrfs_dev_name(dev), > - physical); > - This hunk seems unintentional? Thanks. > if (test_bit(sector_nr, &stripe->io_error_bitmap)) > scrub_print_common_warning("i/o error", dev, > logical, physical, > -- > 2.44.0 > >
在 2024/3/15 04:21, Filipe Manana 写道: [...] >> + } > > No need for the curly braces here, it's a single statement, so we > follow the style of dropping them. > > >> } >> - >> out: >> btrfs_free_path(path); >> } >> @@ -889,12 +905,6 @@ static void scrub_stripe_report_errors(struct scrub_ctx *sctx, >> } >> >> /* The remaining are all for unrepaired. */ >> - btrfs_err_rl_in_rcu(fs_info, >> - "unable to fixup (regular) error at logical %llu(%u) physical %llu(%s)%llu", >> - logical, stripe->mirror_num, >> - dev->devid, btrfs_dev_name(dev), >> - physical); >> - > > This hunk seems unintentional? This is intentional: - Remove the "unable to fixup" error message Since we have ensured at least one error message is outputted for each unrepairable corruption, there is no need for that fallback one. Since we have ensured we would at least output one line for each unrepaired corruption (before rate limit though), that line is duplicated. Thanks, Qu > > Thanks. > >> if (test_bit(sector_nr, &stripe->io_error_bitmap)) >> scrub_print_common_warning("i/o error", dev, >> logical, physical, >> -- >> 2.44.0 >> >> >
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c index 17e492076bf8..84617a64b2d4 100644 --- a/fs/btrfs/scrub.c +++ b/fs/btrfs/scrub.c @@ -227,6 +227,7 @@ struct scrub_warning { u64 logical; struct btrfs_device *dev; int mirror_num; + bool message_printed; }; static void release_scrub_stripe(struct scrub_stripe *stripe) @@ -426,26 +427,29 @@ static int scrub_print_warning_inode(u64 inum, u64 offset, u64 num_bytes, * we deliberately ignore the bit ipath might have been too small to * hold all of the paths here */ - for (i = 0; i < ipath->fspath->elem_cnt; ++i) - btrfs_warn_in_rcu(fs_info, + for (i = 0; i < ipath->fspath->elem_cnt; ++i) { + btrfs_warn_rl_in_rcu(fs_info, "%s at inode %lld/%llu(%s) fileoff %llu, logical %llu(%u) physical %llu(%s)%llu", swarn->errstr, root, inum, (char *)ipath->fspath->val[i], offset, swarn->logical, swarn->mirror_num, swarn->dev->devid, btrfs_dev_name(swarn->dev), swarn->physical); + swarn->message_printed = true; + } btrfs_put_root(local_root); free_ipath(ipath); return 0; err: - btrfs_warn_in_rcu(fs_info, + btrfs_warn_rl_in_rcu(fs_info, "%s at inode %lld/%llu fileoff %llu, logical %llu(%u) physical %llu(%s)%llu", swarn->errstr, root, inum, offset, swarn->logical, swarn->mirror_num, swarn->dev->devid, btrfs_dev_name(swarn->dev), swarn->physical); + swarn->message_printed = true; free_ipath(ipath); return 0; } @@ -472,6 +476,7 @@ static void scrub_print_common_warning(const char *errstr, struct btrfs_device * swarn.errstr = errstr; swarn.dev = NULL; swarn.mirror_num = mirror_num; + swarn.message_printed = false; ret = extent_from_logical(fs_info, swarn.logical, path, &found_key, &flags); @@ -488,26 +493,31 @@ static void scrub_print_common_warning(const char *errstr, struct btrfs_device * unsigned long ptr = 0; u8 ref_level; u64 ref_root; + bool message_printed = false; while (true) { ret = tree_backref_for_extent(&ptr, eb, &found_key, ei, item_size, &ref_root, &ref_level); - if (ret < 0) { - btrfs_warn(fs_info, - "failed to resolve tree backref for logical %llu: %d", - swarn.logical, ret); + if (ret < 0) break; - } if (ret > 0) break; - btrfs_warn_in_rcu(fs_info, + btrfs_warn_rl_in_rcu(fs_info, "%s at metadata in tree %llu(%u), logical %llu(%u) physical %llu(%s)%llu", errstr, ref_root, ref_level, swarn.logical, swarn.mirror_num, dev->devid, btrfs_dev_name(dev), swarn.physical); + message_printed = true; } + if (!message_printed) + btrfs_warn_rl_in_rcu(fs_info, +"%s at metadata, logical %llu(%u) physical %llu(%s)%llu", + errstr, swarn.logical, swarn.mirror_num, + dev->devid, btrfs_dev_name(dev), + swarn.physical); + btrfs_release_path(path); } else { struct btrfs_backref_walk_ctx ctx = { 0 }; @@ -522,8 +532,14 @@ static void scrub_print_common_warning(const char *errstr, struct btrfs_device * swarn.dev = dev; iterate_extent_inodes(&ctx, true, scrub_print_warning_inode, &swarn); + if (!swarn.message_printed) { + btrfs_warn_rl_in_rcu(fs_info, +"%s at data, unresolved path name, logical %llu(%u) physical %llu(%s)%llu", + errstr, swarn.logical, swarn.mirror_num, + dev->devid, btrfs_dev_name(dev), + swarn.physical); + } } - out: btrfs_free_path(path); } @@ -889,12 +905,6 @@ static void scrub_stripe_report_errors(struct scrub_ctx *sctx, } /* The remaining are all for unrepaired. */ - btrfs_err_rl_in_rcu(fs_info, - "unable to fixup (regular) error at logical %llu(%u) physical %llu(%s)%llu", - logical, stripe->mirror_num, - dev->devid, btrfs_dev_name(dev), - physical); - if (test_bit(sector_nr, &stripe->io_error_bitmap)) scrub_print_common_warning("i/o error", dev, logical, physical,
For btrfs scrub error messages, I have hit a lot of support cases on older kernels where no filename is outputted at all, with only error messages like: BTRFS error (device dm-0): bdev /dev/mapper/sys-rootlv errs: wr 0, rd 0, flush 0, corrupt 2823, gen 0 BTRFS error (device dm-0): unable to fixup (regular) error at logical 1563504640 on dev /dev/mapper/sys-rootlv BTRFS error (device dm-0): bdev /dev/mapper/sys-rootlv errs: wr 0, rd 0, flush 0, corrupt 2824, gen 0 BTRFS error (device dm-0): unable to fixup (regular) error at logical 1579016192 on dev /dev/mapper/sys-rootlv BTRFS error (device dm-0): bdev /dev/mapper/sys-rootlv errs: wr 0, rd 0, flush 0, corrupt 2825, gen 0 The "unable to fixup (regular) error" line shows we hit an unrepairable error, then normally we would do data/metadata backref walk to grab the correct info. But we can hit cases like the inode is already orphan (unlinked from any parent inode), or even the subvolume is orphan (unlinked and waiting to be deleted). In that case we're not sure what's the proper way to continue (Is it some critical data/metadata? Would it prevent the system from booting?) To improve the situation, this patch would: - Ensure we at least output one message for each unrepairable error If by somehow we didn't output any error message for the error, we always fallback to the basic logical/physical error output, with its type (data or metadata). - Remove the "unable to fixup" error message Since we have ensured at least one error message is outputted for each unrepairable corruption, there is no need for that fallback one. Now one unrepairable corruption would output a more refined and less duplicated error message: BTRFS info (device dm-2): scrub: started on devid 1 BTRFS warning (device dm-2): checksum error at inode 5/257(file1) fileoff 16384, logical 13647872(1) physical 1(/dev/mapper/test-scratch1)13647872 BTRFS info (device dm-2): scrub: finished on devid 1 with status: 0 Signed-off-by: Qu Wenruo <wqu@suse.com> --- fs/btrfs/scrub.c | 42 ++++++++++++++++++++++++++---------------- 1 file changed, 26 insertions(+), 16 deletions(-)