[0/6] btrfs: scrub: refine the error messages

Message ID	cover.1710409033.git.wqu@suse.com (mailing list archive)
Headers	show Received: from smtp-out1.suse.de (smtp-out1.suse.de [195.135.223.130]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 955D36CDCF for <linux-btrfs@vger.kernel.org>; Thu, 14 Mar 2024 09:50:26 +0000 (UTC) From: Qu Wenruo <wqu@suse.com> To: linux-btrfs@vger.kernel.org Subject: [PATCH 0/6] btrfs: scrub: refine the error messages Date: Thu, 14 Mar 2024 20:20:13 +1030 Message-ID: <cover.1710409033.git.wqu@suse.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	btrfs: scrub: refine the error messages \| expand [0/6] btrfs: scrub: refine the error messages [1/7] btrfs: scrub: fix incorrectly reported logical/physical address [2/7] btrfs: reduce the log level for btrfs_dev_stat_inc_and_print() [3/7] btrfs: scrub: remove unused is_super parameter from scrub_print_common_warning() [4/7] btrfs: scrub: remove unnecessary dev/physical lookup for scrub_stripe_report_errors() [5/7] btrfs: scrub: simplify the inode iteration output [6/7] btrfs: scrub: unify and shorten the error message [7/7] btrfs: scrub: fix the frequency of error messages

Message ID

cover.1710409033.git.wqu@suse.com (mailing list archive)

Headers

From: Qu Wenruo <wqu@suse.com>
To: linux-btrfs@vger.kernel.org
Subject: [PATCH 0/6] btrfs: scrub: refine the error messages
Date: Thu, 14 Mar 2024 20:20:13 +1030
Message-ID: <cover.1710409033.git.wqu@suse.com>
Precedence: bulk
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit

Series

btrfs: scrub: refine the error messages | expand

Message

Qu Wenruo March 14, 2024, 9:50 a.m. UTC

During some support sessions, I found older kernels are always report
very unuseful scrub error messages like:

 BTRFS error (device dm-0): bdev /dev/mapper/sys-rootlv errs: wr 0, rd 0, flush 0, corrupt 2823, gen 0
 BTRFS error (device dm-0): unable to fixup (regular) error at logical 1563504640 on dev /dev/mapper/sys-rootlv
 BTRFS error (device dm-0): bdev /dev/mapper/sys-rootlv errs: wr 0, rd 0, flush 0, corrupt 2824, gen 0
 BTRFS error (device dm-0): unable to fixup (regular) error at logical 1579016192 on dev /dev/mapper/sys-rootlv

There are two problems:
- No proper details about the corruption
  No metadata or data indication, nor the filename or the tree id.
  Even the involved kernel (and newer kernels after the scrub rework)
  has the ability to do backref walk and find out the file path or the
  tree backref.

  My guess is, for data sometimes the corrupted sector is no longer
  referred by any data extent.

- Too noisy and useless error message from
  btrfs_dev_stat_inc_and_print()
  I'd argue we should not output any error message just for
  btrfs_dev_stat_inc_and_print().

After the series, the error message would look like this:

 BTRFS warning (device dm-2): checksum error at inode 5/257(file1) fileoff 16384, logical 13647872(1) physical 1(/dev/mapper/test-scratch1)13647872

This involves the following enhancement:

- Single line
  And we ensure we output at least one line for the error we hit.
  No more unrelated btrfs_dev_stat_inc_and_print() output.

- Proper fall backs
  We have the following different fall back options
  * Repaired
    Just a line of something repaired for logical/physical address.

  * Detailed data info
    Including the following elements (if possible), and if higher
    priority ones can not be fetched, it would be skipped and try
    lower priority items:
    + file path (can have multiple ones)
    + root/inode number and offset
    + logical/physical address (always output)

  * Detaile metadata info
    The priority list is:
    + root ref/level
    + logical/physical address (always output)

  For the worst case of data corruption, we would still have some like:

   BTRFS warning (device dm-2): checksum error at data, logical 13647872(1) physical 1(/dev/mapper/test-scratch1)13647872

  And similar ones for metadata:

   BTRFS warning (device dm-2): checksum error at meta, logical 13647872(1) physical 1(/dev/mapper/test-scratch1)13647872

The first patch is fixing a regression in the error message, which leads
to bad logical/physical bytenr.
The second one would reduce the log level for
btrfs_dev_stat_inc_and_print().

Path 3~4 are cleanups to remove unnecessary parameters.

The remaining reworks the format and refine the error message frequency.

Qu Wenruo (7):
  btrfs: scrub: fix incorrectly reported logical/physical address
  btrfs: reduce the log level for btrfs_dev_stat_inc_and_print()
  btrfs: scrub: remove unused is_super parameter from
    scrub_print_common_warning()
  btrfs: scrub: remove unnecessary dev/physical lookup for
    scrub_stripe_report_errors()
  btrfs: scrub: simplify the inode iteration output
  btrfs: scrub: unify and shorten the error message
  btrfs: scrub: fix the frequency of error messages

 fs/btrfs/scrub.c   | 185 ++++++++++++++++-----------------------------
 fs/btrfs/volumes.c |   2 +-
 2 files changed, 66 insertions(+), 121 deletions(-)

Comments

Boris Burkov March 14, 2024, 5:35 p.m. UTC | #1

On Thu, Mar 14, 2024 at 08:20:13PM +1030, Qu Wenruo wrote:
> During some support sessions, I found older kernels are always report
> very unuseful scrub error messages like:
> 
>  BTRFS error (device dm-0): bdev /dev/mapper/sys-rootlv errs: wr 0, rd 0, flush 0, corrupt 2823, gen 0
>  BTRFS error (device dm-0): unable to fixup (regular) error at logical 1563504640 on dev /dev/mapper/sys-rootlv
>  BTRFS error (device dm-0): bdev /dev/mapper/sys-rootlv errs: wr 0, rd 0, flush 0, corrupt 2824, gen 0
>  BTRFS error (device dm-0): unable to fixup (regular) error at logical 1579016192 on dev /dev/mapper/sys-rootlv
> 
> There are two problems:
> - No proper details about the corruption
>   No metadata or data indication, nor the filename or the tree id.
>   Even the involved kernel (and newer kernels after the scrub rework)
>   has the ability to do backref walk and find out the file path or the
>   tree backref.
> 
>   My guess is, for data sometimes the corrupted sector is no longer
>   referred by any data extent.
> 
> - Too noisy and useless error message from
>   btrfs_dev_stat_inc_and_print()
>   I'd argue we should not output any error message just for
>   btrfs_dev_stat_inc_and_print().
> 
> After the series, the error message would look like this:
> 
>  BTRFS warning (device dm-2): checksum error at inode 5/257(file1) fileoff 16384, logical 13647872(1) physical 1(/dev/mapper/test-scratch1)13647872

Stoked on this series, thanks for doing it!

qq: would it be helpful to also include the actual/expected csum? I
think it's particularly helpful when one or the other is either zeros or
the checksum of the zero block.

> 
> This involves the following enhancement:
> 
> - Single line
>   And we ensure we output at least one line for the error we hit.
>   No more unrelated btrfs_dev_stat_inc_and_print() output.
> 
> - Proper fall backs
>   We have the following different fall back options
>   * Repaired
>     Just a line of something repaired for logical/physical address.
> 
>   * Detailed data info
>     Including the following elements (if possible), and if higher
>     priority ones can not be fetched, it would be skipped and try
>     lower priority items:
>     + file path (can have multiple ones)
>     + root/inode number and offset
>     + logical/physical address (always output)
> 
>   * Detaile metadata info
>     The priority list is:
>     + root ref/level
>     + logical/physical address (always output)
> 
>   For the worst case of data corruption, we would still have some like:
> 
>    BTRFS warning (device dm-2): checksum error at data, logical 13647872(1) physical 1(/dev/mapper/test-scratch1)13647872
> 
>   And similar ones for metadata:
> 
>    BTRFS warning (device dm-2): checksum error at meta, logical 13647872(1) physical 1(/dev/mapper/test-scratch1)13647872
> 
> The first patch is fixing a regression in the error message, which leads
> to bad logical/physical bytenr.
> The second one would reduce the log level for
> btrfs_dev_stat_inc_and_print().
> 
> Path 3~4 are cleanups to remove unnecessary parameters.
> 
> The remaining reworks the format and refine the error message frequency.
> 
> Qu Wenruo (7):
>   btrfs: scrub: fix incorrectly reported logical/physical address
>   btrfs: reduce the log level for btrfs_dev_stat_inc_and_print()
>   btrfs: scrub: remove unused is_super parameter from
>     scrub_print_common_warning()
>   btrfs: scrub: remove unnecessary dev/physical lookup for
>     scrub_stripe_report_errors()
>   btrfs: scrub: simplify the inode iteration output
>   btrfs: scrub: unify and shorten the error message
>   btrfs: scrub: fix the frequency of error messages
> 
>  fs/btrfs/scrub.c   | 185 ++++++++++++++++-----------------------------
>  fs/btrfs/volumes.c |   2 +-
>  2 files changed, 66 insertions(+), 121 deletions(-)
> 
> -- 
> 2.44.0
>

Qu Wenruo March 14, 2024, 8:30 p.m. UTC | #2

在 2024/3/15 04:05, Boris Burkov 写道:
> On Thu, Mar 14, 2024 at 08:20:13PM +1030, Qu Wenruo wrote:
>> During some support sessions, I found older kernels are always report
>> very unuseful scrub error messages like:
>>
>>   BTRFS error (device dm-0): bdev /dev/mapper/sys-rootlv errs: wr 0, rd 0, flush 0, corrupt 2823, gen 0
>>   BTRFS error (device dm-0): unable to fixup (regular) error at logical 1563504640 on dev /dev/mapper/sys-rootlv
>>   BTRFS error (device dm-0): bdev /dev/mapper/sys-rootlv errs: wr 0, rd 0, flush 0, corrupt 2824, gen 0
>>   BTRFS error (device dm-0): unable to fixup (regular) error at logical 1579016192 on dev /dev/mapper/sys-rootlv
>>
>> There are two problems:
>> - No proper details about the corruption
>>    No metadata or data indication, nor the filename or the tree id.
>>    Even the involved kernel (and newer kernels after the scrub rework)
>>    has the ability to do backref walk and find out the file path or the
>>    tree backref.
>>
>>    My guess is, for data sometimes the corrupted sector is no longer
>>    referred by any data extent.
>>
>> - Too noisy and useless error message from
>>    btrfs_dev_stat_inc_and_print()
>>    I'd argue we should not output any error message just for
>>    btrfs_dev_stat_inc_and_print().
>>
>> After the series, the error message would look like this:
>>
>>   BTRFS warning (device dm-2): checksum error at inode 5/257(file1) fileoff 16384, logical 13647872(1) physical 1(/dev/mapper/test-scratch1)13647872
>
> Stoked on this series, thanks for doing it!
>
> qq: would it be helpful to also include the actual/expected csum? I
> think it's particularly helpful when one or the other is either zeros or
> the checksum of the zero block.

It's a little too long to add (especially considering SHA256).
And even with CRC32C, I have difficulty to expose any all zero/one checksum.
(Maybe I did too few support recently?)

Thanks,
Qu

>
>>
>> This involves the following enhancement:
>>
>> - Single line
>>    And we ensure we output at least one line for the error we hit.
>>    No more unrelated btrfs_dev_stat_inc_and_print() output.
>>
>> - Proper fall backs
>>    We have the following different fall back options
>>    * Repaired
>>      Just a line of something repaired for logical/physical address.
>>
>>    * Detailed data info
>>      Including the following elements (if possible), and if higher
>>      priority ones can not be fetched, it would be skipped and try
>>      lower priority items:
>>      + file path (can have multiple ones)
>>      + root/inode number and offset
>>      + logical/physical address (always output)
>>
>>    * Detaile metadata info
>>      The priority list is:
>>      + root ref/level
>>      + logical/physical address (always output)
>>
>>    For the worst case of data corruption, we would still have some like:
>>
>>     BTRFS warning (device dm-2): checksum error at data, logical 13647872(1) physical 1(/dev/mapper/test-scratch1)13647872
>>
>>    And similar ones for metadata:
>>
>>     BTRFS warning (device dm-2): checksum error at meta, logical 13647872(1) physical 1(/dev/mapper/test-scratch1)13647872
>>
>> The first patch is fixing a regression in the error message, which leads
>> to bad logical/physical bytenr.
>> The second one would reduce the log level for
>> btrfs_dev_stat_inc_and_print().
>>
>> Path 3~4 are cleanups to remove unnecessary parameters.
>>
>> The remaining reworks the format and refine the error message frequency.
>>
>> Qu Wenruo (7):
>>    btrfs: scrub: fix incorrectly reported logical/physical address
>>    btrfs: reduce the log level for btrfs_dev_stat_inc_and_print()
>>    btrfs: scrub: remove unused is_super parameter from
>>      scrub_print_common_warning()
>>    btrfs: scrub: remove unnecessary dev/physical lookup for
>>      scrub_stripe_report_errors()
>>    btrfs: scrub: simplify the inode iteration output
>>    btrfs: scrub: unify and shorten the error message
>>    btrfs: scrub: fix the frequency of error messages
>>
>>   fs/btrfs/scrub.c   | 185 ++++++++++++++++-----------------------------
>>   fs/btrfs/volumes.c |   2 +-
>>   2 files changed, 66 insertions(+), 121 deletions(-)
>>
>> --
>> 2.44.0
>>
>