btrfs: Do mandatory tree block check before submitting bio
diff mbox series

Message ID 20190116115308.8687-1-wqu@suse.com
State New
Headers show
Series
  • btrfs: Do mandatory tree block check before submitting bio
Related show

Commit Message

Qu Wenruo Jan. 16, 2019, 11:53 a.m. UTC
There are at least 2 reports about memory bit flip sneaking into on-disk
data.

Currently we only have a relaxed check triggered at
btrfs_mark_buffer_dirty() time, as it's not mandatory, only for
CONFIG_BTRFS_FS_CHECK_INTEGRITY enabled build.

This patch will address the hole by triggering comprehensive check on
tree blocks before writing it back to disk.

The timing is set to csum_tree_block() where @verify == 0.
At that timing, we're generation csum for tree blocks before submitting
the metadata bio, so we could avoid all the unnecessary calls at
btrfs_mark_buffer_dirty(), but still catch enough error.

Reported-by: Leonard Lausen <leonard@lausen.nl>
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/disk-io.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

Comments

Hugo Mills Jan. 16, 2019, 12:01 p.m. UTC | #1
Hi, Qu,

On Wed, Jan 16, 2019 at 07:53:08PM +0800, Qu Wenruo wrote:
> There are at least 2 reports about memory bit flip sneaking into on-disk
> data.
> 
> Currently we only have a relaxed check triggered at
> btrfs_mark_buffer_dirty() time, as it's not mandatory, only for
> CONFIG_BTRFS_FS_CHECK_INTEGRITY enabled build.
> 
> This patch will address the hole by triggering comprehensive check on
> tree blocks before writing it back to disk.
> 
> The timing is set to csum_tree_block() where @verify == 0.
> At that timing, we're generation csum for tree blocks before submitting
> the metadata bio, so we could avoid all the unnecessary calls at
> btrfs_mark_buffer_dirty(), but still catch enough error.

   I agree wholeheartedly with the idea of this change. Just one
question:

   How does this get reported to the user/administrator? As I
understand it, a detectably corrupt metadata page will generate an I/O
error from the filesystem before it's written to disk? How will this
show up in kernel logs? Is it distinguishable in any way from a
similar error that was generated on reading such a corrupt metadata
node from the disk?

   Basically, I want to be able to distinguish this case (error
detected when writing) from the existing case (error detected when
reading) when someone shows up on IRC with a "broken filesystem".

   Hugo.

> Reported-by: Leonard Lausen <leonard@lausen.nl>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  fs/btrfs/disk-io.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 8da2f380d3c0..45bf6be9e751 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -313,6 +313,16 @@ static int csum_tree_block(struct btrfs_fs_info *fs_info,
>  			return -EUCLEAN;
>  		}
>  	} else {
> +		/*
> +		 * Here we're calculating csum before writing it to disk,
> +		 * do comprenhensive check here to catch memory corruption
> +		 */
> +		if (btrfs_header_level(buf))
> +			err = btrfs_check_node(fs_info, buf);
> +		else
> +			err = btrfs_check_leaf_full(fs_info, buf);
> +		if (err < 0)
> +			return err;
>  		write_extent_buffer(buf, result, 0, csum_size);
>  	}
>
Qu Wenruo Jan. 16, 2019, 12:26 p.m. UTC | #2
On 2019/1/16 下午8:01, Hugo Mills wrote:
>    Hi, Qu,
> 
> On Wed, Jan 16, 2019 at 07:53:08PM +0800, Qu Wenruo wrote:
>> There are at least 2 reports about memory bit flip sneaking into on-disk
>> data.
>>
>> Currently we only have a relaxed check triggered at
>> btrfs_mark_buffer_dirty() time, as it's not mandatory, only for
>> CONFIG_BTRFS_FS_CHECK_INTEGRITY enabled build.
>>
>> This patch will address the hole by triggering comprehensive check on
>> tree blocks before writing it back to disk.
>>
>> The timing is set to csum_tree_block() where @verify == 0.
>> At that timing, we're generation csum for tree blocks before submitting
>> the metadata bio, so we could avoid all the unnecessary calls at
>> btrfs_mark_buffer_dirty(), but still catch enough error.
> 
>    I agree wholeheartedly with the idea of this change. Just one
> question:
> 
>    How does this get reported to the user/administrator? As I
> understand it, a detectably corrupt metadata page will generate an I/O
> error from the filesystem before it's written to disk? How will this
> show up in kernel logs?

Well, you caught me.

I haven't try the error case, and in fact if it fails, it fails by
triggering kernel BUG_ON(), thus you may not have a chance to see the
error message from btrfs module.

> Is it distinguishable in any way from a
> similar error that was generated on reading such a corrupt metadata
> node from the disk?

Kind of distinguishable for this patch, when you hit kernel BUG_ON() at
fs/btrfs/extent_io.c:4016 then it's definitely from this patch. :P

> 
>    Basically, I want to be able to distinguish this case (error
> detected when writing) from the existing case (error detected when
> reading) when someone shows up on IRC with a "broken filesystem".

Definitely, I'll address this and the BUG_ON() in next version.

Thanks,
Qu

> 
>    Hugo.
> 
>> Reported-by: Leonard Lausen <leonard@lausen.nl>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>>  fs/btrfs/disk-io.c | 10 ++++++++++
>>  1 file changed, 10 insertions(+)
>>
>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>> index 8da2f380d3c0..45bf6be9e751 100644
>> --- a/fs/btrfs/disk-io.c
>> +++ b/fs/btrfs/disk-io.c
>> @@ -313,6 +313,16 @@ static int csum_tree_block(struct btrfs_fs_info *fs_info,
>>  			return -EUCLEAN;
>>  		}
>>  	} else {
>> +		/*
>> +		 * Here we're calculating csum before writing it to disk,
>> +		 * do comprenhensive check here to catch memory corruption
>> +		 */
>> +		if (btrfs_header_level(buf))
>> +			err = btrfs_check_node(fs_info, buf);
>> +		else
>> +			err = btrfs_check_leaf_full(fs_info, buf);
>> +		if (err < 0)
>> +			return err;
>>  		write_extent_buffer(buf, result, 0, csum_size);
>>  	}
>>  
>
Hugo Mills Jan. 16, 2019, 1:07 p.m. UTC | #3
On Wed, Jan 16, 2019 at 08:26:35PM +0800, Qu Wenruo wrote:
> 
> 
> On 2019/1/16 下午8:01, Hugo Mills wrote:
> >    Hi, Qu,
> > 
> > On Wed, Jan 16, 2019 at 07:53:08PM +0800, Qu Wenruo wrote:
> >> There are at least 2 reports about memory bit flip sneaking into on-disk
> >> data.
> >>
> >> Currently we only have a relaxed check triggered at
> >> btrfs_mark_buffer_dirty() time, as it's not mandatory, only for
> >> CONFIG_BTRFS_FS_CHECK_INTEGRITY enabled build.
> >>
> >> This patch will address the hole by triggering comprehensive check on
> >> tree blocks before writing it back to disk.
> >>
> >> The timing is set to csum_tree_block() where @verify == 0.
> >> At that timing, we're generation csum for tree blocks before submitting
> >> the metadata bio, so we could avoid all the unnecessary calls at
> >> btrfs_mark_buffer_dirty(), but still catch enough error.
> > 
> >    I agree wholeheartedly with the idea of this change. Just one
> > question:
> > 
> >    How does this get reported to the user/administrator? As I
> > understand it, a detectably corrupt metadata page will generate an I/O
> > error from the filesystem before it's written to disk? How will this
> > show up in kernel logs?
> 
> Well, you caught me.
> 
> I haven't try the error case, and in fact if it fails, it fails by
> triggering kernel BUG_ON(), thus you may not have a chance to see the
> error message from btrfs module.
> 
> > Is it distinguishable in any way from a
> > similar error that was generated on reading such a corrupt metadata
> > node from the disk?
> 
> Kind of distinguishable for this patch, when you hit kernel BUG_ON() at
> fs/btrfs/extent_io.c:4016 then it's definitely from this patch. :P

   Haha. :)

> >    Basically, I want to be able to distinguish this case (error
> > detected when writing) from the existing case (error detected when
> > reading) when someone shows up on IRC with a "broken filesystem".
> 
> Definitely, I'll address this and the BUG_ON() in next version.

   The error-on-read case gives us a pretty good report on what's
wrong and where. Typically something like "bad key order" or "wrong
item size", plus the logical address of the metadata chunk so we can
dump it with debug-tree -b.

   Having the same messages from this call-site to indicate the kind
of error found, and also something to indicate that it was detected
before hitting disk (i.e. which call-site for the checks triggered the
error) would be, I think, the minimal information needed. If we could
also have the human-readable dump of the full metadata page logged as
well, that would be ideal -- we won't be able to use debug-tree to
diagnose the issue afterwards, as it won't have reached the disk.

   Thanks,
   Hugo.

> Thanks,
> Qu
> 
> > 
> >    Hugo.
> > 
> >> Reported-by: Leonard Lausen <leonard@lausen.nl>
> >> Signed-off-by: Qu Wenruo <wqu@suse.com>
> >> ---
> >>  fs/btrfs/disk-io.c | 10 ++++++++++
> >>  1 file changed, 10 insertions(+)
> >>
> >> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> >> index 8da2f380d3c0..45bf6be9e751 100644
> >> --- a/fs/btrfs/disk-io.c
> >> +++ b/fs/btrfs/disk-io.c
> >> @@ -313,6 +313,16 @@ static int csum_tree_block(struct btrfs_fs_info *fs_info,
> >>  			return -EUCLEAN;
> >>  		}
> >>  	} else {
> >> +		/*
> >> +		 * Here we're calculating csum before writing it to disk,
> >> +		 * do comprenhensive check here to catch memory corruption
> >> +		 */
> >> +		if (btrfs_header_level(buf))
> >> +			err = btrfs_check_node(fs_info, buf);
> >> +		else
> >> +			err = btrfs_check_leaf_full(fs_info, buf);
> >> +		if (err < 0)
> >> +			return err;
> >>  		write_extent_buffer(buf, result, 0, csum_size);
> >>  	}
> >>  
> > 
>

Patch
diff mbox series

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 8da2f380d3c0..45bf6be9e751 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -313,6 +313,16 @@  static int csum_tree_block(struct btrfs_fs_info *fs_info,
 			return -EUCLEAN;
 		}
 	} else {
+		/*
+		 * Here we're calculating csum before writing it to disk,
+		 * do comprenhensive check here to catch memory corruption
+		 */
+		if (btrfs_header_level(buf))
+			err = btrfs_check_node(fs_info, buf);
+		else
+			err = btrfs_check_leaf_full(fs_info, buf);
+		if (err < 0)
+			return err;
 		write_extent_buffer(buf, result, 0, csum_size);
 	}