mbox series

[v5,00/12] btrfs: Enhancement to tree block validation

Message ID 20190215105044.17619-1-wqu@suse.com (mailing list archive)
Headers show
Series btrfs: Enhancement to tree block validation | expand

Message

Qu Wenruo Feb. 15, 2019, 10:50 a.m. UTC
Patchset can be fetched from github:
https://github.com/adam900710/linux/tree/write_time_tree_checker
Which is based on v5.0-rc1 tag.
Also there is no conflict rebasing the patchset to misc-next.

This patchset has the following 3 features:
- Tree block validation output enhancement
  * Output validation failure timing (write time or read time)
  * Always output tree block level/key mismatch error message
    This part is already submitted and reviewed.

- Write time tree block validation check
  To catch memory corruption either from hardware or kernel.
  Example output would be:

    BTRFS critical (device dm-3): corrupt leaf: root=2 block=1350630375424 slot=68, bad key order, prev (10510212874240 169 0) current (1714119868416 169 0)
    BTRFS error (device dm-3): write time tree block corruption detected
    BTRFS: error (device dm-3) in btrfs_commit_transaction:2220: errno=-5 IO failure (Error while writing out transaction)
    BTRFS info (device dm-3): forced readonly
    BTRFS warning (device dm-3): Skipping commit of aborted transaction.
    BTRFS: error (device dm-3) in cleanup_transaction:1839: errno=-5 IO failure
    BTRFS info (device dm-3): delayed_refs has NO entry

- Better error handling before calling flush_write_bio()
  One hidden reason of calling flush_write_bio() under all cases is,
  flush_write_bio() will trigger endio function and endio function of
  epd->bio will free the bio under all cases.
  So we're in fact abusing flush_write_bio() as cleanup.

  Since now flush_write_bio() has its own return value, we shouldn't call
  flush_write_bio() no-brain, here we introduce proper cleanup helper,
  end_write_bio(). Now we call flush_write_bio() like:
              New                 |           Old
  --------------------------------------------------------------
  ret = do_some_evil(&epd);       | ret = do_some_evil(&epd);
  if (ret < 0) {                  | flush_write_bio(&epd);
  	end_write_bio(&epd, ret); | ^^^ submitting half-backed epd->bio?
  	return ret;               | return ret;
  }                               |
  ret = flush_write_bio(&epd);    |
  return ret;                     |

  Above code should be more streamline for the error handling part.

Changelog:
v2:
- Unlock locked pages in lock_extent_buffer_for_io() for error handling.
- Added Reviewed-by tags.

v3:
- Remove duplicated error message.
- Use IS_ENABLED() macro to replace #ifdef.
- Added Reviewed-by tags.

v4:
- Re-organized patch split
  Now each BUG_ON() cleanup has its own patch
- Dig much further into the call sites to eliminate unexpected >0 return
  May be a little paranoid and abuse some ASSERT(), but it should be
  much safer against further code change.
- Fix the false alert caused by balance and memory pressure
  The fix it skip owner checker for non-essential tree at write time.
  Since owner root can't always be reliable, either due to commit root
  created in current transaction or balance + memory pressure.

v5:
- Do proper error-out handling other than relying on flush_write_bio()
  to clean up.
  This has a side effect that no Reviewed-by tags for modified patches.
- New comment for why we don't need to do anything about ebp->bio when
  submit_one_bio() fails.
- Add some Reviewed-by tag.

Qu Wenruo (12):
  btrfs: Always output error message when key/level verification fails
  btrfs: extent_io: Kill the forward declaration of flush_write_bio()
  btrfs: disk-io: Show the timing of corrupted tree block explicitly
  btrfs: extent_io: Move the BUG_ON() in flush_write_bio() one level up
  btrfs: extent_io: Handle error better in extent_write_full_page()
  btrfs: extent_io: Handle error better in btree_write_cache_pages()
  btrfs: extent_io: Kill the dead branch in extent_write_cache_pages()
  btrfs: extent_io: Handle error better in extent_write_locked_range()
  btrfs: extent_io: Kill the BUG_ON() in lock_extent_buffer_for_io()
  btrfs: extent_io: Kill the BUG_ON() in extent_write_cache_pages()
  btrfs: extent_io: Handle error better in extent_writepages()
  btrfs: Do mandatory tree block check before submitting bio

 fs/btrfs/disk-io.c      |  21 +++--
 fs/btrfs/extent_io.c    | 168 ++++++++++++++++++++++++++++------------
 fs/btrfs/tree-checker.c |  24 +++++-
 fs/btrfs/tree-checker.h |   8 ++
 4 files changed, 162 insertions(+), 59 deletions(-)

Comments

Nikolay Borisov Feb. 15, 2019, 1:10 p.m. UTC | #1
On 15.02.19 г. 12:50 ч., Qu Wenruo wrote:
> Patchset can be fetched from github:
> https://github.com/adam900710/linux/tree/write_time_tree_checker
> Which is based on v5.0-rc1 tag.
> Also there is no conflict rebasing the patchset to misc-next.
> 
> This patchset has the following 3 features:
> - Tree block validation output enhancement
>   * Output validation failure timing (write time or read time)
>   * Always output tree block level/key mismatch error message
>     This part is already submitted and reviewed.
> 
> - Write time tree block validation check
>   To catch memory corruption either from hardware or kernel.
>   Example output would be:
> 
>     BTRFS critical (device dm-3): corrupt leaf: root=2 block=1350630375424 slot=68, bad key order, prev (10510212874240 169 0) current (1714119868416 169 0)
>     BTRFS error (device dm-3): write time tree block corruption detected
This is not good.  Those two error messages should be collapsed into
one. Otherwise it's hard to actually match them up. Better output will
be "Corrupt leaf detected during writing: root=..." and eliminate "write
time tree block corruption detected" line. Is that feasible?

>     BTRFS: error (device dm-3) in btrfs_commit_transaction:2220: errno=-5 IO failure (Error while writing out transaction)
>     BTRFS info (device dm-3): forced readonly
>     BTRFS warning (device dm-3): Skipping commit of aborted transaction.
>     BTRFS: error (device dm-3) in cleanup_transaction:1839: errno=-5 IO failure
>     BTRFS info (device dm-3): delayed_refs has NO entry
> 
> - Better error handling before calling flush_write_bio()
>   One hidden reason of calling flush_write_bio() under all cases is,
>   flush_write_bio() will trigger endio function and endio function of
>   epd->bio will free the bio under all cases.
>   So we're in fact abusing flush_write_bio() as cleanup.
> 
>   Since now flush_write_bio() has its own return value, we shouldn't call
>   flush_write_bio() no-brain, here we introduce proper cleanup helper,
>   end_write_bio(). Now we call flush_write_bio() like:
>               New                 |           Old
>   --------------------------------------------------------------
>   ret = do_some_evil(&epd);       | ret = do_some_evil(&epd);
>   if (ret < 0) {                  | flush_write_bio(&epd);
>   	end_write_bio(&epd, ret); | ^^^ submitting half-backed epd->bio?
>   	return ret;               | return ret;
>   }                               |
>   ret = flush_write_bio(&epd);    |
>   return ret;                     |
> 
>   Above code should be more streamline for the error handling part.
> 
> Changelog:
> v2:
> - Unlock locked pages in lock_extent_buffer_for_io() for error handling.
> - Added Reviewed-by tags.
> 
> v3:
> - Remove duplicated error message.
> - Use IS_ENABLED() macro to replace #ifdef.
> - Added Reviewed-by tags.
> 
> v4:
> - Re-organized patch split
>   Now each BUG_ON() cleanup has its own patch
> - Dig much further into the call sites to eliminate unexpected >0 return
>   May be a little paranoid and abuse some ASSERT(), but it should be
>   much safer against further code change.
> - Fix the false alert caused by balance and memory pressure
>   The fix it skip owner checker for non-essential tree at write time.
>   Since owner root can't always be reliable, either due to commit root
>   created in current transaction or balance + memory pressure.
> 
> v5:
> - Do proper error-out handling other than relying on flush_write_bio()
>   to clean up.
>   This has a side effect that no Reviewed-by tags for modified patches.
> - New comment for why we don't need to do anything about ebp->bio when
>   submit_one_bio() fails.
> - Add some Reviewed-by tag.
> 
> Qu Wenruo (12):
>   btrfs: Always output error message when key/level verification fails
>   btrfs: extent_io: Kill the forward declaration of flush_write_bio()
>   btrfs: disk-io: Show the timing of corrupted tree block explicitly
>   btrfs: extent_io: Move the BUG_ON() in flush_write_bio() one level up
>   btrfs: extent_io: Handle error better in extent_write_full_page()
>   btrfs: extent_io: Handle error better in btree_write_cache_pages()
>   btrfs: extent_io: Kill the dead branch in extent_write_cache_pages()
>   btrfs: extent_io: Handle error better in extent_write_locked_range()
>   btrfs: extent_io: Kill the BUG_ON() in lock_extent_buffer_for_io()
>   btrfs: extent_io: Kill the BUG_ON() in extent_write_cache_pages()
>   btrfs: extent_io: Handle error better in extent_writepages()
>   btrfs: Do mandatory tree block check before submitting bio
> 
>  fs/btrfs/disk-io.c      |  21 +++--
>  fs/btrfs/extent_io.c    | 168 ++++++++++++++++++++++++++++------------
>  fs/btrfs/tree-checker.c |  24 +++++-
>  fs/btrfs/tree-checker.h |   8 ++
>  4 files changed, 162 insertions(+), 59 deletions(-)
>
Qu Wenruo Feb. 15, 2019, 1:18 p.m. UTC | #2
On 2019/2/15 下午9:10, Nikolay Borisov wrote:
> 
> 
> On 15.02.19 г. 12:50 ч., Qu Wenruo wrote:
>> Patchset can be fetched from github:
>> https://github.com/adam900710/linux/tree/write_time_tree_checker
>> Which is based on v5.0-rc1 tag.
>> Also there is no conflict rebasing the patchset to misc-next.
>>
>> This patchset has the following 3 features:
>> - Tree block validation output enhancement
>>   * Output validation failure timing (write time or read time)
>>   * Always output tree block level/key mismatch error message
>>     This part is already submitted and reviewed.
>>
>> - Write time tree block validation check
>>   To catch memory corruption either from hardware or kernel.
>>   Example output would be:
>>
>>     BTRFS critical (device dm-3): corrupt leaf: root=2 block=1350630375424 slot=68, bad key order, prev (10510212874240 169 0) current (1714119868416 169 0)
>>     BTRFS error (device dm-3): write time tree block corruption detected
> This is not good.  Those two error messages should be collapsed into
> one. Otherwise it's hard to actually match them up.

That shouldn't be a problem, since the error won't happen so frequently
there is no other error message that could interrupt these 2 lines.

> Better output will
> be "Corrupt leaf detected during writing: root=..." and eliminate "write
> time tree block corruption detected" line. Is that feasible?

Feasible, currently tree checker only get called in 3 locations:
1) read time full checker
2) mark dirty time basic checker
3) write time full checker

And they all have different internal bool to indicate the timing, so
it's possible to output the timing.

But that needs to pass the internal bool down a long long way, for all
the output help to accept an extra string.
I'm not a big fan for that, and prefer a timing neutral tree checker.

Thanks,
Qu

> 
>>     BTRFS: error (device dm-3) in btrfs_commit_transaction:2220: errno=-5 IO failure (Error while writing out transaction)
>>     BTRFS info (device dm-3): forced readonly
>>     BTRFS warning (device dm-3): Skipping commit of aborted transaction.
>>     BTRFS: error (device dm-3) in cleanup_transaction:1839: errno=-5 IO failure
>>     BTRFS info (device dm-3): delayed_refs has NO entry
>>
>> - Better error handling before calling flush_write_bio()
>>   One hidden reason of calling flush_write_bio() under all cases is,
>>   flush_write_bio() will trigger endio function and endio function of
>>   epd->bio will free the bio under all cases.
>>   So we're in fact abusing flush_write_bio() as cleanup.
>>
>>   Since now flush_write_bio() has its own return value, we shouldn't call
>>   flush_write_bio() no-brain, here we introduce proper cleanup helper,
>>   end_write_bio(). Now we call flush_write_bio() like:
>>               New                 |           Old
>>   --------------------------------------------------------------
>>   ret = do_some_evil(&epd);       | ret = do_some_evil(&epd);
>>   if (ret < 0) {                  | flush_write_bio(&epd);
>>   	end_write_bio(&epd, ret); | ^^^ submitting half-backed epd->bio?
>>   	return ret;               | return ret;
>>   }                               |
>>   ret = flush_write_bio(&epd);    |
>>   return ret;                     |
>>
>>   Above code should be more streamline for the error handling part.
>>
>> Changelog:
>> v2:
>> - Unlock locked pages in lock_extent_buffer_for_io() for error handling.
>> - Added Reviewed-by tags.
>>
>> v3:
>> - Remove duplicated error message.
>> - Use IS_ENABLED() macro to replace #ifdef.
>> - Added Reviewed-by tags.
>>
>> v4:
>> - Re-organized patch split
>>   Now each BUG_ON() cleanup has its own patch
>> - Dig much further into the call sites to eliminate unexpected >0 return
>>   May be a little paranoid and abuse some ASSERT(), but it should be
>>   much safer against further code change.
>> - Fix the false alert caused by balance and memory pressure
>>   The fix it skip owner checker for non-essential tree at write time.
>>   Since owner root can't always be reliable, either due to commit root
>>   created in current transaction or balance + memory pressure.
>>
>> v5:
>> - Do proper error-out handling other than relying on flush_write_bio()
>>   to clean up.
>>   This has a side effect that no Reviewed-by tags for modified patches.
>> - New comment for why we don't need to do anything about ebp->bio when
>>   submit_one_bio() fails.
>> - Add some Reviewed-by tag.
>>
>> Qu Wenruo (12):
>>   btrfs: Always output error message when key/level verification fails
>>   btrfs: extent_io: Kill the forward declaration of flush_write_bio()
>>   btrfs: disk-io: Show the timing of corrupted tree block explicitly
>>   btrfs: extent_io: Move the BUG_ON() in flush_write_bio() one level up
>>   btrfs: extent_io: Handle error better in extent_write_full_page()
>>   btrfs: extent_io: Handle error better in btree_write_cache_pages()
>>   btrfs: extent_io: Kill the dead branch in extent_write_cache_pages()
>>   btrfs: extent_io: Handle error better in extent_write_locked_range()
>>   btrfs: extent_io: Kill the BUG_ON() in lock_extent_buffer_for_io()
>>   btrfs: extent_io: Kill the BUG_ON() in extent_write_cache_pages()
>>   btrfs: extent_io: Handle error better in extent_writepages()
>>   btrfs: Do mandatory tree block check before submitting bio
>>
>>  fs/btrfs/disk-io.c      |  21 +++--
>>  fs/btrfs/extent_io.c    | 168 ++++++++++++++++++++++++++++------------
>>  fs/btrfs/tree-checker.c |  24 +++++-
>>  fs/btrfs/tree-checker.h |   8 ++
>>  4 files changed, 162 insertions(+), 59 deletions(-)
>>
David Sterba Feb. 15, 2019, 5:19 p.m. UTC | #3
On Fri, Feb 15, 2019 at 09:18:03PM +0800, Qu Wenruo wrote:
> 
> 
> On 2019/2/15 下午9:10, Nikolay Borisov wrote:
> > 
> > 
> > On 15.02.19 г. 12:50 ч., Qu Wenruo wrote:
> >> Patchset can be fetched from github:
> >> https://github.com/adam900710/linux/tree/write_time_tree_checker
> >> Which is based on v5.0-rc1 tag.
> >> Also there is no conflict rebasing the patchset to misc-next.
> >>
> >> This patchset has the following 3 features:
> >> - Tree block validation output enhancement
> >>   * Output validation failure timing (write time or read time)
> >>   * Always output tree block level/key mismatch error message
> >>     This part is already submitted and reviewed.
> >>
> >> - Write time tree block validation check
> >>   To catch memory corruption either from hardware or kernel.
> >>   Example output would be:
> >>
> >>     BTRFS critical (device dm-3): corrupt leaf: root=2 block=1350630375424 slot=68, bad key order, prev (10510212874240 169 0) current (1714119868416 169 0)
> >>     BTRFS error (device dm-3): write time tree block corruption detected
> > This is not good.  Those two error messages should be collapsed into
> > one. Otherwise it's hard to actually match them up.
> 
> That shouldn't be a problem, since the error won't happen so frequently
> there is no other error message that could interrupt these 2 lines.
> 
> > Better output will
> > be "Corrupt leaf detected during writing: root=..." and eliminate "write
> > time tree block corruption detected" line. Is that feasible?
> 
> Feasible, currently tree checker only get called in 3 locations:
> 1) read time full checker
> 2) mark dirty time basic checker
> 3) write time full checker
> 
> And they all have different internal bool to indicate the timing, so
> it's possible to output the timing.
> 
> But that needs to pass the internal bool down a long long way, for all
> the output help to accept an extra string.
> I'm not a big fan for that, and prefer a timing neutral tree checker.

I'd rather not merge the error messages, as we'll possibly add more
sanity checks to various functions so there could be a list of problems
and there's one final note about when it happened (read time/write
time).

Matching the lines together is desirable though, so if the block number
could be part of all messages, I hope this makes it usable for analysis.

Reading btree_readpage_end_io_hook, the message should be under the err:
label, as there are 3 other possible messages printed (bad block start,
fsid and level).
Qu Wenruo Feb. 16, 2019, 6:49 a.m. UTC | #4
On 2019/2/16 上午1:19, David Sterba wrote:
> On Fri, Feb 15, 2019 at 09:18:03PM +0800, Qu Wenruo wrote:
>>
>>
>> On 2019/2/15 下午9:10, Nikolay Borisov wrote:
>>>
>>>
>>> On 15.02.19 г. 12:50 ч., Qu Wenruo wrote:
>>>> Patchset can be fetched from github:
>>>> https://github.com/adam900710/linux/tree/write_time_tree_checker
>>>> Which is based on v5.0-rc1 tag.
>>>> Also there is no conflict rebasing the patchset to misc-next.
>>>>
>>>> This patchset has the following 3 features:
>>>> - Tree block validation output enhancement
>>>>   * Output validation failure timing (write time or read time)
>>>>   * Always output tree block level/key mismatch error message
>>>>     This part is already submitted and reviewed.
>>>>
>>>> - Write time tree block validation check
>>>>   To catch memory corruption either from hardware or kernel.
>>>>   Example output would be:
>>>>
>>>>     BTRFS critical (device dm-3): corrupt leaf: root=2 block=1350630375424 slot=68, bad key order, prev (10510212874240 169 0) current (1714119868416 169 0)
>>>>     BTRFS error (device dm-3): write time tree block corruption detected
>>> This is not good.  Those two error messages should be collapsed into
>>> one. Otherwise it's hard to actually match them up.
>>
>> That shouldn't be a problem, since the error won't happen so frequently
>> there is no other error message that could interrupt these 2 lines.
>>
>>> Better output will
>>> be "Corrupt leaf detected during writing: root=..." and eliminate "write
>>> time tree block corruption detected" line. Is that feasible?
>>
>> Feasible, currently tree checker only get called in 3 locations:
>> 1) read time full checker
>> 2) mark dirty time basic checker
>> 3) write time full checker
>>
>> And they all have different internal bool to indicate the timing, so
>> it's possible to output the timing.
>>
>> But that needs to pass the internal bool down a long long way, for all
>> the output help to accept an extra string.
>> I'm not a big fan for that, and prefer a timing neutral tree checker.
> 
> I'd rather not merge the error messages, as we'll possibly add more
> sanity checks to various functions so there could be a list of problems
> and there's one final note about when it happened (read time/write
> time).
> 
> Matching the lines together is desirable though, so if the block number
> could be part of all messages, I hope this makes it usable for analysis.

This looks much better.
I'll change the timing line to show extra info to match them.

Thanks,
Qu
> 
> Reading btree_readpage_end_io_hook, the message should be under the err:
> label, as there are 3 other possible messages printed (bad block start,
> fsid and level).
>