[v3] btrfs: Don't submit any btree write bio after transaction is aborted
diff mbox series

Message ID 20200205071015.19621-1-wqu@suse.com
State New
Headers show
Series
  • [v3] btrfs: Don't submit any btree write bio after transaction is aborted
Related show

Commit Message

Qu Wenruo Feb. 5, 2020, 7:10 a.m. UTC
[BUG]
There is a fuzzed image which could cause KASAN report at unmount time.

  ==================================================================
  BUG: KASAN: use-after-free in btrfs_queue_work+0x2c1/0x390
  Read of size 8 at addr ffff888067cf6848 by task umount/1922

  CPU: 0 PID: 1922 Comm: umount Tainted: G        W         5.0.21 #1
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
  Call Trace:
   dump_stack+0x5b/0x8b
   print_address_description+0x70/0x280
   kasan_report+0x13a/0x19b
   btrfs_queue_work+0x2c1/0x390
   btrfs_wq_submit_bio+0x1cd/0x240
   btree_submit_bio_hook+0x18c/0x2a0
   submit_one_bio+0x1be/0x320
   flush_write_bio.isra.41+0x2c/0x70
   btree_write_cache_pages+0x3bb/0x7f0
   do_writepages+0x5c/0x130
   __writeback_single_inode+0xa3/0x9a0
   writeback_single_inode+0x23d/0x390
   write_inode_now+0x1b5/0x280
   iput+0x2ef/0x600
   close_ctree+0x341/0x750
   generic_shutdown_super+0x126/0x370
   kill_anon_super+0x31/0x50
   btrfs_kill_super+0x36/0x2b0
   deactivate_locked_super+0x80/0xc0
   deactivate_super+0x13c/0x150
   cleanup_mnt+0x9a/0x130
   task_work_run+0x11a/0x1b0
   exit_to_usermode_loop+0x107/0x130
   do_syscall_64+0x1e5/0x280
   entry_SYSCALL_64_after_hwframe+0x44/0xa9

[CAUSE]
The fuzzed image has a completely screwd up extent tree:
  leaf 29421568 gen 8 total ptrs 6 free space 3587 owner EXTENT_TREE
  refs 2 lock (w:0 r:0 bw:0 br:0 sw:0 sr:0) lock_owner 0 current 5938
          item 0 key (12587008 168 4096) itemoff 3942 itemsize 53
                  extent refs 1 gen 9 flags 1
                  ref#0: extent data backref root 5 objectid 259 offset 0 count 1
          item 1 key (12591104 168 8192) itemoff 3889 itemsize 53
                  extent refs 1 gen 9 flags 1
                  ref#0: extent data backref root 5 objectid 271 offset 0 count 1
          item 2 key (12599296 168 4096) itemoff 3836 itemsize 53
                  extent refs 1 gen 9 flags 1
                  ref#0: extent data backref root 5 objectid 259 offset 4096 count 1
          item 3 key (29360128 169 0) itemoff 3803 itemsize 33
                  extent refs 1 gen 9 flags 2
                  ref#0: tree block backref root 5
          item 4 key (29368320 169 1) itemoff 3770 itemsize 33
                  extent refs 1 gen 9 flags 2
                  ref#0: tree block backref root 5
          item 5 key (29372416 169 0) itemoff 3737 itemsize 33
                  extent refs 1 gen 9 flags 2
                  ref#0: tree block backref root 5

Note that, leaf 29421568 doesn't has its backref in extent tree.
Thus extent allocator can re-allocate leaf 29421568 for other trees.

Short version for the corruption:
- Extent tree corruption
  Existing tree block X can be allocated as new tree block.

- Tree block X allocated to log tree
  The tree block X generation get bumped, and is traced by
  log_root->dirty_log_pages now.

- Log tree writes tree blocks
  log_root->dirty_log_pages is cleaned.

- The original owner of tree block X wants to modify its content
  Instead of COW tree block X to a new eb, due to the bumped
  generation, tree block X is reused as is.

  Btrfs believes tree block X is already dirtied due to its transid,
  but it is not tranced by transaction->dirty_pages.

- Tree block X now is dirty but wild
  Neither transaction->dirty_pages nor log_root->dirty_log_pages
  traces it.

- Transaction aborted due to extent tree corruption
  Tree block X is not cleaned due to it's not traced by anyone.

- Fs unmount
  Workers get freed first, then iput() on btree_inode.
  But tree block X is still dirty, writeback is triggered with workers
  freed, triggering use-after-free bug.

This shows us that, if extent tree is corrupted, there are always ways
to let wild tree blocks to sneak in, without being properly traced.

We are doing cleanup properly, but that "properly" is based on the fact
that log tree blocks never shares with existing tree blocks. (aka, the
fact that fs is more or less sane)

If that assumption is broken, existing tracing and cleanup makes no
sense now.

[FIX]
To fix this problem, make btree_write_cache_pages() to check if the
transaction is aborted before submitting write bio.

This is the last safe net in case all other cleanup failed to catch such
problem.

Link: https://github.com/bobfuzzer/CVE/tree/master/CVE-2019-19377
CVE: CVE-2019-19377
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
changelog:
v2:
- More detailed reason on why the dirty pages are not cleaned up
  So regular cleanup method won't work on this extremely corrupted case.
  Thus we still need this last resort method to prevent use-after-free.

v3:
- Dig further to find out the cause
  It's log tree bumping transid of existing tree blocks causing the
  problem.
  This breaks COW condition, making btrfs to dirty eb but not tracing
  it.

  The existing cleanup for log tree is fine for sane fs.
  But when fs goes insane, no sane cleanup makes sense now.
---
 fs/btrfs/extent_io.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

Comments

Josef Bacik Feb. 6, 2020, 4 p.m. UTC | #1
On 2/5/20 2:10 AM, Qu Wenruo wrote:
> [BUG]
> There is a fuzzed image which could cause KASAN report at unmount time.
> 
>    ==================================================================
>    BUG: KASAN: use-after-free in btrfs_queue_work+0x2c1/0x390
>    Read of size 8 at addr ffff888067cf6848 by task umount/1922
> 
>    CPU: 0 PID: 1922 Comm: umount Tainted: G        W         5.0.21 #1
>    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
>    Call Trace:
>     dump_stack+0x5b/0x8b
>     print_address_description+0x70/0x280
>     kasan_report+0x13a/0x19b
>     btrfs_queue_work+0x2c1/0x390
>     btrfs_wq_submit_bio+0x1cd/0x240
>     btree_submit_bio_hook+0x18c/0x2a0
>     submit_one_bio+0x1be/0x320
>     flush_write_bio.isra.41+0x2c/0x70
>     btree_write_cache_pages+0x3bb/0x7f0
>     do_writepages+0x5c/0x130
>     __writeback_single_inode+0xa3/0x9a0
>     writeback_single_inode+0x23d/0x390
>     write_inode_now+0x1b5/0x280
>     iput+0x2ef/0x600
>     close_ctree+0x341/0x750
>     generic_shutdown_super+0x126/0x370
>     kill_anon_super+0x31/0x50
>     btrfs_kill_super+0x36/0x2b0
>     deactivate_locked_super+0x80/0xc0
>     deactivate_super+0x13c/0x150
>     cleanup_mnt+0x9a/0x130
>     task_work_run+0x11a/0x1b0
>     exit_to_usermode_loop+0x107/0x130
>     do_syscall_64+0x1e5/0x280
>     entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 
> [CAUSE]
> The fuzzed image has a completely screwd up extent tree:
>    leaf 29421568 gen 8 total ptrs 6 free space 3587 owner EXTENT_TREE
>    refs 2 lock (w:0 r:0 bw:0 br:0 sw:0 sr:0) lock_owner 0 current 5938
>            item 0 key (12587008 168 4096) itemoff 3942 itemsize 53
>                    extent refs 1 gen 9 flags 1
>                    ref#0: extent data backref root 5 objectid 259 offset 0 count 1
>            item 1 key (12591104 168 8192) itemoff 3889 itemsize 53
>                    extent refs 1 gen 9 flags 1
>                    ref#0: extent data backref root 5 objectid 271 offset 0 count 1
>            item 2 key (12599296 168 4096) itemoff 3836 itemsize 53
>                    extent refs 1 gen 9 flags 1
>                    ref#0: extent data backref root 5 objectid 259 offset 4096 count 1
>            item 3 key (29360128 169 0) itemoff 3803 itemsize 33
>                    extent refs 1 gen 9 flags 2
>                    ref#0: tree block backref root 5
>            item 4 key (29368320 169 1) itemoff 3770 itemsize 33
>                    extent refs 1 gen 9 flags 2
>                    ref#0: tree block backref root 5
>            item 5 key (29372416 169 0) itemoff 3737 itemsize 33
>                    extent refs 1 gen 9 flags 2
>                    ref#0: tree block backref root 5
> 
> Note that, leaf 29421568 doesn't has its backref in extent tree.
> Thus extent allocator can re-allocate leaf 29421568 for other trees.
> 
> Short version for the corruption:
> - Extent tree corruption
>    Existing tree block X can be allocated as new tree block.
> 
> - Tree block X allocated to log tree
>    The tree block X generation get bumped, and is traced by
>    log_root->dirty_log_pages now.
> 
> - Log tree writes tree blocks
>    log_root->dirty_log_pages is cleaned.
> 
> - The original owner of tree block X wants to modify its content
>    Instead of COW tree block X to a new eb, due to the bumped
>    generation, tree block X is reused as is.
> 
>    Btrfs believes tree block X is already dirtied due to its transid,
>    but it is not tranced by transaction->dirty_pages.
> 

But at the write part we should have gotten BTRFS_HEADER_FLAG_WRITTEN, so we 
should have cow'ed this block.  So this isn't what's happening, right?  Or is 
something else clearing the BTRFS_HEADER_FLAG_WRITTEN in between the writeout 
and this part?  Thanks,

Josef
Qu Wenruo Feb. 7, 2020, 12:24 a.m. UTC | #2
On 2020/2/7 上午12:00, Josef Bacik wrote:
> On 2/5/20 2:10 AM, Qu Wenruo wrote:
>> [BUG]
>> There is a fuzzed image which could cause KASAN report at unmount time.
>>
>>    ==================================================================
>>    BUG: KASAN: use-after-free in btrfs_queue_work+0x2c1/0x390
>>    Read of size 8 at addr ffff888067cf6848 by task umount/1922
>>
>>    CPU: 0 PID: 1922 Comm: umount Tainted: G        W         5.0.21 #1
>>    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
>> 1.10.2-1ubuntu1 04/01/2014
>>    Call Trace:
>>     dump_stack+0x5b/0x8b
>>     print_address_description+0x70/0x280
>>     kasan_report+0x13a/0x19b
>>     btrfs_queue_work+0x2c1/0x390
>>     btrfs_wq_submit_bio+0x1cd/0x240
>>     btree_submit_bio_hook+0x18c/0x2a0
>>     submit_one_bio+0x1be/0x320
>>     flush_write_bio.isra.41+0x2c/0x70
>>     btree_write_cache_pages+0x3bb/0x7f0
>>     do_writepages+0x5c/0x130
>>     __writeback_single_inode+0xa3/0x9a0
>>     writeback_single_inode+0x23d/0x390
>>     write_inode_now+0x1b5/0x280
>>     iput+0x2ef/0x600
>>     close_ctree+0x341/0x750
>>     generic_shutdown_super+0x126/0x370
>>     kill_anon_super+0x31/0x50
>>     btrfs_kill_super+0x36/0x2b0
>>     deactivate_locked_super+0x80/0xc0
>>     deactivate_super+0x13c/0x150
>>     cleanup_mnt+0x9a/0x130
>>     task_work_run+0x11a/0x1b0
>>     exit_to_usermode_loop+0x107/0x130
>>     do_syscall_64+0x1e5/0x280
>>     entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>
>> [CAUSE]
>> The fuzzed image has a completely screwd up extent tree:
>>    leaf 29421568 gen 8 total ptrs 6 free space 3587 owner EXTENT_TREE
>>    refs 2 lock (w:0 r:0 bw:0 br:0 sw:0 sr:0) lock_owner 0 current 5938
>>            item 0 key (12587008 168 4096) itemoff 3942 itemsize 53
>>                    extent refs 1 gen 9 flags 1
>>                    ref#0: extent data backref root 5 objectid 259
>> offset 0 count 1
>>            item 1 key (12591104 168 8192) itemoff 3889 itemsize 53
>>                    extent refs 1 gen 9 flags 1
>>                    ref#0: extent data backref root 5 objectid 271
>> offset 0 count 1
>>            item 2 key (12599296 168 4096) itemoff 3836 itemsize 53
>>                    extent refs 1 gen 9 flags 1
>>                    ref#0: extent data backref root 5 objectid 259
>> offset 4096 count 1
>>            item 3 key (29360128 169 0) itemoff 3803 itemsize 33
>>                    extent refs 1 gen 9 flags 2
>>                    ref#0: tree block backref root 5
>>            item 4 key (29368320 169 1) itemoff 3770 itemsize 33
>>                    extent refs 1 gen 9 flags 2
>>                    ref#0: tree block backref root 5
>>            item 5 key (29372416 169 0) itemoff 3737 itemsize 33
>>                    extent refs 1 gen 9 flags 2
>>                    ref#0: tree block backref root 5
>>
>> Note that, leaf 29421568 doesn't has its backref in extent tree.
>> Thus extent allocator can re-allocate leaf 29421568 for other trees.
>>
>> Short version for the corruption:
>> - Extent tree corruption
>>    Existing tree block X can be allocated as new tree block.
>>
>> - Tree block X allocated to log tree
>>    The tree block X generation get bumped, and is traced by
>>    log_root->dirty_log_pages now.
>>
>> - Log tree writes tree blocks
>>    log_root->dirty_log_pages is cleaned.
>>
>> - The original owner of tree block X wants to modify its content
>>    Instead of COW tree block X to a new eb, due to the bumped
>>    generation, tree block X is reused as is.
>>
>>    Btrfs believes tree block X is already dirtied due to its transid,
>>    but it is not tranced by transaction->dirty_pages.
>>
> 
> But at the write part we should have gotten BTRFS_HEADER_FLAG_WRITTEN,
> so we should have cow'ed this block.  So this isn't what's happening,
> right?

From my debugging, it's not the case. By somehow, after log tree writes
back, the tree block just got reused.

>  Or is something else clearing the BTRFS_HEADER_FLAG_WRITTEN in
> between the writeout and this part?  Thanks,

It didn't occur to me at the time of writing, is it possible that log
tree get freed, thus that tree block X is considered free, and get
re-allocated to extent tree again?

The problem is really killing me to digging.
Can't we use this last-resort method anyway? The corrupted extent tree
is really causing all kinds of issues we didn't expect...

Thanks,
Qu

> 
> Josef
Josef Bacik Feb. 7, 2020, 12:37 a.m. UTC | #3
On 2/6/20 7:24 PM, Qu Wenruo wrote:
> 
> 
> On 2020/2/7 上午12:00, Josef Bacik wrote:
>> On 2/5/20 2:10 AM, Qu Wenruo wrote:
>>> [BUG]
>>> There is a fuzzed image which could cause KASAN report at unmount time.
>>>
>>>     ==================================================================
>>>     BUG: KASAN: use-after-free in btrfs_queue_work+0x2c1/0x390
>>>     Read of size 8 at addr ffff888067cf6848 by task umount/1922
>>>
>>>     CPU: 0 PID: 1922 Comm: umount Tainted: G        W         5.0.21 #1
>>>     Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
>>> 1.10.2-1ubuntu1 04/01/2014
>>>     Call Trace:
>>>      dump_stack+0x5b/0x8b
>>>      print_address_description+0x70/0x280
>>>      kasan_report+0x13a/0x19b
>>>      btrfs_queue_work+0x2c1/0x390
>>>      btrfs_wq_submit_bio+0x1cd/0x240
>>>      btree_submit_bio_hook+0x18c/0x2a0
>>>      submit_one_bio+0x1be/0x320
>>>      flush_write_bio.isra.41+0x2c/0x70
>>>      btree_write_cache_pages+0x3bb/0x7f0
>>>      do_writepages+0x5c/0x130
>>>      __writeback_single_inode+0xa3/0x9a0
>>>      writeback_single_inode+0x23d/0x390
>>>      write_inode_now+0x1b5/0x280
>>>      iput+0x2ef/0x600
>>>      close_ctree+0x341/0x750
>>>      generic_shutdown_super+0x126/0x370
>>>      kill_anon_super+0x31/0x50
>>>      btrfs_kill_super+0x36/0x2b0
>>>      deactivate_locked_super+0x80/0xc0
>>>      deactivate_super+0x13c/0x150
>>>      cleanup_mnt+0x9a/0x130
>>>      task_work_run+0x11a/0x1b0
>>>      exit_to_usermode_loop+0x107/0x130
>>>      do_syscall_64+0x1e5/0x280
>>>      entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>>
>>> [CAUSE]
>>> The fuzzed image has a completely screwd up extent tree:
>>>     leaf 29421568 gen 8 total ptrs 6 free space 3587 owner EXTENT_TREE
>>>     refs 2 lock (w:0 r:0 bw:0 br:0 sw:0 sr:0) lock_owner 0 current 5938
>>>             item 0 key (12587008 168 4096) itemoff 3942 itemsize 53
>>>                     extent refs 1 gen 9 flags 1
>>>                     ref#0: extent data backref root 5 objectid 259
>>> offset 0 count 1
>>>             item 1 key (12591104 168 8192) itemoff 3889 itemsize 53
>>>                     extent refs 1 gen 9 flags 1
>>>                     ref#0: extent data backref root 5 objectid 271
>>> offset 0 count 1
>>>             item 2 key (12599296 168 4096) itemoff 3836 itemsize 53
>>>                     extent refs 1 gen 9 flags 1
>>>                     ref#0: extent data backref root 5 objectid 259
>>> offset 4096 count 1
>>>             item 3 key (29360128 169 0) itemoff 3803 itemsize 33
>>>                     extent refs 1 gen 9 flags 2
>>>                     ref#0: tree block backref root 5
>>>             item 4 key (29368320 169 1) itemoff 3770 itemsize 33
>>>                     extent refs 1 gen 9 flags 2
>>>                     ref#0: tree block backref root 5
>>>             item 5 key (29372416 169 0) itemoff 3737 itemsize 33
>>>                     extent refs 1 gen 9 flags 2
>>>                     ref#0: tree block backref root 5
>>>
>>> Note that, leaf 29421568 doesn't has its backref in extent tree.
>>> Thus extent allocator can re-allocate leaf 29421568 for other trees.
>>>
>>> Short version for the corruption:
>>> - Extent tree corruption
>>>     Existing tree block X can be allocated as new tree block.
>>>
>>> - Tree block X allocated to log tree
>>>     The tree block X generation get bumped, and is traced by
>>>     log_root->dirty_log_pages now.
>>>
>>> - Log tree writes tree blocks
>>>     log_root->dirty_log_pages is cleaned.
>>>
>>> - The original owner of tree block X wants to modify its content
>>>     Instead of COW tree block X to a new eb, due to the bumped
>>>     generation, tree block X is reused as is.
>>>
>>>     Btrfs believes tree block X is already dirtied due to its transid,
>>>     but it is not tranced by transaction->dirty_pages.
>>>
>>
>> But at the write part we should have gotten BTRFS_HEADER_FLAG_WRITTEN,
>> so we should have cow'ed this block.  So this isn't what's happening,
>> right?
> 
>  From my debugging, it's not the case. By somehow, after log tree writes
> back, the tree block just got reused.
> 
>>    Or is something else clearing the BTRFS_HEADER_FLAG_WRITTEN in
>> between the writeout and this part?  Thanks,
> 
> It didn't occur to me at the time of writing, is it possible that log
> tree get freed, thus that tree block X is considered free, and get
> re-allocated to extent tree again?
> 

Yeah, but then they'd go onto the dirty pages radix tree properly, because it 
would be the correct root, and we wouldn't have this problem.

> The problem is really killing me to digging.
> Can't we use this last-resort method anyway? The corrupted extent tree
> is really causing all kinds of issues we didn't expect...

Which is why I want the real root cause and a real fix, not something that's 
papering over the problem.  Thanks,

Josef
Qu Wenruo Feb. 8, 2020, 6:28 a.m. UTC | #4
On 2020/2/7 上午8:37, Josef Bacik wrote:
> On 2/6/20 7:24 PM, Qu Wenruo wrote:
>>
>>
>> On 2020/2/7 上午12:00, Josef Bacik wrote:
>>> On 2/5/20 2:10 AM, Qu Wenruo wrote:
>>>> [BUG]
>>>> There is a fuzzed image which could cause KASAN report at unmount time.
>>>>
>>>>     ==================================================================
>>>>     BUG: KASAN: use-after-free in btrfs_queue_work+0x2c1/0x390
>>>>     Read of size 8 at addr ffff888067cf6848 by task umount/1922
>>>>
>>>>     CPU: 0 PID: 1922 Comm: umount Tainted: G        W         5.0.21 #1
>>>>     Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
>>>> 1.10.2-1ubuntu1 04/01/2014
>>>>     Call Trace:
>>>>      dump_stack+0x5b/0x8b
>>>>      print_address_description+0x70/0x280
>>>>      kasan_report+0x13a/0x19b
>>>>      btrfs_queue_work+0x2c1/0x390
>>>>      btrfs_wq_submit_bio+0x1cd/0x240
>>>>      btree_submit_bio_hook+0x18c/0x2a0
>>>>      submit_one_bio+0x1be/0x320
>>>>      flush_write_bio.isra.41+0x2c/0x70
>>>>      btree_write_cache_pages+0x3bb/0x7f0
>>>>      do_writepages+0x5c/0x130
>>>>      __writeback_single_inode+0xa3/0x9a0
>>>>      writeback_single_inode+0x23d/0x390
>>>>      write_inode_now+0x1b5/0x280
>>>>      iput+0x2ef/0x600
>>>>      close_ctree+0x341/0x750
>>>>      generic_shutdown_super+0x126/0x370
>>>>      kill_anon_super+0x31/0x50
>>>>      btrfs_kill_super+0x36/0x2b0
>>>>      deactivate_locked_super+0x80/0xc0
>>>>      deactivate_super+0x13c/0x150
>>>>      cleanup_mnt+0x9a/0x130
>>>>      task_work_run+0x11a/0x1b0
>>>>      exit_to_usermode_loop+0x107/0x130
>>>>      do_syscall_64+0x1e5/0x280
>>>>      entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>>>
>>>> [CAUSE]
>>>> The fuzzed image has a completely screwd up extent tree:
>>>>     leaf 29421568 gen 8 total ptrs 6 free space 3587 owner EXTENT_TREE
>>>>     refs 2 lock (w:0 r:0 bw:0 br:0 sw:0 sr:0) lock_owner 0 current 5938
>>>>             item 0 key (12587008 168 4096) itemoff 3942 itemsize 53
>>>>                     extent refs 1 gen 9 flags 1
>>>>                     ref#0: extent data backref root 5 objectid 259
>>>> offset 0 count 1
>>>>             item 1 key (12591104 168 8192) itemoff 3889 itemsize 53
>>>>                     extent refs 1 gen 9 flags 1
>>>>                     ref#0: extent data backref root 5 objectid 271
>>>> offset 0 count 1
>>>>             item 2 key (12599296 168 4096) itemoff 3836 itemsize 53
>>>>                     extent refs 1 gen 9 flags 1
>>>>                     ref#0: extent data backref root 5 objectid 259
>>>> offset 4096 count 1
>>>>             item 3 key (29360128 169 0) itemoff 3803 itemsize 33
>>>>                     extent refs 1 gen 9 flags 2
>>>>                     ref#0: tree block backref root 5
>>>>             item 4 key (29368320 169 1) itemoff 3770 itemsize 33
>>>>                     extent refs 1 gen 9 flags 2
>>>>                     ref#0: tree block backref root 5
>>>>             item 5 key (29372416 169 0) itemoff 3737 itemsize 33
>>>>                     extent refs 1 gen 9 flags 2
>>>>                     ref#0: tree block backref root 5
>>>>
>>>> Note that, leaf 29421568 doesn't has its backref in extent tree.
>>>> Thus extent allocator can re-allocate leaf 29421568 for other trees.
>>>>
>>>> Short version for the corruption:
>>>> - Extent tree corruption
>>>>     Existing tree block X can be allocated as new tree block.
>>>>
>>>> - Tree block X allocated to log tree
>>>>     The tree block X generation get bumped, and is traced by
>>>>     log_root->dirty_log_pages now.
>>>>
>>>> - Log tree writes tree blocks
>>>>     log_root->dirty_log_pages is cleaned.
>>>>
>>>> - The original owner of tree block X wants to modify its content
>>>>     Instead of COW tree block X to a new eb, due to the bumped
>>>>     generation, tree block X is reused as is.
>>>>
>>>>     Btrfs believes tree block X is already dirtied due to its transid,
>>>>     but it is not tranced by transaction->dirty_pages.
>>>>
>>>
>>> But at the write part we should have gotten BTRFS_HEADER_FLAG_WRITTEN,
>>> so we should have cow'ed this block.  So this isn't what's happening,
>>> right?
>>
>>  From my debugging, it's not the case. By somehow, after log tree writes
>> back, the tree block just got reused.
>>
>>>    Or is something else clearing the BTRFS_HEADER_FLAG_WRITTEN in
>>> between the writeout and this part?  Thanks,
>>
>> It didn't occur to me at the time of writing, is it possible that log
>> tree get freed, thus that tree block X is considered free, and get
>> re-allocated to extent tree again?
>>
> 
> Yeah, but then they'd go onto the dirty pages radix tree properly,
> because it would be the correct root, and we wouldn't have this problem.
> 
>> The problem is really killing me to digging.
>> Can't we use this last-resort method anyway? The corrupted extent tree
>> is really causing all kinds of issues we didn't expect...
> 
> Which is why I want the real root cause and a real fix, not something
> that's papering over the problem.  Thanks,

OK, this fuzzed image doesn't go sane now.

During my debugging, the most weird thing happened.

With the following diff applied, the problem just disappear.

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 968faaec0e39..0d37768003a5 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -1447,6 +1447,7 @@ static inline int should_cow_block(struct
btrfs_trans_handle *trans,
                                   struct btrfs_root *root,
                                   struct extent_buffer *buf)
 {
+       int ret;
        if (btrfs_is_testing(root->fs_info))
                return 0;

@@ -1469,8 +1470,9 @@ static inline int should_cow_block(struct
btrfs_trans_handle *trans,
            !(root->root_key.objectid != BTRFS_TREE_RELOC_OBJECTID &&
              btrfs_header_flag(buf, BTRFS_HEADER_FLAG_RELOC)) &&
            !test_bit(BTRFS_ROOT_FORCE_COW, &root->state))
-               return 0;
-       return 1;
+               ret = 0;
+       ret = 1;
+       return ret;
 }

 /*

How could this happen?!?!?

Thanks,
Qu
> 
> Josef

Patch
diff mbox series

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 2f4802f405a2..0c58c7c230e6 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3927,6 +3927,7 @@  int btree_write_cache_pages(struct address_space *mapping,
 		.extent_locked = 0,
 		.sync_io = wbc->sync_mode == WB_SYNC_ALL,
 	};
+	struct btrfs_fs_info *fs_info = tree->fs_info;
 	int ret = 0;
 	int done = 0;
 	int nr_to_write_done = 0;
@@ -4036,7 +4037,12 @@  int btree_write_cache_pages(struct address_space *mapping,
 		end_write_bio(&epd, ret);
 		return ret;
 	}
-	ret = flush_write_bio(&epd);
+	if (!test_bit(BTRFS_FS_STATE_TRANS_ABORTED, &fs_info->fs_state)) {
+		ret = flush_write_bio(&epd);
+	} else {
+		ret = -EUCLEAN;
+		end_write_bio(&epd, ret);
+	}
 	return ret;
 }