diff mbox series

[RFC] btrfs: locking: Add extra check in btrfs_init_new_buffer() to avoid deadlock

Message ID 20180814055121.6077-1-wqu@suse.com (mailing list archive)
State New, archived
Headers show
Series [RFC] btrfs: locking: Add extra check in btrfs_init_new_buffer() to avoid deadlock | expand

Commit Message

Qu Wenruo Aug. 14, 2018, 5:51 a.m. UTC
[BUG]
For certains crafted image, whose csum root leaf has missing backref, if
we try to trigger write with data csum, it could cause deadlock with the
following kernel WARN_ON():
------
WARNING: CPU: 1 PID: 41 at fs/btrfs/locking.c:230 btrfs_tree_lock+0x3e2/0x400
CPU: 1 PID: 41 Comm: kworker/u4:1 Not tainted 4.18.0-rc1+ #8
Workqueue: btrfs-endio-write btrfs_endio_write_helper
RIP: 0010:btrfs_tree_lock+0x3e2/0x400
Call Trace:
 btrfs_alloc_tree_block+0x39f/0x770
 __btrfs_cow_block+0x285/0x9e0
 btrfs_cow_block+0x191/0x2e0
 btrfs_search_slot+0x492/0x1160
 btrfs_lookup_csum+0xec/0x280
 btrfs_csum_file_blocks+0x2be/0xa60
 add_pending_csums+0xaf/0xf0
 btrfs_finish_ordered_io+0x74b/0xc90
 finish_ordered_fn+0x15/0x20
 normal_work_helper+0xf6/0x500
 btrfs_endio_write_helper+0x12/0x20
 process_one_work+0x302/0x770
 worker_thread+0x81/0x6d0
 kthread+0x180/0x1d0
 ret_from_fork+0x35/0x40
---[ end trace 2e85051acb5f6dc1 ]---
------

[CAUSE]
That crafted image has missing backref for csum tree root leaf.
And when we try to allocate new tree block, since there is no
EXTENT/METADATA_ITEM for csum tree root, btrfs consider it's free slot
and use it.

The extent tree of the image looks like:
Normal image                      |       This fuzzed image
----------------------------------+--------------------------------
BG 29360128                       | BG 29360128
 One empty slot                   |  One empty slot
29364224: backref to UUID tree    | 29364224: backref to UUID tree
 Two empty slots                  |  Two empty slots
29376512: backref to CSUM tree    |  One empty slot (bad type) <<<
29380608: backref to D_RELOC tree | 29380608: backref to D_RELOC tree
...                               | ...

Since bytenr 29376512 has no METADATA/EXTENT_ITEM, when btrfs try to
alloc tree block, it's an valid slot for btrfs.

And for finish_ordered_write, when we need to insert csum, we try to CoW
csum tree root.

By *COINCIDENT*, empty slots at bytenr BG_OFFSET, BG_OFFSET + 8K,
BG_OFFSET + 12K is already used by tree block COW for other trees,
the next empty slot is BG_OFFSET + 16K, which should be the backref for
CSUM tree.

But due to the bad type, btrfs can recognize it and still consider it as
an empty slot, and will try to use it for csum tree CoW.

Then in the following call trace, we will try to lock the new tree
block, which turns out to be the old csum tree root which is already
locked:

btrfs_search_slot() called on csum tree root, which is at 29376512
|- btrfs_cow_block()
   |- btrfs_set_lock_block()
   |  |- Now locks tree block 29376512 (old csum tree root)
   |- __btrfs_cow_block()
      |- btrfs_alloc_tree_block()
         |- btrfs_reserve_extent()
            | Now it returns tree block 29376512, which extent tree
            | shows its empty slot, but it's already hold by csum tree
            |- btrfs_init_new_buffer()
               |- btrfs_tree_lock()
                  | Triggers WARN_ON(eb->lock_owner == current->pid)
                  |- wait_event()
                     Wait lock owner to release the lock, but it's
                     locked by ourself, so it will deadlock

[FIX]
This patch will do the lock_owner and current->pid check at
btrfs_init_new_buffer().
So above deadlock can be avoided.

Since such problem can only happen in crafted image, we will still
trigger kernel warning, but with a little more meaningful warning
message.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=200405
Reported-by: Xu Wen <wen.xu@gatech.edu>
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
changelog:
v2:
  Modify btrfs_tree_lock() to be able to return int to detect possible
  deadlock and return.
  Titled as "btrfs: locking: Allow btrfs_tree_lock() to return error to avoid deadlock"
v3:
  Instead of modify all btrfs_tree_lock() callers, only check possible
  deadlock at btrfs_init_new_buffer(), as the bug only happens for newly
  allocated tree block.
  With better commit message describing the on-disk extent tree
  corruption along with the call trace of how dead lock happens.

Hi David,

This v3 should explain the bug with more details and bring a minimal
impact to existing function callers.

Thanks,
Qu
---
 fs/btrfs/extent-tree.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

Comments

David Sterba Aug. 20, 2018, 4:38 p.m. UTC | #1
On Tue, Aug 14, 2018 at 01:51:21PM +0800, Qu Wenruo wrote:
> [BUG]
> For certains crafted image, whose csum root leaf has missing backref, if
> we try to trigger write with data csum, it could cause deadlock with the
> following kernel WARN_ON():
> ------
> WARNING: CPU: 1 PID: 41 at fs/btrfs/locking.c:230 btrfs_tree_lock+0x3e2/0x400
> CPU: 1 PID: 41 Comm: kworker/u4:1 Not tainted 4.18.0-rc1+ #8
> Workqueue: btrfs-endio-write btrfs_endio_write_helper
> RIP: 0010:btrfs_tree_lock+0x3e2/0x400
> Call Trace:
>  btrfs_alloc_tree_block+0x39f/0x770
>  __btrfs_cow_block+0x285/0x9e0
>  btrfs_cow_block+0x191/0x2e0
>  btrfs_search_slot+0x492/0x1160
>  btrfs_lookup_csum+0xec/0x280
>  btrfs_csum_file_blocks+0x2be/0xa60
>  add_pending_csums+0xaf/0xf0
>  btrfs_finish_ordered_io+0x74b/0xc90
>  finish_ordered_fn+0x15/0x20
>  normal_work_helper+0xf6/0x500
>  btrfs_endio_write_helper+0x12/0x20
>  process_one_work+0x302/0x770
>  worker_thread+0x81/0x6d0
>  kthread+0x180/0x1d0
>  ret_from_fork+0x35/0x40
> ---[ end trace 2e85051acb5f6dc1 ]---
> ------
> 
> [CAUSE]
> That crafted image has missing backref for csum tree root leaf.
> And when we try to allocate new tree block, since there is no
> EXTENT/METADATA_ITEM for csum tree root, btrfs consider it's free slot
> and use it.
> 
> The extent tree of the image looks like:
> Normal image                      |       This fuzzed image
> ----------------------------------+--------------------------------
> BG 29360128                       | BG 29360128
>  One empty slot                   |  One empty slot
> 29364224: backref to UUID tree    | 29364224: backref to UUID tree
>  Two empty slots                  |  Two empty slots
> 29376512: backref to CSUM tree    |  One empty slot (bad type) <<<
> 29380608: backref to D_RELOC tree | 29380608: backref to D_RELOC tree
> ...                               | ...
> 
> Since bytenr 29376512 has no METADATA/EXTENT_ITEM, when btrfs try to
> alloc tree block, it's an valid slot for btrfs.
> 
> And for finish_ordered_write, when we need to insert csum, we try to CoW
> csum tree root.
> 
> By *COINCIDENT*, empty slots at bytenr BG_OFFSET, BG_OFFSET + 8K,
> BG_OFFSET + 12K is already used by tree block COW for other trees,
> the next empty slot is BG_OFFSET + 16K, which should be the backref for
> CSUM tree.
> 
> But due to the bad type, btrfs can recognize it and still consider it as
> an empty slot, and will try to use it for csum tree CoW.
> 
> Then in the following call trace, we will try to lock the new tree
> block, which turns out to be the old csum tree root which is already
> locked:
> 
> btrfs_search_slot() called on csum tree root, which is at 29376512
> |- btrfs_cow_block()
>    |- btrfs_set_lock_block()
>    |  |- Now locks tree block 29376512 (old csum tree root)
>    |- __btrfs_cow_block()
>       |- btrfs_alloc_tree_block()
>          |- btrfs_reserve_extent()
>             | Now it returns tree block 29376512, which extent tree
>             | shows its empty slot, but it's already hold by csum tree
>             |- btrfs_init_new_buffer()
>                |- btrfs_tree_lock()
>                   | Triggers WARN_ON(eb->lock_owner == current->pid)
>                   |- wait_event()
>                      Wait lock owner to release the lock, but it's
>                      locked by ourself, so it will deadlock
> 
> [FIX]
> This patch will do the lock_owner and current->pid check at
> btrfs_init_new_buffer().
> So above deadlock can be avoided.
> 
> Since such problem can only happen in crafted image, we will still
> trigger kernel warning, but with a little more meaningful warning
> message.
> 
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=200405
> Reported-by: Xu Wen <wen.xu@gatech.edu>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
> changelog:
> v2:
>   Modify btrfs_tree_lock() to be able to return int to detect possible
>   deadlock and return.
>   Titled as "btrfs: locking: Allow btrfs_tree_lock() to return error to avoid deadlock"
> v3:
>   Instead of modify all btrfs_tree_lock() callers, only check possible
>   deadlock at btrfs_init_new_buffer(), as the bug only happens for newly
>   allocated tree block.
>   With better commit message describing the on-disk extent tree
>   corruption along with the call trace of how dead lock happens.
> 
> Hi David,
> 
> This v3 should explain the bug with more details and bring a minimal
> impact to existing function callers.

This is much better, thanks. I've checked the direct callers and a bit
up the call chaing, all seem to be ready for errors.

> Qu
> ---
>  fs/btrfs/extent-tree.c | 13 +++++++++++++
>  1 file changed, 13 insertions(+)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index e7559b89eaf7..c85ff8b7a95b 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -8302,6 +8302,19 @@ btrfs_init_new_buffer(struct btrfs_trans_handle *trans, struct btrfs_root *root,
>  	if (IS_ERR(buf))
>  		return buf;
>  
> +	/*
> +	 * Just extra safe net in case extent tree is corrupted and extent
> +	 * allocator chooses to use a tree block which is already used and
> +	 * locked.
> +	 */
> +	if (buf->lock_owner == current->pid) {
> +		WARN(1,

A btrfs_warn or btrfs_err would be better, to identify the corrupted
filesystem. Printing the stack trace might make sense in case we want to
know where exactly it failed as there are several paths that lead to
btrfs_init_new_buffer.

As every WARN, it needs to be considered because there are systems set
up to panic on warn and warnings are frequently confused as hard errors.

In this case I think it's worth to print the bytenr and root objectid
and that the stacktrace is not that important.

> +"tree block %llu already locked by pid=%d, extent tree corruption detected",
> +		     buf->start, current->pid);
> +		free_extent_buffer(buf);
> +		return ERR_PTR(-EUCLEAN);
> +	}
> +
>  	btrfs_set_header_generation(buf, trans->transid);
>  	btrfs_set_buffer_lockdep_class(root->root_key.objectid, buf, level);
>  	btrfs_tree_lock(buf);
> -- 
> 2.18.0
Qu Wenruo Aug. 21, 2018, 12:04 a.m. UTC | #2
On 2018/8/21 上午12:38, David Sterba wrote:
> On Tue, Aug 14, 2018 at 01:51:21PM +0800, Qu Wenruo wrote:
>> [BUG]
>> For certains crafted image, whose csum root leaf has missing backref, if
>> we try to trigger write with data csum, it could cause deadlock with the
>> following kernel WARN_ON():
>> ------
>> WARNING: CPU: 1 PID: 41 at fs/btrfs/locking.c:230 btrfs_tree_lock+0x3e2/0x400
>> CPU: 1 PID: 41 Comm: kworker/u4:1 Not tainted 4.18.0-rc1+ #8
>> Workqueue: btrfs-endio-write btrfs_endio_write_helper
>> RIP: 0010:btrfs_tree_lock+0x3e2/0x400
>> Call Trace:
>>  btrfs_alloc_tree_block+0x39f/0x770
>>  __btrfs_cow_block+0x285/0x9e0
>>  btrfs_cow_block+0x191/0x2e0
>>  btrfs_search_slot+0x492/0x1160
>>  btrfs_lookup_csum+0xec/0x280
>>  btrfs_csum_file_blocks+0x2be/0xa60
>>  add_pending_csums+0xaf/0xf0
>>  btrfs_finish_ordered_io+0x74b/0xc90
>>  finish_ordered_fn+0x15/0x20
>>  normal_work_helper+0xf6/0x500
>>  btrfs_endio_write_helper+0x12/0x20
>>  process_one_work+0x302/0x770
>>  worker_thread+0x81/0x6d0
>>  kthread+0x180/0x1d0
>>  ret_from_fork+0x35/0x40
>> ---[ end trace 2e85051acb5f6dc1 ]---
>> ------
>>
>> [CAUSE]
>> That crafted image has missing backref for csum tree root leaf.
>> And when we try to allocate new tree block, since there is no
>> EXTENT/METADATA_ITEM for csum tree root, btrfs consider it's free slot
>> and use it.
>>
>> The extent tree of the image looks like:
>> Normal image                      |       This fuzzed image
>> ----------------------------------+--------------------------------
>> BG 29360128                       | BG 29360128
>>  One empty slot                   |  One empty slot
>> 29364224: backref to UUID tree    | 29364224: backref to UUID tree
>>  Two empty slots                  |  Two empty slots
>> 29376512: backref to CSUM tree    |  One empty slot (bad type) <<<
>> 29380608: backref to D_RELOC tree | 29380608: backref to D_RELOC tree
>> ...                               | ...
>>
>> Since bytenr 29376512 has no METADATA/EXTENT_ITEM, when btrfs try to
>> alloc tree block, it's an valid slot for btrfs.
>>
>> And for finish_ordered_write, when we need to insert csum, we try to CoW
>> csum tree root.
>>
>> By *COINCIDENT*, empty slots at bytenr BG_OFFSET, BG_OFFSET + 8K,
>> BG_OFFSET + 12K is already used by tree block COW for other trees,
>> the next empty slot is BG_OFFSET + 16K, which should be the backref for
>> CSUM tree.
>>
>> But due to the bad type, btrfs can recognize it and still consider it as
>> an empty slot, and will try to use it for csum tree CoW.
>>
>> Then in the following call trace, we will try to lock the new tree
>> block, which turns out to be the old csum tree root which is already
>> locked:
>>
>> btrfs_search_slot() called on csum tree root, which is at 29376512
>> |- btrfs_cow_block()
>>    |- btrfs_set_lock_block()
>>    |  |- Now locks tree block 29376512 (old csum tree root)
>>    |- __btrfs_cow_block()
>>       |- btrfs_alloc_tree_block()
>>          |- btrfs_reserve_extent()
>>             | Now it returns tree block 29376512, which extent tree
>>             | shows its empty slot, but it's already hold by csum tree
>>             |- btrfs_init_new_buffer()
>>                |- btrfs_tree_lock()
>>                   | Triggers WARN_ON(eb->lock_owner == current->pid)
>>                   |- wait_event()
>>                      Wait lock owner to release the lock, but it's
>>                      locked by ourself, so it will deadlock
>>
>> [FIX]
>> This patch will do the lock_owner and current->pid check at
>> btrfs_init_new_buffer().
>> So above deadlock can be avoided.
>>
>> Since such problem can only happen in crafted image, we will still
>> trigger kernel warning, but with a little more meaningful warning
>> message.
>>
>> Link: https://bugzilla.kernel.org/show_bug.cgi?id=200405
>> Reported-by: Xu Wen <wen.xu@gatech.edu>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>> changelog:
>> v2:
>>   Modify btrfs_tree_lock() to be able to return int to detect possible
>>   deadlock and return.
>>   Titled as "btrfs: locking: Allow btrfs_tree_lock() to return error to avoid deadlock"
>> v3:
>>   Instead of modify all btrfs_tree_lock() callers, only check possible
>>   deadlock at btrfs_init_new_buffer(), as the bug only happens for newly
>>   allocated tree block.
>>   With better commit message describing the on-disk extent tree
>>   corruption along with the call trace of how dead lock happens.
>>
>> Hi David,
>>
>> This v3 should explain the bug with more details and bring a minimal
>> impact to existing function callers.
> 
> This is much better, thanks. I've checked the direct callers and a bit
> up the call chaing, all seem to be ready for errors.
> 
>> Qu
>> ---
>>  fs/btrfs/extent-tree.c | 13 +++++++++++++
>>  1 file changed, 13 insertions(+)
>>
>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>> index e7559b89eaf7..c85ff8b7a95b 100644
>> --- a/fs/btrfs/extent-tree.c
>> +++ b/fs/btrfs/extent-tree.c
>> @@ -8302,6 +8302,19 @@ btrfs_init_new_buffer(struct btrfs_trans_handle *trans, struct btrfs_root *root,
>>  	if (IS_ERR(buf))
>>  		return buf;
>>  
>> +	/*
>> +	 * Just extra safe net in case extent tree is corrupted and extent
>> +	 * allocator chooses to use a tree block which is already used and
>> +	 * locked.
>> +	 */
>> +	if (buf->lock_owner == current->pid) {
>> +		WARN(1,
> 
> A btrfs_warn or btrfs_err would be better, to identify the corrupted
> filesystem. Printing the stack trace might make sense in case we want to
> know where exactly it failed as there are several paths that lead to
> btrfs_init_new_buffer.
> 
> As every WARN, it needs to be considered because there are systems set
> up to panic on warn and warnings are frequently confused as hard errors.

Oh right, abusing WARN() would cause problem on such systems, especially
when such problem can be handled more or less gracefully.

> 
> In this case I think it's worth to print the bytenr and root objectid
> and that the stacktrace is not that important.

Will add them of course.

Thanks,
Qu

> 
>> +"tree block %llu already locked by pid=%d, extent tree corruption detected",
>> +		     buf->start, current->pid);
>> +		free_extent_buffer(buf);
>> +		return ERR_PTR(-EUCLEAN);
>> +	}
>> +
>>  	btrfs_set_header_generation(buf, trans->transid);
>>  	btrfs_set_buffer_lockdep_class(root->root_key.objectid, buf, level);
>>  	btrfs_tree_lock(buf);
>> -- 
>> 2.18.0
diff mbox series

Patch

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index e7559b89eaf7..c85ff8b7a95b 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -8302,6 +8302,19 @@  btrfs_init_new_buffer(struct btrfs_trans_handle *trans, struct btrfs_root *root,
 	if (IS_ERR(buf))
 		return buf;
 
+	/*
+	 * Just extra safe net in case extent tree is corrupted and extent
+	 * allocator chooses to use a tree block which is already used and
+	 * locked.
+	 */
+	if (buf->lock_owner == current->pid) {
+		WARN(1,
+"tree block %llu already locked by pid=%d, extent tree corruption detected",
+		     buf->start, current->pid);
+		free_extent_buffer(buf);
+		return ERR_PTR(-EUCLEAN);
+	}
+
 	btrfs_set_header_generation(buf, trans->transid);
 	btrfs_set_buffer_lockdep_class(root->root_key.objectid, buf, level);
 	btrfs_tree_lock(buf);