mbox series

[v7,00/38] Cleanup error handling in relocation

Message ID cover.1608135849.git.josef@toxicpanda.com (mailing list archive)
Headers show
Series Cleanup error handling in relocation | expand

Message

Josef Bacik Dec. 16, 2020, 4:26 p.m. UTC
v6->v7:
- Broke up the series into 3 series, 1 for cosmetic things, 1 for all the major
  issues (including those reported on v6 of this set), and this new set which is
  solely the error handling related patches for relocation.  It's still a lot of
  patches, sorry about that.

v5->v6:
- Reworked "btrfs: handle errors from select_reloc_root()" because Zygo reported
  hitting an ASSERT(ret != -ENOENT) during his testing.  This was because I
  changed select_reloc_root() to return -ENOENT if we happened to race with
  somebody else who failed to init the reloc root, however we had an ASSERT() to
  check for this because it indicated corruption.  I modified that patch to move
  the ASSERT() to where the problem actually is, so select_reloc_root() can
  return whatever error and it'll pass it along.  I also removed Qu's
  reviewed-by for the patch because of the change.

v4->v5:
- Dropped "btrfs: fix error handling in commit_fs_roots" as it was merged.
- Fixed an ASSERT() that happened during relocation recovery that Zygo reported,
  I moved the error condition out of another condition which broke recovery if
  we had deleted subvols pending with relocation.

v3->v4:
- Squashed the __add_reloc_root error handling patches in
  btrfs_recover_relocation as they were small and in the same function.
- Squashed the record_root_in_trans failure handling patches for
  select_reloc_root as they were small and in the same function.
- Added a new patch to address an existing error handling problem with subvol
  creation.
- Fixed up the various cases that Qu noticed where I got things wrong, cleaning
  up a leaked root extent ref, a leaked inode item, and where I accidentally
  stopped dealing with errors from btrfs_drop_subtree.
- Reworked a bunch of the ASSERT()'s to do ASSERT(0) in their respective if
  statements.
- Added reviewed-bys.

v2->v3:
- A lot of extra patches fixing various things that I encountered while
  debugging the corruption problem that was uncovered by these patches.
- Fixed the panic that Zygo was seeing and other issues.
- Fixed up the comments from Nikolay and Filipe.

A slight note, the first set of patches could probably be taken now, and in fact

  btrfs: fix error handling in commit_fs_roots

Was sent earlier this week and is very important and needs to be reviewed and
merged ASAP.  The following are safe and could be merged outside of the rest of
this series

  btrfs: allow error injection for btrfs_search_slot and btrfs_cow_block
  btrfs: fix lockdep splat in btrfs_recover_relocation
  btrfs: keep track of the root owner for relocation reads
  btrfs: noinline btrfs_should_cancel_balance
  btrfs: do not cleanup upper nodes in btrfs_backref_cleanup_node
  btrfs: pass down the tree block level through ref-verify
  btrfs: make sure owner is set in ref-verify
  btrfs: don't clear ret in btrfs_start_dirty_block_groups

The rest obviously are all around the actual error handling.

v1->v2:
- fixed a bug where I accidentally dropped reading flags in relocate_block_group
  when I dropped the extra checks that we handle in the tree checker.

--- Original message ---
Hello,

Relocation is the last place that is not able to handle errors at all, which
results in all sorts of lovely panics if you encounter corruptions or IO errors.
I'm going to start cleaning up relocation, but before I move code around I want
the error handling to be somewhat sane, so I'm not changing behavior and error
handling at the same time.

These patches are purely about error handling, there is no behavior changing
other than returning errors up the chain properly.  There is a lot of room for
follow up cleanups, which will happen next.  However I wanted to get this series
done today and out so we could get it merged ASAP, and then the follow up
cleanups can happen later as they are less important and less critical.

The only exception to the above is the patch to add the error injection sites
for btrfs_cow_block and btrfs_search_slot, and a lockdep fix that I discovered
while running my tests, those are the first two patches in the series.

I tested this with my error injection stress test, where I keep track of all
stack traces that have been tested and only inject errors when we have a new
stack trace, which means I should have covered all of the various error
conditions.  With this patchset I'm no longer panicing while stressing the error
conditions.  Thanks,

Josef

Josef Bacik (38):
  btrfs: convert BUG_ON()'s in relocate_tree_block
  btrfs: return an error from btrfs_record_root_in_trans
  btrfs: handle errors from select_reloc_root()
  btrfs: convert BUG_ON()'s in select_reloc_root() to proper errors
  btrfs: check record_root_in_trans related failures in
    select_reloc_root
  btrfs: do proper error handling in record_reloc_root_in_trans
  btrfs: handle btrfs_record_root_in_trans failure in
    btrfs_rename_exchange
  btrfs: handle btrfs_record_root_in_trans failure in btrfs_rename
  btrfs: handle btrfs_record_root_in_trans failure in
    btrfs_delete_subvolume
  btrfs: handle btrfs_record_root_in_trans failure in
    btrfs_recover_log_trees
  btrfs: handle btrfs_record_root_in_trans failure in create_subvol
  btrfs: btrfs: handle btrfs_record_root_in_trans failure in
    relocate_tree_block
  btrfs: handle btrfs_record_root_in_trans failure in start_transaction
  btrfs: handle record_root_in_trans failure in qgroup_account_snapshot
  btrfs: handle record_root_in_trans failure in
    btrfs_record_root_in_trans
  btrfs: handle record_root_in_trans failure in create_pending_snapshot
  btrfs: do not panic in __add_reloc_root
  btrfs: have proper error handling in btrfs_init_reloc_root
  btrfs: do proper error handling in create_reloc_root
  btrfs: validate ->reloc_root after recording root in trans
  btrfs: handle btrfs_update_reloc_root failure in commit_fs_roots
  btrfs: change insert_dirty_subvol to return errors
  btrfs: handle btrfs_update_reloc_root failure in insert_dirty_subvol
  btrfs: handle btrfs_update_reloc_root failure in prepare_to_merge
  btrfs: do proper error handling in btrfs_update_reloc_root
  btrfs: convert logic BUG_ON()'s in replace_path to ASSERT()'s
  btrfs: handle btrfs_cow_block errors in replace_path
  btrfs: handle btrfs_search_slot failure in replace_path
  btrfs: handle errors in reference count manipulation in replace_path
  btrfs: handle extent reference errors in do_relocation
  btrfs: check for BTRFS_BLOCK_FLAG_FULL_BACKREF being set improperly
  btrfs: remove the extent item sanity checks in relocate_block_group
  btrfs: do proper error handling in create_reloc_inode
  btrfs: handle __add_reloc_root failures in btrfs_recover_relocation
  btrfs: cleanup error handling in prepare_to_merge
  btrfs: handle extent corruption with select_one_root properly
  btrfs: do proper error handling in merge_reloc_roots
  btrfs: check return value of btrfs_commit_transaction in relocation

 fs/btrfs/inode.c        |  20 +-
 fs/btrfs/ioctl.c        |   7 +-
 fs/btrfs/relocation.c   | 414 +++++++++++++++++++++++++++++++---------
 fs/btrfs/transaction.c  |  37 ++--
 fs/btrfs/tree-checker.c |   5 +
 fs/btrfs/tree-log.c     |   8 +-
 6 files changed, 382 insertions(+), 109 deletions(-)

Comments

Zygo Blaxell Dec. 16, 2020, 7:56 p.m. UTC | #1
On Wed, Dec 16, 2020 at 11:26:16AM -0500, Josef Bacik wrote:
> v6->v7:
> - Broke up the series into 3 series, 1 for cosmetic things, 1 for all the major
>   issues (including those reported on v6 of this set), and this new set which is
>   solely the error handling related patches for relocation.  It's still a lot of
>   patches, sorry about that.

So far it lockdepped, but it is still running:

	[Wed Dec 16 13:30:45 2020] irq event stamp: 5875656

	[Wed Dec 16 13:30:45 2020] ======================================================
	[Wed Dec 16 13:30:45 2020] hardirqs last  enabled at (5875655): [<ffffffff98a82e37>] _raw_spin_unlock_irqrestore+0x47/0x50
	[Wed Dec 16 13:30:45 2020] WARNING: possible circular locking dependency detected
	[Wed Dec 16 13:30:45 2020] hardirqs last disabled at (5875656): [<ffffffff98a8368d>] _raw_spin_lock_irqsave+0x7d/0xa0
	[Wed Dec 16 13:30:45 2020] softirqs last  enabled at (5874382): [<ffffffff98e0044f>] __do_softirq+0x44f/0x5a7
	[Wed Dec 16 13:30:45 2020] 5.10.0-39fbe74d1bbc-josef+ #1 Tainted: G        W        
	[Wed Dec 16 13:30:45 2020] softirqs last disabled at (5874369): [<ffffffff98c01122>] asm_call_irq_on_stack+0x12/0x20
	[Wed Dec 16 13:30:45 2020] ------------------------------------------------------
	[Wed Dec 16 13:30:45 2020] btrfs/6366 is trying to acquire lock:
	[Wed Dec 16 13:30:45 2020] ffff88816f146a28 (btrfs-treloc-03){+.+.}-{3:3}, at: __btrfs_tree_lock+0x29/0x190
	[Wed Dec 16 13:30:45 2020]
				   but task is already holding lock:
	[Wed Dec 16 13:30:45 2020] ffff88810167c598 (btrfs-tree-02){++++}-{3:3}, at: __btrfs_tree_lock+0x29/0x190
	[Wed Dec 16 13:30:45 2020]
				   which lock already depends on the new lock.

	[Wed Dec 16 13:30:45 2020]
				   the existing dependency chain (in reverse order) is:
	[Wed Dec 16 13:30:45 2020]
				   -> #1 (btrfs-tree-02){++++}-{3:3}:
	[Wed Dec 16 13:30:45 2020]        down_write_nested+0xa6/0x2d0
	[Wed Dec 16 13:30:45 2020]        __btrfs_tree_lock+0x29/0x190
	[Wed Dec 16 13:30:45 2020]        btrfs_tree_lock+0x10/0x20
	[Wed Dec 16 13:30:45 2020]        btrfs_search_slot+0x474/0x1090
	[Wed Dec 16 13:30:45 2020]        do_relocation+0x2a0/0xc20
	[Wed Dec 16 13:30:45 2020]        relocate_tree_blocks+0x733/0xb90
	[Wed Dec 16 13:30:45 2020]        relocate_block_group+0x2eb/0x780
	[Wed Dec 16 13:30:45 2020]        btrfs_relocate_block_group+0x26e/0x4c0
	[Wed Dec 16 13:30:45 2020]        btrfs_relocate_chunk+0x52/0x120
	[Wed Dec 16 13:30:45 2020]        btrfs_balance+0xe2e/0x1900
	[Wed Dec 16 13:30:45 2020]        btrfs_ioctl_balance+0x1f9/0x460
	[Wed Dec 16 13:30:45 2020]        btrfs_ioctl+0xe9c/0x4360
	[Wed Dec 16 13:30:45 2020]        __x64_sys_ioctl+0xc3/0x100
	[Wed Dec 16 13:30:45 2020]        do_syscall_64+0x37/0x80
	[Wed Dec 16 13:30:45 2020]        entry_SYSCALL_64_after_hwframe+0x44/0xa9
	[Wed Dec 16 13:30:45 2020]
				   -> #0 (btrfs-treloc-03){+.+.}-{3:3}:
	[Wed Dec 16 13:30:45 2020]        __lock_acquire+0x1ebb/0x2880
	[Wed Dec 16 13:30:45 2020]        lock_acquire+0x192/0x550
	[Wed Dec 16 13:30:45 2020]        down_write_nested+0xa6/0x2d0
	[Wed Dec 16 13:30:45 2020]        __btrfs_tree_lock+0x29/0x190
	[Wed Dec 16 13:30:45 2020]        btrfs_lock_root_node+0x5b/0x190
	[Wed Dec 16 13:30:45 2020]        btrfs_search_slot+0xc8d/0x1090
	[Wed Dec 16 13:30:45 2020]        replace_path.isra.36+0x8a2/0xee0
	[Wed Dec 16 13:30:45 2020]        merge_reloc_root+0x58c/0xc10
	[Wed Dec 16 13:30:45 2020]        merge_reloc_roots+0x1e6/0x4a0
	[Wed Dec 16 13:30:45 2020]        relocate_block_group+0x3d2/0x780
	[Wed Dec 16 13:30:45 2020]        btrfs_relocate_block_group+0x26e/0x4c0
	[Wed Dec 16 13:30:45 2020]        btrfs_relocate_chunk+0x52/0x120
	[Wed Dec 16 13:30:45 2020]        btrfs_balance+0xe2e/0x1900
	[Wed Dec 16 13:30:45 2020]        btrfs_ioctl_balance+0x1f9/0x460
	[Wed Dec 16 13:30:45 2020]        btrfs_ioctl+0xe9c/0x4360
	[Wed Dec 16 13:30:45 2020]        __x64_sys_ioctl+0xc3/0x100
	[Wed Dec 16 13:30:45 2020]        do_syscall_64+0x37/0x80
	[Wed Dec 16 13:30:45 2020]        entry_SYSCALL_64_after_hwframe+0x44/0xa9
	[Wed Dec 16 13:30:45 2020]
				   other info that might help us debug this:

	[Wed Dec 16 13:30:45 2020]  Possible unsafe locking scenario:

	[Wed Dec 16 13:30:45 2020]        CPU0                    CPU1
	[Wed Dec 16 13:30:45 2020]        ----                    ----
	[Wed Dec 16 13:30:45 2020]   lock(btrfs-tree-02);
	[Wed Dec 16 13:30:45 2020]                                lock(btrfs-treloc-03);
	[Wed Dec 16 13:30:45 2020]                                lock(btrfs-tree-02);
	[Wed Dec 16 13:30:45 2020]   lock(btrfs-treloc-03);
	[Wed Dec 16 13:30:45 2020]
				    *** DEADLOCK ***

	[Wed Dec 16 13:30:45 2020] 5 locks held by btrfs/6366:
	[Wed Dec 16 13:30:45 2020]  #0: ffff888128318498 (sb_writers#13){.+.+}-{0:0}, at: btrfs_ioctl_balance+0x71/0x460
	[Wed Dec 16 13:30:45 2020]  #1: ffff88810d2f22c8 (&fs_info->delete_unused_bgs_mutex){+.+.}-{3:3}, at: btrfs_balance+0xa54/0x1900
	[Wed Dec 16 13:30:45 2020]  #2: ffff88810d2f08f0 (&fs_info->cleaner_mutex){+.+.}-{3:3}, at: btrfs_relocate_block_group+0x266/0x4c0
	[Wed Dec 16 13:30:45 2020]  #3: ffff8881283186b8 (sb_internal#2){.+.+}-{0:0}, at: btrfs_start_transaction+0x1e/0x20
	[Wed Dec 16 13:30:45 2020]  #4: ffff88810167c598 (btrfs-tree-02){++++}-{3:3}, at: __btrfs_tree_lock+0x29/0x190
	[Wed Dec 16 13:30:45 2020]
				   stack backtrace:
	[Wed Dec 16 13:30:45 2020] CPU: 0 PID: 6366 Comm: btrfs Tainted: G        W         5.10.0-39fbe74d1bbc-josef+ #1
	[Wed Dec 16 13:30:45 2020] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
	[Wed Dec 16 13:30:45 2020] Call Trace:
	[Wed Dec 16 13:30:45 2020]  dump_stack+0xbc/0xf9
	[Wed Dec 16 13:30:45 2020]  print_circular_bug.isra.42.cold.67+0x146/0x14b
	[Wed Dec 16 13:30:45 2020]  check_noncircular+0x219/0x250
	[Wed Dec 16 13:30:45 2020]  ? print_circular_bug.isra.42+0x230/0x230
	[Wed Dec 16 13:30:45 2020]  ? pvclock_clocksource_read+0xeb/0x190
	[Wed Dec 16 13:30:45 2020]  ? __kasan_check_write+0x14/0x20
	[Wed Dec 16 13:30:45 2020]  ? lockdep_lock+0xaa/0x160
	[Wed Dec 16 13:30:45 2020]  __lock_acquire+0x1ebb/0x2880
	[Wed Dec 16 13:30:45 2020]  ? register_lock_class+0x8f0/0x8f0
	[Wed Dec 16 13:30:45 2020]  ? rcu_read_lock_sched_held+0xa1/0xd0
	[Wed Dec 16 13:30:45 2020]  ? rcu_read_lock_bh_held+0xb0/0xb0
	[Wed Dec 16 13:30:45 2020]  lock_acquire+0x192/0x550
	[Wed Dec 16 13:30:45 2020]  ? __btrfs_tree_lock+0x29/0x190
	[Wed Dec 16 13:30:45 2020]  ? check_flags+0x30/0x30
	[Wed Dec 16 13:30:45 2020]  ? ___might_sleep+0x10f/0x1e0
	[Wed Dec 16 13:30:46 2020]  ? __might_sleep+0x71/0xe0
	[Wed Dec 16 13:30:46 2020]  down_write_nested+0xa6/0x2d0
	[Wed Dec 16 13:30:46 2020]  ? __btrfs_tree_lock+0x29/0x190
	[Wed Dec 16 13:30:46 2020]  ? btrfs_root_node+0x23/0x200
	[Wed Dec 16 13:30:46 2020]  ? _down_write_nest_lock+0x2c0/0x2c0
	[Wed Dec 16 13:30:46 2020]  ? rcu_read_lock_sched_held+0xd0/0xd0
	[Wed Dec 16 13:30:46 2020]  __btrfs_tree_lock+0x29/0x190
	[Wed Dec 16 13:30:46 2020]  btrfs_lock_root_node+0x5b/0x190
	[Wed Dec 16 13:30:46 2020]  btrfs_search_slot+0xc8d/0x1090
	[Wed Dec 16 13:30:46 2020]  ? __kasan_check_read+0x11/0x20
	[Wed Dec 16 13:30:46 2020]  ? check_flags.part.50+0x6c/0x1e0
	[Wed Dec 16 13:30:46 2020]  ? split_leaf+0xa20/0xa20
	[Wed Dec 16 13:30:46 2020]  ? _raw_spin_unlock+0x22/0x30
	[Wed Dec 16 13:30:46 2020]  ? release_extent_buffer+0x225/0x280
	[Wed Dec 16 13:30:46 2020]  ? __kasan_check_write+0x14/0x20
	[Wed Dec 16 13:30:46 2020]  ? free_extent_buffer.part.53+0x90/0x140
	[Wed Dec 16 13:30:46 2020]  ? free_extent_buffer+0x13/0x20
	[Wed Dec 16 13:30:46 2020]  replace_path.isra.36+0x8a2/0xee0
	[Wed Dec 16 13:30:46 2020]  ? describe_relocation.isra.30+0xf0/0xf0
	[Wed Dec 16 13:30:46 2020]  ? check_setget_bounds+0x2a/0x60
	[Wed Dec 16 13:30:46 2020]  ? btrfs_get_64+0x1e8/0x200
	[Wed Dec 16 13:30:46 2020]  ? btrfs_get_token_64+0x350/0x350
	[Wed Dec 16 13:30:46 2020]  ? memcpy+0x4d/0x60
	[Wed Dec 16 13:30:46 2020]  merge_reloc_root+0x58c/0xc10
	[Wed Dec 16 13:30:46 2020]  ? prepare_to_merge+0x660/0x660
	[Wed Dec 16 13:30:46 2020]  ? btrfs_lookup_fs_root+0x113/0x180
	[Wed Dec 16 13:30:46 2020]  ? btrfs_end_super_write+0x3c0/0x3c0
	[Wed Dec 16 13:30:46 2020]  ? mutex_lock_io_nested+0xc20/0xc20
	[Wed Dec 16 13:30:46 2020]  ? btrfs_get_root_ref+0x24a/0x470
	[Wed Dec 16 13:30:46 2020]  ? __mutex_unlock_slowpath+0xb2/0x400
	[Wed Dec 16 13:30:46 2020]  ? btrfs_find_highest_objectid+0x1b0/0x1b0
	[Wed Dec 16 13:30:46 2020]  ? btrfs_init_reloc_root+0x370/0x370
	[Wed Dec 16 13:30:46 2020]  ? __kasan_check_write+0x14/0x20
	[Wed Dec 16 13:30:46 2020]  merge_reloc_roots+0x1e6/0x4a0
	[Wed Dec 16 13:30:46 2020]  ? merge_reloc_root+0xc10/0xc10
	[Wed Dec 16 13:30:46 2020]  ? btrfs_block_rsv_release+0x3a7/0x620
	[Wed Dec 16 13:30:46 2020]  relocate_block_group+0x3d2/0x780
	[Wed Dec 16 13:30:46 2020]  ? btrfs_defrag_leaves+0x250/0x650
	[Wed Dec 16 13:30:46 2020]  ? merge_reloc_roots+0x4a0/0x4a0
	[Wed Dec 16 13:30:46 2020]  btrfs_relocate_block_group+0x26e/0x4c0
	[Wed Dec 16 13:30:46 2020]  btrfs_relocate_chunk+0x52/0x120
	[Wed Dec 16 13:30:46 2020]  btrfs_balance+0xe2e/0x1900
	[Wed Dec 16 13:30:46 2020]  ? pvclock_clocksource_read+0xeb/0x190
	[Wed Dec 16 13:30:46 2020]  ? btrfs_relocate_chunk+0x120/0x120
	[Wed Dec 16 13:30:46 2020]  ? lock_acquired+0x7f/0x5e0
	[Wed Dec 16 13:30:46 2020]  ? do_raw_spin_lock+0x1e0/0x1e0
	[Wed Dec 16 13:30:46 2020]  ? do_raw_spin_unlock+0xa8/0x140
	[Wed Dec 16 13:30:46 2020]  btrfs_ioctl_balance+0x1f9/0x460
	[Wed Dec 16 13:30:46 2020]  btrfs_ioctl+0xe9c/0x4360
	[Wed Dec 16 13:30:46 2020]  ? __kasan_check_read+0x11/0x20
	[Wed Dec 16 13:30:46 2020]  ? check_chain_key+0x1f4/0x2f0
	[Wed Dec 16 13:30:46 2020]  ? __asan_loadN+0xf/0x20
	[Wed Dec 16 13:30:46 2020]  ? btrfs_ioctl_get_supported_features+0x30/0x30
	[Wed Dec 16 13:30:46 2020]  ? kvm_sched_clock_read+0x18/0x30
	[Wed Dec 16 13:30:46 2020]  ? check_chain_key+0x1f4/0x2f0
	[Wed Dec 16 13:30:46 2020]  ? lock_downgrade+0x3f0/0x3f0
	[Wed Dec 16 13:30:46 2020]  ? handle_mm_fault+0xad6/0x2150
	[Wed Dec 16 13:30:46 2020]  ? do_vfs_ioctl+0xfc/0x9d0
	[Wed Dec 16 13:30:46 2020]  ? ioctl_file_clone+0xe0/0xe0
	[Wed Dec 16 13:30:46 2020]  ? check_flags.part.50+0x6c/0x1e0
	[Wed Dec 16 13:30:46 2020]  ? check_flags.part.50+0x6c/0x1e0
	[Wed Dec 16 13:30:46 2020]  ? check_flags+0x26/0x30
	[Wed Dec 16 13:30:46 2020]  ? lock_is_held_type+0xc3/0xf0
	[Wed Dec 16 13:30:46 2020]  ? syscall_enter_from_user_mode+0x1b/0x60
	[Wed Dec 16 13:30:46 2020]  ? do_syscall_64+0x13/0x80
	[Wed Dec 16 13:30:46 2020]  ? rcu_read_lock_sched_held+0xa1/0xd0
	[Wed Dec 16 13:30:46 2020]  ? __kasan_check_read+0x11/0x20
	[Wed Dec 16 13:30:46 2020]  ? __fget_light+0xae/0x110
	[Wed Dec 16 13:30:46 2020]  __x64_sys_ioctl+0xc3/0x100
	[Wed Dec 16 13:30:46 2020]  do_syscall_64+0x37/0x80
	[Wed Dec 16 13:30:46 2020]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
	[Wed Dec 16 13:30:46 2020] RIP: 0033:0x7f7c98b07427
	[Wed Dec 16 13:30:46 2020] Code: 00 00 90 48 8b 05 69 aa 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 39 aa 0c 00 f7 d8 64 89 01 48
	[Wed Dec 16 13:30:46 2020] RSP: 002b:00007fff0cb87938 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
	[Wed Dec 16 13:30:46 2020] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f7c98b07427
	[Wed Dec 16 13:30:46 2020] RDX: 00007fff0cb87948 RSI: 00000000c4009420 RDI: 0000000000000003
	[Wed Dec 16 13:30:46 2020] RBP: 0000000000000003 R08: 0000000000000003 R09: 0000000000000078
	[Wed Dec 16 13:30:46 2020] R10: fffffffffffff59d R11: 0000000000000202 R12: 0000000000000002
	[Wed Dec 16 13:30:46 2020] R13: 00007fff0cb89a3f R14: 0000557c21f8c119 R15: 0000000000000000
	[Wed Dec 16 13:31:03 2020] BTRFS info (device dm-0): found 2 extents, loops 2, stage: update data pointers
	[Wed Dec 16 13:31:59 2020] BTRFS info (device dm-0): relocating block group 981598208000 flags data
Zygo Blaxell Dec. 18, 2020, 4:25 p.m. UTC | #2
On Wed, Dec 16, 2020 at 02:56:03PM -0500, Zygo Blaxell wrote:
> On Wed, Dec 16, 2020 at 11:26:16AM -0500, Josef Bacik wrote:
> > v6->v7:
> > - Broke up the series into 3 series, 1 for cosmetic things, 1 for all the major
> >   issues (including those reported on v6 of this set), and this new set which is
> >   solely the error handling related patches for relocation.  It's still a lot of
> >   patches, sorry about that.
> 
> So far it lockdepped, but it is still running:
> 
> 	[Wed Dec 16 13:30:45 2020] irq event stamp: 5875656

...and now it's dead, looks like tree mod log strikes again:

	[145504.989768][ T3280] BTRFS info (device dm-0): found 13271 extents, loops 2, stage: update data pointers
	[145622.222898][ T3280] avg_delayed_ref_runtime = 743782, time = 772386615466, count = 1038457
	[145664.364729][ T4659] ------------[ cut here ]------------
	[145664.373144][ T4659] kernel BUG at fs/btrfs/ctree.c:1208!
	[145664.377059][ T4659] invalid opcode: 0000 [#1] SMP KASAN PTI
	[145664.379909][ T4659] CPU: 1 PID: 4659 Comm: crawl_258 Tainted: G        W         5.10.0-39fbe74d1bbc-josef+ #1
	[145664.383114][ T4659] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
	[145664.386201][ T4659] RIP: 0010:__tree_mod_log_rewind+0x3b1/0x3c0
	[145664.388238][ T4659] Code: 05 48 8d 74 10 65 ba 19 00 00 00 e8 49 e7 06 00 e9 a7 fd ff ff 4c 8d 7b 2c 4c 89 ff e8 f8 3e c9 ff 48 63 43 2c e9 a2 fe ff ff <0f> 0b 0f 0b 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48
	[145664.394705][ T4659] RSP: 0018:ffffc90001c870e8 EFLAGS: 00010297
	[145664.396777][ T4659] RAX: 0000000000000000 RBX: ffff888004946f00 RCX: ffffffff978792f6
	[145664.399482][ T4659] RDX: 0000000000000007 RSI: dffffc0000000000 RDI: ffff888004946f2c
	[145664.402117][ T4659] RBP: ffffc90001c87138 R08: 1ffff1100da2481c R09: ffffed100da2481c
	[145664.404763][ T4659] R10: ffff88806d1240d8 R11: ffffed100da2481b R12: 000000000000000b
	[145664.407401][ T4659] R13: ffff888062123000 R14: ffff88800f6a7a00 R15: ffff888004946f2c
	[145664.410047][ T4659] FS:  00007f0c92e1f700(0000) GS:ffff8881f5600000(0000) knlGS:0000000000000000
	[145664.413001][ T4659] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
	[145664.415177][ T4659] CR2: 00007fc867df6100 CR3: 00000001038e0004 CR4: 0000000000170ee0
	[145664.417822][ T4659] Call Trace:
	[145664.418935][ T4659]  btrfs_search_old_slot+0x265/0x10d0
	[145664.420747][ T4659]  ? lock_acquired+0xbb/0x5e0
	[145664.422335][ T4659]  ? btrfs_search_slot+0x1090/0x1090
	[145664.424102][ T4659]  ? free_extent_buffer.part.53+0xd7/0x140
	[145664.426077][ T4659]  ? free_extent_buffer+0x13/0x20
	[145664.427810][ T4659]  resolve_indirect_refs+0x3e9/0xfc0
	[145664.429610][ T4659]  ? down_write_nested+0x2d0/0x2d0
	[145664.431310][ T4659]  ? __kasan_check_read+0x11/0x20
	[145664.432998][ T4659]  ? __kasan_check_read+0x11/0x20
	[145664.434654][ T4659]  ? add_prelim_ref.part.12+0x150/0x150
	[145664.436511][ T4659]  ? lock_downgrade+0x3f0/0x3f0
	[145664.438140][ T4659]  ? __kasan_check_read+0x11/0x20
	[145664.439818][ T4659]  ? lock_acquired+0xbb/0x5e0
	[145664.441384][ T4659]  ? free_extent_buffer.part.53+0xb1/0x140
	[145664.443416][ T4659]  ? do_raw_spin_unlock+0xa8/0x140
	[145664.445231][ T4659]  ? rb_insert_color+0x310/0x360
	[145664.446928][ T4659]  ? prelim_ref_insert+0x12d/0x430
	[145664.448694][ T4659]  find_parent_nodes+0x5c3/0x1830
	[145664.450439][ T4659]  ? resolve_indirect_refs+0xfc0/0xfc0
	[145664.452355][ T4659]  ? iterate_inodes_from_logical+0x129/0x170
	[145664.454282][ T4659]  ? btrfs_ioctl+0x237e/0x4360
	[145664.457750][ T4659]  ? __x64_sys_ioctl+0xc3/0x100
	[145664.459969][ T4659]  ? do_syscall_64+0x37/0x80
	[145664.462279][ T4659]  ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
	[145664.465078][ T4659]  ? resolve_indirect_refs+0xf90/0xfc0
	[145664.467560][ T4659]  ? __kasan_check_read+0x11/0x20
	[145664.471069][ T4659]  ? kasan_unpoison_shadow+0x35/0x50
	[145664.472861][ T4659]  ? trace_hardirqs_on+0x55/0x120
	[145664.474528][ T4659]  btrfs_find_all_roots_safe+0x142/0x1e0
	[145664.476661][ T4659]  ? find_parent_nodes+0x1830/0x1830
	[145664.478378][ T4659]  ? btrfs_inode_flags_to_xflags+0x50/0x50
	[145664.480274][ T4659]  iterate_extent_inodes+0x20e/0x580
	[145664.481995][ T4659]  ? tree_backref_for_extent+0x230/0x230
	[145664.483514][ T4659]  ? release_extent_buffer+0x225/0x280
	[145664.485246][ T4659]  ? read_extent_buffer+0xdd/0x110
	[145664.486916][ T4659]  ? lock_downgrade+0x3f0/0x3f0
	[145664.490569][ T4659]  ? __kasan_check_read+0x11/0x20
	[145664.492257][ T4659]  ? lock_acquired+0xbb/0x5e0
	[145664.493876][ T4659]  ? free_extent_buffer.part.53+0xb1/0x140
	[145664.495842][ T4659]  ? do_raw_spin_unlock+0xa8/0x140
	[145664.497540][ T4659]  ? _raw_spin_unlock+0x22/0x30
	[145664.499181][ T4659]  ? release_extent_buffer+0x225/0x280
	[145664.501022][ T4659]  iterate_inodes_from_logical+0x129/0x170
	[145664.502926][ T4659]  ? iterate_inodes_from_logical+0x129/0x170
	[145664.505126][ T4659]  ? btrfs_inode_flags_to_xflags+0x50/0x50
	[145664.507147][ T4659]  ? iterate_extent_inodes+0x580/0x580
	[145664.509104][ T4659]  ? __vmalloc_node+0x92/0xb0
	[145664.510806][ T4659]  ? init_data_container+0x34/0xb0
	[145664.512612][ T4659]  ? init_data_container+0x34/0xb0
	[145664.514452][ T4659]  ? kvmalloc_node+0x60/0x80
	[145664.516057][ T4659]  btrfs_ioctl_logical_to_ino+0x139/0x1e0
	[145664.518043][ T4659]  btrfs_ioctl+0x237e/0x4360
	[145664.519622][ T4659]  ? __kasan_check_write+0x14/0x20
	[145664.521403][ T4659]  ? mmput+0x3b/0x220
	[145664.522780][ T4659]  ? btrfs_ioctl_get_supported_features+0x30/0x30
	[145664.525000][ T4659]  ? __kasan_check_read+0x11/0x20
	[145664.526713][ T4659]  ? lock_release+0xc8/0x640
	[145664.528308][ T4659]  ? __might_fault+0x64/0xd0
	[145664.529894][ T4659]  ? __kasan_check_read+0x11/0x20
	[145664.531564][ T4659]  ? lock_downgrade+0x3f0/0x3f0
	[145664.533187][ T4659]  ? check_flags+0x30/0x30
	[145664.534701][ T4659]  ? check_flags+0x30/0x30
	[145664.536227][ T4659]  ? __kasan_check_read+0x11/0x20
	[145664.537952][ T4659]  ? lock_release+0xc8/0x640
	[145664.539497][ T4659]  ? do_vfs_ioctl+0xfc/0x9d0
	[145664.541020][ T4659]  ? __fget_files+0x151/0x250
	[145664.542641][ T4659]  ? ioctl_file_clone+0xe0/0xe0
	[145664.544304][ T4659]  ? lock_downgrade+0x3f0/0x3f0
	[145664.545917][ T4659]  ? check_flags+0x30/0x30
	[145664.547160][ T4659]  ? __kasan_check_read+0x11/0x20
	[145664.548782][ T4659]  ? lock_release+0xc8/0x640
	[145664.550404][ T4659]  ? __task_pid_nr_ns+0xd3/0x250
	[145664.553313][ T4659]  ? __kasan_check_read+0x11/0x20
	[145664.555669][ T4659]  ? __fget_files+0x170/0x250
	[145664.557893][ T4659]  ? __fget_light+0xf2/0x110
	[145664.560061][ T4659]  __x64_sys_ioctl+0xc3/0x100
	[145664.562178][ T4659]  do_syscall_64+0x37/0x80
	[145664.563689][ T4659]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
	[145664.565964][ T4659] RIP: 0033:0x7f0c94f19427
	[145664.567420][ T4659] Code: 00 00 90 48 8b 05 69 aa 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 39 aa 0c 00 f7 d8 64 89 01 48
	[145664.574088][ T4659] RSP: 002b:00007f0c92e1cca8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
	[145664.577078][ T4659] RAX: ffffffffffffffda RBX: 00007f0c92e1cee0 RCX: 00007f0c94f19427
	[145664.579999][ T4659] RDX: 00007f0c92e1cee8 RSI: 00000000c038943b RDI: 0000000000000004
	[145664.582957][ T4659] RBP: 0000000001000000 R08: 0000000000000000 R09: 00007f0c92e1d0c0
	[145664.585592][ T4659] R10: 000055b378eb9c40 R11: 0000000000000246 R12: 0000000000000004
	[145664.588153][ T4659] R13: 00007f0c92e1cee8 R14: 00007f0c92e1cff0 R15: 00007f0c92e1cee0
	[145664.591569][ T4659] Modules linked in:
	[145664.592962][ T4659] ---[ end trace 21a31c4983711212 ]---
	[145664.594870][ T4659] RIP: 0010:__tree_mod_log_rewind+0x3b1/0x3c0
	[145664.597141][ T4659] Code: 05 48 8d 74 10 65 ba 19 00 00 00 e8 49 e7 06 00 e9 a7 fd ff ff 4c 8d 7b 2c 4c 89 ff e8 f8 3e c9 ff 48 63 43 2c e9 a2 fe ff ff <0f> 0b 0f 0b 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48
	[145664.604814][ T4659] RSP: 0018:ffffc90001c870e8 EFLAGS: 00010297
	[145664.606882][ T4659] RAX: 0000000000000000 RBX: ffff888004946f00 RCX: ffffffff978792f6
	[145664.611246][ T4659] RDX: 0000000000000007 RSI: dffffc0000000000 RDI: ffff888004946f2c
	[145664.613859][ T4659] RBP: ffffc90001c87138 R08: 1ffff1100da2481c R09: ffffed100da2481c
	[145664.617160][ T4659] R10: ffff88806d1240d8 R11: ffffed100da2481b R12: 000000000000000b
	[145664.619988][ T4659] R13: ffff888062123000 R14: ffff88800f6a7a00 R15: ffff888004946f2c
	[145664.622597][ T4659] FS:  00007f0c92e1f700(0000) GS:ffff8881f5600000(0000) knlGS:0000000000000000
	[145664.625590][ T4659] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
	[145664.627890][ T4659] CR2: 00007fc867df6100 CR3: 00000001038e0004 CR4: 0000000000170ee0
	[145664.630639][ T4659] note: crawl_258[4659] exited with preempt_count 1