diff mbox series

Btrfs: fix deadlock on tree root leaf when finding free extent

Message ID 20181022090946.1150-1-fdmanana@kernel.org (mailing list archive)
State New, archived
Headers show
Series Btrfs: fix deadlock on tree root leaf when finding free extent | expand

Commit Message

Filipe Manana Oct. 22, 2018, 9:09 a.m. UTC
From: Filipe Manana <fdmanana@suse.com>

When we are writing out a free space cache, during the transaction commit
phase, we can end up in a deadlock which results in a stack trace like the
following:

 schedule+0x28/0x80
 btrfs_tree_read_lock+0x8e/0x120 [btrfs]
 ? finish_wait+0x80/0x80
 btrfs_read_lock_root_node+0x2f/0x40 [btrfs]
 btrfs_search_slot+0xf6/0x9f0 [btrfs]
 ? evict_refill_and_join+0xd0/0xd0 [btrfs]
 ? inode_insert5+0x119/0x190
 btrfs_lookup_inode+0x3a/0xc0 [btrfs]
 ? kmem_cache_alloc+0x166/0x1d0
 btrfs_iget+0x113/0x690 [btrfs]
 __lookup_free_space_inode+0xd8/0x150 [btrfs]
 lookup_free_space_inode+0x5b/0xb0 [btrfs]
 load_free_space_cache+0x7c/0x170 [btrfs]
 ? cache_block_group+0x72/0x3b0 [btrfs]
 cache_block_group+0x1b3/0x3b0 [btrfs]
 ? finish_wait+0x80/0x80
 find_free_extent+0x799/0x1010 [btrfs]
 btrfs_reserve_extent+0x9b/0x180 [btrfs]
 btrfs_alloc_tree_block+0x1b3/0x4f0 [btrfs]
 __btrfs_cow_block+0x11d/0x500 [btrfs]
 btrfs_cow_block+0xdc/0x180 [btrfs]
 btrfs_search_slot+0x3bd/0x9f0 [btrfs]
 btrfs_lookup_inode+0x3a/0xc0 [btrfs]
 ? kmem_cache_alloc+0x166/0x1d0
 btrfs_update_inode_item+0x46/0x100 [btrfs]
 cache_save_setup+0xe4/0x3a0 [btrfs]
 btrfs_start_dirty_block_groups+0x1be/0x480 [btrfs]
 btrfs_commit_transaction+0xcb/0x8b0 [btrfs]

At cache_save_setup() we need to update the inode item of a block group's
cache which is located in the tree root (fs_info->tree_root), which means
that it may result in COWing a leaf from that tree. If that happens we
need to find a free metadata extent and while looking for one, if we find
a block group which was not cached yet we attempt to load its cache by
calling cache_block_group(). However this function will try to load the
inode of the free space cache, which requires finding the matching inode
item in the tree root - if that inode item is located in the same leaf as
the inode item of the space cache we are updating at cache_save_setup(),
we end up in a deadlock, since we try to obtain a read lock on the same
extent buffer that we previously write locked.

So fix this by skipping the loading of free space caches of any block
groups that are not yet cached (rare cases) if we are updating the inode
of a free space cache. This is a rare case and its downside is failure to
find a free extent (return -ENOSPC) when all the already cached block
groups have no free extents.

Reported-by: Andrew Nelson <andrew.s.nelson@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CAPTELenq9x5KOWuQ+fa7h1r3nsJG8vyiTH8+ifjURc_duHh2Wg@mail.gmail.com/
Fixes: 9d66e233c704 ("Btrfs: load free space cache if it exists")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/ctree.h       |  3 +++
 fs/btrfs/disk-io.c     |  2 ++
 fs/btrfs/extent-tree.c | 22 +++++++++++++++++++++-
 3 files changed, 26 insertions(+), 1 deletion(-)

Comments

Filipe Manana Oct. 22, 2018, 9:53 a.m. UTC | #1
On Mon, Oct 22, 2018 at 10:10 AM <fdmanana@kernel.org> wrote:
>
> From: Filipe Manana <fdmanana@suse.com>
>
> When we are writing out a free space cache, during the transaction commit
> phase, we can end up in a deadlock which results in a stack trace like the
> following:
>
>  schedule+0x28/0x80
>  btrfs_tree_read_lock+0x8e/0x120 [btrfs]
>  ? finish_wait+0x80/0x80
>  btrfs_read_lock_root_node+0x2f/0x40 [btrfs]
>  btrfs_search_slot+0xf6/0x9f0 [btrfs]
>  ? evict_refill_and_join+0xd0/0xd0 [btrfs]
>  ? inode_insert5+0x119/0x190
>  btrfs_lookup_inode+0x3a/0xc0 [btrfs]
>  ? kmem_cache_alloc+0x166/0x1d0
>  btrfs_iget+0x113/0x690 [btrfs]
>  __lookup_free_space_inode+0xd8/0x150 [btrfs]
>  lookup_free_space_inode+0x5b/0xb0 [btrfs]
>  load_free_space_cache+0x7c/0x170 [btrfs]
>  ? cache_block_group+0x72/0x3b0 [btrfs]
>  cache_block_group+0x1b3/0x3b0 [btrfs]
>  ? finish_wait+0x80/0x80
>  find_free_extent+0x799/0x1010 [btrfs]
>  btrfs_reserve_extent+0x9b/0x180 [btrfs]
>  btrfs_alloc_tree_block+0x1b3/0x4f0 [btrfs]
>  __btrfs_cow_block+0x11d/0x500 [btrfs]
>  btrfs_cow_block+0xdc/0x180 [btrfs]
>  btrfs_search_slot+0x3bd/0x9f0 [btrfs]
>  btrfs_lookup_inode+0x3a/0xc0 [btrfs]
>  ? kmem_cache_alloc+0x166/0x1d0
>  btrfs_update_inode_item+0x46/0x100 [btrfs]
>  cache_save_setup+0xe4/0x3a0 [btrfs]
>  btrfs_start_dirty_block_groups+0x1be/0x480 [btrfs]
>  btrfs_commit_transaction+0xcb/0x8b0 [btrfs]
>
> At cache_save_setup() we need to update the inode item of a block group's
> cache which is located in the tree root (fs_info->tree_root), which means
> that it may result in COWing a leaf from that tree. If that happens we
> need to find a free metadata extent and while looking for one, if we find
> a block group which was not cached yet we attempt to load its cache by
> calling cache_block_group(). However this function will try to load the
> inode of the free space cache, which requires finding the matching inode
> item in the tree root - if that inode item is located in the same leaf as
> the inode item of the space cache we are updating at cache_save_setup(),
> we end up in a deadlock, since we try to obtain a read lock on the same
> extent buffer that we previously write locked.
>
> So fix this by skipping the loading of free space caches of any block
> groups that are not yet cached (rare cases) if we are updating the inode
> of a free space cache. This is a rare case and its downside is failure to
> find a free extent (return -ENOSPC) when all the already cached block
> groups have no free extents.
>
> Reported-by: Andrew Nelson <andrew.s.nelson@gmail.com>
> Link: https://lore.kernel.org/linux-btrfs/CAPTELenq9x5KOWuQ+fa7h1r3nsJG8vyiTH8+ifjURc_duHh2Wg@mail.gmail.com/
> Fixes: 9d66e233c704 ("Btrfs: load free space cache if it exists")
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

Andrew Nelson <andrew.s.nelson@gmail.com>


> ---
>  fs/btrfs/ctree.h       |  3 +++
>  fs/btrfs/disk-io.c     |  2 ++
>  fs/btrfs/extent-tree.c | 22 +++++++++++++++++++++-
>  3 files changed, 26 insertions(+), 1 deletion(-)
>
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 2cddfe7806a4..d23ee26eb17d 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -1121,6 +1121,9 @@ struct btrfs_fs_info {
>         u32 sectorsize;
>         u32 stripesize;
>
> +       /* The task currently updating a free space cache inode item. */
> +       struct task_struct *space_cache_updater;
> +
>  #ifdef CONFIG_BTRFS_FS_REF_VERIFY
>         spinlock_t ref_verify_lock;
>         struct rb_root block_tree;
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 05dc3c17cb62..aa5e9a91e560 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -2782,6 +2782,8 @@ int open_ctree(struct super_block *sb,
>         fs_info->sectorsize = 4096;
>         fs_info->stripesize = 4096;
>
> +       fs_info->space_cache_updater = NULL;
> +
>         ret = btrfs_alloc_stripe_hash_table(fs_info);
>         if (ret) {
>                 err = ret;
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 577878324799..e93040449771 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -3364,7 +3364,9 @@ static int cache_save_setup(struct btrfs_block_group_cache *block_group,
>          * time.
>          */
>         BTRFS_I(inode)->generation = 0;
> +       fs_info->space_cache_updater = current;
>         ret = btrfs_update_inode(trans, root, inode);
> +       fs_info->space_cache_updater = NULL;
>         if (ret) {
>                 /*
>                  * So theoretically we could recover from this, simply set the
> @@ -7366,7 +7368,25 @@ static noinline int find_free_extent(struct btrfs_fs_info *fs_info,
>
>  have_block_group:
>                 cached = block_group_cache_done(block_group);
> -               if (unlikely(!cached)) {
> +               /*
> +                * If we are updating the inode of a free space cache, we can
> +                * not start the caching of any block group because we could
> +                * deadlock on an extent buffer of the root tree.
> +                * At cache_save_setup() we update the inode item of a free
> +                * space cache, so we may need to COW a leaf of the root tree,
> +                * which implies finding a free metadata extent. So if when
> +                * searching for such an extent we find a block group that was
> +                * not yet cached (which is unlikely), we can not start loading
> +                * or building its free space cache because that implies reading
> +                * its inode from disk (load_free_space_cache()) which implies
> +                * searching the root tree for its inode item, which can be
> +                * located in the same leaf that we previously locked at
> +                * cache_save_setup() for updating the inode item of the former
> +                * free space cache, therefore leading to an attempt to lock the
> +                * same leaf twice.
> +                */
> +               if (unlikely(!cached) &&
> +                   fs_info->space_cache_updater != current) {
>                         have_caching_bg = true;
>                         ret = cache_block_group(block_group, 0);
>                         BUG_ON(ret < 0);
> --
> 2.11.0
>
Filipe Manana Oct. 22, 2018, 9:54 a.m. UTC | #2
On Mon, Oct 22, 2018 at 10:10 AM <fdmanana@kernel.org> wrote:
>
> From: Filipe Manana <fdmanana@suse.com>
>
> When we are writing out a free space cache, during the transaction commit
> phase, we can end up in a deadlock which results in a stack trace like the
> following:
>
>  schedule+0x28/0x80
>  btrfs_tree_read_lock+0x8e/0x120 [btrfs]
>  ? finish_wait+0x80/0x80
>  btrfs_read_lock_root_node+0x2f/0x40 [btrfs]
>  btrfs_search_slot+0xf6/0x9f0 [btrfs]
>  ? evict_refill_and_join+0xd0/0xd0 [btrfs]
>  ? inode_insert5+0x119/0x190
>  btrfs_lookup_inode+0x3a/0xc0 [btrfs]
>  ? kmem_cache_alloc+0x166/0x1d0
>  btrfs_iget+0x113/0x690 [btrfs]
>  __lookup_free_space_inode+0xd8/0x150 [btrfs]
>  lookup_free_space_inode+0x5b/0xb0 [btrfs]
>  load_free_space_cache+0x7c/0x170 [btrfs]
>  ? cache_block_group+0x72/0x3b0 [btrfs]
>  cache_block_group+0x1b3/0x3b0 [btrfs]
>  ? finish_wait+0x80/0x80
>  find_free_extent+0x799/0x1010 [btrfs]
>  btrfs_reserve_extent+0x9b/0x180 [btrfs]
>  btrfs_alloc_tree_block+0x1b3/0x4f0 [btrfs]
>  __btrfs_cow_block+0x11d/0x500 [btrfs]
>  btrfs_cow_block+0xdc/0x180 [btrfs]
>  btrfs_search_slot+0x3bd/0x9f0 [btrfs]
>  btrfs_lookup_inode+0x3a/0xc0 [btrfs]
>  ? kmem_cache_alloc+0x166/0x1d0
>  btrfs_update_inode_item+0x46/0x100 [btrfs]
>  cache_save_setup+0xe4/0x3a0 [btrfs]
>  btrfs_start_dirty_block_groups+0x1be/0x480 [btrfs]
>  btrfs_commit_transaction+0xcb/0x8b0 [btrfs]
>
> At cache_save_setup() we need to update the inode item of a block group's
> cache which is located in the tree root (fs_info->tree_root), which means
> that it may result in COWing a leaf from that tree. If that happens we
> need to find a free metadata extent and while looking for one, if we find
> a block group which was not cached yet we attempt to load its cache by
> calling cache_block_group(). However this function will try to load the
> inode of the free space cache, which requires finding the matching inode
> item in the tree root - if that inode item is located in the same leaf as
> the inode item of the space cache we are updating at cache_save_setup(),
> we end up in a deadlock, since we try to obtain a read lock on the same
> extent buffer that we previously write locked.
>
> So fix this by skipping the loading of free space caches of any block
> groups that are not yet cached (rare cases) if we are updating the inode
> of a free space cache. This is a rare case and its downside is failure to
> find a free extent (return -ENOSPC) when all the already cached block
> groups have no free extents.
>
> Reported-by: Andrew Nelson <andrew.s.nelson@gmail.com>
> Link: https://lore.kernel.org/linux-btrfs/CAPTELenq9x5KOWuQ+fa7h1r3nsJG8vyiTH8+ifjURc_duHh2Wg@mail.gmail.com/
> Fixes: 9d66e233c704 ("Btrfs: load free space cache if it exists")
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

Tested-by: Andrew Nelson <andrew.s.nelson@gmail.com>

> ---
>  fs/btrfs/ctree.h       |  3 +++
>  fs/btrfs/disk-io.c     |  2 ++
>  fs/btrfs/extent-tree.c | 22 +++++++++++++++++++++-
>  3 files changed, 26 insertions(+), 1 deletion(-)
>
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 2cddfe7806a4..d23ee26eb17d 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -1121,6 +1121,9 @@ struct btrfs_fs_info {
>         u32 sectorsize;
>         u32 stripesize;
>
> +       /* The task currently updating a free space cache inode item. */
> +       struct task_struct *space_cache_updater;
> +
>  #ifdef CONFIG_BTRFS_FS_REF_VERIFY
>         spinlock_t ref_verify_lock;
>         struct rb_root block_tree;
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 05dc3c17cb62..aa5e9a91e560 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -2782,6 +2782,8 @@ int open_ctree(struct super_block *sb,
>         fs_info->sectorsize = 4096;
>         fs_info->stripesize = 4096;
>
> +       fs_info->space_cache_updater = NULL;
> +
>         ret = btrfs_alloc_stripe_hash_table(fs_info);
>         if (ret) {
>                 err = ret;
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 577878324799..e93040449771 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -3364,7 +3364,9 @@ static int cache_save_setup(struct btrfs_block_group_cache *block_group,
>          * time.
>          */
>         BTRFS_I(inode)->generation = 0;
> +       fs_info->space_cache_updater = current;
>         ret = btrfs_update_inode(trans, root, inode);
> +       fs_info->space_cache_updater = NULL;
>         if (ret) {
>                 /*
>                  * So theoretically we could recover from this, simply set the
> @@ -7366,7 +7368,25 @@ static noinline int find_free_extent(struct btrfs_fs_info *fs_info,
>
>  have_block_group:
>                 cached = block_group_cache_done(block_group);
> -               if (unlikely(!cached)) {
> +               /*
> +                * If we are updating the inode of a free space cache, we can
> +                * not start the caching of any block group because we could
> +                * deadlock on an extent buffer of the root tree.
> +                * At cache_save_setup() we update the inode item of a free
> +                * space cache, so we may need to COW a leaf of the root tree,
> +                * which implies finding a free metadata extent. So if when
> +                * searching for such an extent we find a block group that was
> +                * not yet cached (which is unlikely), we can not start loading
> +                * or building its free space cache because that implies reading
> +                * its inode from disk (load_free_space_cache()) which implies
> +                * searching the root tree for its inode item, which can be
> +                * located in the same leaf that we previously locked at
> +                * cache_save_setup() for updating the inode item of the former
> +                * free space cache, therefore leading to an attempt to lock the
> +                * same leaf twice.
> +                */
> +               if (unlikely(!cached) &&
> +                   fs_info->space_cache_updater != current) {
>                         have_caching_bg = true;
>                         ret = cache_block_group(block_group, 0);
>                         BUG_ON(ret < 0);
> --
> 2.11.0
>
Josef Bacik Oct. 22, 2018, 6:07 p.m. UTC | #3
On Mon, Oct 22, 2018 at 10:09:46AM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> When we are writing out a free space cache, during the transaction commit
> phase, we can end up in a deadlock which results in a stack trace like the
> following:
> 
>  schedule+0x28/0x80
>  btrfs_tree_read_lock+0x8e/0x120 [btrfs]
>  ? finish_wait+0x80/0x80
>  btrfs_read_lock_root_node+0x2f/0x40 [btrfs]
>  btrfs_search_slot+0xf6/0x9f0 [btrfs]
>  ? evict_refill_and_join+0xd0/0xd0 [btrfs]
>  ? inode_insert5+0x119/0x190
>  btrfs_lookup_inode+0x3a/0xc0 [btrfs]
>  ? kmem_cache_alloc+0x166/0x1d0
>  btrfs_iget+0x113/0x690 [btrfs]
>  __lookup_free_space_inode+0xd8/0x150 [btrfs]
>  lookup_free_space_inode+0x5b/0xb0 [btrfs]
>  load_free_space_cache+0x7c/0x170 [btrfs]
>  ? cache_block_group+0x72/0x3b0 [btrfs]
>  cache_block_group+0x1b3/0x3b0 [btrfs]
>  ? finish_wait+0x80/0x80
>  find_free_extent+0x799/0x1010 [btrfs]
>  btrfs_reserve_extent+0x9b/0x180 [btrfs]
>  btrfs_alloc_tree_block+0x1b3/0x4f0 [btrfs]
>  __btrfs_cow_block+0x11d/0x500 [btrfs]
>  btrfs_cow_block+0xdc/0x180 [btrfs]
>  btrfs_search_slot+0x3bd/0x9f0 [btrfs]
>  btrfs_lookup_inode+0x3a/0xc0 [btrfs]
>  ? kmem_cache_alloc+0x166/0x1d0
>  btrfs_update_inode_item+0x46/0x100 [btrfs]
>  cache_save_setup+0xe4/0x3a0 [btrfs]
>  btrfs_start_dirty_block_groups+0x1be/0x480 [btrfs]
>  btrfs_commit_transaction+0xcb/0x8b0 [btrfs]
> 
> At cache_save_setup() we need to update the inode item of a block group's
> cache which is located in the tree root (fs_info->tree_root), which means
> that it may result in COWing a leaf from that tree. If that happens we
> need to find a free metadata extent and while looking for one, if we find
> a block group which was not cached yet we attempt to load its cache by
> calling cache_block_group(). However this function will try to load the
> inode of the free space cache, which requires finding the matching inode
> item in the tree root - if that inode item is located in the same leaf as
> the inode item of the space cache we are updating at cache_save_setup(),
> we end up in a deadlock, since we try to obtain a read lock on the same
> extent buffer that we previously write locked.
> 
> So fix this by skipping the loading of free space caches of any block
> groups that are not yet cached (rare cases) if we are updating the inode
> of a free space cache. This is a rare case and its downside is failure to
> find a free extent (return -ENOSPC) when all the already cached block
> groups have no free extents.
> 

Actually isn't this a problem for anything that tries to allocate an extent
while in the tree_root?  Like we snapshot or make a subvolume or anything?  We
should just disallow if root == tree_root.  But even then we only need to do
this if we're using SPACE_CACHE, using the ye-olde caching or the free space
tree are both ok.  Let's just limit it to those cases.  Thanks,

Josef
Filipe Manana Oct. 22, 2018, 6:51 p.m. UTC | #4
On Mon, Oct 22, 2018 at 7:07 PM Josef Bacik <josef@toxicpanda.com> wrote:
>
> On Mon, Oct 22, 2018 at 10:09:46AM +0100, fdmanana@kernel.org wrote:
> > From: Filipe Manana <fdmanana@suse.com>
> >
> > When we are writing out a free space cache, during the transaction commit
> > phase, we can end up in a deadlock which results in a stack trace like the
> > following:
> >
> >  schedule+0x28/0x80
> >  btrfs_tree_read_lock+0x8e/0x120 [btrfs]
> >  ? finish_wait+0x80/0x80
> >  btrfs_read_lock_root_node+0x2f/0x40 [btrfs]
> >  btrfs_search_slot+0xf6/0x9f0 [btrfs]
> >  ? evict_refill_and_join+0xd0/0xd0 [btrfs]
> >  ? inode_insert5+0x119/0x190
> >  btrfs_lookup_inode+0x3a/0xc0 [btrfs]
> >  ? kmem_cache_alloc+0x166/0x1d0
> >  btrfs_iget+0x113/0x690 [btrfs]
> >  __lookup_free_space_inode+0xd8/0x150 [btrfs]
> >  lookup_free_space_inode+0x5b/0xb0 [btrfs]
> >  load_free_space_cache+0x7c/0x170 [btrfs]
> >  ? cache_block_group+0x72/0x3b0 [btrfs]
> >  cache_block_group+0x1b3/0x3b0 [btrfs]
> >  ? finish_wait+0x80/0x80
> >  find_free_extent+0x799/0x1010 [btrfs]
> >  btrfs_reserve_extent+0x9b/0x180 [btrfs]
> >  btrfs_alloc_tree_block+0x1b3/0x4f0 [btrfs]
> >  __btrfs_cow_block+0x11d/0x500 [btrfs]
> >  btrfs_cow_block+0xdc/0x180 [btrfs]
> >  btrfs_search_slot+0x3bd/0x9f0 [btrfs]
> >  btrfs_lookup_inode+0x3a/0xc0 [btrfs]
> >  ? kmem_cache_alloc+0x166/0x1d0
> >  btrfs_update_inode_item+0x46/0x100 [btrfs]
> >  cache_save_setup+0xe4/0x3a0 [btrfs]
> >  btrfs_start_dirty_block_groups+0x1be/0x480 [btrfs]
> >  btrfs_commit_transaction+0xcb/0x8b0 [btrfs]
> >
> > At cache_save_setup() we need to update the inode item of a block group's
> > cache which is located in the tree root (fs_info->tree_root), which means
> > that it may result in COWing a leaf from that tree. If that happens we
> > need to find a free metadata extent and while looking for one, if we find
> > a block group which was not cached yet we attempt to load its cache by
> > calling cache_block_group(). However this function will try to load the
> > inode of the free space cache, which requires finding the matching inode
> > item in the tree root - if that inode item is located in the same leaf as
> > the inode item of the space cache we are updating at cache_save_setup(),
> > we end up in a deadlock, since we try to obtain a read lock on the same
> > extent buffer that we previously write locked.
> >
> > So fix this by skipping the loading of free space caches of any block
> > groups that are not yet cached (rare cases) if we are updating the inode
> > of a free space cache. This is a rare case and its downside is failure to
> > find a free extent (return -ENOSPC) when all the already cached block
> > groups have no free extents.
> >
>
> Actually isn't this a problem for anything that tries to allocate an extent
> while in the tree_root?  Like we snapshot or make a subvolume or anything?

Indeed. Initially I considered making it more generic (like the recent
fix for deadlock when cowing from extent/chunk/device tree) but I
totally forgot about the other cases like you mentioned.

>  We
> should just disallow if root == tree_root.  But even then we only need to do
> this if we're using SPACE_CACHE, using the ye-olde caching or the free space
> tree are both ok.  Let's just limit it to those cases.  Thanks,

Yep, makes all sense.

Thanks! V2 sent out.

>
> Josef
diff mbox series

Patch

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 2cddfe7806a4..d23ee26eb17d 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1121,6 +1121,9 @@  struct btrfs_fs_info {
 	u32 sectorsize;
 	u32 stripesize;
 
+	/* The task currently updating a free space cache inode item. */
+	struct task_struct *space_cache_updater;
+
 #ifdef CONFIG_BTRFS_FS_REF_VERIFY
 	spinlock_t ref_verify_lock;
 	struct rb_root block_tree;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 05dc3c17cb62..aa5e9a91e560 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2782,6 +2782,8 @@  int open_ctree(struct super_block *sb,
 	fs_info->sectorsize = 4096;
 	fs_info->stripesize = 4096;
 
+	fs_info->space_cache_updater = NULL;
+
 	ret = btrfs_alloc_stripe_hash_table(fs_info);
 	if (ret) {
 		err = ret;
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 577878324799..e93040449771 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3364,7 +3364,9 @@  static int cache_save_setup(struct btrfs_block_group_cache *block_group,
 	 * time.
 	 */
 	BTRFS_I(inode)->generation = 0;
+	fs_info->space_cache_updater = current;
 	ret = btrfs_update_inode(trans, root, inode);
+	fs_info->space_cache_updater = NULL;
 	if (ret) {
 		/*
 		 * So theoretically we could recover from this, simply set the
@@ -7366,7 +7368,25 @@  static noinline int find_free_extent(struct btrfs_fs_info *fs_info,
 
 have_block_group:
 		cached = block_group_cache_done(block_group);
-		if (unlikely(!cached)) {
+		/*
+		 * If we are updating the inode of a free space cache, we can
+		 * not start the caching of any block group because we could
+		 * deadlock on an extent buffer of the root tree.
+		 * At cache_save_setup() we update the inode item of a free
+		 * space cache, so we may need to COW a leaf of the root tree,
+		 * which implies finding a free metadata extent. So if when
+		 * searching for such an extent we find a block group that was
+		 * not yet cached (which is unlikely), we can not start loading
+		 * or building its free space cache because that implies reading
+		 * its inode from disk (load_free_space_cache()) which implies
+		 * searching the root tree for its inode item, which can be
+		 * located in the same leaf that we previously locked at
+		 * cache_save_setup() for updating the inode item of the former
+		 * free space cache, therefore leading to an attempt to lock the
+		 * same leaf twice.
+		 */
+		if (unlikely(!cached) &&
+		    fs_info->space_cache_updater != current) {
 			have_caching_bg = true;
 			ret = cache_block_group(block_group, 0);
 			BUG_ON(ret < 0);