diff mbox series

Btrfs: fix race between block group removal and block group allocation

Message ID 20190612100542.1848-1-fdmanana@kernel.org (mailing list archive)
State New, archived
Headers show
Series Btrfs: fix race between block group removal and block group allocation | expand

Commit Message

Filipe Manana June 12, 2019, 10:05 a.m. UTC
From: Filipe Manana <fdmanana@suse.com>

If a task is removing the block group that currently has the highest start
offset amongst all existing block groups, there is a short time window
where it races with a concurrent block group allocation, resulting in a
transaction abort with an error code of EEXIST.

The following diagram explains the race in detail:

      Task A                                                        Task B

 btrfs_remove_block_group(bg offset X)

   remove_extent_mapping(em offset X)
     -> removes extent map X from the
        tree of extent maps
        (fs_info->mapping_tree), so the
        next call to find_next_chunk()
        will return offset X

                                                   btrfs_alloc_chunk()
                                                     find_next_chunk()
                                                       --> returns offset X

                                                     __btrfs_alloc_chunk(offset X)
                                                       btrfs_make_block_group()
                                                         btrfs_create_block_group_cache()
                                                           --> creates btrfs_block_group_cache
                                                               object with a key corresponding
                                                               to the block group item in the
                                                               extent, the key is:
                                                               (offset X, BTRFS_BLOCK_GROUP_ITEM_KEY, 1G)

                                                         --> adds the btrfs_block_group_cache object
                                                             to the list new_bgs of the transaction
                                                             handle

                                                   btrfs_end_transaction(trans handle)
                                                     __btrfs_end_transaction()
                                                       btrfs_create_pending_block_groups()
                                                         --> sees the new btrfs_block_group_cache
                                                             in the new_bgs list of the transaction
                                                             handle
                                                         --> its call to btrfs_insert_item() fails
                                                             with -EEXIST when attempting to insert
                                                             the block group item key
                                                             (offset X, BTRFS_BLOCK_GROUP_ITEM_KEY, 1G)
                                                             because task A has not removed that key yet
                                                         --> aborts the running transaction with
                                                             error -EEXIST

   btrfs_del_item()
     -> removes the block group's key from
        the extent tree, key is
        (offset X, BTRFS_BLOCK_GROUP_ITEM_KEY, 1G)

A sample transaction abort trace:

  [78912.403537] ------------[ cut here ]------------
  [78912.403811] BTRFS: Transaction aborted (error -17)
  [78912.404082] WARNING: CPU: 2 PID: 20465 at fs/btrfs/extent-tree.c:10551 btrfs_create_pending_block_groups+0x196/0x250 [btrfs]
  (...)
  [78912.405642] CPU: 2 PID: 20465 Comm: btrfs Tainted: G        W         5.0.0-btrfs-next-46 #1
  [78912.405941] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
  [78912.406586] RIP: 0010:btrfs_create_pending_block_groups+0x196/0x250 [btrfs]
  (...)
  [78912.407636] RSP: 0018:ffff9d3d4b7e3b08 EFLAGS: 00010282
  [78912.407997] RAX: 0000000000000000 RBX: ffff90959a3796f0 RCX: 0000000000000006
  [78912.408369] RDX: 0000000000000007 RSI: 0000000000000001 RDI: ffff909636b16860
  [78912.408746] RBP: ffff909626758a58 R08: 0000000000000000 R09: 0000000000000000
  [78912.409144] R10: ffff9095ff462400 R11: 0000000000000000 R12: ffff90959a379588
  [78912.409521] R13: ffff909626758ab0 R14: ffff9095036c0000 R15: ffff9095299e1158
  [78912.409899] FS:  00007f387f16f700(0000) GS:ffff909636b00000(0000) knlGS:0000000000000000
  [78912.410285] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [78912.410673] CR2: 00007f429fc87cbc CR3: 000000014440a004 CR4: 00000000003606e0
  [78912.411095] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  [78912.411496] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  [78912.411898] Call Trace:
  [78912.412318]  __btrfs_end_transaction+0x5b/0x1c0 [btrfs]
  [78912.412746]  btrfs_inc_block_group_ro+0xcf/0x160 [btrfs]
  [78912.413179]  scrub_enumerate_chunks+0x188/0x5b0 [btrfs]
  [78912.413622]  ? __mutex_unlock_slowpath+0x100/0x2a0
  [78912.414078]  btrfs_scrub_dev+0x2ef/0x720 [btrfs]
  [78912.414535]  ? __sb_start_write+0xd4/0x1c0
  [78912.414963]  ? mnt_want_write_file+0x24/0x50
  [78912.415403]  btrfs_ioctl+0x17fb/0x3120 [btrfs]
  [78912.415832]  ? lock_acquire+0xa6/0x190
  [78912.416256]  ? do_vfs_ioctl+0xa2/0x6f0
  [78912.416685]  ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
  [78912.417116]  do_vfs_ioctl+0xa2/0x6f0
  [78912.417534]  ? __fget+0x113/0x200
  [78912.417954]  ksys_ioctl+0x70/0x80
  [78912.418369]  __x64_sys_ioctl+0x16/0x20
  [78912.418812]  do_syscall_64+0x60/0x1b0
  [78912.419231]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
  [78912.419644] RIP: 0033:0x7f3880252dd7
  (...)
  [78912.420957] RSP: 002b:00007f387f16ed68 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
  [78912.421426] RAX: ffffffffffffffda RBX: 000055f5becc1df0 RCX: 00007f3880252dd7
  [78912.421889] RDX: 000055f5becc1df0 RSI: 00000000c400941b RDI: 0000000000000003
  [78912.422354] RBP: 0000000000000000 R08: 00007f387f16f700 R09: 0000000000000000
  [78912.422790] R10: 00007f387f16f700 R11: 0000000000000246 R12: 0000000000000000
  [78912.423202] R13: 00007ffda49c266f R14: 0000000000000000 R15: 00007f388145e040
  [78912.425505] ---[ end trace eb9bfe7c426fc4d3 ]---

Fix this by calling remove_extent_mapping(), at btrfs_remove_block_group(),
only at the very end, after removing the block group item key from the
extent tree (and removing the free space tree entry if we are using the
free space tree feature).

Fixes: 04216820fe83d5 ("Btrfs: fix race between fs trimming and block group remove/allocation")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---

NOTE: this applies only to a 5.2 kernel, although the problem exists in previous
      kernels starting from 3.19, due to recent changes in the 5.2 merge window
      that removed the fs_info->pending_chunks, a slightly different version of
      this patch is needed, one which locks and unlocks the chunk mutex inside
      the moved block. Such patch version can be found here:

      https://www.dropbox.com/s/1sv0hd2xbsn9jrt/Btrfs-fix-race-between-block-group-removal-and-block.patch?dl=0

 fs/btrfs/extent-tree.c | 24 +++++++++++++-----------
 1 file changed, 13 insertions(+), 11 deletions(-)

Comments

David Sterba June 12, 2019, 1:23 p.m. UTC | #1
On Wed, Jun 12, 2019 at 11:05:42AM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>

> Fixes: 04216820fe83d5 ("Btrfs: fix race between fs trimming and block group remove/allocation")
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> ---
> 
> NOTE: this applies only to a 5.2 kernel, although the problem exists in previous
>       kernels starting from 3.19, due to recent changes in the 5.2 merge window
>       that removed the fs_info->pending_chunks, a slightly different version of
>       this patch is needed, one which locks and unlocks the chunk mutex inside
>       the moved block.

Thanks, I'll add it to misc-5.2, the type of the fix qualifies for a
late rc, so I'll send it by the end of the week.


>       Such patch version can be found here:
> 
>       https://www.dropbox.com/s/1sv0hd2xbsn9jrt/Btrfs-fix-race-between-block-group-removal-and-block.patch?dl=0

I'd prefer to get patches instead of links to dropbox. You can send more
patches in a series or as a reply to one of them and put the expected
target version to the subject to eg. like [PATCH for 5.2] or [PATCH for
<= 5.1] and put the note like the above with further details. Thanks.
Filipe Manana June 12, 2019, 1:34 p.m. UTC | #2
On Wed, Jun 12, 2019 at 2:22 PM David Sterba <dsterba@suse.cz> wrote:
>
> On Wed, Jun 12, 2019 at 11:05:42AM +0100, fdmanana@kernel.org wrote:
> > From: Filipe Manana <fdmanana@suse.com>
>
> > Fixes: 04216820fe83d5 ("Btrfs: fix race between fs trimming and block group remove/allocation")
> > Signed-off-by: Filipe Manana <fdmanana@suse.com>
> > ---
> >
> > NOTE: this applies only to a 5.2 kernel, although the problem exists in previous
> >       kernels starting from 3.19, due to recent changes in the 5.2 merge window
> >       that removed the fs_info->pending_chunks, a slightly different version of
> >       this patch is needed, one which locks and unlocks the chunk mutex inside
> >       the moved block.
>
> Thanks, I'll add it to misc-5.2, the type of the fix qualifies for a
> late rc, so I'll send it by the end of the week.
>
>
> >       Such patch version can be found here:
> >
> >       https://www.dropbox.com/s/1sv0hd2xbsn9jrt/Btrfs-fix-race-between-block-group-removal-and-block.patch?dl=0
>
> I'd prefer to get patches instead of links to dropbox. You can send more
> patches in a series or as a reply to one of them and put the expected
> target version to the subject to eg. like [PATCH for 5.2] or [PATCH for
> <= 5.1] and put the note like the above with further details. Thanks.

It was just a note, I meant to send a proper patch once I got mail
from the stable guys mentioning that the patch fails to apply.
Thanks.
diff mbox series

Patch

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 1aee51a9f3bf..3a26a4ab9cb8 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -10831,17 +10831,6 @@  int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
 	remove_em = (atomic_read(&block_group->trimming) == 0);
 	spin_unlock(&block_group->lock);
 
-	if (remove_em) {
-		struct extent_map_tree *em_tree;
-
-		em_tree = &fs_info->mapping_tree.map_tree;
-		write_lock(&em_tree->lock);
-		remove_extent_mapping(em_tree, em);
-		write_unlock(&em_tree->lock);
-		/* once for the tree */
-		free_extent_map(em);
-	}
-
 	mutex_unlock(&fs_info->chunk_mutex);
 
 	ret = remove_block_group_free_space(trans, block_group);
@@ -10858,6 +10847,19 @@  int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
 		goto out;
 
 	ret = btrfs_del_item(trans, root, path);
+	if (ret)
+		goto out;
+
+	if (remove_em) {
+		struct extent_map_tree *em_tree;
+
+		em_tree = &fs_info->mapping_tree.map_tree;
+		write_lock(&em_tree->lock);
+		remove_extent_mapping(em_tree, em);
+		write_unlock(&em_tree->lock);
+		/* once for the tree */
+		free_extent_map(em);
+	}
 out:
 	if (remove_rsv)
 		btrfs_delayed_refs_rsv_release(fs_info, 1);