diff mbox

Btrfs: fix unprotected list move from unused_bgs to deleted_bgs list

Message ID 1448642305-26199-1-git-send-email-fdmanana@kernel.org (mailing list archive)
State Accepted
Headers show

Commit Message

Filipe Manana Nov. 27, 2015, 4:38 p.m. UTC
From: Filipe Manana <fdmanana@suse.com>

As of my previous change titled "Btrfs: fix scrub preventing unused block
groups from being deleted", the following warning at
extent-tree.c:btrfs_delete_unused_bgs() can be hit when we mount the a
filesysten with "-o discard":

 10263  void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
 10264  {
 (...)
 10405                  if (trimming) {
 10406                          WARN_ON(!list_empty(&block_group->bg_list));
 10407                          spin_lock(&trans->transaction->deleted_bgs_lock);
 10408                          list_move(&block_group->bg_list,
 10409                                    &trans->transaction->deleted_bgs);
 10410                          spin_unlock(&trans->transaction->deleted_bgs_lock);
 10411                          btrfs_get_block_group(block_group);
 10412                  }
 (...)

This happens because scrub can now add back the block group to the list of
unused block groups (fs_info->unused_bgs). This is dangerous because we
are moving the block group from the unused block groups list to the list
of deleted block groups without holding the lock that protects the source
list (fs_info->unused_bgs_lock).

The following diagram illustrates how this happens:

            CPU 1                                     CPU 2

 cleaner_kthread()
   btrfs_delete_unused_bgs()

     sees bg X in list
      fs_info->unused_bgs

     deletes bg X from list
      fs_info->unused_bgs

                                            scrub_enumerate_chunks()

                                              searches device tree using
                                              its commit root

                                              finds device extent for
                                              block group X

                                              gets block group X from the tree
                                              fs_info->block_group_cache_tree
                                              (via btrfs_lookup_block_group())

                                              sets bg X to RO (again)

                                              scrub_chunk(bg X)

                                              sets bg X back to RW mode

                                              adds bg X to the list
                                              fs_info->unused_bgs again,
                                              since it's still unused and
                                              currently not in that list

     sets bg X to RO mode

     btrfs_remove_chunk(bg X)

     --> discard is enabled and bg X
         is in the fs_info->unused_bgs
         list again so the warning is
         triggered
     --> we move it from that list into
         the transaction's delete_bgs
         list, but we can have another
         task currently manipulating
         the first list (fs_info->unused_bgs)

Fix this by using the same lock (fs_info->unused_bgs_lock) to protect both
the list of unused block groups and the list of deleted block groups. This
makes it safe and there's not much worry for more lock contention, as this
lock is seldom used and only the cleaner kthread adds elements to the list
of deleted block groups. The warning goes away too, as this was previously
an impossible case (and would have been better a BUG_ON/ASSERT) but it's
not impossible anymore.
Reproduced with fstest btrfs/073 (using MOUNT_OPTIONS="-o discard").

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/extent-tree.c | 10 +++++++---
 fs/btrfs/transaction.c |  1 -
 fs/btrfs/transaction.h |  2 +-
 3 files changed, 8 insertions(+), 5 deletions(-)

Comments

Duncan Nov. 28, 2015, 7:34 a.m. UTC | #1
fdmanana posted on Fri, 27 Nov 2015 16:38:25 +0000 as excerpted:

> As of my previous change titled "Btrfs: fix scrub preventing unused
> block groups from being deleted", the following warning at
> extent-tree.c:btrfs_delete_unused_bgs() can be hit when we mount the a
> filesysten with "-o discard":

This set of questions doesn't have anything directly to do with your 
patch, but it does have to do with unused blockgroup deletion, which of 
course your patch does too.

When does unused blockgroup deletion occur?  Is there a waiting period 
after a new block group is created before it's deleted?  (I'd suppose so, 
so there's no race between creation in ordered to actually use it, and 
immediate deletion before it can be used.)  How long?

I suppose you've seen the discussion on defrag apparently not having as 
many uncongested data chunks to work with now, as they're all deleted, 
making it less effective since the available blocks of free space are 
likely to be much smaller now, whereas previously it could very 
frequently find whole gigs free to use at a time.  First of all, I've not 
seen a direct confirmation from a dev on that theory yet, tho IIRC Hugo 
said it made sense, so that'd be useful.  Second, how trivial/major is 
patching defrag to be rather greeder with chunk allocation, when it can't 
find existing free space blocks of sufficient size?  Is that going to 
have to wait for the eventual defrag rewrite to enable reflink/snapshot 
awareness again?

And while waiting for that, is my suggested but as yet untested 
workaround of doing a bunch of 1 GiB file truncates to force data chunk 
allocation, syncing, then either deleting them (if the wait before empty-
chunk deletion will allow time for the defrag to claim them), or 
truncating them to say 3.9 KiB (just under a single 4 KiB block) to keep 
the chunks claimed while freeing most of them (if the wait before empty-
chunk deletion is too short to allow defrag to claim entirely empty 
chunks), followed by another sync, before the defrag, likely to work?  If 
not, is there some other way to force data chunk allocation and/or 
temporarily suspend empty-data-chunk deletion to give defrag some 
uncongested free space to work with?
Filipe Manana Nov. 28, 2015, 10:35 a.m. UTC | #2
On Sat, Nov 28, 2015 at 7:34 AM, Duncan <1i5t5.duncan@cox.net> wrote:
> fdmanana posted on Fri, 27 Nov 2015 16:38:25 +0000 as excerpted:
>
>> As of my previous change titled "Btrfs: fix scrub preventing unused
>> block groups from being deleted", the following warning at
>> extent-tree.c:btrfs_delete_unused_bgs() can be hit when we mount the a
>> filesysten with "-o discard":

Hi Duncan,

>
> This set of questions doesn't have anything directly to do with your
> patch, but it does have to do with unused blockgroup deletion, which of
> course your patch does too.

If it's unrelated to the patch please keep this in a separate thread
(which ironically I'm not doing it now as this client doesn't seem to
allow me to).

>
> When does unused blockgroup deletion occur?  Is there a waiting period
> after a new block group is created before it's deleted?  (I'd suppose so,
> so there's no race between creation in ordered to actually use it, and
> immediate deletion before it can be used.)  How long?

There's no guarantee about when. It can be as soon as the block group
becomes empty/unused or it might happen some time after the current
transaction is committed. The cleaner kthread is awaken at transaction
commit time, then does several stuff like deleting snapshots and
unused block groups and then it sleeps if it has nothing more to do. A
block group might become empty/unused while the cleaner kthread is
active or when its inactive (in the former case the deletion can
happen immediately while in the later case the block group deletion
happens sometime after the current transaction is committed).

>
> I suppose you've seen the discussion on defrag apparently not having as

Nop, I haven't. I don't always have the time to read through long
threads with a lot of theories and speculation.

> many uncongested data chunks to work with now, as they're all deleted,
> making it less effective since the available blocks of free space are
> likely to be much smaller now, whereas previously it could very
> frequently find whole gigs free to use at a time.

I don't get that. Defrag is basically marking as dirty a range of
consecutive pages that have physical (on disk) extents that aren't
contiguous - you can think of it as almost like rewriting the same
data in a file region from userspace (try that, create a fragmented
file, then read its content and write that same content again to the
file and call 'sync', the file will now have less extents unless you
were low on free space).

When writeback happens (in a response to fsync, sync or periodically
triggered by the kernel for several reasons) the allocator is called
to find space to write the data to - it tries its best to find
contiguous space (preferably a single extent, otherwise multiple) -
this is the same mechanism regardless of the pages becoming dirty due
to regular writes from user space or pages becoming dirty from a
defrag operation. If the allocator can't find space, it tries to
allocate new block groups/chunks - having a block group deletion
forcing later the allocator to allocate a new block group is no
different from forcing it to allocate a new one because all current
ones are full.

> First of all, I've not
> seen a direct confirmation from a dev on that theory yet, tho IIRC Hugo
> said it made sense, so that'd be useful.  Second, how trivial/major is
> patching defrag to be rather greeder with chunk allocation, when it can't
> find existing free space blocks of sufficient size?  Is that going to
> have to wait for the eventual defrag rewrite to enable reflink/snapshot
> awareness again?

Snapshot aware defrag has some extra stuff I didn't mention above, but
I'm not describing it here (I would have to go and read its
implementation before that) as it's been disabled for almost 2 years
now (with the disable patch backported to stable).

>
> And while waiting for that, is my suggested but as yet untested
> workaround of doing a bunch of 1 GiB file truncates to force data chunk
> allocation, syncing, then either deleting them (if the wait before empty-
> chunk deletion will allow time for the defrag to claim them), or
> truncating them to say 3.9 KiB (just under a single 4 KiB block) to keep
> the chunks claimed while freeing most of them (if the wait before empty-
> chunk deletion is too short to allow defrag to claim entirely empty
> chunks), followed by another sync, before the defrag, likely to work?  If
> not, is there some other way to force data chunk allocation and/or
> temporarily suspend empty-data-chunk deletion to give defrag some
> uncongested free space to work with?
>
> --
> Duncan - List replies preferred.   No HTML msgs.
> "Every nonfree program has a lord, a master --
> and if you use the program, he is your master."  Richard Stallman
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 4b89680..c4661db 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -10480,11 +10480,15 @@  void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
 		 * until transaction commit to do the actual discard.
 		 */
 		if (trimming) {
-			WARN_ON(!list_empty(&block_group->bg_list));
-			spin_lock(&trans->transaction->deleted_bgs_lock);
+			spin_lock(&fs_info->unused_bgs_lock);
+			/*
+			 * A concurrent scrub might have added us to the list
+			 * fs_info->unused_bgs, so use a list_move operation
+			 * to add the block group to the deleted_bgs list.
+			 */
 			list_move(&block_group->bg_list,
 				  &trans->transaction->deleted_bgs);
-			spin_unlock(&trans->transaction->deleted_bgs_lock);
+			spin_unlock(&fs_info->unused_bgs_lock);
 			btrfs_get_block_group(block_group);
 		}
 end_trans:
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 3367a3c..be8eae8 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -274,7 +274,6 @@  loop:
 	cur_trans->num_dirty_bgs = 0;
 	spin_lock_init(&cur_trans->dirty_bgs_lock);
 	INIT_LIST_HEAD(&cur_trans->deleted_bgs);
-	spin_lock_init(&cur_trans->deleted_bgs_lock);
 	spin_lock_init(&cur_trans->dropped_roots_lock);
 	list_add_tail(&cur_trans->list, &fs_info->trans_list);
 	extent_io_tree_init(&cur_trans->dirty_pages,
diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
index 0da21ca..64c8221 100644
--- a/fs/btrfs/transaction.h
+++ b/fs/btrfs/transaction.h
@@ -77,8 +77,8 @@  struct btrfs_transaction {
 	 */
 	struct mutex cache_write_mutex;
 	spinlock_t dirty_bgs_lock;
+	/* Protected by spin lock fs_info->unused_bgs_lock. */
 	struct list_head deleted_bgs;
-	spinlock_t deleted_bgs_lock;
 	spinlock_t dropped_roots_lock;
 	struct btrfs_delayed_ref_root delayed_refs;
 	int aborted;