diff mbox

[3/3] btrfs: add missing discards when unpinning extents with -o discard

Message ID 1433342874-7099-4-git-send-email-jeffm@suse.com (mailing list archive)
State Superseded
Headers show

Commit Message

Jeff Mahoney June 3, 2015, 2:47 p.m. UTC
From: Jeff Mahoney <jeffm@suse.com>

When we clear the dirty bits in btrfs_delete_unused_bgs for extents
in the empty block group, it results in btrfs_finish_extent_commit being
unable to discard the freed extents.

The block group removal patch added an alternate path to forget extents
other than btrfs_finish_extent_commit.  As a result, any extents that
would be freed when the block group is removed aren't discarded.  In my
test run, with a large copy of mixed sized files followed by removal, it
left nearly 2/3 of extents undiscarded.

To clean up the block groups, we add the removed block group onto a list
that will be discarded after transaction commit.

v1->v2:
- Fix ordering to ensure that we don't discard extents freed in an
  uncommitted transaction.

v2->v3:
- Factor out get/put block_group->trimming to ensure that cleanup always
  happens at the last reference drop.
- Cleanup the free space cache on the last reference drop.
- Use list_move instead of list_add in case of multiple adds.  We still
  issue a warning, but we shouldn't fall over.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
---
 fs/btrfs/ctree.h            |  3 ++
 fs/btrfs/extent-tree.c      | 69 ++++++++++++++++++++++++++++++++++++++++++---
 fs/btrfs/free-space-cache.c | 57 +++++++++++++++++++++----------------
 fs/btrfs/super.c            |  2 +-
 fs/btrfs/transaction.c      |  2 ++
 fs/btrfs/transaction.h      |  2 ++
 6 files changed, 106 insertions(+), 29 deletions(-)

Comments

Filipe Manana June 8, 2015, 2:12 p.m. UTC | #1
On Wed, Jun 3, 2015 at 3:47 PM,  <jeffm@suse.com> wrote:
> From: Jeff Mahoney <jeffm@suse.com>
>
> When we clear the dirty bits in btrfs_delete_unused_bgs for extents
> in the empty block group, it results in btrfs_finish_extent_commit being
> unable to discard the freed extents.
>
> The block group removal patch added an alternate path to forget extents
> other than btrfs_finish_extent_commit.  As a result, any extents that
> would be freed when the block group is removed aren't discarded.  In my
> test run, with a large copy of mixed sized files followed by removal, it
> left nearly 2/3 of extents undiscarded.
>
> To clean up the block groups, we add the removed block group onto a list
> that will be discarded after transaction commit.
>
> v1->v2:
> - Fix ordering to ensure that we don't discard extents freed in an
>   uncommitted transaction.
>
> v2->v3:
> - Factor out get/put block_group->trimming to ensure that cleanup always
>   happens at the last reference drop.
> - Cleanup the free space cache on the last reference drop.
> - Use list_move instead of list_add in case of multiple adds.  We still
>   issue a warning, but we shouldn't fall over.
>
> Signed-off-by: Jeff Mahoney <jeffm@suse.com>

Hi Jeff,

The changes look good from eyeballing, and I gave it a try together
with the previous 2 patches in the corresponding series.
However, this patch in the series fails tests btrfs/065 to btrfs/072.

Running the following in a loop until it returns failure ($? != 0)
triggers an apparent corruption issue most of the time:

root 17:16:48 /home/fdmanana/git/hub/xfstests (up_master_2)>
MKFS_OPTIONS="-O skinny-metadata" MOUNT_OPTIONS="-o discard" ./check
btrfs/065
FSTYP         -- btrfs
PLATFORM      -- Linux/x86_64 debian3 4.1.0-rc5-btrfs-next-10+
MKFS_OPTIONS  -- -O skinny-metadata /dev/sdc
MOUNT_OPTIONS -- -o discard /dev/sdc /home/fdmanana/btrfs-tests/scratch_1

btrfs/065 130s ... [failed, exit status 1] - output mismatch (see
/home/fdmanana/git/hub/xfstests/results//btrfs/065.out.bad)
    --- tests/btrfs/065.out 2014-11-17 20:59:51.282203001 +0000
    +++ /home/fdmanana/git/hub/xfstests/results//btrfs/065.out.bad
2015-06-07 17:37:40.841293317 +0100
    @@ -1,2 +1,3 @@
     QA output created by 065
     Silence is golden
    +_check_btrfs_filesystem: filesystem on /dev/sdc is inconsistent
(see /home/fdmanana/git/hub/xfstests/results//btrfs/065.full)
    ...
    (Run 'diff -u tests/btrfs/065.out
/home/fdmanana/git/hub/xfstests/results//btrfs/065.out.bad'  to see
the entire diff)
Ran: btrfs/065
Failures: btrfs/065
Failed 1 of 1 tests

root 17:37:41 /home/fdmanana/git/hub/xfstests (up_master_2)> btrfsck /dev/sdc
No valid Btrfs found on /dev/sdc
Couldn't open file system
root 17:37:43 /home/fdmanana/git/hub/xfstests (up_master_2)>


With only patches 1 and 2 in the series (or without them) this never
happens (well, at least not in ~50 iterations when running btrfs/065).
And I've never seen this before testing this patch.

It also makes generic/251 "never" finish apparently - it didn't finish
here after 9h running, and without any of these patches in the series
it used to finish in less than half an hour.

Maybe the apparent corruption is due to some interaction with device
replace or scrub, I haven't had the time to analyze/determine what it
could be.

thanks




> ---
>  fs/btrfs/ctree.h            |  3 ++
>  fs/btrfs/extent-tree.c      | 69 ++++++++++++++++++++++++++++++++++++++++++---
>  fs/btrfs/free-space-cache.c | 57 +++++++++++++++++++++----------------
>  fs/btrfs/super.c            |  2 +-
>  fs/btrfs/transaction.c      |  2 ++
>  fs/btrfs/transaction.h      |  2 ++
>  6 files changed, 106 insertions(+), 29 deletions(-)
>
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 6f364e1..780acf1 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -3438,6 +3438,8 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
>                              struct btrfs_root *root, u64 group_start,
>                              struct extent_map *em);
>  void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info);
> +void btrfs_get_block_group_trimming(struct btrfs_block_group_cache *cache);
> +void btrfs_put_block_group_trimming(struct btrfs_block_group_cache *cache);
>  void btrfs_create_pending_block_groups(struct btrfs_trans_handle *trans,
>                                        struct btrfs_root *root);
>  u64 btrfs_get_alloc_profile(struct btrfs_root *root, int data);
> @@ -4068,6 +4070,7 @@ __printf(5, 6)
>  void __btrfs_std_error(struct btrfs_fs_info *fs_info, const char *function,
>                      unsigned int line, int errno, const char *fmt, ...);
>
> +const char *btrfs_decode_error(int errno);
>
>  void __btrfs_abort_transaction(struct btrfs_trans_handle *trans,
>                                struct btrfs_root *root, const char *function,
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index fb4dbfc..e53dbe9 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -6034,20 +6034,19 @@ int btrfs_finish_extent_commit(struct btrfs_trans_handle *trans,
>                                struct btrfs_root *root)
>  {
>         struct btrfs_fs_info *fs_info = root->fs_info;
> +       struct btrfs_block_group_cache *block_group, *tmp;
> +       struct list_head *deleted_bgs;
>         struct extent_io_tree *unpin;
>         u64 start;
>         u64 end;
>         int ret;
>
> -       if (trans->aborted)
> -               return 0;
> -
>         if (fs_info->pinned_extents == &fs_info->freed_extents[0])
>                 unpin = &fs_info->freed_extents[1];
>         else
>                 unpin = &fs_info->freed_extents[0];
>
> -       while (1) {
> +       while (!trans->aborted) {
>                 mutex_lock(&fs_info->unused_bg_unpin_mutex);
>                 ret = find_first_extent_bit(unpin, 0, &start, &end,
>                                             EXTENT_DIRTY, NULL);
> @@ -6066,6 +6065,34 @@ int btrfs_finish_extent_commit(struct btrfs_trans_handle *trans,
>                 cond_resched();
>         }
>
> +       /*
> +        * Transaction is finished.  We don't need the lock anymore.  We
> +        * do need to clean up the block groups in case of a transaction
> +        * abort.
> +        */
> +       deleted_bgs = &trans->transaction->deleted_bgs;
> +       list_for_each_entry_safe(block_group, tmp, deleted_bgs, bg_list) {
> +               u64 trimmed = 0;
> +
> +               ret = -EROFS;
> +               if (!trans->aborted)
> +                       ret = btrfs_discard_extent(root,
> +                                                  block_group->key.objectid,
> +                                                  block_group->key.offset,
> +                                                  &trimmed);
> +
> +               list_del_init(&block_group->bg_list);
> +               btrfs_put_block_group_trimming(block_group);
> +               btrfs_put_block_group(block_group);
> +
> +               if (ret) {
> +                       const char *errstr = btrfs_decode_error(ret);
> +                       btrfs_warn(fs_info,
> +                                  "Discard failed while removing blockgroup: errno=%d %s\n",
> +                                  ret, errstr);
> +               }
> +       }
> +
>         return 0;
>  }
>
> @@ -9845,6 +9872,11 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
>          * currently running transaction might finish and a new one start,
>          * allowing for new block groups to be created that can reuse the same
>          * physical device locations unless we take this special care.
> +        *
> +        * There may also be an implicit trim operation if the file system
> +        * is mounted with -odiscard. The same protections must remain
> +        * in place until the extents have been discarded completely when
> +        * the transaction commit has completed.
>          */
>         remove_em = (atomic_read(&block_group->trimming) == 0);
>         /*
> @@ -9917,8 +9949,10 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
>                 return;
>
>         spin_lock(&fs_info->unused_bgs_lock);
> +
>         while (!list_empty(&fs_info->unused_bgs)) {
>                 u64 start, end;
> +               int trimming;
>
>                 block_group = list_first_entry(&fs_info->unused_bgs,
>                                                struct btrfs_block_group_cache,
> @@ -10016,12 +10050,39 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
>                 spin_unlock(&block_group->lock);
>                 spin_unlock(&space_info->lock);
>
> +               /* DISCARD can flip during remount */
> +               trimming = btrfs_test_opt(root, DISCARD);
> +
> +               /* Implicit trim during transaction commit. */
> +               if (trimming)
> +                       btrfs_get_block_group_trimming(block_group);
> +
>                 /*
>                  * Btrfs_remove_chunk will abort the transaction if things go
>                  * horribly wrong.
>                  */
>                 ret = btrfs_remove_chunk(trans, root,
>                                          block_group->key.objectid);
> +
> +               if (ret) {
> +                       if (trimming)
> +                               btrfs_put_block_group_trimming(block_group);
> +                       goto end_trans;
> +               }
> +
> +               /*
> +                * If we're not mounted with -odiscard, we can just forget
> +                * about this block group. Otherwise we'll need to wait
> +                * until transaction commit to do the actual discard.
> +                */
> +               if (trimming) {
> +                       WARN_ON(!list_empty(&block_group->bg_list));
> +                       spin_lock(&trans->transaction->deleted_bgs_lock);
> +                       list_move(&block_group->bg_list,
> +                                 &trans->transaction->deleted_bgs);
> +                       spin_unlock(&trans->transaction->deleted_bgs_lock);
> +                       btrfs_get_block_group(block_group);
> +               }
>  end_trans:
>                 btrfs_end_transaction(trans, root);
>  next:
> diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
> index 9dbe5b5..c79253e 100644
> --- a/fs/btrfs/free-space-cache.c
> +++ b/fs/btrfs/free-space-cache.c
> @@ -3274,35 +3274,23 @@ next:
>         return ret;
>  }
>
> -int btrfs_trim_block_group(struct btrfs_block_group_cache *block_group,
> -                          u64 *trimmed, u64 start, u64 end, u64 minlen)
> +void btrfs_get_block_group_trimming(struct btrfs_block_group_cache *cache)
>  {
> -       int ret;
> +       atomic_inc(&cache->trimming);
> +}
>
> -       *trimmed = 0;
> +void btrfs_put_block_group_trimming(struct btrfs_block_group_cache *block_group)
> +{
> +       struct extent_map_tree *em_tree;
> +       struct extent_map *em;
> +       bool cleanup;
>
>         spin_lock(&block_group->lock);
> -       if (block_group->removed) {
> -               spin_unlock(&block_group->lock);
> -               return 0;
> -       }
> -       atomic_inc(&block_group->trimming);
> +       cleanup = (atomic_dec_and_test(&block_group->trimming) &&
> +                  block_group->removed);
>         spin_unlock(&block_group->lock);
>
> -       ret = trim_no_bitmap(block_group, trimmed, start, end, minlen);
> -       if (ret)
> -               goto out;
> -
> -       ret = trim_bitmaps(block_group, trimmed, start, end, minlen);
> -out:
> -       spin_lock(&block_group->lock);
> -       if (atomic_dec_and_test(&block_group->trimming) &&
> -           block_group->removed) {
> -               struct extent_map_tree *em_tree;
> -               struct extent_map *em;
> -
> -               spin_unlock(&block_group->lock);
> -
> +       if (cleanup) {
>                 lock_chunks(block_group->fs_info->chunk_root);
>                 em_tree = &block_group->fs_info->mapping_tree.map_tree;
>                 write_lock(&em_tree->lock);
> @@ -3326,10 +3314,31 @@ out:
>                  * this block group have left 1 entry each one. Free them.
>                  */
>                 __btrfs_remove_free_space_cache(block_group->free_space_ctl);
> -       } else {
> +       }
> +}
> +
> +int btrfs_trim_block_group(struct btrfs_block_group_cache *block_group,
> +                          u64 *trimmed, u64 start, u64 end, u64 minlen)
> +{
> +       int ret;
> +
> +       *trimmed = 0;
> +
> +       spin_lock(&block_group->lock);
> +       if (block_group->removed) {
>                 spin_unlock(&block_group->lock);
> +               return 0;
>         }
> +       btrfs_get_block_group_trimming(block_group);
> +       spin_unlock(&block_group->lock);
> +
> +       ret = trim_no_bitmap(block_group, trimmed, start, end, minlen);
> +       if (ret)
> +               goto out;
>
> +       ret = trim_bitmaps(block_group, trimmed, start, end, minlen);
> +out:
> +       btrfs_put_block_group_trimming(block_group);
>         return ret;
>  }
>
> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> index 2ccd8d4..a80da03 100644
> --- a/fs/btrfs/super.c
> +++ b/fs/btrfs/super.c
> @@ -69,7 +69,7 @@ static struct file_system_type btrfs_fs_type;
>
>  static int btrfs_remount(struct super_block *sb, int *flags, char *data);
>
> -static const char *btrfs_decode_error(int errno)
> +const char *btrfs_decode_error(int errno)
>  {
>         char *errstr = "unknown";
>
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index 5628e25..2005262 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -256,6 +256,8 @@ loop:
>         mutex_init(&cur_trans->cache_write_mutex);
>         cur_trans->num_dirty_bgs = 0;
>         spin_lock_init(&cur_trans->dirty_bgs_lock);
> +       INIT_LIST_HEAD(&cur_trans->deleted_bgs);
> +       spin_lock_init(&cur_trans->deleted_bgs_lock);
>         list_add_tail(&cur_trans->list, &fs_info->trans_list);
>         extent_io_tree_init(&cur_trans->dirty_pages,
>                              fs_info->btree_inode->i_mapping);
> diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
> index 0b24755..14325f2 100644
> --- a/fs/btrfs/transaction.h
> +++ b/fs/btrfs/transaction.h
> @@ -74,6 +74,8 @@ struct btrfs_transaction {
>          */
>         struct mutex cache_write_mutex;
>         spinlock_t dirty_bgs_lock;
> +       struct list_head deleted_bgs;
> +       spinlock_t deleted_bgs_lock;
>         struct btrfs_delayed_ref_root delayed_refs;
>         int aborted;
>         int dirty_bg_run;
> --
> 1.8.4.5
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jeff Mahoney June 9, 2015, 8:48 p.m. UTC | #2
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 6/8/15 10:12 AM, Filipe David Manana wrote:
> On Wed, Jun 3, 2015 at 3:47 PM,  <jeffm@suse.com> wrote:
>> From: Jeff Mahoney <jeffm@suse.com>
>> 
>> When we clear the dirty bits in btrfs_delete_unused_bgs for
>> extents in the empty block group, it results in
>> btrfs_finish_extent_commit being unable to discard the freed
>> extents.
>> 
>> The block group removal patch added an alternate path to forget
>> extents other than btrfs_finish_extent_commit.  As a result, any
>> extents that would be freed when the block group is removed
>> aren't discarded.  In my test run, with a large copy of mixed
>> sized files followed by removal, it left nearly 2/3 of extents
>> undiscarded.
>> 
>> To clean up the block groups, we add the removed block group onto
>> a list that will be discarded after transaction commit.
>> 
>> v1->v2: - Fix ordering to ensure that we don't discard extents
>> freed in an uncommitted transaction.
>> 
>> v2->v3: - Factor out get/put block_group->trimming to ensure that
>> cleanup always happens at the last reference drop. - Cleanup the
>> free space cache on the last reference drop. - Use list_move
>> instead of list_add in case of multiple adds.  We still issue a
>> warning, but we shouldn't fall over.
>> 
>> Signed-off-by: Jeff Mahoney <jeffm@suse.com>
> 
> Hi Jeff,
> 
> The changes look good from eyeballing, and I gave it a try
> together with the previous 2 patches in the corresponding series. 
> However, this patch in the series fails tests btrfs/065 to
> btrfs/072.
> 
> Running the following in a loop until it returns failure ($? != 0) 
> triggers an apparent corruption issue most of the time:

Ok, I didn't have my test environment set up to do the multidevice
tests by default, and definitely didn't have it set up to do 065,
since that requires 5+ devices of identical size. Now that I do have
that set up, I see this:

  8,37   2     1442    22.727796582 15400  D   D 419434496 + 8192 [umoun
t]

419434496 is the start of /dev/sdc5 and we're blowing away 4 MB.  That
happens to coincide with the system (single) block group.  In my
testing, I found (IIRC) that system (single) and metadata (single)
were always in the unused list but were getting skipped by the cleaner
thread.  It looks like the cleanup block groups on close_ctree is
misbehaving.  Whatever condition that was causing those to be skipped
must be unset for 065.  It wasn't a problem during normal runtime, but
now that it's called in close_ctree, it's introducing a problem.  I'll
start digging.

Thanks for the review!

- -Jeff

> root 17:16:48 /home/fdmanana/git/hub/xfstests (up_master_2)> 
> MKFS_OPTIONS="-O skinny-metadata" MOUNT_OPTIONS="-o discard"
> ./check btrfs/065 FSTYP         -- btrfs PLATFORM      --
> Linux/x86_64 debian3 4.1.0-rc5-btrfs-next-10+ MKFS_OPTIONS  -- -O
> skinny-metadata /dev/sdc MOUNT_OPTIONS -- -o discard /dev/sdc
> /home/fdmanana/btrfs-tests/scratch_1
> 
> btrfs/065 130s ... [failed, exit status 1] - output mismatch (see 
> /home/fdmanana/git/hub/xfstests/results//btrfs/065.out.bad) ---
> tests/btrfs/065.out 2014-11-17 20:59:51.282203001 +0000 +++
> /home/fdmanana/git/hub/xfstests/results//btrfs/065.out.bad 
> 2015-06-07 17:37:40.841293317 +0100 @@ -1,2 +1,3 @@ QA output
> created by 065 Silence is golden +_check_btrfs_filesystem:
> filesystem on /dev/sdc is inconsistent (see
> /home/fdmanana/git/hub/xfstests/results//btrfs/065.full) ... (Run
> 'diff -u tests/btrfs/065.out 
> /home/fdmanana/git/hub/xfstests/results//btrfs/065.out.bad'  to
> see the entire diff) Ran: btrfs/065 Failures: btrfs/065 Failed 1 of
> 1 tests
> 
> root 17:37:41 /home/fdmanana/git/hub/xfstests (up_master_2)>
> btrfsck /dev/sdc No valid Btrfs found on /dev/sdc Couldn't open
> file system root 17:37:43 /home/fdmanana/git/hub/xfstests
> (up_master_2)>
> 
> 
> With only patches 1 and 2 in the series (or without them) this
> never happens (well, at least not in ~50 iterations when running
> btrfs/065). And I've never seen this before testing this patch.
> 
> It also makes generic/251 "never" finish apparently - it didn't
> finish here after 9h running, and without any of these patches in
> the series it used to finish in less than half an hour.
> 
> Maybe the apparent corruption is due to some interaction with
> device replace or scrub, I haven't had the time to
> analyze/determine what it could be.
> 
> thanks
> 
> 
> 
> 
>> --- fs/btrfs/ctree.h            |  3 ++ fs/btrfs/extent-tree.c
>> | 69 ++++++++++++++++++++++++++++++++++++++++++--- 
>> fs/btrfs/free-space-cache.c | 57
>> +++++++++++++++++++++---------------- fs/btrfs/super.c
>> |  2 +- fs/btrfs/transaction.c      |  2 ++ 
>> fs/btrfs/transaction.h      |  2 ++ 6 files changed, 106
>> insertions(+), 29 deletions(-)
>> 
>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index
>> 6f364e1..780acf1 100644 --- a/fs/btrfs/ctree.h +++
>> b/fs/btrfs/ctree.h @@ -3438,6 +3438,8 @@ int
>> btrfs_remove_block_group(struct btrfs_trans_handle *trans, struct
>> btrfs_root *root, u64 group_start, struct extent_map *em); void
>> btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info); +void
>> btrfs_get_block_group_trimming(struct btrfs_block_group_cache
>> *cache); +void btrfs_put_block_group_trimming(struct
>> btrfs_block_group_cache *cache); void
>> btrfs_create_pending_block_groups(struct btrfs_trans_handle
>> *trans, struct btrfs_root *root); u64
>> btrfs_get_alloc_profile(struct btrfs_root *root, int data); @@
>> -4068,6 +4070,7 @@ __printf(5, 6) void __btrfs_std_error(struct
>> btrfs_fs_info *fs_info, const char *function, unsigned int line,
>> int errno, const char *fmt, ...);
>> 
>> +const char *btrfs_decode_error(int errno);
>> 
>> void __btrfs_abort_transaction(struct btrfs_trans_handle *trans, 
>> struct btrfs_root *root, const char *function, diff --git
>> a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index
>> fb4dbfc..e53dbe9 100644 --- a/fs/btrfs/extent-tree.c +++
>> b/fs/btrfs/extent-tree.c @@ -6034,20 +6034,19 @@ int
>> btrfs_finish_extent_commit(struct btrfs_trans_handle *trans, 
>> struct btrfs_root *root) { struct btrfs_fs_info *fs_info =
>> root->fs_info; +       struct btrfs_block_group_cache
>> *block_group, *tmp; +       struct list_head *deleted_bgs; struct
>> extent_io_tree *unpin; u64 start; u64 end; int ret;
>> 
>> -       if (trans->aborted) -               return 0; - if
>> (fs_info->pinned_extents == &fs_info->freed_extents[0]) unpin =
>> &fs_info->freed_extents[1]; else unpin =
>> &fs_info->freed_extents[0];
>> 
>> -       while (1) { +       while (!trans->aborted) { 
>> mutex_lock(&fs_info->unused_bg_unpin_mutex); ret =
>> find_first_extent_bit(unpin, 0, &start, &end, EXTENT_DIRTY,
>> NULL); @@ -6066,6 +6065,34 @@ int
>> btrfs_finish_extent_commit(struct btrfs_trans_handle *trans, 
>> cond_resched(); }
>> 
>> +       /* +        * Transaction is finished.  We don't need the
>> lock anymore.  We +        * do need to clean up the block groups
>> in case of a transaction +        * abort. +        */ +
>> deleted_bgs = &trans->transaction->deleted_bgs; +
>> list_for_each_entry_safe(block_group, tmp, deleted_bgs, bg_list)
>> { +               u64 trimmed = 0; + +               ret =
>> -EROFS; +               if (!trans->aborted) +
>> ret = btrfs_discard_extent(root, +
>> block_group->key.objectid, +
>> block_group->key.offset, +
>> &trimmed); + +
>> list_del_init(&block_group->bg_list); +
>> btrfs_put_block_group_trimming(block_group); +
>> btrfs_put_block_group(block_group); + +               if (ret) { 
>> +                       const char *errstr =
>> btrfs_decode_error(ret); +
>> btrfs_warn(fs_info, +                                  "Discard
>> failed while removing blockgroup: errno=%d %s\n", +
>> ret, errstr); +               } +       } + return 0; }
>> 
>> @@ -9845,6 +9872,11 @@ int btrfs_remove_block_group(struct
>> btrfs_trans_handle *trans, * currently running transaction might
>> finish and a new one start, * allowing for new block groups to be
>> created that can reuse the same * physical device locations
>> unless we take this special care. +        * +        * There may
>> also be an implicit trim operation if the file system +        *
>> is mounted with -odiscard. The same protections must remain +
>> * in place until the extents have been discarded completely when 
>> +        * the transaction commit has completed. */ remove_em =
>> (atomic_read(&block_group->trimming) == 0); /* @@ -9917,8
>> +9949,10 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info
>> *fs_info) return;
>> 
>> spin_lock(&fs_info->unused_bgs_lock); + while
>> (!list_empty(&fs_info->unused_bgs)) { u64 start, end; +
>> int trimming;
>> 
>> block_group = list_first_entry(&fs_info->unused_bgs, struct
>> btrfs_block_group_cache, @@ -10016,12 +10050,39 @@ void
>> btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info) 
>> spin_unlock(&block_group->lock); spin_unlock(&space_info->lock);
>> 
>> +               /* DISCARD can flip during remount */ +
>> trimming = btrfs_test_opt(root, DISCARD); + +               /*
>> Implicit trim during transaction commit. */ +               if
>> (trimming) +
>> btrfs_get_block_group_trimming(block_group); + /* *
>> Btrfs_remove_chunk will abort the transaction if things go *
>> horribly wrong. */ ret = btrfs_remove_chunk(trans, root, 
>> block_group->key.objectid); + +               if (ret) { +
>> if (trimming) +
>> btrfs_put_block_group_trimming(block_group); +
>> goto end_trans; +               } + +               /* +
>> * If we're not mounted with -odiscard, we can just forget +
>> * about this block group. Otherwise we'll need to wait +
>> * until transaction commit to do the actual discard. +
>> */ +               if (trimming) { +
>> WARN_ON(!list_empty(&block_group->bg_list)); +
>> spin_lock(&trans->transaction->deleted_bgs_lock); +
>> list_move(&block_group->bg_list, +
>> &trans->transaction->deleted_bgs); +
>> spin_unlock(&trans->transaction->deleted_bgs_lock); +
>> btrfs_get_block_group(block_group); +               } end_trans: 
>> btrfs_end_transaction(trans, root); next: diff --git
>> a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c index
>> 9dbe5b5..c79253e 100644 --- a/fs/btrfs/free-space-cache.c +++
>> b/fs/btrfs/free-space-cache.c @@ -3274,35 +3274,23 @@ next: 
>> return ret; }
>> 
>> -int btrfs_trim_block_group(struct btrfs_block_group_cache
>> *block_group, -                          u64 *trimmed, u64 start,
>> u64 end, u64 minlen) +void btrfs_get_block_group_trimming(struct
>> btrfs_block_group_cache *cache) { -       int ret; +
>> atomic_inc(&cache->trimming); +}
>> 
>> -       *trimmed = 0; +void btrfs_put_block_group_trimming(struct
>> btrfs_block_group_cache *block_group) +{ +       struct
>> extent_map_tree *em_tree; +       struct extent_map *em; +
>> bool cleanup;
>> 
>> spin_lock(&block_group->lock); -       if (block_group->removed)
>> { -               spin_unlock(&block_group->lock); -
>> return 0; -       } -       atomic_inc(&block_group->trimming); +
>> cleanup = (atomic_dec_and_test(&block_group->trimming) && +
>> block_group->removed); spin_unlock(&block_group->lock);
>> 
>> -       ret = trim_no_bitmap(block_group, trimmed, start, end,
>> minlen); -       if (ret) -               goto out; - -       ret
>> = trim_bitmaps(block_group, trimmed, start, end, minlen); -out: -
>> spin_lock(&block_group->lock); -       if
>> (atomic_dec_and_test(&block_group->trimming) && -
>> block_group->removed) { -               struct extent_map_tree
>> *em_tree; -               struct extent_map *em; - -
>> spin_unlock(&block_group->lock); - +       if (cleanup) { 
>> lock_chunks(block_group->fs_info->chunk_root); em_tree =
>> &block_group->fs_info->mapping_tree.map_tree; 
>> write_lock(&em_tree->lock); @@ -3326,10 +3314,31 @@ out: * this
>> block group have left 1 entry each one. Free them. */ 
>> __btrfs_remove_free_space_cache(block_group->free_space_ctl); -
>> } else { +       } +} + +int btrfs_trim_block_group(struct
>> btrfs_block_group_cache *block_group, +
>> u64 *trimmed, u64 start, u64 end, u64 minlen) +{ +       int
>> ret; + +       *trimmed = 0; + +
>> spin_lock(&block_group->lock); +       if (block_group->removed)
>> { spin_unlock(&block_group->lock); +               return 0; } +
>> btrfs_get_block_group_trimming(block_group); +
>> spin_unlock(&block_group->lock); + +       ret =
>> trim_no_bitmap(block_group, trimmed, start, end, minlen); +
>> if (ret) +               goto out;
>> 
>> +       ret = trim_bitmaps(block_group, trimmed, start, end,
>> minlen); +out: +
>> btrfs_put_block_group_trimming(block_group); return ret; }
>> 
>> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index
>> 2ccd8d4..a80da03 100644 --- a/fs/btrfs/super.c +++
>> b/fs/btrfs/super.c @@ -69,7 +69,7 @@ static struct
>> file_system_type btrfs_fs_type;
>> 
>> static int btrfs_remount(struct super_block *sb, int *flags, char
>> *data);
>> 
>> -static const char *btrfs_decode_error(int errno) +const char
>> *btrfs_decode_error(int errno) { char *errstr = "unknown";
>> 
>> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c 
>> index 5628e25..2005262 100644 --- a/fs/btrfs/transaction.c +++
>> b/fs/btrfs/transaction.c @@ -256,6 +256,8 @@ loop: 
>> mutex_init(&cur_trans->cache_write_mutex); 
>> cur_trans->num_dirty_bgs = 0; 
>> spin_lock_init(&cur_trans->dirty_bgs_lock); +
>> INIT_LIST_HEAD(&cur_trans->deleted_bgs); +
>> spin_lock_init(&cur_trans->deleted_bgs_lock); 
>> list_add_tail(&cur_trans->list, &fs_info->trans_list); 
>> extent_io_tree_init(&cur_trans->dirty_pages, 
>> fs_info->btree_inode->i_mapping); diff --git
>> a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h index
>> 0b24755..14325f2 100644 --- a/fs/btrfs/transaction.h +++
>> b/fs/btrfs/transaction.h @@ -74,6 +74,8 @@ struct
>> btrfs_transaction { */ struct mutex cache_write_mutex; spinlock_t
>> dirty_bgs_lock; +       struct list_head deleted_bgs; +
>> spinlock_t deleted_bgs_lock; struct btrfs_delayed_ref_root
>> delayed_refs; int aborted; int dirty_bg_run; -- 1.8.4.5
>> 
>> -- To unsubscribe from this list: send the line "unsubscribe
>> linux-btrfs" in the body of a message to
>> majordomo@vger.kernel.org More majordomo info at
>> http://vger.kernel.org/majordomo-info.html
> 
> 
> 


- -- 
Jeff Mahoney
SUSE Labs
-----BEGIN PGP SIGNATURE-----
Version: GnuPG/MacGPG2 v2.0.19 (Darwin)

iQIcBAEBAgAGBQJVd1EWAAoJEB57S2MheeWyqbUP/jtZ7rAsqGPmZAOH3O5CYfp4
m9kacFUoPCVlMNK3Tpv6n/RQrDbwZ9BLRs6J0+a37JZAf+BwN/cTlc8ANOHQeBxH
O9VnTKUa+NHGviS7vgOpY9fV9jMVsgN6s89z2cPI7huXoovKum1zwuyMbrSvb92G
/Q8Rc6/eIW6kKnOqtBTi3l/gXeTUX3j9l3HlFY9SxkE2gRF+FRqWzBZBhDHS8jVs
S9ACTm7IaeOzs84NAGaQCSs5KGXZYjbYYhaUh9hacKWnOuvZ2MJLenyQuKb3JThL
cgaNu6+0Tl58SXh0hh2FrNTHKfCNrhs2jSAZQxGdEd9keViCmGMVgYbS8fyqKMw5
9LvKQB1NYWoKiCc7x+zK0GGU5p0MIWNWRqm02b/yp25LJKPwBHpX8hrseb7tHYdj
MSAqlHBUyfZR/nh6vywSeIsV4inPRIQYQKXoG5hechXQt7ivifXDtFMEa21PiLdR
bHxt2mgn6uqFBh/2atdYCzqOjH4QG3kQ9kVOdHxDbHk1sTFY26Q79xUImTA3qp7E
pI3Ckn9uG5b1m3PDdUFWBUkN6DffpUnsa1DAQBnI69/W6KY/VMKPaPrppYlQ5S8w
cBMhnfDxfyjfisUAS+mUmq4GcbpSt95KsVKizz7TlK2VxtPCWOywK0shRmoWIgME
62rJ9pbtLvoHY9GLNXnS
=E81+
-----END PGP SIGNATURE-----
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jeff Mahoney June 10, 2015, 12:05 a.m. UTC | #3
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 6/9/15 4:48 PM, Jeff Mahoney wrote:
> On 6/8/15 10:12 AM, Filipe David Manana wrote:
>> On Wed, Jun 3, 2015 at 3:47 PM,  <jeffm@suse.com> wrote:
>>> From: Jeff Mahoney <jeffm@suse.com>
>>> 
>>> When we clear the dirty bits in btrfs_delete_unused_bgs for 
>>> extents in the empty block group, it results in 
>>> btrfs_finish_extent_commit being unable to discard the freed 
>>> extents.
>>> 
>>> The block group removal patch added an alternate path to
>>> forget extents other than btrfs_finish_extent_commit.  As a
>>> result, any extents that would be freed when the block group is
>>> removed aren't discarded.  In my test run, with a large copy of
>>> mixed sized files followed by removal, it left nearly 2/3 of
>>> extents undiscarded.
>>> 
>>> To clean up the block groups, we add the removed block group
>>> onto a list that will be discarded after transaction commit.
>>> 
>>> v1->v2: - Fix ordering to ensure that we don't discard extents 
>>> freed in an uncommitted transaction.
>>> 
>>> v2->v3: - Factor out get/put block_group->trimming to ensure
>>> that cleanup always happens at the last reference drop. -
>>> Cleanup the free space cache on the last reference drop. - Use
>>> list_move instead of list_add in case of multiple adds.  We
>>> still issue a warning, but we shouldn't fall over.
>>> 
>>> Signed-off-by: Jeff Mahoney <jeffm@suse.com>
> 
>> Hi Jeff,
> 
>> The changes look good from eyeballing, and I gave it a try 
>> together with the previous 2 patches in the corresponding series.
>>  However, this patch in the series fails tests btrfs/065 to 
>> btrfs/072.
> 
>> Running the following in a loop until it returns failure ($? !=
>> 0) triggers an apparent corruption issue most of the time:
> 
> Ok, I didn't have my test environment set up to do the multidevice 
> tests by default, and definitely didn't have it set up to do 065, 
> since that requires 5+ devices of identical size. Now that I do
> have that set up, I see this:
> 
> 8,37   2     1442    22.727796582 15400  D   D 419434496 + 8192
> [umoun t]
> 
> 419434496 is the start of /dev/sdc5 and we're blowing away 4 MB.
> That happens to coincide with the system (single) block group.  In
> my testing, I found (IIRC) that system (single) and metadata
> (single) were always in the unused list but were getting skipped by
> the cleaner thread.  It looks like the cleanup block groups on
> close_ctree is misbehaving.  Whatever condition that was causing
> those to be skipped must be unset for 065.  It wasn't a problem
> during normal runtime, but now that it's called in close_ctree,
> it's introducing a problem.  I'll start digging.
> 
> Thanks for the review!

<...>-5472  [014] ....   390.430881: btrfs_chunk_free: root =
3(CHUNK_TREE), offset = 0, size = 4194304, num_stripes = 1,
sub_stripes = 0, type = SYSTEM

That'd do it. If I do a btrfs-debug-tree on the file system without
discard, it works fine because the superblock doesn't get discarded.
That 0-4MB chunk _is_ released and unused.

- -Jeff

> -Jeff
> 
>> root 17:16:48 /home/fdmanana/git/hub/xfstests (up_master_2)> 
>> MKFS_OPTIONS="-O skinny-metadata" MOUNT_OPTIONS="-o discard" 
>> ./check btrfs/065 FSTYP         -- btrfs PLATFORM      -- 
>> Linux/x86_64 debian3 4.1.0-rc5-btrfs-next-10+ MKFS_OPTIONS  --
>> -O skinny-metadata /dev/sdc MOUNT_OPTIONS -- -o discard /dev/sdc 
>> /home/fdmanana/btrfs-tests/scratch_1
> 
>> btrfs/065 130s ... [failed, exit status 1] - output mismatch (see
>>  /home/fdmanana/git/hub/xfstests/results//btrfs/065.out.bad) --- 
>> tests/btrfs/065.out 2014-11-17 20:59:51.282203001 +0000 +++ 
>> /home/fdmanana/git/hub/xfstests/results//btrfs/065.out.bad 
>> 2015-06-07 17:37:40.841293317 +0100 @@ -1,2 +1,3 @@ QA output 
>> created by 065 Silence is golden +_check_btrfs_filesystem: 
>> filesystem on /dev/sdc is inconsistent (see 
>> /home/fdmanana/git/hub/xfstests/results//btrfs/065.full) ...
>> (Run 'diff -u tests/btrfs/065.out 
>> /home/fdmanana/git/hub/xfstests/results//btrfs/065.out.bad'  to 
>> see the entire diff) Ran: btrfs/065 Failures: btrfs/065 Failed 1
>> of 1 tests
> 
>> root 17:37:41 /home/fdmanana/git/hub/xfstests (up_master_2)> 
>> btrfsck /dev/sdc No valid Btrfs found on /dev/sdc Couldn't open 
>> file system root 17:37:43 /home/fdmanana/git/hub/xfstests 
>> (up_master_2)>
> 
> 
>> With only patches 1 and 2 in the series (or without them) this 
>> never happens (well, at least not in ~50 iterations when running 
>> btrfs/065). And I've never seen this before testing this patch.
> 
>> It also makes generic/251 "never" finish apparently - it didn't 
>> finish here after 9h running, and without any of these patches
>> in the series it used to finish in less than half an hour.
> 
>> Maybe the apparent corruption is due to some interaction with 
>> device replace or scrub, I haven't had the time to 
>> analyze/determine what it could be.
> 
>> thanks
> 
> 
> 
> 
>>> --- fs/btrfs/ctree.h            |  3 ++ fs/btrfs/extent-tree.c 
>>> | 69 ++++++++++++++++++++++++++++++++++++++++++--- 
>>> fs/btrfs/free-space-cache.c | 57 
>>> +++++++++++++++++++++---------------- fs/btrfs/super.c |  2 +-
>>> fs/btrfs/transaction.c      |  2 ++ fs/btrfs/transaction.h
>>> |  2 ++ 6 files changed, 106 insertions(+), 29 deletions(-)
>>> 
>>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 
>>> 6f364e1..780acf1 100644 --- a/fs/btrfs/ctree.h +++ 
>>> b/fs/btrfs/ctree.h @@ -3438,6 +3438,8 @@ int 
>>> btrfs_remove_block_group(struct btrfs_trans_handle *trans,
>>> struct btrfs_root *root, u64 group_start, struct extent_map
>>> *em); void btrfs_delete_unused_bgs(struct btrfs_fs_info
>>> *fs_info); +void btrfs_get_block_group_trimming(struct
>>> btrfs_block_group_cache *cache); +void
>>> btrfs_put_block_group_trimming(struct btrfs_block_group_cache
>>> *cache); void btrfs_create_pending_block_groups(struct
>>> btrfs_trans_handle *trans, struct btrfs_root *root); u64 
>>> btrfs_get_alloc_profile(struct btrfs_root *root, int data); @@ 
>>> -4068,6 +4070,7 @@ __printf(5, 6) void
>>> __btrfs_std_error(struct btrfs_fs_info *fs_info, const char
>>> *function, unsigned int line, int errno, const char *fmt,
>>> ...);
>>> 
>>> +const char *btrfs_decode_error(int errno);
>>> 
>>> void __btrfs_abort_transaction(struct btrfs_trans_handle
>>> *trans, struct btrfs_root *root, const char *function, diff
>>> --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 
>>> fb4dbfc..e53dbe9 100644 --- a/fs/btrfs/extent-tree.c +++ 
>>> b/fs/btrfs/extent-tree.c @@ -6034,20 +6034,19 @@ int 
>>> btrfs_finish_extent_commit(struct btrfs_trans_handle *trans, 
>>> struct btrfs_root *root) { struct btrfs_fs_info *fs_info = 
>>> root->fs_info; +       struct btrfs_block_group_cache 
>>> *block_group, *tmp; +       struct list_head *deleted_bgs;
>>> struct extent_io_tree *unpin; u64 start; u64 end; int ret;
>>> 
>>> -       if (trans->aborted) -               return 0; - if 
>>> (fs_info->pinned_extents == &fs_info->freed_extents[0]) unpin
>>> = &fs_info->freed_extents[1]; else unpin = 
>>> &fs_info->freed_extents[0];
>>> 
>>> -       while (1) { +       while (!trans->aborted) { 
>>> mutex_lock(&fs_info->unused_bg_unpin_mutex); ret = 
>>> find_first_extent_bit(unpin, 0, &start, &end, EXTENT_DIRTY, 
>>> NULL); @@ -6066,6 +6065,34 @@ int 
>>> btrfs_finish_extent_commit(struct btrfs_trans_handle *trans, 
>>> cond_resched(); }
>>> 
>>> +       /* +        * Transaction is finished.  We don't need
>>> the lock anymore.  We +        * do need to clean up the block
>>> groups in case of a transaction +        * abort. +        */
>>> + deleted_bgs = &trans->transaction->deleted_bgs; + 
>>> list_for_each_entry_safe(block_group, tmp, deleted_bgs,
>>> bg_list) { +               u64 trimmed = 0; + +
>>> ret = -EROFS; +               if (!trans->aborted) + ret =
>>> btrfs_discard_extent(root, + block_group->key.objectid, + 
>>> block_group->key.offset, + &trimmed); + + 
>>> list_del_init(&block_group->bg_list); + 
>>> btrfs_put_block_group_trimming(block_group); + 
>>> btrfs_put_block_group(block_group); + +               if (ret)
>>> { +                       const char *errstr = 
>>> btrfs_decode_error(ret); + btrfs_warn(fs_info, +
>>> "Discard failed while removing blockgroup: errno=%d %s\n", + 
>>> ret, errstr); +               } +       } + return 0; }
>>> 
>>> @@ -9845,6 +9872,11 @@ int btrfs_remove_block_group(struct 
>>> btrfs_trans_handle *trans, * currently running transaction
>>> might finish and a new one start, * allowing for new block
>>> groups to be created that can reuse the same * physical device
>>> locations unless we take this special care. +        * +
>>> * There may also be an implicit trim operation if the file
>>> system +        * is mounted with -odiscard. The same
>>> protections must remain + * in place until the extents have
>>> been discarded completely when +        * the transaction
>>> commit has completed. */ remove_em = 
>>> (atomic_read(&block_group->trimming) == 0); /* @@ -9917,8 
>>> +9949,10 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info 
>>> *fs_info) return;
>>> 
>>> spin_lock(&fs_info->unused_bgs_lock); + while 
>>> (!list_empty(&fs_info->unused_bgs)) { u64 start, end; + int
>>> trimming;
>>> 
>>> block_group = list_first_entry(&fs_info->unused_bgs, struct 
>>> btrfs_block_group_cache, @@ -10016,12 +10050,39 @@ void 
>>> btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info) 
>>> spin_unlock(&block_group->lock);
>>> spin_unlock(&space_info->lock);
>>> 
>>> +               /* DISCARD can flip during remount */ + 
>>> trimming = btrfs_test_opt(root, DISCARD); + +               /* 
>>> Implicit trim during transaction commit. */ +               if 
>>> (trimming) + btrfs_get_block_group_trimming(block_group); + /*
>>> * Btrfs_remove_chunk will abort the transaction if things go * 
>>> horribly wrong. */ ret = btrfs_remove_chunk(trans, root, 
>>> block_group->key.objectid); + +               if (ret) { + if
>>> (trimming) + btrfs_put_block_group_trimming(block_group); + 
>>> goto end_trans; +               } + +               /* + * If
>>> we're not mounted with -odiscard, we can just forget + * about
>>> this block group. Otherwise we'll need to wait + * until
>>> transaction commit to do the actual discard. + */ +
>>> if (trimming) { + WARN_ON(!list_empty(&block_group->bg_list));
>>> + spin_lock(&trans->transaction->deleted_bgs_lock); + 
>>> list_move(&block_group->bg_list, + 
>>> &trans->transaction->deleted_bgs); + 
>>> spin_unlock(&trans->transaction->deleted_bgs_lock); + 
>>> btrfs_get_block_group(block_group); +               }
>>> end_trans: btrfs_end_transaction(trans, root); next: diff
>>> --git a/fs/btrfs/free-space-cache.c
>>> b/fs/btrfs/free-space-cache.c index 9dbe5b5..c79253e 100644 ---
>>> a/fs/btrfs/free-space-cache.c +++ b/fs/btrfs/free-space-cache.c
>>> @@ -3274,35 +3274,23 @@ next: return ret; }
>>> 
>>> -int btrfs_trim_block_group(struct btrfs_block_group_cache 
>>> *block_group, -                          u64 *trimmed, u64
>>> start, u64 end, u64 minlen) +void
>>> btrfs_get_block_group_trimming(struct btrfs_block_group_cache
>>> *cache) { -       int ret; + atomic_inc(&cache->trimming); +}
>>> 
>>> -       *trimmed = 0; +void
>>> btrfs_put_block_group_trimming(struct btrfs_block_group_cache
>>> *block_group) +{ +       struct extent_map_tree *em_tree; +
>>> struct extent_map *em; + bool cleanup;
>>> 
>>> spin_lock(&block_group->lock); -       if
>>> (block_group->removed) { -
>>> spin_unlock(&block_group->lock); - return 0; -       } -
>>> atomic_inc(&block_group->trimming); + cleanup =
>>> (atomic_dec_and_test(&block_group->trimming) && + 
>>> block_group->removed); spin_unlock(&block_group->lock);
>>> 
>>> -       ret = trim_no_bitmap(block_group, trimmed, start, end, 
>>> minlen); -       if (ret) -               goto out; - -
>>> ret = trim_bitmaps(block_group, trimmed, start, end, minlen);
>>> -out: - spin_lock(&block_group->lock); -       if 
>>> (atomic_dec_and_test(&block_group->trimming) && - 
>>> block_group->removed) { -               struct extent_map_tree 
>>> *em_tree; -               struct extent_map *em; - - 
>>> spin_unlock(&block_group->lock); - +       if (cleanup) { 
>>> lock_chunks(block_group->fs_info->chunk_root); em_tree = 
>>> &block_group->fs_info->mapping_tree.map_tree; 
>>> write_lock(&em_tree->lock); @@ -3326,10 +3314,31 @@ out: *
>>> this block group have left 1 entry each one. Free them. */ 
>>> __btrfs_remove_free_space_cache(block_group->free_space_ctl);
>>> - } else { +       } +} + +int btrfs_trim_block_group(struct 
>>> btrfs_block_group_cache *block_group, + u64 *trimmed, u64
>>> start, u64 end, u64 minlen) +{ +       int ret; + +
>>> *trimmed = 0; + + spin_lock(&block_group->lock); +       if
>>> (block_group->removed) { spin_unlock(&block_group->lock); +
>>> return 0; } + btrfs_get_block_group_trimming(block_group); + 
>>> spin_unlock(&block_group->lock); + +       ret = 
>>> trim_no_bitmap(block_group, trimmed, start, end, minlen); + if
>>> (ret) +               goto out;
>>> 
>>> +       ret = trim_bitmaps(block_group, trimmed, start, end, 
>>> minlen); +out: + btrfs_put_block_group_trimming(block_group);
>>> return ret; }
>>> 
>>> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index 
>>> 2ccd8d4..a80da03 100644 --- a/fs/btrfs/super.c +++ 
>>> b/fs/btrfs/super.c @@ -69,7 +69,7 @@ static struct 
>>> file_system_type btrfs_fs_type;
>>> 
>>> static int btrfs_remount(struct super_block *sb, int *flags,
>>> char *data);
>>> 
>>> -static const char *btrfs_decode_error(int errno) +const char 
>>> *btrfs_decode_error(int errno) { char *errstr = "unknown";
>>> 
>>> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c 
>>> index 5628e25..2005262 100644 --- a/fs/btrfs/transaction.c +++ 
>>> b/fs/btrfs/transaction.c @@ -256,6 +256,8 @@ loop: 
>>> mutex_init(&cur_trans->cache_write_mutex); 
>>> cur_trans->num_dirty_bgs = 0; 
>>> spin_lock_init(&cur_trans->dirty_bgs_lock); + 
>>> INIT_LIST_HEAD(&cur_trans->deleted_bgs); + 
>>> spin_lock_init(&cur_trans->deleted_bgs_lock); 
>>> list_add_tail(&cur_trans->list, &fs_info->trans_list); 
>>> extent_io_tree_init(&cur_trans->dirty_pages, 
>>> fs_info->btree_inode->i_mapping); diff --git 
>>> a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h index 
>>> 0b24755..14325f2 100644 --- a/fs/btrfs/transaction.h +++ 
>>> b/fs/btrfs/transaction.h @@ -74,6 +74,8 @@ struct 
>>> btrfs_transaction { */ struct mutex cache_write_mutex;
>>> spinlock_t dirty_bgs_lock; +       struct list_head
>>> deleted_bgs; + spinlock_t deleted_bgs_lock; struct
>>> btrfs_delayed_ref_root delayed_refs; int aborted; int
>>> dirty_bg_run; -- 1.8.4.5
>>> 
>>> -- To unsubscribe from this list: send the line "unsubscribe 
>>> linux-btrfs" in the body of a message to 
>>> majordomo@vger.kernel.org More majordomo info at 
>>> http://vger.kernel.org/majordomo-info.html
> 
> 
> 
> 
> 
> -- To unsubscribe from this list: send the line "unsubscribe
> linux-btrfs" in the body of a message to majordomo@vger.kernel.org 
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

- -- 
Jeff Mahoney
SUSE Labs
-----BEGIN PGP SIGNATURE-----
Version: GnuPG/MacGPG2 v2.0.22 (Darwin)

iQIcBAEBAgAGBQJVd39VAAoJEB57S2MheeWyROkP/3yZKTetC+Ael7NKpQ8TLk6I
usu3EWAz3mkg1Xmd/Isb9HLrCcxFwl4RPE2DORyXpZOF8Vx/2TfwZ9GWeziQsSUl
ArTemnfRS5eZ1Vwh2MiY0ONJrR7OegJwtJ2z2p2w5XjvLj7v/0iBBvBrdWo4Zasl
l4n6othn771D3744sm69WpmUeU+KsdQf8UgSlfpe2vDMounFrPQE91b1HTCrnM/S
w0ccLNju2HWnEqP3GuukFyi3UtylEz8NfhNzWbSyFKlc5Z7LvT5JTU1AcT5U8vSG
G5aTDEZnAuUVx74o7+lL4/kE855vyeHc3Lag6Uc86ZMOjrR8AW2daeLFGhFqxGby
mDPx4wnOBx6kHuZyAFVZk3SXOUdK6jGWIwLnkJQXCNFry8v0pFpt+e+tPrwMYsnp
tBhH9sQeHp9F0U+RPsG18glgpN/xCv9/CEtoOteVZfSlRlv7ktpbYuEITd4y9p0O
x62NTi7oJEqGugsVPV3Y8gsrCW5oQIAA/YaoQFEP5hOqacsS/OV5bbdCvn7qD4Qw
mtF0gNNzvXLy9fIGw/P2NY2fwHHUSVZ21cSBKNxMn2KD9eMrXSnw+WdWeLTfn4v6
Lv7tdOSSd1JCSNUbfLMBXGRKm5E4nmn9idi69R0zyJ61P7fvpqxRyqtxqIpsHups
WKo867u3URagNmYuh6ld
=PZYC
-----END PGP SIGNATURE-----
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jeff Mahoney June 10, 2015, 10:52 p.m. UTC | #4
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 6/9/15 8:05 PM, Jeff Mahoney wrote:
> On 6/9/15 4:48 PM, Jeff Mahoney wrote:
>> On 6/8/15 10:12 AM, Filipe David Manana wrote:
>>> Running the following in a loop until it returns failure ($?
>>> != 0) triggers an apparent corruption issue most of the time:
> 
>> Ok, I didn't have my test environment set up to do the
>> multidevice tests by default, and definitely didn't have it set
>> up to do 065, since that requires 5+ devices of identical size.
>> Now that I do have that set up, I see this:
> 
>> 8,37   2     1442    22.727796582 15400  D   D 419434496 + 8192 
>> [umoun t]
> 
>> 419434496 is the start of /dev/sdc5 and we're blowing away 4 MB. 
>> That happens to coincide with the system (single) block group.
>> In my testing, I found (IIRC) that system (single) and metadata 
>> (single) were always in the unused list but were getting skipped
>> by the cleaner thread.  It looks like the cleanup block groups
>> on close_ctree is misbehaving.  Whatever condition that was
>> causing those to be skipped must be unset for 065.  It wasn't a
>> problem during normal runtime, but now that it's called in
>> close_ctree, it's introducing a problem.  I'll start digging.
> 
>> Thanks for the review!
> 
> <...>-5472  [014] ....   390.430881: btrfs_chunk_free: root = 
> 3(CHUNK_TREE), offset = 0, size = 4194304, num_stripes = 1, 
> sub_stripes = 0, type = SYSTEM
> 
> That'd do it. If I do a btrfs-debug-tree on the file system
> without discard, it works fine because the superblock doesn't get
> discarded. That 0-4MB chunk _is_ released and unused.

I have a patch up and running that skips the superblock (and mirrors)
during discard.  The superblocks are a special case in how we handle
blocks on the file system.  They exist in the middle of and completely
separate from the block groups.  The blocks are claimed when we read
in the block group by pinning them, but they aren't actually reserved
on disk via a proper extent.

My patch adds a filter to btrfs_issue_discard so that we skip the
superblocks.  Since the superblock locations are per-physical device
and are set whether or not there's a block group on top of them, it
made sense to put the filter there so that both fstrim and -odiscard
skip them without any other special code.

I'm running xfstests now and will post once that's completed.

- -Jeff

- -- 
Jeff Mahoney
SUSE Labs
-----BEGIN PGP SIGNATURE-----
Version: GnuPG/MacGPG2 v2.0.19 (Darwin)

iQIcBAEBAgAGBQJVeL+oAAoJEB57S2MheeWyYjIP/iefOYjXc0iYHl3aZLZXz/o0
2a4RcE1AntLCZMs5t6m2LLXfScVyYR84f26bQNdJOvIpMYj7ChMpKqyEQ582eGwD
DDAFwi1zGLATdLIDJSkL3U2YZfklTIoX8G3KJgSUA/rhiRV3bl37HnGqxpGIIlym
O2rNiMmF4iqdbF+e0ZpNmLWQ3wwmwDRyHM67nyIboe0Lui41le2EhLng4OFhJ/Qr
Ldnh58cPZs+hvOtW641SBgxvtt8NEDeVjbfJijSY/KgloAcbFq1hM7AKs51+HjPA
zzvHCvhLh5WCcorlpLz/a+Iy9QOx8Tbw5Ef6zOlvWQqPPEj/Q+PROeL6BK+KIDUh
BvaStlYCA68JmyLeSN1EpXvS2MNILVEWGqxcqC8F9JpgYC4/SzLQPAXLgL63UtQm
XQ/n3rg7P4BMegTfEqh57ExhDUUsorN29/2m4kHu1I19OQzIY0a3QTbJYBH1jTvB
nfBivkC6kyIc6aZBrOHY7M0mIzeAq+mCqbYvMWeYVruK+3iW8Vc9PKlE+RgvFvFP
JX7+v/N9Jl2GBObzGPBLUpxdvaU518sKMF/knZfH+8uVKLlDlWFDsH9ZroKp5TBj
GD0u03qiKESq3Z4MEqizHt0+FRwBbWmIEG+5YcNcKrL0gc6ljHoEn6Xp3SPtZkxn
w6Wo0XZyp+rIkxb5peBa
=Iv+E
-----END PGP SIGNATURE-----
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 6f364e1..780acf1 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3438,6 +3438,8 @@  int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
 			     struct btrfs_root *root, u64 group_start,
 			     struct extent_map *em);
 void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info);
+void btrfs_get_block_group_trimming(struct btrfs_block_group_cache *cache);
+void btrfs_put_block_group_trimming(struct btrfs_block_group_cache *cache);
 void btrfs_create_pending_block_groups(struct btrfs_trans_handle *trans,
 				       struct btrfs_root *root);
 u64 btrfs_get_alloc_profile(struct btrfs_root *root, int data);
@@ -4068,6 +4070,7 @@  __printf(5, 6)
 void __btrfs_std_error(struct btrfs_fs_info *fs_info, const char *function,
 		     unsigned int line, int errno, const char *fmt, ...);
 
+const char *btrfs_decode_error(int errno);
 
 void __btrfs_abort_transaction(struct btrfs_trans_handle *trans,
 			       struct btrfs_root *root, const char *function,
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index fb4dbfc..e53dbe9 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -6034,20 +6034,19 @@  int btrfs_finish_extent_commit(struct btrfs_trans_handle *trans,
 			       struct btrfs_root *root)
 {
 	struct btrfs_fs_info *fs_info = root->fs_info;
+	struct btrfs_block_group_cache *block_group, *tmp;
+	struct list_head *deleted_bgs;
 	struct extent_io_tree *unpin;
 	u64 start;
 	u64 end;
 	int ret;
 
-	if (trans->aborted)
-		return 0;
-
 	if (fs_info->pinned_extents == &fs_info->freed_extents[0])
 		unpin = &fs_info->freed_extents[1];
 	else
 		unpin = &fs_info->freed_extents[0];
 
-	while (1) {
+	while (!trans->aborted) {
 		mutex_lock(&fs_info->unused_bg_unpin_mutex);
 		ret = find_first_extent_bit(unpin, 0, &start, &end,
 					    EXTENT_DIRTY, NULL);
@@ -6066,6 +6065,34 @@  int btrfs_finish_extent_commit(struct btrfs_trans_handle *trans,
 		cond_resched();
 	}
 
+	/*
+	 * Transaction is finished.  We don't need the lock anymore.  We
+	 * do need to clean up the block groups in case of a transaction
+	 * abort.
+	 */
+	deleted_bgs = &trans->transaction->deleted_bgs;
+	list_for_each_entry_safe(block_group, tmp, deleted_bgs, bg_list) {
+		u64 trimmed = 0;
+
+		ret = -EROFS;
+		if (!trans->aborted)
+			ret = btrfs_discard_extent(root,
+						   block_group->key.objectid,
+						   block_group->key.offset,
+						   &trimmed);
+
+		list_del_init(&block_group->bg_list);
+		btrfs_put_block_group_trimming(block_group);
+		btrfs_put_block_group(block_group);
+
+		if (ret) {
+			const char *errstr = btrfs_decode_error(ret);
+			btrfs_warn(fs_info,
+				   "Discard failed while removing blockgroup: errno=%d %s\n",
+				   ret, errstr);
+		}
+	}
+
 	return 0;
 }
 
@@ -9845,6 +9872,11 @@  int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
 	 * currently running transaction might finish and a new one start,
 	 * allowing for new block groups to be created that can reuse the same
 	 * physical device locations unless we take this special care.
+	 *
+	 * There may also be an implicit trim operation if the file system
+	 * is mounted with -odiscard. The same protections must remain
+	 * in place until the extents have been discarded completely when
+	 * the transaction commit has completed.
 	 */
 	remove_em = (atomic_read(&block_group->trimming) == 0);
 	/*
@@ -9917,8 +9949,10 @@  void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
 		return;
 
 	spin_lock(&fs_info->unused_bgs_lock);
+
 	while (!list_empty(&fs_info->unused_bgs)) {
 		u64 start, end;
+		int trimming;
 
 		block_group = list_first_entry(&fs_info->unused_bgs,
 					       struct btrfs_block_group_cache,
@@ -10016,12 +10050,39 @@  void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
 		spin_unlock(&block_group->lock);
 		spin_unlock(&space_info->lock);
 
+		/* DISCARD can flip during remount */
+		trimming = btrfs_test_opt(root, DISCARD);
+
+		/* Implicit trim during transaction commit. */
+		if (trimming)
+			btrfs_get_block_group_trimming(block_group);
+
 		/*
 		 * Btrfs_remove_chunk will abort the transaction if things go
 		 * horribly wrong.
 		 */
 		ret = btrfs_remove_chunk(trans, root,
 					 block_group->key.objectid);
+
+		if (ret) {
+			if (trimming)
+				btrfs_put_block_group_trimming(block_group);
+			goto end_trans;
+		}
+
+		/*
+		 * If we're not mounted with -odiscard, we can just forget
+		 * about this block group. Otherwise we'll need to wait
+		 * until transaction commit to do the actual discard.
+		 */
+		if (trimming) {
+			WARN_ON(!list_empty(&block_group->bg_list));
+			spin_lock(&trans->transaction->deleted_bgs_lock);
+			list_move(&block_group->bg_list,
+				  &trans->transaction->deleted_bgs);
+			spin_unlock(&trans->transaction->deleted_bgs_lock);
+			btrfs_get_block_group(block_group);
+		}
 end_trans:
 		btrfs_end_transaction(trans, root);
 next:
diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index 9dbe5b5..c79253e 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -3274,35 +3274,23 @@  next:
 	return ret;
 }
 
-int btrfs_trim_block_group(struct btrfs_block_group_cache *block_group,
-			   u64 *trimmed, u64 start, u64 end, u64 minlen)
+void btrfs_get_block_group_trimming(struct btrfs_block_group_cache *cache)
 {
-	int ret;
+	atomic_inc(&cache->trimming);
+}
 
-	*trimmed = 0;
+void btrfs_put_block_group_trimming(struct btrfs_block_group_cache *block_group)
+{
+	struct extent_map_tree *em_tree;
+	struct extent_map *em;
+	bool cleanup;
 
 	spin_lock(&block_group->lock);
-	if (block_group->removed) {
-		spin_unlock(&block_group->lock);
-		return 0;
-	}
-	atomic_inc(&block_group->trimming);
+	cleanup = (atomic_dec_and_test(&block_group->trimming) &&
+		   block_group->removed);
 	spin_unlock(&block_group->lock);
 
-	ret = trim_no_bitmap(block_group, trimmed, start, end, minlen);
-	if (ret)
-		goto out;
-
-	ret = trim_bitmaps(block_group, trimmed, start, end, minlen);
-out:
-	spin_lock(&block_group->lock);
-	if (atomic_dec_and_test(&block_group->trimming) &&
-	    block_group->removed) {
-		struct extent_map_tree *em_tree;
-		struct extent_map *em;
-
-		spin_unlock(&block_group->lock);
-
+	if (cleanup) {
 		lock_chunks(block_group->fs_info->chunk_root);
 		em_tree = &block_group->fs_info->mapping_tree.map_tree;
 		write_lock(&em_tree->lock);
@@ -3326,10 +3314,31 @@  out:
 		 * this block group have left 1 entry each one. Free them.
 		 */
 		__btrfs_remove_free_space_cache(block_group->free_space_ctl);
-	} else {
+	}
+}
+
+int btrfs_trim_block_group(struct btrfs_block_group_cache *block_group,
+			   u64 *trimmed, u64 start, u64 end, u64 minlen)
+{
+	int ret;
+
+	*trimmed = 0;
+
+	spin_lock(&block_group->lock);
+	if (block_group->removed) {
 		spin_unlock(&block_group->lock);
+		return 0;
 	}
+	btrfs_get_block_group_trimming(block_group);
+	spin_unlock(&block_group->lock);
+
+	ret = trim_no_bitmap(block_group, trimmed, start, end, minlen);
+	if (ret)
+		goto out;
 
+	ret = trim_bitmaps(block_group, trimmed, start, end, minlen);
+out:
+	btrfs_put_block_group_trimming(block_group);
 	return ret;
 }
 
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 2ccd8d4..a80da03 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -69,7 +69,7 @@  static struct file_system_type btrfs_fs_type;
 
 static int btrfs_remount(struct super_block *sb, int *flags, char *data);
 
-static const char *btrfs_decode_error(int errno)
+const char *btrfs_decode_error(int errno)
 {
 	char *errstr = "unknown";
 
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 5628e25..2005262 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -256,6 +256,8 @@  loop:
 	mutex_init(&cur_trans->cache_write_mutex);
 	cur_trans->num_dirty_bgs = 0;
 	spin_lock_init(&cur_trans->dirty_bgs_lock);
+	INIT_LIST_HEAD(&cur_trans->deleted_bgs);
+	spin_lock_init(&cur_trans->deleted_bgs_lock);
 	list_add_tail(&cur_trans->list, &fs_info->trans_list);
 	extent_io_tree_init(&cur_trans->dirty_pages,
 			     fs_info->btree_inode->i_mapping);
diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
index 0b24755..14325f2 100644
--- a/fs/btrfs/transaction.h
+++ b/fs/btrfs/transaction.h
@@ -74,6 +74,8 @@  struct btrfs_transaction {
 	 */
 	struct mutex cache_write_mutex;
 	spinlock_t dirty_bgs_lock;
+	struct list_head deleted_bgs;
+	spinlock_t deleted_bgs_lock;
 	struct btrfs_delayed_ref_root delayed_refs;
 	int aborted;
 	int dirty_bg_run;