diff mbox

[2/2] btrfs: Fix lost-data-profile caused by balance bg

Message ID CAL3q7H6uShZiaj7wbVSopDWJ3gWDBypzcOuwGzM32oj8N2bFiA@mail.gmail.com (mailing list archive)
State Superseded
Headers show

Commit Message

Filipe Manana Sept. 30, 2015, 7:41 a.m. UTC
On Wed, Sep 30, 2015 at 5:20 AM, Zhao Lei <zhaolei@cn.fujitsu.com> wrote:
> Hi, Filipe Manana
>
> Thanks for reviewing.
>
>> -----Original Message-----
>> From: Filipe Manana [mailto:fdmanana@gmail.com]
>> Sent: Tuesday, September 29, 2015 11:48 PM
>> To: Zhao Lei <zhaolei@cn.fujitsu.com>
>> Cc: linux-btrfs@vger.kernel.org
>> Subject: Re: [PATCH 2/2] btrfs: Fix lost-data-profile caused by balance bg
>>
>> On Tue, Sep 29, 2015 at 2:51 PM, Zhao Lei <zhaolei@cn.fujitsu.com> wrote:
>> > Reproduce:
>> >  (In integration-4.3 branch)
>> >
>> >  TEST_DEV=(/dev/vdg /dev/vdh)
>> >  TEST_DIR=/mnt/tmp
>> >
>> >  umount "$TEST_DEV" >/dev/null
>> >  mkfs.btrfs -f -d raid1 "${TEST_DEV[@]}"
>> >
>> >  mount -o nospace_cache "$TEST_DEV" "$TEST_DIR"
>> >  btrfs balance start -dusage=0 $TEST_DIR  btrfs filesystem usage
>> > $TEST_DIR
>> >
>> >  dd if=/dev/zero of="$TEST_DIR"/file count=100  btrfs filesystem usage
>> > $TEST_DIR
>> >
>> > Result:
>> >  We can see "no data chunk" in first "btrfs filesystem usage":
>> >  # btrfs filesystem usage $TEST_DIR
>> >  Overall:
>> >     ...
>> >  Metadata,single: Size:8.00MiB, Used:0.00B
>> >     /dev/vdg        8.00MiB
>> >  Metadata,RAID1: Size:122.88MiB, Used:112.00KiB
>> >     /dev/vdg      122.88MiB
>> >     /dev/vdh      122.88MiB
>> >  System,single: Size:4.00MiB, Used:0.00B
>> >     /dev/vdg        4.00MiB
>> >  System,RAID1: Size:8.00MiB, Used:16.00KiB
>> >     /dev/vdg        8.00MiB
>> >     /dev/vdh        8.00MiB
>> >  Unallocated:
>> >     /dev/vdg        1.06GiB
>> >     /dev/vdh        1.07GiB
>> >
>> >  And "data chunks changed from raid1 to single" in second  "btrfs
>> > filesystem usage":
>> >  # btrfs filesystem usage $TEST_DIR
>> >  Overall:
>> >     ...
>> >  Data,single: Size:256.00MiB, Used:0.00B
>> >     /dev/vdh      256.00MiB
>> >  Metadata,single: Size:8.00MiB, Used:0.00B
>> >     /dev/vdg        8.00MiB
>> >  Metadata,RAID1: Size:122.88MiB, Used:112.00KiB
>> >     /dev/vdg      122.88MiB
>> >     /dev/vdh      122.88MiB
>> >  System,single: Size:4.00MiB, Used:0.00B
>> >     /dev/vdg        4.00MiB
>> >  System,RAID1: Size:8.00MiB, Used:16.00KiB
>> >     /dev/vdg        8.00MiB
>> >     /dev/vdh        8.00MiB
>> >  Unallocated:
>> >     /dev/vdg        1.06GiB
>> >     /dev/vdh      841.92MiB
>> >
>> > Reason:
>> >  btrfs balance delete last data chunk in case of no data in  the
>> > filesystem, then we can see "no data chunk" by "fi usage"
>> >  command.
>> >
>> >  And when we do write operation to fs, the only available data
>> > profile is 0x0, result is all new chunks are allocated single type.
>> >
>> > Fix:
>> >  Allocate a data chunk explicitly in balance operation, to reserve  at
>> > least one data chunk in the filesystem.
>>
>> Allocate a data chunk explicitly to ensure we don't lose the raid profile for data.
>>
>
> Thanks, will change in v2.
>
>> >
>> > Test:
>> >  Test by above script, and confirmed the logic by debug output.
>> >
>> > Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
>> > ---
>> >  fs/btrfs/volumes.c | 19 +++++++++++++++++++
>> >  1 file changed, 19 insertions(+)
>> >
>> > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index
>> > 6fc73586..3d5e41e 100644
>> > --- a/fs/btrfs/volumes.c
>> > +++ b/fs/btrfs/volumes.c
>> > @@ -3277,6 +3277,7 @@ static int __btrfs_balance(struct btrfs_fs_info
>> *fs_info)
>> >         u64 limit_data = bctl->data.limit;
>> >         u64 limit_meta = bctl->meta.limit;
>> >         u64 limit_sys = bctl->sys.limit;
>> > +       int chunk_reserved = 0;
>> >
>> >         /* step one make some room on all the devices */
>> >         devices = &fs_info->fs_devices->devices; @@ -3387,6 +3388,24
>> > @@ again:
>> >                         goto loop;
>> >                 }
>> >
>> > +               if (!chunk_reserved) {
>> > +                       trans = btrfs_start_transaction(chunk_root, 0);
>> > +                       if (IS_ERR(trans)) {
>> > +
>> mutex_unlock(&fs_info->delete_unused_bgs_mutex);
>> > +                               ret = PTR_ERR(trans);
>> > +                               goto error;
>> > +                       }
>> > +
>> > +                       ret = btrfs_force_chunk_alloc(trans,
>> > + chunk_root, 1);
>>
>> Can we please use the symbol BTRFS_BLOCK_GROUP_DATA instead of 1?
>>
> Thanks, will change in v2.
>
>
>> > +                       if (ret < 0) {
>> > +
>> mutex_unlock(&fs_info->delete_unused_bgs_mutex);
>> > +                               goto error;
>> > +                       }
>> > +
>> > +                       btrfs_end_transaction(trans, chunk_root);
>> > +                       chunk_reserved = 1;
>> > +               }
>>
>> Can we do this logic only if the block group is a data one? I.e. no point allocating
>> a data block group if our balance only touches metadata ones (due to some
>> filter for e.g.).
>>
> Thanks for point out it, will change in v2.

I find it somewhat awkward that we always allocate a new data block
group no matter what. Some balance operations that used to succeed in
the past may now fail with -ENOSPC for example...

How about making this simpler and not so special for data block
groups, like the following (compile tested only):

        extent_root = root->fs_info->extent_root;
@@ -2803,6 +2805,23 @@ static int btrfs_relocate_chunk(struct
btrfs_root *root, u64 chunk_offset)
        if (ret)
                return ret;

+       bg = btrfs_lookup_block_group(root->fs_info, chunk_offset);
+       ASSERT(bg);
+       down_read(&bg->space_info->groups_sem);
+       /*
+        * Do not remove the last block group of a kind to prevent losing
+        * raid profile information for future chunk allocations.
+        */
+       if (bg->list.next == bg->list.prev)
+               remove = false;
+       up_read(&bg->space_info->groups_sem);
+       if (!remove)
+               btrfs_dec_block_group_ro(extent_root, bg);
+       btrfs_put_block_group(bg);
+
+       if (!remove)
+               return 0;
+
        trans = btrfs_start_transaction(root, 0);
        if (IS_ERR(trans)) {
                ret = PTR_ERR(trans);

(also at https://friendpaste.com/5IeAIIzBv3oKhureKfvjwm)

thanks


>
>> Also, can't this be ineffective when the data block group we allocate ends up
>> with a key  in the chunk_root that is lower than the key we found in the
>> current iteration? I.e., key found in current iteration has object id B, we
>> allocate a new block group which gets a  key with object id A, where A < B and
>> the next iteration of our loop sees key A, deletes the respective block group A if
>> it's empty. In the end we have no data block groups, assuming that before A
>> there are no other non-empty data block groups.
>> Your example works because there's no free space before the offset where the
>> chunk starts in the device.
>>
> New chunk will always use increased logic address, even if there are "hole" before,
> so A's logic address will always >B.
> And current code of balance also use this feature to avoid balance new-created
> chunks(which was created by balance operation itself).
>
> Thanks
> Zhaolei
>
>> thanks
>>
>> > +
>> >                 ret = btrfs_relocate_chunk(chunk_root,
>> >                                            found_key.offset);
>> >                 mutex_unlock(&fs_info->delete_unused_bgs_mutex);
>> > --
>> > 1.8.5.1
>> >
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
>> > in the body of a message to majordomo@vger.kernel.org More majordomo
>> > info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>>
>> --
>> Filipe David Manana,
>>
>> "Reasonable men adapt themselves to the world.
>>  Unreasonable men adapt the world to themselves.
>>  That's why all progress depends on unreasonable men."
>

Comments

Zhaolei Sept. 30, 2015, 8:26 a.m. UTC | #1
Hi, Filipe Manana

> -----Original Message-----
> From: Filipe Manana [mailto:fdmanana@gmail.com]
> Sent: Wednesday, September 30, 2015 3:41 PM
> To: Zhao Lei <zhaolei@cn.fujitsu.com>
> Cc: linux-btrfs@vger.kernel.org
> Subject: Re: [PATCH 2/2] btrfs: Fix lost-data-profile caused by balance bg
> 
> On Wed, Sep 30, 2015 at 5:20 AM, Zhao Lei <zhaolei@cn.fujitsu.com> wrote:
> > Hi, Filipe Manana
> >
> > Thanks for reviewing.
> >
> >> -----Original Message-----
> >> From: Filipe Manana [mailto:fdmanana@gmail.com]
> >> Sent: Tuesday, September 29, 2015 11:48 PM
> >> To: Zhao Lei <zhaolei@cn.fujitsu.com>
> >> Cc: linux-btrfs@vger.kernel.org
> >> Subject: Re: [PATCH 2/2] btrfs: Fix lost-data-profile caused by
> >> balance bg
> >>
> >> On Tue, Sep 29, 2015 at 2:51 PM, Zhao Lei <zhaolei@cn.fujitsu.com> wrote:
> >> > Reproduce:
> >> >  (In integration-4.3 branch)
> >> >
> >> >  TEST_DEV=(/dev/vdg /dev/vdh)
> >> >  TEST_DIR=/mnt/tmp
> >> >
> >> >  umount "$TEST_DEV" >/dev/null
> >> >  mkfs.btrfs -f -d raid1 "${TEST_DEV[@]}"
> >> >
> >> >  mount -o nospace_cache "$TEST_DEV" "$TEST_DIR"
> >> >  btrfs balance start -dusage=0 $TEST_DIR  btrfs filesystem usage
> >> > $TEST_DIR
> >> >
> >> >  dd if=/dev/zero of="$TEST_DIR"/file count=100  btrfs filesystem
> >> > usage $TEST_DIR
> >> >
> >> > Result:
> >> >  We can see "no data chunk" in first "btrfs filesystem usage":
> >> >  # btrfs filesystem usage $TEST_DIR
> >> >  Overall:
> >> >     ...
> >> >  Metadata,single: Size:8.00MiB, Used:0.00B
> >> >     /dev/vdg        8.00MiB
> >> >  Metadata,RAID1: Size:122.88MiB, Used:112.00KiB
> >> >     /dev/vdg      122.88MiB
> >> >     /dev/vdh      122.88MiB
> >> >  System,single: Size:4.00MiB, Used:0.00B
> >> >     /dev/vdg        4.00MiB
> >> >  System,RAID1: Size:8.00MiB, Used:16.00KiB
> >> >     /dev/vdg        8.00MiB
> >> >     /dev/vdh        8.00MiB
> >> >  Unallocated:
> >> >     /dev/vdg        1.06GiB
> >> >     /dev/vdh        1.07GiB
> >> >
> >> >  And "data chunks changed from raid1 to single" in second  "btrfs
> >> > filesystem usage":
> >> >  # btrfs filesystem usage $TEST_DIR
> >> >  Overall:
> >> >     ...
> >> >  Data,single: Size:256.00MiB, Used:0.00B
> >> >     /dev/vdh      256.00MiB
> >> >  Metadata,single: Size:8.00MiB, Used:0.00B
> >> >     /dev/vdg        8.00MiB
> >> >  Metadata,RAID1: Size:122.88MiB, Used:112.00KiB
> >> >     /dev/vdg      122.88MiB
> >> >     /dev/vdh      122.88MiB
> >> >  System,single: Size:4.00MiB, Used:0.00B
> >> >     /dev/vdg        4.00MiB
> >> >  System,RAID1: Size:8.00MiB, Used:16.00KiB
> >> >     /dev/vdg        8.00MiB
> >> >     /dev/vdh        8.00MiB
> >> >  Unallocated:
> >> >     /dev/vdg        1.06GiB
> >> >     /dev/vdh      841.92MiB
> >> >
> >> > Reason:
> >> >  btrfs balance delete last data chunk in case of no data in  the
> >> > filesystem, then we can see "no data chunk" by "fi usage"
> >> >  command.
> >> >
> >> >  And when we do write operation to fs, the only available data
> >> > profile is 0x0, result is all new chunks are allocated single type.
> >> >
> >> > Fix:
> >> >  Allocate a data chunk explicitly in balance operation, to reserve
> >> > at least one data chunk in the filesystem.
> >>
> >> Allocate a data chunk explicitly to ensure we don't lose the raid profile for
> data.
> >>
> >
> > Thanks, will change in v2.
> >
> >> >
> >> > Test:
> >> >  Test by above script, and confirmed the logic by debug output.
> >> >
> >> > Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
> >> > ---
> >> >  fs/btrfs/volumes.c | 19 +++++++++++++++++++
> >> >  1 file changed, 19 insertions(+)
> >> >
> >> > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index
> >> > 6fc73586..3d5e41e 100644
> >> > --- a/fs/btrfs/volumes.c
> >> > +++ b/fs/btrfs/volumes.c
> >> > @@ -3277,6 +3277,7 @@ static int __btrfs_balance(struct
> >> > btrfs_fs_info
> >> *fs_info)
> >> >         u64 limit_data = bctl->data.limit;
> >> >         u64 limit_meta = bctl->meta.limit;
> >> >         u64 limit_sys = bctl->sys.limit;
> >> > +       int chunk_reserved = 0;
> >> >
> >> >         /* step one make some room on all the devices */
> >> >         devices = &fs_info->fs_devices->devices; @@ -3387,6
> >> > +3388,24 @@ again:
> >> >                         goto loop;
> >> >                 }
> >> >
> >> > +               if (!chunk_reserved) {
> >> > +                       trans = btrfs_start_transaction(chunk_root,
> 0);
> >> > +                       if (IS_ERR(trans)) {
> >> > +
> >> mutex_unlock(&fs_info->delete_unused_bgs_mutex);
> >> > +                               ret = PTR_ERR(trans);
> >> > +                               goto error;
> >> > +                       }
> >> > +
> >> > +                       ret = btrfs_force_chunk_alloc(trans,
> >> > + chunk_root, 1);
> >>
> >> Can we please use the symbol BTRFS_BLOCK_GROUP_DATA instead of 1?
> >>
> > Thanks, will change in v2.
> >
> >
> >> > +                       if (ret < 0) {
> >> > +
> >> mutex_unlock(&fs_info->delete_unused_bgs_mutex);
> >> > +                               goto error;
> >> > +                       }
> >> > +
> >> > +                       btrfs_end_transaction(trans, chunk_root);
> >> > +                       chunk_reserved = 1;
> >> > +               }
> >>
> >> Can we do this logic only if the block group is a data one? I.e. no
> >> point allocating a data block group if our balance only touches
> >> metadata ones (due to some filter for e.g.).
> >>
> > Thanks for point out it, will change in v2.
> 
> I find it somewhat awkward that we always allocate a new data block group no
> matter what. Some balance operations that used to succeed in the past may
> now fail with -ENOSPC for example...
> 
> How about making this simpler and not so special for data block groups, like the
> following (compile tested only):
> 
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 644e070..067b1eb
> 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -2774,6 +2774,8 @@ static int btrfs_relocate_chunk(struct btrfs_root
> *root, u64 chunk_offset)
>         struct btrfs_root *extent_root;
>         struct btrfs_trans_handle *trans;
>         int ret;
> +       struct btrfs_block_group_cache *bg;
> +       bool remove = true;
> 
>         root = root->fs_info->chunk_root;
>         extent_root = root->fs_info->extent_root; @@ -2803,6 +2805,23
> @@ static int btrfs_relocate_chunk(struct btrfs_root *root, u64 chunk_offset)
>         if (ret)
>                 return ret;
> 
> +       bg = btrfs_lookup_block_group(root->fs_info, chunk_offset);
> +       ASSERT(bg);
> +       down_read(&bg->space_info->groups_sem);
> +       /*
> +        * Do not remove the last block group of a kind to prevent losing
> +        * raid profile information for future chunk allocations.
> +        */
> +       if (bg->list.next == bg->list.prev)
> +               remove = false;
> +       up_read(&bg->space_info->groups_sem);
> +       if (!remove)
> +               btrfs_dec_block_group_ro(extent_root, bg);
> +       btrfs_put_block_group(bg);
> +
> +       if (!remove)
> +               return 0;
> +
>         trans = btrfs_start_transaction(root, 0);
>         if (IS_ERR(trans)) {
>                 ret = PTR_ERR(trans);
> 

Thanks for your detailed review!

Reason of creating new bg is:
Balance operation may be used to change raid-profile of the filesystem.
If we want to change filesystem form raid1 to raid5, we need:
1: delete all raid1 bgs
2: move all data to raid5 bg if data exist in filesystem
3: reserve only one blank raid5 bg if no data in filesystem

If we use similar operation as patch 1/2(not delete lattest blockgroup),
we'll have following problem:
1: the old raid1 blockgroups are not all-deleted
2: if no data in filesystem, the profile will not changed from raid1 to raid5.

I understand your worry of "NO_SPACE" when create additional bg,
and I also considered it in making patch, but I choose to use this way because:
1: balance operation had check free space before operation
  (In front of __btrfs_balance)
2: for filesystem with data, we have to create target-chunk in balance operation,
  this patch only make "creating-chunk" earlier, and the created chunk will be used
  to save data in balance operation, and reduce one chunk-allocate in balance operation.
3: for a blank filesystem, it is necessary to create a blank chunk with of target raid profile.
  the old code hadn't create it, and this patch created. It is right operation even if
  cause no-space. Actually, create a chunk in empty filesystem rarely cause no-space,
  plus we have free-space check in 1.
4: 1~3 ensure this patch rarely cause additional no-space problem in logic.
  And if it really caused additional no-space(if out of my consideration),
  We can change code of 1 to avoid it.

Thanks
Zhaolei

> (also at https://friendpaste.com/5IeAIIzBv3oKhureKfvjwm)
> 
> thanks
> 
> 
> >
> >> Also, can't this be ineffective when the data block group we allocate
> >> ends up with a key  in the chunk_root that is lower than the key we
> >> found in the current iteration? I.e., key found in current iteration
> >> has object id B, we allocate a new block group which gets a  key with
> >> object id A, where A < B and the next iteration of our loop sees key
> >> A, deletes the respective block group A if it's empty. In the end we
> >> have no data block groups, assuming that before A there are no other
> non-empty data block groups.
> >> Your example works because there's no free space before the offset
> >> where the chunk starts in the device.
> >>
> > New chunk will always use increased logic address, even if there are
> > "hole" before, so A's logic address will always >B.
> > And current code of balance also use this feature to avoid balance
> > new-created chunks(which was created by balance operation itself).
> >
> > Thanks
> > Zhaolei
> >
> >> thanks
> >>
> >> > +
> >> >                 ret = btrfs_relocate_chunk(chunk_root,
> >> >                                            found_key.offset);
> >> >                 mutex_unlock(&fs_info->delete_unused_bgs_mutex);
> >> > --
> >> > 1.8.5.1
> >> >
> >> > --
> >> > To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
> >> > in the body of a message to majordomo@vger.kernel.org More
> >> > majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>
> >>
> >>
> >> --
> >> Filipe David Manana,
> >>
> >> "Reasonable men adapt themselves to the world.
> >>  Unreasonable men adapt the world to themselves.
> >>  That's why all progress depends on unreasonable men."
> >
> 
> 
> 
> --
> Filipe David Manana,
> 
> "Reasonable men adapt themselves to the world.
>  Unreasonable men adapt the world to themselves.
>  That's why all progress depends on unreasonable men."

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Filipe Manana Sept. 30, 2015, 9:44 a.m. UTC | #2
On Wed, Sep 30, 2015 at 9:26 AM, Zhao Lei <zhaolei@cn.fujitsu.com> wrote:
> Hi, Filipe Manana
>
>> -----Original Message-----
>> From: Filipe Manana [mailto:fdmanana@gmail.com]
>> Sent: Wednesday, September 30, 2015 3:41 PM
>> To: Zhao Lei <zhaolei@cn.fujitsu.com>
>> Cc: linux-btrfs@vger.kernel.org
>> Subject: Re: [PATCH 2/2] btrfs: Fix lost-data-profile caused by balance bg
>>
>> On Wed, Sep 30, 2015 at 5:20 AM, Zhao Lei <zhaolei@cn.fujitsu.com> wrote:
>> > Hi, Filipe Manana
>> >
>> > Thanks for reviewing.
>> >
>> >> -----Original Message-----
>> >> From: Filipe Manana [mailto:fdmanana@gmail.com]
>> >> Sent: Tuesday, September 29, 2015 11:48 PM
>> >> To: Zhao Lei <zhaolei@cn.fujitsu.com>
>> >> Cc: linux-btrfs@vger.kernel.org
>> >> Subject: Re: [PATCH 2/2] btrfs: Fix lost-data-profile caused by
>> >> balance bg
>> >>
>> >> On Tue, Sep 29, 2015 at 2:51 PM, Zhao Lei <zhaolei@cn.fujitsu.com> wrote:
>> >> > Reproduce:
>> >> >  (In integration-4.3 branch)
>> >> >
>> >> >  TEST_DEV=(/dev/vdg /dev/vdh)
>> >> >  TEST_DIR=/mnt/tmp
>> >> >
>> >> >  umount "$TEST_DEV" >/dev/null
>> >> >  mkfs.btrfs -f -d raid1 "${TEST_DEV[@]}"
>> >> >
>> >> >  mount -o nospace_cache "$TEST_DEV" "$TEST_DIR"
>> >> >  btrfs balance start -dusage=0 $TEST_DIR  btrfs filesystem usage
>> >> > $TEST_DIR
>> >> >
>> >> >  dd if=/dev/zero of="$TEST_DIR"/file count=100  btrfs filesystem
>> >> > usage $TEST_DIR
>> >> >
>> >> > Result:
>> >> >  We can see "no data chunk" in first "btrfs filesystem usage":
>> >> >  # btrfs filesystem usage $TEST_DIR
>> >> >  Overall:
>> >> >     ...
>> >> >  Metadata,single: Size:8.00MiB, Used:0.00B
>> >> >     /dev/vdg        8.00MiB
>> >> >  Metadata,RAID1: Size:122.88MiB, Used:112.00KiB
>> >> >     /dev/vdg      122.88MiB
>> >> >     /dev/vdh      122.88MiB
>> >> >  System,single: Size:4.00MiB, Used:0.00B
>> >> >     /dev/vdg        4.00MiB
>> >> >  System,RAID1: Size:8.00MiB, Used:16.00KiB
>> >> >     /dev/vdg        8.00MiB
>> >> >     /dev/vdh        8.00MiB
>> >> >  Unallocated:
>> >> >     /dev/vdg        1.06GiB
>> >> >     /dev/vdh        1.07GiB
>> >> >
>> >> >  And "data chunks changed from raid1 to single" in second  "btrfs
>> >> > filesystem usage":
>> >> >  # btrfs filesystem usage $TEST_DIR
>> >> >  Overall:
>> >> >     ...
>> >> >  Data,single: Size:256.00MiB, Used:0.00B
>> >> >     /dev/vdh      256.00MiB
>> >> >  Metadata,single: Size:8.00MiB, Used:0.00B
>> >> >     /dev/vdg        8.00MiB
>> >> >  Metadata,RAID1: Size:122.88MiB, Used:112.00KiB
>> >> >     /dev/vdg      122.88MiB
>> >> >     /dev/vdh      122.88MiB
>> >> >  System,single: Size:4.00MiB, Used:0.00B
>> >> >     /dev/vdg        4.00MiB
>> >> >  System,RAID1: Size:8.00MiB, Used:16.00KiB
>> >> >     /dev/vdg        8.00MiB
>> >> >     /dev/vdh        8.00MiB
>> >> >  Unallocated:
>> >> >     /dev/vdg        1.06GiB
>> >> >     /dev/vdh      841.92MiB
>> >> >
>> >> > Reason:
>> >> >  btrfs balance delete last data chunk in case of no data in  the
>> >> > filesystem, then we can see "no data chunk" by "fi usage"
>> >> >  command.
>> >> >
>> >> >  And when we do write operation to fs, the only available data
>> >> > profile is 0x0, result is all new chunks are allocated single type.
>> >> >
>> >> > Fix:
>> >> >  Allocate a data chunk explicitly in balance operation, to reserve
>> >> > at least one data chunk in the filesystem.
>> >>
>> >> Allocate a data chunk explicitly to ensure we don't lose the raid profile for
>> data.
>> >>
>> >
>> > Thanks, will change in v2.
>> >
>> >> >
>> >> > Test:
>> >> >  Test by above script, and confirmed the logic by debug output.
>> >> >
>> >> > Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
>> >> > ---
>> >> >  fs/btrfs/volumes.c | 19 +++++++++++++++++++
>> >> >  1 file changed, 19 insertions(+)
>> >> >
>> >> > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index
>> >> > 6fc73586..3d5e41e 100644
>> >> > --- a/fs/btrfs/volumes.c
>> >> > +++ b/fs/btrfs/volumes.c
>> >> > @@ -3277,6 +3277,7 @@ static int __btrfs_balance(struct
>> >> > btrfs_fs_info
>> >> *fs_info)
>> >> >         u64 limit_data = bctl->data.limit;
>> >> >         u64 limit_meta = bctl->meta.limit;
>> >> >         u64 limit_sys = bctl->sys.limit;
>> >> > +       int chunk_reserved = 0;
>> >> >
>> >> >         /* step one make some room on all the devices */
>> >> >         devices = &fs_info->fs_devices->devices; @@ -3387,6
>> >> > +3388,24 @@ again:
>> >> >                         goto loop;
>> >> >                 }
>> >> >
>> >> > +               if (!chunk_reserved) {
>> >> > +                       trans = btrfs_start_transaction(chunk_root,
>> 0);
>> >> > +                       if (IS_ERR(trans)) {
>> >> > +
>> >> mutex_unlock(&fs_info->delete_unused_bgs_mutex);
>> >> > +                               ret = PTR_ERR(trans);
>> >> > +                               goto error;
>> >> > +                       }
>> >> > +
>> >> > +                       ret = btrfs_force_chunk_alloc(trans,
>> >> > + chunk_root, 1);
>> >>
>> >> Can we please use the symbol BTRFS_BLOCK_GROUP_DATA instead of 1?
>> >>
>> > Thanks, will change in v2.
>> >
>> >
>> >> > +                       if (ret < 0) {
>> >> > +
>> >> mutex_unlock(&fs_info->delete_unused_bgs_mutex);
>> >> > +                               goto error;
>> >> > +                       }
>> >> > +
>> >> > +                       btrfs_end_transaction(trans, chunk_root);
>> >> > +                       chunk_reserved = 1;
>> >> > +               }
>> >>
>> >> Can we do this logic only if the block group is a data one? I.e. no
>> >> point allocating a data block group if our balance only touches
>> >> metadata ones (due to some filter for e.g.).
>> >>
>> > Thanks for point out it, will change in v2.
>>
>> I find it somewhat awkward that we always allocate a new data block group no
>> matter what. Some balance operations that used to succeed in the past may
>> now fail with -ENOSPC for example...
>>
>> How about making this simpler and not so special for data block groups, like the
>> following (compile tested only):
>>
>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 644e070..067b1eb
>> 100644
>> --- a/fs/btrfs/volumes.c
>> +++ b/fs/btrfs/volumes.c
>> @@ -2774,6 +2774,8 @@ static int btrfs_relocate_chunk(struct btrfs_root
>> *root, u64 chunk_offset)
>>         struct btrfs_root *extent_root;
>>         struct btrfs_trans_handle *trans;
>>         int ret;
>> +       struct btrfs_block_group_cache *bg;
>> +       bool remove = true;
>>
>>         root = root->fs_info->chunk_root;
>>         extent_root = root->fs_info->extent_root; @@ -2803,6 +2805,23
>> @@ static int btrfs_relocate_chunk(struct btrfs_root *root, u64 chunk_offset)
>>         if (ret)
>>                 return ret;
>>
>> +       bg = btrfs_lookup_block_group(root->fs_info, chunk_offset);
>> +       ASSERT(bg);
>> +       down_read(&bg->space_info->groups_sem);
>> +       /*
>> +        * Do not remove the last block group of a kind to prevent losing
>> +        * raid profile information for future chunk allocations.
>> +        */
>> +       if (bg->list.next == bg->list.prev)
>> +               remove = false;
>> +       up_read(&bg->space_info->groups_sem);
>> +       if (!remove)
>> +               btrfs_dec_block_group_ro(extent_root, bg);
>> +       btrfs_put_block_group(bg);
>> +
>> +       if (!remove)
>> +               return 0;
>> +
>>         trans = btrfs_start_transaction(root, 0);
>>         if (IS_ERR(trans)) {
>>                 ret = PTR_ERR(trans);
>>
>
> Thanks for your detailed review!
>
> Reason of creating new bg is:
> Balance operation may be used to change raid-profile of the filesystem.
> If we want to change filesystem form raid1 to raid5, we need:
> 1: delete all raid1 bgs
> 2: move all data to raid5 bg if data exist in filesystem
> 3: reserve only one blank raid5 bg if no data in filesystem

I see. Thanks.

>
> If we use similar operation as patch 1/2(not delete lattest blockgroup),
> we'll have following problem:
> 1: the old raid1 blockgroups are not all-deleted
> 2: if no data in filesystem, the profile will not changed from raid1 to raid5.
>
> I understand your worry of "NO_SPACE" when create additional bg,
> and I also considered it in making patch, but I choose to use this way because:
> 1: balance operation had check free space before operation
>   (In front of __btrfs_balance)
> 2: for filesystem with data, we have to create target-chunk in balance operation,
>   this patch only make "creating-chunk" earlier, and the created chunk will be used
>   to save data in balance operation, and reduce one chunk-allocate in balance operation.
> 3: for a blank filesystem, it is necessary to create a blank chunk with of target raid profile.
>   the old code hadn't create it, and this patch created. It is right operation even if
>   cause no-space. Actually, create a chunk in empty filesystem rarely cause no-space,
>   plus we have free-space check in 1.
> 4: 1~3 ensure this patch rarely cause additional no-space problem in logic.
>   And if it really caused additional no-space(if out of my consideration),
>   We can change code of 1 to avoid it.
>
> Thanks
> Zhaolei
>
>> (also at https://friendpaste.com/5IeAIIzBv3oKhureKfvjwm)
>>
>> thanks
>>
>>
>> >
>> >> Also, can't this be ineffective when the data block group we allocate
>> >> ends up with a key  in the chunk_root that is lower than the key we
>> >> found in the current iteration? I.e., key found in current iteration
>> >> has object id B, we allocate a new block group which gets a  key with
>> >> object id A, where A < B and the next iteration of our loop sees key
>> >> A, deletes the respective block group A if it's empty. In the end we
>> >> have no data block groups, assuming that before A there are no other
>> non-empty data block groups.
>> >> Your example works because there's no free space before the offset
>> >> where the chunk starts in the device.
>> >>
>> > New chunk will always use increased logic address, even if there are
>> > "hole" before, so A's logic address will always >B.
>> > And current code of balance also use this feature to avoid balance
>> > new-created chunks(which was created by balance operation itself).
>> >
>> > Thanks
>> > Zhaolei
>> >
>> >> thanks
>> >>
>> >> > +
>> >> >                 ret = btrfs_relocate_chunk(chunk_root,
>> >> >                                            found_key.offset);
>> >> >                 mutex_unlock(&fs_info->delete_unused_bgs_mutex);
>> >> > --
>> >> > 1.8.5.1
>> >> >
>> >> > --
>> >> > To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
>> >> > in the body of a message to majordomo@vger.kernel.org More
>> >> > majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >>
>> >>
>> >>
>> >> --
>> >> Filipe David Manana,
>> >>
>> >> "Reasonable men adapt themselves to the world.
>> >>  Unreasonable men adapt the world to themselves.
>> >>  That's why all progress depends on unreasonable men."
>> >
>>
>>
>>
>> --
>> Filipe David Manana,
>>
>> "Reasonable men adapt themselves to the world.
>>  Unreasonable men adapt the world to themselves.
>>  That's why all progress depends on unreasonable men."
>
diff mbox

Patch

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 644e070..067b1eb 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -2774,6 +2774,8 @@  static int btrfs_relocate_chunk(struct
btrfs_root *root, u64 chunk_offset)
        struct btrfs_root *extent_root;
        struct btrfs_trans_handle *trans;
        int ret;
+       struct btrfs_block_group_cache *bg;
+       bool remove = true;

        root = root->fs_info->chunk_root;