Different sized disks, allocation strategy

Message ID	CANaSA1wZfn5Gxg_dU33WbamchVtWDU4GpXazn8ep-NJKGNaetA@mail.gmail.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-btrfs-owner@vger.kernel.org> MIME-Version: 1.0 From: iker vagyok <ikervagyok@gmail.com> Date: Sat, 3 Jun 2023 20:52:19 +0200 Message-ID: <CANaSA1wZfn5Gxg_dU33WbamchVtWDU4GpXazn8ep-NJKGNaetA@mail.gmail.com> Subject: Different sized disks, allocation strategy To: linux-btrfs@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Precedence: bulk
Series	Different sized disks, allocation strategy \| expand Different sized disks, allocation strategy

Message ID

CANaSA1wZfn5Gxg_dU33WbamchVtWDU4GpXazn8ep-NJKGNaetA@mail.gmail.com (mailing list archive)

State

New, archived

Headers

MIME-Version: 1.0
From: iker vagyok <ikervagyok@gmail.com>
Date: Sat, 3 Jun 2023 20:52:19 +0200
Message-ID: 
 <CANaSA1wZfn5Gxg_dU33WbamchVtWDU4GpXazn8ep-NJKGNaetA@mail.gmail.com>
Subject: Different sized disks, allocation strategy
To: linux-btrfs@vger.kernel.org
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Precedence: bulk

Series

Different sized disks, allocation strategy | expand

Commit Message

iker vagyok June 3, 2023, 6:52 p.m. UTC

Hello!

I am curious if the multiple disk allocation strategy for btrfs could
be improved for my use case. The current situation is:


# btrfs filesystem usage -T /
Overall:
    Device size:                  32.74TiB
    Device allocated:             15.66TiB
    Device unallocated:           17.08TiB
    Device missing:                  0.00B
    Device slack:                    0.00B
    Used:                         15.57TiB
    Free (estimated):              8.58TiB      (min: 4.31TiB)
    Free (statfs, df):             6.12TiB
    Data ratio:                       2.00
    Metadata ratio:                   4.00
    Global reserve:              512.00MiB      (used: 0.00B)
    Multiple profiles:                  no

             Data    Metadata System
Id Path      RAID1   RAID1C4  RAID1C4  Unallocated Total    Slack
-- --------- ------- -------- -------- ----------- -------- -----
 2 /dev/sdc2 4.18TiB 13.00GiB 32.00MiB     4.91TiB  9.09TiB     -
 3 /dev/sda2       -  7.00GiB        -     2.72TiB  2.72TiB     -
 6 /dev/sde2       -  6.00GiB 32.00MiB     2.72TiB  2.72TiB     -
 8 /dev/sdb2 5.72TiB 13.00GiB 32.00MiB     3.37TiB  9.09TiB     -
 9 /dev/sdd2 5.71TiB 13.00GiB 32.00MiB     3.37TiB  9.09TiB     -
-- --------- ------- -------- -------- ----------- -------- -----
   Total     7.80TiB 13.00GiB 32.00MiB    17.08TiB 32.74TiB 0.00B
   Used      7.77TiB  9.95GiB  1.19MiB


As you can see, my server has 2*3TB and 3*10TB HDDs and uses RAID1 for
data and RAID1C4 for metadata. This works fine and the smaller devices
are even used for RAID1C4 (metadata), as there are not enough big
drives to handle 4 copies. But the smaller drives are not used for
RAID1 (data) until all bigger disks are filled (with <3TB remaining
free). Only then will all the disks take part in the raid setup.

I propose a simple change to the strategy where btrfs is still filling
up the big drives faster, but doesn't wait for ~7TB used on each big
drive, before it starts to use the 3TB drives as well. It would be
nice if all disks would be used corresponding to their relative free
space. IMO it wouldn't change the behavior for big setups with
same-sized devices and would make rebuilds and drive wear much easier
to anticipate on smaller setups. I browsed through the kernel code and
believe that this could be a very simple and contained change - but my
C knowledge is limited. Something in the ballpark of this:


                return -1;
        if (di_a->max_avail < di_b->max_avail)


What do you think about the proposal?

Thank you and best regards,
Balázs

Comments

waxhead June 4, 2023, 9:20 a.m. UTC | #1

iker vagyok wrote:
> Hello!
> 
> I am curious if the multiple disk allocation strategy for btrfs could
> be improved for my use case. The current situation is:
> 
> 
> # btrfs filesystem usage -T /
> Overall:
>      Device size:                  32.74TiB
>      Device allocated:             15.66TiB
>      Device unallocated:           17.08TiB
>      Device missing:                  0.00B
>      Device slack:                    0.00B
>      Used:                         15.57TiB
>      Free (estimated):              8.58TiB      (min: 4.31TiB)
>      Free (statfs, df):             6.12TiB
>      Data ratio:                       2.00
>      Metadata ratio:                   4.00
>      Global reserve:              512.00MiB      (used: 0.00B)
>      Multiple profiles:                  no
> 
>               Data    Metadata System
> Id Path      RAID1   RAID1C4  RAID1C4  Unallocated Total    Slack
> -- --------- ------- -------- -------- ----------- -------- -----
>   2 /dev/sdc2 4.18TiB 13.00GiB 32.00MiB     4.91TiB  9.09TiB     -
>   3 /dev/sda2       -  7.00GiB        -     2.72TiB  2.72TiB     -
>   6 /dev/sde2       -  6.00GiB 32.00MiB     2.72TiB  2.72TiB     -
>   8 /dev/sdb2 5.72TiB 13.00GiB 32.00MiB     3.37TiB  9.09TiB     -
>   9 /dev/sdd2 5.71TiB 13.00GiB 32.00MiB     3.37TiB  9.09TiB     -
> -- --------- ------- -------- -------- ----------- -------- -----
>     Total     7.80TiB 13.00GiB 32.00MiB    17.08TiB 32.74TiB 0.00B
>     Used      7.77TiB  9.95GiB  1.19MiB
> 
> 
> As you can see, my server has 2*3TB and 3*10TB HDDs and uses RAID1 for
> data and RAID1C4 for metadata. This works fine and the smaller devices
> are even used for RAID1C4 (metadata), as there are not enough big
> drives to handle 4 copies. But the smaller drives are not used for
> RAID1 (data) until all bigger disks are filled (with <3TB remaining
> free). Only then will all the disks take part in the raid setup.
> 

First of all, I am not a btrfs dev, just a regular user.
This "issue" bugs me a bit as well and I remember that there was a 
proposal quite some time ago where someone wanted to use the percentage 
of free space as the allocation trigger. I believe that would have been 
a quite useful change.

On modern kernels (>= 5.15) the RAID10 profile "degrades" to RAID1 
(which is perfectly fine from a reliability perspective) so you would 
spread the data over all your disk a bit better when the filesystem is 
"fresh", the opposite may be more likely when the filesystem nears full 
state, then the three largest devices will be the one in use for writes.

So assuming regular "home use", fill up with data once, read often:
In your case, you roughly have the choice between:
A: RAID1 , use your largest devices first, then spread over all devices.
B: RAID10, use all devices first, then use your largest devices last.

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 841e799dece5..db36c01a62be 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5049,6 +5049,14 @@  static int btrfs_cmp_device_info(const void *a,
const void *b)
        const struct btrfs_device_info *di_a = a;
        const struct btrfs_device_info *di_b = b;

+       if (di_a->total_avail > 0 && di_b->total_avail > 0)
+               {
+                       if (di_a->max_avail / di_a->total_avail >
di_b->max_avail / di_b->total_avail)
+                               return -1;
+                       if (di_a->max_avail / di_a->total_avail <
di_b->max_avail / di_b->total_avail)
+                               return 1;
+               };
        if (di_a->max_avail > di_b->max_avail)

Different sized disks, allocation strategy

Commit Message

Comments

Patch