[RFC,V8,0/5] btrfs: allocation_hint mode

Message ID	cover.1635089352.git.kreijack@inwind.it (mailing list archive)
Headers	show Return-Path: <linux-btrfs-owner@kernel.org> From: Goffredo Baroncelli <kreijack@tiscali.it> To: linux-btrfs@vger.kernel.org Cc: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>, Josef Bacik <josef@toxicpanda.com>, David Sterba <dsterba@suse.cz>, Sinnamohideen Shafeeq <shafeeqs@panasas.com>, Goffredo Baroncelli <kreijack@inwind.it> Subject: [RFC][V8][PATCH 0/5] btrfs: allocation_hint mode Date: Sun, 24 Oct 2021 17:31:03 +0200 Message-Id: <cover.1635089352.git.kreijack@inwind.it> Reply-To: Goffredo Baroncelli <kreijack@libero.it> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	btrfs: allocation_hint mode \| expand [RFC,V8,0/5] btrfs: allocation_hint mode [1/4] btrfs: add flags to give an hint to the chunk allocator [2/4] btrfs: export dev_item.type in /sys/fs/btrfs/<uuid>/devinfo/<devid>/type [3/4] btrfs: change the DEV_ITEM 'type' field via sysfs [4/4] btrfs: add allocator_hint mode

Goffredo Baroncelli Oct. 24, 2021, 3:31 p.m. UTC

From: Goffredo Baroncelli <kreijack@inwind.it>

Hi all,

This patches set was born after some discussion between me, Zygo and Josef.
Some details can be found in https://github.com/btrfs/btrfs-todo/issues/19.

Some further information about a real use case can be found in
https://lore.kernel.org/linux-btrfs/20210116002533.GE31381@hungrycats.org/

Reently Shafeeq told me that he is interested too, due to the performance gain.

In this revision I switched away from an ioctl API in favor of a sysfs API (
see patch #2 and #3).

The idea behind this patches set, is to dedicate some disks (the fastest one)
to the metadata chunk. My initial idea was a "soft" hint. However Zygo
asked an option for a "strong" hint (== mandatory). The result is that
each disk can be "tagged" by one of the following flags:
- BTRFS_DEV_ALLOCATION_METADATA_ONLY
- BTRFS_DEV_ALLOCATION_PREFERRED_METADATA
- BTRFS_DEV_ALLOCATION_PREFERRED_DATA
- BTRFS_DEV_ALLOCATION_DATA_ONLY

When the chunk allocator search a disks to allocate a chunk, scans the disks
in an order decided by these tags. For metadata, the order is:
*_METADATA_ONLY
*_PREFERRED_METADATA
*_PREFERRED_DATA

The *_DATA_ONLY are not eligible from metadata chunk allocation.

For the data chunk, the order is reversed, and the *_METADATA_ONLY are
excluded.

The exact sort logic is to sort first for the "tag", and then for the space
available. If there is no space available, the next "tag" disks set are
selected.

To set these tags, a new property called "allocation_hint" was created.
There is a dedicated btrfs-prog patches set [[PATCH V5] btrfs-progs:
allocation_hint disk property].

$ sudo mount /dev/loop0 /mnt/test-btrfs/
$ for i in /dev/loop[0-9]; do sudo ./btrfs prop get $i allocation_hint; done
devid=1, path=/dev/loop0: allocation_hint=PREFERRED_METADATA
devid=2, path=/dev/loop1: allocation_hint=PREFERRED_METADATA
devid=3, path=/dev/loop2: allocation_hint=PREFERRED_DATA
devid=4, path=/dev/loop3: allocation_hint=PREFERRED_DATA
devid=5, path=/dev/loop4: allocation_hint=PREFERRED_DATA
devid=6, path=/dev/loop5: allocation_hint=DATA_ONLY
devid=7, path=/dev/loop6: allocation_hint=METADATA_ONLY
devid=8, path=/dev/loop7: allocation_hint=METADATA_ONLY

$ sudo ./btrfs fi us /mnt/test-btrfs/
Overall:
    Device size:           2.75GiB
    Device allocated:           1.34GiB
    Device unallocated:           1.41GiB
    Device missing:             0.00B
    Used:             400.89MiB
    Free (estimated):           1.04GiB    (min: 1.04GiB)
    Data ratio:                  2.00
    Metadata ratio:              1.00
    Global reserve:           3.25MiB    (used: 0.00B)
    Multiple profiles:                no

Data,RAID1: Size:542.00MiB, Used:200.25MiB (36.95%)
   /dev/loop0     288.00MiB
   /dev/loop1     288.00MiB
   /dev/loop2     127.00MiB
   /dev/loop3     127.00MiB
   /dev/loop4     127.00MiB
   /dev/loop5     127.00MiB

Metadata,single: Size:256.00MiB, Used:384.00KiB (0.15%)
   /dev/loop1     256.00MiB

System,single: Size:32.00MiB, Used:16.00KiB (0.05%)
   /dev/loop0      32.00MiB

Unallocated:
   /dev/loop0     704.00MiB
   /dev/loop1     480.00MiB
   /dev/loop2       1.00MiB
   /dev/loop3       1.00MiB
   /dev/loop4       1.00MiB
   /dev/loop5       1.00MiB
   /dev/loop6     128.00MiB
   /dev/loop7     128.00MiB

# change the tag of some disks

$ sudo ./btrfs prop set /dev/loop0 allocation_hint DATA_ONLY
$ sudo ./btrfs prop set /dev/loop1 allocation_hint DATA_ONLY
$ sudo ./btrfs prop set /dev/loop5 allocation_hint METADATA_ONLY

$ for i in /dev/loop[0-9]; do sudo ./btrfs prop get $i allocation_hint; done
devid=1, path=/dev/loop0: allocation_hint=DATA_ONLY
devid=2, path=/dev/loop1: allocation_hint=DATA_ONLY
devid=3, path=/dev/loop2: allocation_hint=PREFERRED_DATA
devid=4, path=/dev/loop3: allocation_hint=PREFERRED_DATA
devid=5, path=/dev/loop4: allocation_hint=PREFERRED_DATA
devid=6, path=/dev/loop5: allocation_hint=METADATA_ONLY
devid=7, path=/dev/loop6: allocation_hint=METADATA_ONLY
devid=8, path=/dev/loop7: allocation_hint=METADATA_ONLY

$ sudo btrfs bal start --full-balance /mnt/test-btrfs/
$ sudo ./btrfs fi us /mnt/test-btrfs/
Overall:
    Device size:           2.75GiB
    Device allocated:         735.00MiB
    Device unallocated:           2.03GiB
    Device missing:             0.00B
    Used:             400.72MiB
    Free (estimated):           1.10GiB    (min: 1.10GiB)
    Data ratio:                  2.00
    Metadata ratio:              1.00
    Global reserve:           3.25MiB    (used: 0.00B)
    Multiple profiles:                no

Data,RAID1: Size:288.00MiB, Used:200.19MiB (69.51%)
   /dev/loop0     288.00MiB
   /dev/loop1     288.00MiB

Metadata,single: Size:127.00MiB, Used:336.00KiB (0.26%)
   /dev/loop5     127.00MiB

System,single: Size:32.00MiB, Used:16.00KiB (0.05%)
   /dev/loop7      32.00MiB

Unallocated:
   /dev/loop0     736.00MiB
   /dev/loop1     736.00MiB
   /dev/loop2     128.00MiB
   /dev/loop3     128.00MiB
   /dev/loop4     128.00MiB
   /dev/loop5       1.00MiB
   /dev/loop6     128.00MiB
   /dev/loop7      96.00MiB


#As you can see all the metadata were placed on the disk loop5/loop7 even if
#the most empty one are loop0 and loop1.



TODO:
- more tests
- the tool which show the space available should consider the tagging (eg
  the disks tagged by _METADATA_ONLY should be excluded from the data
  availability)


Comments are welcome
BR
G.Baroncelli

Paul Jones Dec. 13, 2021, 9:39 a.m. UTC | #1

> -----Original Message-----
> From: Goffredo Baroncelli <kreijack@tiscali.it>
> Sent: Monday, 25 October 2021 2:31 AM
> To: linux-btrfs@vger.kernel.org
> Cc: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>; Josef Bacik
> <josef@toxicpanda.com>; David Sterba <dsterba@suse.cz>; Sinnamohideen
> Shafeeq <shafeeqs@panasas.com>; Goffredo Baroncelli
> <kreijack@inwind.it>
> Subject: [RFC][V8][PATCH 0/5] btrfs: allocation_hint mode
> 
> From: Goffredo Baroncelli <kreijack@inwind.it>
> 
> Hi all,
> 
> This patches set was born after some discussion between me, Zygo and
> Josef.
> Some details can be found in https://github.com/btrfs/btrfs-todo/issues/19.
> 
> Some further information about a real use case can be found in
> https://lore.kernel.org/linux-
> btrfs/20210116002533.GE31381@hungrycats.org/
> 
> Reently Shafeeq told me that he is interested too, due to the performance
> gain.
> 
> In this revision I switched away from an ioctl API in favor of a sysfs API ( see
> patch #2 and #3).
> 
> The idea behind this patches set, is to dedicate some disks (the fastest one)
> to the metadata chunk. My initial idea was a "soft" hint. However Zygo asked
> an option for a "strong" hint (== mandatory). The result is that each disk can
> be "tagged" by one of the following flags:
> - BTRFS_DEV_ALLOCATION_METADATA_ONLY
> - BTRFS_DEV_ALLOCATION_PREFERRED_METADATA
> - BTRFS_DEV_ALLOCATION_PREFERRED_DATA
> - BTRFS_DEV_ALLOCATION_DATA_ONLY
> 
> When the chunk allocator search a disks to allocate a chunk, scans the disks in
> an order decided by these tags. For metadata, the order is:
> *_METADATA_ONLY
> *_PREFERRED_METADATA
> *_PREFERRED_DATA
> 
> The *_DATA_ONLY are not eligible from metadata chunk allocation.
> 
> For the data chunk, the order is reversed, and the *_METADATA_ONLY are
> excluded.
> 
> The exact sort logic is to sort first for the "tag", and then for the space
> available. If there is no space available, the next "tag" disks set are selected.
> 
> To set these tags, a new property called "allocation_hint" was created.
> There is a dedicated btrfs-prog patches set [[PATCH V5] btrfs-progs:
> allocation_hint disk property].
> 
> $ sudo mount /dev/loop0 /mnt/test-btrfs/ $ for i in /dev/loop[0-9]; do sudo
> ./btrfs prop get $i allocation_hint; done devid=1, path=/dev/loop0:
> allocation_hint=PREFERRED_METADATA
> devid=2, path=/dev/loop1: allocation_hint=PREFERRED_METADATA
> devid=3, path=/dev/loop2: allocation_hint=PREFERRED_DATA devid=4,
> path=/dev/loop3: allocation_hint=PREFERRED_DATA devid=5,
> path=/dev/loop4: allocation_hint=PREFERRED_DATA devid=6,
> path=/dev/loop5: allocation_hint=DATA_ONLY devid=7, path=/dev/loop6:
> allocation_hint=METADATA_ONLY devid=8, path=/dev/loop7:
> allocation_hint=METADATA_ONLY
> 
> $ sudo ./btrfs fi us /mnt/test-btrfs/
> Overall:
>     Device size:           2.75GiB
>     Device allocated:           1.34GiB
>     Device unallocated:           1.41GiB
>     Device missing:             0.00B
>     Used:             400.89MiB
>     Free (estimated):           1.04GiB    (min: 1.04GiB)
>     Data ratio:                  2.00
>     Metadata ratio:              1.00
>     Global reserve:           3.25MiB    (used: 0.00B)
>     Multiple profiles:                no
> 
> Data,RAID1: Size:542.00MiB, Used:200.25MiB (36.95%)
>    /dev/loop0     288.00MiB
>    /dev/loop1     288.00MiB
>    /dev/loop2     127.00MiB
>    /dev/loop3     127.00MiB
>    /dev/loop4     127.00MiB
>    /dev/loop5     127.00MiB
> 
> Metadata,single: Size:256.00MiB, Used:384.00KiB (0.15%)
>    /dev/loop1     256.00MiB
> 
> System,single: Size:32.00MiB, Used:16.00KiB (0.05%)
>    /dev/loop0      32.00MiB
> 
> Unallocated:
>    /dev/loop0     704.00MiB
>    /dev/loop1     480.00MiB
>    /dev/loop2       1.00MiB
>    /dev/loop3       1.00MiB
>    /dev/loop4       1.00MiB
>    /dev/loop5       1.00MiB
>    /dev/loop6     128.00MiB
>    /dev/loop7     128.00MiB
> 
> # change the tag of some disks
> 
> $ sudo ./btrfs prop set /dev/loop0 allocation_hint DATA_ONLY $ sudo ./btrfs
> prop set /dev/loop1 allocation_hint DATA_ONLY $ sudo ./btrfs prop set
> /dev/loop5 allocation_hint METADATA_ONLY
> 
> $ for i in /dev/loop[0-9]; do sudo ./btrfs prop get $i allocation_hint; done
> devid=1, path=/dev/loop0: allocation_hint=DATA_ONLY devid=2,
> path=/dev/loop1: allocation_hint=DATA_ONLY devid=3, path=/dev/loop2:
> allocation_hint=PREFERRED_DATA devid=4, path=/dev/loop3:
> allocation_hint=PREFERRED_DATA devid=5, path=/dev/loop4:
> allocation_hint=PREFERRED_DATA devid=6, path=/dev/loop5:
> allocation_hint=METADATA_ONLY devid=7, path=/dev/loop6:
> allocation_hint=METADATA_ONLY devid=8, path=/dev/loop7:
> allocation_hint=METADATA_ONLY
> 
> $ sudo btrfs bal start --full-balance /mnt/test-btrfs/ $ sudo ./btrfs fi us
> /mnt/test-btrfs/
> Overall:
>     Device size:           2.75GiB
>     Device allocated:         735.00MiB
>     Device unallocated:           2.03GiB
>     Device missing:             0.00B
>     Used:             400.72MiB
>     Free (estimated):           1.10GiB    (min: 1.10GiB)
>     Data ratio:                  2.00
>     Metadata ratio:              1.00
>     Global reserve:           3.25MiB    (used: 0.00B)
>     Multiple profiles:                no
> 
> Data,RAID1: Size:288.00MiB, Used:200.19MiB (69.51%)
>    /dev/loop0     288.00MiB
>    /dev/loop1     288.00MiB
> 
> Metadata,single: Size:127.00MiB, Used:336.00KiB (0.26%)
>    /dev/loop5     127.00MiB
> 
> System,single: Size:32.00MiB, Used:16.00KiB (0.05%)
>    /dev/loop7      32.00MiB
> 
> Unallocated:
>    /dev/loop0     736.00MiB
>    /dev/loop1     736.00MiB
>    /dev/loop2     128.00MiB
>    /dev/loop3     128.00MiB
>    /dev/loop4     128.00MiB
>    /dev/loop5       1.00MiB
>    /dev/loop6     128.00MiB
>    /dev/loop7      96.00MiB
> 
> 
> #As you can see all the metadata were placed on the disk loop5/loop7 even if
> #the most empty one are loop0 and loop1.
> 
> 
> 
> TODO:
> - more tests
> - the tool which show the space available should consider the tagging (eg
>   the disks tagged by _METADATA_ONLY should be excluded from the data
>   availability)
> 
> 
> Comments are welcome
> BR
> G.Baroncelli


I've been running this patch series since about V4, works really well. Would be nice to have it merged eventually.

Tested By: Paul Jones <paul@pauljones.id.au>

Goffredo Baroncelli Dec. 13, 2021, 7:54 p.m. UTC | #2

Gentle ping :-)

Are there anyone of the mains developer interested in supporting this patch ?

I am open to improve it if required.

BR
G.Baroncelli

On 12/13/21 10:39, Paul Jones wrote:
>> -----Original Message-----
>> From: Goffredo Baroncelli <kreijack@tiscali.it>
>> Sent: Monday, 25 October 2021 2:31 AM
>> To: linux-btrfs@vger.kernel.org
>> Cc: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>; Josef Bacik
>> <josef@toxicpanda.com>; David Sterba <dsterba@suse.cz>; Sinnamohideen
>> Shafeeq <shafeeqs@panasas.com>; Goffredo Baroncelli
>> <kreijack@inwind.it>
>> Subject: [RFC][V8][PATCH 0/5] btrfs: allocation_hint mode
>>
>> From: Goffredo Baroncelli <kreijack@inwind.it>
>>
>> Hi all,
>>
>> This patches set was born after some discussion between me, Zygo and
>> Josef.
>> Some details can be found in https://github.com/btrfs/btrfs-todo/issues/19.
>>
>> Some further information about a real use case can be found in
>> https://lore.kernel.org/linux-
>> btrfs/20210116002533.GE31381@hungrycats.org/
>>
>> Reently Shafeeq told me that he is interested too, due to the performance
>> gain.
>>
>> In this revision I switched away from an ioctl API in favor of a sysfs API ( see
>> patch #2 and #3).
>>
>> The idea behind this patches set, is to dedicate some disks (the fastest one)
>> to the metadata chunk. My initial idea was a "soft" hint. However Zygo asked
>> an option for a "strong" hint (== mandatory). The result is that each disk can
>> be "tagged" by one of the following flags:
>> - BTRFS_DEV_ALLOCATION_METADATA_ONLY
>> - BTRFS_DEV_ALLOCATION_PREFERRED_METADATA
>> - BTRFS_DEV_ALLOCATION_PREFERRED_DATA
>> - BTRFS_DEV_ALLOCATION_DATA_ONLY
>>
>> When the chunk allocator search a disks to allocate a chunk, scans the disks in
>> an order decided by these tags. For metadata, the order is:
>> *_METADATA_ONLY
>> *_PREFERRED_METADATA
>> *_PREFERRED_DATA
>>
>> The *_DATA_ONLY are not eligible from metadata chunk allocation.
>>
>> For the data chunk, the order is reversed, and the *_METADATA_ONLY are
>> excluded.
>>
>> The exact sort logic is to sort first for the "tag", and then for the space
>> available. If there is no space available, the next "tag" disks set are selected.
>>
>> To set these tags, a new property called "allocation_hint" was created.
>> There is a dedicated btrfs-prog patches set [[PATCH V5] btrfs-progs:
>> allocation_hint disk property].
>>
>> $ sudo mount /dev/loop0 /mnt/test-btrfs/ $ for i in /dev/loop[0-9]; do sudo
>> ./btrfs prop get $i allocation_hint; done devid=1, path=/dev/loop0:
>> allocation_hint=PREFERRED_METADATA
>> devid=2, path=/dev/loop1: allocation_hint=PREFERRED_METADATA
>> devid=3, path=/dev/loop2: allocation_hint=PREFERRED_DATA devid=4,
>> path=/dev/loop3: allocation_hint=PREFERRED_DATA devid=5,
>> path=/dev/loop4: allocation_hint=PREFERRED_DATA devid=6,
>> path=/dev/loop5: allocation_hint=DATA_ONLY devid=7, path=/dev/loop6:
>> allocation_hint=METADATA_ONLY devid=8, path=/dev/loop7:
>> allocation_hint=METADATA_ONLY
>>
>> $ sudo ./btrfs fi us /mnt/test-btrfs/
>> Overall:
>>      Device size:           2.75GiB
>>      Device allocated:           1.34GiB
>>      Device unallocated:           1.41GiB
>>      Device missing:             0.00B
>>      Used:             400.89MiB
>>      Free (estimated):           1.04GiB    (min: 1.04GiB)
>>      Data ratio:                  2.00
>>      Metadata ratio:              1.00
>>      Global reserve:           3.25MiB    (used: 0.00B)
>>      Multiple profiles:                no
>>
>> Data,RAID1: Size:542.00MiB, Used:200.25MiB (36.95%)
>>     /dev/loop0     288.00MiB
>>     /dev/loop1     288.00MiB
>>     /dev/loop2     127.00MiB
>>     /dev/loop3     127.00MiB
>>     /dev/loop4     127.00MiB
>>     /dev/loop5     127.00MiB
>>
>> Metadata,single: Size:256.00MiB, Used:384.00KiB (0.15%)
>>     /dev/loop1     256.00MiB
>>
>> System,single: Size:32.00MiB, Used:16.00KiB (0.05%)
>>     /dev/loop0      32.00MiB
>>
>> Unallocated:
>>     /dev/loop0     704.00MiB
>>     /dev/loop1     480.00MiB
>>     /dev/loop2       1.00MiB
>>     /dev/loop3       1.00MiB
>>     /dev/loop4       1.00MiB
>>     /dev/loop5       1.00MiB
>>     /dev/loop6     128.00MiB
>>     /dev/loop7     128.00MiB
>>
>> # change the tag of some disks
>>
>> $ sudo ./btrfs prop set /dev/loop0 allocation_hint DATA_ONLY $ sudo ./btrfs
>> prop set /dev/loop1 allocation_hint DATA_ONLY $ sudo ./btrfs prop set
>> /dev/loop5 allocation_hint METADATA_ONLY
>>
>> $ for i in /dev/loop[0-9]; do sudo ./btrfs prop get $i allocation_hint; done
>> devid=1, path=/dev/loop0: allocation_hint=DATA_ONLY devid=2,
>> path=/dev/loop1: allocation_hint=DATA_ONLY devid=3, path=/dev/loop2:
>> allocation_hint=PREFERRED_DATA devid=4, path=/dev/loop3:
>> allocation_hint=PREFERRED_DATA devid=5, path=/dev/loop4:
>> allocation_hint=PREFERRED_DATA devid=6, path=/dev/loop5:
>> allocation_hint=METADATA_ONLY devid=7, path=/dev/loop6:
>> allocation_hint=METADATA_ONLY devid=8, path=/dev/loop7:
>> allocation_hint=METADATA_ONLY
>>
>> $ sudo btrfs bal start --full-balance /mnt/test-btrfs/ $ sudo ./btrfs fi us
>> /mnt/test-btrfs/
>> Overall:
>>      Device size:           2.75GiB
>>      Device allocated:         735.00MiB
>>      Device unallocated:           2.03GiB
>>      Device missing:             0.00B
>>      Used:             400.72MiB
>>      Free (estimated):           1.10GiB    (min: 1.10GiB)
>>      Data ratio:                  2.00
>>      Metadata ratio:              1.00
>>      Global reserve:           3.25MiB    (used: 0.00B)
>>      Multiple profiles:                no
>>
>> Data,RAID1: Size:288.00MiB, Used:200.19MiB (69.51%)
>>     /dev/loop0     288.00MiB
>>     /dev/loop1     288.00MiB
>>
>> Metadata,single: Size:127.00MiB, Used:336.00KiB (0.26%)
>>     /dev/loop5     127.00MiB
>>
>> System,single: Size:32.00MiB, Used:16.00KiB (0.05%)
>>     /dev/loop7      32.00MiB
>>
>> Unallocated:
>>     /dev/loop0     736.00MiB
>>     /dev/loop1     736.00MiB
>>     /dev/loop2     128.00MiB
>>     /dev/loop3     128.00MiB
>>     /dev/loop4     128.00MiB
>>     /dev/loop5       1.00MiB
>>     /dev/loop6     128.00MiB
>>     /dev/loop7      96.00MiB
>>
>>
>> #As you can see all the metadata were placed on the disk loop5/loop7 even if
>> #the most empty one are loop0 and loop1.
>>
>>
>>
>> TODO:
>> - more tests
>> - the tool which show the space available should consider the tagging (eg
>>    the disks tagged by _METADATA_ONLY should be excluded from the data
>>    availability)
>>
>>
>> Comments are welcome
>> BR
>> G.Baroncelli
> 
> 
> I've been running this patch series since about V4, works really well. Would be nice to have it merged eventually.
> 
> Tested By: Paul Jones <paul@pauljones.id.au>
>

Josef Bacik Dec. 13, 2021, 9:15 p.m. UTC | #3

On Mon, Dec 13, 2021 at 08:54:24PM +0100, Goffredo Baroncelli wrote:
> Gentle ping :-)
> 
> Are there anyone of the mains developer interested in supporting this patch ?
> 
> I am open to improve it if required.
> 

Sorry I missed this go by.  I like the interface, we don't have a use for
device->type yet, so this fits nicely.

I don't see the btrfs-progs patches in my inbox, and these don't apply, so
you'll definitely need to refresh for a proper review, but looking at these
patches they seem sane enough, and I like the interface.  I'd like to hear
Zygo's opinion as well.

If we're going to use device->type for this, and since we don't have a user of
device->type, I'd also like you to go ahead and re-name ->type to
->allocation_policy, that way it's clear what we're using it for now.

I'd also like some xfstests to validate the behavior so we're sure we're testing
this.  I'd want 1 test to just test the mechanics, like mkfs with different
policies and validate they're set right, change policies, add/remove disks with
different policies.

Then a second test to do something like fsstress with each set of allocation
policies to validate that we did actually allocate from the correct disks.  For
this please also test with compression on to make sure the test validation works
for both normal allocation and compression (ie it doesn't assume writing 5gib of
data == 5 gib of data usage, as compression chould give you a different value).

With that in place I think this is the correct way to implement this feature.
Thanks,

Josef

Zygo Blaxell Dec. 13, 2021, 10:49 p.m. UTC | #4

On Mon, Dec 13, 2021 at 04:15:14PM -0500, Josef Bacik wrote:
> On Mon, Dec 13, 2021 at 08:54:24PM +0100, Goffredo Baroncelli wrote:
> > Gentle ping :-)
> > 
> > Are there anyone of the mains developer interested in supporting this patch ?
> > 
> > I am open to improve it if required.
> > 
> 
> Sorry I missed this go by.  I like the interface, we don't have a use for
> device->type yet, so this fits nicely.
> 
> I don't see the btrfs-progs patches in my inbox, and these don't apply, so
> you'll definitely need to refresh for a proper review, but looking at these
> patches they seem sane enough, and I like the interface.  I'd like to hear
> Zygo's opinion as well.

I've been running earlier versions with modifications since summer 2020,
and this version mostly unmodified (rebase changes only) since it was
posted.  It seems to work, even in corner cases like converting balances,
replacing drives, and running out of space.  The "running out of space"
experience is on btrfs is weird at the best of times, and these patches
add some more new special cases, but it doesn't behave in ways that
would surprise a sysadmin familiar with how btrfs chunk allocation works.

One major piece that's missing is adjusting the statvfs (aka df)
available blocks field so that it doesn't include unallocated space on
any metadata-only devices.  Right now all the unallocated space on
metadata-only devices is counted as free even though it's impossible to
put a data block there, so anything that is triggered automatically
on "f_bavail < some_threshold" will be confused.

I don't think that piece has to block the rest of the patch series--if
you're not using the feature, df gives the right number (or at least the
same number it gave before), and if you are using the feature, you can
subtract the unavailable data space until a later patch comes along to
fix it.

I like

	echo data_only > /sys/fs/btrfs/$uuid/devinfo/3/type

more than patching btrfs-progs so I can use

	btrfs prop set /dev/... allocation_hint data_only

but I admit that might be because I'm weird.

> If we're going to use device->type for this, and since we don't have a user of
> device->type, I'd also like you to go ahead and re-name ->type to
> ->allocation_policy, that way it's clear what we're using it for now.
> 
> I'd also like some xfstests to validate the behavior so we're sure we're testing
> this.  I'd want 1 test to just test the mechanics, like mkfs with different
> policies and validate they're set right, change policies, add/remove disks with
> different policies.
> 
> Then a second test to do something like fsstress with each set of allocation
> policies to validate that we did actually allocate from the correct disks.  For
> this please also test with compression on to make sure the test validation works
> for both normal allocation and compression (ie it doesn't assume writing 5gib of
> data == 5 gib of data usage, as compression chould give you a different value).
> 
> With that in place I think this is the correct way to implement this feature.
> Thanks,
> 
> Josef

Sinnamohideen, Shafeeq Dec. 14, 2021, 1:03 a.m. UTC | #5

Panasas would be very happy with this patch - it solves a problem we have when delete-only workloads thrash the disks. Because it is equivalent to "rm -rf" of an old directory, caching does't help since the metadata of the files being deleted will have fallen out of bcache long ago. So giving up a small amount of SSD space to always hold the metadata yields a big performance improvement.

We got as far as verifying that it doesn't break existing xfstests, we are open to writing the test cases you listed and also to check that the device out-of-space cases behave as expected.

Thanks,
Shafeeq

-----Original Message-----
From: Josef Bacik <josef@toxicpanda.com> 
Sent: Monday, December 13, 2021 4:15 PM
To: kreijack@inwind.it
Cc: David Sterba <dsterba@suse.cz>; Sinnamohideen, Shafeeq <shafeeqs@panasas.com>; Paul Jones <paul@pauljones.id.au>; linux-btrfs@vger.kernel.org; Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Subject: Re: [RFC][V8][PATCH 0/5] btrfs: allocation_hint mode

On Mon, Dec 13, 2021 at 08:54:24PM +0100, Goffredo Baroncelli wrote:
> Gentle ping :-)
> 
> Are there anyone of the mains developer interested in supporting this patch ?
> 
> I am open to improve it if required.
> 

Sorry I missed this go by.  I like the interface, we don't have a use for
device->type yet, so this fits nicely.

I don't see the btrfs-progs patches in my inbox, and these don't apply, so you'll definitely need to refresh for a proper review, but looking at these patches they seem sane enough, and I like the interface.  I'd like to hear Zygo's opinion as well.

If we're going to use device->type for this, and since we don't have a user of
device->type, I'd also like you to go ahead and re-name ->type to
->allocation_policy, that way it's clear what we're using it for now.

I'd also like some xfstests to validate the behavior so we're sure we're testing this.  I'd want 1 test to just test the mechanics, like mkfs with different policies and validate they're set right, change policies, add/remove disks with different policies.

Then a second test to do something like fsstress with each set of allocation policies to validate that we did actually allocate from the correct disks.  For this please also test with compression on to make sure the test validation works for both normal allocation and compression (ie it doesn't assume writing 5gib of data == 5 gib of data usage, as compression chould give you a different value).

With that in place I think this is the correct way to implement this feature.
Thanks,

Josef

Josef Bacik Dec. 14, 2021, 2:31 p.m. UTC | #6

On Mon, Dec 13, 2021 at 05:49:22PM -0500, Zygo Blaxell wrote:
> On Mon, Dec 13, 2021 at 04:15:14PM -0500, Josef Bacik wrote:
> > On Mon, Dec 13, 2021 at 08:54:24PM +0100, Goffredo Baroncelli wrote:
> > > Gentle ping :-)
> > > 
> > > Are there anyone of the mains developer interested in supporting this patch ?
> > > 
> > > I am open to improve it if required.
> > > 
> > 
> > Sorry I missed this go by.  I like the interface, we don't have a use for
> > device->type yet, so this fits nicely.
> > 
> > I don't see the btrfs-progs patches in my inbox, and these don't apply, so
> > you'll definitely need to refresh for a proper review, but looking at these
> > patches they seem sane enough, and I like the interface.  I'd like to hear
> > Zygo's opinion as well.
> 
> I've been running earlier versions with modifications since summer 2020,
> and this version mostly unmodified (rebase changes only) since it was
> posted.  It seems to work, even in corner cases like converting balances,
> replacing drives, and running out of space.  The "running out of space"
> experience is on btrfs is weird at the best of times, and these patches
> add some more new special cases, but it doesn't behave in ways that
> would surprise a sysadmin familiar with how btrfs chunk allocation works.
> 
> One major piece that's missing is adjusting the statvfs (aka df)
> available blocks field so that it doesn't include unallocated space on
> any metadata-only devices.  Right now all the unallocated space on
> metadata-only devices is counted as free even though it's impossible to
> put a data block there, so anything that is triggered automatically
> on "f_bavail < some_threshold" will be confused.
> 
> I don't think that piece has to block the rest of the patch series--if
> you're not using the feature, df gives the right number (or at least the
> same number it gave before), and if you are using the feature, you can
> subtract the unavailable data space until a later patch comes along to
> fix it.
> 
> I like
> 
> 	echo data_only > /sys/fs/btrfs/$uuid/devinfo/3/type
> 
> more than patching btrfs-progs so I can use
> 
> 	btrfs prop set /dev/... allocation_hint data_only
> 
> but I admit that might be because I'm weird.
> 

Ok this is good enough for me.  Refresh the series Goffredo and I'll review it,
get some xfstests as I described, and fix up btrfs-progs to show the preferences
via either btrfs filesystem show or some other mechanism since the df thing will
be confusing.  We'll assume if you're using this feature you are aware of the
space gotchas and then fix up bugs as they show up.  With the testing and user
feedback tho this looks good enough for me, I just want to make sure we have
tests and such in place before merging.  Thanks,

Josef

Goffredo Baroncelli Dec. 14, 2021, 6:53 p.m. UTC | #7

On 12/13/21 22:15, Josef Bacik wrote:
> On Mon, Dec 13, 2021 at 08:54:24PM +0100, Goffredo Baroncelli wrote:
>> Gentle ping :-)
>>
>> Are there anyone of the mains developer interested in supporting this patch ?
>>
>> I am open to improve it if required.
>>
> 
> Sorry I missed this go by.  I like the interface, we don't have a use for
> device->type yet, so this fits nicely.
> 
> I don't see the btrfs-progs patches in my inbox, and these don't apply, so
> you'll definitely need to refresh for a proper review, but looking at these
> patches they seem sane enough, and I like the interface.  I'd like to hear
> Zygo's opinion as well.
> 
> If we're going to use device->type for this, and since we don't have a user of
> device->type, I'd also like you to go ahead and re-name ->type to
> ->allocation_policy, that way it's clear what we're using it for now.

The allocation policy requires only few bits (3 or 4); instead "type" is 64 bit wide.
We can allocate more bits for future extension, but I don't think that it is necessary
to allocate all the bits to the "allocation_policy".

This to say that renaming the field as allocation_policy is too restrictive if in future
we will use the other bits for another purposes.

I don't like the terms "type". May be flags, is it better ?

> 
> I'd also like some xfstests to validate the behavior so we're sure we're testing
> this.  I'd want 1 test to just test the mechanics, like mkfs with different
> policies and validate they're set right, change policies, add/remove disks with
> different policies.
> 
> Then a second test to do something like fsstress with each set of allocation
> policies to validate that we did actually allocate from the correct disks.  For
> this please also test with compression on to make sure the test validation works
> for both normal allocation and compression (ie it doesn't assume writing 5gib of
> data == 5 gib of data usage, as compression chould give you a different value).
> 
> With that in place I think this is the correct way to implement this feature.
> Thanks,

Ok

> 
> Josef

Goffredo Baroncelli Dec. 14, 2021, 7:03 p.m. UTC | #8

On 12/13/21 23:49, Zygo Blaxell wrote:
> On Mon, Dec 13, 2021 at 04:15:14PM -0500, Josef Bacik wrote:
>> On Mon, Dec 13, 2021 at 08:54:24PM +0100, Goffredo Baroncelli wrote:
>>> Gentle ping :-)
>>>
>>> Are there anyone of the mains developer interested in supporting this patch ?
>>>
>>> I am open to improve it if required.
>>>
>>
>> Sorry I missed this go by.  I like the interface, we don't have a use for
>> device->type yet, so this fits nicely.
>>
>> I don't see the btrfs-progs patches in my inbox, and these don't apply, so
>> you'll definitely need to refresh for a proper review, but looking at these
>> patches they seem sane enough, and I like the interface.  I'd like to hear
>> Zygo's opinion as well.
> 
> I've been running earlier versions with modifications since summer 2020,
> and this version mostly unmodified (rebase changes only) since it was
> posted.  It seems to work, even in corner cases like converting balances,
> replacing drives, and running out of space.  The "running out of space"
> experience is on btrfs is weird at the best of times, and these patches
> add some more new special cases, but it doesn't behave in ways that
> would surprise a sysadmin familiar with how btrfs chunk allocation works.
> 
> One major piece that's missing is adjusting the statvfs (aka df)
> available blocks field so that it doesn't include unallocated space on
> any metadata-only devices.  Right now all the unallocated space on
> metadata-only devices is counted as free even though it's impossible to
> put a data block there, so anything that is triggered automatically
> on "f_bavail < some_threshold" will be confused.
> 
> I don't think that piece has to block the rest of the patch series--if
> you're not using the feature, df gives the right number (or at least the
> same number it gave before), and if you are using the feature, you can
> subtract the unavailable data space until a later patch comes along to
> fix it.
> 
> I like
> 
> 	echo data_only > /sys/fs/btrfs/$uuid/devinfo/3/type

Only to be clear, for now you can pass a numeric value to "type". Not a text
like your example.

However I want to put on the table another option: to not expose all the
"type" field, but only the "allocation policy"; we can add a new sysfs field
called "allocation policy" that internally change the dev_item->type field.

It is not only a "cosmetic" change. If we want to change the allocation
policy, now the correct way is:
- read the type field
- change the "allocation policy" bits
- write the type field

Which is race 'prone'

For now it is not a problem, because type contains only the allocation bits.
But in future when the type field will contains further properties this could
be a problem.

> 
> more than patching btrfs-progs so I can use
> 
> 	btrfs prop set /dev/... allocation_hint data_only
> 
> but I admit that might be because I'm weird.

I prefer the echo approach too; however it is not very ergonomics in conjunction
to sudo....

> 
>> If we're going to use device->type for this, and since we don't have a user of
>> device->type, I'd also like you to go ahead and re-name ->type to
>> ->allocation_policy, that way it's clear what we're using it for now.
>>
>> I'd also like some xfstests to validate the behavior so we're sure we're testing
>> this.  I'd want 1 test to just test the mechanics, like mkfs with different
>> policies and validate they're set right, change policies, add/remove disks with
>> different policies.
>>
>> Then a second test to do something like fsstress with each set of allocation
>> policies to validate that we did actually allocate from the correct disks.  For
>> this please also test with compression on to make sure the test validation works
>> for both normal allocation and compression (ie it doesn't assume writing 5gib of
>> data == 5 gib of data usage, as compression chould give you a different value).
>>
>> With that in place I think this is the correct way to implement this feature.
>> Thanks,
>>
>> Josef

Zygo Blaxell Dec. 14, 2021, 8:04 p.m. UTC | #9

On Tue, Dec 14, 2021 at 08:03:45PM +0100, Goffredo Baroncelli wrote:
> On 12/13/21 23:49, Zygo Blaxell wrote:
> > On Mon, Dec 13, 2021 at 04:15:14PM -0500, Josef Bacik wrote:
> > > On Mon, Dec 13, 2021 at 08:54:24PM +0100, Goffredo Baroncelli wrote:
> > > > Gentle ping :-)
> > > > 
> > > > Are there anyone of the mains developer interested in supporting this patch ?
> > > > 
> > > > I am open to improve it if required.
> > > > 
> > > 
> > > Sorry I missed this go by.  I like the interface, we don't have a use for
> > > device->type yet, so this fits nicely.
> > > 
> > > I don't see the btrfs-progs patches in my inbox, and these don't apply, so
> > > you'll definitely need to refresh for a proper review, but looking at these
> > > patches they seem sane enough, and I like the interface.  I'd like to hear
> > > Zygo's opinion as well.
> > 
> > I've been running earlier versions with modifications since summer 2020,
> > and this version mostly unmodified (rebase changes only) since it was
> > posted.  It seems to work, even in corner cases like converting balances,
> > replacing drives, and running out of space.  The "running out of space"
> > experience is on btrfs is weird at the best of times, and these patches
> > add some more new special cases, but it doesn't behave in ways that
> > would surprise a sysadmin familiar with how btrfs chunk allocation works.
> > 
> > One major piece that's missing is adjusting the statvfs (aka df)
> > available blocks field so that it doesn't include unallocated space on
> > any metadata-only devices.  Right now all the unallocated space on
> > metadata-only devices is counted as free even though it's impossible to
> > put a data block there, so anything that is triggered automatically
> > on "f_bavail < some_threshold" will be confused.
> > 
> > I don't think that piece has to block the rest of the patch series--if
> > you're not using the feature, df gives the right number (or at least the
> > same number it gave before), and if you are using the feature, you can
> > subtract the unavailable data space until a later patch comes along to
> > fix it.
> > 
> > I like
> > 
> > 	echo data_only > /sys/fs/btrfs/$uuid/devinfo/3/type
> 
> Only to be clear, for now you can pass a numeric value to "type". Not a text
> like your example.
> 
> However I want to put on the table another option: to not expose all the
> "type" field, but only the "allocation policy"; we can add a new sysfs field
> called "allocation policy" that internally change the dev_item->type field.
> 
> It is not only a "cosmetic" change. If we want to change the allocation
> policy, now the correct way is:
> - read the type field
> - change the "allocation policy" bits
> - write the type field
> 
> Which is race 'prone'

> For now it is not a problem, because type contains only the allocation bits.
> But in future when the type field will contains further properties this could
> be a problem.

Yeah, keep the interface very narrow, don't hand out access to random bits.

If the kernel supports additional bits, it should support additional
sysfs filenames to go with them.  Or it could put all the supported
options in the sysfs field, like block IO schedulers do, so you could
find this in the file by reading it:

	[prefer_data] prefer_metadata metadata_only data_only

> > more than patching btrfs-progs so I can use
> > 
> > 	btrfs prop set /dev/... allocation_hint data_only
> > 
> > but I admit that might be because I'm weird.
> 
> I prefer the echo approach too; however it is not very ergonomics in conjunction
> to sudo....

For /proc/sys/* we have the 'sysctl' tool, so you can write 'sysctl
vm.drop_caches=1' or 'sudo sysctl vm.drop_caches=1'.  For some reason
we don't have this for sysfs (or maybe it's just Debian...?) so we have
to write things like 'echo foo | sudo tee /sys/fs/...'.

Of course btrfs-progs could always open the
/sys/fs/btrfs/.../allocation_policy file and write to it.  But if we're
modifying btrfs-progs then we could use the ioctl interface anyway.

I don't have a strong preference for either sysfs or ioctl, nor am I
opposed to simply implementing both.  I'll let someone who does have
such a preference make their case.

> > > If we're going to use device->type for this, and since we don't have a user of
> > > device->type, I'd also like you to go ahead and re-name ->type to
> > > ->allocation_policy, that way it's clear what we're using it for now.
> > > 
> > > I'd also like some xfstests to validate the behavior so we're sure we're testing
> > > this.  I'd want 1 test to just test the mechanics, like mkfs with different
> > > policies and validate they're set right, change policies, add/remove disks with
> > > different policies.
> > > 
> > > Then a second test to do something like fsstress with each set of allocation
> > > policies to validate that we did actually allocate from the correct disks.  For
> > > this please also test with compression on to make sure the test validation works
> > > for both normal allocation and compression (ie it doesn't assume writing 5gib of
> > > data == 5 gib of data usage, as compression chould give you a different value).
> > > 
> > > With that in place I think this is the correct way to implement this feature.
> > > Thanks,
> > > 
> > > Josef
> 
> 
> -- 
> gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

Josef Bacik Dec. 14, 2021, 8:34 p.m. UTC | #10

On Tue, Dec 14, 2021 at 03:04:32PM -0500, Zygo Blaxell wrote:
> On Tue, Dec 14, 2021 at 08:03:45PM +0100, Goffredo Baroncelli wrote:
> > On 12/13/21 23:49, Zygo Blaxell wrote:
> > > On Mon, Dec 13, 2021 at 04:15:14PM -0500, Josef Bacik wrote:
> > > > On Mon, Dec 13, 2021 at 08:54:24PM +0100, Goffredo Baroncelli wrote:
> > > > > Gentle ping :-)
> > > > > 
> > > > > Are there anyone of the mains developer interested in supporting this patch ?
> > > > > 
> > > > > I am open to improve it if required.
> > > > > 
> > > > 
> > > > Sorry I missed this go by.  I like the interface, we don't have a use for
> > > > device->type yet, so this fits nicely.
> > > > 
> > > > I don't see the btrfs-progs patches in my inbox, and these don't apply, so
> > > > you'll definitely need to refresh for a proper review, but looking at these
> > > > patches they seem sane enough, and I like the interface.  I'd like to hear
> > > > Zygo's opinion as well.
> > > 
> > > I've been running earlier versions with modifications since summer 2020,
> > > and this version mostly unmodified (rebase changes only) since it was
> > > posted.  It seems to work, even in corner cases like converting balances,
> > > replacing drives, and running out of space.  The "running out of space"
> > > experience is on btrfs is weird at the best of times, and these patches
> > > add some more new special cases, but it doesn't behave in ways that
> > > would surprise a sysadmin familiar with how btrfs chunk allocation works.
> > > 
> > > One major piece that's missing is adjusting the statvfs (aka df)
> > > available blocks field so that it doesn't include unallocated space on
> > > any metadata-only devices.  Right now all the unallocated space on
> > > metadata-only devices is counted as free even though it's impossible to
> > > put a data block there, so anything that is triggered automatically
> > > on "f_bavail < some_threshold" will be confused.
> > > 
> > > I don't think that piece has to block the rest of the patch series--if
> > > you're not using the feature, df gives the right number (or at least the
> > > same number it gave before), and if you are using the feature, you can
> > > subtract the unavailable data space until a later patch comes along to
> > > fix it.
> > > 
> > > I like
> > > 
> > > 	echo data_only > /sys/fs/btrfs/$uuid/devinfo/3/type
> > 
> > Only to be clear, for now you can pass a numeric value to "type". Not a text
> > like your example.
> > 
> > However I want to put on the table another option: to not expose all the
> > "type" field, but only the "allocation policy"; we can add a new sysfs field
> > called "allocation policy" that internally change the dev_item->type field.
> > 
> > It is not only a "cosmetic" change. If we want to change the allocation
> > policy, now the correct way is:
> > - read the type field
> > - change the "allocation policy" bits
> > - write the type field
> > 
> > Which is race 'prone'
> 
> > For now it is not a problem, because type contains only the allocation bits.
> > But in future when the type field will contains further properties this could
> > be a problem.
> 
> Yeah, keep the interface very narrow, don't hand out access to random bits.
> 
> If the kernel supports additional bits, it should support additional
> sysfs filenames to go with them.  Or it could put all the supported
> options in the sysfs field, like block IO schedulers do, so you could
> find this in the file by reading it:
> 
> 	[prefer_data] prefer_metadata metadata_only data_only
> 
> > > more than patching btrfs-progs so I can use
> > > 
> > > 	btrfs prop set /dev/... allocation_hint data_only
> > > 
> > > but I admit that might be because I'm weird.
> > 
> > I prefer the echo approach too; however it is not very ergonomics in conjunction
> > to sudo....
> 
> For /proc/sys/* we have the 'sysctl' tool, so you can write 'sysctl
> vm.drop_caches=1' or 'sudo sysctl vm.drop_caches=1'.  For some reason
> we don't have this for sysfs (or maybe it's just Debian...?) so we have
> to write things like 'echo foo | sudo tee /sys/fs/...'.
> 
> Of course btrfs-progs could always open the
> /sys/fs/btrfs/.../allocation_policy file and write to it.  But if we're
> modifying btrfs-progs then we could use the ioctl interface anyway.
> 
> I don't have a strong preference for either sysfs or ioctl, nor am I
> opposed to simply implementing both.  I'll let someone who does have
> such a preference make their case.

I think echo'ing a name into sysfs is better than bits for sure.  However I want
the ability to set the device properties via a btrfs-progs command offline so I
can setup the storage and then mount the file system.  I want

1) The sysfs interface so you can change things on the fly.  This stays
   persistent of course, so the way it works is perfect.

2) The btrfs-progs command sets it on offline devices.  If you point it at a
   live mounted fs it can simply use the sysfs thing to do it live.

Does this seem reasonable?  Thanks,

Josef

Josef Bacik Dec. 14, 2021, 8:35 p.m. UTC | #11

On Tue, Dec 14, 2021 at 07:53:34PM +0100, Goffredo Baroncelli wrote:
> On 12/13/21 22:15, Josef Bacik wrote:
> > On Mon, Dec 13, 2021 at 08:54:24PM +0100, Goffredo Baroncelli wrote:
> > > Gentle ping :-)
> > > 
> > > Are there anyone of the mains developer interested in supporting this patch ?
> > > 
> > > I am open to improve it if required.
> > > 
> > 
> > Sorry I missed this go by.  I like the interface, we don't have a use for
> > device->type yet, so this fits nicely.
> > 
> > I don't see the btrfs-progs patches in my inbox, and these don't apply, so
> > you'll definitely need to refresh for a proper review, but looking at these
> > patches they seem sane enough, and I like the interface.  I'd like to hear
> > Zygo's opinion as well.
> > 
> > If we're going to use device->type for this, and since we don't have a user of
> > device->type, I'd also like you to go ahead and re-name ->type to
> > ->allocation_policy, that way it's clear what we're using it for now.
> 
> The allocation policy requires only few bits (3 or 4); instead "type" is 64 bit wide.
> We can allocate more bits for future extension, but I don't think that it is necessary
> to allocate all the bits to the "allocation_policy".
> 
> This to say that renaming the field as allocation_policy is too restrictive if in future
> we will use the other bits for another purposes.
> 
> I don't like the terms "type". May be flags, is it better ?

Sure, rename it to flags and make the allocation policies flags.  The important
part is that "type" isn't accurate, and since it's not used by anything
currently we're free to rename it so it better represents the information it
holds.  Thanks,

Josef

Goffredo Baroncelli Dec. 14, 2021, 8:41 p.m. UTC | #12

On 12/14/21 21:34, Josef Bacik wrote:
> On Tue, Dec 14, 2021 at 03:04:32PM -0500, Zygo Blaxell wrote:
>> On Tue, Dec 14, 2021 at 08:03:45PM +0100, Goffredo Baroncelli wrote:

>>
>> I don't have a strong preference for either sysfs or ioctl, nor am I
>> opposed to simply implementing both.  I'll let someone who does have
>> such a preference make their case.
> 
> I think echo'ing a name into sysfs is better than bits for sure.  However I want
> the ability to set the device properties via a btrfs-progs command offline so I
> can setup the storage and then mount the file system.  I want
> 
> 1) The sysfs interface so you can change things on the fly.  This stays
>     persistent of course, so the way it works is perfect.
> 
> 2) The btrfs-progs command sets it on offline devices.  If you point it at a
>     live mounted fs it can simply use the sysfs thing to do it live.

#2 is currently not implemented. However I think that we should do.

The problem is that we need to update both:

- the superblock		(simple)
- the dev_item item		(not so simple...)

What about using only bits from the superblock to store this property ?

> 
> Does this seem reasonable?  Thanks,
> 
> Josef

Josef Bacik Dec. 15, 2021, 1:58 p.m. UTC | #13

On Tue, Dec 14, 2021 at 09:41:21PM +0100, Goffredo Baroncelli wrote:
> On 12/14/21 21:34, Josef Bacik wrote:
> > On Tue, Dec 14, 2021 at 03:04:32PM -0500, Zygo Blaxell wrote:
> > > On Tue, Dec 14, 2021 at 08:03:45PM +0100, Goffredo Baroncelli wrote:
> 
> > > 
> > > I don't have a strong preference for either sysfs or ioctl, nor am I
> > > opposed to simply implementing both.  I'll let someone who does have
> > > such a preference make their case.
> > 
> > I think echo'ing a name into sysfs is better than bits for sure.  However I want
> > the ability to set the device properties via a btrfs-progs command offline so I
> > can setup the storage and then mount the file system.  I want
> > 
> > 1) The sysfs interface so you can change things on the fly.  This stays
> >     persistent of course, so the way it works is perfect.
> > 
> > 2) The btrfs-progs command sets it on offline devices.  If you point it at a
> >     live mounted fs it can simply use the sysfs thing to do it live.
> 
> #2 is currently not implemented. However I think that we should do.
> 
> The problem is that we need to update both:
> 
> - the superblock		(simple)
> - the dev_item item		(not so simple...)
> 
> What about using only bits from the superblock to store this property ?

I'm looking at the patches and you only are updating the dev_item, am I missing
something for the super block?

For offline all you would need to do is do the normal open_ctree,
btrfs_search_slot to the item and update the device item type, that's
straightforward.

For online if you use btrfs prop you can see if the fs is mounted and just find
the sysfs file to modify and do it that way.

But this also brings up another point, we're going to want a compat bit for
this.  It doesn't make the fs unusable for old kernels, so just a normal
BTRFS_FS_COMPAT_<whatever> flag is fine.  If the setting gets set you set the
compat flag.

You'll also need to modify the tree checker to check the dev item allocation
hints to make sure they're valid, via check_dev_item.  Thanks,

Josef

Goffredo Baroncelli Dec. 15, 2021, 6:53 p.m. UTC | #14

On 12/15/21 14:58, Josef Bacik wrote:
> On Tue, Dec 14, 2021 at 09:41:21PM +0100, Goffredo Baroncelli wrote:
>> On 12/14/21 21:34, Josef Bacik wrote:
>>> On Tue, Dec 14, 2021 at 03:04:32PM -0500, Zygo Blaxell wrote:
>>>> On Tue, Dec 14, 2021 at 08:03:45PM +0100, Goffredo Baroncelli wrote:
>>
>>>>
>>>> I don't have a strong preference for either sysfs or ioctl, nor am I
>>>> opposed to simply implementing both.  I'll let someone who does have
>>>> such a preference make their case.
>>>
>>> I think echo'ing a name into sysfs is better than bits for sure.  However I want
>>> the ability to set the device properties via a btrfs-progs command offline so I
>>> can setup the storage and then mount the file system.  I want
>>>
>>> 1) The sysfs interface so you can change things on the fly.  This stays
>>>      persistent of course, so the way it works is perfect.
>>>
>>> 2) The btrfs-progs command sets it on offline devices.  If you point it at a
>>>      live mounted fs it can simply use the sysfs thing to do it live.
>>
>> #2 is currently not implemented. However I think that we should do.
>>
>> The problem is that we need to update both:
>>
>> - the superblock		(simple)
>> - the dev_item item		(not so simple...)
>>
>> What about using only bits from the superblock to store this property ?
> 
> I'm looking at the patches and you only are updating the dev_item, am I missing
> something for the super block?

When btrfs write the superblocks (see write_all_supers() in disk-io.c), it copies
the dev_item fields (contained in fs_info->fs_devices->devices lists) in each
superblock before updating it.

> 
> For offline all you would need to do is do the normal open_ctree,
> btrfs_search_slot to the item and update the device item type, that's
> straightforward.
> 
> For online if you use btrfs prop you can see if the fs is mounted and just find
> the sysfs file to modify and do it that way.
> 
> But this also brings up another point, we're going to want a compat bit for
> this.  It doesn't make the fs unusable for old kernels, so just a normal
> BTRFS_FS_COMPAT_<whatever> flag is fine.  If the setting gets set you set the
> compat flag.

Why we need a "compact" bit ? The new kernels know how treat the dev_item_type field.
The old kernels ignore it. The worst thing is that a filesystem may require a balance
before reaching a good shape (i.e. the metadata on ssd and the data on a spinning disk)
  



> You'll also need to modify the tree checker to check the dev item allocation
> hints to make sure they're valid, via check_dev_item.  Thanks,

Ok

> 
> Josef

Josef Bacik Dec. 16, 2021, 12:56 a.m. UTC | #15

On Wed, Dec 15, 2021 at 07:53:40PM +0100, Goffredo Baroncelli wrote:
> On 12/15/21 14:58, Josef Bacik wrote:
> > On Tue, Dec 14, 2021 at 09:41:21PM +0100, Goffredo Baroncelli wrote:
> > > On 12/14/21 21:34, Josef Bacik wrote:
> > > > On Tue, Dec 14, 2021 at 03:04:32PM -0500, Zygo Blaxell wrote:
> > > > > On Tue, Dec 14, 2021 at 08:03:45PM +0100, Goffredo Baroncelli wrote:
> > > 
> > > > > 
> > > > > I don't have a strong preference for either sysfs or ioctl, nor am I
> > > > > opposed to simply implementing both.  I'll let someone who does have
> > > > > such a preference make their case.
> > > > 
> > > > I think echo'ing a name into sysfs is better than bits for sure.  However I want
> > > > the ability to set the device properties via a btrfs-progs command offline so I
> > > > can setup the storage and then mount the file system.  I want
> > > > 
> > > > 1) The sysfs interface so you can change things on the fly.  This stays
> > > >      persistent of course, so the way it works is perfect.
> > > > 
> > > > 2) The btrfs-progs command sets it on offline devices.  If you point it at a
> > > >      live mounted fs it can simply use the sysfs thing to do it live.
> > > 
> > > #2 is currently not implemented. However I think that we should do.
> > > 
> > > The problem is that we need to update both:
> > > 
> > > - the superblock		(simple)
> > > - the dev_item item		(not so simple...)
> > > 
> > > What about using only bits from the superblock to store this property ?
> > 
> > I'm looking at the patches and you only are updating the dev_item, am I missing
> > something for the super block?
> 
> When btrfs write the superblocks (see write_all_supers() in disk-io.c), it copies
> the dev_item fields (contained in fs_info->fs_devices->devices lists) in each
> superblock before updating it.
> 

Oh right.  Still, I hope we're doing this correctly in btrfs-progs, if not
that's a problem.

> > 
> > For offline all you would need to do is do the normal open_ctree,
> > btrfs_search_slot to the item and update the device item type, that's
> > straightforward.
> > 
> > For online if you use btrfs prop you can see if the fs is mounted and just find
> > the sysfs file to modify and do it that way.
> > 
> > But this also brings up another point, we're going to want a compat bit for
> > this.  It doesn't make the fs unusable for old kernels, so just a normal
> > BTRFS_FS_COMPAT_<whatever> flag is fine.  If the setting gets set you set the
> > compat flag.
> 
> Why we need a "compact" bit ? The new kernels know how treat the dev_item_type field.
> The old kernels ignore it. The worst thing is that a filesystem may require a balance
> before reaching a good shape (i.e. the metadata on ssd and the data on a spinning disk)
> 

So you can do the validation below, tho I'm thinking I care about it less, if we
just make sure that type is correct regardless of the compat bit then that's
fine.  Thanks,

Josef

Paul Jones Dec. 16, 2021, 2:30 a.m. UTC | #16

> -----Original Message-----
> From: Josef Bacik <josef@toxicpanda.com>
> Sent: Thursday, 16 December 2021 12:58 AM
> To: kreijack@inwind.it
> Cc: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>; David Sterba
> <dsterba@suse.cz>; Sinnamohideen Shafeeq <shafeeqs@panasas.com>;
> Paul Jones <paul@pauljones.id.au>; linux-btrfs@vger.kernel.org
> Subject: Re: [RFC][V8][PATCH 0/5] btrfs: allocation_hint mode
> 
> On Tue, Dec 14, 2021 at 09:41:21PM +0100, Goffredo Baroncelli wrote:
> > On 12/14/21 21:34, Josef Bacik wrote:
> > > On Tue, Dec 14, 2021 at 03:04:32PM -0500, Zygo Blaxell wrote:
> > > > On Tue, Dec 14, 2021 at 08:03:45PM +0100, Goffredo Baroncelli wrote:
> >
> > > >
> > > > I don't have a strong preference for either sysfs or ioctl, nor am
> > > > I opposed to simply implementing both.  I'll let someone who does
> > > > have such a preference make their case.
> > >
> > > I think echo'ing a name into sysfs is better than bits for sure.
> > > However I want the ability to set the device properties via a
> > > btrfs-progs command offline so I can setup the storage and then
> > > mount the file system.  I want
> > >
> > > 1) The sysfs interface so you can change things on the fly.  This stays
> > >     persistent of course, so the way it works is perfect.
> > >
> > > 2) The btrfs-progs command sets it on offline devices.  If you point it at a
> > >     live mounted fs it can simply use the sysfs thing to do it live.
> >
> > #2 is currently not implemented. However I think that we should do.
> >
> > The problem is that we need to update both:
> >
> > - the superblock		(simple)
> > - the dev_item item		(not so simple...)
> >
> > What about using only bits from the superblock to store this property ?
> 
> I'm looking at the patches and you only are updating the dev_item, am I
> missing something for the super block?
> 
> For offline all you would need to do is do the normal open_ctree,
> btrfs_search_slot to the item and update the device item type, that's
> straightforward.
> 
> For online if you use btrfs prop you can see if the fs is mounted and just find
> the sysfs file to modify and do it that way.
> 
> But this also brings up another point, we're going to want a compat bit for
> this.  It doesn't make the fs unusable for old kernels, so just a normal
> BTRFS_FS_COMPAT_<whatever> flag is fine.  If the setting gets set you set
> the compat flag.

I don't think that is necessary - I've remounted my filesystems a few times with an unpatched kernel, and all that happens is new blocks get allocated without any hints - I like that simplicity and compatibility. As long as it's mentioned in the docs I think it's fine.

Having a balance filter that only rebalances wrongly allocated blocks would be a nice addition though.


Paul.

Zygo Blaxell Dec. 17, 2021, 5:40 a.m. UTC | #17

On Wed, Dec 15, 2021 at 07:56:32PM -0500, Josef Bacik wrote:
> On Wed, Dec 15, 2021 at 07:53:40PM +0100, Goffredo Baroncelli wrote:
> > On 12/15/21 14:58, Josef Bacik wrote:
> > > On Tue, Dec 14, 2021 at 09:41:21PM +0100, Goffredo Baroncelli wrote:
> > > > On 12/14/21 21:34, Josef Bacik wrote:
> > > > > On Tue, Dec 14, 2021 at 03:04:32PM -0500, Zygo Blaxell wrote:
> > > > > > On Tue, Dec 14, 2021 at 08:03:45PM +0100, Goffredo Baroncelli wrote:
> > > > 
> > > > > > 
> > > > > > I don't have a strong preference for either sysfs or ioctl, nor am I
> > > > > > opposed to simply implementing both.  I'll let someone who does have
> > > > > > such a preference make their case.
> > > > > 
> > > > > I think echo'ing a name into sysfs is better than bits for sure.  However I want
> > > > > the ability to set the device properties via a btrfs-progs command offline so I
> > > > > can setup the storage and then mount the file system.  I want
> > > > > 
> > > > > 1) The sysfs interface so you can change things on the fly.  This stays
> > > > >      persistent of course, so the way it works is perfect.
> > > > > 
> > > > > 2) The btrfs-progs command sets it on offline devices.  If you point it at a
> > > > >      live mounted fs it can simply use the sysfs thing to do it live.
> > > > 
> > > > #2 is currently not implemented. However I think that we should do.
> > > > 
> > > > The problem is that we need to update both:
> > > > 
> > > > - the superblock		(simple)
> > > > - the dev_item item		(not so simple...)
> > > > 
> > > > What about using only bits from the superblock to store this property ?
> > > 
> > > I'm looking at the patches and you only are updating the dev_item, am I missing
> > > something for the super block?
> > 
> > When btrfs write the superblocks (see write_all_supers() in disk-io.c), it copies
> > the dev_item fields (contained in fs_info->fs_devices->devices lists) in each
> > superblock before updating it.
> > 
> 
> Oh right.  Still, I hope we're doing this correctly in btrfs-progs, if not
> that's a problem.
> 
> > > 
> > > For offline all you would need to do is do the normal open_ctree,
> > > btrfs_search_slot to the item and update the device item type, that's
> > > straightforward.
> > > 
> > > For online if you use btrfs prop you can see if the fs is mounted and just find
> > > the sysfs file to modify and do it that way.
> > > 
> > > But this also brings up another point, we're going to want a compat bit for
> > > this.  It doesn't make the fs unusable for old kernels, so just a normal
> > > BTRFS_FS_COMPAT_<whatever> flag is fine.  If the setting gets set you set the
> > > compat flag.
> > 
> > Why we need a "compact" bit ? The new kernels know how treat the dev_item_type field.
> > The old kernels ignore it. The worst thing is that a filesystem may require a balance
> > before reaching a good shape (i.e. the metadata on ssd and the data on a spinning disk)
> 
> So you can do the validation below, tho I'm thinking I care about it less, if we
> just make sure that type is correct regardless of the compat bit then that's
> fine.  Thanks,

In theory if you get stuck in an impossible allocation situation (like all
your disks are data-only and you run out of metadata space) then one way
to recover from it is to mount with an old kernel which doesn't respect
the type bits.  Another way to recover would be to flip the type bits
while the filesystem is offline with btrfs-progs.  A third way would be to
have a mount option for newer kernels to ignore the allocation bits like
old kernels do (yes I know I already said I didn't like that idea).

If we have a bit that says "old kernels don't mount this filesystem any
more" then we lose one of those recovery options, and the other options
aren't implemented yet.

While I think of it, the metadata reservation system eventually needs
to know that it can't use data-only devices for metadata, the same way
that df eventually needs to know about metadata-only devices.


> Josef
>

Josef Bacik Dec. 17, 2021, 2:48 p.m. UTC | #18

On Fri, Dec 17, 2021 at 12:40:55AM -0500, Zygo Blaxell wrote:
> On Wed, Dec 15, 2021 at 07:56:32PM -0500, Josef Bacik wrote:
> > On Wed, Dec 15, 2021 at 07:53:40PM +0100, Goffredo Baroncelli wrote:
> > > On 12/15/21 14:58, Josef Bacik wrote:
> > > > On Tue, Dec 14, 2021 at 09:41:21PM +0100, Goffredo Baroncelli wrote:
> > > > > On 12/14/21 21:34, Josef Bacik wrote:
> > > > > > On Tue, Dec 14, 2021 at 03:04:32PM -0500, Zygo Blaxell wrote:
> > > > > > > On Tue, Dec 14, 2021 at 08:03:45PM +0100, Goffredo Baroncelli wrote:
> > > > > 
> > > > > > > 
> > > > > > > I don't have a strong preference for either sysfs or ioctl, nor am I
> > > > > > > opposed to simply implementing both.  I'll let someone who does have
> > > > > > > such a preference make their case.
> > > > > > 
> > > > > > I think echo'ing a name into sysfs is better than bits for sure.  However I want
> > > > > > the ability to set the device properties via a btrfs-progs command offline so I
> > > > > > can setup the storage and then mount the file system.  I want
> > > > > > 
> > > > > > 1) The sysfs interface so you can change things on the fly.  This stays
> > > > > >      persistent of course, so the way it works is perfect.
> > > > > > 
> > > > > > 2) The btrfs-progs command sets it on offline devices.  If you point it at a
> > > > > >      live mounted fs it can simply use the sysfs thing to do it live.
> > > > > 
> > > > > #2 is currently not implemented. However I think that we should do.
> > > > > 
> > > > > The problem is that we need to update both:
> > > > > 
> > > > > - the superblock		(simple)
> > > > > - the dev_item item		(not so simple...)
> > > > > 
> > > > > What about using only bits from the superblock to store this property ?
> > > > 
> > > > I'm looking at the patches and you only are updating the dev_item, am I missing
> > > > something for the super block?
> > > 
> > > When btrfs write the superblocks (see write_all_supers() in disk-io.c), it copies
> > > the dev_item fields (contained in fs_info->fs_devices->devices lists) in each
> > > superblock before updating it.
> > > 
> > 
> > Oh right.  Still, I hope we're doing this correctly in btrfs-progs, if not
> > that's a problem.
> > 
> > > > 
> > > > For offline all you would need to do is do the normal open_ctree,
> > > > btrfs_search_slot to the item and update the device item type, that's
> > > > straightforward.
> > > > 
> > > > For online if you use btrfs prop you can see if the fs is mounted and just find
> > > > the sysfs file to modify and do it that way.
> > > > 
> > > > But this also brings up another point, we're going to want a compat bit for
> > > > this.  It doesn't make the fs unusable for old kernels, so just a normal
> > > > BTRFS_FS_COMPAT_<whatever> flag is fine.  If the setting gets set you set the
> > > > compat flag.
> > > 
> > > Why we need a "compact" bit ? The new kernels know how treat the dev_item_type field.
> > > The old kernels ignore it. The worst thing is that a filesystem may require a balance
> > > before reaching a good shape (i.e. the metadata on ssd and the data on a spinning disk)
> > 
> > So you can do the validation below, tho I'm thinking I care about it less, if we
> > just make sure that type is correct regardless of the compat bit then that's
> > fine.  Thanks,
> 
> In theory if you get stuck in an impossible allocation situation (like all
> your disks are data-only and you run out of metadata space) then one way
> to recover from it is to mount with an old kernel which doesn't respect
> the type bits.  Another way to recover would be to flip the type bits
> while the filesystem is offline with btrfs-progs.  A third way would be to
> have a mount option for newer kernels to ignore the allocation bits like
> old kernels do (yes I know I already said I didn't like that idea).
> 

This is the "preferred" vs "only" category.  You set "preferred" if you want to
be able to handle failures cleanly, you set "only" if you'd rather reprovision a
box than lose performance.

> If we have a bit that says "old kernels don't mount this filesystem any
> more" then we lose one of those recovery options, and the other options
> aren't implemented yet.
> 
> While I think of it, the metadata reservation system eventually needs
> to know that it can't use data-only devices for metadata, the same way
> that df eventually needs to know about metadata-only devices.
> 

Yeah, I'm willing to throw that into the power user bucket.  It really only
affects overcommit, because once we've allocated all our chunks we'll stop
overcommitting.  There may be some corners that need to be addressed, but we can
burn those bridges when we get to them.  Thanks,

Josef

Zygo Blaxell Dec. 17, 2021, 4:31 p.m. UTC | #19

On Fri, Dec 17, 2021 at 09:48:40AM -0500, Josef Bacik wrote:
> On Fri, Dec 17, 2021 at 12:40:55AM -0500, Zygo Blaxell wrote:
> > On Wed, Dec 15, 2021 at 07:56:32PM -0500, Josef Bacik wrote:
> > > On Wed, Dec 15, 2021 at 07:53:40PM +0100, Goffredo Baroncelli wrote:
> > > > On 12/15/21 14:58, Josef Bacik wrote:
> > > > > On Tue, Dec 14, 2021 at 09:41:21PM +0100, Goffredo Baroncelli wrote:
> > > > > > On 12/14/21 21:34, Josef Bacik wrote:
> > > > > > > On Tue, Dec 14, 2021 at 03:04:32PM -0500, Zygo Blaxell wrote:
> > > > > > > > On Tue, Dec 14, 2021 at 08:03:45PM +0100, Goffredo Baroncelli wrote:
> > > > > > 
> > > > > > > > 
> > > > > > > > I don't have a strong preference for either sysfs or ioctl, nor am I
> > > > > > > > opposed to simply implementing both.  I'll let someone who does have
> > > > > > > > such a preference make their case.
> > > > > > > 
> > > > > > > I think echo'ing a name into sysfs is better than bits for sure.  However I want
> > > > > > > the ability to set the device properties via a btrfs-progs command offline so I
> > > > > > > can setup the storage and then mount the file system.  I want
> > > > > > > 
> > > > > > > 1) The sysfs interface so you can change things on the fly.  This stays
> > > > > > >      persistent of course, so the way it works is perfect.
> > > > > > > 
> > > > > > > 2) The btrfs-progs command sets it on offline devices.  If you point it at a
> > > > > > >      live mounted fs it can simply use the sysfs thing to do it live.
> > > > > > 
> > > > > > #2 is currently not implemented. However I think that we should do.
> > > > > > 
> > > > > > The problem is that we need to update both:
> > > > > > 
> > > > > > - the superblock		(simple)
> > > > > > - the dev_item item		(not so simple...)
> > > > > > 
> > > > > > What about using only bits from the superblock to store this property ?
> > > > > 
> > > > > I'm looking at the patches and you only are updating the dev_item, am I missing
> > > > > something for the super block?
> > > > 
> > > > When btrfs write the superblocks (see write_all_supers() in disk-io.c), it copies
> > > > the dev_item fields (contained in fs_info->fs_devices->devices lists) in each
> > > > superblock before updating it.
> > > > 
> > > 
> > > Oh right.  Still, I hope we're doing this correctly in btrfs-progs, if not
> > > that's a problem.
> > > 
> > > > > 
> > > > > For offline all you would need to do is do the normal open_ctree,
> > > > > btrfs_search_slot to the item and update the device item type, that's
> > > > > straightforward.
> > > > > 
> > > > > For online if you use btrfs prop you can see if the fs is mounted and just find
> > > > > the sysfs file to modify and do it that way.
> > > > > 
> > > > > But this also brings up another point, we're going to want a compat bit for
> > > > > this.  It doesn't make the fs unusable for old kernels, so just a normal
> > > > > BTRFS_FS_COMPAT_<whatever> flag is fine.  If the setting gets set you set the
> > > > > compat flag.
> > > > 
> > > > Why we need a "compact" bit ? The new kernels know how treat the dev_item_type field.
> > > > The old kernels ignore it. The worst thing is that a filesystem may require a balance
> > > > before reaching a good shape (i.e. the metadata on ssd and the data on a spinning disk)
> > > 
> > > So you can do the validation below, tho I'm thinking I care about it less, if we
> > > just make sure that type is correct regardless of the compat bit then that's
> > > fine.  Thanks,
> > 
> > In theory if you get stuck in an impossible allocation situation (like all
> > your disks are data-only and you run out of metadata space) then one way
> > to recover from it is to mount with an old kernel which doesn't respect
> > the type bits.  Another way to recover would be to flip the type bits
> > while the filesystem is offline with btrfs-progs.  A third way would be to
> > have a mount option for newer kernels to ignore the allocation bits like
> > old kernels do (yes I know I already said I didn't like that idea).
> > 
> 
> This is the "preferred" vs "only" category.  You set "preferred" if you want to
> be able to handle failures cleanly, you set "only" if you'd rather reprovision a
> box than lose performance.

Yeah, you're right here...and I was right the first time.  ;)
In case I forget again, I'll write down why...

It's a corner of a corner of a corner case.  To get there, we have to 1)
set up the unsatisfiable allocation preferences, 2) run out of existing
metadata allocations so we're one big commit away from metadata ENOSPC,
and 3) in a small commit, write something to the filesystem that forces
big metadata allocations during the next mount (like a deleted but
uncleaned snapshot, or an orphan inode).  It's not a new case: steps 2
and 3 are sufficient to get into this corner, and allocation preferences
in step 1 only get us to step 2 faster.  At any time before step 3,
we can change the allocation preferences and avoid the whole thing.

mkfs changes this slightly, since it's possible in mkfs to set up
condition 1 without having any metadata on the filesystem; however,
the filesystem would be unmountable, and it would require a separate bug
in mkfs to create the filesystem at all.  So the filesystem blows up,
but it can only happen on an empty filesystem.  I can live with that.

It would be nice to have a way to avoid step 3 in general (like disable
orphan inode cleanup and snapshot delete on mount, so we can arrange
for some more metadata storage before those run) but that's completely
orthogonal to the allocation preferences patch, solving a problem that
existed before and after.

> > If we have a bit that says "old kernels don't mount this filesystem any
> > more" then we lose one of those recovery options, and the other options
> > aren't implemented yet.
> > 
> > While I think of it, the metadata reservation system eventually needs
> > to know that it can't use data-only devices for metadata, the same way
> > that df eventually needs to know about metadata-only devices.
> 
> Yeah, I'm willing to throw that into the power user bucket.  It really only
> affects overcommit, because once we've allocated all our chunks we'll stop
> overcommitting.  There may be some corners that need to be addressed, but we can
> burn those bridges when we get to them.  Thanks,

> Josef
>

Goffredo Baroncelli Dec. 17, 2021, 6:08 p.m. UTC | #20

On 12/17/21 06:40, Zygo Blaxell wrote:

> 
> In theory if you get stuck in an impossible allocation situation (like all
> your disks are data-only and you run out of metadata space) then one way
> to recover from it is to mount with an old kernel which doesn't respect
> the type bits.  

> Another way to recover would be to flip the type bits
> while the filesystem is offline with btrfs-progs.  

Unfortunately this would not doable; type is backup-ped in the dev_item
which are stored in the metadata.

So despite the fact that the filesystem is mounted or not, changing
"type" would require updating metadata which in turn may require to
allocate a new metadata BG.

> A third way would be to
> have a mount option for newer kernels to ignore the allocation bits like
> old kernels do (yes I know I already said I didn't like that idea).

The patch is still available. However I have to point out that this is not
a new problem. Will be always some cases where the metadata space is
exhausted and is not possible to make the needing changes to create space.

If the user uses the _ONLY flags, it is shooting in its feet. We should warn
in the manual to avoid that. But even if the user do that we still have the
problem. Frankly speaking I fatigue to see an use case where _ONLY flags are
needed. IMHO _PREFERRED are enough; what it is needed is to improve the
reporting to the user about these cases: something like that for each
btrfs-prog invocation a warning is raised when the metadata space is near to
exhaustion...

> 
> If we have a bit that says "old kernels don't mount this filesystem any
> more" then we lose one of those recovery options, and the other options
> aren't implemented yet.

Using an old kernel is not a solution. Sooner another new non-backward
compatible change will prevent that.

> 
> While I think of it, the metadata reservation system eventually needs
> to know that it can't use data-only devices for metadata, the same way
> that df eventually needs to know about metadata-only devices.
> 
> 
>> Josef
>>