Message ID | cover.1635089352.git.kreijack@inwind.it (mailing list archive) |
---|---|
Headers | show |
Series | btrfs: allocation_hint mode | expand |
> -----Original Message----- > From: Goffredo Baroncelli <kreijack@tiscali.it> > Sent: Monday, 25 October 2021 2:31 AM > To: linux-btrfs@vger.kernel.org > Cc: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>; Josef Bacik > <josef@toxicpanda.com>; David Sterba <dsterba@suse.cz>; Sinnamohideen > Shafeeq <shafeeqs@panasas.com>; Goffredo Baroncelli > <kreijack@inwind.it> > Subject: [RFC][V8][PATCH 0/5] btrfs: allocation_hint mode > > From: Goffredo Baroncelli <kreijack@inwind.it> > > Hi all, > > This patches set was born after some discussion between me, Zygo and > Josef. > Some details can be found in https://github.com/btrfs/btrfs-todo/issues/19. > > Some further information about a real use case can be found in > https://lore.kernel.org/linux- > btrfs/20210116002533.GE31381@hungrycats.org/ > > Reently Shafeeq told me that he is interested too, due to the performance > gain. > > In this revision I switched away from an ioctl API in favor of a sysfs API ( see > patch #2 and #3). > > The idea behind this patches set, is to dedicate some disks (the fastest one) > to the metadata chunk. My initial idea was a "soft" hint. However Zygo asked > an option for a "strong" hint (== mandatory). The result is that each disk can > be "tagged" by one of the following flags: > - BTRFS_DEV_ALLOCATION_METADATA_ONLY > - BTRFS_DEV_ALLOCATION_PREFERRED_METADATA > - BTRFS_DEV_ALLOCATION_PREFERRED_DATA > - BTRFS_DEV_ALLOCATION_DATA_ONLY > > When the chunk allocator search a disks to allocate a chunk, scans the disks in > an order decided by these tags. For metadata, the order is: > *_METADATA_ONLY > *_PREFERRED_METADATA > *_PREFERRED_DATA > > The *_DATA_ONLY are not eligible from metadata chunk allocation. > > For the data chunk, the order is reversed, and the *_METADATA_ONLY are > excluded. > > The exact sort logic is to sort first for the "tag", and then for the space > available. If there is no space available, the next "tag" disks set are selected. > > To set these tags, a new property called "allocation_hint" was created. > There is a dedicated btrfs-prog patches set [[PATCH V5] btrfs-progs: > allocation_hint disk property]. > > $ sudo mount /dev/loop0 /mnt/test-btrfs/ $ for i in /dev/loop[0-9]; do sudo > ./btrfs prop get $i allocation_hint; done devid=1, path=/dev/loop0: > allocation_hint=PREFERRED_METADATA > devid=2, path=/dev/loop1: allocation_hint=PREFERRED_METADATA > devid=3, path=/dev/loop2: allocation_hint=PREFERRED_DATA devid=4, > path=/dev/loop3: allocation_hint=PREFERRED_DATA devid=5, > path=/dev/loop4: allocation_hint=PREFERRED_DATA devid=6, > path=/dev/loop5: allocation_hint=DATA_ONLY devid=7, path=/dev/loop6: > allocation_hint=METADATA_ONLY devid=8, path=/dev/loop7: > allocation_hint=METADATA_ONLY > > $ sudo ./btrfs fi us /mnt/test-btrfs/ > Overall: > Device size: 2.75GiB > Device allocated: 1.34GiB > Device unallocated: 1.41GiB > Device missing: 0.00B > Used: 400.89MiB > Free (estimated): 1.04GiB (min: 1.04GiB) > Data ratio: 2.00 > Metadata ratio: 1.00 > Global reserve: 3.25MiB (used: 0.00B) > Multiple profiles: no > > Data,RAID1: Size:542.00MiB, Used:200.25MiB (36.95%) > /dev/loop0 288.00MiB > /dev/loop1 288.00MiB > /dev/loop2 127.00MiB > /dev/loop3 127.00MiB > /dev/loop4 127.00MiB > /dev/loop5 127.00MiB > > Metadata,single: Size:256.00MiB, Used:384.00KiB (0.15%) > /dev/loop1 256.00MiB > > System,single: Size:32.00MiB, Used:16.00KiB (0.05%) > /dev/loop0 32.00MiB > > Unallocated: > /dev/loop0 704.00MiB > /dev/loop1 480.00MiB > /dev/loop2 1.00MiB > /dev/loop3 1.00MiB > /dev/loop4 1.00MiB > /dev/loop5 1.00MiB > /dev/loop6 128.00MiB > /dev/loop7 128.00MiB > > # change the tag of some disks > > $ sudo ./btrfs prop set /dev/loop0 allocation_hint DATA_ONLY $ sudo ./btrfs > prop set /dev/loop1 allocation_hint DATA_ONLY $ sudo ./btrfs prop set > /dev/loop5 allocation_hint METADATA_ONLY > > $ for i in /dev/loop[0-9]; do sudo ./btrfs prop get $i allocation_hint; done > devid=1, path=/dev/loop0: allocation_hint=DATA_ONLY devid=2, > path=/dev/loop1: allocation_hint=DATA_ONLY devid=3, path=/dev/loop2: > allocation_hint=PREFERRED_DATA devid=4, path=/dev/loop3: > allocation_hint=PREFERRED_DATA devid=5, path=/dev/loop4: > allocation_hint=PREFERRED_DATA devid=6, path=/dev/loop5: > allocation_hint=METADATA_ONLY devid=7, path=/dev/loop6: > allocation_hint=METADATA_ONLY devid=8, path=/dev/loop7: > allocation_hint=METADATA_ONLY > > $ sudo btrfs bal start --full-balance /mnt/test-btrfs/ $ sudo ./btrfs fi us > /mnt/test-btrfs/ > Overall: > Device size: 2.75GiB > Device allocated: 735.00MiB > Device unallocated: 2.03GiB > Device missing: 0.00B > Used: 400.72MiB > Free (estimated): 1.10GiB (min: 1.10GiB) > Data ratio: 2.00 > Metadata ratio: 1.00 > Global reserve: 3.25MiB (used: 0.00B) > Multiple profiles: no > > Data,RAID1: Size:288.00MiB, Used:200.19MiB (69.51%) > /dev/loop0 288.00MiB > /dev/loop1 288.00MiB > > Metadata,single: Size:127.00MiB, Used:336.00KiB (0.26%) > /dev/loop5 127.00MiB > > System,single: Size:32.00MiB, Used:16.00KiB (0.05%) > /dev/loop7 32.00MiB > > Unallocated: > /dev/loop0 736.00MiB > /dev/loop1 736.00MiB > /dev/loop2 128.00MiB > /dev/loop3 128.00MiB > /dev/loop4 128.00MiB > /dev/loop5 1.00MiB > /dev/loop6 128.00MiB > /dev/loop7 96.00MiB > > > #As you can see all the metadata were placed on the disk loop5/loop7 even if > #the most empty one are loop0 and loop1. > > > > TODO: > - more tests > - the tool which show the space available should consider the tagging (eg > the disks tagged by _METADATA_ONLY should be excluded from the data > availability) > > > Comments are welcome > BR > G.Baroncelli I've been running this patch series since about V4, works really well. Would be nice to have it merged eventually. Tested By: Paul Jones <paul@pauljones.id.au>
Gentle ping :-) Are there anyone of the mains developer interested in supporting this patch ? I am open to improve it if required. BR G.Baroncelli On 12/13/21 10:39, Paul Jones wrote: >> -----Original Message----- >> From: Goffredo Baroncelli <kreijack@tiscali.it> >> Sent: Monday, 25 October 2021 2:31 AM >> To: linux-btrfs@vger.kernel.org >> Cc: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>; Josef Bacik >> <josef@toxicpanda.com>; David Sterba <dsterba@suse.cz>; Sinnamohideen >> Shafeeq <shafeeqs@panasas.com>; Goffredo Baroncelli >> <kreijack@inwind.it> >> Subject: [RFC][V8][PATCH 0/5] btrfs: allocation_hint mode >> >> From: Goffredo Baroncelli <kreijack@inwind.it> >> >> Hi all, >> >> This patches set was born after some discussion between me, Zygo and >> Josef. >> Some details can be found in https://github.com/btrfs/btrfs-todo/issues/19. >> >> Some further information about a real use case can be found in >> https://lore.kernel.org/linux- >> btrfs/20210116002533.GE31381@hungrycats.org/ >> >> Reently Shafeeq told me that he is interested too, due to the performance >> gain. >> >> In this revision I switched away from an ioctl API in favor of a sysfs API ( see >> patch #2 and #3). >> >> The idea behind this patches set, is to dedicate some disks (the fastest one) >> to the metadata chunk. My initial idea was a "soft" hint. However Zygo asked >> an option for a "strong" hint (== mandatory). The result is that each disk can >> be "tagged" by one of the following flags: >> - BTRFS_DEV_ALLOCATION_METADATA_ONLY >> - BTRFS_DEV_ALLOCATION_PREFERRED_METADATA >> - BTRFS_DEV_ALLOCATION_PREFERRED_DATA >> - BTRFS_DEV_ALLOCATION_DATA_ONLY >> >> When the chunk allocator search a disks to allocate a chunk, scans the disks in >> an order decided by these tags. For metadata, the order is: >> *_METADATA_ONLY >> *_PREFERRED_METADATA >> *_PREFERRED_DATA >> >> The *_DATA_ONLY are not eligible from metadata chunk allocation. >> >> For the data chunk, the order is reversed, and the *_METADATA_ONLY are >> excluded. >> >> The exact sort logic is to sort first for the "tag", and then for the space >> available. If there is no space available, the next "tag" disks set are selected. >> >> To set these tags, a new property called "allocation_hint" was created. >> There is a dedicated btrfs-prog patches set [[PATCH V5] btrfs-progs: >> allocation_hint disk property]. >> >> $ sudo mount /dev/loop0 /mnt/test-btrfs/ $ for i in /dev/loop[0-9]; do sudo >> ./btrfs prop get $i allocation_hint; done devid=1, path=/dev/loop0: >> allocation_hint=PREFERRED_METADATA >> devid=2, path=/dev/loop1: allocation_hint=PREFERRED_METADATA >> devid=3, path=/dev/loop2: allocation_hint=PREFERRED_DATA devid=4, >> path=/dev/loop3: allocation_hint=PREFERRED_DATA devid=5, >> path=/dev/loop4: allocation_hint=PREFERRED_DATA devid=6, >> path=/dev/loop5: allocation_hint=DATA_ONLY devid=7, path=/dev/loop6: >> allocation_hint=METADATA_ONLY devid=8, path=/dev/loop7: >> allocation_hint=METADATA_ONLY >> >> $ sudo ./btrfs fi us /mnt/test-btrfs/ >> Overall: >> Device size: 2.75GiB >> Device allocated: 1.34GiB >> Device unallocated: 1.41GiB >> Device missing: 0.00B >> Used: 400.89MiB >> Free (estimated): 1.04GiB (min: 1.04GiB) >> Data ratio: 2.00 >> Metadata ratio: 1.00 >> Global reserve: 3.25MiB (used: 0.00B) >> Multiple profiles: no >> >> Data,RAID1: Size:542.00MiB, Used:200.25MiB (36.95%) >> /dev/loop0 288.00MiB >> /dev/loop1 288.00MiB >> /dev/loop2 127.00MiB >> /dev/loop3 127.00MiB >> /dev/loop4 127.00MiB >> /dev/loop5 127.00MiB >> >> Metadata,single: Size:256.00MiB, Used:384.00KiB (0.15%) >> /dev/loop1 256.00MiB >> >> System,single: Size:32.00MiB, Used:16.00KiB (0.05%) >> /dev/loop0 32.00MiB >> >> Unallocated: >> /dev/loop0 704.00MiB >> /dev/loop1 480.00MiB >> /dev/loop2 1.00MiB >> /dev/loop3 1.00MiB >> /dev/loop4 1.00MiB >> /dev/loop5 1.00MiB >> /dev/loop6 128.00MiB >> /dev/loop7 128.00MiB >> >> # change the tag of some disks >> >> $ sudo ./btrfs prop set /dev/loop0 allocation_hint DATA_ONLY $ sudo ./btrfs >> prop set /dev/loop1 allocation_hint DATA_ONLY $ sudo ./btrfs prop set >> /dev/loop5 allocation_hint METADATA_ONLY >> >> $ for i in /dev/loop[0-9]; do sudo ./btrfs prop get $i allocation_hint; done >> devid=1, path=/dev/loop0: allocation_hint=DATA_ONLY devid=2, >> path=/dev/loop1: allocation_hint=DATA_ONLY devid=3, path=/dev/loop2: >> allocation_hint=PREFERRED_DATA devid=4, path=/dev/loop3: >> allocation_hint=PREFERRED_DATA devid=5, path=/dev/loop4: >> allocation_hint=PREFERRED_DATA devid=6, path=/dev/loop5: >> allocation_hint=METADATA_ONLY devid=7, path=/dev/loop6: >> allocation_hint=METADATA_ONLY devid=8, path=/dev/loop7: >> allocation_hint=METADATA_ONLY >> >> $ sudo btrfs bal start --full-balance /mnt/test-btrfs/ $ sudo ./btrfs fi us >> /mnt/test-btrfs/ >> Overall: >> Device size: 2.75GiB >> Device allocated: 735.00MiB >> Device unallocated: 2.03GiB >> Device missing: 0.00B >> Used: 400.72MiB >> Free (estimated): 1.10GiB (min: 1.10GiB) >> Data ratio: 2.00 >> Metadata ratio: 1.00 >> Global reserve: 3.25MiB (used: 0.00B) >> Multiple profiles: no >> >> Data,RAID1: Size:288.00MiB, Used:200.19MiB (69.51%) >> /dev/loop0 288.00MiB >> /dev/loop1 288.00MiB >> >> Metadata,single: Size:127.00MiB, Used:336.00KiB (0.26%) >> /dev/loop5 127.00MiB >> >> System,single: Size:32.00MiB, Used:16.00KiB (0.05%) >> /dev/loop7 32.00MiB >> >> Unallocated: >> /dev/loop0 736.00MiB >> /dev/loop1 736.00MiB >> /dev/loop2 128.00MiB >> /dev/loop3 128.00MiB >> /dev/loop4 128.00MiB >> /dev/loop5 1.00MiB >> /dev/loop6 128.00MiB >> /dev/loop7 96.00MiB >> >> >> #As you can see all the metadata were placed on the disk loop5/loop7 even if >> #the most empty one are loop0 and loop1. >> >> >> >> TODO: >> - more tests >> - the tool which show the space available should consider the tagging (eg >> the disks tagged by _METADATA_ONLY should be excluded from the data >> availability) >> >> >> Comments are welcome >> BR >> G.Baroncelli > > > I've been running this patch series since about V4, works really well. Would be nice to have it merged eventually. > > Tested By: Paul Jones <paul@pauljones.id.au> >
On Mon, Dec 13, 2021 at 08:54:24PM +0100, Goffredo Baroncelli wrote: > Gentle ping :-) > > Are there anyone of the mains developer interested in supporting this patch ? > > I am open to improve it if required. > Sorry I missed this go by. I like the interface, we don't have a use for device->type yet, so this fits nicely. I don't see the btrfs-progs patches in my inbox, and these don't apply, so you'll definitely need to refresh for a proper review, but looking at these patches they seem sane enough, and I like the interface. I'd like to hear Zygo's opinion as well. If we're going to use device->type for this, and since we don't have a user of device->type, I'd also like you to go ahead and re-name ->type to ->allocation_policy, that way it's clear what we're using it for now. I'd also like some xfstests to validate the behavior so we're sure we're testing this. I'd want 1 test to just test the mechanics, like mkfs with different policies and validate they're set right, change policies, add/remove disks with different policies. Then a second test to do something like fsstress with each set of allocation policies to validate that we did actually allocate from the correct disks. For this please also test with compression on to make sure the test validation works for both normal allocation and compression (ie it doesn't assume writing 5gib of data == 5 gib of data usage, as compression chould give you a different value). With that in place I think this is the correct way to implement this feature. Thanks, Josef
On Mon, Dec 13, 2021 at 04:15:14PM -0500, Josef Bacik wrote: > On Mon, Dec 13, 2021 at 08:54:24PM +0100, Goffredo Baroncelli wrote: > > Gentle ping :-) > > > > Are there anyone of the mains developer interested in supporting this patch ? > > > > I am open to improve it if required. > > > > Sorry I missed this go by. I like the interface, we don't have a use for > device->type yet, so this fits nicely. > > I don't see the btrfs-progs patches in my inbox, and these don't apply, so > you'll definitely need to refresh for a proper review, but looking at these > patches they seem sane enough, and I like the interface. I'd like to hear > Zygo's opinion as well. I've been running earlier versions with modifications since summer 2020, and this version mostly unmodified (rebase changes only) since it was posted. It seems to work, even in corner cases like converting balances, replacing drives, and running out of space. The "running out of space" experience is on btrfs is weird at the best of times, and these patches add some more new special cases, but it doesn't behave in ways that would surprise a sysadmin familiar with how btrfs chunk allocation works. One major piece that's missing is adjusting the statvfs (aka df) available blocks field so that it doesn't include unallocated space on any metadata-only devices. Right now all the unallocated space on metadata-only devices is counted as free even though it's impossible to put a data block there, so anything that is triggered automatically on "f_bavail < some_threshold" will be confused. I don't think that piece has to block the rest of the patch series--if you're not using the feature, df gives the right number (or at least the same number it gave before), and if you are using the feature, you can subtract the unavailable data space until a later patch comes along to fix it. I like echo data_only > /sys/fs/btrfs/$uuid/devinfo/3/type more than patching btrfs-progs so I can use btrfs prop set /dev/... allocation_hint data_only but I admit that might be because I'm weird. > If we're going to use device->type for this, and since we don't have a user of > device->type, I'd also like you to go ahead and re-name ->type to > ->allocation_policy, that way it's clear what we're using it for now. > > I'd also like some xfstests to validate the behavior so we're sure we're testing > this. I'd want 1 test to just test the mechanics, like mkfs with different > policies and validate they're set right, change policies, add/remove disks with > different policies. > > Then a second test to do something like fsstress with each set of allocation > policies to validate that we did actually allocate from the correct disks. For > this please also test with compression on to make sure the test validation works > for both normal allocation and compression (ie it doesn't assume writing 5gib of > data == 5 gib of data usage, as compression chould give you a different value). > > With that in place I think this is the correct way to implement this feature. > Thanks, > > Josef
Panasas would be very happy with this patch - it solves a problem we have when delete-only workloads thrash the disks. Because it is equivalent to "rm -rf" of an old directory, caching does't help since the metadata of the files being deleted will have fallen out of bcache long ago. So giving up a small amount of SSD space to always hold the metadata yields a big performance improvement. We got as far as verifying that it doesn't break existing xfstests, we are open to writing the test cases you listed and also to check that the device out-of-space cases behave as expected. Thanks, Shafeeq -----Original Message----- From: Josef Bacik <josef@toxicpanda.com> Sent: Monday, December 13, 2021 4:15 PM To: kreijack@inwind.it Cc: David Sterba <dsterba@suse.cz>; Sinnamohideen, Shafeeq <shafeeqs@panasas.com>; Paul Jones <paul@pauljones.id.au>; linux-btrfs@vger.kernel.org; Zygo Blaxell <ce3g8jdj@umail.furryterror.org> Subject: Re: [RFC][V8][PATCH 0/5] btrfs: allocation_hint mode On Mon, Dec 13, 2021 at 08:54:24PM +0100, Goffredo Baroncelli wrote: > Gentle ping :-) > > Are there anyone of the mains developer interested in supporting this patch ? > > I am open to improve it if required. > Sorry I missed this go by. I like the interface, we don't have a use for device->type yet, so this fits nicely. I don't see the btrfs-progs patches in my inbox, and these don't apply, so you'll definitely need to refresh for a proper review, but looking at these patches they seem sane enough, and I like the interface. I'd like to hear Zygo's opinion as well. If we're going to use device->type for this, and since we don't have a user of device->type, I'd also like you to go ahead and re-name ->type to ->allocation_policy, that way it's clear what we're using it for now. I'd also like some xfstests to validate the behavior so we're sure we're testing this. I'd want 1 test to just test the mechanics, like mkfs with different policies and validate they're set right, change policies, add/remove disks with different policies. Then a second test to do something like fsstress with each set of allocation policies to validate that we did actually allocate from the correct disks. For this please also test with compression on to make sure the test validation works for both normal allocation and compression (ie it doesn't assume writing 5gib of data == 5 gib of data usage, as compression chould give you a different value). With that in place I think this is the correct way to implement this feature. Thanks, Josef
On Mon, Dec 13, 2021 at 05:49:22PM -0500, Zygo Blaxell wrote: > On Mon, Dec 13, 2021 at 04:15:14PM -0500, Josef Bacik wrote: > > On Mon, Dec 13, 2021 at 08:54:24PM +0100, Goffredo Baroncelli wrote: > > > Gentle ping :-) > > > > > > Are there anyone of the mains developer interested in supporting this patch ? > > > > > > I am open to improve it if required. > > > > > > > Sorry I missed this go by. I like the interface, we don't have a use for > > device->type yet, so this fits nicely. > > > > I don't see the btrfs-progs patches in my inbox, and these don't apply, so > > you'll definitely need to refresh for a proper review, but looking at these > > patches they seem sane enough, and I like the interface. I'd like to hear > > Zygo's opinion as well. > > I've been running earlier versions with modifications since summer 2020, > and this version mostly unmodified (rebase changes only) since it was > posted. It seems to work, even in corner cases like converting balances, > replacing drives, and running out of space. The "running out of space" > experience is on btrfs is weird at the best of times, and these patches > add some more new special cases, but it doesn't behave in ways that > would surprise a sysadmin familiar with how btrfs chunk allocation works. > > One major piece that's missing is adjusting the statvfs (aka df) > available blocks field so that it doesn't include unallocated space on > any metadata-only devices. Right now all the unallocated space on > metadata-only devices is counted as free even though it's impossible to > put a data block there, so anything that is triggered automatically > on "f_bavail < some_threshold" will be confused. > > I don't think that piece has to block the rest of the patch series--if > you're not using the feature, df gives the right number (or at least the > same number it gave before), and if you are using the feature, you can > subtract the unavailable data space until a later patch comes along to > fix it. > > I like > > echo data_only > /sys/fs/btrfs/$uuid/devinfo/3/type > > more than patching btrfs-progs so I can use > > btrfs prop set /dev/... allocation_hint data_only > > but I admit that might be because I'm weird. > Ok this is good enough for me. Refresh the series Goffredo and I'll review it, get some xfstests as I described, and fix up btrfs-progs to show the preferences via either btrfs filesystem show or some other mechanism since the df thing will be confusing. We'll assume if you're using this feature you are aware of the space gotchas and then fix up bugs as they show up. With the testing and user feedback tho this looks good enough for me, I just want to make sure we have tests and such in place before merging. Thanks, Josef
On 12/13/21 22:15, Josef Bacik wrote: > On Mon, Dec 13, 2021 at 08:54:24PM +0100, Goffredo Baroncelli wrote: >> Gentle ping :-) >> >> Are there anyone of the mains developer interested in supporting this patch ? >> >> I am open to improve it if required. >> > > Sorry I missed this go by. I like the interface, we don't have a use for > device->type yet, so this fits nicely. > > I don't see the btrfs-progs patches in my inbox, and these don't apply, so > you'll definitely need to refresh for a proper review, but looking at these > patches they seem sane enough, and I like the interface. I'd like to hear > Zygo's opinion as well. > > If we're going to use device->type for this, and since we don't have a user of > device->type, I'd also like you to go ahead and re-name ->type to > ->allocation_policy, that way it's clear what we're using it for now. The allocation policy requires only few bits (3 or 4); instead "type" is 64 bit wide. We can allocate more bits for future extension, but I don't think that it is necessary to allocate all the bits to the "allocation_policy". This to say that renaming the field as allocation_policy is too restrictive if in future we will use the other bits for another purposes. I don't like the terms "type". May be flags, is it better ? > > I'd also like some xfstests to validate the behavior so we're sure we're testing > this. I'd want 1 test to just test the mechanics, like mkfs with different > policies and validate they're set right, change policies, add/remove disks with > different policies. > > Then a second test to do something like fsstress with each set of allocation > policies to validate that we did actually allocate from the correct disks. For > this please also test with compression on to make sure the test validation works > for both normal allocation and compression (ie it doesn't assume writing 5gib of > data == 5 gib of data usage, as compression chould give you a different value). > > With that in place I think this is the correct way to implement this feature. > Thanks, Ok > > Josef
On 12/13/21 23:49, Zygo Blaxell wrote: > On Mon, Dec 13, 2021 at 04:15:14PM -0500, Josef Bacik wrote: >> On Mon, Dec 13, 2021 at 08:54:24PM +0100, Goffredo Baroncelli wrote: >>> Gentle ping :-) >>> >>> Are there anyone of the mains developer interested in supporting this patch ? >>> >>> I am open to improve it if required. >>> >> >> Sorry I missed this go by. I like the interface, we don't have a use for >> device->type yet, so this fits nicely. >> >> I don't see the btrfs-progs patches in my inbox, and these don't apply, so >> you'll definitely need to refresh for a proper review, but looking at these >> patches they seem sane enough, and I like the interface. I'd like to hear >> Zygo's opinion as well. > > I've been running earlier versions with modifications since summer 2020, > and this version mostly unmodified (rebase changes only) since it was > posted. It seems to work, even in corner cases like converting balances, > replacing drives, and running out of space. The "running out of space" > experience is on btrfs is weird at the best of times, and these patches > add some more new special cases, but it doesn't behave in ways that > would surprise a sysadmin familiar with how btrfs chunk allocation works. > > One major piece that's missing is adjusting the statvfs (aka df) > available blocks field so that it doesn't include unallocated space on > any metadata-only devices. Right now all the unallocated space on > metadata-only devices is counted as free even though it's impossible to > put a data block there, so anything that is triggered automatically > on "f_bavail < some_threshold" will be confused. > > I don't think that piece has to block the rest of the patch series--if > you're not using the feature, df gives the right number (or at least the > same number it gave before), and if you are using the feature, you can > subtract the unavailable data space until a later patch comes along to > fix it. > > I like > > echo data_only > /sys/fs/btrfs/$uuid/devinfo/3/type Only to be clear, for now you can pass a numeric value to "type". Not a text like your example. However I want to put on the table another option: to not expose all the "type" field, but only the "allocation policy"; we can add a new sysfs field called "allocation policy" that internally change the dev_item->type field. It is not only a "cosmetic" change. If we want to change the allocation policy, now the correct way is: - read the type field - change the "allocation policy" bits - write the type field Which is race 'prone' For now it is not a problem, because type contains only the allocation bits. But in future when the type field will contains further properties this could be a problem. > > more than patching btrfs-progs so I can use > > btrfs prop set /dev/... allocation_hint data_only > > but I admit that might be because I'm weird. I prefer the echo approach too; however it is not very ergonomics in conjunction to sudo.... > >> If we're going to use device->type for this, and since we don't have a user of >> device->type, I'd also like you to go ahead and re-name ->type to >> ->allocation_policy, that way it's clear what we're using it for now. >> >> I'd also like some xfstests to validate the behavior so we're sure we're testing >> this. I'd want 1 test to just test the mechanics, like mkfs with different >> policies and validate they're set right, change policies, add/remove disks with >> different policies. >> >> Then a second test to do something like fsstress with each set of allocation >> policies to validate that we did actually allocate from the correct disks. For >> this please also test with compression on to make sure the test validation works >> for both normal allocation and compression (ie it doesn't assume writing 5gib of >> data == 5 gib of data usage, as compression chould give you a different value). >> >> With that in place I think this is the correct way to implement this feature. >> Thanks, >> >> Josef
On Tue, Dec 14, 2021 at 08:03:45PM +0100, Goffredo Baroncelli wrote: > On 12/13/21 23:49, Zygo Blaxell wrote: > > On Mon, Dec 13, 2021 at 04:15:14PM -0500, Josef Bacik wrote: > > > On Mon, Dec 13, 2021 at 08:54:24PM +0100, Goffredo Baroncelli wrote: > > > > Gentle ping :-) > > > > > > > > Are there anyone of the mains developer interested in supporting this patch ? > > > > > > > > I am open to improve it if required. > > > > > > > > > > Sorry I missed this go by. I like the interface, we don't have a use for > > > device->type yet, so this fits nicely. > > > > > > I don't see the btrfs-progs patches in my inbox, and these don't apply, so > > > you'll definitely need to refresh for a proper review, but looking at these > > > patches they seem sane enough, and I like the interface. I'd like to hear > > > Zygo's opinion as well. > > > > I've been running earlier versions with modifications since summer 2020, > > and this version mostly unmodified (rebase changes only) since it was > > posted. It seems to work, even in corner cases like converting balances, > > replacing drives, and running out of space. The "running out of space" > > experience is on btrfs is weird at the best of times, and these patches > > add some more new special cases, but it doesn't behave in ways that > > would surprise a sysadmin familiar with how btrfs chunk allocation works. > > > > One major piece that's missing is adjusting the statvfs (aka df) > > available blocks field so that it doesn't include unallocated space on > > any metadata-only devices. Right now all the unallocated space on > > metadata-only devices is counted as free even though it's impossible to > > put a data block there, so anything that is triggered automatically > > on "f_bavail < some_threshold" will be confused. > > > > I don't think that piece has to block the rest of the patch series--if > > you're not using the feature, df gives the right number (or at least the > > same number it gave before), and if you are using the feature, you can > > subtract the unavailable data space until a later patch comes along to > > fix it. > > > > I like > > > > echo data_only > /sys/fs/btrfs/$uuid/devinfo/3/type > > Only to be clear, for now you can pass a numeric value to "type". Not a text > like your example. > > However I want to put on the table another option: to not expose all the > "type" field, but only the "allocation policy"; we can add a new sysfs field > called "allocation policy" that internally change the dev_item->type field. > > It is not only a "cosmetic" change. If we want to change the allocation > policy, now the correct way is: > - read the type field > - change the "allocation policy" bits > - write the type field > > Which is race 'prone' > For now it is not a problem, because type contains only the allocation bits. > But in future when the type field will contains further properties this could > be a problem. Yeah, keep the interface very narrow, don't hand out access to random bits. If the kernel supports additional bits, it should support additional sysfs filenames to go with them. Or it could put all the supported options in the sysfs field, like block IO schedulers do, so you could find this in the file by reading it: [prefer_data] prefer_metadata metadata_only data_only > > more than patching btrfs-progs so I can use > > > > btrfs prop set /dev/... allocation_hint data_only > > > > but I admit that might be because I'm weird. > > I prefer the echo approach too; however it is not very ergonomics in conjunction > to sudo.... For /proc/sys/* we have the 'sysctl' tool, so you can write 'sysctl vm.drop_caches=1' or 'sudo sysctl vm.drop_caches=1'. For some reason we don't have this for sysfs (or maybe it's just Debian...?) so we have to write things like 'echo foo | sudo tee /sys/fs/...'. Of course btrfs-progs could always open the /sys/fs/btrfs/.../allocation_policy file and write to it. But if we're modifying btrfs-progs then we could use the ioctl interface anyway. I don't have a strong preference for either sysfs or ioctl, nor am I opposed to simply implementing both. I'll let someone who does have such a preference make their case. > > > If we're going to use device->type for this, and since we don't have a user of > > > device->type, I'd also like you to go ahead and re-name ->type to > > > ->allocation_policy, that way it's clear what we're using it for now. > > > > > > I'd also like some xfstests to validate the behavior so we're sure we're testing > > > this. I'd want 1 test to just test the mechanics, like mkfs with different > > > policies and validate they're set right, change policies, add/remove disks with > > > different policies. > > > > > > Then a second test to do something like fsstress with each set of allocation > > > policies to validate that we did actually allocate from the correct disks. For > > > this please also test with compression on to make sure the test validation works > > > for both normal allocation and compression (ie it doesn't assume writing 5gib of > > > data == 5 gib of data usage, as compression chould give you a different value). > > > > > > With that in place I think this is the correct way to implement this feature. > > > Thanks, > > > > > > Josef > > > -- > gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it> > Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5
On Tue, Dec 14, 2021 at 03:04:32PM -0500, Zygo Blaxell wrote: > On Tue, Dec 14, 2021 at 08:03:45PM +0100, Goffredo Baroncelli wrote: > > On 12/13/21 23:49, Zygo Blaxell wrote: > > > On Mon, Dec 13, 2021 at 04:15:14PM -0500, Josef Bacik wrote: > > > > On Mon, Dec 13, 2021 at 08:54:24PM +0100, Goffredo Baroncelli wrote: > > > > > Gentle ping :-) > > > > > > > > > > Are there anyone of the mains developer interested in supporting this patch ? > > > > > > > > > > I am open to improve it if required. > > > > > > > > > > > > > Sorry I missed this go by. I like the interface, we don't have a use for > > > > device->type yet, so this fits nicely. > > > > > > > > I don't see the btrfs-progs patches in my inbox, and these don't apply, so > > > > you'll definitely need to refresh for a proper review, but looking at these > > > > patches they seem sane enough, and I like the interface. I'd like to hear > > > > Zygo's opinion as well. > > > > > > I've been running earlier versions with modifications since summer 2020, > > > and this version mostly unmodified (rebase changes only) since it was > > > posted. It seems to work, even in corner cases like converting balances, > > > replacing drives, and running out of space. The "running out of space" > > > experience is on btrfs is weird at the best of times, and these patches > > > add some more new special cases, but it doesn't behave in ways that > > > would surprise a sysadmin familiar with how btrfs chunk allocation works. > > > > > > One major piece that's missing is adjusting the statvfs (aka df) > > > available blocks field so that it doesn't include unallocated space on > > > any metadata-only devices. Right now all the unallocated space on > > > metadata-only devices is counted as free even though it's impossible to > > > put a data block there, so anything that is triggered automatically > > > on "f_bavail < some_threshold" will be confused. > > > > > > I don't think that piece has to block the rest of the patch series--if > > > you're not using the feature, df gives the right number (or at least the > > > same number it gave before), and if you are using the feature, you can > > > subtract the unavailable data space until a later patch comes along to > > > fix it. > > > > > > I like > > > > > > echo data_only > /sys/fs/btrfs/$uuid/devinfo/3/type > > > > Only to be clear, for now you can pass a numeric value to "type". Not a text > > like your example. > > > > However I want to put on the table another option: to not expose all the > > "type" field, but only the "allocation policy"; we can add a new sysfs field > > called "allocation policy" that internally change the dev_item->type field. > > > > It is not only a "cosmetic" change. If we want to change the allocation > > policy, now the correct way is: > > - read the type field > > - change the "allocation policy" bits > > - write the type field > > > > Which is race 'prone' > > > For now it is not a problem, because type contains only the allocation bits. > > But in future when the type field will contains further properties this could > > be a problem. > > Yeah, keep the interface very narrow, don't hand out access to random bits. > > If the kernel supports additional bits, it should support additional > sysfs filenames to go with them. Or it could put all the supported > options in the sysfs field, like block IO schedulers do, so you could > find this in the file by reading it: > > [prefer_data] prefer_metadata metadata_only data_only > > > > more than patching btrfs-progs so I can use > > > > > > btrfs prop set /dev/... allocation_hint data_only > > > > > > but I admit that might be because I'm weird. > > > > I prefer the echo approach too; however it is not very ergonomics in conjunction > > to sudo.... > > For /proc/sys/* we have the 'sysctl' tool, so you can write 'sysctl > vm.drop_caches=1' or 'sudo sysctl vm.drop_caches=1'. For some reason > we don't have this for sysfs (or maybe it's just Debian...?) so we have > to write things like 'echo foo | sudo tee /sys/fs/...'. > > Of course btrfs-progs could always open the > /sys/fs/btrfs/.../allocation_policy file and write to it. But if we're > modifying btrfs-progs then we could use the ioctl interface anyway. > > I don't have a strong preference for either sysfs or ioctl, nor am I > opposed to simply implementing both. I'll let someone who does have > such a preference make their case. I think echo'ing a name into sysfs is better than bits for sure. However I want the ability to set the device properties via a btrfs-progs command offline so I can setup the storage and then mount the file system. I want 1) The sysfs interface so you can change things on the fly. This stays persistent of course, so the way it works is perfect. 2) The btrfs-progs command sets it on offline devices. If you point it at a live mounted fs it can simply use the sysfs thing to do it live. Does this seem reasonable? Thanks, Josef
On Tue, Dec 14, 2021 at 07:53:34PM +0100, Goffredo Baroncelli wrote: > On 12/13/21 22:15, Josef Bacik wrote: > > On Mon, Dec 13, 2021 at 08:54:24PM +0100, Goffredo Baroncelli wrote: > > > Gentle ping :-) > > > > > > Are there anyone of the mains developer interested in supporting this patch ? > > > > > > I am open to improve it if required. > > > > > > > Sorry I missed this go by. I like the interface, we don't have a use for > > device->type yet, so this fits nicely. > > > > I don't see the btrfs-progs patches in my inbox, and these don't apply, so > > you'll definitely need to refresh for a proper review, but looking at these > > patches they seem sane enough, and I like the interface. I'd like to hear > > Zygo's opinion as well. > > > > If we're going to use device->type for this, and since we don't have a user of > > device->type, I'd also like you to go ahead and re-name ->type to > > ->allocation_policy, that way it's clear what we're using it for now. > > The allocation policy requires only few bits (3 or 4); instead "type" is 64 bit wide. > We can allocate more bits for future extension, but I don't think that it is necessary > to allocate all the bits to the "allocation_policy". > > This to say that renaming the field as allocation_policy is too restrictive if in future > we will use the other bits for another purposes. > > I don't like the terms "type". May be flags, is it better ? Sure, rename it to flags and make the allocation policies flags. The important part is that "type" isn't accurate, and since it's not used by anything currently we're free to rename it so it better represents the information it holds. Thanks, Josef
On 12/14/21 21:34, Josef Bacik wrote: > On Tue, Dec 14, 2021 at 03:04:32PM -0500, Zygo Blaxell wrote: >> On Tue, Dec 14, 2021 at 08:03:45PM +0100, Goffredo Baroncelli wrote: >> >> I don't have a strong preference for either sysfs or ioctl, nor am I >> opposed to simply implementing both. I'll let someone who does have >> such a preference make their case. > > I think echo'ing a name into sysfs is better than bits for sure. However I want > the ability to set the device properties via a btrfs-progs command offline so I > can setup the storage and then mount the file system. I want > > 1) The sysfs interface so you can change things on the fly. This stays > persistent of course, so the way it works is perfect. > > 2) The btrfs-progs command sets it on offline devices. If you point it at a > live mounted fs it can simply use the sysfs thing to do it live. #2 is currently not implemented. However I think that we should do. The problem is that we need to update both: - the superblock (simple) - the dev_item item (not so simple...) What about using only bits from the superblock to store this property ? > > Does this seem reasonable? Thanks, > > Josef
On Tue, Dec 14, 2021 at 09:41:21PM +0100, Goffredo Baroncelli wrote: > On 12/14/21 21:34, Josef Bacik wrote: > > On Tue, Dec 14, 2021 at 03:04:32PM -0500, Zygo Blaxell wrote: > > > On Tue, Dec 14, 2021 at 08:03:45PM +0100, Goffredo Baroncelli wrote: > > > > > > > I don't have a strong preference for either sysfs or ioctl, nor am I > > > opposed to simply implementing both. I'll let someone who does have > > > such a preference make their case. > > > > I think echo'ing a name into sysfs is better than bits for sure. However I want > > the ability to set the device properties via a btrfs-progs command offline so I > > can setup the storage and then mount the file system. I want > > > > 1) The sysfs interface so you can change things on the fly. This stays > > persistent of course, so the way it works is perfect. > > > > 2) The btrfs-progs command sets it on offline devices. If you point it at a > > live mounted fs it can simply use the sysfs thing to do it live. > > #2 is currently not implemented. However I think that we should do. > > The problem is that we need to update both: > > - the superblock (simple) > - the dev_item item (not so simple...) > > What about using only bits from the superblock to store this property ? I'm looking at the patches and you only are updating the dev_item, am I missing something for the super block? For offline all you would need to do is do the normal open_ctree, btrfs_search_slot to the item and update the device item type, that's straightforward. For online if you use btrfs prop you can see if the fs is mounted and just find the sysfs file to modify and do it that way. But this also brings up another point, we're going to want a compat bit for this. It doesn't make the fs unusable for old kernels, so just a normal BTRFS_FS_COMPAT_<whatever> flag is fine. If the setting gets set you set the compat flag. You'll also need to modify the tree checker to check the dev item allocation hints to make sure they're valid, via check_dev_item. Thanks, Josef
On 12/15/21 14:58, Josef Bacik wrote: > On Tue, Dec 14, 2021 at 09:41:21PM +0100, Goffredo Baroncelli wrote: >> On 12/14/21 21:34, Josef Bacik wrote: >>> On Tue, Dec 14, 2021 at 03:04:32PM -0500, Zygo Blaxell wrote: >>>> On Tue, Dec 14, 2021 at 08:03:45PM +0100, Goffredo Baroncelli wrote: >> >>>> >>>> I don't have a strong preference for either sysfs or ioctl, nor am I >>>> opposed to simply implementing both. I'll let someone who does have >>>> such a preference make their case. >>> >>> I think echo'ing a name into sysfs is better than bits for sure. However I want >>> the ability to set the device properties via a btrfs-progs command offline so I >>> can setup the storage and then mount the file system. I want >>> >>> 1) The sysfs interface so you can change things on the fly. This stays >>> persistent of course, so the way it works is perfect. >>> >>> 2) The btrfs-progs command sets it on offline devices. If you point it at a >>> live mounted fs it can simply use the sysfs thing to do it live. >> >> #2 is currently not implemented. However I think that we should do. >> >> The problem is that we need to update both: >> >> - the superblock (simple) >> - the dev_item item (not so simple...) >> >> What about using only bits from the superblock to store this property ? > > I'm looking at the patches and you only are updating the dev_item, am I missing > something for the super block? When btrfs write the superblocks (see write_all_supers() in disk-io.c), it copies the dev_item fields (contained in fs_info->fs_devices->devices lists) in each superblock before updating it. > > For offline all you would need to do is do the normal open_ctree, > btrfs_search_slot to the item and update the device item type, that's > straightforward. > > For online if you use btrfs prop you can see if the fs is mounted and just find > the sysfs file to modify and do it that way. > > But this also brings up another point, we're going to want a compat bit for > this. It doesn't make the fs unusable for old kernels, so just a normal > BTRFS_FS_COMPAT_<whatever> flag is fine. If the setting gets set you set the > compat flag. Why we need a "compact" bit ? The new kernels know how treat the dev_item_type field. The old kernels ignore it. The worst thing is that a filesystem may require a balance before reaching a good shape (i.e. the metadata on ssd and the data on a spinning disk) > You'll also need to modify the tree checker to check the dev item allocation > hints to make sure they're valid, via check_dev_item. Thanks, Ok > > Josef
On Wed, Dec 15, 2021 at 07:53:40PM +0100, Goffredo Baroncelli wrote: > On 12/15/21 14:58, Josef Bacik wrote: > > On Tue, Dec 14, 2021 at 09:41:21PM +0100, Goffredo Baroncelli wrote: > > > On 12/14/21 21:34, Josef Bacik wrote: > > > > On Tue, Dec 14, 2021 at 03:04:32PM -0500, Zygo Blaxell wrote: > > > > > On Tue, Dec 14, 2021 at 08:03:45PM +0100, Goffredo Baroncelli wrote: > > > > > > > > > > > > > I don't have a strong preference for either sysfs or ioctl, nor am I > > > > > opposed to simply implementing both. I'll let someone who does have > > > > > such a preference make their case. > > > > > > > > I think echo'ing a name into sysfs is better than bits for sure. However I want > > > > the ability to set the device properties via a btrfs-progs command offline so I > > > > can setup the storage and then mount the file system. I want > > > > > > > > 1) The sysfs interface so you can change things on the fly. This stays > > > > persistent of course, so the way it works is perfect. > > > > > > > > 2) The btrfs-progs command sets it on offline devices. If you point it at a > > > > live mounted fs it can simply use the sysfs thing to do it live. > > > > > > #2 is currently not implemented. However I think that we should do. > > > > > > The problem is that we need to update both: > > > > > > - the superblock (simple) > > > - the dev_item item (not so simple...) > > > > > > What about using only bits from the superblock to store this property ? > > > > I'm looking at the patches and you only are updating the dev_item, am I missing > > something for the super block? > > When btrfs write the superblocks (see write_all_supers() in disk-io.c), it copies > the dev_item fields (contained in fs_info->fs_devices->devices lists) in each > superblock before updating it. > Oh right. Still, I hope we're doing this correctly in btrfs-progs, if not that's a problem. > > > > For offline all you would need to do is do the normal open_ctree, > > btrfs_search_slot to the item and update the device item type, that's > > straightforward. > > > > For online if you use btrfs prop you can see if the fs is mounted and just find > > the sysfs file to modify and do it that way. > > > > But this also brings up another point, we're going to want a compat bit for > > this. It doesn't make the fs unusable for old kernels, so just a normal > > BTRFS_FS_COMPAT_<whatever> flag is fine. If the setting gets set you set the > > compat flag. > > Why we need a "compact" bit ? The new kernels know how treat the dev_item_type field. > The old kernels ignore it. The worst thing is that a filesystem may require a balance > before reaching a good shape (i.e. the metadata on ssd and the data on a spinning disk) > So you can do the validation below, tho I'm thinking I care about it less, if we just make sure that type is correct regardless of the compat bit then that's fine. Thanks, Josef
> -----Original Message----- > From: Josef Bacik <josef@toxicpanda.com> > Sent: Thursday, 16 December 2021 12:58 AM > To: kreijack@inwind.it > Cc: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>; David Sterba > <dsterba@suse.cz>; Sinnamohideen Shafeeq <shafeeqs@panasas.com>; > Paul Jones <paul@pauljones.id.au>; linux-btrfs@vger.kernel.org > Subject: Re: [RFC][V8][PATCH 0/5] btrfs: allocation_hint mode > > On Tue, Dec 14, 2021 at 09:41:21PM +0100, Goffredo Baroncelli wrote: > > On 12/14/21 21:34, Josef Bacik wrote: > > > On Tue, Dec 14, 2021 at 03:04:32PM -0500, Zygo Blaxell wrote: > > > > On Tue, Dec 14, 2021 at 08:03:45PM +0100, Goffredo Baroncelli wrote: > > > > > > > > > > I don't have a strong preference for either sysfs or ioctl, nor am > > > > I opposed to simply implementing both. I'll let someone who does > > > > have such a preference make their case. > > > > > > I think echo'ing a name into sysfs is better than bits for sure. > > > However I want the ability to set the device properties via a > > > btrfs-progs command offline so I can setup the storage and then > > > mount the file system. I want > > > > > > 1) The sysfs interface so you can change things on the fly. This stays > > > persistent of course, so the way it works is perfect. > > > > > > 2) The btrfs-progs command sets it on offline devices. If you point it at a > > > live mounted fs it can simply use the sysfs thing to do it live. > > > > #2 is currently not implemented. However I think that we should do. > > > > The problem is that we need to update both: > > > > - the superblock (simple) > > - the dev_item item (not so simple...) > > > > What about using only bits from the superblock to store this property ? > > I'm looking at the patches and you only are updating the dev_item, am I > missing something for the super block? > > For offline all you would need to do is do the normal open_ctree, > btrfs_search_slot to the item and update the device item type, that's > straightforward. > > For online if you use btrfs prop you can see if the fs is mounted and just find > the sysfs file to modify and do it that way. > > But this also brings up another point, we're going to want a compat bit for > this. It doesn't make the fs unusable for old kernels, so just a normal > BTRFS_FS_COMPAT_<whatever> flag is fine. If the setting gets set you set > the compat flag. I don't think that is necessary - I've remounted my filesystems a few times with an unpatched kernel, and all that happens is new blocks get allocated without any hints - I like that simplicity and compatibility. As long as it's mentioned in the docs I think it's fine. Having a balance filter that only rebalances wrongly allocated blocks would be a nice addition though. Paul.
On Wed, Dec 15, 2021 at 07:56:32PM -0500, Josef Bacik wrote: > On Wed, Dec 15, 2021 at 07:53:40PM +0100, Goffredo Baroncelli wrote: > > On 12/15/21 14:58, Josef Bacik wrote: > > > On Tue, Dec 14, 2021 at 09:41:21PM +0100, Goffredo Baroncelli wrote: > > > > On 12/14/21 21:34, Josef Bacik wrote: > > > > > On Tue, Dec 14, 2021 at 03:04:32PM -0500, Zygo Blaxell wrote: > > > > > > On Tue, Dec 14, 2021 at 08:03:45PM +0100, Goffredo Baroncelli wrote: > > > > > > > > > > > > > > > > I don't have a strong preference for either sysfs or ioctl, nor am I > > > > > > opposed to simply implementing both. I'll let someone who does have > > > > > > such a preference make their case. > > > > > > > > > > I think echo'ing a name into sysfs is better than bits for sure. However I want > > > > > the ability to set the device properties via a btrfs-progs command offline so I > > > > > can setup the storage and then mount the file system. I want > > > > > > > > > > 1) The sysfs interface so you can change things on the fly. This stays > > > > > persistent of course, so the way it works is perfect. > > > > > > > > > > 2) The btrfs-progs command sets it on offline devices. If you point it at a > > > > > live mounted fs it can simply use the sysfs thing to do it live. > > > > > > > > #2 is currently not implemented. However I think that we should do. > > > > > > > > The problem is that we need to update both: > > > > > > > > - the superblock (simple) > > > > - the dev_item item (not so simple...) > > > > > > > > What about using only bits from the superblock to store this property ? > > > > > > I'm looking at the patches and you only are updating the dev_item, am I missing > > > something for the super block? > > > > When btrfs write the superblocks (see write_all_supers() in disk-io.c), it copies > > the dev_item fields (contained in fs_info->fs_devices->devices lists) in each > > superblock before updating it. > > > > Oh right. Still, I hope we're doing this correctly in btrfs-progs, if not > that's a problem. > > > > > > > For offline all you would need to do is do the normal open_ctree, > > > btrfs_search_slot to the item and update the device item type, that's > > > straightforward. > > > > > > For online if you use btrfs prop you can see if the fs is mounted and just find > > > the sysfs file to modify and do it that way. > > > > > > But this also brings up another point, we're going to want a compat bit for > > > this. It doesn't make the fs unusable for old kernels, so just a normal > > > BTRFS_FS_COMPAT_<whatever> flag is fine. If the setting gets set you set the > > > compat flag. > > > > Why we need a "compact" bit ? The new kernels know how treat the dev_item_type field. > > The old kernels ignore it. The worst thing is that a filesystem may require a balance > > before reaching a good shape (i.e. the metadata on ssd and the data on a spinning disk) > > So you can do the validation below, tho I'm thinking I care about it less, if we > just make sure that type is correct regardless of the compat bit then that's > fine. Thanks, In theory if you get stuck in an impossible allocation situation (like all your disks are data-only and you run out of metadata space) then one way to recover from it is to mount with an old kernel which doesn't respect the type bits. Another way to recover would be to flip the type bits while the filesystem is offline with btrfs-progs. A third way would be to have a mount option for newer kernels to ignore the allocation bits like old kernels do (yes I know I already said I didn't like that idea). If we have a bit that says "old kernels don't mount this filesystem any more" then we lose one of those recovery options, and the other options aren't implemented yet. While I think of it, the metadata reservation system eventually needs to know that it can't use data-only devices for metadata, the same way that df eventually needs to know about metadata-only devices. > Josef >
On Fri, Dec 17, 2021 at 12:40:55AM -0500, Zygo Blaxell wrote: > On Wed, Dec 15, 2021 at 07:56:32PM -0500, Josef Bacik wrote: > > On Wed, Dec 15, 2021 at 07:53:40PM +0100, Goffredo Baroncelli wrote: > > > On 12/15/21 14:58, Josef Bacik wrote: > > > > On Tue, Dec 14, 2021 at 09:41:21PM +0100, Goffredo Baroncelli wrote: > > > > > On 12/14/21 21:34, Josef Bacik wrote: > > > > > > On Tue, Dec 14, 2021 at 03:04:32PM -0500, Zygo Blaxell wrote: > > > > > > > On Tue, Dec 14, 2021 at 08:03:45PM +0100, Goffredo Baroncelli wrote: > > > > > > > > > > > > > > > > > > > I don't have a strong preference for either sysfs or ioctl, nor am I > > > > > > > opposed to simply implementing both. I'll let someone who does have > > > > > > > such a preference make their case. > > > > > > > > > > > > I think echo'ing a name into sysfs is better than bits for sure. However I want > > > > > > the ability to set the device properties via a btrfs-progs command offline so I > > > > > > can setup the storage and then mount the file system. I want > > > > > > > > > > > > 1) The sysfs interface so you can change things on the fly. This stays > > > > > > persistent of course, so the way it works is perfect. > > > > > > > > > > > > 2) The btrfs-progs command sets it on offline devices. If you point it at a > > > > > > live mounted fs it can simply use the sysfs thing to do it live. > > > > > > > > > > #2 is currently not implemented. However I think that we should do. > > > > > > > > > > The problem is that we need to update both: > > > > > > > > > > - the superblock (simple) > > > > > - the dev_item item (not so simple...) > > > > > > > > > > What about using only bits from the superblock to store this property ? > > > > > > > > I'm looking at the patches and you only are updating the dev_item, am I missing > > > > something for the super block? > > > > > > When btrfs write the superblocks (see write_all_supers() in disk-io.c), it copies > > > the dev_item fields (contained in fs_info->fs_devices->devices lists) in each > > > superblock before updating it. > > > > > > > Oh right. Still, I hope we're doing this correctly in btrfs-progs, if not > > that's a problem. > > > > > > > > > > For offline all you would need to do is do the normal open_ctree, > > > > btrfs_search_slot to the item and update the device item type, that's > > > > straightforward. > > > > > > > > For online if you use btrfs prop you can see if the fs is mounted and just find > > > > the sysfs file to modify and do it that way. > > > > > > > > But this also brings up another point, we're going to want a compat bit for > > > > this. It doesn't make the fs unusable for old kernels, so just a normal > > > > BTRFS_FS_COMPAT_<whatever> flag is fine. If the setting gets set you set the > > > > compat flag. > > > > > > Why we need a "compact" bit ? The new kernels know how treat the dev_item_type field. > > > The old kernels ignore it. The worst thing is that a filesystem may require a balance > > > before reaching a good shape (i.e. the metadata on ssd and the data on a spinning disk) > > > > So you can do the validation below, tho I'm thinking I care about it less, if we > > just make sure that type is correct regardless of the compat bit then that's > > fine. Thanks, > > In theory if you get stuck in an impossible allocation situation (like all > your disks are data-only and you run out of metadata space) then one way > to recover from it is to mount with an old kernel which doesn't respect > the type bits. Another way to recover would be to flip the type bits > while the filesystem is offline with btrfs-progs. A third way would be to > have a mount option for newer kernels to ignore the allocation bits like > old kernels do (yes I know I already said I didn't like that idea). > This is the "preferred" vs "only" category. You set "preferred" if you want to be able to handle failures cleanly, you set "only" if you'd rather reprovision a box than lose performance. > If we have a bit that says "old kernels don't mount this filesystem any > more" then we lose one of those recovery options, and the other options > aren't implemented yet. > > While I think of it, the metadata reservation system eventually needs > to know that it can't use data-only devices for metadata, the same way > that df eventually needs to know about metadata-only devices. > Yeah, I'm willing to throw that into the power user bucket. It really only affects overcommit, because once we've allocated all our chunks we'll stop overcommitting. There may be some corners that need to be addressed, but we can burn those bridges when we get to them. Thanks, Josef
On Fri, Dec 17, 2021 at 09:48:40AM -0500, Josef Bacik wrote: > On Fri, Dec 17, 2021 at 12:40:55AM -0500, Zygo Blaxell wrote: > > On Wed, Dec 15, 2021 at 07:56:32PM -0500, Josef Bacik wrote: > > > On Wed, Dec 15, 2021 at 07:53:40PM +0100, Goffredo Baroncelli wrote: > > > > On 12/15/21 14:58, Josef Bacik wrote: > > > > > On Tue, Dec 14, 2021 at 09:41:21PM +0100, Goffredo Baroncelli wrote: > > > > > > On 12/14/21 21:34, Josef Bacik wrote: > > > > > > > On Tue, Dec 14, 2021 at 03:04:32PM -0500, Zygo Blaxell wrote: > > > > > > > > On Tue, Dec 14, 2021 at 08:03:45PM +0100, Goffredo Baroncelli wrote: > > > > > > > > > > > > > > > > > > > > > > I don't have a strong preference for either sysfs or ioctl, nor am I > > > > > > > > opposed to simply implementing both. I'll let someone who does have > > > > > > > > such a preference make their case. > > > > > > > > > > > > > > I think echo'ing a name into sysfs is better than bits for sure. However I want > > > > > > > the ability to set the device properties via a btrfs-progs command offline so I > > > > > > > can setup the storage and then mount the file system. I want > > > > > > > > > > > > > > 1) The sysfs interface so you can change things on the fly. This stays > > > > > > > persistent of course, so the way it works is perfect. > > > > > > > > > > > > > > 2) The btrfs-progs command sets it on offline devices. If you point it at a > > > > > > > live mounted fs it can simply use the sysfs thing to do it live. > > > > > > > > > > > > #2 is currently not implemented. However I think that we should do. > > > > > > > > > > > > The problem is that we need to update both: > > > > > > > > > > > > - the superblock (simple) > > > > > > - the dev_item item (not so simple...) > > > > > > > > > > > > What about using only bits from the superblock to store this property ? > > > > > > > > > > I'm looking at the patches and you only are updating the dev_item, am I missing > > > > > something for the super block? > > > > > > > > When btrfs write the superblocks (see write_all_supers() in disk-io.c), it copies > > > > the dev_item fields (contained in fs_info->fs_devices->devices lists) in each > > > > superblock before updating it. > > > > > > > > > > Oh right. Still, I hope we're doing this correctly in btrfs-progs, if not > > > that's a problem. > > > > > > > > > > > > > For offline all you would need to do is do the normal open_ctree, > > > > > btrfs_search_slot to the item and update the device item type, that's > > > > > straightforward. > > > > > > > > > > For online if you use btrfs prop you can see if the fs is mounted and just find > > > > > the sysfs file to modify and do it that way. > > > > > > > > > > But this also brings up another point, we're going to want a compat bit for > > > > > this. It doesn't make the fs unusable for old kernels, so just a normal > > > > > BTRFS_FS_COMPAT_<whatever> flag is fine. If the setting gets set you set the > > > > > compat flag. > > > > > > > > Why we need a "compact" bit ? The new kernels know how treat the dev_item_type field. > > > > The old kernels ignore it. The worst thing is that a filesystem may require a balance > > > > before reaching a good shape (i.e. the metadata on ssd and the data on a spinning disk) > > > > > > So you can do the validation below, tho I'm thinking I care about it less, if we > > > just make sure that type is correct regardless of the compat bit then that's > > > fine. Thanks, > > > > In theory if you get stuck in an impossible allocation situation (like all > > your disks are data-only and you run out of metadata space) then one way > > to recover from it is to mount with an old kernel which doesn't respect > > the type bits. Another way to recover would be to flip the type bits > > while the filesystem is offline with btrfs-progs. A third way would be to > > have a mount option for newer kernels to ignore the allocation bits like > > old kernels do (yes I know I already said I didn't like that idea). > > > > This is the "preferred" vs "only" category. You set "preferred" if you want to > be able to handle failures cleanly, you set "only" if you'd rather reprovision a > box than lose performance. Yeah, you're right here...and I was right the first time. ;) In case I forget again, I'll write down why... It's a corner of a corner of a corner case. To get there, we have to 1) set up the unsatisfiable allocation preferences, 2) run out of existing metadata allocations so we're one big commit away from metadata ENOSPC, and 3) in a small commit, write something to the filesystem that forces big metadata allocations during the next mount (like a deleted but uncleaned snapshot, or an orphan inode). It's not a new case: steps 2 and 3 are sufficient to get into this corner, and allocation preferences in step 1 only get us to step 2 faster. At any time before step 3, we can change the allocation preferences and avoid the whole thing. mkfs changes this slightly, since it's possible in mkfs to set up condition 1 without having any metadata on the filesystem; however, the filesystem would be unmountable, and it would require a separate bug in mkfs to create the filesystem at all. So the filesystem blows up, but it can only happen on an empty filesystem. I can live with that. It would be nice to have a way to avoid step 3 in general (like disable orphan inode cleanup and snapshot delete on mount, so we can arrange for some more metadata storage before those run) but that's completely orthogonal to the allocation preferences patch, solving a problem that existed before and after. > > If we have a bit that says "old kernels don't mount this filesystem any > > more" then we lose one of those recovery options, and the other options > > aren't implemented yet. > > > > While I think of it, the metadata reservation system eventually needs > > to know that it can't use data-only devices for metadata, the same way > > that df eventually needs to know about metadata-only devices. > > Yeah, I'm willing to throw that into the power user bucket. It really only > affects overcommit, because once we've allocated all our chunks we'll stop > overcommitting. There may be some corners that need to be addressed, but we can > burn those bridges when we get to them. Thanks, > Josef >
On 12/17/21 06:40, Zygo Blaxell wrote: > > In theory if you get stuck in an impossible allocation situation (like all > your disks are data-only and you run out of metadata space) then one way > to recover from it is to mount with an old kernel which doesn't respect > the type bits. > Another way to recover would be to flip the type bits > while the filesystem is offline with btrfs-progs. Unfortunately this would not doable; type is backup-ped in the dev_item which are stored in the metadata. So despite the fact that the filesystem is mounted or not, changing "type" would require updating metadata which in turn may require to allocate a new metadata BG. > A third way would be to > have a mount option for newer kernels to ignore the allocation bits like > old kernels do (yes I know I already said I didn't like that idea). The patch is still available. However I have to point out that this is not a new problem. Will be always some cases where the metadata space is exhausted and is not possible to make the needing changes to create space. If the user uses the _ONLY flags, it is shooting in its feet. We should warn in the manual to avoid that. But even if the user do that we still have the problem. Frankly speaking I fatigue to see an use case where _ONLY flags are needed. IMHO _PREFERRED are enough; what it is needed is to improve the reporting to the user about these cases: something like that for each btrfs-prog invocation a warning is raised when the metadata space is near to exhaustion... > > If we have a bit that says "old kernels don't mount this filesystem any > more" then we lose one of those recovery options, and the other options > aren't implemented yet. Using an old kernel is not a solution. Sooner another new non-backward compatible change will prevent that. > > While I think of it, the metadata reservation system eventually needs > to know that it can't use data-only devices for metadata, the same way > that df eventually needs to know about metadata-only devices. > > >> Josef >>
From: Goffredo Baroncelli <kreijack@inwind.it> Hi all, This patches set was born after some discussion between me, Zygo and Josef. Some details can be found in https://github.com/btrfs/btrfs-todo/issues/19. Some further information about a real use case can be found in https://lore.kernel.org/linux-btrfs/20210116002533.GE31381@hungrycats.org/ Reently Shafeeq told me that he is interested too, due to the performance gain. In this revision I switched away from an ioctl API in favor of a sysfs API ( see patch #2 and #3). The idea behind this patches set, is to dedicate some disks (the fastest one) to the metadata chunk. My initial idea was a "soft" hint. However Zygo asked an option for a "strong" hint (== mandatory). The result is that each disk can be "tagged" by one of the following flags: - BTRFS_DEV_ALLOCATION_METADATA_ONLY - BTRFS_DEV_ALLOCATION_PREFERRED_METADATA - BTRFS_DEV_ALLOCATION_PREFERRED_DATA - BTRFS_DEV_ALLOCATION_DATA_ONLY When the chunk allocator search a disks to allocate a chunk, scans the disks in an order decided by these tags. For metadata, the order is: *_METADATA_ONLY *_PREFERRED_METADATA *_PREFERRED_DATA The *_DATA_ONLY are not eligible from metadata chunk allocation. For the data chunk, the order is reversed, and the *_METADATA_ONLY are excluded. The exact sort logic is to sort first for the "tag", and then for the space available. If there is no space available, the next "tag" disks set are selected. To set these tags, a new property called "allocation_hint" was created. There is a dedicated btrfs-prog patches set [[PATCH V5] btrfs-progs: allocation_hint disk property]. $ sudo mount /dev/loop0 /mnt/test-btrfs/ $ for i in /dev/loop[0-9]; do sudo ./btrfs prop get $i allocation_hint; done devid=1, path=/dev/loop0: allocation_hint=PREFERRED_METADATA devid=2, path=/dev/loop1: allocation_hint=PREFERRED_METADATA devid=3, path=/dev/loop2: allocation_hint=PREFERRED_DATA devid=4, path=/dev/loop3: allocation_hint=PREFERRED_DATA devid=5, path=/dev/loop4: allocation_hint=PREFERRED_DATA devid=6, path=/dev/loop5: allocation_hint=DATA_ONLY devid=7, path=/dev/loop6: allocation_hint=METADATA_ONLY devid=8, path=/dev/loop7: allocation_hint=METADATA_ONLY $ sudo ./btrfs fi us /mnt/test-btrfs/ Overall: Device size: 2.75GiB Device allocated: 1.34GiB Device unallocated: 1.41GiB Device missing: 0.00B Used: 400.89MiB Free (estimated): 1.04GiB (min: 1.04GiB) Data ratio: 2.00 Metadata ratio: 1.00 Global reserve: 3.25MiB (used: 0.00B) Multiple profiles: no Data,RAID1: Size:542.00MiB, Used:200.25MiB (36.95%) /dev/loop0 288.00MiB /dev/loop1 288.00MiB /dev/loop2 127.00MiB /dev/loop3 127.00MiB /dev/loop4 127.00MiB /dev/loop5 127.00MiB Metadata,single: Size:256.00MiB, Used:384.00KiB (0.15%) /dev/loop1 256.00MiB System,single: Size:32.00MiB, Used:16.00KiB (0.05%) /dev/loop0 32.00MiB Unallocated: /dev/loop0 704.00MiB /dev/loop1 480.00MiB /dev/loop2 1.00MiB /dev/loop3 1.00MiB /dev/loop4 1.00MiB /dev/loop5 1.00MiB /dev/loop6 128.00MiB /dev/loop7 128.00MiB # change the tag of some disks $ sudo ./btrfs prop set /dev/loop0 allocation_hint DATA_ONLY $ sudo ./btrfs prop set /dev/loop1 allocation_hint DATA_ONLY $ sudo ./btrfs prop set /dev/loop5 allocation_hint METADATA_ONLY $ for i in /dev/loop[0-9]; do sudo ./btrfs prop get $i allocation_hint; done devid=1, path=/dev/loop0: allocation_hint=DATA_ONLY devid=2, path=/dev/loop1: allocation_hint=DATA_ONLY devid=3, path=/dev/loop2: allocation_hint=PREFERRED_DATA devid=4, path=/dev/loop3: allocation_hint=PREFERRED_DATA devid=5, path=/dev/loop4: allocation_hint=PREFERRED_DATA devid=6, path=/dev/loop5: allocation_hint=METADATA_ONLY devid=7, path=/dev/loop6: allocation_hint=METADATA_ONLY devid=8, path=/dev/loop7: allocation_hint=METADATA_ONLY $ sudo btrfs bal start --full-balance /mnt/test-btrfs/ $ sudo ./btrfs fi us /mnt/test-btrfs/ Overall: Device size: 2.75GiB Device allocated: 735.00MiB Device unallocated: 2.03GiB Device missing: 0.00B Used: 400.72MiB Free (estimated): 1.10GiB (min: 1.10GiB) Data ratio: 2.00 Metadata ratio: 1.00 Global reserve: 3.25MiB (used: 0.00B) Multiple profiles: no Data,RAID1: Size:288.00MiB, Used:200.19MiB (69.51%) /dev/loop0 288.00MiB /dev/loop1 288.00MiB Metadata,single: Size:127.00MiB, Used:336.00KiB (0.26%) /dev/loop5 127.00MiB System,single: Size:32.00MiB, Used:16.00KiB (0.05%) /dev/loop7 32.00MiB Unallocated: /dev/loop0 736.00MiB /dev/loop1 736.00MiB /dev/loop2 128.00MiB /dev/loop3 128.00MiB /dev/loop4 128.00MiB /dev/loop5 1.00MiB /dev/loop6 128.00MiB /dev/loop7 96.00MiB #As you can see all the metadata were placed on the disk loop5/loop7 even if #the most empty one are loop0 and loop1. TODO: - more tests - the tool which show the space available should consider the tagging (eg the disks tagged by _METADATA_ONLY should be excluded from the data availability) Comments are welcome BR G.Baroncelli