Message ID | 36be817396956bffe981a69ea0b8796c44153fa5.1418203063.git.yangds.fnst@cn.fujitsu.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On 12/11/2014 09:31 AM, Dongsheng Yang wrote: > When function btrfs_statfs() calculate the tatol size of fs, it is calculating > the total size of disks and then dividing it by a factor. But in some usecase, > the result is not good to user. I am checking it; to me it seems a good improvement. However I noticed that df now doesn't seem to report anymore the space consumed by the meta-data chunk; eg: # I have two disks of 5GB each $ sudo ~/mkfs.btrfs -f -m raid1 -d raid1 /dev/vgtest/disk /dev/vgtest/disk1 $ df -h /mnt/btrfs1/ Filesystem Size Used Avail Use% Mounted on /dev/mapper/vgtest-disk 5.0G 1.1G 4.0G 21% /mnt/btrfs1 $ sudo btrfs fi show Label: none uuid: 884414c6-9374-40af-a5be-3949cdf6ad0b Total devices 2 FS bytes used 640.00KB devid 2 size 5.00GB used 2.01GB path /dev/dm-1 devid 1 size 5.00GB used 2.03GB path /dev/dm-0 $ sudo ./btrfs fi df /mnt/btrfs1/ Data, RAID1: total=1.00GiB, used=512.00KiB Data, single: total=8.00MiB, used=0.00B System, RAID1: total=8.00MiB, used=16.00KiB System, single: total=4.00MiB, used=0.00B Metadata, RAID1: total=1.00GiB, used=112.00KiB Metadata, single: total=8.00MiB, used=0.00B GlobalReserve, single: total=16.00MiB, used=0.00B In this case the filesystem is empty (it was a new filesystem !). However a 1G metadata chunk was already allocated. This is the reasons why the free space is only 4Gb. On my system the ratio metadata/data is 234MB/8.82GB = ~3%, so ignoring the metadata chunk from the free space may not be a big problem. > Example: > # mkfs.btrfs -f /dev/vdf1 /dev/vdf2 -d raid1 > # mount /dev/vdf1 /mnt > # dd if=/dev/zero of=/mnt/zero bs=1M count=1000 > # df -h /mnt > Filesystem Size Used Avail Use% Mounted on > /dev/vdf1 3.0G 1018M 1.3G 45% /mnt > # btrfs fi show /dev/vdf1 > Label: none uuid: f85d93dc-81f4-445d-91e5-6a5cd9563294 > Total devices 2 FS bytes used 1001.53MiB > devid 1 size 2.00GiB used 1.85GiB path /dev/vdf1 > devid 2 size 4.00GiB used 1.83GiB path /dev/vdf2 > a. df -h should report Size as 2GiB rather than as 3GiB. > Because this is 2 device raid1, the limiting factor is devid 1 @2GiB. > > b. df -h should report Avail as 0.97GiB or less, rather than as 1.3GiB. > 1.85 (the capacity of the allocated chunk) > -1.018 (the file stored) > +(2-1.85=0.15) (the residual capacity of the disks > considering a raid1 fs) > --------------- > = 0.97 > > This patch drops the factor at all and calculate the size observable to > user without considering which raid level the data is in and what's the > size exactly in disk. > After this patch applied: > # mkfs.btrfs -f /dev/vdf1 /dev/vdf2 -d raid1 > # mount /dev/vdf1 /mnt > # dd if=/dev/zero of=/mnt/zero bs=1M count=1000 > # df -h /mnt > Filesystem Size Used Avail Use% Mounted on > /dev/vdf1 2.0G 1.3G 713M 66% /mnt > # df /mnt > Filesystem 1K-blocks Used Available Use% Mounted on > /dev/vdf1 2097152 1359424 729536 66% /mnt > # btrfs fi show /dev/vdf1 > Label: none uuid: e98c1321-645f-4457-b20d-4f41dc1cf2f4 > Total devices 2 FS bytes used 1001.55MiB > devid 1 size 2.00GiB used 1.85GiB path /dev/vdf1 > devid 2 size 4.00GiB used 1.83GiB path /dev/vdf2 > a). The @Size is 2G as we expected. > b). @Available is 700M = 1.85G - 1.3G + (2G - 1.85G). > c). @Used is changed to 1.3G rather than 1018M as above. Because > this patch do not treat the free space in metadata chunk > and system chunk as available to user. It's true, user can > not use these space to store data, then it should not be > thought as available. At the same time, it will make the > @Used + @Available == @Size as possible to user. > > Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com> > --- > Changelog: > v1 -> v2: > a. account the total_bytes in medadata chunk and > system chunk as used to user. > b. set data_stripes to the correct value in RAID0. > > fs/btrfs/extent-tree.c | 13 ++---------- > fs/btrfs/super.c | 56 ++++++++++++++++++++++---------------------------- > 2 files changed, 26 insertions(+), 43 deletions(-) > > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c > index a84e00d..9954d60 100644 > --- a/fs/btrfs/extent-tree.c > +++ b/fs/btrfs/extent-tree.c > @@ -8571,7 +8571,6 @@ static u64 __btrfs_get_ro_block_group_free_space(struct list_head *groups_list) > { > struct btrfs_block_group_cache *block_group; > u64 free_bytes = 0; > - int factor; > > list_for_each_entry(block_group, groups_list, list) { > spin_lock(&block_group->lock); > @@ -8581,16 +8580,8 @@ static u64 __btrfs_get_ro_block_group_free_space(struct list_head *groups_list) > continue; > } > > - if (block_group->flags & (BTRFS_BLOCK_GROUP_RAID1 | > - BTRFS_BLOCK_GROUP_RAID10 | > - BTRFS_BLOCK_GROUP_DUP)) > - factor = 2; > - else > - factor = 1; > - > - free_bytes += (block_group->key.offset - > - btrfs_block_group_used(&block_group->item)) * > - factor; > + free_bytes += block_group->key.offset - > + btrfs_block_group_used(&block_group->item); > > spin_unlock(&block_group->lock); > } > diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c > index 54bd91e..40f41a2 100644 > --- a/fs/btrfs/super.c > +++ b/fs/btrfs/super.c > @@ -1641,6 +1641,8 @@ static int btrfs_calc_avail_data_space(struct btrfs_root *root, u64 *free_bytes) > u64 used_space; > u64 min_stripe_size; > int min_stripes = 1, num_stripes = 1; > + /* How many stripes used to store data, without considering mirrors. */ > + int data_stripes = 1; > int i = 0, nr_devices; > int ret; > > @@ -1657,12 +1659,15 @@ static int btrfs_calc_avail_data_space(struct btrfs_root *root, u64 *free_bytes) > if (type & BTRFS_BLOCK_GROUP_RAID0) { > min_stripes = 2; > num_stripes = nr_devices; > + data_stripes = num_stripes; > } else if (type & BTRFS_BLOCK_GROUP_RAID1) { > min_stripes = 2; > num_stripes = 2; > + data_stripes = 1; > } else if (type & BTRFS_BLOCK_GROUP_RAID10) { > min_stripes = 4; > num_stripes = 4; > + data_stripes = 2; > } > > if (type & BTRFS_BLOCK_GROUP_DUP) > @@ -1733,14 +1738,17 @@ static int btrfs_calc_avail_data_space(struct btrfs_root *root, u64 *free_bytes) > i = nr_devices - 1; > avail_space = 0; > while (nr_devices >= min_stripes) { > - if (num_stripes > nr_devices) > + if (num_stripes > nr_devices) { > num_stripes = nr_devices; > + if (type & BTRFS_BLOCK_GROUP_RAID0) > + data_stripes = num_stripes; > + } > > if (devices_info[i].max_avail >= min_stripe_size) { > int j; > u64 alloc_size; > > - avail_space += devices_info[i].max_avail * num_stripes; > + avail_space += devices_info[i].max_avail * data_stripes; > alloc_size = devices_info[i].max_avail; > for (j = i + 1 - num_stripes; j <= i; j++) > devices_info[j].max_avail -= alloc_size; > @@ -1772,15 +1780,13 @@ static int btrfs_calc_avail_data_space(struct btrfs_root *root, u64 *free_bytes) > static int btrfs_statfs(struct dentry *dentry, struct kstatfs *buf) > { > struct btrfs_fs_info *fs_info = btrfs_sb(dentry->d_sb); > - struct btrfs_super_block *disk_super = fs_info->super_copy; > struct list_head *head = &fs_info->space_info; > struct btrfs_space_info *found; > u64 total_used = 0; > + u64 total_alloc = 0; > u64 total_free_data = 0; > int bits = dentry->d_sb->s_blocksize_bits; > __be32 *fsid = (__be32 *)fs_info->fsid; > - unsigned factor = 1; > - struct btrfs_block_rsv *block_rsv = &fs_info->global_block_rsv; > int ret; > > /* > @@ -1792,38 +1798,20 @@ static int btrfs_statfs(struct dentry *dentry, struct kstatfs *buf) > rcu_read_lock(); > list_for_each_entry_rcu(found, head, list) { > if (found->flags & BTRFS_BLOCK_GROUP_DATA) { > - int i; > - > - total_free_data += found->disk_total - found->disk_used; > + total_free_data += found->total_bytes - found->bytes_used; > total_free_data -= > btrfs_account_ro_block_groups_free_space(found); > - > - for (i = 0; i < BTRFS_NR_RAID_TYPES; i++) { > - if (!list_empty(&found->block_groups[i])) { > - switch (i) { > - case BTRFS_RAID_DUP: > - case BTRFS_RAID_RAID1: > - case BTRFS_RAID_RAID10: > - factor = 2; > - } > - } > - } > + total_used += found->bytes_used; > + } else { > + /* For metadata and system, we treat the total_bytes as > + * not available to file data. So show it as Used in df. > + **/ > + total_used += found->total_bytes; > } > - > - total_used += found->disk_used; > + total_alloc += found->total_bytes; > } > - > rcu_read_unlock(); > > - buf->f_blocks = div_u64(btrfs_super_total_bytes(disk_super), factor); > - buf->f_blocks >>= bits; > - buf->f_bfree = buf->f_blocks - (div_u64(total_used, factor) >> bits); > - > - /* Account global block reserve as used, it's in logical size already */ > - spin_lock(&block_rsv->lock); > - buf->f_bfree -= block_rsv->size >> bits; > - spin_unlock(&block_rsv->lock); > - > buf->f_bavail = total_free_data; > ret = btrfs_calc_avail_data_space(fs_info->tree_root, &total_free_data); > if (ret) { > @@ -1831,8 +1819,12 @@ static int btrfs_statfs(struct dentry *dentry, struct kstatfs *buf) > mutex_unlock(&fs_info->fs_devices->device_list_mutex); > return ret; > } > - buf->f_bavail += div_u64(total_free_data, factor); > + buf->f_bavail += total_free_data; > buf->f_bavail = buf->f_bavail >> bits; > + buf->f_blocks = total_alloc + total_free_data; > + buf->f_blocks >>= bits; > + buf->f_bfree = buf->f_blocks - (total_used >> bits); > + > mutex_unlock(&fs_info->chunk_mutex); > mutex_unlock(&fs_info->fs_devices->device_list_mutex); > >
On 12/11/2014 09:31 AM, Dongsheng Yang wrote: > When function btrfs_statfs() calculate the tatol size of fs, it is calculating > the total size of disks and then dividing it by a factor. But in some usecase, > the result is not good to user. I Yang; during my test I discovered an error: $ sudo lvcreate -L +10G -n disk vgtest $ sudo /home/ghigo/mkfs.btrfs -f /dev/vgtest/disk Btrfs v3.17 See http://btrfs.wiki.kernel.org for more information. Turning ON incompat feature 'extref': increased hardlink limit per file to 65536 fs created label (null) on /dev/vgtest/disk nodesize 16384 leafsize 16384 sectorsize 4096 size 10.00GiB $ sudo mount /dev/vgtest/disk /mnt/btrfs1/ $ df /mnt/btrfs1/ Filesystem 1K-blocks Used Available Use% Mounted on /dev/mapper/vgtest-disk 9428992 1069312 8359680 12% /mnt/btrfs1 $ sudo ~/btrfs fi df /mnt/btrfs1/ Data, single: total=8.00MiB, used=256.00KiB System, DUP: total=8.00MiB, used=16.00KiB System, single: total=4.00MiB, used=0.00B Metadata, DUP: total=1.00GiB, used=112.00KiB Metadata, single: total=8.00MiB, used=0.00B GlobalReserve, single: total=16.00MiB, used=0.00B What seems me strange is the 9428992KiB of total disk space as reported by df. About 600MiB are missing ! Without your patch, I got: $ df /mnt/btrfs1/ Filesystem 1K-blocks Used Available Use% Mounted on /dev/mapper/vgtest-disk 10485760 16896 8359680 1% /mnt/btrfs1 > Example: > # mkfs.btrfs -f /dev/vdf1 /dev/vdf2 -d raid1 > # mount /dev/vdf1 /mnt > # dd if=/dev/zero of=/mnt/zero bs=1M count=1000 > # df -h /mnt > Filesystem Size Used Avail Use% Mounted on > /dev/vdf1 3.0G 1018M 1.3G 45% /mnt > # btrfs fi show /dev/vdf1 > Label: none uuid: f85d93dc-81f4-445d-91e5-6a5cd9563294 > Total devices 2 FS bytes used 1001.53MiB > devid 1 size 2.00GiB used 1.85GiB path /dev/vdf1 > devid 2 size 4.00GiB used 1.83GiB path /dev/vdf2 > a. df -h should report Size as 2GiB rather than as 3GiB. > Because this is 2 device raid1, the limiting factor is devid 1 @2GiB. > > b. df -h should report Avail as 0.97GiB or less, rather than as 1.3GiB. > 1.85 (the capacity of the allocated chunk) > -1.018 (the file stored) > +(2-1.85=0.15) (the residual capacity of the disks > considering a raid1 fs) > --------------- > = 0.97 > > This patch drops the factor at all and calculate the size observable to > user without considering which raid level the data is in and what's the > size exactly in disk. > After this patch applied: > # mkfs.btrfs -f /dev/vdf1 /dev/vdf2 -d raid1 > # mount /dev/vdf1 /mnt > # dd if=/dev/zero of=/mnt/zero bs=1M count=1000 > # df -h /mnt > Filesystem Size Used Avail Use% Mounted on > /dev/vdf1 2.0G 1.3G 713M 66% /mnt > # df /mnt > Filesystem 1K-blocks Used Available Use% Mounted on > /dev/vdf1 2097152 1359424 729536 66% /mnt > # btrfs fi show /dev/vdf1 > Label: none uuid: e98c1321-645f-4457-b20d-4f41dc1cf2f4 > Total devices 2 FS bytes used 1001.55MiB > devid 1 size 2.00GiB used 1.85GiB path /dev/vdf1 > devid 2 size 4.00GiB used 1.83GiB path /dev/vdf2 > a). The @Size is 2G as we expected. > b). @Available is 700M = 1.85G - 1.3G + (2G - 1.85G). > c). @Used is changed to 1.3G rather than 1018M as above. Because > this patch do not treat the free space in metadata chunk > and system chunk as available to user. It's true, user can > not use these space to store data, then it should not be > thought as available. At the same time, it will make the > @Used + @Available == @Size as possible to user. > > Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com> > --- > Changelog: > v1 -> v2: > a. account the total_bytes in medadata chunk and > system chunk as used to user. > b. set data_stripes to the correct value in RAID0. > > fs/btrfs/extent-tree.c | 13 ++---------- > fs/btrfs/super.c | 56 ++++++++++++++++++++++---------------------------- > 2 files changed, 26 insertions(+), 43 deletions(-) > > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c > index a84e00d..9954d60 100644 > --- a/fs/btrfs/extent-tree.c > +++ b/fs/btrfs/extent-tree.c > @@ -8571,7 +8571,6 @@ static u64 __btrfs_get_ro_block_group_free_space(struct list_head *groups_list) > { > struct btrfs_block_group_cache *block_group; > u64 free_bytes = 0; > - int factor; > > list_for_each_entry(block_group, groups_list, list) { > spin_lock(&block_group->lock); > @@ -8581,16 +8580,8 @@ static u64 __btrfs_get_ro_block_group_free_space(struct list_head *groups_list) > continue; > } > > - if (block_group->flags & (BTRFS_BLOCK_GROUP_RAID1 | > - BTRFS_BLOCK_GROUP_RAID10 | > - BTRFS_BLOCK_GROUP_DUP)) > - factor = 2; > - else > - factor = 1; > - > - free_bytes += (block_group->key.offset - > - btrfs_block_group_used(&block_group->item)) * > - factor; > + free_bytes += block_group->key.offset - > + btrfs_block_group_used(&block_group->item); > > spin_unlock(&block_group->lock); > } > diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c > index 54bd91e..40f41a2 100644 > --- a/fs/btrfs/super.c > +++ b/fs/btrfs/super.c > @@ -1641,6 +1641,8 @@ static int btrfs_calc_avail_data_space(struct btrfs_root *root, u64 *free_bytes) > u64 used_space; > u64 min_stripe_size; > int min_stripes = 1, num_stripes = 1; > + /* How many stripes used to store data, without considering mirrors. */ > + int data_stripes = 1; > int i = 0, nr_devices; > int ret; > > @@ -1657,12 +1659,15 @@ static int btrfs_calc_avail_data_space(struct btrfs_root *root, u64 *free_bytes) > if (type & BTRFS_BLOCK_GROUP_RAID0) { > min_stripes = 2; > num_stripes = nr_devices; > + data_stripes = num_stripes; > } else if (type & BTRFS_BLOCK_GROUP_RAID1) { > min_stripes = 2; > num_stripes = 2; > + data_stripes = 1; > } else if (type & BTRFS_BLOCK_GROUP_RAID10) { > min_stripes = 4; > num_stripes = 4; > + data_stripes = 2; > } > > if (type & BTRFS_BLOCK_GROUP_DUP) > @@ -1733,14 +1738,17 @@ static int btrfs_calc_avail_data_space(struct btrfs_root *root, u64 *free_bytes) > i = nr_devices - 1; > avail_space = 0; > while (nr_devices >= min_stripes) { > - if (num_stripes > nr_devices) > + if (num_stripes > nr_devices) { > num_stripes = nr_devices; > + if (type & BTRFS_BLOCK_GROUP_RAID0) > + data_stripes = num_stripes; > + } > > if (devices_info[i].max_avail >= min_stripe_size) { > int j; > u64 alloc_size; > > - avail_space += devices_info[i].max_avail * num_stripes; > + avail_space += devices_info[i].max_avail * data_stripes; > alloc_size = devices_info[i].max_avail; > for (j = i + 1 - num_stripes; j <= i; j++) > devices_info[j].max_avail -= alloc_size; > @@ -1772,15 +1780,13 @@ static int btrfs_calc_avail_data_space(struct btrfs_root *root, u64 *free_bytes) > static int btrfs_statfs(struct dentry *dentry, struct kstatfs *buf) > { > struct btrfs_fs_info *fs_info = btrfs_sb(dentry->d_sb); > - struct btrfs_super_block *disk_super = fs_info->super_copy; > struct list_head *head = &fs_info->space_info; > struct btrfs_space_info *found; > u64 total_used = 0; > + u64 total_alloc = 0; > u64 total_free_data = 0; > int bits = dentry->d_sb->s_blocksize_bits; > __be32 *fsid = (__be32 *)fs_info->fsid; > - unsigned factor = 1; > - struct btrfs_block_rsv *block_rsv = &fs_info->global_block_rsv; > int ret; > > /* > @@ -1792,38 +1798,20 @@ static int btrfs_statfs(struct dentry *dentry, struct kstatfs *buf) > rcu_read_lock(); > list_for_each_entry_rcu(found, head, list) { > if (found->flags & BTRFS_BLOCK_GROUP_DATA) { > - int i; > - > - total_free_data += found->disk_total - found->disk_used; > + total_free_data += found->total_bytes - found->bytes_used; > total_free_data -= > btrfs_account_ro_block_groups_free_space(found); > - > - for (i = 0; i < BTRFS_NR_RAID_TYPES; i++) { > - if (!list_empty(&found->block_groups[i])) { > - switch (i) { > - case BTRFS_RAID_DUP: > - case BTRFS_RAID_RAID1: > - case BTRFS_RAID_RAID10: > - factor = 2; > - } > - } > - } > + total_used += found->bytes_used; > + } else { > + /* For metadata and system, we treat the total_bytes as > + * not available to file data. So show it as Used in df. > + **/ > + total_used += found->total_bytes; > } > - > - total_used += found->disk_used; > + total_alloc += found->total_bytes; > } > - > rcu_read_unlock(); > > - buf->f_blocks = div_u64(btrfs_super_total_bytes(disk_super), factor); > - buf->f_blocks >>= bits; > - buf->f_bfree = buf->f_blocks - (div_u64(total_used, factor) >> bits); > - > - /* Account global block reserve as used, it's in logical size already */ > - spin_lock(&block_rsv->lock); > - buf->f_bfree -= block_rsv->size >> bits; > - spin_unlock(&block_rsv->lock); > - > buf->f_bavail = total_free_data; > ret = btrfs_calc_avail_data_space(fs_info->tree_root, &total_free_data); > if (ret) { > @@ -1831,8 +1819,12 @@ static int btrfs_statfs(struct dentry *dentry, struct kstatfs *buf) > mutex_unlock(&fs_info->fs_devices->device_list_mutex); > return ret; > } > - buf->f_bavail += div_u64(total_free_data, factor); > + buf->f_bavail += total_free_data; > buf->f_bavail = buf->f_bavail >> bits; > + buf->f_blocks = total_alloc + total_free_data; > + buf->f_blocks >>= bits; > + buf->f_bfree = buf->f_blocks - (total_used >> bits); > + > mutex_unlock(&fs_info->chunk_mutex); > mutex_unlock(&fs_info->fs_devices->device_list_mutex); > >
Goffredo Baroncelli posted on Fri, 12 Dec 2014 19:00:20 +0100 as excerpted: > $ sudo ./btrfs fi df /mnt/btrfs1/ > Data, RAID1: total=1.00GiB, used=512.00KiB > Data, single: total=8.00MiB, used=0.00B > System, RAID1: total=8.00MiB, used=16.00KiB > System, single: total=4.00MiB, used=0.00B > Metadata, RAID1: total=1.00GiB, used=112.00KiB > Metadata, single: total=8.00MiB, used=0.00B > GlobalReserve, single: total=16.00MiB, used=0.00B > > In this case the filesystem is empty (it was a new filesystem !). > However a 1G metadata chunk was already allocated. This is the reasons > why the free space is only 4Gb. Trivial(?) correction. Metadata chunks are quarter-gig, not 1 gig. So that's 4 quarter-gig metadata chunks allocated, not a (one/single) 1-gig metadata chunk. > On my system the ratio metadata/data is 234MB/8.82GB = ~3%, so ignoring > the metadata chunk from the free space may not be a big problem. Presumably your use-case is primarily reasonably large files; too large for their data to be tucked directly into metadata instead of allocating an extent from a data chunk. That's not always going to be the case. And given the multi-device default allocation of raid1 metadata, single data, files small enough to fit into metadata have a default size effect double their actual size! (Tho it can be noted that given btrfs' 4 KiB standard block size, without metadata packing there'd still be an outsized effect for files smaller than half that, 2 KiB or under, but there it'd be in data chunks, not metadata.)
On 12/13/2014 02:00 AM, Goffredo Baroncelli wrote: > On 12/11/2014 09:31 AM, Dongsheng Yang wrote: >> When function btrfs_statfs() calculate the tatol size of fs, it is calculating >> the total size of disks and then dividing it by a factor. But in some usecase, >> the result is not good to user. > I am checking it; to me it seems a good improvement. However > I noticed that df now doesn't seem to report anymore the space consumed > by the meta-data chunk; eg: > > # I have two disks of 5GB each > $ sudo ~/mkfs.btrfs -f -m raid1 -d raid1 /dev/vgtest/disk /dev/vgtest/disk1 > > $ df -h /mnt/btrfs1/ > Filesystem Size Used Avail Use% Mounted on > /dev/mapper/vgtest-disk 5.0G 1.1G 4.0G 21% /mnt/btrfs1 > > $ sudo btrfs fi show > Label: none uuid: 884414c6-9374-40af-a5be-3949cdf6ad0b > Total devices 2 FS bytes used 640.00KB > devid 2 size 5.00GB used 2.01GB path /dev/dm-1 > devid 1 size 5.00GB used 2.03GB path /dev/dm-0 > > $ sudo ./btrfs fi df /mnt/btrfs1/ > Data, RAID1: total=1.00GiB, used=512.00KiB > Data, single: total=8.00MiB, used=0.00B > System, RAID1: total=8.00MiB, used=16.00KiB > System, single: total=4.00MiB, used=0.00B > Metadata, RAID1: total=1.00GiB, used=112.00KiB > Metadata, single: total=8.00MiB, used=0.00B > GlobalReserve, single: total=16.00MiB, used=0.00B > > In this case the filesystem is empty (it was a new filesystem !). However a > 1G metadata chunk was already allocated. This is the reasons why the free > space is only 4Gb. Actually, in the original btrfs_statfs(), the space in metadata chunk was also not be considered as available. But you will get 5G in this case with original btrfs_statfs() because there is a bug in it. As I use another implementation to calculate the @available in df command, I did not mention this problem. I can describe it as below. In original btrfs_statfs(), we only consider the free space in data chunk as available. <<<<<<<<<<<<<<<<<<<<<<<<<<<<< list_for_each_entry_rcu(found, head, list) { if (found->flags & BTRFS_BLOCK_GROUP_DATA) { int i; total_free_data += found->disk_total - found->disk_used; >>>>>>>>>>>>>>>>>>>>>>>>>>>>> In the later, we will add the total_free_data to @available in output of df. buf->f_bavail = total_free_data; BUT: This is incorrect! It should be: buf->f_bavail = div_u64(total_free_data, factor); That said this bug will add one more (data_chunk_disk_total - data_chunk_disk_used = 1G) to @available by mistake. Unfortunately, the free space in metadata_chunk is also 1G. Then we will get 5G in this case you provided here. Conclusion: Even in the original btrfs_statfs(), the free space in metadata is also considered as not available. But one bug in it make it to be misunderstood. My new btrfs_statfs() still consider the free space in metadata as not available, furthermore, I add it into @used. > > On my system the ratio metadata/data is 234MB/8.82GB = ~3%, so ignoring the > metadata chunk from the free space may not be a big problem. > > > > > >> Example: >> # mkfs.btrfs -f /dev/vdf1 /dev/vdf2 -d raid1 >> # mount /dev/vdf1 /mnt >> # dd if=/dev/zero of=/mnt/zero bs=1M count=1000 >> # df -h /mnt >> Filesystem Size Used Avail Use% Mounted on >> /dev/vdf1 3.0G 1018M 1.3G 45% /mnt >> # btrfs fi show /dev/vdf1 >> Label: none uuid: f85d93dc-81f4-445d-91e5-6a5cd9563294 >> Total devices 2 FS bytes used 1001.53MiB >> devid 1 size 2.00GiB used 1.85GiB path /dev/vdf1 >> devid 2 size 4.00GiB used 1.83GiB path /dev/vdf2 >> a. df -h should report Size as 2GiB rather than as 3GiB. >> Because this is 2 device raid1, the limiting factor is devid 1 @2GiB. >> >> b. df -h should report Avail as 0.97GiB or less, rather than as 1.3GiB. >> 1.85 (the capacity of the allocated chunk) >> -1.018 (the file stored) >> +(2-1.85=0.15) (the residual capacity of the disks >> considering a raid1 fs) >> --------------- >> = 0.97 >> >> This patch drops the factor at all and calculate the size observable to >> user without considering which raid level the data is in and what's the >> size exactly in disk. >> After this patch applied: >> # mkfs.btrfs -f /dev/vdf1 /dev/vdf2 -d raid1 >> # mount /dev/vdf1 /mnt >> # dd if=/dev/zero of=/mnt/zero bs=1M count=1000 >> # df -h /mnt >> Filesystem Size Used Avail Use% Mounted on >> /dev/vdf1 2.0G 1.3G 713M 66% /mnt >> # df /mnt >> Filesystem 1K-blocks Used Available Use% Mounted on >> /dev/vdf1 2097152 1359424 729536 66% /mnt >> # btrfs fi show /dev/vdf1 >> Label: none uuid: e98c1321-645f-4457-b20d-4f41dc1cf2f4 >> Total devices 2 FS bytes used 1001.55MiB >> devid 1 size 2.00GiB used 1.85GiB path /dev/vdf1 >> devid 2 size 4.00GiB used 1.83GiB path /dev/vdf2 >> a). The @Size is 2G as we expected. >> b). @Available is 700M = 1.85G - 1.3G + (2G - 1.85G). >> c). @Used is changed to 1.3G rather than 1018M as above. Because >> this patch do not treat the free space in metadata chunk >> and system chunk as available to user. It's true, user can >> not use these space to store data, then it should not be >> thought as available. At the same time, it will make the >> @Used + @Available == @Size as possible to user. >> >> Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com> >> --- >> Changelog: >> v1 -> v2: >> a. account the total_bytes in medadata chunk and >> system chunk as used to user. >> b. set data_stripes to the correct value in RAID0. >> >> fs/btrfs/extent-tree.c | 13 ++---------- >> fs/btrfs/super.c | 56 ++++++++++++++++++++++---------------------------- >> 2 files changed, 26 insertions(+), 43 deletions(-) >> >> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c >> index a84e00d..9954d60 100644 >> --- a/fs/btrfs/extent-tree.c >> +++ b/fs/btrfs/extent-tree.c >> @@ -8571,7 +8571,6 @@ static u64 __btrfs_get_ro_block_group_free_space(struct list_head *groups_list) >> { >> struct btrfs_block_group_cache *block_group; >> u64 free_bytes = 0; >> - int factor; >> >> list_for_each_entry(block_group, groups_list, list) { >> spin_lock(&block_group->lock); >> @@ -8581,16 +8580,8 @@ static u64 __btrfs_get_ro_block_group_free_space(struct list_head *groups_list) >> continue; >> } >> >> - if (block_group->flags & (BTRFS_BLOCK_GROUP_RAID1 | >> - BTRFS_BLOCK_GROUP_RAID10 | >> - BTRFS_BLOCK_GROUP_DUP)) >> - factor = 2; >> - else >> - factor = 1; >> - >> - free_bytes += (block_group->key.offset - >> - btrfs_block_group_used(&block_group->item)) * >> - factor; >> + free_bytes += block_group->key.offset - >> + btrfs_block_group_used(&block_group->item); >> >> spin_unlock(&block_group->lock); >> } >> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c >> index 54bd91e..40f41a2 100644 >> --- a/fs/btrfs/super.c >> +++ b/fs/btrfs/super.c >> @@ -1641,6 +1641,8 @@ static int btrfs_calc_avail_data_space(struct btrfs_root *root, u64 *free_bytes) >> u64 used_space; >> u64 min_stripe_size; >> int min_stripes = 1, num_stripes = 1; >> + /* How many stripes used to store data, without considering mirrors. */ >> + int data_stripes = 1; >> int i = 0, nr_devices; >> int ret; >> >> @@ -1657,12 +1659,15 @@ static int btrfs_calc_avail_data_space(struct btrfs_root *root, u64 *free_bytes) >> if (type & BTRFS_BLOCK_GROUP_RAID0) { >> min_stripes = 2; >> num_stripes = nr_devices; >> + data_stripes = num_stripes; >> } else if (type & BTRFS_BLOCK_GROUP_RAID1) { >> min_stripes = 2; >> num_stripes = 2; >> + data_stripes = 1; >> } else if (type & BTRFS_BLOCK_GROUP_RAID10) { >> min_stripes = 4; >> num_stripes = 4; >> + data_stripes = 2; >> } >> >> if (type & BTRFS_BLOCK_GROUP_DUP) >> @@ -1733,14 +1738,17 @@ static int btrfs_calc_avail_data_space(struct btrfs_root *root, u64 *free_bytes) >> i = nr_devices - 1; >> avail_space = 0; >> while (nr_devices >= min_stripes) { >> - if (num_stripes > nr_devices) >> + if (num_stripes > nr_devices) { >> num_stripes = nr_devices; >> + if (type & BTRFS_BLOCK_GROUP_RAID0) >> + data_stripes = num_stripes; >> + } >> >> if (devices_info[i].max_avail >= min_stripe_size) { >> int j; >> u64 alloc_size; >> >> - avail_space += devices_info[i].max_avail * num_stripes; >> + avail_space += devices_info[i].max_avail * data_stripes; >> alloc_size = devices_info[i].max_avail; >> for (j = i + 1 - num_stripes; j <= i; j++) >> devices_info[j].max_avail -= alloc_size; >> @@ -1772,15 +1780,13 @@ static int btrfs_calc_avail_data_space(struct btrfs_root *root, u64 *free_bytes) >> static int btrfs_statfs(struct dentry *dentry, struct kstatfs *buf) >> { >> struct btrfs_fs_info *fs_info = btrfs_sb(dentry->d_sb); >> - struct btrfs_super_block *disk_super = fs_info->super_copy; >> struct list_head *head = &fs_info->space_info; >> struct btrfs_space_info *found; >> u64 total_used = 0; >> + u64 total_alloc = 0; >> u64 total_free_data = 0; >> int bits = dentry->d_sb->s_blocksize_bits; >> __be32 *fsid = (__be32 *)fs_info->fsid; >> - unsigned factor = 1; >> - struct btrfs_block_rsv *block_rsv = &fs_info->global_block_rsv; >> int ret; >> >> /* >> @@ -1792,38 +1798,20 @@ static int btrfs_statfs(struct dentry *dentry, struct kstatfs *buf) >> rcu_read_lock(); >> list_for_each_entry_rcu(found, head, list) { >> if (found->flags & BTRFS_BLOCK_GROUP_DATA) { >> - int i; >> - >> - total_free_data += found->disk_total - found->disk_used; >> + total_free_data += found->total_bytes - found->bytes_used; >> total_free_data -= >> btrfs_account_ro_block_groups_free_space(found); >> - >> - for (i = 0; i < BTRFS_NR_RAID_TYPES; i++) { >> - if (!list_empty(&found->block_groups[i])) { >> - switch (i) { >> - case BTRFS_RAID_DUP: >> - case BTRFS_RAID_RAID1: >> - case BTRFS_RAID_RAID10: >> - factor = 2; >> - } >> - } >> - } >> + total_used += found->bytes_used; >> + } else { >> + /* For metadata and system, we treat the total_bytes as >> + * not available to file data. So show it as Used in df. >> + **/ >> + total_used += found->total_bytes; >> } >> - >> - total_used += found->disk_used; >> + total_alloc += found->total_bytes; >> } >> - >> rcu_read_unlock(); >> >> - buf->f_blocks = div_u64(btrfs_super_total_bytes(disk_super), factor); >> - buf->f_blocks >>= bits; >> - buf->f_bfree = buf->f_blocks - (div_u64(total_used, factor) >> bits); >> - >> - /* Account global block reserve as used, it's in logical size already */ >> - spin_lock(&block_rsv->lock); >> - buf->f_bfree -= block_rsv->size >> bits; >> - spin_unlock(&block_rsv->lock); >> - >> buf->f_bavail = total_free_data; >> ret = btrfs_calc_avail_data_space(fs_info->tree_root, &total_free_data); >> if (ret) { >> @@ -1831,8 +1819,12 @@ static int btrfs_statfs(struct dentry *dentry, struct kstatfs *buf) >> mutex_unlock(&fs_info->fs_devices->device_list_mutex); >> return ret; >> } >> - buf->f_bavail += div_u64(total_free_data, factor); >> + buf->f_bavail += total_free_data; >> buf->f_bavail = buf->f_bavail >> bits; >> + buf->f_blocks = total_alloc + total_free_data; >> + buf->f_blocks >>= bits; >> + buf->f_bfree = buf->f_blocks - (total_used >> bits); >> + >> mutex_unlock(&fs_info->chunk_mutex); >> mutex_unlock(&fs_info->fs_devices->device_list_mutex); >> >> > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 12/13/2014 08:50 AM, Duncan wrote: > Goffredo Baroncelli posted on Fri, 12 Dec 2014 19:00:20 +0100 as > excerpted: > >> $ sudo ./btrfs fi df /mnt/btrfs1/ >> Data, RAID1: total=1.00GiB, used=512.00KiB >> Data, single: total=8.00MiB, used=0.00B >> System, RAID1: total=8.00MiB, used=16.00KiB >> System, single: total=4.00MiB, used=0.00B >> Metadata, RAID1: total=1.00GiB, used=112.00KiB >> Metadata, single: total=8.00MiB, used=0.00B >> GlobalReserve, single: total=16.00MiB, used=0.00B >> >> In this case the filesystem is empty (it was a new filesystem !). >> However a 1G metadata chunk was already allocated. This is the reasons >> why the free space is only 4Gb. > Trivial(?) correction. > > Metadata chunks are quarter-gig, not 1 gig. So that's 4 quarter-gig > metadata chunks allocated, not a (one/single) 1-gig metadata chunk. Sorry but from my reading of the code, I have to say that in this case it is 1G. Maybe I should be wrong, I would say, yes, the chunk_size for btrfs which is smaller than 50G is 256M by default. But in mkfs, we alloc 1G for the first metadata chunk. > >> On my system the ratio metadata/data is 234MB/8.82GB = ~3%, so ignoring >> the metadata chunk from the free space may not be a big problem. > Presumably your use-case is primarily reasonably large files; too large > for their data to be tucked directly into metadata instead of allocating > an extent from a data chunk. > > That's not always going to be the case. And given the multi-device > default allocation of raid1 metadata, single data, files small enough to > fit into metadata have a default size effect double their actual size! > (Tho it can be noted that given btrfs' 4 KiB standard block size, without > metadata packing there'd still be an outsized effect for files smaller > than half that, 2 KiB or under, but there it'd be in data chunks, not > metadata.) > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, Dec 13, 2014 at 3:25 AM, Goffredo Baroncelli <kreijack@inwind.it> wrote: > On 12/11/2014 09:31 AM, Dongsheng Yang wrote: >> When function btrfs_statfs() calculate the tatol size of fs, it is calculating >> the total size of disks and then dividing it by a factor. But in some usecase, >> the result is not good to user. > > > I Yang; during my test I discovered an error: > > $ sudo lvcreate -L +10G -n disk vgtest > $ sudo /home/ghigo/mkfs.btrfs -f /dev/vgtest/disk > Btrfs v3.17 > See http://btrfs.wiki.kernel.org for more information. > > Turning ON incompat feature 'extref': increased hardlink limit per file to 65536 > fs created label (null) on /dev/vgtest/disk > nodesize 16384 leafsize 16384 sectorsize 4096 size 10.00GiB > $ sudo mount /dev/vgtest/disk /mnt/btrfs1/ > $ df /mnt/btrfs1/ > Filesystem 1K-blocks Used Available Use% Mounted on > /dev/mapper/vgtest-disk 9428992 1069312 8359680 12% /mnt/btrfs1 > $ sudo ~/btrfs fi df /mnt/btrfs1/ > Data, single: total=8.00MiB, used=256.00KiB > System, DUP: total=8.00MiB, used=16.00KiB > System, single: total=4.00MiB, used=0.00B > Metadata, DUP: total=1.00GiB, used=112.00KiB > Metadata, single: total=8.00MiB, used=0.00B > GlobalReserve, single: total=16.00MiB, used=0.00B > > What seems me strange is the 9428992KiB of total disk space as reported > by df. About 600MiB are missing ! > > Without your patch, I got: > > $ df /mnt/btrfs1/ > Filesystem 1K-blocks Used Available Use% Mounted on > /dev/mapper/vgtest-disk 10485760 16896 8359680 1% /mnt/btrfs1 > Hi Goffredo, thanx for pointing this. Let me try to show the reason of it. In this case you provided here, the metadata is 1G in DUP. It means we spent 2G device space for it. But from the view point of user, they can only see 9G size for this fs. It's show as below: 1G -----|---------------- Filesystem space (9G) | \ | \ | \ | 2G \ -----|----|----------------- Device space (10G) DUP | The main idea about the new btrfs_statfs() is hiding the detail in Device space and only show the information in the FS space level. Actually, the similar idea is adopted in btrfs fi df. Example: # lvcreate -L +10G -n disk Group0 # mkfs.btrfs -f /dev/Group0/disk # dd if=/dev/zero of=/mnt/data bs=1M count=10240 dd: error writing ‘/mnt/data’: No space left on device 8163+0 records in 8162+0 records out 8558477312 bytes (8.6 GB) copied, 207.565 s, 41.2 MB/s # df /mnt Filesystem 1K-blocks Used Available Use% Mounted on /dev/mapper/Group0-disk 9428992 8382896 2048 100% /mnt # btrfs fi df /mnt Data, single: total=7.97GiB, used=7.97GiB System, DUP: total=8.00MiB, used=16.00KiB System, single: total=4.00MiB, used=0.00B Metadata, DUP: total=1.00GiB, used=8.41MiB Metadata, single: total=8.00MiB, used=0.00B GlobalReserve, single: total=16.00MiB, used=0.00B Now, the all space is used and got a ENOSPC error. But from the output of btrfs fi df. We can only find almost 9G (data 8G + metadata 1G) is used. The difference is that it show the DUP here to make it more clear. Wish this description is clear to you :). If there is anything confusing or sounds incorrect to you, please point it out. Thanx Yang > >> Example: >> # mkfs.btrfs -f /dev/vdf1 /dev/vdf2 -d raid1 >> # mount /dev/vdf1 /mnt >> # dd if=/dev/zero of=/mnt/zero bs=1M count=1000 >> # df -h /mnt >> Filesystem Size Used Avail Use% Mounted on >> /dev/vdf1 3.0G 1018M 1.3G 45% /mnt >> # btrfs fi show /dev/vdf1 >> Label: none uuid: f85d93dc-81f4-445d-91e5-6a5cd9563294 >> Total devices 2 FS bytes used 1001.53MiB >> devid 1 size 2.00GiB used 1.85GiB path /dev/vdf1 >> devid 2 size 4.00GiB used 1.83GiB path /dev/vdf2 >> a. df -h should report Size as 2GiB rather than as 3GiB. >> Because this is 2 device raid1, the limiting factor is devid 1 @2GiB. >> >> b. df -h should report Avail as 0.97GiB or less, rather than as 1.3GiB. >> 1.85 (the capacity of the allocated chunk) >> -1.018 (the file stored) >> +(2-1.85=0.15) (the residual capacity of the disks >> considering a raid1 fs) >> --------------- >> = 0.97 >> >> This patch drops the factor at all and calculate the size observable to >> user without considering which raid level the data is in and what's the >> size exactly in disk. >> After this patch applied: >> # mkfs.btrfs -f /dev/vdf1 /dev/vdf2 -d raid1 >> # mount /dev/vdf1 /mnt >> # dd if=/dev/zero of=/mnt/zero bs=1M count=1000 >> # df -h /mnt >> Filesystem Size Used Avail Use% Mounted on >> /dev/vdf1 2.0G 1.3G 713M 66% /mnt >> # df /mnt >> Filesystem 1K-blocks Used Available Use% Mounted on >> /dev/vdf1 2097152 1359424 729536 66% /mnt >> # btrfs fi show /dev/vdf1 >> Label: none uuid: e98c1321-645f-4457-b20d-4f41dc1cf2f4 >> Total devices 2 FS bytes used 1001.55MiB >> devid 1 size 2.00GiB used 1.85GiB path /dev/vdf1 >> devid 2 size 4.00GiB used 1.83GiB path /dev/vdf2 >> a). The @Size is 2G as we expected. >> b). @Available is 700M = 1.85G - 1.3G + (2G - 1.85G). >> c). @Used is changed to 1.3G rather than 1018M as above. Because >> this patch do not treat the free space in metadata chunk >> and system chunk as available to user. It's true, user can >> not use these space to store data, then it should not be >> thought as available. At the same time, it will make the >> @Used + @Available == @Size as possible to user. >> >> Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com> >> --- >> Changelog: >> v1 -> v2: >> a. account the total_bytes in medadata chunk and >> system chunk as used to user. >> b. set data_stripes to the correct value in RAID0. >> >> fs/btrfs/extent-tree.c | 13 ++---------- >> fs/btrfs/super.c | 56 ++++++++++++++++++++++---------------------------- >> 2 files changed, 26 insertions(+), 43 deletions(-) >> >> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c >> index a84e00d..9954d60 100644 >> --- a/fs/btrfs/extent-tree.c >> +++ b/fs/btrfs/extent-tree.c >> @@ -8571,7 +8571,6 @@ static u64 __btrfs_get_ro_block_group_free_space(struct list_head *groups_list) >> { >> struct btrfs_block_group_cache *block_group; >> u64 free_bytes = 0; >> - int factor; >> >> list_for_each_entry(block_group, groups_list, list) { >> spin_lock(&block_group->lock); >> @@ -8581,16 +8580,8 @@ static u64 __btrfs_get_ro_block_group_free_space(struct list_head *groups_list) >> continue; >> } >> >> - if (block_group->flags & (BTRFS_BLOCK_GROUP_RAID1 | >> - BTRFS_BLOCK_GROUP_RAID10 | >> - BTRFS_BLOCK_GROUP_DUP)) >> - factor = 2; >> - else >> - factor = 1; >> - >> - free_bytes += (block_group->key.offset - >> - btrfs_block_group_used(&block_group->item)) * >> - factor; >> + free_bytes += block_group->key.offset - >> + btrfs_block_group_used(&block_group->item); >> >> spin_unlock(&block_group->lock); >> } >> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c >> index 54bd91e..40f41a2 100644 >> --- a/fs/btrfs/super.c >> +++ b/fs/btrfs/super.c >> @@ -1641,6 +1641,8 @@ static int btrfs_calc_avail_data_space(struct btrfs_root *root, u64 *free_bytes) >> u64 used_space; >> u64 min_stripe_size; >> int min_stripes = 1, num_stripes = 1; >> + /* How many stripes used to store data, without considering mirrors. */ >> + int data_stripes = 1; >> int i = 0, nr_devices; >> int ret; >> >> @@ -1657,12 +1659,15 @@ static int btrfs_calc_avail_data_space(struct btrfs_root *root, u64 *free_bytes) >> if (type & BTRFS_BLOCK_GROUP_RAID0) { >> min_stripes = 2; >> num_stripes = nr_devices; >> + data_stripes = num_stripes; >> } else if (type & BTRFS_BLOCK_GROUP_RAID1) { >> min_stripes = 2; >> num_stripes = 2; >> + data_stripes = 1; >> } else if (type & BTRFS_BLOCK_GROUP_RAID10) { >> min_stripes = 4; >> num_stripes = 4; >> + data_stripes = 2; >> } >> >> if (type & BTRFS_BLOCK_GROUP_DUP) >> @@ -1733,14 +1738,17 @@ static int btrfs_calc_avail_data_space(struct btrfs_root *root, u64 *free_bytes) >> i = nr_devices - 1; >> avail_space = 0; >> while (nr_devices >= min_stripes) { >> - if (num_stripes > nr_devices) >> + if (num_stripes > nr_devices) { >> num_stripes = nr_devices; >> + if (type & BTRFS_BLOCK_GROUP_RAID0) >> + data_stripes = num_stripes; >> + } >> >> if (devices_info[i].max_avail >= min_stripe_size) { >> int j; >> u64 alloc_size; >> >> - avail_space += devices_info[i].max_avail * num_stripes; >> + avail_space += devices_info[i].max_avail * data_stripes; >> alloc_size = devices_info[i].max_avail; >> for (j = i + 1 - num_stripes; j <= i; j++) >> devices_info[j].max_avail -= alloc_size; >> @@ -1772,15 +1780,13 @@ static int btrfs_calc_avail_data_space(struct btrfs_root *root, u64 *free_bytes) >> static int btrfs_statfs(struct dentry *dentry, struct kstatfs *buf) >> { >> struct btrfs_fs_info *fs_info = btrfs_sb(dentry->d_sb); >> - struct btrfs_super_block *disk_super = fs_info->super_copy; >> struct list_head *head = &fs_info->space_info; >> struct btrfs_space_info *found; >> u64 total_used = 0; >> + u64 total_alloc = 0; >> u64 total_free_data = 0; >> int bits = dentry->d_sb->s_blocksize_bits; >> __be32 *fsid = (__be32 *)fs_info->fsid; >> - unsigned factor = 1; >> - struct btrfs_block_rsv *block_rsv = &fs_info->global_block_rsv; >> int ret; >> >> /* >> @@ -1792,38 +1798,20 @@ static int btrfs_statfs(struct dentry *dentry, struct kstatfs *buf) >> rcu_read_lock(); >> list_for_each_entry_rcu(found, head, list) { >> if (found->flags & BTRFS_BLOCK_GROUP_DATA) { >> - int i; >> - >> - total_free_data += found->disk_total - found->disk_used; >> + total_free_data += found->total_bytes - found->bytes_used; >> total_free_data -= >> btrfs_account_ro_block_groups_free_space(found); >> - >> - for (i = 0; i < BTRFS_NR_RAID_TYPES; i++) { >> - if (!list_empty(&found->block_groups[i])) { >> - switch (i) { >> - case BTRFS_RAID_DUP: >> - case BTRFS_RAID_RAID1: >> - case BTRFS_RAID_RAID10: >> - factor = 2; >> - } >> - } >> - } >> + total_used += found->bytes_used; >> + } else { >> + /* For metadata and system, we treat the total_bytes as >> + * not available to file data. So show it as Used in df. >> + **/ >> + total_used += found->total_bytes; >> } >> - >> - total_used += found->disk_used; >> + total_alloc += found->total_bytes; >> } >> - >> rcu_read_unlock(); >> >> - buf->f_blocks = div_u64(btrfs_super_total_bytes(disk_super), factor); >> - buf->f_blocks >>= bits; >> - buf->f_bfree = buf->f_blocks - (div_u64(total_used, factor) >> bits); >> - >> - /* Account global block reserve as used, it's in logical size already */ >> - spin_lock(&block_rsv->lock); >> - buf->f_bfree -= block_rsv->size >> bits; >> - spin_unlock(&block_rsv->lock); >> - >> buf->f_bavail = total_free_data; >> ret = btrfs_calc_avail_data_space(fs_info->tree_root, &total_free_data); >> if (ret) { >> @@ -1831,8 +1819,12 @@ static int btrfs_statfs(struct dentry *dentry, struct kstatfs *buf) >> mutex_unlock(&fs_info->fs_devices->device_list_mutex); >> return ret; >> } >> - buf->f_bavail += div_u64(total_free_data, factor); >> + buf->f_bavail += total_free_data; >> buf->f_bavail = buf->f_bavail >> bits; >> + buf->f_blocks = total_alloc + total_free_data; >> + buf->f_blocks >>= bits; >> + buf->f_bfree = buf->f_blocks - (total_used >> bits); >> + >> mutex_unlock(&fs_info->chunk_mutex); >> mutex_unlock(&fs_info->fs_devices->device_list_mutex); >> >> > > > -- > gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it> > Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi, I see another problem on 1 device fs after applying this patch. I set up 30GB system partition: in gdisk key 'i' Partition size: 62914560 sectors (30.0 GiB) -> 62914560 * 512 = 32212254720 = 30.0GiB before applying the patch df -B1 shows Filesystem 1B-blocks Used Available Use% Mounted on /dev/sda3 32212254720 18149384192 12821815296 59% / The total size matches exactly the one reported by gdisk. After applying the patch df -B1 shows Filesystem 1B-blocks Used Available Use% Mounted on /dev/sda3 29761732608 17037860864 12723871744 58% / The total size is 29761732608 bytes now, which is 2337 MiB (2.28GB) smaller then the partition size. IMHO the total size should correspond exactly to the partition size, at least in the case of fs on one device, and anything used by the file system and user should go to the used space. Am I missing something here? Thanks, Grzegorz On Sun, Dec 14, 2014 at 12:27 PM, Grzegorz Kowal <custos.mentis@gmail.com> wrote: > Hi, > > I see another problem on 1 device fs after applying this patch. I set up > 30GB system partition: > > in gdisk key 'i' > Partition size: 62914560 sectors (30.0 GiB) -> 62914560 * 512 = 32212254720 > = 30.0GiB > > before applying the patch df -B1 shows > > Filesystem 1B-blocks Used Available Use% Mounted on > /dev/sda3 32212254720 18149384192 12821815296 59% / > > The total size matches exactly the one reported by gdisk. > > After applying the patch df -B1 shows > > Filesystem 1B-blocks Used Available Use% Mounted on > /dev/sda3 29761732608 17037860864 12723871744 58% / > > The total size is 29761732608 bytes now, which is 2337 MiB (2.28GB) smaller > then the partition size. > > IMHO the total size should correspond exactly to the partition size, at > least in the case of fs on one device, and anything used by the file system > and user should go to the used space. > > Am I missing something here? > > Thanks, > Grzegorz > > > > > > On Sun, Dec 14, 2014 at 9:29 AM, Dongsheng Yang <dongsheng081251@gmail.com> > wrote: >> >> On Sat, Dec 13, 2014 at 3:25 AM, Goffredo Baroncelli <kreijack@inwind.it> >> wrote: >> > On 12/11/2014 09:31 AM, Dongsheng Yang wrote: >> >> When function btrfs_statfs() calculate the tatol size of fs, it is >> >> calculating >> >> the total size of disks and then dividing it by a factor. But in some >> >> usecase, >> >> the result is not good to user. >> > >> > >> > I Yang; during my test I discovered an error: >> > >> > $ sudo lvcreate -L +10G -n disk vgtest >> > $ sudo /home/ghigo/mkfs.btrfs -f /dev/vgtest/disk >> > Btrfs v3.17 >> > See http://btrfs.wiki.kernel.org for more information. >> > >> > Turning ON incompat feature 'extref': increased hardlink limit per file >> > to 65536 >> > fs created label (null) on /dev/vgtest/disk >> > nodesize 16384 leafsize 16384 sectorsize 4096 size 10.00GiB >> > $ sudo mount /dev/vgtest/disk /mnt/btrfs1/ >> > $ df /mnt/btrfs1/ >> > Filesystem 1K-blocks Used Available Use% Mounted on >> > /dev/mapper/vgtest-disk 9428992 1069312 8359680 12% /mnt/btrfs1 >> > $ sudo ~/btrfs fi df /mnt/btrfs1/ >> > Data, single: total=8.00MiB, used=256.00KiB >> > System, DUP: total=8.00MiB, used=16.00KiB >> > System, single: total=4.00MiB, used=0.00B >> > Metadata, DUP: total=1.00GiB, used=112.00KiB >> > Metadata, single: total=8.00MiB, used=0.00B >> > GlobalReserve, single: total=16.00MiB, used=0.00B >> > >> > What seems me strange is the 9428992KiB of total disk space as reported >> > by df. About 600MiB are missing ! >> > >> > Without your patch, I got: >> > >> > $ df /mnt/btrfs1/ >> > Filesystem 1K-blocks Used Available Use% Mounted on >> > /dev/mapper/vgtest-disk 10485760 16896 8359680 1% /mnt/btrfs1 >> > >> >> Hi Goffredo, thanx for pointing this. Let me try to show the reason of it. >> >> In this case you provided here, the metadata is 1G in DUP. It means >> we spent 2G device space for it. But from the view point of user, they >> can only see 9G size for this fs. It's show as below: >> 1G >> -----|---------------- Filesystem space (9G) >> | \ >> | \ >> | \ >> | 2G \ >> -----|----|----------------- Device space (10G) >> DUP | >> >> The main idea about the new btrfs_statfs() is hiding the detail in Device >> space and only show the information in the FS space level. >> >> Actually, the similar idea is adopted in btrfs fi df. >> >> Example: >> # lvcreate -L +10G -n disk Group0 >> # mkfs.btrfs -f /dev/Group0/disk >> # dd if=/dev/zero of=/mnt/data bs=1M count=10240 >> dd: error writing ‘/mnt/data’: No space left on device >> 8163+0 records in >> 8162+0 records out >> 8558477312 bytes (8.6 GB) copied, 207.565 s, 41.2 MB/s >> # df /mnt >> Filesystem 1K-blocks Used Available Use% Mounted on >> /dev/mapper/Group0-disk 9428992 8382896 2048 100% /mnt >> # btrfs fi df /mnt >> Data, single: total=7.97GiB, used=7.97GiB >> System, DUP: total=8.00MiB, used=16.00KiB >> System, single: total=4.00MiB, used=0.00B >> Metadata, DUP: total=1.00GiB, used=8.41MiB >> Metadata, single: total=8.00MiB, used=0.00B >> GlobalReserve, single: total=16.00MiB, used=0.00B >> >> Now, the all space is used and got a ENOSPC error. >> But from the output of btrfs fi df. We can only find almost 9G (data >> 8G + metadata 1G) >> is used. The difference is that it show the DUP here to make >> it more clear. >> >> Wish this description is clear to you :). >> >> If there is anything confusing or sounds incorrect to you, please point it >> out. >> >> Thanx >> Yang >> >> > >> >> Example: >> >> # mkfs.btrfs -f /dev/vdf1 /dev/vdf2 -d raid1 >> >> # mount /dev/vdf1 /mnt >> >> # dd if=/dev/zero of=/mnt/zero bs=1M count=1000 >> >> # df -h /mnt >> >> Filesystem Size Used Avail Use% Mounted on >> >> /dev/vdf1 3.0G 1018M 1.3G 45% /mnt >> >> # btrfs fi show /dev/vdf1 >> >> Label: none uuid: f85d93dc-81f4-445d-91e5-6a5cd9563294 >> >> Total devices 2 FS bytes used 1001.53MiB >> >> devid 1 size 2.00GiB used 1.85GiB path /dev/vdf1 >> >> devid 2 size 4.00GiB used 1.83GiB path /dev/vdf2 >> >> a. df -h should report Size as 2GiB rather than as 3GiB. >> >> Because this is 2 device raid1, the limiting factor is devid 1 @2GiB. >> >> >> >> b. df -h should report Avail as 0.97GiB or less, rather than as 1.3GiB. >> >> 1.85 (the capacity of the allocated chunk) >> >> -1.018 (the file stored) >> >> +(2-1.85=0.15) (the residual capacity of the disks >> >> considering a raid1 fs) >> >> --------------- >> >> = 0.97 >> >> >> >> This patch drops the factor at all and calculate the size observable to >> >> user without considering which raid level the data is in and what's the >> >> size exactly in disk. >> >> After this patch applied: >> >> # mkfs.btrfs -f /dev/vdf1 /dev/vdf2 -d raid1 >> >> # mount /dev/vdf1 /mnt >> >> # dd if=/dev/zero of=/mnt/zero bs=1M count=1000 >> >> # df -h /mnt >> >> Filesystem Size Used Avail Use% Mounted on >> >> /dev/vdf1 2.0G 1.3G 713M 66% /mnt >> >> # df /mnt >> >> Filesystem 1K-blocks Used Available Use% Mounted on >> >> /dev/vdf1 2097152 1359424 729536 66% /mnt >> >> # btrfs fi show /dev/vdf1 >> >> Label: none uuid: e98c1321-645f-4457-b20d-4f41dc1cf2f4 >> >> Total devices 2 FS bytes used 1001.55MiB >> >> devid 1 size 2.00GiB used 1.85GiB path /dev/vdf1 >> >> devid 2 size 4.00GiB used 1.83GiB path /dev/vdf2 >> >> a). The @Size is 2G as we expected. >> >> b). @Available is 700M = 1.85G - 1.3G + (2G - 1.85G). >> >> c). @Used is changed to 1.3G rather than 1018M as above. Because >> >> this patch do not treat the free space in metadata chunk >> >> and system chunk as available to user. It's true, user can >> >> not use these space to store data, then it should not be >> >> thought as available. At the same time, it will make the >> >> @Used + @Available == @Size as possible to user. >> >> >> >> Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com> >> >> --- >> >> Changelog: >> >> v1 -> v2: >> >> a. account the total_bytes in medadata chunk and >> >> system chunk as used to user. >> >> b. set data_stripes to the correct value in RAID0. >> >> >> >> fs/btrfs/extent-tree.c | 13 ++---------- >> >> fs/btrfs/super.c | 56 >> >> ++++++++++++++++++++++---------------------------- >> >> 2 files changed, 26 insertions(+), 43 deletions(-) >> >> >> >> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c >> >> index a84e00d..9954d60 100644 >> >> --- a/fs/btrfs/extent-tree.c >> >> +++ b/fs/btrfs/extent-tree.c >> >> @@ -8571,7 +8571,6 @@ static u64 >> >> __btrfs_get_ro_block_group_free_space(struct list_head *groups_list) >> >> { >> >> struct btrfs_block_group_cache *block_group; >> >> u64 free_bytes = 0; >> >> - int factor; >> >> >> >> list_for_each_entry(block_group, groups_list, list) { >> >> spin_lock(&block_group->lock); >> >> @@ -8581,16 +8580,8 @@ static u64 >> >> __btrfs_get_ro_block_group_free_space(struct list_head *groups_list) >> >> continue; >> >> } >> >> >> >> - if (block_group->flags & (BTRFS_BLOCK_GROUP_RAID1 | >> >> - BTRFS_BLOCK_GROUP_RAID10 | >> >> - BTRFS_BLOCK_GROUP_DUP)) >> >> - factor = 2; >> >> - else >> >> - factor = 1; >> >> - >> >> - free_bytes += (block_group->key.offset - >> >> - >> >> btrfs_block_group_used(&block_group->item)) * >> >> - factor; >> >> + free_bytes += block_group->key.offset - >> >> + btrfs_block_group_used(&block_group->item); >> >> >> >> spin_unlock(&block_group->lock); >> >> } >> >> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c >> >> index 54bd91e..40f41a2 100644 >> >> --- a/fs/btrfs/super.c >> >> +++ b/fs/btrfs/super.c >> >> @@ -1641,6 +1641,8 @@ static int btrfs_calc_avail_data_space(struct >> >> btrfs_root *root, u64 *free_bytes) >> >> u64 used_space; >> >> u64 min_stripe_size; >> >> int min_stripes = 1, num_stripes = 1; >> >> + /* How many stripes used to store data, without considering >> >> mirrors. */ >> >> + int data_stripes = 1; >> >> int i = 0, nr_devices; >> >> int ret; >> >> >> >> @@ -1657,12 +1659,15 @@ static int btrfs_calc_avail_data_space(struct >> >> btrfs_root *root, u64 *free_bytes) >> >> if (type & BTRFS_BLOCK_GROUP_RAID0) { >> >> min_stripes = 2; >> >> num_stripes = nr_devices; >> >> + data_stripes = num_stripes; >> >> } else if (type & BTRFS_BLOCK_GROUP_RAID1) { >> >> min_stripes = 2; >> >> num_stripes = 2; >> >> + data_stripes = 1; >> >> } else if (type & BTRFS_BLOCK_GROUP_RAID10) { >> >> min_stripes = 4; >> >> num_stripes = 4; >> >> + data_stripes = 2; >> >> } >> >> >> >> if (type & BTRFS_BLOCK_GROUP_DUP) >> >> @@ -1733,14 +1738,17 @@ static int btrfs_calc_avail_data_space(struct >> >> btrfs_root *root, u64 *free_bytes) >> >> i = nr_devices - 1; >> >> avail_space = 0; >> >> while (nr_devices >= min_stripes) { >> >> - if (num_stripes > nr_devices) >> >> + if (num_stripes > nr_devices) { >> >> num_stripes = nr_devices; >> >> + if (type & BTRFS_BLOCK_GROUP_RAID0) >> >> + data_stripes = num_stripes; >> >> + } >> >> >> >> if (devices_info[i].max_avail >= min_stripe_size) { >> >> int j; >> >> u64 alloc_size; >> >> >> >> - avail_space += devices_info[i].max_avail * >> >> num_stripes; >> >> + avail_space += devices_info[i].max_avail * >> >> data_stripes; >> >> alloc_size = devices_info[i].max_avail; >> >> for (j = i + 1 - num_stripes; j <= i; j++) >> >> devices_info[j].max_avail -= alloc_size; >> >> @@ -1772,15 +1780,13 @@ static int btrfs_calc_avail_data_space(struct >> >> btrfs_root *root, u64 *free_bytes) >> >> static int btrfs_statfs(struct dentry *dentry, struct kstatfs *buf) >> >> { >> >> struct btrfs_fs_info *fs_info = btrfs_sb(dentry->d_sb); >> >> - struct btrfs_super_block *disk_super = fs_info->super_copy; >> >> struct list_head *head = &fs_info->space_info; >> >> struct btrfs_space_info *found; >> >> u64 total_used = 0; >> >> + u64 total_alloc = 0; >> >> u64 total_free_data = 0; >> >> int bits = dentry->d_sb->s_blocksize_bits; >> >> __be32 *fsid = (__be32 *)fs_info->fsid; >> >> - unsigned factor = 1; >> >> - struct btrfs_block_rsv *block_rsv = &fs_info->global_block_rsv; >> >> int ret; >> >> >> >> /* >> >> @@ -1792,38 +1798,20 @@ static int btrfs_statfs(struct dentry *dentry, >> >> struct kstatfs *buf) >> >> rcu_read_lock(); >> >> list_for_each_entry_rcu(found, head, list) { >> >> if (found->flags & BTRFS_BLOCK_GROUP_DATA) { >> >> - int i; >> >> - >> >> - total_free_data += found->disk_total - >> >> found->disk_used; >> >> + total_free_data += found->total_bytes - >> >> found->bytes_used; >> >> total_free_data -= >> >> >> >> btrfs_account_ro_block_groups_free_space(found); >> >> - >> >> - for (i = 0; i < BTRFS_NR_RAID_TYPES; i++) { >> >> - if (!list_empty(&found->block_groups[i])) >> >> { >> >> - switch (i) { >> >> - case BTRFS_RAID_DUP: >> >> - case BTRFS_RAID_RAID1: >> >> - case BTRFS_RAID_RAID10: >> >> - factor = 2; >> >> - } >> >> - } >> >> - } >> >> + total_used += found->bytes_used; >> >> + } else { >> >> + /* For metadata and system, we treat the >> >> total_bytes as >> >> + * not available to file data. So show it as Used >> >> in df. >> >> + **/ >> >> + total_used += found->total_bytes; >> >> } >> >> - >> >> - total_used += found->disk_used; >> >> + total_alloc += found->total_bytes; >> >> } >> >> - >> >> rcu_read_unlock(); >> >> >> >> - buf->f_blocks = div_u64(btrfs_super_total_bytes(disk_super), >> >> factor); >> >> - buf->f_blocks >>= bits; >> >> - buf->f_bfree = buf->f_blocks - (div_u64(total_used, factor) >> >> >> bits); >> >> - >> >> - /* Account global block reserve as used, it's in logical size >> >> already */ >> >> - spin_lock(&block_rsv->lock); >> >> - buf->f_bfree -= block_rsv->size >> bits; >> >> - spin_unlock(&block_rsv->lock); >> >> - >> >> buf->f_bavail = total_free_data; >> >> ret = btrfs_calc_avail_data_space(fs_info->tree_root, >> >> &total_free_data); >> >> if (ret) { >> >> @@ -1831,8 +1819,12 @@ static int btrfs_statfs(struct dentry *dentry, >> >> struct kstatfs *buf) >> >> mutex_unlock(&fs_info->fs_devices->device_list_mutex); >> >> return ret; >> >> } >> >> - buf->f_bavail += div_u64(total_free_data, factor); >> >> + buf->f_bavail += total_free_data; >> >> buf->f_bavail = buf->f_bavail >> bits; >> >> + buf->f_blocks = total_alloc + total_free_data; >> >> + buf->f_blocks >>= bits; >> >> + buf->f_bfree = buf->f_blocks - (total_used >> bits); >> >> + >> >> mutex_unlock(&fs_info->chunk_mutex); >> >> mutex_unlock(&fs_info->fs_devices->device_list_mutex); >> >> >> >> >> > >> > >> > -- >> > gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it> >> > Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 >> > -- >> > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" >> > in >> > the body of a message to majordomo@vger.kernel.org >> > More majordomo info at http://vger.kernel.org/majordomo-info.html >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 12/14/2014 10:32 PM, Grzegorz Kowal wrote: > Hi, > > I see another problem on 1 device fs after applying this patch. I set > up 30GB system partition: > > in gdisk key 'i' > Partition size: 62914560 sectors (30.0 GiB) -> 62914560 * 512 = > 32212254720 = 30.0GiB > > before applying the patch df -B1 shows > > Filesystem 1B-blocks Used Available Use% Mounted on > /dev/sda3 32212254720 18149384192 12821815296 59% / > > The total size matches exactly the one reported by gdisk. > > After applying the patch df -B1 shows > > Filesystem 1B-blocks Used Available Use% Mounted on > /dev/sda3 29761732608 17037860864 12723871744 58% / > > The total size is 29761732608 bytes now, which is 2337 MiB (2.28GB) > smaller then the partition size. Hi Grzegorz, this is very similar with the question discussed in my last mail. I believe the 2.28GB is used for metadata in DUP level. That said, there is a space of almost 4.56GB (2.28*2GB) used for metadata and the metadata is in DUP level. Then user can only see a 2.28GB from the FS space level. 2.28G -----|---------------- Filesystem space (27.72G) | \ | \ | \ | 4.56GB \ -----|----|----------------- Device space (30.0G) DUP | The main idea in new btrfs_statfs() is to consider the space information in a FS level rather than the device space level. Then the @size in df is the size in FS space level. As we all agree to report @size as 5G in the case of "mkfs.btrfs /dev/vdb (5G) /dev/vdc (5G) -d raid1", this means we are agreed to think the df in a FS space without showing the detail in device space. In this time, the total size in FS space is 27.72G, so the @size in df is 27.72G. Does it make sense to you? As there are 2 complaints for the same change of @size in df, I have to say it maybe not so easy to understand. Anyone have some suggestion about it? Thanx Yang > > IMHO the total size should correspond exactly to the partition size, > at least in the case of fs on one device, and anything used by the > file system and user should go to the used space. > > Am I missing something here? > > Thanks, > Grzegorz > > On Sun, Dec 14, 2014 at 12:27 PM, Grzegorz Kowal > <custos.mentis@gmail.com> wrote: >> Hi, >> >> I see another problem on 1 device fs after applying this patch. I set up >> 30GB system partition: >> >> in gdisk key 'i' >> Partition size: 62914560 sectors (30.0 GiB) -> 62914560 * 512 = 32212254720 >> = 30.0GiB >> >> before applying the patch df -B1 shows >> >> Filesystem 1B-blocks Used Available Use% Mounted on >> /dev/sda3 32212254720 18149384192 12821815296 59% / >> >> The total size matches exactly the one reported by gdisk. >> >> After applying the patch df -B1 shows >> >> Filesystem 1B-blocks Used Available Use% Mounted on >> /dev/sda3 29761732608 17037860864 12723871744 58% / >> >> The total size is 29761732608 bytes now, which is 2337 MiB (2.28GB) smaller >> then the partition size. >> >> IMHO the total size should correspond exactly to the partition size, at >> least in the case of fs on one device, and anything used by the file system >> and user should go to the used space. >> >> Am I missing something here? >> >> Thanks, >> Grzegorz >> >> >> >> >> >> On Sun, Dec 14, 2014 at 9:29 AM, Dongsheng Yang <dongsheng081251@gmail.com> >> wrote: >>> On Sat, Dec 13, 2014 at 3:25 AM, Goffredo Baroncelli <kreijack@inwind.it> >>> wrote: >>>> On 12/11/2014 09:31 AM, Dongsheng Yang wrote: >>>>> When function btrfs_statfs() calculate the tatol size of fs, it is >>>>> calculating >>>>> the total size of disks and then dividing it by a factor. But in some >>>>> usecase, >>>>> the result is not good to user. >>>> >>>> I Yang; during my test I discovered an error: >>>> >>>> $ sudo lvcreate -L +10G -n disk vgtest >>>> $ sudo /home/ghigo/mkfs.btrfs -f /dev/vgtest/disk >>>> Btrfs v3.17 >>>> See http://btrfs.wiki.kernel.org for more information. >>>> >>>> Turning ON incompat feature 'extref': increased hardlink limit per file >>>> to 65536 >>>> fs created label (null) on /dev/vgtest/disk >>>> nodesize 16384 leafsize 16384 sectorsize 4096 size 10.00GiB >>>> $ sudo mount /dev/vgtest/disk /mnt/btrfs1/ >>>> $ df /mnt/btrfs1/ >>>> Filesystem 1K-blocks Used Available Use% Mounted on >>>> /dev/mapper/vgtest-disk 9428992 1069312 8359680 12% /mnt/btrfs1 >>>> $ sudo ~/btrfs fi df /mnt/btrfs1/ >>>> Data, single: total=8.00MiB, used=256.00KiB >>>> System, DUP: total=8.00MiB, used=16.00KiB >>>> System, single: total=4.00MiB, used=0.00B >>>> Metadata, DUP: total=1.00GiB, used=112.00KiB >>>> Metadata, single: total=8.00MiB, used=0.00B >>>> GlobalReserve, single: total=16.00MiB, used=0.00B >>>> >>>> What seems me strange is the 9428992KiB of total disk space as reported >>>> by df. About 600MiB are missing ! >>>> >>>> Without your patch, I got: >>>> >>>> $ df /mnt/btrfs1/ >>>> Filesystem 1K-blocks Used Available Use% Mounted on >>>> /dev/mapper/vgtest-disk 10485760 16896 8359680 1% /mnt/btrfs1 >>>> >>> Hi Goffredo, thanx for pointing this. Let me try to show the reason of it. >>> >>> In this case you provided here, the metadata is 1G in DUP. It means >>> we spent 2G device space for it. But from the view point of user, they >>> can only see 9G size for this fs. It's show as below: >>> 1G >>> -----|---------------- Filesystem space (9G) >>> | \ >>> | \ >>> | \ >>> | 2G \ >>> -----|----|----------------- Device space (10G) >>> DUP | >>> >>> The main idea about the new btrfs_statfs() is hiding the detail in Device >>> space and only show the information in the FS space level. >>> >>> Actually, the similar idea is adopted in btrfs fi df. >>> >>> Example: >>> # lvcreate -L +10G -n disk Group0 >>> # mkfs.btrfs -f /dev/Group0/disk >>> # dd if=/dev/zero of=/mnt/data bs=1M count=10240 >>> dd: error writing ‘/mnt/data’: No space left on device >>> 8163+0 records in >>> 8162+0 records out >>> 8558477312 bytes (8.6 GB) copied, 207.565 s, 41.2 MB/s >>> # df /mnt >>> Filesystem 1K-blocks Used Available Use% Mounted on >>> /dev/mapper/Group0-disk 9428992 8382896 2048 100% /mnt >>> # btrfs fi df /mnt >>> Data, single: total=7.97GiB, used=7.97GiB >>> System, DUP: total=8.00MiB, used=16.00KiB >>> System, single: total=4.00MiB, used=0.00B >>> Metadata, DUP: total=1.00GiB, used=8.41MiB >>> Metadata, single: total=8.00MiB, used=0.00B >>> GlobalReserve, single: total=16.00MiB, used=0.00B >>> >>> Now, the all space is used and got a ENOSPC error. >>> But from the output of btrfs fi df. We can only find almost 9G (data >>> 8G + metadata 1G) >>> is used. The difference is that it show the DUP here to make >>> it more clear. >>> >>> Wish this description is clear to you :). >>> >>> If there is anything confusing or sounds incorrect to you, please point it >>> out. >>> >>> Thanx >>> Yang >>> >>>>> Example: >>>>> # mkfs.btrfs -f /dev/vdf1 /dev/vdf2 -d raid1 >>>>> # mount /dev/vdf1 /mnt >>>>> # dd if=/dev/zero of=/mnt/zero bs=1M count=1000 >>>>> # df -h /mnt >>>>> Filesystem Size Used Avail Use% Mounted on >>>>> /dev/vdf1 3.0G 1018M 1.3G 45% /mnt >>>>> # btrfs fi show /dev/vdf1 >>>>> Label: none uuid: f85d93dc-81f4-445d-91e5-6a5cd9563294 >>>>> Total devices 2 FS bytes used 1001.53MiB >>>>> devid 1 size 2.00GiB used 1.85GiB path /dev/vdf1 >>>>> devid 2 size 4.00GiB used 1.83GiB path /dev/vdf2 >>>>> a. df -h should report Size as 2GiB rather than as 3GiB. >>>>> Because this is 2 device raid1, the limiting factor is devid 1 @2GiB. >>>>> >>>>> b. df -h should report Avail as 0.97GiB or less, rather than as 1.3GiB. >>>>> 1.85 (the capacity of the allocated chunk) >>>>> -1.018 (the file stored) >>>>> +(2-1.85=0.15) (the residual capacity of the disks >>>>> considering a raid1 fs) >>>>> --------------- >>>>> = 0.97 >>>>> >>>>> This patch drops the factor at all and calculate the size observable to >>>>> user without considering which raid level the data is in and what's the >>>>> size exactly in disk. >>>>> After this patch applied: >>>>> # mkfs.btrfs -f /dev/vdf1 /dev/vdf2 -d raid1 >>>>> # mount /dev/vdf1 /mnt >>>>> # dd if=/dev/zero of=/mnt/zero bs=1M count=1000 >>>>> # df -h /mnt >>>>> Filesystem Size Used Avail Use% Mounted on >>>>> /dev/vdf1 2.0G 1.3G 713M 66% /mnt >>>>> # df /mnt >>>>> Filesystem 1K-blocks Used Available Use% Mounted on >>>>> /dev/vdf1 2097152 1359424 729536 66% /mnt >>>>> # btrfs fi show /dev/vdf1 >>>>> Label: none uuid: e98c1321-645f-4457-b20d-4f41dc1cf2f4 >>>>> Total devices 2 FS bytes used 1001.55MiB >>>>> devid 1 size 2.00GiB used 1.85GiB path /dev/vdf1 >>>>> devid 2 size 4.00GiB used 1.83GiB path /dev/vdf2 >>>>> a). The @Size is 2G as we expected. >>>>> b). @Available is 700M = 1.85G - 1.3G + (2G - 1.85G). >>>>> c). @Used is changed to 1.3G rather than 1018M as above. Because >>>>> this patch do not treat the free space in metadata chunk >>>>> and system chunk as available to user. It's true, user can >>>>> not use these space to store data, then it should not be >>>>> thought as available. At the same time, it will make the >>>>> @Used + @Available == @Size as possible to user. >>>>> >>>>> Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com> >>>>> --- >>>>> Changelog: >>>>> v1 -> v2: >>>>> a. account the total_bytes in medadata chunk and >>>>> system chunk as used to user. >>>>> b. set data_stripes to the correct value in RAID0. >>>>> >>>>> fs/btrfs/extent-tree.c | 13 ++---------- >>>>> fs/btrfs/super.c | 56 >>>>> ++++++++++++++++++++++---------------------------- >>>>> 2 files changed, 26 insertions(+), 43 deletions(-) >>>>> >>>>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c >>>>> index a84e00d..9954d60 100644 >>>>> --- a/fs/btrfs/extent-tree.c >>>>> +++ b/fs/btrfs/extent-tree.c >>>>> @@ -8571,7 +8571,6 @@ static u64 >>>>> __btrfs_get_ro_block_group_free_space(struct list_head *groups_list) >>>>> { >>>>> struct btrfs_block_group_cache *block_group; >>>>> u64 free_bytes = 0; >>>>> - int factor; >>>>> >>>>> list_for_each_entry(block_group, groups_list, list) { >>>>> spin_lock(&block_group->lock); >>>>> @@ -8581,16 +8580,8 @@ static u64 >>>>> __btrfs_get_ro_block_group_free_space(struct list_head *groups_list) >>>>> continue; >>>>> } >>>>> >>>>> - if (block_group->flags & (BTRFS_BLOCK_GROUP_RAID1 | >>>>> - BTRFS_BLOCK_GROUP_RAID10 | >>>>> - BTRFS_BLOCK_GROUP_DUP)) >>>>> - factor = 2; >>>>> - else >>>>> - factor = 1; >>>>> - >>>>> - free_bytes += (block_group->key.offset - >>>>> - >>>>> btrfs_block_group_used(&block_group->item)) * >>>>> - factor; >>>>> + free_bytes += block_group->key.offset - >>>>> + btrfs_block_group_used(&block_group->item); >>>>> >>>>> spin_unlock(&block_group->lock); >>>>> } >>>>> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c >>>>> index 54bd91e..40f41a2 100644 >>>>> --- a/fs/btrfs/super.c >>>>> +++ b/fs/btrfs/super.c >>>>> @@ -1641,6 +1641,8 @@ static int btrfs_calc_avail_data_space(struct >>>>> btrfs_root *root, u64 *free_bytes) >>>>> u64 used_space; >>>>> u64 min_stripe_size; >>>>> int min_stripes = 1, num_stripes = 1; >>>>> + /* How many stripes used to store data, without considering >>>>> mirrors. */ >>>>> + int data_stripes = 1; >>>>> int i = 0, nr_devices; >>>>> int ret; >>>>> >>>>> @@ -1657,12 +1659,15 @@ static int btrfs_calc_avail_data_space(struct >>>>> btrfs_root *root, u64 *free_bytes) >>>>> if (type & BTRFS_BLOCK_GROUP_RAID0) { >>>>> min_stripes = 2; >>>>> num_stripes = nr_devices; >>>>> + data_stripes = num_stripes; >>>>> } else if (type & BTRFS_BLOCK_GROUP_RAID1) { >>>>> min_stripes = 2; >>>>> num_stripes = 2; >>>>> + data_stripes = 1; >>>>> } else if (type & BTRFS_BLOCK_GROUP_RAID10) { >>>>> min_stripes = 4; >>>>> num_stripes = 4; >>>>> + data_stripes = 2; >>>>> } >>>>> >>>>> if (type & BTRFS_BLOCK_GROUP_DUP) >>>>> @@ -1733,14 +1738,17 @@ static int btrfs_calc_avail_data_space(struct >>>>> btrfs_root *root, u64 *free_bytes) >>>>> i = nr_devices - 1; >>>>> avail_space = 0; >>>>> while (nr_devices >= min_stripes) { >>>>> - if (num_stripes > nr_devices) >>>>> + if (num_stripes > nr_devices) { >>>>> num_stripes = nr_devices; >>>>> + if (type & BTRFS_BLOCK_GROUP_RAID0) >>>>> + data_stripes = num_stripes; >>>>> + } >>>>> >>>>> if (devices_info[i].max_avail >= min_stripe_size) { >>>>> int j; >>>>> u64 alloc_size; >>>>> >>>>> - avail_space += devices_info[i].max_avail * >>>>> num_stripes; >>>>> + avail_space += devices_info[i].max_avail * >>>>> data_stripes; >>>>> alloc_size = devices_info[i].max_avail; >>>>> for (j = i + 1 - num_stripes; j <= i; j++) >>>>> devices_info[j].max_avail -= alloc_size; >>>>> @@ -1772,15 +1780,13 @@ static int btrfs_calc_avail_data_space(struct >>>>> btrfs_root *root, u64 *free_bytes) >>>>> static int btrfs_statfs(struct dentry *dentry, struct kstatfs *buf) >>>>> { >>>>> struct btrfs_fs_info *fs_info = btrfs_sb(dentry->d_sb); >>>>> - struct btrfs_super_block *disk_super = fs_info->super_copy; >>>>> struct list_head *head = &fs_info->space_info; >>>>> struct btrfs_space_info *found; >>>>> u64 total_used = 0; >>>>> + u64 total_alloc = 0; >>>>> u64 total_free_data = 0; >>>>> int bits = dentry->d_sb->s_blocksize_bits; >>>>> __be32 *fsid = (__be32 *)fs_info->fsid; >>>>> - unsigned factor = 1; >>>>> - struct btrfs_block_rsv *block_rsv = &fs_info->global_block_rsv; >>>>> int ret; >>>>> >>>>> /* >>>>> @@ -1792,38 +1798,20 @@ static int btrfs_statfs(struct dentry *dentry, >>>>> struct kstatfs *buf) >>>>> rcu_read_lock(); >>>>> list_for_each_entry_rcu(found, head, list) { >>>>> if (found->flags & BTRFS_BLOCK_GROUP_DATA) { >>>>> - int i; >>>>> - >>>>> - total_free_data += found->disk_total - >>>>> found->disk_used; >>>>> + total_free_data += found->total_bytes - >>>>> found->bytes_used; >>>>> total_free_data -= >>>>> >>>>> btrfs_account_ro_block_groups_free_space(found); >>>>> - >>>>> - for (i = 0; i < BTRFS_NR_RAID_TYPES; i++) { >>>>> - if (!list_empty(&found->block_groups[i])) >>>>> { >>>>> - switch (i) { >>>>> - case BTRFS_RAID_DUP: >>>>> - case BTRFS_RAID_RAID1: >>>>> - case BTRFS_RAID_RAID10: >>>>> - factor = 2; >>>>> - } >>>>> - } >>>>> - } >>>>> + total_used += found->bytes_used; >>>>> + } else { >>>>> + /* For metadata and system, we treat the >>>>> total_bytes as >>>>> + * not available to file data. So show it as Used >>>>> in df. >>>>> + **/ >>>>> + total_used += found->total_bytes; >>>>> } >>>>> - >>>>> - total_used += found->disk_used; >>>>> + total_alloc += found->total_bytes; >>>>> } >>>>> - >>>>> rcu_read_unlock(); >>>>> >>>>> - buf->f_blocks = div_u64(btrfs_super_total_bytes(disk_super), >>>>> factor); >>>>> - buf->f_blocks >>= bits; >>>>> - buf->f_bfree = buf->f_blocks - (div_u64(total_used, factor) >> >>>>> bits); >>>>> - >>>>> - /* Account global block reserve as used, it's in logical size >>>>> already */ >>>>> - spin_lock(&block_rsv->lock); >>>>> - buf->f_bfree -= block_rsv->size >> bits; >>>>> - spin_unlock(&block_rsv->lock); >>>>> - >>>>> buf->f_bavail = total_free_data; >>>>> ret = btrfs_calc_avail_data_space(fs_info->tree_root, >>>>> &total_free_data); >>>>> if (ret) { >>>>> @@ -1831,8 +1819,12 @@ static int btrfs_statfs(struct dentry *dentry, >>>>> struct kstatfs *buf) >>>>> mutex_unlock(&fs_info->fs_devices->device_list_mutex); >>>>> return ret; >>>>> } >>>>> - buf->f_bavail += div_u64(total_free_data, factor); >>>>> + buf->f_bavail += total_free_data; >>>>> buf->f_bavail = buf->f_bavail >> bits; >>>>> + buf->f_blocks = total_alloc + total_free_data; >>>>> + buf->f_blocks >>= bits; >>>>> + buf->f_bfree = buf->f_blocks - (total_used >> bits); >>>>> + >>>>> mutex_unlock(&fs_info->chunk_mutex); >>>>> mutex_unlock(&fs_info->fs_devices->device_list_mutex); >>>>> >>>>> >>>> >>>> -- >>>> gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it> >>>> Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 >>>> -- >>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" >>>> in >>>> the body of a message to majordomo@vger.kernel.org >>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>> -- >>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in >>> the body of a message to majordomo@vger.kernel.org >>> More majordomo info at http://vger.kernel.org/majordomo-info.html > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > . > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 12/14/2014 05:21 PM, Dongsheng Yang wrote: > > Does it make sense to you? I understood what you were saying but it didn't make sense to me... > As there are 2 complaints for the same change of @size in df, I have to > say it maybe not so easy to understand. > Anyone have some suggestion about it? ABSTRACT:: Stop being clever, just give the raw values. That's what you should be doing anyway. There are no other correct values to give that doesn't blow someone's paradigm somewhere. ITEM #1 :: In my humble opinion (ha ha) the size column should never change unless you add or remove actual storage. It should approximate the raw block size of the device on initial creation, and it should adjust to the size changes that happen when you semantically resize the filesystem with e.g. btrfs resize. RATIONALE :: (1a) The actual definition of size for df is not well defined, so the best definition of size is the one that produces the "most useful information". IMHO the number I've always wanted to see from df is the value SIZE I would supply in order to safely dd the entire filesystem from one place to another. That number would be, on a single drive filesystem, "total_bytes" from the superblock as scaled by the necessary block size etc. ITEM #2 :: The idea of "blocks used" is iffy as well. In particular I don't care how or why those blocks have been used. And almost all filesystems have this same issue. If I write a 1GiB file to ext2 my blocks used doesn't go down by exactly 1GiB, it goes down by 1GiB plus all the indirect indexing blocks needed to reference that 1GiB. RATIONALE :: (2a) "Blocks Used" is not, and wasn't particularly meant to be "Blocks Used By Data Alone". (2b) Many filesystems have, historically, per-subtracted the fixed overhead of their layout such as removing inode table regions. But the it became "stupid" and "anti-helpful" but remained un-redressed when advancements were made that let data be stored directly in the inodes for small files. So now you can technically fit more data in an EXT* filesystem than you could fit in SIZE*BLOCKSIZE bytes. Even before compression. (2c) The "fixed" blocks-used size of BTRFS is technically sizeof(SuperBlock)*num_supers. Everything else is up for grabs. Some is, indeed, pre-grabbed but so what? ITEM #3 :: The idea of Blocks Available should be Blocks - BlocksUsed and _nothing_ _more_. RATIONALE :: (3a) Just like Blocks Used isn't just about blocks used for data, Blocks Available isn't about how much more user data can be stuffed into the filesystem (3b) Any attempt to treat Blocks Available as some sort of guarantee will be meaningless for some significant number of people and usages. ITEM #4 :: Blocks available to unprivileged users is pretty "iffy" since unprivileged users cannot write to the filesystem. This datum doesn't have a "plain reading". I'd start with filesystem total blocks, then subtract the total blocks used by all trees nodes in all trees. (e.g. Nodes * 0x1000 or whatever node size is) then shave off the N superblocks, then subtract the number of blocks already allocated in data extents. And you're done. ITEM #5 :: A plain reading of the comments in the code cry out "stop trying to help me make predictions". Just serve up the nice raw numbers. RATIONALE :: (5a) We have _all_ suffered under the merciless tyranny of some system or another that was being "too helpful to be useful". Once a block of code tries to "help you" and enforces that help, then you are doomed to suffer under that help. See "Clippy". (5b) The code has a plain reading. It doesn't say anything about how things will be used. "Available" is _available_. If you have chosen to use it at a 2x rate (e.g. 200%, e.g. RAID1) or a 1.25 rate (five media in RAID5) or an N+2/N rate (e.g. RAID6), or a 4x rate (RAID10)... well that was your choice. (5c) If your metadata rate is different than your data rate, then there is _absolutely_ no way to _programatically_ predict how the data _might_ be used, and this is the _default_ usage model. Literally the hardest model is the normal model. There is actually no predictive solution. So why are we putting in predictions at all when they _must_ be wrong. struct statfs { __SWORD_TYPE f_type; /* type of filesystem (see below) */ __SWORD_TYPE f_bsize; /* optimal transfer block size */ fsblkcnt_t f_blocks; /* total data blocks in filesystem */ fsblkcnt_t f_bfree; /* free blocks in fs */ fsblkcnt_t f_bavail; /* free blocks available to unprivileged user */ fsfilcnt_t f_files; /* total file nodes in filesystem */ fsfilcnt_t f_ffree; /* free file nodes in fs */ fsid_t f_fsid; /* filesystem id */ __SWORD_TYPE f_namelen; /* maximum length of filenames */ __SWORD_TYPE f_frsize; /* fragment size (since Linux 2.6) */ __SWORD_TYPE f_spare[5]; }; The datum provided is _supposed_ to be simple. "total blocks in file system" "free blocks in file system". "Blocks available to unprivileged users" is the only tricky one. Id limit that to all unallocated blocks inside data extents and all blocks not part of any extent. "Unprivileged users" cannot, after all, actually allocate blocks in the various trees even if the system ends up doing it for them. Fortunately (or hopefully) that's not the datum /bin/df usually returns. SUMMARY :: No fudge factor or backwards-reasoning is going to satisfy more than half the people. Trying to gestimate the users intentions is impossible. Like all filesystems except the most simplistic ones (vfat etc) or read-only ones (squashfs), getting any answer "near perfect" is not likely, nor particularly helpful. It's really _not_ the implementors job to guess at how the user is going to use the system. Just as EXT before us didn't bother trying to put in a fudge factor that guessed what percentage of files would end up needing indirect blocks, we shouldn't be in the business of trying to back-figure cost-of-storage. The raw numbers are _more_ useful in many circumstances. The raw blocks used, for example, will tell me what I need to know for thin provisioning on other media, for example. Literally nothing else exposes that sort of information. Just put a prominent notice that the user needs to remember to factor their choice of redundancy et al into the numbers. Noticing that my RAID1 costs two 1k blocks to store 1K of data is _their_ _job_ when it comes down to it. That's because we are giving "them" insight to the filessytem _and_ the storage management. Same for the benefits of compression etc. We can recognize that this is "harder" than some other filesystems because, frankly, it is... Once we decided to get into the business of fusing the file system with the storage management system we _accepted_ that burden of difficulty. Users who never go beyond core usage (single data plus "overhead" from DUP metadata) will still get the same numbers for their simple case. People who start doing RAID5+1 or whatever (assuming our implementation gets that far) across 22 media are just going to have to remember to do the math to figure their 10% overhead cost when looking at "blocks available" just like I had to do my S=N*log(N) estimates while laying out Oracle table spaces on my sun stations back in the eighties. Any "clever" answer to any one model will be wrong for _every_ _other_ model. IN MY HUMBLE OPINION, of course... 8-) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 12/14/2014 10:06 PM, Robert White wrote: > On 12/14/2014 05:21 PM, Dongsheng Yang wrote: >> Anyone have some suggestion about it? > (... strong advocacy for raw numbers...) Concise Example to attempt to be clearer: /dev/sda == 1TiB /dev/sdb == 2TiB /dev/sdc == 3TiB /dev/sdd == 3TiB mkfs.btrfs /dev/sd{a..d} -d raid0 mount /dev/sda /mnt Now compare :: #!/bin/bash dd if=/dev/urandom of=/mnt/example bs=1G vs #!/bin/bash typeset -i counter for ((counter=0;;counter++)); do dd if=/dev/urandom of=/mnt/example$conter bs=44 count=1 done vs #!/bin/bash typeset -i counter for ((counter=0;;counter++)); do dd if=/dev/urandom of=/mnt/example$conter bs=44 count=1 done & dd if=/dev/urandom of=/mnt/example bs=1G Now repeat the above 3 models for mkfs.btrfs /dev/sd{a..d} -d raid5 ...... As you watch these six examples evolve you can ponder the ultimate futility of doing adaptive prediction within statfs(). Then go back and change the metadata from the default of RAID1 to RAID5 or RAID6 or RAID10. Then go back and try mkfs.btrfs /dev/sd{a..d} -d raid10 then balance when the big file runs out of space, then resume the big file with oflag=append ...... Unlike _all_ our predecessors, we are active at both the semantic file storage level _and_ the physical media management level. None of the prior filesystems match this new ground exactly. The only real option is to expose the raw numbers and then tell people the corner cases. Absolutely unavailable blocks, such as the massive waste of 5TiB in the above sized media if raid10 were selected for both data and metadata would be subtracted from size if and only if it's _impossible_ for it to be accessed by this sort of restriction. But even in this case, the correct answer for size is 4TiB because that exactly answers "how big is this filesystem". It might be worth having a "dev_item.bytes_excluded" or unusable or whatever to account for the difference between total_bytes and bytes_used and the implicit bytes available. This would account for the 0,1,2,2 TiB that a raid10 of the example sizes could never reach in the current geometry. I'm betting that this sort of number also shows up as some number of sectors in any filesystem that has an odd tidbit of size up at the top where no structure is ever gong to fit. That's just a feature of the way disks use GB instead of GiB and msdos style partitions love the number 63. So resize sets the size. Geometry limitations may reduce the effective size by some, or a _lot_, but then the used-vs-available should _not_ try to correct for whatever geometry is in use. Even when it might be simple because if it does it well in the simple cases like raid10/raid10, it would have to botch it up on the hard cases. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 12/15/2014 03:49 PM, Robert White wrote: > On 12/14/2014 10:06 PM, Robert White wrote: >> On 12/14/2014 05:21 PM, Dongsheng Yang wrote: >>> Anyone have some suggestion about it? >> (... strong advocacy for raw numbers...) Hi Robert, thanx for your so detailed reply. You are proposing to report the raw numbers in df command, right? Let's compare the space information in FS level and Device level. Example: /dev/sda == 1TiB /dev/sdb == 2TiB mkfs.btrfs /dev/sda /dev/sdb -d raid1 (1). If we report the raw numbers in df command, we will get the result of @size=3T, @used=0 @available=3T. It's not a bad idea until now, as you said user can consider the raid when they are using the fs. Then if we fill 1T data into it. we will get @size=3, @used=2T, @avalable=1T. And at this moment, we will get ENOSPC when writing any more data. It's unacceptable. Why you tell me there is 1T space available, but I can't write one byte into it? (2). Actually, there was an elder btrfs_statfs(), it reported the raw numbers to user. To solve the problem mentioned in (1), we need report the space information in the FS level. Current btrfs_statfs() is designed like this, but not working in any cases. My new btrfs_statfs() here is following this design and implementing it to show a *better* output to user. Thanx Yang > > Concise Example to attempt to be clearer: > > /dev/sda == 1TiB > /dev/sdb == 2TiB > /dev/sdc == 3TiB > /dev/sdd == 3TiB > > mkfs.btrfs /dev/sd{a..d} -d raid0 > mount /dev/sda /mnt > > Now compare :: > > #!/bin/bash > dd if=/dev/urandom of=/mnt/example bs=1G > > vs > > #!/bin/bash > typeset -i counter > for ((counter=0;;counter++)); do > dd if=/dev/urandom of=/mnt/example$conter bs=44 count=1 > done > > vs > > #!/bin/bash > typeset -i counter > for ((counter=0;;counter++)); do > dd if=/dev/urandom of=/mnt/example$conter bs=44 count=1 > done & > dd if=/dev/urandom of=/mnt/example bs=1G > > Now repeat the above 3 models for > mkfs.btrfs /dev/sd{a..d} -d raid5 > > > ...... > > As you watch these six examples evolve you can ponder the ultimate > futility of doing adaptive prediction within statfs(). > > Then go back and change the metadata from the default of RAID1 to > RAID5 or RAID6 or RAID10. > > Then go back and try > > mkfs.btrfs /dev/sd{a..d} -d raid10 > > then balance when the big file runs out of space, then resume the big > file with oflag=append > > ...... > > Unlike _all_ our predecessors, we are active at both the semantic file > storage level _and_ the physical media management level. > > None of the prior filesystems match this new ground exactly. > > The only real option is to expose the raw numbers and then tell people > the corner cases. > > Absolutely unavailable blocks, such as the massive waste of 5TiB in > the above sized media if raid10 were selected for both data and > metadata would be subtracted from size if and only if it's > _impossible_ for it to be accessed by this sort of restriction. But > even in this case, the correct answer for size is 4TiB because that > exactly answers "how big is this filesystem". > > It might be worth having a "dev_item.bytes_excluded" or unusable or > whatever to account for the difference between total_bytes and > bytes_used and the implicit bytes available. This would account for > the 0,1,2,2 TiB that a raid10 of the example sizes could never reach > in the current geometry. I'm betting that this sort of number also > shows up as some number of sectors in any filesystem that has an odd > tidbit of size up at the top where no structure is ever gong to fit. > That's just a feature of the way disks use GB instead of GiB and msdos > style partitions love the number 63. > > So resize sets the size. Geometry limitations may reduce the effective > size by some, or a _lot_, but then the used-vs-available should _not_ > try to correct for whatever geometry is in use. Even when it might be > simple because if it does it well in the simple cases like > raid10/raid10, it would have to botch it up on the hard cases. > > > . > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 12/15/2014 12:26 AM, Dongsheng Yang wrote: > On 12/15/2014 03:49 PM, Robert White wrote: >> On 12/14/2014 10:06 PM, Robert White wrote: >>> On 12/14/2014 05:21 PM, Dongsheng Yang wrote: >>>> Anyone have some suggestion about it? >>> (... strong advocacy for raw numbers...) > > Hi Robert, thanx for your so detailed reply. > > You are proposing to report the raw numbers in df command, right? Wrong. Well partly wrong. You didn't include the "geometrically unavailable" calculation from my suggestion. And you've largely ignored the commentary and justifications from the first email. See, all the solutions being discussed basically drop out the hard parts. For instance god save your existing patch from a partially complete conversion. I'll bet we could get into negative @available numbers if we abort halfway through conversion between raid levels zero and one. If we have two disks of 5GiB and they _were_ at RAID0 and we start a -dconvert to RAID1 and abort it or run out of space because there were 7GiB in use... then what? If size is now 5GiB and used is now 6GiB but we've still got 1GiB available... well _how_ exactly is that _better_ information? > Let's compare the space information in FS level and Device level. > > Example: > /dev/sda == 1TiB > /dev/sdb == 2TiB > > mkfs.btrfs /dev/sda /dev/sdb -d raid1 > > (1). If we report the raw numbers in df command, we will get the result of > @size=3T, @used=0 @available=3T. It's not a bad idea until now, as you said > user can consider the raid when they are using the fs. Then if we fill > 1T data > into it. we will get @size=3, @used=2T, @avalable=1T. And at this > moment, we > will get ENOSPC when writing any more data. It's unacceptable. Why you > tell me there is 1T space available, but I can't write one byte into it? See the last segment below where I addressed the idea of geometrically unavailable space. It explains why, in this case, available would have been 0TiB not 1TiB. The solution has to be correct for all uses and BTRFS _WILL_ eventually reach a condition where _either_ @available will be zero but you can still add some small files, or @available will be non-zero but you _can't_ any large files. It is _impossible_ _to_ _avoid_ both outcomes because the two modalities of file storage are incompatable. We have mixed the metaphor of file storage and raw storage. We have done so with a dynamically self-restructuring system. As such we don't have a completely consistent option. For instance, what happens when the user adds a third 1TiB drive to your example? So we need to account, at a system-wide level, for every byte of storage. We need to then make as few assumptions as possible, but no fewer, when reporting the subsets of information to the tools that were invented before we stretched the paradigm outside that box. > > (2). Actually, there was an elder btrfs_statfs(), it reported the raw > numbers to user. > To solve the problem mentioned in (1), we need report the space > information in the FS level. No, we need to report the availability at the _raw_ level, but subtract the "absolutely inaccessable" tidbits. In short we need to report at the _storage_ _management_ _level_ w So your example sizes in my methodology would report SIZE=2TiB USED=0 AVAILABLE=2TiB And if you wrote exactly 1GiB of data to that filesystem it would show SIZE=2TiB USED=2GiB AVAILABLE=1.99TiB because that one 1GiB took 2GiB to store. No attempt would be made to make the filesystem appear 1TiB-ish. The operator knows its RAID1 so he knows it will be sucked down at a 2x rate. And when the user said "WTF" regarding the missing 1TiB he'd be directed to show-super and the 1TiB "unavailable because of geometry". But if the user built the filesystem with btrfs device add and never rebalanced it, absolutely no attempt would be made to "protect him" from the knowledge that he could "run out of space" while still having blocks available. mkfs.btrfs /dev/sda (time pases) btrfs device add /dev/sdd /mountpoint Would show SIZE=3TiB (etc) and would take no steps to warn about the possibly jammed up condition caused by the existing concentration duplicate metata on /dev/sda. > Current btrfs_statfs() is designed like this, but not working in any cases. > My new btrfs_statfs() here is following this design and implementing it > to show a *better* output to user. > > Thanx > Yang >> >> Concise Example to attempt to be clearer: >> >> /dev/sda == 1TiB >> /dev/sdb == 2TiB >> /dev/sdc == 3TiB >> /dev/sdd == 3TiB >> >> mkfs.btrfs /dev/sd{a..d} -d raid0 >> mount /dev/sda /mnt >> >> Now compare :: >> >> #!/bin/bash >> dd if=/dev/urandom of=/mnt/example bs=1G >> >> vs >> >> #!/bin/bash >> typeset -i counter >> for ((counter=0;;counter++)); do >> dd if=/dev/urandom of=/mnt/example$conter bs=44 count=1 >> done >> >> vs >> >> #!/bin/bash >> typeset -i counter >> for ((counter=0;;counter++)); do >> dd if=/dev/urandom of=/mnt/example$conter bs=44 count=1 >> done & >> dd if=/dev/urandom of=/mnt/example bs=1G >> >> Now repeat the above 3 models for >> mkfs.btrfs /dev/sd{a..d} -d raid5 >> >> >> ...... >> >> As you watch these six examples evolve you can ponder the ultimate >> futility of doing adaptive prediction within statfs(). >> >> Then go back and change the metadata from the default of RAID1 to >> RAID5 or RAID6 or RAID10. >> >> Then go back and try >> >> mkfs.btrfs /dev/sd{a..d} -d raid10 >> >> then balance when the big file runs out of space, then resume the big >> file with oflag=append >> >> ...... >> >> Unlike _all_ our predecessors, we are active at both the semantic file >> storage level _and_ the physical media management level. >> >> None of the prior filesystems match this new ground exactly. >> >> The only real option is to expose the raw numbers and then tell people >> the corner cases. >> === Re-read from here down === >> Absolutely unavailable blocks, such as the massive waste of 5TiB in >> the above sized media if raid10 were selected for both data and >> metadata would be subtracted from size if and only if it's >> _impossible_ for it to be accessed by this sort of restriction. But >> even in this case, the correct answer for size is 4TiB because that >> exactly answers "how big is this filesystem". >> >> It might be worth having a "dev_item.bytes_excluded" or unusable or >> whatever to account for the difference between total_bytes and >> bytes_used and the implicit bytes available. This would account for >> the 0,1,2,2 TiB that a raid10 of the example sizes could never reach >> in the current geometry. I'm betting that this sort of number also >> shows up as some number of sectors in any filesystem that has an odd >> tidbit of size up at the top where no structure is ever gong to fit. >> That's just a feature of the way disks use GB instead of GiB and msdos >> style partitions love the number 63. >> >> So resize sets the size. Geometry limitations may reduce the effective >> size by some, or a _lot_, but then the used-vs-available should _not_ >> try to correct for whatever geometry is in use. Even when it might be >> simple because if it does it well in the simple cases like >> raid10/raid10, it would have to botch it up on the hard cases. >> >> >> . >> > > So we don't just hand-wave over statfs(). We include the dev_item.bytes_excluded in the superblock and we decide once-and-for-all (with any geometry creation, or completed conversion) how many bytes just _can't_ be reached but only once we _know_ they cant be reached. And we memorialize that unreachable data in the superblocks. Thereafter we report the raw numbers after subtracting anything we know cannot be reached. All other "helpful" solutions are NP-complete and insoluble. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 12/15/2014 01:36 AM, Robert White wrote: > So we don't just hand-wave over statfs(). We include the > dev_item.bytes_excluded in the superblock and we decide once-and-for-all > (with any geometry creation, or completed conversion) how many bytes > just _can't_ be reached but only once we _know_ they cant be reached. > And we memorialize that unreachable data in the superblocks. > > Thereafter we report the raw numbers after subtracting anything we know > cannot be reached. > > All other "helpful" solutions are NP-complete and insoluble. On multiple re-readings of my own words and running off to the POSIX definitions _and_ kernel sources (which don't agree). The practical bits first :: I would add a "-c | --compatable" option to btrfs fi df that let it produce /bin/df format-compatable output that gave the "real" numbers as defined near the end. /dev/sda 1TiB /dev/sdb 2TiB mkfs.btrfs /dev/sd{a,b} -d raid1 @size=3TiB @used=0TiB @available=2TiB The above would be ideal. But POSIX says "no". f_blocks is defined (only in the comments) as "total data blocks in the filesystem" and /bin/df pivots on that assumption, so the only usable option left is :: @size=2TiB @used=0TiB @available=2TiB After which @used would be the real, raw space consumed. If it takes 2GiB or 4GiB to store 1GiB (q.v. RAID 1 and 10) then @used would go up by that 2 or 4 GiB. Given the not-displayed, not reported, excluded_by_geometry values (e.g. @waste) the equation should always be :: @size - @waste = @used + @available The fact that /bin/df doesn't display all four values is just tough, The fact that it calculates one "for us" is really annoying, show-super would be the place to go find the truth. The @waste value is soft because while 1TiB of /dev/sdb that is not going to be used isn't a _particular_ 1TiB. It could be low blocks or high blocks or randomly distributed blocks that end up not having data. So keeping with my thought that (ideally) @size should be the "safe dd size" for doing a raw-block transcribe of the devices and filesystem, it is most correct for @size to be real storage size. But sadly, posix didn't define that value for that role, so we are forced to munge around. (particularly since /bin/df calculates stuff "for us"). Calculation of the @waste would have to happen in two phases. At initiation phase of any convert @waste would be set to zero. At completion of any _full_ convert, when we know that there are no leftover bits that could lead to rampant mis-report, @waste would be calculated for each device as a dev_item. Then the total would be stored as a global item. btrfs tools would report all four items. statfs() would have to report (@size-@waste) and @available, but that's a problem with limits to the assumptions made by statfs() designers two decades ago. I don't know which numbers we keep on hand and which we derive so... @available, if calculated dynamically would be sum(@size, -@waste, -@used). @used, if calculated dynamically, would be sum(@size, -@waste, -@available). This would also keep all the determinations of @waste well defined and relegated to specific, infrequently executed blocks of code. GIVEN ALSO :: The BTRFS dynamic internal layout allows for completely valid states that are inconsistent with the current filesystem flags... Such as it is legal to set the RAID1 mode for data but still having RAID0, RAID5, and any manner of other extents present... there is no finite solution to every particular layout that exists. This condition is even _mandatory_ in an evolving system. May persist if conversion is interrupted and then the balance is aborted. And might be purely legal if you supply a convert option and limit the number of blocks to process in the same run. Each individual extent block is it's own master in terms of what "mode the filesystem is actally in" when that extent is being accessed. This fact is _unchangeable_. STANDARDS REFERENCES and Issues... The actual standard from POSIX at The Open Group refers to f_blocks as "Total number of blocks on file system in units of f_frsize". See :: http://pubs.opengroup.org/onlinepubs/009695399/basedefs/sys/statvfs.h.html The linux kernel source and man pages say "total data blocks in filesystem". I don't know where/when/why the "total blocks" got re-qualified as "total data blocks" in the linux history, but it's probably incorrect on plain reading. The df command itself suffers a similar problem as the POSIX standard doesn't talk about "data blocks" etc. Problematically, of course, the statfs() call doesn't really allow for any means to address slack/waste space and the reverse calculation for us becomes impossible. This gets back to the "no right answer in BTRFS" issue. There is a lot of missing magic here. Back when INODES where just one thing with one size statfs results were probably either-or and "Everybody Knew" how to turn the inode count into a block count and history just rolled on. I think the real answer would be to invent an expanded statfs() call that returned the real numbers for @total_size, @system_overhead_used, @waste_space, @unusable_space, etc -- that is to come up with a generic model for a modern storage system -- and let real calculations take place. But I don't have the "community chops" to start that ball rolling. CONCLUSIONS :: Given the inherent assumptions of statfs(), there is _no_ solution that will be correct in all cases. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 12/15/2014 07:30 PM, Robert White wrote: > The above would be ideal. But POSIX says "no". f_blocks is defined (only Correction the linux kernel says "total data blocks", POSIX says "total blocks" -- it was a mental typo... 8-) > in the comments) as "total data blocks in the filesystem" and /bin/df > pivots on that assumption, so the only usable option left is :: -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 12/16/2014 11:30 AM, Robert White wrote: > On 12/15/2014 01:36 AM, Robert White wrote: >> So we don't just hand-wave over statfs(). We include the >> dev_item.bytes_excluded in the superblock and we decide once-and-for-all >> (with any geometry creation, or completed conversion) how many bytes >> just _can't_ be reached but only once we _know_ they cant be reached. >> And we memorialize that unreachable data in the superblocks. >> >> Thereafter we report the raw numbers after subtracting anything we know >> cannot be reached. >> >> All other "helpful" solutions are NP-complete and insoluble. > > On multiple re-readings of my own words and running off to the POSIX > definitions _and_ kernel sources (which don't agree). > > The practical bits first :: > > I would add a "-c | --compatable" option to btrfs fi df > that let it produce /bin/df format-compatable output that gave the > "real" numbers as defined near the end. > > > /dev/sda 1TiB > /dev/sdb 2TiB > > > mkfs.btrfs /dev/sd{a,b} -d raid1 > > @size=3TiB @used=0TiB @available=2TiB > > The above would be ideal. But POSIX says "no". f_blocks is defined > (only in the comments) as "total data blocks in the filesystem" and > /bin/df pivots on that assumption, so the only usable option left is :: > > @size=2TiB @used=0TiB @available=2TiB > > After which @used would be the real, raw space consumed. If it takes > 2GiB or 4GiB to store 1GiB (q.v. RAID 1 and 10) then @used would go up > by that 2 or 4 GiB. Hi Robert, thanx for your proposal about this. IMHO, output of df command shoud be more friendly to user. Well, I think we have a disagreement on this point, let's take a look at what the zfs is doing. /dev/sda7- 10G /dev/sda8- 10G # zpool create myzpool mirror /dev/sda7 /dev/sda8 -f # df -h /myzpool/ Filesystem Size Used Avail Use% Mounted on myzpool 9.8G 21K 9.8G 1% /myzpool That said that df command should tell user the space info they can see. It means the output is the information from the FS level rather than device level or _storage_manager level. Thanx Yang > > Given the not-displayed, not reported, excluded_by_geometry values > (e.g. @waste) the equation should always be :: > > @size - @waste = @used + @available > > The fact that /bin/df doesn't display all four values is just tough, > The fact that it calculates one "for us" is really annoying, > show-super would be the place to go find the truth. > > The @waste value is soft because while 1TiB of /dev/sdb that is not > going to be used isn't a _particular_ 1TiB. It could be low blocks or > high blocks or randomly distributed blocks that end up not having data. > > So keeping with my thought that (ideally) @size should be the "safe dd > size" for doing a raw-block transcribe of the devices and filesystem, > it is most correct for @size to be real storage size. But sadly, posix > didn't define that value for that role, so we are forced to munge > around. (particularly since /bin/df calculates stuff "for us"). > > > Calculation of the @waste would have to happen in two phases. At > initiation phase of any convert @waste would be set to zero. At > completion of any _full_ convert, when we know that there are no > leftover bits that could lead to rampant mis-report, @waste would be > calculated for each device as a dev_item. Then the total would be > stored as a global item. > > btrfs tools would report all four items. > > statfs() would have to report (@size-@waste) and @available, but > that's a problem with limits to the assumptions made by statfs() > designers two decades ago. > > I don't know which numbers we keep on hand and which we derive so... > > @available, if calculated dynamically would be > sum(@size, -@waste, -@used). > > @used, if calculated dynamically, would be > sum(@size, -@waste, -@available). > > This would also keep all the determinations of @waste well defined and > relegated to specific, infrequently executed blocks of code. > > GIVEN ALSO :: > > The BTRFS dynamic internal layout allows for completely valid states > that are inconsistent with the current filesystem flags... Such as it > is legal to set the RAID1 mode for data but still having RAID0, RAID5, > and any manner of other extents present... there is no finite solution > to every particular layout that exists. > > This condition is even _mandatory_ in an evolving system. May persist > if conversion is interrupted and then the balance is aborted. And > might be purely legal if you supply a convert option and limit the > number of blocks to process in the same run. > > Each individual extent block is it's own master in terms of what "mode > the filesystem is actally in" when that extent is being accessed. This > fact is _unchangeable_. > > > STANDARDS REFERENCES and Issues... > > The actual standard from POSIX at The Open Group refers to f_blocks as > "Total number of blocks on file system in units of f_frsize". > > See :: > http://pubs.opengroup.org/onlinepubs/009695399/basedefs/sys/statvfs.h.html > > The linux kernel source and man pages say "total data blocks in > filesystem". > > I don't know where/when/why the "total blocks" got re-qualified as > "total data blocks" in the linux history, but it's probably incorrect > on plain reading. > > The df command itself suffers a similar problem as the POSIX standard > doesn't talk about "data blocks" etc. > > Problematically, of course, the statfs() call doesn't really allow for > any means to address slack/waste space and the reverse calculation for > us becomes impossible. > > This gets back to the "no right answer in BTRFS" issue. > > There is a lot of missing magic here. Back when INODES where just one > thing with one size statfs results were probably either-or and > "Everybody Knew" how to turn the inode count into a block count and > history just rolled on. > > I think the real answer would be to invent an expanded statfs() call > that returned the real numbers for @total_size, @system_overhead_used, > @waste_space, @unusable_space, etc -- that is to come up with a > generic model for a modern storage system -- and let real calculations > take place. But I don't have the "community chops" to start that ball > rolling. > > CONCLUSIONS :: > > Given the inherent assumptions of statfs(), there is _no_ solution > that will be correct in all cases. > . > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Dec 16, 2014 at 7:30 PM, Dongsheng Yang <yangds.fnst@cn.fujitsu.com> wrote: > On 12/16/2014 11:30 AM, Robert White wrote: >> >> On 12/15/2014 01:36 AM, Robert White wrote: >>> >>> So we don't just hand-wave over statfs(). We include the >>> dev_item.bytes_excluded in the superblock and we decide once-and-for-all >>> (with any geometry creation, or completed conversion) how many bytes >>> just _can't_ be reached but only once we _know_ they cant be reached. >>> And we memorialize that unreachable data in the superblocks. >>> >>> Thereafter we report the raw numbers after subtracting anything we know >>> cannot be reached. >>> >>> All other "helpful" solutions are NP-complete and insoluble. >> >> >> On multiple re-readings of my own words and running off to the POSIX >> definitions _and_ kernel sources (which don't agree). >> >> The practical bits first :: >> >> I would add a "-c | --compatable" option to btrfs fi df >> that let it produce /bin/df format-compatable output that gave the "real" >> numbers as defined near the end. >> >> >> /dev/sda 1TiB >> /dev/sdb 2TiB >> >> >> mkfs.btrfs /dev/sd{a,b} -d raid1 >> >> @size=3TiB @used=0TiB @available=2TiB >> >> The above would be ideal. But POSIX says "no". f_blocks is defined (only >> in the comments) as "total data blocks in the filesystem" and /bin/df pivots >> on that assumption, so the only usable option left is :: >> >> @size=2TiB @used=0TiB @available=2TiB >> >> After which @used would be the real, raw space consumed. If it takes 2GiB >> or 4GiB to store 1GiB (q.v. RAID 1 and 10) then @used would go up by that 2 >> or 4 GiB. > > > Hi Robert, thanx for your proposal about this. > > IMHO, output of df command shoud be more friendly to user. > Well, I think we have a disagreement on this point, let's take a look at > what the zfs is doing. > > /dev/sda7- 10G > /dev/sda8- 10G > # zpool create myzpool mirror /dev/sda7 /dev/sda8 -f > # df -h /myzpool/ > Filesystem Size Used Avail Use% Mounted on > myzpool 9.8G 21K 9.8G 1% /myzpool > > That said that df command should tell user the space info they can see. > It means the output is the information from the FS level rather than device > level or _storage_manager level. Addition: There are some other ways to get the space information in btrfs, btrfs fi df, btrfs fi show, btrfs-debug-tree. The df command we discussed here is on the top level which is directly facing the user. Let me try to show the difference in different level. 1), TOP level, for a linux user: df command. For a linux user, he does not care about the detail how the data is stored in devices. They do not care even not know what's Single, what does DUP mean, and how a fs implement the RAID10. What they want to know is *what is the size of the filesystem I am using and how much space is still available to me*. That's what I said by "FS space level) 2). Middle level, for a btrfs user: btrfs fi df/show. For a btrfs user, they know about the single, dup and RAIDX. When they want to know what's the raid level in each space info, they can use btrfs fi df to print the information they want. 3). Device level. for debugging. Sometimes, you need to know how the each chunk is stored in device. Please use btrfs-debug-tree to show details you want as more as possible. After all, I would say, the information you want to show is *not* incorrect to me, but it's not the business of df command. Thanx > > Thanx > Yang > >> >> Given the not-displayed, not reported, excluded_by_geometry values (e.g. >> @waste) the equation should always be :: >> >> @size - @waste = @used + @available >> >> The fact that /bin/df doesn't display all four values is just tough, The >> fact that it calculates one "for us" is really annoying, show-super would be >> the place to go find the truth. >> >> The @waste value is soft because while 1TiB of /dev/sdb that is not going >> to be used isn't a _particular_ 1TiB. It could be low blocks or high blocks >> or randomly distributed blocks that end up not having data. >> >> So keeping with my thought that (ideally) @size should be the "safe dd >> size" for doing a raw-block transcribe of the devices and filesystem, it is >> most correct for @size to be real storage size. But sadly, posix didn't >> define that value for that role, so we are forced to munge around. >> (particularly since /bin/df calculates stuff "for us"). >> >> >> Calculation of the @waste would have to happen in two phases. At >> initiation phase of any convert @waste would be set to zero. At completion >> of any _full_ convert, when we know that there are no leftover bits that >> could lead to rampant mis-report, @waste would be calculated for each device >> as a dev_item. Then the total would be stored as a global item. >> >> btrfs tools would report all four items. >> >> statfs() would have to report (@size-@waste) and @available, but that's a >> problem with limits to the assumptions made by statfs() designers two >> decades ago. >> >> I don't know which numbers we keep on hand and which we derive so... >> >> @available, if calculated dynamically would be >> sum(@size, -@waste, -@used). >> >> @used, if calculated dynamically, would be >> sum(@size, -@waste, -@available). >> >> This would also keep all the determinations of @waste well defined and >> relegated to specific, infrequently executed blocks of code. >> >> GIVEN ALSO :: >> >> The BTRFS dynamic internal layout allows for completely valid states that >> are inconsistent with the current filesystem flags... Such as it is legal to >> set the RAID1 mode for data but still having RAID0, RAID5, and any manner of >> other extents present... there is no finite solution to every particular >> layout that exists. >> >> This condition is even _mandatory_ in an evolving system. May persist if >> conversion is interrupted and then the balance is aborted. And might be >> purely legal if you supply a convert option and limit the number of blocks >> to process in the same run. >> >> Each individual extent block is it's own master in terms of what "mode the >> filesystem is actally in" when that extent is being accessed. This fact is >> _unchangeable_. >> >> >> STANDARDS REFERENCES and Issues... >> >> The actual standard from POSIX at The Open Group refers to f_blocks as >> "Total number of blocks on file system in units of f_frsize". >> >> See :: >> http://pubs.opengroup.org/onlinepubs/009695399/basedefs/sys/statvfs.h.html >> >> The linux kernel source and man pages say "total data blocks in >> filesystem". >> >> I don't know where/when/why the "total blocks" got re-qualified as "total >> data blocks" in the linux history, but it's probably incorrect on plain >> reading. >> >> The df command itself suffers a similar problem as the POSIX standard >> doesn't talk about "data blocks" etc. >> >> Problematically, of course, the statfs() call doesn't really allow for any >> means to address slack/waste space and the reverse calculation for us >> becomes impossible. >> >> This gets back to the "no right answer in BTRFS" issue. >> >> There is a lot of missing magic here. Back when INODES where just one >> thing with one size statfs results were probably either-or and "Everybody >> Knew" how to turn the inode count into a block count and history just rolled >> on. >> >> I think the real answer would be to invent an expanded statfs() call that >> returned the real numbers for @total_size, @system_overhead_used, >> @waste_space, @unusable_space, etc -- that is to come up with a generic >> model for a modern storage system -- and let real calculations take place. >> But I don't have the "community chops" to start that ball rolling. >> >> CONCLUSIONS :: >> >> Given the inherent assumptions of statfs(), there is _no_ solution that >> will be correct in all cases. >> . >> > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Goffredo, On Tue, Dec 16, 2014 at 1:47 AM, Goffredo Baroncelli <kreijack@inwind.it> wrote: > I Yang, > > On 12/14/2014 12:29 PM, Dongsheng Yang wrote: >> On Sat, Dec 13, 2014 at 3:25 AM, Goffredo Baroncelli <kreijack@inwind.it> wrote: >>> On 12/11/2014 09:31 AM, Dongsheng Yang wrote: >>>> When function btrfs_statfs() calculate the tatol size of fs, it is calculating >>>> the total size of disks and then dividing it by a factor. But in some usecase, >>>> the result is not good to user. >>> >>> >>> I Yang; during my test I discovered an error: >>> >>> $ sudo lvcreate -L +10G -n disk vgtest >>> $ sudo /home/ghigo/mkfs.btrfs -f /dev/vgtest/disk >>> Btrfs v3.17 >>> See http://btrfs.wiki.kernel.org for more information. >>> >>> Turning ON incompat feature 'extref': increased hardlink limit per file to 65536 >>> fs created label (null) on /dev/vgtest/disk >>> nodesize 16384 leafsize 16384 sectorsize 4096 size 10.00GiB >>> $ sudo mount /dev/vgtest/disk /mnt/btrfs1/ >>> $ df /mnt/btrfs1/ >>> Filesystem 1K-blocks Used Available Use% Mounted on >>> /dev/mapper/vgtest-disk 9428992 1069312 8359680 12% /mnt/btrfs1 >>> $ sudo ~/btrfs fi df /mnt/btrfs1/ >>> Data, single: total=8.00MiB, used=256.00KiB >>> System, DUP: total=8.00MiB, used=16.00KiB >>> System, single: total=4.00MiB, used=0.00B >>> Metadata, DUP: total=1.00GiB, used=112.00KiB >>> Metadata, single: total=8.00MiB, used=0.00B >>> GlobalReserve, single: total=16.00MiB, used=0.00B >>> >>> What seems me strange is the 9428992KiB of total disk space as reported >>> by df. About 600MiB are missing ! >>> >>> Without your patch, I got: >>> >>> $ df /mnt/btrfs1/ >>> Filesystem 1K-blocks Used Available Use% Mounted on >>> /dev/mapper/vgtest-disk 10485760 16896 8359680 1% /mnt/btrfs1 >>> >> >> Hi Goffredo, thanx for pointing this. Let me try to show the reason of it. >> >> In this case you provided here, the metadata is 1G in DUP. It means >> we spent 2G device space for it. But from the view point of user, they >> can only see 9G size for this fs. It's show as below: >> 1G >> -----|---------------- Filesystem space (9G) >> | \ >> | \ >> | \ >> | 2G \ >> -----|----|----------------- Device space (10G) >> DUP | >> >> The main idea about the new btrfs_statfs() is hiding the detail in Device >> space and only show the information in the FS space level. > > I am not entirely convinced. On the basis of your statement, I would find a difference of 1G or 1G/2... but missing are 600MB... Ha, forgot to clarify this one. 9428992K is almost 9G actually. 9G = (9 * 2^20)K = 9437184K. So, as I showed in the graphic, @size is 9G. :) > Grzegorz found a difference of 2.3GB (but on a bigger disk). > > Looking at your patch set, it seems to me that the computation of the > "total disk space" is done differently than before. > > Before, the "total disk space" was > >>>> - buf->f_blocks = div_u64(btrfs_super_total_bytes(disk_super), factor); >>>> - buf->f_blocks >>= bits; > > I read this as: "total disk space" is the sum of the disk size (corrected by a > factor depending by the profile). > > Instead after your patches the "total disk space" is > >>>> + buf->f_blocks = total_alloc + total_free_data; > > This means that if a disk is not fully mapped by chunks, there is a difference > between the "total disk space" and the value reported by df. > > May be that you are highlighting a bug elsewhere (not caused by your patch) ? > > BTW > I will continue my tests... Thank you very much. :) Yang > > BR > G.Baroncelli > >> >> Actually, the similar idea is adopted in btrfs fi df. >> >> Example: >> # lvcreate -L +10G -n disk Group0 >> # mkfs.btrfs -f /dev/Group0/disk >> # dd if=/dev/zero of=/mnt/data bs=1M count=10240 >> dd: error writing ‘/mnt/data’: No space left on device >> 8163+0 records in >> 8162+0 records out >> 8558477312 bytes (8.6 GB) copied, 207.565 s, 41.2 MB/s >> # df /mnt >> Filesystem 1K-blocks Used Available Use% Mounted on >> /dev/mapper/Group0-disk 9428992 8382896 2048 100% /mnt >> # btrfs fi df /mnt >> Data, single: total=7.97GiB, used=7.97GiB >> System, DUP: total=8.00MiB, used=16.00KiB >> System, single: total=4.00MiB, used=0.00B >> Metadata, DUP: total=1.00GiB, used=8.41MiB >> Metadata, single: total=8.00MiB, used=0.00B >> GlobalReserve, single: total=16.00MiB, used=0.00B >> >> Now, the all space is used and got a ENOSPC error. >> But from the output of btrfs fi df. We can only find almost 9G (data >> 8G + metadata 1G) >> is used. The difference is that it show the DUP here to make >> it more clear. >> >> Wish this description is clear to you :). >> >> If there is anything confusing or sounds incorrect to you, please point it out. >> >> Thanx >> Yang >> >>> >>>> Example: >>>> # mkfs.btrfs -f /dev/vdf1 /dev/vdf2 -d raid1 >>>> # mount /dev/vdf1 /mnt >>>> # dd if=/dev/zero of=/mnt/zero bs=1M count=1000 >>>> # df -h /mnt >>>> Filesystem Size Used Avail Use% Mounted on >>>> /dev/vdf1 3.0G 1018M 1.3G 45% /mnt >>>> # btrfs fi show /dev/vdf1 >>>> Label: none uuid: f85d93dc-81f4-445d-91e5-6a5cd9563294 >>>> Total devices 2 FS bytes used 1001.53MiB >>>> devid 1 size 2.00GiB used 1.85GiB path /dev/vdf1 >>>> devid 2 size 4.00GiB used 1.83GiB path /dev/vdf2 >>>> a. df -h should report Size as 2GiB rather than as 3GiB. >>>> Because this is 2 device raid1, the limiting factor is devid 1 @2GiB. >>>> >>>> b. df -h should report Avail as 0.97GiB or less, rather than as 1.3GiB. >>>> 1.85 (the capacity of the allocated chunk) >>>> -1.018 (the file stored) >>>> +(2-1.85=0.15) (the residual capacity of the disks >>>> considering a raid1 fs) >>>> --------------- >>>> = 0.97 >>>> >>>> This patch drops the factor at all and calculate the size observable to >>>> user without considering which raid level the data is in and what's the >>>> size exactly in disk. >>>> After this patch applied: >>>> # mkfs.btrfs -f /dev/vdf1 /dev/vdf2 -d raid1 >>>> # mount /dev/vdf1 /mnt >>>> # dd if=/dev/zero of=/mnt/zero bs=1M count=1000 >>>> # df -h /mnt >>>> Filesystem Size Used Avail Use% Mounted on >>>> /dev/vdf1 2.0G 1.3G 713M 66% /mnt >>>> # df /mnt >>>> Filesystem 1K-blocks Used Available Use% Mounted on >>>> /dev/vdf1 2097152 1359424 729536 66% /mnt >>>> # btrfs fi show /dev/vdf1 >>>> Label: none uuid: e98c1321-645f-4457-b20d-4f41dc1cf2f4 >>>> Total devices 2 FS bytes used 1001.55MiB >>>> devid 1 size 2.00GiB used 1.85GiB path /dev/vdf1 >>>> devid 2 size 4.00GiB used 1.83GiB path /dev/vdf2 >>>> a). The @Size is 2G as we expected. >>>> b). @Available is 700M = 1.85G - 1.3G + (2G - 1.85G). >>>> c). @Used is changed to 1.3G rather than 1018M as above. Because >>>> this patch do not treat the free space in metadata chunk >>>> and system chunk as available to user. It's true, user can >>>> not use these space to store data, then it should not be >>>> thought as available. At the same time, it will make the >>>> @Used + @Available == @Size as possible to user. >>>> >>>> Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com> >>>> --- >>>> Changelog: >>>> v1 -> v2: >>>> a. account the total_bytes in medadata chunk and >>>> system chunk as used to user. >>>> b. set data_stripes to the correct value in RAID0. >>>> >>>> fs/btrfs/extent-tree.c | 13 ++---------- >>>> fs/btrfs/super.c | 56 ++++++++++++++++++++++---------------------------- >>>> 2 files changed, 26 insertions(+), 43 deletions(-) >>>> >>>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c >>>> index a84e00d..9954d60 100644 >>>> --- a/fs/btrfs/extent-tree.c >>>> +++ b/fs/btrfs/extent-tree.c >>>> @@ -8571,7 +8571,6 @@ static u64 __btrfs_get_ro_block_group_free_space(struct list_head *groups_list) >>>> { >>>> struct btrfs_block_group_cache *block_group; >>>> u64 free_bytes = 0; >>>> - int factor; >>>> >>>> list_for_each_entry(block_group, groups_list, list) { >>>> spin_lock(&block_group->lock); >>>> @@ -8581,16 +8580,8 @@ static u64 __btrfs_get_ro_block_group_free_space(struct list_head *groups_list) >>>> continue; >>>> } >>>> >>>> - if (block_group->flags & (BTRFS_BLOCK_GROUP_RAID1 | >>>> - BTRFS_BLOCK_GROUP_RAID10 | >>>> - BTRFS_BLOCK_GROUP_DUP)) >>>> - factor = 2; >>>> - else >>>> - factor = 1; >>>> - >>>> - free_bytes += (block_group->key.offset - >>>> - btrfs_block_group_used(&block_group->item)) * >>>> - factor; >>>> + free_bytes += block_group->key.offset - >>>> + btrfs_block_group_used(&block_group->item); >>>> >>>> spin_unlock(&block_group->lock); >>>> } >>>> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c >>>> index 54bd91e..40f41a2 100644 >>>> --- a/fs/btrfs/super.c >>>> +++ b/fs/btrfs/super.c >>>> @@ -1641,6 +1641,8 @@ static int btrfs_calc_avail_data_space(struct btrfs_root *root, u64 *free_bytes) >>>> u64 used_space; >>>> u64 min_stripe_size; >>>> int min_stripes = 1, num_stripes = 1; >>>> + /* How many stripes used to store data, without considering mirrors. */ >>>> + int data_stripes = 1; >>>> int i = 0, nr_devices; >>>> int ret; >>>> >>>> @@ -1657,12 +1659,15 @@ static int btrfs_calc_avail_data_space(struct btrfs_root *root, u64 *free_bytes) >>>> if (type & BTRFS_BLOCK_GROUP_RAID0) { >>>> min_stripes = 2; >>>> num_stripes = nr_devices; >>>> + data_stripes = num_stripes; >>>> } else if (type & BTRFS_BLOCK_GROUP_RAID1) { >>>> min_stripes = 2; >>>> num_stripes = 2; >>>> + data_stripes = 1; >>>> } else if (type & BTRFS_BLOCK_GROUP_RAID10) { >>>> min_stripes = 4; >>>> num_stripes = 4; >>>> + data_stripes = 2; >>>> } >>>> >>>> if (type & BTRFS_BLOCK_GROUP_DUP) >>>> @@ -1733,14 +1738,17 @@ static int btrfs_calc_avail_data_space(struct btrfs_root *root, u64 *free_bytes) >>>> i = nr_devices - 1; >>>> avail_space = 0; >>>> while (nr_devices >= min_stripes) { >>>> - if (num_stripes > nr_devices) >>>> + if (num_stripes > nr_devices) { >>>> num_stripes = nr_devices; >>>> + if (type & BTRFS_BLOCK_GROUP_RAID0) >>>> + data_stripes = num_stripes; >>>> + } >>>> >>>> if (devices_info[i].max_avail >= min_stripe_size) { >>>> int j; >>>> u64 alloc_size; >>>> >>>> - avail_space += devices_info[i].max_avail * num_stripes; >>>> + avail_space += devices_info[i].max_avail * data_stripes; >>>> alloc_size = devices_info[i].max_avail; >>>> for (j = i + 1 - num_stripes; j <= i; j++) >>>> devices_info[j].max_avail -= alloc_size; >>>> @@ -1772,15 +1780,13 @@ static int btrfs_calc_avail_data_space(struct btrfs_root *root, u64 *free_bytes) >>>> static int btrfs_statfs(struct dentry *dentry, struct kstatfs *buf) >>>> { >>>> struct btrfs_fs_info *fs_info = btrfs_sb(dentry->d_sb); >>>> - struct btrfs_super_block *disk_super = fs_info->super_copy; >>>> struct list_head *head = &fs_info->space_info; >>>> struct btrfs_space_info *found; >>>> u64 total_used = 0; >>>> + u64 total_alloc = 0; >>>> u64 total_free_data = 0; >>>> int bits = dentry->d_sb->s_blocksize_bits; >>>> __be32 *fsid = (__be32 *)fs_info->fsid; >>>> - unsigned factor = 1; >>>> - struct btrfs_block_rsv *block_rsv = &fs_info->global_block_rsv; >>>> int ret; >>>> >>>> /* >>>> @@ -1792,38 +1798,20 @@ static int btrfs_statfs(struct dentry *dentry, struct kstatfs *buf) >>>> rcu_read_lock(); >>>> list_for_each_entry_rcu(found, head, list) { >>>> if (found->flags & BTRFS_BLOCK_GROUP_DATA) { >>>> - int i; >>>> - >>>> - total_free_data += found->disk_total - found->disk_used; >>>> + total_free_data += found->total_bytes - found->bytes_used; >>>> total_free_data -= >>>> btrfs_account_ro_block_groups_free_space(found); >>>> - >>>> - for (i = 0; i < BTRFS_NR_RAID_TYPES; i++) { >>>> - if (!list_empty(&found->block_groups[i])) { >>>> - switch (i) { >>>> - case BTRFS_RAID_DUP: >>>> - case BTRFS_RAID_RAID1: >>>> - case BTRFS_RAID_RAID10: >>>> - factor = 2; >>>> - } >>>> - } >>>> - } >>>> + total_used += found->bytes_used; >>>> + } else { >>>> + /* For metadata and system, we treat the total_bytes as >>>> + * not available to file data. So show it as Used in df. >>>> + **/ >>>> + total_used += found->total_bytes; >>>> } >>>> - >>>> - total_used += found->disk_used; >>>> + total_alloc += found->total_bytes; >>>> } >>>> - >>>> rcu_read_unlock(); >>>> >>>> - buf->f_blocks = div_u64(btrfs_super_total_bytes(disk_super), factor); >>>> - buf->f_blocks >>= bits; >>>> - buf->f_bfree = buf->f_blocks - (div_u64(total_used, factor) >> bits); >>>> - >>>> - /* Account global block reserve as used, it's in logical size already */ >>>> - spin_lock(&block_rsv->lock); >>>> - buf->f_bfree -= block_rsv->size >> bits; >>>> - spin_unlock(&block_rsv->lock); >>>> - >>>> buf->f_bavail = total_free_data; >>>> ret = btrfs_calc_avail_data_space(fs_info->tree_root, &total_free_data); >>>> if (ret) { >>>> @@ -1831,8 +1819,12 @@ static int btrfs_statfs(struct dentry *dentry, struct kstatfs *buf) >>>> mutex_unlock(&fs_info->fs_devices->device_list_mutex); >>>> return ret; >>>> } >>>> - buf->f_bavail += div_u64(total_free_data, factor); >>>> + buf->f_bavail += total_free_data; >>>> buf->f_bavail = buf->f_bavail >> bits; >>>> + buf->f_blocks = total_alloc + total_free_data; >>>> + buf->f_blocks >>= bits; >>>> + buf->f_bfree = buf->f_blocks - (total_used >> bits); >>>> + >>>> mutex_unlock(&fs_info->chunk_mutex); >>>> mutex_unlock(&fs_info->fs_devices->device_list_mutex); >>>> >>>> >>> >>> >>> -- >>> gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it> >>> Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 >>> -- >>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in >>> the body of a message to majordomo@vger.kernel.org >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> > > > -- > gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it> > Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 12/16/2014 03:30 AM, Dongsheng Yang wrote: > > Hi Robert, thanx for your proposal about this. > > IMHO, output of df command shoud be more friendly to user. > Well, I think we have a disagreement on this point, let's take a look at > what the zfs is doing. > > /dev/sda7- 10G > /dev/sda8- 10G > # zpool create myzpool mirror /dev/sda7 /dev/sda8 -f > # df -h /myzpool/ > Filesystem Size Used Avail Use% Mounted on > myzpool 9.8G 21K 9.8G 1% /myzpool > > That said that df command should tell user the space info they can see. > It means the output is the information from the FS level rather than > device level or _storage_manager level. That's great for ZFS, but ZFS isn't BTRFS. ZFS can't get caught halfway between existing modailties and sit there forever. ZFS doesn't restructure itself. So the simple answer you want isn't _possible_ outside very simple cases. So again, you've displayed a _simple_ case as if it covers or addresses all the complex cases. (I don't have the storage to actually do the exercise) But what do you propose the correct answer is for the following case: /dev/sda - 7.5G /dev/sdb - 7.5G mkfs.btrfs /dev/sd{a,b} -d raid0 mount /dev/sda /mnt dd if=/dev/urandom of=/mnt/consumed bs=1G count=7 btrfsck balance start -dconvert=raid1 -dlimit=1 /mnt (wait) /bin/df The filesystem is now in a superposition where all future blocks are going to be written as raid1, one 2G stripe has been converted into two two-gig stripes that have been mirrored, and six gig is still RAID0. In your proposal we now have @size=7G @used=??? (clearly 7G, at the least, is consumed) @filesize[consumed]=7G @available is really messed up since there is now _probably_ 1G of one of the original raid0 extents with free space and so available, almost all of the single RAID1 metadata block, Room for three more metadata stripes, and room for one more RAID1 extent. so @available=2-ish gigs. But since statfs() pivots on @size and @available /bin/df is going to report @used as 3-ish gigs even though we've got an uncompressed and uncompressable @7G file. NOW waht if we went the _other_ way? /dev/sda - 7.5G /dev/sdb - 7.5G mkfs.btrfs /dev/sd{a,b} -d raid1 mount /dev/sda /mnt dd if=/dev/urandom of=/mnt/consumed bs=1G count=7 btrfsck balance start -dconvert=raid0 -dlimit=1 /mnt (wait) /bin/df filesystem is _full_ when the convert starts. @size=14Gig @used=7G @actual_available=0 @reported_available=??? (at least 2x1G extents are up for grabs so minimum 2G) @reported_used=??? @calculated_used=??? We are either back to reporting available space when non-trivial allocation will report ENOSPC (if I did the math right etc). Now do partial conversions to other formats and repeat the exercise. Now add or remove storage here-or-there. The "working set" and the "current model" are not _required_ to be in harmony at any given time, so trying to analyze the working set based on the current model is NP-complete. In every modality we find that at some point we _can_ either report 0 available and still have room, or we report non-zero available and the user's going to get an ENOSPC. So fine, _IF_ btrfs disallowed conversion and fixed its overhead -- that is if it stopped bing btrfs -- we could safely report cooked numbers. But at that point, why BTRFS at all? As for being easier on the user, that just depends on which lie each user wants. People are smart enough to use a fair approximation, and the only fair approximation we have available are at the storage management level -- which is halfway between the raw blocks and the cooked file-system numbers. A BTRFS filesystem just _isn't_ fully cooked until the last data extent is allocated, and because of COW that's over-cooked as well. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 12/17/2014 03:52 AM, Robert White wrote: > On 12/16/2014 03:30 AM, Dongsheng Yang wrote: >> >> Hi Robert, thanx for your proposal about this. >> >> IMHO, output of df command shoud be more friendly to user. >> Well, I think we have a disagreement on this point, let's take a look at >> what the zfs is doing. >> >> /dev/sda7- 10G >> /dev/sda8- 10G >> # zpool create myzpool mirror /dev/sda7 /dev/sda8 -f >> # df -h /myzpool/ >> Filesystem Size Used Avail Use% Mounted on >> myzpool 9.8G 21K 9.8G 1% /myzpool >> >> That said that df command should tell user the space info they can see. >> It means the output is the information from the FS level rather than >> device level or _storage_manager level. > > That's great for ZFS, but ZFS isn't BTRFS. ZFS can't get caught > halfway between existing modailties and sit there forever. ZFS doesn't > restructure itself. So the simple answer you want isn't _possible_ > outside very simple cases. > > So again, you've displayed a _simple_ case as if it covers or > addresses all the complex cases. > > (I don't have the storage to actually do the exercise) But what do you > propose the correct answer is for the following case: > > > /dev/sda - 7.5G > /dev/sdb - 7.5G > > mkfs.btrfs /dev/sd{a,b} -d raid0 > mount /dev/sda /mnt > dd if=/dev/urandom of=/mnt/consumed bs=1G count=7 > btrfsck balance start -dconvert=raid1 -dlimit=1 /mnt > (wait) > /bin/df > > > The filesystem is now in a superposition where all future blocks are > going to be written as raid1, one 2G stripe has been converted into > two two-gig stripes that have been mirrored, and six gig is still RAID0. > Similar with your mail about inline file case, I think this is another thing. I really appreciate your persistence and so much analyse in semasiology. I think I will never convince you at this point even with thousands mails to express myself. What about implementing your proposal here and sending a patchset at least with RFC. Then it can be more clear to us and we can make the choice more easily. :) Again, below is all points from my side: The df command we discussed here is on the top level which is directly facing the user. 1), For a linux user, he does not care about the detail how the data is stored in devices. They do not care even not know what's Single, what does DUP mean, and how a fs implement the RAID10. What they want to know is *what is the size of the filesystem I am using and how much space is still available to me*. That's what I said by "FS space level" 2). For a btrfs user, they know about the single, dup and RAIDX. When they want to know what's the raid level in each space info, they can use btrfs fi df to print the information they want. 3). Device level. for debugging. Sometimes, you need to know how the each chunk is stored in device. Please use btrfs-debug-tree to show details you want as more as possible. 4). df in ZFS is showing the FS space information. 5). For the elder btrfs_statfs(), We have discussed about df command and chosen to hide the detail information to user. And only show the FS space information to user. Current btrfs_statfs() is working like this. 6). IIUC, you are saying that: a). BTRFS is not ZFS, df in zfs is not referenced to us. b). Current btrfs_statfs() is woring in a wrong way, we need to reimplement it from another new view point. I am not sure your proposal here is not better. As I said above, I am pleased to see a demo of it. If it works better, I am really very happy to agree with you in this argument. My patch could be objected, but the problem in current btrfs_statfs() should be fixed by some ways. If you submit a better solution, I am pleased to see btrfs becoming better. :) Thanx Yang > In your proposal we now have > @size=7G > @used=??? (clearly 7G, at the least, is consumed) > @filesize[consumed]=7G > > @available is really messed up since there is now _probably_ 1G of one > of the original raid0 extents with free space and so available, almost > all of the single RAID1 metadata block, Room for three more metadata > stripes, and room for one more RAID1 extent. > > so @available=2-ish gigs. > > But since statfs() pivots on @size and @available /bin/df is going to > report @used as 3-ish gigs even though we've got an uncompressed and > uncompressable @7G file. > > NOW waht if we went the _other_ way? > > /dev/sda - 7.5G > /dev/sdb - 7.5G > > mkfs.btrfs /dev/sd{a,b} -d raid1 > mount /dev/sda /mnt > dd if=/dev/urandom of=/mnt/consumed bs=1G count=7 > btrfsck balance start -dconvert=raid0 -dlimit=1 /mnt > (wait) > /bin/df > > filesystem is _full_ when the convert starts. > > @size=14Gig > @used=7G > @actual_available=0 > @reported_available=??? (at least 2x1G extents are up for grabs so > minimum 2G) > @reported_used=??? > @calculated_used=??? > > We are either back to reporting available space when non-trivial > allocation will report ENOSPC (if I did the math right etc). > > Now do partial conversions to other formats and repeat the exercise. > Now add or remove storage here-or-there. > > The "working set" and the "current model" are not _required_ to be in > harmony at any given time, so trying to analyze the working set based > on the current model is NP-complete. > > In every modality we find that at some point we _can_ either report 0 > available and still have room, or we report non-zero available and the > user's going to get an ENOSPC. > > > So fine, _IF_ btrfs disallowed conversion and fixed its overhead -- > that is if it stopped bing btrfs -- we could safely report cooked > numbers. > > But at that point, why BTRFS at all? > > As for being easier on the user, that just depends on which lie each > user wants. People are smart enough to use a fair approximation, and > the only fair approximation we have available are at the storage > management level -- which is halfway between the raw blocks and the > cooked file-system numbers. > > A BTRFS filesystem just _isn't_ fully cooked until the last data > extent is allocated, and because of COW that's over-cooked as well. > > > . > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
I don't disagree with the _ideal_ of your patch. I just think that it's impossible to implement it without lying to the user or making things just as bad in a different way. I would _like_ you to be right. But my thing is finding and quantifying failure cases and the entire question is full of fail. This is not an attack on you personally, it's a mismatch between the storage and file system paradigms that we've seen first because we are the first to really blend the two fairly. Here is a completely legal BTRFS working set. (it's a little extreme.) /dev/sda :: '|Sf|Sf|Sp|0f|1f|0p|0p|Mf|Mf|Mp|1p|1.25GiB-unallocated| /dev/sdb :: '|0f|1f|0p|0p|Mp|1p| 4.75GiB-unalloated | Legend p == partial, about half full. f == full, or full enough to treat as full. S == Single allocated chunk 0 == RAID=0 allocated chunk 1 == RAID=1 allocated chunk M == metadata chunk History: This filesystem started out on a single drive, then it's bounced between RAID-0 and RAID-1 at least twice. The owner has _never_ let a conversion finish. Indeed this user has just changed modes a couple times. The current filesystem flag says RAID-1. But we currently have .5GiB of "single" slack, 2GiB of RAID-0 slack, 1GiB of RAID-1 slack, 2GiB of space where a total of 1GiB more RIAD1 extents can be created, and we have 3GiB of space on /dev/sdb that _can_ _not_ be allocated. We have room for 1 more metadata extent on each drive, but if we allocate two more metadat extents on each drive we will burn up 1.25 GiB by reducing it to 0.75GiB. First, a question. Will a BTRFS in RAID1 mode add file data to extents that are in other modes? That is, will the filesystem _use_ the 2.5GiB of available "single" and "RAID0" data? If no, then that's 2.5GiB of "phantom consumption" space that insn't "used" but also isn't usable. The size of the store is 20GiB. The default of 2x10GiB you propose would be 10GiB. But how do you identify the 3GiB "missing" because of the lopsided allocation history? Seem unlikely? The rotten cod example I've given is unlikely. But a more even case is downright common and likely. Say you run a nice old-fashoned MUTT mail-spool. "most" of your files are small enough to live in metadata. You start with one drive. and allocate 2 single-data and 10 metatata (5xDup). Then you add a second drive of equal size. (the metadata just switched to DUP-as-RAID1-alike mode) And then you do a dconvert=raid0. That uneven allocation of metadata will be a 2GiB difference between the two drives forever. So do you shave 2GiB off of your @size? Do you shave @2GiB off your @available? Do you overreport your available by @2GiB and end up _still_ having things "available" when you get your ENOSPC? How about this :: /dev/sda == |Sf|Sf|Mf|Mf|Mf|Mf|Sf|Sf|Sp|Mp|Mp| .5GiB free| /dev/sdb == |10 GiB free | Operator fills his drive, then adds a second one, then _foolishly_ tries to convert it to RAID0 when the power fails. In order to check the FS he boots with no_balance. Then his maintenance window closes and he has to go back into production, at which point he forgets (or isn't allowed) to do the balance. The flags are set but now no more extents can be allocated. Size is 20GiB, slack is 10.5GiB. Operator is about to get ENOSPACE. Yes a balance would fix it, but that's not the question. In the meantime what does your patch report? Or... /dev/sda == |Sf|Sf|Mf|Mf|Mf|Mf|Sf|Sf|Sp|Mp|Mp| .5GiB free| /dev/sdb == |10 GiB free | /dev/sdc == |10 GiB free | Does a -dconvert=raid5 and immediately gets ENOSPC for all the blocks. According to the flags we've got 10GiB free... Or we end up with an egregious metadata history from lots of small files and we've got a perfectly fine RAID1 with several GiB of slack but none of that slack is 1GiB contiguous. All the slack has just come from reclaiming metadata. /dev/sda == |Sf|Sf|Mp|Mp|Rx|Rx|Mp|Mp|Rx|Rx|Mp|Mp| N-free slack| (R == reclaimed, e.g. avalable to extent-tree.c for allocation) We have a 1.5GB of "poisoned" space here; it can hold metadata but not data. So is that 1.5 in your @available calculation? How do you mark it up as used. ... And I've been ingoring the Mp(s) completely. What if I've got a good two GiB of partial space in the metadata, but that's all I've got. You write a file of any size and you'll get ENOSPC even though you've got that GiB. Was it in @size? Is it in @avail? ... See you keep giving me these examples where the history of the filesystem is uniform. It was made a certain way and stayed that way. But in real life this sort of thing is going to happen and your patch simply report's a _different_ _wrong_ number. A _friendlier_ wrong number, I'll grant you that, but still wrong. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Robert White posted on Wed, 17 Dec 2014 20:07:27 -0800 as excerpted: > We have room for 1 more metadata extent on each > drive, but if we allocate two more metadat extents on each drive we will > burn up 1.25 GiB by reducing it to 0.75GiB. FWIW, at least the last chunk assignment can be smaller than normal. I believe I've seen it happen both here and on posted reports. The 0.75 GiB could thus be allocated as data if needed. I'm not actually sure how the allocator works with the last few GiB. Some developer comments have hinted that it starts carving smaller chunks before it actually has to, and I could imagine it dropping data chunks to a half gig, then a quarter gig, than 128 MiB, then 64 MiB... as space gets tight, and metadata chunks similarly but of course from a quarter gig down. That I'm not sure about. But I'm quite sure it will actually use the last little bit (provided it can properly fill its raid policy when doing so), as I'm quite sure I've seen it do it. I know for sure it does that in mixed-mode, as I have a 256 MiB mixed- mode dup /boot (and a backup /boot of the same size on the other device, so I can select the one that's booted from the BIOS), and they tend to be fully chunk-allocated. Note that even with mixed-mode, which defaults to metadata-sized-chunks, thus 256 MiB, on a 256 MiB device, by the time overhead, system, and reserve chunks are allocated, there's definitely NOT 256 MiB left for a data/metadata-mixed chunk, so if it couldn't allocate smaller bits it couldn't allocate even ONE chunk, let alone a pair (dup mode). And I think I've seen it happen on my larger (not mixed) filesystems of several GiB as well, tho I don't actually tend to fill them up quite so routinely, so it's more difficult to say for sure.
On Sun, Dec 14, 2014 at 10:06:50PM -0800, Robert White wrote: > ABSTRACT:: Stop being clever, just give the raw values. That's what > you should be doing anyway. There are no other correct values to > give that doesn't blow someone's paradigm somewhere. The trouble is a lot of existing software can't cope with the raw values without configuration changes and access to a bunch of out-of-band data. Nor should it. I thank Robert for providing so many pathological examples in this thread. They illustrate nicely why it's so important to provide adequately cooked values through statvfs, especially for f_bavail! > ITEM #1 :: In my humble opinion (ha ha) the size column should never > change unless you add or remove actual storage. It should > approximate the raw block size of the device on initial creation, > and it should adjust to the size changes that happen when you > semantically resize the filesystem with e.g. btrfs resize. The units for f_blocks and f_bavail should be roughly similar because software does calculate the ratio of those values (i.e. percentage of disk used); however, there is no strong accuracy requirement--it could be off by a few percent, and most software won't care. Some software will start to misbehave if the ratio error is too large, e.g. btrfs-RAID1 reporting the total disk size instead of the stored data size. This makes it necessary to scale f_blocks according to chunk profiles, at least to within a few percent of actual size. One critical rule is that (f_blocks - f_bavail) and (f_blocks - f_bfree) should never be negative, or software will break in assorted ways. > ITEM #4 :: Blocks available to unprivileged users is pretty "iffy" > since unprivileged users cannot write to the filesystem. This datum > doesn't have a "plain reading". I'd start with filesystem total > blocks, then subtract the total blocks used by all trees nodes in > all trees. (e.g. Nodes * 0x1000 or whatever node size is) then shave > off the N superblocks, then subtract the number of blocks already > allocated in data extents. And you're done. Blocks available (f_bavail) should be the intersection of all sets of blocks currently available for allocation by all users (ignoring quota for now). It must be an underestimate, so that ENOSPC implies f_bavail == 0. It is acceptable to report f_bavail = 0 but allow writes and allocations to succeed for a subset of cases (e.g. inline extents, metadata modifications). Historically, filesystems reserved arbitrary quantities of blocks that were available to unprivileged users, but not counted in f_bavail, in order to ensure the relation holds even in corner cases. Existing application software is well adapted to this behavior. Compressing filesystems underestimate free space on the filesystem, in the sense that N blocks written may result in M fewer free blocks where M < N. Compression in filesystems has been well tolerated by application software for years now. ENOSPC when f_bavail > 0 is very bad. Depending on how large f_bavail was at the ENOSPC failure, it means: - admins didn't get alerts triggered by a low space threshold and did not take action to avoid ENOSPC - software that garbage-collects based on available space wasn't activated in time to prevent ENOSPC - service applications have been starting transactions they believe they can finish because f_bavail is large enough, but failing when they hit ENOSPC instead f_bavail should be the minimum of the free data space and the free metadata space (counting free space in mixed-mode chunks and non-chunk disk space that can be used by the current profile in both sets). This provides a more accurate indication of the amount of free space in cases where metadata blocks are more scarce than data blocks. This breaks a common trap for btrfs users who are blindsided by metadata ENOSPC with gigabytes "free" in df. f_bavail should count only free space in chunks that are already allocated in the current allocation profile, or unallocated space that _could_ become a chunk of the currently selected profile (separately for data and metadata, then the lower of the two numbers kept). Free space not allocated to chunks could be difficult to measure precisely, so guess low. (Aside: would it make sense to have btrfs preallocate all available space just to make this calculation easier? We can now reallocate empty chunks on demand, so one reason _not_ to do this is already gone) Chunks of non-current profiles (e.g. raid0 when the current profile is raid1) should not be counted in f_bavail because such space can only become available for use after a balance/convert operation. Such space should appear in f_bavail after that operation occurs. This means that f_bavail will hit zero long before writes to the filesystem start to fail due to lack of space, so there will be plenty of warning for the administrator and software before space really runs out. Free metadata space is harder to deal with through statvfs. Currently f_files, f_favail, and f_ffree are all zeros--they could be overloaded to provide an estimate of metadata space separate from the estimate of data space. If the patch isn't taking those cases (and Robert's pathological examples) into account, it should. > Just as EXT before us didn't bother trying to put in a fudge factor > that guessed what percentage of files would end up needing indirect > blocks, we shouldn't be in the business of trying to back-figure > cost-of-storage. That fudge factor is already built into most application software, so it's not required for btrfs. 0.1% - 1% is the usual overhead for many filesystems, and most software adds an order of magnitude above that. 100% overhead (for btrfs-RAID1 with raw statvfs numbers) is well outside the accepted range and will cause problems. > The raw numbers are _more_ useful in many circumstances. The raw > blocks used, for example, will tell me what I need to know for thin > provisioning on other media, for example. Literally nothing else > exposes that sort of information. You'd need more structure in the data than statvfs provides to support that kind of decision, and btrfs has at least two specialized tools to expose it. statvfs is not the appropriate place for raw numbers. The numbers need to be cooked so existing applications can continue to make useful decisions with them without having to introduce awareness of the filesystem into their calculations. Or put another way, the filesystem should not expose its concrete implementation details through such an abstract interface. > Just put a prominent notice that the user needs to remember to > factor their choice of redundancy et al into the numbers. Why, when the filesystem knows that choice and can factor it into the numbers for us? Also, struct statvfs has no "prominent notice" field. ;) > (5c) If your metadata rate is different than your data rate, then there > is _absolutely_ no way to _programatically_ predict how the data _might_ > be used, and this is the _default_ usage model. Literally the hardest > model is the normal model. There is actually no predictive solution. So We only need the worst-case prediction for f_bavail, and that is the lower of two numbers that we certainly can calculate. > "Blocks available to unprivileged users" is the only tricky one.[...] > Fortunately (or hopefully) that's not the datum /bin/df usually returns. I think you'll find that f_bavail *is* the datum /bin/df usually returns. If you want f_bfree you need to use 'stat -f' or roll your own df.
On 12/18/2014 12:07 PM, Robert White wrote: > I don't disagree with the _ideal_ of your patch. I just think that > it's impossible to implement it without lying to the user or making > things just as bad in a different way. I would _like_ you to be right. > But my thing is finding and quantifying failure cases and the entire > question is full of fail. > ... > > See you keep giving me these examples where the history of the > filesystem is uniform. It was made a certain way and stayed that way. > But in real life this sort of thing is going to happen and your patch > simply report's a _different_ _wrong_ number. A _friendlier_ wrong > number, I'll grant you that, but still wrong. > Hi Robert, sorry for the late. (It's busy to deal with some other work.) Yes, you are providing some examples here to show that in some cases the numbers from df is not helpful to understand the real space state in disk. But I'm afraid we can not blame df reporting a wrong number. You could say "Hey df, you are stupid, we can store the small file in metadata to exceed your @size.". But he just reports the information from what he is seeing from FS level honestly. He is reporting what he can see, and do not care about the other things out of his business. The all special cases you provided in this thread, actually lead to one result that "Btrfs is special, df command will report an improper in some case.". It means we need some other methods to tell user about the space info. And yes, we had. It's btrfs fi df. You are trying to make df showing information as more as possible, even change the behaviour of df in btrfs to show the numbers of different meaning with what it is in other filesystems. And my answer is keeping df command doing what he want, and solve the problems in our private "df" , btrfs fi df. Thanx Yang > . xaE > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 12/23/2014 04:31 AM, Dongsheng Yang wrote: > On 12/18/2014 12:07 PM, Robert White wrote: >> I don't disagree with the _ideal_ of your patch. I just think that >> it's impossible to implement it without lying to the user or making >> things just as bad in a different way. I would _like_ you to be right. >> But my thing is finding and quantifying failure cases and the entire >> question is full of fail. >> > ... >> >> See you keep giving me these examples where the history of the >> filesystem is uniform. It was made a certain way and stayed that way. >> But in real life this sort of thing is going to happen and your patch >> simply report's a _different_ _wrong_ number. A _friendlier_ wrong >> number, I'll grant you that, but still wrong. >> > > Hi Robert, sorry for the late. (It's busy to deal with some other work.) > > Yes, you are providing some examples here to show that in some cases the > numbers from df is not helpful to understand the real space state in > disk. But > I'm afraid we can not blame df reporting a wrong number. You could say > "Hey df, you are stupid, we can store the small file in metadata to exceed > your @size.". But he just reports the information from what he is seeing > from FS level honestly. > He is reporting what he can see, and do not care about the other things > out of > his business. > > The all special cases you provided in this thread, actually lead to one > result that > "Btrfs is special, df command will report an improper in some case.". It > means > we need some other methods to tell user about the space info. And yes, > we had. It's > btrfs fi df. > > You are trying to make df showing information as more as possible, even > change > the behaviour of df in btrfs to show the numbers of different meaning > with what it > is in other filesystems. And my answer is keeping df command doing what > he want, > and solve the problems in our private "df" , btrfs fi df. > > Thanx > Yang >> . xaE >> > > Fine. but you still haven't told me/us what you thing df (really fsstat() ) should report in those cases. Go back to that email. Grab some of those cases. Then tell me what your patch will report and why it makes more sense than what is reported now. For code to be correct it _must_ be correct for the "special" cases. You don't get to just ignore the special cases. So what are your specific answers to those cases? -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Dec 17, 2014 at 08:07:27PM -0800, Robert White wrote: >[...] There are a number of pathological examples in here, but I think there are justifiable correct answers for each of them that emerge from a single interpretation of the meanings of f_bavail, f_blocks, and f_bfree. One gotcha is that some of the numbers required may be difficult to calculate precisely before all space is allocated to chunks; however, some error is tolerable as long as free space is not overestimated. In other words: when in doubt, guess low. statvfs(2) gives us six numbers, three of which are block counts. Very few users or programs ever bother to look at the inode counts (f_files, f_ffree, f_favail), but they could be overloaded for metadata block counts. The f_blocks parameter is mostly irrelevant to application behavior, except to the extent that the ratio between f_bavail and f_blocks is used by applications to calculate a percentage of occupied or free space. f_blocks must always be greater than or equal to f_bavail and f_blocks, and preferably f_blocks would be scaled to use the same effective unit size as f_bavail and f_blocks within a percent or two. Nobody cares about f_bfree since traditionally only root could use the difference between f_bfree and f_bavail. f_bfree is effectively space conditionally available (e.g. if the process euid is root or the process egid matches a configured group id), while f_bavail is space available without conditions (e.g. processes without privilege can use it). The most important number is f_bavail. It's what a bunch of software (archive unpackers, assorted garbage collectors, email MTAs, snapshot remover scripts, download managers, etc) uses to estimate how much space is available without conditions (except quotas, although arguably those should be included too). Applications that are privileged still use the unprivileged f_bavail number so their decisions based on free space don't disrupt unprivileged applications. It's generally better to underestimate than to overestimate f_bavail. Historically filesystems have reserved extra space to avoid various problems in low-disk conditions, and application software has adapted to that well over the years. Also, admin people are more pleasantly surprised when it turns out that they had more space than f_bavail, instead of when they had less. The rule should be: if we have some space, but it is not available for data extents in the current allocation mode, don't add it to f_bavail in statvfs. I think this rule handles all of these examples well. That would mean that we get cases where we add a drive to a full filesystem and it doesn't immediately give you any new f_bavail space. That may be an unexpected result for a naive admin, but much less unexpected than having all the new space show up in f_bavail when it is not available for allocation in the current data profile! Better to have the surprising behavior earlier than later. On to examples... > But a more even case is downright common and likely. Say you run a > nice old-fashoned MUTT mail-spool. "most" of your files are small > enough to live in metadata. You start with one drive. and allocate 2 > single-data and 10 metatata (5xDup). Then you add a second drive of > equal size. (the metadata just switched to DUP-as-RAID1-alike mode) > And then you do a dconvert=raid0. > > That uneven allocation of metadata will be a 2GiB difference between > the two drives forever. > So do you shave 2GiB off of your @size? Yes. f_blocks is the total size of all allocated chunks plus all free space allocated by the current data profile. That 2GiB should disappear from such a calculation. > Do you shave @2GiB off your @available? Yes, because it's _not_ available until something changes to make it available (e.g. balance to get rid of the dup metadata, change the metadata profile to dup or single, or change the data profile to single). The 2GiB could be added to f_bfree, but that might still be confusing for people and software. > Do you overreport your available by @2GiB and end up _still_ having > things "available" when you get your ENOSPC? No. ENOSPC when f_bavail > 0 is very bad. Low-available-space admin alerts will not be triggered. Automated mitigation software will not be activated. Service daemons will start transactions they cannot complete. > How about this :: > > /dev/sda == |Sf|Sf|Mf|Mf|Mf|Mf|Sf|Sf|Sp|Mp|Mp| .5GiB free| > /dev/sdb == |10 GiB free | > > Operator fills his drive, then adds a second one, then _foolishly_ > tries to convert it to RAID0 when the power fails. In order to check > the FS he boots with no_balance. Then his maintenance window closes > and he has to go back into production, at which point he forgets (or > isn't allowed) to do the balance. The flags are set but now no more > extents can be allocated. > > Size is 20GiB, slack is 10.5GiB. Operator is about to get ENOSPACE. f_bavail should be 0.5GB or so. Operator is now aware that ENOSPC is imminent, and can report to whoever grants permission to do things that the machine will be continuing to balance outside of the maintenance window. This is much better than the alternative, which is that the lack of available space is detected by application failure outside of a maintenance window. Even better: if f_bavail is reflective of reality, the operator can minimize out-of-window balance time by monitoring f_bavail and pausing the balance when there is enough space to operate without ENOSPC until the next maintenance window. > Yes a balance would fix it, but that's not the question. The question is "how much space is available at this time?" and the correct answer is "almost none," and it stays that way until and unless someone runs a balance, adds more drives, deletes a lot of data, etc. balance changes the way space will be allocated, so it also changes the output of df to match. > In the meantime what does your patch report? It should report that there's almost no available space. If the patch doesn't report that, the patch needs rework. > Or... > > /dev/sda == |Sf|Sf|Mf|Mf|Mf|Mf|Sf|Sf|Sp|Mp|Mp| .5GiB free| > /dev/sdb == |10 GiB free | > /dev/sdc == |10 GiB free | > > Does a -dconvert=raid5 and immediately gets ENOSPC for all the > blocks. According to the flags we've got 10GiB free... 10.5GiB is correct: 1.0GiB in a 3-way RAID5 on sd[abc] (2x 0.5GiB data, 1x 0.5GiB parity), and 9.5GiB in a 2-way RAID5 (1x 9.5GiB data, 1x 9.5GiB parity) on sd[bc]. The current RAID5 implementation might not use space that way, but if that's true, that's arguably a bug in the current RAID5 implementation. If it was -dconvert=raid6, there would be only 0.5GiB in f_bavail (1x 0.5GiB data, 2x 0.5GiB parity) since raid6 requires 3 disks per chunk. > Or we end up with an egregious metadata history from lots of small > files and we've got a perfectly fine RAID1 with several GiB of slack > but none of that slack is 1GiB contiguous. All the slack has just > come from reclaiming metadata. > > /dev/sda == |Sf|Sf|Mp|Mp|Rx|Rx|Mp|Mp|Rx|Rx|Mp|Mp| N-free slack| > > (R == reclaimed, e.g. avalable to extent-tree.c for allocation) > > We have a 1.5GB of "poisoned" space here; it can hold metadata but > not data. So is that 1.5 in your @available calculation? How do you > mark it up as used. Overload f_favail to report metadata space maybe? That's kind of ugly, though, and nobody looks at f_favail anyway (or, everyone who looks at f_favail is going to be aware enough to look at btrfs fi df too). The metadata chunk free space could also go in f_bfree: it's available for use under some but not all conditions. In all cases, leave such space out of f_bavail. The space is only available under some conditions, and f_bavail is effectively about space available without conditions. Space allocated to metadata-only chunks should never be reported as available through f_bavail in statvfs. Also note that if free space in metadata chunks could be limiting allocations then it should be reported in f_bavail instead of free space in data chunks. That can be as simple as reporting the lesser of the two numbers when there is no free space on the filesystem left to allocate to chunks. > And I've been ingoring the Mp(s) completely. What if I've got a good > two GiB of partial space in the metadata, but that's all I've got. > You write a file of any size and you'll get ENOSPC even though > you've got that GiB. Was it in @size? Is it in @avail? @size should be the sum of the sizes of all allocated chunks in the filesystem, and all free space as if it was allocated with the current raid profile. It should change if there was unallocated space and the default profile changed, or if there was a balance converting existing chunks to a new profile, or if disks were added or removed. Don't report that GiB of metadata noise in f_bavail. This may mean that f_bavail is zero, but data can still be written in metadata chunks. That's OK--f_bavail is an underestimate. It's OK for the filesystem to report f_bavail = 0 but not ENOSPC-- people like storing bonus data without losing any "available" space, and we really can't know whether that space is available until after we've committed data to it. Since we can't commit in advance, that space isn't unconditionally available, and shouldn't be in f_bavail. It's not OK to report ENOSPC when f_bavail > 0. People hate failing to store data when they appear to have "free" space. > See you keep giving me these examples where the history of the > filesystem is uniform. It was made a certain way and stayed that > way. But in real life this sort of thing is going to happen and your > patch simply report's a _different_ _wrong_ number. A _friendlier_ > wrong number, I'll grant you that, but still wrong. Who cares if the number is wrong, as long as useful decisions can still be made with it? It doesn't have to be byte-accurate in all possible cases. Existing software and admin practice is OK with underreporting free space, but not overreporting it. All the errors should be biased in that direction.
On 12/31/2014 08:15 AM, Zygo Blaxell wrote: > On Wed, Dec 17, 2014 at 08:07:27PM -0800, Robert White wrote: >> [...] > There are a number of pathological examples in here, but I think there > are justifiable correct answers for each of them that emerge from a > single interpretation of the meanings of f_bavail, f_blocks, and f_bfree. > > One gotcha is that some of the numbers required may be difficult to > calculate precisely before all space is allocated to chunks; however, > some error is tolerable as long as free space is not overestimated. > In other words: when in doubt, guess low. > > statvfs(2) gives us six numbers, three of which are block counts. > Very few users or programs ever bother to look at the inode counts > (f_files, f_ffree, f_favail), but they could be overloaded for metadata > block counts. > > The f_blocks parameter is mostly irrelevant to application behavior, > except to the extent that the ratio between f_bavail and f_blocks is > used by applications to calculate a percentage of occupied or free space. > f_blocks must always be greater than or equal to f_bavail and f_blocks, > and preferably f_blocks would be scaled to use the same effective unit > size as f_bavail and f_blocks within a percent or two. > > Nobody cares about f_bfree since traditionally only root could use the > difference between f_bfree and f_bavail. f_bfree is effectively space > conditionally available (e.g. if the process euid is root or the process > egid matches a configured group id), while f_bavail is space available > without conditions (e.g. processes without privilege can use it). > > The most important number is f_bavail. It's what a bunch of software > (archive unpackers, assorted garbage collectors, email MTAs, snapshot > remover scripts, download managers, etc) uses to estimate how much space > is available without conditions (except quotas, although arguably those > should be included too). Applications that are privileged still use > the unprivileged f_bavail number so their decisions based on free space > don't disrupt unprivileged applications. > > It's generally better to underestimate than to overestimate f_bavail. > Historically filesystems have reserved extra space to avoid various > problems in low-disk conditions, and application software has adapted > to that well over the years. Also, admin people are more pleasantly > surprised when it turns out that they had more space than f_bavail, > instead of when they had less. > > The rule should be: if we have some space, but it is not available for > data extents in the current allocation mode, don't add it to f_bavail > in statvfs. I think this rule handles all of these examples well. > > That would mean that we get cases where we add a drive to a full > filesystem and it doesn't immediately give you any new f_bavail space. > That may be an unexpected result for a naive admin, but much less > unexpected than having all the new space show up in f_bavail when it > is not available for allocation in the current data profile! Better > to have the surprising behavior earlier than later. > > On to examples... > >> But a more even case is downright common and likely. Say you run a >> nice old-fashoned MUTT mail-spool. "most" of your files are small >> enough to live in metadata. You start with one drive. and allocate 2 >> single-data and 10 metatata (5xDup). Then you add a second drive of >> equal size. (the metadata just switched to DUP-as-RAID1-alike mode) >> And then you do a dconvert=raid0. >> >> That uneven allocation of metadata will be a 2GiB difference between >> the two drives forever. >> So do you shave 2GiB off of your @size? > Yes. f_blocks is the total size of all allocated chunks plus all free > space allocated by the current data profile. Agreed. This is what my patch designed by. > That 2GiB should disappear > from such a calculation. > >> Do you shave @2GiB off your @available? > Yes, because it's _not_ available until something changes to make it > available (e.g. balance to get rid of the dup metadata, change the > metadata profile to dup or single, or change the data profile to single). > > The 2GiB could be added to f_bfree, but that might still be confusing > for people and software. > >> Do you overreport your available by @2GiB and end up _still_ having >> things "available" when you get your ENOSPC? > No. ENOSPC when f_bavail > 0 is very bad. Yes, it is very bad. > Low-available-space admin > alerts will not be triggered. Automated mitigation software will not be > activated. Service daemons will start transactions they cannot complete. > >> How about this :: >> >> /dev/sda == |Sf|Sf|Mf|Mf|Mf|Mf|Sf|Sf|Sp|Mp|Mp| .5GiB free| >> /dev/sdb == |10 GiB free | >> >> Operator fills his drive, then adds a second one, then _foolishly_ >> tries to convert it to RAID0 when the power fails. In order to check >> the FS he boots with no_balance. Then his maintenance window closes >> and he has to go back into production, at which point he forgets (or >> isn't allowed) to do the balance. The flags are set but now no more >> extents can be allocated. >> >> Size is 20GiB, slack is 10.5GiB. Operator is about to get ENOSPACE. I am not clear about this use case. Is the current profile raid0? if so, @available is 10.5G. If raid1, @available is 0.5G. > f_bavail should be 0.5GB or so. Operator is now aware that ENOSPC is > imminent, and can report to whoever grants permission to do things that > the machine will be continuing to balance outside of the maintenance > window. This is much better than the alternative, which is that the > lack of available space is detected by application failure outside of > a maintenance window. > > Even better: if f_bavail is reflective of reality, the operator can > minimize out-of-window balance time by monitoring f_bavail and pausing > the balance when there is enough space to operate without ENOSPC until > the next maintenance window. > >> Yes a balance would fix it, but that's not the question. > The question is "how much space is available at this time?" and the > correct answer is "almost none," and it stays that way until and unless > someone runs a balance, adds more drives, deletes a lot of data, etc. > > balance changes the way space will be allocated, so it also changes > the output of df to match. > >> In the meantime what does your patch report? > It should report that there's almost no available space. > If the patch doesn't report that, the patch needs rework. > >> Or... >> >> /dev/sda == |Sf|Sf|Mf|Mf|Mf|Mf|Sf|Sf|Sp|Mp|Mp| .5GiB free| >> /dev/sdb == |10 GiB free | >> /dev/sdc == |10 GiB free | >> >> Does a -dconvert=raid5 and immediately gets ENOSPC for all the >> blocks. According to the flags we've got 10GiB free... > 10.5GiB is correct: 1.0GiB in a 3-way RAID5 on sd[abc] (2x 0.5GiB > data, 1x 0.5GiB parity), and 9.5GiB in a 2-way RAID5 (1x 9.5GiB data, > 1x 9.5GiB parity) on sd[bc]. The current RAID5 implementation might > not use space that way, but if that's true, that's arguably a bug in > the current RAID5 implementation. The current RAID5 does not work like this. It will alloc 10G (0.5 sd[ab] + 9.5 sd[bc]). The calculation in statfs() is same with the calculation in current allocator. It will report 10G available. Yes it would be better if we make the allocator more clever in these case. That can be another topic about allocator. > > If it was -dconvert=raid6, there would be only 0.5GiB in f_bavail (1x > 0.5GiB data, 2x 0.5GiB parity) since raid6 requires 3 disks per chunk. > >> See you keep giving me these examples where the history of the >> filesystem is uniform. It was made a certain way and stayed that >> way. But in real life this sort of thing is going to happen and your >> patch simply report's a _different_ _wrong_ number. A _friendlier_ >> wrong number, I'll grant you that, but still wrong. > Who cares if the number is wrong, as long as useful decisions can still be > made with it? It doesn't have to be byte-accurate in all possible cases. > > Existing software and admin practice is OK with underreporting free > space, but not overreporting it. All the errors should be biased in > that direction. Thanx Zygo and Robert, I agree that my patch did not cover the situation when block groups in different raid level. I will update my patch soon and sent it out. Thanx for your suggestion. Yang > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 12/27/2014 09:10 AM, Robert White wrote: > On 12/23/2014 04:31 AM, Dongsheng Yang wrote: >> On 12/18/2014 12:07 PM, Robert White wrote: ... >> Thanx >> Yang >>> . xaE >>> >> >> > > Fine. but you still haven't told me/us what you thing df (really > fsstat() ) should report in those cases. > > Go back to that email. Grab some of those cases. Then tell me what > your patch will report and why it makes more sense than what is > reported now. > > For code to be correct it _must_ be correct for the "special" cases. > You don't get to just ignore the special cases. > > So what are your specific answers to those cases? Great to hear that we can stop the endless discussion of FS level VS device level. Then we can come back to discuss the special cases you provided. I give you some answer in another mail. And make a change in my patch to cover the case of un-finished balance. Thanx > . > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index a84e00d..9954d60 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -8571,7 +8571,6 @@ static u64 __btrfs_get_ro_block_group_free_space(struct list_head *groups_list) { struct btrfs_block_group_cache *block_group; u64 free_bytes = 0; - int factor; list_for_each_entry(block_group, groups_list, list) { spin_lock(&block_group->lock); @@ -8581,16 +8580,8 @@ static u64 __btrfs_get_ro_block_group_free_space(struct list_head *groups_list) continue; } - if (block_group->flags & (BTRFS_BLOCK_GROUP_RAID1 | - BTRFS_BLOCK_GROUP_RAID10 | - BTRFS_BLOCK_GROUP_DUP)) - factor = 2; - else - factor = 1; - - free_bytes += (block_group->key.offset - - btrfs_block_group_used(&block_group->item)) * - factor; + free_bytes += block_group->key.offset - + btrfs_block_group_used(&block_group->item); spin_unlock(&block_group->lock); } diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index 54bd91e..40f41a2 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -1641,6 +1641,8 @@ static int btrfs_calc_avail_data_space(struct btrfs_root *root, u64 *free_bytes) u64 used_space; u64 min_stripe_size; int min_stripes = 1, num_stripes = 1; + /* How many stripes used to store data, without considering mirrors. */ + int data_stripes = 1; int i = 0, nr_devices; int ret; @@ -1657,12 +1659,15 @@ static int btrfs_calc_avail_data_space(struct btrfs_root *root, u64 *free_bytes) if (type & BTRFS_BLOCK_GROUP_RAID0) { min_stripes = 2; num_stripes = nr_devices; + data_stripes = num_stripes; } else if (type & BTRFS_BLOCK_GROUP_RAID1) { min_stripes = 2; num_stripes = 2; + data_stripes = 1; } else if (type & BTRFS_BLOCK_GROUP_RAID10) { min_stripes = 4; num_stripes = 4; + data_stripes = 2; } if (type & BTRFS_BLOCK_GROUP_DUP) @@ -1733,14 +1738,17 @@ static int btrfs_calc_avail_data_space(struct btrfs_root *root, u64 *free_bytes) i = nr_devices - 1; avail_space = 0; while (nr_devices >= min_stripes) { - if (num_stripes > nr_devices) + if (num_stripes > nr_devices) { num_stripes = nr_devices; + if (type & BTRFS_BLOCK_GROUP_RAID0) + data_stripes = num_stripes; + } if (devices_info[i].max_avail >= min_stripe_size) { int j; u64 alloc_size; - avail_space += devices_info[i].max_avail * num_stripes; + avail_space += devices_info[i].max_avail * data_stripes; alloc_size = devices_info[i].max_avail; for (j = i + 1 - num_stripes; j <= i; j++) devices_info[j].max_avail -= alloc_size; @@ -1772,15 +1780,13 @@ static int btrfs_calc_avail_data_space(struct btrfs_root *root, u64 *free_bytes) static int btrfs_statfs(struct dentry *dentry, struct kstatfs *buf) { struct btrfs_fs_info *fs_info = btrfs_sb(dentry->d_sb); - struct btrfs_super_block *disk_super = fs_info->super_copy; struct list_head *head = &fs_info->space_info; struct btrfs_space_info *found; u64 total_used = 0; + u64 total_alloc = 0; u64 total_free_data = 0; int bits = dentry->d_sb->s_blocksize_bits; __be32 *fsid = (__be32 *)fs_info->fsid; - unsigned factor = 1; - struct btrfs_block_rsv *block_rsv = &fs_info->global_block_rsv; int ret; /* @@ -1792,38 +1798,20 @@ static int btrfs_statfs(struct dentry *dentry, struct kstatfs *buf) rcu_read_lock(); list_for_each_entry_rcu(found, head, list) { if (found->flags & BTRFS_BLOCK_GROUP_DATA) { - int i; - - total_free_data += found->disk_total - found->disk_used; + total_free_data += found->total_bytes - found->bytes_used; total_free_data -= btrfs_account_ro_block_groups_free_space(found); - - for (i = 0; i < BTRFS_NR_RAID_TYPES; i++) { - if (!list_empty(&found->block_groups[i])) { - switch (i) { - case BTRFS_RAID_DUP: - case BTRFS_RAID_RAID1: - case BTRFS_RAID_RAID10: - factor = 2; - } - } - } + total_used += found->bytes_used; + } else { + /* For metadata and system, we treat the total_bytes as + * not available to file data. So show it as Used in df. + **/ + total_used += found->total_bytes; } - - total_used += found->disk_used; + total_alloc += found->total_bytes; } - rcu_read_unlock(); - buf->f_blocks = div_u64(btrfs_super_total_bytes(disk_super), factor); - buf->f_blocks >>= bits; - buf->f_bfree = buf->f_blocks - (div_u64(total_used, factor) >> bits); - - /* Account global block reserve as used, it's in logical size already */ - spin_lock(&block_rsv->lock); - buf->f_bfree -= block_rsv->size >> bits; - spin_unlock(&block_rsv->lock); - buf->f_bavail = total_free_data; ret = btrfs_calc_avail_data_space(fs_info->tree_root, &total_free_data); if (ret) { @@ -1831,8 +1819,12 @@ static int btrfs_statfs(struct dentry *dentry, struct kstatfs *buf) mutex_unlock(&fs_info->fs_devices->device_list_mutex); return ret; } - buf->f_bavail += div_u64(total_free_data, factor); + buf->f_bavail += total_free_data; buf->f_bavail = buf->f_bavail >> bits; + buf->f_blocks = total_alloc + total_free_data; + buf->f_blocks >>= bits; + buf->f_bfree = buf->f_blocks - (total_used >> bits); + mutex_unlock(&fs_info->chunk_mutex); mutex_unlock(&fs_info->fs_devices->device_list_mutex);
When function btrfs_statfs() calculate the tatol size of fs, it is calculating the total size of disks and then dividing it by a factor. But in some usecase, the result is not good to user. Example: # mkfs.btrfs -f /dev/vdf1 /dev/vdf2 -d raid1 # mount /dev/vdf1 /mnt # dd if=/dev/zero of=/mnt/zero bs=1M count=1000 # df -h /mnt Filesystem Size Used Avail Use% Mounted on /dev/vdf1 3.0G 1018M 1.3G 45% /mnt # btrfs fi show /dev/vdf1 Label: none uuid: f85d93dc-81f4-445d-91e5-6a5cd9563294 Total devices 2 FS bytes used 1001.53MiB devid 1 size 2.00GiB used 1.85GiB path /dev/vdf1 devid 2 size 4.00GiB used 1.83GiB path /dev/vdf2 a. df -h should report Size as 2GiB rather than as 3GiB. Because this is 2 device raid1, the limiting factor is devid 1 @2GiB. b. df -h should report Avail as 0.97GiB or less, rather than as 1.3GiB. 1.85 (the capacity of the allocated chunk) -1.018 (the file stored) +(2-1.85=0.15) (the residual capacity of the disks considering a raid1 fs) --------------- = 0.97 This patch drops the factor at all and calculate the size observable to user without considering which raid level the data is in and what's the size exactly in disk. After this patch applied: # mkfs.btrfs -f /dev/vdf1 /dev/vdf2 -d raid1 # mount /dev/vdf1 /mnt # dd if=/dev/zero of=/mnt/zero bs=1M count=1000 # df -h /mnt Filesystem Size Used Avail Use% Mounted on /dev/vdf1 2.0G 1.3G 713M 66% /mnt # df /mnt Filesystem 1K-blocks Used Available Use% Mounted on /dev/vdf1 2097152 1359424 729536 66% /mnt # btrfs fi show /dev/vdf1 Label: none uuid: e98c1321-645f-4457-b20d-4f41dc1cf2f4 Total devices 2 FS bytes used 1001.55MiB devid 1 size 2.00GiB used 1.85GiB path /dev/vdf1 devid 2 size 4.00GiB used 1.83GiB path /dev/vdf2 a). The @Size is 2G as we expected. b). @Available is 700M = 1.85G - 1.3G + (2G - 1.85G). c). @Used is changed to 1.3G rather than 1018M as above. Because this patch do not treat the free space in metadata chunk and system chunk as available to user. It's true, user can not use these space to store data, then it should not be thought as available. At the same time, it will make the @Used + @Available == @Size as possible to user. Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com> --- Changelog: v1 -> v2: a. account the total_bytes in medadata chunk and system chunk as used to user. b. set data_stripes to the correct value in RAID0. fs/btrfs/extent-tree.c | 13 ++---------- fs/btrfs/super.c | 56 ++++++++++++++++++++++---------------------------- 2 files changed, 26 insertions(+), 43 deletions(-)