[v2,1/3] Btrfs: get more accurate output in df command.

Message ID	36be817396956bffe981a69ea0b8796c44153fa5.1418203063.git.yangds.fnst@cn.fujitsu.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-btrfs-owner@kernel.org> From: Dongsheng Yang <yangds.fnst@cn.fujitsu.com> To: <kreijack@inwind.it>, <1i5t5.duncan@cox.net>, <rwhite@pobox.com>, <lists@colorremedies.com> CC: <linux-btrfs@vger.kernel.org>, Dongsheng Yang <yangds.fnst@cn.fujitsu.com> Subject: [PATCH v2 1/3] Btrfs: get more accurate output in df command. Date: Thu, 11 Dec 2014 16:31:23 +0800 Message-ID: <36be817396956bffe981a69ea0b8796c44153fa5.1418203063.git.yangds.fnst@cn.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain Sender: linux-btrfs-owner@vger.kernel.org Precedence: bulk

Yang Dongsheng Dec. 11, 2014, 8:31 a.m. UTC

When function btrfs_statfs() calculate the tatol size of fs, it is calculating
the total size of disks and then dividing it by a factor. But in some usecase,
the result is not good to user.
Example:
	# mkfs.btrfs -f /dev/vdf1 /dev/vdf2 -d raid1
	# mount /dev/vdf1 /mnt
	# dd if=/dev/zero of=/mnt/zero bs=1M count=1000
	# df -h /mnt
Filesystem      Size  Used Avail Use% Mounted on
/dev/vdf1       3.0G 1018M  1.3G  45% /mnt
	# btrfs fi show /dev/vdf1
Label: none  uuid: f85d93dc-81f4-445d-91e5-6a5cd9563294
	Total devices 2 FS bytes used 1001.53MiB
	devid    1 size 2.00GiB used 1.85GiB path /dev/vdf1
	devid    2 size 4.00GiB used 1.83GiB path /dev/vdf2
a. df -h should report Size as 2GiB rather than as 3GiB.
Because this is 2 device raid1, the limiting factor is devid 1 @2GiB.

b. df -h should report Avail as 0.97GiB or less, rather than as 1.3GiB.
    1.85           (the capacity of the allocated chunk)
   -1.018          (the file stored)
   +(2-1.85=0.15)  (the residual capacity of the disks
                    considering a raid1 fs)
   ---------------
=   0.97

This patch drops the factor at all and calculate the size observable to
user without considering which raid level the data is in and what's the
size exactly in disk.
After this patch applied:
	# mkfs.btrfs -f /dev/vdf1 /dev/vdf2 -d raid1
	# mount /dev/vdf1 /mnt
	# dd if=/dev/zero of=/mnt/zero bs=1M count=1000
	# df -h /mnt
Filesystem      Size  Used Avail Use% Mounted on
/dev/vdf1       2.0G  1.3G  713M  66% /mnt
	# df /mnt
Filesystem     1K-blocks    Used Available Use% Mounted on
/dev/vdf1        2097152 1359424    729536  66% /mnt
	# btrfs fi show /dev/vdf1
Label: none  uuid: e98c1321-645f-4457-b20d-4f41dc1cf2f4
	Total devices 2 FS bytes used 1001.55MiB
	devid    1 size 2.00GiB used 1.85GiB path /dev/vdf1
	devid    2 size 4.00GiB used 1.83GiB path /dev/vdf2
a). The @Size is 2G as we expected.
b). @Available is 700M = 1.85G - 1.3G + (2G - 1.85G).
c). @Used is changed to 1.3G rather than 1018M as above. Because
    this patch do not treat the free space in metadata chunk
    and system chunk as available to user. It's true, user can
    not use these space to store data, then it should not be
    thought as available. At the same time, it will make the
    @Used + @Available == @Size as possible to user.

Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com>
---
Changelog:
        v1 -> v2:
                a. account the total_bytes in medadata chunk and
                   system chunk as used to user.
                b. set data_stripes to the correct value in RAID0.

 fs/btrfs/extent-tree.c | 13 ++----------
 fs/btrfs/super.c       | 56 ++++++++++++++++++++++----------------------------
 2 files changed, 26 insertions(+), 43 deletions(-)

Goffredo Baroncelli Dec. 12, 2014, 6 p.m. UTC | #1

On 12/11/2014 09:31 AM, Dongsheng Yang wrote:
> When function btrfs_statfs() calculate the tatol size of fs, it is calculating
> the total size of disks and then dividing it by a factor. But in some usecase,
> the result is not good to user.

I am checking it; to me it seems a good improvement. However
I noticed that df now doesn't seem to report anymore the space consumed
by the meta-data chunk; eg:

# I have two disks of 5GB each
$ sudo ~/mkfs.btrfs -f -m raid1 -d raid1 /dev/vgtest/disk /dev/vgtest/disk1

$ df -h /mnt/btrfs1/
Filesystem               Size  Used Avail Use% Mounted on
/dev/mapper/vgtest-disk  5.0G  1.1G  4.0G  21% /mnt/btrfs1

$ sudo btrfs fi show
Label: none  uuid: 884414c6-9374-40af-a5be-3949cdf6ad0b
	Total devices 2 FS bytes used 640.00KB
	devid    2 size 5.00GB used 2.01GB path /dev/dm-1
	devid    1 size 5.00GB used 2.03GB path /dev/dm-0

$ sudo ./btrfs fi df /mnt/btrfs1/
Data, RAID1: total=1.00GiB, used=512.00KiB
Data, single: total=8.00MiB, used=0.00B
System, RAID1: total=8.00MiB, used=16.00KiB
System, single: total=4.00MiB, used=0.00B
Metadata, RAID1: total=1.00GiB, used=112.00KiB
Metadata, single: total=8.00MiB, used=0.00B
GlobalReserve, single: total=16.00MiB, used=0.00B

In this case the filesystem is empty (it was a new filesystem !). However a
1G metadata chunk was already allocated. This is the reasons why the free 
space is only 4Gb.

On my system the ratio metadata/data is 234MB/8.82GB = ~3%, so ignoring the
metadata chunk from the free space may not be a big problem. 





> Example:
> 	# mkfs.btrfs -f /dev/vdf1 /dev/vdf2 -d raid1
> 	# mount /dev/vdf1 /mnt
> 	# dd if=/dev/zero of=/mnt/zero bs=1M count=1000
> 	# df -h /mnt
> Filesystem      Size  Used Avail Use% Mounted on
> /dev/vdf1       3.0G 1018M  1.3G  45% /mnt
> 	# btrfs fi show /dev/vdf1
> Label: none  uuid: f85d93dc-81f4-445d-91e5-6a5cd9563294
> 	Total devices 2 FS bytes used 1001.53MiB
> 	devid    1 size 2.00GiB used 1.85GiB path /dev/vdf1
> 	devid    2 size 4.00GiB used 1.83GiB path /dev/vdf2
> a. df -h should report Size as 2GiB rather than as 3GiB.
> Because this is 2 device raid1, the limiting factor is devid 1 @2GiB.
> 
> b. df -h should report Avail as 0.97GiB or less, rather than as 1.3GiB.
>     1.85           (the capacity of the allocated chunk)
>    -1.018          (the file stored)
>    +(2-1.85=0.15)  (the residual capacity of the disks
>                     considering a raid1 fs)
>    ---------------
> =   0.97
> 
> This patch drops the factor at all and calculate the size observable to
> user without considering which raid level the data is in and what's the
> size exactly in disk.
> After this patch applied:
> 	# mkfs.btrfs -f /dev/vdf1 /dev/vdf2 -d raid1
> 	# mount /dev/vdf1 /mnt
> 	# dd if=/dev/zero of=/mnt/zero bs=1M count=1000
> 	# df -h /mnt
> Filesystem      Size  Used Avail Use% Mounted on
> /dev/vdf1       2.0G  1.3G  713M  66% /mnt
> 	# df /mnt
> Filesystem     1K-blocks    Used Available Use% Mounted on
> /dev/vdf1        2097152 1359424    729536  66% /mnt
> 	# btrfs fi show /dev/vdf1
> Label: none  uuid: e98c1321-645f-4457-b20d-4f41dc1cf2f4
> 	Total devices 2 FS bytes used 1001.55MiB
> 	devid    1 size 2.00GiB used 1.85GiB path /dev/vdf1
> 	devid    2 size 4.00GiB used 1.83GiB path /dev/vdf2
> a). The @Size is 2G as we expected.
> b). @Available is 700M = 1.85G - 1.3G + (2G - 1.85G).
> c). @Used is changed to 1.3G rather than 1018M as above. Because
>     this patch do not treat the free space in metadata chunk
>     and system chunk as available to user. It's true, user can
>     not use these space to store data, then it should not be
>     thought as available. At the same time, it will make the
>     @Used + @Available == @Size as possible to user.
> 
> Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com>
> ---
> Changelog:
>         v1 -> v2:
>                 a. account the total_bytes in medadata chunk and
>                    system chunk as used to user.
>                 b. set data_stripes to the correct value in RAID0.
> 
>  fs/btrfs/extent-tree.c | 13 ++----------
>  fs/btrfs/super.c       | 56 ++++++++++++++++++++++----------------------------
>  2 files changed, 26 insertions(+), 43 deletions(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index a84e00d..9954d60 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -8571,7 +8571,6 @@ static u64 __btrfs_get_ro_block_group_free_space(struct list_head *groups_list)
>  {
>  	struct btrfs_block_group_cache *block_group;
>  	u64 free_bytes = 0;
> -	int factor;
>  
>  	list_for_each_entry(block_group, groups_list, list) {
>  		spin_lock(&block_group->lock);
> @@ -8581,16 +8580,8 @@ static u64 __btrfs_get_ro_block_group_free_space(struct list_head *groups_list)
>  			continue;
>  		}
>  
> -		if (block_group->flags & (BTRFS_BLOCK_GROUP_RAID1 |
> -					  BTRFS_BLOCK_GROUP_RAID10 |
> -					  BTRFS_BLOCK_GROUP_DUP))
> -			factor = 2;
> -		else
> -			factor = 1;
> -
> -		free_bytes += (block_group->key.offset -
> -			       btrfs_block_group_used(&block_group->item)) *
> -			       factor;
> +		free_bytes += block_group->key.offset -
> +			      btrfs_block_group_used(&block_group->item);
>  
>  		spin_unlock(&block_group->lock);
>  	}
> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> index 54bd91e..40f41a2 100644
> --- a/fs/btrfs/super.c
> +++ b/fs/btrfs/super.c
> @@ -1641,6 +1641,8 @@ static int btrfs_calc_avail_data_space(struct btrfs_root *root, u64 *free_bytes)
>  	u64 used_space;
>  	u64 min_stripe_size;
>  	int min_stripes = 1, num_stripes = 1;
> +	/* How many stripes used to store data, without considering mirrors. */
> +	int data_stripes = 1;
>  	int i = 0, nr_devices;
>  	int ret;
>  
> @@ -1657,12 +1659,15 @@ static int btrfs_calc_avail_data_space(struct btrfs_root *root, u64 *free_bytes)
>  	if (type & BTRFS_BLOCK_GROUP_RAID0) {
>  		min_stripes = 2;
>  		num_stripes = nr_devices;
> +		data_stripes = num_stripes;
>  	} else if (type & BTRFS_BLOCK_GROUP_RAID1) {
>  		min_stripes = 2;
>  		num_stripes = 2;
> +		data_stripes = 1;
>  	} else if (type & BTRFS_BLOCK_GROUP_RAID10) {
>  		min_stripes = 4;
>  		num_stripes = 4;
> +		data_stripes = 2;
>  	}
>  
>  	if (type & BTRFS_BLOCK_GROUP_DUP)
> @@ -1733,14 +1738,17 @@ static int btrfs_calc_avail_data_space(struct btrfs_root *root, u64 *free_bytes)
>  	i = nr_devices - 1;
>  	avail_space = 0;
>  	while (nr_devices >= min_stripes) {
> -		if (num_stripes > nr_devices)
> +		if (num_stripes > nr_devices) {
>  			num_stripes = nr_devices;
> +			if (type & BTRFS_BLOCK_GROUP_RAID0)
> +				data_stripes = num_stripes;
> +		}
>  
>  		if (devices_info[i].max_avail >= min_stripe_size) {
>  			int j;
>  			u64 alloc_size;
>  
> -			avail_space += devices_info[i].max_avail * num_stripes;
> +			avail_space += devices_info[i].max_avail * data_stripes;
>  			alloc_size = devices_info[i].max_avail;
>  			for (j = i + 1 - num_stripes; j <= i; j++)
>  				devices_info[j].max_avail -= alloc_size;
> @@ -1772,15 +1780,13 @@ static int btrfs_calc_avail_data_space(struct btrfs_root *root, u64 *free_bytes)
>  static int btrfs_statfs(struct dentry *dentry, struct kstatfs *buf)
>  {
>  	struct btrfs_fs_info *fs_info = btrfs_sb(dentry->d_sb);
> -	struct btrfs_super_block *disk_super = fs_info->super_copy;
>  	struct list_head *head = &fs_info->space_info;
>  	struct btrfs_space_info *found;
>  	u64 total_used = 0;
> +	u64 total_alloc = 0;
>  	u64 total_free_data = 0;
>  	int bits = dentry->d_sb->s_blocksize_bits;
>  	__be32 *fsid = (__be32 *)fs_info->fsid;
> -	unsigned factor = 1;
> -	struct btrfs_block_rsv *block_rsv = &fs_info->global_block_rsv;
>  	int ret;
>  
>  	/*
> @@ -1792,38 +1798,20 @@ static int btrfs_statfs(struct dentry *dentry, struct kstatfs *buf)
>  	rcu_read_lock();
>  	list_for_each_entry_rcu(found, head, list) {
>  		if (found->flags & BTRFS_BLOCK_GROUP_DATA) {
> -			int i;
> -
> -			total_free_data += found->disk_total - found->disk_used;
> +			total_free_data += found->total_bytes - found->bytes_used;
>  			total_free_data -=
>  				btrfs_account_ro_block_groups_free_space(found);
> -
> -			for (i = 0; i < BTRFS_NR_RAID_TYPES; i++) {
> -				if (!list_empty(&found->block_groups[i])) {
> -					switch (i) {
> -					case BTRFS_RAID_DUP:
> -					case BTRFS_RAID_RAID1:
> -					case BTRFS_RAID_RAID10:
> -						factor = 2;
> -					}
> -				}
> -			}
> +			total_used += found->bytes_used;
> +		} else {
> +			/* For metadata and system, we treat the total_bytes as
> +			 * not available to file data. So show it as Used in df.
> +			 **/
> +			total_used += found->total_bytes;
>  		}
> -
> -		total_used += found->disk_used;
> +		total_alloc += found->total_bytes;
>  	}
> -
>  	rcu_read_unlock();
>  
> -	buf->f_blocks = div_u64(btrfs_super_total_bytes(disk_super), factor);
> -	buf->f_blocks >>= bits;
> -	buf->f_bfree = buf->f_blocks - (div_u64(total_used, factor) >> bits);
> -
> -	/* Account global block reserve as used, it's in logical size already */
> -	spin_lock(&block_rsv->lock);
> -	buf->f_bfree -= block_rsv->size >> bits;
> -	spin_unlock(&block_rsv->lock);
> -
>  	buf->f_bavail = total_free_data;
>  	ret = btrfs_calc_avail_data_space(fs_info->tree_root, &total_free_data);
>  	if (ret) {
> @@ -1831,8 +1819,12 @@ static int btrfs_statfs(struct dentry *dentry, struct kstatfs *buf)
>  		mutex_unlock(&fs_info->fs_devices->device_list_mutex);
>  		return ret;
>  	}
> -	buf->f_bavail += div_u64(total_free_data, factor);
> +	buf->f_bavail += total_free_data;
>  	buf->f_bavail = buf->f_bavail >> bits;
> +	buf->f_blocks = total_alloc + total_free_data;
> +	buf->f_blocks >>= bits;
> +	buf->f_bfree = buf->f_blocks - (total_used >> bits);
> +
>  	mutex_unlock(&fs_info->chunk_mutex);
>  	mutex_unlock(&fs_info->fs_devices->device_list_mutex);
>  
>

Goffredo Baroncelli Dec. 12, 2014, 7:25 p.m. UTC | #2

On 12/11/2014 09:31 AM, Dongsheng Yang wrote:
> When function btrfs_statfs() calculate the tatol size of fs, it is calculating
> the total size of disks and then dividing it by a factor. But in some usecase,
> the result is not good to user.


I Yang; during my test I discovered an error:

$ sudo lvcreate -L +10G -n disk vgtest
$ sudo /home/ghigo/mkfs.btrfs -f /dev/vgtest/disk
Btrfs v3.17
See http://btrfs.wiki.kernel.org for more information.

Turning ON incompat feature 'extref': increased hardlink limit per file to 65536
fs created label (null) on /dev/vgtest/disk
	nodesize 16384 leafsize 16384 sectorsize 4096 size 10.00GiB
$ sudo mount /dev/vgtest/disk /mnt/btrfs1/
$ df /mnt/btrfs1/
Filesystem              1K-blocks    Used Available Use% Mounted on
/dev/mapper/vgtest-disk   9428992 1069312   8359680  12% /mnt/btrfs1
$ sudo ~/btrfs fi df /mnt/btrfs1/
Data, single: total=8.00MiB, used=256.00KiB
System, DUP: total=8.00MiB, used=16.00KiB
System, single: total=4.00MiB, used=0.00B
Metadata, DUP: total=1.00GiB, used=112.00KiB
Metadata, single: total=8.00MiB, used=0.00B
GlobalReserve, single: total=16.00MiB, used=0.00B

What seems me strange is the 9428992KiB of total disk space as reported 
by df. About 600MiB are missing !

Without your patch, I got:

$ df /mnt/btrfs1/
Filesystem              1K-blocks  Used Available Use% Mounted on
/dev/mapper/vgtest-disk  10485760 16896   8359680   1% /mnt/btrfs1


> Example:
> 	# mkfs.btrfs -f /dev/vdf1 /dev/vdf2 -d raid1
> 	# mount /dev/vdf1 /mnt
> 	# dd if=/dev/zero of=/mnt/zero bs=1M count=1000
> 	# df -h /mnt
> Filesystem      Size  Used Avail Use% Mounted on
> /dev/vdf1       3.0G 1018M  1.3G  45% /mnt
> 	# btrfs fi show /dev/vdf1
> Label: none  uuid: f85d93dc-81f4-445d-91e5-6a5cd9563294
> 	Total devices 2 FS bytes used 1001.53MiB
> 	devid    1 size 2.00GiB used 1.85GiB path /dev/vdf1
> 	devid    2 size 4.00GiB used 1.83GiB path /dev/vdf2
> a. df -h should report Size as 2GiB rather than as 3GiB.
> Because this is 2 device raid1, the limiting factor is devid 1 @2GiB.
> 
> b. df -h should report Avail as 0.97GiB or less, rather than as 1.3GiB.
>     1.85           (the capacity of the allocated chunk)
>    -1.018          (the file stored)
>    +(2-1.85=0.15)  (the residual capacity of the disks
>                     considering a raid1 fs)
>    ---------------
> =   0.97
> 
> This patch drops the factor at all and calculate the size observable to
> user without considering which raid level the data is in and what's the
> size exactly in disk.
> After this patch applied:
> 	# mkfs.btrfs -f /dev/vdf1 /dev/vdf2 -d raid1
> 	# mount /dev/vdf1 /mnt
> 	# dd if=/dev/zero of=/mnt/zero bs=1M count=1000
> 	# df -h /mnt
> Filesystem      Size  Used Avail Use% Mounted on
> /dev/vdf1       2.0G  1.3G  713M  66% /mnt
> 	# df /mnt
> Filesystem     1K-blocks    Used Available Use% Mounted on
> /dev/vdf1        2097152 1359424    729536  66% /mnt
> 	# btrfs fi show /dev/vdf1
> Label: none  uuid: e98c1321-645f-4457-b20d-4f41dc1cf2f4
> 	Total devices 2 FS bytes used 1001.55MiB
> 	devid    1 size 2.00GiB used 1.85GiB path /dev/vdf1
> 	devid    2 size 4.00GiB used 1.83GiB path /dev/vdf2
> a). The @Size is 2G as we expected.
> b). @Available is 700M = 1.85G - 1.3G + (2G - 1.85G).
> c). @Used is changed to 1.3G rather than 1018M as above. Because
>     this patch do not treat the free space in metadata chunk
>     and system chunk as available to user. It's true, user can
>     not use these space to store data, then it should not be
>     thought as available. At the same time, it will make the
>     @Used + @Available == @Size as possible to user.
> 
> Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com>
> ---
> Changelog:
>         v1 -> v2:
>                 a. account the total_bytes in medadata chunk and
>                    system chunk as used to user.
>                 b. set data_stripes to the correct value in RAID0.
> 
>  fs/btrfs/extent-tree.c | 13 ++----------
>  fs/btrfs/super.c       | 56 ++++++++++++++++++++++----------------------------
>  2 files changed, 26 insertions(+), 43 deletions(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index a84e00d..9954d60 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -8571,7 +8571,6 @@ static u64 __btrfs_get_ro_block_group_free_space(struct list_head *groups_list)
>  {
>  	struct btrfs_block_group_cache *block_group;
>  	u64 free_bytes = 0;
> -	int factor;
>  
>  	list_for_each_entry(block_group, groups_list, list) {
>  		spin_lock(&block_group->lock);
> @@ -8581,16 +8580,8 @@ static u64 __btrfs_get_ro_block_group_free_space(struct list_head *groups_list)
>  			continue;
>  		}
>  
> -		if (block_group->flags & (BTRFS_BLOCK_GROUP_RAID1 |
> -					  BTRFS_BLOCK_GROUP_RAID10 |
> -					  BTRFS_BLOCK_GROUP_DUP))
> -			factor = 2;
> -		else
> -			factor = 1;
> -
> -		free_bytes += (block_group->key.offset -
> -			       btrfs_block_group_used(&block_group->item)) *
> -			       factor;
> +		free_bytes += block_group->key.offset -
> +			      btrfs_block_group_used(&block_group->item);
>  
>  		spin_unlock(&block_group->lock);
>  	}
> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> index 54bd91e..40f41a2 100644
> --- a/fs/btrfs/super.c
> +++ b/fs/btrfs/super.c
> @@ -1641,6 +1641,8 @@ static int btrfs_calc_avail_data_space(struct btrfs_root *root, u64 *free_bytes)
>  	u64 used_space;
>  	u64 min_stripe_size;
>  	int min_stripes = 1, num_stripes = 1;
> +	/* How many stripes used to store data, without considering mirrors. */
> +	int data_stripes = 1;
>  	int i = 0, nr_devices;
>  	int ret;
>  
> @@ -1657,12 +1659,15 @@ static int btrfs_calc_avail_data_space(struct btrfs_root *root, u64 *free_bytes)
>  	if (type & BTRFS_BLOCK_GROUP_RAID0) {
>  		min_stripes = 2;
>  		num_stripes = nr_devices;
> +		data_stripes = num_stripes;
>  	} else if (type & BTRFS_BLOCK_GROUP_RAID1) {
>  		min_stripes = 2;
>  		num_stripes = 2;
> +		data_stripes = 1;
>  	} else if (type & BTRFS_BLOCK_GROUP_RAID10) {
>  		min_stripes = 4;
>  		num_stripes = 4;
> +		data_stripes = 2;
>  	}
>  
>  	if (type & BTRFS_BLOCK_GROUP_DUP)
> @@ -1733,14 +1738,17 @@ static int btrfs_calc_avail_data_space(struct btrfs_root *root, u64 *free_bytes)
>  	i = nr_devices - 1;
>  	avail_space = 0;
>  	while (nr_devices >= min_stripes) {
> -		if (num_stripes > nr_devices)
> +		if (num_stripes > nr_devices) {
>  			num_stripes = nr_devices;
> +			if (type & BTRFS_BLOCK_GROUP_RAID0)
> +				data_stripes = num_stripes;
> +		}
>  
>  		if (devices_info[i].max_avail >= min_stripe_size) {
>  			int j;
>  			u64 alloc_size;
>  
> -			avail_space += devices_info[i].max_avail * num_stripes;
> +			avail_space += devices_info[i].max_avail * data_stripes;
>  			alloc_size = devices_info[i].max_avail;
>  			for (j = i + 1 - num_stripes; j <= i; j++)
>  				devices_info[j].max_avail -= alloc_size;
> @@ -1772,15 +1780,13 @@ static int btrfs_calc_avail_data_space(struct btrfs_root *root, u64 *free_bytes)
>  static int btrfs_statfs(struct dentry *dentry, struct kstatfs *buf)
>  {
>  	struct btrfs_fs_info *fs_info = btrfs_sb(dentry->d_sb);
> -	struct btrfs_super_block *disk_super = fs_info->super_copy;
>  	struct list_head *head = &fs_info->space_info;
>  	struct btrfs_space_info *found;
>  	u64 total_used = 0;
> +	u64 total_alloc = 0;
>  	u64 total_free_data = 0;
>  	int bits = dentry->d_sb->s_blocksize_bits;
>  	__be32 *fsid = (__be32 *)fs_info->fsid;
> -	unsigned factor = 1;
> -	struct btrfs_block_rsv *block_rsv = &fs_info->global_block_rsv;
>  	int ret;
>  
>  	/*
> @@ -1792,38 +1798,20 @@ static int btrfs_statfs(struct dentry *dentry, struct kstatfs *buf)
>  	rcu_read_lock();
>  	list_for_each_entry_rcu(found, head, list) {
>  		if (found->flags & BTRFS_BLOCK_GROUP_DATA) {
> -			int i;
> -
> -			total_free_data += found->disk_total - found->disk_used;
> +			total_free_data += found->total_bytes - found->bytes_used;
>  			total_free_data -=
>  				btrfs_account_ro_block_groups_free_space(found);
> -
> -			for (i = 0; i < BTRFS_NR_RAID_TYPES; i++) {
> -				if (!list_empty(&found->block_groups[i])) {
> -					switch (i) {
> -					case BTRFS_RAID_DUP:
> -					case BTRFS_RAID_RAID1:
> -					case BTRFS_RAID_RAID10:
> -						factor = 2;
> -					}
> -				}
> -			}
> +			total_used += found->bytes_used;
> +		} else {
> +			/* For metadata and system, we treat the total_bytes as
> +			 * not available to file data. So show it as Used in df.
> +			 **/
> +			total_used += found->total_bytes;
>  		}
> -
> -		total_used += found->disk_used;
> +		total_alloc += found->total_bytes;
>  	}
> -
>  	rcu_read_unlock();
>  
> -	buf->f_blocks = div_u64(btrfs_super_total_bytes(disk_super), factor);
> -	buf->f_blocks >>= bits;
> -	buf->f_bfree = buf->f_blocks - (div_u64(total_used, factor) >> bits);
> -
> -	/* Account global block reserve as used, it's in logical size already */
> -	spin_lock(&block_rsv->lock);
> -	buf->f_bfree -= block_rsv->size >> bits;
> -	spin_unlock(&block_rsv->lock);
> -
>  	buf->f_bavail = total_free_data;
>  	ret = btrfs_calc_avail_data_space(fs_info->tree_root, &total_free_data);
>  	if (ret) {
> @@ -1831,8 +1819,12 @@ static int btrfs_statfs(struct dentry *dentry, struct kstatfs *buf)
>  		mutex_unlock(&fs_info->fs_devices->device_list_mutex);
>  		return ret;
>  	}
> -	buf->f_bavail += div_u64(total_free_data, factor);
> +	buf->f_bavail += total_free_data;
>  	buf->f_bavail = buf->f_bavail >> bits;
> +	buf->f_blocks = total_alloc + total_free_data;
> +	buf->f_blocks >>= bits;
> +	buf->f_bfree = buf->f_blocks - (total_used >> bits);
> +
>  	mutex_unlock(&fs_info->chunk_mutex);
>  	mutex_unlock(&fs_info->fs_devices->device_list_mutex);
>  
>

Duncan Dec. 13, 2014, 12:50 a.m. UTC | #3

Goffredo Baroncelli posted on Fri, 12 Dec 2014 19:00:20 +0100 as
excerpted:

> $ sudo ./btrfs fi df /mnt/btrfs1/
> Data, RAID1: total=1.00GiB, used=512.00KiB
> Data, single: total=8.00MiB, used=0.00B
> System, RAID1: total=8.00MiB, used=16.00KiB
> System, single: total=4.00MiB, used=0.00B
> Metadata, RAID1: total=1.00GiB, used=112.00KiB
> Metadata, single: total=8.00MiB, used=0.00B
> GlobalReserve, single: total=16.00MiB, used=0.00B
> 
> In this case the filesystem is empty (it was a new filesystem !).
> However a 1G metadata chunk was already allocated. This is the reasons
> why the free space is only 4Gb.

Trivial(?) correction.

Metadata chunks are quarter-gig, not 1 gig.  So that's 4 quarter-gig 
metadata chunks allocated, not a (one/single) 1-gig metadata chunk.

> On my system the ratio metadata/data is 234MB/8.82GB = ~3%, so ignoring
> the metadata chunk from the free space may not be a big problem. 

Presumably your use-case is primarily reasonably large files; too large 
for their data to be tucked directly into metadata instead of allocating 
an extent from a data chunk.

That's not always going to be the case.  And given the multi-device 
default allocation of raid1 metadata, single data, files small enough to 
fit into metadata have a default size effect double their actual size!  
(Tho it can be noted that given btrfs' 4 KiB standard block size, without 
metadata packing there'd still be an outsized effect for files smaller 
than half that, 2 KiB or under, but there it'd be in data chunks, not 
metadata.)

Yang Dongsheng Dec. 13, 2014, 9:57 a.m. UTC | #4

On 12/13/2014 02:00 AM, Goffredo Baroncelli wrote:
> On 12/11/2014 09:31 AM, Dongsheng Yang wrote:
>> When function btrfs_statfs() calculate the tatol size of fs, it is calculating
>> the total size of disks and then dividing it by a factor. But in some usecase,
>> the result is not good to user.
> I am checking it; to me it seems a good improvement. However
> I noticed that df now doesn't seem to report anymore the space consumed
> by the meta-data chunk; eg:
>
> # I have two disks of 5GB each
> $ sudo ~/mkfs.btrfs -f -m raid1 -d raid1 /dev/vgtest/disk /dev/vgtest/disk1
>
> $ df -h /mnt/btrfs1/
> Filesystem               Size  Used Avail Use% Mounted on
> /dev/mapper/vgtest-disk  5.0G  1.1G  4.0G  21% /mnt/btrfs1
>
> $ sudo btrfs fi show
> Label: none  uuid: 884414c6-9374-40af-a5be-3949cdf6ad0b
> 	Total devices 2 FS bytes used 640.00KB
> 	devid    2 size 5.00GB used 2.01GB path /dev/dm-1
> 	devid    1 size 5.00GB used 2.03GB path /dev/dm-0
>
> $ sudo ./btrfs fi df /mnt/btrfs1/
> Data, RAID1: total=1.00GiB, used=512.00KiB
> Data, single: total=8.00MiB, used=0.00B
> System, RAID1: total=8.00MiB, used=16.00KiB
> System, single: total=4.00MiB, used=0.00B
> Metadata, RAID1: total=1.00GiB, used=112.00KiB
> Metadata, single: total=8.00MiB, used=0.00B
> GlobalReserve, single: total=16.00MiB, used=0.00B
>
> In this case the filesystem is empty (it was a new filesystem !). However a
> 1G metadata chunk was already allocated. This is the reasons why the free
> space is only 4Gb.

Actually, in the original btrfs_statfs(), the space in metadata chunk 
was also not be considered
as available. But you will get 5G in this case with original 
btrfs_statfs() because there is a bug
in it. As I use another implementation to calculate the @available in df 
command, I did not
mention this problem. I can describe it as below.

In original btrfs_statfs(), we only consider the free space in data 
chunk as available.
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
         list_for_each_entry_rcu(found, head, list) {
                 if (found->flags & BTRFS_BLOCK_GROUP_DATA) {
                         int i;

                         total_free_data += found->disk_total - 
found->disk_used;
 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
In the later, we will add the total_free_data to @available in output of df.
         buf->f_bavail = total_free_data;
BUT:
     This is incorrect! It should be:
       buf->f_bavail = div_u64(total_free_data, factor);
That said this bug will add one more (data_chunk_disk_total - 
data_chunk_disk_used = 1G)
to @available by mistake. Unfortunately, the free space in 
metadata_chunk is also 1G.

Then we will get 5G in this case you provided here.

Conclusion:
     Even in the original btrfs_statfs(), the free space in metadata is 
also considered as not available.
But one bug in it make it to be misunderstood. My new btrfs_statfs() 
still consider the free space
in metadata as not available, furthermore, I add it into @used.
>
> On my system the ratio metadata/data is 234MB/8.82GB = ~3%, so ignoring the
> metadata chunk from the free space may not be a big problem.
>
>
>
>
>
>> Example:
>> 	# mkfs.btrfs -f /dev/vdf1 /dev/vdf2 -d raid1
>> 	# mount /dev/vdf1 /mnt
>> 	# dd if=/dev/zero of=/mnt/zero bs=1M count=1000
>> 	# df -h /mnt
>> Filesystem      Size  Used Avail Use% Mounted on
>> /dev/vdf1       3.0G 1018M  1.3G  45% /mnt
>> 	# btrfs fi show /dev/vdf1
>> Label: none  uuid: f85d93dc-81f4-445d-91e5-6a5cd9563294
>> 	Total devices 2 FS bytes used 1001.53MiB
>> 	devid    1 size 2.00GiB used 1.85GiB path /dev/vdf1
>> 	devid    2 size 4.00GiB used 1.83GiB path /dev/vdf2
>> a. df -h should report Size as 2GiB rather than as 3GiB.
>> Because this is 2 device raid1, the limiting factor is devid 1 @2GiB.
>>
>> b. df -h should report Avail as 0.97GiB or less, rather than as 1.3GiB.
>>      1.85           (the capacity of the allocated chunk)
>>     -1.018          (the file stored)
>>     +(2-1.85=0.15)  (the residual capacity of the disks
>>                      considering a raid1 fs)
>>     ---------------
>> =   0.97
>>
>> This patch drops the factor at all and calculate the size observable to
>> user without considering which raid level the data is in and what's the
>> size exactly in disk.
>> After this patch applied:
>> 	# mkfs.btrfs -f /dev/vdf1 /dev/vdf2 -d raid1
>> 	# mount /dev/vdf1 /mnt
>> 	# dd if=/dev/zero of=/mnt/zero bs=1M count=1000
>> 	# df -h /mnt
>> Filesystem      Size  Used Avail Use% Mounted on
>> /dev/vdf1       2.0G  1.3G  713M  66% /mnt
>> 	# df /mnt
>> Filesystem     1K-blocks    Used Available Use% Mounted on
>> /dev/vdf1        2097152 1359424    729536  66% /mnt
>> 	# btrfs fi show /dev/vdf1
>> Label: none  uuid: e98c1321-645f-4457-b20d-4f41dc1cf2f4
>> 	Total devices 2 FS bytes used 1001.55MiB
>> 	devid    1 size 2.00GiB used 1.85GiB path /dev/vdf1
>> 	devid    2 size 4.00GiB used 1.83GiB path /dev/vdf2
>> a). The @Size is 2G as we expected.
>> b). @Available is 700M = 1.85G - 1.3G + (2G - 1.85G).
>> c). @Used is changed to 1.3G rather than 1018M as above. Because
>>      this patch do not treat the free space in metadata chunk
>>      and system chunk as available to user. It's true, user can
>>      not use these space to store data, then it should not be
>>      thought as available. At the same time, it will make the
>>      @Used + @Available == @Size as possible to user.
>>
>> Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com>
>> ---
>> Changelog:
>>          v1 -> v2:
>>                  a. account the total_bytes in medadata chunk and
>>                     system chunk as used to user.
>>                  b. set data_stripes to the correct value in RAID0.
>>
>>   fs/btrfs/extent-tree.c | 13 ++----------
>>   fs/btrfs/super.c       | 56 ++++++++++++++++++++++----------------------------
>>   2 files changed, 26 insertions(+), 43 deletions(-)
>>
>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>> index a84e00d..9954d60 100644
>> --- a/fs/btrfs/extent-tree.c
>> +++ b/fs/btrfs/extent-tree.c
>> @@ -8571,7 +8571,6 @@ static u64 __btrfs_get_ro_block_group_free_space(struct list_head *groups_list)
>>   {
>>   	struct btrfs_block_group_cache *block_group;
>>   	u64 free_bytes = 0;
>> -	int factor;
>>   
>>   	list_for_each_entry(block_group, groups_list, list) {
>>   		spin_lock(&block_group->lock);
>> @@ -8581,16 +8580,8 @@ static u64 __btrfs_get_ro_block_group_free_space(struct list_head *groups_list)
>>   			continue;
>>   		}
>>   
>> -		if (block_group->flags & (BTRFS_BLOCK_GROUP_RAID1 |
>> -					  BTRFS_BLOCK_GROUP_RAID10 |
>> -					  BTRFS_BLOCK_GROUP_DUP))
>> -			factor = 2;
>> -		else
>> -			factor = 1;
>> -
>> -		free_bytes += (block_group->key.offset -
>> -			       btrfs_block_group_used(&block_group->item)) *
>> -			       factor;
>> +		free_bytes += block_group->key.offset -
>> +			      btrfs_block_group_used(&block_group->item);
>>   
>>   		spin_unlock(&block_group->lock);
>>   	}
>> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
>> index 54bd91e..40f41a2 100644
>> --- a/fs/btrfs/super.c
>> +++ b/fs/btrfs/super.c
>> @@ -1641,6 +1641,8 @@ static int btrfs_calc_avail_data_space(struct btrfs_root *root, u64 *free_bytes)
>>   	u64 used_space;
>>   	u64 min_stripe_size;
>>   	int min_stripes = 1, num_stripes = 1;
>> +	/* How many stripes used to store data, without considering mirrors. */
>> +	int data_stripes = 1;
>>   	int i = 0, nr_devices;
>>   	int ret;
>>   
>> @@ -1657,12 +1659,15 @@ static int btrfs_calc_avail_data_space(struct btrfs_root *root, u64 *free_bytes)
>>   	if (type & BTRFS_BLOCK_GROUP_RAID0) {
>>   		min_stripes = 2;
>>   		num_stripes = nr_devices;
>> +		data_stripes = num_stripes;
>>   	} else if (type & BTRFS_BLOCK_GROUP_RAID1) {
>>   		min_stripes = 2;
>>   		num_stripes = 2;
>> +		data_stripes = 1;
>>   	} else if (type & BTRFS_BLOCK_GROUP_RAID10) {
>>   		min_stripes = 4;
>>   		num_stripes = 4;
>> +		data_stripes = 2;
>>   	}
>>   
>>   	if (type & BTRFS_BLOCK_GROUP_DUP)
>> @@ -1733,14 +1738,17 @@ static int btrfs_calc_avail_data_space(struct btrfs_root *root, u64 *free_bytes)
>>   	i = nr_devices - 1;
>>   	avail_space = 0;
>>   	while (nr_devices >= min_stripes) {
>> -		if (num_stripes > nr_devices)
>> +		if (num_stripes > nr_devices) {
>>   			num_stripes = nr_devices;
>> +			if (type & BTRFS_BLOCK_GROUP_RAID0)
>> +				data_stripes = num_stripes;
>> +		}
>>   
>>   		if (devices_info[i].max_avail >= min_stripe_size) {
>>   			int j;
>>   			u64 alloc_size;
>>   
>> -			avail_space += devices_info[i].max_avail * num_stripes;
>> +			avail_space += devices_info[i].max_avail * data_stripes;
>>   			alloc_size = devices_info[i].max_avail;
>>   			for (j = i + 1 - num_stripes; j <= i; j++)
>>   				devices_info[j].max_avail -= alloc_size;
>> @@ -1772,15 +1780,13 @@ static int btrfs_calc_avail_data_space(struct btrfs_root *root, u64 *free_bytes)
>>   static int btrfs_statfs(struct dentry *dentry, struct kstatfs *buf)
>>   {
>>   	struct btrfs_fs_info *fs_info = btrfs_sb(dentry->d_sb);
>> -	struct btrfs_super_block *disk_super = fs_info->super_copy;
>>   	struct list_head *head = &fs_info->space_info;
>>   	struct btrfs_space_info *found;
>>   	u64 total_used = 0;
>> +	u64 total_alloc = 0;
>>   	u64 total_free_data = 0;
>>   	int bits = dentry->d_sb->s_blocksize_bits;
>>   	__be32 *fsid = (__be32 *)fs_info->fsid;
>> -	unsigned factor = 1;
>> -	struct btrfs_block_rsv *block_rsv = &fs_info->global_block_rsv;
>>   	int ret;
>>   
>>   	/*
>> @@ -1792,38 +1798,20 @@ static int btrfs_statfs(struct dentry *dentry, struct kstatfs *buf)
>>   	rcu_read_lock();
>>   	list_for_each_entry_rcu(found, head, list) {
>>   		if (found->flags & BTRFS_BLOCK_GROUP_DATA) {
>> -			int i;
>> -
>> -			total_free_data += found->disk_total - found->disk_used;
>> +			total_free_data += found->total_bytes - found->bytes_used;
>>   			total_free_data -=
>>   				btrfs_account_ro_block_groups_free_space(found);
>> -
>> -			for (i = 0; i < BTRFS_NR_RAID_TYPES; i++) {
>> -				if (!list_empty(&found->block_groups[i])) {
>> -					switch (i) {
>> -					case BTRFS_RAID_DUP:
>> -					case BTRFS_RAID_RAID1:
>> -					case BTRFS_RAID_RAID10:
>> -						factor = 2;
>> -					}
>> -				}
>> -			}
>> +			total_used += found->bytes_used;
>> +		} else {
>> +			/* For metadata and system, we treat the total_bytes as
>> +			 * not available to file data. So show it as Used in df.
>> +			 **/
>> +			total_used += found->total_bytes;
>>   		}
>> -
>> -		total_used += found->disk_used;
>> +		total_alloc += found->total_bytes;
>>   	}
>> -
>>   	rcu_read_unlock();
>>   
>> -	buf->f_blocks = div_u64(btrfs_super_total_bytes(disk_super), factor);
>> -	buf->f_blocks >>= bits;
>> -	buf->f_bfree = buf->f_blocks - (div_u64(total_used, factor) >> bits);
>> -
>> -	/* Account global block reserve as used, it's in logical size already */
>> -	spin_lock(&block_rsv->lock);
>> -	buf->f_bfree -= block_rsv->size >> bits;
>> -	spin_unlock(&block_rsv->lock);
>> -
>>   	buf->f_bavail = total_free_data;
>>   	ret = btrfs_calc_avail_data_space(fs_info->tree_root, &total_free_data);
>>   	if (ret) {
>> @@ -1831,8 +1819,12 @@ static int btrfs_statfs(struct dentry *dentry, struct kstatfs *buf)
>>   		mutex_unlock(&fs_info->fs_devices->device_list_mutex);
>>   		return ret;
>>   	}
>> -	buf->f_bavail += div_u64(total_free_data, factor);
>> +	buf->f_bavail += total_free_data;
>>   	buf->f_bavail = buf->f_bavail >> bits;
>> +	buf->f_blocks = total_alloc + total_free_data;
>> +	buf->f_blocks >>= bits;
>> +	buf->f_bfree = buf->f_blocks - (total_used >> bits);
>> +
>>   	mutex_unlock(&fs_info->chunk_mutex);
>>   	mutex_unlock(&fs_info->fs_devices->device_list_mutex);
>>   
>>
>

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Yang Dongsheng Dec. 13, 2014, 10:21 a.m. UTC | #5

On 12/13/2014 08:50 AM, Duncan wrote:
> Goffredo Baroncelli posted on Fri, 12 Dec 2014 19:00:20 +0100 as
> excerpted:
>
>> $ sudo ./btrfs fi df /mnt/btrfs1/
>> Data, RAID1: total=1.00GiB, used=512.00KiB
>> Data, single: total=8.00MiB, used=0.00B
>> System, RAID1: total=8.00MiB, used=16.00KiB
>> System, single: total=4.00MiB, used=0.00B
>> Metadata, RAID1: total=1.00GiB, used=112.00KiB
>> Metadata, single: total=8.00MiB, used=0.00B
>> GlobalReserve, single: total=16.00MiB, used=0.00B
>>
>> In this case the filesystem is empty (it was a new filesystem !).
>> However a 1G metadata chunk was already allocated. This is the reasons
>> why the free space is only 4Gb.
> Trivial(?) correction.
>
> Metadata chunks are quarter-gig, not 1 gig.  So that's 4 quarter-gig
> metadata chunks allocated, not a (one/single) 1-gig metadata chunk.

Sorry but from my reading of the code, I have to say that in this case 
it is 1G.

Maybe I should be wrong, I would say, yes, the chunk_size for btrfs which is
smaller than 50G is 256M by default. But in mkfs, we alloc 1G for the first
metadata chunk.
>
>> On my system the ratio metadata/data is 234MB/8.82GB = ~3%, so ignoring
>> the metadata chunk from the free space may not be a big problem.
> Presumably your use-case is primarily reasonably large files; too large
> for their data to be tucked directly into metadata instead of allocating
> an extent from a data chunk.
>
> That's not always going to be the case.  And given the multi-device
> default allocation of raid1 metadata, single data, files small enough to
> fit into metadata have a default size effect double their actual size!
> (Tho it can be noted that given btrfs' 4 KiB standard block size, without
> metadata packing there'd still be an outsized effect for files smaller
> than half that, 2 KiB or under, but there it'd be in data chunks, not
> metadata.)
>

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Dongsheng Yang Dec. 14, 2014, 11:29 a.m. UTC | #6

On Sat, Dec 13, 2014 at 3:25 AM, Goffredo Baroncelli <kreijack@inwind.it> wrote:
> On 12/11/2014 09:31 AM, Dongsheng Yang wrote:
>> When function btrfs_statfs() calculate the tatol size of fs, it is calculating
>> the total size of disks and then dividing it by a factor. But in some usecase,
>> the result is not good to user.
>
>
> I Yang; during my test I discovered an error:
>
> $ sudo lvcreate -L +10G -n disk vgtest
> $ sudo /home/ghigo/mkfs.btrfs -f /dev/vgtest/disk
> Btrfs v3.17
> See http://btrfs.wiki.kernel.org for more information.
>
> Turning ON incompat feature 'extref': increased hardlink limit per file to 65536
> fs created label (null) on /dev/vgtest/disk
>         nodesize 16384 leafsize 16384 sectorsize 4096 size 10.00GiB
> $ sudo mount /dev/vgtest/disk /mnt/btrfs1/
> $ df /mnt/btrfs1/
> Filesystem              1K-blocks    Used Available Use% Mounted on
> /dev/mapper/vgtest-disk   9428992 1069312   8359680  12% /mnt/btrfs1
> $ sudo ~/btrfs fi df /mnt/btrfs1/
> Data, single: total=8.00MiB, used=256.00KiB
> System, DUP: total=8.00MiB, used=16.00KiB
> System, single: total=4.00MiB, used=0.00B
> Metadata, DUP: total=1.00GiB, used=112.00KiB
> Metadata, single: total=8.00MiB, used=0.00B
> GlobalReserve, single: total=16.00MiB, used=0.00B
>
> What seems me strange is the 9428992KiB of total disk space as reported
> by df. About 600MiB are missing !
>
> Without your patch, I got:
>
> $ df /mnt/btrfs1/
> Filesystem              1K-blocks  Used Available Use% Mounted on
> /dev/mapper/vgtest-disk  10485760 16896   8359680   1% /mnt/btrfs1
>

Hi Goffredo, thanx for pointing this. Let me try to show the reason of it.

In this case you provided here, the metadata is 1G in DUP. It means
we spent 2G device space for it. But from the view point of user, they
can only see 9G size for this fs. It's show as below:
 1G
-----|---------------- Filesystem space (9G)
|     \
|      \
|       \
|    2G  \
-----|----|----------------- Device space (10G)
    DUP   |

The main idea about the new btrfs_statfs() is hiding the detail in Device
space and only show the information in the FS space level.

Actually, the similar idea is adopted in btrfs fi df.

Example:
# lvcreate -L +10G -n disk Group0
# mkfs.btrfs -f /dev/Group0/disk
# dd if=/dev/zero of=/mnt/data bs=1M count=10240
dd: error writing ‘/mnt/data’: No space left on device
8163+0 records in
8162+0 records out
8558477312 bytes (8.6 GB) copied, 207.565 s, 41.2 MB/s
# df /mnt
Filesystem              1K-blocks    Used Available Use% Mounted on
/dev/mapper/Group0-disk   9428992 8382896      2048 100% /mnt
# btrfs fi df /mnt
Data, single: total=7.97GiB, used=7.97GiB
System, DUP: total=8.00MiB, used=16.00KiB
System, single: total=4.00MiB, used=0.00B
Metadata, DUP: total=1.00GiB, used=8.41MiB
Metadata, single: total=8.00MiB, used=0.00B
GlobalReserve, single: total=16.00MiB, used=0.00B

Now, the all space is used and got a ENOSPC error.
But from the output of btrfs fi df. We can only find almost 9G (data
8G + metadata 1G)
is used. The difference is that it show the DUP here to make
it more clear.

Wish this description is clear to you :).

If there is anything confusing or sounds incorrect to you, please point it out.

Thanx
Yang

>
>> Example:
>>       # mkfs.btrfs -f /dev/vdf1 /dev/vdf2 -d raid1
>>       # mount /dev/vdf1 /mnt
>>       # dd if=/dev/zero of=/mnt/zero bs=1M count=1000
>>       # df -h /mnt
>> Filesystem      Size  Used Avail Use% Mounted on
>> /dev/vdf1       3.0G 1018M  1.3G  45% /mnt
>>       # btrfs fi show /dev/vdf1
>> Label: none  uuid: f85d93dc-81f4-445d-91e5-6a5cd9563294
>>       Total devices 2 FS bytes used 1001.53MiB
>>       devid    1 size 2.00GiB used 1.85GiB path /dev/vdf1
>>       devid    2 size 4.00GiB used 1.83GiB path /dev/vdf2
>> a. df -h should report Size as 2GiB rather than as 3GiB.
>> Because this is 2 device raid1, the limiting factor is devid 1 @2GiB.
>>
>> b. df -h should report Avail as 0.97GiB or less, rather than as 1.3GiB.
>>     1.85           (the capacity of the allocated chunk)
>>    -1.018          (the file stored)
>>    +(2-1.85=0.15)  (the residual capacity of the disks
>>                     considering a raid1 fs)
>>    ---------------
>> =   0.97
>>
>> This patch drops the factor at all and calculate the size observable to
>> user without considering which raid level the data is in and what's the
>> size exactly in disk.
>> After this patch applied:
>>       # mkfs.btrfs -f /dev/vdf1 /dev/vdf2 -d raid1
>>       # mount /dev/vdf1 /mnt
>>       # dd if=/dev/zero of=/mnt/zero bs=1M count=1000
>>       # df -h /mnt
>> Filesystem      Size  Used Avail Use% Mounted on
>> /dev/vdf1       2.0G  1.3G  713M  66% /mnt
>>       # df /mnt
>> Filesystem     1K-blocks    Used Available Use% Mounted on
>> /dev/vdf1        2097152 1359424    729536  66% /mnt
>>       # btrfs fi show /dev/vdf1
>> Label: none  uuid: e98c1321-645f-4457-b20d-4f41dc1cf2f4
>>       Total devices 2 FS bytes used 1001.55MiB
>>       devid    1 size 2.00GiB used 1.85GiB path /dev/vdf1
>>       devid    2 size 4.00GiB used 1.83GiB path /dev/vdf2
>> a). The @Size is 2G as we expected.
>> b). @Available is 700M = 1.85G - 1.3G + (2G - 1.85G).
>> c). @Used is changed to 1.3G rather than 1018M as above. Because
>>     this patch do not treat the free space in metadata chunk
>>     and system chunk as available to user. It's true, user can
>>     not use these space to store data, then it should not be
>>     thought as available. At the same time, it will make the
>>     @Used + @Available == @Size as possible to user.
>>
>> Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com>
>> ---
>> Changelog:
>>         v1 -> v2:
>>                 a. account the total_bytes in medadata chunk and
>>                    system chunk as used to user.
>>                 b. set data_stripes to the correct value in RAID0.
>>
>>  fs/btrfs/extent-tree.c | 13 ++----------
>>  fs/btrfs/super.c       | 56 ++++++++++++++++++++++----------------------------
>>  2 files changed, 26 insertions(+), 43 deletions(-)
>>
>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>> index a84e00d..9954d60 100644
>> --- a/fs/btrfs/extent-tree.c
>> +++ b/fs/btrfs/extent-tree.c
>> @@ -8571,7 +8571,6 @@ static u64 __btrfs_get_ro_block_group_free_space(struct list_head *groups_list)
>>  {
>>       struct btrfs_block_group_cache *block_group;
>>       u64 free_bytes = 0;
>> -     int factor;
>>
>>       list_for_each_entry(block_group, groups_list, list) {
>>               spin_lock(&block_group->lock);
>> @@ -8581,16 +8580,8 @@ static u64 __btrfs_get_ro_block_group_free_space(struct list_head *groups_list)
>>                       continue;
>>               }
>>
>> -             if (block_group->flags & (BTRFS_BLOCK_GROUP_RAID1 |
>> -                                       BTRFS_BLOCK_GROUP_RAID10 |
>> -                                       BTRFS_BLOCK_GROUP_DUP))
>> -                     factor = 2;
>> -             else
>> -                     factor = 1;
>> -
>> -             free_bytes += (block_group->key.offset -
>> -                            btrfs_block_group_used(&block_group->item)) *
>> -                            factor;
>> +             free_bytes += block_group->key.offset -
>> +                           btrfs_block_group_used(&block_group->item);
>>
>>               spin_unlock(&block_group->lock);
>>       }
>> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
>> index 54bd91e..40f41a2 100644
>> --- a/fs/btrfs/super.c
>> +++ b/fs/btrfs/super.c
>> @@ -1641,6 +1641,8 @@ static int btrfs_calc_avail_data_space(struct btrfs_root *root, u64 *free_bytes)
>>       u64 used_space;
>>       u64 min_stripe_size;
>>       int min_stripes = 1, num_stripes = 1;
>> +     /* How many stripes used to store data, without considering mirrors. */
>> +     int data_stripes = 1;
>>       int i = 0, nr_devices;
>>       int ret;
>>
>> @@ -1657,12 +1659,15 @@ static int btrfs_calc_avail_data_space(struct btrfs_root *root, u64 *free_bytes)
>>       if (type & BTRFS_BLOCK_GROUP_RAID0) {
>>               min_stripes = 2;
>>               num_stripes = nr_devices;
>> +             data_stripes = num_stripes;
>>       } else if (type & BTRFS_BLOCK_GROUP_RAID1) {
>>               min_stripes = 2;
>>               num_stripes = 2;
>> +             data_stripes = 1;
>>       } else if (type & BTRFS_BLOCK_GROUP_RAID10) {
>>               min_stripes = 4;
>>               num_stripes = 4;
>> +             data_stripes = 2;
>>       }
>>
>>       if (type & BTRFS_BLOCK_GROUP_DUP)
>> @@ -1733,14 +1738,17 @@ static int btrfs_calc_avail_data_space(struct btrfs_root *root, u64 *free_bytes)
>>       i = nr_devices - 1;
>>       avail_space = 0;
>>       while (nr_devices >= min_stripes) {
>> -             if (num_stripes > nr_devices)
>> +             if (num_stripes > nr_devices) {
>>                       num_stripes = nr_devices;
>> +                     if (type & BTRFS_BLOCK_GROUP_RAID0)
>> +                             data_stripes = num_stripes;
>> +             }
>>
>>               if (devices_info[i].max_avail >= min_stripe_size) {
>>                       int j;
>>                       u64 alloc_size;
>>
>> -                     avail_space += devices_info[i].max_avail * num_stripes;
>> +                     avail_space += devices_info[i].max_avail * data_stripes;
>>                       alloc_size = devices_info[i].max_avail;
>>                       for (j = i + 1 - num_stripes; j <= i; j++)
>>                               devices_info[j].max_avail -= alloc_size;
>> @@ -1772,15 +1780,13 @@ static int btrfs_calc_avail_data_space(struct btrfs_root *root, u64 *free_bytes)
>>  static int btrfs_statfs(struct dentry *dentry, struct kstatfs *buf)
>>  {
>>       struct btrfs_fs_info *fs_info = btrfs_sb(dentry->d_sb);
>> -     struct btrfs_super_block *disk_super = fs_info->super_copy;
>>       struct list_head *head = &fs_info->space_info;
>>       struct btrfs_space_info *found;
>>       u64 total_used = 0;
>> +     u64 total_alloc = 0;
>>       u64 total_free_data = 0;
>>       int bits = dentry->d_sb->s_blocksize_bits;
>>       __be32 *fsid = (__be32 *)fs_info->fsid;
>> -     unsigned factor = 1;
>> -     struct btrfs_block_rsv *block_rsv = &fs_info->global_block_rsv;
>>       int ret;
>>
>>       /*
>> @@ -1792,38 +1798,20 @@ static int btrfs_statfs(struct dentry *dentry, struct kstatfs *buf)
>>       rcu_read_lock();
>>       list_for_each_entry_rcu(found, head, list) {
>>               if (found->flags & BTRFS_BLOCK_GROUP_DATA) {
>> -                     int i;
>> -
>> -                     total_free_data += found->disk_total - found->disk_used;
>> +                     total_free_data += found->total_bytes - found->bytes_used;
>>                       total_free_data -=
>>                               btrfs_account_ro_block_groups_free_space(found);
>> -
>> -                     for (i = 0; i < BTRFS_NR_RAID_TYPES; i++) {
>> -                             if (!list_empty(&found->block_groups[i])) {
>> -                                     switch (i) {
>> -                                     case BTRFS_RAID_DUP:
>> -                                     case BTRFS_RAID_RAID1:
>> -                                     case BTRFS_RAID_RAID10:
>> -                                             factor = 2;
>> -                                     }
>> -                             }
>> -                     }
>> +                     total_used += found->bytes_used;
>> +             } else {
>> +                     /* For metadata and system, we treat the total_bytes as
>> +                      * not available to file data. So show it as Used in df.
>> +                      **/
>> +                     total_used += found->total_bytes;
>>               }
>> -
>> -             total_used += found->disk_used;
>> +             total_alloc += found->total_bytes;
>>       }
>> -
>>       rcu_read_unlock();
>>
>> -     buf->f_blocks = div_u64(btrfs_super_total_bytes(disk_super), factor);
>> -     buf->f_blocks >>= bits;
>> -     buf->f_bfree = buf->f_blocks - (div_u64(total_used, factor) >> bits);
>> -
>> -     /* Account global block reserve as used, it's in logical size already */
>> -     spin_lock(&block_rsv->lock);
>> -     buf->f_bfree -= block_rsv->size >> bits;
>> -     spin_unlock(&block_rsv->lock);
>> -
>>       buf->f_bavail = total_free_data;
>>       ret = btrfs_calc_avail_data_space(fs_info->tree_root, &total_free_data);
>>       if (ret) {
>> @@ -1831,8 +1819,12 @@ static int btrfs_statfs(struct dentry *dentry, struct kstatfs *buf)
>>               mutex_unlock(&fs_info->fs_devices->device_list_mutex);
>>               return ret;
>>       }
>> -     buf->f_bavail += div_u64(total_free_data, factor);
>> +     buf->f_bavail += total_free_data;
>>       buf->f_bavail = buf->f_bavail >> bits;
>> +     buf->f_blocks = total_alloc + total_free_data;
>> +     buf->f_blocks >>= bits;
>> +     buf->f_bfree = buf->f_blocks - (total_used >> bits);
>> +
>>       mutex_unlock(&fs_info->chunk_mutex);
>>       mutex_unlock(&fs_info->fs_devices->device_list_mutex);
>>
>>
>
>
> --
> gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Grzegorz Kowal Dec. 14, 2014, 2:32 p.m. UTC | #7

Hi,

I see another problem on 1 device fs after applying this patch. I set
up 30GB system partition:

in gdisk key 'i'
Partition size: 62914560 sectors (30.0 GiB) -> 62914560 * 512 =
32212254720 = 30.0GiB

before applying the patch df -B1 shows

Filesystem        1B-blocks         Used    Available Use% Mounted on
/dev/sda3      32212254720  18149384192  12821815296  59% /

The total size matches exactly the one reported by gdisk.

After applying the patch df -B1 shows

Filesystem        1B-blocks         Used    Available Use% Mounted on
/dev/sda3       29761732608  17037860864  12723871744  58% /

The total size is 29761732608 bytes now, which is 2337 MiB (2.28GB)
smaller then the partition size.

IMHO the total size should correspond exactly to the partition size,
at least in the case of fs on one device, and anything used by the
file system and user should go to the used space.

Am I missing something here?

Thanks,
Grzegorz

On Sun, Dec 14, 2014 at 12:27 PM, Grzegorz Kowal
<custos.mentis@gmail.com> wrote:
> Hi,
>
> I see another problem on 1 device fs after applying this patch. I set up
> 30GB system partition:
>
> in gdisk key 'i'
> Partition size: 62914560 sectors (30.0 GiB) -> 62914560 * 512 = 32212254720
> = 30.0GiB
>
> before applying the patch df -B1 shows
>
> Filesystem        1B-blocks         Used    Available Use% Mounted on
> /dev/sda3      32212254720  18149384192  12821815296  59% /
>
> The total size matches exactly the one reported by gdisk.
>
> After applying the patch df -B1 shows
>
> Filesystem        1B-blocks         Used    Available Use% Mounted on
> /dev/sda3       29761732608  17037860864  12723871744  58% /
>
> The total size is 29761732608 bytes now, which is 2337 MiB (2.28GB) smaller
> then the partition size.
>
> IMHO the total size should correspond exactly to the partition size, at
> least in the case of fs on one device, and anything used by the file system
> and user should go to the used space.
>
> Am I missing something here?
>
> Thanks,
> Grzegorz
>
>
>
>
>
> On Sun, Dec 14, 2014 at 9:29 AM, Dongsheng Yang <dongsheng081251@gmail.com>
> wrote:
>>
>> On Sat, Dec 13, 2014 at 3:25 AM, Goffredo Baroncelli <kreijack@inwind.it>
>> wrote:
>> > On 12/11/2014 09:31 AM, Dongsheng Yang wrote:
>> >> When function btrfs_statfs() calculate the tatol size of fs, it is
>> >> calculating
>> >> the total size of disks and then dividing it by a factor. But in some
>> >> usecase,
>> >> the result is not good to user.
>> >
>> >
>> > I Yang; during my test I discovered an error:
>> >
>> > $ sudo lvcreate -L +10G -n disk vgtest
>> > $ sudo /home/ghigo/mkfs.btrfs -f /dev/vgtest/disk
>> > Btrfs v3.17
>> > See http://btrfs.wiki.kernel.org for more information.
>> >
>> > Turning ON incompat feature 'extref': increased hardlink limit per file
>> > to 65536
>> > fs created label (null) on /dev/vgtest/disk
>> >         nodesize 16384 leafsize 16384 sectorsize 4096 size 10.00GiB
>> > $ sudo mount /dev/vgtest/disk /mnt/btrfs1/
>> > $ df /mnt/btrfs1/
>> > Filesystem              1K-blocks    Used Available Use% Mounted on
>> > /dev/mapper/vgtest-disk   9428992 1069312   8359680  12% /mnt/btrfs1
>> > $ sudo ~/btrfs fi df /mnt/btrfs1/
>> > Data, single: total=8.00MiB, used=256.00KiB
>> > System, DUP: total=8.00MiB, used=16.00KiB
>> > System, single: total=4.00MiB, used=0.00B
>> > Metadata, DUP: total=1.00GiB, used=112.00KiB
>> > Metadata, single: total=8.00MiB, used=0.00B
>> > GlobalReserve, single: total=16.00MiB, used=0.00B
>> >
>> > What seems me strange is the 9428992KiB of total disk space as reported
>> > by df. About 600MiB are missing !
>> >
>> > Without your patch, I got:
>> >
>> > $ df /mnt/btrfs1/
>> > Filesystem              1K-blocks  Used Available Use% Mounted on
>> > /dev/mapper/vgtest-disk  10485760 16896   8359680   1% /mnt/btrfs1
>> >
>>
>> Hi Goffredo, thanx for pointing this. Let me try to show the reason of it.
>>
>> In this case you provided here, the metadata is 1G in DUP. It means
>> we spent 2G device space for it. But from the view point of user, they
>> can only see 9G size for this fs. It's show as below:
>>  1G
>> -----|---------------- Filesystem space (9G)
>> |     \
>> |      \
>> |       \
>> |    2G  \
>> -----|----|----------------- Device space (10G)
>>     DUP   |
>>
>> The main idea about the new btrfs_statfs() is hiding the detail in Device
>> space and only show the information in the FS space level.
>>
>> Actually, the similar idea is adopted in btrfs fi df.
>>
>> Example:
>> # lvcreate -L +10G -n disk Group0
>> # mkfs.btrfs -f /dev/Group0/disk
>> # dd if=/dev/zero of=/mnt/data bs=1M count=10240
>> dd: error writing ‘/mnt/data’: No space left on device
>> 8163+0 records in
>> 8162+0 records out
>> 8558477312 bytes (8.6 GB) copied, 207.565 s, 41.2 MB/s
>> # df /mnt
>> Filesystem              1K-blocks    Used Available Use% Mounted on
>> /dev/mapper/Group0-disk   9428992 8382896      2048 100% /mnt
>> # btrfs fi df /mnt
>> Data, single: total=7.97GiB, used=7.97GiB
>> System, DUP: total=8.00MiB, used=16.00KiB
>> System, single: total=4.00MiB, used=0.00B
>> Metadata, DUP: total=1.00GiB, used=8.41MiB
>> Metadata, single: total=8.00MiB, used=0.00B
>> GlobalReserve, single: total=16.00MiB, used=0.00B
>>
>> Now, the all space is used and got a ENOSPC error.
>> But from the output of btrfs fi df. We can only find almost 9G (data
>> 8G + metadata 1G)
>> is used. The difference is that it show the DUP here to make
>> it more clear.
>>
>> Wish this description is clear to you :).
>>
>> If there is anything confusing or sounds incorrect to you, please point it
>> out.
>>
>> Thanx
>> Yang
>>
>> >
>> >> Example:
>> >>       # mkfs.btrfs -f /dev/vdf1 /dev/vdf2 -d raid1
>> >>       # mount /dev/vdf1 /mnt
>> >>       # dd if=/dev/zero of=/mnt/zero bs=1M count=1000
>> >>       # df -h /mnt
>> >> Filesystem      Size  Used Avail Use% Mounted on
>> >> /dev/vdf1       3.0G 1018M  1.3G  45% /mnt
>> >>       # btrfs fi show /dev/vdf1
>> >> Label: none  uuid: f85d93dc-81f4-445d-91e5-6a5cd9563294
>> >>       Total devices 2 FS bytes used 1001.53MiB
>> >>       devid    1 size 2.00GiB used 1.85GiB path /dev/vdf1
>> >>       devid    2 size 4.00GiB used 1.83GiB path /dev/vdf2
>> >> a. df -h should report Size as 2GiB rather than as 3GiB.
>> >> Because this is 2 device raid1, the limiting factor is devid 1 @2GiB.
>> >>
>> >> b. df -h should report Avail as 0.97GiB or less, rather than as 1.3GiB.
>> >>     1.85           (the capacity of the allocated chunk)
>> >>    -1.018          (the file stored)
>> >>    +(2-1.85=0.15)  (the residual capacity of the disks
>> >>                     considering a raid1 fs)
>> >>    ---------------
>> >> =   0.97
>> >>
>> >> This patch drops the factor at all and calculate the size observable to
>> >> user without considering which raid level the data is in and what's the
>> >> size exactly in disk.
>> >> After this patch applied:
>> >>       # mkfs.btrfs -f /dev/vdf1 /dev/vdf2 -d raid1
>> >>       # mount /dev/vdf1 /mnt
>> >>       # dd if=/dev/zero of=/mnt/zero bs=1M count=1000
>> >>       # df -h /mnt
>> >> Filesystem      Size  Used Avail Use% Mounted on
>> >> /dev/vdf1       2.0G  1.3G  713M  66% /mnt
>> >>       # df /mnt
>> >> Filesystem     1K-blocks    Used Available Use% Mounted on
>> >> /dev/vdf1        2097152 1359424    729536  66% /mnt
>> >>       # btrfs fi show /dev/vdf1
>> >> Label: none  uuid: e98c1321-645f-4457-b20d-4f41dc1cf2f4
>> >>       Total devices 2 FS bytes used 1001.55MiB
>> >>       devid    1 size 2.00GiB used 1.85GiB path /dev/vdf1
>> >>       devid    2 size 4.00GiB used 1.83GiB path /dev/vdf2
>> >> a). The @Size is 2G as we expected.
>> >> b). @Available is 700M = 1.85G - 1.3G + (2G - 1.85G).
>> >> c). @Used is changed to 1.3G rather than 1018M as above. Because
>> >>     this patch do not treat the free space in metadata chunk
>> >>     and system chunk as available to user. It's true, user can
>> >>     not use these space to store data, then it should not be
>> >>     thought as available. At the same time, it will make the
>> >>     @Used + @Available == @Size as possible to user.
>> >>
>> >> Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com>
>> >> ---
>> >> Changelog:
>> >>         v1 -> v2:
>> >>                 a. account the total_bytes in medadata chunk and
>> >>                    system chunk as used to user.
>> >>                 b. set data_stripes to the correct value in RAID0.
>> >>
>> >>  fs/btrfs/extent-tree.c | 13 ++----------
>> >>  fs/btrfs/super.c       | 56
>> >> ++++++++++++++++++++++----------------------------
>> >>  2 files changed, 26 insertions(+), 43 deletions(-)
>> >>
>> >> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>> >> index a84e00d..9954d60 100644
>> >> --- a/fs/btrfs/extent-tree.c
>> >> +++ b/fs/btrfs/extent-tree.c
>> >> @@ -8571,7 +8571,6 @@ static u64
>> >> __btrfs_get_ro_block_group_free_space(struct list_head *groups_list)
>> >>  {
>> >>       struct btrfs_block_group_cache *block_group;
>> >>       u64 free_bytes = 0;
>> >> -     int factor;
>> >>
>> >>       list_for_each_entry(block_group, groups_list, list) {
>> >>               spin_lock(&block_group->lock);
>> >> @@ -8581,16 +8580,8 @@ static u64
>> >> __btrfs_get_ro_block_group_free_space(struct list_head *groups_list)
>> >>                       continue;
>> >>               }
>> >>
>> >> -             if (block_group->flags & (BTRFS_BLOCK_GROUP_RAID1 |
>> >> -                                       BTRFS_BLOCK_GROUP_RAID10 |
>> >> -                                       BTRFS_BLOCK_GROUP_DUP))
>> >> -                     factor = 2;
>> >> -             else
>> >> -                     factor = 1;
>> >> -
>> >> -             free_bytes += (block_group->key.offset -
>> >> -
>> >> btrfs_block_group_used(&block_group->item)) *
>> >> -                            factor;
>> >> +             free_bytes += block_group->key.offset -
>> >> +                           btrfs_block_group_used(&block_group->item);
>> >>
>> >>               spin_unlock(&block_group->lock);
>> >>       }
>> >> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
>> >> index 54bd91e..40f41a2 100644
>> >> --- a/fs/btrfs/super.c
>> >> +++ b/fs/btrfs/super.c
>> >> @@ -1641,6 +1641,8 @@ static int btrfs_calc_avail_data_space(struct
>> >> btrfs_root *root, u64 *free_bytes)
>> >>       u64 used_space;
>> >>       u64 min_stripe_size;
>> >>       int min_stripes = 1, num_stripes = 1;
>> >> +     /* How many stripes used to store data, without considering
>> >> mirrors. */
>> >> +     int data_stripes = 1;
>> >>       int i = 0, nr_devices;
>> >>       int ret;
>> >>
>> >> @@ -1657,12 +1659,15 @@ static int btrfs_calc_avail_data_space(struct
>> >> btrfs_root *root, u64 *free_bytes)
>> >>       if (type & BTRFS_BLOCK_GROUP_RAID0) {
>> >>               min_stripes = 2;
>> >>               num_stripes = nr_devices;
>> >> +             data_stripes = num_stripes;
>> >>       } else if (type & BTRFS_BLOCK_GROUP_RAID1) {
>> >>               min_stripes = 2;
>> >>               num_stripes = 2;
>> >> +             data_stripes = 1;
>> >>       } else if (type & BTRFS_BLOCK_GROUP_RAID10) {
>> >>               min_stripes = 4;
>> >>               num_stripes = 4;
>> >> +             data_stripes = 2;
>> >>       }
>> >>
>> >>       if (type & BTRFS_BLOCK_GROUP_DUP)
>> >> @@ -1733,14 +1738,17 @@ static int btrfs_calc_avail_data_space(struct
>> >> btrfs_root *root, u64 *free_bytes)
>> >>       i = nr_devices - 1;
>> >>       avail_space = 0;
>> >>       while (nr_devices >= min_stripes) {
>> >> -             if (num_stripes > nr_devices)
>> >> +             if (num_stripes > nr_devices) {
>> >>                       num_stripes = nr_devices;
>> >> +                     if (type & BTRFS_BLOCK_GROUP_RAID0)
>> >> +                             data_stripes = num_stripes;
>> >> +             }
>> >>
>> >>               if (devices_info[i].max_avail >= min_stripe_size) {
>> >>                       int j;
>> >>                       u64 alloc_size;
>> >>
>> >> -                     avail_space += devices_info[i].max_avail *
>> >> num_stripes;
>> >> +                     avail_space += devices_info[i].max_avail *
>> >> data_stripes;
>> >>                       alloc_size = devices_info[i].max_avail;
>> >>                       for (j = i + 1 - num_stripes; j <= i; j++)
>> >>                               devices_info[j].max_avail -= alloc_size;
>> >> @@ -1772,15 +1780,13 @@ static int btrfs_calc_avail_data_space(struct
>> >> btrfs_root *root, u64 *free_bytes)
>> >>  static int btrfs_statfs(struct dentry *dentry, struct kstatfs *buf)
>> >>  {
>> >>       struct btrfs_fs_info *fs_info = btrfs_sb(dentry->d_sb);
>> >> -     struct btrfs_super_block *disk_super = fs_info->super_copy;
>> >>       struct list_head *head = &fs_info->space_info;
>> >>       struct btrfs_space_info *found;
>> >>       u64 total_used = 0;
>> >> +     u64 total_alloc = 0;
>> >>       u64 total_free_data = 0;
>> >>       int bits = dentry->d_sb->s_blocksize_bits;
>> >>       __be32 *fsid = (__be32 *)fs_info->fsid;
>> >> -     unsigned factor = 1;
>> >> -     struct btrfs_block_rsv *block_rsv = &fs_info->global_block_rsv;
>> >>       int ret;
>> >>
>> >>       /*
>> >> @@ -1792,38 +1798,20 @@ static int btrfs_statfs(struct dentry *dentry,
>> >> struct kstatfs *buf)
>> >>       rcu_read_lock();
>> >>       list_for_each_entry_rcu(found, head, list) {
>> >>               if (found->flags & BTRFS_BLOCK_GROUP_DATA) {
>> >> -                     int i;
>> >> -
>> >> -                     total_free_data += found->disk_total -
>> >> found->disk_used;
>> >> +                     total_free_data += found->total_bytes -
>> >> found->bytes_used;
>> >>                       total_free_data -=
>> >>
>> >> btrfs_account_ro_block_groups_free_space(found);
>> >> -
>> >> -                     for (i = 0; i < BTRFS_NR_RAID_TYPES; i++) {
>> >> -                             if (!list_empty(&found->block_groups[i]))
>> >> {
>> >> -                                     switch (i) {
>> >> -                                     case BTRFS_RAID_DUP:
>> >> -                                     case BTRFS_RAID_RAID1:
>> >> -                                     case BTRFS_RAID_RAID10:
>> >> -                                             factor = 2;
>> >> -                                     }
>> >> -                             }
>> >> -                     }
>> >> +                     total_used += found->bytes_used;
>> >> +             } else {
>> >> +                     /* For metadata and system, we treat the
>> >> total_bytes as
>> >> +                      * not available to file data. So show it as Used
>> >> in df.
>> >> +                      **/
>> >> +                     total_used += found->total_bytes;
>> >>               }
>> >> -
>> >> -             total_used += found->disk_used;
>> >> +             total_alloc += found->total_bytes;
>> >>       }
>> >> -
>> >>       rcu_read_unlock();
>> >>
>> >> -     buf->f_blocks = div_u64(btrfs_super_total_bytes(disk_super),
>> >> factor);
>> >> -     buf->f_blocks >>= bits;
>> >> -     buf->f_bfree = buf->f_blocks - (div_u64(total_used, factor) >>
>> >> bits);
>> >> -
>> >> -     /* Account global block reserve as used, it's in logical size
>> >> already */
>> >> -     spin_lock(&block_rsv->lock);
>> >> -     buf->f_bfree -= block_rsv->size >> bits;
>> >> -     spin_unlock(&block_rsv->lock);
>> >> -
>> >>       buf->f_bavail = total_free_data;
>> >>       ret = btrfs_calc_avail_data_space(fs_info->tree_root,
>> >> &total_free_data);
>> >>       if (ret) {
>> >> @@ -1831,8 +1819,12 @@ static int btrfs_statfs(struct dentry *dentry,
>> >> struct kstatfs *buf)
>> >>               mutex_unlock(&fs_info->fs_devices->device_list_mutex);
>> >>               return ret;
>> >>       }
>> >> -     buf->f_bavail += div_u64(total_free_data, factor);
>> >> +     buf->f_bavail += total_free_data;
>> >>       buf->f_bavail = buf->f_bavail >> bits;
>> >> +     buf->f_blocks = total_alloc + total_free_data;
>> >> +     buf->f_blocks >>= bits;
>> >> +     buf->f_bfree = buf->f_blocks - (total_used >> bits);
>> >> +
>> >>       mutex_unlock(&fs_info->chunk_mutex);
>> >>       mutex_unlock(&fs_info->fs_devices->device_list_mutex);
>> >>
>> >>
>> >
>> >
>> > --
>> > gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
>> > Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
>> > in
>> > the body of a message to majordomo@vger.kernel.org
>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Yang Dongsheng Dec. 15, 2014, 1:21 a.m. UTC | #8

On 12/14/2014 10:32 PM, Grzegorz Kowal wrote:
> Hi,
>
> I see another problem on 1 device fs after applying this patch. I set
> up 30GB system partition:
>
> in gdisk key 'i'
> Partition size: 62914560 sectors (30.0 GiB) -> 62914560 * 512 =
> 32212254720 = 30.0GiB
>
> before applying the patch df -B1 shows
>
> Filesystem        1B-blocks         Used    Available Use% Mounted on
> /dev/sda3      32212254720  18149384192  12821815296  59% /
>
> The total size matches exactly the one reported by gdisk.
>
> After applying the patch df -B1 shows
>
> Filesystem        1B-blocks         Used    Available Use% Mounted on
> /dev/sda3       29761732608  17037860864  12723871744  58% /
>
> The total size is 29761732608 bytes now, which is 2337 MiB (2.28GB)
> smaller then the partition size.

Hi Grzegorz, this is very similar with the question discussed in my last 
mail. I believe
the 2.28GB is used for metadata in DUP level. That said, there is a 
space of almost
4.56GB (2.28*2GB) used for metadata and the metadata is in DUP level. 
Then user can
only see a 2.28GB from the FS space level.

2.28G
-----|---------------- Filesystem space (27.72G)
|     \
|      \
|       \
| 4.56GB \
-----|----|----------------- Device space (30.0G)
     DUP   |

The main idea in new btrfs_statfs() is to consider the space information
in a FS level rather than the device space level. Then the @size in df
is the size in FS space level. As we all agree to report @size
as 5G in the case of "mkfs.btrfs /dev/vdb (5G) /dev/vdc (5G) -d raid1",
this means we are agreed to think the df in a FS space without showing
the detail in device space. In this time, the total size in FS space is  
27.72G,
so the @size in df is 27.72G.

Does it make sense to you?

As there are 2 complaints for the same change of @size in df, I have to
say it maybe not so easy to understand.

Anyone have some suggestion about it?

Thanx
Yang
>
> IMHO the total size should correspond exactly to the partition size,
> at least in the case of fs on one device, and anything used by the
> file system and user should go to the used space.
>
> Am I missing something here?
>
> Thanks,
> Grzegorz
>
> On Sun, Dec 14, 2014 at 12:27 PM, Grzegorz Kowal
> <custos.mentis@gmail.com> wrote:
>> Hi,
>>
>> I see another problem on 1 device fs after applying this patch. I set up
>> 30GB system partition:
>>
>> in gdisk key 'i'
>> Partition size: 62914560 sectors (30.0 GiB) -> 62914560 * 512 = 32212254720
>> = 30.0GiB
>>
>> before applying the patch df -B1 shows
>>
>> Filesystem        1B-blocks         Used    Available Use% Mounted on
>> /dev/sda3      32212254720  18149384192  12821815296  59% /
>>
>> The total size matches exactly the one reported by gdisk.
>>
>> After applying the patch df -B1 shows
>>
>> Filesystem        1B-blocks         Used    Available Use% Mounted on
>> /dev/sda3       29761732608  17037860864  12723871744  58% /
>>
>> The total size is 29761732608 bytes now, which is 2337 MiB (2.28GB) smaller
>> then the partition size.
>>
>> IMHO the total size should correspond exactly to the partition size, at
>> least in the case of fs on one device, and anything used by the file system
>> and user should go to the used space.
>>
>> Am I missing something here?
>>
>> Thanks,
>> Grzegorz
>>
>>
>>
>>
>>
>> On Sun, Dec 14, 2014 at 9:29 AM, Dongsheng Yang <dongsheng081251@gmail.com>
>> wrote:
>>> On Sat, Dec 13, 2014 at 3:25 AM, Goffredo Baroncelli <kreijack@inwind.it>
>>> wrote:
>>>> On 12/11/2014 09:31 AM, Dongsheng Yang wrote:
>>>>> When function btrfs_statfs() calculate the tatol size of fs, it is
>>>>> calculating
>>>>> the total size of disks and then dividing it by a factor. But in some
>>>>> usecase,
>>>>> the result is not good to user.
>>>>
>>>> I Yang; during my test I discovered an error:
>>>>
>>>> $ sudo lvcreate -L +10G -n disk vgtest
>>>> $ sudo /home/ghigo/mkfs.btrfs -f /dev/vgtest/disk
>>>> Btrfs v3.17
>>>> See http://btrfs.wiki.kernel.org for more information.
>>>>
>>>> Turning ON incompat feature 'extref': increased hardlink limit per file
>>>> to 65536
>>>> fs created label (null) on /dev/vgtest/disk
>>>>          nodesize 16384 leafsize 16384 sectorsize 4096 size 10.00GiB
>>>> $ sudo mount /dev/vgtest/disk /mnt/btrfs1/
>>>> $ df /mnt/btrfs1/
>>>> Filesystem              1K-blocks    Used Available Use% Mounted on
>>>> /dev/mapper/vgtest-disk   9428992 1069312   8359680  12% /mnt/btrfs1
>>>> $ sudo ~/btrfs fi df /mnt/btrfs1/
>>>> Data, single: total=8.00MiB, used=256.00KiB
>>>> System, DUP: total=8.00MiB, used=16.00KiB
>>>> System, single: total=4.00MiB, used=0.00B
>>>> Metadata, DUP: total=1.00GiB, used=112.00KiB
>>>> Metadata, single: total=8.00MiB, used=0.00B
>>>> GlobalReserve, single: total=16.00MiB, used=0.00B
>>>>
>>>> What seems me strange is the 9428992KiB of total disk space as reported
>>>> by df. About 600MiB are missing !
>>>>
>>>> Without your patch, I got:
>>>>
>>>> $ df /mnt/btrfs1/
>>>> Filesystem              1K-blocks  Used Available Use% Mounted on
>>>> /dev/mapper/vgtest-disk  10485760 16896   8359680   1% /mnt/btrfs1
>>>>
>>> Hi Goffredo, thanx for pointing this. Let me try to show the reason of it.
>>>
>>> In this case you provided here, the metadata is 1G in DUP. It means
>>> we spent 2G device space for it. But from the view point of user, they
>>> can only see 9G size for this fs. It's show as below:
>>>   1G
>>> -----|---------------- Filesystem space (9G)
>>> |     \
>>> |      \
>>> |       \
>>> |    2G  \
>>> -----|----|----------------- Device space (10G)
>>>      DUP   |
>>>
>>> The main idea about the new btrfs_statfs() is hiding the detail in Device
>>> space and only show the information in the FS space level.
>>>
>>> Actually, the similar idea is adopted in btrfs fi df.
>>>
>>> Example:
>>> # lvcreate -L +10G -n disk Group0
>>> # mkfs.btrfs -f /dev/Group0/disk
>>> # dd if=/dev/zero of=/mnt/data bs=1M count=10240
>>> dd: error writing ‘/mnt/data’: No space left on device
>>> 8163+0 records in
>>> 8162+0 records out
>>> 8558477312 bytes (8.6 GB) copied, 207.565 s, 41.2 MB/s
>>> # df /mnt
>>> Filesystem              1K-blocks    Used Available Use% Mounted on
>>> /dev/mapper/Group0-disk   9428992 8382896      2048 100% /mnt
>>> # btrfs fi df /mnt
>>> Data, single: total=7.97GiB, used=7.97GiB
>>> System, DUP: total=8.00MiB, used=16.00KiB
>>> System, single: total=4.00MiB, used=0.00B
>>> Metadata, DUP: total=1.00GiB, used=8.41MiB
>>> Metadata, single: total=8.00MiB, used=0.00B
>>> GlobalReserve, single: total=16.00MiB, used=0.00B
>>>
>>> Now, the all space is used and got a ENOSPC error.
>>> But from the output of btrfs fi df. We can only find almost 9G (data
>>> 8G + metadata 1G)
>>> is used. The difference is that it show the DUP here to make
>>> it more clear.
>>>
>>> Wish this description is clear to you :).
>>>
>>> If there is anything confusing or sounds incorrect to you, please point it
>>> out.
>>>
>>> Thanx
>>> Yang
>>>
>>>>> Example:
>>>>>        # mkfs.btrfs -f /dev/vdf1 /dev/vdf2 -d raid1
>>>>>        # mount /dev/vdf1 /mnt
>>>>>        # dd if=/dev/zero of=/mnt/zero bs=1M count=1000
>>>>>        # df -h /mnt
>>>>> Filesystem      Size  Used Avail Use% Mounted on
>>>>> /dev/vdf1       3.0G 1018M  1.3G  45% /mnt
>>>>>        # btrfs fi show /dev/vdf1
>>>>> Label: none  uuid: f85d93dc-81f4-445d-91e5-6a5cd9563294
>>>>>        Total devices 2 FS bytes used 1001.53MiB
>>>>>        devid    1 size 2.00GiB used 1.85GiB path /dev/vdf1
>>>>>        devid    2 size 4.00GiB used 1.83GiB path /dev/vdf2
>>>>> a. df -h should report Size as 2GiB rather than as 3GiB.
>>>>> Because this is 2 device raid1, the limiting factor is devid 1 @2GiB.
>>>>>
>>>>> b. df -h should report Avail as 0.97GiB or less, rather than as 1.3GiB.
>>>>>      1.85           (the capacity of the allocated chunk)
>>>>>     -1.018          (the file stored)
>>>>>     +(2-1.85=0.15)  (the residual capacity of the disks
>>>>>                      considering a raid1 fs)
>>>>>     ---------------
>>>>> =   0.97
>>>>>
>>>>> This patch drops the factor at all and calculate the size observable to
>>>>> user without considering which raid level the data is in and what's the
>>>>> size exactly in disk.
>>>>> After this patch applied:
>>>>>        # mkfs.btrfs -f /dev/vdf1 /dev/vdf2 -d raid1
>>>>>        # mount /dev/vdf1 /mnt
>>>>>        # dd if=/dev/zero of=/mnt/zero bs=1M count=1000
>>>>>        # df -h /mnt
>>>>> Filesystem      Size  Used Avail Use% Mounted on
>>>>> /dev/vdf1       2.0G  1.3G  713M  66% /mnt
>>>>>        # df /mnt
>>>>> Filesystem     1K-blocks    Used Available Use% Mounted on
>>>>> /dev/vdf1        2097152 1359424    729536  66% /mnt
>>>>>        # btrfs fi show /dev/vdf1
>>>>> Label: none  uuid: e98c1321-645f-4457-b20d-4f41dc1cf2f4
>>>>>        Total devices 2 FS bytes used 1001.55MiB
>>>>>        devid    1 size 2.00GiB used 1.85GiB path /dev/vdf1
>>>>>        devid    2 size 4.00GiB used 1.83GiB path /dev/vdf2
>>>>> a). The @Size is 2G as we expected.
>>>>> b). @Available is 700M = 1.85G - 1.3G + (2G - 1.85G).
>>>>> c). @Used is changed to 1.3G rather than 1018M as above. Because
>>>>>      this patch do not treat the free space in metadata chunk
>>>>>      and system chunk as available to user. It's true, user can
>>>>>      not use these space to store data, then it should not be
>>>>>      thought as available. At the same time, it will make the
>>>>>      @Used + @Available == @Size as possible to user.
>>>>>
>>>>> Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com>
>>>>> ---
>>>>> Changelog:
>>>>>          v1 -> v2:
>>>>>                  a. account the total_bytes in medadata chunk and
>>>>>                     system chunk as used to user.
>>>>>                  b. set data_stripes to the correct value in RAID0.
>>>>>
>>>>>   fs/btrfs/extent-tree.c | 13 ++----------
>>>>>   fs/btrfs/super.c       | 56
>>>>> ++++++++++++++++++++++----------------------------
>>>>>   2 files changed, 26 insertions(+), 43 deletions(-)
>>>>>
>>>>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>>>>> index a84e00d..9954d60 100644
>>>>> --- a/fs/btrfs/extent-tree.c
>>>>> +++ b/fs/btrfs/extent-tree.c
>>>>> @@ -8571,7 +8571,6 @@ static u64
>>>>> __btrfs_get_ro_block_group_free_space(struct list_head *groups_list)
>>>>>   {
>>>>>        struct btrfs_block_group_cache *block_group;
>>>>>        u64 free_bytes = 0;
>>>>> -     int factor;
>>>>>
>>>>>        list_for_each_entry(block_group, groups_list, list) {
>>>>>                spin_lock(&block_group->lock);
>>>>> @@ -8581,16 +8580,8 @@ static u64
>>>>> __btrfs_get_ro_block_group_free_space(struct list_head *groups_list)
>>>>>                        continue;
>>>>>                }
>>>>>
>>>>> -             if (block_group->flags & (BTRFS_BLOCK_GROUP_RAID1 |
>>>>> -                                       BTRFS_BLOCK_GROUP_RAID10 |
>>>>> -                                       BTRFS_BLOCK_GROUP_DUP))
>>>>> -                     factor = 2;
>>>>> -             else
>>>>> -                     factor = 1;
>>>>> -
>>>>> -             free_bytes += (block_group->key.offset -
>>>>> -
>>>>> btrfs_block_group_used(&block_group->item)) *
>>>>> -                            factor;
>>>>> +             free_bytes += block_group->key.offset -
>>>>> +                           btrfs_block_group_used(&block_group->item);
>>>>>
>>>>>                spin_unlock(&block_group->lock);
>>>>>        }
>>>>> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
>>>>> index 54bd91e..40f41a2 100644
>>>>> --- a/fs/btrfs/super.c
>>>>> +++ b/fs/btrfs/super.c
>>>>> @@ -1641,6 +1641,8 @@ static int btrfs_calc_avail_data_space(struct
>>>>> btrfs_root *root, u64 *free_bytes)
>>>>>        u64 used_space;
>>>>>        u64 min_stripe_size;
>>>>>        int min_stripes = 1, num_stripes = 1;
>>>>> +     /* How many stripes used to store data, without considering
>>>>> mirrors. */
>>>>> +     int data_stripes = 1;
>>>>>        int i = 0, nr_devices;
>>>>>        int ret;
>>>>>
>>>>> @@ -1657,12 +1659,15 @@ static int btrfs_calc_avail_data_space(struct
>>>>> btrfs_root *root, u64 *free_bytes)
>>>>>        if (type & BTRFS_BLOCK_GROUP_RAID0) {
>>>>>                min_stripes = 2;
>>>>>                num_stripes = nr_devices;
>>>>> +             data_stripes = num_stripes;
>>>>>        } else if (type & BTRFS_BLOCK_GROUP_RAID1) {
>>>>>                min_stripes = 2;
>>>>>                num_stripes = 2;
>>>>> +             data_stripes = 1;
>>>>>        } else if (type & BTRFS_BLOCK_GROUP_RAID10) {
>>>>>                min_stripes = 4;
>>>>>                num_stripes = 4;
>>>>> +             data_stripes = 2;
>>>>>        }
>>>>>
>>>>>        if (type & BTRFS_BLOCK_GROUP_DUP)
>>>>> @@ -1733,14 +1738,17 @@ static int btrfs_calc_avail_data_space(struct
>>>>> btrfs_root *root, u64 *free_bytes)
>>>>>        i = nr_devices - 1;
>>>>>        avail_space = 0;
>>>>>        while (nr_devices >= min_stripes) {
>>>>> -             if (num_stripes > nr_devices)
>>>>> +             if (num_stripes > nr_devices) {
>>>>>                        num_stripes = nr_devices;
>>>>> +                     if (type & BTRFS_BLOCK_GROUP_RAID0)
>>>>> +                             data_stripes = num_stripes;
>>>>> +             }
>>>>>
>>>>>                if (devices_info[i].max_avail >= min_stripe_size) {
>>>>>                        int j;
>>>>>                        u64 alloc_size;
>>>>>
>>>>> -                     avail_space += devices_info[i].max_avail *
>>>>> num_stripes;
>>>>> +                     avail_space += devices_info[i].max_avail *
>>>>> data_stripes;
>>>>>                        alloc_size = devices_info[i].max_avail;
>>>>>                        for (j = i + 1 - num_stripes; j <= i; j++)
>>>>>                                devices_info[j].max_avail -= alloc_size;
>>>>> @@ -1772,15 +1780,13 @@ static int btrfs_calc_avail_data_space(struct
>>>>> btrfs_root *root, u64 *free_bytes)
>>>>>   static int btrfs_statfs(struct dentry *dentry, struct kstatfs *buf)
>>>>>   {
>>>>>        struct btrfs_fs_info *fs_info = btrfs_sb(dentry->d_sb);
>>>>> -     struct btrfs_super_block *disk_super = fs_info->super_copy;
>>>>>        struct list_head *head = &fs_info->space_info;
>>>>>        struct btrfs_space_info *found;
>>>>>        u64 total_used = 0;
>>>>> +     u64 total_alloc = 0;
>>>>>        u64 total_free_data = 0;
>>>>>        int bits = dentry->d_sb->s_blocksize_bits;
>>>>>        __be32 *fsid = (__be32 *)fs_info->fsid;
>>>>> -     unsigned factor = 1;
>>>>> -     struct btrfs_block_rsv *block_rsv = &fs_info->global_block_rsv;
>>>>>        int ret;
>>>>>
>>>>>        /*
>>>>> @@ -1792,38 +1798,20 @@ static int btrfs_statfs(struct dentry *dentry,
>>>>> struct kstatfs *buf)
>>>>>        rcu_read_lock();
>>>>>        list_for_each_entry_rcu(found, head, list) {
>>>>>                if (found->flags & BTRFS_BLOCK_GROUP_DATA) {
>>>>> -                     int i;
>>>>> -
>>>>> -                     total_free_data += found->disk_total -
>>>>> found->disk_used;
>>>>> +                     total_free_data += found->total_bytes -
>>>>> found->bytes_used;
>>>>>                        total_free_data -=
>>>>>
>>>>> btrfs_account_ro_block_groups_free_space(found);
>>>>> -
>>>>> -                     for (i = 0; i < BTRFS_NR_RAID_TYPES; i++) {
>>>>> -                             if (!list_empty(&found->block_groups[i]))
>>>>> {
>>>>> -                                     switch (i) {
>>>>> -                                     case BTRFS_RAID_DUP:
>>>>> -                                     case BTRFS_RAID_RAID1:
>>>>> -                                     case BTRFS_RAID_RAID10:
>>>>> -                                             factor = 2;
>>>>> -                                     }
>>>>> -                             }
>>>>> -                     }
>>>>> +                     total_used += found->bytes_used;
>>>>> +             } else {
>>>>> +                     /* For metadata and system, we treat the
>>>>> total_bytes as
>>>>> +                      * not available to file data. So show it as Used
>>>>> in df.
>>>>> +                      **/
>>>>> +                     total_used += found->total_bytes;
>>>>>                }
>>>>> -
>>>>> -             total_used += found->disk_used;
>>>>> +             total_alloc += found->total_bytes;
>>>>>        }
>>>>> -
>>>>>        rcu_read_unlock();
>>>>>
>>>>> -     buf->f_blocks = div_u64(btrfs_super_total_bytes(disk_super),
>>>>> factor);
>>>>> -     buf->f_blocks >>= bits;
>>>>> -     buf->f_bfree = buf->f_blocks - (div_u64(total_used, factor) >>
>>>>> bits);
>>>>> -
>>>>> -     /* Account global block reserve as used, it's in logical size
>>>>> already */
>>>>> -     spin_lock(&block_rsv->lock);
>>>>> -     buf->f_bfree -= block_rsv->size >> bits;
>>>>> -     spin_unlock(&block_rsv->lock);
>>>>> -
>>>>>        buf->f_bavail = total_free_data;
>>>>>        ret = btrfs_calc_avail_data_space(fs_info->tree_root,
>>>>> &total_free_data);
>>>>>        if (ret) {
>>>>> @@ -1831,8 +1819,12 @@ static int btrfs_statfs(struct dentry *dentry,
>>>>> struct kstatfs *buf)
>>>>>                mutex_unlock(&fs_info->fs_devices->device_list_mutex);
>>>>>                return ret;
>>>>>        }
>>>>> -     buf->f_bavail += div_u64(total_free_data, factor);
>>>>> +     buf->f_bavail += total_free_data;
>>>>>        buf->f_bavail = buf->f_bavail >> bits;
>>>>> +     buf->f_blocks = total_alloc + total_free_data;
>>>>> +     buf->f_blocks >>= bits;
>>>>> +     buf->f_bfree = buf->f_blocks - (total_used >> bits);
>>>>> +
>>>>>        mutex_unlock(&fs_info->chunk_mutex);
>>>>>        mutex_unlock(&fs_info->fs_devices->device_list_mutex);
>>>>>
>>>>>
>>>>
>>>> --
>>>> gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
>>>> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
>>>> in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> .
>

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Robert White Dec. 15, 2014, 6:06 a.m. UTC | #9

On 12/14/2014 05:21 PM, Dongsheng Yang wrote:
>
> Does it make sense to you?

I understood what you were saying but it didn't make sense to me...

> As there are 2 complaints for the same change of @size in df, I have to
> say it maybe not so easy to understand.

> Anyone have some suggestion about it?

ABSTRACT:: Stop being clever, just give the raw values. That's what you 
should be doing anyway. There are no other correct values to give that 
doesn't blow someone's paradigm somewhere.

ITEM #1 :: In my humble opinion (ha ha) the size column should never 
change unless you add or remove actual storage. It should approximate 
the raw block size of the device on initial creation, and it should 
adjust to the size changes that happen when you semantically resize the 
filesystem with e.g. btrfs resize.

RATIONALE ::

(1a) The actual definition of size for df is not well defined, so the 
best definition of size is the one that produces the "most useful 
information". IMHO the number I've always wanted to see from df is the 
value SIZE I would supply in order to safely dd the entire filesystem 
from one place to another. That number would be, on a single drive 
filesystem, "total_bytes" from the superblock as scaled by the necessary 
block size etc.

ITEM #2 :: The idea of "blocks used" is iffy as well. In particular I 
don't care how or why those blocks have been used. And almost all 
filesystems have this same issue. If I write a 1GiB file to ext2 my 
blocks used doesn't go down by exactly 1GiB, it goes down by 1GiB plus 
all the indirect indexing blocks needed to reference that 1GiB.

RATIONALE ::

(2a) "Blocks Used" is not, and wasn't particularly meant to be "Blocks 
Used By Data Alone".

(2b) Many filesystems have, historically, per-subtracted the fixed 
overhead of their layout such as removing inode table regions. But the 
it became "stupid" and "anti-helpful" but remained un-redressed when 
advancements were made that let data be stored directly in the inodes 
for small files. So now you can technically fit more data in an EXT* 
filesystem than you could fit in SIZE*BLOCKSIZE bytes. Even before 
compression.

(2c) The "fixed" blocks-used size of BTRFS is technically 
sizeof(SuperBlock)*num_supers. Everything else is up for grabs. Some is, 
indeed, pre-grabbed but so what?

ITEM #3 :: The idea of Blocks Available should be Blocks - BlocksUsed 
and _nothing_ _more_.

RATIONALE ::

(3a) Just like Blocks Used isn't just about blocks used for data, Blocks 
Available isn't about how much more user data can be stuffed into the 
filesystem

(3b) Any attempt to treat Blocks Available as some sort of guarantee 
will be meaningless for some significant number of people and usages.

ITEM #4 :: Blocks available to unprivileged users is pretty "iffy" since 
unprivileged users cannot write to the filesystem. This datum doesn't 
have a "plain reading". I'd start with filesystem total blocks, then 
subtract the total blocks used by all trees nodes in all trees. (e.g. 
Nodes * 0x1000 or whatever node size is) then shave off the N 
superblocks, then subtract the number of blocks already allocated in 
data extents. And you're done.

ITEM #5 :: A plain reading of the comments in the code cry out "stop 
trying to help me make predictions". Just serve up the nice raw numbers.

RATIONALE ::

(5a) We have _all_ suffered under the merciless tyranny of some system 
or another that was being "too helpful to be useful". Once a block of 
code tries to "help you" and enforces that help, then you are doomed to 
suffer under that help. See "Clippy".

(5b) The code has a plain reading. It doesn't say anything about how 
things will be used. "Available" is _available_. If you have chosen to 
use it at a 2x rate (e.g. 200%, e.g. RAID1) or a 1.25 rate (five media 
in RAID5) or an N+2/N rate (e.g. RAID6), or a 4x rate (RAID10)... well 
that was your choice.

(5c) If your metadata rate is different than your data rate, then there 
is _absolutely_ no way to _programatically_ predict how the data _might_ 
be used, and this is the _default_ usage model. Literally the hardest 
model is the normal model. There is actually no predictive solution. So 
why are we putting in predictions at all when they _must_ be wrong.

            struct statfs {
                __SWORD_TYPE f_type;    /* type of filesystem (see below) */
                __SWORD_TYPE f_bsize;   /* optimal transfer block size */
                fsblkcnt_t   f_blocks;  /* total data blocks in 
filesystem */
                fsblkcnt_t   f_bfree;   /* free blocks in fs */
                fsblkcnt_t   f_bavail;  /* free blocks available to
                                           unprivileged user */
                fsfilcnt_t   f_files;   /* total file nodes in filesystem */
                fsfilcnt_t   f_ffree;   /* free file nodes in fs */
                fsid_t       f_fsid;    /* filesystem id */
                __SWORD_TYPE f_namelen; /* maximum length of filenames */
                __SWORD_TYPE f_frsize;  /* fragment size (since Linux 
2.6) */
                __SWORD_TYPE f_spare[5];
            };

The datum provided is _supposed_ to be simple. "total blocks in file 
system" "free blocks in file system".

"Blocks available to unprivileged users" is the only tricky one. Id 
limit that to all unallocated blocks inside data extents and all blocks 
not part of any extent. "Unprivileged users" cannot, after all, actually 
allocate blocks in the various trees even if the system ends up doing it 
for them.

Fortunately (or hopefully) that's not the datum /bin/df usually returns.

SUMMARY ::

No fudge factor or backwards-reasoning is going to satisfy more than 
half the people.

Trying to gestimate the users intentions is impossible. Like all 
filesystems except the most simplistic ones (vfat etc) or read-only ones 
(squashfs), getting any answer "near perfect" is not likely, nor 
particularly helpful.

It's really _not_ the implementors job to guess at how the user is going 
to use the system.

Just as EXT before us didn't bother trying to put in a fudge factor that 
guessed what percentage of files would end up needing indirect blocks, 
we shouldn't be in the business of trying to back-figure cost-of-storage.

The raw numbers are _more_ useful in many circumstances. The raw blocks 
used, for example, will tell me what I need to know for thin 
provisioning on other media, for example. Literally nothing else exposes 
that sort of information.

Just put a prominent notice that the user needs to remember to factor 
their choice of redundancy et al into the numbers.

Noticing that my RAID1 costs two 1k blocks to store 1K of data is 
_their_ _job_ when it comes down to it. That's because we are giving 
"them" insight to the filessytem _and_ the storage management.

Same for the benefits of compression etc.

We can recognize that this is "harder" than some other filesystems 
because, frankly, it is... Once we decided to get into the business of 
fusing the file system with the storage management system we _accepted_ 
that burden of difficulty. Users who never go beyond core usage (single 
data plus "overhead" from DUP metadata) will still get the same numbers 
for their simple case. People who start doing RAID5+1 or whatever 
(assuming our implementation gets that far) across 22 media are just 
going to have to remember to do the math to figure their 10% overhead 
cost when looking at "blocks available" just like I had to do my 
S=N*log(N) estimates while laying out Oracle table spaces on my sun 
stations back in the eighties.

Any "clever" answer to any one model will be wrong for _every_ _other_ 
model.

IN MY HUMBLE OPINION, of course... 8-)

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Robert White Dec. 15, 2014, 7:49 a.m. UTC | #10

On 12/14/2014 10:06 PM, Robert White wrote:
> On 12/14/2014 05:21 PM, Dongsheng Yang wrote:
>> Anyone have some suggestion about it?
> (... strong advocacy for raw numbers...)

Concise Example to attempt to be clearer:

/dev/sda == 1TiB
/dev/sdb == 2TiB
/dev/sdc == 3TiB
/dev/sdd == 3TiB

mkfs.btrfs /dev/sd{a..d} -d raid0
mount /dev/sda /mnt

Now compare ::

#!/bin/bash
dd if=/dev/urandom of=/mnt/example bs=1G

vs

#!/bin/bash
typeset -i counter
for ((counter=0;;counter++)); do
dd if=/dev/urandom of=/mnt/example$conter bs=44 count=1
done

vs

#!/bin/bash
typeset -i counter
for ((counter=0;;counter++)); do
dd if=/dev/urandom of=/mnt/example$conter bs=44 count=1
done &
dd if=/dev/urandom of=/mnt/example bs=1G

Now repeat the above 3 models for
mkfs.btrfs /dev/sd{a..d} -d raid5

......

As you watch these six examples evolve you can ponder the ultimate 
futility of doing adaptive prediction within statfs().

Then go back and change the metadata from the default of RAID1 to RAID5 
or RAID6 or RAID10.

Then go back and try

mkfs.btrfs /dev/sd{a..d} -d raid10

then balance when the big file runs out of space, then resume the big 
file with oflag=append

......

Unlike _all_ our predecessors, we are active at both the semantic file 
storage level _and_ the physical media management level.

None of the prior filesystems match this new ground exactly.

The only real option is to expose the raw numbers and then tell people 
the corner cases.

Absolutely unavailable blocks, such as the massive waste of 5TiB in the 
above sized media if raid10 were selected for both data and metadata 
would be subtracted from size if and only if it's _impossible_ for it to 
be accessed by this sort of restriction. But even in this case, the 
correct answer for size is 4TiB because that exactly answers "how big is 
this filesystem".

It might be worth having a "dev_item.bytes_excluded" or unusable or 
whatever to account for the difference between total_bytes and 
bytes_used and the implicit bytes available. This would account for the 
0,1,2,2 TiB that a raid10 of the example sizes could never reach in the 
current geometry. I'm betting that this sort of number also shows up as 
some number of sectors in any filesystem that has an odd tidbit of size 
up at the top where no structure is ever gong to fit. That's just a 
feature of the way disks use GB instead of GiB and msdos style 
partitions love the number 63.

So resize sets the size. Geometry limitations may reduce the effective 
size by some, or a _lot_, but then the used-vs-available should _not_ 
try to correct for whatever geometry is in use. Even when it might be 
simple because if it does it well in the simple cases like 
raid10/raid10, it would have to botch it up on the hard cases.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Yang Dongsheng Dec. 15, 2014, 8:26 a.m. UTC | #11

On 12/15/2014 03:49 PM, Robert White wrote:
> On 12/14/2014 10:06 PM, Robert White wrote:
>> On 12/14/2014 05:21 PM, Dongsheng Yang wrote:
>>> Anyone have some suggestion about it?
>> (... strong advocacy for raw numbers...)

Hi Robert, thanx for your so detailed reply.

You are proposing to report the raw numbers in df command, right?

Let's compare the space information in FS level and Device level.

Example:
/dev/sda == 1TiB
/dev/sdb == 2TiB

mkfs.btrfs /dev/sda /dev/sdb -d raid1

(1). If we report the raw numbers in df command, we will get the result of
@size=3T, @used=0 @available=3T. It's not a bad idea until now, as you said
user can consider the raid when they are using the fs. Then if we fill 
1T data
into it. we will get @size=3, @used=2T, @avalable=1T. And at this moment, we
will get ENOSPC when writing any more data. It's unacceptable.  Why you
tell me there is 1T space available, but I can't write one byte into it?

(2). Actually, there was an elder btrfs_statfs(), it reported the raw 
numbers to user.
To solve the problem mentioned in (1), we need report the space 
information in the FS level.
Current btrfs_statfs() is designed like this, but not working in any cases.
My new btrfs_statfs() here is following this design and implementing it
to show a *better* output to user.

Thanx
Yang
>
> Concise Example to attempt to be clearer:
>
> /dev/sda == 1TiB
> /dev/sdb == 2TiB
> /dev/sdc == 3TiB
> /dev/sdd == 3TiB
>
> mkfs.btrfs /dev/sd{a..d} -d raid0
> mount /dev/sda /mnt
>
> Now compare ::
>
> #!/bin/bash
> dd if=/dev/urandom of=/mnt/example bs=1G
>
> vs
>
> #!/bin/bash
> typeset -i counter
> for ((counter=0;;counter++)); do
> dd if=/dev/urandom of=/mnt/example$conter bs=44 count=1
> done
>
> vs
>
> #!/bin/bash
> typeset -i counter
> for ((counter=0;;counter++)); do
> dd if=/dev/urandom of=/mnt/example$conter bs=44 count=1
> done &
> dd if=/dev/urandom of=/mnt/example bs=1G
>
> Now repeat the above 3 models for
> mkfs.btrfs /dev/sd{a..d} -d raid5
>
>
> ......
>
> As you watch these six examples evolve you can ponder the ultimate 
> futility of doing adaptive prediction within statfs().
>
> Then go back and change the metadata from the default of RAID1 to 
> RAID5 or RAID6 or RAID10.
>
> Then go back and try
>
> mkfs.btrfs /dev/sd{a..d} -d raid10
>
> then balance when the big file runs out of space, then resume the big 
> file with oflag=append
>
> ......
>
> Unlike _all_ our predecessors, we are active at both the semantic file 
> storage level _and_ the physical media management level.
>
> None of the prior filesystems match this new ground exactly.
>
> The only real option is to expose the raw numbers and then tell people 
> the corner cases.
>
> Absolutely unavailable blocks, such as the massive waste of 5TiB in 
> the above sized media if raid10 were selected for both data and 
> metadata would be subtracted from size if and only if it's 
> _impossible_ for it to be accessed by this sort of restriction. But 
> even in this case, the correct answer for size is 4TiB because that 
> exactly answers "how big is this filesystem".
>
> It might be worth having a "dev_item.bytes_excluded" or unusable or 
> whatever to account for the difference between total_bytes and 
> bytes_used and the implicit bytes available. This would account for 
> the 0,1,2,2 TiB that a raid10 of the example sizes could never reach 
> in the current geometry. I'm betting that this sort of number also 
> shows up as some number of sectors in any filesystem that has an odd 
> tidbit of size up at the top where no structure is ever gong to fit. 
> That's just a feature of the way disks use GB instead of GiB and msdos 
> style partitions love the number 63.
>
> So resize sets the size. Geometry limitations may reduce the effective 
> size by some, or a _lot_, but then the used-vs-available should _not_ 
> try to correct for whatever geometry is in use. Even when it might be 
> simple because if it does it well in the simple cases like 
> raid10/raid10, it would have to botch it up on the hard cases.
>
>
> .
>

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Robert White Dec. 15, 2014, 9:36 a.m. UTC | #12

On 12/15/2014 12:26 AM, Dongsheng Yang wrote:
> On 12/15/2014 03:49 PM, Robert White wrote:
>> On 12/14/2014 10:06 PM, Robert White wrote:
>>> On 12/14/2014 05:21 PM, Dongsheng Yang wrote:
>>>> Anyone have some suggestion about it?
>>> (... strong advocacy for raw numbers...)
>
> Hi Robert, thanx for your so detailed reply.
>
> You are proposing to report the raw numbers in df command, right?

Wrong. Well partly wrong. You didn't include the "geometrically 
unavailable" calculation from my suggestion. And you've largely ignored 
the commentary and justifications from the first email.

See, all the solutions being discussed basically drop out the hard 
parts. For instance god save your existing patch from a partially 
complete conversion. I'll bet we could get into negative @available 
numbers if we abort halfway through conversion between raid levels zero 
and one.

If we have two disks of 5GiB and they _were_ at RAID0 and we start a 
-dconvert to RAID1 and abort it or run out of space because there were 
7GiB in use... then what? If size is now 5GiB and used is now 6GiB but 
we've still got 1GiB available... well _how_ exactly is that _better_ 
information?

> Let's compare the space information in FS level and Device level.
>
> Example:
> /dev/sda == 1TiB
> /dev/sdb == 2TiB
>
> mkfs.btrfs /dev/sda /dev/sdb -d raid1
>
> (1). If we report the raw numbers in df command, we will get the result of
> @size=3T, @used=0 @available=3T. It's not a bad idea until now, as you said
> user can consider the raid when they are using the fs. Then if we fill
> 1T data
> into it. we will get @size=3, @used=2T, @avalable=1T. And at this
> moment, we
> will get ENOSPC when writing any more data. It's unacceptable.  Why you
> tell me there is 1T space available, but I can't write one byte into it?

See the last segment below where I addressed the idea of geometrically 
unavailable space. It explains why, in this case, available would have 
been 0TiB not 1TiB.

The solution has to be correct for all uses and BTRFS _WILL_ eventually 
reach a condition where _either_ @available will be zero but you can 
still add some small files, or @available will be non-zero but you 
_can't_ any large files.

It is _impossible_ _to_ _avoid_ both outcomes because the two modalities 
of file storage are incompatable.

We have mixed the metaphor of file storage and raw storage. We have done 
so with a dynamically self-restructuring system. As such we don't have a 
completely consistent option.

For instance, what happens when the user adds a third 1TiB drive to your 
example?

So we need to account, at a system-wide level, for every byte of 
storage. We need to then make as few assumptions as possible, but no 
fewer, when reporting the subsets of information to the tools that were 
invented before we stretched the paradigm outside that box.

>
> (2). Actually, there was an elder btrfs_statfs(), it reported the raw
> numbers to user.
> To solve the problem mentioned in (1), we need report the space
> information in the FS level.

No, we need to report the availability at the _raw_ level, but subtract 
the "absolutely inaccessable" tidbits. In short we need to report at the 
_storage_ _management_ _level_ w

So your example sizes in my methodology would report

SIZE=2TiB USED=0 AVAILABLE=2TiB

And if you wrote exactly 1GiB of data to that filesystem it would show

SIZE=2TiB USED=2GiB AVAILABLE=1.99TiB

because that one 1GiB took 2GiB to store. No attempt would be made to 
make the filesystem appear 1TiB-ish. The operator knows its RAID1 so he 
knows it will be sucked down at a 2x rate.

And when the user said "WTF" regarding the missing 1TiB he'd be directed 
to show-super and the 1TiB "unavailable because of geometry".

But if the user built the filesystem with btrfs device add and never 
rebalanced it, absolutely no attempt would be made to "protect him" from 
the knowledge that he could "run out of space" while still having blocks 
available.

mkfs.btrfs /dev/sda
(time pases)
btrfs device add /dev/sdd /mountpoint

Would show SIZE=3TiB (etc) and would take no steps to warn about the 
possibly jammed up condition caused by the existing concentration 
duplicate metata on /dev/sda.

> Current btrfs_statfs() is designed like this, but not working in any cases.
> My new btrfs_statfs() here is following this design and implementing it
> to show a *better* output to user.
>
> Thanx
> Yang
>>
>> Concise Example to attempt to be clearer:
>>
>> /dev/sda == 1TiB
>> /dev/sdb == 2TiB
>> /dev/sdc == 3TiB
>> /dev/sdd == 3TiB
>>
>> mkfs.btrfs /dev/sd{a..d} -d raid0
>> mount /dev/sda /mnt
>>
>> Now compare ::
>>
>> #!/bin/bash
>> dd if=/dev/urandom of=/mnt/example bs=1G
>>
>> vs
>>
>> #!/bin/bash
>> typeset -i counter
>> for ((counter=0;;counter++)); do
>> dd if=/dev/urandom of=/mnt/example$conter bs=44 count=1
>> done
>>
>> vs
>>
>> #!/bin/bash
>> typeset -i counter
>> for ((counter=0;;counter++)); do
>> dd if=/dev/urandom of=/mnt/example$conter bs=44 count=1
>> done &
>> dd if=/dev/urandom of=/mnt/example bs=1G
>>
>> Now repeat the above 3 models for
>> mkfs.btrfs /dev/sd{a..d} -d raid5
>>
>>
>> ......
>>
>> As you watch these six examples evolve you can ponder the ultimate
>> futility of doing adaptive prediction within statfs().
>>
>> Then go back and change the metadata from the default of RAID1 to
>> RAID5 or RAID6 or RAID10.
>>
>> Then go back and try
>>
>> mkfs.btrfs /dev/sd{a..d} -d raid10
>>
>> then balance when the big file runs out of space, then resume the big
>> file with oflag=append
>>
>> ......
>>
>> Unlike _all_ our predecessors, we are active at both the semantic file
>> storage level _and_ the physical media management level.
>>
>> None of the prior filesystems match this new ground exactly.
>>
>> The only real option is to expose the raw numbers and then tell people
>> the corner cases.
>>

=== Re-read from here down ===

>> Absolutely unavailable blocks, such as the massive waste of 5TiB in
>> the above sized media if raid10 were selected for both data and
>> metadata would be subtracted from size if and only if it's
>> _impossible_ for it to be accessed by this sort of restriction. But
>> even in this case, the correct answer for size is 4TiB because that
>> exactly answers "how big is this filesystem".
>>
>> It might be worth having a "dev_item.bytes_excluded" or unusable or
>> whatever to account for the difference between total_bytes and
>> bytes_used and the implicit bytes available. This would account for
>> the 0,1,2,2 TiB that a raid10 of the example sizes could never reach
>> in the current geometry. I'm betting that this sort of number also
>> shows up as some number of sectors in any filesystem that has an odd
>> tidbit of size up at the top where no structure is ever gong to fit.
>> That's just a feature of the way disks use GB instead of GiB and msdos
>> style partitions love the number 63.
>>
>> So resize sets the size. Geometry limitations may reduce the effective
>> size by some, or a _lot_, but then the used-vs-available should _not_
>> try to correct for whatever geometry is in use. Even when it might be
>> simple because if it does it well in the simple cases like
>> raid10/raid10, it would have to botch it up on the hard cases.
>>
>>
>> .
>>
>
>

So we don't just hand-wave over statfs(). We include the 
dev_item.bytes_excluded in the superblock and we decide once-and-for-all 
(with any geometry creation, or completed conversion) how many bytes 
just _can't_ be reached but only once we _know_ they cant be reached. 
And we memorialize that unreachable data in the superblocks.

Thereafter we report the raw numbers after subtracting anything we know 
cannot be reached.

All other "helpful" solutions are NP-complete and insoluble.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Robert White Dec. 16, 2014, 3:30 a.m. UTC | #13

On 12/15/2014 01:36 AM, Robert White wrote:
> So we don't just hand-wave over statfs(). We include the
> dev_item.bytes_excluded in the superblock and we decide once-and-for-all
> (with any geometry creation, or completed conversion) how many bytes
> just _can't_ be reached but only once we _know_ they cant be reached.
> And we memorialize that unreachable data in the superblocks.
>
> Thereafter we report the raw numbers after subtracting anything we know
> cannot be reached.
>
> All other "helpful" solutions are NP-complete and insoluble.

On multiple re-readings of my own words and running off to the POSIX 
definitions _and_ kernel sources (which don't agree).

The practical bits first ::

I would add a "-c | --compatable" option to btrfs fi df
that let it produce /bin/df format-compatable output that gave the 
"real" numbers as defined near the end.

/dev/sda 1TiB
/dev/sdb 2TiB

mkfs.btrfs /dev/sd{a,b} -d raid1

@size=3TiB @used=0TiB @available=2TiB

The above would be ideal. But POSIX says "no". f_blocks is defined (only 
in the comments) as "total data blocks in the filesystem" and /bin/df 
pivots on that assumption, so the only usable option left is ::

@size=2TiB @used=0TiB @available=2TiB

After which @used would be the real, raw space consumed. If it takes 
2GiB or 4GiB to store 1GiB (q.v. RAID 1 and 10) then @used would go up 
by that 2 or 4 GiB.

Given the not-displayed, not reported, excluded_by_geometry values (e.g. 
@waste) the equation should always be ::

@size - @waste = @used + @available

The fact that /bin/df doesn't display all four values is just tough, The 
fact that it calculates one "for us" is really annoying, show-super 
would be the place to go find the truth.

The @waste value is soft because while 1TiB of /dev/sdb that is not 
going to be used isn't a _particular_ 1TiB. It could be low blocks or 
high blocks or randomly distributed blocks that end up not having data.

So keeping with my thought that (ideally) @size should be the "safe dd 
size" for doing a raw-block transcribe of the devices and filesystem, it 
is most correct for @size to be real storage size. But sadly, posix 
didn't define that value for that role, so we are forced to munge 
around. (particularly since /bin/df calculates stuff "for us").

Calculation of the @waste would have to happen in two phases. At 
initiation phase of any convert @waste would be set to zero. At 
completion of any _full_ convert, when we know that there are no 
leftover bits that could lead to rampant mis-report, @waste would be 
calculated for each device as a dev_item. Then the total would be stored 
as a global item.

btrfs tools would report all four items.

statfs() would have to report (@size-@waste) and @available, but that's 
a problem with limits to the assumptions made by statfs() designers two 
decades ago.

I don't know which numbers we keep on hand and which we derive so...

@available, if calculated dynamically would be
sum(@size, -@waste, -@used).

@used, if calculated dynamically, would be
sum(@size, -@waste, -@available).

This would also keep all the determinations of @waste well defined and 
relegated to specific, infrequently executed blocks of code.

GIVEN ALSO ::

The BTRFS dynamic internal layout allows for completely valid states 
that are inconsistent with the current filesystem flags... Such as it is 
legal to set the RAID1 mode for data but still having RAID0, RAID5, and 
any manner of other extents present... there is no finite solution to 
every particular layout that exists.

This condition is even _mandatory_ in an evolving system. May persist if 
conversion is interrupted and then the balance is aborted. And might be 
purely legal if you supply a convert option and limit the number of 
blocks to process in the same run.

Each individual extent block is it's own master in terms of what "mode 
the filesystem is actally in" when that extent is being accessed. This 
fact is _unchangeable_.

STANDARDS REFERENCES and Issues...

The actual standard from POSIX at The Open Group refers to f_blocks as 
"Total number of blocks on file system in units of f_frsize".

See :: 
http://pubs.opengroup.org/onlinepubs/009695399/basedefs/sys/statvfs.h.html

The linux kernel source and man pages say "total data blocks in filesystem".

I don't know where/when/why the "total blocks" got re-qualified as 
"total data blocks" in the linux history, but it's probably incorrect on 
plain reading.

The df command itself suffers a similar problem as the POSIX standard 
doesn't talk about "data blocks" etc.

Problematically, of course, the statfs() call doesn't really allow for 
any means to address slack/waste space and the reverse calculation for 
us becomes impossible.

This gets back to the "no right answer in BTRFS" issue.

There is a lot of missing magic here. Back when INODES where just one 
thing with one size statfs results were probably either-or and 
"Everybody Knew" how to turn the inode count into a block count and 
history just rolled on.

I think the real answer would be to invent an expanded statfs() call 
that returned the real numbers for @total_size, @system_overhead_used, 
@waste_space, @unusable_space, etc -- that is to come up with a generic 
model for a modern storage system -- and let real calculations take 
place. But I don't have the "community chops" to start that ball rolling.

CONCLUSIONS ::

Given the inherent assumptions of statfs(), there is _no_ solution that 
will be correct in all cases.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Robert White Dec. 16, 2014, 3:52 a.m. UTC | #14

On 12/15/2014 07:30 PM, Robert White wrote:
> The above would be ideal. But POSIX says "no". f_blocks is defined (only
Correction the linux kernel says "total data blocks", POSIX says "total 
blocks" -- it was a mental typo... 8-)
> in the comments) as "total data blocks in the filesystem" and /bin/df
> pivots on that assumption, so the only usable option left is ::

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Yang Dongsheng Dec. 16, 2014, 11:30 a.m. UTC | #15

On 12/16/2014 11:30 AM, Robert White wrote:
> On 12/15/2014 01:36 AM, Robert White wrote:
>> So we don't just hand-wave over statfs(). We include the
>> dev_item.bytes_excluded in the superblock and we decide once-and-for-all
>> (with any geometry creation, or completed conversion) how many bytes
>> just _can't_ be reached but only once we _know_ they cant be reached.
>> And we memorialize that unreachable data in the superblocks.
>>
>> Thereafter we report the raw numbers after subtracting anything we know
>> cannot be reached.
>>
>> All other "helpful" solutions are NP-complete and insoluble.
>
> On multiple re-readings of my own words and running off to the POSIX 
> definitions _and_ kernel sources (which don't agree).
>
> The practical bits first ::
>
> I would add a "-c | --compatable" option to btrfs fi df
> that let it produce /bin/df format-compatable output that gave the 
> "real" numbers as defined near the end.
>
>
> /dev/sda 1TiB
> /dev/sdb 2TiB
>
>
> mkfs.btrfs /dev/sd{a,b} -d raid1
>
> @size=3TiB @used=0TiB @available=2TiB
>
> The above would be ideal. But POSIX says "no". f_blocks is defined 
> (only in the comments) as "total data blocks in the filesystem" and 
> /bin/df pivots on that assumption, so the only usable option left is ::
>
> @size=2TiB @used=0TiB @available=2TiB
>
> After which @used would be the real, raw space consumed. If it takes 
> 2GiB or 4GiB to store 1GiB (q.v. RAID 1 and 10) then @used would go up 
> by that 2 or 4 GiB.

Hi Robert, thanx for your proposal about this.

IMHO, output of df command shoud be more friendly to user.
Well, I think we have a disagreement on this point, let's take a look at 
what the zfs is doing.

/dev/sda7- 10G
/dev/sda8- 10G
# zpool create myzpool mirror /dev/sda7 /dev/sda8 -f
# df -h /myzpool/
Filesystem      Size  Used Avail Use% Mounted on
myzpool         9.8G   21K  9.8G   1% /myzpool

That said that df command should tell user the space info they can see.
It means the output is the information from the FS level rather than 
device level or _storage_manager level.

Thanx
Yang
>
> Given the not-displayed, not reported, excluded_by_geometry values 
> (e.g. @waste) the equation should always be ::
>
> @size - @waste = @used + @available
>
> The fact that /bin/df doesn't display all four values is just tough, 
> The fact that it calculates one "for us" is really annoying, 
> show-super would be the place to go find the truth.
>
> The @waste value is soft because while 1TiB of /dev/sdb that is not 
> going to be used isn't a _particular_ 1TiB. It could be low blocks or 
> high blocks or randomly distributed blocks that end up not having data.
>
> So keeping with my thought that (ideally) @size should be the "safe dd 
> size" for doing a raw-block transcribe of the devices and filesystem, 
> it is most correct for @size to be real storage size. But sadly, posix 
> didn't define that value for that role, so we are forced to munge 
> around. (particularly since /bin/df calculates stuff "for us").
>
>
> Calculation of the @waste would have to happen in two phases. At 
> initiation phase of any convert @waste would be set to zero. At 
> completion of any _full_ convert, when we know that there are no 
> leftover bits that could lead to rampant mis-report, @waste would be 
> calculated for each device as a dev_item. Then the total would be 
> stored as a global item.
>
> btrfs tools would report all four items.
>
> statfs() would have to report (@size-@waste) and @available, but 
> that's a problem with limits to the assumptions made by statfs() 
> designers two decades ago.
>
> I don't know which numbers we keep on hand and which we derive so...
>
> @available, if calculated dynamically would be
> sum(@size, -@waste, -@used).
>
> @used, if calculated dynamically, would be
> sum(@size, -@waste, -@available).
>
> This would also keep all the determinations of @waste well defined and 
> relegated to specific, infrequently executed blocks of code.
>
> GIVEN ALSO ::
>
> The BTRFS dynamic internal layout allows for completely valid states 
> that are inconsistent with the current filesystem flags... Such as it 
> is legal to set the RAID1 mode for data but still having RAID0, RAID5, 
> and any manner of other extents present... there is no finite solution 
> to every particular layout that exists.
>
> This condition is even _mandatory_ in an evolving system. May persist 
> if conversion is interrupted and then the balance is aborted. And 
> might be purely legal if you supply a convert option and limit the 
> number of blocks to process in the same run.
>
> Each individual extent block is it's own master in terms of what "mode 
> the filesystem is actally in" when that extent is being accessed. This 
> fact is _unchangeable_.
>
>
> STANDARDS REFERENCES and Issues...
>
> The actual standard from POSIX at The Open Group refers to f_blocks as 
> "Total number of blocks on file system in units of f_frsize".
>
> See :: 
> http://pubs.opengroup.org/onlinepubs/009695399/basedefs/sys/statvfs.h.html
>
> The linux kernel source and man pages say "total data blocks in 
> filesystem".
>
> I don't know where/when/why the "total blocks" got re-qualified as 
> "total data blocks" in the linux history, but it's probably incorrect 
> on plain reading.
>
> The df command itself suffers a similar problem as the POSIX standard 
> doesn't talk about "data blocks" etc.
>
> Problematically, of course, the statfs() call doesn't really allow for 
> any means to address slack/waste space and the reverse calculation for 
> us becomes impossible.
>
> This gets back to the "no right answer in BTRFS" issue.
>
> There is a lot of missing magic here. Back when INODES where just one 
> thing with one size statfs results were probably either-or and 
> "Everybody Knew" how to turn the inode count into a block count and 
> history just rolled on.
>
> I think the real answer would be to invent an expanded statfs() call 
> that returned the real numbers for @total_size, @system_overhead_used, 
> @waste_space, @unusable_space, etc -- that is to come up with a 
> generic model for a modern storage system -- and let real calculations 
> take place. But I don't have the "community chops" to start that ball 
> rolling.
>
> CONCLUSIONS ::
>
> Given the inherent assumptions of statfs(), there is _no_ solution 
> that will be correct in all cases.
> .
>

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Dongsheng Yang Dec. 16, 2014, 1:24 p.m. UTC | #16

On Tue, Dec 16, 2014 at 7:30 PM, Dongsheng Yang
<yangds.fnst@cn.fujitsu.com> wrote:
> On 12/16/2014 11:30 AM, Robert White wrote:
>>
>> On 12/15/2014 01:36 AM, Robert White wrote:
>>>
>>> So we don't just hand-wave over statfs(). We include the
>>> dev_item.bytes_excluded in the superblock and we decide once-and-for-all
>>> (with any geometry creation, or completed conversion) how many bytes
>>> just _can't_ be reached but only once we _know_ they cant be reached.
>>> And we memorialize that unreachable data in the superblocks.
>>>
>>> Thereafter we report the raw numbers after subtracting anything we know
>>> cannot be reached.
>>>
>>> All other "helpful" solutions are NP-complete and insoluble.
>>
>>
>> On multiple re-readings of my own words and running off to the POSIX
>> definitions _and_ kernel sources (which don't agree).
>>
>> The practical bits first ::
>>
>> I would add a "-c | --compatable" option to btrfs fi df
>> that let it produce /bin/df format-compatable output that gave the "real"
>> numbers as defined near the end.
>>
>>
>> /dev/sda 1TiB
>> /dev/sdb 2TiB
>>
>>
>> mkfs.btrfs /dev/sd{a,b} -d raid1
>>
>> @size=3TiB @used=0TiB @available=2TiB
>>
>> The above would be ideal. But POSIX says "no". f_blocks is defined (only
>> in the comments) as "total data blocks in the filesystem" and /bin/df pivots
>> on that assumption, so the only usable option left is ::
>>
>> @size=2TiB @used=0TiB @available=2TiB
>>
>> After which @used would be the real, raw space consumed. If it takes 2GiB
>> or 4GiB to store 1GiB (q.v. RAID 1 and 10) then @used would go up by that 2
>> or 4 GiB.
>
>
> Hi Robert, thanx for your proposal about this.
>
> IMHO, output of df command shoud be more friendly to user.
> Well, I think we have a disagreement on this point, let's take a look at
> what the zfs is doing.
>
> /dev/sda7- 10G
> /dev/sda8- 10G
> # zpool create myzpool mirror /dev/sda7 /dev/sda8 -f
> # df -h /myzpool/
> Filesystem      Size  Used Avail Use% Mounted on
> myzpool         9.8G   21K  9.8G   1% /myzpool
>
> That said that df command should tell user the space info they can see.
> It means the output is the information from the FS level rather than device
> level or _storage_manager level.

Addition:

There are some other ways to get the space information in btrfs, btrfs
fi df, btrfs fi show, btrfs-debug-tree.

The df command we discussed here is on the top level which is directly
facing the user. Let me try to show the difference in different level.

1), TOP level, for a linux user: df command.
For a linux user, he does not care about the detail how the data is
stored in devices.
They do not care even not know what's Single, what does DUP mean, and how
a fs implement the RAID10. What they want to know is *what is the size
of the filesystem
I am using and how much space is still available to me*. That's what I
said by "FS space level)

2). Middle level, for a btrfs user: btrfs fi df/show.
For a btrfs user, they know about the single, dup and RAIDX. When they
want to know
what's the raid level in each space info, they can use btrfs fi df to
print the information
they want.

3). Device level. for debugging.
Sometimes, you need to know how the each chunk is stored in device. Please
use btrfs-debug-tree to show details you want as more as possible.

After all, I would say, the information you want to show is *not*
incorrect to me, but it's not the
business of df command.

Thanx
>
> Thanx
> Yang
>
>>
>> Given the not-displayed, not reported, excluded_by_geometry values (e.g.
>> @waste) the equation should always be ::
>>
>> @size - @waste = @used + @available
>>
>> The fact that /bin/df doesn't display all four values is just tough, The
>> fact that it calculates one "for us" is really annoying, show-super would be
>> the place to go find the truth.
>>
>> The @waste value is soft because while 1TiB of /dev/sdb that is not going
>> to be used isn't a _particular_ 1TiB. It could be low blocks or high blocks
>> or randomly distributed blocks that end up not having data.
>>
>> So keeping with my thought that (ideally) @size should be the "safe dd
>> size" for doing a raw-block transcribe of the devices and filesystem, it is
>> most correct for @size to be real storage size. But sadly, posix didn't
>> define that value for that role, so we are forced to munge around.
>> (particularly since /bin/df calculates stuff "for us").
>>
>>
>> Calculation of the @waste would have to happen in two phases. At
>> initiation phase of any convert @waste would be set to zero. At completion
>> of any _full_ convert, when we know that there are no leftover bits that
>> could lead to rampant mis-report, @waste would be calculated for each device
>> as a dev_item. Then the total would be stored as a global item.
>>
>> btrfs tools would report all four items.
>>
>> statfs() would have to report (@size-@waste) and @available, but that's a
>> problem with limits to the assumptions made by statfs() designers two
>> decades ago.
>>
>> I don't know which numbers we keep on hand and which we derive so...
>>
>> @available, if calculated dynamically would be
>> sum(@size, -@waste, -@used).
>>
>> @used, if calculated dynamically, would be
>> sum(@size, -@waste, -@available).
>>
>> This would also keep all the determinations of @waste well defined and
>> relegated to specific, infrequently executed blocks of code.
>>
>> GIVEN ALSO ::
>>
>> The BTRFS dynamic internal layout allows for completely valid states that
>> are inconsistent with the current filesystem flags... Such as it is legal to
>> set the RAID1 mode for data but still having RAID0, RAID5, and any manner of
>> other extents present... there is no finite solution to every particular
>> layout that exists.
>>
>> This condition is even _mandatory_ in an evolving system. May persist if
>> conversion is interrupted and then the balance is aborted. And might be
>> purely legal if you supply a convert option and limit the number of blocks
>> to process in the same run.
>>
>> Each individual extent block is it's own master in terms of what "mode the
>> filesystem is actally in" when that extent is being accessed. This fact is
>> _unchangeable_.
>>
>>
>> STANDARDS REFERENCES and Issues...
>>
>> The actual standard from POSIX at The Open Group refers to f_blocks as
>> "Total number of blocks on file system in units of f_frsize".
>>
>> See ::
>> http://pubs.opengroup.org/onlinepubs/009695399/basedefs/sys/statvfs.h.html
>>
>> The linux kernel source and man pages say "total data blocks in
>> filesystem".
>>
>> I don't know where/when/why the "total blocks" got re-qualified as "total
>> data blocks" in the linux history, but it's probably incorrect on plain
>> reading.
>>
>> The df command itself suffers a similar problem as the POSIX standard
>> doesn't talk about "data blocks" etc.
>>
>> Problematically, of course, the statfs() call doesn't really allow for any
>> means to address slack/waste space and the reverse calculation for us
>> becomes impossible.
>>
>> This gets back to the "no right answer in BTRFS" issue.
>>
>> There is a lot of missing magic here. Back when INODES where just one
>> thing with one size statfs results were probably either-or and "Everybody
>> Knew" how to turn the inode count into a block count and history just rolled
>> on.
>>
>> I think the real answer would be to invent an expanded statfs() call that
>> returned the real numbers for @total_size, @system_overhead_used,
>> @waste_space, @unusable_space, etc -- that is to come up with a generic
>> model for a modern storage system -- and let real calculations take place.
>> But I don't have the "community chops" to start that ball rolling.
>>
>> CONCLUSIONS ::
>>
>> Given the inherent assumptions of statfs(), there is _no_ solution that
>> will be correct in all cases.
>> .
>>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Dongsheng Yang Dec. 16, 2014, 1:47 p.m. UTC | #17

Hi Goffredo,

On Tue, Dec 16, 2014 at 1:47 AM, Goffredo Baroncelli <kreijack@inwind.it> wrote:
> I Yang,
>
> On 12/14/2014 12:29 PM, Dongsheng Yang wrote:
>> On Sat, Dec 13, 2014 at 3:25 AM, Goffredo Baroncelli <kreijack@inwind.it> wrote:
>>> On 12/11/2014 09:31 AM, Dongsheng Yang wrote:
>>>> When function btrfs_statfs() calculate the tatol size of fs, it is calculating
>>>> the total size of disks and then dividing it by a factor. But in some usecase,
>>>> the result is not good to user.
>>>
>>>
>>> I Yang; during my test I discovered an error:
>>>
>>> $ sudo lvcreate -L +10G -n disk vgtest
>>> $ sudo /home/ghigo/mkfs.btrfs -f /dev/vgtest/disk
>>> Btrfs v3.17
>>> See http://btrfs.wiki.kernel.org for more information.
>>>
>>> Turning ON incompat feature 'extref': increased hardlink limit per file to 65536
>>> fs created label (null) on /dev/vgtest/disk
>>>         nodesize 16384 leafsize 16384 sectorsize 4096 size 10.00GiB
>>> $ sudo mount /dev/vgtest/disk /mnt/btrfs1/
>>> $ df /mnt/btrfs1/
>>> Filesystem              1K-blocks    Used Available Use% Mounted on
>>> /dev/mapper/vgtest-disk   9428992 1069312   8359680  12% /mnt/btrfs1
>>> $ sudo ~/btrfs fi df /mnt/btrfs1/
>>> Data, single: total=8.00MiB, used=256.00KiB
>>> System, DUP: total=8.00MiB, used=16.00KiB
>>> System, single: total=4.00MiB, used=0.00B
>>> Metadata, DUP: total=1.00GiB, used=112.00KiB
>>> Metadata, single: total=8.00MiB, used=0.00B
>>> GlobalReserve, single: total=16.00MiB, used=0.00B
>>>
>>> What seems me strange is the 9428992KiB of total disk space as reported
>>> by df. About 600MiB are missing !
>>>
>>> Without your patch, I got:
>>>
>>> $ df /mnt/btrfs1/
>>> Filesystem              1K-blocks  Used Available Use% Mounted on
>>> /dev/mapper/vgtest-disk  10485760 16896   8359680   1% /mnt/btrfs1
>>>
>>
>> Hi Goffredo, thanx for pointing this. Let me try to show the reason of it.
>>
>> In this case you provided here, the metadata is 1G in DUP. It means
>> we spent 2G device space for it. But from the view point of user, they
>> can only see 9G size for this fs. It's show as below:
>>  1G
>> -----|---------------- Filesystem space (9G)
>> |     \
>> |      \
>> |       \
>> |    2G  \
>> -----|----|----------------- Device space (10G)
>>     DUP   |
>>
>> The main idea about the new btrfs_statfs() is hiding the detail in Device
>> space and only show the information in the FS space level.
>
> I am not entirely convinced. On the basis of your statement, I would find a difference of 1G or 1G/2... but missing are 600MB...

Ha, forgot to clarify this one. 9428992K is almost 9G actually.
9G = (9 * 2^20)K = 9437184K.

So, as I showed in the graphic, @size is 9G. :)

> Grzegorz found a difference of 2.3GB (but on a bigger disk).
>
> Looking at your patch set, it seems to me that the computation of the
> "total disk space" is done differently than before.
>
> Before, the "total disk space" was
>
>>>> -     buf->f_blocks = div_u64(btrfs_super_total_bytes(disk_super), factor);
>>>> -     buf->f_blocks >>= bits;
>
> I read this as: "total disk space" is the sum of the disk size (corrected by a
> factor depending by the profile).
>
> Instead after your patches the "total disk space" is
>
>>>> +     buf->f_blocks = total_alloc + total_free_data;
>
> This means that if a disk is not fully mapped by chunks, there is a difference
> between the "total disk space" and the value reported by df.
>
> May be that you are highlighting a bug elsewhere (not caused by your patch) ?
>
> BTW
> I will continue my tests...

Thank you very much. :)

Yang
>
> BR
> G.Baroncelli
>
>>
>> Actually, the similar idea is adopted in btrfs fi df.
>>
>> Example:
>> # lvcreate -L +10G -n disk Group0
>> # mkfs.btrfs -f /dev/Group0/disk
>> # dd if=/dev/zero of=/mnt/data bs=1M count=10240
>> dd: error writing ‘/mnt/data’: No space left on device
>> 8163+0 records in
>> 8162+0 records out
>> 8558477312 bytes (8.6 GB) copied, 207.565 s, 41.2 MB/s
>> # df /mnt
>> Filesystem              1K-blocks    Used Available Use% Mounted on
>> /dev/mapper/Group0-disk   9428992 8382896      2048 100% /mnt
>> # btrfs fi df /mnt
>> Data, single: total=7.97GiB, used=7.97GiB
>> System, DUP: total=8.00MiB, used=16.00KiB
>> System, single: total=4.00MiB, used=0.00B
>> Metadata, DUP: total=1.00GiB, used=8.41MiB
>> Metadata, single: total=8.00MiB, used=0.00B
>> GlobalReserve, single: total=16.00MiB, used=0.00B
>>
>> Now, the all space is used and got a ENOSPC error.
>> But from the output of btrfs fi df. We can only find almost 9G (data
>> 8G + metadata 1G)
>> is used. The difference is that it show the DUP here to make
>> it more clear.
>>
>> Wish this description is clear to you :).
>>
>> If there is anything confusing or sounds incorrect to you, please point it out.
>>
>> Thanx
>> Yang
>>
>>>
>>>> Example:
>>>>       # mkfs.btrfs -f /dev/vdf1 /dev/vdf2 -d raid1
>>>>       # mount /dev/vdf1 /mnt
>>>>       # dd if=/dev/zero of=/mnt/zero bs=1M count=1000
>>>>       # df -h /mnt
>>>> Filesystem      Size  Used Avail Use% Mounted on
>>>> /dev/vdf1       3.0G 1018M  1.3G  45% /mnt
>>>>       # btrfs fi show /dev/vdf1
>>>> Label: none  uuid: f85d93dc-81f4-445d-91e5-6a5cd9563294
>>>>       Total devices 2 FS bytes used 1001.53MiB
>>>>       devid    1 size 2.00GiB used 1.85GiB path /dev/vdf1
>>>>       devid    2 size 4.00GiB used 1.83GiB path /dev/vdf2
>>>> a. df -h should report Size as 2GiB rather than as 3GiB.
>>>> Because this is 2 device raid1, the limiting factor is devid 1 @2GiB.
>>>>
>>>> b. df -h should report Avail as 0.97GiB or less, rather than as 1.3GiB.
>>>>     1.85           (the capacity of the allocated chunk)
>>>>    -1.018          (the file stored)
>>>>    +(2-1.85=0.15)  (the residual capacity of the disks
>>>>                     considering a raid1 fs)
>>>>    ---------------
>>>> =   0.97
>>>>
>>>> This patch drops the factor at all and calculate the size observable to
>>>> user without considering which raid level the data is in and what's the
>>>> size exactly in disk.
>>>> After this patch applied:
>>>>       # mkfs.btrfs -f /dev/vdf1 /dev/vdf2 -d raid1
>>>>       # mount /dev/vdf1 /mnt
>>>>       # dd if=/dev/zero of=/mnt/zero bs=1M count=1000
>>>>       # df -h /mnt
>>>> Filesystem      Size  Used Avail Use% Mounted on
>>>> /dev/vdf1       2.0G  1.3G  713M  66% /mnt
>>>>       # df /mnt
>>>> Filesystem     1K-blocks    Used Available Use% Mounted on
>>>> /dev/vdf1        2097152 1359424    729536  66% /mnt
>>>>       # btrfs fi show /dev/vdf1
>>>> Label: none  uuid: e98c1321-645f-4457-b20d-4f41dc1cf2f4
>>>>       Total devices 2 FS bytes used 1001.55MiB
>>>>       devid    1 size 2.00GiB used 1.85GiB path /dev/vdf1
>>>>       devid    2 size 4.00GiB used 1.83GiB path /dev/vdf2
>>>> a). The @Size is 2G as we expected.
>>>> b). @Available is 700M = 1.85G - 1.3G + (2G - 1.85G).
>>>> c). @Used is changed to 1.3G rather than 1018M as above. Because
>>>>     this patch do not treat the free space in metadata chunk
>>>>     and system chunk as available to user. It's true, user can
>>>>     not use these space to store data, then it should not be
>>>>     thought as available. At the same time, it will make the
>>>>     @Used + @Available == @Size as possible to user.
>>>>
>>>> Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com>
>>>> ---
>>>> Changelog:
>>>>         v1 -> v2:
>>>>                 a. account the total_bytes in medadata chunk and
>>>>                    system chunk as used to user.
>>>>                 b. set data_stripes to the correct value in RAID0.
>>>>
>>>>  fs/btrfs/extent-tree.c | 13 ++----------
>>>>  fs/btrfs/super.c       | 56 ++++++++++++++++++++++----------------------------
>>>>  2 files changed, 26 insertions(+), 43 deletions(-)
>>>>
>>>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>>>> index a84e00d..9954d60 100644
>>>> --- a/fs/btrfs/extent-tree.c
>>>> +++ b/fs/btrfs/extent-tree.c
>>>> @@ -8571,7 +8571,6 @@ static u64 __btrfs_get_ro_block_group_free_space(struct list_head *groups_list)
>>>>  {
>>>>       struct btrfs_block_group_cache *block_group;
>>>>       u64 free_bytes = 0;
>>>> -     int factor;
>>>>
>>>>       list_for_each_entry(block_group, groups_list, list) {
>>>>               spin_lock(&block_group->lock);
>>>> @@ -8581,16 +8580,8 @@ static u64 __btrfs_get_ro_block_group_free_space(struct list_head *groups_list)
>>>>                       continue;
>>>>               }
>>>>
>>>> -             if (block_group->flags & (BTRFS_BLOCK_GROUP_RAID1 |
>>>> -                                       BTRFS_BLOCK_GROUP_RAID10 |
>>>> -                                       BTRFS_BLOCK_GROUP_DUP))
>>>> -                     factor = 2;
>>>> -             else
>>>> -                     factor = 1;
>>>> -
>>>> -             free_bytes += (block_group->key.offset -
>>>> -                            btrfs_block_group_used(&block_group->item)) *
>>>> -                            factor;
>>>> +             free_bytes += block_group->key.offset -
>>>> +                           btrfs_block_group_used(&block_group->item);
>>>>
>>>>               spin_unlock(&block_group->lock);
>>>>       }
>>>> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
>>>> index 54bd91e..40f41a2 100644
>>>> --- a/fs/btrfs/super.c
>>>> +++ b/fs/btrfs/super.c
>>>> @@ -1641,6 +1641,8 @@ static int btrfs_calc_avail_data_space(struct btrfs_root *root, u64 *free_bytes)
>>>>       u64 used_space;
>>>>       u64 min_stripe_size;
>>>>       int min_stripes = 1, num_stripes = 1;
>>>> +     /* How many stripes used to store data, without considering mirrors. */
>>>> +     int data_stripes = 1;
>>>>       int i = 0, nr_devices;
>>>>       int ret;
>>>>
>>>> @@ -1657,12 +1659,15 @@ static int btrfs_calc_avail_data_space(struct btrfs_root *root, u64 *free_bytes)
>>>>       if (type & BTRFS_BLOCK_GROUP_RAID0) {
>>>>               min_stripes = 2;
>>>>               num_stripes = nr_devices;
>>>> +             data_stripes = num_stripes;
>>>>       } else if (type & BTRFS_BLOCK_GROUP_RAID1) {
>>>>               min_stripes = 2;
>>>>               num_stripes = 2;
>>>> +             data_stripes = 1;
>>>>       } else if (type & BTRFS_BLOCK_GROUP_RAID10) {
>>>>               min_stripes = 4;
>>>>               num_stripes = 4;
>>>> +             data_stripes = 2;
>>>>       }
>>>>
>>>>       if (type & BTRFS_BLOCK_GROUP_DUP)
>>>> @@ -1733,14 +1738,17 @@ static int btrfs_calc_avail_data_space(struct btrfs_root *root, u64 *free_bytes)
>>>>       i = nr_devices - 1;
>>>>       avail_space = 0;
>>>>       while (nr_devices >= min_stripes) {
>>>> -             if (num_stripes > nr_devices)
>>>> +             if (num_stripes > nr_devices) {
>>>>                       num_stripes = nr_devices;
>>>> +                     if (type & BTRFS_BLOCK_GROUP_RAID0)
>>>> +                             data_stripes = num_stripes;
>>>> +             }
>>>>
>>>>               if (devices_info[i].max_avail >= min_stripe_size) {
>>>>                       int j;
>>>>                       u64 alloc_size;
>>>>
>>>> -                     avail_space += devices_info[i].max_avail * num_stripes;
>>>> +                     avail_space += devices_info[i].max_avail * data_stripes;
>>>>                       alloc_size = devices_info[i].max_avail;
>>>>                       for (j = i + 1 - num_stripes; j <= i; j++)
>>>>                               devices_info[j].max_avail -= alloc_size;
>>>> @@ -1772,15 +1780,13 @@ static int btrfs_calc_avail_data_space(struct btrfs_root *root, u64 *free_bytes)
>>>>  static int btrfs_statfs(struct dentry *dentry, struct kstatfs *buf)
>>>>  {
>>>>       struct btrfs_fs_info *fs_info = btrfs_sb(dentry->d_sb);
>>>> -     struct btrfs_super_block *disk_super = fs_info->super_copy;
>>>>       struct list_head *head = &fs_info->space_info;
>>>>       struct btrfs_space_info *found;
>>>>       u64 total_used = 0;
>>>> +     u64 total_alloc = 0;
>>>>       u64 total_free_data = 0;
>>>>       int bits = dentry->d_sb->s_blocksize_bits;
>>>>       __be32 *fsid = (__be32 *)fs_info->fsid;
>>>> -     unsigned factor = 1;
>>>> -     struct btrfs_block_rsv *block_rsv = &fs_info->global_block_rsv;
>>>>       int ret;
>>>>
>>>>       /*
>>>> @@ -1792,38 +1798,20 @@ static int btrfs_statfs(struct dentry *dentry, struct kstatfs *buf)
>>>>       rcu_read_lock();
>>>>       list_for_each_entry_rcu(found, head, list) {
>>>>               if (found->flags & BTRFS_BLOCK_GROUP_DATA) {
>>>> -                     int i;
>>>> -
>>>> -                     total_free_data += found->disk_total - found->disk_used;
>>>> +                     total_free_data += found->total_bytes - found->bytes_used;
>>>>                       total_free_data -=
>>>>                               btrfs_account_ro_block_groups_free_space(found);
>>>> -
>>>> -                     for (i = 0; i < BTRFS_NR_RAID_TYPES; i++) {
>>>> -                             if (!list_empty(&found->block_groups[i])) {
>>>> -                                     switch (i) {
>>>> -                                     case BTRFS_RAID_DUP:
>>>> -                                     case BTRFS_RAID_RAID1:
>>>> -                                     case BTRFS_RAID_RAID10:
>>>> -                                             factor = 2;
>>>> -                                     }
>>>> -                             }
>>>> -                     }
>>>> +                     total_used += found->bytes_used;
>>>> +             } else {
>>>> +                     /* For metadata and system, we treat the total_bytes as
>>>> +                      * not available to file data. So show it as Used in df.
>>>> +                      **/
>>>> +                     total_used += found->total_bytes;
>>>>               }
>>>> -
>>>> -             total_used += found->disk_used;
>>>> +             total_alloc += found->total_bytes;
>>>>       }
>>>> -
>>>>       rcu_read_unlock();
>>>>
>>>> -     buf->f_blocks = div_u64(btrfs_super_total_bytes(disk_super), factor);
>>>> -     buf->f_blocks >>= bits;
>>>> -     buf->f_bfree = buf->f_blocks - (div_u64(total_used, factor) >> bits);
>>>> -
>>>> -     /* Account global block reserve as used, it's in logical size already */
>>>> -     spin_lock(&block_rsv->lock);
>>>> -     buf->f_bfree -= block_rsv->size >> bits;
>>>> -     spin_unlock(&block_rsv->lock);
>>>> -
>>>>       buf->f_bavail = total_free_data;
>>>>       ret = btrfs_calc_avail_data_space(fs_info->tree_root, &total_free_data);
>>>>       if (ret) {
>>>> @@ -1831,8 +1819,12 @@ static int btrfs_statfs(struct dentry *dentry, struct kstatfs *buf)
>>>>               mutex_unlock(&fs_info->fs_devices->device_list_mutex);
>>>>               return ret;
>>>>       }
>>>> -     buf->f_bavail += div_u64(total_free_data, factor);
>>>> +     buf->f_bavail += total_free_data;
>>>>       buf->f_bavail = buf->f_bavail >> bits;
>>>> +     buf->f_blocks = total_alloc + total_free_data;
>>>> +     buf->f_blocks >>= bits;
>>>> +     buf->f_bfree = buf->f_blocks - (total_used >> bits);
>>>> +
>>>>       mutex_unlock(&fs_info->chunk_mutex);
>>>>       mutex_unlock(&fs_info->fs_devices->device_list_mutex);
>>>>
>>>>
>>>
>>>
>>> --
>>> gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
>>> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>
>
> --
> gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Robert White Dec. 16, 2014, 7:52 p.m. UTC | #18

On 12/16/2014 03:30 AM, Dongsheng Yang wrote:
>
> Hi Robert, thanx for your proposal about this.
>
> IMHO, output of df command shoud be more friendly to user.
> Well, I think we have a disagreement on this point, let's take a look at
> what the zfs is doing.
>
> /dev/sda7- 10G
> /dev/sda8- 10G
> # zpool create myzpool mirror /dev/sda7 /dev/sda8 -f
> # df -h /myzpool/
> Filesystem      Size  Used Avail Use% Mounted on
> myzpool         9.8G   21K  9.8G   1% /myzpool
>
> That said that df command should tell user the space info they can see.
> It means the output is the information from the FS level rather than
> device level or _storage_manager level.

That's great for ZFS, but ZFS isn't BTRFS. ZFS can't get caught halfway 
between existing modailties and sit there forever. ZFS doesn't 
restructure itself. So the simple answer you want isn't _possible_ 
outside very simple cases.

So again, you've displayed a _simple_ case as if it covers or addresses 
all the complex cases.

(I don't have the storage to actually do the exercise) But what do you 
propose the correct answer is for the following case:

/dev/sda - 7.5G
/dev/sdb - 7.5G

mkfs.btrfs /dev/sd{a,b} -d raid0
mount /dev/sda /mnt
dd if=/dev/urandom of=/mnt/consumed bs=1G count=7
btrfsck balance start -dconvert=raid1 -dlimit=1 /mnt
(wait)
/bin/df

The filesystem is now in a superposition where all future blocks are 
going to be written as raid1, one 2G stripe has been converted into two 
two-gig stripes that have been mirrored, and six gig is still RAID0.

In your proposal we now have
@size=7G
@used=??? (clearly 7G, at the least, is consumed)
@filesize[consumed]=7G

@available is really messed up since there is now _probably_ 1G of one 
of the original raid0 extents with free space and so available, almost 
all of the single RAID1 metadata block, Room for three more metadata 
stripes, and room for one more RAID1 extent.

so @available=2-ish gigs.

But since statfs() pivots on @size and @available /bin/df is going to 
report @used as 3-ish gigs even though we've got an uncompressed and 
uncompressable @7G file.

NOW waht if we went the _other_ way?

/dev/sda - 7.5G
/dev/sdb - 7.5G

mkfs.btrfs /dev/sd{a,b} -d raid1
mount /dev/sda /mnt
dd if=/dev/urandom of=/mnt/consumed bs=1G count=7
btrfsck balance start -dconvert=raid0 -dlimit=1 /mnt
(wait)
/bin/df

filesystem is _full_ when the convert starts.

@size=14Gig
@used=7G
@actual_available=0
@reported_available=??? (at least 2x1G extents are up for grabs so 
minimum 2G)
@reported_used=???
@calculated_used=???

We are either back to reporting available space when non-trivial 
allocation will report ENOSPC (if I did the math right etc).

Now do partial conversions to other formats and repeat the exercise.
Now add or remove storage here-or-there.

The "working set" and the "current model" are not _required_ to be in 
harmony at any given time, so trying to analyze the working set based on 
the current model is NP-complete.

In every modality we find that at some point we _can_ either report 0 
available and still have room, or we report non-zero available and the 
user's going to get an ENOSPC.

So fine, _IF_ btrfs disallowed conversion and fixed its overhead -- that 
is if it stopped bing btrfs -- we could safely report cooked numbers.

But at that point, why BTRFS at all?

As for being easier on the user, that just depends on which lie each 
user wants. People are smart enough to use a fair approximation, and the 
only fair approximation we have available are at the storage management 
level -- which is halfway between the raw blocks and the cooked 
file-system numbers.

A BTRFS filesystem just _isn't_ fully cooked until the last data extent 
is allocated, and because of COW that's over-cooked as well.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Yang Dongsheng Dec. 17, 2014, 11:38 a.m. UTC | #19

On 12/17/2014 03:52 AM, Robert White wrote:
> On 12/16/2014 03:30 AM, Dongsheng Yang wrote:
>>
>> Hi Robert, thanx for your proposal about this.
>>
>> IMHO, output of df command shoud be more friendly to user.
>> Well, I think we have a disagreement on this point, let's take a look at
>> what the zfs is doing.
>>
>> /dev/sda7- 10G
>> /dev/sda8- 10G
>> # zpool create myzpool mirror /dev/sda7 /dev/sda8 -f
>> # df -h /myzpool/
>> Filesystem      Size  Used Avail Use% Mounted on
>> myzpool         9.8G   21K  9.8G   1% /myzpool
>>
>> That said that df command should tell user the space info they can see.
>> It means the output is the information from the FS level rather than
>> device level or _storage_manager level.
>
> That's great for ZFS, but ZFS isn't BTRFS. ZFS can't get caught 
> halfway between existing modailties and sit there forever. ZFS doesn't 
> restructure itself. So the simple answer you want isn't _possible_ 
> outside very simple cases.
>
> So again, you've displayed a _simple_ case as if it covers or 
> addresses all the complex cases.
>
> (I don't have the storage to actually do the exercise) But what do you 
> propose the correct answer is for the following case:
>
>
> /dev/sda - 7.5G
> /dev/sdb - 7.5G
>
> mkfs.btrfs /dev/sd{a,b} -d raid0
> mount /dev/sda /mnt
> dd if=/dev/urandom of=/mnt/consumed bs=1G count=7
> btrfsck balance start -dconvert=raid1 -dlimit=1 /mnt
> (wait)
> /bin/df
>
>
> The filesystem is now in a superposition where all future blocks are 
> going to be written as raid1, one 2G stripe has been converted into 
> two two-gig stripes that have been mirrored, and six gig is still RAID0.
>

Similar with your mail about inline file case, I think this is another 
thing.

I really appreciate your persistence and so much analyse in semasiology.
I think I will never convince you at this point even with thousands mails
to express myself. What about implementing your proposal here and sending
a patchset at least with RFC. Then it can be more clear to us and we can 
make
the choice more easily. :)

Again, below is all points from my side:

The df command we discussed here is on the top level which is directly
facing the user.

1), For a linux user, he does not care about the detail how the data is
stored in devices. They do not care even not know what's Single, what
does DUP mean, and how a fs implement the RAID10. What they want to
know is *what is the size of the filesystem I am using and how much
space is still available to me*. That's what I said by "FS space level"

2). For a btrfs user, they know about the single, dup and RAIDX. When they
want to know what's the raid level in each space info, they can use btrfs
fi df to print the information they want.

3). Device level. for debugging.
Sometimes, you need to know how the each chunk is stored in device. Please
use btrfs-debug-tree to show details you want as more as possible.

4). df in ZFS is showing the FS space information.

5). For the elder btrfs_statfs(), We have discussed about df command
and chosen to hide the detail information to user.
And only show the FS space information to user. Current btrfs_statfs() 
is working like
this.

6). IIUC, you are saying that:
      a). BTRFS is not ZFS, df in zfs is not referenced to us.
      b). Current btrfs_statfs() is woring in a wrong way, we
            need to reimplement it from another new view point.

I am not sure your proposal here is not better. As I said above,
I am pleased to see a demo of it. If it works better, I am really
very happy to agree with you in this argument.

My patch could be objected, but the problem in current btrfs_statfs()
should be fixed by some ways. If you submit a better solution, I am pleased
to see btrfs becoming better. :)

Thanx
Yang
> In your proposal we now have
> @size=7G
> @used=??? (clearly 7G, at the least, is consumed)
> @filesize[consumed]=7G
>
> @available is really messed up since there is now _probably_ 1G of one 
> of the original raid0 extents with free space and so available, almost 
> all of the single RAID1 metadata block, Room for three more metadata 
> stripes, and room for one more RAID1 extent.
>
> so @available=2-ish gigs.
>
> But since statfs() pivots on @size and @available /bin/df is going to 
> report @used as 3-ish gigs even though we've got an uncompressed and 
> uncompressable @7G file.
>
> NOW waht if we went the _other_ way?
>
> /dev/sda - 7.5G
> /dev/sdb - 7.5G
>
> mkfs.btrfs /dev/sd{a,b} -d raid1
> mount /dev/sda /mnt
> dd if=/dev/urandom of=/mnt/consumed bs=1G count=7
> btrfsck balance start -dconvert=raid0 -dlimit=1 /mnt
> (wait)
> /bin/df
>
> filesystem is _full_ when the convert starts.
>
> @size=14Gig
> @used=7G
> @actual_available=0
> @reported_available=??? (at least 2x1G extents are up for grabs so 
> minimum 2G)
> @reported_used=???
> @calculated_used=???
>
> We are either back to reporting available space when non-trivial 
> allocation will report ENOSPC (if I did the math right etc).
>
> Now do partial conversions to other formats and repeat the exercise.
> Now add or remove storage here-or-there.
>
> The "working set" and the "current model" are not _required_ to be in 
> harmony at any given time, so trying to analyze the working set based 
> on the current model is NP-complete.
>
> In every modality we find that at some point we _can_ either report 0 
> available and still have room, or we report non-zero available and the 
> user's going to get an ENOSPC.
>
>
> So fine, _IF_ btrfs disallowed conversion and fixed its overhead -- 
> that is if it stopped bing btrfs -- we could safely report cooked 
> numbers.
>
> But at that point, why BTRFS at all?
>
> As for being easier on the user, that just depends on which lie each 
> user wants. People are smart enough to use a fair approximation, and 
> the only fair approximation we have available are at the storage 
> management level -- which is halfway between the raw blocks and the 
> cooked file-system numbers.
>
> A BTRFS filesystem just _isn't_ fully cooked until the last data 
> extent is allocated, and because of COW that's over-cooked as well.
>
>
> .
>

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Robert White Dec. 18, 2014, 4:07 a.m. UTC | #20

I don't disagree with the _ideal_ of your patch. I just think that it's 
impossible to implement it without lying to the user or making things 
just as bad in a different way. I would _like_ you to be right. But my 
thing is finding and quantifying failure cases and the entire question 
is full of fail.

This is not an attack on you personally, it's a mismatch between the 
storage and file system paradigms that we've seen first because we are 
the first to really blend the two fairly.

Here is a completely legal BTRFS working set. (it's a little extreme.)


/dev/sda :: '|Sf|Sf|Sp|0f|1f|0p|0p|Mf|Mf|Mp|1p|1.25GiB-unallocated|
/dev/sdb :: '|0f|1f|0p|0p|Mp|1p| 4.75GiB-unalloated               |


Legend
p == partial, about half full.
f == full, or full enough to treat as full.
S == Single allocated chunk
0 == RAID=0 allocated chunk
1 == RAID=1 allocated chunk
M == metadata chunk

History: This filesystem started out on a single drive, then it's 
bounced between RAID-0 and RAID-1 at least twice. The owner has _never_ 
let a conversion finish. Indeed this user has just changed modes a 
couple times.

The current filesystem flag says RAID-1.

But we currently have .5GiB of "single" slack, 2GiB of RAID-0 slack, 
1GiB of RAID-1 slack, 2GiB of space where a total of 1GiB more RIAD1 
extents can be created, and we have 3GiB of space on /dev/sdb that _can_ 
_not_ be allocated. We have room for 1 more metadata extent on each 
drive, but if we allocate two more metadat extents on each drive we will 
burn up 1.25 GiB by reducing it to 0.75GiB.

First, a question.

Will a BTRFS in RAID1 mode add file data to extents that are in other 
modes? That is, will the filesystem _use_ the 2.5GiB of available 
"single" and "RAID0" data? If no, then that's 2.5GiB of "phantom 
consumption" space that insn't "used" but also isn't usable.

The size of the store is 20GiB. The default of 2x10GiB you propose would 
be 10GiB. But how do you identify the 3GiB "missing" because of the 
lopsided allocation history?

Seem unlikely? The rotten cod example I've given is unlikely.

But a more even case is downright common and likely. Say you run a nice 
old-fashoned MUTT mail-spool. "most" of your files are small enough to 
live in metadata. You start with one drive. and allocate 2 single-data 
and 10 metatata (5xDup). Then you add a second drive of equal size. (the 
metadata just switched to DUP-as-RAID1-alike mode) And then you do a 
dconvert=raid0.

That uneven allocation of metadata will be a 2GiB difference between the 
two drives forever.

So do you shave 2GiB off of your @size?
Do you shave @2GiB off your @available?
Do you overreport your available by @2GiB and end up _still_ having 
things "available" when you get your ENOSPC?

How about this ::

/dev/sda == |Sf|Sf|Mf|Mf|Mf|Mf|Sf|Sf|Sp|Mp|Mp| .5GiB free|
/dev/sdb == |10 GiB free                                 |

Operator fills his drive, then adds a second one, then _foolishly_ tries 
to convert it to RAID0 when the power fails. In order to check the FS he 
boots with no_balance. Then his maintenance window closes and he has to 
go back into production, at which point he forgets (or isn't allowed) to 
do the balance. The flags are set but now no more extents can be allocated.

Size is 20GiB, slack is 10.5GiB. Operator is about to get ENOSPACE.


Yes a balance would fix it, but that's not the question.

In the meantime what does your patch report?

Or...

/dev/sda == |Sf|Sf|Mf|Mf|Mf|Mf|Sf|Sf|Sp|Mp|Mp| .5GiB free|
/dev/sdb == |10 GiB free                                 |
/dev/sdc == |10 GiB free                                 |

Does a -dconvert=raid5 and immediately gets ENOSPC for all the blocks. 
According to the flags we've got 10GiB free...

Or we end up with an egregious metadata history from lots of small files 
and we've got a perfectly fine RAID1 with several GiB of slack but none 
of that slack is 1GiB contiguous. All the slack has just come from 
reclaiming metadata.

/dev/sda == |Sf|Sf|Mp|Mp|Rx|Rx|Mp|Mp|Rx|Rx|Mp|Mp| N-free slack|

(R == reclaimed, e.g. avalable to extent-tree.c for allocation)

We have a 1.5GB of "poisoned" space here; it can hold metadata but not 
data. So is that 1.5 in your @available calculation? How do you mark it 
up as used.

...

And I've been ingoring the Mp(s) completely. What if I've got a good two 
GiB of partial space in the metadata, but that's all I've got. You write 
a file of any size and you'll get ENOSPC even though you've got that 
GiB.  Was it in @size? Is it in @avail?

...

See you keep giving me these examples where the history of the 
filesystem is uniform. It was made a certain way and stayed that way. 
But in real life this sort of thing is going to happen and your patch 
simply report's a _different_ _wrong_ number. A _friendlier_ wrong 
number, I'll grant you that, but still wrong.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Duncan Dec. 18, 2014, 8:02 a.m. UTC | #21

Robert White posted on Wed, 17 Dec 2014 20:07:27 -0800 as excerpted:

>  We have room for 1 more metadata extent on each
> drive, but if we allocate two more metadat extents on each drive we will
> burn up 1.25 GiB by reducing it to 0.75GiB.

FWIW, at least the last chunk assignment can be smaller than normal.  I 
believe I've seen it happen both here and on posted reports.  The 0.75 
GiB could thus be allocated as data if needed.

I'm not actually sure how the allocator works with the last few GiB.  
Some developer comments have hinted that it starts carving smaller chunks 
before it actually has to, and I could imagine it dropping data chunks to 
a half gig, then a quarter gig, than 128 MiB, then 64 MiB... as space 
gets tight, and metadata chunks similarly but of course from a quarter 
gig down.  That I'm not sure about.  But I'm quite sure it will actually 
use the last little bit (provided it can properly fill its raid policy 
when doing so), as I'm quite sure I've seen it do it.

I know for sure it does that in mixed-mode, as I have a 256 MiB mixed-
mode dup /boot (and a backup /boot of the same size on the other device, 
so I can select the one that's booted from the BIOS), and they tend to be 
fully chunk-allocated.   Note that even with mixed-mode, which defaults 
to metadata-sized-chunks, thus 256 MiB, on a 256 MiB device, by the time 
overhead, system, and reserve chunks are allocated, there's definitely 
NOT 256 MiB left for a data/metadata-mixed chunk, so if it couldn't 
allocate smaller bits it couldn't allocate even ONE chunk, let alone a 
pair (dup mode).

And I think I've seen it happen on my larger (not mixed) filesystems of 
several GiB as well, tho I don't actually tend to fill them up quite so 
routinely, so it's more difficult to say for sure.

Zygo Blaxell Dec. 19, 2014, 3:32 a.m. UTC | #22

On Sun, Dec 14, 2014 at 10:06:50PM -0800, Robert White wrote:
> ABSTRACT:: Stop being clever, just give the raw values. That's what
> you should be doing anyway. There are no other correct values to
> give that doesn't blow someone's paradigm somewhere.

The trouble is a lot of existing software can't cope with the raw values
without configuration changes and access to a bunch of out-of-band data.
Nor should it.

I thank Robert for providing so many pathological examples in this thread.
They illustrate nicely why it's so important to provide adequately cooked
values through statvfs, especially for f_bavail!

> ITEM #1 :: In my humble opinion (ha ha) the size column should never
> change unless you add or remove actual storage. It should
> approximate the raw block size of the device on initial creation,
> and it should adjust to the size changes that happen when you
> semantically resize the filesystem with e.g. btrfs resize.

The units for f_blocks and f_bavail should be roughly similar because
software does calculate the ratio of those values (i.e. percentage of
disk used); however, there is no strong accuracy requirement--it could
be off by a few percent, and most software won't care.

Some software will start to misbehave if the ratio error is too large,
e.g. btrfs-RAID1 reporting the total disk size instead of the stored
data size.  This makes it necessary to scale f_blocks according to
chunk profiles, at least to within a few percent of actual size.

One critical rule is that (f_blocks - f_bavail) and (f_blocks - f_bfree)
should never be negative, or software will break in assorted ways.

> ITEM #4 :: Blocks available to unprivileged users is pretty "iffy"
> since unprivileged users cannot write to the filesystem. This datum
> doesn't have a "plain reading". I'd start with filesystem total
> blocks, then subtract the total blocks used by all trees nodes in
> all trees. (e.g. Nodes * 0x1000 or whatever node size is) then shave
> off the N superblocks, then subtract the number of blocks already
> allocated in data extents. And you're done.

Blocks available (f_bavail) should be the intersection of all sets of
blocks currently available for allocation by all users (ignoring quota
for now).  It must be an underestimate, so that ENOSPC implies 
f_bavail == 0.  It is acceptable to report f_bavail = 0 but allow writes
and allocations to succeed for a subset of cases (e.g. inline extents,
metadata modifications).

Historically, filesystems reserved arbitrary quantities of blocks that
were available to unprivileged users, but not counted in f_bavail, in
order to ensure the relation holds even in corner cases.  Existing
application software is well adapted to this behavior.

Compressing filesystems underestimate free space on the filesystem,
in the sense that N blocks written may result in M fewer free blocks
where M < N.  Compression in filesystems has been well tolerated by
application software for years now.

ENOSPC when f_bavail > 0 is very bad.  Depending on how large f_bavail
was at the ENOSPC failure, it means:

	- admins didn't get alerts triggered by a low space threshold
	and did not take action to avoid ENOSPC

	- software that garbage-collects based on available space wasn't
	activated in time to prevent ENOSPC

	- service applications have been starting transactions they
	believe they can finish because f_bavail is large enough,
	but failing when they hit ENOSPC instead

f_bavail should be the minimum of the free data space and the free
metadata space (counting free space in mixed-mode chunks and non-chunk
disk space that can be used by the current profile in both sets).
This provides a more accurate indication of the amount of free space in
cases where metadata blocks are more scarce than data blocks.  This breaks
a common trap for btrfs users who are blindsided by metadata ENOSPC with
gigabytes "free" in df.

f_bavail should count only free space in chunks that are already allocated
in the current allocation profile, or unallocated space that _could_
become a chunk of the currently selected profile (separately for data and
metadata, then the lower of the two numbers kept).

Free space not allocated to chunks could be difficult to measure
precisely, so guess low.  

	(Aside:  would it make sense to have btrfs preallocate all
	available space just to make this calculation easier?  We can
	now reallocate empty chunks on demand, so one reason _not_
	to do this is already gone)

Chunks of non-current profiles (e.g. raid0 when the current profile is
raid1) should not be counted in f_bavail because such space can only
become available for use after a balance/convert operation.  Such space
should appear in f_bavail after that operation occurs.

This means that f_bavail will hit zero long before writes to the
filesystem start to fail due to lack of space, so there will be plenty of
warning for the administrator and software before space really runs out.

Free metadata space is harder to deal with through statvfs.  Currently
f_files, f_favail, and f_ffree are all zeros--they could be overloaded
to provide an estimate of metadata space separate from the estimate of
data space.

If the patch isn't taking those cases (and Robert's pathological examples)
into account, it should.

> Just as EXT before us didn't bother trying to put in a fudge factor
> that guessed what percentage of files would end up needing indirect
> blocks, we shouldn't be in the business of trying to back-figure
> cost-of-storage.

That fudge factor is already built into most application software, so
it's not required for btrfs.  0.1% - 1% is the usual overhead for many
filesystems, and most software adds an order of magnitude above that.
100% overhead (for btrfs-RAID1 with raw statvfs numbers) is well outside
the accepted range and will cause problems.

> The raw numbers are _more_ useful in many circumstances. The raw
> blocks used, for example, will tell me what I need to know for thin
> provisioning on other media, for example. Literally nothing else
> exposes that sort of information.

You'd need more structure in the data than statvfs provides to support
that kind of decision, and btrfs has at least two specialized tools to
expose it.

statvfs is not the appropriate place for raw numbers.  The numbers need to
be cooked so existing applications can continue to make useful decisions
with them without having to introduce awareness of the filesystem into
their calculations.

Or put another way, the filesystem should not expose its concrete
implementation details through such an abstract interface.

> Just put a prominent notice that the user needs to remember to
> factor their choice of redundancy et al into the numbers.

Why, when the filesystem knows that choice and can factor it into the
numbers for us?

Also, struct statvfs has no "prominent notice" field.  ;)

> (5c) If your metadata rate is different than your data rate, then there 
> is _absolutely_ no way to _programatically_ predict how the data _might_ 
> be used, and this is the _default_ usage model. Literally the hardest 
> model is the normal model. There is actually no predictive solution. So 

We only need the worst-case prediction for f_bavail, and that is the
lower of two numbers that we certainly can calculate.

> "Blocks available to unprivileged users" is the only tricky one.[...]
> Fortunately (or hopefully) that's not the datum /bin/df usually returns.

I think you'll find that f_bavail *is* the datum /bin/df usually returns.
If you want f_bfree you need to use 'stat -f' or roll your own df.

Yang Dongsheng Dec. 23, 2014, 12:31 p.m. UTC | #23

On 12/18/2014 12:07 PM, Robert White wrote:
> I don't disagree with the _ideal_ of your patch. I just think that 
> it's impossible to implement it without lying to the user or making 
> things just as bad in a different way. I would _like_ you to be right. 
> But my thing is finding and quantifying failure cases and the entire 
> question is full of fail.
>
...
>
> See you keep giving me these examples where the history of the 
> filesystem is uniform. It was made a certain way and stayed that way. 
> But in real life this sort of thing is going to happen and your patch 
> simply report's a _different_ _wrong_ number. A _friendlier_ wrong 
> number, I'll grant you that, but still wrong.
>

Hi Robert, sorry for the late. (It's busy to deal with some other work.)

Yes, you are providing some examples here to show that in some cases the
numbers from df is not helpful to understand the real space state in 
disk. But
I'm afraid we can not blame df reporting a wrong number. You could say
"Hey df, you are stupid, we can store the small file in metadata to exceed
your @size.". But he just reports the information from what he is seeing 
from FS level honestly.
He is reporting what he can see, and do not care about the other things 
out of
his business.

The all special cases you provided in this thread, actually lead to one 
result that
"Btrfs is special, df command will report an improper in some case.". It 
means
we need some other methods to tell user about the space info. And yes, 
we had. It's
btrfs fi df.

You are trying to make df showing information as more as possible, even 
change
the behaviour of df in btrfs to show the numbers of different meaning 
with what it
is in other filesystems. And my answer is keeping df command doing what 
he want,
and solve the problems in our private "df" , btrfs fi df.

Thanx
Yang
> . xaE
>

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Robert White Dec. 27, 2014, 1:10 a.m. UTC | #24

On 12/23/2014 04:31 AM, Dongsheng Yang wrote:
> On 12/18/2014 12:07 PM, Robert White wrote:
>> I don't disagree with the _ideal_ of your patch. I just think that
>> it's impossible to implement it without lying to the user or making
>> things just as bad in a different way. I would _like_ you to be right.
>> But my thing is finding and quantifying failure cases and the entire
>> question is full of fail.
>>
> ...
>>
>> See you keep giving me these examples where the history of the
>> filesystem is uniform. It was made a certain way and stayed that way.
>> But in real life this sort of thing is going to happen and your patch
>> simply report's a _different_ _wrong_ number. A _friendlier_ wrong
>> number, I'll grant you that, but still wrong.
>>
>
> Hi Robert, sorry for the late. (It's busy to deal with some other work.)
>
> Yes, you are providing some examples here to show that in some cases the
> numbers from df is not helpful to understand the real space state in
> disk. But
> I'm afraid we can not blame df reporting a wrong number. You could say
> "Hey df, you are stupid, we can store the small file in metadata to exceed
> your @size.". But he just reports the information from what he is seeing
> from FS level honestly.
> He is reporting what he can see, and do not care about the other things
> out of
> his business.
>
> The all special cases you provided in this thread, actually lead to one
> result that
> "Btrfs is special, df command will report an improper in some case.". It
> means
> we need some other methods to tell user about the space info. And yes,
> we had. It's
> btrfs fi df.
>
> You are trying to make df showing information as more as possible, even
> change
> the behaviour of df in btrfs to show the numbers of different meaning
> with what it
> is in other filesystems. And my answer is keeping df command doing what
> he want,
> and solve the problems in our private "df" , btrfs fi df.
>
> Thanx
> Yang
>> . xaE
>>
>
>

Fine. but you still haven't told me/us what you thing df (really 
fsstat() ) should report in those cases.

Go back to that email. Grab some of those cases. Then tell me what your 
patch will report and why it makes more sense than what is reported now.

For code to be correct it _must_ be correct for the "special" cases. You 
don't get to just ignore the special cases.

So what are your specific answers to those cases?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Zygo Blaxell Dec. 31, 2014, 12:15 a.m. UTC | #25

On Wed, Dec 17, 2014 at 08:07:27PM -0800, Robert White wrote:
>[...]

There are a number of pathological examples in here, but I think there
are justifiable correct answers for each of them that emerge from a
single interpretation of the meanings of f_bavail, f_blocks, and f_bfree.

One gotcha is that some of the numbers required may be difficult to
calculate precisely before all space is allocated to chunks; however,
some error is tolerable as long as free space is not overestimated.
In other words:  when in doubt, guess low.

statvfs(2) gives us six numbers, three of which are block counts.
Very few users or programs ever bother to look at the inode counts
(f_files, f_ffree, f_favail), but they could be overloaded for metadata
block counts.

The f_blocks parameter is mostly irrelevant to application behavior,
except to the extent that the ratio between f_bavail and f_blocks is
used by applications to calculate a percentage of occupied or free space.
f_blocks must always be greater than or equal to f_bavail and f_blocks,
and preferably f_blocks would be scaled to use the same effective unit
size as f_bavail and f_blocks within a percent or two.

Nobody cares about f_bfree since traditionally only root could use the
difference between f_bfree and f_bavail.  f_bfree is effectively space
conditionally available (e.g. if the process euid is root or the process
egid matches a configured group id), while f_bavail is space available
without conditions (e.g. processes without privilege can use it).

The most important number is f_bavail.  It's what a bunch of software
(archive unpackers, assorted garbage collectors, email MTAs, snapshot
remover scripts, download managers, etc) uses to estimate how much space
is available without conditions (except quotas, although arguably those
should be included too).  Applications that are privileged still use
the unprivileged f_bavail number so their decisions based on free space
don't disrupt unprivileged applications.

It's generally better to underestimate than to overestimate f_bavail.
Historically filesystems have reserved extra space to avoid various
problems in low-disk conditions, and application software has adapted
to that well over the years.  Also, admin people are more pleasantly
surprised when it turns out that they had more space than f_bavail,
instead of when they had less.

The rule should be:  if we have some space, but it is not available for
data extents in the current allocation mode, don't add it to f_bavail
in statvfs.  I think this rule handles all of these examples well.

That would mean that we get cases where we add a drive to a full
filesystem and it doesn't immediately give you any new f_bavail space.
That may be an unexpected result for a naive admin, but much less
unexpected than having all the new space show up in f_bavail when it
is not available for allocation in the current data profile!  Better
to have the surprising behavior earlier than later.

On to examples...

> But a more even case is downright common and likely. Say you run a
> nice old-fashoned MUTT mail-spool. "most" of your files are small
> enough to live in metadata. You start with one drive. and allocate 2
> single-data and 10 metatata (5xDup). Then you add a second drive of
> equal size. (the metadata just switched to DUP-as-RAID1-alike mode)
> And then you do a dconvert=raid0.
> 
> That uneven allocation of metadata will be a 2GiB difference between
> the two drives forever.

> So do you shave 2GiB off of your @size?

Yes.  f_blocks is the total size of all allocated chunks plus all free
space allocated by the current data profile.  That 2GiB should disappear
from such a calculation.

> Do you shave @2GiB off your @available?

Yes, because it's _not_ available until something changes to make it
available (e.g. balance to get rid of the dup metadata, change the
metadata profile to dup or single, or change the data profile to single).

The 2GiB could be added to f_bfree, but that might still be confusing
for people and software.

> Do you overreport your available by @2GiB and end up _still_ having
> things "available" when you get your ENOSPC?

No.  ENOSPC when f_bavail > 0 is very bad.  Low-available-space admin
alerts will not be triggered.  Automated mitigation software will not be
activated.  Service daemons will start transactions they cannot complete.

> How about this ::
> 
> /dev/sda == |Sf|Sf|Mf|Mf|Mf|Mf|Sf|Sf|Sp|Mp|Mp| .5GiB free|
> /dev/sdb == |10 GiB free                                 |
> 
> Operator fills his drive, then adds a second one, then _foolishly_
> tries to convert it to RAID0 when the power fails. In order to check
> the FS he boots with no_balance. Then his maintenance window closes
> and he has to go back into production, at which point he forgets (or
> isn't allowed) to do the balance. The flags are set but now no more
> extents can be allocated.
> 
> Size is 20GiB, slack is 10.5GiB. Operator is about to get ENOSPACE.

f_bavail should be 0.5GB or so.  Operator is now aware that ENOSPC is
imminent, and can report to whoever grants permission to do things that
the machine will be continuing to balance outside of the maintenance
window.  This is much better than the alternative, which is that the
lack of available space is detected by application failure outside of
a maintenance window.

Even better:  if f_bavail is reflective of reality, the operator can
minimize out-of-window balance time by monitoring f_bavail and pausing
the balance when there is enough space to operate without ENOSPC until
the next maintenance window.

> Yes a balance would fix it, but that's not the question.

The question is "how much space is available at this time?" and the
correct answer is "almost none," and it stays that way until and unless
someone runs a balance, adds more drives, deletes a lot of data, etc.

balance changes the way space will be allocated, so it also changes
the output of df to match.

> In the meantime what does your patch report?

It should report that there's almost no available space.
If the patch doesn't report that, the patch needs rework.

> Or...
> 
> /dev/sda == |Sf|Sf|Mf|Mf|Mf|Mf|Sf|Sf|Sp|Mp|Mp| .5GiB free|
> /dev/sdb == |10 GiB free                                 |
> /dev/sdc == |10 GiB free                                 |
> 
> Does a -dconvert=raid5 and immediately gets ENOSPC for all the
> blocks. According to the flags we've got 10GiB free...

10.5GiB is correct:  1.0GiB in a 3-way RAID5 on sd[abc] (2x 0.5GiB
data, 1x 0.5GiB parity), and 9.5GiB in a 2-way RAID5 (1x 9.5GiB data,
1x 9.5GiB parity) on sd[bc].  The current RAID5 implementation might
not use space that way, but if that's true, that's arguably a bug in
the current RAID5 implementation.

If it was -dconvert=raid6, there would be only 0.5GiB in f_bavail (1x
0.5GiB data, 2x 0.5GiB parity) since raid6 requires 3 disks per chunk.

> Or we end up with an egregious metadata history from lots of small
> files and we've got a perfectly fine RAID1 with several GiB of slack
> but none of that slack is 1GiB contiguous. All the slack has just
> come from reclaiming metadata.
> 
> /dev/sda == |Sf|Sf|Mp|Mp|Rx|Rx|Mp|Mp|Rx|Rx|Mp|Mp| N-free slack|
> 
> (R == reclaimed, e.g. avalable to extent-tree.c for allocation)
> 
> We have a 1.5GB of "poisoned" space here; it can hold metadata but
> not data. So is that 1.5 in your @available calculation? How do you
> mark it up as used.

Overload f_favail to report metadata space maybe?  That's kind of ugly,
though, and nobody looks at f_favail anyway (or, everyone who looks at
f_favail is going to be aware enough to look at btrfs fi df too).

The metadata chunk free space could also go in f_bfree:  it's available
for use under some but not all conditions.

In all cases, leave such space out of f_bavail.  The space is only
available under some conditions, and f_bavail is effectively about space
available without conditions.  Space allocated to metadata-only chunks
should never be reported as available through f_bavail in statvfs.

Also note that if free space in metadata chunks could be limiting
allocations then it should be reported in f_bavail instead of free space
in data chunks.  That can be as simple as reporting the lesser of the two
numbers when there is no free space on the filesystem left to allocate
to chunks.

> And I've been ingoring the Mp(s) completely. What if I've got a good
> two GiB of partial space in the metadata, but that's all I've got.
> You write a file of any size and you'll get ENOSPC even though
> you've got that GiB.  Was it in @size? Is it in @avail?

@size should be the sum of the sizes of all allocated chunks in the
filesystem, and all free space as if it was allocated with the current
raid profile.  It should change if there was unallocated space and the
default profile changed, or if there was a balance converting existing
chunks to a new profile, or if disks were added or removed.

Don't report that GiB of metadata noise in f_bavail.  This may mean that
f_bavail is zero, but data can still be written in metadata chunks.
That's OK--f_bavail is an underestimate.

It's OK for the filesystem to report f_bavail = 0 but not ENOSPC--
people like storing bonus data without losing any "available" space,
and we really can't know whether that space is available until after
we've committed data to it.  Since we can't commit in advance, that
space isn't unconditionally available, and shouldn't be in f_bavail.

It's not OK to report ENOSPC when f_bavail > 0.  People hate failing to
store data when they appear to have "free" space.

> See you keep giving me these examples where the history of the
> filesystem is uniform. It was made a certain way and stayed that
> way. But in real life this sort of thing is going to happen and your
> patch simply report's a _different_ _wrong_ number. A _friendlier_
> wrong number, I'll grant you that, but still wrong.

Who cares if the number is wrong, as long as useful decisions can still be
made with it?  It doesn't have to be byte-accurate in all possible cases.

Existing software and admin practice is OK with underreporting free
space, but not overreporting it.  All the errors should be biased in
that direction.

Yang Dongsheng Jan. 5, 2015, 9:56 a.m. UTC | #26

On 12/31/2014 08:15 AM, Zygo Blaxell wrote:
> On Wed, Dec 17, 2014 at 08:07:27PM -0800, Robert White wrote:
>> [...]
> There are a number of pathological examples in here, but I think there
> are justifiable correct answers for each of them that emerge from a
> single interpretation of the meanings of f_bavail, f_blocks, and f_bfree.
>
> One gotcha is that some of the numbers required may be difficult to
> calculate precisely before all space is allocated to chunks; however,
> some error is tolerable as long as free space is not overestimated.
> In other words:  when in doubt, guess low.
>
> statvfs(2) gives us six numbers, three of which are block counts.
> Very few users or programs ever bother to look at the inode counts
> (f_files, f_ffree, f_favail), but they could be overloaded for metadata
> block counts.
>
> The f_blocks parameter is mostly irrelevant to application behavior,
> except to the extent that the ratio between f_bavail and f_blocks is
> used by applications to calculate a percentage of occupied or free space.
> f_blocks must always be greater than or equal to f_bavail and f_blocks,
> and preferably f_blocks would be scaled to use the same effective unit
> size as f_bavail and f_blocks within a percent or two.
>
> Nobody cares about f_bfree since traditionally only root could use the
> difference between f_bfree and f_bavail.  f_bfree is effectively space
> conditionally available (e.g. if the process euid is root or the process
> egid matches a configured group id), while f_bavail is space available
> without conditions (e.g. processes without privilege can use it).
>
> The most important number is f_bavail.  It's what a bunch of software
> (archive unpackers, assorted garbage collectors, email MTAs, snapshot
> remover scripts, download managers, etc) uses to estimate how much space
> is available without conditions (except quotas, although arguably those
> should be included too).  Applications that are privileged still use
> the unprivileged f_bavail number so their decisions based on free space
> don't disrupt unprivileged applications.
>
> It's generally better to underestimate than to overestimate f_bavail.
> Historically filesystems have reserved extra space to avoid various
> problems in low-disk conditions, and application software has adapted
> to that well over the years.  Also, admin people are more pleasantly
> surprised when it turns out that they had more space than f_bavail,
> instead of when they had less.
>
> The rule should be:  if we have some space, but it is not available for
> data extents in the current allocation mode, don't add it to f_bavail
> in statvfs.  I think this rule handles all of these examples well.
>
> That would mean that we get cases where we add a drive to a full
> filesystem and it doesn't immediately give you any new f_bavail space.
> That may be an unexpected result for a naive admin, but much less
> unexpected than having all the new space show up in f_bavail when it
> is not available for allocation in the current data profile!  Better
> to have the surprising behavior earlier than later.
>
> On to examples...
>
>> But a more even case is downright common and likely. Say you run a
>> nice old-fashoned MUTT mail-spool. "most" of your files are small
>> enough to live in metadata. You start with one drive. and allocate 2
>> single-data and 10 metatata (5xDup). Then you add a second drive of
>> equal size. (the metadata just switched to DUP-as-RAID1-alike mode)
>> And then you do a dconvert=raid0.
>>
>> That uneven allocation of metadata will be a 2GiB difference between
>> the two drives forever.
>> So do you shave 2GiB off of your @size?
> Yes.  f_blocks is the total size of all allocated chunks plus all free
> space allocated by the current data profile.

Agreed. This is what my patch designed by.
>    That 2GiB should disappear
> from such a calculation.
>
>> Do you shave @2GiB off your @available?
> Yes, because it's _not_ available until something changes to make it
> available (e.g. balance to get rid of the dup metadata, change the
> metadata profile to dup or single, or change the data profile to single).
>
> The 2GiB could be added to f_bfree, but that might still be confusing
> for people and software.
>
>> Do you overreport your available by @2GiB and end up _still_ having
>> things "available" when you get your ENOSPC?
> No.  ENOSPC when f_bavail > 0 is very bad.

Yes, it is very bad.
>   Low-available-space admin
> alerts will not be triggered.  Automated mitigation software will not be
> activated.  Service daemons will start transactions they cannot complete.
>
>> How about this ::
>>
>> /dev/sda == |Sf|Sf|Mf|Mf|Mf|Mf|Sf|Sf|Sp|Mp|Mp| .5GiB free|
>> /dev/sdb == |10 GiB free                                 |
>>
>> Operator fills his drive, then adds a second one, then _foolishly_
>> tries to convert it to RAID0 when the power fails. In order to check
>> the FS he boots with no_balance. Then his maintenance window closes
>> and he has to go back into production, at which point he forgets (or
>> isn't allowed) to do the balance. The flags are set but now no more
>> extents can be allocated.
>>
>> Size is 20GiB, slack is 10.5GiB. Operator is about to get ENOSPACE.

I am not clear about this use case. Is the current profile raid0? if so, 
@available is
10.5G. If raid1, @available is 0.5G.
> f_bavail should be 0.5GB or so.  Operator is now aware that ENOSPC is
> imminent, and can report to whoever grants permission to do things that
> the machine will be continuing to balance outside of the maintenance
> window.  This is much better than the alternative, which is that the
> lack of available space is detected by application failure outside of
> a maintenance window.
>
> Even better:  if f_bavail is reflective of reality, the operator can
> minimize out-of-window balance time by monitoring f_bavail and pausing
> the balance when there is enough space to operate without ENOSPC until
> the next maintenance window.
>
>> Yes a balance would fix it, but that's not the question.
> The question is "how much space is available at this time?" and the
> correct answer is "almost none," and it stays that way until and unless
> someone runs a balance, adds more drives, deletes a lot of data, etc.
>
> balance changes the way space will be allocated, so it also changes
> the output of df to match.
>
>> In the meantime what does your patch report?
> It should report that there's almost no available space.
> If the patch doesn't report that, the patch needs rework.
>
>> Or...
>>
>> /dev/sda == |Sf|Sf|Mf|Mf|Mf|Mf|Sf|Sf|Sp|Mp|Mp| .5GiB free|
>> /dev/sdb == |10 GiB free                                 |
>> /dev/sdc == |10 GiB free                                 |
>>
>> Does a -dconvert=raid5 and immediately gets ENOSPC for all the
>> blocks. According to the flags we've got 10GiB free...
> 10.5GiB is correct:  1.0GiB in a 3-way RAID5 on sd[abc] (2x 0.5GiB
> data, 1x 0.5GiB parity), and 9.5GiB in a 2-way RAID5 (1x 9.5GiB data,
> 1x 9.5GiB parity) on sd[bc].  The current RAID5 implementation might
> not use space that way, but if that's true, that's arguably a bug in
> the current RAID5 implementation.

The current RAID5 does not work like this. It will alloc 10G (0.5 sd[ab] 
+ 9.5 sd[bc]).

The calculation in statfs() is same with the calculation in current 
allocator.
It will report 10G available.

Yes it would be better if we make the allocator more clever in these case.
That can be another topic about allocator.
>
> If it was -dconvert=raid6, there would be only 0.5GiB in f_bavail (1x
> 0.5GiB data, 2x 0.5GiB parity) since raid6 requires 3 disks per chunk.
>
>> See you keep giving me these examples where the history of the
>> filesystem is uniform. It was made a certain way and stayed that
>> way. But in real life this sort of thing is going to happen and your
>> patch simply report's a _different_ _wrong_ number. A _friendlier_
>> wrong number, I'll grant you that, but still wrong.
> Who cares if the number is wrong, as long as useful decisions can still be
> made with it?  It doesn't have to be byte-accurate in all possible cases.
>
> Existing software and admin practice is OK with underreporting free
> space, but not overreporting it.  All the errors should be biased in
> that direction.

Thanx Zygo and Robert, I agree that my patch did not cover the situation 
when
block groups in different raid level. I will update my patch soon and 
sent it out.

Thanx for your suggestion.

Yang
>

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Yang Dongsheng Jan. 5, 2015, 9:59 a.m. UTC | #27

On 12/27/2014 09:10 AM, Robert White wrote:
> On 12/23/2014 04:31 AM, Dongsheng Yang wrote:
>> On 12/18/2014 12:07 PM, Robert White wrote:
...
>> Thanx
>> Yang
>>> . xaE
>>>
>>
>>
>
> Fine. but you still haven't told me/us what you thing df (really 
> fsstat() ) should report in those cases.
>
> Go back to that email. Grab some of those cases. Then tell me what 
> your patch will report and why it makes more sense than what is 
> reported now.
>
> For code to be correct it _must_ be correct for the "special" cases. 
> You don't get to just ignore the special cases.
>
> So what are your specific answers to those cases?

Great to hear that we can stop the endless discussion of FS level VS 
device level.

Then we can come back to discuss the special cases you provided. I give 
you some answer
in another mail. And make a change in my patch to cover the case of 
un-finished
balance.

Thanx
> .
>

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[v2,1/3] Btrfs: get more accurate output in df command.

Commit Message

Comments

Patch