diff mbox series

[2/2] btrfs: provide an estimated number of inodes for statfs

Message ID 20191024154455.19370-3-jthumshirn@suse.de (mailing list archive)
State New, archived
Headers show
Series Provide an estimation of (free/total) inodes in statfs | expand

Commit Message

Johannes Thumshirn Oct. 24, 2019, 3:44 p.m. UTC
On the BeeGFS Mailing list there is a report claiming BTRFS is not usable
with BeeGFS, as BeeGFS is using statfs output to determine the number of
total and free inodes. BeeGFS needs the number of free inodes as it stores
its meta-data either in extended attributes of the underlying file-system
or directly in an inline inode. According to the BeeGFS Server Tuning
Guide:

"""
BeeGFS metadata is stored as extended attributes (EAs) on the underlying
file system to optimal performance. One metadata file will be created for
each file that a user creates. About extended attributes usage: BeeGFS
Metadata files have a size of 0 bytes (i.e. no normal file contents).

Access to extended attributes is possible with the getfattr tool.

If the inodes of the underlying file system are sufficiently large, EAs
can be inlined into the inode of the underlying file system.  Additional
data blocks are then not required anymore and metadata disk usage will be
reduced.  With EAs inlined into the inode, access latencies are reduced as
seeking to an extra data block is not required anymore.
"""

Provide some estimated numbers of total and free inodes in statfs by
dividing the number of blocks by the size of an inode-item for the total
number of possible inodes and for the number of free inodes divide the
number of free blocks by the size of an inode-item, similar to what other
file-systems without a fixed number of inodes do.

This of is just an estimation and should not be relied upon.

Without the patch applied:
rapido1:/# df -hTi /mnt/test
Filesystem     Type     Inodes IUsed IFree IUse% Mounted on
/mnt/test      btrfs         0     0     0     - /mnt/test

With the patch applied on an empty fs:
rapido1:/# df -hTi /mnt/test
Filesystem     Type     Inodes IUsed IFree IUse% Mounted on
/dev/zram0     btrfs      1.6K     0  1.6K    0% /mnt/test

With the patch applied on a dirty fs:
rapido1:/# df -hTi /mnt/test
Filesystem     Type     Inodes IUsed IFree IUse% Mounted on
/dev/zram0     btrfs      1.6K  1.5K   197   88% /mnt/test

Link: https://groups.google.com/forum/#!msg/fhgfs-user/IJqGS5o1UD0/8ftDdUI3AQAJ
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
---
 fs/btrfs/super.c | 2 ++
 1 file changed, 2 insertions(+)

Comments

Qu Wenruo Oct. 25, 2019, 12:56 a.m. UTC | #1
On 2019/10/24 下午11:44, Johannes Thumshirn wrote:
> On the BeeGFS Mailing list there is a report claiming BTRFS is not usable
> with BeeGFS, as BeeGFS is using statfs output to determine the number of
> total and free inodes. BeeGFS needs the number of free inodes as it stores
> its meta-data either in extended attributes of the underlying file-system
> or directly in an inline inode. According to the BeeGFS Server Tuning
> Guide:
> 
> """
> BeeGFS metadata is stored as extended attributes (EAs) on the underlying
> file system to optimal performance. One metadata file will be created for
> each file that a user creates. About extended attributes usage: BeeGFS
> Metadata files have a size of 0 bytes (i.e. no normal file contents).
> 
> Access to extended attributes is possible with the getfattr tool.
> 
> If the inodes of the underlying file system are sufficiently large, EAs
> can be inlined into the inode of the underlying file system.  Additional
> data blocks are then not required anymore and metadata disk usage will be
> reduced.  With EAs inlined into the inode, access latencies are reduced as
> seeking to an extra data block is not required anymore.
> """

Personally speaking, reporting 0 used and 0 free should be the proper
way. User of the fs should be aware of dynamical fs which doesn't go
fixed inodes.

I really think it's BeeFS' job to change their behavior.

Since there are more thing to consider when faking the used/free inodes.

> 
> Provide some estimated numbers of total and free inodes in statfs by
> dividing the number of blocks by the size of an inode-item for the total
> number of possible inodes and for the number of free inodes divide the
> number of free blocks by the size of an inode-item, similar to what other
> file-systems without a fixed number of inodes do.
> 
> This of is just an estimation and should not be relied upon.
> 
> Without the patch applied:
> rapido1:/# df -hTi /mnt/test
> Filesystem     Type     Inodes IUsed IFree IUse% Mounted on
> /mnt/test      btrfs         0     0     0     - /mnt/test
> 
> With the patch applied on an empty fs:
> rapido1:/# df -hTi /mnt/test
> Filesystem     Type     Inodes IUsed IFree IUse% Mounted on
> /dev/zram0     btrfs      1.6K     0  1.6K    0% /mnt/test
> 
> With the patch applied on a dirty fs:
> rapido1:/# df -hTi /mnt/test
> Filesystem     Type     Inodes IUsed IFree IUse% Mounted on
> /dev/zram0     btrfs      1.6K  1.5K   197   88% /mnt/test
> 
> Link: https://groups.google.com/forum/#!msg/fhgfs-user/IJqGS5o1UD0/8ftDdUI3AQAJ
> Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
> ---
>  fs/btrfs/super.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> index b818f764c1c9..6f6f6a70eb1e 100644
> --- a/fs/btrfs/super.c
> +++ b/fs/btrfs/super.c
> @@ -2068,6 +2068,8 @@ static int btrfs_statfs(struct dentry *dentry, struct kstatfs *buf)
>  	buf->f_blocks = div_u64(btrfs_super_total_bytes(disk_super), factor);
>  	buf->f_blocks >>= bits;
>  	buf->f_bfree = buf->f_blocks - (div_u64(total_used, factor) >> bits);
> +	buf->f_files = div_u64(buf->f_blocks, sizeof(struct btrfs_inode_item));

That's too optimistic. (I'd call it even beyond Elon Musk's schedule)

We have tree block header overhead, and with the increase of tree
blocks, the size of extent tree will also increase and bring overhead.

In long run, user will report that the ffiles increases more than they used.
It will be a hell to calculate such estimation, and we will never reach
a good enough point for that.

> +	buf->f_ffree = div_u64(buf->f_bfree, sizeof(struct btrfs_inode_item));

The same can be applied to ffree, it will decrease faster than real usage.

If whatever the distributed fs is using ffree/files as an indicator,
it's not reliable anyway. And if they accept such unreliable indicator,
they'd better double think before using that indicator.

Thanks,
Qu

>  
>  	/* Account global block reserve as used, it's in logical size already */
>  	spin_lock(&block_rsv->lock);
>
Johannes Thumshirn Oct. 25, 2019, 8:55 a.m. UTC | #2
On 25/10/2019 02:56, Qu Wenruo wrote:
[...]
> Personally speaking, reporting 0 used and 0 free should be the proper
> way. User of the fs should be aware of dynamical fs which doesn't go
> fixed inodes.
> 
> I really think it's BeeFS' job to change their behavior.
> 
> Since there are more thing to consider when faking the used/free inodes.

I'm with you on this. It is something BeeGFS has to fix, but judging
from what other file-systems do, some do have a real fixed number of
inodes, some assign 0 or -1, some do not touch the variable at all and
some (i.e. xfs) fake a number. My role model was xfs here.
David Sterba Oct. 25, 2019, 10:05 a.m. UTC | #3
On Thu, Oct 24, 2019 at 05:44:55PM +0200, Johannes Thumshirn wrote:
> On the BeeGFS Mailing list there is a report claiming BTRFS is not usable
> with BeeGFS, as BeeGFS is using statfs output to determine the number of
> total and free inodes. BeeGFS needs the number of free inodes as it stores
> its meta-data either in extended attributes of the underlying file-system
> or directly in an inline inode. According to the BeeGFS Server Tuning
> Guide:
> 
> """
> BeeGFS metadata is stored as extended attributes (EAs) on the underlying
> file system to optimal performance. One metadata file will be created for
> each file that a user creates. About extended attributes usage: BeeGFS
> Metadata files have a size of 0 bytes (i.e. no normal file contents).

That's not really typical use of a files and the 'optimal performance'
claim would need some clarifications.

> Access to extended attributes is possible with the getfattr tool.
> 
> If the inodes of the underlying file system are sufficiently large, EAs
> can be inlined into the inode of the underlying file system.  Additional
> data blocks are then not required anymore and metadata disk usage will be
> reduced.  With EAs inlined into the inode, access latencies are reduced as
> seeking to an extra data block is not required anymore.

So this describes how it's implemented in EXT4 and the BeeGFS is
probably tuned to work 'optimally' there.

> """
> 
> Provide some estimated numbers of total and free inodes in statfs by
> dividing the number of blocks by the size of an inode-item for the total
> number of possible inodes and for the number of free inodes divide the
> number of free blocks by the size of an inode-item, similar to what other
> file-systems without a fixed number of inodes do.
> 
> This of is just an estimation and should not be relied upon.

This is the most problematic part. The inode counts cannot be calculated
exactly on btrfs, because of the dynamic nature of the space usage. We
can only give rough estimates "how the rest of unallocated space would
be used if [assumptions]". We have this problem with explaining 'df'
values and now somebody is asking for the same with 'df -i'.

The Inode/IFree numbers are intentionally zero, to avoid confusion of
monitoring tools to report low inode counts. Though I can't find a
documented and standardized interpretation of the numbers, manual page
of statfs only says

	fsfilcnt_t f_files;   /* Total file nodes in filesystem */
	fsfilcnt_t f_ffree;   /* Free file nodes in filesystem */

for the respective fields. And nothing else.

For traditional filesystems, and EXT2/3/4 in particular, the inodes are
preallocated at creation time so calculating the numbers is easy.

I believe XFS does that too without the option inode64, so users are
used to see non-zero value and nowadays it has to be faked. That makes
sense from backward compatibility POV. But still the numbers are made up
and can change unexpectedly.

Btrfs has reported 0/0/0 since the beginning to not cofuse monitoring
tools, yet this is exactly what can be seen at

https://groups.google.com/forum/#!msg/fhgfs-user/IJqGS5o1UD0/8ftDdUI3AQAJ

I'd say fix your monitoring tool not to interpret 0% free inodes in case
there's also 0 in total. This is not even btrfs-specific fix, IMHO this
is interpreting the numbers in the wrong way.

> Without the patch applied:
> rapido1:/# df -hTi /mnt/test
> Filesystem     Type     Inodes IUsed IFree IUse% Mounted on
> /mnt/test      btrfs         0     0     0     - /mnt/test
> 
> With the patch applied on an empty fs:
> rapido1:/# df -hTi /mnt/test
> Filesystem     Type     Inodes IUsed IFree IUse% Mounted on
> /dev/zram0     btrfs      1.6K     0  1.6K    0% /mnt/test
> 
> With the patch applied on a dirty fs:
> rapido1:/# df -hTi /mnt/test
> Filesystem     Type     Inodes IUsed IFree IUse% Mounted on
> /dev/zram0     btrfs      1.6K  1.5K   197   88% /mnt/test

At the moment I object against conjuring up numbers like that. It's
perhaps going to silence some tools but would cause lots of questions
because the numbers otherwise don't reflect reality, not even close.

We try hard to make the regular Allocated/Free space numbers to match
users' expectations, but it's not perfect and can't be made much better.
And I'm glad we have a simple answer to the inode counts.

Should the discussion continue, it would be good to have interested
people from BeeGFS on CC.
diff mbox series

Patch

diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index b818f764c1c9..6f6f6a70eb1e 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -2068,6 +2068,8 @@  static int btrfs_statfs(struct dentry *dentry, struct kstatfs *buf)
 	buf->f_blocks = div_u64(btrfs_super_total_bytes(disk_super), factor);
 	buf->f_blocks >>= bits;
 	buf->f_bfree = buf->f_blocks - (div_u64(total_used, factor) >> bits);
+	buf->f_files = div_u64(buf->f_blocks, sizeof(struct btrfs_inode_item));
+	buf->f_ffree = div_u64(buf->f_bfree, sizeof(struct btrfs_inode_item));
 
 	/* Account global block reserve as used, it's in logical size already */
 	spin_lock(&block_rsv->lock);