btrfs: statfs: Don't reset f_bavail if we're over committing metadata space
diff mbox series

Message ID 20200115034128.32889-1-wqu@suse.com
State New
Headers show
Series
  • btrfs: statfs: Don't reset f_bavail if we're over committing metadata space
Related show

Commit Message

Qu Wenruo Jan. 15, 2020, 3:41 a.m. UTC
[BUG]
When there are a lot of metadata space reserved, e.g. after balancing a
data block with many extents, vanilla df would report 0 available space.

[CAUSE]
btrfs_statfs() would report 0 available space if its metadata space is
exhausted.
And the calculation is based on currently reserved space vs on-disk
available space, with a small headroom as buffer.
When there is not enough headroom, btrfs_statfs() will report 0
available space.

The problem is, since commit ef1317a1b9a3 ("btrfs: do not allow
reservations if we have pending tickets"), we allow btrfs to over commit
metadata space, as long as we have enough space to allocate new metadata
chunks.

This makes old calculation unreliable and report false 0 available space.

[FIX]
Don't do such naive check anymore for btrfs_statfs().
Also remove the comment about "0 available space when metadata is
exhausted".

Please note that, this is a just a quick fix. There are still a lot of
things to be improved.

Fixes: ef1317a1b9a3 ("btrfs: do not allow reservations if we have pending tickets")
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/super.c | 21 ---------------------
 1 file changed, 21 deletions(-)

Comments

Qu Wenruo Jan. 15, 2020, 11:40 a.m. UTC | #1
BTW, this patch is for v5.5-rc release.

As some GUI programs, e.g nemo from cinnamon, would check statfs result
before doing any write operation, thus could cause false early ENOSPC error.

Thanks,
Qu

On 2020/1/15 上午11:41, Qu Wenruo wrote:
> [BUG]
> When there are a lot of metadata space reserved, e.g. after balancing a
> data block with many extents, vanilla df would report 0 available space.
> 
> [CAUSE]
> btrfs_statfs() would report 0 available space if its metadata space is
> exhausted.
> And the calculation is based on currently reserved space vs on-disk
> available space, with a small headroom as buffer.
> When there is not enough headroom, btrfs_statfs() will report 0
> available space.
> 
> The problem is, since commit ef1317a1b9a3 ("btrfs: do not allow
> reservations if we have pending tickets"), we allow btrfs to over commit
> metadata space, as long as we have enough space to allocate new metadata
> chunks.
> 
> This makes old calculation unreliable and report false 0 available space.
> 
> [FIX]
> Don't do such naive check anymore for btrfs_statfs().
> Also remove the comment about "0 available space when metadata is
> exhausted".
> 
> Please note that, this is a just a quick fix. There are still a lot of
> things to be improved.
> 
> Fixes: ef1317a1b9a3 ("btrfs: do not allow reservations if we have pending tickets")
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  fs/btrfs/super.c | 21 ---------------------
>  1 file changed, 21 deletions(-)
> 
> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> index f452a94abdc3..ca1a26b3e884 100644
> --- a/fs/btrfs/super.c
> +++ b/fs/btrfs/super.c
> @@ -2018,8 +2018,6 @@ static inline int btrfs_calc_avail_data_space(struct btrfs_fs_info *fs_info,
>   * algorithm that respects the device sizes and order of allocations.  This is
>   * a close approximation of the actual use but there are other factors that may
>   * change the result (like a new metadata chunk).
> - *
> - * If metadata is exhausted, f_bavail will be 0.
>   */
>  static int btrfs_statfs(struct dentry *dentry, struct kstatfs *buf)
>  {
> @@ -2034,7 +2032,6 @@ static int btrfs_statfs(struct dentry *dentry, struct kstatfs *buf)
>  	unsigned factor = 1;
>  	struct btrfs_block_rsv *block_rsv = &fs_info->global_block_rsv;
>  	int ret;
> -	u64 thresh = 0;
>  	int mixed = 0;
>  
>  	rcu_read_lock();
> @@ -2089,24 +2086,6 @@ static int btrfs_statfs(struct dentry *dentry, struct kstatfs *buf)
>  	buf->f_bavail += div_u64(total_free_data, factor);
>  	buf->f_bavail = buf->f_bavail >> bits;
>  
> -	/*
> -	 * We calculate the remaining metadata space minus global reserve. If
> -	 * this is (supposedly) smaller than zero, there's no space. But this
> -	 * does not hold in practice, the exhausted state happens where's still
> -	 * some positive delta. So we apply some guesswork and compare the
> -	 * delta to a 4M threshold.  (Practically observed delta was ~2M.)
> -	 *
> -	 * We probably cannot calculate the exact threshold value because this
> -	 * depends on the internal reservations requested by various
> -	 * operations, so some operations that consume a few metadata will
> -	 * succeed even if the Avail is zero. But this is better than the other
> -	 * way around.
> -	 */
> -	thresh = SZ_4M;
> -
> -	if (!mixed && total_free_meta - thresh < block_rsv->size)
> -		buf->f_bavail = 0;
> -
>  	buf->f_type = BTRFS_SUPER_MAGIC;
>  	buf->f_bsize = dentry->d_sb->s_blocksize;
>  	buf->f_namelen = BTRFS_NAME_LEN;
>
David Sterba Jan. 16, 2020, 2:29 p.m. UTC | #2
On Wed, Jan 15, 2020 at 11:41:28AM +0800, Qu Wenruo wrote:
> [BUG]
> When there are a lot of metadata space reserved, e.g. after balancing a
> data block with many extents, vanilla df would report 0 available space.
> 
> [CAUSE]
> btrfs_statfs() would report 0 available space if its metadata space is
> exhausted.
> And the calculation is based on currently reserved space vs on-disk
> available space, with a small headroom as buffer.
> When there is not enough headroom, btrfs_statfs() will report 0
> available space.
> 
> The problem is, since commit ef1317a1b9a3 ("btrfs: do not allow
> reservations if we have pending tickets"), we allow btrfs to over commit
> metadata space, as long as we have enough space to allocate new metadata
> chunks.
> 
> This makes old calculation unreliable and report false 0 available space.
> 
> [FIX]
> Don't do such naive check anymore for btrfs_statfs().
> Also remove the comment about "0 available space when metadata is
> exhausted".

This is intentional and was added to prevent a situation where 'df'
reports available space but exhausted metadata don't allow to create new
inode.

If it gets removed you are trading one bug for another. With the changed
logic in the referenced commit, the metadata exhaustion is more likely
but it's also temporary.

The overcommit and overestimated reservations make it hard if not
impossible to do any accurate calculation in statfs/df. From the
usability side, there are 2 options:

a) return 0 free, while it's still possible to eg. create files
b) return >0 free, but no new file can be created

The user report I got was for b) so that's what the guesswork fixes and
does a). The idea behind that is that there's really low space, but with
the overreservation caused by balance it's not.

I don't see a good way out of that which could be solved inside statfs,
it only interprets the numbers in the best way under circumstances. We
don't have exact reservation, don't have a delta of the
reserved-requested (to check how much the reservation is off).
Qu Wenruo Jan. 17, 2020, 12:54 a.m. UTC | #3
On 2020/1/16 下午10:29, David Sterba wrote:
> On Wed, Jan 15, 2020 at 11:41:28AM +0800, Qu Wenruo wrote:
>> [BUG]
>> When there are a lot of metadata space reserved, e.g. after balancing a
>> data block with many extents, vanilla df would report 0 available space.
>>
>> [CAUSE]
>> btrfs_statfs() would report 0 available space if its metadata space is
>> exhausted.
>> And the calculation is based on currently reserved space vs on-disk
>> available space, with a small headroom as buffer.
>> When there is not enough headroom, btrfs_statfs() will report 0
>> available space.
>>
>> The problem is, since commit ef1317a1b9a3 ("btrfs: do not allow
>> reservations if we have pending tickets"), we allow btrfs to over commit
>> metadata space, as long as we have enough space to allocate new metadata
>> chunks.
>>
>> This makes old calculation unreliable and report false 0 available space.
>>
>> [FIX]
>> Don't do such naive check anymore for btrfs_statfs().
>> Also remove the comment about "0 available space when metadata is
>> exhausted".
> 
> This is intentional and was added to prevent a situation where 'df'
> reports available space but exhausted metadata don't allow to create new
> inode.

But this behavior itself is not accurate.

We have global reservation, which is normally always larger than the
immediate number 4M.

So that check will never really be triggered.

Thus invalidating most of your argument.

Thanks,
Qu

> 
> If it gets removed you are trading one bug for another. With the changed
> logic in the referenced commit, the metadata exhaustion is more likely
> but it's also temporary.
> 
> The overcommit and overestimated reservations make it hard if not
> impossible to do any accurate calculation in statfs/df. From the
> usability side, there are 2 options:
> 
> a) return 0 free, while it's still possible to eg. create files
> b) return >0 free, but no new file can be created
> 
> The user report I got was for b) so that's what the guesswork fixes and
> does a). The idea behind that is that there's really low space, but with
> the overreservation caused by balance it's not.
> 
> I don't see a good way out of that which could be solved inside statfs,
> it only interprets the numbers in the best way under circumstances. We
> don't have exact reservation, don't have a delta of the
> reserved-requested (to check how much the reservation is off).
>
Qu Wenruo Jan. 17, 2020, 1:32 a.m. UTC | #4
On 2020/1/17 上午8:54, Qu Wenruo wrote:
> 
> 
> On 2020/1/16 下午10:29, David Sterba wrote:
>> On Wed, Jan 15, 2020 at 11:41:28AM +0800, Qu Wenruo wrote:
>>> [BUG]
>>> When there are a lot of metadata space reserved, e.g. after balancing a
>>> data block with many extents, vanilla df would report 0 available space.
>>>
>>> [CAUSE]
>>> btrfs_statfs() would report 0 available space if its metadata space is
>>> exhausted.
>>> And the calculation is based on currently reserved space vs on-disk
>>> available space, with a small headroom as buffer.
>>> When there is not enough headroom, btrfs_statfs() will report 0
>>> available space.
>>>
>>> The problem is, since commit ef1317a1b9a3 ("btrfs: do not allow
>>> reservations if we have pending tickets"), we allow btrfs to over commit
>>> metadata space, as long as we have enough space to allocate new metadata
>>> chunks.
>>>
>>> This makes old calculation unreliable and report false 0 available space.
>>>
>>> [FIX]
>>> Don't do such naive check anymore for btrfs_statfs().
>>> Also remove the comment about "0 available space when metadata is
>>> exhausted".
>>
>> This is intentional and was added to prevent a situation where 'df'
>> reports available space but exhausted metadata don't allow to create new
>> inode.
> 
> But this behavior itself is not accurate.
> 
> We have global reservation, which is normally always larger than the
> immediate number 4M.
> 
> So that check will never really be triggered.
> 
> Thus invalidating most of your argument.
> 
> Thanks,
> Qu
> 
>>
>> If it gets removed you are trading one bug for another. With the changed
>> logic in the referenced commit, the metadata exhaustion is more likely
>> but it's also temporary.

Furthermore, the point of the patch is, current check doesn't play well
with metadata over-commit.

If it's before v5.4, I won't touch the check considering it will never
hit anyway.

But now for v5.4, either:
- We over-commit metadata
  Meaning we have unallocated space, nothing to worry

- No more space for over-commit
  But in that case, we still have global rsv to update essential trees.
  Please note that, btrfs should never fall into a status where no files
  can be deleted.

Consider all these, we're no longer able to really hit that case.

So that's why I'm purposing deleting that. I see no reason why that
magic number 4M would still work nowadays.

Thanks,
Qu

>>
>> The overcommit and overestimated reservations make it hard if not
>> impossible to do any accurate calculation in statfs/df. From the
>> usability side, there are 2 options:
>>
>> a) return 0 free, while it's still possible to eg. create files
>> b) return >0 free, but no new file can be created
>>
>> The user report I got was for b) so that's what the guesswork fixes and
>> does a). The idea behind that is that there's really low space, but with
>> the overreservation caused by balance it's not.
>>
>> I don't see a good way out of that which could be solved inside statfs,
>> it only interprets the numbers in the best way under circumstances. We
>> don't have exact reservation, don't have a delta of the
>> reserved-requested (to check how much the reservation is off).
>>
>
David Sterba Jan. 17, 2020, 2:02 p.m. UTC | #5
On Fri, Jan 17, 2020 at 08:54:35AM +0800, Qu Wenruo wrote:
> 
> 
> On 2020/1/16 下午10:29, David Sterba wrote:
> > On Wed, Jan 15, 2020 at 11:41:28AM +0800, Qu Wenruo wrote:
> >> [BUG]
> >> When there are a lot of metadata space reserved, e.g. after balancing a
> >> data block with many extents, vanilla df would report 0 available space.
> >>
> >> [CAUSE]
> >> btrfs_statfs() would report 0 available space if its metadata space is
> >> exhausted.
> >> And the calculation is based on currently reserved space vs on-disk
> >> available space, with a small headroom as buffer.
> >> When there is not enough headroom, btrfs_statfs() will report 0
> >> available space.
> >>
> >> The problem is, since commit ef1317a1b9a3 ("btrfs: do not allow
> >> reservations if we have pending tickets"), we allow btrfs to over commit
> >> metadata space, as long as we have enough space to allocate new metadata
> >> chunks.
> >>
> >> This makes old calculation unreliable and report false 0 available space.
> >>
> >> [FIX]
> >> Don't do such naive check anymore for btrfs_statfs().
> >> Also remove the comment about "0 available space when metadata is
> >> exhausted".
> > 
> > This is intentional and was added to prevent a situation where 'df'
> > reports available space but exhausted metadata don't allow to create new
> > inode.
> 
> But this behavior itself is not accurate.
> 
> We have global reservation, which is normally always larger than the
> immediate number 4M.

The global block reserve is subtracted from the metadata accounted from
the block groups. And after that, if there's only little space left, the
check triggers. Because at this point any new metadata reservation
cannot be satisfied from the remaining space, yet there's >0 reported.

> So that check will never really be triggered.
> 
> Thus invalidating most of your argument.

Please read the current comment and code in statfs again.
David Sterba Jan. 17, 2020, 2:10 p.m. UTC | #6
On Fri, Jan 17, 2020 at 09:32:49AM +0800, Qu Wenruo wrote:
> On 2020/1/17 上午8:54, Qu Wenruo wrote:
> > On 2020/1/16 下午10:29, David Sterba wrote:
> >> On Wed, Jan 15, 2020 at 11:41:28AM +0800, Qu Wenruo wrote:
> >>> [BUG]
> >>> When there are a lot of metadata space reserved, e.g. after balancing a
> >>> data block with many extents, vanilla df would report 0 available space.
> >>>
> >>> [CAUSE]
> >>> btrfs_statfs() would report 0 available space if its metadata space is
> >>> exhausted.
> >>> And the calculation is based on currently reserved space vs on-disk
> >>> available space, with a small headroom as buffer.
> >>> When there is not enough headroom, btrfs_statfs() will report 0
> >>> available space.
> >>>
> >>> The problem is, since commit ef1317a1b9a3 ("btrfs: do not allow
> >>> reservations if we have pending tickets"), we allow btrfs to over commit
> >>> metadata space, as long as we have enough space to allocate new metadata
> >>> chunks.
> >>>
> >>> This makes old calculation unreliable and report false 0 available space.
> >>>
> >>> [FIX]
> >>> Don't do such naive check anymore for btrfs_statfs().
> >>> Also remove the comment about "0 available space when metadata is
> >>> exhausted".
> >>
> >> This is intentional and was added to prevent a situation where 'df'
> >> reports available space but exhausted metadata don't allow to create new
> >> inode.
> > 
> > But this behavior itself is not accurate.
> > 
> > We have global reservation, which is normally always larger than the
> > immediate number 4M.
> > 
> > So that check will never really be triggered.
> > 
> > Thus invalidating most of your argument.
> >>
> >> If it gets removed you are trading one bug for another. With the changed
> >> logic in the referenced commit, the metadata exhaustion is more likely
> >> but it's also temporary.
> 
> Furthermore, the point of the patch is, current check doesn't play well
> with metadata over-commit.

The recent overcommit updates broke statfs in a new way and left us
almost nothing to make it better.

> If it's before v5.4, I won't touch the check considering it will never
> hit anyway.
> 
> But now for v5.4, either:
> - We over-commit metadata
>   Meaning we have unallocated space, nothing to worry

Can we estimate how much unallocated data are there? I don't know how,
and "nothing to worry" always worries me.

> - No more space for over-commit
>   But in that case, we still have global rsv to update essential trees.
>   Please note that, btrfs should never fall into a status where no files
>   can be deleted.

Of course, the global reserve is there for last resort actions and will
be used for deletion and updating essential trees. What statfs says is
how much data is there left for the user. New files, writing more data
etc.

> Consider all these, we're no longer able to really hit that case.
> 
> So that's why I'm purposing deleting that. I see no reason why that
> magic number 4M would still work nowadays.

So, the corner case that resulted in the guesswork needs to be
reevaluated then, the space reservations and related updates clearly
affect that. That's out of 5.5-rc timeframe though.
Qu Wenruo Jan. 17, 2020, 2:16 p.m. UTC | #7
On 2020/1/17 下午10:02, David Sterba wrote:
> On Fri, Jan 17, 2020 at 08:54:35AM +0800, Qu Wenruo wrote:
>>
>>
>> On 2020/1/16 下午10:29, David Sterba wrote:
>>> On Wed, Jan 15, 2020 at 11:41:28AM +0800, Qu Wenruo wrote:
>>>> [BUG]
>>>> When there are a lot of metadata space reserved, e.g. after balancing a
>>>> data block with many extents, vanilla df would report 0 available space.
>>>>
>>>> [CAUSE]
>>>> btrfs_statfs() would report 0 available space if its metadata space is
>>>> exhausted.
>>>> And the calculation is based on currently reserved space vs on-disk
>>>> available space, with a small headroom as buffer.
>>>> When there is not enough headroom, btrfs_statfs() will report 0
>>>> available space.
>>>>
>>>> The problem is, since commit ef1317a1b9a3 ("btrfs: do not allow
>>>> reservations if we have pending tickets"), we allow btrfs to over commit
>>>> metadata space, as long as we have enough space to allocate new metadata
>>>> chunks.
>>>>
>>>> This makes old calculation unreliable and report false 0 available space.
>>>>
>>>> [FIX]
>>>> Don't do such naive check anymore for btrfs_statfs().
>>>> Also remove the comment about "0 available space when metadata is
>>>> exhausted".
>>>
>>> This is intentional and was added to prevent a situation where 'df'
>>> reports available space but exhausted metadata don't allow to create new
>>> inode.
>>
>> But this behavior itself is not accurate.
>>
>> We have global reservation, which is normally always larger than the
>> immediate number 4M.
> 
> The global block reserve is subtracted from the metadata accounted from
> the block groups. And after that, if there's only little space left, the
> check triggers. Because at this point any new metadata reservation
> cannot be satisfied from the remaining space, yet there's >0 reported.

OK, then we need to do over-commit calculation here, and do the 4M
calculation.

The quick solution I can think of would go back to Josef's solution by
exporting can_overcommit() to do the calculation.


But my biggest problem is, do we really need to do all these hassle?
My argument is, other fs like ext4/xfs still has their inode number
limits, and they don't report 0 avail when  that get exhausted.
(Although statfs() has such report mechanism for them though).

If it's a different source making us unable to write data, I believe it
should be reported in different way.

Thanks,
Qu

> 
>> So that check will never really be triggered.
>>
>> Thus invalidating most of your argument.
> 
> Please read the current comment and code in statfs again.
>
Qu Wenruo Jan. 17, 2020, 2:22 p.m. UTC | #8
On 2020/1/17 下午10:10, David Sterba wrote:
> On Fri, Jan 17, 2020 at 09:32:49AM +0800, Qu Wenruo wrote:
>> On 2020/1/17 上午8:54, Qu Wenruo wrote:
>>> On 2020/1/16 下午10:29, David Sterba wrote:
>>>> On Wed, Jan 15, 2020 at 11:41:28AM +0800, Qu Wenruo wrote:
>>>>> [BUG]
>>>>> When there are a lot of metadata space reserved, e.g. after balancing a
>>>>> data block with many extents, vanilla df would report 0 available space.
>>>>>
>>>>> [CAUSE]
>>>>> btrfs_statfs() would report 0 available space if its metadata space is
>>>>> exhausted.
>>>>> And the calculation is based on currently reserved space vs on-disk
>>>>> available space, with a small headroom as buffer.
>>>>> When there is not enough headroom, btrfs_statfs() will report 0
>>>>> available space.
>>>>>
>>>>> The problem is, since commit ef1317a1b9a3 ("btrfs: do not allow
>>>>> reservations if we have pending tickets"), we allow btrfs to over commit
>>>>> metadata space, as long as we have enough space to allocate new metadata
>>>>> chunks.
>>>>>
>>>>> This makes old calculation unreliable and report false 0 available space.
>>>>>
>>>>> [FIX]
>>>>> Don't do such naive check anymore for btrfs_statfs().
>>>>> Also remove the comment about "0 available space when metadata is
>>>>> exhausted".
>>>>
>>>> This is intentional and was added to prevent a situation where 'df'
>>>> reports available space but exhausted metadata don't allow to create new
>>>> inode.
>>>
>>> But this behavior itself is not accurate.
>>>
>>> We have global reservation, which is normally always larger than the
>>> immediate number 4M.
>>>
>>> So that check will never really be triggered.
>>>
>>> Thus invalidating most of your argument.
>>>>
>>>> If it gets removed you are trading one bug for another. With the changed
>>>> logic in the referenced commit, the metadata exhaustion is more likely
>>>> but it's also temporary.
>>
>> Furthermore, the point of the patch is, current check doesn't play well
>> with metadata over-commit.
> 
> The recent overcommit updates broke statfs in a new way and left us
> almost nothing to make it better.

It's not impossible to solve in fact.

Exporting can_overcommit() can do pretty well in this particular case.

> 
>> If it's before v5.4, I won't touch the check considering it will never
>> hit anyway.
>>
>> But now for v5.4, either:
>> - We over-commit metadata
>>   Meaning we have unallocated space, nothing to worry
> 
> Can we estimate how much unallocated data are there? I don't know how,
> and "nothing to worry" always worries me.

Data never over-commit. We always ensure there are enough data chunk
allocated before we allocate data extents.

> 
>> - No more space for over-commit
>>   But in that case, we still have global rsv to update essential trees.
>>   Please note that, btrfs should never fall into a status where no files
>>   can be deleted.
> 
> Of course, the global reserve is there for last resort actions and will
> be used for deletion and updating essential trees. What statfs says is
> how much data is there left for the user. New files, writing more data
> etc.
> 
>> Consider all these, we're no longer able to really hit that case.
>>
>> So that's why I'm purposing deleting that. I see no reason why that
>> magic number 4M would still work nowadays.
> 
> So, the corner case that resulted in the guesswork needs to be
> reevaluated then, the space reservations and related updates clearly
> affect that. That's out of 5.5-rc timeframe though.

Although we can still solve the problem only using facility in v5.5, I'm
still not happy enough with the idea of "one exhausted resource would
result a different resource exhausted"

I still believe in that we should report different things independently.
(Which obviously makes our lives easier in statfs case).

That's also why we require reporters to include 'btrfs fi df' result
other than vanilla 'df', because we have different internals.

Or, can we reuse the f_files/f_free facility to report metadata space,
and forgot all these mess?

Thanks,
Qu
David Sterba Jan. 29, 2020, 3:38 p.m. UTC | #9
On Fri, Jan 17, 2020 at 10:22:46PM +0800, Qu Wenruo wrote:
> >>>> If it gets removed you are trading one bug for another. With the changed
> >>>> logic in the referenced commit, the metadata exhaustion is more likely
> >>>> but it's also temporary.
> >>
> >> Furthermore, the point of the patch is, current check doesn't play well
> >> with metadata over-commit.
> > 
> > The recent overcommit updates broke statfs in a new way and left us
> > almost nothing to make it better.
> 
> It's not impossible to solve in fact.
> 
> Exporting can_overcommit() can do pretty well in this particular case.

can_overcommit will be exported by the block group reado-only fixes,
pending for 5.6, so it might be used for statfs if need be.

> >> If it's before v5.4, I won't touch the check considering it will never
> >> hit anyway.
> >>
> >> But now for v5.4, either:
> >> - We over-commit metadata
> >>   Meaning we have unallocated space, nothing to worry
> > 
> > Can we estimate how much unallocated data are there? I don't know how,
> > and "nothing to worry" always worries me.
> 
> Data never over-commit. We always ensure there are enough data chunk
> allocated before we allocate data extents.
> 
> > 
> >> - No more space for over-commit
> >>   But in that case, we still have global rsv to update essential trees.
> >>   Please note that, btrfs should never fall into a status where no files
> >>   can be deleted.
> > 
> > Of course, the global reserve is there for last resort actions and will
> > be used for deletion and updating essential trees. What statfs says is
> > how much data is there left for the user. New files, writing more data
> > etc.
> > 
> >> Consider all these, we're no longer able to really hit that case.
> >>
> >> So that's why I'm purposing deleting that. I see no reason why that
> >> magic number 4M would still work nowadays.
> > 
> > So, the corner case that resulted in the guesswork needs to be
> > reevaluated then, the space reservations and related updates clearly
> > affect that. That's out of 5.5-rc timeframe though.
> 
> Although we can still solve the problem only using facility in v5.5, I'm
> still not happy enough with the idea of "one exhausted resource would
> result a different resource exhausted"
> 
> I still believe in that we should report different things independently.
> (Which obviously makes our lives easier in statfs case).
> 
> That's also why we require reporters to include 'btrfs fi df' result
> other than vanilla 'df', because we have different internals.
> 
> Or, can we reuse the f_files/f_free facility to report metadata space,
> and forgot all these mess?

Requiring filesystem-specific interpretation of f_files is a mess too.
That statfs, which is a syscall and we can't change anything on the
interface level, is a severe limitation for presenting the space is a
well known problem, yeah.

The patch is still in game, I got a feedback some feedback on IRC.
Comparing the 2 corner cases, the one I was aiming to fix is harder to
hit than the inflated metadata during balance.
David Sterba Jan. 29, 2020, 4:01 p.m. UTC | #10
On Fri, Jan 17, 2020 at 10:16:45PM +0800, Qu Wenruo wrote:
> >> But this behavior itself is not accurate.
> >>
> >> We have global reservation, which is normally always larger than the
> >> immediate number 4M.
> > 
> > The global block reserve is subtracted from the metadata accounted from
> > the block groups. And after that, if there's only little space left, the
> > check triggers. Because at this point any new metadata reservation
> > cannot be satisfied from the remaining space, yet there's >0 reported.
> 
> OK, then we need to do over-commit calculation here, and do the 4M
> calculation.
> 
> The quick solution I can think of would go back to Josef's solution by
> exporting can_overcommit() to do the calculation.
> 
> 
> But my biggest problem is, do we really need to do all these hassle?

As far as I know we did not ask for it, it's caused by limitations of
the public interfaces (statfs).

> My argument is, other fs like ext4/xfs still has their inode number
> limits, and they don't report 0 avail when  that get exhausted.
> (Although statfs() has such report mechanism for them though).

Maybe that's also the perception of what "space" is for users. The data
and metadata distinction is an implementation detail. So if 'df' tells
me there's space I should be able to create new files, right? Or write
more data, but still looking at the same number of free space.

For ext2 or ext3 it should be easier to see if it's possible to create
new inodes, the values of 'df -i' are likely accurate because of the
mkfs-time preallocation.

Newer features on ext4 and xfs allow dynamic creation of inodes, you can
find guesswork regarding the f_files estimates.

I vaguely remember that it's possible to get ENOSPC on ext4 by
exhausting metadata yet statfs will still tell you there's free space.
This is confusing too.
Josef Bacik Jan. 30, 2020, 9:05 p.m. UTC | #11
On 1/14/20 10:41 PM, Qu Wenruo wrote:
> [BUG]
> When there are a lot of metadata space reserved, e.g. after balancing a
> data block with many extents, vanilla df would report 0 available space.
> 
> [CAUSE]
> btrfs_statfs() would report 0 available space if its metadata space is
> exhausted.
> And the calculation is based on currently reserved space vs on-disk
> available space, with a small headroom as buffer.
> When there is not enough headroom, btrfs_statfs() will report 0
> available space.
> 
> The problem is, since commit ef1317a1b9a3 ("btrfs: do not allow
> reservations if we have pending tickets"), we allow btrfs to over commit
> metadata space, as long as we have enough space to allocate new metadata
> chunks.
> 
> This makes old calculation unreliable and report false 0 available space.
> 
> [FIX]
> Don't do such naive check anymore for btrfs_statfs().
> Also remove the comment about "0 available space when metadata is
> exhausted".
> 
> Please note that, this is a just a quick fix. There are still a lot of
> things to be improved.
> 
> Fixes: ef1317a1b9a3 ("btrfs: do not allow reservations if we have pending tickets")

This isn't the patch that broke it.  The patch that broke it is the patch that 
introduced this code in the first place.

And this isn't the proper fix either, because technically we have 0 available if 
we don't have enough space for our global reserve _and_ we don't have any 
unallocated space.  So for now the best "quick" fix would be to make the 
condition something like

if (!mixed && block-rsv->space_info->full &&
     total_free_meta - thresh < block_rsv->size)

Thanks,

Josef
Anand Jain Jan. 30, 2020, 11:14 p.m. UTC | #12
On 1/31/20 5:05 AM, Josef Bacik wrote:
> On 1/14/20 10:41 PM, Qu Wenruo wrote:
>> [BUG]
>> When there are a lot of metadata space reserved, e.g. after balancing a
>> data block with many extents, vanilla df would report 0 available space.
>>
>> [CAUSE]
>> btrfs_statfs() would report 0 available space if its metadata space is
>> exhausted.
>> And the calculation is based on currently reserved space vs on-disk
>> available space, with a small headroom as buffer.
>> When there is not enough headroom, btrfs_statfs() will report 0
>> available space.
>>
>> The problem is, since commit ef1317a1b9a3 ("btrfs: do not allow
>> reservations if we have pending tickets"), we allow btrfs to over commit
>> metadata space, as long as we have enough space to allocate new metadata
>> chunks.
>>
>> This makes old calculation unreliable and report false 0 available space.
>>
>> [FIX]
>> Don't do such naive check anymore for btrfs_statfs().
>> Also remove the comment about "0 available space when metadata is
>> exhausted".
>>
>> Please note that, this is a just a quick fix. There are still a lot of
>> things to be improved.
>>
>> Fixes: ef1317a1b9a3 ("btrfs: do not allow reservations if we have 
>> pending tickets")
> 
> This isn't the patch that broke it.  The patch that broke it is the 
> patch that introduced this code in the first place.
> 
> And this isn't the proper fix either, because technically we have 0 
> available if we don't have enough space for our global reserve _and_ we 
> don't have any unallocated space.  So for now the best "quick" fix would 
> be to make the condition something like



> if (!mixed && block-rsv->space_info->full &&

  Makes sense. I didn't realize we have space_info::full I was
  experimenting create btrfs_calc_avail_data_space() (which we use
  few lines above in this function) for type metadata.

  This issue is easy to reproduce as below.

  mkfs.btrfs -fq -n64K -msingle /dev/sdb && mount /dev/sdb /btrfs && df 
-k /btrfs

  Filesystem     1K-blocks  Used Available Use% Mounted on
/dev/sdb         3072000 13824         0 100% /btrfs

  So xfstests using _get_available_space() with MKFS_OPTS="-m single"
  option is failing.

  Unfortunately, revert of patch 0096420adb03 (btrfs: do not account
  global reserve in can_overcommit) also fixes this, I don't know yet
  how and why yet. Any idea?

  As because whats happening for the test case above, is
  %found->disk_total is 8388608 (for single) but the %block_rsv->size is
  13631488 (for nodesize 64K). So the condition fails. The
  %block_rev->size scales with nodesize, where as  %found->disk_total
  doesn't, so with default nodesize 16K this problem isn't reproducible.

Thanks, Anand


>      total_free_meta - thresh < block_rsv->size)

> Thanks,
> 
> Josef
Qu Wenruo Jan. 31, 2020, 12:35 a.m. UTC | #13
On 2020/1/31 上午5:05, Josef Bacik wrote:
> On 1/14/20 10:41 PM, Qu Wenruo wrote:
>> [BUG]
>> When there are a lot of metadata space reserved, e.g. after balancing a
>> data block with many extents, vanilla df would report 0 available space.
>>
>> [CAUSE]
>> btrfs_statfs() would report 0 available space if its metadata space is
>> exhausted.
>> And the calculation is based on currently reserved space vs on-disk
>> available space, with a small headroom as buffer.
>> When there is not enough headroom, btrfs_statfs() will report 0
>> available space.
>>
>> The problem is, since commit ef1317a1b9a3 ("btrfs: do not allow
>> reservations if we have pending tickets"), we allow btrfs to over commit
>> metadata space, as long as we have enough space to allocate new metadata
>> chunks.
>>
>> This makes old calculation unreliable and report false 0 available space.
>>
>> [FIX]
>> Don't do such naive check anymore for btrfs_statfs().
>> Also remove the comment about "0 available space when metadata is
>> exhausted".
>>
>> Please note that, this is a just a quick fix. There are still a lot of
>> things to be improved.
>>
>> Fixes: ef1317a1b9a3 ("btrfs: do not allow reservations if we have
>> pending tickets")
> 
> This isn't the patch that broke it.  The patch that broke it is the
> patch that introduced this code in the first place.
> 
> And this isn't the proper fix either, because technically we have 0
> available if we don't have enough space for our global reserve _and_ we
> don't have any unallocated space.  So for now the best "quick" fix would
> be to make the condition something like
> 
> if (!mixed && block-rsv->space_info->full &&
>     total_free_meta - thresh < block_rsv->size)

block-rsv->space_info->full is not reliable afaik.

For metadata we will never really reach full space info.

Anyway, the patch is already discarded since it doesn't make sense if
the patch can't reach v5.5.

The proper fix in the per-profile will be the fix.

Thanks,
Qu

> 
> Thanks,
> 
> Josef
Zygo Blaxell Jan. 31, 2020, 2:23 a.m. UTC | #14
On Wed, Jan 29, 2020 at 05:01:47PM +0100, David Sterba wrote:
> On Fri, Jan 17, 2020 at 10:16:45PM +0800, Qu Wenruo wrote:
> > >> But this behavior itself is not accurate.
> > >>
> > >> We have global reservation, which is normally always larger than the
> > >> immediate number 4M.
> > > 
> > > The global block reserve is subtracted from the metadata accounted from
> > > the block groups. And after that, if there's only little space left, the
> > > check triggers. Because at this point any new metadata reservation
> > > cannot be satisfied from the remaining space, yet there's >0 reported.
> > 
> > OK, then we need to do over-commit calculation here, and do the 4M
> > calculation.
> > 
> > The quick solution I can think of would go back to Josef's solution by
> > exporting can_overcommit() to do the calculation.
> > 
> > 
> > But my biggest problem is, do we really need to do all these hassle?
> 
> As far as I know we did not ask for it, it's caused by limitations of
> the public interfaces (statfs).

We don't use half of the existing public interface (the f_files side).

Conflating data with metadata isn't helpful:  they need very different
remedial actions when you run out of each.  This is true even on other
filesystems, though other filesystems don't have remedial actions as
extremely different as "deleting files" and "btrfs data balance."

> > My argument is, other fs like ext4/xfs still has their inode number
> > limits, and they don't report 0 avail when  that get exhausted.
> > (Although statfs() has such report mechanism for them though).
> 
> Maybe that's also the perception of what "space" is for users. The data
> and metadata distinction is an implementation detail. So if 'df' tells
> me there's space I should be able to create new files, right? 

Not necessarily.  Someone else might allocate data between statfs check
and the write.  The new file's name may overflow existing directory
blocks and trigger a data block allocation (on filesystems that have
directory data blocks).  The filesystem might be out of inodes, if it's
a filesystem with static inode allocation.  Also recall that 'df' has the
'-i' option, so df does give you two numbers for space, and it has two
opportunities to tell you that you don't have any.

Seeing "72KB data blocks free" is not a guarantee that you can write
exactly 72KB of data.  It's an estimate that you can write _about_
72KB of data.  Some difference between the estimated free space and the
actual number of data blocks that can be written is expected, especially
when the filesystem is nearly full or has multiple agents writing to it
(and then there's compression, snapshots, dedupe, unreachable overwritten
extent blocks...).

We expect certainty when comparing numbers that are very different.
The closer you get to the reported number, the less certain we can be
about whether an ENOSPC will occur before you write that amount of data.
Examples: If you have 72K free in df, you _definitely_ can't install
that package that requires 860MB of space, but you probably can write
just one more 4K block to a log.  Maybe you can install a 60KB package
but not a 68KB one.  Users (and the robots they configure) can usually
cope with derating the numbers a little.

The recent problem in 5.4+ kernels is that this difference is now often
far too large--thousands of gigabytes, because btrfs flips from counting
"all the free data blocks in existing and future data block groups" to
"zero" at arbitrary times.  This is *much worse* than the nearly-full
case, where the correct and reported (zero) numbers are within a few GB
of each other.  The large errors in statfs reporting and large changes
in reported value result in large consequences--automated deletion of
data is the most serious one, where the bigger the error, the more data
is unnecessarily deleted.

The case where a filesystem runs out of unallocated space, has several
thousand GB of unused space in data block groups, and runs out of metadata
space is...tricky, because you correctly hit ENOSPC with thousands of GB
of free space on the drive, but it's not in a form btrfs can use because
it's all in data block groups.  The usual automated responses triggered
on statfs() reports of low data space don't work here.  Data balances are
required to free up space for metadata.  'unlink' or 'btrfs sub delete'
usually don't help in these cases, or help extremely inefficiently
(i.e. the kernel won't free data block groups until they're entirely
empty, so it's not sufficient to delete a few files--you have to delete
almost _all_ of the files).  'unlink' can even make the problem worse
(duplicating shared metadata pages, consuming even more metadata space).
The arbitrary reporting of "0 available data blocks" in df isn't helpful
in this case because the ENOSPC has nothing at all to do with data blocks.

One could say "well if we're out of metadata space and have free data
blocks, the correct thing to do is block writers and start balancing,
so that a free block is a free block."  There are obvious problems
with that, but it would make sense to consider reducing data blocks and
metadata free space to a single number in df only *after* they were made
transparently fungible.

> more data, but still looking at the same number of free space.
> 
> For ext2 or ext3 it should be easier to see if it's possible to create
> new inodes, the values of 'df -i' are likely accurate because of the
> mkfs-time preallocation.

> Newer features on ext4 and xfs allow dynamic creation of inodes, you can
> find guesswork regarding the f_files estimates.
> 
> I vaguely remember that it's possible to get ENOSPC on ext4 by
> exhausting metadata yet statfs will still tell you there's free space.

Yes, if you run out of f_favail on a traditional Unix filesystem, you
can't make any more inodes, but you can still write all the data blocks
to existing files you want (until f_bavail runs out).  Other filesystems
do _not_ set f_bavail to 0 when f_favail happens to be 0 too.

Also, on a traditional Unix filesystem, you can overwrite, truncate
(shrink, maybe not expand) or delete any files you want, no matter
what df says.  On btrfs those things are not all always possible, and
sometimes have a negative effect on free space.

> This is confusing too.

statfs on btrfs will tell you that you have free blocks (f_bavail),
but not free files.  Many users have never heard of f_favail because
some filesystems (including btrfs) don't implement it, and on the others
f_favail changes very slowly and rarely hits zero for typical workloads
when compared to f_bavail.

That said, f_favail is an ideal way to report to users that they are
running out of some scarce resource that isn't data blocks.  Thankfully
btrfs only has two kinds of scarce resource (for now), so we can still
use the statvfs structure.

On btrfs we could overload f_f{avail,free,iles} to count metadata
blocks (metadata_allocated - metadata_used + (all_unallocated /
raid_profile_parameters) - metadata_reserved).  That would provide
a number that proportionally decreases to zero as all options for
allocating new metadata blocks are exhausted.  When ENOSPC occurs,
either f_favail or f_bavail will end up being zero[1], no questionable
hacks required.  Automated low space responders will be able to apply
the correct proportional response (delete files/snapshots or balance
data for low data and metadata space, respectively) by looking at which
value is zero.

TBH I think the code change content of Qu's patch was fine as-is:
it mostly reverts ca8a51b3a9 "btrfs: statfs: report zero available if
metadata are exhausted", and I think that should be reverted regardless
of the other issues, simply and only because it conflates data blocks
with metadata.  It's reporting values in the wrong columns, so confusion
is (and bugs are) inevitable.


[1] If f_{b,f}avail ends up being below 0, make it 0.  Nothing is more
confusing or has less predictable effects than reports of negative free
disk space.
Qu Wenruo Jan. 31, 2020, 11:58 a.m. UTC | #15
On 2020/1/31 上午8:35, Qu Wenruo wrote:
> 
> 
> On 2020/1/31 上午5:05, Josef Bacik wrote:
>> On 1/14/20 10:41 PM, Qu Wenruo wrote:
>>> [BUG]
>>> When there are a lot of metadata space reserved, e.g. after balancing a
>>> data block with many extents, vanilla df would report 0 available space.
>>>
>>> [CAUSE]
>>> btrfs_statfs() would report 0 available space if its metadata space is
>>> exhausted.
>>> And the calculation is based on currently reserved space vs on-disk
>>> available space, with a small headroom as buffer.
>>> When there is not enough headroom, btrfs_statfs() will report 0
>>> available space.
>>>
>>> The problem is, since commit ef1317a1b9a3 ("btrfs: do not allow
>>> reservations if we have pending tickets"), we allow btrfs to over commit
>>> metadata space, as long as we have enough space to allocate new metadata
>>> chunks.
>>>
>>> This makes old calculation unreliable and report false 0 available space.
>>>
>>> [FIX]
>>> Don't do such naive check anymore for btrfs_statfs().
>>> Also remove the comment about "0 available space when metadata is
>>> exhausted".
>>>
>>> Please note that, this is a just a quick fix. There are still a lot of
>>> things to be improved.
>>>
>>> Fixes: ef1317a1b9a3 ("btrfs: do not allow reservations if we have
>>> pending tickets")
>>
>> This isn't the patch that broke it.  The patch that broke it is the
>> patch that introduced this code in the first place.
>>
>> And this isn't the proper fix either, because technically we have 0
>> available if we don't have enough space for our global reserve _and_ we
>> don't have any unallocated space.  So for now the best "quick" fix would
>> be to make the condition something like
>>
>> if (!mixed && block-rsv->space_info->full &&
>>     total_free_meta - thresh < block_rsv->size)
> 
> block-rsv->space_info->full is not reliable afaik.

My bad, I should double check the code before writing bullsh**.

That full is set as long as we can't allocate a new chunk for that profile.

So that fix is pretty good.

I'll update the patch to use Josef's credit.

Thanks,
Qu

> 
> For metadata we will never really reach full space info.
> 
> Anyway, the patch is already discarded since it doesn't make sense if
> the patch can't reach v5.5.
> 
> The proper fix in the per-profile will be the fix.
> 
> Thanks,
> Qu
> 
>>
>> Thanks,
>>
>> Josef
>
David Sterba Jan. 31, 2020, 12:34 p.m. UTC | #16
On Thu, Jan 30, 2020 at 04:05:10PM -0500, Josef Bacik wrote:
> On 1/14/20 10:41 PM, Qu Wenruo wrote:
> > [BUG]
> > When there are a lot of metadata space reserved, e.g. after balancing a
> > data block with many extents, vanilla df would report 0 available space.
> > 
> > [CAUSE]
> > btrfs_statfs() would report 0 available space if its metadata space is
> > exhausted.
> > And the calculation is based on currently reserved space vs on-disk
> > available space, with a small headroom as buffer.
> > When there is not enough headroom, btrfs_statfs() will report 0
> > available space.
> > 
> > The problem is, since commit ef1317a1b9a3 ("btrfs: do not allow
> > reservations if we have pending tickets"), we allow btrfs to over commit
> > metadata space, as long as we have enough space to allocate new metadata
> > chunks.
> > 
> > This makes old calculation unreliable and report false 0 available space.
> > 
> > [FIX]
> > Don't do such naive check anymore for btrfs_statfs().
> > Also remove the comment about "0 available space when metadata is
> > exhausted".
> > 
> > Please note that, this is a just a quick fix. There are still a lot of
> > things to be improved.
> > 
> > Fixes: ef1317a1b9a3 ("btrfs: do not allow reservations if we have pending tickets")
> 
> This isn't the patch that broke it.  The patch that broke it is the patch that 
> introduced this code in the first place.
> 
> And this isn't the proper fix either, because technically we have 0 available if 
> we don't have enough space for our global reserve _and_ we don't have any 
> unallocated space.  So for now the best "quick" fix would be to make the 
> condition something like
> 
> if (!mixed && block-rsv->space_info->full &&
>      total_free_meta - thresh < block_rsv->size)

Yes, that seems to be the missing part of my patch that added the above
check. In the testcase there was no remaining metadata space so the
->full == 1 was implied.

Patch
diff mbox series

diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index f452a94abdc3..ca1a26b3e884 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -2018,8 +2018,6 @@  static inline int btrfs_calc_avail_data_space(struct btrfs_fs_info *fs_info,
  * algorithm that respects the device sizes and order of allocations.  This is
  * a close approximation of the actual use but there are other factors that may
  * change the result (like a new metadata chunk).
- *
- * If metadata is exhausted, f_bavail will be 0.
  */
 static int btrfs_statfs(struct dentry *dentry, struct kstatfs *buf)
 {
@@ -2034,7 +2032,6 @@  static int btrfs_statfs(struct dentry *dentry, struct kstatfs *buf)
 	unsigned factor = 1;
 	struct btrfs_block_rsv *block_rsv = &fs_info->global_block_rsv;
 	int ret;
-	u64 thresh = 0;
 	int mixed = 0;
 
 	rcu_read_lock();
@@ -2089,24 +2086,6 @@  static int btrfs_statfs(struct dentry *dentry, struct kstatfs *buf)
 	buf->f_bavail += div_u64(total_free_data, factor);
 	buf->f_bavail = buf->f_bavail >> bits;
 
-	/*
-	 * We calculate the remaining metadata space minus global reserve. If
-	 * this is (supposedly) smaller than zero, there's no space. But this
-	 * does not hold in practice, the exhausted state happens where's still
-	 * some positive delta. So we apply some guesswork and compare the
-	 * delta to a 4M threshold.  (Practically observed delta was ~2M.)
-	 *
-	 * We probably cannot calculate the exact threshold value because this
-	 * depends on the internal reservations requested by various
-	 * operations, so some operations that consume a few metadata will
-	 * succeed even if the Avail is zero. But this is better than the other
-	 * way around.
-	 */
-	thresh = SZ_4M;
-
-	if (!mixed && total_free_meta - thresh < block_rsv->size)
-		buf->f_bavail = 0;
-
 	buf->f_type = BTRFS_SUPER_MAGIC;
 	buf->f_bsize = dentry->d_sb->s_blocksize;
 	buf->f_namelen = BTRFS_NAME_LEN;