generic/015: Change the test filesystem size to 101mb
diff mbox

Message ID 1515401010-26802-1-git-send-email-nborisov@suse.com
State New
Headers show

Commit Message

Nikolay Borisov Jan. 8, 2018, 8:43 a.m. UTC
This test has been failing for btrfs for quite some time,
at least since 4.7. There are 2 implementation details of btrfs that
it exposes:

1. Currently btrfs filesystem under 100mb are created in Mixed block
group mode. Freespace accounting for it is not 100% accurate - I've
observed around 100-200kb discrepancy between a newly created filesystem,
then writing a file and deleting it and checking the free space. This
falls within %3 and not %1 as hardcoded in the test. 

2. BTRFS won't flush it's delayed allocation on file deletion if less
than 32mb are deleted. On such files we need to perform sync (missing
in the test) or wait until time elapses for transaction commit.

Since mixed mode is somewhat deprecated and btrfs is not really intended
to be used on really small devices let's just adjust the test to
create a 101mb fs, which doesn't use mixed mode and really test
freespace accounting.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
---
 tests/generic/015 | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Qu Wenruo Jan. 8, 2018, 12:54 p.m. UTC | #1
On 2018年01月08日 16:43, Nikolay Borisov wrote:
> This test has been failing for btrfs for quite some time,
> at least since 4.7. There are 2 implementation details of btrfs that
> it exposes:
> 
> 1. Currently btrfs filesystem under 100mb are created in Mixed block
> group mode. Freespace accounting for it is not 100% accurate - I've
> observed around 100-200kb discrepancy between a newly created filesystem,
> then writing a file and deleting it and checking the free space. This
> falls within %3 and not %1 as hardcoded in the test.
> 
> 2. BTRFS won't flush it's delayed allocation on file deletion if less
> than 32mb are deleted. On such files we need to perform sync (missing
> in the test) or wait until time elapses for transaction commit.

I'm a little confused about the 32mb limit.

My personal guess about the reason to delay space freeing would be:
1) Performance
   Btrfs tree operation (at least for write) is slow due to its tree
   design.
   So it may makes sense to delay space freeing.

   But in that case, 32MB may seems to small to really improve the
   performance. (Max file extent size is 128M, delaying one item
   deletion doesn't really improve performance)

2) To avoid later new allocation to rewrite the data.
   It's possible that freed space of deleted inode A get allocated to
   new file extents. And a power loss happens before we commit the
   transaction.

   In that case, if everything else works fine, we should be reverted to
   previous transaction where deleted inode A still exists.
   But we lost its data, as its data is overwritten by other file
   extents. And any read will just cause csum error.

   But in that case, there shouldn't be any 32MB limit, but all deletion
   of orphan inodes should be delayed.

   And further more, this can be addressed using log tree, to log such
   deletion so at recovery time, we just delete that inode.

So I'm wonder if we can improve btrfs deletion behavior.


> 
> Since mixed mode is somewhat deprecated and btrfs is not really intended
> to be used on really small devices let's just adjust the test to
> create a 101mb fs, which doesn't use mixed mode and really test
> freespace accounting.

Despite of some btrfs related questions, I'm wondering if there is any
standard specifying (POSIX?) how a filesystem should behave when
unlinking a file.

Should the space freeing be synchronized? And how should statfs report
available space?

In short, I'm wondering if this test and its expected behavior is
generic enough for all filesystems.

Thanks,
Qu

> 
> Signed-off-by: Nikolay Borisov <nborisov@suse.com>
> ---
>  tests/generic/015 | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/tests/generic/015 b/tests/generic/015
> index 78f2b13..416c4ae 100755
> --- a/tests/generic/015
> +++ b/tests/generic/015
> @@ -53,7 +53,7 @@ _supported_os Linux
>  _require_scratch
>  _require_no_large_scratch_dev
>  
> -_scratch_mkfs_sized `expr 50 \* 1024 \* 1024` >/dev/null 2>&1 \
> +_scratch_mkfs_sized `expr 101 \* 1024 \* 1024` >/dev/null 2>&1 \
>      || _fail "mkfs failed"
>  _scratch_mount || _fail "mount failed"
>  out=$SCRATCH_MNT/fillup.$$
>
Timofey Titovets Jan. 8, 2018, 1:55 p.m. UTC | #2
2018-01-08 15:54 GMT+03:00 Qu Wenruo <quwenruo.btrfs@gmx.com>:
>
>
> On 2018年01月08日 16:43, Nikolay Borisov wrote:
>> This test has been failing for btrfs for quite some time,
>> at least since 4.7. There are 2 implementation details of btrfs that
>> it exposes:
>>
>> 1. Currently btrfs filesystem under 100mb are created in Mixed block
>> group mode. Freespace accounting for it is not 100% accurate - I've
>> observed around 100-200kb discrepancy between a newly created filesystem,
>> then writing a file and deleting it and checking the free space. This
>> falls within %3 and not %1 as hardcoded in the test.
>>
>> 2. BTRFS won't flush it's delayed allocation on file deletion if less
>> than 32mb are deleted. On such files we need to perform sync (missing
>> in the test) or wait until time elapses for transaction commit.
>
> I'm a little confused about the 32mb limit.
>
> My personal guess about the reason to delay space freeing would be:
> 1) Performance
>    Btrfs tree operation (at least for write) is slow due to its tree
>    design.
>    So it may makes sense to delay space freeing.
>
>    But in that case, 32MB may seems to small to really improve the
>    performance. (Max file extent size is 128M, delaying one item
>    deletion doesn't really improve performance)
>
> 2) To avoid later new allocation to rewrite the data.
>    It's possible that freed space of deleted inode A get allocated to
>    new file extents. And a power loss happens before we commit the
>    transaction.
>
>    In that case, if everything else works fine, we should be reverted to
>    previous transaction where deleted inode A still exists.
>    But we lost its data, as its data is overwritten by other file
>    extents. And any read will just cause csum error.
>
>    But in that case, there shouldn't be any 32MB limit, but all deletion
>    of orphan inodes should be delayed.
>
>    And further more, this can be addressed using log tree, to log such
>    deletion so at recovery time, we just delete that inode.
>
> So I'm wonder if we can improve btrfs deletion behavior.
>
>
>>
>> Since mixed mode is somewhat deprecated and btrfs is not really intended
>> to be used on really small devices let's just adjust the test to
>> create a 101mb fs, which doesn't use mixed mode and really test
>> freespace accounting.
>
> Despite of some btrfs related questions, I'm wondering if there is any
> standard specifying (POSIX?) how a filesystem should behave when
> unlinking a file.
>
> Should the space freeing be synchronized? And how should statfs report
> available space?
>
> In short, I'm wondering if this test and its expected behavior is
> generic enough for all filesystems.
>
> Thanks,
> Qu
>
>>
>> Signed-off-by: Nikolay Borisov <nborisov@suse.com>
>> ---
>>  tests/generic/015 | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/tests/generic/015 b/tests/generic/015
>> index 78f2b13..416c4ae 100755
>> --- a/tests/generic/015
>> +++ b/tests/generic/015
>> @@ -53,7 +53,7 @@ _supported_os Linux
>>  _require_scratch
>>  _require_no_large_scratch_dev
>>
>> -_scratch_mkfs_sized `expr 50 \* 1024 \* 1024` >/dev/null 2>&1 \
>> +_scratch_mkfs_sized `expr 101 \* 1024 \* 1024` >/dev/null 2>&1 \
>>      || _fail "mkfs failed"
>>  _scratch_mount || _fail "mount failed"
>>  out=$SCRATCH_MNT/fillup.$$
>>
>

All fs, including btrfs (AFAIK) return unlink(), (if file not open)
only then space has been freed.
So free space after return of unlink() must be freed.
Proofs: [1] [2] [3]
[4] - Posix, looks like do not describe behaviour.

1. http://man7.org/linux/man-pages/man2/unlink.2.html
2. https://stackoverflow.com/questions/31448693/why-system-call-unlink-so-slow
3. https://www.spinics.net/lists/linux-btrfs/msg59901.html
4. https://www.unix.com/man-page/posix/1P/unlink/

Thanks.
Nikolay Borisov Jan. 8, 2018, 2:17 p.m. UTC | #3
On  8.01.2018 14:54, Qu Wenruo wrote:
> 
> 
> On 2018年01月08日 16:43, Nikolay Borisov wrote:
>> This test has been failing for btrfs for quite some time,
>> at least since 4.7. There are 2 implementation details of btrfs that
>> it exposes:
>>
>> 1. Currently btrfs filesystem under 100mb are created in Mixed block
>> group mode. Freespace accounting for it is not 100% accurate - I've
>> observed around 100-200kb discrepancy between a newly created filesystem,
>> then writing a file and deleting it and checking the free space. This
>> falls within %3 and not %1 as hardcoded in the test.
>>
>> 2. BTRFS won't flush it's delayed allocation on file deletion if less
>> than 32mb are deleted. On such files we need to perform sync (missing
>> in the test) or wait until time elapses for transaction commit.
> 
> I'm a little confused about the 32mb limit.
> 
> My personal guess about the reason to delay space freeing would be:
> 1) Performance
>    Btrfs tree operation (at least for write) is slow due to its tree
>    design.
>    So it may makes sense to delay space freeing.
> 
>    But in that case, 32MB may seems to small to really improve the
>    performance. (Max file extent size is 128M, delaying one item
>    deletion doesn't really improve performance)
> 
> 2) To avoid later new allocation to rewrite the data.
>    It's possible that freed space of deleted inode A get allocated to
>    new file extents. And a power loss happens before we commit the
>    transaction.
> 
>    In that case, if everything else works fine, we should be reverted to
>    previous transaction where deleted inode A still exists.
>    But we lost its data, as its data is overwritten by other file
>    extents. And any read will just cause csum error.
> 
>    But in that case, there shouldn't be any 32MB limit, but all deletion
>    of orphan inodes should be delayed.
> 
>    And further more, this can be addressed using log tree, to log such
>    deletion so at recovery time, we just delete that inode.
> 
> So I'm wonder if we can improve btrfs deletion behavior.

So the initial commit I'm talking about is this one:
28ed1345a50491d78e1454ad4005dc5d3557a69e and the function in question
is: btrfs_truncate_inode_items
> 
> 
>>
>> Since mixed mode is somewhat deprecated and btrfs is not really intended
>> to be used on really small devices let's just adjust the test to
>> create a 101mb fs, which doesn't use mixed mode and really test
>> freespace accounting.
> 
> Despite of some btrfs related questions, I'm wondering if there is any
> standard specifying (POSIX?) how a filesystem should behave when
> unlinking a file.
> 
> Should the space freeing be synchronized? And how should statfs report
> available space?

While digging a bit more into this I also found that we alway have a
discrepancy of 1mb when reporting statfs if we have unallocated device
space and if everything is allocated to chunks. The reason being that if
we have unallocated space on the device then btrfs_calc_avail_data_space
subtracts 1mb from the unallocated space of the device. This is
apparently done to mimics the allocator's behavior. However, when
everything is reported based on the status of the space_info struct i.e
btrfs_calc_avail_data_space return 0 in free_bytes then we don't do this
subtraction. So in the end the issue might not really be in the test but
in the way we report freespace from df. And 101mb happens to work
because the 1mb discrepancy falls within %1 error margin on a fs of
101mb size

> 
> In short, I'm wondering if this test and its expected behavior is
> generic enough for all filesystems.
> 
> Thanks,
> Qu
> 
>>
>> Signed-off-by: Nikolay Borisov <nborisov@suse.com>
>> ---
>>  tests/generic/015 | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/tests/generic/015 b/tests/generic/015
>> index 78f2b13..416c4ae 100755
>> --- a/tests/generic/015
>> +++ b/tests/generic/015
>> @@ -53,7 +53,7 @@ _supported_os Linux
>>  _require_scratch
>>  _require_no_large_scratch_dev
>>  
>> -_scratch_mkfs_sized `expr 50 \* 1024 \* 1024` >/dev/null 2>&1 \
>> +_scratch_mkfs_sized `expr 101 \* 1024 \* 1024` >/dev/null 2>&1 \
>>      || _fail "mkfs failed"
>>  _scratch_mount || _fail "mount failed"
>>  out=$SCRATCH_MNT/fillup.$$
>>
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Qu Wenruo Jan. 9, 2018, 12:29 a.m. UTC | #4
On 2018年01月08日 21:55, Timofey Titovets wrote:
> 2018-01-08 15:54 GMT+03:00 Qu Wenruo <quwenruo.btrfs@gmx.com>:
>>
>>
>> On 2018年01月08日 16:43, Nikolay Borisov wrote:
>>> This test has been failing for btrfs for quite some time,
>>> at least since 4.7. There are 2 implementation details of btrfs that
>>> it exposes:
>>>
>>> 1. Currently btrfs filesystem under 100mb are created in Mixed block
>>> group mode. Freespace accounting for it is not 100% accurate - I've
>>> observed around 100-200kb discrepancy between a newly created filesystem,
>>> then writing a file and deleting it and checking the free space. This
>>> falls within %3 and not %1 as hardcoded in the test.
>>>
>>> 2. BTRFS won't flush it's delayed allocation on file deletion if less
>>> than 32mb are deleted. On such files we need to perform sync (missing
>>> in the test) or wait until time elapses for transaction commit.
>>
>> I'm a little confused about the 32mb limit.
>>
>> My personal guess about the reason to delay space freeing would be:
>> 1) Performance
>>    Btrfs tree operation (at least for write) is slow due to its tree
>>    design.
>>    So it may makes sense to delay space freeing.
>>
>>    But in that case, 32MB may seems to small to really improve the
>>    performance. (Max file extent size is 128M, delaying one item
>>    deletion doesn't really improve performance)
>>
>> 2) To avoid later new allocation to rewrite the data.
>>    It's possible that freed space of deleted inode A get allocated to
>>    new file extents. And a power loss happens before we commit the
>>    transaction.
>>
>>    In that case, if everything else works fine, we should be reverted to
>>    previous transaction where deleted inode A still exists.
>>    But we lost its data, as its data is overwritten by other file
>>    extents. And any read will just cause csum error.
>>
>>    But in that case, there shouldn't be any 32MB limit, but all deletion
>>    of orphan inodes should be delayed.
>>
>>    And further more, this can be addressed using log tree, to log such
>>    deletion so at recovery time, we just delete that inode.
>>
>> So I'm wonder if we can improve btrfs deletion behavior.
>>
>>
>>>
>>> Since mixed mode is somewhat deprecated and btrfs is not really intended
>>> to be used on really small devices let's just adjust the test to
>>> create a 101mb fs, which doesn't use mixed mode and really test
>>> freespace accounting.
>>
>> Despite of some btrfs related questions, I'm wondering if there is any
>> standard specifying (POSIX?) how a filesystem should behave when
>> unlinking a file.
>>
>> Should the space freeing be synchronized? And how should statfs report
>> available space?
>>
>> In short, I'm wondering if this test and its expected behavior is
>> generic enough for all filesystems.
>>
>> Thanks,
>> Qu
>>
>>>
>>> Signed-off-by: Nikolay Borisov <nborisov@suse.com>
>>> ---
>>>  tests/generic/015 | 2 +-
>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/tests/generic/015 b/tests/generic/015
>>> index 78f2b13..416c4ae 100755
>>> --- a/tests/generic/015
>>> +++ b/tests/generic/015
>>> @@ -53,7 +53,7 @@ _supported_os Linux
>>>  _require_scratch
>>>  _require_no_large_scratch_dev
>>>
>>> -_scratch_mkfs_sized `expr 50 \* 1024 \* 1024` >/dev/null 2>&1 \
>>> +_scratch_mkfs_sized `expr 101 \* 1024 \* 1024` >/dev/null 2>&1 \
>>>      || _fail "mkfs failed"
>>>  _scratch_mount || _fail "mount failed"
>>>  out=$SCRATCH_MNT/fillup.$$
>>>
>>
> 
> All fs, including btrfs (AFAIK) return unlink(), (if file not open)
> only then space has been freed.

AFAIK there is btrfs orphan inode item for delayed deletion.
So at least not for btrfs.

> So free space after return of unlink() must be freed.
> Proofs: [1] [2] [3]
> [4] - Posix, looks like do not describe behaviour.
> 
> 1. http://man7.org/linux/man-pages/man2/unlink.2.html

"If that name was the last link to a file and no processes have the file
open, the file is deleted and the space it was using is made available
for reuse."

But when? I think that's the point, and no exact explain on this.

Thanks,
Qu

> 2. https://stackoverflow.com/questions/31448693/why-system-call-unlink-so-slow
> 3. https://www.spinics.net/lists/linux-btrfs/msg59901.html
> 4. https://www.unix.com/man-page/posix/1P/unlink/
> 
> Thanks.
>
Nikolay Borisov Jan. 9, 2018, 2:59 p.m. UTC | #5
On  8.01.2018 10:43, Nikolay Borisov wrote:
> This test has been failing for btrfs for quite some time,
> at least since 4.7. There are 2 implementation details of btrfs that
> it exposes:
> 
> 1. Currently btrfs filesystem under 100mb are created in Mixed block
> group mode. Freespace accounting for it is not 100% accurate - I've
> observed around 100-200kb discrepancy between a newly created filesystem,
> then writing a file and deleting it and checking the free space. This
> falls within %3 and not %1 as hardcoded in the test. 
> 
> 2. BTRFS won't flush it's delayed allocation on file deletion if less
> than 32mb are deleted. On such files we need to perform sync (missing
> in the test) or wait until time elapses for transaction commit.
> 
> Since mixed mode is somewhat deprecated and btrfs is not really intended
> to be used on really small devices let's just adjust the test to
> create a 101mb fs, which doesn't use mixed mode and really test
> freespace accounting.
> 
> Signed-off-by: Nikolay Borisov <nborisov@suse.com>

After further experimentation this test is indeed hitting a very
pathological edge case. So when we create a btrfs file system in mixed
mode with default raid levels the system chunk occupies 0-4mb. As such
we can never allocate a chunk which spans this range so the -1mb in
btrfs_calc_avail_data_space is not required in this case. Anyway, I
believe this to be a really minor nuisance not worth complicating the
code to fix, so I'd suggest this test change be merged.

> ---
>  tests/generic/015 | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/tests/generic/015 b/tests/generic/015
> index 78f2b13..416c4ae 100755
> --- a/tests/generic/015
> +++ b/tests/generic/015
> @@ -53,7 +53,7 @@ _supported_os Linux
>  _require_scratch
>  _require_no_large_scratch_dev
>  
> -_scratch_mkfs_sized `expr 50 \* 1024 \* 1024` >/dev/null 2>&1 \
> +_scratch_mkfs_sized `expr 101 \* 1024 \* 1024` >/dev/null 2>&1 \
>      || _fail "mkfs failed"
>  _scratch_mount || _fail "mount failed"
>  out=$SCRATCH_MNT/fillup.$$
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eryu Guan Jan. 10, 2018, 4:05 a.m. UTC | #6
On Mon, Jan 08, 2018 at 10:43:30AM +0200, Nikolay Borisov wrote:
> This test has been failing for btrfs for quite some time,
> at least since 4.7. There are 2 implementation details of btrfs that
> it exposes:
> 
> 1. Currently btrfs filesystem under 100mb are created in Mixed block
> group mode. Freespace accounting for it is not 100% accurate - I've

mkfs.btrfs does this too? Because I noticed _scratch_mkfs_sized adds
'--mixed' mkfs option explicitly.

> observed around 100-200kb discrepancy between a newly created filesystem,
> then writing a file and deleting it and checking the free space. This
> falls within %3 and not %1 as hardcoded in the test. 
> 
> 2. BTRFS won't flush it's delayed allocation on file deletion if less
> than 32mb are deleted. On such files we need to perform sync (missing
> in the test) or wait until time elapses for transaction commit.
> 
> Since mixed mode is somewhat deprecated and btrfs is not really intended
> to be used on really small devices let's just adjust the test to
> create a 101mb fs, which doesn't use mixed mode and really test
> freespace accounting.
> 
> Signed-off-by: Nikolay Borisov <nborisov@suse.com>
> ---
>  tests/generic/015 | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/tests/generic/015 b/tests/generic/015
> index 78f2b13..416c4ae 100755
> --- a/tests/generic/015
> +++ b/tests/generic/015
> @@ -53,7 +53,7 @@ _supported_os Linux
>  _require_scratch
>  _require_no_large_scratch_dev
>  
> -_scratch_mkfs_sized `expr 50 \* 1024 \* 1024` >/dev/null 2>&1 \
> +_scratch_mkfs_sized `expr 101 \* 1024 \* 1024` >/dev/null 2>&1 \

Better to have some comments in the code too, to explain why we choose
101m filesystem size.

Thanks,
Eryu

>      || _fail "mkfs failed"
>  _scratch_mount || _fail "mount failed"
>  out=$SCRATCH_MNT/fillup.$$
> -- 
> 2.7.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe fstests" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch
diff mbox

diff --git a/tests/generic/015 b/tests/generic/015
index 78f2b13..416c4ae 100755
--- a/tests/generic/015
+++ b/tests/generic/015
@@ -53,7 +53,7 @@  _supported_os Linux
 _require_scratch
 _require_no_large_scratch_dev
 
-_scratch_mkfs_sized `expr 50 \* 1024 \* 1024` >/dev/null 2>&1 \
+_scratch_mkfs_sized `expr 101 \* 1024 \* 1024` >/dev/null 2>&1 \
     || _fail "mkfs failed"
 _scratch_mount || _fail "mount failed"
 out=$SCRATCH_MNT/fillup.$$