diff mbox

[RFC] btrfs: Fix slight df misreporting when listing mixed block groups fs

Message ID 1515513845-3749-1-git-send-email-nborisov@suse.com (mailing list archive)
State New, archived
Headers show

Commit Message

Nikolay Borisov Jan. 9, 2018, 4:04 p.m. UTC
Currently when df (resp. statfs) is executed on a btrfs instance the
free space is calculated as the freespace of the block groups +
any unallocated space on the devices, constituting this filesystem.
There is a catch, however, in that the unallocated space from the
devices has 1 mb subtracted from each device to closely follow btrfs'
allocator behavior. This is all good except that when we have a file
system and the system chunk's allocation profile is single (as is the
default in mixed mode) it occupies the range 0..4mb and thus implicitly
prevents allocation starting in the range 0..1mb. In those cases
additionally subtracting 1mb during statfs makes us misreport the
actual free space and this causes generic/015 to fail.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
---
Hello,                                                                          
                                                                                
So with this patch and adding sync to generic/015 (which is a different issue)  
I can make generic/015 pass. In any case I think this behavior is correct       
though frankly I feel it's not really worth it, in the worst case we are        
underreporting by at most 1mb per device.

 fs/btrfs/super.c | 25 +++++++++++++------------
 1 file changed, 13 insertions(+), 12 deletions(-)

Comments

Josef Bacik Jan. 9, 2018, 4:09 p.m. UTC | #1
On Tue, Jan 09, 2018 at 06:04:05PM +0200, Nikolay Borisov wrote:
> Currently when df (resp. statfs) is executed on a btrfs instance the
> free space is calculated as the freespace of the block groups +
> any unallocated space on the devices, constituting this filesystem.
> There is a catch, however, in that the unallocated space from the
> devices has 1 mb subtracted from each device to closely follow btrfs'
> allocator behavior. This is all good except that when we have a file
> system and the system chunk's allocation profile is single (as is the
> default in mixed mode) it occupies the range 0..4mb and thus implicitly
> prevents allocation starting in the range 0..1mb. In those cases
> additionally subtracting 1mb during statfs makes us misreport the
> actual free space and this causes generic/015 to fail.
> 
> Signed-off-by: Nikolay Borisov <nborisov@suse.com>

This is fine by me,

Reviewed-by: Josef Bacik <jbacik@fb.com>

Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Qu Wenruo Jan. 10, 2018, 1:30 a.m. UTC | #2
On 2018年01月10日 00:04, Nikolay Borisov wrote:
> Currently when df (resp. statfs) is executed on a btrfs instance the
> free space is calculated as the freespace of the block groups +
> any unallocated space on the devices, constituting this filesystem.
> There is a catch, however, in that the unallocated space from the
> devices has 1 mb subtracted from each device to closely follow btrfs'
> allocator behavior. This is all good except that when we have a file
> system and the system chunk's allocation profile is single (as is the
> default in mixed mode) it occupies the range 0..4mb and thus implicitly
> prevents allocation starting in the range 0..1mb. In those cases
> additionally subtracting 1mb during statfs makes us misreport the
> actual free space and this causes generic/015 to fail.
> 
> Signed-off-by: Nikolay Borisov <nborisov@suse.com>
> ---
> Hello,                                                                          
>                                                                                 
> So with this patch and adding sync to generic/015 (which is a different issue)  
> I can make generic/015 pass. In any case I think this behavior is correct       
> though frankly I feel it's not really worth it, in the worst case we are        
> underreporting by at most 1mb per device.
> 
>  fs/btrfs/super.c | 25 +++++++++++++------------
>  1 file changed, 13 insertions(+), 12 deletions(-)
> 
> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> index f33a55272deb..a4dfc9701220 100644
> --- a/fs/btrfs/super.c
> +++ b/fs/btrfs/super.c
> @@ -1925,12 +1925,13 @@ static int btrfs_calc_avail_data_space(struct btrfs_fs_info *fs_info,
>  	struct btrfs_device_info *devices_info;
>  	struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
>  	struct btrfs_device *device;
> -	u64 skip_space;
>  	u64 type;
>  	u64 avail_space;
>  	u64 min_stripe_size;
>  	int min_stripes = 1, num_stripes = 1;
>  	int i = 0, nr_devices;
> +	bool sys_chunk_single = !(btrfs_system_alloc_profile(fs_info) &
> +				  ~BTRFS_BLOCK_GROUP_SYSTEM);
>  
>  	/*
>  	 * We aren't under the device list lock, so this is racy-ish, but good
> @@ -1988,18 +1989,18 @@ static int btrfs_calc_avail_data_space(struct btrfs_fs_info *fs_info,
>  		/*
>  		 * In order to avoid overwriting the superblock on the drive,
>  		 * btrfs starts at an offset of at least 1MB when doing chunk
> -		 * allocation.
> +		 * allocation. There is one notable exception, and that is
> +		 * when the System chunk has allocation profile of single. In
> +		 * this case it occupies 0..4m  and we don't really need to
> +		 * subtract 1mb as it's implied

This is in fact a bug in mkfs.btrfs.

The problem is, current chunk allocator in mkfs.btrfs doesn't really
avoid the first 1M.

So temporary (profile SINGLE) chunks can still using that reserved 1M,
while normally such chunks get removed so reserved 1M will not be used,
but if we specified -m single, such chunks will just be used as is.

And such behavior makes unallocated space calculation pretty nasty.

>  		 */
> -		skip_space = SZ_1M;
> -
> -		/*
> -		 * we can use the free space in [0, skip_space - 1], subtract
> -		 * it from the total.
> -		 */
> -		if (avail_space && avail_space >= skip_space)
> -			avail_space -= skip_space;
> -		else
> -			avail_space = 0;
> +		if (device->devid != 1 ||
> +		    (!sys_chunk_single && device->devid == 1)) {
> +			if (avail_space && avail_space >= SZ_1M)
> +				avail_space -= SZ_1M;
> +			else
> +				avail_space = 0;
> +		}

Due to above reason, using system chunk profile and devid to check if we
need to reduce 1M is not reliable.

That system chunk can be relocated by balance, and in that case new
chunk may start beyond 1M.

So the most reliable method would be manually checking the the first
device extent of this device, and to determine if we should minus that 1M.

If the first device extent starts beyond 1M, reducing 1M would be good.
Or don't modify it.

Thanks,
Qu

>  
>  		if (avail_space < min_stripe_size)
>  			continue;
>
Qu Wenruo Jan. 10, 2018, 1:49 a.m. UTC | #3
On 2018年01月10日 09:30, Qu Wenruo wrote:
> 
> 
> On 2018年01月10日 00:04, Nikolay Borisov wrote:
>> Currently when df (resp. statfs) is executed on a btrfs instance the
>> free space is calculated as the freespace of the block groups +
>> any unallocated space on the devices, constituting this filesystem.
>> There is a catch, however, in that the unallocated space from the
>> devices has 1 mb subtracted from each device to closely follow btrfs'
>> allocator behavior. This is all good except that when we have a file
>> system and the system chunk's allocation profile is single (as is the
>> default in mixed mode) it occupies the range 0..4mb and thus implicitly
>> prevents allocation starting in the range 0..1mb. In those cases
>> additionally subtracting 1mb during statfs makes us misreport the
>> actual free space and this causes generic/015 to fail.
>>
>> Signed-off-by: Nikolay Borisov <nborisov@suse.com>
>> ---
>> Hello,                                                                          
>>                                                                                 
>> So with this patch and adding sync to generic/015 (which is a different issue)  
>> I can make generic/015 pass. In any case I think this behavior is correct       
>> though frankly I feel it's not really worth it, in the worst case we are        
>> underreporting by at most 1mb per device.
>>
>>  fs/btrfs/super.c | 25 +++++++++++++------------
>>  1 file changed, 13 insertions(+), 12 deletions(-)
>>
>> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
>> index f33a55272deb..a4dfc9701220 100644
>> --- a/fs/btrfs/super.c
>> +++ b/fs/btrfs/super.c
>> @@ -1925,12 +1925,13 @@ static int btrfs_calc_avail_data_space(struct btrfs_fs_info *fs_info,
>>  	struct btrfs_device_info *devices_info;
>>  	struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
>>  	struct btrfs_device *device;
>> -	u64 skip_space;
>>  	u64 type;
>>  	u64 avail_space;
>>  	u64 min_stripe_size;
>>  	int min_stripes = 1, num_stripes = 1;
>>  	int i = 0, nr_devices;
>> +	bool sys_chunk_single = !(btrfs_system_alloc_profile(fs_info) &
>> +				  ~BTRFS_BLOCK_GROUP_SYSTEM);
>>  
>>  	/*
>>  	 * We aren't under the device list lock, so this is racy-ish, but good
>> @@ -1988,18 +1989,18 @@ static int btrfs_calc_avail_data_space(struct btrfs_fs_info *fs_info,
>>  		/*
>>  		 * In order to avoid overwriting the superblock on the drive,
>>  		 * btrfs starts at an offset of at least 1MB when doing chunk
>> -		 * allocation.
>> +		 * allocation. There is one notable exception, and that is
>> +		 * when the System chunk has allocation profile of single. In
>> +		 * this case it occupies 0..4m  and we don't really need to
>> +		 * subtract 1mb as it's implied
> 
> This is in fact a bug in mkfs.btrfs.
> 
> The problem is, current chunk allocator in mkfs.btrfs doesn't really
> avoid the first 1M.

Just a side note, it's the custom allocator (Yep, custom allocator
again) ignoring the 1M limit.
While the generic btrfs_alloc_chunk() is completely fine.

The fix to mkfs.btrfs can arrive very soon.

This also reminds me about the long forgot project to get rid of such
SINGLE temporary chunks (not allocating them at all) and allocate them
in native profile at the very beginning of mkfs.btrfs.

Thanks,
Qu

> 
> So temporary (profile SINGLE) chunks can still using that reserved 1M,
> while normally such chunks get removed so reserved 1M will not be used,
> but if we specified -m single, such chunks will just be used as is.
> 
> And such behavior makes unallocated space calculation pretty nasty.
> 
>>  		 */
>> -		skip_space = SZ_1M;
>> -
>> -		/*
>> -		 * we can use the free space in [0, skip_space - 1], subtract
>> -		 * it from the total.
>> -		 */
>> -		if (avail_space && avail_space >= skip_space)
>> -			avail_space -= skip_space;
>> -		else
>> -			avail_space = 0;
>> +		if (device->devid != 1 ||
>> +		    (!sys_chunk_single && device->devid == 1)) {
>> +			if (avail_space && avail_space >= SZ_1M)
>> +				avail_space -= SZ_1M;
>> +			else
>> +				avail_space = 0;
>> +		}
> 
> Due to above reason, using system chunk profile and devid to check if we
> need to reduce 1M is not reliable.
> 
> That system chunk can be relocated by balance, and in that case new
> chunk may start beyond 1M.
> 
> So the most reliable method would be manually checking the the first
> device extent of this device, and to determine if we should minus that 1M.
> 
> If the first device extent starts beyond 1M, reducing 1M would be good.
> Or don't modify it.
> 
> Thanks,
> Qu
> 
>>  
>>  		if (avail_space < min_stripe_size)
>>  			continue;
>>
>
Nikolay Borisov Jan. 10, 2018, 3:33 p.m. UTC | #4
On 10.01.2018 03:30, Qu Wenruo wrote:
> That system chunk can be relocated by balance, and in that case new
> chunk may start beyond 1M.
> 
> So the most reliable method would be manually checking the the first
> device extent of this device, and to determine if we should minus that 1M.
> 
> If the first device extent starts beyond 1M, reducing 1M would be good.
> Or don't modify it.

So assuming I run generic/015 on a system with your mkfs patches to put
system chunk after 1mb and I implement a logic in
btrfs_calc_avail_data_space to subtract 1mb if the first extent of a
device starts at 1mb. In that case the test is still going to fail i.e
have the discrepancy of 1mb due to 1mb being subtracted first due to
there being unallocated space on the device, but then on the second df
btrfs_calc_avail_data_space will return 0 so df is going to report space
only from the block groups.


Thinking a bit more I think with your mkfs.btrfs patch there is no need
to check where the device extent starts, since now we are guaranteed to
always have a hole of at least 1mb at the beginning.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Sterba Jan. 10, 2018, 4:20 p.m. UTC | #5
On Wed, Jan 10, 2018 at 09:30:06AM +0800, Qu Wenruo wrote:
> 
> 
> On 2018年01月10日 00:04, Nikolay Borisov wrote:
> > Currently when df (resp. statfs) is executed on a btrfs instance the
> > free space is calculated as the freespace of the block groups +
> > any unallocated space on the devices, constituting this filesystem.
> > There is a catch, however, in that the unallocated space from the
> > devices has 1 mb subtracted from each device to closely follow btrfs'
> > allocator behavior. This is all good except that when we have a file
> > system and the system chunk's allocation profile is single (as is the
> > default in mixed mode) it occupies the range 0..4mb and thus implicitly
> > prevents allocation starting in the range 0..1mb. In those cases
> > additionally subtracting 1mb during statfs makes us misreport the
> > actual free space and this causes generic/015 to fail.
> > 
> > Signed-off-by: Nikolay Borisov <nborisov@suse.com>
> > ---
> > Hello,                                                                          
> >                                                                                 
> > So with this patch and adding sync to generic/015 (which is a different issue)  
> > I can make generic/015 pass. In any case I think this behavior is correct       
> > though frankly I feel it's not really worth it, in the worst case we are        
> > underreporting by at most 1mb per device.
> > 
> >  fs/btrfs/super.c | 25 +++++++++++++------------
> >  1 file changed, 13 insertions(+), 12 deletions(-)
> > 
> > diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> > index f33a55272deb..a4dfc9701220 100644
> > --- a/fs/btrfs/super.c
> > +++ b/fs/btrfs/super.c
> > @@ -1925,12 +1925,13 @@ static int btrfs_calc_avail_data_space(struct btrfs_fs_info *fs_info,
> >  	struct btrfs_device_info *devices_info;
> >  	struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
> >  	struct btrfs_device *device;
> > -	u64 skip_space;
> >  	u64 type;
> >  	u64 avail_space;
> >  	u64 min_stripe_size;
> >  	int min_stripes = 1, num_stripes = 1;
> >  	int i = 0, nr_devices;
> > +	bool sys_chunk_single = !(btrfs_system_alloc_profile(fs_info) &
> > +				  ~BTRFS_BLOCK_GROUP_SYSTEM);
> >  
> >  	/*
> >  	 * We aren't under the device list lock, so this is racy-ish, but good
> > @@ -1988,18 +1989,18 @@ static int btrfs_calc_avail_data_space(struct btrfs_fs_info *fs_info,
> >  		/*
> >  		 * In order to avoid overwriting the superblock on the drive,
> >  		 * btrfs starts at an offset of at least 1MB when doing chunk
> > -		 * allocation.
> > +		 * allocation. There is one notable exception, and that is
> > +		 * when the System chunk has allocation profile of single. In
> > +		 * this case it occupies 0..4m  and we don't really need to
> > +		 * subtract 1mb as it's implied
> 
> This is in fact a bug in mkfs.btrfs.
> 
> The problem is, current chunk allocator in mkfs.btrfs doesn't really
> avoid the first 1M.
> 
> So temporary (profile SINGLE) chunks can still using that reserved 1M,
> while normally such chunks get removed so reserved 1M will not be used,
> but if we specified -m single, such chunks will just be used as is.
> 
> And such behavior makes unallocated space calculation pretty nasty.
> 
> >  		 */
> > -		skip_space = SZ_1M;
> > -
> > -		/*
> > -		 * we can use the free space in [0, skip_space - 1], subtract
> > -		 * it from the total.
> > -		 */
> > -		if (avail_space && avail_space >= skip_space)
> > -			avail_space -= skip_space;
> > -		else
> > -			avail_space = 0;
> > +		if (device->devid != 1 ||
> > +		    (!sys_chunk_single && device->devid == 1)) {
> > +			if (avail_space && avail_space >= SZ_1M)
> > +				avail_space -= SZ_1M;
> > +			else
> > +				avail_space = 0;
> > +		}
> 
> Due to above reason, using system chunk profile and devid to check if we
> need to reduce 1M is not reliable.
> 
> That system chunk can be relocated by balance, and in that case new
> chunk may start beyond 1M.

Agreed, the patch relies on some very specific layout and is not future
proof.

> So the most reliable method would be manually checking the the first
> device extent of this device, and to determine if we should minus that 1M.
> 
> If the first device extent starts beyond 1M, reducing 1M would be good.
> Or don't modify it.

This would need to get the physical mapping of the device chunks, right
now we're fine with sizes (in the logical units).
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Qu Wenruo Jan. 11, 2018, 1:30 a.m. UTC | #6
On 2018年01月10日 23:33, Nikolay Borisov wrote:
> 
> 
> On 10.01.2018 03:30, Qu Wenruo wrote:
>> That system chunk can be relocated by balance, and in that case new
>> chunk may start beyond 1M.
>>
>> So the most reliable method would be manually checking the the first
>> device extent of this device, and to determine if we should minus that 1M.
>>
>> If the first device extent starts beyond 1M, reducing 1M would be good.
>> Or don't modify it.
> 
> So assuming I run generic/015 on a system with your mkfs patches to put
> system chunk after 1mb and I implement a logic in
> btrfs_calc_avail_data_space to subtract 1mb if the first extent of a
> device starts at 1mb. In that case the test is still going to fail i.e
> have the discrepancy of 1mb due to 1mb being subtracted first due to
> there being unallocated space on the device, but then on the second df
> btrfs_calc_avail_data_space will return 0 so df is going to report space
> only from the block groups.
> 
> 
> Thinking a bit more I think with your mkfs.btrfs patch there is no need
> to check where the device extent starts, since now we are guaranteed to
> always have a hole of at least 1mb at the beginning.

The check here is mainly for old filesystems, whose device extent can
start in the reserved range.

But considering the difference will be no larger than 1MB, and using
small filesystem (100ish MiB) is already very rare in btrfs, such
difference won't really cause much problem, except for the test case.

Thanks,
Qu

> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
diff mbox

Patch

diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index f33a55272deb..a4dfc9701220 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -1925,12 +1925,13 @@  static int btrfs_calc_avail_data_space(struct btrfs_fs_info *fs_info,
 	struct btrfs_device_info *devices_info;
 	struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
 	struct btrfs_device *device;
-	u64 skip_space;
 	u64 type;
 	u64 avail_space;
 	u64 min_stripe_size;
 	int min_stripes = 1, num_stripes = 1;
 	int i = 0, nr_devices;
+	bool sys_chunk_single = !(btrfs_system_alloc_profile(fs_info) &
+				  ~BTRFS_BLOCK_GROUP_SYSTEM);
 
 	/*
 	 * We aren't under the device list lock, so this is racy-ish, but good
@@ -1988,18 +1989,18 @@  static int btrfs_calc_avail_data_space(struct btrfs_fs_info *fs_info,
 		/*
 		 * In order to avoid overwriting the superblock on the drive,
 		 * btrfs starts at an offset of at least 1MB when doing chunk
-		 * allocation.
+		 * allocation. There is one notable exception, and that is
+		 * when the System chunk has allocation profile of single. In
+		 * this case it occupies 0..4m  and we don't really need to
+		 * subtract 1mb as it's implied
 		 */
-		skip_space = SZ_1M;
-
-		/*
-		 * we can use the free space in [0, skip_space - 1], subtract
-		 * it from the total.
-		 */
-		if (avail_space && avail_space >= skip_space)
-			avail_space -= skip_space;
-		else
-			avail_space = 0;
+		if (device->devid != 1 ||
+		    (!sys_chunk_single && device->devid == 1)) {
+			if (avail_space && avail_space >= SZ_1M)
+				avail_space -= SZ_1M;
+			else
+				avail_space = 0;
+		}
 
 		if (avail_space < min_stripe_size)
 			continue;