Provide a better free space estimate on RAID1

Message ID	20140206021516.304732cd@natsu (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-btrfs-owner@kernel.org> Date: Thu, 6 Feb 2014 02:15:16 +0600 From: Roman Mamedov <rm@romanrm.net> To: linux-btrfs@vger.kernel.org Subject: Provide a better free space estimate on RAID1 Message-ID: <20140206021516.304732cd@natsu> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=PGP-SHA1; boundary="Sig_/QgWDeS2=Mnt30ZhiRLTW_+7"; protocol="application/pgp-signature" Sender: linux-btrfs-owner@vger.kernel.org Precedence: bulk

Roman Mamedov Feb. 5, 2014, 8:15 p.m. UTC

Hello,

On a freshly-created RAID1 filesystem of two 1TB disks:

# df -h /mnt/p2/
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda2       1.8T  1.1M  1.8T   1% /mnt/p2

I cannot write 2TB of user data to that RAID1, so this estimate is clearly
misleading. I got tired of looking at the bogus disk free space on all my
RAID1 btrfs systems, so today I decided to do something about this:



After:

# df -h /mnt/p2/
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda2       1.8T  1.1M  912G   1% /mnt/p2

Until per-subvolume RAID profiles are implemented, this estimate will be
correct, and even after, it should be closer to the truth than assuming the
user will fill their RAID1 FS only with subvolumes of single or raid0 profiles.

If anyone likes feel free to reimplement my PoC patch in a better way, e.g.
integrate this into the calculation 'while' block of that function immediately
before it (logic of which I couldn't yet grasp due to it lacking comments),
and not just tacked onto the tail of it.

Brendan Hide Feb. 6, 2014, 7:38 a.m. UTC | #1

On 2014/02/05 10:15 PM, Roman Mamedov wrote:
> Hello,
>
> On a freshly-created RAID1 filesystem of two 1TB disks:
>
> # df -h /mnt/p2/
> Filesystem      Size  Used Avail Use% Mounted on
> /dev/sda2       1.8T  1.1M  1.8T   1% /mnt/p2
>
> I cannot write 2TB of user data to that RAID1, so this estimate is clearly
> misleading. I got tired of looking at the bogus disk free space on all my
> RAID1 btrfs systems, so today I decided to do something about this:
...
> After:
>
> # df -h /mnt/p2/
> Filesystem      Size  Used Avail Use% Mounted on
> /dev/sda2       1.8T  1.1M  912G   1% /mnt/p2
>
> Until per-subvolume RAID profiles are implemented, this estimate will be
> correct, and even after, it should be closer to the truth than assuming the
> user will fill their RAID1 FS only with subvolumes of single or raid0 profiles.
This is a known issue: 
https://btrfs.wiki.kernel.org/index.php/FAQ#Why_does_df_show_incorrect_free_space_for_my_RAID_volume.3F

Btrfs is still considered experimental - this is just one of those 
caveats we've learned to adjust to.

The change could work well for now and I'm sure it has been considered. 
I guess the biggest end-user issue is that you can, at a whim, change 
the model for new blocks - raid0/5/6,single etc and your value from 5 
minutes ago is far out from your new value without having written 
anything or taken up any space. Not a show-stopper problem, really.

The biggest dev issue is that future features will break this behaviour, 
such as the "per-subvolume RAID profiles" you mentioned. It is difficult 
to motivate including code (for which there's a known workaround) where 
we know it will be obsoleted.

Roman Mamedov Feb. 6, 2014, 12:45 p.m. UTC | #2

On Thu, 06 Feb 2014 09:38:15 +0200
Brendan Hide <brendan@swiftspirit.co.za> wrote:

> This is a known issue: 
> https://btrfs.wiki.kernel.org/index.php/FAQ#Why_does_df_show_incorrect_free_space_for_my_RAID_volume.3F
> Btrfs is still considered experimental 

It's long overdue to start tackling these snags and 'stop hiding behind the
experimental flag' [1], which is also no longer present as of 3.13.

[1] http://www.spinics.net/lists/linux-btrfs/msg30396.html

> this is just one of those caveats we've learned to adjust to.

Sure, but it's hard to argue this particular behavior is clearly broken from
the user perspective, even if it's "broken by design", and there are a number
of very "smart" and "future-proof" reasons to keep it broken today.

Personally I tired of trying to keep in mind which partitions are btrfs raid1,
and mentally divide the displayed free space by two for those. That's what
computers are for, to keep track of such things.

> The change could work well for now and I'm sure it has been considered. 
> I guess the biggest end-user issue is that you can, at a whim, change 
> the model for new blocks - raid0/5/6,single etc and your value from 5 
> minutes ago is far out from your new value without having written 
> anything or taken up any space. Not a show-stopper problem, really.

Changing the allocation profile for new blocks is a serious change you don't do
accidentally, it's about the same importance level as e.g. resizing the
filesystem. And no one is really surprised when the 'df' result changes after
an FS resize.

> The biggest dev issue is that future features will break this behaviour, 

That could be years away.

> such as the "per-subvolume RAID profiles" you mentioned. It is difficult 
> to motivate including code (for which there's a known workaround) where 
> we know it will be obsoleted.

There's not a lot of code to include (as my 3-line patch demonstrates), it
could just as easily be removed when it's obsolete. But I did not have any
high hopes of defeating the "broken by design" philosophy, that's why I didn't
submit it as a real patch for inclusion but rather just as a helpful hint for
people to add to their own kernels if they want this change to happen.

Goffredo Baroncelli Feb. 6, 2014, 7:54 p.m. UTC | #3

Hi Roman

On 02/06/2014 01:45 PM, Roman Mamedov wrote:
> On Thu, 06 Feb 2014 09:38:15 +0200
[...]
> 
> There's not a lot of code to include (as my 3-line patch demonstrates), it
> could just as easily be removed when it's obsolete. But I did not have any
> high hopes of defeating the "broken by design" philosophy, that's why I didn't
> submit it as a real patch for inclusion but rather just as a helpful hint for
> people to add to their own kernels if they want this change to happen.

I agree with you about the needing of a solution. However your patch to me seems even worse than the actual code.

For example you cannot take in account the mix of data/linear and metadata/dup (with the pathological case of small files stored in the metadata chunks ), nor different profile level like raid5/6 (or the future raidNxM)
And do not forget the compression...

The situation is very complex. I am inclined to use a different approach.

As you know, btrfs allocate space in chunk. Each chunk has an own ration between the data occupied on the disk, and the data available to the filesystem. For SINGLE the ratio is 1, for DUP/RAID1/RAID10 the ratio is 2, for raid 5 the ratio is n/(n-1) (where n is the stripes count), for raid 6 the ratio is n/(n-2)....

Because a filesystem could have chunks with different ratios, we can compute a global ratio as the composition of the each chunk ratio:

for_each_chunk:
	all_chunks_size += chunk_size[i]

for_each_chunk:
	global_ratio += chunk_ratio[i] * chunk_size[i] / all_chunks_size

If we assume that this ratio is constant during the live of the filesystem, we can use it to get an estimation of the space available to the users as:

	free_space = (all_disks_size-all_chunks_size)/global_ratio

The code above is a simplification, because we should take in account also the space available on each _already_allocated_ chunk.
We could further enhance this estimation, taking in account also the total files sizes and their space consumed in the chunks (this could be different due to the compression)

Even tough not perfect, it would be a better estimation than the actual one. 

BR
G.Baroncelli

Josef Bacik Feb. 6, 2014, 8:21 p.m. UTC | #4

On 02/05/2014 03:15 PM, Roman Mamedov wrote:
> Hello,
>
> On a freshly-created RAID1 filesystem of two 1TB disks:
>
> # df -h /mnt/p2/
> Filesystem      Size  Used Avail Use% Mounted on
> /dev/sda2       1.8T  1.1M  1.8T   1% /mnt/p2
>
> I cannot write 2TB of user data to that RAID1, so this estimate is clearly
> misleading. I got tired of looking at the bogus disk free space on all my
> RAID1 btrfs systems, so today I decided to do something about this:
>
> --- fs/btrfs/super.c.orig	2014-02-06 01:28:36.636164982 +0600
> +++ fs/btrfs/super.c	2014-02-06 01:28:58.304164370 +0600
> @@ -1481,6 +1481,11 @@
>   	}
>   
>   	kfree(devices_info);
> +
> +	if (type & BTRFS_BLOCK_GROUP_RAID1) {
> +		do_div(avail_space, min_stripes);
> +	}
> +
>   	*free_bytes = avail_space;
>   	return 0;
>   }

This needs to be more flexible, and also this causes the problem where 
now you show the actual usable amount of space _but_ you are also 
showing twice the amount of used space.  I'm ok with going in this 
direction, but we need to convert everybody over so it works for raid10 
as well and the used values need to be adjusted.  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Roman Mamedov Feb. 7, 2014, 4:40 a.m. UTC | #5

On Thu, 06 Feb 2014 20:54:19 +0100
Goffredo Baroncelli <kreijack@libero.it> wrote:

> I agree with you about the needing of a solution. However your patch to me seems even worse than the actual code.
> 
> For example you cannot take in account the mix of data/linear and metadata/dup (with the pathological case of small files stored in the metadata chunks ), nor different profile level like raid5/6 (or the future raidNxM)
> And do not forget the compression...

Every estimate first and foremost should be measured by how precise it is, or
in this case "wrong by how many gigabytes". The actual code returns a result
that is pretty much always wrong by 2x, after the patch it will be close
within gigabytes to the correct value in the most common use case (data raid1,
metadata raid1 and that's it). Of course that PoC is nowhere near the final
solution, what I can't agree with is "if another option is somewhat better,
but not ideally perfect, then it's worse than the current one", even
considering the current one is absolutely broken.

> The situation is very complex. I am inclined to use a different approach.
> 
> As you know, btrfs allocate space in chunk. Each chunk has an own ration between the data occupied on the disk, and the data available to the filesystem. For SINGLE the ratio is 1, for DUP/RAID1/RAID10 the ratio is 2, for raid 5 the ratio is n/(n-1) (where n is the stripes count), for raid 6 the ratio is n/(n-2)....
> 
> Because a filesystem could have chunks with different ratios, we can compute a global ratio as the composition of the each chunk ratio

> We could further enhance this estimation, taking in account also the total files sizes and their space consumed in the chunks (this could be different due to the compression)

I wonder what would be performance implications of all that. I feel a simpler
approach could work.

Chris Murphy Feb. 7, 2014, 5:30 a.m. UTC | #6

On Feb 6, 2014, at 9:40 PM, Roman Mamedov <rm@romanrm.net> wrote:

> On Thu, 06 Feb 2014 20:54:19 +0100
> Goffredo Baroncelli <kreijack@libero.it> wrote:
> 
>> I agree with you about the needing of a solution. However your patch to me seems even worse than the actual code.
>> 
>> For example you cannot take in account the mix of data/linear and metadata/dup (with the pathological case of small files stored in the metadata chunks ), nor different profile level like raid5/6 (or the future raidNxM)
>> And do not forget the compression...
> 
> Every estimate first and foremost should be measured by how precise it is, or
> in this case "wrong by how many gigabytes". The actual code returns a result
> that is pretty much always wrong by 2x, after the patch it will be close
> within gigabytes to the correct value in the most common use case (data raid1,
> metadata raid1 and that's it). Of course that PoC is nowhere near the final
> solution, what I can't agree with is "if another option is somewhat better,
> but not ideally perfect, then it's worse than the current one", even
> considering the current one is absolutely broken.

Is the glass half empty or is it half full?

From the original post, context is a 2x 1TB raid volume:

Filesystem      Size  Used Avail Use% Mounted on
/dev/sda2       1.8T  1.1M  1.8T   1% /mnt/p2

Earlier conventions would have stated Size ~900GB, and Avail ~900GB. But that's not exactly true either, is it? It's merely a convention to cut the storage available in half, while keeping data file sizes the same as if they were on a single device without raid.

On Btrfs the file system Size is reported as the total storage stack size, and that's not incorrect. And the amount Avail is likewise not wrong because that space is "not otherwise occupied" which is the definition of available. It's linguistically consistent, it's just not a familiar convention.

What I don't care for is the fact that btrfs fi df doesn't report total and used for raid1, the user has to mentally double the displayed values. I think the doubling should already be computed, that's what total and used mean, rather than needing secret decoder ring knowledge to understand the situation.

Anyway, there isn't a terribly good solution for this issue still. But I don't find the argument that it's absolutely broken very compelling. And I disagree that upending Used+Avail=Size as you suggest is a good alternative. How is that going to work, by the way?

Your idea:
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda2       1.8T  1.1M  912G   1% /mnt/p2

If I copy 500GB to this file system, what do you propose df shows me? Clearly Size stays the same, and Avail presumably becomes 412G. But what does Used go to? 500G? Or 1T? And when full, will it say Size 1.8T, Used 900G, Avail 11M? So why is the Size 1.8T, only 900G used and yet it's empty? That doesn't make sense. Nor does Used increasing at twice the rate Avail goes down.

I also don't think it's useful to fix the problem more than once either.

Chris Murphy

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Roman Mamedov Feb. 7, 2014, 6:08 a.m. UTC | #7

On Thu, 6 Feb 2014 22:30:46 -0700
Chris Murphy <lists@colorremedies.com> wrote:

> >From the original post, context is a 2x 1TB raid volume:
> 
> Filesystem      Size  Used Avail Use% Mounted on
> /dev/sda2       1.8T  1.1M  1.8T   1% /mnt/p2
> 
> Earlier conventions would have stated Size ~900GB, and Avail ~900GB. But that's not exactly true either, is it?

Much better, and matching the user expectations of how RAID1 should behave,
without a major "gotcha" blowing up into their face the first minute they are
trying it out. In fact next step that I planned would be finding how to adjust
also Size and Used on all my machines to show what you just mentioned. I get it
that btrfs is special and its RAID1 is not the usual RAID1 either, but that's
not a good reason to break the 'df' behavior; do whatever you want with in
'btrfs fi df', but if I'm not mistaken the UNIX 'df' always was about user
data, how much of my data I have already stored on this partition and how much
more can I store. If that's not possible to tell, then try to be reasonably
close to the truth, not deliberately off by 2x.

> On Btrfs ...the amount Avail is likewise not wrong because that space is "not otherwise occupied" which is the definition of available.

That's not the definition of available that's directly useful to anyone, but
rather a filesystem-designer level implementation detail, if anything.

What usually interests me is, I have a 100 GB file, can I fit it on this
filesystem, yes/no? Sure let's find out, just check 'df'. Oh wait, not so fast
let's remember was this btrfs? Is that the one with RAID1 or not?... And what
if I am accessing that partition on a server via a network CIFS/NFS share and
don't even *have a way to find out* any of that.

Martin Steigerwald Feb. 7, 2014, 10:02 a.m. UTC | #8

Am Donnerstag, 6. Februar 2014, 22:30:46 schrieb Chris Murphy:
> On Feb 6, 2014, at 9:40 PM, Roman Mamedov <rm@romanrm.net> wrote:
> > On Thu, 06 Feb 2014 20:54:19 +0100
> >
> > Goffredo Baroncelli <kreijack@libero.it> wrote:
> > 
> >
> >> I agree with you about the needing of a solution. However your patch to
> >> me seems even worse than the actual code.>>
> >> 
> >>
> >> For example you cannot take in account the mix of data/linear and
> >> metadata/dup (with the pathological case of small files stored in the
> >> metadata chunks ), nor different profile level like raid5/6 (or the
> >> future raidNxM) And do not forget the compression...
> >
> > 
> >
> > Every estimate first and foremost should be measured by how precise it is,
> > or in this case "wrong by how many gigabytes". The actual code returns a
> > result that is pretty much always wrong by 2x, after the patch it will be
> > close within gigabytes to the correct value in the most common use case
> > (data raid1, metadata raid1 and that's it). Of course that PoC is nowhere
> > near the final solution, what I can't agree with is "if another option is
> > somewhat better, but not ideally perfect, then it's worse than the
> > current one", even considering the current one is absolutely broken.
> 
> Is the glass half empty or is it half full?
> 
> From the original post, context is a 2x 1TB raid volume:
> 
> Filesystem      Size  Used Avail Use% Mounted on
> /dev/sda2       1.8T  1.1M  1.8T   1% /mnt/p2
> 
> Earlier conventions would have stated Size ~900GB, and Avail ~900GB. But
> that's not exactly true either, is it? It's merely a convention to cut the
> storage available in half, while keeping data file sizes the same as if
> they were on a single device without raid.
> 
> On Btrfs the file system Size is reported as the total storage stack size,
> and that's not incorrect. And the amount Avail is likewise not wrong
> because that space is "not otherwise occupied" which is the definition of
> available. It's linguistically consistent, it's just not a familiar
> convention.

I see one issue with it:

There are installers and applications that check available disk space prior to 
installing. This won´t work with current df figures that BTRFS delivers.

While I understand that there is *never* a guarentee that a given free space 
can really be allocated by a process cause other processes can allocate space 
as well in the mean time, and while I understand that its difficult to provide 
an accurate to provide exact figures as soon as RAID settings can be set per 
subvolume, it still think its important to improve on the figures.

In the longer term I´d like like a function / syscall to ask the filesystem the 
following question:

I am about to write 200 MB in this directory, am I likely to succeed with 
that?

This way an application can ask specific to a directory which allows BTRFS to 
provide a more accurate estimation.

I understand that there is something like that for single files (fallocate), 
but there is nothing like this for writing a certain amount of data in several 
files / directories.

Thanks,

Frank Kingswood Feb. 7, 2014, 2:05 p.m. UTC | #9

On 06/02/14 19:54, Goffredo Baroncelli wrote:
> Hi Roman
>
> On 02/06/2014 01:45 PM, Roman Mamedov wrote:
>> There's not a lot of code to include (as my 3-line patch demonstrates), it
>> could just as easily be removed when it's obsolete. But I did not have any
>> high hopes of defeating the "broken by design" philosophy, that's why I didn't
>> submit it as a real patch for inclusion but rather just as a helpful hint for
>> people to add to their own kernels if they want this change to happen.
>
> I agree with you about the needing of a solution. However your patch to me seems even worse than the actual code.
>
> For example you cannot take in account the mix of data/linear and metadata/dup (with the pathological case of small files stored in the metadata chunks ), nor different profile level like raid5/6 (or the future raidNxM)
> And do not forget the compression...

Just because the solution that Roman provided is not perfect does not 
mean that it is no good at all. For common use cases this will give a 
much better estimate of free space than the current code does, at a 
trivial cost in code size.

It has the benefit of giving a simple estimate without doing any further 
work or disk activity (no need to walk all chunks).

Adding a couple of more lines of code will make it work equally well 
with other RAID levels, maybe that would be more acceptable?

Frank

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Murphy Feb. 7, 2014, 6:44 p.m. UTC | #10

On Feb 6, 2014, at 11:08 PM, Roman Mamedov <rm@romanrm.net> wrote:

>  And what
> if I am accessing that partition on a server via a network CIFS/NFS share and
> don't even *have a way to find out* any of that.

That's the strongest argument. And if the user is using Explorer/Finder/Nautilus to copy files to the share, I'm pretty sure all three determine if there's enough free space in advance of starting the copy. So if it thinks there's free space, it will start to copy and then later fail midstream when there's no more space. And then the user's copy task is in a questionable state as to what's been copied, depending on how the file copies are being threaded.

And due to Btrfs metadata requirements even when deleting, we actually need an Avail estimate that accounts for "phantom future metadata" as if it's currently in use, otherwise we don't really have the right indication of whether or not files can be copied. 

Chris Murphy--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Kai Krakow Feb. 7, 2014, 8:32 p.m. UTC | #11

Josef Bacik <jbacik@fb.com> schrieb:

> 
> On 02/05/2014 03:15 PM, Roman Mamedov wrote:
>> Hello,
>>
>> On a freshly-created RAID1 filesystem of two 1TB disks:
>>
>> # df -h /mnt/p2/
>> Filesystem      Size  Used Avail Use% Mounted on
>> /dev/sda2       1.8T  1.1M  1.8T   1% /mnt/p2
>>
>> I cannot write 2TB of user data to that RAID1, so this estimate is
>> clearly misleading. I got tired of looking at the bogus disk free space
>> on all my RAID1 btrfs systems, so today I decided to do something about
>> this:
>>
>> --- fs/btrfs/super.c.orig	2014-02-06 01:28:36.636164982 +0600
>> +++ fs/btrfs/super.c	2014-02-06 01:28:58.304164370 +0600
>> @@ -1481,6 +1481,11 @@
>>   }
>>   
>>   kfree(devices_info);
>> +
>> +	if (type & BTRFS_BLOCK_GROUP_RAID1) {
>> +		do_div(avail_space, min_stripes);
>> +	}
>> +
>>   *free_bytes = avail_space;
>>   return 0;
>>   }
> 
> This needs to be more flexible, and also this causes the problem where
> now you show the actual usable amount of space _but_ you are also
> showing twice the amount of used space.  I'm ok with going in this
> direction, but we need to convert everybody over so it works for raid10
> as well and the used values need to be adjusted.  Thanks,

It should show the raw space available. Btrfs also supports compression and 
doesn't try to be smart about how much compressed data would fit in the free 
space of the drive. If one is using RAID1, it's supposed to fill up with a 
rate of 2:1. If one is using compression, it's supposed to fill up with a 
rate of maybe 1:5 for mostly text files.

However, btrfs should probably provide its own df utility (like "btrfs fi 
df") which is smart about disk usage and tries to predict the usable space. 
But df should stay with actually showing the _free_ space, not _usable_ 
space (the latter is unpredictable anyway based on the usage patterns 
applied to the drive, so it can be a rough guess as its best, like looking 
into the crystal ball).

The point here is about "free" vs. "usable" space, the first being a known 
number, the latter only being a prediction based on current usage. I'd like 
to stay with "free space" actually showing raw free space.

Roman Mamedov Feb. 8, 2014, 11:33 a.m. UTC | #12

On Fri, 07 Feb 2014 21:32:42 +0100
Kai Krakow <hurikhan77+btrfs@gmail.com> wrote:

> It should show the raw space available. Btrfs also supports compression and 
> doesn't try to be smart about how much compressed data would fit in the free 
> space of the drive. If one is using RAID1, it's supposed to fill up with a 
> rate of 2:1. If one is using compression, it's supposed to fill up with a 
> rate of maybe 1:5 for mostly text files.

Imagine a small business with some 30-40 employees. There is a piece of paper
near the door at the office so that everyone sees it when entering or leaving,
which says:

"Dear employees,

Please keep in mind that on the fileserver '\\DepartmentC', in the directory
'\PublicStorage7' the free space you see as being available needs to be divided
by two; On the server '\\DepartmentD', in '\StorageArchive' and '\VideoFiles',
multiplied by two-thirds. For more details please contact the IT operations
team. Further assistance will be provided at the monthly training seminar.

Regards,
John S, CTO.'

Hugo Mills Feb. 8, 2014, 11:46 a.m. UTC | #13

On Sat, Feb 08, 2014 at 05:33:10PM +0600, Roman Mamedov wrote:
> On Fri, 07 Feb 2014 21:32:42 +0100
> Kai Krakow <hurikhan77+btrfs@gmail.com> wrote:
> 
> > It should show the raw space available. Btrfs also supports compression and 
> > doesn't try to be smart about how much compressed data would fit in the free 
> > space of the drive. If one is using RAID1, it's supposed to fill up with a 
> > rate of 2:1. If one is using compression, it's supposed to fill up with a 
> > rate of maybe 1:5 for mostly text files.
> 
> Imagine a small business with some 30-40 employees. There is a piece of paper
> near the door at the office so that everyone sees it when entering or leaving,
> which says:
> 
> "Dear employees,
> 
> Please keep in mind that on the fileserver '\\DepartmentC', in the directory
> '\PublicStorage7' the free space you see as being available needs to be divided
> by two; On the server '\\DepartmentD', in '\StorageArchive' and '\VideoFiles',
> multiplied by two-thirds. For more details please contact the IT operations
> team. Further assistance will be provided at the monthly training seminar.
> 
> Regards,
> John S, CTO.'

   In my experience, nobody who uses a shared filesystem *ever* looks
at the amount of free space on it, until it fills up, at which point
they may look at the free space and see "0". Or most likely, they'll
be alerted to the issue by an email from the systems people saying,
"please will everyone delete unnecessary files from the shared drive,
because it's full up."

   Having a more accurate estimate of the free space is a laudable
aim, and in principle I agree with attempts to do it, but I think the
argument above isn't exactly a strong one in practice.

   Even in the current code with only one RAID setting available for
data, if you have parity RAID, you've got to look at the number of
drives with available free space to make an estimate of available
space. I think your best bet, ultimately, is to write code to give
either a pessimistic (lower bound) or optimistic (upper bound)
estimate of available space based on the profiles in use and the
current distribution of free/unallocated space, and stick with that. I
think I'd prefer to see a pessimistic bound, although that could break
anything like an installer that attempts to see how much free space
there is before proceeding.

   Hugo.

Kai Krakow Feb. 8, 2014, 9:35 p.m. UTC | #14

Hugo Mills <hugo@carfax.org.uk> schrieb:

> On Sat, Feb 08, 2014 at 05:33:10PM +0600, Roman Mamedov wrote:
>> On Fri, 07 Feb 2014 21:32:42 +0100
>> Kai Krakow <hurikhan77+btrfs@gmail.com> wrote:
>> 
>> > It should show the raw space available. Btrfs also supports compression
>> > and doesn't try to be smart about how much compressed data would fit in
>> > the free space of the drive. If one is using RAID1, it's supposed to
>> > fill up with a rate of 2:1. If one is using compression, it's supposed
>> > to fill up with a rate of maybe 1:5 for mostly text files.
>> 
>> Imagine a small business with some 30-40 employees. There is a piece of
>> paper near the door at the office so that everyone sees it when entering
>> or leaving, which says:
>> 
>> "Dear employees,
>> 
>> Please keep in mind that on the fileserver '\\DepartmentC', in the
>> directory '\PublicStorage7' the free space you see as being available
>> needs to be divided by two; On the server '\\DepartmentD', in
>> '\StorageArchive' and '\VideoFiles', multiplied by two-thirds. For more
>> details please contact the IT operations team. Further assistance will be
>> provided at the monthly training seminar.
>> 
>> Regards,
>> John S, CTO.'
> 
>    In my experience, nobody who uses a shared filesystem *ever* looks
> at the amount of free space on it, until it fills up, at which point
> they may look at the free space and see "0". Or most likely, they'll
> be alerted to the issue by an email from the systems people saying,
> "please will everyone delete unnecessary files from the shared drive,
> because it's full up."

Exactly that is the point from my practical experience. Only sysadmins watch 
these numbers, and they'd know how to handle them.

Imagine the future: Btrfs supports different RAID levels per subvolume. We 
need to figure out where to place a new subvolume. I need raw numbers for 
it. Df won't tell me that now. Things become very difficult now.

Free space is a number unimportant to end users. They won't look at it. They 
start to cry and call helpdesk if an application says: Disk is full. You 
cannot even save your unsaved document, because: Disk full.

The only way to solve this, is to apply quotas to users and let the 
sysadmins do the space usage planning. That will work.

I still think, there should be an extra utility which guesses the predicted 
usable free space - or an option added to df to show that.

Roman's argument is only one view of the problem. My argument (sysadmin 
space planning) is exactly the opposite view. In the future, free space 
prediction will only become more complicated, involves more code, introduces 
bugs... It should be done in user space. Df should receive raw numbers.

Storage space is cheap these days. You should just throw another disk at the 
array if free space falls below a certain threshold. End users do not care 
for free space. They just cry when it's full - no matter how accurate the 
numbers had been before. They will certainly not cry if they copied 2 MB to 
the disk but 4 MB had been taken. In a shared storage space this is probably 
always the case anyway, because just the very same moment someone else also 
copied 2 MB to the volume. So what?

>    Having a more accurate estimate of the free space is a laudable
> aim, and in principle I agree with attempts to do it, but I think the
> argument above isn't exactly a strong one in practice.

I do not disagree, too. But I think it should go to a separate utility or 
there should be a new API call in the kernel to get predicted usable free 
space based on current usage pattern. Df is meant as a utility to get 
accurate numbers. It should not tell you guessed numbers.

Whatever you design a df calculater in btrfs, it could always be too 
pessimistic or too optimistic (and could even switch unpredictably between 
both situations). So whatever you do: It is always inaccurate. It will never 
be able to exactly tell you the numbers you need. If disk space is low: Add 
disks. Clean up. Whatever. Just simply do not try to fill up your FS to just 
1kb left. Btrfs doesn't like that anyway. So: Use quotas.

Picking up the piece of paper example: You still have to tell your employees 
that the free space numbers aren't exact anyways, so their best chance is to 
simply not look at them and are better off with just trying to copy 
something.

Besides: If you want to fix this, what about the early-ENOSPC problem which 
is there by design (allocation in chunks)? You'd need to fix that, too.

Kai Krakow Feb. 8, 2014, 9:46 p.m. UTC | #15

Chris Murphy <lists@colorremedies.com> schrieb:

> 
> On Feb 6, 2014, at 11:08 PM, Roman Mamedov <rm@romanrm.net> wrote:
> 
>>  And what
>> if I am accessing that partition on a server via a network CIFS/NFS share
>> and don't even *have a way to find out* any of that.
> 
> That's the strongest argument. And if the user is using
> Explorer/Finder/Nautilus to copy files to the share, I'm pretty sure all
> three determine if there's enough free space in advance of starting the
> copy. So if it thinks there's free space, it will start to copy and then
> later fail midstream when there's no more space. And then the user's copy
> task is in a questionable state as to what's been copied, depending on how
> the file copies are being threaded.

This problem has already been solved for remote file systems maybe 20-30 
years ago: You cannot know how much space is left at the end of the copy by 
looking at the numbers before the copy - it may have been used up by another 
user copying a file at the same time. The problem has been solved by 
applying hard and soft quotas: The sysadmin does an optimistic (or possibly 
even pessimistic) planning and applies quotas. Soft quotas can be passed for 
(maybe) 7 days after which you need to free up space again before adding new 
data. Hard quotas are the hard cutoff - you cannot pass that barrier. Df 
will show you what's free within your softquota. Problem solved. If you need 
better numbers, there are quota commands instead of df. Why break with this 
design choice?

If you manage a central shared storage for end users, you should really 
start thinking about quotas. Without, you cannot even exactly plan your 
backups.

If df shows transformed/guessed numbers to the sysadmins, things start to 
become very complicated and unpredictable.

Kai Krakow Feb. 8, 2014, 9:50 p.m. UTC | #16

Martin Steigerwald <Martin@lichtvoll.de> schrieb:

> While I understand that there is *never* a guarentee that a given free
> space can really be allocated by a process cause other processes can
> allocate space as well in the mean time, and while I understand that its
> difficult to provide an accurate to provide exact figures as soon as RAID
> settings can be set per subvolume, it still think its important to improve
> on the figures.

The question here: Does the free space indicator "fail" predictably or 
inpredictably? It will do the latter with this change.

Roman Mamedov Feb. 8, 2014, 10:10 p.m. UTC | #17

On Sat, 08 Feb 2014 22:35:40 +0100
Kai Krakow <hurikhan77+btrfs@gmail.com> wrote:

> Imagine the future: Btrfs supports different RAID levels per subvolume. We 
> need to figure out where to place a new subvolume. I need raw numbers for 
> it. Df won't tell me that now. Things become very difficult now.

If you need to perform a btrfs-specific operation, you can easily use the
btrfs-specific tools to prepare for it, specifically use "btrfs fi df" which
could give provide every imaginable interpretation of free space estimate
and then some.

UNIX 'df' and the 'statfs' call on the other hand should keep the behavior
people are accustomized to rely on since 1970s.

cwillu Feb. 8, 2014, 10:45 p.m. UTC | #18

Everyone who has actually looked at what the statfs syscall returns
and how df (and everyone else) uses it, keep talking.  Everyone else,
go read that source code first.

There is _no_ combination of values you can return in statfs which
will not be grossly misleading in some common scenario that someone
cares about.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Kai Krakow Feb. 8, 2014, 11:17 p.m. UTC | #19

Roman Mamedov <rm@romanrm.net> schrieb:

>> It should show the raw space available. Btrfs also supports compression
>> and doesn't try to be smart about how much compressed data would fit in
>> the free space of the drive. If one is using RAID1, it's supposed to fill
>> up with a rate of 2:1. If one is using compression, it's supposed to fill
>> up with a rate of maybe 1:5 for mostly text files.
> 
> Imagine a small business with some 30-40 employees. There is a piece of
> paper near the door at the office so that everyone sees it when entering
> or leaving, which says:
> 
> "Dear employees,
> 
> Please keep in mind that on the fileserver '\\DepartmentC', in the
> directory '\PublicStorage7' the free space you see as being available
> needs to be divided by two; On the server '\\DepartmentD', in
> '\StorageArchive' and '\VideoFiles', multiplied by two-thirds. For more
> details please contact the IT operations team. Further assistance will be
> provided at the monthly training seminar.

"Dear employees,

Please keep in mind that when you run out of space on the fileserver 
'\\DepartmentC', when you free up space in the directory '\PublicStorage7' 
the free space you gain on '\StorageArchive' is only one third of the amount 
you deleted, and in '\VideoFiles', you gain only one half. For more details 
please contact the IT operations team. Further assistance will be provided 
at the monthly training seminar.

Regards,
John S, CTO."

The exercise of why is left to the reader...

The proposed fix simply does not fix the problem. It simply shifts it 
introducing the need for another fix somewhere else, which in turn probably 
also introduces another need for a fix, and so forth... This will become an 
endless effort of fixing and tuning.

It simply does not work because btrfs' design does not allow that. Feel free 
to fix it but be prepared for the reincarnation of this problem when per-
subvolume raid levels become introduced. The problem has to be fixed in user 
space or with a new API call.

Kai Krakow Feb. 8, 2014, 11:27 p.m. UTC | #20

cwillu <cwillu@cwillu.com> schrieb:

> Everyone who has actually looked at what the statfs syscall returns
> and how df (and everyone else) uses it, keep talking.  Everyone else,
> go read that source code first.
> 
> There is _no_ combination of values you can return in statfs which
> will not be grossly misleading in some common scenario that someone
> cares about.

Thanks man! statfs returns free blocks. So let's stick with that. The df 
command, as people try to understand it, is broken by design on btrfs. One 
has to live with that. The df command as it works since 1970 returns free 
blocks - and it does that perfectly fine on btrfs without that proposed 
"fix".

User space should not try to be smart about how many blocks are written to 
the filesystem if it writes xyz bytes to the filesystem. It has been that 
way since 1970 (or whatever), and it will be that way in the future. And a 
good file copying GUI should give you the choice of "I know better, copy 
anyways" (like every other unix utility).

Your pointer is everything to say about it.

Kai Krakow Feb. 8, 2014, 11:32 p.m. UTC | #21

Roman Mamedov <rm@romanrm.net> schrieb:

> UNIX 'df' and the 'statfs' call on the other hand should keep the behavior
> people are accustomized to rely on since 1970s.

When I started to use unix, df returned blocks, not bytes. Without your 
proposed patch, it does that right. With your patch, it does it wrong.

Roman Mamedov Feb. 9, 2014, 1:08 a.m. UTC | #22

On Sun, 09 Feb 2014 00:32:47 +0100
Kai Krakow <hurikhan77+btrfs@gmail.com> wrote:

> When I started to use unix, df returned blocks, not bytes. Without your 
> proposed patch, it does that right. With your patch, it does it wrong.

It returns total/used/available space that is usable/used/available by/for
user data. Whether that be in sectors, blocks, kilobytes, megabytes or in some
other unit, is a secondary detail which is also unrelated to the change being
currently discussed and not affected by it.

Roman Mamedov Feb. 9, 2014, 1:55 a.m. UTC | #23

On Sun, 09 Feb 2014 00:17:29 +0100
Kai Krakow <hurikhan77+btrfs@gmail.com> wrote:

> "Dear employees,
> 
> Please keep in mind that when you run out of space on the fileserver 
> '\\DepartmentC', when you free up space in the directory '\PublicStorage7' 
> the free space you gain on '\StorageArchive' is only one third of the amount 
> you deleted, and in '\VideoFiles', you gain only one half.

But that's simply incorrect. Looking at my 2nd patch which also changes the
total reported size and 'used' size, the 'total' space, 'used' space and space
freed up as 'available' after file deletion will all match up perfectly.

> The exercise of why is left to the reader...
> 
> The proposed fix simply does not fix the problem. It simply shifts it 
> introducing the need for another fix somewhere else, which in turn probably 
> also introduces another need for a fix, and so forth... This will become an 
> endless effort of fixing and tuning.

Not sure what exactly becomes problematic if a 2-device RAID1 tells the user
they can store 1 TB of their data on it, and is no longer lying about the 
possibility to store 2 TB on it as currently.

Two 1TB disks in RAID1. Total space 1TB. Can store of my data: 1TB.
Wrote 100 GB of files? 100 GB used, 900 GB available, 1TB total.
Deleted 50 GB of those? 50 GB used, 950 GB available, 1TB total.

Can't see anything horribly broken about this behavior.

For when you need to "get to the bottom of things", as mentioned earlier
there's always 'btrfs fi df'.

> Feel free to fix it but be prepared for the reincarnation of this problem when
> per-subvolume raid levels become introduced.

AFAIK no one has even begun to write any code code to implement those yet.

Chris Murphy Feb. 9, 2014, 2:21 a.m. UTC | #24

On Feb 8, 2014, at 6:55 PM, Roman Mamedov <rm@romanrm.net> wrote:
> 
> Not sure what exactly becomes problematic if a 2-device RAID1 tells the user
> they can store 1 TB of their data on it, and is no longer lying about the 
> possibility to store 2 TB on it as currently.
> 
> Two 1TB disks in RAID1.

OK but while we don't have a top level switch for variable raid on a volume yet, the on-disk format doesn't consider the device to be raid1 at all. Not the device, nor the volume, nor the subvolume have this attribute. It's a function of the data, metadata or system chunk via their profiles.

I can do a partial conversion on a volume, and even could do this multiple times and end up with some chunks in every available option, some chunks single, some raid1, some raid0, some raid5. All I have to do is cancel the conversion before each conversion is complete, successively shortening the time.

And it's not fair to say this has no application because such conversions take a long time. I might not want to fully do a conversion all at once. There's no requirement that I do so.

In any case I object to the language being used that implicitly indicates the 'raidness' is a device or disk attribute.

Chris Murphy

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Murphy Feb. 9, 2014, 2:29 a.m. UTC | #25

On Feb 8, 2014, at 7:21 PM, Chris Murphy <lists@colorremedies.com> wrote:

> we don't have a top level switch for variable raid on a volume yet

This isn't good wording. We don't have a controllable way to set variable raid levels. The interrupted convert model I'd consider not controllable.


Chris Murphy--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Duncan Feb. 9, 2014, 6:38 a.m. UTC | #26

Roman Mamedov posted on Sun, 09 Feb 2014 04:10:50 +0600 as excerpted:

> If you need to perform a btrfs-specific operation, you can easily use
> the btrfs-specific tools to prepare for it, specifically use "btrfs fi
> df" which could give provide every imaginable interpretation of free
> space estimate and then some.
> 
> UNIX 'df' and the 'statfs' call on the other hand should keep the
> behavior people are accustomized to rely on since 1970s.

Which it does... on filesystems that only have 1970s filesystem features. 
=:^)

RAID or multi-device filesystems aren't 1970s features and break 1970s 
behavior and the assumptions associated with it.  If you're not prepared 
to deal with those broken assumptions, don't.  Use mdraid or dmraid or lvm 
or whatever to combine your multiple devices into one logical devices as 
presented, and put your filesystem (either traditional filesystem, or 
even btrfs using traditional single-device functionality) on top of the 
single device the layer beneath the filesystem presents.  Problem solved! 
=:^)

Note that df only lists a single device as well, not the multiple 
component devices of the filesystem.  That's broken functionality by your 
definition, too, and again, using some other layer like lvm or mdraid to 
present multiple devices as a single virtual device, with a traditional 
single-device filesystem layout on top of that single device... solves 
the problem!


Meanwhile, what I've done here is use one of df's commandline options to 
set its block size to 2 MiB, and further used bash's alias functionality 
to setup an alias accordingly:

alias df='df -B2M'

$ df /h
Filesystem     2M-blocks  Used Available Use% Mounted on
/dev/sda6          20480 12186      7909  61% /h

$ sudo btrfs fi show /h
Label: hm0238gcnx+35l0  uuid: ce23242a-b0a9-423f-a9c3-7db2729f48d6
        Total devices 2 FS bytes used 11.90GiB
        devid    1 size 20.00GiB used 14.78GiB path /dev/sda6
        devid    2 size 20.00GiB used 14.78GiB path /dev/sdb6

$ sudo btrfs fi df /h
Data, RAID1: total=14.00GiB, used=11.49GiB
System, RAID1: total=32.00MiB, used=16.00KiB
Metadata, RAID1: total=768.00MiB, used=414.94MiB


On btrfs such as the above I can read the 2M blocks as 1M and be happy.  
On btrfs such as my /boot, which aren't raid1 (I have two separate 
/boots, one on each device, with grub2 configured separately for each to 
provide a backup), or if I df my media partitions still on reiserfs on 
the old spinning rust, I can either double the figures DF gives me, or 
add a second -B option at the CLI, overriding the aliased option.

If I wanted something fully automated, it'd be easy enough to setup a 
script that checked what filesystem I was df-ing, matched that against a 
table of filesystems to preferred df block sizes, and supplied the 
appropriate -BxX option accordingly.  As I guess most admins after a few 
years, I've developed quite a library of scripts/aliases for various 
things I do routinely enough to warrant it, and this would be just one 
more joining the list. =:^)


But of course it's your system in question, and you can patch btrfs to 
output anything you like, in any format you like.  No need to bother with 
df's -B option if you'd prefer to patch the kernel instead.  Me, I'll 
stick to the -B option.  =:^)

Roman Mamedov Feb. 9, 2014, 9:20 a.m. UTC | #27

On Sun, 9 Feb 2014 06:38:53 +0000 (UTC)
Duncan <1i5t5.duncan@cox.net> wrote:

> RAID or multi-device filesystems aren't 1970s features and break 1970s 
> behavior and the assumptions associated with it.  If you're not prepared 
> to deal with those broken assumptions, don't.  Use mdraid or dmraid or lvm 
> or whatever to combine your multiple devices into one logical devices as 
> presented, and put your filesystem (either traditional filesystem, or 
> even btrfs using traditional single-device functionality) on top of the 
> single device the layer beneath the filesystem presents.  Problem solved! 
> =:^)
> 
> Note that df only lists a single device as well, not the multiple 
> component devices of the filesystem.  That's broken functionality by your 
> definition, too, and again, using some other layer like lvm or mdraid to 
> present multiple devices as a single virtual device, with a traditional 
> single-device filesystem layout on top of that single device... solves 
> the problem!

No reason BTRFS can't work well in a similar simplistic usage scenario.

You seem to insist there is no way around it being "too flexible for its own
good", but all those advanced features absolutely don't *have* to get in the
way of everyday usage for users who don't require them.

> Meanwhile, what I've done here is use one of df's commandline options to 
> set its block size to 2 MiB, and further used bash's alias functionality 
> to setup an alias accordingly:
> 
> alias df='df -B2M'
> 
> $ df /h
> Filesystem     2M-blocks  Used Available Use% Mounted on
> /dev/sda6          20480 12186      7909  61% /h
> 
> $ sudo btrfs fi show /h
> Label: hm0238gcnx+35l0  uuid: ce23242a-b0a9-423f-a9c3-7db2729f48d6
>         Total devices 2 FS bytes used 11.90GiB
>         devid    1 size 20.00GiB used 14.78GiB path /dev/sda6
>         devid    2 size 20.00GiB used 14.78GiB path /dev/sdb6
> 
> $ sudo btrfs fi df /h
> Data, RAID1: total=14.00GiB, used=11.49GiB
> System, RAID1: total=32.00MiB, used=16.00KiB
> Metadata, RAID1: total=768.00MiB, used=414.94MiB
> 
> 
> On btrfs such as the above I can read the 2M blocks as 1M and be happy.
>
> On btrfs such as my /boot, which aren't raid1 (I have two separate 
> /boots, one on each device, with grub2 configured separately for each to 
> provide a backup), or if I df my media partitions still on reiserfs on 
> the old spinning rust, I can either double the figures DF gives me, or 
> add a second -B option at the CLI, overriding the aliased option.

Congratulations, you broke your df readings on all other filesystems to fix
them on btrfs.

> If I wanted something fully automated, it'd be easy enough to setup a 
> script that checked what filesystem I was df-ing, matched that against a 
> table of filesystems to preferred df block sizes, and supplied the 
> appropriate -BxX option accordingly.

I am not sure this would work well in the network share scenario described
earlier, with clients which in the real world are largely Windows-based.

Kai Krakow Feb. 9, 2014, 9:37 a.m. UTC | #28

Duncan <1i5t5.duncan@cox.net> schrieb:

> Roman Mamedov posted on Sun, 09 Feb 2014 04:10:50 +0600 as excerpted:
> 
>> If you need to perform a btrfs-specific operation, you can easily use
>> the btrfs-specific tools to prepare for it, specifically use "btrfs fi
>> df" which could give provide every imaginable interpretation of free
>> space estimate and then some.
>> 
>> UNIX 'df' and the 'statfs' call on the other hand should keep the
>> behavior people are accustomized to rely on since 1970s.
> 
> Which it does... on filesystems that only have 1970s filesystem features.
> =:^)
> 
> RAID or multi-device filesystems aren't 1970s features and break 1970s
> behavior and the assumptions associated with it.  If you're not prepared
> to deal with those broken assumptions, don't.  Use mdraid or dmraid or lvm
> or whatever to combine your multiple devices into one logical devices as
> presented, and put your filesystem (either traditional filesystem, or
> even btrfs using traditional single-device functionality) on top of the
> single device the layer beneath the filesystem presents.  Problem solved!
> =:^)
> 
> Note that df only lists a single device as well, not the multiple
> component devices of the filesystem.  That's broken functionality by your
> definition, too, and again, using some other layer like lvm or mdraid to
> present multiple devices as a single virtual device, with a traditional
> single-device filesystem layout on top of that single device... solves
> the problem!
> 
> 
> Meanwhile, what I've done here is use one of df's commandline options to
> set its block size to 2 MiB, and further used bash's alias functionality
> to setup an alias accordingly:
> 
> alias df='df -B2M'
> 
> $ df /h
> Filesystem     2M-blocks  Used Available Use% Mounted on
> /dev/sda6          20480 12186      7909  61% /h
> 
> $ sudo btrfs fi show /h
> Label: hm0238gcnx+35l0  uuid: ce23242a-b0a9-423f-a9c3-7db2729f48d6
>         Total devices 2 FS bytes used 11.90GiB
>         devid    1 size 20.00GiB used 14.78GiB path /dev/sda6
>         devid    2 size 20.00GiB used 14.78GiB path /dev/sdb6
> 
> $ sudo btrfs fi df /h
> Data, RAID1: total=14.00GiB, used=11.49GiB
> System, RAID1: total=32.00MiB, used=16.00KiB
> Metadata, RAID1: total=768.00MiB, used=414.94MiB
> 
> 
> On btrfs such as the above I can read the 2M blocks as 1M and be happy.
> On btrfs such as my /boot, which aren't raid1 (I have two separate
> /boots, one on each device, with grub2 configured separately for each to
> provide a backup), or if I df my media partitions still on reiserfs on
> the old spinning rust, I can either double the figures DF gives me, or
> add a second -B option at the CLI, overriding the aliased option.
> 
> If I wanted something fully automated, it'd be easy enough to setup a
> script that checked what filesystem I was df-ing, matched that against a
> table of filesystems to preferred df block sizes, and supplied the
> appropriate -BxX option accordingly.  As I guess most admins after a few
> years, I've developed quite a library of scripts/aliases for various
> things I do routinely enough to warrant it, and this would be just one
> more joining the list. =:^)

Well done... And a good idea, I didn't think of it yet. But it's my idea of 
fixing it in user space. :-)

I usually leave the discussion when people start to argument with pointers 
to unix tradition... That's like starting a systemd discussion and telling 
me that systemd is broken by design while mentioning in the same sentence 
that sysvinit is working perfectly fine. The latter doesn't do so. The first 
is a matter of personal taste but is in no case broken... But... Well...

> But of course it's your system in question, and you can patch btrfs to
> output anything you like, in any format you like.  No need to bother with
> df's -B option if you'd prefer to patch the kernel instead.  Me, I'll
> stick to the -B option.  =:^)

That's essentially the FOSS idea. Actually, I don't want df behavior being 
broken for me. It uses fstat syscall, that returns blocks. Cutting returned 
values into half lies about the properties of the device - for EVERY 
application out there, no matter which assumptions are being made about the 
returned values. This breaks the fstat syscall. User-space should simply not 
rely on the assumption that 1k of user data occupies 1k worth of blocks 
(that's not true anyways because meta-data has to be allocated, too). When I 
had contact with unix first, df returned used/free blocks - native BLOCKS! 
No option to make it human readable. No forced intention that it would show 
you usable space for actual written data. The blocks were given as 512-byte 
sectors. I've been okay with that. I knew: If I cut the values in half, I'd 
get about the size of data I perhabs could fit in the device. If it had been 
a property of the device that 512 byte of user data would write two blocks, 
nobody had cared about df displaying "wrong" values.

Kai Krakow Feb. 9, 2014, 9:39 a.m. UTC | #29

Roman Mamedov <rm@romanrm.net> schrieb:

>> When I started to use unix, df returned blocks, not bytes. Without your
>> proposed patch, it does that right. With your patch, it does it wrong.
> 
> It returns total/used/available space that is usable/used/available by/for
> user data.

No, it does not. It returns space allocatable to the filesystem. That's user 
data and meta data. That can be far from your expectations depending on how 
allocation on the filesystem works.

Duncan Feb. 10, 2014, 12:02 a.m. UTC | #30

Roman Mamedov posted on Sun, 09 Feb 2014 15:20:00 +0600 as excerpted:

> On Sun, 9 Feb 2014 06:38:53 +0000 (UTC)
> Duncan <1i5t5.duncan@cox.net> wrote:
> 
>> RAID or multi-device filesystems aren't 1970s features and break 1970s
>> behavior and the assumptions associated with it.  If you're not
>> prepared to deal with those broken assumptions, don't.  Use mdraid or
>> dmraid or lvm or whatever to combine your multiple devices into one
>> logical devices as presented, and put your filesystem (either
>> traditional filesystem, or even btrfs using traditional single-device
>> functionality) on top of the single device the layer beneath the
>> filesystem presents.  Problem solved! =:^)
> 
> No reason BTRFS can't work well in a similar simplistic usage scenario.
> 
> You seem to insist there is no way around it being "too flexible for its
> own good", but all those advanced features absolutely don't *have* to
> get in the way of everyday usage for users who don't require them.

Not really.  I'm more insisting that I've not seen a good kernel-space 
solution to the problem yet, and believe that it's a userspace or wetware 
problem.

And I provided a userspace/wetware solution that works for me, too. =:^)

>> Meanwhile, what I've done here is use one of df's commandline options
>> to set its block size to 2 MiB, and further used bash's alias
>> functionality to setup an alias accordingly:
>> 
>> alias df='df -B2M'
>> 
>> $ df /h Filesystem     2M-blocks  Used Available Use% Mounted on
>> /dev/sda6          20480 12186      7909  61% /h

>> On btrfs such as the above I can read the 2M blocks as 1M and be happy.
>>
>> On btrfs such as my /boot, which aren't raid1 (I have two separate
>> /boots, one on each device, with grub2 configured separately for each
>> to provide a backup), or if I df my media partitions still on reiserfs
>> on the old spinning rust, I can either double the figures DF gives me,
>> or add a second -B option at the CLI, overriding the aliased option.
> 
> Congratulations, you broke your df readings on all other filesystems to
> fix them on btrfs.

No.  It clearly says 2M blocks.  Nothing's broken at all, except perhaps 
the user's wetware.

I just find it a easier to do the doubling in wetware on the occasion 
it's needed, in MiB, then halving on more frequent occasions (since all 
my core automounted filesystems that I'd normally be doing df on are 
btrfs raid1), larger KiB or byte units, and don't need to do that wetware 
halving often enough to have gone to the trouble of setting up the 
software-scripted version I propose below.

>> If I wanted something fully automated, it'd be easy enough to setup a
>> script that checked what filesystem I was df-ing, matched that against
>> a table of filesystems to preferred df block sizes, and supplied the
>> appropriate -BxX option accordingly.
> 
> I am not sure this would work well in the network share scenario
> described earlier, with clients which in the real world are largely
> Windows-based.

So patch the window-based stuff... oh, you've let them be your master (in 
the context of my sig below) and you can't...  Well, servant by choice, I 
guess...  There's freedom if you want it... which in fact you are using 
to do your kernel patches.  Try patching the MS Windows kernel and 
distributing those patches, and see how far you get! =:^(

FWIW/IMO, in the business context Ernie Ball made the right decision.  
One BSA audit was enough.  He said no more, and the company moved to free 
as in freedom software and isn't beholden to the whims of any servantware 
or the BSA auditors enforcing it, any longer. =:^)

But as I said, your systems (or your company's systems), play servant 
with them and be subject to the BSA gestapo (or the equivalent in your 
country) if you will.  No skin off my nose.  <shrug>

Meanwhile, you said it yourself, users aren't normally concerned about 
this.  And others pointed out that to the degree users /are/ concerned, 
they should be looking at their quotas, not filesystem level usage.

And admins, assuming they're proper admins, not the simple "here's my 
MCSE, I'm certified to do anything, and if I can't do it, it's not 
possible", types, should have the wetware resources to either deal with 
the problem there, or script their own solutions, offloading it from 
wetware to installation-specific userspace software scripts as necessary.

All that said, it's worth noting that there ARE already API changes 
proposed and working their way thru the pipeline, that would expose 
various bits of necessary data to userspace in a standardized API that 
filesystems other than btrfs could make use of as well, with the intent 
of then updating coreutils (the package containing df) and friends to 
allow them to make use of the information exposed by this API to improve 
their default information output and allow for additional CLI level 
options as appropriate.  Presumably other userspace apps, including the 
GUIs over time, would follow the same course.

But the key is, getting a standardized modern API ready and the data 
exposed to userspace via said API in a standardized way, so that all 
userspace apps can make use of it the same way they use the existing APIs 
to provide used/free-space data today, but updated from the 1970s style 
APIs their currently using, to this standardized more modern one, in 
ordered to provide correct and accurate information to the user, 
regardless of whether they're using legacy filesystems or something more 
modern such as btrfs.

The point being, it's early days yet, in terms of btrfs and similar 
modern multi-device flexible usage filesystems, it's basically early 
adopters having to deal with it now and /as/ early adopters, they can 
cope with it.  Meanwhile, as we know, premature optimization is the root 
of evil, so be patient, and a reasonable solution, standardized so that 
other modern filesystems can use the same APIs and expose the same sort 
of information for userspace to work out the presentation of, will 
eventually appear.

Roman Mamedov Feb. 10, 2014, 9:14 a.m. UTC | #31

On Mon, 10 Feb 2014 00:02:38 +0000 (UTC)
Duncan <1i5t5.duncan@cox.net> wrote:

> Meanwhile, you said it yourself, users aren't normally concerned about 
> this.

I think you're being mistaken here, the point that "users aren't looking at
the free space, hence it is not important to provide a correct estimate" was
made by someone else, not me. Personally I found that to be just a bit too
surrealistic to try and seriously answer; much like the rest of your message.

Provide a better free space estimate on RAID1

Commit Message

Comments

Patch