diff mbox

btrfs: change max_inline default to 2048

Message ID 1455209730-22811-1-git-send-email-dsterba@suse.com (mailing list archive)
State Accepted
Headers show

Commit Message

David Sterba Feb. 11, 2016, 4:55 p.m. UTC
The current practical default is ~4k on x86_64 (the logic is more complex,
simplified for brevity), the inlined files land in the metadata group and
thus consume space that could be needed for the real metadata.

The inlining brings some usability surprises:

1) total space consumption measured on various filesystems and btrfs
   with DUP metadata was quite visible because of the duplicated data
   within metadata

2) inlined data may exhaust the metadata, which are more precious in case
   the entire device space is allocated to chunks (ie. balance cannot
   make the space more compact)

3) performance suffers a bit as the inlined blocks are duplicate and
   stored far away on the device.

Proposed fix: set the default to 2048

This fixes namely 1), the total filesysystem space consumption will be on
par with other filesystems.

Partially fixes 2), more data are pushed to the data block groups.

The characteristics of 3) are based on actual small file size
distribution.

The change is independent of the metadata blockgroup type (though it's
most visible with DUP) or system page size as these parameters are not
trival to find out, compared to file size.

Signed-off-by: David Sterba <dsterba@suse.com>
---
 fs/btrfs/ctree.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Duncan Feb. 12, 2016, 7:10 a.m. UTC | #1
David Sterba posted on Thu, 11 Feb 2016 17:55:30 +0100 as excerpted:

> The current practical default is ~4k on x86_64 (the logic is more
> complex, simplified for brevity)

> Proposed fix: set the default to 2048

> Signed-off-by: David Sterba <dsterba@suse.com>
> ---
>  fs/btrfs/ctree.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index bfe4a337fb4d..6661ad8b4088 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -2252,7 +2252,7 @@ struct btrfs_ioctl_defrag_range_args {

> -#define BTRFS_DEFAULT_MAX_INLINE	(8192)
> +#define BTRFS_DEFAULT_MAX_INLINE	(2048)

Default?

For those who want to keep the current inline, what's the mkfs.btrfs or 
mount-option recipe to do so?  I don't see any code added for that, nor 
am I aware of any current options to change it, yet "default" indicates 
that it's possible to set it other than that default if desired.


Specifically what I'm looking at here is avoiding "tails", ala reiserfs. 

Except to my understanding, on btrfs, this feature doesn't avoid tails on 
large files at all -- they're unchanged and still take whole blocks even 
if for just a single byte over an even block size.  Rather, (my 
understanding of) what the feature does on btrfs is redirect whole files 
under a particular size to metadata.  While that won't change things for 
larger files, in general usage it /can/ still help quite a lot, as above 
some arbitrary cutoff (which is what this value ultimately becomes), a 
fraction of a block, on a file that's already say hundreds of blocks, 
doesn't make a lot of difference, while a fraction of a block on a file 
only a fraction of a block in size, makes ALL the difference, 
proportionally.  And given that a whole lot more small files can fit in 
whatever size compared to larger files...

Of course dup metadata with single data does screw up the figures, 
because any data that's stored in metadata then gets duped to twice the 
size it would take as data, so indeed, in that case, half a block's size 
(which is what your 2048 is) maximum makes sense, since above that, the 
file would take less space stored in data as a full block, then it does 
squished into metadata but with metadata duped.

But there's a lot of users who choose to use the same replication for 
both data and metadata, on a single device either both single, or now 
that it's possible, both dup, and on multi-device, the same raid-whatever 
for both.  For those people, even a (small) multi-block setting makes 
sense, because for instance 16 KiB plus one byte becomes 20 KiB when 
stored as data in 4 KiB blocks, but it's still 16 KiB plus one byte as 
metadata, and the multiplier is the same for both, so...  And on raid1, 
of course that 4 KiB extra block becomes 8 KiB extra, 2 * 4 KiB blocks, 
32 KiB + 4 B total as metadata, 40 KiB total as data.

And of course we now have dup data as a single-device possibility, so 
people can set dup data /and/ metadata, now, yet another same replication 
case.

But there's some historical perspective to consider here as well.  Back 
when metadata nodes were 4 KiB by default too, I believe the result was 
something slightly under 2048 anyway, so the duped/raid1 metadata vs. 
single data case worked as expected, while now that metadata nodes are 16 
KiB by default, you indicate the practical result is near the 4 KiB block 
size, and you correctly point out the size-doubling implications of that 
on the default single-data, raid1/dup-metadata, compared to how it used 
to work.

So your size implications point is valid, and of course reliably getting/
calculating replication value is indeed problematic, too, as you say, 
so...

There is indeed a case to be made for a 2048 default, agreed.

But exposing this as an admin-settable value, so admins that know they've 
set a similar replication value for both data and metadata can optimize 
accordingly, makes a lot of sense as well.

(And come to think of it, now that I've argued that point, it occurs to 
me that maybe setting 32 KiB or even 64 KiB node size as opposed to 
keeping the 16 KiB default, may make sense in this regard, as it should 
allow larger max_inline values, to 16 KiB aka 4 * 4 KiB blocks, anyway, 
which as I pointed out could still cut down on waste rather dramatically, 
while still allowing the performance efficiency of separate data/metadata 
on files of any significant size, where the proportional space wastage of 
sub-block tails will be far smaller.)
David Sterba Feb. 15, 2016, 6:42 p.m. UTC | #2
On Fri, Feb 12, 2016 at 07:10:29AM +0000, Duncan wrote:
> Default?
> 
> For those who want to keep the current inline, what's the mkfs.btrfs or 
> mount-option recipe to do so?  I don't see any code added for that, nor 
> am I aware of any current options to change it, yet "default" indicates 
> that it's possible to set it other than that default if desired.

The name of the mount option is max_inline, referred in the subject.
This is a runtime option and not affected by mkfs.

> Specifically what I'm looking at here is avoiding "tails", ala reiserfs. 

The inline data are not the same as the reiserfs-style tail packing.
On btrfs only files smaller than the limit are inlined, file tails
allocate a full block.

> Except to my understanding, on btrfs, this feature doesn't avoid tails on 
> large files at all -- they're unchanged and still take whole blocks even 
> if for just a single byte over an even block size.  Rather, (my 
> understanding of) what the feature does on btrfs is redirect whole files 
> under a particular size to metadata. 

Right.

> While that won't change things for 
> larger files, in general usage it /can/ still help quite a lot, as above 
> some arbitrary cutoff (which is what this value ultimately becomes), a 
> fraction of a block, on a file that's already say hundreds of blocks, 
> doesn't make a lot of difference, while a fraction of a block on a file 
> only a fraction of a block in size, makes ALL the difference, 
> proportionally.  And given that a whole lot more small files can fit in 
> whatever size compared to larger files...
> 
> Of course dup metadata with single data does screw up the figures, 
> because any data that's stored in metadata then gets duped to twice the 
> size it would take as data, so indeed, in that case, half a block's size 
> (which is what your 2048 is) maximum makes sense, since above that, the 
> file would take less space stored in data as a full block, then it does 
> squished into metadata but with metadata duped.
> 
> But there's a lot of users who choose to use the same replication for 
> both data and metadata, on a single device either both single, or now 
> that it's possible, both dup, and on multi-device, the same raid-whatever 
> for both.  For those people, even a (small) multi-block setting makes 
> sense, because for instance 16 KiB plus one byte becomes 20 KiB when 
> stored as data in 4 KiB blocks, but it's still 16 KiB plus one byte as 
> metadata,

Not exactly like that, the internal limits are page size, b-tree leaf
space and max_inline. So on a common hardware the limit is 4k.

> and the multiplier is the same for both, so...  And on raid1, 
> of course that 4 KiB extra block becomes 8 KiB extra, 2 * 4 KiB blocks, 
> 32 KiB + 4 B total as metadata, 40 KiB total as data.
> 
> And of course we now have dup data as a single-device possibility, so 
> people can set dup data /and/ metadata, now, yet another same replication 
> case.

I think the replication is not strictly related to this patch. Yes it
applies for the default DUP metadata profile, but that's rather a
conicidence. The primary purpose is to guarantee replication for
metadata, not the inlined data.

> But there's some historical perspective to consider here as well.  Back 
> when metadata nodes were 4 KiB by default too, I believe the result was 
> something slightly under 2048 anyway,

In that case the b-tree leaf limit applies, so it's 4k - leaf header,
resulting in ~3918 bytes.

> so the duped/raid1 metadata vs. 
> single data case worked as expected, while now that metadata nodes are 16 
> KiB by default, you indicate the practical result is near the 4 KiB block 
> size,

Here the page size limit applies and it's 4k exactly.

> and you correctly point out the size-doubling implications of that 
> on the default single-data, raid1/dup-metadata, compared to how it used 
> to work.
> 
> So your size implications point is valid, and of course reliably getting/
> calculating replication value is indeed problematic, too, as you say, 
> so...
> 
> There is indeed a case to be made for a 2048 default, agreed.
> 
> But exposing this as an admin-settable value, so admins that know they've 
> set a similar replication value for both data and metadata can optimize 
> accordingly, makes a lot of sense as well.
> 
> (And come to think of it, now that I've argued that point, it occurs to 
> me that maybe setting 32 KiB or even 64 KiB node size as opposed to 
> keeping the 16 KiB default, may make sense in this regard, as it should 
> allow larger max_inline values, to 16 KiB aka 4 * 4 KiB blocks, anyway, 
> which as I pointed out could still cut down on waste rather dramatically, 
> while still allowing the performance efficiency of separate data/metadata 
> on files of any significant size, where the proportional space wastage of 
> sub-block tails will be far smaller.)

As stated above, unfortunatelly no, and what's worse, larger node sizes
cause increase in memcpy overhead on all metadata changes. Then the
amount of bytes saved would not IMO justify the performance drop. The
tendency is to lower the limit, there were people asking how to turn
inlining off completely.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Chris Mason Feb. 15, 2016, 9:29 p.m. UTC | #3
On Thu, Feb 11, 2016 at 05:55:30PM +0100, David Sterba wrote:
> The current practical default is ~4k on x86_64 (the logic is more complex,
> simplified for brevity), the inlined files land in the metadata group and
> thus consume space that could be needed for the real metadata.
> 
> The inlining brings some usability surprises:
> 
> 1) total space consumption measured on various filesystems and btrfs
>    with DUP metadata was quite visible because of the duplicated data
>    within metadata
> 
> 2) inlined data may exhaust the metadata, which are more precious in case
>    the entire device space is allocated to chunks (ie. balance cannot
>    make the space more compact)
> 
> 3) performance suffers a bit as the inlined blocks are duplicate and
>    stored far away on the device.
> 
> Proposed fix: set the default to 2048
> 
> This fixes namely 1), the total filesysystem space consumption will be on
> par with other filesystems.
> 
> Partially fixes 2), more data are pushed to the data block groups.
> 
> The characteristics of 3) are based on actual small file size
> distribution.
> 
> The change is independent of the metadata blockgroup type (though it's
> most visible with DUP) or system page size as these parameters are not
> trival to find out, compared to file size.

This is a good compromise, and people that want higher numbers can
always use the mount option.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index bfe4a337fb4d..6661ad8b4088 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2252,7 +2252,7 @@  struct btrfs_ioctl_defrag_range_args {
 #define BTRFS_MOUNT_FREE_SPACE_TREE	(1 << 26)
 
 #define BTRFS_DEFAULT_COMMIT_INTERVAL	(30)
-#define BTRFS_DEFAULT_MAX_INLINE	(8192)
+#define BTRFS_DEFAULT_MAX_INLINE	(2048)
 
 #define btrfs_clear_opt(o, opt)		((o) &= ~BTRFS_MOUNT_##opt)
 #define btrfs_set_opt(o, opt)		((o) |= BTRFS_MOUNT_##opt)