Btrfs: Do not use data_alloc_cluster in ssd mode
diff mbox

Message ID 20170721114711.20229-1-hans.van.kranenburg@mendix.com
State New
Headers show

Commit Message

Hans van Kranenburg July 21, 2017, 11:47 a.m. UTC
In the first year of btrfs development, around early 2008, btrfs
gained a mount option which enables specific functionality for
filesystems on solid state devices. The first occurance of this
functionality is in commit e18e4809, labeled "Add mount -o ssd, which
includes optimizations for seek free storage".

The effect on allocating free space for doing (data) writes is to
'cluster' writes together, writing them out in contiguous space, as
opposed to a 'tetris' way of putting all separate writes into any free
space fragment that fits (which is what the -o nossd behaviour does).

A somewhat simplified explanation of what happens is that, when for
example, the 'cluster' size is set to 2MiB, when we do some writes, the
data allocator will search for a free space block that is 2MiB big, and
put the writes in there. The ssd mode itself might allow a 2MiB cluster
to be composed of multiple free space extents with some existing data in
between, while the additional ssd_spread mount option kills off this
option and requires fully free space.

The idea behind this is (commit 536ac8ae): "The [...] clusters make it
more likely a given IO will completely overwrite the ssd block, so it
doesn't have to do an internal rwm cycle."; ssd block meaning nand erase
block. So, effectively this means applying a "locality based algorithm"
and trying to outsmart the actual ssd.

Since then, various changes have been made to the involved code, but the
basic idea is still present, and gets activated whenever the ssd mount
option is active. This also happens by default, when the rotational flag
as seen at /sys/block/<device>/queue/rotational is set to 0.

    However, there's a number of problems with this approach.

    First, what the optimization is trying to do is outsmart the ssd by
assuming there is a relation between the physical address space of the
block device as seen by btrfs and the actual physical storage of the
ssd, and then adjusting data placement. However, since the introduction
of the Flash Translation Layer (FTL) which is a part of the internal
controller of an ssd, these attempts are futile. The use of good quality
FTL in consumer ssd products might have been limited in 2008, but this
situation has changed drastically soon after that time. Today, even the
flash memory in your automatic cat feeding machine or your grandma's
wheelchair has a full featured one.

Second, the behaviour as described above results in the filesystem being
filled up with badly fragmented free space extents because of relatively
small pieces of space that are freed up by deletes, but not selected
again as part of a 'cluster'. Since the algorithm prefers allocating a
new chunk over going back to tetris mode, the end result is a filesystem
in which all raw space is allocated, but which is composed of
underutilized chunks with a 'shotgun blast' pattern of fragmented free
space. Usually, the next problematic thing that happens is the
filesystem wanting to allocate new space for metadata, which causes the
filesystem to fail in spectacular ways.

Third, the default mount options you get for an ssd ('ssd' mode enabled,
'discard' not enabled), in combination with spreading out writes over
the full address space and ignoring freed up space leads to worst case
behaviour in providing information to the ssd itself, since it will
never learn that all the free space left behind is actually free.  There
are two ways to let an ssd know previously written data does not have to
be preserved, which are sending explicit signals using discard or
fstrim, or by simply overwriting the space with new data.  The worst
case behaviour is the btrfs ssd_spread mount option in combination with
not having discard enabled. It has a side effect of minimizing the reuse
of free space previously written in.

Fourth, the rotational flag in /sys/ does not reliably indicate if the
device is a locally attached ssd. For example, iSCSI or NBD displays as
non-rotational, while a loop device on an ssd shows up as rotational.

The combination of the second and third problem effectively means that
despite all the good intentions, the btrfs ssd mode reliably causes the
ssd hardware and the filesystem structures and performance to be choked
to death. The clickbait version of the title of this story would have
been "Btrfs ssd optimizations condered harmful for ssds".

The current nossd 'tetris' mode (even still without discard) allows a
pattern of overwriting much more previously used space, causing many
more implicit discards to happen because of the overwrite information
the ssd gets. The actual location in the physical address space, as seen
from the point of view of btrfs is irrelevant, because the actual writes
to the low level flash are reordered anyway thanks to the FTL.

    So what now...?

The changes in here do the following:

1. Throw out the current ssd_spread behaviour.
2. Move the current ssd behaviour to the ssd_spread option.
3. Make ssd mode data allocation identical to tetris mode, like nossd.
4. Adjust and clean up filesystem mount messages so that we can easily
identify if a kernel has this patch applied or not, when providing
support to end users.

Instead of directly cutting out all code related to the data cluster, it
makes sense to take a gradual approach and allow users who are still
able to find a valid reason to prefer the current ssd mode the means to
do so by specifiying the additional ssd_spread option.

Since there are other uses of the ssd mode, we keep the difference
between nossd and ssd mode. However, the usage of the rotational
attribute warrants some reconsideration in the future.

Notes for whoever wants to backport this patch on their 4.9 LTS kernel:
* First apply commit 8a83665a "btrfs: drop the nossd flag when
  remounting with -o ssd", or fixup the differences manually.
* The rest of the conflicts are because of the fs_info refactoring. So,
  for example, instead of using fs_info, it's root->fs_info in
  extent-tree.c

Signed-off-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com>
---
 fs/btrfs/ctree.h            |  4 ++--
 fs/btrfs/disk-io.c          |  6 ++----
 fs/btrfs/extent-tree.c      | 11 ++++++-----
 fs/btrfs/free-space-cache.c | 11 ++++-------
 fs/btrfs/super.c            | 16 +++++++++-------
 5 files changed, 23 insertions(+), 25 deletions(-)

Comments

Adam Borowski July 21, 2017, 2:49 p.m. UTC | #1
On Fri, Jul 21, 2017 at 01:47:11PM +0200, Hans van Kranenburg wrote:
> The changes in here do the following:
> 
> 1. Throw out the current ssd_spread behaviour.
> 2. Move the current ssd behaviour to the ssd_spread option.
> 3. Make ssd mode data allocation identical to tetris mode, like nossd.
> 4. Adjust and clean up filesystem mount messages so that we can easily
> identify if a kernel has this patch applied or not, when providing
> support to end users.

> Notes for whoever wants to backport this patch on their 4.9 LTS kernel:
> * First apply commit 8a83665a "btrfs: drop the nossd flag when
>   remounting with -o ssd", or fixup the differences manually.

It's 951e7966 in mainline.  I see no 8a83665a anywhere -- most likely it's
what you got after cherry-picking to 4.9.
Hans van Kranenburg July 21, 2017, 3:16 p.m. UTC | #2
On 07/21/2017 04:49 PM, Adam Borowski wrote:
> On Fri, Jul 21, 2017 at 01:47:11PM +0200, Hans van Kranenburg wrote:
>> The changes in here do the following:
>>
>> 1. Throw out the current ssd_spread behaviour.
>> 2. Move the current ssd behaviour to the ssd_spread option.
>> 3. Make ssd mode data allocation identical to tetris mode, like nossd.
>> 4. Adjust and clean up filesystem mount messages so that we can easily
>> identify if a kernel has this patch applied or not, when providing
>> support to end users.
> 
>> Notes for whoever wants to backport this patch on their 4.9 LTS kernel:
>> * First apply commit 8a83665a "btrfs: drop the nossd flag when
>>   remounting with -o ssd", or fixup the differences manually.
> 
> It's 951e7966 in mainline.  I see no 8a83665a anywhere -- most likely it's
> what you got after cherry-picking to 4.9.

Yes, that's exactly where it went wrong, thanks.
Austin S. Hemmelgarn July 21, 2017, 3:50 p.m. UTC | #3
On 2017-07-21 07:47, Hans van Kranenburg wrote:
>      In the first year of btrfs development, around early 2008, btrfs
> gained a mount option which enables specific functionality for
> filesystems on solid state devices. The first occurance of this
> functionality is in commit e18e4809, labeled "Add mount -o ssd, which
> includes optimizations for seek free storage".
> 
> The effect on allocating free space for doing (data) writes is to
> 'cluster' writes together, writing them out in contiguous space, as
> opposed to a 'tetris' way of putting all separate writes into any free
> space fragment that fits (which is what the -o nossd behaviour does).
> 
> A somewhat simplified explanation of what happens is that, when for
> example, the 'cluster' size is set to 2MiB, when we do some writes, the
> data allocator will search for a free space block that is 2MiB big, and
> put the writes in there. The ssd mode itself might allow a 2MiB cluster
> to be composed of multiple free space extents with some existing data in
> between, while the additional ssd_spread mount option kills off this
> option and requires fully free space.
> 
> The idea behind this is (commit 536ac8ae): "The [...] clusters make it
> more likely a given IO will completely overwrite the ssd block, so it
> doesn't have to do an internal rwm cycle."; ssd block meaning nand erase
> block. So, effectively this means applying a "locality based algorithm"
> and trying to outsmart the actual ssd.
> 
> Since then, various changes have been made to the involved code, but the
> basic idea is still present, and gets activated whenever the ssd mount
> option is active. This also happens by default, when the rotational flag
> as seen at /sys/block/<device>/queue/rotational is set to 0.
> 
>      However, there's a number of problems with this approach.
> 
>      First, what the optimization is trying to do is outsmart the ssd by
> assuming there is a relation between the physical address space of the
> block device as seen by btrfs and the actual physical storage of the
> ssd, and then adjusting data placement. However, since the introduction
> of the Flash Translation Layer (FTL) which is a part of the internal
> controller of an ssd, these attempts are futile. The use of good quality
> FTL in consumer ssd products might have been limited in 2008, but this
> situation has changed drastically soon after that time. Today, even the
> flash memory in your automatic cat feeding machine or your grandma's
> wheelchair has a full featured one.
> 
> Second, the behaviour as described above results in the filesystem being
> filled up with badly fragmented free space extents because of relatively
> small pieces of space that are freed up by deletes, but not selected
> again as part of a 'cluster'. Since the algorithm prefers allocating a
> new chunk over going back to tetris mode, the end result is a filesystem
> in which all raw space is allocated, but which is composed of
> underutilized chunks with a 'shotgun blast' pattern of fragmented free
> space. Usually, the next problematic thing that happens is the
> filesystem wanting to allocate new space for metadata, which causes the
> filesystem to fail in spectacular ways.
> 
> Third, the default mount options you get for an ssd ('ssd' mode enabled,
> 'discard' not enabled), in combination with spreading out writes over
> the full address space and ignoring freed up space leads to worst case
> behaviour in providing information to the ssd itself, since it will
> never learn that all the free space left behind is actually free.  There
> are two ways to let an ssd know previously written data does not have to
> be preserved, which are sending explicit signals using discard or
> fstrim, or by simply overwriting the space with new data.  The worst
> case behaviour is the btrfs ssd_spread mount option in combination with
> not having discard enabled. It has a side effect of minimizing the reuse
> of free space previously written in.
> 
> Fourth, the rotational flag in /sys/ does not reliably indicate if the
> device is a locally attached ssd. For example, iSCSI or NBD displays as
> non-rotational, while a loop device on an ssd shows up as rotational.
> 
> The combination of the second and third problem effectively means that
> despite all the good intentions, the btrfs ssd mode reliably causes the
> ssd hardware and the filesystem structures and performance to be choked
> to death. The clickbait version of the title of this story would have
> been "Btrfs ssd optimizations condered harmful for ssds".
> 
> The current nossd 'tetris' mode (even still without discard) allows a
> pattern of overwriting much more previously used space, causing many
> more implicit discards to happen because of the overwrite information
> the ssd gets. The actual location in the physical address space, as seen
> from the point of view of btrfs is irrelevant, because the actual writes
> to the low level flash are reordered anyway thanks to the FTL.
> 
>      So what now...?
> 
> The changes in here do the following:
> 
> 1. Throw out the current ssd_spread behaviour.
> 2. Move the current ssd behaviour to the ssd_spread option.
> 3. Make ssd mode data allocation identical to tetris mode, like nossd.
> 4. Adjust and clean up filesystem mount messages so that we can easily
> identify if a kernel has this patch applied or not, when providing
> support to end users.
> 
> Instead of directly cutting out all code related to the data cluster, it
> makes sense to take a gradual approach and allow users who are still
> able to find a valid reason to prefer the current ssd mode the means to
> do so by specifiying the additional ssd_spread option.
> 
> Since there are other uses of the ssd mode, we keep the difference
> between nossd and ssd mode. However, the usage of the rotational
> attribute warrants some reconsideration in the future.
> 
> Notes for whoever wants to backport this patch on their 4.9 LTS kernel:
> * First apply commit 8a83665a "btrfs: drop the nossd flag when
>    remounting with -o ssd", or fixup the differences manually.
> * The rest of the conflicts are because of the fs_info refactoring. So,
>    for example, instead of using fs_info, it's root->fs_info in
>    extent-tree.c
> 
> Signed-off-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com>
Behaves as advertised, and I'm not seeing any issues in my testing (and 
I can confirm from experience that this logic slows things down, I've 
been forcing '-o nossd' on my systems for a while now for the 
performance improvement), so you can add:
Tested-by: Austin S. Hemmelgarn <ahferroin7@gmail.com>

> ---
>   fs/btrfs/ctree.h            |  4 ++--
>   fs/btrfs/disk-io.c          |  6 ++----
>   fs/btrfs/extent-tree.c      | 11 ++++++-----
>   fs/btrfs/free-space-cache.c | 11 ++++-------
>   fs/btrfs/super.c            | 16 +++++++++-------
>   5 files changed, 23 insertions(+), 25 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 4f8f75d9e839..b091dd3f5b38 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -472,7 +472,7 @@ struct btrfs_block_rsv {
>   /*
>    * free clusters are used to claim free space in relatively large chunks,
>    * allowing us to do less seeky writes.  They are used for all metadata
> - * allocations and data allocations in ssd mode.
> + * allocations and data allocations in ssd_spread mode.
>    */
>   struct btrfs_free_cluster {
>   	spinlock_t lock;
> @@ -976,7 +976,7 @@ struct btrfs_fs_info {
>   
>   	struct reloc_control *reloc_ctl;
>   
> -	/* data_alloc_cluster is only used in ssd mode */
> +	/* data_alloc_cluster is only used in ssd_spread mode */
>   	struct btrfs_free_cluster data_alloc_cluster;
>   
>   	/* all metadata allocations go through this cluster */
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 5f678dcb20e6..97ede2de24a3 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -3061,11 +3061,9 @@ int open_ctree(struct super_block *sb,
>   	if (IS_ERR(fs_info->transaction_kthread))
>   		goto fail_cleaner;
>   
> -	if (!btrfs_test_opt(fs_info, SSD) &&
> -	    !btrfs_test_opt(fs_info, NOSSD) &&
> +	if (!btrfs_test_opt(fs_info, NOSSD) &&
>   	    !fs_info->fs_devices->rotating) {
> -		btrfs_info(fs_info, "detected SSD devices, enabling SSD mode");
> -		btrfs_set_opt(fs_info->mount_opt, SSD);
> +		btrfs_set_and_info(fs_info, SSD, "enabling ssd optimizations");
>   	}
>   
>   	/*
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 33d979e9ea2a..cb1855fede41 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -6609,19 +6609,20 @@ fetch_cluster_info(struct btrfs_fs_info *fs_info,
>   		   struct btrfs_space_info *space_info, u64 *empty_cluster)
>   {
>   	struct btrfs_free_cluster *ret = NULL;
> -	bool ssd = btrfs_test_opt(fs_info, SSD);
>   
>   	*empty_cluster = 0;
>   	if (btrfs_mixed_space_info(space_info))
>   		return ret;
>   
> -	if (ssd)
> -		*empty_cluster = SZ_2M;
>   	if (space_info->flags & BTRFS_BLOCK_GROUP_METADATA) {
>   		ret = &fs_info->meta_alloc_cluster;
> -		if (!ssd)
> +		if (btrfs_test_opt(fs_info, SSD))
> +			*empty_cluster = SZ_2M;
> +		else
>   			*empty_cluster = SZ_64K;
> -	} else if ((space_info->flags & BTRFS_BLOCK_GROUP_DATA) && ssd) {
> +	} else if ((space_info->flags & BTRFS_BLOCK_GROUP_DATA) &&
> +		   btrfs_test_opt(fs_info, SSD_SPREAD)) {
> +		*empty_cluster = SZ_2M;
>   		ret = &fs_info->data_alloc_cluster;
>   	}
>   
> diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
> index c5e6180cdb8c..8ebbb182ba94 100644
> --- a/fs/btrfs/free-space-cache.c
> +++ b/fs/btrfs/free-space-cache.c
> @@ -3034,14 +3034,11 @@ int btrfs_find_space_cluster(struct btrfs_fs_info *fs_info,
>   	int ret;
>   
>   	/*
> -	 * Choose the minimum extent size we'll require for this
> -	 * cluster.  For SSD_SPREAD, don't allow any fragmentation.
> -	 * For metadata, allow allocates with smaller extents.  For
> -	 * data, keep it dense.
> +	 * Choose the minimum extent size we'll require for this cluster.  For
> +	 * metadata, allow allocates with smaller extents.  For data, keep it
> +	 * dense.
>   	 */
> -	if (btrfs_test_opt(fs_info, SSD_SPREAD)) {
> -		cont1_bytes = min_bytes = bytes + empty_size;
> -	} else if (block_group->flags & BTRFS_BLOCK_GROUP_METADATA) {
> +	if (block_group->flags & BTRFS_BLOCK_GROUP_METADATA) {
>   		cont1_bytes = bytes;
>   		min_bytes = fs_info->sectorsize;
>   	} else {
> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> index 4f1cdd5058f1..01ebc93f994a 100644
> --- a/fs/btrfs/super.c
> +++ b/fs/btrfs/super.c
> @@ -548,20 +548,22 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char *options,
>   			break;
>   		case Opt_ssd:
>   			btrfs_set_and_info(info, SSD,
> -					   "use ssd allocation scheme");
> +					   "enabling ssd optimizations");
>   			btrfs_clear_opt(info->mount_opt, NOSSD);
>   			break;
>   		case Opt_ssd_spread:
> +			btrfs_set_and_info(info, SSD,
> +					   "enabling ssd optimizations");
>   			btrfs_set_and_info(info, SSD_SPREAD,
> -					   "use spread ssd allocation scheme");
> -			btrfs_set_opt(info->mount_opt, SSD);
> +					   "using legacy ssd allocation scheme");
>   			btrfs_clear_opt(info->mount_opt, NOSSD);
>   			break;
>   		case Opt_nossd:
> -			btrfs_set_and_info(info, NOSSD,
> -					     "not using ssd allocation scheme");
> -			btrfs_clear_opt(info->mount_opt, SSD);
> -			btrfs_clear_opt(info->mount_opt, SSD_SPREAD);
> +			btrfs_set_opt(info->mount_opt, NOSSD);
> +			btrfs_clear_and_info(info, SSD,
> +					     "not using ssd optimizations");
> +			btrfs_clear_and_info(info, SSD_SPREAD,
> +					     "not using legacy ssd allocation scheme");
>   			break;
>   		case Opt_barrier:
>   			btrfs_clear_and_info(info, NOBARRIER,
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hans van Kranenburg July 21, 2017, 11:21 p.m. UTC | #4
On 07/21/2017 05:50 PM, Austin S. Hemmelgarn wrote:
> On 2017-07-21 07:47, Hans van Kranenburg wrote:
>> [...]
>>
>> Signed-off-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com>
> Behaves as advertised, and I'm not seeing any issues in my testing (and
> I can confirm from experience that this logic slows things down, I've
> been forcing '-o nossd' on my systems for a while now for the
> performance improvement), so you can add:
> Tested-by: Austin S. Hemmelgarn <ahferroin7@gmail.com>

Thanks for testing!

Would you mind sharing a bit about what kind of test this is, and how
you measure the difference?

Does it mean that you're throwing a lot of data at the disk, and that
you can measure that 'cluster' mode of figuring out where to put the
writes has to spend more effort than the 'tetris' mode, and that it
limits the write speed?

The only relevant difference I was aware of until now is just the
placement of the writes.
Austin S. Hemmelgarn July 24, 2017, 11:23 a.m. UTC | #5
On 2017-07-21 19:21, Hans van Kranenburg wrote:
> On 07/21/2017 05:50 PM, Austin S. Hemmelgarn wrote:
>> On 2017-07-21 07:47, Hans van Kranenburg wrote:
>>> [...]
>>>
>>> Signed-off-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com>
>> Behaves as advertised, and I'm not seeing any issues in my testing (and
>> I can confirm from experience that this logic slows things down, I've
>> been forcing '-o nossd' on my systems for a while now for the
>> performance improvement), so you can add:
>> Tested-by: Austin S. Hemmelgarn <ahferroin7@gmail.com>
> 
> Thanks for testing!
> 
> Would you mind sharing a bit about what kind of test this is, and how
> you measure the difference?
> 
> Does it mean that you're throwing a lot of data at the disk, and that
> you can measure that 'cluster' mode of figuring out where to put the
> writes has to spend more effort than the 'tetris' mode, and that it
> limits the write speed?
> 
> The only relevant difference I was aware of until now is just the
> placement of the writes.
> 

For the generic testing, I've got a (mostly) automated testing system
setup on my home server that builds a kernel with the patches applied
and then runs xfstests and a handful of other tests I've got myself
(I really need to get around to pushing those upstream...).  At the
moment, it tests using 32 and 64-bit x86, ARMv7a, ARMv6stj-z, ARMv8
(whatever QEMU's default revision is), PPC64 (big and little endian),
and 32-bit MIPS (new ABI only, also big and little endian).

For the specific case here, back when I set up my systems to use 'nossd'
3 months ago, I specifically set up two of them that are otherwise
identical to test if there was any difference.  The systems in question
are a pair of Intel NUC's with Pentium N3700 CPU's (4 cores, no HT, 1.6GHz
normal, 2.4GHz boost), 4G of DDR3L-1600 RAM, and a 250GB Crucial MX200 SSD.
After 3 months, the one which has run with 'nossd' set in the mount options
is getting roughly 1-3% better write performance, equivalent read performance
(both measured with fio and iozone), and completes discard operations almost
twice as fast (based on timing `fstrim` with the same amount of space free on
both filesystems).
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Sterba July 24, 2017, 2:25 p.m. UTC | #6
On Fri, Jul 21, 2017 at 01:47:11PM +0200, Hans van Kranenburg wrote:
>     In the first year of btrfs development, around early 2008, btrfs
> gained a mount option which enables specific functionality for
> filesystems on solid state devices. The first occurance of this
> functionality is in commit e18e4809, labeled "Add mount -o ssd, which
> includes optimizations for seek free storage".
> 
> The effect on allocating free space for doing (data) writes is to
> 'cluster' writes together, writing them out in contiguous space, as
> opposed to a 'tetris' way of putting all separate writes into any free
> space fragment that fits (which is what the -o nossd behaviour does).
> 
> A somewhat simplified explanation of what happens is that, when for
> example, the 'cluster' size is set to 2MiB, when we do some writes, the
> data allocator will search for a free space block that is 2MiB big, and
> put the writes in there. The ssd mode itself might allow a 2MiB cluster
> to be composed of multiple free space extents with some existing data in
> between, while the additional ssd_spread mount option kills off this
> option and requires fully free space.
> 
> The idea behind this is (commit 536ac8ae): "The [...] clusters make it
> more likely a given IO will completely overwrite the ssd block, so it
> doesn't have to do an internal rwm cycle."; ssd block meaning nand erase
> block. So, effectively this means applying a "locality based algorithm"
> and trying to outsmart the actual ssd.
> 
> Since then, various changes have been made to the involved code, but the
> basic idea is still present, and gets activated whenever the ssd mount
> option is active. This also happens by default, when the rotational flag
> as seen at /sys/block/<device>/queue/rotational is set to 0.
> 
>     However, there's a number of problems with this approach.
> 
>     First, what the optimization is trying to do is outsmart the ssd by
> assuming there is a relation between the physical address space of the
> block device as seen by btrfs and the actual physical storage of the
> ssd, and then adjusting data placement. However, since the introduction
> of the Flash Translation Layer (FTL) which is a part of the internal
> controller of an ssd, these attempts are futile. The use of good quality
> FTL in consumer ssd products might have been limited in 2008, but this
> situation has changed drastically soon after that time. Today, even the
> flash memory in your automatic cat feeding machine or your grandma's
> wheelchair has a full featured one.
> 
> Second, the behaviour as described above results in the filesystem being
> filled up with badly fragmented free space extents because of relatively
> small pieces of space that are freed up by deletes, but not selected
> again as part of a 'cluster'. Since the algorithm prefers allocating a
> new chunk over going back to tetris mode, the end result is a filesystem
> in which all raw space is allocated, but which is composed of
> underutilized chunks with a 'shotgun blast' pattern of fragmented free
> space. Usually, the next problematic thing that happens is the
> filesystem wanting to allocate new space for metadata, which causes the
> filesystem to fail in spectacular ways.
> 
> Third, the default mount options you get for an ssd ('ssd' mode enabled,
> 'discard' not enabled), in combination with spreading out writes over
> the full address space and ignoring freed up space leads to worst case
> behaviour in providing information to the ssd itself, since it will
> never learn that all the free space left behind is actually free.  There
> are two ways to let an ssd know previously written data does not have to
> be preserved, which are sending explicit signals using discard or
> fstrim, or by simply overwriting the space with new data.  The worst
> case behaviour is the btrfs ssd_spread mount option in combination with
> not having discard enabled. It has a side effect of minimizing the reuse
> of free space previously written in.
> 
> Fourth, the rotational flag in /sys/ does not reliably indicate if the
> device is a locally attached ssd. For example, iSCSI or NBD displays as
> non-rotational, while a loop device on an ssd shows up as rotational.
> 
> The combination of the second and third problem effectively means that
> despite all the good intentions, the btrfs ssd mode reliably causes the
> ssd hardware and the filesystem structures and performance to be choked
> to death. The clickbait version of the title of this story would have
> been "Btrfs ssd optimizations condered harmful for ssds".
> 
> The current nossd 'tetris' mode (even still without discard) allows a
> pattern of overwriting much more previously used space, causing many
> more implicit discards to happen because of the overwrite information
> the ssd gets. The actual location in the physical address space, as seen
> from the point of view of btrfs is irrelevant, because the actual writes
> to the low level flash are reordered anyway thanks to the FTL.
> 
>     So what now...?
> 
> The changes in here do the following:
> 
> 1. Throw out the current ssd_spread behaviour.
> 2. Move the current ssd behaviour to the ssd_spread option.
> 3. Make ssd mode data allocation identical to tetris mode, like nossd.
> 4. Adjust and clean up filesystem mount messages so that we can easily
> identify if a kernel has this patch applied or not, when providing
> support to end users.
> 
> Instead of directly cutting out all code related to the data cluster, it
> makes sense to take a gradual approach and allow users who are still
> able to find a valid reason to prefer the current ssd mode the means to
> do so by specifiying the additional ssd_spread option.
> 
> Since there are other uses of the ssd mode, we keep the difference
> between nossd and ssd mode. However, the usage of the rotational
> attribute warrants some reconsideration in the future.
> 
> Notes for whoever wants to backport this patch on their 4.9 LTS kernel:
> * First apply commit 8a83665a "btrfs: drop the nossd flag when
>   remounting with -o ssd", or fixup the differences manually.
> * The rest of the conflicts are because of the fs_info refactoring. So,
>   for example, instead of using fs_info, it's root->fs_info in
>   extent-tree.c
> 
> Signed-off-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com>

Thanks for the extensive historical summary, this change really deserves
it.

Decoupling the assumptions about the device's block management is really
a good thing, mount option 'ssd' should mean that the device just has
cheap seeks. Moving the the allocation tweaks to ssd_spread provides a
way to keep the behaviour for anybody who wants it.

I'd like to push this change to 4.13-rc3, as I don't think we need more
time to let other users to test this. The effects of current ssd
implementation have been debated and debugged on IRC for a long time.

Reviewed-by: David Sterba <dsterba@suse.com>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hans van Kranenburg July 24, 2017, 5:22 p.m. UTC | #7
On 07/24/2017 04:25 PM, David Sterba wrote:
> On Fri, Jul 21, 2017 at 01:47:11PM +0200, Hans van Kranenburg wrote:
>> [...]
>>
>>     So what now...?
>>
>> The changes in here do the following:
>>
>> 1. Throw out the current ssd_spread behaviour.
>> 2. Move the current ssd behaviour to the ssd_spread option.
>> 3. Make ssd mode data allocation identical to tetris mode, like nossd.
>> 4. Adjust and clean up filesystem mount messages so that we can easily
>> identify if a kernel has this patch applied or not, when providing
>> support to end users.
>>
>> Instead of directly cutting out all code related to the data cluster, it
>> makes sense to take a gradual approach and allow users who are still
>> able to find a valid reason to prefer the current ssd mode the means to
>> do so by specifiying the additional ssd_spread option.
>>
>> Since there are other uses of the ssd mode, we keep the difference
>> between nossd and ssd mode. However, the usage of the rotational
>> attribute warrants some reconsideration in the future.
>>
>> [...]
>> Signed-off-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com>
> 
> Thanks for the extensive historical summary, this change really deserves
> it.
> 
> Decoupling the assumptions about the device's block management is really
> a good thing, mount option 'ssd' should mean that the device just has
> cheap seeks. Moving the the allocation tweaks to ssd_spread provides a
> way to keep the behaviour for anybody who wants it.
> 
> I'd like to push this change to 4.13-rc3, as I don't think we need more
> time to let other users to test this. The effects of current ssd
> implementation have been debated and debugged on IRC for a long time.
> 
> Reviewed-by: David Sterba <dsterba@suse.com>
> 

Ok, thanks. I'll send out a v2 tonight with the typo / commit id fixes
etc from the other feedback.
David Sterba July 24, 2017, 5:52 p.m. UTC | #8
On Mon, Jul 24, 2017 at 07:22:03PM +0200, Hans van Kranenburg wrote:
> On 07/24/2017 04:25 PM, David Sterba wrote:
> > On Fri, Jul 21, 2017 at 01:47:11PM +0200, Hans van Kranenburg wrote:
> >> [...]
> >>
> >>     So what now...?
> >>
> >> The changes in here do the following:
> >>
> >> 1. Throw out the current ssd_spread behaviour.
> >> 2. Move the current ssd behaviour to the ssd_spread option.
> >> 3. Make ssd mode data allocation identical to tetris mode, like nossd.
> >> 4. Adjust and clean up filesystem mount messages so that we can easily
> >> identify if a kernel has this patch applied or not, when providing
> >> support to end users.
> >>
> >> Instead of directly cutting out all code related to the data cluster, it
> >> makes sense to take a gradual approach and allow users who are still
> >> able to find a valid reason to prefer the current ssd mode the means to
> >> do so by specifiying the additional ssd_spread option.
> >>
> >> Since there are other uses of the ssd mode, we keep the difference
> >> between nossd and ssd mode. However, the usage of the rotational
> >> attribute warrants some reconsideration in the future.
> >>
> >> [...]
> >> Signed-off-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com>
> > 
> > Thanks for the extensive historical summary, this change really deserves
> > it.
> > 
> > Decoupling the assumptions about the device's block management is really
> > a good thing, mount option 'ssd' should mean that the device just has
> > cheap seeks. Moving the the allocation tweaks to ssd_spread provides a
> > way to keep the behaviour for anybody who wants it.
> > 
> > I'd like to push this change to 4.13-rc3, as I don't think we need more
> > time to let other users to test this. The effects of current ssd
> > implementation have been debated and debugged on IRC for a long time.
> > 
> > Reviewed-by: David Sterba <dsterba@suse.com>
> > 
> 
> Ok, thanks. I'll send out a v2 tonight with the typo / commit id fixes
> etc from the other feedback.

I've fixed them already.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hans van Kranenburg July 24, 2017, 5:56 p.m. UTC | #9
On 07/24/2017 07:52 PM, David Sterba wrote:
> On Mon, Jul 24, 2017 at 07:22:03PM +0200, Hans van Kranenburg wrote:
>> On 07/24/2017 04:25 PM, David Sterba wrote:
>>> On Fri, Jul 21, 2017 at 01:47:11PM +0200, Hans van Kranenburg wrote:
>>>> [...]
>>>>
>>>>     So what now...?
>>>>
>>>> The changes in here do the following:
>>>>
>>>> 1. Throw out the current ssd_spread behaviour.
>>>> 2. Move the current ssd behaviour to the ssd_spread option.
>>>> 3. Make ssd mode data allocation identical to tetris mode, like nossd.
>>>> 4. Adjust and clean up filesystem mount messages so that we can easily
>>>> identify if a kernel has this patch applied or not, when providing
>>>> support to end users.
>>>>
>>>> Instead of directly cutting out all code related to the data cluster, it
>>>> makes sense to take a gradual approach and allow users who are still
>>>> able to find a valid reason to prefer the current ssd mode the means to
>>>> do so by specifiying the additional ssd_spread option.
>>>>
>>>> Since there are other uses of the ssd mode, we keep the difference
>>>> between nossd and ssd mode. However, the usage of the rotational
>>>> attribute warrants some reconsideration in the future.
>>>>
>>>> [...]
>>>> Signed-off-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com>
>>>
>>> Thanks for the extensive historical summary, this change really deserves
>>> it.
>>>
>>> Decoupling the assumptions about the device's block management is really
>>> a good thing, mount option 'ssd' should mean that the device just has
>>> cheap seeks. Moving the the allocation tweaks to ssd_spread provides a
>>> way to keep the behaviour for anybody who wants it.
>>>
>>> I'd like to push this change to 4.13-rc3, as I don't think we need more
>>> time to let other users to test this. The effects of current ssd
>>> implementation have been debated and debugged on IRC for a long time.
>>>
>>> Reviewed-by: David Sterba <dsterba@suse.com>
>>>
>>
>> Ok, thanks. I'll send out a v2 tonight with the typo / commit id fixes
>> etc from the other feedback.
> 
> I've fixed them already.

Oh, sure, fine. The additional typo I found was s/condered/considered/
Chris Mason July 24, 2017, 6:01 p.m. UTC | #10
On 07/24/2017 10:25 AM, David Sterba wrote:

> Thanks for the extensive historical summary, this change really deserves
> it.
> 
> Decoupling the assumptions about the device's block management is really
> a good thing, mount option 'ssd' should mean that the device just has
> cheap seeks. Moving the the allocation tweaks to ssd_spread provides a
> way to keep the behaviour for anybody who wants it.
> 
> I'd like to push this change to 4.13-rc3, as I don't think we need more
> time to let other users to test this. The effects of current ssd
> implementation have been debated and debugged on IRC for a long time.
> 

The description is great, but I'd love to see many more benchmarks.  At 
Facebook we use the current ssd_spread mode in production on top of 
hardware raid5/6 (spinning storage) because it dramatically reduces the 
read/modify/write cycles done for metadata writes.

If we're going to play around with these, we need a good way to measure 
free space fragmentation as part of benchmarks, as well as the IO 
patterns coming out of the allocator.

At least for our uses, ssd_spread matters much more for metadata than 
data (the data writes are large and metadata is small).

Thanks again,
Chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Sterba July 24, 2017, 6:41 p.m. UTC | #11
On Mon, Jul 24, 2017 at 02:01:07PM -0400, Chris Mason wrote:
> On 07/24/2017 10:25 AM, David Sterba wrote:
> 
> > Thanks for the extensive historical summary, this change really deserves
> > it.
> > 
> > Decoupling the assumptions about the device's block management is really
> > a good thing, mount option 'ssd' should mean that the device just has
> > cheap seeks. Moving the the allocation tweaks to ssd_spread provides a
> > way to keep the behaviour for anybody who wants it.
> > 
> > I'd like to push this change to 4.13-rc3, as I don't think we need more
> > time to let other users to test this. The effects of current ssd
> > implementation have been debated and debugged on IRC for a long time.
> 
> The description is great, but I'd love to see many more benchmarks.  At 
> Facebook we use the current ssd_spread mode in production on top of 
> hardware raid5/6 (spinning storage) because it dramatically reduces the 
> read/modify/write cycles done for metadata writes.

Well, I think this is an example that ssd got misused because of the
side effects of the allocation. If you observe good patterns for raid5,
then the allocator should be adapted for that case, otherwise
ssd/ssd_spread should be independent of the raid level.

> If we're going to play around with these, we need a good way to measure 
> free space fragmentation as part of benchmarks, as well as the IO 
> patterns coming out of the allocator.

Hans has a tool that visualizes the fragmentation. Most complaints I've
seen were about 'ssd' itself, excessive fragmentation, early ENOSPC. Not
many people use ssd_spread, 'ssd' gets turned on automatically so it has
much wider impact.

> At least for our uses, ssd_spread matters much more for metadata than 
> data (the data writes are large and metadata is small).

From the changes overview:

> 1. Throw out the current ssd_spread behaviour.

would it be ok for you to keep ssd_working as before?

I'd really like to get this patch merged soon because "do not use ssd
mode for ssd" has started to be the recommended workaround. Once this
sticks, we won't need to have any ssd mode anymore ...
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Chris Mason July 24, 2017, 6:53 p.m. UTC | #12
On 07/24/2017 02:41 PM, David Sterba wrote:
> On Mon, Jul 24, 2017 at 02:01:07PM -0400, Chris Mason wrote:
>> On 07/24/2017 10:25 AM, David Sterba wrote:
>>
>>> Thanks for the extensive historical summary, this change really deserves
>>> it.
>>>
>>> Decoupling the assumptions about the device's block management is really
>>> a good thing, mount option 'ssd' should mean that the device just has
>>> cheap seeks. Moving the the allocation tweaks to ssd_spread provides a
>>> way to keep the behaviour for anybody who wants it.
>>>
>>> I'd like to push this change to 4.13-rc3, as I don't think we need more
>>> time to let other users to test this. The effects of current ssd
>>> implementation have been debated and debugged on IRC for a long time.
>>
>> The description is great, but I'd love to see many more benchmarks.  At
>> Facebook we use the current ssd_spread mode in production on top of
>> hardware raid5/6 (spinning storage) because it dramatically reduces the
>> read/modify/write cycles done for metadata writes.
> 
> Well, I think this is an example that ssd got misused because of the
> side effects of the allocation. If you observe good patterns for raid5,
> then the allocator should be adapted for that case, otherwise
> ssd/ssd_spread should be independent of the raid level.

Absolutely.  The optimizations that made ssd_spread useful for first 
generation flash are the same things that raid5/6 need.  Big writes, or 
said differently a minimum size for fast writes.

> 
>> If we're going to play around with these, we need a good way to measure
>> free space fragmentation as part of benchmarks, as well as the IO
>> patterns coming out of the allocator.
> 
> Hans has a tool that visualizes the fragmentation. Most complaints I've
> seen were about 'ssd' itself, excessive fragmentation, early ENOSPC. Not
> many people use ssd_spread, 'ssd' gets turned on automatically so it has
> much wider impact.
> 
>> At least for our uses, ssd_spread matters much more for metadata than
>> data (the data writes are large and metadata is small).
> 
>  From the changes overview:
> 
>> 1. Throw out the current ssd_spread behaviour.
> 
> would it be ok for you to keep ssd_working as before?
> 
> I'd really like to get this patch merged soon because "do not use ssd
> mode for ssd" has started to be the recommended workaround. Once this
> sticks, we won't need to have any ssd mode anymore ...

Works for me.  I do want to make sure that commits in this area include 
the workload they were targeting, how they were measured and what 
impacts they had.  That way when we go back to try and change this again 
we'll understand what profiles we want to preserve.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Austin S. Hemmelgarn July 24, 2017, 7:06 p.m. UTC | #13
On 2017-07-24 14:53, Chris Mason wrote:
> On 07/24/2017 02:41 PM, David Sterba wrote:
>>
>> would it be ok for you to keep ssd_working as before?
>>
>> I'd really like to get this patch merged soon because "do not use ssd
>> mode for ssd" has started to be the recommended workaround. Once this
>> sticks, we won't need to have any ssd mode anymore ...
> 
> Works for me.  I do want to make sure that commits in this area include 
> the workload they were targeting, how they were measured and what 
> impacts they had.  That way when we go back to try and change this again 
> we'll understand what profiles we want to preserve.
Just thinking long term here, but might it make sense to (eventually) 
allow the user to tune how big a chunk of space to look for?  I know 
that ext* have options to do this kind of thing, and I think XFS does 
too (but they do it at filesystem creation time), and I do know people 
who make use of those to make sure things are working at their absolute 
best.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Chris Mason July 24, 2017, 7:18 p.m. UTC | #14
On 07/24/2017 03:06 PM, Austin S. Hemmelgarn wrote:
> On 2017-07-24 14:53, Chris Mason wrote:
>> On 07/24/2017 02:41 PM, David Sterba wrote:
>>>
>>> would it be ok for you to keep ssd_working as before?
>>>
>>> I'd really like to get this patch merged soon because "do not use ssd
>>> mode for ssd" has started to be the recommended workaround. Once this
>>> sticks, we won't need to have any ssd mode anymore ...
>>
>> Works for me.  I do want to make sure that commits in this area 
>> include the workload they were targeting, how they were measured and 
>> what impacts they had.  That way when we go back to try and change 
>> this again we'll understand what profiles we want to preserve.
> Just thinking long term here, but might it make sense to (eventually) 
> allow the user to tune how big a chunk of space to look for?  I know 
> that ext* have options to do this kind of thing, and I think XFS does 
> too (but they do it at filesystem creation time), and I do know people 
> who make use of those to make sure things are working at their absolute 
> best.

Agreed, we have space in the on-disk format to record those preferences, 
but we haven't done it in the past.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hans van Kranenburg July 24, 2017, 10:42 p.m. UTC | #15
Hi Chris,

On 07/24/2017 08:53 PM, Chris Mason wrote:
> On 07/24/2017 02:41 PM, David Sterba wrote:
>> On Mon, Jul 24, 2017 at 02:01:07PM -0400, Chris Mason wrote:
>>> On 07/24/2017 10:25 AM, David Sterba wrote:
>>>
>>>> Thanks for the extensive historical summary, this change really
>>>> deserves
>>>> it.
>>>>
>>>> Decoupling the assumptions about the device's block management is
>>>> really
>>>> a good thing, mount option 'ssd' should mean that the device just has
>>>> cheap seeks. Moving the the allocation tweaks to ssd_spread provides a
>>>> way to keep the behaviour for anybody who wants it.
>>>>
>>>> I'd like to push this change to 4.13-rc3, as I don't think we need more
>>>> time to let other users to test this. The effects of current ssd
>>>> implementation have been debated and debugged on IRC for a long time.
>>>
>>> The description is great, but I'd love to see many more benchmarks.

Before starting a visual guide through a subset of benchmark-ish
material that I gathered in the last months, I would first like to point
at a different kind of benchmark, which is not a technical one. It would
be about measuring the end user adoption of the btrfs file system.

There are weeks in which no day goes by on IRC without a user joining
the channel in tears, asking "why is btrfs telling me my disk is full
when df says it's only 50% used"? Every time the IRC tribe has to do the
whole story about chunks, allocated space, used space, balance, the user
understands it a bit, or the user might give up, being confused why all
of this is necessary and why btrfs cannot just manage its own space a
bit, etc. Since a few weeks the debugging sessions have become a lot
shorter. "Ok, remount -o nossd, now try balance again, yay now it works
without "another xyz enospc during balance" errors, and keep it in your
fstab, kthxbye."

These are the users that know IRC or try the mailinglist. Others just
rage quit btrfs again accompanied with some not so nice words in a
twitter message.

This patch solves a significant part of this bad 'out of the box' behaviour.

This behaviour (fully allocating raw space and then crashing) has also
been the main reason at work to only keep btrfs for a few specific use
cases, in tightly controlled environment with extra monitoring on it.
With the -o nossd behaviour, I can start doing much more fun things with
it now, not having to walk around all day, using balance to clean up.

>>> At
>>> Facebook we use the current ssd_spread mode in production on top of
>>> hardware raid5/6 (spinning storage) because it dramatically reduces the
>>> read/modify/write cycles done for metadata writes.
>>
>> Well, I think this is an example that ssd got misused because of the
>> side effects of the allocation. If you observe good patterns for raid5,
>> then the allocator should be adapted for that case, otherwise
>> ssd/ssd_spread should be independent of the raid level.

If I have an iSCSI lun on a NetApp filer with rotating disks, which is
seek-free for random writes (they go to NVRAM and then to WAFL), but not
for random reads, does that make it an ssd, or not? (don't answer, it's
a useless question)

When writing this patch, I deliberately went for a scenario with minimal
impact in the code, but maximum impact in default behaviour for the end
user.

> Absolutely.  The optimizations that made ssd_spread useful for first
> generation flash are the same things that raid5/6 need.  Big writes, or
> said differently a minimum size for fast writes.
> 
>>
>>> If we're going to play around with these, we need a good way to measure
>>> free space fragmentation as part of benchmarks, as well as the IO
>>> patterns coming out of the allocator.

Yes! First a collection of use cases, then some reproducible
simulations, then measurements and timelapse pictures... Interesting.

By the way. In the last weeks I've been trying to debug excessive
metadata write IO behaviour of a large filesystem.

(from show_metadata_tree_sizes.py output:)
EXTENT_TREE  15.64GiB  0(1021576) 1(3692) 2(16) 3(1)

So far the results only point in the direction of btrfs meaning
butterfly effects, since I cannot seem to reliably cause things to
happen. However, I'll start a separate mail thread about the findings so
far, and some helper code for measuring things. Let's not do that in
here. This was about the data part at first.

>> Hans has a tool that visualizes the fragmentation. Most complaints I've
>> seen were about 'ssd' itself, excessive fragmentation, early ENOSPC. Not
>> many people use ssd_spread, 'ssd' gets turned on automatically so it has
>> much wider impact.
>>
>>> At least for our uses, ssd_spread matters much more for metadata than
>>> data (the data writes are large and metadata is small).

Exactly. Metadata. But that's a completely different story from data.
Actually, thanks for the idea, I'm gonna try a test with this later...
'nossd' data and 'ssd_spread' metadata.

>>  From the changes overview:
>>
>>> 1. Throw out the current ssd_spread behaviour.
>>
>> would it be ok for you to keep ssd_working as before?
>>
>> I'd really like to get this patch merged soon because "do not use ssd
>> mode for ssd" has started to be the recommended workaround. Once this
>> sticks, we won't need to have any ssd mode anymore ...

E.g. Debian Stretch has just been released with a 4.9 kernel. So, that
workaround will continue to be used for quite a while I guess...

> Works for me.  I do want to make sure that commits in this area include
> the workload they were targeting, how they were measured and what
> impacts they had.

For this one, it simply is... Btrfs default options that are supposed to
optimize things for an ssd must not exhibit behaviour that causes the
ssd to actually wear out faster.

> That way when we go back to try and change this again
> we'll understand what profiles we want to preserve.

Ok, there we go.

1. Just after I finished chunk level btrfs-heatmap. This shows a
continous fight against -o ssd filesystem with balance. 1 picture per
day, 2.5TiB or so fs.

https://www.youtube.com/watch?v=Qj1lxAasytc

2. The behaviour of -o ssd on a filesystem with a postfix mailserver,
mailman mailinglist server and mail logging in /var/log. Data-extent
level btrfs-heatmap timelapse, 4 pictures per hour, 4 block groups with
the highest vaddr:

https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-01-19-noautodefrag-ichiban.mp4

3. The behaviour of -o nossd on the same filesystem as 2:

https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-04-08-ichiban-walk-nossd.mp4

4. Again, same fs as in 2 and 3. Guess at which point I switched from -o
ssd (automatically chosen, it's not even an ssd at all) to adding an
explicit -o nossd...

https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-07-19-btrfs_usage_ssd_vs_nossd.png

(When it goes down, that's using balance)

5. A video of a ~40TiB filesystem that switches from -o ssd (no, it's
not an ssd, it was automatically chosen for me) to explicit -o nossd at
some point. I think you should be able to guess when exactly this
happens. 1 picture per day, ordered on virtual address space.

https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-ssd-to-nossd.mp4

6. The effect of -o ssd on the filesystem in 5. At this point it was
already impossible to keep the allocated space down with balance.

https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-04-08-backups-16-Q23.png

7. A collection of pictures of what block groups look like after they've
been abused by -o ssd:

https://syrinx.knorrie.org/~knorrie/btrfs/fragmented/

These pictures are from the filesystem in 5. After going nossd, I wrote
a custom balance script that operates on free space fragmentation level,
and started cleaning up the mess, worst fragmented first. It turned out
that for the fragmentation number it returned for a data block group
(second number in the file name), fragmented > 200 was quite bad, < 200
was not that bad.

8. Same timespan as the timelapse in 5, displayed as totals.

https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-07-25-backups-17-Q23.png

All of the examples and more details about them can be found in mailing
list threads from the last months.

Have fun,
David Sterba July 26, 2017, 1:27 p.m. UTC | #16
On Mon, Jul 24, 2017 at 02:53:52PM -0400, Chris Mason wrote:
> On 07/24/2017 02:41 PM, David Sterba wrote:
> > On Mon, Jul 24, 2017 at 02:01:07PM -0400, Chris Mason wrote:
> >> On 07/24/2017 10:25 AM, David Sterba wrote:
> >>
> >>> Thanks for the extensive historical summary, this change really deserves
> >>> it.
> >>>
> >>> Decoupling the assumptions about the device's block management is really
> >>> a good thing, mount option 'ssd' should mean that the device just has
> >>> cheap seeks. Moving the the allocation tweaks to ssd_spread provides a
> >>> way to keep the behaviour for anybody who wants it.
> >>>
> >>> I'd like to push this change to 4.13-rc3, as I don't think we need more
> >>> time to let other users to test this. The effects of current ssd
> >>> implementation have been debated and debugged on IRC for a long time.
> >>
> >> The description is great, but I'd love to see many more benchmarks.  At
> >> Facebook we use the current ssd_spread mode in production on top of
> >> hardware raid5/6 (spinning storage) because it dramatically reduces the
> >> read/modify/write cycles done for metadata writes.
> > 
> > Well, I think this is an example that ssd got misused because of the
> > side effects of the allocation. If you observe good patterns for raid5,
> > then the allocator should be adapted for that case, otherwise
> > ssd/ssd_spread should be independent of the raid level.
> 
> Absolutely.  The optimizations that made ssd_spread useful for first 
> generation flash are the same things that raid5/6 need.  Big writes, or 
> said differently a minimum size for fast writes.

Actually, you can do the alignments if the block group is raid56
automatically, and don't rely on ssd_spread. This should be equivalent
in function, but a bit cleaner from the code and interface side.

> >> If we're going to play around with these, we need a good way to measure
> >> free space fragmentation as part of benchmarks, as well as the IO
> >> patterns coming out of the allocator.
> > 
> > Hans has a tool that visualizes the fragmentation. Most complaints I've
> > seen were about 'ssd' itself, excessive fragmentation, early ENOSPC. Not
> > many people use ssd_spread, 'ssd' gets turned on automatically so it has
> > much wider impact.
> > 
> >> At least for our uses, ssd_spread matters much more for metadata than
> >> data (the data writes are large and metadata is small).
> > 
> >  From the changes overview:
> > 
> >> 1. Throw out the current ssd_spread behaviour.
> > 
> > would it be ok for you to keep ssd_working as before?
> > 
> > I'd really like to get this patch merged soon because "do not use ssd
> > mode for ssd" has started to be the recommended workaround. Once this
> > sticks, we won't need to have any ssd mode anymore ...
> 
> Works for me.  I do want to make sure that commits in this area include 
> the workload they were targeting, how they were measured and what 
> impacts they had.  That way when we go back to try and change this again 
> we'll understand what profiles we want to preserve.

So there are at least two ways how to look at the change:

performance - ssd + alignments gives better results under some
conditions, especially when there's enough space, rough summary of my
recent measurements with dbench, mailserver workload, fio jobs with
random readwrite

early ENOSPC (user experience) - ie. when a fragmented filesystem, can't
statisfy allocation due to the constraints, although it would be
possible without the alignments; an aged filesystem, near-full

The target here is not a particular workload or profile, but the
behaviour in the near-full conditions, on a filesystem that's likely
fragmented and aged, mixed workload. The patch should fix it in a way
that will make it work at all. There will be some performance impact,
hard to measure or predict under the conditions.

The reports where nossd fixed the ENOSPC problem IMO validate the
change. We don't have performance characteristics attached to them, but
I think that's not relevant to compare if a 'balance finished in a few
hours' compared to 'balance failed too early, what next'.

Technically, this is an ENOSPC fix, and it took quite some time to
identify the cause.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch
diff mbox

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 4f8f75d9e839..b091dd3f5b38 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -472,7 +472,7 @@  struct btrfs_block_rsv {
 /*
  * free clusters are used to claim free space in relatively large chunks,
  * allowing us to do less seeky writes.  They are used for all metadata
- * allocations and data allocations in ssd mode.
+ * allocations and data allocations in ssd_spread mode.
  */
 struct btrfs_free_cluster {
 	spinlock_t lock;
@@ -976,7 +976,7 @@  struct btrfs_fs_info {
 
 	struct reloc_control *reloc_ctl;
 
-	/* data_alloc_cluster is only used in ssd mode */
+	/* data_alloc_cluster is only used in ssd_spread mode */
 	struct btrfs_free_cluster data_alloc_cluster;
 
 	/* all metadata allocations go through this cluster */
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 5f678dcb20e6..97ede2de24a3 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3061,11 +3061,9 @@  int open_ctree(struct super_block *sb,
 	if (IS_ERR(fs_info->transaction_kthread))
 		goto fail_cleaner;
 
-	if (!btrfs_test_opt(fs_info, SSD) &&
-	    !btrfs_test_opt(fs_info, NOSSD) &&
+	if (!btrfs_test_opt(fs_info, NOSSD) &&
 	    !fs_info->fs_devices->rotating) {
-		btrfs_info(fs_info, "detected SSD devices, enabling SSD mode");
-		btrfs_set_opt(fs_info->mount_opt, SSD);
+		btrfs_set_and_info(fs_info, SSD, "enabling ssd optimizations");
 	}
 
 	/*
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 33d979e9ea2a..cb1855fede41 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -6609,19 +6609,20 @@  fetch_cluster_info(struct btrfs_fs_info *fs_info,
 		   struct btrfs_space_info *space_info, u64 *empty_cluster)
 {
 	struct btrfs_free_cluster *ret = NULL;
-	bool ssd = btrfs_test_opt(fs_info, SSD);
 
 	*empty_cluster = 0;
 	if (btrfs_mixed_space_info(space_info))
 		return ret;
 
-	if (ssd)
-		*empty_cluster = SZ_2M;
 	if (space_info->flags & BTRFS_BLOCK_GROUP_METADATA) {
 		ret = &fs_info->meta_alloc_cluster;
-		if (!ssd)
+		if (btrfs_test_opt(fs_info, SSD))
+			*empty_cluster = SZ_2M;
+		else
 			*empty_cluster = SZ_64K;
-	} else if ((space_info->flags & BTRFS_BLOCK_GROUP_DATA) && ssd) {
+	} else if ((space_info->flags & BTRFS_BLOCK_GROUP_DATA) &&
+		   btrfs_test_opt(fs_info, SSD_SPREAD)) {
+		*empty_cluster = SZ_2M;
 		ret = &fs_info->data_alloc_cluster;
 	}
 
diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index c5e6180cdb8c..8ebbb182ba94 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -3034,14 +3034,11 @@  int btrfs_find_space_cluster(struct btrfs_fs_info *fs_info,
 	int ret;
 
 	/*
-	 * Choose the minimum extent size we'll require for this
-	 * cluster.  For SSD_SPREAD, don't allow any fragmentation.
-	 * For metadata, allow allocates with smaller extents.  For
-	 * data, keep it dense.
+	 * Choose the minimum extent size we'll require for this cluster.  For
+	 * metadata, allow allocates with smaller extents.  For data, keep it
+	 * dense.
 	 */
-	if (btrfs_test_opt(fs_info, SSD_SPREAD)) {
-		cont1_bytes = min_bytes = bytes + empty_size;
-	} else if (block_group->flags & BTRFS_BLOCK_GROUP_METADATA) {
+	if (block_group->flags & BTRFS_BLOCK_GROUP_METADATA) {
 		cont1_bytes = bytes;
 		min_bytes = fs_info->sectorsize;
 	} else {
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 4f1cdd5058f1..01ebc93f994a 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -548,20 +548,22 @@  int btrfs_parse_options(struct btrfs_fs_info *info, char *options,
 			break;
 		case Opt_ssd:
 			btrfs_set_and_info(info, SSD,
-					   "use ssd allocation scheme");
+					   "enabling ssd optimizations");
 			btrfs_clear_opt(info->mount_opt, NOSSD);
 			break;
 		case Opt_ssd_spread:
+			btrfs_set_and_info(info, SSD,
+					   "enabling ssd optimizations");
 			btrfs_set_and_info(info, SSD_SPREAD,
-					   "use spread ssd allocation scheme");
-			btrfs_set_opt(info->mount_opt, SSD);
+					   "using legacy ssd allocation scheme");
 			btrfs_clear_opt(info->mount_opt, NOSSD);
 			break;
 		case Opt_nossd:
-			btrfs_set_and_info(info, NOSSD,
-					     "not using ssd allocation scheme");
-			btrfs_clear_opt(info->mount_opt, SSD);
-			btrfs_clear_opt(info->mount_opt, SSD_SPREAD);
+			btrfs_set_opt(info->mount_opt, NOSSD);
+			btrfs_clear_and_info(info, SSD,
+					     "not using ssd optimizations");
+			btrfs_clear_and_info(info, SSD_SPREAD,
+					     "not using legacy ssd allocation scheme");
 			break;
 		case Opt_barrier:
 			btrfs_clear_and_info(info, NOBARRIER,