[RFC] btrfs: introduce rescue=onlyfs
diff mbox series

Message ID 20200701144438.7613-1-josef@toxicpanda.com
State New
Headers show
Series
  • [RFC] btrfs: introduce rescue=onlyfs
Related show

Commit Message

Josef Bacik July 1, 2020, 2:44 p.m. UTC
One of the things that came up consistently in talking with Fedora about
switching to btrfs as default is that btrfs is particularly vulnerable
to metadata corruption.  If any of the core global roots are corrupted,
the fs is unmountable and fsck can't usually do anything for you without
some special options.

Qu addressed this sort of with rescue=skipbg, but that's poorly named as
what it really does is just allow you to operate without an extent root.
However there are a lot of other roots, and I'd rather not have to do

mount -o rescue=skipbg,rescue=nocsum,rescue=nofreespacetree,rescue=blah

Instead take his original idea and modify it so it just works for
everything.  Turn it into rescue=onlyfs, and then any major root we fail
to read just gets left empty and we carry on.

Obviously if the fs roots are screwed then the user is in trouble, but
otherwise this makes it much easier to pull stuff off the disk without
needing our special rescue tools.  I tested this with my TEST_DEV that
had a bunch of data on it by corrupting the csum tree and then reading
files off the disk.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---

I'm not married to the rescue=onlyfs name, if we can think of something better
I'm good.

Also rescue=skipbg is currently only sitting in misc-next, which is why I'm
killing it with this patch, we haven't sent it upstream so we're good to change
it now before it lands.

 fs/btrfs/block-group.c |  2 +-
 fs/btrfs/ctree.h       |  2 +-
 fs/btrfs/disk-io.c     | 76 ++++++++++++++++++++++--------------------
 fs/btrfs/inode.c       |  6 +++-
 fs/btrfs/super.c       | 27 +++++++--------
 fs/btrfs/volumes.c     |  4 +--
 6 files changed, 63 insertions(+), 54 deletions(-)

Comments

Lukas Straub July 1, 2020, 3:22 p.m. UTC | #1
On Wed,  1 Jul 2020 10:44:38 -0400
Josef Bacik <josef@toxicpanda.com> wrote:

> One of the things that came up consistently in talking with Fedora about
> switching to btrfs as default is that btrfs is particularly vulnerable
> to metadata corruption.  If any of the core global roots are corrupted,
> the fs is unmountable and fsck can't usually do anything for you without
> some special options.
> 
> Qu addressed this sort of with rescue=skipbg, but that's poorly named as
> what it really does is just allow you to operate without an extent root.
> However there are a lot of other roots, and I'd rather not have to do
> 
> mount -o rescue=skipbg,rescue=nocsum,rescue=nofreespacetree,rescue=blah
> 
> Instead take his original idea and modify it so it just works for
> everything.  Turn it into rescue=onlyfs, and then any major root we fail
> to read just gets left empty and we carry on.
> 
> Obviously if the fs roots are screwed then the user is in trouble, but
> otherwise this makes it much easier to pull stuff off the disk without
> needing our special rescue tools.  I tested this with my TEST_DEV that
> had a bunch of data on it by corrupting the csum tree and then reading
> files off the disk.
> 
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
> 
> I'm not married to the rescue=onlyfs name, if we can think of something better
> I'm good.

Maybe you could go a step further and automatically switch to rescue mode if something is corrupt. This is easier for the user than having to remember the mount flags.

Regards,
Lukas Straub

> Also rescue=skipbg is currently only sitting in misc-next, which is why I'm
> killing it with this patch, we haven't sent it upstream so we're good to change
> it now before it lands.
> 
>  fs/btrfs/block-group.c |  2 +-
>  fs/btrfs/ctree.h       |  2 +-
>  fs/btrfs/disk-io.c     | 76 ++++++++++++++++++++++--------------------
>  fs/btrfs/inode.c       |  6 +++-
>  fs/btrfs/super.c       | 27 +++++++--------
>  fs/btrfs/volumes.c     |  4 +--
>  6 files changed, 63 insertions(+), 54 deletions(-)
> ...
David Sterba July 1, 2020, 3:39 p.m. UTC | #2
On Wed, Jul 01, 2020 at 05:22:18PM +0200, Lukas Straub wrote:
> On Wed,  1 Jul 2020 10:44:38 -0400
> Josef Bacik <josef@toxicpanda.com> wrote:
> 
> > One of the things that came up consistently in talking with Fedora about
> > switching to btrfs as default is that btrfs is particularly vulnerable
> > to metadata corruption.  If any of the core global roots are corrupted,
> > the fs is unmountable and fsck can't usually do anything for you without
> > some special options.
> > 
> > Qu addressed this sort of with rescue=skipbg, but that's poorly named as
> > what it really does is just allow you to operate without an extent root.
> > However there are a lot of other roots, and I'd rather not have to do
> > 
> > mount -o rescue=skipbg,rescue=nocsum,rescue=nofreespacetree,rescue=blah
> > 
> > Instead take his original idea and modify it so it just works for
> > everything.  Turn it into rescue=onlyfs, and then any major root we fail
> > to read just gets left empty and we carry on.
> > 
> > Obviously if the fs roots are screwed then the user is in trouble, but
> > otherwise this makes it much easier to pull stuff off the disk without
> > needing our special rescue tools.  I tested this with my TEST_DEV that
> > had a bunch of data on it by corrupting the csum tree and then reading
> > files off the disk.
> > 
> > Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> > ---
> > 
> > I'm not married to the rescue=onlyfs name, if we can think of something better
> > I'm good.
> 
> Maybe you could go a step further and automatically switch to rescue
> mode if something is corrupt. This is easier for the user than having
> to remember the mount flags.

We don't want to do the auto-switching in general as it's a non-standard
situation.  It's better to get user attention than to silently mount
with limited capabilities and then let the user find out that something
went wrong, eg. system services randomly failing to start or work.
Josef Bacik July 1, 2020, 3:51 p.m. UTC | #3
On 7/1/20 11:39 AM, David Sterba wrote:
> On Wed, Jul 01, 2020 at 05:22:18PM +0200, Lukas Straub wrote:
>> On Wed,  1 Jul 2020 10:44:38 -0400
>> Josef Bacik <josef@toxicpanda.com> wrote:
>>
>>> One of the things that came up consistently in talking with Fedora about
>>> switching to btrfs as default is that btrfs is particularly vulnerable
>>> to metadata corruption.  If any of the core global roots are corrupted,
>>> the fs is unmountable and fsck can't usually do anything for you without
>>> some special options.
>>>
>>> Qu addressed this sort of with rescue=skipbg, but that's poorly named as
>>> what it really does is just allow you to operate without an extent root.
>>> However there are a lot of other roots, and I'd rather not have to do
>>>
>>> mount -o rescue=skipbg,rescue=nocsum,rescue=nofreespacetree,rescue=blah
>>>
>>> Instead take his original idea and modify it so it just works for
>>> everything.  Turn it into rescue=onlyfs, and then any major root we fail
>>> to read just gets left empty and we carry on.
>>>
>>> Obviously if the fs roots are screwed then the user is in trouble, but
>>> otherwise this makes it much easier to pull stuff off the disk without
>>> needing our special rescue tools.  I tested this with my TEST_DEV that
>>> had a bunch of data on it by corrupting the csum tree and then reading
>>> files off the disk.
>>>
>>> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
>>> ---
>>>
>>> I'm not married to the rescue=onlyfs name, if we can think of something better
>>> I'm good.
>>
>> Maybe you could go a step further and automatically switch to rescue
>> mode if something is corrupt. This is easier for the user than having
>> to remember the mount flags.
> 
> We don't want to do the auto-switching in general as it's a non-standard
> situation.  It's better to get user attention than to silently mount
> with limited capabilities and then let the user find out that something
> went wrong, eg. system services randomly failing to start or work.
> 

To elaborate further on this, systemd is still working on (and maybe can now) 
boot a box with a read only fs.  This is the sort of thing we want to make 
really clear to the user, so I'm all for giving them the abilities to get to 
this rescue mode, but as a policy it needs to be handled in userspace, either 
via systemd or some other mechanism if people want it to be automatic.  Thanks,

Josef
Alberto Bursi July 1, 2020, 7:16 p.m. UTC | #4
On 01/07/20 17:39, David Sterba wrote:
> On Wed, Jul 01, 2020 at 05:22:18PM +0200, Lukas Straub wrote:
>> On Wed,  1 Jul 2020 10:44:38 -0400
>> Josef Bacik <josef@toxicpanda.com> wrote:
>>
>>> One of the things that came up consistently in talking with Fedora about
>>> switching to btrfs as default is that btrfs is particularly vulnerable
>>> to metadata corruption.  If any of the core global roots are corrupted,
>>> the fs is unmountable and fsck can't usually do anything for you without
>>> some special options.
>>>
>>> Qu addressed this sort of with rescue=skipbg, but that's poorly named as
>>> what it really does is just allow you to operate without an extent root.
>>> However there are a lot of other roots, and I'd rather not have to do
>>>
>>> mount -o rescue=skipbg,rescue=nocsum,rescue=nofreespacetree,rescue=blah
>>>
>>> Instead take his original idea and modify it so it just works for
>>> everything.  Turn it into rescue=onlyfs, and then any major root we fail
>>> to read just gets left empty and we carry on.
>>>
>>> Obviously if the fs roots are screwed then the user is in trouble, but
>>> otherwise this makes it much easier to pull stuff off the disk without
>>> needing our special rescue tools.  I tested this with my TEST_DEV that
>>> had a bunch of data on it by corrupting the csum tree and then reading
>>> files off the disk.
>>>
>>> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
>>> ---
>>>
>>> I'm not married to the rescue=onlyfs name, if we can think of something better
>>> I'm good.
>>
>> Maybe you could go a step further and automatically switch to rescue
>> mode if something is corrupt. This is easier for the user than having
>> to remember the mount flags.
> 
> We don't want to do the auto-switching in general as it's a non-standard
> situation.  It's better to get user attention than to silently mount
> with limited capabilities and then let the user find out that something
> went wrong, eg. system services randomly failing to start or work.
> 

Eh. I'm sure stopping boot and dropping to initramfs shell is a great 
way to get someone's attention.

Afaik in mdadm or hardware raid the main way to notify the administrator 
of issues is sending an email, or send the error through the server 
fleet management software.

-Alberto
waxhead July 1, 2020, 7:43 p.m. UTC | #5
Josef Bacik wrote:
> One of the things that came up consistently in talking with Fedora about
> switching to btrfs as default is that btrfs is particularly vulnerable
> to metadata corruption.  If any of the core global roots are corrupted,
> the fs is unmountable and fsck can't usually do anything for you without
> some special options.
> 
> Qu addressed this sort of with rescue=skipbg, but that's poorly named as
> what it really does is just allow you to operate without an extent root.
> However there are a lot of other roots, and I'd rather not have to do
> 
> mount -o rescue=skipbg,rescue=nocsum,rescue=nofreespacetree,rescue=blah
> 
> Instead take his original idea and modify it so it just works for
> everything.  Turn it into rescue=onlyfs, and then any major root we fail
> to read just gets left empty and we carry on.
> 
> Obviously if the fs roots are screwed then the user is in trouble, but
> otherwise this makes it much easier to pull stuff off the disk without
> needing our special rescue tools.  I tested this with my TEST_DEV that
> had a bunch of data on it by corrupting the csum tree and then reading
> files off the disk.
> 
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---

Just an idea inspired from RAID1c3 and RAID1c3, how about introducing 
DUP2 and/or even DUP3 making multiple copies of the metadata to increase 
the chance to recover metadata on even a single storage device?
waxhead July 1, 2020, 7:45 p.m. UTC | #6
waxhead wrote:
> 
> 
> Josef Bacik wrote:
>> One of the things that came up consistently in talking with Fedora about
>> switching to btrfs as default is that btrfs is particularly vulnerable
>> to metadata corruption.  If any of the core global roots are corrupted,
>> the fs is unmountable and fsck can't usually do anything for you without
>> some special options.
>>
>> Qu addressed this sort of with rescue=skipbg, but that's poorly named as
>> what it really does is just allow you to operate without an extent root.
>> However there are a lot of other roots, and I'd rather not have to do
>>
>> mount -o rescue=skipbg,rescue=nocsum,rescue=nofreespacetree,rescue=blah
>>
>> Instead take his original idea and modify it so it just works for
>> everything.  Turn it into rescue=onlyfs, and then any major root we fail
>> to read just gets left empty and we carry on.
>>
>> Obviously if the fs roots are screwed then the user is in trouble, but
>> otherwise this makes it much easier to pull stuff off the disk without
>> needing our special rescue tools.  I tested this with my TEST_DEV that
>> had a bunch of data on it by corrupting the csum tree and then reading
>> files off the disk.
>>
>> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
>> ---
> 
> Just an idea inspired from RAID1c3 and RAID1c3, how about introducing 
> DUP2 and/or even DUP3 making multiple copies of the metadata to increase 
> the chance to recover metadata on even a single storage device?
Obviously I am drunk... RAID1c3/RAID1c4 and DUP3/DUP4 of course... :P
Josef Bacik July 1, 2020, 7:53 p.m. UTC | #7
On 7/1/20 3:43 PM, waxhead wrote:
> 
> 
> Josef Bacik wrote:
>> One of the things that came up consistently in talking with Fedora about
>> switching to btrfs as default is that btrfs is particularly vulnerable
>> to metadata corruption.  If any of the core global roots are corrupted,
>> the fs is unmountable and fsck can't usually do anything for you without
>> some special options.
>>
>> Qu addressed this sort of with rescue=skipbg, but that's poorly named as
>> what it really does is just allow you to operate without an extent root.
>> However there are a lot of other roots, and I'd rather not have to do
>>
>> mount -o rescue=skipbg,rescue=nocsum,rescue=nofreespacetree,rescue=blah
>>
>> Instead take his original idea and modify it so it just works for
>> everything.  Turn it into rescue=onlyfs, and then any major root we fail
>> to read just gets left empty and we carry on.
>>
>> Obviously if the fs roots are screwed then the user is in trouble, but
>> otherwise this makes it much easier to pull stuff off the disk without
>> needing our special rescue tools.  I tested this with my TEST_DEV that
>> had a bunch of data on it by corrupting the csum tree and then reading
>> files off the disk.
>>
>> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
>> ---
> 
> Just an idea inspired from RAID1c3 and RAID1c3, how about introducing DUP2 
> and/or even DUP3 making multiple copies of the metadata to increase the chance 
> to recover metadata on even a single storage device?

Because this only works on HDD.  On SSD's concurrent writes will often be 
shunted to the same erase block, and if the whole erase block goes, so do all of 
your copies.  This is why we default to 'single' for SSD's.

The one thing I _do_ want to do is make better use of the backup roots.  Right 
now we always free the pinned extents once the transaction commits, which makes 
the backup roots useless as we're likely to re-use those blocks.  With Nikolay's 
patches we can now async drop pinned extents, which I've implemented here for an 
unrelated issue.  We could take that work and simply hold pinned extents for 
several transactions so that old backup roots and all of their nodes don't get 
over-written until they cycle out.  This would go a long way towards making us 
more resilient under metadata corruption conditions.  Thanks,

Josef
Hans van Kranenburg July 1, 2020, 9:17 p.m. UTC | #8
Hi!

On 7/1/20 9:53 PM, Josef Bacik wrote:
> On 7/1/20 3:43 PM, waxhead wrote:
>>
>>
>> Josef Bacik wrote:
>>> One of the things that came up consistently in talking with Fedora about
>>> switching to btrfs as default is that btrfs is particularly vulnerable
>>> to metadata corruption.  If any of the core global roots are corrupted,
>>> the fs is unmountable and fsck can't usually do anything for you without
>>> some special options.
>>>
>>> Qu addressed this sort of with rescue=skipbg, but that's poorly named as
>>> what it really does is just allow you to operate without an extent root.
>>> However there are a lot of other roots, and I'd rather not have to do
>>>
>>> mount -o rescue=skipbg,rescue=nocsum,rescue=nofreespacetree,rescue=blah
>>>
>>> Instead take his original idea and modify it so it just works for
>>> everything.  Turn it into rescue=onlyfs, and then any major root we fail
>>> to read just gets left empty and we carry on.
>>>
>>> Obviously if the fs roots are screwed then the user is in trouble, but
>>> otherwise this makes it much easier to pull stuff off the disk without
>>> needing our special rescue tools.  I tested this with my TEST_DEV that
>>> had a bunch of data on it by corrupting the csum tree and then reading
>>> files off the disk.
>>>
>>> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
>>> ---
>>
>> Just an idea inspired from RAID1c3 and RAID1c3, how about introducing DUP2 
>> and/or even DUP3 making multiple copies of the metadata to increase the chance 
>> to recover metadata on even a single storage device?
> 
> Because this only works on HDD.  On SSD's concurrent writes will often be 
> shunted to the same erase block, and if the whole erase block goes, so do all of 
> your copies.  This is why we default to 'single' for SSD's.
> 
> The one thing I _do_ want to do is make better use of the backup roots.  Right 
> now we always free the pinned extents once the transaction commits,

For other readers, who might think something actively needs to be
freed... Sort of some opposite thing happens: AIUI, the in-memory yolo
blacklist gets emptied. The space was already freed officially, but it
was still blacklisted for new writes until the transaction commits.

This difference is essential to understand that removing this in-memory
blacklist also happens when you reboot, and that that's fine. (currently)

> which makes 
> the backup roots useless as we're likely to re-use those blocks.  With Nikolay's 
> patches we can now async drop pinned extents, which I've implemented here for an 
> unrelated issue.  We could take that work and simply hold pinned extents

It could be called 'cow light'.

https://imgproc.airliners.net/photos/airliners/9/3/2/0693239.jpg?v=v40

Jokes aside, that would be great, of course, and much better than giving
up and removing all the backup roots related tooling because it's just
problematic right now.

About the 'simply' part of the story: I've been thinking of this while
writing the reply to "Buggy disk firmware (fsync/FUA) and power-loss
btrfs survability" and I ended up thinking about just dumping the
mappings to disk (since they need to be able to survive a reboot to
prevent leaking space). And, that means, space cache v1, but then for
pinned extents...

So, I'm curious to hear the out of the box idea about how to solve this
while not introducing a problem like space cache v1 again and also not
using proper metadata trees. :)

> for 
> several transactions so that old backup roots and all of their nodes don't get 
> over-written until they cycle out.  This would go a long way towards making us 
> more resilient under metadata corruption conditions.  Thanks,

K
Qu Wenruo July 2, 2020, 12:30 a.m. UTC | #9
On 2020/7/1 下午10:44, Josef Bacik wrote:
> One of the things that came up consistently in talking with Fedora about
> switching to btrfs as default is that btrfs is particularly vulnerable
> to metadata corruption.  If any of the core global roots are corrupted,
> the fs is unmountable and fsck can't usually do anything for you without
> some special options.
> 
> Qu addressed this sort of with rescue=skipbg, but that's poorly named as
> what it really does is just allow you to operate without an extent root.
> However there are a lot of other roots, and I'd rather not have to do
> 
> mount -o rescue=skipbg,rescue=nocsum,rescue=nofreespacetree,rescue=blah

Yep, that's the problem.

Extent tree is chosen mostly by the fact that, extent tree seems to be
the first victim for transid error.

I still like the idea to ignore more trees.

> 
> Instead take his original idea and modify it so it just works for
> everything.  Turn it into rescue=onlyfs, and then any major root we fail
> to read just gets left empty and we carry on.

That's indeed an improvement, other trees, especially extent, uuid,
reloc, dev trees are completely useless in completely RO environment.

> 
> Obviously if the fs roots are screwed then the user is in trouble, but
> otherwise this makes it much easier to pull stuff off the disk without
> needing our special rescue tools.  I tested this with my TEST_DEV that
> had a bunch of data on it by corrupting the csum tree and then reading
> files off the disk.

And since we still try to read csum tree, and if it's not corrupted, it
still get read and we go regular csum verification, so I'm completely
fine with this enhancement.

> 
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
> 
> I'm not married to the rescue=onlyfs name, if we can think of something better
> I'm good.

Naming is never easy to do.

What about 'skipall'?

Despite that, I'm pretty happy with the enhancement, with only one
concern below.


...
> @@ -2275,21 +2274,27 @@ static int btrfs_read_roots(struct btrfs_fs_info *fs_info)
>  	location.objectid = BTRFS_DEV_TREE_OBJECTID;
>  	root = btrfs_read_tree_root(tree_root, &location);
>  	if (IS_ERR(root)) {
> -		ret = PTR_ERR(root);
> -		goto out;
> +		if (!btrfs_test_opt(fs_info, ONLYFS)) {
> +			ret = PTR_ERR(root);
> +			goto out;
> +		}
> +	} else {
> +		set_bit(BTRFS_ROOT_TRACK_DIRTY, &root->state);
> +		fs_info->dev_root = root;
> +		btrfs_init_devices_late(fs_info);
>  	}
> -	set_bit(BTRFS_ROOT_TRACK_DIRTY, &root->state);
> -	fs_info->dev_root = root;
> -	btrfs_init_devices_late(fs_info);
>  
>  	location.objectid = BTRFS_CSUM_TREE_OBJECTID;
>  	root = btrfs_read_tree_root(tree_root, &location);
>  	if (IS_ERR(root)) {
> -		ret = PTR_ERR(root);
> -		goto out;
> +		if (!btrfs_test_opt(fs_info, ONLYFS)) {
> +			ret = PTR_ERR(root);
> +			goto out;
> +		}

One of my concern here, I didn't see btrfs_lookup_csums_range() have the
ability to skip NULL csum root.

Wouldn't this cause NULL pointer dereference if the csum root is corrupted?

Thanks,
Qu

> +	} else {
> +		set_bit(BTRFS_ROOT_TRACK_DIRTY, &root->state);
> +		fs_info->csum_root = root;
>  	}
> -	set_bit(BTRFS_ROOT_TRACK_DIRTY, &root->state);
> -	fs_info->csum_root = root;
>  
>  	/*
>  	 * This tree can share blocks with some other fs tree during relocation
> @@ -2298,11 +2303,14 @@ static int btrfs_read_roots(struct btrfs_fs_info *fs_info)
>  	root = btrfs_get_fs_root(tree_root->fs_info,
>  				 BTRFS_DATA_RELOC_TREE_OBJECTID, true);
>  	if (IS_ERR(root)) {
> -		ret = PTR_ERR(root);
> -		goto out;
> +		if (!btrfs_test_opt(fs_info, ONLYFS)) {
> +			ret = PTR_ERR(root);
> +			goto out;
> +		}
> +	} else {
> +		set_bit(BTRFS_ROOT_TRACK_DIRTY, &root->state);
> +		fs_info->data_reloc_root = root;
>  	}
> -	set_bit(BTRFS_ROOT_TRACK_DIRTY, &root->state);
> -	fs_info->data_reloc_root = root;
>  
>  	location.objectid = BTRFS_QUOTA_TREE_OBJECTID;
>  	root = btrfs_read_tree_root(tree_root, &location);
> @@ -2315,9 +2323,11 @@ static int btrfs_read_roots(struct btrfs_fs_info *fs_info)
>  	location.objectid = BTRFS_UUID_TREE_OBJECTID;
>  	root = btrfs_read_tree_root(tree_root, &location);
>  	if (IS_ERR(root)) {
> -		ret = PTR_ERR(root);
> -		if (ret != -ENOENT)
> -			goto out;
> +		if (!btrfs_test_opt(fs_info, ONLYFS)) {
> +			ret = PTR_ERR(root);
> +			if (ret != -ENOENT)
> +				goto out;
> +		}
>  	} else {
>  		set_bit(BTRFS_ROOT_TRACK_DIRTY, &root->state);
>  		fs_info->uuid_root = root;
> @@ -2327,11 +2337,14 @@ static int btrfs_read_roots(struct btrfs_fs_info *fs_info)
>  		location.objectid = BTRFS_FREE_SPACE_TREE_OBJECTID;
>  		root = btrfs_read_tree_root(tree_root, &location);
>  		if (IS_ERR(root)) {
> -			ret = PTR_ERR(root);
> -			goto out;
> +			if (!btrfs_test_opt(fs_info, ONLYFS)) {
> +				ret = PTR_ERR(root);
> +				goto out;
> +			}
> +		}  else {
> +			set_bit(BTRFS_ROOT_TRACK_DIRTY, &root->state);
> +			fs_info->free_space_root = root;
>  		}
> -		set_bit(BTRFS_ROOT_TRACK_DIRTY, &root->state);
> -		fs_info->free_space_root = root;
>  	}
>  
>  	return 0;
> @@ -3047,20 +3060,11 @@ int __cold open_ctree(struct super_block *sb, struct btrfs_fs_devices *fs_device
>  	}
>  
>  	/* Skip bg needs RO and no tree-log to replay */
> -	if (btrfs_test_opt(fs_info, SKIPBG)) {
> -		if (!sb_rdonly(sb)) {
> -			btrfs_err(fs_info,
> -			"rescue=skipbg can only be used on read-only mount");
> -			err = -EINVAL;
> -			goto fail_alloc;
> -		}
> -		if (btrfs_super_log_root(disk_super) &&
> -		    !btrfs_test_opt(fs_info, NOLOGREPLAY)) {
> -			btrfs_err(fs_info,
> -"rescue=skipbg must be used with rescue=nologreplay when tree-log needs to replayed");
> -			err = -EINVAL;
> -			goto fail_alloc;
> -		}
> +	if (btrfs_test_opt(fs_info, ONLYFS) && !sb_rdonly(sb)) {
> +		btrfs_err(fs_info,
> +			  "rescue=onlyfs can only be used on read-only mount");
> +		err = -EINVAL;
> +		goto fail_alloc;
>  	}
>  
>  	ret = btrfs_init_workqueues(fs_info, fs_devices);
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index d301550b9c70..9f8ef22ac65e 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -2209,7 +2209,8 @@ static blk_status_t btrfs_submit_bio_hook(struct inode *inode, struct bio *bio,
>  	int skip_sum;
>  	int async = !atomic_read(&BTRFS_I(inode)->sync_writers);
>  
> -	skip_sum = BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM;
> +	skip_sum = (BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM) ||
> +		btrfs_test_opt(fs_info, ONLYFS);
>  
>  	if (btrfs_is_free_space_inode(BTRFS_I(inode)))
>  		metadata = BTRFS_WQ_ENDIO_FREE_SPACE;
> @@ -2866,6 +2867,9 @@ static int btrfs_readpage_end_io_hook(struct btrfs_io_bio *io_bio,
>  	if (BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM)
>  		return 0;
>  
> +	if (btrfs_test_opt(root->fs_info, ONLYFS))
> +		return 0;
> +
>  	if (root->root_key.objectid == BTRFS_DATA_RELOC_TREE_OBJECTID &&
>  	    test_range_bit(io_tree, start, end, EXTENT_NODATASUM, 1, NULL)) {
>  		clear_extent_bits(io_tree, start, end, EXTENT_NODATASUM);
> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> index 3c9ebd4f2b61..7ea9f8f53156 100644
> --- a/fs/btrfs/super.c
> +++ b/fs/btrfs/super.c
> @@ -345,7 +345,7 @@ enum {
>  	Opt_rescue,
>  	Opt_usebackuproot,
>  	Opt_nologreplay,
> -	Opt_rescue_skipbg,
> +	Opt_rescue_onlyfs,
>  
>  	/* Deprecated options */
>  	Opt_alloc_start,
> @@ -445,7 +445,7 @@ static const match_table_t tokens = {
>  static const match_table_t rescue_tokens = {
>  	{Opt_usebackuproot, "usebackuproot"},
>  	{Opt_nologreplay, "nologreplay"},
> -	{Opt_rescue_skipbg, "skipbg"},
> +	{Opt_rescue_onlyfs, "onlyfs"},
>  	{Opt_err, NULL},
>  };
>  
> @@ -478,9 +478,10 @@ static int parse_rescue_options(struct btrfs_fs_info *info, const char *options)
>  			btrfs_set_and_info(info, NOLOGREPLAY,
>  					   "disabling log replay at mount time");
>  			break;
> -		case Opt_rescue_skipbg:
> -			btrfs_set_and_info(info, SKIPBG,
> -				"skip mount time block group searching");
> +		case Opt_rescue_onlyfs:
> +			btrfs_set_and_info(info, ONLYFS,
> +					   "only reading fs roots, also setting  nologreplay");
> +			btrfs_set_opt(info->mount_opt, NOLOGREPLAY);
>  			break;
>  		case Opt_err:
>  			btrfs_info(info, "unrecognized rescue option '%s'", p);
> @@ -1418,8 +1419,8 @@ static int btrfs_show_options(struct seq_file *seq, struct dentry *dentry)
>  		seq_puts(seq, ",notreelog");
>  	if (btrfs_test_opt(info, NOLOGREPLAY))
>  		seq_puts(seq, ",rescue=nologreplay");
> -	if (btrfs_test_opt(info, SKIPBG))
> -		seq_puts(seq, ",rescue=skipbg");
> +	if (btrfs_test_opt(info, ONLYFS))
> +		seq_puts(seq, ",rescue=onlyfs");
>  	if (btrfs_test_opt(info, FLUSHONCOMMIT))
>  		seq_puts(seq, ",flushoncommit");
>  	if (btrfs_test_opt(info, DISCARD_SYNC))
> @@ -1859,10 +1860,10 @@ static int btrfs_remount(struct super_block *sb, int *flags, char *data)
>  	if (ret)
>  		goto restore;
>  
> -	if (btrfs_test_opt(fs_info, SKIPBG) !=
> -	    (old_opts & BTRFS_MOUNT_SKIPBG)) {
> +	if (btrfs_test_opt(fs_info, ONLYFS) !=
> +	    (old_opts & BTRFS_MOUNT_ONLYFS)) {
>  		btrfs_err(fs_info,
> -		"rescue=skipbg mount option can't be changed during remount");
> +		"rescue=onlyfs mount option can't be changed during remount");
>  		ret = -EINVAL;
>  		goto restore;
>  	}
> @@ -1932,9 +1933,9 @@ static int btrfs_remount(struct super_block *sb, int *flags, char *data)
>  			goto restore;
>  		}
>  
> -		if (btrfs_test_opt(fs_info, SKIPBG)) {
> +		if (btrfs_test_opt(fs_info, ONLYFS)) {
>  			btrfs_err(fs_info,
> -		"remounting read-write with rescue=skipbg is not allowed");
> +		"remounting read-write with rescue=onlyfs is not allowed");
>  			ret = -EINVAL;
>  			goto restore;
>  		}
> @@ -2245,7 +2246,7 @@ static int btrfs_statfs(struct dentry *dentry, struct kstatfs *buf)
>  	 *
>  	 * Or if we're rescuing, set available to 0 anyway.
>  	 */
> -	if (btrfs_test_opt(fs_info, SKIPBG) ||
> +	if (btrfs_test_opt(fs_info, ONLYFS) ||
>  	    (!mixed && block_rsv->space_info->full &&
>  	     total_free_meta - thresh < block_rsv->size))
>  		buf->f_bavail = 0;
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index aabc6c922e04..a5d124f95ce2 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -7595,10 +7595,10 @@ int btrfs_verify_dev_extents(struct btrfs_fs_info *fs_info)
>  	int ret = 0;
>  
>  	/*
> -	 * For rescue=skipbg mount option, we're already RO and are salvaging
> +	 * For rescue=onlyfs mount option, we're already RO and are salvaging
>  	 * data, no need for such strict check.
>  	 */
> -	if (btrfs_test_opt(fs_info, SKIPBG))
> +	if (btrfs_test_opt(fs_info, ONLYFS))
>  		return 0;
>  
>  	key.objectid = 1;
>
Zygo Blaxell July 2, 2020, 2:05 a.m. UTC | #10
On Wed, Jul 01, 2020 at 05:39:19PM +0200, David Sterba wrote:
> On Wed, Jul 01, 2020 at 05:22:18PM +0200, Lukas Straub wrote:
> > On Wed,  1 Jul 2020 10:44:38 -0400
> > Josef Bacik <josef@toxicpanda.com> wrote:
> > 
> > > One of the things that came up consistently in talking with Fedora about
> > > switching to btrfs as default is that btrfs is particularly vulnerable
> > > to metadata corruption.  If any of the core global roots are corrupted,
> > > the fs is unmountable and fsck can't usually do anything for you without
> > > some special options.
> > > 
> > > Qu addressed this sort of with rescue=skipbg, but that's poorly named as
> > > what it really does is just allow you to operate without an extent root.
> > > However there are a lot of other roots, and I'd rather not have to do
> > > 
> > > mount -o rescue=skipbg,rescue=nocsum,rescue=nofreespacetree,rescue=blah
> > > 
> > > Instead take his original idea and modify it so it just works for
> > > everything.  Turn it into rescue=onlyfs, and then any major root we fail
> > > to read just gets left empty and we carry on.
> > > 
> > > Obviously if the fs roots are screwed then the user is in trouble, but
> > > otherwise this makes it much easier to pull stuff off the disk without
> > > needing our special rescue tools.  I tested this with my TEST_DEV that
> > > had a bunch of data on it by corrupting the csum tree and then reading
> > > files off the disk.
> > > 
> > > Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> > > ---
> > > 
> > > I'm not married to the rescue=onlyfs name, if we can think of something better
> > > I'm good.
> > 
> > Maybe you could go a step further and automatically switch to rescue
> > mode if something is corrupt. This is easier for the user than having
> > to remember the mount flags.
> 
> We don't want to do the auto-switching in general as it's a non-standard
> situation.  It's better to get user attention than to silently mount
> with limited capabilities and then let the user find out that something
> went wrong, eg. system services randomly failing to start or work.

To be fair, auto-switching is almost exactly what happens now, with mounts
being the exception.  If the tree corruption is detected after mounting,
we panic and flip the RO bit right in the middle of the work day.
When mounting with default options and a tree error detected during the
mount, we fail to mount at all--IMHO that's the non-standard situation.
Qu Wenruo July 2, 2020, 3:09 a.m. UTC | #11
On 2020/7/2 上午3:53, Josef Bacik wrote:
> On 7/1/20 3:43 PM, waxhead wrote:
>>
>>
>> Josef Bacik wrote:
>>> One of the things that came up consistently in talking with Fedora about
>>> switching to btrfs as default is that btrfs is particularly vulnerable
>>> to metadata corruption.  If any of the core global roots are corrupted,
>>> the fs is unmountable and fsck can't usually do anything for you without
>>> some special options.
>>>
>>> Qu addressed this sort of with rescue=skipbg, but that's poorly named as
>>> what it really does is just allow you to operate without an extent root.
>>> However there are a lot of other roots, and I'd rather not have to do
>>>
>>> mount -o rescue=skipbg,rescue=nocsum,rescue=nofreespacetree,rescue=blah
>>>
>>> Instead take his original idea and modify it so it just works for
>>> everything.  Turn it into rescue=onlyfs, and then any major root we fail
>>> to read just gets left empty and we carry on.
>>>
>>> Obviously if the fs roots are screwed then the user is in trouble, but
>>> otherwise this makes it much easier to pull stuff off the disk without
>>> needing our special rescue tools.  I tested this with my TEST_DEV that
>>> had a bunch of data on it by corrupting the csum tree and then reading
>>> files off the disk.
>>>
>>> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
>>> ---
>>
>> Just an idea inspired from RAID1c3 and RAID1c3, how about introducing
>> DUP2 and/or even DUP3 making multiple copies of the metadata to
>> increase the chance to recover metadata on even a single storage device?
> 
> Because this only works on HDD.  On SSD's concurrent writes will often
> be shunted to the same erase block, and if the whole erase block goes,
> so do all of your copies.  This is why we default to 'single' for SSD's.
> 
> The one thing I _do_ want to do is make better use of the backup roots. 
> Right now we always free the pinned extents once the transaction
> commits, which makes the backup roots useless as we're likely to re-use
> those blocks.

IIRC Filipe tried this before and didn't go that direction due to ENOSPC.
As we need to commit multiple transactions to free the pinned extents.

But maybe the latest async pinned extent drop could solve the problem?

Thanks,
Qu

>  With Nikolay's patches we can now async drop pinned
> extents, which I've implemented here for an unrelated issue.  We could
> take that work and simply hold pinned extents for several transactions
> so that old backup roots and all of their nodes don't get over-written
> until they cycle out.  This would go a long way towards making us more
> resilient under metadata corruption conditions.  Thanks,
> 
> Josef
Zygo Blaxell July 2, 2020, 3:30 a.m. UTC | #12
On Wed, Jul 01, 2020 at 03:53:40PM -0400, Josef Bacik wrote:
> On 7/1/20 3:43 PM, waxhead wrote:
> > 
> > 
> > Josef Bacik wrote:
> > > One of the things that came up consistently in talking with Fedora about
> > > switching to btrfs as default is that btrfs is particularly vulnerable
> > > to metadata corruption.  If any of the core global roots are corrupted,
> > > the fs is unmountable and fsck can't usually do anything for you without
> > > some special options.
> > > 
> > > Qu addressed this sort of with rescue=skipbg, but that's poorly named as
> > > what it really does is just allow you to operate without an extent root.
> > > However there are a lot of other roots, and I'd rather not have to do
> > > 
> > > mount -o rescue=skipbg,rescue=nocsum,rescue=nofreespacetree,rescue=blah
> > > 
> > > Instead take his original idea and modify it so it just works for
> > > everything.  Turn it into rescue=onlyfs, and then any major root we fail
> > > to read just gets left empty and we carry on.
> > > 
> > > Obviously if the fs roots are screwed then the user is in trouble, but
> > > otherwise this makes it much easier to pull stuff off the disk without
> > > needing our special rescue tools.  I tested this with my TEST_DEV that
> > > had a bunch of data on it by corrupting the csum tree and then reading
> > > files off the disk.
> > > 
> > > Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> > > ---
> > 
> > Just an idea inspired from RAID1c3 and RAID1c3, how about introducing
> > DUP2 and/or even DUP3 making multiple copies of the metadata to increase
> > the chance to recover metadata on even a single storage device?

I don't think extra dup copies are very useful.  The disk firmware
behavior that breaks 2-copy dup will break 3-copy and 4-copy too.

> Because this only works on HDD.  On SSD's concurrent writes will often be
> shunted to the same erase block, and if the whole erase block goes, so do
> all of your copies.  This is why we default to 'single' for SSD's.

That's true on higher end SSDs that have "data reduction" feature sets
(compress and dedupe).  In theory these drives could dedupe metadata
pages even if they are slightly different, so schemes like labelling the
two dup copies won't work.  A sufficiently broken flash page will wipe
out both metadata copies, possibly even if you arrange a delay buffer
to write both copies several minutes apart.

On low-end SSD, though, there's not only no dedupe, there's also plenty of
single-bit (or sector) errors that don't destroy both copies of metadata.
It's one of the reasons why low-end drives have the write endurance of
mayflies compared to their high-end counterparts--even with the same
flash technology underneath, the high-end drives do clever things behind
the scenes, while the low-end drives just write every sector they're told
to, and the resulting TBW lifespan is very different.  I've had Kingston
SSDs recover from several single-sector csum failure events in btrfs
metadata blocks thanks to dup metadata.  Those would have been 'mkfs &
start over' events had I used default mkfs options with single metadata.

Dup metadata should be the default on all single-drive filesystems.
Single metadata should be reserved for extreme performance/robustness
tradeoffs, like the nobarrier mount option, where the filesystem is not
intended to survive a crash.  Dup metadata won't affect write endurance
or robustness on drives that dedupe writes, but it can save a filesystem
from destruction on drives that don't.

I think we might want to look into having some kind of delayed write
buffer to spread out writes to the same disk over a sufficiently long
period of time to defeat various firmware bugs.  e.g. I had a filesystem
killed by WD Black (HDD) firmware when it mishandled a UNC sector and
dropped the write cache during error handling.  If the two metadata copies
had been written on either side of the UNC error, the filesystem would
have survived, but since both metadata copies were destroyed by the
firmware bug, the filesystem was lost.

> The one thing I _do_ want to do is make better use of the backup roots.
> Right now we always free the pinned extents once the transaction commits,
> which makes the backup roots useless as we're likely to re-use those blocks.
> With Nikolay's patches we can now async drop pinned extents, which I've
> implemented here for an unrelated issue.  We could take that work and simply
> hold pinned extents for several transactions so that old backup roots and
> all of their nodes don't get over-written until they cycle out.  This would
> go a long way towards making us more resilient under metadata corruption
> conditions.  Thanks,

I have questions about this:  That would probably increase the size
required for global reserve?  RAM requirements for the pinned list?
What impact does this have when space is low, e.g. deleting a snapshot
on a full filesystem?  There are probably answers to these, but it
might mean spending some time making sure all the ENOSPC cases are
still recoverable.

Increasing the number of viable backup roots can work if we can figure
out how many trees we need to keep, but for real drives it's probably
the number and temporal spacing of pages written that matters, not
the number of transactions (assuming the drives just ignore write
barriers--if they didn't, then there's no need to keep backup roots
at all).  A full filesystem can get lots of short transactions, so if
the disk firmware bug is "throw the last 500 ms of ACKed writes away"
then keeping the last 8 trees committed in 50 ms each will not help,
because all of those as well as the 9th and 10th trees will be broken.
Same problem for fsync and the log tree, if the disk is going to corrupt
metadata it will corrupt data too.

What we really need is a usable fsck (possibly online or an interactive
tool) that can put a filesystem back together quickly when only a few
hundred metadata pages are broken.  The vast majority of btrfs filesystems
I've seen killed were due btrfs and Linux kernel bugs, e.g. that delayed
reloc_roots bug in 5.1.  Second place is RAM failure leading to memory
corruption (a favorite in #btrfs IRC).  Third is disk failure beyond
array limits (mostly on SD cards, nothing btrfs can do when the whole
disk fails).  Firmware bugs in the disk eating the metadata is #4 (it
happens, but not often).  Keeping backup trees on the disk longer will
only help in the 4th case.  All of the other cases involve damage to trees
that were committed long ago, where none of the backup roots will work.

> Josef
Josef Bacik July 2, 2020, 3:28 p.m. UTC | #13
On 7/1/20 11:09 PM, Qu Wenruo wrote:
> 
> 
> On 2020/7/2 上午3:53, Josef Bacik wrote:
>> On 7/1/20 3:43 PM, waxhead wrote:
>>>
>>>
>>> Josef Bacik wrote:
>>>> One of the things that came up consistently in talking with Fedora about
>>>> switching to btrfs as default is that btrfs is particularly vulnerable
>>>> to metadata corruption.  If any of the core global roots are corrupted,
>>>> the fs is unmountable and fsck can't usually do anything for you without
>>>> some special options.
>>>>
>>>> Qu addressed this sort of with rescue=skipbg, but that's poorly named as
>>>> what it really does is just allow you to operate without an extent root.
>>>> However there are a lot of other roots, and I'd rather not have to do
>>>>
>>>> mount -o rescue=skipbg,rescue=nocsum,rescue=nofreespacetree,rescue=blah
>>>>
>>>> Instead take his original idea and modify it so it just works for
>>>> everything.  Turn it into rescue=onlyfs, and then any major root we fail
>>>> to read just gets left empty and we carry on.
>>>>
>>>> Obviously if the fs roots are screwed then the user is in trouble, but
>>>> otherwise this makes it much easier to pull stuff off the disk without
>>>> needing our special rescue tools.  I tested this with my TEST_DEV that
>>>> had a bunch of data on it by corrupting the csum tree and then reading
>>>> files off the disk.
>>>>
>>>> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
>>>> ---
>>>
>>> Just an idea inspired from RAID1c3 and RAID1c3, how about introducing
>>> DUP2 and/or even DUP3 making multiple copies of the metadata to
>>> increase the chance to recover metadata on even a single storage device?
>>
>> Because this only works on HDD.  On SSD's concurrent writes will often
>> be shunted to the same erase block, and if the whole erase block goes,
>> so do all of your copies.  This is why we default to 'single' for SSD's.
>>
>> The one thing I _do_ want to do is make better use of the backup roots.
>> Right now we always free the pinned extents once the transaction
>> commits, which makes the backup roots useless as we're likely to re-use
>> those blocks.
> 
> IIRC Filipe tried this before and didn't go that direction due to ENOSPC.
> As we need to commit multiple transactions to free the pinned extents.
> 
> But maybe the latest async pinned extent drop could solve the problem?
> 

Yeah before it was tricky, but with Nikolay's work it made async pinned extent 
drop possible, I've been testing that patch internally.

Now it's just a matter of keeping the last 4 transactions worth of pinned around 
and only unpinning under enospc conditions.  I'll dig out the async unpinning 
and send that up next week since that's already valuable by itself, and then we 
can talk about wiring up the ENOSPC part of it.  Thanks,

Josef
Josef Bacik July 2, 2020, 3:35 p.m. UTC | #14
On 7/1/20 11:30 PM, Zygo Blaxell wrote:
> On Wed, Jul 01, 2020 at 03:53:40PM -0400, Josef Bacik wrote:
>> On 7/1/20 3:43 PM, waxhead wrote:
>>>
>>>
>>> Josef Bacik wrote:
>>>> One of the things that came up consistently in talking with Fedora about
>>>> switching to btrfs as default is that btrfs is particularly vulnerable
>>>> to metadata corruption.  If any of the core global roots are corrupted,
>>>> the fs is unmountable and fsck can't usually do anything for you without
>>>> some special options.
>>>>
>>>> Qu addressed this sort of with rescue=skipbg, but that's poorly named as
>>>> what it really does is just allow you to operate without an extent root.
>>>> However there are a lot of other roots, and I'd rather not have to do
>>>>
>>>> mount -o rescue=skipbg,rescue=nocsum,rescue=nofreespacetree,rescue=blah
>>>>
>>>> Instead take his original idea and modify it so it just works for
>>>> everything.  Turn it into rescue=onlyfs, and then any major root we fail
>>>> to read just gets left empty and we carry on.
>>>>
>>>> Obviously if the fs roots are screwed then the user is in trouble, but
>>>> otherwise this makes it much easier to pull stuff off the disk without
>>>> needing our special rescue tools.  I tested this with my TEST_DEV that
>>>> had a bunch of data on it by corrupting the csum tree and then reading
>>>> files off the disk.
>>>>
>>>> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
>>>> ---
>>>
>>> Just an idea inspired from RAID1c3 and RAID1c3, how about introducing
>>> DUP2 and/or even DUP3 making multiple copies of the metadata to increase
>>> the chance to recover metadata on even a single storage device?
> 
> I don't think extra dup copies are very useful.  The disk firmware
> behavior that breaks 2-copy dup will break 3-copy and 4-copy too.
> 
>> Because this only works on HDD.  On SSD's concurrent writes will often be
>> shunted to the same erase block, and if the whole erase block goes, so do
>> all of your copies.  This is why we default to 'single' for SSD's.
> 
> That's true on higher end SSDs that have "data reduction" feature sets
> (compress and dedupe).  In theory these drives could dedupe metadata
> pages even if they are slightly different, so schemes like labelling the
> two dup copies won't work.  A sufficiently broken flash page will wipe
> out both metadata copies, possibly even if you arrange a delay buffer
> to write both copies several minutes apart.
> 
> On low-end SSD, though, there's not only no dedupe, there's also plenty of
> single-bit (or sector) errors that don't destroy both copies of metadata.
> It's one of the reasons why low-end drives have the write endurance of
> mayflies compared to their high-end counterparts--even with the same
> flash technology underneath, the high-end drives do clever things behind
> the scenes, while the low-end drives just write every sector they're told
> to, and the resulting TBW lifespan is very different.  I've had Kingston
> SSDs recover from several single-sector csum failure events in btrfs
> metadata blocks thanks to dup metadata.  Those would have been 'mkfs &
> start over' events had I used default mkfs options with single metadata.
> 
> Dup metadata should be the default on all single-drive filesystems.
> Single metadata should be reserved for extreme performance/robustness
> tradeoffs, like the nobarrier mount option, where the filesystem is not
> intended to survive a crash.  Dup metadata won't affect write endurance
> or robustness on drives that dedupe writes, but it can save a filesystem
> from destruction on drives that don't.
> 
> I think we might want to look into having some kind of delayed write
> buffer to spread out writes to the same disk over a sufficiently long
> period of time to defeat various firmware bugs.  e.g. I had a filesystem
> killed by WD Black (HDD) firmware when it mishandled a UNC sector and
> dropped the write cache during error handling.  If the two metadata copies
> had been written on either side of the UNC error, the filesystem would
> have survived, but since both metadata copies were destroyed by the
> firmware bug, the filesystem was lost.
> 
>> The one thing I _do_ want to do is make better use of the backup roots.
>> Right now we always free the pinned extents once the transaction commits,
>> which makes the backup roots useless as we're likely to re-use those blocks.
>> With Nikolay's patches we can now async drop pinned extents, which I've
>> implemented here for an unrelated issue.  We could take that work and simply
>> hold pinned extents for several transactions so that old backup roots and
>> all of their nodes don't get over-written until they cycle out.  This would
>> go a long way towards making us more resilient under metadata corruption
>> conditions.  Thanks,
> 
> I have questions about this:  That would probably increase the size
> required for global reserve?  RAM requirements for the pinned list?
> What impact does this have when space is low, e.g. deleting a snapshot
> on a full filesystem?  There are probably answers to these, but it
> might mean spending some time making sure all the ENOSPC cases are
> still recoverable.

There's RAM requirements but they're low, its just a range tree with the bits 
set that need to be unpinned.  We wouldn't have to increase the size required 
for global reserve.

Right now the way we deal with pinned extents is to commit the transaction, 
because once the transaction is committed all pinned extents are unpinned. 
Nikolay made it so we have a per-transaction pinned tree now instead of two that 
we flip flop on.

That enables us to async off the unpinning.  I've been doing that internally 
with WhatsApp because they see a huge amount of latency with unpinning.  My 
version of this still works fine with the transaction part, because we wait for 
the transaction to "complete", which only gets set once the unpin mechanism is 
done.  So although it's asynchronous, we still get the same behavior from an 
enospc perspective.

What I would do for this case is delay 4 transactions worth of pinned extents. 
Once we commit the 5th transaction, we unpin the oldest guy because we no longer 
need him.

But this creates ENOSPC issues, because now we have decoupled the transaction 
commit from the pinned extent thing.  But that's ok, we know where our pinned 
space is now.  So we just add a new flushing state that we start walking through 
the unpin list and unpinning stuff until our reservations are satisfied.

Now this does mean that under a space crunch we're going to not be saving our 
old roots as well because we'll be re-using them, but that's the same situation 
that we have currently.  Most normal users will have plenty of space and thus 
will get the benefit of being able to recovery from the backup roots.  It's not 
a perfect solution, but it's muuuuch better than what we have currently.

> What we really need is a usable fsck (possibly online or an interactive
> tool) that can put a filesystem back together quickly when only a few
> hundred metadata pages are broken.  The vast majority of btrfs filesystems
> I've seen killed were due btrfs and Linux kernel bugs, e.g. that delayed
> reloc_roots bug in 5.1.  Second place is RAM failure leading to memory
> corruption (a favorite in #btrfs IRC).  Third is disk failure beyond
> array limits (mostly on SD cards, nothing btrfs can do when the whole
> disk fails).  Firmware bugs in the disk eating the metadata is #4 (it
> happens, but not often).  Keeping backup trees on the disk longer will
> only help in the 4th case.  All of the other cases involve damage to trees
> that were committed long ago, where none of the backup roots will work.
> 
Yeah I think we need to be smarter about detecting these cases with btrfsck. 
Like there's good reason to not default to --init-extent-tree or 
--init-csum-tree.  But if you can't read those roots at all?  Why make the user 
have to figure that stuff out?  --repair should be able to say "oh well these 
trees are just completely fucked, --init-extent-tree is getting turned on", 
without the user needing to know how to do it.  Thanks,

Josef
Zygo Blaxell July 2, 2020, 9:10 p.m. UTC | #15
On Thu, Jul 02, 2020 at 11:35:00AM -0400, Josef Bacik wrote:
> On 7/1/20 11:30 PM, Zygo Blaxell wrote:
> > On Wed, Jul 01, 2020 at 03:53:40PM -0400, Josef Bacik wrote:
> > > The one thing I _do_ want to do is make better use of the backup roots.
> > > Right now we always free the pinned extents once the transaction commits,
> > > which makes the backup roots useless as we're likely to re-use those blocks.
> > > With Nikolay's patches we can now async drop pinned extents, which I've
> > > implemented here for an unrelated issue.  We could take that work and simply
> > > hold pinned extents for several transactions so that old backup roots and
> > > all of their nodes don't get over-written until they cycle out.  This would
> > > go a long way towards making us more resilient under metadata corruption
> > > conditions.  Thanks,
> > 
> > I have questions about this:  That would probably increase the size
> > required for global reserve?  RAM requirements for the pinned list?
> > What impact does this have when space is low, e.g. deleting a snapshot
> > on a full filesystem?  There are probably answers to these, but it
> > might mean spending some time making sure all the ENOSPC cases are
> > still recoverable.
> 
> There's RAM requirements but they're low, its just a range tree with the
> bits set that need to be unpinned.  We wouldn't have to increase the size
> required for global reserve.
> 
> Right now the way we deal with pinned extents is to commit the transaction,
> because once the transaction is committed all pinned extents are unpinned.
> Nikolay made it so we have a per-transaction pinned tree now instead of two
> that we flip flop on.
> 
> That enables us to async off the unpinning.  I've been doing that internally
> with WhatsApp because they see a huge amount of latency with unpinning.  My
> version of this still works fine with the transaction part, because we wait
> for the transaction to "complete", which only gets set once the unpin
> mechanism is done.  So although it's asynchronous, we still get the same
> behavior from an enospc perspective.

That's interesting...does it mean that we do fewer reads in the critical
part of transaction commit too?  Like if we remove the last reference to
an extent, can we delete the extent and the csum item for the extent in
the background, while new transactions are running?  That would have a
huge latency benefit (which I guess is what WhatsApp wanted), regardless
of the effect on robustness.

> What I would do for this case is delay 4 transactions worth of pinned
> extents. Once we commit the 5th transaction, we unpin the oldest guy because
> we no longer need him.
> 
> But this creates ENOSPC issues, because now we have decoupled the
> transaction commit from the pinned extent thing.  But that's ok, we know
> where our pinned space is now.  So we just add a new flushing state that we
> start walking through the unpin list and unpinning stuff until our
> reservations are satisfied.

So under ENOSPC we smoothly transition back to current behavior.  Got it.

> Now this does mean that under a space crunch we're going to not be saving
> our old roots as well because we'll be re-using them, but that's the same
> situation that we have currently.  Most normal users will have plenty of
> space and thus will get the benefit of being able to recovery from the
> backup roots.  It's not a perfect solution, but it's muuuuch better than
> what we have currently.

I agree that's quantitatively better than 'btrfs gets destroyed
by an unclean shutdown on some drives', but it would mean that firmware
bug root corruption correlates with low disk space.  So it would be
"btrfs gets destroyed by an unclean shutdown while low on disk space on
some drives."  I guess it depends on where you want to draw the line on
support for hard drives that belong in the recycle bin.

> > What we really need is a usable fsck (possibly online or an interactive
> > tool) that can put a filesystem back together quickly when only a few
> > hundred metadata pages are broken.  The vast majority of btrfs filesystems
> > I've seen killed were due btrfs and Linux kernel bugs, e.g. that delayed
> > reloc_roots bug in 5.1.  Second place is RAM failure leading to memory
> > corruption (a favorite in #btrfs IRC).  Third is disk failure beyond
> > array limits (mostly on SD cards, nothing btrfs can do when the whole
> > disk fails).  Firmware bugs in the disk eating the metadata is #4 (it
> > happens, but not often).  Keeping backup trees on the disk longer will
> > only help in the 4th case.  All of the other cases involve damage to trees
> > that were committed long ago, where none of the backup roots will work.
> > 
> Yeah I think we need to be smarter about detecting these cases with btrfsck.
> Like there's good reason to not default to --init-extent-tree or
> --init-csum-tree.  But if you can't read those roots at all?  Why make the
> user have to figure that stuff out?  --repair should be able to say "oh well
> these trees are just completely fucked, --init-extent-tree is getting turned
> on", without the user needing to know how to do it.  Thanks,

That...isn't the direction I was thinking of.  '--init-csum-tree' burns
the data integrity checking in btrfs.  If you're going to do that, you
might as well just restore from backups, at least you know then that the
data wasn't corrupted by whatever corrupted the metadata on the same disk.

There have been some historical btrfs bugs where there's supposed to
be a csum item and there isn't, or there's not supposed to be a csum
item and there is.  Ideally, we'd have a way to tell btrfs check or the
kernel on a mounted filesystem "yes we know there's no csums for block
XYZ, we give permission to compute and insert (delete) those missing
(extra) csum items in this instance" so we don't have to take the
filesystem offline and let btrfs check spend days trying to discover
the one missing csum item on its own.

--init-extent-tree, sure, brute force scan and rebuild of all the subvol
metadata is the general solution to fix the extent tree.  If you keep the
csum tree, you can check your work afterwards.  If the disk is corrupting
one page, it's likely to corrupt several.  btrfs check should not (by
default) prevent scrub from providing a list of corrupted data blocks
when it's done.

I'm thinking more of the cases where there's 150GB of metadata,
and something went wrong with an extent data item in one leaf node.
--init-extent-tree will work, but it's massive overkill.  It currently
takes more than 11 days to run btrfs check on such filesystems (11 days
amount of time it takes to build a new filesystem from backups, then turn
off and wipe the server that has been making no visible progress after
running btrfs check for 11 days).  In some cases I can just delete the
offending items in a hex editor and all is well again (it takes hours to
do by hand, but still beats 11 days).  It would be nice to have the kernel
print a message like 'please run "btrfs check --repair --broken-nodes
1434238885120,4125924810240" and try again' so btrfs check doesn't have
to read the whole metadata to find and fix a single-page problem.

> Josef
Qu Wenruo July 2, 2020, 11:36 p.m. UTC | #16
On 2020/7/2 下午11:28, Josef Bacik wrote:
> On 7/1/20 11:09 PM, Qu Wenruo wrote:
>>
>>
>> On 2020/7/2 上午3:53, Josef Bacik wrote:
>>> On 7/1/20 3:43 PM, waxhead wrote:
>>>>
>>>>
>>>> Josef Bacik wrote:
>>>>> One of the things that came up consistently in talking with Fedora
>>>>> about
>>>>> switching to btrfs as default is that btrfs is particularly vulnerable
>>>>> to metadata corruption.  If any of the core global roots are
>>>>> corrupted,
>>>>> the fs is unmountable and fsck can't usually do anything for you
>>>>> without
>>>>> some special options.
>>>>>
>>>>> Qu addressed this sort of with rescue=skipbg, but that's poorly
>>>>> named as
>>>>> what it really does is just allow you to operate without an extent
>>>>> root.
>>>>> However there are a lot of other roots, and I'd rather not have to do
>>>>>
>>>>> mount -o
>>>>> rescue=skipbg,rescue=nocsum,rescue=nofreespacetree,rescue=blah
>>>>>
>>>>> Instead take his original idea and modify it so it just works for
>>>>> everything.  Turn it into rescue=onlyfs, and then any major root we
>>>>> fail
>>>>> to read just gets left empty and we carry on.
>>>>>
>>>>> Obviously if the fs roots are screwed then the user is in trouble, but
>>>>> otherwise this makes it much easier to pull stuff off the disk without
>>>>> needing our special rescue tools.  I tested this with my TEST_DEV that
>>>>> had a bunch of data on it by corrupting the csum tree and then reading
>>>>> files off the disk.
>>>>>
>>>>> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
>>>>> ---
>>>>
>>>> Just an idea inspired from RAID1c3 and RAID1c3, how about introducing
>>>> DUP2 and/or even DUP3 making multiple copies of the metadata to
>>>> increase the chance to recover metadata on even a single storage
>>>> device?
>>>
>>> Because this only works on HDD.  On SSD's concurrent writes will often
>>> be shunted to the same erase block, and if the whole erase block goes,
>>> so do all of your copies.  This is why we default to 'single' for SSD's.
>>>
>>> The one thing I _do_ want to do is make better use of the backup roots.
>>> Right now we always free the pinned extents once the transaction
>>> commits, which makes the backup roots useless as we're likely to re-use
>>> those blocks.
>>
>> IIRC Filipe tried this before and didn't go that direction due to ENOSPC.
>> As we need to commit multiple transactions to free the pinned extents.
>>
>> But maybe the latest async pinned extent drop could solve the problem?
>>
> 
> Yeah before it was tricky, but with Nikolay's work it made async pinned
> extent drop possible, I've been testing that patch internally.
> 
> Now it's just a matter of keeping the last 4 transactions worth of
> pinned around and only unpinning under enospc conditions.  I'll dig out
> the async unpinning and send that up next week since that's already
> valuable by itself, and then we can talk about wiring up the ENOSPC part
> of it.  Thanks,

That's really awesome, let make btrfs the most bullet proof fs then!

Thanks,
Qu

> 
> Josef
>
Neal Gompa July 3, 2020, 12:09 a.m. UTC | #17
On Thu, Jul 2, 2020 at 7:38 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
>
> On 2020/7/2 下午11:28, Josef Bacik wrote:
> > On 7/1/20 11:09 PM, Qu Wenruo wrote:
> >>
> >>
> >> On 2020/7/2 上午3:53, Josef Bacik wrote:
> >>> On 7/1/20 3:43 PM, waxhead wrote:
> >>>>
> >>>>
> >>>> Josef Bacik wrote:
> >>>>> One of the things that came up consistently in talking with Fedora
> >>>>> about
> >>>>> switching to btrfs as default is that btrfs is particularly vulnerable
> >>>>> to metadata corruption.  If any of the core global roots are
> >>>>> corrupted,
> >>>>> the fs is unmountable and fsck can't usually do anything for you
> >>>>> without
> >>>>> some special options.
> >>>>>
> >>>>> Qu addressed this sort of with rescue=skipbg, but that's poorly
> >>>>> named as
> >>>>> what it really does is just allow you to operate without an extent
> >>>>> root.
> >>>>> However there are a lot of other roots, and I'd rather not have to do
> >>>>>
> >>>>> mount -o
> >>>>> rescue=skipbg,rescue=nocsum,rescue=nofreespacetree,rescue=blah
> >>>>>
> >>>>> Instead take his original idea and modify it so it just works for
> >>>>> everything.  Turn it into rescue=onlyfs, and then any major root we
> >>>>> fail
> >>>>> to read just gets left empty and we carry on.
> >>>>>
> >>>>> Obviously if the fs roots are screwed then the user is in trouble, but
> >>>>> otherwise this makes it much easier to pull stuff off the disk without
> >>>>> needing our special rescue tools.  I tested this with my TEST_DEV that
> >>>>> had a bunch of data on it by corrupting the csum tree and then reading
> >>>>> files off the disk.
> >>>>>
> >>>>> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> >>>>> ---
> >>>>
> >>>> Just an idea inspired from RAID1c3 and RAID1c3, how about introducing
> >>>> DUP2 and/or even DUP3 making multiple copies of the metadata to
> >>>> increase the chance to recover metadata on even a single storage
> >>>> device?
> >>>
> >>> Because this only works on HDD.  On SSD's concurrent writes will often
> >>> be shunted to the same erase block, and if the whole erase block goes,
> >>> so do all of your copies.  This is why we default to 'single' for SSD's.
> >>>
> >>> The one thing I _do_ want to do is make better use of the backup roots.
> >>> Right now we always free the pinned extents once the transaction
> >>> commits, which makes the backup roots useless as we're likely to re-use
> >>> those blocks.
> >>
> >> IIRC Filipe tried this before and didn't go that direction due to ENOSPC.
> >> As we need to commit multiple transactions to free the pinned extents.
> >>
> >> But maybe the latest async pinned extent drop could solve the problem?
> >>
> >
> > Yeah before it was tricky, but with Nikolay's work it made async pinned
> > extent drop possible, I've been testing that patch internally.
> >
> > Now it's just a matter of keeping the last 4 transactions worth of
> > pinned around and only unpinning under enospc conditions.  I'll dig out
> > the async unpinning and send that up next week since that's already
> > valuable by itself, and then we can talk about wiring up the ENOSPC part
> > of it.  Thanks,
>
> That's really awesome, let make btrfs the most bullet proof fs then!
>

Woohoo, yes! Go! 
David Sterba July 3, 2020, 9:16 a.m. UTC | #18
On Wed, Jul 01, 2020 at 09:16:55PM +0200, Alberto Bursi wrote:
> On 01/07/20 17:39, David Sterba wrote:
> > On Wed, Jul 01, 2020 at 05:22:18PM +0200, Lukas Straub wrote:
> >>> I'm not married to the rescue=onlyfs name, if we can think of something better
> >>> I'm good.
> >>
> >> Maybe you could go a step further and automatically switch to rescue
> >> mode if something is corrupt. This is easier for the user than having
> >> to remember the mount flags.
> > 
> > We don't want to do the auto-switching in general as it's a non-standard
> > situation.  It's better to get user attention than to silently mount
> > with limited capabilities and then let the user find out that something
> > went wrong, eg. system services randomly failing to start or work.
> 
> Eh. I'm sure stopping boot and dropping to initramfs shell is a great 
> way to get someone's attention.

I'm not saying it's pretty and these rescue tasks are only chosing
lesser evil. As I understand it, users want a single option to do the
magic, aka. fix all problems and continue, but the hard part is how to
implement that. The analysis is a necessary step, and it cannot be
always automated to the point where autor-repair will not cause more
harm than good.

> Afaik in mdadm or hardware raid the main way to notify the administrator 
> of issues is sending an email, or send the error through the server 
> fleet management software.

That's for monitoring tools, I'm not familiar with that topic but seems
that hat it could be done. Let's say there's a problem during mount,
auto-repair tries something, system boots but is not in ideal state.
The status is saved in system log and perhaps in some sysfs file to be
grabbed any time later. And the monitoring tools can read that and react
(send email).
David Sterba July 3, 2020, 9:43 a.m. UTC | #19
On Wed, Jul 01, 2020 at 11:30:17PM -0400, Zygo Blaxell wrote:
> On Wed, Jul 01, 2020 at 03:53:40PM -0400, Josef Bacik wrote:
> > On 7/1/20 3:43 PM, waxhead wrote:
> > > Josef Bacik wrote:
> > > > One of the things that came up consistently in talking with Fedora about
> > > > switching to btrfs as default is that btrfs is particularly vulnerable
> > > > to metadata corruption.  If any of the core global roots are corrupted,
> > > > the fs is unmountable and fsck can't usually do anything for you without
> > > > some special options.
> > > > 
> > > > Qu addressed this sort of with rescue=skipbg, but that's poorly named as
> > > > what it really does is just allow you to operate without an extent root.
> > > > However there are a lot of other roots, and I'd rather not have to do
> > > > 
> > > > mount -o rescue=skipbg,rescue=nocsum,rescue=nofreespacetree,rescue=blah
> > > > 
> > > > Instead take his original idea and modify it so it just works for
> > > > everything.  Turn it into rescue=onlyfs, and then any major root we fail
> > > > to read just gets left empty and we carry on.
> > > > 
> > > > Obviously if the fs roots are screwed then the user is in trouble, but
> > > > otherwise this makes it much easier to pull stuff off the disk without
> > > > needing our special rescue tools.  I tested this with my TEST_DEV that
> > > > had a bunch of data on it by corrupting the csum tree and then reading
> > > > files off the disk.
> > > > 
> > > > Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> > > > ---
> > > 
> > > Just an idea inspired from RAID1c3 and RAID1c3, how about introducing
> > > DUP2 and/or even DUP3 making multiple copies of the metadata to increase
> > > the chance to recover metadata on even a single storage device?
> 
> I don't think extra dup copies are very useful.  The disk firmware
> behavior that breaks 2-copy dup will break 3-copy and 4-copy too.
> 
> > Because this only works on HDD.  On SSD's concurrent writes will often be
> > shunted to the same erase block, and if the whole erase block goes, so do
> > all of your copies.  This is why we default to 'single' for SSD's.
> 
> That's true on higher end SSDs that have "data reduction" feature sets
> (compress and dedupe).  In theory these drives could dedupe metadata
> pages even if they are slightly different, so schemes like labelling the
> two dup copies won't work.  A sufficiently broken flash page will wipe
> out both metadata copies, possibly even if you arrange a delay buffer
> to write both copies several minutes apart.
> 
> On low-end SSD, though, there's not only no dedupe, there's also plenty of
> single-bit (or sector) errors that don't destroy both copies of metadata.
> It's one of the reasons why low-end drives have the write endurance of
> mayflies compared to their high-end counterparts--even with the same
> flash technology underneath, the high-end drives do clever things behind
> the scenes, while the low-end drives just write every sector they're told
> to, and the resulting TBW lifespan is very different.  I've had Kingston
> SSDs recover from several single-sector csum failure events in btrfs
> metadata blocks thanks to dup metadata.  Those would have been 'mkfs &
> start over' events had I used default mkfs options with single metadata.
> 
> Dup metadata should be the default on all single-drive filesystems.
> Single metadata should be reserved for extreme performance/robustness
> tradeoffs, like the nobarrier mount option, where the filesystem is not
> intended to survive a crash.  Dup metadata won't affect write endurance
> or robustness on drives that dedupe writes, but it can save a filesystem
> from destruction on drives that don't.
> 
> I think we might want to look into having some kind of delayed write
> buffer to spread out writes to the same disk over a sufficiently long
> period of time to defeat various firmware bugs.  e.g. I had a filesystem
> killed by WD Black (HDD) firmware when it mishandled a UNC sector and
> dropped the write cache during error handling.  If the two metadata copies
> had been written on either side of the UNC error, the filesystem would
> have survived, but since both metadata copies were destroyed by the
> firmware bug, the filesystem was lost.

From all that experience, the model of storage we've been working so far
is insufficient, hardware is unreliable crap and we need to implement
more filesystem features once we know how hardware creatively fails.

The simplified model is that writes are not ordered, disks have caches
that need to be flushed, barriers can be relied on.

The errors that can happen are eg. misdirected writes, lost writes,
bitflips - that should be covered by checksums, storing generation and
location of the data. Enough to catch bad hardware.

With the SSD optimizations it could become an arms race, so far we avoid
dup because of the drive deduplication. As you say it depends on
class/model and we've never had enough information to decide either for
tools or users what exacly one can expect from the device.

If we want to go the way to guess what firmware does, delay writes, move
dup copies apart, that's still working with assumptions and may help on
average. Until the drive firmware gets adapted to that, but I'm not
against implementing the countermeasures for the time being.
David Sterba July 3, 2020, 9:47 a.m. UTC | #20
On Wed, Jul 01, 2020 at 03:53:40PM -0400, Josef Bacik wrote:
> On 7/1/20 3:43 PM, waxhead wrote:
> > Josef Bacik wrote:
> > Just an idea inspired from RAID1c3 and RAID1c3, how about introducing DUP2 
> > and/or even DUP3 making multiple copies of the metadata to increase the chance 
> > to recover metadata on even a single storage device?
> 
> Because this only works on HDD.  On SSD's concurrent writes will often be 
> shunted to the same erase block, and if the whole erase block goes, so do all of 
> your copies.  This is why we default to 'single' for SSD's.
> 
> The one thing I _do_ want to do is make better use of the backup roots.

Oh please do, otherwise the backup roots are on the track to be removed
due to very little chances to actually make things better. We've heared
complaints about that for long.
David Sterba July 7, 2020, 3:16 p.m. UTC | #21
On Wed, Jul 01, 2020 at 10:44:38AM -0400, Josef Bacik wrote:
> One of the things that came up consistently in talking with Fedora about
> switching to btrfs as default is that btrfs is particularly vulnerable
> to metadata corruption.  If any of the core global roots are corrupted,
> the fs is unmountable and fsck can't usually do anything for you without
> some special options.
> 
> Qu addressed this sort of with rescue=skipbg, but that's poorly named as
> what it really does is just allow you to operate without an extent root.
> However there are a lot of other roots, and I'd rather not have to do
> 
> mount -o rescue=skipbg,rescue=nocsum,rescue=nofreespacetree,rescue=blah
> 
> Instead take his original idea and modify it so it just works for
> everything.  Turn it into rescue=onlyfs, and then any major root we fail
> to read just gets left empty and we carry on.
> 
> Obviously if the fs roots are screwed then the user is in trouble, but
> otherwise this makes it much easier to pull stuff off the disk without
> needing our special rescue tools.  I tested this with my TEST_DEV that
> had a bunch of data on it by corrupting the csum tree and then reading
> files off the disk.
> 
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
> 
> I'm not married to the rescue=onlyfs name, if we can think of something better
> I'm good.
> 
> Also rescue=skipbg is currently only sitting in misc-next, which is why I'm
> killing it with this patch, we haven't sent it upstream so we're good to change
> it now before it lands.

Right now we don't seem to have a consensus what rescue= should or
should not do and given that the patch is not in any released branch I
can keep it in for-next topic branch.  We'll have something to test but
unless the final semantics is agreed on, I don't want to keep the patch
in misc-next to avoid dealing with the fallouts.

To be specific:

- patch "btrfs: introduce "rescue=" mount option" only wraps existing
  options so that one is probably ok to stay in misc-next

- rescue=skipbg would go to topic branch
waxhead July 7, 2020, 4:25 p.m. UTC | #22
David Sterba wrote:
> On Wed, Jul 01, 2020 at 10:44:38AM -0400, Josef Bacik wrote:
>> One of the things that came up consistently in talking with Fedora about
>> switching to btrfs as default is that btrfs is particularly vulnerable
>> to metadata corruption.  If any of the core global roots are corrupted,
>> the fs is unmountable and fsck can't usually do anything for you without
>> some special options.
>>
>> Qu addressed this sort of with rescue=skipbg, but that's poorly named as
>> what it really does is just allow you to operate without an extent root.
>> However there are a lot of other roots, and I'd rather not have to do
>>
>> mount -o rescue=skipbg,rescue=nocsum,rescue=nofreespacetree,rescue=blah
>>
>> Instead take his original idea and modify it so it just works for
>> everything.  Turn it into rescue=onlyfs, and then any major root we fail
>> to read just gets left empty and we carry on.
>>
>> Obviously if the fs roots are screwed then the user is in trouble, but
>> otherwise this makes it much easier to pull stuff off the disk without
>> needing our special rescue tools.  I tested this with my TEST_DEV that
>> had a bunch of data on it by corrupting the csum tree and then reading
>> files off the disk.
>>
>> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
>> ---
>>
>> I'm not married to the rescue=onlyfs name, if we can think of something better
>> I'm good.
>>
>> Also rescue=skipbg is currently only sitting in misc-next, which is why I'm
>> killing it with this patch, we haven't sent it upstream so we're good to change
>> it now before it lands.
> 
> Right now we don't seem to have a consensus what rescue= should or
> should not do and given that the patch is not in any released branch I
> can keep it in for-next topic branch.  We'll have something to test but
> unless the final semantics is agreed on, I don't want to keep the patch
> in misc-next to avoid dealing with the fallouts.
> 
> To be specific:
> 
> - patch "btrfs: introduce "rescue=" mount option" only wraps existing
>    options so that one is probably ok to stay in misc-next
> 
> - rescue=skipbg would go to topic branch
> 
How about calling it recoverylevel=xyz or recoverymask=this,or,that 
perhaps be a better with a conservative settings as the default?!

Patch
diff mbox series

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index 09b796a081dd..cb5608b2deec 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -2052,7 +2052,7 @@  int btrfs_read_block_groups(struct btrfs_fs_info *info)
 	int need_clear = 0;
 	u64 cache_gen;
 
-	if (btrfs_test_opt(info, SKIPBG))
+	if (btrfs_test_opt(info, ONLYFS))
 		return fill_dummy_bgs(info);
 
 	key.objectid = 0;
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index e40eb210670d..d888d08f7e0d 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1265,7 +1265,7 @@  static inline u32 BTRFS_MAX_XATTR_SIZE(const struct btrfs_fs_info *info)
 #define BTRFS_MOUNT_NOLOGREPLAY		(1 << 27)
 #define BTRFS_MOUNT_REF_VERIFY		(1 << 28)
 #define BTRFS_MOUNT_DISCARD_ASYNC	(1 << 29)
-#define BTRFS_MOUNT_SKIPBG		(1 << 30)
+#define BTRFS_MOUNT_ONLYFS		(1 << 30)
 
 #define BTRFS_DEFAULT_COMMIT_INTERVAL	(30)
 #define BTRFS_DEFAULT_MAX_INLINE	(2048)
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index c27022f13150..9cba0a66b3d5 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2262,11 +2262,10 @@  static int btrfs_read_roots(struct btrfs_fs_info *fs_info)
 
 	root = btrfs_read_tree_root(tree_root, &location);
 	if (IS_ERR(root)) {
-		if (!btrfs_test_opt(fs_info, SKIPBG)) {
+		if (!btrfs_test_opt(fs_info, ONLYFS)) {
 			ret = PTR_ERR(root);
 			goto out;
 		}
-		fs_info->extent_root = NULL;
 	} else {
 		set_bit(BTRFS_ROOT_TRACK_DIRTY, &root->state);
 		fs_info->extent_root = root;
@@ -2275,21 +2274,27 @@  static int btrfs_read_roots(struct btrfs_fs_info *fs_info)
 	location.objectid = BTRFS_DEV_TREE_OBJECTID;
 	root = btrfs_read_tree_root(tree_root, &location);
 	if (IS_ERR(root)) {
-		ret = PTR_ERR(root);
-		goto out;
+		if (!btrfs_test_opt(fs_info, ONLYFS)) {
+			ret = PTR_ERR(root);
+			goto out;
+		}
+	} else {
+		set_bit(BTRFS_ROOT_TRACK_DIRTY, &root->state);
+		fs_info->dev_root = root;
+		btrfs_init_devices_late(fs_info);
 	}
-	set_bit(BTRFS_ROOT_TRACK_DIRTY, &root->state);
-	fs_info->dev_root = root;
-	btrfs_init_devices_late(fs_info);
 
 	location.objectid = BTRFS_CSUM_TREE_OBJECTID;
 	root = btrfs_read_tree_root(tree_root, &location);
 	if (IS_ERR(root)) {
-		ret = PTR_ERR(root);
-		goto out;
+		if (!btrfs_test_opt(fs_info, ONLYFS)) {
+			ret = PTR_ERR(root);
+			goto out;
+		}
+	} else {
+		set_bit(BTRFS_ROOT_TRACK_DIRTY, &root->state);
+		fs_info->csum_root = root;
 	}
-	set_bit(BTRFS_ROOT_TRACK_DIRTY, &root->state);
-	fs_info->csum_root = root;
 
 	/*
 	 * This tree can share blocks with some other fs tree during relocation
@@ -2298,11 +2303,14 @@  static int btrfs_read_roots(struct btrfs_fs_info *fs_info)
 	root = btrfs_get_fs_root(tree_root->fs_info,
 				 BTRFS_DATA_RELOC_TREE_OBJECTID, true);
 	if (IS_ERR(root)) {
-		ret = PTR_ERR(root);
-		goto out;
+		if (!btrfs_test_opt(fs_info, ONLYFS)) {
+			ret = PTR_ERR(root);
+			goto out;
+		}
+	} else {
+		set_bit(BTRFS_ROOT_TRACK_DIRTY, &root->state);
+		fs_info->data_reloc_root = root;
 	}
-	set_bit(BTRFS_ROOT_TRACK_DIRTY, &root->state);
-	fs_info->data_reloc_root = root;
 
 	location.objectid = BTRFS_QUOTA_TREE_OBJECTID;
 	root = btrfs_read_tree_root(tree_root, &location);
@@ -2315,9 +2323,11 @@  static int btrfs_read_roots(struct btrfs_fs_info *fs_info)
 	location.objectid = BTRFS_UUID_TREE_OBJECTID;
 	root = btrfs_read_tree_root(tree_root, &location);
 	if (IS_ERR(root)) {
-		ret = PTR_ERR(root);
-		if (ret != -ENOENT)
-			goto out;
+		if (!btrfs_test_opt(fs_info, ONLYFS)) {
+			ret = PTR_ERR(root);
+			if (ret != -ENOENT)
+				goto out;
+		}
 	} else {
 		set_bit(BTRFS_ROOT_TRACK_DIRTY, &root->state);
 		fs_info->uuid_root = root;
@@ -2327,11 +2337,14 @@  static int btrfs_read_roots(struct btrfs_fs_info *fs_info)
 		location.objectid = BTRFS_FREE_SPACE_TREE_OBJECTID;
 		root = btrfs_read_tree_root(tree_root, &location);
 		if (IS_ERR(root)) {
-			ret = PTR_ERR(root);
-			goto out;
+			if (!btrfs_test_opt(fs_info, ONLYFS)) {
+				ret = PTR_ERR(root);
+				goto out;
+			}
+		}  else {
+			set_bit(BTRFS_ROOT_TRACK_DIRTY, &root->state);
+			fs_info->free_space_root = root;
 		}
-		set_bit(BTRFS_ROOT_TRACK_DIRTY, &root->state);
-		fs_info->free_space_root = root;
 	}
 
 	return 0;
@@ -3047,20 +3060,11 @@  int __cold open_ctree(struct super_block *sb, struct btrfs_fs_devices *fs_device
 	}
 
 	/* Skip bg needs RO and no tree-log to replay */
-	if (btrfs_test_opt(fs_info, SKIPBG)) {
-		if (!sb_rdonly(sb)) {
-			btrfs_err(fs_info,
-			"rescue=skipbg can only be used on read-only mount");
-			err = -EINVAL;
-			goto fail_alloc;
-		}
-		if (btrfs_super_log_root(disk_super) &&
-		    !btrfs_test_opt(fs_info, NOLOGREPLAY)) {
-			btrfs_err(fs_info,
-"rescue=skipbg must be used with rescue=nologreplay when tree-log needs to replayed");
-			err = -EINVAL;
-			goto fail_alloc;
-		}
+	if (btrfs_test_opt(fs_info, ONLYFS) && !sb_rdonly(sb)) {
+		btrfs_err(fs_info,
+			  "rescue=onlyfs can only be used on read-only mount");
+		err = -EINVAL;
+		goto fail_alloc;
 	}
 
 	ret = btrfs_init_workqueues(fs_info, fs_devices);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index d301550b9c70..9f8ef22ac65e 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2209,7 +2209,8 @@  static blk_status_t btrfs_submit_bio_hook(struct inode *inode, struct bio *bio,
 	int skip_sum;
 	int async = !atomic_read(&BTRFS_I(inode)->sync_writers);
 
-	skip_sum = BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM;
+	skip_sum = (BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM) ||
+		btrfs_test_opt(fs_info, ONLYFS);
 
 	if (btrfs_is_free_space_inode(BTRFS_I(inode)))
 		metadata = BTRFS_WQ_ENDIO_FREE_SPACE;
@@ -2866,6 +2867,9 @@  static int btrfs_readpage_end_io_hook(struct btrfs_io_bio *io_bio,
 	if (BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM)
 		return 0;
 
+	if (btrfs_test_opt(root->fs_info, ONLYFS))
+		return 0;
+
 	if (root->root_key.objectid == BTRFS_DATA_RELOC_TREE_OBJECTID &&
 	    test_range_bit(io_tree, start, end, EXTENT_NODATASUM, 1, NULL)) {
 		clear_extent_bits(io_tree, start, end, EXTENT_NODATASUM);
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 3c9ebd4f2b61..7ea9f8f53156 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -345,7 +345,7 @@  enum {
 	Opt_rescue,
 	Opt_usebackuproot,
 	Opt_nologreplay,
-	Opt_rescue_skipbg,
+	Opt_rescue_onlyfs,
 
 	/* Deprecated options */
 	Opt_alloc_start,
@@ -445,7 +445,7 @@  static const match_table_t tokens = {
 static const match_table_t rescue_tokens = {
 	{Opt_usebackuproot, "usebackuproot"},
 	{Opt_nologreplay, "nologreplay"},
-	{Opt_rescue_skipbg, "skipbg"},
+	{Opt_rescue_onlyfs, "onlyfs"},
 	{Opt_err, NULL},
 };
 
@@ -478,9 +478,10 @@  static int parse_rescue_options(struct btrfs_fs_info *info, const char *options)
 			btrfs_set_and_info(info, NOLOGREPLAY,
 					   "disabling log replay at mount time");
 			break;
-		case Opt_rescue_skipbg:
-			btrfs_set_and_info(info, SKIPBG,
-				"skip mount time block group searching");
+		case Opt_rescue_onlyfs:
+			btrfs_set_and_info(info, ONLYFS,
+					   "only reading fs roots, also setting  nologreplay");
+			btrfs_set_opt(info->mount_opt, NOLOGREPLAY);
 			break;
 		case Opt_err:
 			btrfs_info(info, "unrecognized rescue option '%s'", p);
@@ -1418,8 +1419,8 @@  static int btrfs_show_options(struct seq_file *seq, struct dentry *dentry)
 		seq_puts(seq, ",notreelog");
 	if (btrfs_test_opt(info, NOLOGREPLAY))
 		seq_puts(seq, ",rescue=nologreplay");
-	if (btrfs_test_opt(info, SKIPBG))
-		seq_puts(seq, ",rescue=skipbg");
+	if (btrfs_test_opt(info, ONLYFS))
+		seq_puts(seq, ",rescue=onlyfs");
 	if (btrfs_test_opt(info, FLUSHONCOMMIT))
 		seq_puts(seq, ",flushoncommit");
 	if (btrfs_test_opt(info, DISCARD_SYNC))
@@ -1859,10 +1860,10 @@  static int btrfs_remount(struct super_block *sb, int *flags, char *data)
 	if (ret)
 		goto restore;
 
-	if (btrfs_test_opt(fs_info, SKIPBG) !=
-	    (old_opts & BTRFS_MOUNT_SKIPBG)) {
+	if (btrfs_test_opt(fs_info, ONLYFS) !=
+	    (old_opts & BTRFS_MOUNT_ONLYFS)) {
 		btrfs_err(fs_info,
-		"rescue=skipbg mount option can't be changed during remount");
+		"rescue=onlyfs mount option can't be changed during remount");
 		ret = -EINVAL;
 		goto restore;
 	}
@@ -1932,9 +1933,9 @@  static int btrfs_remount(struct super_block *sb, int *flags, char *data)
 			goto restore;
 		}
 
-		if (btrfs_test_opt(fs_info, SKIPBG)) {
+		if (btrfs_test_opt(fs_info, ONLYFS)) {
 			btrfs_err(fs_info,
-		"remounting read-write with rescue=skipbg is not allowed");
+		"remounting read-write with rescue=onlyfs is not allowed");
 			ret = -EINVAL;
 			goto restore;
 		}
@@ -2245,7 +2246,7 @@  static int btrfs_statfs(struct dentry *dentry, struct kstatfs *buf)
 	 *
 	 * Or if we're rescuing, set available to 0 anyway.
 	 */
-	if (btrfs_test_opt(fs_info, SKIPBG) ||
+	if (btrfs_test_opt(fs_info, ONLYFS) ||
 	    (!mixed && block_rsv->space_info->full &&
 	     total_free_meta - thresh < block_rsv->size))
 		buf->f_bavail = 0;
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index aabc6c922e04..a5d124f95ce2 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -7595,10 +7595,10 @@  int btrfs_verify_dev_extents(struct btrfs_fs_info *fs_info)
 	int ret = 0;
 
 	/*
-	 * For rescue=skipbg mount option, we're already RO and are salvaging
+	 * For rescue=onlyfs mount option, we're already RO and are salvaging
 	 * data, no need for such strict check.
 	 */
-	if (btrfs_test_opt(fs_info, SKIPBG))
+	if (btrfs_test_opt(fs_info, ONLYFS))
 		return 0;
 
 	key.objectid = 1;