[v2] btrfs: introduce rescue=onlyfs
diff mbox series

Message ID 20200721151057.9325-1-josef@toxicpanda.com
State New
Headers show
Series
  • [v2] btrfs: introduce rescue=onlyfs
Related show

Commit Message

Josef Bacik July 21, 2020, 3:10 p.m. UTC
One of the things that came up consistently in talking with Fedora about
switching to btrfs as default is that btrfs is particularly vulnerable
to metadata corruption.  If any of the core global roots are corrupted,
the fs is unmountable and fsck can't usually do anything for you without
some special options.

Qu addressed this sort of with rescue=skipbg, but that's poorly named as
what it really does is just allow you to operate without an extent root.
However there are a lot of other roots, and I'd rather not have to do

mount -o rescue=skipbg,rescue=nocsum,rescue=nofreespacetree,rescue=blah

Instead take his original idea and modify it so it just works for
everything.  Turn it into rescue=onlyfs, and then any major root we fail
to read just gets left empty and we carry on.

Obviously if the fs roots are screwed then the user is in trouble, but
otherwise this makes it much easier to pull stuff off the disk without
needing our special rescue tools.  I tested this with my TEST_DEV that
had a bunch of data on it by corrupting the csum tree and then reading
files off the disk.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
v1->v2:
- Rebase onto recent misc-next.
- Add in the fill_dummy_bgs since skipbg is no longer in this tree.

 fs/btrfs/block-group.c | 49 +++++++++++++++++++++++++++++
 fs/btrfs/ctree.h       |  1 +
 fs/btrfs/disk-io.c     | 71 +++++++++++++++++++++++++++++-------------
 fs/btrfs/inode.c       |  6 +++-
 fs/btrfs/super.c       | 29 +++++++++++++++--
 fs/btrfs/volumes.c     |  7 +++++
 6 files changed, 138 insertions(+), 25 deletions(-)

Comments

Graham Cobb July 21, 2020, 3:56 p.m. UTC | #1
On 21/07/2020 16:10, Josef Bacik wrote:
> One of the things that came up consistently in talking with Fedora about
> switching to btrfs as default is that btrfs is particularly vulnerable
> to metadata corruption.  If any of the core global roots are corrupted,
> the fs is unmountable and fsck can't usually do anything for you without
> some special options.
> 
> Qu addressed this sort of with rescue=skipbg, but that's poorly named as
> what it really does is just allow you to operate without an extent root.
> However there are a lot of other roots, and I'd rather not have to do
> 
> mount -o rescue=skipbg,rescue=nocsum,rescue=nofreespacetree,rescue=blah
> 
> Instead take his original idea and modify it so it just works for
> everything.  Turn it into rescue=onlyfs, and then any major root we fail
> to read just gets left empty and we carry on.

Am I the only one who dislikes the name? "onlyfs" does not seem at all
meaningful to me, as a system manager - the people it is apparently
aimed at. I really don't understand what it is supposed to mean and it
sounds like some developer debugging option or something.

If it means "only filesystem" that doesn't make sense to me - the whole
thing is the filesystem. I guess "only data" might be more meaningful
but if the aim is to turn on as much recovery as possible to help the
user to save their data then why not just say so?

Something like "rescue=max", "rescue=recoverymode", "rescue=dataonly",
"rescue=ignoreallerrors" or "rescue=emergency" might be more meaningful.
David Sterba July 21, 2020, 5:16 p.m. UTC | #2
On Tue, Jul 21, 2020 at 04:56:55PM +0100, Graham Cobb wrote:
> On 21/07/2020 16:10, Josef Bacik wrote:
> > One of the things that came up consistently in talking with Fedora about
> > switching to btrfs as default is that btrfs is particularly vulnerable
> > to metadata corruption.  If any of the core global roots are corrupted,
> > the fs is unmountable and fsck can't usually do anything for you without
> > some special options.
> > 
> > Qu addressed this sort of with rescue=skipbg, but that's poorly named as
> > what it really does is just allow you to operate without an extent root.
> > However there are a lot of other roots, and I'd rather not have to do
> > 
> > mount -o rescue=skipbg,rescue=nocsum,rescue=nofreespacetree,rescue=blah
> > 
> > Instead take his original idea and modify it so it just works for
> > everything.  Turn it into rescue=onlyfs, and then any major root we fail
> > to read just gets left empty and we carry on.
> 
> Am I the only one who dislikes the name? "onlyfs" does not seem at all
> meaningful to me, as a system manager - the people it is apparently
> aimed at. I really don't understand what it is supposed to mean and it
> sounds like some developer debugging option or something.

No, you're not the only one. Changelog points to 'skipbg' as poor naming
choice but 'onlyfs' is IMHO just as bad because it's supposed to be used
by users so the naming need to take that into account.

> If it means "only filesystem" that doesn't make sense to me - the whole
> thing is the filesystem. I guess "only data" might be more meaningful
> but if the aim is to turn on as much recovery as possible to help the
> user to save their data then why not just say so?
> 
> Something like "rescue=max", "rescue=recoverymode", "rescue=dataonly",
> "rescue=ignoreallerrors" or "rescue=emergency" might be more meaningful.

From user perspective the option should have a high level semantics,
like you suggest above. We should add individual options to try to work
around specific damage if not just for testing purposes, having more
flexibility is a good thing.
Josef Bacik July 21, 2020, 5:25 p.m. UTC | #3
On 7/21/20 1:16 PM, David Sterba wrote:
> On Tue, Jul 21, 2020 at 04:56:55PM +0100, Graham Cobb wrote:
>> On 21/07/2020 16:10, Josef Bacik wrote:
>>> One of the things that came up consistently in talking with Fedora about
>>> switching to btrfs as default is that btrfs is particularly vulnerable
>>> to metadata corruption.  If any of the core global roots are corrupted,
>>> the fs is unmountable and fsck can't usually do anything for you without
>>> some special options.
>>>
>>> Qu addressed this sort of with rescue=skipbg, but that's poorly named as
>>> what it really does is just allow you to operate without an extent root.
>>> However there are a lot of other roots, and I'd rather not have to do
>>>
>>> mount -o rescue=skipbg,rescue=nocsum,rescue=nofreespacetree,rescue=blah
>>>
>>> Instead take his original idea and modify it so it just works for
>>> everything.  Turn it into rescue=onlyfs, and then any major root we fail
>>> to read just gets left empty and we carry on.
>>
>> Am I the only one who dislikes the name? "onlyfs" does not seem at all
>> meaningful to me, as a system manager - the people it is apparently
>> aimed at. I really don't understand what it is supposed to mean and it
>> sounds like some developer debugging option or something.
> 
> No, you're not the only one. Changelog points to 'skipbg' as poor naming
> choice but 'onlyfs' is IMHO just as bad because it's supposed to be used
> by users so the naming need to take that into account.

I'm not married to the name, I think its awful but I had no better ideas nor 
suggestions from the last time I posted, tho I'm partial to 
rescue=allthefuckingthings.  However I'm fine with rescue=max also, is that 
everybody's favorite?  Thanks,

Josef
Graham Cobb July 21, 2020, 5:34 p.m. UTC | #4
On 21/07/2020 18:16, David Sterba wrote:
> On Tue, Jul 21, 2020 at 04:56:55PM +0100, Graham Cobb wrote:
>> If it means "only filesystem" that doesn't make sense to me - the whole
>> thing is the filesystem. I guess "only data" might be more meaningful
>> but if the aim is to turn on as much recovery as possible to help the
>> user to save their data then why not just say so?
>>
>> Something like "rescue=max", "rescue=recoverymode", "rescue=dataonly",
>> "rescue=ignoreallerrors" or "rescue=emergency" might be more meaningful.
> 
> From user perspective the option should have a high level semantics,
> like you suggest above. We should add individual options to try to work
> around specific damage if not just for testing purposes, having more
> flexibility is a good thing.

I would also prefer not to have checksum checking disabled by this "try
harder" option. I would imagine turning on "ignore whatever checks you
can to get me my data back mode", retrieving all the readable data with
valid checksums and getting errors for things which cannot be verified.
Then I would make a decision as to whether to enable another option to
even provide files which the filesystem cannot guarantee have not been
corrupted because it can't check checksums. Even if that is all the
files (because the checksum tree is destroyed) I should have to make an
explicit acknowledgement that I want that.
Josef Bacik July 21, 2020, 5:40 p.m. UTC | #5
On 7/21/20 1:34 PM, Graham Cobb wrote:
> On 21/07/2020 18:16, David Sterba wrote:
>> On Tue, Jul 21, 2020 at 04:56:55PM +0100, Graham Cobb wrote:
>>> If it means "only filesystem" that doesn't make sense to me - the whole
>>> thing is the filesystem. I guess "only data" might be more meaningful
>>> but if the aim is to turn on as much recovery as possible to help the
>>> user to save their data then why not just say so?
>>>
>>> Something like "rescue=max", "rescue=recoverymode", "rescue=dataonly",
>>> "rescue=ignoreallerrors" or "rescue=emergency" might be more meaningful.
>>
>>  From user perspective the option should have a high level semantics,
>> like you suggest above. We should add individual options to try to work
>> around specific damage if not just for testing purposes, having more
>> flexibility is a good thing.
> 
> I would also prefer not to have checksum checking disabled by this "try
> harder" option. I would imagine turning on "ignore whatever checks you
> can to get me my data back mode", retrieving all the readable data with
> valid checksums and getting errors for things which cannot be verified.
> Then I would make a decision as to whether to enable another option to
> even provide files which the filesystem cannot guarantee have not been
> corrupted because it can't check checksums. Even if that is all the
> files (because the checksum tree is destroyed) I should have to make an
> explicit acknowledgement that I want that.
> 

If somebody wants to add finer grained stuff later I'm fine with that, right now 
I'm addressing the case where a lot of things are dead.  Thanks,

Josef
Chris Murphy July 21, 2020, 9:20 p.m. UTC | #6
On Tue, Jul 21, 2020 at 11:25 AM Josef Bacik <josef@toxicpanda.com> wrote:
>
> On 7/21/20 1:16 PM, David Sterba wrote:

> > No, you're not the only one. Changelog points to 'skipbg' as poor naming
> > choice but 'onlyfs' is IMHO just as bad because it's supposed to be used
> > by users so the naming need to take that into account.
>
> I'm not married to the name, I think its awful but I had no better ideas nor
> suggestions from the last time I posted, tho I'm partial to
> rescue=allthefuckingthings.  However I'm fine with rescue=max also, is that
> everybody's favorite?  Thanks,

Or just rescue?

And then rescue= to specifically list the trees that are to be
ignored? Separated by comma or colon or whatever?
Chris Murphy July 21, 2020, 9:25 p.m. UTC | #7
On Tue, Jul 21, 2020 at 11:35 AM Graham Cobb <g.btrfs@cobb.uk.net> wrote:
>
> On 21/07/2020 18:16, David Sterba wrote:
> > On Tue, Jul 21, 2020 at 04:56:55PM +0100, Graham Cobb wrote:
> >> If it means "only filesystem" that doesn't make sense to me - the whole
> >> thing is the filesystem. I guess "only data" might be more meaningful
> >> but if the aim is to turn on as much recovery as possible to help the
> >> user to save their data then why not just say so?
> >>
> >> Something like "rescue=max", "rescue=recoverymode", "rescue=dataonly",
> >> "rescue=ignoreallerrors" or "rescue=emergency" might be more meaningful.
> >
> > From user perspective the option should have a high level semantics,
> > like you suggest above. We should add individual options to try to work
> > around specific damage if not just for testing purposes, having more
> > flexibility is a good thing.
>
> I would also prefer not to have checksum checking disabled by this "try
> harder" option. I would imagine turning on "ignore whatever checks you
> can to get me my data back mode", retrieving all the readable data with
> valid checksums and getting errors for things which cannot be verified.
> Then I would make a decision as to whether to enable another option to
> even provide files which the filesystem cannot guarantee have not been
> corrupted because it can't check checksums. Even if that is all the
> files (because the checksum tree is destroyed) I should have to make an
> explicit acknowledgement that I want that.

It would be nice if this rescue=all mount option, continues to spit
out noisy and scary warnings about the problems encountered. Including
corrupt files with path to the corrupt file. Just avoid face planting.
I assume rescue=all implies ro (possibly also nologreplay).
Qu Wenruo July 21, 2020, 11:05 p.m. UTC | #8
On 2020/7/21 下午11:10, Josef Bacik wrote:
> One of the things that came up consistently in talking with Fedora about
> switching to btrfs as default is that btrfs is particularly vulnerable
> to metadata corruption.  If any of the core global roots are corrupted,
> the fs is unmountable and fsck can't usually do anything for you without
> some special options.
> 
> Qu addressed this sort of with rescue=skipbg, but that's poorly named as
> what it really does is just allow you to operate without an extent root.
> However there are a lot of other roots, and I'd rather not have to do
> 
> mount -o rescue=skipbg,rescue=nocsum,rescue=nofreespacetree,rescue=blah
> 
> Instead take his original idea and modify it so it just works for
> everything.  Turn it into rescue=onlyfs, and then any major root we fail
> to read just gets left empty and we carry on.
> 
> Obviously if the fs roots are screwed then the user is in trouble, but
> otherwise this makes it much easier to pull stuff off the disk without
> needing our special rescue tools.  I tested this with my TEST_DEV that
> had a bunch of data on it by corrupting the csum tree and then reading
> files off the disk.
> 
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
> v1->v2:
> - Rebase onto recent misc-next.
> - Add in the fill_dummy_bgs since skipbg is no longer in this tree.
> 
>  fs/btrfs/block-group.c | 49 +++++++++++++++++++++++++++++
>  fs/btrfs/ctree.h       |  1 +
>  fs/btrfs/disk-io.c     | 71 +++++++++++++++++++++++++++++-------------
>  fs/btrfs/inode.c       |  6 +++-
>  fs/btrfs/super.c       | 29 +++++++++++++++--
>  fs/btrfs/volumes.c     |  7 +++++
>  6 files changed, 138 insertions(+), 25 deletions(-)
> 
> diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
> index 884de28a41e4..416fc419fd95 100644
> --- a/fs/btrfs/block-group.c
> +++ b/fs/btrfs/block-group.c
> @@ -1997,6 +1997,52 @@ static int read_one_block_group(struct btrfs_fs_info *info,
>  	return ret;
>  }
>  
> +static int fill_dummy_bgs(struct btrfs_fs_info *fs_info)
> +{
> +	struct extent_map_tree *em_tree = &fs_info->mapping_tree;
> +	struct extent_map *em;
> +	struct map_lookup *map;
> +	struct btrfs_block_group *bg;
> +	struct btrfs_space_info *space_info;
> +	struct rb_node *node;
> +	int ret = 0;
> +
> +	read_lock(&em_tree->lock);
> +	for (node = rb_first_cached(&em_tree->map); node;
> +	     node = rb_next(node)) {
> +		em = rb_entry(node, struct extent_map, rb_node);
> +		map = em->map_lookup;
> +		bg = btrfs_create_block_group_cache(fs_info, em->start);
> +		if (!bg) {
> +			ret = -ENOMEM;
> +			goto out;
> +		}
> +
> +		/* Fill dummy cache as FULL */
> +		bg->length = em->len;
> +		bg->flags = map->type;
> +		bg->last_byte_to_unpin = (u64)-1;
> +		bg->cached = BTRFS_CACHE_FINISHED;
> +		bg->used = em->len;
> +		bg->flags = map->type;
> +		ret = btrfs_add_block_group_cache(fs_info, bg);
> +		if (ret) {
> +			btrfs_remove_free_space_cache(bg);
> +			btrfs_put_block_group(bg);
> +			goto out;
> +		}
> +		btrfs_update_space_info(fs_info, bg->flags, em->len, em->len,
> +					0, &space_info);
> +		bg->space_info = space_info;
> +		link_block_group(bg);
> +
> +		set_avail_alloc_bits(fs_info, bg->flags);
> +	}
> +out:
> +	read_unlock(&em_tree->lock);
> +	return ret;
> +}
> +
>  int btrfs_read_block_groups(struct btrfs_fs_info *info)
>  {
>  	struct btrfs_path *path;
> @@ -2007,6 +2053,9 @@ int btrfs_read_block_groups(struct btrfs_fs_info *info)
>  	int need_clear = 0;
>  	u64 cache_gen;
>  
> +	if (btrfs_test_opt(info, ONLYFS))
> +		return fill_dummy_bgs(info);
> +
>  	key.objectid = 0;
>  	key.offset = 0;
>  	key.type = BTRFS_BLOCK_GROUP_ITEM_KEY;
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index b70c2024296f..0be7d9461443 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -1266,6 +1266,7 @@ static inline u32 BTRFS_MAX_XATTR_SIZE(const struct btrfs_fs_info *info)
>  #define BTRFS_MOUNT_NOLOGREPLAY		(1 << 27)
>  #define BTRFS_MOUNT_REF_VERIFY		(1 << 28)
>  #define BTRFS_MOUNT_DISCARD_ASYNC	(1 << 29)
> +#define BTRFS_MOUNT_ONLYFS		(1 << 30)
>  
>  #define BTRFS_DEFAULT_COMMIT_INTERVAL	(30)
>  #define BTRFS_DEFAULT_MAX_INLINE	(2048)
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index c850d7f44fbe..0dffa4c12d8c 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -2326,8 +2326,13 @@ static int btrfs_read_roots(struct btrfs_fs_info *fs_info)
>  
>  	root = btrfs_read_tree_root(tree_root, &location);
>  	if (IS_ERR(root)) {
> -		ret = PTR_ERR(root);
> -		goto out;
> +		if (!btrfs_test_opt(fs_info, ONLYFS)) {
> +			ret = PTR_ERR(root);
> +			goto out;
> +		}
> +	} else {
> +		set_bit(BTRFS_ROOT_TRACK_DIRTY, &root->state);
> +		fs_info->extent_root = root;
>  	}
>  	set_bit(BTRFS_ROOT_TRACK_DIRTY, &root->state);
>  	fs_info->extent_root = root;
> @@ -2335,21 +2340,27 @@ static int btrfs_read_roots(struct btrfs_fs_info *fs_info)
>  	location.objectid = BTRFS_DEV_TREE_OBJECTID;
>  	root = btrfs_read_tree_root(tree_root, &location);
>  	if (IS_ERR(root)) {
> -		ret = PTR_ERR(root);
> -		goto out;
> +		if (!btrfs_test_opt(fs_info, ONLYFS)) {
> +			ret = PTR_ERR(root);
> +			goto out;
> +		}
> +	} else {
> +		set_bit(BTRFS_ROOT_TRACK_DIRTY, &root->state);
> +		fs_info->dev_root = root;
> +		btrfs_init_devices_late(fs_info);
>  	}
> -	set_bit(BTRFS_ROOT_TRACK_DIRTY, &root->state);
> -	fs_info->dev_root = root;
> -	btrfs_init_devices_late(fs_info);
>  
>  	location.objectid = BTRFS_CSUM_TREE_OBJECTID;
>  	root = btrfs_read_tree_root(tree_root, &location);
>  	if (IS_ERR(root)) {
> -		ret = PTR_ERR(root);
> -		goto out;
> +		if (!btrfs_test_opt(fs_info, ONLYFS)) {
> +			ret = PTR_ERR(root);
> +			goto out;
> +		}
> +	} else {
> +		set_bit(BTRFS_ROOT_TRACK_DIRTY, &root->state);
> +		fs_info->csum_root = root;
>  	}
> -	set_bit(BTRFS_ROOT_TRACK_DIRTY, &root->state);
> -	fs_info->csum_root = root;
>  
>  	/*
>  	 * This tree can share blocks with some other fs tree during relocation
> @@ -2358,11 +2369,14 @@ static int btrfs_read_roots(struct btrfs_fs_info *fs_info)
>  	root = btrfs_get_fs_root(tree_root->fs_info,
>  				 BTRFS_DATA_RELOC_TREE_OBJECTID, true);
>  	if (IS_ERR(root)) {
> -		ret = PTR_ERR(root);
> -		goto out;
> +		if (!btrfs_test_opt(fs_info, ONLYFS)) {
> +			ret = PTR_ERR(root);
> +			goto out;
> +		}
> +	} else {
> +		set_bit(BTRFS_ROOT_TRACK_DIRTY, &root->state);
> +		fs_info->data_reloc_root = root;
>  	}
> -	set_bit(BTRFS_ROOT_TRACK_DIRTY, &root->state);
> -	fs_info->data_reloc_root = root;
>  
>  	location.objectid = BTRFS_QUOTA_TREE_OBJECTID;
>  	root = btrfs_read_tree_root(tree_root, &location);
> @@ -2375,9 +2389,11 @@ static int btrfs_read_roots(struct btrfs_fs_info *fs_info)
>  	location.objectid = BTRFS_UUID_TREE_OBJECTID;
>  	root = btrfs_read_tree_root(tree_root, &location);
>  	if (IS_ERR(root)) {
> -		ret = PTR_ERR(root);
> -		if (ret != -ENOENT)
> -			goto out;
> +		if (!btrfs_test_opt(fs_info, ONLYFS)) {
> +			ret = PTR_ERR(root);
> +			if (ret != -ENOENT)
> +				goto out;
> +		}
>  	} else {
>  		set_bit(BTRFS_ROOT_TRACK_DIRTY, &root->state);
>  		fs_info->uuid_root = root;
> @@ -2387,11 +2403,14 @@ static int btrfs_read_roots(struct btrfs_fs_info *fs_info)
>  		location.objectid = BTRFS_FREE_SPACE_TREE_OBJECTID;
>  		root = btrfs_read_tree_root(tree_root, &location);
>  		if (IS_ERR(root)) {
> -			ret = PTR_ERR(root);
> -			goto out;
> +			if (!btrfs_test_opt(fs_info, ONLYFS)) {
> +				ret = PTR_ERR(root);
> +				goto out;
> +			}
> +		}  else {
> +			set_bit(BTRFS_ROOT_TRACK_DIRTY, &root->state);
> +			fs_info->free_space_root = root;
>  		}
> -		set_bit(BTRFS_ROOT_TRACK_DIRTY, &root->state);
> -		fs_info->free_space_root = root;
>  	}
>  
>  	return 0;
> @@ -3106,6 +3125,14 @@ int __cold open_ctree(struct super_block *sb, struct btrfs_fs_devices *fs_device
>  		goto fail_alloc;
>  	}
>  
> +	/* Skip bg needs RO and no tree-log to replay */

Comment is still the skip-bg version.

> +	if (btrfs_test_opt(fs_info, ONLYFS) && !sb_rdonly(sb)) {
> +		btrfs_err(fs_info,
> +			  "rescue=onlyfs can only be used on read-only mount");
> +		err = -EINVAL;
> +		goto fail_alloc;
> +	}
> +
>  	ret = btrfs_init_workqueues(fs_info, fs_devices);
>  	if (ret) {
>  		err = ret;
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 611b3412fbfd..49cd3eba651e 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -2191,7 +2191,8 @@ static blk_status_t btrfs_submit_bio_hook(struct inode *inode, struct bio *bio,
>  	int skip_sum;
>  	int async = !atomic_read(&BTRFS_I(inode)->sync_writers);
>  
> -	skip_sum = BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM;
> +	skip_sum = (BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM) ||
> +		btrfs_test_opt(fs_info, ONLYFS);

This means, if onlyfs is spcified, csum is completely skipped even it
matches.

I'm not yet determined whether it's preferred.

On one hand, even at recovery, user may want csum verification to detect
bad copy if possible, but on the other hand, since user is doing data
salvage, bothering csum could only lead to extra error.

Despite that, the patch looks pretty good and indeed an enhancement to
skipbg.

Thanks,
Qu
>  
>  	if (btrfs_is_free_space_inode(BTRFS_I(inode)))
>  		metadata = BTRFS_WQ_ENDIO_FREE_SPACE;
> @@ -2846,6 +2847,9 @@ static int btrfs_readpage_end_io_hook(struct btrfs_io_bio *io_bio,
>  	if (BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM)
>  		return 0;
>  
> +	if (btrfs_test_opt(root->fs_info, ONLYFS))
> +		return 0;
> +
>  	if (root->root_key.objectid == BTRFS_DATA_RELOC_TREE_OBJECTID &&
>  	    test_range_bit(io_tree, start, end, EXTENT_NODATASUM, 1, NULL)) {
>  		clear_extent_bits(io_tree, start, end, EXTENT_NODATASUM);
> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> index 58f890f73650..4dc3f35ca7e3 100644
> --- a/fs/btrfs/super.c
> +++ b/fs/btrfs/super.c
> @@ -345,6 +345,7 @@ enum {
>  	Opt_rescue,
>  	Opt_usebackuproot,
>  	Opt_nologreplay,
> +	Opt_rescue_onlyfs,
>  
>  	/* Deprecated options */
>  	Opt_recovery,
> @@ -440,6 +441,7 @@ static const match_table_t tokens = {
>  static const match_table_t rescue_tokens = {
>  	{Opt_usebackuproot, "usebackuproot"},
>  	{Opt_nologreplay, "nologreplay"},
> +	{Opt_rescue_onlyfs, "onlyfs"},
>  	{Opt_err, NULL},
>  };
>  
> @@ -472,6 +474,11 @@ static int parse_rescue_options(struct btrfs_fs_info *info, const char *options)
>  			btrfs_set_and_info(info, NOLOGREPLAY,
>  					   "disabling log replay at mount time");
>  			break;
> +		case Opt_rescue_onlyfs:
> +			btrfs_set_and_info(info, ONLYFS,
> +					   "only reading fs roots, also setting  nologreplay");
> +			btrfs_set_opt(info->mount_opt, NOLOGREPLAY);
> +			break;
>  		case Opt_err:
>  			btrfs_info(info, "unrecognized rescue option '%s'", p);
>  			ret = -EINVAL;
> @@ -1400,6 +1407,8 @@ static int btrfs_show_options(struct seq_file *seq, struct dentry *dentry)
>  		seq_puts(seq, ",notreelog");
>  	if (btrfs_test_opt(info, NOLOGREPLAY))
>  		seq_puts(seq, ",rescue=nologreplay");
> +	if (btrfs_test_opt(info, ONLYFS))
> +		seq_puts(seq, ",rescue=onlyfs");
>  	if (btrfs_test_opt(info, FLUSHONCOMMIT))
>  		seq_puts(seq, ",flushoncommit");
>  	if (btrfs_test_opt(info, DISCARD_SYNC))
> @@ -1839,6 +1848,14 @@ static int btrfs_remount(struct super_block *sb, int *flags, char *data)
>  	if (ret)
>  		goto restore;
>  
> +	if (btrfs_test_opt(fs_info, ONLYFS) !=
> +	    (old_opts & BTRFS_MOUNT_ONLYFS)) {
> +		btrfs_err(fs_info,
> +		"rescue=onlyfs mount option can't be changed during remount");
> +		ret = -EINVAL;
> +		goto restore;
> +	}
> +
>  	btrfs_remount_begin(fs_info, old_opts, *flags);
>  	btrfs_resize_thread_pool(fs_info,
>  		fs_info->thread_pool_size, old_thread_pool_size);
> @@ -1904,6 +1921,13 @@ static int btrfs_remount(struct super_block *sb, int *flags, char *data)
>  			goto restore;
>  		}
>  
> +		if (btrfs_test_opt(fs_info, ONLYFS)) {
> +			btrfs_err(fs_info,
> +		"remounting read-write with rescue=onlyfs is not allowed");
> +			ret = -EINVAL;
> +			goto restore;
> +		}
> +
>  		ret = btrfs_cleanup_fs_roots(fs_info);
>  		if (ret)
>  			goto restore;
> @@ -2208,8 +2232,9 @@ static int btrfs_statfs(struct dentry *dentry, struct kstatfs *buf)
>  	 * still can allocate chunks and thus are fine using the currently
>  	 * calculated f_bavail.
>  	 */
> -	if (!mixed && block_rsv->space_info->full &&
> -	    total_free_meta - thresh < block_rsv->size)
> +	if (btrfs_test_opt(fs_info, ONLYFS) ||
> +	    (!mixed && block_rsv->space_info->full &&
> +	     total_free_meta - thresh < block_rsv->size))
>  		buf->f_bavail = 0;
>  
>  	buf->f_type = BTRFS_SUPER_MAGIC;
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 537ccf66ee20..e86f78579884 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -7628,6 +7628,13 @@ int btrfs_verify_dev_extents(struct btrfs_fs_info *fs_info)
>  	u64 prev_dev_ext_end = 0;
>  	int ret = 0;
>  
> +	/*
> +	 * For rescue=onlyfs mount option, we're already RO and are salvaging
> +	 * data, no need for such strict check.
> +	 */
> +	if (btrfs_test_opt(fs_info, ONLYFS))
> +		return 0;
> +
>  	key.objectid = 1;
>  	key.type = BTRFS_DEV_EXTENT_KEY;
>  	key.offset = 0;
>
Graham Cobb July 22, 2020, 9:36 a.m. UTC | #9
On 22/07/2020 00:05, Qu Wenruo wrote:
> 
> 
> On 2020/7/21 下午11:10, Josef Bacik wrote:
...
>> -	skip_sum = BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM;
>> +	skip_sum = (BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM) ||
>> +		btrfs_test_opt(fs_info, ONLYFS);
> 
> This means, if onlyfs is spcified, csum is completely skipped even it
> matches.
> 
> I'm not yet determined whether it's preferred.
> 
> On one hand, even at recovery, user may want csum verification to detect
> bad copy if possible, but on the other hand, since user is doing data
> salvage, bothering csum could only lead to extra error.

If we are using this option then something very bad has happened to the
filesystem. It would be useful to the user to know whether the checksum
on the file being read is valid or not as a failed checksum will be a
hint that the BAD THING has affected this file (maybe some extent got
incorrectly overwritten by the content of another file, for example). It
would tell the user that they may want to check the file in some other
way or, at least, not rely on its contents.

On the other hand, it may be that the file contents are fine, but the
checksum has been corrupted, so it should also be possible the retrieve
the contents anyway.

Of course, in the face of unknown filesystem corruption, a valid
checksum does not guarantee that the file is uncorrupted. But a bad
checksum gives a strong hint that the file needs additional checking.

I don't particularly care how the user finds out that the checksum is
bad. In my earlier mail, I suggested that this option does not imply
nodatasum and that the user has to specify nodatasum if that is what
they want. However, I would also be happy with (for example) a warning
in the logs if the checksum is not valid.

By the way, is it possible to do a scrub while the filesystem is mounted
with this option? What would happen, in that case, for either real bad
sectors or checksum errors caused by the filesystem corruption?

Patch
diff mbox series

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index 884de28a41e4..416fc419fd95 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -1997,6 +1997,52 @@  static int read_one_block_group(struct btrfs_fs_info *info,
 	return ret;
 }
 
+static int fill_dummy_bgs(struct btrfs_fs_info *fs_info)
+{
+	struct extent_map_tree *em_tree = &fs_info->mapping_tree;
+	struct extent_map *em;
+	struct map_lookup *map;
+	struct btrfs_block_group *bg;
+	struct btrfs_space_info *space_info;
+	struct rb_node *node;
+	int ret = 0;
+
+	read_lock(&em_tree->lock);
+	for (node = rb_first_cached(&em_tree->map); node;
+	     node = rb_next(node)) {
+		em = rb_entry(node, struct extent_map, rb_node);
+		map = em->map_lookup;
+		bg = btrfs_create_block_group_cache(fs_info, em->start);
+		if (!bg) {
+			ret = -ENOMEM;
+			goto out;
+		}
+
+		/* Fill dummy cache as FULL */
+		bg->length = em->len;
+		bg->flags = map->type;
+		bg->last_byte_to_unpin = (u64)-1;
+		bg->cached = BTRFS_CACHE_FINISHED;
+		bg->used = em->len;
+		bg->flags = map->type;
+		ret = btrfs_add_block_group_cache(fs_info, bg);
+		if (ret) {
+			btrfs_remove_free_space_cache(bg);
+			btrfs_put_block_group(bg);
+			goto out;
+		}
+		btrfs_update_space_info(fs_info, bg->flags, em->len, em->len,
+					0, &space_info);
+		bg->space_info = space_info;
+		link_block_group(bg);
+
+		set_avail_alloc_bits(fs_info, bg->flags);
+	}
+out:
+	read_unlock(&em_tree->lock);
+	return ret;
+}
+
 int btrfs_read_block_groups(struct btrfs_fs_info *info)
 {
 	struct btrfs_path *path;
@@ -2007,6 +2053,9 @@  int btrfs_read_block_groups(struct btrfs_fs_info *info)
 	int need_clear = 0;
 	u64 cache_gen;
 
+	if (btrfs_test_opt(info, ONLYFS))
+		return fill_dummy_bgs(info);
+
 	key.objectid = 0;
 	key.offset = 0;
 	key.type = BTRFS_BLOCK_GROUP_ITEM_KEY;
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index b70c2024296f..0be7d9461443 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1266,6 +1266,7 @@  static inline u32 BTRFS_MAX_XATTR_SIZE(const struct btrfs_fs_info *info)
 #define BTRFS_MOUNT_NOLOGREPLAY		(1 << 27)
 #define BTRFS_MOUNT_REF_VERIFY		(1 << 28)
 #define BTRFS_MOUNT_DISCARD_ASYNC	(1 << 29)
+#define BTRFS_MOUNT_ONLYFS		(1 << 30)
 
 #define BTRFS_DEFAULT_COMMIT_INTERVAL	(30)
 #define BTRFS_DEFAULT_MAX_INLINE	(2048)
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index c850d7f44fbe..0dffa4c12d8c 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2326,8 +2326,13 @@  static int btrfs_read_roots(struct btrfs_fs_info *fs_info)
 
 	root = btrfs_read_tree_root(tree_root, &location);
 	if (IS_ERR(root)) {
-		ret = PTR_ERR(root);
-		goto out;
+		if (!btrfs_test_opt(fs_info, ONLYFS)) {
+			ret = PTR_ERR(root);
+			goto out;
+		}
+	} else {
+		set_bit(BTRFS_ROOT_TRACK_DIRTY, &root->state);
+		fs_info->extent_root = root;
 	}
 	set_bit(BTRFS_ROOT_TRACK_DIRTY, &root->state);
 	fs_info->extent_root = root;
@@ -2335,21 +2340,27 @@  static int btrfs_read_roots(struct btrfs_fs_info *fs_info)
 	location.objectid = BTRFS_DEV_TREE_OBJECTID;
 	root = btrfs_read_tree_root(tree_root, &location);
 	if (IS_ERR(root)) {
-		ret = PTR_ERR(root);
-		goto out;
+		if (!btrfs_test_opt(fs_info, ONLYFS)) {
+			ret = PTR_ERR(root);
+			goto out;
+		}
+	} else {
+		set_bit(BTRFS_ROOT_TRACK_DIRTY, &root->state);
+		fs_info->dev_root = root;
+		btrfs_init_devices_late(fs_info);
 	}
-	set_bit(BTRFS_ROOT_TRACK_DIRTY, &root->state);
-	fs_info->dev_root = root;
-	btrfs_init_devices_late(fs_info);
 
 	location.objectid = BTRFS_CSUM_TREE_OBJECTID;
 	root = btrfs_read_tree_root(tree_root, &location);
 	if (IS_ERR(root)) {
-		ret = PTR_ERR(root);
-		goto out;
+		if (!btrfs_test_opt(fs_info, ONLYFS)) {
+			ret = PTR_ERR(root);
+			goto out;
+		}
+	} else {
+		set_bit(BTRFS_ROOT_TRACK_DIRTY, &root->state);
+		fs_info->csum_root = root;
 	}
-	set_bit(BTRFS_ROOT_TRACK_DIRTY, &root->state);
-	fs_info->csum_root = root;
 
 	/*
 	 * This tree can share blocks with some other fs tree during relocation
@@ -2358,11 +2369,14 @@  static int btrfs_read_roots(struct btrfs_fs_info *fs_info)
 	root = btrfs_get_fs_root(tree_root->fs_info,
 				 BTRFS_DATA_RELOC_TREE_OBJECTID, true);
 	if (IS_ERR(root)) {
-		ret = PTR_ERR(root);
-		goto out;
+		if (!btrfs_test_opt(fs_info, ONLYFS)) {
+			ret = PTR_ERR(root);
+			goto out;
+		}
+	} else {
+		set_bit(BTRFS_ROOT_TRACK_DIRTY, &root->state);
+		fs_info->data_reloc_root = root;
 	}
-	set_bit(BTRFS_ROOT_TRACK_DIRTY, &root->state);
-	fs_info->data_reloc_root = root;
 
 	location.objectid = BTRFS_QUOTA_TREE_OBJECTID;
 	root = btrfs_read_tree_root(tree_root, &location);
@@ -2375,9 +2389,11 @@  static int btrfs_read_roots(struct btrfs_fs_info *fs_info)
 	location.objectid = BTRFS_UUID_TREE_OBJECTID;
 	root = btrfs_read_tree_root(tree_root, &location);
 	if (IS_ERR(root)) {
-		ret = PTR_ERR(root);
-		if (ret != -ENOENT)
-			goto out;
+		if (!btrfs_test_opt(fs_info, ONLYFS)) {
+			ret = PTR_ERR(root);
+			if (ret != -ENOENT)
+				goto out;
+		}
 	} else {
 		set_bit(BTRFS_ROOT_TRACK_DIRTY, &root->state);
 		fs_info->uuid_root = root;
@@ -2387,11 +2403,14 @@  static int btrfs_read_roots(struct btrfs_fs_info *fs_info)
 		location.objectid = BTRFS_FREE_SPACE_TREE_OBJECTID;
 		root = btrfs_read_tree_root(tree_root, &location);
 		if (IS_ERR(root)) {
-			ret = PTR_ERR(root);
-			goto out;
+			if (!btrfs_test_opt(fs_info, ONLYFS)) {
+				ret = PTR_ERR(root);
+				goto out;
+			}
+		}  else {
+			set_bit(BTRFS_ROOT_TRACK_DIRTY, &root->state);
+			fs_info->free_space_root = root;
 		}
-		set_bit(BTRFS_ROOT_TRACK_DIRTY, &root->state);
-		fs_info->free_space_root = root;
 	}
 
 	return 0;
@@ -3106,6 +3125,14 @@  int __cold open_ctree(struct super_block *sb, struct btrfs_fs_devices *fs_device
 		goto fail_alloc;
 	}
 
+	/* Skip bg needs RO and no tree-log to replay */
+	if (btrfs_test_opt(fs_info, ONLYFS) && !sb_rdonly(sb)) {
+		btrfs_err(fs_info,
+			  "rescue=onlyfs can only be used on read-only mount");
+		err = -EINVAL;
+		goto fail_alloc;
+	}
+
 	ret = btrfs_init_workqueues(fs_info, fs_devices);
 	if (ret) {
 		err = ret;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 611b3412fbfd..49cd3eba651e 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2191,7 +2191,8 @@  static blk_status_t btrfs_submit_bio_hook(struct inode *inode, struct bio *bio,
 	int skip_sum;
 	int async = !atomic_read(&BTRFS_I(inode)->sync_writers);
 
-	skip_sum = BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM;
+	skip_sum = (BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM) ||
+		btrfs_test_opt(fs_info, ONLYFS);
 
 	if (btrfs_is_free_space_inode(BTRFS_I(inode)))
 		metadata = BTRFS_WQ_ENDIO_FREE_SPACE;
@@ -2846,6 +2847,9 @@  static int btrfs_readpage_end_io_hook(struct btrfs_io_bio *io_bio,
 	if (BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM)
 		return 0;
 
+	if (btrfs_test_opt(root->fs_info, ONLYFS))
+		return 0;
+
 	if (root->root_key.objectid == BTRFS_DATA_RELOC_TREE_OBJECTID &&
 	    test_range_bit(io_tree, start, end, EXTENT_NODATASUM, 1, NULL)) {
 		clear_extent_bits(io_tree, start, end, EXTENT_NODATASUM);
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 58f890f73650..4dc3f35ca7e3 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -345,6 +345,7 @@  enum {
 	Opt_rescue,
 	Opt_usebackuproot,
 	Opt_nologreplay,
+	Opt_rescue_onlyfs,
 
 	/* Deprecated options */
 	Opt_recovery,
@@ -440,6 +441,7 @@  static const match_table_t tokens = {
 static const match_table_t rescue_tokens = {
 	{Opt_usebackuproot, "usebackuproot"},
 	{Opt_nologreplay, "nologreplay"},
+	{Opt_rescue_onlyfs, "onlyfs"},
 	{Opt_err, NULL},
 };
 
@@ -472,6 +474,11 @@  static int parse_rescue_options(struct btrfs_fs_info *info, const char *options)
 			btrfs_set_and_info(info, NOLOGREPLAY,
 					   "disabling log replay at mount time");
 			break;
+		case Opt_rescue_onlyfs:
+			btrfs_set_and_info(info, ONLYFS,
+					   "only reading fs roots, also setting  nologreplay");
+			btrfs_set_opt(info->mount_opt, NOLOGREPLAY);
+			break;
 		case Opt_err:
 			btrfs_info(info, "unrecognized rescue option '%s'", p);
 			ret = -EINVAL;
@@ -1400,6 +1407,8 @@  static int btrfs_show_options(struct seq_file *seq, struct dentry *dentry)
 		seq_puts(seq, ",notreelog");
 	if (btrfs_test_opt(info, NOLOGREPLAY))
 		seq_puts(seq, ",rescue=nologreplay");
+	if (btrfs_test_opt(info, ONLYFS))
+		seq_puts(seq, ",rescue=onlyfs");
 	if (btrfs_test_opt(info, FLUSHONCOMMIT))
 		seq_puts(seq, ",flushoncommit");
 	if (btrfs_test_opt(info, DISCARD_SYNC))
@@ -1839,6 +1848,14 @@  static int btrfs_remount(struct super_block *sb, int *flags, char *data)
 	if (ret)
 		goto restore;
 
+	if (btrfs_test_opt(fs_info, ONLYFS) !=
+	    (old_opts & BTRFS_MOUNT_ONLYFS)) {
+		btrfs_err(fs_info,
+		"rescue=onlyfs mount option can't be changed during remount");
+		ret = -EINVAL;
+		goto restore;
+	}
+
 	btrfs_remount_begin(fs_info, old_opts, *flags);
 	btrfs_resize_thread_pool(fs_info,
 		fs_info->thread_pool_size, old_thread_pool_size);
@@ -1904,6 +1921,13 @@  static int btrfs_remount(struct super_block *sb, int *flags, char *data)
 			goto restore;
 		}
 
+		if (btrfs_test_opt(fs_info, ONLYFS)) {
+			btrfs_err(fs_info,
+		"remounting read-write with rescue=onlyfs is not allowed");
+			ret = -EINVAL;
+			goto restore;
+		}
+
 		ret = btrfs_cleanup_fs_roots(fs_info);
 		if (ret)
 			goto restore;
@@ -2208,8 +2232,9 @@  static int btrfs_statfs(struct dentry *dentry, struct kstatfs *buf)
 	 * still can allocate chunks and thus are fine using the currently
 	 * calculated f_bavail.
 	 */
-	if (!mixed && block_rsv->space_info->full &&
-	    total_free_meta - thresh < block_rsv->size)
+	if (btrfs_test_opt(fs_info, ONLYFS) ||
+	    (!mixed && block_rsv->space_info->full &&
+	     total_free_meta - thresh < block_rsv->size))
 		buf->f_bavail = 0;
 
 	buf->f_type = BTRFS_SUPER_MAGIC;
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 537ccf66ee20..e86f78579884 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -7628,6 +7628,13 @@  int btrfs_verify_dev_extents(struct btrfs_fs_info *fs_info)
 	u64 prev_dev_ext_end = 0;
 	int ret = 0;
 
+	/*
+	 * For rescue=onlyfs mount option, we're already RO and are salvaging
+	 * data, no need for such strict check.
+	 */
+	if (btrfs_test_opt(fs_info, ONLYFS))
+		return 0;
+
 	key.objectid = 1;
 	key.type = BTRFS_DEV_EXTENT_KEY;
 	key.offset = 0;