Message ID | 20200721151057.9325-1-josef@toxicpanda.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v2] btrfs: introduce rescue=onlyfs | expand |
On 21/07/2020 16:10, Josef Bacik wrote: > One of the things that came up consistently in talking with Fedora about > switching to btrfs as default is that btrfs is particularly vulnerable > to metadata corruption. If any of the core global roots are corrupted, > the fs is unmountable and fsck can't usually do anything for you without > some special options. > > Qu addressed this sort of with rescue=skipbg, but that's poorly named as > what it really does is just allow you to operate without an extent root. > However there are a lot of other roots, and I'd rather not have to do > > mount -o rescue=skipbg,rescue=nocsum,rescue=nofreespacetree,rescue=blah > > Instead take his original idea and modify it so it just works for > everything. Turn it into rescue=onlyfs, and then any major root we fail > to read just gets left empty and we carry on. Am I the only one who dislikes the name? "onlyfs" does not seem at all meaningful to me, as a system manager - the people it is apparently aimed at. I really don't understand what it is supposed to mean and it sounds like some developer debugging option or something. If it means "only filesystem" that doesn't make sense to me - the whole thing is the filesystem. I guess "only data" might be more meaningful but if the aim is to turn on as much recovery as possible to help the user to save their data then why not just say so? Something like "rescue=max", "rescue=recoverymode", "rescue=dataonly", "rescue=ignoreallerrors" or "rescue=emergency" might be more meaningful.
On Tue, Jul 21, 2020 at 04:56:55PM +0100, Graham Cobb wrote: > On 21/07/2020 16:10, Josef Bacik wrote: > > One of the things that came up consistently in talking with Fedora about > > switching to btrfs as default is that btrfs is particularly vulnerable > > to metadata corruption. If any of the core global roots are corrupted, > > the fs is unmountable and fsck can't usually do anything for you without > > some special options. > > > > Qu addressed this sort of with rescue=skipbg, but that's poorly named as > > what it really does is just allow you to operate without an extent root. > > However there are a lot of other roots, and I'd rather not have to do > > > > mount -o rescue=skipbg,rescue=nocsum,rescue=nofreespacetree,rescue=blah > > > > Instead take his original idea and modify it so it just works for > > everything. Turn it into rescue=onlyfs, and then any major root we fail > > to read just gets left empty and we carry on. > > Am I the only one who dislikes the name? "onlyfs" does not seem at all > meaningful to me, as a system manager - the people it is apparently > aimed at. I really don't understand what it is supposed to mean and it > sounds like some developer debugging option or something. No, you're not the only one. Changelog points to 'skipbg' as poor naming choice but 'onlyfs' is IMHO just as bad because it's supposed to be used by users so the naming need to take that into account. > If it means "only filesystem" that doesn't make sense to me - the whole > thing is the filesystem. I guess "only data" might be more meaningful > but if the aim is to turn on as much recovery as possible to help the > user to save their data then why not just say so? > > Something like "rescue=max", "rescue=recoverymode", "rescue=dataonly", > "rescue=ignoreallerrors" or "rescue=emergency" might be more meaningful. From user perspective the option should have a high level semantics, like you suggest above. We should add individual options to try to work around specific damage if not just for testing purposes, having more flexibility is a good thing.
On 7/21/20 1:16 PM, David Sterba wrote: > On Tue, Jul 21, 2020 at 04:56:55PM +0100, Graham Cobb wrote: >> On 21/07/2020 16:10, Josef Bacik wrote: >>> One of the things that came up consistently in talking with Fedora about >>> switching to btrfs as default is that btrfs is particularly vulnerable >>> to metadata corruption. If any of the core global roots are corrupted, >>> the fs is unmountable and fsck can't usually do anything for you without >>> some special options. >>> >>> Qu addressed this sort of with rescue=skipbg, but that's poorly named as >>> what it really does is just allow you to operate without an extent root. >>> However there are a lot of other roots, and I'd rather not have to do >>> >>> mount -o rescue=skipbg,rescue=nocsum,rescue=nofreespacetree,rescue=blah >>> >>> Instead take his original idea and modify it so it just works for >>> everything. Turn it into rescue=onlyfs, and then any major root we fail >>> to read just gets left empty and we carry on. >> >> Am I the only one who dislikes the name? "onlyfs" does not seem at all >> meaningful to me, as a system manager - the people it is apparently >> aimed at. I really don't understand what it is supposed to mean and it >> sounds like some developer debugging option or something. > > No, you're not the only one. Changelog points to 'skipbg' as poor naming > choice but 'onlyfs' is IMHO just as bad because it's supposed to be used > by users so the naming need to take that into account. I'm not married to the name, I think its awful but I had no better ideas nor suggestions from the last time I posted, tho I'm partial to rescue=allthefuckingthings. However I'm fine with rescue=max also, is that everybody's favorite? Thanks, Josef
On 21/07/2020 18:16, David Sterba wrote: > On Tue, Jul 21, 2020 at 04:56:55PM +0100, Graham Cobb wrote: >> If it means "only filesystem" that doesn't make sense to me - the whole >> thing is the filesystem. I guess "only data" might be more meaningful >> but if the aim is to turn on as much recovery as possible to help the >> user to save their data then why not just say so? >> >> Something like "rescue=max", "rescue=recoverymode", "rescue=dataonly", >> "rescue=ignoreallerrors" or "rescue=emergency" might be more meaningful. > > From user perspective the option should have a high level semantics, > like you suggest above. We should add individual options to try to work > around specific damage if not just for testing purposes, having more > flexibility is a good thing. I would also prefer not to have checksum checking disabled by this "try harder" option. I would imagine turning on "ignore whatever checks you can to get me my data back mode", retrieving all the readable data with valid checksums and getting errors for things which cannot be verified. Then I would make a decision as to whether to enable another option to even provide files which the filesystem cannot guarantee have not been corrupted because it can't check checksums. Even if that is all the files (because the checksum tree is destroyed) I should have to make an explicit acknowledgement that I want that.
On 7/21/20 1:34 PM, Graham Cobb wrote: > On 21/07/2020 18:16, David Sterba wrote: >> On Tue, Jul 21, 2020 at 04:56:55PM +0100, Graham Cobb wrote: >>> If it means "only filesystem" that doesn't make sense to me - the whole >>> thing is the filesystem. I guess "only data" might be more meaningful >>> but if the aim is to turn on as much recovery as possible to help the >>> user to save their data then why not just say so? >>> >>> Something like "rescue=max", "rescue=recoverymode", "rescue=dataonly", >>> "rescue=ignoreallerrors" or "rescue=emergency" might be more meaningful. >> >> From user perspective the option should have a high level semantics, >> like you suggest above. We should add individual options to try to work >> around specific damage if not just for testing purposes, having more >> flexibility is a good thing. > > I would also prefer not to have checksum checking disabled by this "try > harder" option. I would imagine turning on "ignore whatever checks you > can to get me my data back mode", retrieving all the readable data with > valid checksums and getting errors for things which cannot be verified. > Then I would make a decision as to whether to enable another option to > even provide files which the filesystem cannot guarantee have not been > corrupted because it can't check checksums. Even if that is all the > files (because the checksum tree is destroyed) I should have to make an > explicit acknowledgement that I want that. > If somebody wants to add finer grained stuff later I'm fine with that, right now I'm addressing the case where a lot of things are dead. Thanks, Josef
On Tue, Jul 21, 2020 at 11:25 AM Josef Bacik <josef@toxicpanda.com> wrote: > > On 7/21/20 1:16 PM, David Sterba wrote: > > No, you're not the only one. Changelog points to 'skipbg' as poor naming > > choice but 'onlyfs' is IMHO just as bad because it's supposed to be used > > by users so the naming need to take that into account. > > I'm not married to the name, I think its awful but I had no better ideas nor > suggestions from the last time I posted, tho I'm partial to > rescue=allthefuckingthings. However I'm fine with rescue=max also, is that > everybody's favorite? Thanks, Or just rescue? And then rescue= to specifically list the trees that are to be ignored? Separated by comma or colon or whatever?
On Tue, Jul 21, 2020 at 11:35 AM Graham Cobb <g.btrfs@cobb.uk.net> wrote: > > On 21/07/2020 18:16, David Sterba wrote: > > On Tue, Jul 21, 2020 at 04:56:55PM +0100, Graham Cobb wrote: > >> If it means "only filesystem" that doesn't make sense to me - the whole > >> thing is the filesystem. I guess "only data" might be more meaningful > >> but if the aim is to turn on as much recovery as possible to help the > >> user to save their data then why not just say so? > >> > >> Something like "rescue=max", "rescue=recoverymode", "rescue=dataonly", > >> "rescue=ignoreallerrors" or "rescue=emergency" might be more meaningful. > > > > From user perspective the option should have a high level semantics, > > like you suggest above. We should add individual options to try to work > > around specific damage if not just for testing purposes, having more > > flexibility is a good thing. > > I would also prefer not to have checksum checking disabled by this "try > harder" option. I would imagine turning on "ignore whatever checks you > can to get me my data back mode", retrieving all the readable data with > valid checksums and getting errors for things which cannot be verified. > Then I would make a decision as to whether to enable another option to > even provide files which the filesystem cannot guarantee have not been > corrupted because it can't check checksums. Even if that is all the > files (because the checksum tree is destroyed) I should have to make an > explicit acknowledgement that I want that. It would be nice if this rescue=all mount option, continues to spit out noisy and scary warnings about the problems encountered. Including corrupt files with path to the corrupt file. Just avoid face planting. I assume rescue=all implies ro (possibly also nologreplay).
On 2020/7/21 下午11:10, Josef Bacik wrote: > One of the things that came up consistently in talking with Fedora about > switching to btrfs as default is that btrfs is particularly vulnerable > to metadata corruption. If any of the core global roots are corrupted, > the fs is unmountable and fsck can't usually do anything for you without > some special options. > > Qu addressed this sort of with rescue=skipbg, but that's poorly named as > what it really does is just allow you to operate without an extent root. > However there are a lot of other roots, and I'd rather not have to do > > mount -o rescue=skipbg,rescue=nocsum,rescue=nofreespacetree,rescue=blah > > Instead take his original idea and modify it so it just works for > everything. Turn it into rescue=onlyfs, and then any major root we fail > to read just gets left empty and we carry on. > > Obviously if the fs roots are screwed then the user is in trouble, but > otherwise this makes it much easier to pull stuff off the disk without > needing our special rescue tools. I tested this with my TEST_DEV that > had a bunch of data on it by corrupting the csum tree and then reading > files off the disk. > > Signed-off-by: Josef Bacik <josef@toxicpanda.com> > --- > v1->v2: > - Rebase onto recent misc-next. > - Add in the fill_dummy_bgs since skipbg is no longer in this tree. > > fs/btrfs/block-group.c | 49 +++++++++++++++++++++++++++++ > fs/btrfs/ctree.h | 1 + > fs/btrfs/disk-io.c | 71 +++++++++++++++++++++++++++++------------- > fs/btrfs/inode.c | 6 +++- > fs/btrfs/super.c | 29 +++++++++++++++-- > fs/btrfs/volumes.c | 7 +++++ > 6 files changed, 138 insertions(+), 25 deletions(-) > > diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c > index 884de28a41e4..416fc419fd95 100644 > --- a/fs/btrfs/block-group.c > +++ b/fs/btrfs/block-group.c > @@ -1997,6 +1997,52 @@ static int read_one_block_group(struct btrfs_fs_info *info, > return ret; > } > > +static int fill_dummy_bgs(struct btrfs_fs_info *fs_info) > +{ > + struct extent_map_tree *em_tree = &fs_info->mapping_tree; > + struct extent_map *em; > + struct map_lookup *map; > + struct btrfs_block_group *bg; > + struct btrfs_space_info *space_info; > + struct rb_node *node; > + int ret = 0; > + > + read_lock(&em_tree->lock); > + for (node = rb_first_cached(&em_tree->map); node; > + node = rb_next(node)) { > + em = rb_entry(node, struct extent_map, rb_node); > + map = em->map_lookup; > + bg = btrfs_create_block_group_cache(fs_info, em->start); > + if (!bg) { > + ret = -ENOMEM; > + goto out; > + } > + > + /* Fill dummy cache as FULL */ > + bg->length = em->len; > + bg->flags = map->type; > + bg->last_byte_to_unpin = (u64)-1; > + bg->cached = BTRFS_CACHE_FINISHED; > + bg->used = em->len; > + bg->flags = map->type; > + ret = btrfs_add_block_group_cache(fs_info, bg); > + if (ret) { > + btrfs_remove_free_space_cache(bg); > + btrfs_put_block_group(bg); > + goto out; > + } > + btrfs_update_space_info(fs_info, bg->flags, em->len, em->len, > + 0, &space_info); > + bg->space_info = space_info; > + link_block_group(bg); > + > + set_avail_alloc_bits(fs_info, bg->flags); > + } > +out: > + read_unlock(&em_tree->lock); > + return ret; > +} > + > int btrfs_read_block_groups(struct btrfs_fs_info *info) > { > struct btrfs_path *path; > @@ -2007,6 +2053,9 @@ int btrfs_read_block_groups(struct btrfs_fs_info *info) > int need_clear = 0; > u64 cache_gen; > > + if (btrfs_test_opt(info, ONLYFS)) > + return fill_dummy_bgs(info); > + > key.objectid = 0; > key.offset = 0; > key.type = BTRFS_BLOCK_GROUP_ITEM_KEY; > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h > index b70c2024296f..0be7d9461443 100644 > --- a/fs/btrfs/ctree.h > +++ b/fs/btrfs/ctree.h > @@ -1266,6 +1266,7 @@ static inline u32 BTRFS_MAX_XATTR_SIZE(const struct btrfs_fs_info *info) > #define BTRFS_MOUNT_NOLOGREPLAY (1 << 27) > #define BTRFS_MOUNT_REF_VERIFY (1 << 28) > #define BTRFS_MOUNT_DISCARD_ASYNC (1 << 29) > +#define BTRFS_MOUNT_ONLYFS (1 << 30) > > #define BTRFS_DEFAULT_COMMIT_INTERVAL (30) > #define BTRFS_DEFAULT_MAX_INLINE (2048) > diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c > index c850d7f44fbe..0dffa4c12d8c 100644 > --- a/fs/btrfs/disk-io.c > +++ b/fs/btrfs/disk-io.c > @@ -2326,8 +2326,13 @@ static int btrfs_read_roots(struct btrfs_fs_info *fs_info) > > root = btrfs_read_tree_root(tree_root, &location); > if (IS_ERR(root)) { > - ret = PTR_ERR(root); > - goto out; > + if (!btrfs_test_opt(fs_info, ONLYFS)) { > + ret = PTR_ERR(root); > + goto out; > + } > + } else { > + set_bit(BTRFS_ROOT_TRACK_DIRTY, &root->state); > + fs_info->extent_root = root; > } > set_bit(BTRFS_ROOT_TRACK_DIRTY, &root->state); > fs_info->extent_root = root; > @@ -2335,21 +2340,27 @@ static int btrfs_read_roots(struct btrfs_fs_info *fs_info) > location.objectid = BTRFS_DEV_TREE_OBJECTID; > root = btrfs_read_tree_root(tree_root, &location); > if (IS_ERR(root)) { > - ret = PTR_ERR(root); > - goto out; > + if (!btrfs_test_opt(fs_info, ONLYFS)) { > + ret = PTR_ERR(root); > + goto out; > + } > + } else { > + set_bit(BTRFS_ROOT_TRACK_DIRTY, &root->state); > + fs_info->dev_root = root; > + btrfs_init_devices_late(fs_info); > } > - set_bit(BTRFS_ROOT_TRACK_DIRTY, &root->state); > - fs_info->dev_root = root; > - btrfs_init_devices_late(fs_info); > > location.objectid = BTRFS_CSUM_TREE_OBJECTID; > root = btrfs_read_tree_root(tree_root, &location); > if (IS_ERR(root)) { > - ret = PTR_ERR(root); > - goto out; > + if (!btrfs_test_opt(fs_info, ONLYFS)) { > + ret = PTR_ERR(root); > + goto out; > + } > + } else { > + set_bit(BTRFS_ROOT_TRACK_DIRTY, &root->state); > + fs_info->csum_root = root; > } > - set_bit(BTRFS_ROOT_TRACK_DIRTY, &root->state); > - fs_info->csum_root = root; > > /* > * This tree can share blocks with some other fs tree during relocation > @@ -2358,11 +2369,14 @@ static int btrfs_read_roots(struct btrfs_fs_info *fs_info) > root = btrfs_get_fs_root(tree_root->fs_info, > BTRFS_DATA_RELOC_TREE_OBJECTID, true); > if (IS_ERR(root)) { > - ret = PTR_ERR(root); > - goto out; > + if (!btrfs_test_opt(fs_info, ONLYFS)) { > + ret = PTR_ERR(root); > + goto out; > + } > + } else { > + set_bit(BTRFS_ROOT_TRACK_DIRTY, &root->state); > + fs_info->data_reloc_root = root; > } > - set_bit(BTRFS_ROOT_TRACK_DIRTY, &root->state); > - fs_info->data_reloc_root = root; > > location.objectid = BTRFS_QUOTA_TREE_OBJECTID; > root = btrfs_read_tree_root(tree_root, &location); > @@ -2375,9 +2389,11 @@ static int btrfs_read_roots(struct btrfs_fs_info *fs_info) > location.objectid = BTRFS_UUID_TREE_OBJECTID; > root = btrfs_read_tree_root(tree_root, &location); > if (IS_ERR(root)) { > - ret = PTR_ERR(root); > - if (ret != -ENOENT) > - goto out; > + if (!btrfs_test_opt(fs_info, ONLYFS)) { > + ret = PTR_ERR(root); > + if (ret != -ENOENT) > + goto out; > + } > } else { > set_bit(BTRFS_ROOT_TRACK_DIRTY, &root->state); > fs_info->uuid_root = root; > @@ -2387,11 +2403,14 @@ static int btrfs_read_roots(struct btrfs_fs_info *fs_info) > location.objectid = BTRFS_FREE_SPACE_TREE_OBJECTID; > root = btrfs_read_tree_root(tree_root, &location); > if (IS_ERR(root)) { > - ret = PTR_ERR(root); > - goto out; > + if (!btrfs_test_opt(fs_info, ONLYFS)) { > + ret = PTR_ERR(root); > + goto out; > + } > + } else { > + set_bit(BTRFS_ROOT_TRACK_DIRTY, &root->state); > + fs_info->free_space_root = root; > } > - set_bit(BTRFS_ROOT_TRACK_DIRTY, &root->state); > - fs_info->free_space_root = root; > } > > return 0; > @@ -3106,6 +3125,14 @@ int __cold open_ctree(struct super_block *sb, struct btrfs_fs_devices *fs_device > goto fail_alloc; > } > > + /* Skip bg needs RO and no tree-log to replay */ Comment is still the skip-bg version. > + if (btrfs_test_opt(fs_info, ONLYFS) && !sb_rdonly(sb)) { > + btrfs_err(fs_info, > + "rescue=onlyfs can only be used on read-only mount"); > + err = -EINVAL; > + goto fail_alloc; > + } > + > ret = btrfs_init_workqueues(fs_info, fs_devices); > if (ret) { > err = ret; > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > index 611b3412fbfd..49cd3eba651e 100644 > --- a/fs/btrfs/inode.c > +++ b/fs/btrfs/inode.c > @@ -2191,7 +2191,8 @@ static blk_status_t btrfs_submit_bio_hook(struct inode *inode, struct bio *bio, > int skip_sum; > int async = !atomic_read(&BTRFS_I(inode)->sync_writers); > > - skip_sum = BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM; > + skip_sum = (BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM) || > + btrfs_test_opt(fs_info, ONLYFS); This means, if onlyfs is spcified, csum is completely skipped even it matches. I'm not yet determined whether it's preferred. On one hand, even at recovery, user may want csum verification to detect bad copy if possible, but on the other hand, since user is doing data salvage, bothering csum could only lead to extra error. Despite that, the patch looks pretty good and indeed an enhancement to skipbg. Thanks, Qu > > if (btrfs_is_free_space_inode(BTRFS_I(inode))) > metadata = BTRFS_WQ_ENDIO_FREE_SPACE; > @@ -2846,6 +2847,9 @@ static int btrfs_readpage_end_io_hook(struct btrfs_io_bio *io_bio, > if (BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM) > return 0; > > + if (btrfs_test_opt(root->fs_info, ONLYFS)) > + return 0; > + > if (root->root_key.objectid == BTRFS_DATA_RELOC_TREE_OBJECTID && > test_range_bit(io_tree, start, end, EXTENT_NODATASUM, 1, NULL)) { > clear_extent_bits(io_tree, start, end, EXTENT_NODATASUM); > diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c > index 58f890f73650..4dc3f35ca7e3 100644 > --- a/fs/btrfs/super.c > +++ b/fs/btrfs/super.c > @@ -345,6 +345,7 @@ enum { > Opt_rescue, > Opt_usebackuproot, > Opt_nologreplay, > + Opt_rescue_onlyfs, > > /* Deprecated options */ > Opt_recovery, > @@ -440,6 +441,7 @@ static const match_table_t tokens = { > static const match_table_t rescue_tokens = { > {Opt_usebackuproot, "usebackuproot"}, > {Opt_nologreplay, "nologreplay"}, > + {Opt_rescue_onlyfs, "onlyfs"}, > {Opt_err, NULL}, > }; > > @@ -472,6 +474,11 @@ static int parse_rescue_options(struct btrfs_fs_info *info, const char *options) > btrfs_set_and_info(info, NOLOGREPLAY, > "disabling log replay at mount time"); > break; > + case Opt_rescue_onlyfs: > + btrfs_set_and_info(info, ONLYFS, > + "only reading fs roots, also setting nologreplay"); > + btrfs_set_opt(info->mount_opt, NOLOGREPLAY); > + break; > case Opt_err: > btrfs_info(info, "unrecognized rescue option '%s'", p); > ret = -EINVAL; > @@ -1400,6 +1407,8 @@ static int btrfs_show_options(struct seq_file *seq, struct dentry *dentry) > seq_puts(seq, ",notreelog"); > if (btrfs_test_opt(info, NOLOGREPLAY)) > seq_puts(seq, ",rescue=nologreplay"); > + if (btrfs_test_opt(info, ONLYFS)) > + seq_puts(seq, ",rescue=onlyfs"); > if (btrfs_test_opt(info, FLUSHONCOMMIT)) > seq_puts(seq, ",flushoncommit"); > if (btrfs_test_opt(info, DISCARD_SYNC)) > @@ -1839,6 +1848,14 @@ static int btrfs_remount(struct super_block *sb, int *flags, char *data) > if (ret) > goto restore; > > + if (btrfs_test_opt(fs_info, ONLYFS) != > + (old_opts & BTRFS_MOUNT_ONLYFS)) { > + btrfs_err(fs_info, > + "rescue=onlyfs mount option can't be changed during remount"); > + ret = -EINVAL; > + goto restore; > + } > + > btrfs_remount_begin(fs_info, old_opts, *flags); > btrfs_resize_thread_pool(fs_info, > fs_info->thread_pool_size, old_thread_pool_size); > @@ -1904,6 +1921,13 @@ static int btrfs_remount(struct super_block *sb, int *flags, char *data) > goto restore; > } > > + if (btrfs_test_opt(fs_info, ONLYFS)) { > + btrfs_err(fs_info, > + "remounting read-write with rescue=onlyfs is not allowed"); > + ret = -EINVAL; > + goto restore; > + } > + > ret = btrfs_cleanup_fs_roots(fs_info); > if (ret) > goto restore; > @@ -2208,8 +2232,9 @@ static int btrfs_statfs(struct dentry *dentry, struct kstatfs *buf) > * still can allocate chunks and thus are fine using the currently > * calculated f_bavail. > */ > - if (!mixed && block_rsv->space_info->full && > - total_free_meta - thresh < block_rsv->size) > + if (btrfs_test_opt(fs_info, ONLYFS) || > + (!mixed && block_rsv->space_info->full && > + total_free_meta - thresh < block_rsv->size)) > buf->f_bavail = 0; > > buf->f_type = BTRFS_SUPER_MAGIC; > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c > index 537ccf66ee20..e86f78579884 100644 > --- a/fs/btrfs/volumes.c > +++ b/fs/btrfs/volumes.c > @@ -7628,6 +7628,13 @@ int btrfs_verify_dev_extents(struct btrfs_fs_info *fs_info) > u64 prev_dev_ext_end = 0; > int ret = 0; > > + /* > + * For rescue=onlyfs mount option, we're already RO and are salvaging > + * data, no need for such strict check. > + */ > + if (btrfs_test_opt(fs_info, ONLYFS)) > + return 0; > + > key.objectid = 1; > key.type = BTRFS_DEV_EXTENT_KEY; > key.offset = 0; >
On 22/07/2020 00:05, Qu Wenruo wrote: > > > On 2020/7/21 下午11:10, Josef Bacik wrote: ... >> - skip_sum = BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM; >> + skip_sum = (BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM) || >> + btrfs_test_opt(fs_info, ONLYFS); > > This means, if onlyfs is spcified, csum is completely skipped even it > matches. > > I'm not yet determined whether it's preferred. > > On one hand, even at recovery, user may want csum verification to detect > bad copy if possible, but on the other hand, since user is doing data > salvage, bothering csum could only lead to extra error. If we are using this option then something very bad has happened to the filesystem. It would be useful to the user to know whether the checksum on the file being read is valid or not as a failed checksum will be a hint that the BAD THING has affected this file (maybe some extent got incorrectly overwritten by the content of another file, for example). It would tell the user that they may want to check the file in some other way or, at least, not rely on its contents. On the other hand, it may be that the file contents are fine, but the checksum has been corrupted, so it should also be possible the retrieve the contents anyway. Of course, in the face of unknown filesystem corruption, a valid checksum does not guarantee that the file is uncorrupted. But a bad checksum gives a strong hint that the file needs additional checking. I don't particularly care how the user finds out that the checksum is bad. In my earlier mail, I suggested that this option does not imply nodatasum and that the user has to specify nodatasum if that is what they want. However, I would also be happy with (for example) a warning in the logs if the checksum is not valid. By the way, is it possible to do a scrub while the filesystem is mounted with this option? What would happen, in that case, for either real bad sectors or checksum errors caused by the filesystem corruption?
diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c index 884de28a41e4..416fc419fd95 100644 --- a/fs/btrfs/block-group.c +++ b/fs/btrfs/block-group.c @@ -1997,6 +1997,52 @@ static int read_one_block_group(struct btrfs_fs_info *info, return ret; } +static int fill_dummy_bgs(struct btrfs_fs_info *fs_info) +{ + struct extent_map_tree *em_tree = &fs_info->mapping_tree; + struct extent_map *em; + struct map_lookup *map; + struct btrfs_block_group *bg; + struct btrfs_space_info *space_info; + struct rb_node *node; + int ret = 0; + + read_lock(&em_tree->lock); + for (node = rb_first_cached(&em_tree->map); node; + node = rb_next(node)) { + em = rb_entry(node, struct extent_map, rb_node); + map = em->map_lookup; + bg = btrfs_create_block_group_cache(fs_info, em->start); + if (!bg) { + ret = -ENOMEM; + goto out; + } + + /* Fill dummy cache as FULL */ + bg->length = em->len; + bg->flags = map->type; + bg->last_byte_to_unpin = (u64)-1; + bg->cached = BTRFS_CACHE_FINISHED; + bg->used = em->len; + bg->flags = map->type; + ret = btrfs_add_block_group_cache(fs_info, bg); + if (ret) { + btrfs_remove_free_space_cache(bg); + btrfs_put_block_group(bg); + goto out; + } + btrfs_update_space_info(fs_info, bg->flags, em->len, em->len, + 0, &space_info); + bg->space_info = space_info; + link_block_group(bg); + + set_avail_alloc_bits(fs_info, bg->flags); + } +out: + read_unlock(&em_tree->lock); + return ret; +} + int btrfs_read_block_groups(struct btrfs_fs_info *info) { struct btrfs_path *path; @@ -2007,6 +2053,9 @@ int btrfs_read_block_groups(struct btrfs_fs_info *info) int need_clear = 0; u64 cache_gen; + if (btrfs_test_opt(info, ONLYFS)) + return fill_dummy_bgs(info); + key.objectid = 0; key.offset = 0; key.type = BTRFS_BLOCK_GROUP_ITEM_KEY; diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index b70c2024296f..0be7d9461443 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -1266,6 +1266,7 @@ static inline u32 BTRFS_MAX_XATTR_SIZE(const struct btrfs_fs_info *info) #define BTRFS_MOUNT_NOLOGREPLAY (1 << 27) #define BTRFS_MOUNT_REF_VERIFY (1 << 28) #define BTRFS_MOUNT_DISCARD_ASYNC (1 << 29) +#define BTRFS_MOUNT_ONLYFS (1 << 30) #define BTRFS_DEFAULT_COMMIT_INTERVAL (30) #define BTRFS_DEFAULT_MAX_INLINE (2048) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index c850d7f44fbe..0dffa4c12d8c 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -2326,8 +2326,13 @@ static int btrfs_read_roots(struct btrfs_fs_info *fs_info) root = btrfs_read_tree_root(tree_root, &location); if (IS_ERR(root)) { - ret = PTR_ERR(root); - goto out; + if (!btrfs_test_opt(fs_info, ONLYFS)) { + ret = PTR_ERR(root); + goto out; + } + } else { + set_bit(BTRFS_ROOT_TRACK_DIRTY, &root->state); + fs_info->extent_root = root; } set_bit(BTRFS_ROOT_TRACK_DIRTY, &root->state); fs_info->extent_root = root; @@ -2335,21 +2340,27 @@ static int btrfs_read_roots(struct btrfs_fs_info *fs_info) location.objectid = BTRFS_DEV_TREE_OBJECTID; root = btrfs_read_tree_root(tree_root, &location); if (IS_ERR(root)) { - ret = PTR_ERR(root); - goto out; + if (!btrfs_test_opt(fs_info, ONLYFS)) { + ret = PTR_ERR(root); + goto out; + } + } else { + set_bit(BTRFS_ROOT_TRACK_DIRTY, &root->state); + fs_info->dev_root = root; + btrfs_init_devices_late(fs_info); } - set_bit(BTRFS_ROOT_TRACK_DIRTY, &root->state); - fs_info->dev_root = root; - btrfs_init_devices_late(fs_info); location.objectid = BTRFS_CSUM_TREE_OBJECTID; root = btrfs_read_tree_root(tree_root, &location); if (IS_ERR(root)) { - ret = PTR_ERR(root); - goto out; + if (!btrfs_test_opt(fs_info, ONLYFS)) { + ret = PTR_ERR(root); + goto out; + } + } else { + set_bit(BTRFS_ROOT_TRACK_DIRTY, &root->state); + fs_info->csum_root = root; } - set_bit(BTRFS_ROOT_TRACK_DIRTY, &root->state); - fs_info->csum_root = root; /* * This tree can share blocks with some other fs tree during relocation @@ -2358,11 +2369,14 @@ static int btrfs_read_roots(struct btrfs_fs_info *fs_info) root = btrfs_get_fs_root(tree_root->fs_info, BTRFS_DATA_RELOC_TREE_OBJECTID, true); if (IS_ERR(root)) { - ret = PTR_ERR(root); - goto out; + if (!btrfs_test_opt(fs_info, ONLYFS)) { + ret = PTR_ERR(root); + goto out; + } + } else { + set_bit(BTRFS_ROOT_TRACK_DIRTY, &root->state); + fs_info->data_reloc_root = root; } - set_bit(BTRFS_ROOT_TRACK_DIRTY, &root->state); - fs_info->data_reloc_root = root; location.objectid = BTRFS_QUOTA_TREE_OBJECTID; root = btrfs_read_tree_root(tree_root, &location); @@ -2375,9 +2389,11 @@ static int btrfs_read_roots(struct btrfs_fs_info *fs_info) location.objectid = BTRFS_UUID_TREE_OBJECTID; root = btrfs_read_tree_root(tree_root, &location); if (IS_ERR(root)) { - ret = PTR_ERR(root); - if (ret != -ENOENT) - goto out; + if (!btrfs_test_opt(fs_info, ONLYFS)) { + ret = PTR_ERR(root); + if (ret != -ENOENT) + goto out; + } } else { set_bit(BTRFS_ROOT_TRACK_DIRTY, &root->state); fs_info->uuid_root = root; @@ -2387,11 +2403,14 @@ static int btrfs_read_roots(struct btrfs_fs_info *fs_info) location.objectid = BTRFS_FREE_SPACE_TREE_OBJECTID; root = btrfs_read_tree_root(tree_root, &location); if (IS_ERR(root)) { - ret = PTR_ERR(root); - goto out; + if (!btrfs_test_opt(fs_info, ONLYFS)) { + ret = PTR_ERR(root); + goto out; + } + } else { + set_bit(BTRFS_ROOT_TRACK_DIRTY, &root->state); + fs_info->free_space_root = root; } - set_bit(BTRFS_ROOT_TRACK_DIRTY, &root->state); - fs_info->free_space_root = root; } return 0; @@ -3106,6 +3125,14 @@ int __cold open_ctree(struct super_block *sb, struct btrfs_fs_devices *fs_device goto fail_alloc; } + /* Skip bg needs RO and no tree-log to replay */ + if (btrfs_test_opt(fs_info, ONLYFS) && !sb_rdonly(sb)) { + btrfs_err(fs_info, + "rescue=onlyfs can only be used on read-only mount"); + err = -EINVAL; + goto fail_alloc; + } + ret = btrfs_init_workqueues(fs_info, fs_devices); if (ret) { err = ret; diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 611b3412fbfd..49cd3eba651e 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -2191,7 +2191,8 @@ static blk_status_t btrfs_submit_bio_hook(struct inode *inode, struct bio *bio, int skip_sum; int async = !atomic_read(&BTRFS_I(inode)->sync_writers); - skip_sum = BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM; + skip_sum = (BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM) || + btrfs_test_opt(fs_info, ONLYFS); if (btrfs_is_free_space_inode(BTRFS_I(inode))) metadata = BTRFS_WQ_ENDIO_FREE_SPACE; @@ -2846,6 +2847,9 @@ static int btrfs_readpage_end_io_hook(struct btrfs_io_bio *io_bio, if (BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM) return 0; + if (btrfs_test_opt(root->fs_info, ONLYFS)) + return 0; + if (root->root_key.objectid == BTRFS_DATA_RELOC_TREE_OBJECTID && test_range_bit(io_tree, start, end, EXTENT_NODATASUM, 1, NULL)) { clear_extent_bits(io_tree, start, end, EXTENT_NODATASUM); diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index 58f890f73650..4dc3f35ca7e3 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -345,6 +345,7 @@ enum { Opt_rescue, Opt_usebackuproot, Opt_nologreplay, + Opt_rescue_onlyfs, /* Deprecated options */ Opt_recovery, @@ -440,6 +441,7 @@ static const match_table_t tokens = { static const match_table_t rescue_tokens = { {Opt_usebackuproot, "usebackuproot"}, {Opt_nologreplay, "nologreplay"}, + {Opt_rescue_onlyfs, "onlyfs"}, {Opt_err, NULL}, }; @@ -472,6 +474,11 @@ static int parse_rescue_options(struct btrfs_fs_info *info, const char *options) btrfs_set_and_info(info, NOLOGREPLAY, "disabling log replay at mount time"); break; + case Opt_rescue_onlyfs: + btrfs_set_and_info(info, ONLYFS, + "only reading fs roots, also setting nologreplay"); + btrfs_set_opt(info->mount_opt, NOLOGREPLAY); + break; case Opt_err: btrfs_info(info, "unrecognized rescue option '%s'", p); ret = -EINVAL; @@ -1400,6 +1407,8 @@ static int btrfs_show_options(struct seq_file *seq, struct dentry *dentry) seq_puts(seq, ",notreelog"); if (btrfs_test_opt(info, NOLOGREPLAY)) seq_puts(seq, ",rescue=nologreplay"); + if (btrfs_test_opt(info, ONLYFS)) + seq_puts(seq, ",rescue=onlyfs"); if (btrfs_test_opt(info, FLUSHONCOMMIT)) seq_puts(seq, ",flushoncommit"); if (btrfs_test_opt(info, DISCARD_SYNC)) @@ -1839,6 +1848,14 @@ static int btrfs_remount(struct super_block *sb, int *flags, char *data) if (ret) goto restore; + if (btrfs_test_opt(fs_info, ONLYFS) != + (old_opts & BTRFS_MOUNT_ONLYFS)) { + btrfs_err(fs_info, + "rescue=onlyfs mount option can't be changed during remount"); + ret = -EINVAL; + goto restore; + } + btrfs_remount_begin(fs_info, old_opts, *flags); btrfs_resize_thread_pool(fs_info, fs_info->thread_pool_size, old_thread_pool_size); @@ -1904,6 +1921,13 @@ static int btrfs_remount(struct super_block *sb, int *flags, char *data) goto restore; } + if (btrfs_test_opt(fs_info, ONLYFS)) { + btrfs_err(fs_info, + "remounting read-write with rescue=onlyfs is not allowed"); + ret = -EINVAL; + goto restore; + } + ret = btrfs_cleanup_fs_roots(fs_info); if (ret) goto restore; @@ -2208,8 +2232,9 @@ static int btrfs_statfs(struct dentry *dentry, struct kstatfs *buf) * still can allocate chunks and thus are fine using the currently * calculated f_bavail. */ - if (!mixed && block_rsv->space_info->full && - total_free_meta - thresh < block_rsv->size) + if (btrfs_test_opt(fs_info, ONLYFS) || + (!mixed && block_rsv->space_info->full && + total_free_meta - thresh < block_rsv->size)) buf->f_bavail = 0; buf->f_type = BTRFS_SUPER_MAGIC; diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 537ccf66ee20..e86f78579884 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -7628,6 +7628,13 @@ int btrfs_verify_dev_extents(struct btrfs_fs_info *fs_info) u64 prev_dev_ext_end = 0; int ret = 0; + /* + * For rescue=onlyfs mount option, we're already RO and are salvaging + * data, no need for such strict check. + */ + if (btrfs_test_opt(fs_info, ONLYFS)) + return 0; + key.objectid = 1; key.type = BTRFS_DEV_EXTENT_KEY; key.offset = 0;
One of the things that came up consistently in talking with Fedora about switching to btrfs as default is that btrfs is particularly vulnerable to metadata corruption. If any of the core global roots are corrupted, the fs is unmountable and fsck can't usually do anything for you without some special options. Qu addressed this sort of with rescue=skipbg, but that's poorly named as what it really does is just allow you to operate without an extent root. However there are a lot of other roots, and I'd rather not have to do mount -o rescue=skipbg,rescue=nocsum,rescue=nofreespacetree,rescue=blah Instead take his original idea and modify it so it just works for everything. Turn it into rescue=onlyfs, and then any major root we fail to read just gets left empty and we carry on. Obviously if the fs roots are screwed then the user is in trouble, but otherwise this makes it much easier to pull stuff off the disk without needing our special rescue tools. I tested this with my TEST_DEV that had a bunch of data on it by corrupting the csum tree and then reading files off the disk. Signed-off-by: Josef Bacik <josef@toxicpanda.com> --- v1->v2: - Rebase onto recent misc-next. - Add in the fill_dummy_bgs since skipbg is no longer in this tree. fs/btrfs/block-group.c | 49 +++++++++++++++++++++++++++++ fs/btrfs/ctree.h | 1 + fs/btrfs/disk-io.c | 71 +++++++++++++++++++++++++++++------------- fs/btrfs/inode.c | 6 +++- fs/btrfs/super.c | 29 +++++++++++++++-- fs/btrfs/volumes.c | 7 +++++ 6 files changed, 138 insertions(+), 25 deletions(-)