Message ID | d20f67adb1ab345a9af9e0262e1aba0772832751.1611327201.git.fdmanana@suse.com (mailing list archive) |
---|---|
State | Superseded, archived |
Headers | show |
Series | btrfs: a couple bug fixes for failures of log replay and mount | expand |
On 1/22/21 10:28 AM, fdmanana@kernel.org wrote: > From: Filipe Manana <fdmanana@suse.com> > > After a sudden power failure we may end up with a space cache on disk that > is not valid and needs to be rebuilt from scratch. > > If that happens, during log replay when we attempt to pin an extent buffer > from a log tree, at btrfs_pin_extent_for_log_replay(), we do not wait for > the space cache to be rebuilt through the call to: > > btrfs_cache_block_group(cache, 1); > > That is because that only waits for the task (work queue job) that loads > the space cache to change the cache state from BTRFS_CACHE_FAST to any > other value. That is ok when the space cache on disk exists and is valid, > but when the cache is not valid and needs to be rebuilt, it ends up > returning as soon as the cache state changes to BTRFS_CACHE_STARTED (done > at caching_thread()). > > So this means that we can end up trying to unpin a range which is not yet > marked as free in the block group. This results in the call to > btrfs_remove_free_space() to return -EINVAL to > btrfs_pin_extent_for_log_replay(), which in turn makes the log replay fail > as well as mounting the filesystem. When this happens we got the following > in dmesg/syslog: > > [72383.415114] BTRFS: device fsid 32b95b69-0ea9-496a-9f02-3f5a56dc9322 devid 1 transid 1432 /dev/sdb scanned by mount (3816007) > [72383.417837] BTRFS info (device sdb): disk space caching is enabled > [72383.418536] BTRFS info (device sdb): has skinny extents > [72383.423846] BTRFS info (device sdb): start tree-log replay > [72383.426416] BTRFS warning (device sdb): block group 30408704 has wrong amount of free space > [72383.427686] BTRFS warning (device sdb): failed to load free space cache for block group 30408704, rebuilding it now > [72383.454291] BTRFS: error (device sdb) in btrfs_recover_log_trees:6203: errno=-22 unknown (Failed to pin buffers while recovering log root tree.) > [72383.456725] BTRFS: error (device sdb) in btrfs_replay_log:2253: errno=-22 unknown (Failed to recover log tree) > [72383.460241] BTRFS error (device sdb): open_ctree failed > > So fix this by making sure that during log replay we wait for the caching > task to finish completely when we need to rebuild a space cache. > > Fixes: e747853cae3ae3 ("btrfs: load free space cache asynchronously") > Signed-off-by: Filipe Manana <fdmanana@suse.com> This could actually happen before, it was just less likely because we'd start the async thread from the callers context. I assume we're getting the -EINVAL from the remove_from_bitmap() function? So we've loaded part of the free space but not all of it, and thus get the -EINVAL. This probably needs the earlier Fixes, all the async patch did was make it easier to hit. <snip> > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c > index 30b1a630dc2f..594534482ad3 100644 > --- a/fs/btrfs/extent-tree.c > +++ b/fs/btrfs/extent-tree.c > @@ -2600,6 +2600,7 @@ int btrfs_pin_extent_for_log_replay(struct btrfs_trans_handle *trans, > u64 bytenr, u64 num_bytes) > { > struct btrfs_block_group *cache; > + struct btrfs_caching_control *caching_ctl; > int ret; > > btrfs_add_excluded_extent(trans->fs_info, bytenr, num_bytes); > @@ -2615,6 +2616,13 @@ int btrfs_pin_extent_for_log_replay(struct btrfs_trans_handle *trans, > * the pinned extents. > */ > btrfs_cache_block_group(cache, 1); Instead we could probably just do if (btrfs_test_opt(fs_info, SPACE_CACHE)) { btrfs_cache_block_group(cache, 0); btrfs_wait_block_group_cache_done(cache); } here instead of changing all of the function arguments and such. Thanks, Josef
On Fri, Jan 22, 2021 at 4:43 PM Josef Bacik <josef@toxicpanda.com> wrote: > > On 1/22/21 10:28 AM, fdmanana@kernel.org wrote: > > From: Filipe Manana <fdmanana@suse.com> > > > > After a sudden power failure we may end up with a space cache on disk that > > is not valid and needs to be rebuilt from scratch. > > > > If that happens, during log replay when we attempt to pin an extent buffer > > from a log tree, at btrfs_pin_extent_for_log_replay(), we do not wait for > > the space cache to be rebuilt through the call to: > > > > btrfs_cache_block_group(cache, 1); > > > > That is because that only waits for the task (work queue job) that loads > > the space cache to change the cache state from BTRFS_CACHE_FAST to any > > other value. That is ok when the space cache on disk exists and is valid, > > but when the cache is not valid and needs to be rebuilt, it ends up > > returning as soon as the cache state changes to BTRFS_CACHE_STARTED (done > > at caching_thread()). > > > > So this means that we can end up trying to unpin a range which is not yet > > marked as free in the block group. This results in the call to > > btrfs_remove_free_space() to return -EINVAL to > > btrfs_pin_extent_for_log_replay(), which in turn makes the log replay fail > > as well as mounting the filesystem. When this happens we got the following > > in dmesg/syslog: > > > > [72383.415114] BTRFS: device fsid 32b95b69-0ea9-496a-9f02-3f5a56dc9322 devid 1 transid 1432 /dev/sdb scanned by mount (3816007) > > [72383.417837] BTRFS info (device sdb): disk space caching is enabled > > [72383.418536] BTRFS info (device sdb): has skinny extents > > [72383.423846] BTRFS info (device sdb): start tree-log replay > > [72383.426416] BTRFS warning (device sdb): block group 30408704 has wrong amount of free space > > [72383.427686] BTRFS warning (device sdb): failed to load free space cache for block group 30408704, rebuilding it now > > [72383.454291] BTRFS: error (device sdb) in btrfs_recover_log_trees:6203: errno=-22 unknown (Failed to pin buffers while recovering log root tree.) > > [72383.456725] BTRFS: error (device sdb) in btrfs_replay_log:2253: errno=-22 unknown (Failed to recover log tree) > > [72383.460241] BTRFS error (device sdb): open_ctree failed > > > > So fix this by making sure that during log replay we wait for the caching > > task to finish completely when we need to rebuild a space cache. > > > > Fixes: e747853cae3ae3 ("btrfs: load free space cache asynchronously") > > Signed-off-by: Filipe Manana <fdmanana@suse.com> > > This could actually happen before, it was just less likely because we'd start > the async thread from the callers context. Yes, I missed that initially, as discussed on slack in the meanwhile. > I assume we're getting the -EINVAL > from the remove_from_bitmap() function? Correct, I'll add more details on the next version (new patch actually). > So we've loaded part of the free space > but not all of it, and thus get the -EINVAL. This probably needs the earlier > Fixes, all the async patch did was make it easier to hit. > > <snip> > > > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c > > index 30b1a630dc2f..594534482ad3 100644 > > --- a/fs/btrfs/extent-tree.c > > +++ b/fs/btrfs/extent-tree.c > > @@ -2600,6 +2600,7 @@ int btrfs_pin_extent_for_log_replay(struct btrfs_trans_handle *trans, > > u64 bytenr, u64 num_bytes) > > { > > struct btrfs_block_group *cache; > > + struct btrfs_caching_control *caching_ctl; > > int ret; > > > > btrfs_add_excluded_extent(trans->fs_info, bytenr, num_bytes); > > @@ -2615,6 +2616,13 @@ int btrfs_pin_extent_for_log_replay(struct btrfs_trans_handle *trans, > > * the pinned extents. > > */ > > btrfs_cache_block_group(cache, 1); > > Instead we could probably just do > > if (btrfs_test_opt(fs_info, SPACE_CACHE)) { > btrfs_cache_block_group(cache, 0); > btrfs_wait_block_group_cache_done(cache); > } > > here instead of changing all of the function arguments and such. Thanks, Yes, in fact I was leaking the refcount on the caching_ctl as well. btrfs_cache_block_group(cache, 1) needs to be called always, it's needed for the free space tree as well. Thanks, just sent another version that combines both patches into one. > > Josef
diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c index 0886e81e5540..895fc69543de 100644 --- a/fs/btrfs/block-group.c +++ b/fs/btrfs/block-group.c @@ -435,10 +435,19 @@ static bool space_cache_v1_done(struct btrfs_block_group *cache) return ret; } +/* + * Wait for the cache to be either loaded from disk only, in case it exists and + * is valid, or wait until the cache is fully built in case it is missing or it + * is invalid. + */ void btrfs_wait_space_cache_v1_finished(struct btrfs_block_group *cache, - struct btrfs_caching_control *caching_ctl) + struct btrfs_caching_control *caching_ctl, + bool load_only) { - wait_event(caching_ctl->wait, space_cache_v1_done(cache)); + if (load_only) + wait_event(caching_ctl->wait, space_cache_v1_done(cache)); + else + wait_event(caching_ctl->wait, btrfs_block_group_done(cache)); } #ifdef CONFIG_BTRFS_DEBUG @@ -757,7 +766,7 @@ int btrfs_cache_block_group(struct btrfs_block_group *cache, int load_cache_only btrfs_queue_work(fs_info->caching_workers, &caching_ctl->work); out: if (load_cache_only && caching_ctl) - btrfs_wait_space_cache_v1_finished(cache, caching_ctl); + btrfs_wait_space_cache_v1_finished(cache, caching_ctl, true); if (caching_ctl) btrfs_put_caching_control(caching_ctl); diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h index 8f74a96074f7..fe4b92cdc08b 100644 --- a/fs/btrfs/block-group.h +++ b/fs/btrfs/block-group.h @@ -269,7 +269,8 @@ u64 btrfs_get_alloc_profile(struct btrfs_fs_info *fs_info, u64 orig_flags); void btrfs_put_block_group_cache(struct btrfs_fs_info *info); int btrfs_free_block_groups(struct btrfs_fs_info *info); void btrfs_wait_space_cache_v1_finished(struct btrfs_block_group *cache, - struct btrfs_caching_control *caching_ctl); + struct btrfs_caching_control *caching_ctl, + bool load_only); static inline u64 btrfs_data_alloc_profile(struct btrfs_fs_info *fs_info) { diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 30b1a630dc2f..594534482ad3 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -2600,6 +2600,7 @@ int btrfs_pin_extent_for_log_replay(struct btrfs_trans_handle *trans, u64 bytenr, u64 num_bytes) { struct btrfs_block_group *cache; + struct btrfs_caching_control *caching_ctl; int ret; btrfs_add_excluded_extent(trans->fs_info, bytenr, num_bytes); @@ -2615,6 +2616,13 @@ int btrfs_pin_extent_for_log_replay(struct btrfs_trans_handle *trans, * the pinned extents. */ btrfs_cache_block_group(cache, 1); + /* + * Make sure we wait until the cache is comletely built in case it is + * missing or is invalid and therefore needs to be rebuilt. + */ + caching_ctl = btrfs_get_caching_control(cache); + if (caching_ctl) + btrfs_wait_space_cache_v1_finished(cache, caching_ctl, false); pin_down_extent(trans, cache, bytenr, num_bytes, 0); @@ -2647,7 +2655,7 @@ static int __exclude_logged_extent(struct btrfs_fs_info *fs_info, * We must wait for v1 caching to finish, otherwise we may not * remove our space. */ - btrfs_wait_space_cache_v1_finished(block_group, caching_ctl); + btrfs_wait_space_cache_v1_finished(block_group, caching_ctl, false); mutex_lock(&caching_ctl->mutex); if (start >= caching_ctl->progress) {