Message ID | bd295f0e2277e34008b4aa5648527d0394472de1.1689883754.git.josef@toxicpanda.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | btrfs: fix generic/475 hang | expand |
On Thu, Jul 20, 2023 at 04:12:14PM -0400, Josef Bacik wrote: > Recently we've been having mysterious hangs while running generic/475 on > the CI system. This turned out to be something like this > > Task 1 > dmsetup suspend --nolockfs > -> __dm_suspend > -> dm_wait_for_completion > -> dm_wait_for_bios_completion > -> Unable to complete because of IO's on a plug in Task 2 > > Task 2 > wb_workfn > -> wb_writeback > -> blk_start_plug > -> writeback_sb_inodes > -> Infinite loop unable to make an allocation > > Task 3 > cache_block_group > ->read_extent_buffer_pages > ->Waiting for IO to complete that can't be submitted because Task 1 > suspended the DM device > > The problem here is that we need Task 2 to be scheduled completely for > the blk plug to flush. Normally this would happen, we normally wait for > the block group caching to finish (Task 3), and this schedule would > result in the block plug flushing. > > However if there's enough free space available from the current caching > to satisfy the allocation we won't actually wait for the caching to > complete. This check however just checks that we have enough space, not > that we can make the allocation. In this particular case we were trying > to allocate 9mib, and we had 10mib of free space, but we didn't have > 9mib of contiguous space to allocate, and thus the allocation failed and > we looped. > > We specifically don't cycle through the FFE loop until we stop finding > cached block groups because we don't want to allocate new block groups > just because we're caching, so we short circuit the normal loop once we > hit LOOP_CACHING_WAIT and we found a caching block group. > > This is normally fine, except in this particular case where the caching > thread can't make progress because the dm device has been suspended. > > Fix this by adding another LOOP state that specifically waits for the > block group to be completely cached before proceeding. This allows us > to drop this particular optimization, and will give us the proper > scheduling needed to finish the plug. > > The alternative here was to simply flush the plug if we need_resched(), > but this is actually a sort of bad behavior from us where we assume that > if the block group has enough free space to match our allocation we'll > actually be successful. It is a good enough check for a quick pass to > avoid the latency of a full wait, but free space != contiguous free > space, so waiting is more appropriate. > > Signed-off-by: Josef Bacik <josef@toxicpanda.com> > --- > fs/btrfs/extent-tree.c | 20 +++++++++++++------- > 1 file changed, 13 insertions(+), 7 deletions(-) > > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c > index 04ceb9d25d3e..2850bd411a0e 100644 > --- a/fs/btrfs/extent-tree.c > +++ b/fs/btrfs/extent-tree.c > @@ -3331,6 +3331,7 @@ int btrfs_free_extent(struct btrfs_trans_handle *trans, struct btrfs_ref *ref) > enum btrfs_loop_type { > LOOP_CACHING_NOWAIT, > LOOP_CACHING_WAIT, > + LOOP_CACHING_DONE, > LOOP_UNSET_SIZE_CLASS, > LOOP_ALLOC_CHUNK, > LOOP_WRONG_SIZE_CLASS, > @@ -3920,9 +3921,6 @@ static int find_free_extent_update_loop(struct btrfs_fs_info *fs_info, > return 0; > } > > - if (ffe_ctl->loop >= LOOP_CACHING_WAIT && ffe_ctl->have_caching_bg) > - return 1; > - As far as I can tell, this change significantly diminishes the meaning/value of LOOP_CACHING_WAIT. It goes from a busy loop that continues until we stop failing on an uncached bgs (success or all bgs cached) to just a single retry before moving on to LOOP_CACHING_DONE which does the real wait_event for the bg to get cached. I am not convinced that a single retry does much compared to just going straight to the wait_event. I think it should be reasonably easy to answer this question by creating a big FS, mounting it and allocating right after and seeing if the allocations succeed in LOOP_CACHING_WAIT or LOOP_CACHING_DONE. Let's suppose that I guessed correctly and they succeed in LOOP_CACHING_DONE. Let's also assume, based on prior experience, that we can't always affort to wait for the bg to get fully cached in. Perhaps another option is to add a new event to the mix. It would be signalled when we make progress caching a bg, and LOOP_CACHING_WAIT could call the non-blocking btrfs_cache_block_group, then wait on that event. This would have better granularity than waiting for the whole block group while still having the desired scheduling behavior needed to fix this bug. > ffe_ctl->index++; > if (ffe_ctl->index < BTRFS_NR_RAID_TYPES) > return 1; > @@ -3931,6 +3929,8 @@ static int find_free_extent_update_loop(struct btrfs_fs_info *fs_info, > * LOOP_CACHING_NOWAIT, search partially cached block groups, kicking > * caching kthreads as we move along > * LOOP_CACHING_WAIT, search everything, and wait if our bg is caching > + * LOOP_CACHING_DONE, search everything, wait for the caching to > + * completely finish > * LOOP_UNSET_SIZE_CLASS, allow unset size class > * LOOP_ALLOC_CHUNK, force a chunk allocation and try again > * LOOP_NO_EMPTY_SIZE, set empty_size and empty_cluster to 0 and try > @@ -3939,13 +3939,13 @@ static int find_free_extent_update_loop(struct btrfs_fs_info *fs_info, > if (ffe_ctl->loop < LOOP_NO_EMPTY_SIZE) { > ffe_ctl->index = 0; > /* > - * We want to skip the LOOP_CACHING_WAIT step if we don't have > + * We want to skip the LOOP_CACHING_* steps if we don't have > * any uncached bgs and we've already done a full search > * through. > */ > if (ffe_ctl->loop == LOOP_CACHING_NOWAIT && > (!ffe_ctl->orig_have_caching_bg && full_search)) > - ffe_ctl->loop++; > + ffe_ctl->loop = LOOP_CACHING_DONE; > ffe_ctl->loop++; > > if (ffe_ctl->loop == LOOP_ALLOC_CHUNK) { > @@ -4269,8 +4269,11 @@ static noinline int find_free_extent(struct btrfs_root *root, > trace_find_free_extent_have_block_group(root, ffe_ctl, block_group); > ffe_ctl->cached = btrfs_block_group_done(block_group); > if (unlikely(!ffe_ctl->cached)) { > - ffe_ctl->have_caching_bg = true; > - ret = btrfs_cache_block_group(block_group, false); > + bool wait = ffe_ctl->loop == LOOP_CACHING_DONE; This feels quite unlikely, but I think it's also theoretically possible that every single call to btrfs_cache_block_group in the loop will fail with -ENOMEM, which we swallow. We then advance to the next big loop with more caching outstanding that we no longer wait for in any way. I think changing the wait check to >= LOOP_CACHING_DONE would fix this. > + > + if (!wait) > + ffe_ctl->have_caching_bg = true; > + ret = btrfs_cache_block_group(block_group, wait); I think a comment somewhere explaining that the wait_event this triggers is critical would be helpful. > > /* > * If we get ENOMEM here or something else we want to > @@ -4285,6 +4288,9 @@ static noinline int find_free_extent(struct btrfs_root *root, > ret = 0; > goto loop; > } > + > + if (wait) > + ffe_ctl->cached = btrfs_block_group_done(block_group); should we set have_caching_bg = false too? > ret = 0; > } > > -- > 2.41.0 >
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 04ceb9d25d3e..2850bd411a0e 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -3331,6 +3331,7 @@ int btrfs_free_extent(struct btrfs_trans_handle *trans, struct btrfs_ref *ref) enum btrfs_loop_type { LOOP_CACHING_NOWAIT, LOOP_CACHING_WAIT, + LOOP_CACHING_DONE, LOOP_UNSET_SIZE_CLASS, LOOP_ALLOC_CHUNK, LOOP_WRONG_SIZE_CLASS, @@ -3920,9 +3921,6 @@ static int find_free_extent_update_loop(struct btrfs_fs_info *fs_info, return 0; } - if (ffe_ctl->loop >= LOOP_CACHING_WAIT && ffe_ctl->have_caching_bg) - return 1; - ffe_ctl->index++; if (ffe_ctl->index < BTRFS_NR_RAID_TYPES) return 1; @@ -3931,6 +3929,8 @@ static int find_free_extent_update_loop(struct btrfs_fs_info *fs_info, * LOOP_CACHING_NOWAIT, search partially cached block groups, kicking * caching kthreads as we move along * LOOP_CACHING_WAIT, search everything, and wait if our bg is caching + * LOOP_CACHING_DONE, search everything, wait for the caching to + * completely finish * LOOP_UNSET_SIZE_CLASS, allow unset size class * LOOP_ALLOC_CHUNK, force a chunk allocation and try again * LOOP_NO_EMPTY_SIZE, set empty_size and empty_cluster to 0 and try @@ -3939,13 +3939,13 @@ static int find_free_extent_update_loop(struct btrfs_fs_info *fs_info, if (ffe_ctl->loop < LOOP_NO_EMPTY_SIZE) { ffe_ctl->index = 0; /* - * We want to skip the LOOP_CACHING_WAIT step if we don't have + * We want to skip the LOOP_CACHING_* steps if we don't have * any uncached bgs and we've already done a full search * through. */ if (ffe_ctl->loop == LOOP_CACHING_NOWAIT && (!ffe_ctl->orig_have_caching_bg && full_search)) - ffe_ctl->loop++; + ffe_ctl->loop = LOOP_CACHING_DONE; ffe_ctl->loop++; if (ffe_ctl->loop == LOOP_ALLOC_CHUNK) { @@ -4269,8 +4269,11 @@ static noinline int find_free_extent(struct btrfs_root *root, trace_find_free_extent_have_block_group(root, ffe_ctl, block_group); ffe_ctl->cached = btrfs_block_group_done(block_group); if (unlikely(!ffe_ctl->cached)) { - ffe_ctl->have_caching_bg = true; - ret = btrfs_cache_block_group(block_group, false); + bool wait = ffe_ctl->loop == LOOP_CACHING_DONE; + + if (!wait) + ffe_ctl->have_caching_bg = true; + ret = btrfs_cache_block_group(block_group, wait); /* * If we get ENOMEM here or something else we want to @@ -4285,6 +4288,9 @@ static noinline int find_free_extent(struct btrfs_root *root, ret = 0; goto loop; } + + if (wait) + ffe_ctl->cached = btrfs_block_group_done(block_group); ret = 0; }
Recently we've been having mysterious hangs while running generic/475 on the CI system. This turned out to be something like this Task 1 dmsetup suspend --nolockfs -> __dm_suspend -> dm_wait_for_completion -> dm_wait_for_bios_completion -> Unable to complete because of IO's on a plug in Task 2 Task 2 wb_workfn -> wb_writeback -> blk_start_plug -> writeback_sb_inodes -> Infinite loop unable to make an allocation Task 3 cache_block_group ->read_extent_buffer_pages ->Waiting for IO to complete that can't be submitted because Task 1 suspended the DM device The problem here is that we need Task 2 to be scheduled completely for the blk plug to flush. Normally this would happen, we normally wait for the block group caching to finish (Task 3), and this schedule would result in the block plug flushing. However if there's enough free space available from the current caching to satisfy the allocation we won't actually wait for the caching to complete. This check however just checks that we have enough space, not that we can make the allocation. In this particular case we were trying to allocate 9mib, and we had 10mib of free space, but we didn't have 9mib of contiguous space to allocate, and thus the allocation failed and we looped. We specifically don't cycle through the FFE loop until we stop finding cached block groups because we don't want to allocate new block groups just because we're caching, so we short circuit the normal loop once we hit LOOP_CACHING_WAIT and we found a caching block group. This is normally fine, except in this particular case where the caching thread can't make progress because the dm device has been suspended. Fix this by adding another LOOP state that specifically waits for the block group to be completely cached before proceeding. This allows us to drop this particular optimization, and will give us the proper scheduling needed to finish the plug. The alternative here was to simply flush the plug if we need_resched(), but this is actually a sort of bad behavior from us where we assume that if the block group has enough free space to match our allocation we'll actually be successful. It is a good enough check for a quick pass to avoid the latency of a full wait, but free space != contiguous free space, so waiting is more appropriate. Signed-off-by: Josef Bacik <josef@toxicpanda.com> --- fs/btrfs/extent-tree.c | 20 +++++++++++++------- 1 file changed, 13 insertions(+), 7 deletions(-)