Message ID | 20191009164422.7202-1-fdmanana@kernel.org (mailing list archive) |
---|---|
State | Superseded, archived |
Headers | show |
Series | Btrfs: fix negative subv_writers counter and data space leak after buffered write | expand |
On Wed, Oct 09, 2019 at 05:44:22PM +0100, fdmanana@kernel.org wrote: > From: Filipe Manana <fdmanana@suse.com> > > When doing a buffered write it's possible to leave the subv_writers > counter of the root, used for synchronization between buffered nocow > writers and snapshotting. This happens in an exceptional case like the > following: > > 1) We fail to allocate data space for the write, since there's not > enough available data space nor enough unallocated space for allocating > a new data block group; > > 2) Because of that failure, we try to go to NOCOW mode, which succeeds > and therefore we set the local variable 'only_release_metadata' to true > and set the root's sub_writers counter to 1 through the call to > btrfs_start_write_no_snapshotting() made by check_can_nocow(); > > 3) The call to btrfs_copy_from_user() returns zero, which is very unlikely > to happen but not impossible; > > 4) No pages are copied because btrfs_copy_from_user() returned zero; > > 5) We call btrfs_end_write_no_snapshotting() which decrements the root's > subv_writers counter to 0; > > 6) We don't set 'only_release_metadata' back to 'false' because we do > it only if 'copied', the value returned by btrfs_copy_from_user(), is > greater than zero; > > 7) On the next iteration of the while loop, which processes the same > page range, we are now able to allocate data space for the write (we > got enough data space released in the meanwhile); > > 8) After this if we fail at btrfs_delalloc_reserve_metadata(), because > now there isn't enough free metadata space, or in some other place > further below (prepare_pages(), lock_and_cleanup_extent_if_need(), > btrfs_dirty_pages()), we break out of the while loop with > 'only_release_metadata' having a value of 'true'; > > 9) Because 'only_release_metadata' is 'true' we end up decrementing the > root's subv_writers counter to -1, and we also end up not releasing the > data space previously reserved through btrfs_check_data_free_space(). > As a consequence the mechanism for synchronizing NOCOW buffered writes > with snapshotting gets broken. > > Fix this by always setting 'only_release_metadata' to false whenever it > currently has a true value, independently of having been able to copy any > data to the pages. Can we accomplish the same thing by just doing only_release_metadata = false; at the start of the loop? That way we only ever deal with it in its current scope? Thanks, Josef
On Fri, Oct 11, 2019 at 2:27 PM Josef Bacik <josef@toxicpanda.com> wrote: > > On Wed, Oct 09, 2019 at 05:44:22PM +0100, fdmanana@kernel.org wrote: > > From: Filipe Manana <fdmanana@suse.com> > > > > When doing a buffered write it's possible to leave the subv_writers > > counter of the root, used for synchronization between buffered nocow > > writers and snapshotting. This happens in an exceptional case like the > > following: > > > > 1) We fail to allocate data space for the write, since there's not > > enough available data space nor enough unallocated space for allocating > > a new data block group; > > > > 2) Because of that failure, we try to go to NOCOW mode, which succeeds > > and therefore we set the local variable 'only_release_metadata' to true > > and set the root's sub_writers counter to 1 through the call to > > btrfs_start_write_no_snapshotting() made by check_can_nocow(); > > > > 3) The call to btrfs_copy_from_user() returns zero, which is very unlikely > > to happen but not impossible; > > > > 4) No pages are copied because btrfs_copy_from_user() returned zero; > > > > 5) We call btrfs_end_write_no_snapshotting() which decrements the root's > > subv_writers counter to 0; > > > > 6) We don't set 'only_release_metadata' back to 'false' because we do > > it only if 'copied', the value returned by btrfs_copy_from_user(), is > > greater than zero; > > > > 7) On the next iteration of the while loop, which processes the same > > page range, we are now able to allocate data space for the write (we > > got enough data space released in the meanwhile); > > > > 8) After this if we fail at btrfs_delalloc_reserve_metadata(), because > > now there isn't enough free metadata space, or in some other place > > further below (prepare_pages(), lock_and_cleanup_extent_if_need(), > > btrfs_dirty_pages()), we break out of the while loop with > > 'only_release_metadata' having a value of 'true'; > > > > 9) Because 'only_release_metadata' is 'true' we end up decrementing the > > root's subv_writers counter to -1, and we also end up not releasing the > > data space previously reserved through btrfs_check_data_free_space(). > > As a consequence the mechanism for synchronizing NOCOW buffered writes > > with snapshotting gets broken. > > > > Fix this by always setting 'only_release_metadata' to false whenever it > > currently has a true value, independently of having been able to copy any > > data to the pages. > > Can we accomplish the same thing by just doing > > only_release_metadata = false; > > at the start of the loop? That way we only ever deal with it in its current > scope? Thanks, Yeah, that's probably better. I just felt to leave it closer to the last place where it's used. Thanks. > > Josef
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index 27e5b269e729..c98c1d10fd3a 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -1780,18 +1780,19 @@ static noinline ssize_t btrfs_buffered_write(struct kiocb *iocb, } release_bytes = 0; - if (only_release_metadata) + if (only_release_metadata) { btrfs_end_write_no_snapshotting(root); - - if (only_release_metadata && copied > 0) { - lockstart = round_down(pos, - fs_info->sectorsize); - lockend = round_up(pos + copied, - fs_info->sectorsize) - 1; - - set_extent_bit(&BTRFS_I(inode)->io_tree, lockstart, - lockend, EXTENT_NORESERVE, NULL, - NULL, GFP_NOFS); + if (copied > 0) { + lockstart = round_down(pos, + fs_info->sectorsize); + lockend = round_up(pos + copied, + fs_info->sectorsize) - 1; + + set_extent_bit(&BTRFS_I(inode)->io_tree, + lockstart, lockend, + EXTENT_NORESERVE, NULL, NULL, + GFP_NOFS); + } only_release_metadata = false; }