Message ID | 75e2c5599917601879ad120425376d54e95cb49c.1742889641.git.wqu@suse.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [v3] btrfs: simplify the reserved space handling inside copy_one_range() | expand |
在 2025/3/25 18:31, Qu Wenruo 写道: > Inside that function we have the following variables all involved to > help handling the reserved space: > > - block_offset > - reserve_bytes > - dirty_blocks > - num_blocks > - release_bytes > > Many of them (block_offset, dirty_blocks, num_blocks) are only utilized > once or twice as a temporary variables. > > Furthermore the variable @release_bytes are utilized for two different > operations: > > - As a temporary variable to release exceed range if a short copy > happened > And after a short-copy, the @release_bytes will be immediately > re-calculated to cancel the change such temporary usage. > > - As a variables to record the length that will be released > > And there is even a bug in the short-copy handling: > > - btrfs_delalloc_release_extents() is not called for short-copy > Which looks like a bug from the very beginning, affecting both > zero byte copied and partial copied cases. > This can lead to rare generic/027 kernel warnings for subpage block > size. > > To fix all those unnecessary variables along with the inconsistent > variable usage: > > - Introduce @reserved_start and @reserved_len variable > Both are utilized to track the current reserved range (block aligned). > > - Use above variables to calculate the range which needs to be shrunk > When a short copy happened, we need to shrink the reserved space > beyond the last block. > > - Handle zero byte copied case and return immediately > Not all functions are handling 0 length correct, and to be extra safe, > just manually do the cleanup and exit. > > - Call btrfs_delalloc_release_extents() for short-copy cases > This also fixes a bug in the original code that it doesn't call > btrfs_delalloc_release_extents() to release the ranges beyond > the last block. > > - Remove the five variables we no longer need > > Signed-off-by: Qu Wenruo <wqu@suse.com> > --- > Changelog: > v3: > - Do cleanup and return directly if no byte is copied > > - Call btrfs_delalloc_release_extents() when short copy happened > Which is not done in the original code. > > v2: > - Fix a bug that can be triggered by generic/027 with subpage block size > The fix is to keep the old behavior that the first block will only be > released by btrfs_delalloc_release_extents() after > btrfs_dirty_folio(). > --- > fs/btrfs/file.c | 79 ++++++++++++++++++++++++++----------------------- > 1 file changed, 42 insertions(+), 37 deletions(-) > > diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c > index b72fc00bc2f6..0b637e61fdee 100644 > --- a/fs/btrfs/file.c > +++ b/fs/btrfs/file.c > @@ -1164,15 +1164,13 @@ static int copy_one_range(struct btrfs_inode *inode, struct iov_iter *i, > { > struct btrfs_fs_info *fs_info = inode->root->fs_info; > struct extent_state *cached_state = NULL; > - const size_t block_offset = start & (fs_info->sectorsize - 1); > + const u32 blocksize = fs_info->sectorsize; > size_t write_bytes = min(iov_iter_count(i), PAGE_SIZE - offset_in_page(start)); > - size_t reserve_bytes; > size_t copied; > - size_t dirty_blocks; > - size_t num_blocks; > struct folio *folio = NULL; > - u64 release_bytes; > int extents_locked; > + const u64 reserved_start = round_down(start, blocksize); > + u64 reserved_len; > u64 lockstart; > u64 lockend; > bool only_release_metadata = false; > @@ -1190,23 +1188,25 @@ static int copy_one_range(struct btrfs_inode *inode, struct iov_iter *i, > &only_release_metadata); > if (ret < 0) > return ret; > - reserve_bytes = ret; > - release_bytes = reserve_bytes; > + reserved_len = ret; > + /* The write range must be inside the reserved range. */ > + ASSERT(start >= reserved_start); > + ASSERT(start + write_bytes <= reserved_start + reserved_len); > > again: > ret = balance_dirty_pages_ratelimited_flags(inode->vfs_inode.i_mapping, > bdp_flags); > if (ret) { > - btrfs_delalloc_release_extents(inode, reserve_bytes); > - release_space(inode, *data_reserved, start, release_bytes, > + btrfs_delalloc_release_extents(inode, reserved_len); > + release_space(inode, *data_reserved, reserved_start, reserved_len, > only_release_metadata); > return ret; > } > > ret = prepare_one_folio(&inode->vfs_inode, &folio, start, write_bytes, false); > if (ret) { > - btrfs_delalloc_release_extents(inode, reserve_bytes); > - release_space(inode, *data_reserved, start, release_bytes, > + btrfs_delalloc_release_extents(inode, reserved_len); > + release_space(inode, *data_reserved, reserved_start, reserved_len, > only_release_metadata); > return ret; > } > @@ -1217,8 +1217,8 @@ static int copy_one_range(struct btrfs_inode *inode, struct iov_iter *i, > if (!nowait && extents_locked == -EAGAIN) > goto again; > > - btrfs_delalloc_release_extents(inode, reserve_bytes); > - release_space(inode, *data_reserved, start, release_bytes, > + btrfs_delalloc_release_extents(inode, reserved_len); > + release_space(inode, *data_reserved, reserved_start, reserved_len, > only_release_metadata); > ret = extents_locked; > return ret; > @@ -1235,35 +1235,40 @@ static int copy_one_range(struct btrfs_inode *inode, struct iov_iter *i, > * sure they don't happen by forcing retry this copy. > */ > if (unlikely(copied < write_bytes)) { > + u64 last_block; > + u64 release_len; > + > if (!folio_test_uptodate(folio)) { > iov_iter_revert(i, copied); > copied = 0; > } > - } > > - num_blocks = BTRFS_BYTES_TO_BLKS(fs_info, reserve_bytes); > - dirty_blocks = round_up(copied + block_offset, fs_info->sectorsize); > - dirty_blocks = BTRFS_BYTES_TO_BLKS(fs_info, dirty_blocks); > - > - if (copied == 0) > - dirty_blocks = 0; > - > - if (num_blocks > dirty_blocks) { > - /* Release everything except the sectors we dirtied. */ > - release_bytes -= dirty_blocks << fs_info->sectorsize_bits; > - if (only_release_metadata) { > - btrfs_delalloc_release_metadata(inode, > - release_bytes, true); > - } else { > - const u64 release_start = round_up(start + copied, > - fs_info->sectorsize); > - > - btrfs_delalloc_release_space(inode, > - *data_reserved, release_start, > - release_bytes, true); > + if (copied == 0) { > + btrfs_drop_folio(fs_info, folio, start, copied); > + if (extents_locked) > + unlock_extent(&inode->io_tree, lockstart, lockend, > + &cached_state); > + else > + free_extent_state(cached_state); > + btrfs_delalloc_release_extents(inode, reserved_len); > + release_space(inode, *data_reserved, reserved_start, > + reserved_len, only_release_metadata); > + return 0; > } > + > + /* Release space beyond the last block and continue. */ > + last_block = round_up(start + copied, blocksize); > + release_len = reserved_start + reserved_len - last_block; > + btrfs_delalloc_release_extents(inode, release_len); This is incorrect. The behavior of btrfs_delalloc_release_extents() is never as accurate as things like qgroup reserved space. And it can not handle reserved space shrinking in this case. It is only purely based on the @num_bytes, and it is always calculated based on the maximum extent size. So for our current code, btrfs_delalloc_release_extents() will always decrease the outstand extent by one. Thus if we do the following pattern, it will cause problems: btrfs_delalloc_reserve_metadata(reserved_start, reserved_len); btrfs_delalloc_release_extent(reserved_len - SZ_4K); btrfs_delalloc_release_extent(SZ_4K); Thus the original code is correct, we should only call btrfs_delalloc_release_extent() once after the space is reserved. I'd better find a proper way to make btrfs_delalloc_release_extent() as good as the qgroup rsv space. As for large data folios, we have to over-reserve before grabbing the folio, then release the exceeding part before doing the copy. Thanks, Qu
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index b72fc00bc2f6..0b637e61fdee 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -1164,15 +1164,13 @@ static int copy_one_range(struct btrfs_inode *inode, struct iov_iter *i, { struct btrfs_fs_info *fs_info = inode->root->fs_info; struct extent_state *cached_state = NULL; - const size_t block_offset = start & (fs_info->sectorsize - 1); + const u32 blocksize = fs_info->sectorsize; size_t write_bytes = min(iov_iter_count(i), PAGE_SIZE - offset_in_page(start)); - size_t reserve_bytes; size_t copied; - size_t dirty_blocks; - size_t num_blocks; struct folio *folio = NULL; - u64 release_bytes; int extents_locked; + const u64 reserved_start = round_down(start, blocksize); + u64 reserved_len; u64 lockstart; u64 lockend; bool only_release_metadata = false; @@ -1190,23 +1188,25 @@ static int copy_one_range(struct btrfs_inode *inode, struct iov_iter *i, &only_release_metadata); if (ret < 0) return ret; - reserve_bytes = ret; - release_bytes = reserve_bytes; + reserved_len = ret; + /* The write range must be inside the reserved range. */ + ASSERT(start >= reserved_start); + ASSERT(start + write_bytes <= reserved_start + reserved_len); again: ret = balance_dirty_pages_ratelimited_flags(inode->vfs_inode.i_mapping, bdp_flags); if (ret) { - btrfs_delalloc_release_extents(inode, reserve_bytes); - release_space(inode, *data_reserved, start, release_bytes, + btrfs_delalloc_release_extents(inode, reserved_len); + release_space(inode, *data_reserved, reserved_start, reserved_len, only_release_metadata); return ret; } ret = prepare_one_folio(&inode->vfs_inode, &folio, start, write_bytes, false); if (ret) { - btrfs_delalloc_release_extents(inode, reserve_bytes); - release_space(inode, *data_reserved, start, release_bytes, + btrfs_delalloc_release_extents(inode, reserved_len); + release_space(inode, *data_reserved, reserved_start, reserved_len, only_release_metadata); return ret; } @@ -1217,8 +1217,8 @@ static int copy_one_range(struct btrfs_inode *inode, struct iov_iter *i, if (!nowait && extents_locked == -EAGAIN) goto again; - btrfs_delalloc_release_extents(inode, reserve_bytes); - release_space(inode, *data_reserved, start, release_bytes, + btrfs_delalloc_release_extents(inode, reserved_len); + release_space(inode, *data_reserved, reserved_start, reserved_len, only_release_metadata); ret = extents_locked; return ret; @@ -1235,35 +1235,40 @@ static int copy_one_range(struct btrfs_inode *inode, struct iov_iter *i, * sure they don't happen by forcing retry this copy. */ if (unlikely(copied < write_bytes)) { + u64 last_block; + u64 release_len; + if (!folio_test_uptodate(folio)) { iov_iter_revert(i, copied); copied = 0; } - } - num_blocks = BTRFS_BYTES_TO_BLKS(fs_info, reserve_bytes); - dirty_blocks = round_up(copied + block_offset, fs_info->sectorsize); - dirty_blocks = BTRFS_BYTES_TO_BLKS(fs_info, dirty_blocks); - - if (copied == 0) - dirty_blocks = 0; - - if (num_blocks > dirty_blocks) { - /* Release everything except the sectors we dirtied. */ - release_bytes -= dirty_blocks << fs_info->sectorsize_bits; - if (only_release_metadata) { - btrfs_delalloc_release_metadata(inode, - release_bytes, true); - } else { - const u64 release_start = round_up(start + copied, - fs_info->sectorsize); - - btrfs_delalloc_release_space(inode, - *data_reserved, release_start, - release_bytes, true); + if (copied == 0) { + btrfs_drop_folio(fs_info, folio, start, copied); + if (extents_locked) + unlock_extent(&inode->io_tree, lockstart, lockend, + &cached_state); + else + free_extent_state(cached_state); + btrfs_delalloc_release_extents(inode, reserved_len); + release_space(inode, *data_reserved, reserved_start, + reserved_len, only_release_metadata); + return 0; } + + /* Release space beyond the last block and continue. */ + last_block = round_up(start + copied, blocksize); + release_len = reserved_start + reserved_len - last_block; + btrfs_delalloc_release_extents(inode, release_len); + if (only_release_metadata) + btrfs_delalloc_release_metadata(inode, release_len, true); + else + btrfs_delalloc_release_space(inode, *data_reserved, + last_block, release_len, true); + /* We should have at least one block reserved. */ + ASSERT(last_block > reserved_start); + reserved_len = last_block - reserved_start; } - release_bytes = round_up(copied + block_offset, fs_info->sectorsize); ret = btrfs_dirty_folio(inode, folio, start, copied, &cached_state, only_release_metadata); @@ -1280,10 +1285,10 @@ static int copy_one_range(struct btrfs_inode *inode, struct iov_iter *i, else free_extent_state(cached_state); - btrfs_delalloc_release_extents(inode, reserve_bytes); + btrfs_delalloc_release_extents(inode, reserved_len); if (ret) { btrfs_drop_folio(fs_info, folio, start, copied); - release_space(inode, *data_reserved, start, release_bytes, + release_space(inode, *data_reserved, reserved_start, reserved_len, only_release_metadata); return ret; }
Inside that function we have the following variables all involved to help handling the reserved space: - block_offset - reserve_bytes - dirty_blocks - num_blocks - release_bytes Many of them (block_offset, dirty_blocks, num_blocks) are only utilized once or twice as a temporary variables. Furthermore the variable @release_bytes are utilized for two different operations: - As a temporary variable to release exceed range if a short copy happened And after a short-copy, the @release_bytes will be immediately re-calculated to cancel the change such temporary usage. - As a variables to record the length that will be released And there is even a bug in the short-copy handling: - btrfs_delalloc_release_extents() is not called for short-copy Which looks like a bug from the very beginning, affecting both zero byte copied and partial copied cases. This can lead to rare generic/027 kernel warnings for subpage block size. To fix all those unnecessary variables along with the inconsistent variable usage: - Introduce @reserved_start and @reserved_len variable Both are utilized to track the current reserved range (block aligned). - Use above variables to calculate the range which needs to be shrunk When a short copy happened, we need to shrink the reserved space beyond the last block. - Handle zero byte copied case and return immediately Not all functions are handling 0 length correct, and to be extra safe, just manually do the cleanup and exit. - Call btrfs_delalloc_release_extents() for short-copy cases This also fixes a bug in the original code that it doesn't call btrfs_delalloc_release_extents() to release the ranges beyond the last block. - Remove the five variables we no longer need Signed-off-by: Qu Wenruo <wqu@suse.com> --- Changelog: v3: - Do cleanup and return directly if no byte is copied - Call btrfs_delalloc_release_extents() when short copy happened Which is not done in the original code. v2: - Fix a bug that can be triggered by generic/027 with subpage block size The fix is to keep the old behavior that the first block will only be released by btrfs_delalloc_release_extents() after btrfs_dirty_folio(). --- fs/btrfs/file.c | 79 ++++++++++++++++++++++++++----------------------- 1 file changed, 42 insertions(+), 37 deletions(-)