diff mbox series

[v3] btrfs: simplify the reserved space handling inside copy_one_range()

Message ID 75e2c5599917601879ad120425376d54e95cb49c.1742889641.git.wqu@suse.com (mailing list archive)
State New
Headers show
Series [v3] btrfs: simplify the reserved space handling inside copy_one_range() | expand

Commit Message

Qu Wenruo March 25, 2025, 8:01 a.m. UTC
Inside that function we have the following variables all involved to
help handling the reserved space:

- block_offset
- reserve_bytes
- dirty_blocks
- num_blocks
- release_bytes

Many of them (block_offset, dirty_blocks, num_blocks) are only utilized
once or twice as a temporary variables.

Furthermore the variable @release_bytes are utilized for two different
operations:

- As a temporary variable to release exceed range if a short copy
  happened
  And after a short-copy, the @release_bytes will be immediately
  re-calculated to cancel the change such temporary usage.

- As a variables to record the length that will be released

And there is even a bug in the short-copy handling:

- btrfs_delalloc_release_extents() is not called for short-copy
  Which looks like a bug from the very beginning, affecting both
  zero byte copied and partial copied cases.
  This can lead to rare generic/027 kernel warnings for subpage block
  size.

To fix all those unnecessary variables along with the inconsistent
variable usage:

- Introduce @reserved_start and @reserved_len variable
  Both are utilized to track the current reserved range (block aligned).

- Use above variables to calculate the range which needs to be shrunk
  When a short copy happened, we need to shrink the reserved space
  beyond the last block.

- Handle zero byte copied case and return immediately
  Not all functions are handling 0 length correct, and to be extra safe,
  just manually do the cleanup and exit.

- Call btrfs_delalloc_release_extents() for short-copy cases
  This also fixes a bug in the original code that it doesn't call
  btrfs_delalloc_release_extents() to release the ranges beyond
  the last block.

- Remove the five variables we no longer need

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
Changelog:
v3:
- Do cleanup and return directly if no byte is copied

- Call btrfs_delalloc_release_extents() when short copy happened
  Which is not done in the original code.

v2:
- Fix a bug that can be triggered by generic/027 with subpage block size
  The fix is to keep the old behavior that the first block will only be
  released by btrfs_delalloc_release_extents() after
  btrfs_dirty_folio().
---
 fs/btrfs/file.c | 79 ++++++++++++++++++++++++++-----------------------
 1 file changed, 42 insertions(+), 37 deletions(-)

Comments

Qu Wenruo March 26, 2025, 6:19 a.m. UTC | #1
在 2025/3/25 18:31, Qu Wenruo 写道:
> Inside that function we have the following variables all involved to
> help handling the reserved space:
> 
> - block_offset
> - reserve_bytes
> - dirty_blocks
> - num_blocks
> - release_bytes
> 
> Many of them (block_offset, dirty_blocks, num_blocks) are only utilized
> once or twice as a temporary variables.
> 
> Furthermore the variable @release_bytes are utilized for two different
> operations:
> 
> - As a temporary variable to release exceed range if a short copy
>    happened
>    And after a short-copy, the @release_bytes will be immediately
>    re-calculated to cancel the change such temporary usage.
> 
> - As a variables to record the length that will be released
> 
> And there is even a bug in the short-copy handling:
> 
> - btrfs_delalloc_release_extents() is not called for short-copy
>    Which looks like a bug from the very beginning, affecting both
>    zero byte copied and partial copied cases.
>    This can lead to rare generic/027 kernel warnings for subpage block
>    size.
> 
> To fix all those unnecessary variables along with the inconsistent
> variable usage:
> 
> - Introduce @reserved_start and @reserved_len variable
>    Both are utilized to track the current reserved range (block aligned).
> 
> - Use above variables to calculate the range which needs to be shrunk
>    When a short copy happened, we need to shrink the reserved space
>    beyond the last block.
> 
> - Handle zero byte copied case and return immediately
>    Not all functions are handling 0 length correct, and to be extra safe,
>    just manually do the cleanup and exit.
> 
> - Call btrfs_delalloc_release_extents() for short-copy cases
>    This also fixes a bug in the original code that it doesn't call
>    btrfs_delalloc_release_extents() to release the ranges beyond
>    the last block.
> 
> - Remove the five variables we no longer need
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
> Changelog:
> v3:
> - Do cleanup and return directly if no byte is copied
> 
> - Call btrfs_delalloc_release_extents() when short copy happened
>    Which is not done in the original code.
> 
> v2:
> - Fix a bug that can be triggered by generic/027 with subpage block size
>    The fix is to keep the old behavior that the first block will only be
>    released by btrfs_delalloc_release_extents() after
>    btrfs_dirty_folio().
> ---
>   fs/btrfs/file.c | 79 ++++++++++++++++++++++++++-----------------------
>   1 file changed, 42 insertions(+), 37 deletions(-)
> 
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index b72fc00bc2f6..0b637e61fdee 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -1164,15 +1164,13 @@ static int copy_one_range(struct btrfs_inode *inode, struct iov_iter *i,
>   {
>   	struct btrfs_fs_info *fs_info = inode->root->fs_info;
>   	struct extent_state *cached_state = NULL;
> -	const size_t block_offset = start & (fs_info->sectorsize - 1);
> +	const u32 blocksize = fs_info->sectorsize;
>   	size_t write_bytes = min(iov_iter_count(i), PAGE_SIZE - offset_in_page(start));
> -	size_t reserve_bytes;
>   	size_t copied;
> -	size_t dirty_blocks;
> -	size_t num_blocks;
>   	struct folio *folio = NULL;
> -	u64 release_bytes;
>   	int extents_locked;
> +	const u64 reserved_start = round_down(start, blocksize);
> +	u64 reserved_len;
>   	u64 lockstart;
>   	u64 lockend;
>   	bool only_release_metadata = false;
> @@ -1190,23 +1188,25 @@ static int copy_one_range(struct btrfs_inode *inode, struct iov_iter *i,
>   			    &only_release_metadata);
>   	if (ret < 0)
>   		return ret;
> -	reserve_bytes = ret;
> -	release_bytes = reserve_bytes;
> +	reserved_len = ret;
> +	/* The write range must be inside the reserved range. */
> +	ASSERT(start >= reserved_start);
> +	ASSERT(start + write_bytes <= reserved_start + reserved_len);
>   
>   again:
>   	ret = balance_dirty_pages_ratelimited_flags(inode->vfs_inode.i_mapping,
>   						    bdp_flags);
>   	if (ret) {
> -		btrfs_delalloc_release_extents(inode, reserve_bytes);
> -		release_space(inode, *data_reserved, start, release_bytes,
> +		btrfs_delalloc_release_extents(inode, reserved_len);
> +		release_space(inode, *data_reserved, reserved_start, reserved_len,
>   			      only_release_metadata);
>   		return ret;
>   	}
>   
>   	ret = prepare_one_folio(&inode->vfs_inode, &folio, start, write_bytes, false);
>   	if (ret) {
> -		btrfs_delalloc_release_extents(inode, reserve_bytes);
> -		release_space(inode, *data_reserved, start, release_bytes,
> +		btrfs_delalloc_release_extents(inode, reserved_len);
> +		release_space(inode, *data_reserved, reserved_start, reserved_len,
>   			      only_release_metadata);
>   		return ret;
>   	}
> @@ -1217,8 +1217,8 @@ static int copy_one_range(struct btrfs_inode *inode, struct iov_iter *i,
>   		if (!nowait && extents_locked == -EAGAIN)
>   			goto again;
>   
> -		btrfs_delalloc_release_extents(inode, reserve_bytes);
> -		release_space(inode, *data_reserved, start, release_bytes,
> +		btrfs_delalloc_release_extents(inode, reserved_len);
> +		release_space(inode, *data_reserved, reserved_start, reserved_len,
>   			      only_release_metadata);
>   		ret = extents_locked;
>   		return ret;
> @@ -1235,35 +1235,40 @@ static int copy_one_range(struct btrfs_inode *inode, struct iov_iter *i,
>   	 * sure they don't happen by forcing retry this copy.
>   	 */
>   	if (unlikely(copied < write_bytes)) {
> +		u64 last_block;
> +		u64 release_len;
> +
>   		if (!folio_test_uptodate(folio)) {
>   			iov_iter_revert(i, copied);
>   			copied = 0;
>   		}
> -	}
>   
> -	num_blocks = BTRFS_BYTES_TO_BLKS(fs_info, reserve_bytes);
> -	dirty_blocks = round_up(copied + block_offset, fs_info->sectorsize);
> -	dirty_blocks = BTRFS_BYTES_TO_BLKS(fs_info, dirty_blocks);
> -
> -	if (copied == 0)
> -		dirty_blocks = 0;
> -
> -	if (num_blocks > dirty_blocks) {
> -		/* Release everything except the sectors we dirtied. */
> -		release_bytes -= dirty_blocks << fs_info->sectorsize_bits;
> -		if (only_release_metadata) {
> -			btrfs_delalloc_release_metadata(inode,
> -						release_bytes, true);
> -		} else {
> -			const u64 release_start = round_up(start + copied,
> -							   fs_info->sectorsize);
> -
> -			btrfs_delalloc_release_space(inode,
> -					*data_reserved, release_start,
> -					release_bytes, true);
> +		if (copied == 0) {
> +			btrfs_drop_folio(fs_info, folio, start, copied);
> +			if (extents_locked)
> +				unlock_extent(&inode->io_tree, lockstart, lockend,
> +				      &cached_state);
> +			else
> +				free_extent_state(cached_state);
> +			btrfs_delalloc_release_extents(inode, reserved_len);
> +			release_space(inode, *data_reserved, reserved_start,
> +				      reserved_len, only_release_metadata);
> +			return 0;
>   		}
> +
> +		/* Release space beyond the last block and continue. */
> +		last_block = round_up(start + copied, blocksize);
> +		release_len = reserved_start + reserved_len - last_block;
> +		btrfs_delalloc_release_extents(inode, release_len);

This is incorrect.

The behavior of btrfs_delalloc_release_extents() is never as accurate as 
things like qgroup reserved space.
And it can not handle reserved space shrinking in this case.

It is only purely based on the @num_bytes, and it is always calculated 
based on the maximum extent size.

So for our current code, btrfs_delalloc_release_extents() will always 
decrease the outstand extent by one.

Thus if we do the following pattern, it will cause problems:

	btrfs_delalloc_reserve_metadata(reserved_start, reserved_len);
	btrfs_delalloc_release_extent(reserved_len - SZ_4K);
	btrfs_delalloc_release_extent(SZ_4K);

Thus the original code is correct, we should only call 
btrfs_delalloc_release_extent() once after the space is reserved.

I'd better find a proper way to make btrfs_delalloc_release_extent() as 
good as the qgroup rsv space.

As for large data folios, we have to over-reserve before grabbing the 
folio, then release the exceeding part before doing the copy.

Thanks,
Qu
diff mbox series

Patch

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index b72fc00bc2f6..0b637e61fdee 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1164,15 +1164,13 @@  static int copy_one_range(struct btrfs_inode *inode, struct iov_iter *i,
 {
 	struct btrfs_fs_info *fs_info = inode->root->fs_info;
 	struct extent_state *cached_state = NULL;
-	const size_t block_offset = start & (fs_info->sectorsize - 1);
+	const u32 blocksize = fs_info->sectorsize;
 	size_t write_bytes = min(iov_iter_count(i), PAGE_SIZE - offset_in_page(start));
-	size_t reserve_bytes;
 	size_t copied;
-	size_t dirty_blocks;
-	size_t num_blocks;
 	struct folio *folio = NULL;
-	u64 release_bytes;
 	int extents_locked;
+	const u64 reserved_start = round_down(start, blocksize);
+	u64 reserved_len;
 	u64 lockstart;
 	u64 lockend;
 	bool only_release_metadata = false;
@@ -1190,23 +1188,25 @@  static int copy_one_range(struct btrfs_inode *inode, struct iov_iter *i,
 			    &only_release_metadata);
 	if (ret < 0)
 		return ret;
-	reserve_bytes = ret;
-	release_bytes = reserve_bytes;
+	reserved_len = ret;
+	/* The write range must be inside the reserved range. */
+	ASSERT(start >= reserved_start);
+	ASSERT(start + write_bytes <= reserved_start + reserved_len);
 
 again:
 	ret = balance_dirty_pages_ratelimited_flags(inode->vfs_inode.i_mapping,
 						    bdp_flags);
 	if (ret) {
-		btrfs_delalloc_release_extents(inode, reserve_bytes);
-		release_space(inode, *data_reserved, start, release_bytes,
+		btrfs_delalloc_release_extents(inode, reserved_len);
+		release_space(inode, *data_reserved, reserved_start, reserved_len,
 			      only_release_metadata);
 		return ret;
 	}
 
 	ret = prepare_one_folio(&inode->vfs_inode, &folio, start, write_bytes, false);
 	if (ret) {
-		btrfs_delalloc_release_extents(inode, reserve_bytes);
-		release_space(inode, *data_reserved, start, release_bytes,
+		btrfs_delalloc_release_extents(inode, reserved_len);
+		release_space(inode, *data_reserved, reserved_start, reserved_len,
 			      only_release_metadata);
 		return ret;
 	}
@@ -1217,8 +1217,8 @@  static int copy_one_range(struct btrfs_inode *inode, struct iov_iter *i,
 		if (!nowait && extents_locked == -EAGAIN)
 			goto again;
 
-		btrfs_delalloc_release_extents(inode, reserve_bytes);
-		release_space(inode, *data_reserved, start, release_bytes,
+		btrfs_delalloc_release_extents(inode, reserved_len);
+		release_space(inode, *data_reserved, reserved_start, reserved_len,
 			      only_release_metadata);
 		ret = extents_locked;
 		return ret;
@@ -1235,35 +1235,40 @@  static int copy_one_range(struct btrfs_inode *inode, struct iov_iter *i,
 	 * sure they don't happen by forcing retry this copy.
 	 */
 	if (unlikely(copied < write_bytes)) {
+		u64 last_block;
+		u64 release_len;
+
 		if (!folio_test_uptodate(folio)) {
 			iov_iter_revert(i, copied);
 			copied = 0;
 		}
-	}
 
-	num_blocks = BTRFS_BYTES_TO_BLKS(fs_info, reserve_bytes);
-	dirty_blocks = round_up(copied + block_offset, fs_info->sectorsize);
-	dirty_blocks = BTRFS_BYTES_TO_BLKS(fs_info, dirty_blocks);
-
-	if (copied == 0)
-		dirty_blocks = 0;
-
-	if (num_blocks > dirty_blocks) {
-		/* Release everything except the sectors we dirtied. */
-		release_bytes -= dirty_blocks << fs_info->sectorsize_bits;
-		if (only_release_metadata) {
-			btrfs_delalloc_release_metadata(inode,
-						release_bytes, true);
-		} else {
-			const u64 release_start = round_up(start + copied,
-							   fs_info->sectorsize);
-
-			btrfs_delalloc_release_space(inode,
-					*data_reserved, release_start,
-					release_bytes, true);
+		if (copied == 0) {
+			btrfs_drop_folio(fs_info, folio, start, copied);
+			if (extents_locked)
+				unlock_extent(&inode->io_tree, lockstart, lockend,
+				      &cached_state);
+			else
+				free_extent_state(cached_state);
+			btrfs_delalloc_release_extents(inode, reserved_len);
+			release_space(inode, *data_reserved, reserved_start,
+				      reserved_len, only_release_metadata);
+			return 0;
 		}
+
+		/* Release space beyond the last block and continue. */
+		last_block = round_up(start + copied, blocksize);
+		release_len = reserved_start + reserved_len - last_block;
+		btrfs_delalloc_release_extents(inode, release_len);
+		if (only_release_metadata)
+			btrfs_delalloc_release_metadata(inode, release_len, true);
+		else
+			btrfs_delalloc_release_space(inode, *data_reserved,
+					last_block, release_len, true);
+		/* We should have at least one block reserved. */
+		ASSERT(last_block > reserved_start);
+		reserved_len = last_block - reserved_start;
 	}
-	release_bytes = round_up(copied + block_offset, fs_info->sectorsize);
 
 	ret = btrfs_dirty_folio(inode, folio, start, copied,
 				&cached_state, only_release_metadata);
@@ -1280,10 +1285,10 @@  static int copy_one_range(struct btrfs_inode *inode, struct iov_iter *i,
 	else
 		free_extent_state(cached_state);
 
-	btrfs_delalloc_release_extents(inode, reserve_bytes);
+	btrfs_delalloc_release_extents(inode, reserved_len);
 	if (ret) {
 		btrfs_drop_folio(fs_info, folio, start, copied);
-		release_space(inode, *data_reserved, start, release_bytes,
+		release_space(inode, *data_reserved, reserved_start, reserved_len,
 			      only_release_metadata);
 		return ret;
 	}