Message ID | 4baa663dcabe3a542e035ec725586118a78b0971.1743113694.git.wqu@suse.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | btrfs: add the missing preparations exposed by initial large data folio support | expand |
On Thu, Mar 27, 2025 at 10:36 PM Qu Wenruo <wqu@suse.com> wrote: > > There are several things not ideal inside copy_one_range(): > > - Unnecessary temporary variables > * block_offset > * reserve_bytes > * dirty_blocks > * num_blocks > * release_bytes > These are utilized to handle short-copy cases. > > - Inconsistent handling of btrfs_delalloc_release_extents() > There is a hidden behavior that, after reserving metadata for X bytes > of data write, we have to call btrfs_delalloc_release_extents() with X > once and only once. > > Calling btrfs_delalloc_release_extents(X - 4K) and > btrfs_delalloc_release_extents(4K) will cause outstanding extents > accounting to go wrong. > > This is because the outstanding extents mechanism is not designed to > handle shrink of reserved space. > > Improve above situations by: > > - Use a single @reserved_start and @reserved_len pair > Now we reserved space for the initial range, and if a short copy > happened and we need to shrink the reserved space, we can easily > calculate the new length, and update @reserved_len. > > - Introduce helpers to shrink reserved data and metadata space > This is done by two new helper, shrink_reserved_space() and > btrfs_delalloc_shrink_extents(). > > The later will do a better calculation on if we need to modify the > outstanding extents, and the first one will be utlized inside > copy_one_range(). > > - Manually unlock, release reserved space and return if no byte is > copied > > Signed-off-by: Qu Wenruo <wqu@suse.com> > --- > fs/btrfs/delalloc-space.c | 25 +++++++++ > fs/btrfs/delalloc-space.h | 3 +- > fs/btrfs/file.c | 104 ++++++++++++++++++++++---------------- > 3 files changed, 88 insertions(+), 44 deletions(-) > > diff --git a/fs/btrfs/delalloc-space.c b/fs/btrfs/delalloc-space.c > index 88e900e5a43d..916b62221dde 100644 > --- a/fs/btrfs/delalloc-space.c > +++ b/fs/btrfs/delalloc-space.c > @@ -439,6 +439,31 @@ void btrfs_delalloc_release_extents(struct btrfs_inode *inode, u64 num_bytes) > btrfs_inode_rsv_release(inode, true); > } > > +/* Shrink a previously reserved extent to a new length. */ > +void btrfs_delalloc_shrink_extents(struct btrfs_inode *inode, u64 reserved_len, > + u64 new_len) > +{ > + struct btrfs_fs_info *fs_info = inode->root->fs_info; > + const u32 reserved_num_extents = count_max_extents(fs_info, reserved_len); > + const u32 new_num_extents = count_max_extents(fs_info, new_len); > + u32 diff_num_extents; > + > + ASSERT(new_len <= reserved_len); > + if (new_num_extents == reserved_num_extents) > + return; > + > + spin_lock(&inode->lock); > + diff_num_extents = reserved_num_extents - new_num_extents; This doesn't need to be done while holding the lock, and in fact should be done outside to reduce time spent in a critical section. It's a simple expression that can be done when declaring the variable and also make it const. > + btrfs_mod_outstanding_extents(inode, -diff_num_extents); We can also make the sign change when declaring the variable, something like: const int diff_num_extents = -((int)(reserved_num_extents - new_num_extents)); The way it is, turning an unsigned into a negative is also a bit odd to read, I think it makes it more clear if we have a signed type. > + btrfs_calculate_inode_block_rsv_size(fs_info, inode); > + spin_unlock(&inode->lock); > + > + if (btrfs_is_testing(fs_info)) > + return; > + > + btrfs_inode_rsv_release(inode, true); > +} > + > /* > * Reserve data and metadata space for delalloc > * > diff --git a/fs/btrfs/delalloc-space.h b/fs/btrfs/delalloc-space.h > index 3f32953c0a80..c61580c63caf 100644 > --- a/fs/btrfs/delalloc-space.h > +++ b/fs/btrfs/delalloc-space.h > @@ -27,5 +27,6 @@ int btrfs_delalloc_reserve_space(struct btrfs_inode *inode, > int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes, > u64 disk_num_bytes, bool noflush); > void btrfs_delalloc_release_extents(struct btrfs_inode *inode, u64 num_bytes); > - > +void btrfs_delalloc_shrink_extents(struct btrfs_inode *inode, u64 reserved_len, > + u64 new_len); > #endif /* BTRFS_DELALLOC_SPACE_H */ > diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c > index b72fc00bc2f6..63c7a3294eb2 100644 > --- a/fs/btrfs/file.c > +++ b/fs/btrfs/file.c > @@ -1151,6 +1151,24 @@ static ssize_t reserve_space(struct btrfs_inode *inode, > return reserve_bytes; > } > > +/* Shrink the reserved data and metadata space from @reserved_len to @new_len. */ > +static void shrink_reserved_space(struct btrfs_inode *inode, > + struct extent_changeset *data_reserved, > + u64 reserved_start, u64 reserved_len, > + u64 new_len, bool only_release_metadata) > +{ > + u64 diff = reserved_len - new_len; Can be made const. > + > + ASSERT(new_len <= reserved_len); > + btrfs_delalloc_shrink_extents(inode, reserved_len, new_len); > + if (only_release_metadata) > + btrfs_delalloc_release_metadata(inode, diff, true); > + else > + btrfs_delalloc_release_space(inode, data_reserved, > + reserved_start + new_len, > + diff, true); This could fit in 2 lines instead of 3. Anyway those are minor things, so: Reviewed-by: Filipe Manana <fdmanana@suse.com> Thanks. > +} > + > /* > * Do the heavy-lifting work to copy one range into one folio of the page cache. > * > @@ -1164,14 +1182,11 @@ static int copy_one_range(struct btrfs_inode *inode, struct iov_iter *i, > { > struct btrfs_fs_info *fs_info = inode->root->fs_info; > struct extent_state *cached_state = NULL; > - const size_t block_offset = start & (fs_info->sectorsize - 1); > size_t write_bytes = min(iov_iter_count(i), PAGE_SIZE - offset_in_page(start)); > - size_t reserve_bytes; > size_t copied; > - size_t dirty_blocks; > - size_t num_blocks; > + const u64 reserved_start = round_down(start, fs_info->sectorsize); > + u64 reserved_len; > struct folio *folio = NULL; > - u64 release_bytes; > int extents_locked; > u64 lockstart; > u64 lockend; > @@ -1190,23 +1205,25 @@ static int copy_one_range(struct btrfs_inode *inode, struct iov_iter *i, > &only_release_metadata); > if (ret < 0) > return ret; > - reserve_bytes = ret; > - release_bytes = reserve_bytes; > + reserved_len = ret; > + /* Write range must be inside the reserved range. */ > + ASSERT(reserved_start <= start); > + ASSERT(start + write_bytes <= reserved_start + reserved_len); > > again: > ret = balance_dirty_pages_ratelimited_flags(inode->vfs_inode.i_mapping, > bdp_flags); > if (ret) { > - btrfs_delalloc_release_extents(inode, reserve_bytes); > - release_space(inode, *data_reserved, start, release_bytes, > + btrfs_delalloc_release_extents(inode, reserved_len); > + release_space(inode, *data_reserved, reserved_start, reserved_len, > only_release_metadata); > return ret; > } > > ret = prepare_one_folio(&inode->vfs_inode, &folio, start, write_bytes, false); > if (ret) { > - btrfs_delalloc_release_extents(inode, reserve_bytes); > - release_space(inode, *data_reserved, start, release_bytes, > + btrfs_delalloc_release_extents(inode, reserved_len); > + release_space(inode, *data_reserved, reserved_start, reserved_len, > only_release_metadata); > return ret; > } > @@ -1217,8 +1234,8 @@ static int copy_one_range(struct btrfs_inode *inode, struct iov_iter *i, > if (!nowait && extents_locked == -EAGAIN) > goto again; > > - btrfs_delalloc_release_extents(inode, reserve_bytes); > - release_space(inode, *data_reserved, start, release_bytes, > + btrfs_delalloc_release_extents(inode, reserved_len); > + release_space(inode, *data_reserved, reserved_start, reserved_len, > only_release_metadata); > ret = extents_locked; > return ret; > @@ -1228,42 +1245,43 @@ static int copy_one_range(struct btrfs_inode *inode, struct iov_iter *i, > offset_in_folio(folio, start), write_bytes, i); > flush_dcache_folio(folio); > > - /* > - * If we get a partial write, we can end up with partially uptodate > - * page. Although if sector size < page size we can handle it, but if > - * it's not sector aligned it can cause a lot of complexity, so make > - * sure they don't happen by forcing retry this copy. > - */ > if (unlikely(copied < write_bytes)) { > + u64 last_block; > + > + /* > + * The original write range doesn't need an uptodate folio as > + * the range is block aligned. But now a short copy happened. > + * We can not handle it without an uptodate folio. > + * > + * So just revert the range and we will retry. > + */ > if (!folio_test_uptodate(folio)) { > iov_iter_revert(i, copied); > copied = 0; > } > - } > > - num_blocks = BTRFS_BYTES_TO_BLKS(fs_info, reserve_bytes); > - dirty_blocks = round_up(copied + block_offset, fs_info->sectorsize); > - dirty_blocks = BTRFS_BYTES_TO_BLKS(fs_info, dirty_blocks); > - > - if (copied == 0) > - dirty_blocks = 0; > - > - if (num_blocks > dirty_blocks) { > - /* Release everything except the sectors we dirtied. */ > - release_bytes -= dirty_blocks << fs_info->sectorsize_bits; > - if (only_release_metadata) { > - btrfs_delalloc_release_metadata(inode, > - release_bytes, true); > - } else { > - const u64 release_start = round_up(start + copied, > - fs_info->sectorsize); > - > - btrfs_delalloc_release_space(inode, > - *data_reserved, release_start, > - release_bytes, true); > + /* No copied byte, unlock, release reserved space and exit. */ > + if (copied == 0) { > + if (extents_locked) > + unlock_extent(&inode->io_tree, lockstart, lockend, > + &cached_state); > + else > + free_extent_state(cached_state); > + btrfs_delalloc_release_extents(inode, reserved_len); > + release_space(inode, *data_reserved, reserved_start, reserved_len, > + only_release_metadata); > + btrfs_drop_folio(fs_info, folio, start, copied); > + return 0; > } > + > + /* Release the reserved space beyond the last block. */ > + last_block = round_up(start + copied, fs_info->sectorsize); > + > + shrink_reserved_space(inode, *data_reserved, reserved_start, > + reserved_len, last_block - reserved_start, > + only_release_metadata); > + reserved_len = last_block - reserved_start; > } > - release_bytes = round_up(copied + block_offset, fs_info->sectorsize); > > ret = btrfs_dirty_folio(inode, folio, start, copied, > &cached_state, only_release_metadata); > @@ -1280,10 +1298,10 @@ static int copy_one_range(struct btrfs_inode *inode, struct iov_iter *i, > else > free_extent_state(cached_state); > > - btrfs_delalloc_release_extents(inode, reserve_bytes); > + btrfs_delalloc_release_extents(inode, reserved_len); > if (ret) { > btrfs_drop_folio(fs_info, folio, start, copied); > - release_space(inode, *data_reserved, start, release_bytes, > + release_space(inode, *data_reserved, reserved_start, reserved_len, > only_release_metadata); > return ret; > } > -- > 2.49.0 > >
diff --git a/fs/btrfs/delalloc-space.c b/fs/btrfs/delalloc-space.c index 88e900e5a43d..916b62221dde 100644 --- a/fs/btrfs/delalloc-space.c +++ b/fs/btrfs/delalloc-space.c @@ -439,6 +439,31 @@ void btrfs_delalloc_release_extents(struct btrfs_inode *inode, u64 num_bytes) btrfs_inode_rsv_release(inode, true); } +/* Shrink a previously reserved extent to a new length. */ +void btrfs_delalloc_shrink_extents(struct btrfs_inode *inode, u64 reserved_len, + u64 new_len) +{ + struct btrfs_fs_info *fs_info = inode->root->fs_info; + const u32 reserved_num_extents = count_max_extents(fs_info, reserved_len); + const u32 new_num_extents = count_max_extents(fs_info, new_len); + u32 diff_num_extents; + + ASSERT(new_len <= reserved_len); + if (new_num_extents == reserved_num_extents) + return; + + spin_lock(&inode->lock); + diff_num_extents = reserved_num_extents - new_num_extents; + btrfs_mod_outstanding_extents(inode, -diff_num_extents); + btrfs_calculate_inode_block_rsv_size(fs_info, inode); + spin_unlock(&inode->lock); + + if (btrfs_is_testing(fs_info)) + return; + + btrfs_inode_rsv_release(inode, true); +} + /* * Reserve data and metadata space for delalloc * diff --git a/fs/btrfs/delalloc-space.h b/fs/btrfs/delalloc-space.h index 3f32953c0a80..c61580c63caf 100644 --- a/fs/btrfs/delalloc-space.h +++ b/fs/btrfs/delalloc-space.h @@ -27,5 +27,6 @@ int btrfs_delalloc_reserve_space(struct btrfs_inode *inode, int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes, u64 disk_num_bytes, bool noflush); void btrfs_delalloc_release_extents(struct btrfs_inode *inode, u64 num_bytes); - +void btrfs_delalloc_shrink_extents(struct btrfs_inode *inode, u64 reserved_len, + u64 new_len); #endif /* BTRFS_DELALLOC_SPACE_H */ diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index b72fc00bc2f6..63c7a3294eb2 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -1151,6 +1151,24 @@ static ssize_t reserve_space(struct btrfs_inode *inode, return reserve_bytes; } +/* Shrink the reserved data and metadata space from @reserved_len to @new_len. */ +static void shrink_reserved_space(struct btrfs_inode *inode, + struct extent_changeset *data_reserved, + u64 reserved_start, u64 reserved_len, + u64 new_len, bool only_release_metadata) +{ + u64 diff = reserved_len - new_len; + + ASSERT(new_len <= reserved_len); + btrfs_delalloc_shrink_extents(inode, reserved_len, new_len); + if (only_release_metadata) + btrfs_delalloc_release_metadata(inode, diff, true); + else + btrfs_delalloc_release_space(inode, data_reserved, + reserved_start + new_len, + diff, true); +} + /* * Do the heavy-lifting work to copy one range into one folio of the page cache. * @@ -1164,14 +1182,11 @@ static int copy_one_range(struct btrfs_inode *inode, struct iov_iter *i, { struct btrfs_fs_info *fs_info = inode->root->fs_info; struct extent_state *cached_state = NULL; - const size_t block_offset = start & (fs_info->sectorsize - 1); size_t write_bytes = min(iov_iter_count(i), PAGE_SIZE - offset_in_page(start)); - size_t reserve_bytes; size_t copied; - size_t dirty_blocks; - size_t num_blocks; + const u64 reserved_start = round_down(start, fs_info->sectorsize); + u64 reserved_len; struct folio *folio = NULL; - u64 release_bytes; int extents_locked; u64 lockstart; u64 lockend; @@ -1190,23 +1205,25 @@ static int copy_one_range(struct btrfs_inode *inode, struct iov_iter *i, &only_release_metadata); if (ret < 0) return ret; - reserve_bytes = ret; - release_bytes = reserve_bytes; + reserved_len = ret; + /* Write range must be inside the reserved range. */ + ASSERT(reserved_start <= start); + ASSERT(start + write_bytes <= reserved_start + reserved_len); again: ret = balance_dirty_pages_ratelimited_flags(inode->vfs_inode.i_mapping, bdp_flags); if (ret) { - btrfs_delalloc_release_extents(inode, reserve_bytes); - release_space(inode, *data_reserved, start, release_bytes, + btrfs_delalloc_release_extents(inode, reserved_len); + release_space(inode, *data_reserved, reserved_start, reserved_len, only_release_metadata); return ret; } ret = prepare_one_folio(&inode->vfs_inode, &folio, start, write_bytes, false); if (ret) { - btrfs_delalloc_release_extents(inode, reserve_bytes); - release_space(inode, *data_reserved, start, release_bytes, + btrfs_delalloc_release_extents(inode, reserved_len); + release_space(inode, *data_reserved, reserved_start, reserved_len, only_release_metadata); return ret; } @@ -1217,8 +1234,8 @@ static int copy_one_range(struct btrfs_inode *inode, struct iov_iter *i, if (!nowait && extents_locked == -EAGAIN) goto again; - btrfs_delalloc_release_extents(inode, reserve_bytes); - release_space(inode, *data_reserved, start, release_bytes, + btrfs_delalloc_release_extents(inode, reserved_len); + release_space(inode, *data_reserved, reserved_start, reserved_len, only_release_metadata); ret = extents_locked; return ret; @@ -1228,42 +1245,43 @@ static int copy_one_range(struct btrfs_inode *inode, struct iov_iter *i, offset_in_folio(folio, start), write_bytes, i); flush_dcache_folio(folio); - /* - * If we get a partial write, we can end up with partially uptodate - * page. Although if sector size < page size we can handle it, but if - * it's not sector aligned it can cause a lot of complexity, so make - * sure they don't happen by forcing retry this copy. - */ if (unlikely(copied < write_bytes)) { + u64 last_block; + + /* + * The original write range doesn't need an uptodate folio as + * the range is block aligned. But now a short copy happened. + * We can not handle it without an uptodate folio. + * + * So just revert the range and we will retry. + */ if (!folio_test_uptodate(folio)) { iov_iter_revert(i, copied); copied = 0; } - } - num_blocks = BTRFS_BYTES_TO_BLKS(fs_info, reserve_bytes); - dirty_blocks = round_up(copied + block_offset, fs_info->sectorsize); - dirty_blocks = BTRFS_BYTES_TO_BLKS(fs_info, dirty_blocks); - - if (copied == 0) - dirty_blocks = 0; - - if (num_blocks > dirty_blocks) { - /* Release everything except the sectors we dirtied. */ - release_bytes -= dirty_blocks << fs_info->sectorsize_bits; - if (only_release_metadata) { - btrfs_delalloc_release_metadata(inode, - release_bytes, true); - } else { - const u64 release_start = round_up(start + copied, - fs_info->sectorsize); - - btrfs_delalloc_release_space(inode, - *data_reserved, release_start, - release_bytes, true); + /* No copied byte, unlock, release reserved space and exit. */ + if (copied == 0) { + if (extents_locked) + unlock_extent(&inode->io_tree, lockstart, lockend, + &cached_state); + else + free_extent_state(cached_state); + btrfs_delalloc_release_extents(inode, reserved_len); + release_space(inode, *data_reserved, reserved_start, reserved_len, + only_release_metadata); + btrfs_drop_folio(fs_info, folio, start, copied); + return 0; } + + /* Release the reserved space beyond the last block. */ + last_block = round_up(start + copied, fs_info->sectorsize); + + shrink_reserved_space(inode, *data_reserved, reserved_start, + reserved_len, last_block - reserved_start, + only_release_metadata); + reserved_len = last_block - reserved_start; } - release_bytes = round_up(copied + block_offset, fs_info->sectorsize); ret = btrfs_dirty_folio(inode, folio, start, copied, &cached_state, only_release_metadata); @@ -1280,10 +1298,10 @@ static int copy_one_range(struct btrfs_inode *inode, struct iov_iter *i, else free_extent_state(cached_state); - btrfs_delalloc_release_extents(inode, reserve_bytes); + btrfs_delalloc_release_extents(inode, reserved_len); if (ret) { btrfs_drop_folio(fs_info, folio, start, copied); - release_space(inode, *data_reserved, start, release_bytes, + release_space(inode, *data_reserved, reserved_start, reserved_len, only_release_metadata); return ret; }
There are several things not ideal inside copy_one_range(): - Unnecessary temporary variables * block_offset * reserve_bytes * dirty_blocks * num_blocks * release_bytes These are utilized to handle short-copy cases. - Inconsistent handling of btrfs_delalloc_release_extents() There is a hidden behavior that, after reserving metadata for X bytes of data write, we have to call btrfs_delalloc_release_extents() with X once and only once. Calling btrfs_delalloc_release_extents(X - 4K) and btrfs_delalloc_release_extents(4K) will cause outstanding extents accounting to go wrong. This is because the outstanding extents mechanism is not designed to handle shrink of reserved space. Improve above situations by: - Use a single @reserved_start and @reserved_len pair Now we reserved space for the initial range, and if a short copy happened and we need to shrink the reserved space, we can easily calculate the new length, and update @reserved_len. - Introduce helpers to shrink reserved data and metadata space This is done by two new helper, shrink_reserved_space() and btrfs_delalloc_shrink_extents(). The later will do a better calculation on if we need to modify the outstanding extents, and the first one will be utlized inside copy_one_range(). - Manually unlock, release reserved space and return if no byte is copied Signed-off-by: Qu Wenruo <wqu@suse.com> --- fs/btrfs/delalloc-space.c | 25 +++++++++ fs/btrfs/delalloc-space.h | 3 +- fs/btrfs/file.c | 104 ++++++++++++++++++++++---------------- 3 files changed, 88 insertions(+), 44 deletions(-)