[17/19] btrfs: shrink delayed allocation size in HMZONED mode
diff mbox series

Message ID 20190607131025.31996-18-naohiro.aota@wdc.com
State New
Headers show
Series
  • btrfs zoned block device support
Related show

Commit Message

Naohiro Aota June 7, 2019, 1:10 p.m. UTC
In a write heavy workload, the following scenario can occur:

1. mark page #0 to page #2 (and their corresponding extent region) as dirty
   and candidate for delayed allocation

pages    0 1 2 3 4
dirty    o o o - -
towrite  - - - - -
delayed  o o o - -
alloc

2. extent_write_cache_pages() mark dirty pages as TOWRITE

pages    0 1 2 3 4
dirty    o o o - -
towrite  o o o - -
delayed  o o o - -
alloc

3. Meanwhile, another write dirties page #3 and page #4

pages    0 1 2 3 4
dirty    o o o o o
towrite  o o o - -
delayed  o o o o o
alloc

4. find_lock_delalloc_range() decide to allocate a region to write page #0
   to page #4
5. but, extent_write_cache_pages() only initiate write to TOWRITE tagged
   pages (#0 to #2)

So the above process leaves page #3 and page #4 behind. Usually, the
periodic dirty flush kicks write IOs for page #3 and #4. However, if we try
to mount a subvolume at this timing, mount process takes s_umount write
lock to block the periodic flush to come in.

To deal with the problem, shrink the delayed allocation region to have only
expected to be written pages.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/extent_io.c | 27 +++++++++++++++++++++++++++
 1 file changed, 27 insertions(+)

Comments

Josef Bacik June 13, 2019, 2:27 p.m. UTC | #1
On Fri, Jun 07, 2019 at 10:10:23PM +0900, Naohiro Aota wrote:
> In a write heavy workload, the following scenario can occur:
> 
> 1. mark page #0 to page #2 (and their corresponding extent region) as dirty
>    and candidate for delayed allocation
> 
> pages    0 1 2 3 4
> dirty    o o o - -
> towrite  - - - - -
> delayed  o o o - -
> alloc
> 
> 2. extent_write_cache_pages() mark dirty pages as TOWRITE
> 
> pages    0 1 2 3 4
> dirty    o o o - -
> towrite  o o o - -
> delayed  o o o - -
> alloc
> 
> 3. Meanwhile, another write dirties page #3 and page #4
> 
> pages    0 1 2 3 4
> dirty    o o o o o
> towrite  o o o - -
> delayed  o o o o o
> alloc
> 
> 4. find_lock_delalloc_range() decide to allocate a region to write page #0
>    to page #4
> 5. but, extent_write_cache_pages() only initiate write to TOWRITE tagged
>    pages (#0 to #2)
> 
> So the above process leaves page #3 and page #4 behind. Usually, the
> periodic dirty flush kicks write IOs for page #3 and #4. However, if we try
> to mount a subvolume at this timing, mount process takes s_umount write
> lock to block the periodic flush to come in.
> 
> To deal with the problem, shrink the delayed allocation region to have only
> expected to be written pages.
> 
> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
> ---
>  fs/btrfs/extent_io.c | 27 +++++++++++++++++++++++++++
>  1 file changed, 27 insertions(+)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index c73c69e2bef4..ea582ff85c73 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -3310,6 +3310,33 @@ static noinline_for_stack int writepage_delalloc(struct inode *inode,
>  			delalloc_start = delalloc_end + 1;
>  			continue;
>  		}
> +
> +		if (btrfs_fs_incompat(btrfs_sb(inode->i_sb), HMZONED) &&
> +		    (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages) &&
> +		    ((delalloc_start >> PAGE_SHIFT) <
> +		     (delalloc_end >> PAGE_SHIFT))) {
> +			unsigned long i;
> +			unsigned long end_index = delalloc_end >> PAGE_SHIFT;
> +
> +			for (i = delalloc_start >> PAGE_SHIFT;
> +			     i <= end_index; i++)
> +				if (!xa_get_mark(&inode->i_mapping->i_pages, i,
> +						 PAGECACHE_TAG_TOWRITE))
> +					break;
> +
> +			if (i <= end_index) {
> +				u64 unlock_start = (u64)i << PAGE_SHIFT;
> +
> +				if (i == delalloc_start >> PAGE_SHIFT)
> +					unlock_start += PAGE_SIZE;
> +
> +				unlock_extent(tree, unlock_start, delalloc_end);
> +				__unlock_for_delalloc(inode, page, unlock_start,
> +						      delalloc_end);
> +				delalloc_end = unlock_start - 1;
> +			}
> +		}
> +

Helper please.  Really for all this hmzoned stuff I want it segregated as much
as possible so when I'm debugging or cleaning other stuff up I want to easily be
able to say "oh this is for zoned devices, it doesn't matter."  Thanks,

Josef

Patch
diff mbox series

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index c73c69e2bef4..ea582ff85c73 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3310,6 +3310,33 @@  static noinline_for_stack int writepage_delalloc(struct inode *inode,
 			delalloc_start = delalloc_end + 1;
 			continue;
 		}
+
+		if (btrfs_fs_incompat(btrfs_sb(inode->i_sb), HMZONED) &&
+		    (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages) &&
+		    ((delalloc_start >> PAGE_SHIFT) <
+		     (delalloc_end >> PAGE_SHIFT))) {
+			unsigned long i;
+			unsigned long end_index = delalloc_end >> PAGE_SHIFT;
+
+			for (i = delalloc_start >> PAGE_SHIFT;
+			     i <= end_index; i++)
+				if (!xa_get_mark(&inode->i_mapping->i_pages, i,
+						 PAGECACHE_TAG_TOWRITE))
+					break;
+
+			if (i <= end_index) {
+				u64 unlock_start = (u64)i << PAGE_SHIFT;
+
+				if (i == delalloc_start >> PAGE_SHIFT)
+					unlock_start += PAGE_SIZE;
+
+				unlock_extent(tree, unlock_start, delalloc_end);
+				__unlock_for_delalloc(inode, page, unlock_start,
+						      delalloc_end);
+				delalloc_end = unlock_start - 1;
+			}
+		}
+
 		ret = btrfs_run_delalloc_range(inode, page, delalloc_start,
 				delalloc_end, &page_started, nr_written, wbc);
 		/* File system has been set read-only */