diff mbox series

[RFC] btrfs: make extent_write_locked_range() to handle subpage dirty correctly

Message ID 7737c2e976c0bb2d36339ed0563cdbd07d846363.1709626757.git.wqu@suse.com (mailing list archive)
State New, archived
Headers show
Series [RFC] btrfs: make extent_write_locked_range() to handle subpage dirty correctly | expand

Commit Message

Qu Wenruo March 5, 2024, 8:19 a.m. UTC
[BUG]
Even with my previous subpage delalloc rework, the following workload
can easily lead to leaked reserved space:

 # mkfs.btrfs -f $dev -s 4k > /dev/null
 # mount $dev $mnt
 # fsstress -v -w -n 8 -d $mnt -s 1709539240
 0/0: fiemap - no filename
 0/1: copyrange read - no filename
 0/2: write - no filename
 0/3: rename - no source filename
 0/4: creat f0 x:0 0 0
 0/4: creat add id=0,parent=-1
 0/5: writev f0[259 1 0 0 0 0] [778052,113,965] 0
 0/6: ioctl(FIEMAP) f0[259 1 0 0 224 887097] [1294220,2291618343991484791,0x10000] -1
 0/7: dwrite - xfsctl(XFS_IOC_DIOINFO) f0[259 1 0 0 224 887097] return 25, fallback to stat()
 0/7: dwrite f0[259 1 0 0 224 887097] [696320,102400] 0
 # umount $mnt

The $dev is a tcmu-runner emulated zoned HDD, with 80 zones, and append
max size is 64K.

[CAUSE]
Before the dwrite(), writev() would dirty the following 3 pages:

    720896           +64K           +128K             +192K
    |            |///|//////////////|/////|           |
                 +52K                     +164K

Then the dwrite() would try to drop the page caches for the first two
pages starting at 720896.

Now we trigger delalloc for the above two pages.
Firstly find_lock_delalloc_range() would return the range
[774144, 774144 +64K], not the full dirty range as non-zoned case.

This is due to find_lock_delalloc_range() is using the max zone append
size (64K).
The range would end in the 2nd page.

Then we start writeback through extent_write_locked_range(), which for
the range in the second patch [+52K, +116K), it would clear the page
dirty for the whole page.

    720896           +64K           +128K             +192K
    |            |///|//////////|   |/////|           |
                 +52K           +116K     +164K

This is due to the fact that extent_write_locked_range() is using the
non-subpage compatible call to clear the folio dirty flags.

Now at this stage,we leak reserved space for the remaining dirty part
inside the second page.

[PREMATURE FIX]
For now, just change all the page/folio flag operations to subpage
version.

But this would not fully solve the problem, as we still have other
problems, like triggering the BUG_ON() inside mm/truncate on the second
page of the above range:

 ------------[ cut here ]------------
 kernel BUG at mm/truncate.c:577!
 Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
 Dumping ftrace buffer:
 ---------------------------------
    <...>-2055      1d..2. 16142us : invalidate_inode_pages2_range: !!! folio=786432 !!!
 ---------------------------------
 Call trace:
  invalidate_inode_pages2_range+0x378/0x430
  kiocb_invalidate_pages+0x60/0x98
  __iomap_dio_rw+0x350/0x5d8
  btrfs_dio_write+0x50/0x88 [btrfs]
  btrfs_direct_write+0x128/0x328 [btrfs]
  btrfs_do_write_iter+0x174/0x1e8 [btrfs]
  btrfs_file_write_iter+0x1c/0x30 [btrfs]
  vfs_write+0x258/0x380
  ksys_write+0x80/0x120
  __arm64_sys_write+0x24/0x38
  invoke_syscall+0x78/0x100
  el0_svc_common.constprop.0+0x48/0xf0
  do_el0_svc+0x24/0x38
  el0_svc+0x3c/0x138
  el0t_64_sync_handler+0x120/0x130
  el0t_64_sync+0x194/0x198
 Code: 97fbf380 f9400380 f271041f 54fff7e0 (d4210000)
 ---[ end trace 0000000000000000 ]---

[REASON FOR RFC]
I really want a formal answer why zoned buffered writes need such
a dedicated (while completely incompatible with subpage)
run_dealloc_cow().

It looks like we have already more than enough infrastructure for zoned
devices, as find_lock_delalloc_range() would already return a range
no larger than max zone append size.

Thus it looks to me that, we're fine to go without a dedicated
run_delalloc_cow() just for zoned.

And since the delalloc range would be locked during
find_lock_delalloc_range(), the content of all the pages won't be
changed, thus I didn't see much reason why we can not go the regular
cow_file_range() + __extent_writepage_io() path.

Of course I'll try to dig deeper to fully solve all the subpage + zoned
bugs, but if we have more shared code paths, it would make our lives
much easier.

Cc: Johannes Thumshirn <Johannes.Thumshirn@wdc.com>
Cc: Naohiro Aota <Naohiro.Aota@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

Comments

Naohiro Aota March 6, 2024, 3:23 p.m. UTC | #1
On Tue, Mar 05, 2024 at 06:49:25PM +1030, Qu Wenruo wrote:
> [BUG]
> Even with my previous subpage delalloc rework, the following workload
> can easily lead to leaked reserved space:
> 
>  # mkfs.btrfs -f $dev -s 4k > /dev/null
>  # mount $dev $mnt
>  # fsstress -v -w -n 8 -d $mnt -s 1709539240
>  0/0: fiemap - no filename
>  0/1: copyrange read - no filename
>  0/2: write - no filename
>  0/3: rename - no source filename
>  0/4: creat f0 x:0 0 0
>  0/4: creat add id=0,parent=-1
>  0/5: writev f0[259 1 0 0 0 0] [778052,113,965] 0
>  0/6: ioctl(FIEMAP) f0[259 1 0 0 224 887097] [1294220,2291618343991484791,0x10000] -1
>  0/7: dwrite - xfsctl(XFS_IOC_DIOINFO) f0[259 1 0 0 224 887097] return 25, fallback to stat()
>  0/7: dwrite f0[259 1 0 0 224 887097] [696320,102400] 0
>  # umount $mnt
> 
> The $dev is a tcmu-runner emulated zoned HDD, with 80 zones, and append
> max size is 64K.
> 
> [CAUSE]
> Before the dwrite(), writev() would dirty the following 3 pages:
> 
>     720896           +64K           +128K             +192K
>     |            |///|//////////////|/////|           |
>                  +52K                     +164K
> 
> Then the dwrite() would try to drop the page caches for the first two
> pages starting at 720896.
> 
> Now we trigger delalloc for the above two pages.
> Firstly find_lock_delalloc_range() would return the range
> [774144, 774144 +64K], not the full dirty range as non-zoned case.
> 
> This is due to find_lock_delalloc_range() is using the max zone append
> size (64K).
> The range would end in the 2nd page.
> 
> Then we start writeback through extent_write_locked_range(), which for
> the range in the second patch [+52K, +116K), it would clear the page
> dirty for the whole page.
> 
>     720896           +64K           +128K             +192K
>     |            |///|//////////|   |/////|           |
>                  +52K           +116K     +164K
> 
> This is due to the fact that extent_write_locked_range() is using the
> non-subpage compatible call to clear the folio dirty flags.
> 
> Now at this stage,we leak reserved space for the remaining dirty part
> inside the second page.
> 
> [PREMATURE FIX]
> For now, just change all the page/folio flag operations to subpage
> version.
> 
> But this would not fully solve the problem, as we still have other
> problems, like triggering the BUG_ON() inside mm/truncate on the second
> page of the above range:
> 
>  ------------[ cut here ]------------
>  kernel BUG at mm/truncate.c:577!
>  Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
>  Dumping ftrace buffer:
>  ---------------------------------
>     <...>-2055      1d..2. 16142us : invalidate_inode_pages2_range: !!! folio=786432 !!!
>  ---------------------------------
>  Call trace:
>   invalidate_inode_pages2_range+0x378/0x430
>   kiocb_invalidate_pages+0x60/0x98
>   __iomap_dio_rw+0x350/0x5d8
>   btrfs_dio_write+0x50/0x88 [btrfs]
>   btrfs_direct_write+0x128/0x328 [btrfs]
>   btrfs_do_write_iter+0x174/0x1e8 [btrfs]
>   btrfs_file_write_iter+0x1c/0x30 [btrfs]
>   vfs_write+0x258/0x380
>   ksys_write+0x80/0x120
>   __arm64_sys_write+0x24/0x38
>   invoke_syscall+0x78/0x100
>   el0_svc_common.constprop.0+0x48/0xf0
>   do_el0_svc+0x24/0x38
>   el0_svc+0x3c/0x138
>   el0t_64_sync_handler+0x120/0x130
>   el0t_64_sync+0x194/0x198
>  Code: 97fbf380 f9400380 f271041f 54fff7e0 (d4210000)
>  ---[ end trace 0000000000000000 ]---
> 
> [REASON FOR RFC]
> I really want a formal answer why zoned buffered writes need such
> a dedicated (while completely incompatible with subpage)
> run_dealloc_cow().
> 
> It looks like we have already more than enough infrastructure for zoned
> devices, as find_lock_delalloc_range() would already return a range
> no larger than max zone append size.

Yes, and basically it is to ensure one extent = one bio rule.

But, I'd like to lift that limitation. Even if one extent is handled
by multiple bios, the current extent split code (in end_bio path) should
keep it merged if they are written sequentially. So, there is an
opportunity to have a larger extent == less metadata. However, I hit a bug
when I lifted the restriction, so that's not done yet.

> 
> Thus it looks to me that, we're fine to go without a dedicated
> run_delalloc_cow() just for zoned.
> 
> And since the delalloc range would be locked during
> find_lock_delalloc_range(), the content of all the pages won't be
> changed, thus I didn't see much reason why we can not go the regular
> cow_file_range() + __extent_writepage_io() path.

While the delalloc range is locked, pages except the first page are
unlocked in cow_file_range() -> extent_clear_unlock_delalloc(). That allows
another thread to grab the unlocked pages and send a bio from that thread.
That will break one extent == one bio rule.

Well, as I said above, current end_bio can handle that case well (not
tested, though). So, it's more meant for optimization nowadays, to have
less extent split.

> 
> Of course I'll try to dig deeper to fully solve all the subpage + zoned
> bugs, but if we have more shared code paths, it would make our lives
> much easier.
> 
> Cc: Johannes Thumshirn <Johannes.Thumshirn@wdc.com>
> Cc: Naohiro Aota <Naohiro.Aota@wdc.com>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  fs/btrfs/extent_io.c | 8 +++++---
>  1 file changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index fb63055f42f3..e9850be26bb7 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -2286,13 +2286,15 @@ void extent_write_locked_range(struct inode *inode, struct page *locked_page,
>  		u64 cur_end = min(round_down(cur, PAGE_SIZE) + PAGE_SIZE - 1, end);
>  		u32 cur_len = cur_end + 1 - cur;
>  		struct page *page;
> +		struct folio *folio;
>  		int nr = 0;
>  
>  		page = find_get_page(mapping, cur >> PAGE_SHIFT);
>  		ASSERT(PageLocked(page));
> +		folio = page_folio(page);
>  		if (pages_dirty && page != locked_page) {
>  			ASSERT(PageDirty(page));
> -			clear_page_dirty_for_io(page);
> +			btrfs_folio_clear_dirty(fs_info, folio, cur, cur_len);
>  		}
>  
>  		ret = __extent_writepage_io(BTRFS_I(inode), page, cur, cur_len,
> @@ -2302,8 +2304,8 @@ void extent_write_locked_range(struct inode *inode, struct page *locked_page,
>  
>  		/* Make sure the mapping tag for page dirty gets cleared. */
>  		if (nr == 0) {
> -			set_page_writeback(page);
> -			end_page_writeback(page);
> +			btrfs_folio_set_writeback(fs_info, folio, cur, cur_len);
> +			btrfs_folio_clear_writeback(fs_info, folio, cur, cur_len);
>  		}
>  		if (ret) {
>  			btrfs_mark_ordered_io_finished(BTRFS_I(inode), page,
> -- 
> 2.44.0
>
Qu Wenruo March 6, 2024, 11:41 p.m. UTC | #2
在 2024/3/7 01:53, Naohiro Aota 写道:
> On Tue, Mar 05, 2024 at 06:49:25PM +1030, Qu Wenruo wrote:
>> [BUG]
[...]
>>
>> It looks like we have already more than enough infrastructure for zoned
>> devices, as find_lock_delalloc_range() would already return a range
>> no larger than max zone append size.
> 
> Yes, and basically it is to ensure one extent = one bio rule.
> 
> But, I'd like to lift that limitation. Even if one extent is handled
> by multiple bios, the current extent split code (in end_bio path) should
> keep it merged if they are written sequentially. So, there is an
> opportunity to have a larger extent == less metadata. However, I hit a bug
> when I lifted the restriction, so that's not done yet.

OK, so there are still some pitfalls, thanks for verifying that.

> 
>>
>> Thus it looks to me that, we're fine to go without a dedicated
>> run_delalloc_cow() just for zoned.
>>
>> And since the delalloc range would be locked during
>> find_lock_delalloc_range(), the content of all the pages won't be
>> changed, thus I didn't see much reason why we can not go the regular
>> cow_file_range() + __extent_writepage_io() path.
> 
> While the delalloc range is locked, pages except the first page are
> unlocked in cow_file_range() -> extent_clear_unlock_delalloc(). That allows
> another thread to grab the unlocked pages and send a bio from that thread.
> That will break one extent == one bio rule.

Although cow_file_range() unlocked the remaining page of the delalloc 
range (except the first page), those pages are still dirty and not 
marked writeback.

And we only clear and mark those (sub)pages writeback during 
__extent_writepage_io().

The only difference is in the page lock lifespan:

- The non-zoned COW path
   Pages locked during run_delalloc_range(), then unlocked.
   Relocked in extent_write_cache_pages() for submissioin.

- The zoned COW path
   Pages kept locked until submission is done.

Although I'd say the zoned COW path has much simpler lifespan
(run_delalloc and submission happen for the same range, without page 
unlocking).

Anyway I have eventually pinned down the root cause for subpage + zoned 
error, and it's related to that full page_clear_dirty_for_io() call 
inside extent_write_locked_range().

Personally speaking I'm not a big fan of how we handle delalloc and 
submission, and the zoned one looks a little more nature to me.
I'll check if we can make every write path to follow the same behavior.

Thanks,
Qu
> 
> Well, as I said above, current end_bio can handle that case well (not
> tested, though). So, it's more meant for optimization nowadays, to have
> less extent split.
> 
>>
>> Of course I'll try to dig deeper to fully solve all the subpage + zoned
>> bugs, but if we have more shared code paths, it would make our lives
>> much easier.
>>
>> Cc: Johannes Thumshirn <Johannes.Thumshirn@wdc.com>
>> Cc: Naohiro Aota <Naohiro.Aota@wdc.com>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>>   fs/btrfs/extent_io.c | 8 +++++---
>>   1 file changed, 5 insertions(+), 3 deletions(-)
>>
>> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
>> index fb63055f42f3..e9850be26bb7 100644
>> --- a/fs/btrfs/extent_io.c
>> +++ b/fs/btrfs/extent_io.c
>> @@ -2286,13 +2286,15 @@ void extent_write_locked_range(struct inode *inode, struct page *locked_page,
>>   		u64 cur_end = min(round_down(cur, PAGE_SIZE) + PAGE_SIZE - 1, end);
>>   		u32 cur_len = cur_end + 1 - cur;
>>   		struct page *page;
>> +		struct folio *folio;
>>   		int nr = 0;
>>   
>>   		page = find_get_page(mapping, cur >> PAGE_SHIFT);
>>   		ASSERT(PageLocked(page));
>> +		folio = page_folio(page);
>>   		if (pages_dirty && page != locked_page) {
>>   			ASSERT(PageDirty(page));
>> -			clear_page_dirty_for_io(page);
>> +			btrfs_folio_clear_dirty(fs_info, folio, cur, cur_len);
>>   		}
>>   
>>   		ret = __extent_writepage_io(BTRFS_I(inode), page, cur, cur_len,
>> @@ -2302,8 +2304,8 @@ void extent_write_locked_range(struct inode *inode, struct page *locked_page,
>>   
>>   		/* Make sure the mapping tag for page dirty gets cleared. */
>>   		if (nr == 0) {
>> -			set_page_writeback(page);
>> -			end_page_writeback(page);
>> +			btrfs_folio_set_writeback(fs_info, folio, cur, cur_len);
>> +			btrfs_folio_clear_writeback(fs_info, folio, cur, cur_len);
>>   		}
>>   		if (ret) {
>>   			btrfs_mark_ordered_io_finished(BTRFS_I(inode), page,
>> -- 
>> 2.44.0
diff mbox series

Patch

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index fb63055f42f3..e9850be26bb7 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2286,13 +2286,15 @@  void extent_write_locked_range(struct inode *inode, struct page *locked_page,
 		u64 cur_end = min(round_down(cur, PAGE_SIZE) + PAGE_SIZE - 1, end);
 		u32 cur_len = cur_end + 1 - cur;
 		struct page *page;
+		struct folio *folio;
 		int nr = 0;
 
 		page = find_get_page(mapping, cur >> PAGE_SHIFT);
 		ASSERT(PageLocked(page));
+		folio = page_folio(page);
 		if (pages_dirty && page != locked_page) {
 			ASSERT(PageDirty(page));
-			clear_page_dirty_for_io(page);
+			btrfs_folio_clear_dirty(fs_info, folio, cur, cur_len);
 		}
 
 		ret = __extent_writepage_io(BTRFS_I(inode), page, cur, cur_len,
@@ -2302,8 +2304,8 @@  void extent_write_locked_range(struct inode *inode, struct page *locked_page,
 
 		/* Make sure the mapping tag for page dirty gets cleared. */
 		if (nr == 0) {
-			set_page_writeback(page);
-			end_page_writeback(page);
+			btrfs_folio_set_writeback(fs_info, folio, cur, cur_len);
+			btrfs_folio_clear_writeback(fs_info, folio, cur, cur_len);
 		}
 		if (ret) {
 			btrfs_mark_ordered_io_finished(BTRFS_I(inode), page,