Message ID | 7737c2e976c0bb2d36339ed0563cdbd07d846363.1709626757.git.wqu@suse.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [RFC] btrfs: make extent_write_locked_range() to handle subpage dirty correctly | expand |
On Tue, Mar 05, 2024 at 06:49:25PM +1030, Qu Wenruo wrote: > [BUG] > Even with my previous subpage delalloc rework, the following workload > can easily lead to leaked reserved space: > > # mkfs.btrfs -f $dev -s 4k > /dev/null > # mount $dev $mnt > # fsstress -v -w -n 8 -d $mnt -s 1709539240 > 0/0: fiemap - no filename > 0/1: copyrange read - no filename > 0/2: write - no filename > 0/3: rename - no source filename > 0/4: creat f0 x:0 0 0 > 0/4: creat add id=0,parent=-1 > 0/5: writev f0[259 1 0 0 0 0] [778052,113,965] 0 > 0/6: ioctl(FIEMAP) f0[259 1 0 0 224 887097] [1294220,2291618343991484791,0x10000] -1 > 0/7: dwrite - xfsctl(XFS_IOC_DIOINFO) f0[259 1 0 0 224 887097] return 25, fallback to stat() > 0/7: dwrite f0[259 1 0 0 224 887097] [696320,102400] 0 > # umount $mnt > > The $dev is a tcmu-runner emulated zoned HDD, with 80 zones, and append > max size is 64K. > > [CAUSE] > Before the dwrite(), writev() would dirty the following 3 pages: > > 720896 +64K +128K +192K > | |///|//////////////|/////| | > +52K +164K > > Then the dwrite() would try to drop the page caches for the first two > pages starting at 720896. > > Now we trigger delalloc for the above two pages. > Firstly find_lock_delalloc_range() would return the range > [774144, 774144 +64K], not the full dirty range as non-zoned case. > > This is due to find_lock_delalloc_range() is using the max zone append > size (64K). > The range would end in the 2nd page. > > Then we start writeback through extent_write_locked_range(), which for > the range in the second patch [+52K, +116K), it would clear the page > dirty for the whole page. > > 720896 +64K +128K +192K > | |///|//////////| |/////| | > +52K +116K +164K > > This is due to the fact that extent_write_locked_range() is using the > non-subpage compatible call to clear the folio dirty flags. > > Now at this stage,we leak reserved space for the remaining dirty part > inside the second page. > > [PREMATURE FIX] > For now, just change all the page/folio flag operations to subpage > version. > > But this would not fully solve the problem, as we still have other > problems, like triggering the BUG_ON() inside mm/truncate on the second > page of the above range: > > ------------[ cut here ]------------ > kernel BUG at mm/truncate.c:577! > Internal error: Oops - BUG: 00000000f2000800 [#1] SMP > Dumping ftrace buffer: > --------------------------------- > <...>-2055 1d..2. 16142us : invalidate_inode_pages2_range: !!! folio=786432 !!! > --------------------------------- > Call trace: > invalidate_inode_pages2_range+0x378/0x430 > kiocb_invalidate_pages+0x60/0x98 > __iomap_dio_rw+0x350/0x5d8 > btrfs_dio_write+0x50/0x88 [btrfs] > btrfs_direct_write+0x128/0x328 [btrfs] > btrfs_do_write_iter+0x174/0x1e8 [btrfs] > btrfs_file_write_iter+0x1c/0x30 [btrfs] > vfs_write+0x258/0x380 > ksys_write+0x80/0x120 > __arm64_sys_write+0x24/0x38 > invoke_syscall+0x78/0x100 > el0_svc_common.constprop.0+0x48/0xf0 > do_el0_svc+0x24/0x38 > el0_svc+0x3c/0x138 > el0t_64_sync_handler+0x120/0x130 > el0t_64_sync+0x194/0x198 > Code: 97fbf380 f9400380 f271041f 54fff7e0 (d4210000) > ---[ end trace 0000000000000000 ]--- > > [REASON FOR RFC] > I really want a formal answer why zoned buffered writes need such > a dedicated (while completely incompatible with subpage) > run_dealloc_cow(). > > It looks like we have already more than enough infrastructure for zoned > devices, as find_lock_delalloc_range() would already return a range > no larger than max zone append size. Yes, and basically it is to ensure one extent = one bio rule. But, I'd like to lift that limitation. Even if one extent is handled by multiple bios, the current extent split code (in end_bio path) should keep it merged if they are written sequentially. So, there is an opportunity to have a larger extent == less metadata. However, I hit a bug when I lifted the restriction, so that's not done yet. > > Thus it looks to me that, we're fine to go without a dedicated > run_delalloc_cow() just for zoned. > > And since the delalloc range would be locked during > find_lock_delalloc_range(), the content of all the pages won't be > changed, thus I didn't see much reason why we can not go the regular > cow_file_range() + __extent_writepage_io() path. While the delalloc range is locked, pages except the first page are unlocked in cow_file_range() -> extent_clear_unlock_delalloc(). That allows another thread to grab the unlocked pages and send a bio from that thread. That will break one extent == one bio rule. Well, as I said above, current end_bio can handle that case well (not tested, though). So, it's more meant for optimization nowadays, to have less extent split. > > Of course I'll try to dig deeper to fully solve all the subpage + zoned > bugs, but if we have more shared code paths, it would make our lives > much easier. > > Cc: Johannes Thumshirn <Johannes.Thumshirn@wdc.com> > Cc: Naohiro Aota <Naohiro.Aota@wdc.com> > Signed-off-by: Qu Wenruo <wqu@suse.com> > --- > fs/btrfs/extent_io.c | 8 +++++--- > 1 file changed, 5 insertions(+), 3 deletions(-) > > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c > index fb63055f42f3..e9850be26bb7 100644 > --- a/fs/btrfs/extent_io.c > +++ b/fs/btrfs/extent_io.c > @@ -2286,13 +2286,15 @@ void extent_write_locked_range(struct inode *inode, struct page *locked_page, > u64 cur_end = min(round_down(cur, PAGE_SIZE) + PAGE_SIZE - 1, end); > u32 cur_len = cur_end + 1 - cur; > struct page *page; > + struct folio *folio; > int nr = 0; > > page = find_get_page(mapping, cur >> PAGE_SHIFT); > ASSERT(PageLocked(page)); > + folio = page_folio(page); > if (pages_dirty && page != locked_page) { > ASSERT(PageDirty(page)); > - clear_page_dirty_for_io(page); > + btrfs_folio_clear_dirty(fs_info, folio, cur, cur_len); > } > > ret = __extent_writepage_io(BTRFS_I(inode), page, cur, cur_len, > @@ -2302,8 +2304,8 @@ void extent_write_locked_range(struct inode *inode, struct page *locked_page, > > /* Make sure the mapping tag for page dirty gets cleared. */ > if (nr == 0) { > - set_page_writeback(page); > - end_page_writeback(page); > + btrfs_folio_set_writeback(fs_info, folio, cur, cur_len); > + btrfs_folio_clear_writeback(fs_info, folio, cur, cur_len); > } > if (ret) { > btrfs_mark_ordered_io_finished(BTRFS_I(inode), page, > -- > 2.44.0 >
在 2024/3/7 01:53, Naohiro Aota 写道: > On Tue, Mar 05, 2024 at 06:49:25PM +1030, Qu Wenruo wrote: >> [BUG] [...] >> >> It looks like we have already more than enough infrastructure for zoned >> devices, as find_lock_delalloc_range() would already return a range >> no larger than max zone append size. > > Yes, and basically it is to ensure one extent = one bio rule. > > But, I'd like to lift that limitation. Even if one extent is handled > by multiple bios, the current extent split code (in end_bio path) should > keep it merged if they are written sequentially. So, there is an > opportunity to have a larger extent == less metadata. However, I hit a bug > when I lifted the restriction, so that's not done yet. OK, so there are still some pitfalls, thanks for verifying that. > >> >> Thus it looks to me that, we're fine to go without a dedicated >> run_delalloc_cow() just for zoned. >> >> And since the delalloc range would be locked during >> find_lock_delalloc_range(), the content of all the pages won't be >> changed, thus I didn't see much reason why we can not go the regular >> cow_file_range() + __extent_writepage_io() path. > > While the delalloc range is locked, pages except the first page are > unlocked in cow_file_range() -> extent_clear_unlock_delalloc(). That allows > another thread to grab the unlocked pages and send a bio from that thread. > That will break one extent == one bio rule. Although cow_file_range() unlocked the remaining page of the delalloc range (except the first page), those pages are still dirty and not marked writeback. And we only clear and mark those (sub)pages writeback during __extent_writepage_io(). The only difference is in the page lock lifespan: - The non-zoned COW path Pages locked during run_delalloc_range(), then unlocked. Relocked in extent_write_cache_pages() for submissioin. - The zoned COW path Pages kept locked until submission is done. Although I'd say the zoned COW path has much simpler lifespan (run_delalloc and submission happen for the same range, without page unlocking). Anyway I have eventually pinned down the root cause for subpage + zoned error, and it's related to that full page_clear_dirty_for_io() call inside extent_write_locked_range(). Personally speaking I'm not a big fan of how we handle delalloc and submission, and the zoned one looks a little more nature to me. I'll check if we can make every write path to follow the same behavior. Thanks, Qu > > Well, as I said above, current end_bio can handle that case well (not > tested, though). So, it's more meant for optimization nowadays, to have > less extent split. > >> >> Of course I'll try to dig deeper to fully solve all the subpage + zoned >> bugs, but if we have more shared code paths, it would make our lives >> much easier. >> >> Cc: Johannes Thumshirn <Johannes.Thumshirn@wdc.com> >> Cc: Naohiro Aota <Naohiro.Aota@wdc.com> >> Signed-off-by: Qu Wenruo <wqu@suse.com> >> --- >> fs/btrfs/extent_io.c | 8 +++++--- >> 1 file changed, 5 insertions(+), 3 deletions(-) >> >> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c >> index fb63055f42f3..e9850be26bb7 100644 >> --- a/fs/btrfs/extent_io.c >> +++ b/fs/btrfs/extent_io.c >> @@ -2286,13 +2286,15 @@ void extent_write_locked_range(struct inode *inode, struct page *locked_page, >> u64 cur_end = min(round_down(cur, PAGE_SIZE) + PAGE_SIZE - 1, end); >> u32 cur_len = cur_end + 1 - cur; >> struct page *page; >> + struct folio *folio; >> int nr = 0; >> >> page = find_get_page(mapping, cur >> PAGE_SHIFT); >> ASSERT(PageLocked(page)); >> + folio = page_folio(page); >> if (pages_dirty && page != locked_page) { >> ASSERT(PageDirty(page)); >> - clear_page_dirty_for_io(page); >> + btrfs_folio_clear_dirty(fs_info, folio, cur, cur_len); >> } >> >> ret = __extent_writepage_io(BTRFS_I(inode), page, cur, cur_len, >> @@ -2302,8 +2304,8 @@ void extent_write_locked_range(struct inode *inode, struct page *locked_page, >> >> /* Make sure the mapping tag for page dirty gets cleared. */ >> if (nr == 0) { >> - set_page_writeback(page); >> - end_page_writeback(page); >> + btrfs_folio_set_writeback(fs_info, folio, cur, cur_len); >> + btrfs_folio_clear_writeback(fs_info, folio, cur, cur_len); >> } >> if (ret) { >> btrfs_mark_ordered_io_finished(BTRFS_I(inode), page, >> -- >> 2.44.0
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index fb63055f42f3..e9850be26bb7 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -2286,13 +2286,15 @@ void extent_write_locked_range(struct inode *inode, struct page *locked_page, u64 cur_end = min(round_down(cur, PAGE_SIZE) + PAGE_SIZE - 1, end); u32 cur_len = cur_end + 1 - cur; struct page *page; + struct folio *folio; int nr = 0; page = find_get_page(mapping, cur >> PAGE_SHIFT); ASSERT(PageLocked(page)); + folio = page_folio(page); if (pages_dirty && page != locked_page) { ASSERT(PageDirty(page)); - clear_page_dirty_for_io(page); + btrfs_folio_clear_dirty(fs_info, folio, cur, cur_len); } ret = __extent_writepage_io(BTRFS_I(inode), page, cur, cur_len, @@ -2302,8 +2304,8 @@ void extent_write_locked_range(struct inode *inode, struct page *locked_page, /* Make sure the mapping tag for page dirty gets cleared. */ if (nr == 0) { - set_page_writeback(page); - end_page_writeback(page); + btrfs_folio_set_writeback(fs_info, folio, cur, cur_len); + btrfs_folio_clear_writeback(fs_info, folio, cur, cur_len); } if (ret) { btrfs_mark_ordered_io_finished(BTRFS_I(inode), page,
[BUG] Even with my previous subpage delalloc rework, the following workload can easily lead to leaked reserved space: # mkfs.btrfs -f $dev -s 4k > /dev/null # mount $dev $mnt # fsstress -v -w -n 8 -d $mnt -s 1709539240 0/0: fiemap - no filename 0/1: copyrange read - no filename 0/2: write - no filename 0/3: rename - no source filename 0/4: creat f0 x:0 0 0 0/4: creat add id=0,parent=-1 0/5: writev f0[259 1 0 0 0 0] [778052,113,965] 0 0/6: ioctl(FIEMAP) f0[259 1 0 0 224 887097] [1294220,2291618343991484791,0x10000] -1 0/7: dwrite - xfsctl(XFS_IOC_DIOINFO) f0[259 1 0 0 224 887097] return 25, fallback to stat() 0/7: dwrite f0[259 1 0 0 224 887097] [696320,102400] 0 # umount $mnt The $dev is a tcmu-runner emulated zoned HDD, with 80 zones, and append max size is 64K. [CAUSE] Before the dwrite(), writev() would dirty the following 3 pages: 720896 +64K +128K +192K | |///|//////////////|/////| | +52K +164K Then the dwrite() would try to drop the page caches for the first two pages starting at 720896. Now we trigger delalloc for the above two pages. Firstly find_lock_delalloc_range() would return the range [774144, 774144 +64K], not the full dirty range as non-zoned case. This is due to find_lock_delalloc_range() is using the max zone append size (64K). The range would end in the 2nd page. Then we start writeback through extent_write_locked_range(), which for the range in the second patch [+52K, +116K), it would clear the page dirty for the whole page. 720896 +64K +128K +192K | |///|//////////| |/////| | +52K +116K +164K This is due to the fact that extent_write_locked_range() is using the non-subpage compatible call to clear the folio dirty flags. Now at this stage,we leak reserved space for the remaining dirty part inside the second page. [PREMATURE FIX] For now, just change all the page/folio flag operations to subpage version. But this would not fully solve the problem, as we still have other problems, like triggering the BUG_ON() inside mm/truncate on the second page of the above range: ------------[ cut here ]------------ kernel BUG at mm/truncate.c:577! Internal error: Oops - BUG: 00000000f2000800 [#1] SMP Dumping ftrace buffer: --------------------------------- <...>-2055 1d..2. 16142us : invalidate_inode_pages2_range: !!! folio=786432 !!! --------------------------------- Call trace: invalidate_inode_pages2_range+0x378/0x430 kiocb_invalidate_pages+0x60/0x98 __iomap_dio_rw+0x350/0x5d8 btrfs_dio_write+0x50/0x88 [btrfs] btrfs_direct_write+0x128/0x328 [btrfs] btrfs_do_write_iter+0x174/0x1e8 [btrfs] btrfs_file_write_iter+0x1c/0x30 [btrfs] vfs_write+0x258/0x380 ksys_write+0x80/0x120 __arm64_sys_write+0x24/0x38 invoke_syscall+0x78/0x100 el0_svc_common.constprop.0+0x48/0xf0 do_el0_svc+0x24/0x38 el0_svc+0x3c/0x138 el0t_64_sync_handler+0x120/0x130 el0t_64_sync+0x194/0x198 Code: 97fbf380 f9400380 f271041f 54fff7e0 (d4210000) ---[ end trace 0000000000000000 ]--- [REASON FOR RFC] I really want a formal answer why zoned buffered writes need such a dedicated (while completely incompatible with subpage) run_dealloc_cow(). It looks like we have already more than enough infrastructure for zoned devices, as find_lock_delalloc_range() would already return a range no larger than max zone append size. Thus it looks to me that, we're fine to go without a dedicated run_delalloc_cow() just for zoned. And since the delalloc range would be locked during find_lock_delalloc_range(), the content of all the pages won't be changed, thus I didn't see much reason why we can not go the regular cow_file_range() + __extent_writepage_io() path. Of course I'll try to dig deeper to fully solve all the subpage + zoned bugs, but if we have more shared code paths, it would make our lives much easier. Cc: Johannes Thumshirn <Johannes.Thumshirn@wdc.com> Cc: Naohiro Aota <Naohiro.Aota@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> --- fs/btrfs/extent_io.c | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-)