Message ID | 20200717071205.26027-1-wqu@suse.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v2] btrfs: qgroup: Fix data leakage caused by race between writeback and truncate | expand |
On 7/17/20 3:12 AM, Qu Wenruo wrote: > [BUG] > When running tests like generic/013 on test device with btrfs quota > enabled, it can normally lead to data leakage, detected at unmount time: > > BTRFS warning (device dm-3): qgroup 0/5 has unreleased space, type 0 rsv 4096 > ------------[ cut here ]------------ > WARNING: CPU: 11 PID: 16386 at fs/btrfs/disk-io.c:4142 close_ctree+0x1dc/0x323 [btrfs] > RIP: 0010:close_ctree+0x1dc/0x323 [btrfs] > Call Trace: > btrfs_put_super+0x15/0x17 [btrfs] > generic_shutdown_super+0x72/0x110 > kill_anon_super+0x18/0x30 > btrfs_kill_super+0x17/0x30 [btrfs] > deactivate_locked_super+0x3b/0xa0 > deactivate_super+0x40/0x50 > cleanup_mnt+0x135/0x190 > __cleanup_mnt+0x12/0x20 > task_work_run+0x64/0xb0 > __prepare_exit_to_usermode+0x1bc/0x1c0 > __syscall_return_slowpath+0x47/0x230 > do_syscall_64+0x64/0xb0 > entry_SYSCALL_64_after_hwframe+0x44/0xa9 > ---[ end trace caf08beafeca2392 ]--- > BTRFS error (device dm-3): qgroup reserved space leaked > > [CAUSE] > In the offending case, the offending operations are: > 2/6: writev f2X[269 1 0 0 0 0] [1006997,67,288] 0 > 2/7: truncate f2X[269 1 0 0 48 1026293] 18388 0 > > The following sequence of events could happen after the writev(): > CPU1 (writeback) | CPU2 (truncate) > ----------------------------------------------------------------- > btrfs_writepages() | > |- extent_write_cache_pages() | > |- Got page for 1003520 | > | 1003520 is Dirty, no writeback | > | So (!clear_page_dirty_for_io()) | > | gets called for it | > |- Now page 1003520 is Clean. | > | | btrfs_setattr() > | | |- btrfs_setsize() > | | |- truncate_setsize() > | | New i_size is 18388 > |- __extent_writepage() | > | |- page_offset() > i_size | > |- btrfs_invalidatepage() | > |- Page is clean, so no qgroup | > callback executed > > This means, the qgroup reserved data space is not properly released in > btrfs_invalidatepage() as the page is Clean. > > [FIX] > Instead of checking the dirty bit of a page, call > btrfs_qgroup_free_data() unconditionally in btrfs_invalidatepage(). > > As qgroup rsv are completely binded to the QGROUP_RESERVED bit of > io_tree, not binded to page status, thus we won't cause double freeing > anyway. > > Fixes: 0b34c261e235 ("btrfs: qgroup: Prevent qgroup->reserved from going subzero") > Signed-off-by: Qu Wenruo <wqu@suse.com> > I don't understand how this is ok. We can call invalidatepage via memory pressure, so what if we have started the write and have an ordered extent outstanding, and then we call into invalidate page and now unconditionally drop the qgroup reservation, even tho we still need it for the ordered extent. Am I missing something here? Thanks, Josef
On 2020/7/17 下午11:30, Josef Bacik wrote: > On 7/17/20 3:12 AM, Qu Wenruo wrote: >> [BUG] >> When running tests like generic/013 on test device with btrfs quota >> enabled, it can normally lead to data leakage, detected at unmount time: >> >> BTRFS warning (device dm-3): qgroup 0/5 has unreleased space, type >> 0 rsv 4096 >> ------------[ cut here ]------------ >> WARNING: CPU: 11 PID: 16386 at fs/btrfs/disk-io.c:4142 >> close_ctree+0x1dc/0x323 [btrfs] >> RIP: 0010:close_ctree+0x1dc/0x323 [btrfs] >> Call Trace: >> btrfs_put_super+0x15/0x17 [btrfs] >> generic_shutdown_super+0x72/0x110 >> kill_anon_super+0x18/0x30 >> btrfs_kill_super+0x17/0x30 [btrfs] >> deactivate_locked_super+0x3b/0xa0 >> deactivate_super+0x40/0x50 >> cleanup_mnt+0x135/0x190 >> __cleanup_mnt+0x12/0x20 >> task_work_run+0x64/0xb0 >> __prepare_exit_to_usermode+0x1bc/0x1c0 >> __syscall_return_slowpath+0x47/0x230 >> do_syscall_64+0x64/0xb0 >> entry_SYSCALL_64_after_hwframe+0x44/0xa9 >> ---[ end trace caf08beafeca2392 ]--- >> BTRFS error (device dm-3): qgroup reserved space leaked >> >> [CAUSE] >> In the offending case, the offending operations are: >> 2/6: writev f2X[269 1 0 0 0 0] [1006997,67,288] 0 >> 2/7: truncate f2X[269 1 0 0 48 1026293] 18388 0 >> >> The following sequence of events could happen after the writev(): >> CPU1 (writeback) | CPU2 (truncate) >> ----------------------------------------------------------------- >> btrfs_writepages() | >> |- extent_write_cache_pages() | >> |- Got page for 1003520 | >> | 1003520 is Dirty, no writeback | >> | So (!clear_page_dirty_for_io()) | >> | gets called for it | >> |- Now page 1003520 is Clean. | >> | | btrfs_setattr() >> | | |- btrfs_setsize() >> | | |- truncate_setsize() >> | | New i_size is 18388 >> |- __extent_writepage() | >> | |- page_offset() > i_size | >> |- btrfs_invalidatepage() | >> |- Page is clean, so no qgroup | >> callback executed >> >> This means, the qgroup reserved data space is not properly released in >> btrfs_invalidatepage() as the page is Clean. >> >> [FIX] >> Instead of checking the dirty bit of a page, call >> btrfs_qgroup_free_data() unconditionally in btrfs_invalidatepage(). >> >> As qgroup rsv are completely binded to the QGROUP_RESERVED bit of >> io_tree, not binded to page status, thus we won't cause double freeing >> anyway. >> >> Fixes: 0b34c261e235 ("btrfs: qgroup: Prevent qgroup->reserved from >> going subzero") >> Signed-off-by: Qu Wenruo <wqu@suse.com> >> > > I don't understand how this is ok. We can call invalidatepage via > memory pressure, so what if we have started the write and have an > ordered extent outstanding, and then we call into invalidate page and > now unconditionally drop the qgroup reservation, even tho we still need > it for the ordered extent. Am I missing something here? Thanks, As long as the ordered extent as been started (__btrfs_add_ordered_extent()), then the QGROUP_RESERVED bit is cleared, either freed for NODATACOW write, or released for COW writes. IIRC this recent change is suggested by you, and that paved the road for this fix. Thanks, Qu > > Josef
On 7/17/20 7:38 PM, Qu Wenruo wrote: > > > On 2020/7/17 下午11:30, Josef Bacik wrote: >> On 7/17/20 3:12 AM, Qu Wenruo wrote: >>> [BUG] >>> When running tests like generic/013 on test device with btrfs quota >>> enabled, it can normally lead to data leakage, detected at unmount time: >>> >>> BTRFS warning (device dm-3): qgroup 0/5 has unreleased space, type >>> 0 rsv 4096 >>> ------------[ cut here ]------------ >>> WARNING: CPU: 11 PID: 16386 at fs/btrfs/disk-io.c:4142 >>> close_ctree+0x1dc/0x323 [btrfs] >>> RIP: 0010:close_ctree+0x1dc/0x323 [btrfs] >>> Call Trace: >>> btrfs_put_super+0x15/0x17 [btrfs] >>> generic_shutdown_super+0x72/0x110 >>> kill_anon_super+0x18/0x30 >>> btrfs_kill_super+0x17/0x30 [btrfs] >>> deactivate_locked_super+0x3b/0xa0 >>> deactivate_super+0x40/0x50 >>> cleanup_mnt+0x135/0x190 >>> __cleanup_mnt+0x12/0x20 >>> task_work_run+0x64/0xb0 >>> __prepare_exit_to_usermode+0x1bc/0x1c0 >>> __syscall_return_slowpath+0x47/0x230 >>> do_syscall_64+0x64/0xb0 >>> entry_SYSCALL_64_after_hwframe+0x44/0xa9 >>> ---[ end trace caf08beafeca2392 ]--- >>> BTRFS error (device dm-3): qgroup reserved space leaked >>> >>> [CAUSE] >>> In the offending case, the offending operations are: >>> 2/6: writev f2X[269 1 0 0 0 0] [1006997,67,288] 0 >>> 2/7: truncate f2X[269 1 0 0 48 1026293] 18388 0 >>> >>> The following sequence of events could happen after the writev(): >>> CPU1 (writeback) | CPU2 (truncate) >>> ----------------------------------------------------------------- >>> btrfs_writepages() | >>> |- extent_write_cache_pages() | >>> |- Got page for 1003520 | >>> | 1003520 is Dirty, no writeback | >>> | So (!clear_page_dirty_for_io()) | >>> | gets called for it | >>> |- Now page 1003520 is Clean. | >>> | | btrfs_setattr() >>> | | |- btrfs_setsize() >>> | | |- truncate_setsize() >>> | | New i_size is 18388 >>> |- __extent_writepage() | >>> | |- page_offset() > i_size | >>> |- btrfs_invalidatepage() | >>> |- Page is clean, so no qgroup | >>> callback executed >>> >>> This means, the qgroup reserved data space is not properly released in >>> btrfs_invalidatepage() as the page is Clean. >>> >>> [FIX] >>> Instead of checking the dirty bit of a page, call >>> btrfs_qgroup_free_data() unconditionally in btrfs_invalidatepage(). >>> >>> As qgroup rsv are completely binded to the QGROUP_RESERVED bit of >>> io_tree, not binded to page status, thus we won't cause double freeing >>> anyway. >>> >>> Fixes: 0b34c261e235 ("btrfs: qgroup: Prevent qgroup->reserved from >>> going subzero") >>> Signed-off-by: Qu Wenruo <wqu@suse.com> >>> >> >> I don't understand how this is ok. We can call invalidatepage via >> memory pressure, so what if we have started the write and have an >> ordered extent outstanding, and then we call into invalidate page and >> now unconditionally drop the qgroup reservation, even tho we still need >> it for the ordered extent. Am I missing something here? Thanks, > > As long as the ordered extent as been started > (__btrfs_add_ordered_extent()), then the QGROUP_RESERVED bit is cleared, > either freed for NODATACOW write, or released for COW writes. > > IIRC this recent change is suggested by you, and that paved the road for > this fix. > Yeah I had it backwards in my head, this looks good to me, you can add Reviewed-by: Josef Bacik <josef@toxicpanda.com> Thanks, Josef
On Fri, Jul 17, 2020 at 03:12:05PM +0800, Qu Wenruo wrote: > [BUG] > When running tests like generic/013 on test device with btrfs quota > enabled, it can normally lead to data leakage, detected at unmount time: > > BTRFS warning (device dm-3): qgroup 0/5 has unreleased space, type 0 rsv 4096 > ------------[ cut here ]------------ > WARNING: CPU: 11 PID: 16386 at fs/btrfs/disk-io.c:4142 close_ctree+0x1dc/0x323 [btrfs] > RIP: 0010:close_ctree+0x1dc/0x323 [btrfs] > Call Trace: > btrfs_put_super+0x15/0x17 [btrfs] > generic_shutdown_super+0x72/0x110 > kill_anon_super+0x18/0x30 > btrfs_kill_super+0x17/0x30 [btrfs] > deactivate_locked_super+0x3b/0xa0 > deactivate_super+0x40/0x50 > cleanup_mnt+0x135/0x190 > __cleanup_mnt+0x12/0x20 > task_work_run+0x64/0xb0 > __prepare_exit_to_usermode+0x1bc/0x1c0 > __syscall_return_slowpath+0x47/0x230 > do_syscall_64+0x64/0xb0 > entry_SYSCALL_64_after_hwframe+0x44/0xa9 > ---[ end trace caf08beafeca2392 ]--- > BTRFS error (device dm-3): qgroup reserved space leaked > > [CAUSE] > In the offending case, the offending operations are: > 2/6: writev f2X[269 1 0 0 0 0] [1006997,67,288] 0 > 2/7: truncate f2X[269 1 0 0 48 1026293] 18388 0 > > The following sequence of events could happen after the writev(): > CPU1 (writeback) | CPU2 (truncate) > ----------------------------------------------------------------- > btrfs_writepages() | > |- extent_write_cache_pages() | > |- Got page for 1003520 | > | 1003520 is Dirty, no writeback | > | So (!clear_page_dirty_for_io()) | > | gets called for it | > |- Now page 1003520 is Clean. | > | | btrfs_setattr() > | | |- btrfs_setsize() > | | |- truncate_setsize() > | | New i_size is 18388 > |- __extent_writepage() | > | |- page_offset() > i_size | > |- btrfs_invalidatepage() | > |- Page is clean, so no qgroup | > callback executed > > This means, the qgroup reserved data space is not properly released in > btrfs_invalidatepage() as the page is Clean. > > [FIX] > Instead of checking the dirty bit of a page, call > btrfs_qgroup_free_data() unconditionally in btrfs_invalidatepage(). > > As qgroup rsv are completely binded to the QGROUP_RESERVED bit of > io_tree, not binded to page status, thus we won't cause double freeing > anyway. > > Fixes: 0b34c261e235 ("btrfs: qgroup: Prevent qgroup->reserved from going subzero") > Signed-off-by: Qu Wenruo <wqu@suse.com> Added to misc-next, thanks.
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 7c03b402529e..0fa4f7007ff9 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -8157,21 +8157,17 @@ static void btrfs_invalidatepage(struct page *page, unsigned int offset, /* * Qgroup reserved space handler * Page here will be either - * 1) Already written to disk - * In this case, its reserved space is released from data rsv map - * and will be freed by delayed_ref handler finally. - * So even we call qgroup_free_data(), it won't decrease reserved - * space. - * 2) Not written to disk - * This means the reserved space should be freed here. However, - * if a truncate invalidates the page (by clearing PageDirty) - * and the page is accounted for while allocating extent - * in btrfs_check_data_free_space() we let delayed_ref to - * free the entire extent. + * 1) Already written to disk or ordered extent already submitted + * Then its QGROUP_RESERVED bit in io_tree is already cleaned. + * Qgroup will be handled by its qgroup_record then. + * btrfs_qgroup_free_data() call will do nothing here. + * + * 2) Not written to disk yet + * Then btrfs_qgroup_free_data() call will clear the QGROUP_RESERVED + * bit of its io_tree, and free the qgroup reserved data space. + * Since the IO will never happen for this page. */ - if (PageDirty(page)) - btrfs_qgroup_free_data(BTRFS_I(inode), NULL, page_start, - PAGE_SIZE); + btrfs_qgroup_free_data(BTRFS_I(inode), NULL, page_start, PAGE_SIZE); if (!inode_evicting) { clear_extent_bit(tree, page_start, page_end, EXTENT_LOCKED | EXTENT_DELALLOC | EXTENT_DELALLOC_NEW |
[BUG] When running tests like generic/013 on test device with btrfs quota enabled, it can normally lead to data leakage, detected at unmount time: BTRFS warning (device dm-3): qgroup 0/5 has unreleased space, type 0 rsv 4096 ------------[ cut here ]------------ WARNING: CPU: 11 PID: 16386 at fs/btrfs/disk-io.c:4142 close_ctree+0x1dc/0x323 [btrfs] RIP: 0010:close_ctree+0x1dc/0x323 [btrfs] Call Trace: btrfs_put_super+0x15/0x17 [btrfs] generic_shutdown_super+0x72/0x110 kill_anon_super+0x18/0x30 btrfs_kill_super+0x17/0x30 [btrfs] deactivate_locked_super+0x3b/0xa0 deactivate_super+0x40/0x50 cleanup_mnt+0x135/0x190 __cleanup_mnt+0x12/0x20 task_work_run+0x64/0xb0 __prepare_exit_to_usermode+0x1bc/0x1c0 __syscall_return_slowpath+0x47/0x230 do_syscall_64+0x64/0xb0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 ---[ end trace caf08beafeca2392 ]--- BTRFS error (device dm-3): qgroup reserved space leaked [CAUSE] In the offending case, the offending operations are: 2/6: writev f2X[269 1 0 0 0 0] [1006997,67,288] 0 2/7: truncate f2X[269 1 0 0 48 1026293] 18388 0 The following sequence of events could happen after the writev(): CPU1 (writeback) | CPU2 (truncate) ----------------------------------------------------------------- btrfs_writepages() | |- extent_write_cache_pages() | |- Got page for 1003520 | | 1003520 is Dirty, no writeback | | So (!clear_page_dirty_for_io()) | | gets called for it | |- Now page 1003520 is Clean. | | | btrfs_setattr() | | |- btrfs_setsize() | | |- truncate_setsize() | | New i_size is 18388 |- __extent_writepage() | | |- page_offset() > i_size | |- btrfs_invalidatepage() | |- Page is clean, so no qgroup | callback executed This means, the qgroup reserved data space is not properly released in btrfs_invalidatepage() as the page is Clean. [FIX] Instead of checking the dirty bit of a page, call btrfs_qgroup_free_data() unconditionally in btrfs_invalidatepage(). As qgroup rsv are completely binded to the QGROUP_RESERVED bit of io_tree, not binded to page status, thus we won't cause double freeing anyway. Fixes: 0b34c261e235 ("btrfs: qgroup: Prevent qgroup->reserved from going subzero") Signed-off-by: Qu Wenruo <wqu@suse.com> --- Changelog: v2: - Use proper subject for the patch - Use better, more clear commit message This is a race between truncate and writeback, and the dangerous page bit dance. --- fs/btrfs/inode.c | 24 ++++++++++-------------- 1 file changed, 10 insertions(+), 14 deletions(-)