Message ID | 58dac27acbab72124549718201bec971491b5b1a.1734420572.git.wqu@suse.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | btrfs: handle free space tree rebuild in multiple transactions | expand |
On Tue, Dec 17, 2024 at 7:31 AM Qu Wenruo <wqu@suse.com> wrote: > > During free space tree rebuild, we're holding on transaction handler for > the whole rebuild process. > > This will cause blocked task warning for btrfs-transaction kernel > thread, as during the rebuild, btrfs-transaction kthread has to wait for > the running transaction we're holding for free space tree rebuild. Do you have a stack trace? Does that happen where exactly, in the btrfs_attach_transaction() call chain, while waiting on a wait queue? > > On a large enough btrfs, we have thousands of block groups to go > through, thus it will definitely take over 120s and trigger the blocked > task warning. > > Fix the problem by handling 32 block groups in one transaction, and end > the transaction when we hit the 32 block groups threshold. > > This will allow the btrfs-transaction kthread to commit the transaction > when needed. > > And even if during the rebuild the system lost its power, we are still > fine as we didn't set FREE_SPACE_TREE_VALID flag, thus on next RW mount > we will still rebuild the tree, without utilizing the half built one. What about if a crash happens and we already processed some block groups and the transaction got committed? On the next mount, when we call populate_free_space_tree() for the same block groups, because we always start from the first one, won't we get an -EEXIST and fail the mount? For example, add_new_free_space_info() doesn't ignore an -EEXIST when it calls btrfs_insert_empty_item(), so we will fail the mount when trying to build the tree. > > Signed-off-by: Qu Wenruo <wqu@suse.com> > --- > fs/btrfs/free-space-tree.c | 12 ++++++++++++ > 1 file changed, 12 insertions(+) > > diff --git a/fs/btrfs/free-space-tree.c b/fs/btrfs/free-space-tree.c > index 7ba50e133921..d8f334724092 100644 > --- a/fs/btrfs/free-space-tree.c > +++ b/fs/btrfs/free-space-tree.c > @@ -1312,6 +1312,8 @@ int btrfs_delete_free_space_tree(struct btrfs_fs_info *fs_info) > return btrfs_commit_transaction(trans); > } > > +/* How many block groups can be handled in one transaction. */ > +#define FREE_SPACE_TREE_REBUILD_BATCH (32) > int btrfs_rebuild_free_space_tree(struct btrfs_fs_info *fs_info) Please put a blank line after the macro declaration and place the macro at the top of the file before the C code. > { > struct btrfs_trans_handle *trans; > @@ -1322,6 +1324,7 @@ int btrfs_rebuild_free_space_tree(struct btrfs_fs_info *fs_info) > }; > struct btrfs_root *free_space_root = btrfs_global_root(fs_info, &key); > struct rb_node *node; > + unsigned int handled = 0; > int ret; > > trans = btrfs_start_transaction(free_space_root, 1); > @@ -1350,6 +1353,15 @@ int btrfs_rebuild_free_space_tree(struct btrfs_fs_info *fs_info) > btrfs_end_transaction(trans); > return ret; > } > + handled++; > + handled %= FREE_SPACE_TREE_REBUILD_BATCH; > + if (!handled) { > + btrfs_end_transaction(trans); > + trans = btrfs_start_transaction(free_space_root, > + FREE_SPACE_TREE_REBUILD_BATCH); This is a fundamental change here, we are no longer reserving 1 unit but 32 instead. Plus the first call to btrfs_start_transaction(), before entering the loop, uses 1. For consistency, both places should reserve the same amount. But I think that changing the amount of reserved space should be a separate change with its own changelog, providing a rationale for it. It's odd indeed that only 1 item is being reserved, but on the other hand the block reserve associated with the free space tree root is the delayed refs reserve (see btrfs_init_root_block_rsv()), so changing the number of units passed to btrfs_start_transaction() shouldn't make much difference because the space reserved by a transaction goes to the transaction block reserve (fs_info->trans_block_rsv). And given that free space tree build/rebuild only touches the free space tree root, I don't see how that makes any difference, or at least it's not trivial, hence the separate change only for the space reservation amount and an explanation about why we are doing it. Thanks. > + if (IS_ERR(trans)) > + return PTR_ERR(trans); > + } > node = rb_next(node); > } > > -- > 2.47.1 > >
在 2024/12/17 19:15, Filipe Manana 写道: > On Tue, Dec 17, 2024 at 7:31 AM Qu Wenruo <wqu@suse.com> wrote: >> >> During free space tree rebuild, we're holding on transaction handler for >> the whole rebuild process. >> >> This will cause blocked task warning for btrfs-transaction kernel >> thread, as during the rebuild, btrfs-transaction kthread has to wait for >> the running transaction we're holding for free space tree rebuild. > > Do you have a stack trace? > Does that happen where exactly, in the btrfs_attach_transaction() call > chain, while waiting on a wait queue? Unfortunately no, thus it's only based on code analyze. The original reporter's stack are not related to free space cache rebuild, but on metadata writeback wait: [101497.950425] INFO: task btrfs-transacti:29385 blocked for more than 122 seconds. [101497.950432] Tainted: G W O 6.12.3-gentoo-x86_64 #1 [101497.950435] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [101497.950437] task:btrfs-transacti state:D stack:0 pid:29385 tgid:29385 ppid:2 flags:0x00004000 [101497.950442] Call Trace: [101497.950444] <TASK> [101497.950449] __schedule+0x3f0/0xbd0 [101497.950458] schedule+0x27/0xf0 [101497.950461] io_schedule+0x46/0x70 [101497.950465] folio_wait_bit_common+0x123/0x340 [101497.950472] ? __pfx_wake_page_function+0x10/0x10 [101497.950477] folio_wait_writeback+0x2b/0x80 [101497.950480] __filemap_fdatawait_range+0x7d/0xd0 [101497.950489] filemap_fdatawait_range+0x12/0x20 [101497.950493] __btrfs_wait_marked_extents.isra.0+0xb8/0xf0 [btrfs] [101497.950536] btrfs_write_and_wait_transaction+0x5e/0xd0 [btrfs] [101497.950572] btrfs_commit_transaction+0x8d9/0xe80 [btrfs] [101497.950606] ? start_transaction+0xc0/0x820 [btrfs] [101497.950640] transaction_kthread+0x159/0x1c0 [btrfs] [101497.950674] ? __pfx_transaction_kthread+0x10/0x10 [btrfs] [101497.950705] kthread+0xd2/0x100 [101497.950710] ? __pfx_kthread+0x10/0x10 [101497.950714] ret_from_fork+0x34/0x50 [101497.950718] ? __pfx_kthread+0x10/0x10 [101497.950721] ret_from_fork_asm+0x1a/0x30 [101497.950727] </TASK> Furthermore, the original reporter doesn't experience real hang. The operation can finish (very large stream receive), sync and properly unmount. I believe the original report is mostly caused by the following reasons: - Too large physical RAM 96GiB, causing too huge page cache - HDD Low IOPS - Mostly random metadata writes? > >> >> On a large enough btrfs, we have thousands of block groups to go >> through, thus it will definitely take over 120s and trigger the blocked >> task warning. >> >> Fix the problem by handling 32 block groups in one transaction, and end >> the transaction when we hit the 32 block groups threshold. >> >> This will allow the btrfs-transaction kthread to commit the transaction >> when needed. >> >> And even if during the rebuild the system lost its power, we are still >> fine as we didn't set FREE_SPACE_TREE_VALID flag, thus on next RW mount >> we will still rebuild the tree, without utilizing the half built one. > > What about if a crash happens and we already processed some block > groups and the transaction got committed? > > On the next mount, when we call populate_free_space_tree() for the > same block groups, because we always start from the first one, won't > we get an -EEXIST and fail the mount? > For example, add_new_free_space_info() doesn't ignore an -EEXIST when > it calls btrfs_insert_empty_item(), so we will fail the mount when > trying to build the tree. We are still fine: btrfs_start_pre_rw_mount() |- rebuild_free_space_tree = true; | This is because our fs only has FREE_SPACE_TREE, but no | FREE_SPACE_TREE_VALID compat_ro flag. | |- btrfs_rebuild_free_space_tree() |- clear_free_space_tree() Which deletes all the free space tree items. Thanks, Qu > >> >> Signed-off-by: Qu Wenruo <wqu@suse.com> >> --- >> fs/btrfs/free-space-tree.c | 12 ++++++++++++ >> 1 file changed, 12 insertions(+) >> >> diff --git a/fs/btrfs/free-space-tree.c b/fs/btrfs/free-space-tree.c >> index 7ba50e133921..d8f334724092 100644 >> --- a/fs/btrfs/free-space-tree.c >> +++ b/fs/btrfs/free-space-tree.c >> @@ -1312,6 +1312,8 @@ int btrfs_delete_free_space_tree(struct btrfs_fs_info *fs_info) >> return btrfs_commit_transaction(trans); >> } >> >> +/* How many block groups can be handled in one transaction. */ >> +#define FREE_SPACE_TREE_REBUILD_BATCH (32) >> int btrfs_rebuild_free_space_tree(struct btrfs_fs_info *fs_info) > > Please put a blank line after the macro declaration and place the > macro at the top of the file before the C code. > >> { >> struct btrfs_trans_handle *trans; >> @@ -1322,6 +1324,7 @@ int btrfs_rebuild_free_space_tree(struct btrfs_fs_info *fs_info) >> }; >> struct btrfs_root *free_space_root = btrfs_global_root(fs_info, &key); >> struct rb_node *node; >> + unsigned int handled = 0; >> int ret; >> >> trans = btrfs_start_transaction(free_space_root, 1); >> @@ -1350,6 +1353,15 @@ int btrfs_rebuild_free_space_tree(struct btrfs_fs_info *fs_info) >> btrfs_end_transaction(trans); >> return ret; >> } >> + handled++; >> + handled %= FREE_SPACE_TREE_REBUILD_BATCH; >> + if (!handled) { >> + btrfs_end_transaction(trans); >> + trans = btrfs_start_transaction(free_space_root, >> + FREE_SPACE_TREE_REBUILD_BATCH); > > This is a fundamental change here, we are no longer reserving 1 unit > but 32 instead. > Plus the first call to btrfs_start_transaction(), before entering the > loop, uses 1. > For consistency, both places should reserve the same amount. > > But I think that changing the amount of reserved space should be a > separate change with its own changelog, > providing a rationale for it. > > It's odd indeed that only 1 item is being reserved, but on the other > hand the block reserve associated with the free space tree root is the > delayed refs reserve (see btrfs_init_root_block_rsv()), > so changing the number of units passed to btrfs_start_transaction() > shouldn't make much difference because the space reserved by a > transaction goes to the transaction block reserve > (fs_info->trans_block_rsv). > > And given that free space tree build/rebuild only touches the free > space tree root, I don't see how that makes any difference, or at > least it's not trivial, hence the separate change only for the space > reservation > amount and an explanation about why we are doing it. > > Thanks. > > >> + if (IS_ERR(trans)) >> + return PTR_ERR(trans); >> + } >> node = rb_next(node); >> } >> >> -- >> 2.47.1 >> >>
On Tue, Dec 17, 2024 at 9:02 AM Qu Wenruo <wqu@suse.com> wrote: > > > > 在 2024/12/17 19:15, Filipe Manana 写道: > > On Tue, Dec 17, 2024 at 7:31 AM Qu Wenruo <wqu@suse.com> wrote: > >> > >> During free space tree rebuild, we're holding on transaction handler for > >> the whole rebuild process. > >> > >> This will cause blocked task warning for btrfs-transaction kernel > >> thread, as during the rebuild, btrfs-transaction kthread has to wait for > >> the running transaction we're holding for free space tree rebuild. > > > > Do you have a stack trace? > > Does that happen where exactly, in the btrfs_attach_transaction() call > > chain, while waiting on a wait queue? > > Unfortunately no, thus it's only based on code analyze. > > The original reporter's stack are not related to free space cache > rebuild, but on metadata writeback wait: > > [101497.950425] INFO: task btrfs-transacti:29385 blocked for more than > 122 seconds. > [101497.950432] Tainted: G W O 6.12.3-gentoo-x86_64 #1 > [101497.950435] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" > disables this message. > [101497.950437] task:btrfs-transacti state:D stack:0 pid:29385 > tgid:29385 ppid:2 flags:0x00004000 > [101497.950442] Call Trace: > [101497.950444] <TASK> > [101497.950449] __schedule+0x3f0/0xbd0 > [101497.950458] schedule+0x27/0xf0 > [101497.950461] io_schedule+0x46/0x70 > [101497.950465] folio_wait_bit_common+0x123/0x340 > [101497.950472] ? __pfx_wake_page_function+0x10/0x10 > [101497.950477] folio_wait_writeback+0x2b/0x80 > [101497.950480] __filemap_fdatawait_range+0x7d/0xd0 > [101497.950489] filemap_fdatawait_range+0x12/0x20 > [101497.950493] __btrfs_wait_marked_extents.isra.0+0xb8/0xf0 [btrfs] > [101497.950536] btrfs_write_and_wait_transaction+0x5e/0xd0 [btrfs] > [101497.950572] btrfs_commit_transaction+0x8d9/0xe80 [btrfs] > [101497.950606] ? start_transaction+0xc0/0x820 [btrfs] > [101497.950640] transaction_kthread+0x159/0x1c0 [btrfs] > [101497.950674] ? __pfx_transaction_kthread+0x10/0x10 [btrfs] > [101497.950705] kthread+0xd2/0x100 > [101497.950710] ? __pfx_kthread+0x10/0x10 > [101497.950714] ret_from_fork+0x34/0x50 > [101497.950718] ? __pfx_kthread+0x10/0x10 > [101497.950721] ret_from_fork_asm+0x1a/0x30 > [101497.950727] </TASK> > > Furthermore, the original reporter doesn't experience real hang. > The operation can finish (very large stream receive), sync and properly > unmount. > I believe the original report is mostly caused by the following reasons: > > - Too large physical RAM > 96GiB, causing too huge page cache > > - HDD > Low IOPS > > - Mostly random metadata writes? Looking at that trace, I don't see how this change relates to it. That is, how this batching helps prevent that. That happens when committing a transaction and we're in the unblocked transaction state, starting writeback of the metadata extents and waiting for the writeback to finish. Since the free space tree build code uses the same transaction handle to build the whole free space tree, the transaction kthread would block when calling btrfs_commit_transaction(), waiting for the free space tree build task to release its handle - but that's not what is happening here. The transaction kthread is doing the commit, and already in the unblocked transaction state, writing all dirty metadata extents. So that means it's not getting blocked by free tree build task holding a transaction handle - it may be holding a transaction handle open, but it's a different transaction already, since when the current transaction is in a state >= TRANS_STATE_UNBLOCKED, other tasks can start a new transaction. And each transaction has its own io tree to keep track of the metadata extents it created/dirtied (trans->transaction->dirty_pages), so even if the task building the free space tree keeps COWing and creating free space tree nodes/leaves, it doesn't affect the io tree of the committing transaction - these extent buffers will be written by the transaction used by the free space tree build task when it commits, and not by the one being committed by the transaction kthread. We should have the stack trace in the change log, a link to the reporter's thread and an explanation of how this batching helps anything - I don't see how, as explained above. Thanks. > > > > >> > >> On a large enough btrfs, we have thousands of block groups to go > >> through, thus it will definitely take over 120s and trigger the blocked > >> task warning. > >> > >> Fix the problem by handling 32 block groups in one transaction, and end > >> the transaction when we hit the 32 block groups threshold. > >> > >> This will allow the btrfs-transaction kthread to commit the transaction > >> when needed. > >> > >> And even if during the rebuild the system lost its power, we are still > >> fine as we didn't set FREE_SPACE_TREE_VALID flag, thus on next RW mount > >> we will still rebuild the tree, without utilizing the half built one. > > > > What about if a crash happens and we already processed some block > > groups and the transaction got committed? > > > > On the next mount, when we call populate_free_space_tree() for the > > same block groups, because we always start from the first one, won't > > we get an -EEXIST and fail the mount? > > For example, add_new_free_space_info() doesn't ignore an -EEXIST when > > it calls btrfs_insert_empty_item(), so we will fail the mount when > > trying to build the tree. > > We are still fine: > > btrfs_start_pre_rw_mount() > |- rebuild_free_space_tree = true; > | This is because our fs only has FREE_SPACE_TREE, but no > | FREE_SPACE_TREE_VALID compat_ro flag. > | > |- btrfs_rebuild_free_space_tree() > |- clear_free_space_tree() > Which deletes all the free space tree items. > > Thanks, > Qu > > > > >> > >> Signed-off-by: Qu Wenruo <wqu@suse.com> > >> --- > >> fs/btrfs/free-space-tree.c | 12 ++++++++++++ > >> 1 file changed, 12 insertions(+) > >> > >> diff --git a/fs/btrfs/free-space-tree.c b/fs/btrfs/free-space-tree.c > >> index 7ba50e133921..d8f334724092 100644 > >> --- a/fs/btrfs/free-space-tree.c > >> +++ b/fs/btrfs/free-space-tree.c > >> @@ -1312,6 +1312,8 @@ int btrfs_delete_free_space_tree(struct btrfs_fs_info *fs_info) > >> return btrfs_commit_transaction(trans); > >> } > >> > >> +/* How many block groups can be handled in one transaction. */ > >> +#define FREE_SPACE_TREE_REBUILD_BATCH (32) > >> int btrfs_rebuild_free_space_tree(struct btrfs_fs_info *fs_info) > > > > Please put a blank line after the macro declaration and place the > > macro at the top of the file before the C code. > > > >> { > >> struct btrfs_trans_handle *trans; > >> @@ -1322,6 +1324,7 @@ int btrfs_rebuild_free_space_tree(struct btrfs_fs_info *fs_info) > >> }; > >> struct btrfs_root *free_space_root = btrfs_global_root(fs_info, &key); > >> struct rb_node *node; > >> + unsigned int handled = 0; > >> int ret; > >> > >> trans = btrfs_start_transaction(free_space_root, 1); > >> @@ -1350,6 +1353,15 @@ int btrfs_rebuild_free_space_tree(struct btrfs_fs_info *fs_info) > >> btrfs_end_transaction(trans); > >> return ret; > >> } > >> + handled++; > >> + handled %= FREE_SPACE_TREE_REBUILD_BATCH; > >> + if (!handled) { > >> + btrfs_end_transaction(trans); > >> + trans = btrfs_start_transaction(free_space_root, > >> + FREE_SPACE_TREE_REBUILD_BATCH); > > > > This is a fundamental change here, we are no longer reserving 1 unit > > but 32 instead. > > Plus the first call to btrfs_start_transaction(), before entering the > > loop, uses 1. > > For consistency, both places should reserve the same amount. > > > > But I think that changing the amount of reserved space should be a > > separate change with its own changelog, > > providing a rationale for it. > > > > It's odd indeed that only 1 item is being reserved, but on the other > > hand the block reserve associated with the free space tree root is the > > delayed refs reserve (see btrfs_init_root_block_rsv()), > > so changing the number of units passed to btrfs_start_transaction() > > shouldn't make much difference because the space reserved by a > > transaction goes to the transaction block reserve > > (fs_info->trans_block_rsv). > > > > And given that free space tree build/rebuild only touches the free > > space tree root, I don't see how that makes any difference, or at > > least it's not trivial, hence the separate change only for the space > > reservation > > amount and an explanation about why we are doing it. > > > > Thanks. > > > > > >> + if (IS_ERR(trans)) > >> + return PTR_ERR(trans); > >> + } > >> node = rb_next(node); > >> } > >> > >> -- > >> 2.47.1 > >> > >> >
在 2024/12/17 21:06, Filipe Manana 写道: > On Tue, Dec 17, 2024 at 9:02 AM Qu Wenruo <wqu@suse.com> wrote: >> >> >> >> 在 2024/12/17 19:15, Filipe Manana 写道: >>> On Tue, Dec 17, 2024 at 7:31 AM Qu Wenruo <wqu@suse.com> wrote: >>>> >>>> During free space tree rebuild, we're holding on transaction handler for >>>> the whole rebuild process. >>>> >>>> This will cause blocked task warning for btrfs-transaction kernel >>>> thread, as during the rebuild, btrfs-transaction kthread has to wait for >>>> the running transaction we're holding for free space tree rebuild. >>> >>> Do you have a stack trace? >>> Does that happen where exactly, in the btrfs_attach_transaction() call >>> chain, while waiting on a wait queue? >> >> Unfortunately no, thus it's only based on code analyze. >> >> The original reporter's stack are not related to free space cache >> rebuild, but on metadata writeback wait: >> >> [101497.950425] INFO: task btrfs-transacti:29385 blocked for more than >> 122 seconds. >> [101497.950432] Tainted: G W O 6.12.3-gentoo-x86_64 #1 >> [101497.950435] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" >> disables this message. >> [101497.950437] task:btrfs-transacti state:D stack:0 pid:29385 >> tgid:29385 ppid:2 flags:0x00004000 >> [101497.950442] Call Trace: >> [101497.950444] <TASK> >> [101497.950449] __schedule+0x3f0/0xbd0 >> [101497.950458] schedule+0x27/0xf0 >> [101497.950461] io_schedule+0x46/0x70 >> [101497.950465] folio_wait_bit_common+0x123/0x340 >> [101497.950472] ? __pfx_wake_page_function+0x10/0x10 >> [101497.950477] folio_wait_writeback+0x2b/0x80 >> [101497.950480] __filemap_fdatawait_range+0x7d/0xd0 >> [101497.950489] filemap_fdatawait_range+0x12/0x20 >> [101497.950493] __btrfs_wait_marked_extents.isra.0+0xb8/0xf0 [btrfs] >> [101497.950536] btrfs_write_and_wait_transaction+0x5e/0xd0 [btrfs] >> [101497.950572] btrfs_commit_transaction+0x8d9/0xe80 [btrfs] >> [101497.950606] ? start_transaction+0xc0/0x820 [btrfs] >> [101497.950640] transaction_kthread+0x159/0x1c0 [btrfs] >> [101497.950674] ? __pfx_transaction_kthread+0x10/0x10 [btrfs] >> [101497.950705] kthread+0xd2/0x100 >> [101497.950710] ? __pfx_kthread+0x10/0x10 >> [101497.950714] ret_from_fork+0x34/0x50 >> [101497.950718] ? __pfx_kthread+0x10/0x10 >> [101497.950721] ret_from_fork_asm+0x1a/0x30 >> [101497.950727] </TASK> >> >> Furthermore, the original reporter doesn't experience real hang. >> The operation can finish (very large stream receive), sync and properly >> unmount. >> I believe the original report is mostly caused by the following reasons: >> >> - Too large physical RAM >> 96GiB, causing too huge page cache >> >> - HDD >> Low IOPS >> >> - Mostly random metadata writes? > > Looking at that trace, I don't see how this change relates to it. > That is, how this batching helps prevent that. Exactly, this batching fix is unrelated to the report, just as I said. Thanks, Qu > > That happens when committing a transaction and we're in the unblocked > transaction state, starting writeback of the metadata extents and > waiting for the writeback to finish. > > Since the free space tree build code uses the same transaction handle > to build the whole free space tree, > the transaction kthread would block when calling > btrfs_commit_transaction(), waiting for the free space tree build task > to release its handle - but that's not what is happening here. > > The transaction kthread is doing the commit, and already in the > unblocked transaction state, writing all dirty metadata extents. > So that means it's not getting blocked by free tree build task holding > a transaction handle - it may be holding a transaction handle open, > but it's a different transaction already, since when the current > transaction is in a state >= TRANS_STATE_UNBLOCKED, other tasks can > start a new transaction. > > And each transaction has its own io tree to keep track of the metadata > extents it created/dirtied (trans->transaction->dirty_pages), > so even if the task building the free space tree keeps COWing and > creating free space tree nodes/leaves, it doesn't affect the io tree > of the > committing transaction - these extent buffers will be written by the > transaction used by the free space tree build task when it commits, > and not by the one being committed by the transaction kthread. > > We should have the stack trace in the change log, a link to the > reporter's thread and an explanation of how this batching helps > anything - I don't see how, as explained above. > > Thanks. > > >> >>> >>>> >>>> On a large enough btrfs, we have thousands of block groups to go >>>> through, thus it will definitely take over 120s and trigger the blocked >>>> task warning. >>>> >>>> Fix the problem by handling 32 block groups in one transaction, and end >>>> the transaction when we hit the 32 block groups threshold. >>>> >>>> This will allow the btrfs-transaction kthread to commit the transaction >>>> when needed. >>>> >>>> And even if during the rebuild the system lost its power, we are still >>>> fine as we didn't set FREE_SPACE_TREE_VALID flag, thus on next RW mount >>>> we will still rebuild the tree, without utilizing the half built one. >>> >>> What about if a crash happens and we already processed some block >>> groups and the transaction got committed? >>> >>> On the next mount, when we call populate_free_space_tree() for the >>> same block groups, because we always start from the first one, won't >>> we get an -EEXIST and fail the mount? >>> For example, add_new_free_space_info() doesn't ignore an -EEXIST when >>> it calls btrfs_insert_empty_item(), so we will fail the mount when >>> trying to build the tree. >> >> We are still fine: >> >> btrfs_start_pre_rw_mount() >> |- rebuild_free_space_tree = true; >> | This is because our fs only has FREE_SPACE_TREE, but no >> | FREE_SPACE_TREE_VALID compat_ro flag. >> | >> |- btrfs_rebuild_free_space_tree() >> |- clear_free_space_tree() >> Which deletes all the free space tree items. >> >> Thanks, >> Qu >> >>> >>>> >>>> Signed-off-by: Qu Wenruo <wqu@suse.com> >>>> --- >>>> fs/btrfs/free-space-tree.c | 12 ++++++++++++ >>>> 1 file changed, 12 insertions(+) >>>> >>>> diff --git a/fs/btrfs/free-space-tree.c b/fs/btrfs/free-space-tree.c >>>> index 7ba50e133921..d8f334724092 100644 >>>> --- a/fs/btrfs/free-space-tree.c >>>> +++ b/fs/btrfs/free-space-tree.c >>>> @@ -1312,6 +1312,8 @@ int btrfs_delete_free_space_tree(struct btrfs_fs_info *fs_info) >>>> return btrfs_commit_transaction(trans); >>>> } >>>> >>>> +/* How many block groups can be handled in one transaction. */ >>>> +#define FREE_SPACE_TREE_REBUILD_BATCH (32) >>>> int btrfs_rebuild_free_space_tree(struct btrfs_fs_info *fs_info) >>> >>> Please put a blank line after the macro declaration and place the >>> macro at the top of the file before the C code. >>> >>>> { >>>> struct btrfs_trans_handle *trans; >>>> @@ -1322,6 +1324,7 @@ int btrfs_rebuild_free_space_tree(struct btrfs_fs_info *fs_info) >>>> }; >>>> struct btrfs_root *free_space_root = btrfs_global_root(fs_info, &key); >>>> struct rb_node *node; >>>> + unsigned int handled = 0; >>>> int ret; >>>> >>>> trans = btrfs_start_transaction(free_space_root, 1); >>>> @@ -1350,6 +1353,15 @@ int btrfs_rebuild_free_space_tree(struct btrfs_fs_info *fs_info) >>>> btrfs_end_transaction(trans); >>>> return ret; >>>> } >>>> + handled++; >>>> + handled %= FREE_SPACE_TREE_REBUILD_BATCH; >>>> + if (!handled) { >>>> + btrfs_end_transaction(trans); >>>> + trans = btrfs_start_transaction(free_space_root, >>>> + FREE_SPACE_TREE_REBUILD_BATCH); >>> >>> This is a fundamental change here, we are no longer reserving 1 unit >>> but 32 instead. >>> Plus the first call to btrfs_start_transaction(), before entering the >>> loop, uses 1. >>> For consistency, both places should reserve the same amount. >>> >>> But I think that changing the amount of reserved space should be a >>> separate change with its own changelog, >>> providing a rationale for it. >>> >>> It's odd indeed that only 1 item is being reserved, but on the other >>> hand the block reserve associated with the free space tree root is the >>> delayed refs reserve (see btrfs_init_root_block_rsv()), >>> so changing the number of units passed to btrfs_start_transaction() >>> shouldn't make much difference because the space reserved by a >>> transaction goes to the transaction block reserve >>> (fs_info->trans_block_rsv). >>> >>> And given that free space tree build/rebuild only touches the free >>> space tree root, I don't see how that makes any difference, or at >>> least it's not trivial, hence the separate change only for the space >>> reservation >>> amount and an explanation about why we are doing it. >>> >>> Thanks. >>> >>> >>>> + if (IS_ERR(trans)) >>>> + return PTR_ERR(trans); >>>> + } >>>> node = rb_next(node); >>>> } >>>> >>>> -- >>>> 2.47.1 >>>> >>>> >> >
On Tue, Dec 17, 2024 at 8:38 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: > > > > 在 2024/12/17 21:06, Filipe Manana 写道: > > On Tue, Dec 17, 2024 at 9:02 AM Qu Wenruo <wqu@suse.com> wrote: > >> > >> > >> > >> 在 2024/12/17 19:15, Filipe Manana 写道: > >>> On Tue, Dec 17, 2024 at 7:31 AM Qu Wenruo <wqu@suse.com> wrote: > >>>> > >>>> During free space tree rebuild, we're holding on transaction handler for > >>>> the whole rebuild process. > >>>> > >>>> This will cause blocked task warning for btrfs-transaction kernel > >>>> thread, as during the rebuild, btrfs-transaction kthread has to wait for > >>>> the running transaction we're holding for free space tree rebuild. > >>> > >>> Do you have a stack trace? > >>> Does that happen where exactly, in the btrfs_attach_transaction() call > >>> chain, while waiting on a wait queue? > >> > >> Unfortunately no, thus it's only based on code analyze. > >> > >> The original reporter's stack are not related to free space cache > >> rebuild, but on metadata writeback wait: > >> > >> [101497.950425] INFO: task btrfs-transacti:29385 blocked for more than > >> 122 seconds. > >> [101497.950432] Tainted: G W O 6.12.3-gentoo-x86_64 #1 > >> [101497.950435] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" > >> disables this message. > >> [101497.950437] task:btrfs-transacti state:D stack:0 pid:29385 > >> tgid:29385 ppid:2 flags:0x00004000 > >> [101497.950442] Call Trace: > >> [101497.950444] <TASK> > >> [101497.950449] __schedule+0x3f0/0xbd0 > >> [101497.950458] schedule+0x27/0xf0 > >> [101497.950461] io_schedule+0x46/0x70 > >> [101497.950465] folio_wait_bit_common+0x123/0x340 > >> [101497.950472] ? __pfx_wake_page_function+0x10/0x10 > >> [101497.950477] folio_wait_writeback+0x2b/0x80 > >> [101497.950480] __filemap_fdatawait_range+0x7d/0xd0 > >> [101497.950489] filemap_fdatawait_range+0x12/0x20 > >> [101497.950493] __btrfs_wait_marked_extents.isra.0+0xb8/0xf0 [btrfs] > >> [101497.950536] btrfs_write_and_wait_transaction+0x5e/0xd0 [btrfs] > >> [101497.950572] btrfs_commit_transaction+0x8d9/0xe80 [btrfs] > >> [101497.950606] ? start_transaction+0xc0/0x820 [btrfs] > >> [101497.950640] transaction_kthread+0x159/0x1c0 [btrfs] > >> [101497.950674] ? __pfx_transaction_kthread+0x10/0x10 [btrfs] > >> [101497.950705] kthread+0xd2/0x100 > >> [101497.950710] ? __pfx_kthread+0x10/0x10 > >> [101497.950714] ret_from_fork+0x34/0x50 > >> [101497.950718] ? __pfx_kthread+0x10/0x10 > >> [101497.950721] ret_from_fork_asm+0x1a/0x30 > >> [101497.950727] </TASK> > >> > >> Furthermore, the original reporter doesn't experience real hang. > >> The operation can finish (very large stream receive), sync and properly > >> unmount. > >> I believe the original report is mostly caused by the following reasons: > >> > >> - Too large physical RAM > >> 96GiB, causing too huge page cache > >> > >> - HDD > >> Low IOPS > >> > >> - Mostly random metadata writes? > > > > Looking at that trace, I don't see how this change relates to it. > > That is, how this batching helps prevent that. > > Exactly, this batching fix is unrelated to the report, just as I said. So I think this batch thing is not really appropriate: 1) There's the thing about the space reservation mentioned earlier; 2) Picking 32 or any other number may result in unnecessary churn by calling end_transaction() / star_transaction() too often, as we're likely to be able to process much more than 32 block groups before the transaction kthread attempts to commit the transaction. Instead of that, just periodically call btrfs_should_end_transaction(), and release the handle and start a new one if it returns true. This is all we need since it returns true when the transaction kthread attempts to start the commit (state set to TRANS_STATE_COMMIT_START). This way the kthread won't wait for potentially too long in the transaction's wait queue and trigger any warning about being blocked for too long. Thanks. > > Thanks, > Qu > > > > That happens when committing a transaction and we're in the unblocked > > transaction state, starting writeback of the metadata extents and > > waiting for the writeback to finish. > > > > Since the free space tree build code uses the same transaction handle > > to build the whole free space tree, > > the transaction kthread would block when calling > > btrfs_commit_transaction(), waiting for the free space tree build task > > to release its handle - but that's not what is happening here. > > > > The transaction kthread is doing the commit, and already in the > > unblocked transaction state, writing all dirty metadata extents. > > So that means it's not getting blocked by free tree build task holding > > a transaction handle - it may be holding a transaction handle open, > > but it's a different transaction already, since when the current > > transaction is in a state >= TRANS_STATE_UNBLOCKED, other tasks can > > start a new transaction. > > > > And each transaction has its own io tree to keep track of the metadata > > extents it created/dirtied (trans->transaction->dirty_pages), > > so even if the task building the free space tree keeps COWing and > > creating free space tree nodes/leaves, it doesn't affect the io tree > > of the > > committing transaction - these extent buffers will be written by the > > transaction used by the free space tree build task when it commits, > > and not by the one being committed by the transaction kthread. > > > > We should have the stack trace in the change log, a link to the > > reporter's thread and an explanation of how this batching helps > > anything - I don't see how, as explained above. > > > > Thanks. > > > > > >> > >>> > >>>> > >>>> On a large enough btrfs, we have thousands of block groups to go > >>>> through, thus it will definitely take over 120s and trigger the blocked > >>>> task warning. > >>>> > >>>> Fix the problem by handling 32 block groups in one transaction, and end > >>>> the transaction when we hit the 32 block groups threshold. > >>>> > >>>> This will allow the btrfs-transaction kthread to commit the transaction > >>>> when needed. > >>>> > >>>> And even if during the rebuild the system lost its power, we are still > >>>> fine as we didn't set FREE_SPACE_TREE_VALID flag, thus on next RW mount > >>>> we will still rebuild the tree, without utilizing the half built one. > >>> > >>> What about if a crash happens and we already processed some block > >>> groups and the transaction got committed? > >>> > >>> On the next mount, when we call populate_free_space_tree() for the > >>> same block groups, because we always start from the first one, won't > >>> we get an -EEXIST and fail the mount? > >>> For example, add_new_free_space_info() doesn't ignore an -EEXIST when > >>> it calls btrfs_insert_empty_item(), so we will fail the mount when > >>> trying to build the tree. > >> > >> We are still fine: > >> > >> btrfs_start_pre_rw_mount() > >> |- rebuild_free_space_tree = true; > >> | This is because our fs only has FREE_SPACE_TREE, but no > >> | FREE_SPACE_TREE_VALID compat_ro flag. > >> | > >> |- btrfs_rebuild_free_space_tree() > >> |- clear_free_space_tree() > >> Which deletes all the free space tree items. > >> > >> Thanks, > >> Qu > >> > >>> > >>>> > >>>> Signed-off-by: Qu Wenruo <wqu@suse.com> > >>>> --- > >>>> fs/btrfs/free-space-tree.c | 12 ++++++++++++ > >>>> 1 file changed, 12 insertions(+) > >>>> > >>>> diff --git a/fs/btrfs/free-space-tree.c b/fs/btrfs/free-space-tree.c > >>>> index 7ba50e133921..d8f334724092 100644 > >>>> --- a/fs/btrfs/free-space-tree.c > >>>> +++ b/fs/btrfs/free-space-tree.c > >>>> @@ -1312,6 +1312,8 @@ int btrfs_delete_free_space_tree(struct btrfs_fs_info *fs_info) > >>>> return btrfs_commit_transaction(trans); > >>>> } > >>>> > >>>> +/* How many block groups can be handled in one transaction. */ > >>>> +#define FREE_SPACE_TREE_REBUILD_BATCH (32) > >>>> int btrfs_rebuild_free_space_tree(struct btrfs_fs_info *fs_info) > >>> > >>> Please put a blank line after the macro declaration and place the > >>> macro at the top of the file before the C code. > >>> > >>>> { > >>>> struct btrfs_trans_handle *trans; > >>>> @@ -1322,6 +1324,7 @@ int btrfs_rebuild_free_space_tree(struct btrfs_fs_info *fs_info) > >>>> }; > >>>> struct btrfs_root *free_space_root = btrfs_global_root(fs_info, &key); > >>>> struct rb_node *node; > >>>> + unsigned int handled = 0; > >>>> int ret; > >>>> > >>>> trans = btrfs_start_transaction(free_space_root, 1); > >>>> @@ -1350,6 +1353,15 @@ int btrfs_rebuild_free_space_tree(struct btrfs_fs_info *fs_info) > >>>> btrfs_end_transaction(trans); > >>>> return ret; > >>>> } > >>>> + handled++; > >>>> + handled %= FREE_SPACE_TREE_REBUILD_BATCH; > >>>> + if (!handled) { > >>>> + btrfs_end_transaction(trans); > >>>> + trans = btrfs_start_transaction(free_space_root, > >>>> + FREE_SPACE_TREE_REBUILD_BATCH); > >>> > >>> This is a fundamental change here, we are no longer reserving 1 unit > >>> but 32 instead. > >>> Plus the first call to btrfs_start_transaction(), before entering the > >>> loop, uses 1. > >>> For consistency, both places should reserve the same amount. > >>> > >>> But I think that changing the amount of reserved space should be a > >>> separate change with its own changelog, > >>> providing a rationale for it. > >>> > >>> It's odd indeed that only 1 item is being reserved, but on the other > >>> hand the block reserve associated with the free space tree root is the > >>> delayed refs reserve (see btrfs_init_root_block_rsv()), > >>> so changing the number of units passed to btrfs_start_transaction() > >>> shouldn't make much difference because the space reserved by a > >>> transaction goes to the transaction block reserve > >>> (fs_info->trans_block_rsv). > >>> > >>> And given that free space tree build/rebuild only touches the free > >>> space tree root, I don't see how that makes any difference, or at > >>> least it's not trivial, hence the separate change only for the space > >>> reservation > >>> amount and an explanation about why we are doing it. > >>> > >>> Thanks. > >>> > >>> > >>>> + if (IS_ERR(trans)) > >>>> + return PTR_ERR(trans); > >>>> + } > >>>> node = rb_next(node); > >>>> } > >>>> > >>>> -- > >>>> 2.47.1 > >>>> > >>>> > >> > > >
在 2024/12/18 22:15, Filipe Manana 写道: > On Tue, Dec 17, 2024 at 8:38 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: >> >> >> >> 在 2024/12/17 21:06, Filipe Manana 写道: >>> On Tue, Dec 17, 2024 at 9:02 AM Qu Wenruo <wqu@suse.com> wrote: >>>> >>>> >>>> >>>> 在 2024/12/17 19:15, Filipe Manana 写道: >>>>> On Tue, Dec 17, 2024 at 7:31 AM Qu Wenruo <wqu@suse.com> wrote: >>>>>> >>>>>> During free space tree rebuild, we're holding on transaction handler for >>>>>> the whole rebuild process. >>>>>> >>>>>> This will cause blocked task warning for btrfs-transaction kernel >>>>>> thread, as during the rebuild, btrfs-transaction kthread has to wait for >>>>>> the running transaction we're holding for free space tree rebuild. >>>>> >>>>> Do you have a stack trace? >>>>> Does that happen where exactly, in the btrfs_attach_transaction() call >>>>> chain, while waiting on a wait queue? >>>> >>>> Unfortunately no, thus it's only based on code analyze. >>>> >>>> The original reporter's stack are not related to free space cache >>>> rebuild, but on metadata writeback wait: >>>> >>>> [101497.950425] INFO: task btrfs-transacti:29385 blocked for more than >>>> 122 seconds. >>>> [101497.950432] Tainted: G W O 6.12.3-gentoo-x86_64 #1 >>>> [101497.950435] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" >>>> disables this message. >>>> [101497.950437] task:btrfs-transacti state:D stack:0 pid:29385 >>>> tgid:29385 ppid:2 flags:0x00004000 >>>> [101497.950442] Call Trace: >>>> [101497.950444] <TASK> >>>> [101497.950449] __schedule+0x3f0/0xbd0 >>>> [101497.950458] schedule+0x27/0xf0 >>>> [101497.950461] io_schedule+0x46/0x70 >>>> [101497.950465] folio_wait_bit_common+0x123/0x340 >>>> [101497.950472] ? __pfx_wake_page_function+0x10/0x10 >>>> [101497.950477] folio_wait_writeback+0x2b/0x80 >>>> [101497.950480] __filemap_fdatawait_range+0x7d/0xd0 >>>> [101497.950489] filemap_fdatawait_range+0x12/0x20 >>>> [101497.950493] __btrfs_wait_marked_extents.isra.0+0xb8/0xf0 [btrfs] >>>> [101497.950536] btrfs_write_and_wait_transaction+0x5e/0xd0 [btrfs] >>>> [101497.950572] btrfs_commit_transaction+0x8d9/0xe80 [btrfs] >>>> [101497.950606] ? start_transaction+0xc0/0x820 [btrfs] >>>> [101497.950640] transaction_kthread+0x159/0x1c0 [btrfs] >>>> [101497.950674] ? __pfx_transaction_kthread+0x10/0x10 [btrfs] >>>> [101497.950705] kthread+0xd2/0x100 >>>> [101497.950710] ? __pfx_kthread+0x10/0x10 >>>> [101497.950714] ret_from_fork+0x34/0x50 >>>> [101497.950718] ? __pfx_kthread+0x10/0x10 >>>> [101497.950721] ret_from_fork_asm+0x1a/0x30 >>>> [101497.950727] </TASK> >>>> >>>> Furthermore, the original reporter doesn't experience real hang. >>>> The operation can finish (very large stream receive), sync and properly >>>> unmount. >>>> I believe the original report is mostly caused by the following reasons: >>>> >>>> - Too large physical RAM >>>> 96GiB, causing too huge page cache >>>> >>>> - HDD >>>> Low IOPS >>>> >>>> - Mostly random metadata writes? >>> >>> Looking at that trace, I don't see how this change relates to it. >>> That is, how this batching helps prevent that. >> >> Exactly, this batching fix is unrelated to the report, just as I said. > > So I think this batch thing is not really appropriate: > > 1) There's the thing about the space reservation mentioned earlier; > > 2) Picking 32 or any other number may result in unnecessary churn by > calling end_transaction() / star_transaction() too often, > as we're likely to be able to process much more than 32 block > groups before the transaction kthread attempts to > commit the transaction. > > Instead of that, just periodically call > btrfs_should_end_transaction(), and release the handle and start a new > one if it returns true. That sounds much better indeed. Will go that path. Thanks, Qu > > This is all we need since it returns true when the transaction kthread > attempts to start the commit (state set to TRANS_STATE_COMMIT_START). > This way the kthread won't wait for potentially too long in the > transaction's wait queue and trigger any warning about being blocked > for too long. > > Thanks. > > >> >> Thanks, >> Qu >>> >>> That happens when committing a transaction and we're in the unblocked >>> transaction state, starting writeback of the metadata extents and >>> waiting for the writeback to finish. >>> >>> Since the free space tree build code uses the same transaction handle >>> to build the whole free space tree, >>> the transaction kthread would block when calling >>> btrfs_commit_transaction(), waiting for the free space tree build task >>> to release its handle - but that's not what is happening here. >>> >>> The transaction kthread is doing the commit, and already in the >>> unblocked transaction state, writing all dirty metadata extents. >>> So that means it's not getting blocked by free tree build task holding >>> a transaction handle - it may be holding a transaction handle open, >>> but it's a different transaction already, since when the current >>> transaction is in a state >= TRANS_STATE_UNBLOCKED, other tasks can >>> start a new transaction. >>> >>> And each transaction has its own io tree to keep track of the metadata >>> extents it created/dirtied (trans->transaction->dirty_pages), >>> so even if the task building the free space tree keeps COWing and >>> creating free space tree nodes/leaves, it doesn't affect the io tree >>> of the >>> committing transaction - these extent buffers will be written by the >>> transaction used by the free space tree build task when it commits, >>> and not by the one being committed by the transaction kthread. >>> >>> We should have the stack trace in the change log, a link to the >>> reporter's thread and an explanation of how this batching helps >>> anything - I don't see how, as explained above. >>> >>> Thanks. >>> >>> >>>> >>>>> >>>>>> >>>>>> On a large enough btrfs, we have thousands of block groups to go >>>>>> through, thus it will definitely take over 120s and trigger the blocked >>>>>> task warning. >>>>>> >>>>>> Fix the problem by handling 32 block groups in one transaction, and end >>>>>> the transaction when we hit the 32 block groups threshold. >>>>>> >>>>>> This will allow the btrfs-transaction kthread to commit the transaction >>>>>> when needed. >>>>>> >>>>>> And even if during the rebuild the system lost its power, we are still >>>>>> fine as we didn't set FREE_SPACE_TREE_VALID flag, thus on next RW mount >>>>>> we will still rebuild the tree, without utilizing the half built one. >>>>> >>>>> What about if a crash happens and we already processed some block >>>>> groups and the transaction got committed? >>>>> >>>>> On the next mount, when we call populate_free_space_tree() for the >>>>> same block groups, because we always start from the first one, won't >>>>> we get an -EEXIST and fail the mount? >>>>> For example, add_new_free_space_info() doesn't ignore an -EEXIST when >>>>> it calls btrfs_insert_empty_item(), so we will fail the mount when >>>>> trying to build the tree. >>>> >>>> We are still fine: >>>> >>>> btrfs_start_pre_rw_mount() >>>> |- rebuild_free_space_tree = true; >>>> | This is because our fs only has FREE_SPACE_TREE, but no >>>> | FREE_SPACE_TREE_VALID compat_ro flag. >>>> | >>>> |- btrfs_rebuild_free_space_tree() >>>> |- clear_free_space_tree() >>>> Which deletes all the free space tree items. >>>> >>>> Thanks, >>>> Qu >>>> >>>>> >>>>>> >>>>>> Signed-off-by: Qu Wenruo <wqu@suse.com> >>>>>> --- >>>>>> fs/btrfs/free-space-tree.c | 12 ++++++++++++ >>>>>> 1 file changed, 12 insertions(+) >>>>>> >>>>>> diff --git a/fs/btrfs/free-space-tree.c b/fs/btrfs/free-space-tree.c >>>>>> index 7ba50e133921..d8f334724092 100644 >>>>>> --- a/fs/btrfs/free-space-tree.c >>>>>> +++ b/fs/btrfs/free-space-tree.c >>>>>> @@ -1312,6 +1312,8 @@ int btrfs_delete_free_space_tree(struct btrfs_fs_info *fs_info) >>>>>> return btrfs_commit_transaction(trans); >>>>>> } >>>>>> >>>>>> +/* How many block groups can be handled in one transaction. */ >>>>>> +#define FREE_SPACE_TREE_REBUILD_BATCH (32) >>>>>> int btrfs_rebuild_free_space_tree(struct btrfs_fs_info *fs_info) >>>>> >>>>> Please put a blank line after the macro declaration and place the >>>>> macro at the top of the file before the C code. >>>>> >>>>>> { >>>>>> struct btrfs_trans_handle *trans; >>>>>> @@ -1322,6 +1324,7 @@ int btrfs_rebuild_free_space_tree(struct btrfs_fs_info *fs_info) >>>>>> }; >>>>>> struct btrfs_root *free_space_root = btrfs_global_root(fs_info, &key); >>>>>> struct rb_node *node; >>>>>> + unsigned int handled = 0; >>>>>> int ret; >>>>>> >>>>>> trans = btrfs_start_transaction(free_space_root, 1); >>>>>> @@ -1350,6 +1353,15 @@ int btrfs_rebuild_free_space_tree(struct btrfs_fs_info *fs_info) >>>>>> btrfs_end_transaction(trans); >>>>>> return ret; >>>>>> } >>>>>> + handled++; >>>>>> + handled %= FREE_SPACE_TREE_REBUILD_BATCH; >>>>>> + if (!handled) { >>>>>> + btrfs_end_transaction(trans); >>>>>> + trans = btrfs_start_transaction(free_space_root, >>>>>> + FREE_SPACE_TREE_REBUILD_BATCH); >>>>> >>>>> This is a fundamental change here, we are no longer reserving 1 unit >>>>> but 32 instead. >>>>> Plus the first call to btrfs_start_transaction(), before entering the >>>>> loop, uses 1. >>>>> For consistency, both places should reserve the same amount. >>>>> >>>>> But I think that changing the amount of reserved space should be a >>>>> separate change with its own changelog, >>>>> providing a rationale for it. >>>>> >>>>> It's odd indeed that only 1 item is being reserved, but on the other >>>>> hand the block reserve associated with the free space tree root is the >>>>> delayed refs reserve (see btrfs_init_root_block_rsv()), >>>>> so changing the number of units passed to btrfs_start_transaction() >>>>> shouldn't make much difference because the space reserved by a >>>>> transaction goes to the transaction block reserve >>>>> (fs_info->trans_block_rsv). >>>>> >>>>> And given that free space tree build/rebuild only touches the free >>>>> space tree root, I don't see how that makes any difference, or at >>>>> least it's not trivial, hence the separate change only for the space >>>>> reservation >>>>> amount and an explanation about why we are doing it. >>>>> >>>>> Thanks. >>>>> >>>>> >>>>>> + if (IS_ERR(trans)) >>>>>> + return PTR_ERR(trans); >>>>>> + } >>>>>> node = rb_next(node); >>>>>> } >>>>>> >>>>>> -- >>>>>> 2.47.1 >>>>>> >>>>>> >>>> >>> >>
diff --git a/fs/btrfs/free-space-tree.c b/fs/btrfs/free-space-tree.c index 7ba50e133921..d8f334724092 100644 --- a/fs/btrfs/free-space-tree.c +++ b/fs/btrfs/free-space-tree.c @@ -1312,6 +1312,8 @@ int btrfs_delete_free_space_tree(struct btrfs_fs_info *fs_info) return btrfs_commit_transaction(trans); } +/* How many block groups can be handled in one transaction. */ +#define FREE_SPACE_TREE_REBUILD_BATCH (32) int btrfs_rebuild_free_space_tree(struct btrfs_fs_info *fs_info) { struct btrfs_trans_handle *trans; @@ -1322,6 +1324,7 @@ int btrfs_rebuild_free_space_tree(struct btrfs_fs_info *fs_info) }; struct btrfs_root *free_space_root = btrfs_global_root(fs_info, &key); struct rb_node *node; + unsigned int handled = 0; int ret; trans = btrfs_start_transaction(free_space_root, 1); @@ -1350,6 +1353,15 @@ int btrfs_rebuild_free_space_tree(struct btrfs_fs_info *fs_info) btrfs_end_transaction(trans); return ret; } + handled++; + handled %= FREE_SPACE_TREE_REBUILD_BATCH; + if (!handled) { + btrfs_end_transaction(trans); + trans = btrfs_start_transaction(free_space_root, + FREE_SPACE_TREE_REBUILD_BATCH); + if (IS_ERR(trans)) + return PTR_ERR(trans); + } node = rb_next(node); }
During free space tree rebuild, we're holding on transaction handler for the whole rebuild process. This will cause blocked task warning for btrfs-transaction kernel thread, as during the rebuild, btrfs-transaction kthread has to wait for the running transaction we're holding for free space tree rebuild. On a large enough btrfs, we have thousands of block groups to go through, thus it will definitely take over 120s and trigger the blocked task warning. Fix the problem by handling 32 block groups in one transaction, and end the transaction when we hit the 32 block groups threshold. This will allow the btrfs-transaction kthread to commit the transaction when needed. And even if during the rebuild the system lost its power, we are still fine as we didn't set FREE_SPACE_TREE_VALID flag, thus on next RW mount we will still rebuild the tree, without utilizing the half built one. Signed-off-by: Qu Wenruo <wqu@suse.com> --- fs/btrfs/free-space-tree.c | 12 ++++++++++++ 1 file changed, 12 insertions(+)