Message ID | e4ad6d8e46f084001770fb9dad6ac8df38cd3e2e.1724795624.git.dsterba@suse.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Renames and defrag cleanups | expand |
在 2024/8/28 07:25, David Sterba 写道: > btrfs_cleanup_defrag_inodes() is not called frequently, only in remount > or unmount, but the way it frees the inodes in fs_info->defrag_inodes > is inefficient. Each time it needs to locate first node, remove it, > potentially rebalance tree until it's done. This allows to do a > conditional reschedule. > > For cleanups the rbtree_postorder_for_each_entry_safe() iterator is > convenient but if the reschedule happens and unlocks fs_info->defrag_inodes_lock > we can't be sure that the tree is in the same state. If that happens, > restart the iteration from the beginning. In that case, isn't the rbtree itself in an inconsistent state, and restarting will only cause invalid memory access? So in this particular case, since we can be interrupted, the full tree balance looks like the only safe way we can go? Thanks, Qu > > The cleanup operation is kmem_cache_free() which will likely take the > fast path for most objects. > > Signed-off-by: David Sterba <dsterba@suse.com> > --- > fs/btrfs/defrag.c | 17 ++++++++--------- > 1 file changed, 8 insertions(+), 9 deletions(-) > > diff --git a/fs/btrfs/defrag.c b/fs/btrfs/defrag.c > index 41d67065d02b..bd1769dd134c 100644 > --- a/fs/btrfs/defrag.c > +++ b/fs/btrfs/defrag.c > @@ -212,19 +212,18 @@ static struct inode_defrag *btrfs_pick_defrag_inode( > > void btrfs_cleanup_defrag_inodes(struct btrfs_fs_info *fs_info) > { > - struct inode_defrag *defrag; > - struct rb_node *node; > + struct inode_defrag *defrag, *next; > > spin_lock(&fs_info->defrag_inodes_lock); > - node = rb_first(&fs_info->defrag_inodes); > - while (node) { > - rb_erase(node, &fs_info->defrag_inodes); > - defrag = rb_entry(node, struct inode_defrag, rb_node); > +again: > + rbtree_postorder_for_each_entry_safe(defrag, next, &fs_info->defrag_inodes, > + rb_node) { > kmem_cache_free(btrfs_inode_defrag_cachep, defrag); > > - cond_resched_lock(&fs_info->defrag_inodes_lock); > - > - node = rb_first(&fs_info->defrag_inodes); > + if (cond_resched_lock(&fs_info->defrag_inodes_lock)) { > + /* Rescheduled and unlocked. */ > + goto again; > + } > } > spin_unlock(&fs_info->defrag_inodes_lock); > }
On Wed, Aug 28, 2024 at 08:29:23AM +0930, Qu Wenruo wrote: > > > 在 2024/8/28 07:25, David Sterba 写道: > > btrfs_cleanup_defrag_inodes() is not called frequently, only in remount > > or unmount, but the way it frees the inodes in fs_info->defrag_inodes > > is inefficient. Each time it needs to locate first node, remove it, > > potentially rebalance tree until it's done. This allows to do a > > conditional reschedule. > > > > For cleanups the rbtree_postorder_for_each_entry_safe() iterator is > > convenient but if the reschedule happens and unlocks fs_info->defrag_inodes_lock > > we can't be sure that the tree is in the same state. If that happens, > > restart the iteration from the beginning. > > In that case, isn't the rbtree itself in an inconsistent state, and > restarting will only cause invalid memory access? > > So in this particular case, since we can be interrupted, the full tree > balance looks like the only safe way we can go? You're right, the nodes get freed so even if the iteration is restarted it would touch freed memory, IOW rbtree_postorder_for_each_entry_safe() can't be interrupted. I can drop the reschedule, with the same argument that it should be relatively fast even for thousands of entries, this should not hurt for remouunt/umount context.
在 2024/8/28 08:48, David Sterba 写道: > On Wed, Aug 28, 2024 at 08:29:23AM +0930, Qu Wenruo wrote: >> >> >> 在 2024/8/28 07:25, David Sterba 写道: >>> btrfs_cleanup_defrag_inodes() is not called frequently, only in remount >>> or unmount, but the way it frees the inodes in fs_info->defrag_inodes >>> is inefficient. Each time it needs to locate first node, remove it, >>> potentially rebalance tree until it's done. This allows to do a >>> conditional reschedule. >>> >>> For cleanups the rbtree_postorder_for_each_entry_safe() iterator is >>> convenient but if the reschedule happens and unlocks fs_info->defrag_inodes_lock >>> we can't be sure that the tree is in the same state. If that happens, >>> restart the iteration from the beginning. >> >> In that case, isn't the rbtree itself in an inconsistent state, and >> restarting will only cause invalid memory access? >> >> So in this particular case, since we can be interrupted, the full tree >> balance looks like the only safe way we can go? > > You're right, the nodes get freed so even if the iteration is restarted > it would touch freed memory, IOW rbtree_postorder_for_each_entry_safe() > can't be interrupted. I can drop the reschedule, with the same argument > that it should be relatively fast even for thousands of entries, this > should not hurt for remouunt/umount context. > Considering the autodefrag is only triggered for certain writes, and at remount (to RO) or unmount time, there should be no more writes, the solution looks fine. Thanks, Qu
On Wed, Aug 28, 2024 at 09:18:11AM +0930, Qu Wenruo wrote: > > > 在 2024/8/28 08:48, David Sterba 写道: > > On Wed, Aug 28, 2024 at 08:29:23AM +0930, Qu Wenruo wrote: > >> > >> > >> 在 2024/8/28 07:25, David Sterba 写道: > >>> btrfs_cleanup_defrag_inodes() is not called frequently, only in remount > >>> or unmount, but the way it frees the inodes in fs_info->defrag_inodes > >>> is inefficient. Each time it needs to locate first node, remove it, > >>> potentially rebalance tree until it's done. This allows to do a > >>> conditional reschedule. > >>> > >>> For cleanups the rbtree_postorder_for_each_entry_safe() iterator is > >>> convenient but if the reschedule happens and unlocks fs_info->defrag_inodes_lock > >>> we can't be sure that the tree is in the same state. If that happens, > >>> restart the iteration from the beginning. > >> > >> In that case, isn't the rbtree itself in an inconsistent state, and > >> restarting will only cause invalid memory access? > >> > >> So in this particular case, since we can be interrupted, the full tree > >> balance looks like the only safe way we can go? > > > > You're right, the nodes get freed so even if the iteration is restarted > > it would touch freed memory, IOW rbtree_postorder_for_each_entry_safe() > > can't be interrupted. I can drop the reschedule, with the same argument > > that it should be relatively fast even for thousands of entries, this > > should not hurt for remouunt/umount context. > > > > Considering the autodefrag is only triggered for certain writes, and at > remount (to RO) or unmount time, there should be no more writes, the > solution looks fine. Ok, thanks. I'll commit the following updated version: btrfs: clear defragmented inodes using postorder in btrfs_cleanup_defrag_inodes() btrfs_cleanup_defrag_inodes() is not called frequently, only in remount or unmount, but the way it frees the inodes in fs_info->defrag_inodes is inefficient. Each time it needs to locate first node, remove it, potentially rebalance tree until it's done. This allows to do a conditional reschedule. For cleanups the rbtree_postorder_for_each_entry_safe() iterator is convenient but we can't reschedule and restart iteration because some of the tree nodes would be already freed. The cleanup operation is kmem_cache_free() which will likely take the fast path for most objects so rescheduling should not be necessary. Signed-off-by: David Sterba <dsterba@suse.com> --- a/fs/btrfs/defrag.c +++ b/fs/btrfs/defrag.c @@ -212,19 +212,12 @@ static struct inode_defrag *btrfs_pick_defrag_inode( void btrfs_cleanup_defrag_inodes(struct btrfs_fs_info *fs_info) { - struct inode_defrag *defrag; - struct rb_node *node; + struct inode_defrag *defrag, *next; spin_lock(&fs_info->defrag_inodes_lock); - node = rb_first(&fs_info->defrag_inodes); - while (node) { - rb_erase(node, &fs_info->defrag_inodes); - defrag = rb_entry(node, struct inode_defrag, rb_node); + rbtree_postorder_for_each_entry_safe(defrag, next, &fs_info->defrag_inodes, + rb_node) { kmem_cache_free(btrfs_inode_defrag_cachep, defrag); - - cond_resched_lock(&fs_info->defrag_inodes_lock); - - node = rb_first(&fs_info->defrag_inodes); } spin_unlock(&fs_info->defrag_inodes_lock); }
在 2024/8/28 09:41, David Sterba 写道: > On Wed, Aug 28, 2024 at 09:18:11AM +0930, Qu Wenruo wrote: >> >> >> 在 2024/8/28 08:48, David Sterba 写道: >>> On Wed, Aug 28, 2024 at 08:29:23AM +0930, Qu Wenruo wrote: >>>> >>>> >>>> 在 2024/8/28 07:25, David Sterba 写道: >>>>> btrfs_cleanup_defrag_inodes() is not called frequently, only in remount >>>>> or unmount, but the way it frees the inodes in fs_info->defrag_inodes >>>>> is inefficient. Each time it needs to locate first node, remove it, >>>>> potentially rebalance tree until it's done. This allows to do a >>>>> conditional reschedule. >>>>> >>>>> For cleanups the rbtree_postorder_for_each_entry_safe() iterator is >>>>> convenient but if the reschedule happens and unlocks fs_info->defrag_inodes_lock >>>>> we can't be sure that the tree is in the same state. If that happens, >>>>> restart the iteration from the beginning. >>>> >>>> In that case, isn't the rbtree itself in an inconsistent state, and >>>> restarting will only cause invalid memory access? >>>> >>>> So in this particular case, since we can be interrupted, the full tree >>>> balance looks like the only safe way we can go? >>> >>> You're right, the nodes get freed so even if the iteration is restarted >>> it would touch freed memory, IOW rbtree_postorder_for_each_entry_safe() >>> can't be interrupted. I can drop the reschedule, with the same argument >>> that it should be relatively fast even for thousands of entries, this >>> should not hurt for remouunt/umount context. >>> >> >> Considering the autodefrag is only triggered for certain writes, and at >> remount (to RO) or unmount time, there should be no more writes, the >> solution looks fine. > > Ok, thanks. I'll commit the following updated version: > > btrfs: clear defragmented inodes using postorder in btrfs_cleanup_defrag_inodes() > > btrfs_cleanup_defrag_inodes() is not called frequently, only in remount > or unmount, but the way it frees the inodes in fs_info->defrag_inodes > is inefficient. Each time it needs to locate first node, remove it, > potentially rebalance tree until it's done. This allows to do a > conditional reschedule. > > For cleanups the rbtree_postorder_for_each_entry_safe() iterator is > convenient but we can't reschedule and restart iteration because some of > the tree nodes would be already freed. > > The cleanup operation is kmem_cache_free() which will likely take the > fast path for most objects so rescheduling should not be necessary. > > Signed-off-by: David Sterba <dsterba@suse.com> > > --- a/fs/btrfs/defrag.c > +++ b/fs/btrfs/defrag.c > @@ -212,19 +212,12 @@ static struct inode_defrag *btrfs_pick_defrag_inode( > > void btrfs_cleanup_defrag_inodes(struct btrfs_fs_info *fs_info) > { > - struct inode_defrag *defrag; > - struct rb_node *node; > + struct inode_defrag *defrag, *next; > > spin_lock(&fs_info->defrag_inodes_lock); > - node = rb_first(&fs_info->defrag_inodes); > - while (node) { > - rb_erase(node, &fs_info->defrag_inodes); > - defrag = rb_entry(node, struct inode_defrag, rb_node); > + rbtree_postorder_for_each_entry_safe(defrag, next, &fs_info->defrag_inodes, > + rb_node) { > kmem_cache_free(btrfs_inode_defrag_cachep, defrag); > - > - cond_resched_lock(&fs_info->defrag_inodes_lock); > - > - node = rb_first(&fs_info->defrag_inodes); > } Since it's just a simple kmem_cache_free() line, there is no need for the brackets. Otherwise the whole series looks good to me. Reviewed-by: Qu Wenruo <wqu@suse.com> Thanks, Qu > spin_unlock(&fs_info->defrag_inodes_lock); > } >
diff --git a/fs/btrfs/defrag.c b/fs/btrfs/defrag.c index 41d67065d02b..bd1769dd134c 100644 --- a/fs/btrfs/defrag.c +++ b/fs/btrfs/defrag.c @@ -212,19 +212,18 @@ static struct inode_defrag *btrfs_pick_defrag_inode( void btrfs_cleanup_defrag_inodes(struct btrfs_fs_info *fs_info) { - struct inode_defrag *defrag; - struct rb_node *node; + struct inode_defrag *defrag, *next; spin_lock(&fs_info->defrag_inodes_lock); - node = rb_first(&fs_info->defrag_inodes); - while (node) { - rb_erase(node, &fs_info->defrag_inodes); - defrag = rb_entry(node, struct inode_defrag, rb_node); +again: + rbtree_postorder_for_each_entry_safe(defrag, next, &fs_info->defrag_inodes, + rb_node) { kmem_cache_free(btrfs_inode_defrag_cachep, defrag); - cond_resched_lock(&fs_info->defrag_inodes_lock); - - node = rb_first(&fs_info->defrag_inodes); + if (cond_resched_lock(&fs_info->defrag_inodes_lock)) { + /* Rescheduled and unlocked. */ + goto again; + } } spin_unlock(&fs_info->defrag_inodes_lock); }
btrfs_cleanup_defrag_inodes() is not called frequently, only in remount or unmount, but the way it frees the inodes in fs_info->defrag_inodes is inefficient. Each time it needs to locate first node, remove it, potentially rebalance tree until it's done. This allows to do a conditional reschedule. For cleanups the rbtree_postorder_for_each_entry_safe() iterator is convenient but if the reschedule happens and unlocks fs_info->defrag_inodes_lock we can't be sure that the tree is in the same state. If that happens, restart the iteration from the beginning. The cleanup operation is kmem_cache_free() which will likely take the fast path for most objects. Signed-off-by: David Sterba <dsterba@suse.com> --- fs/btrfs/defrag.c | 17 ++++++++--------- 1 file changed, 8 insertions(+), 9 deletions(-)