diff mbox series

[09/12] btrfs: clear defragmented inodes using postorder in btrfs_cleanup_defrag_inodes()

Message ID e4ad6d8e46f084001770fb9dad6ac8df38cd3e2e.1724795624.git.dsterba@suse.com (mailing list archive)
State New
Headers show
Series Renames and defrag cleanups | expand

Commit Message

David Sterba Aug. 27, 2024, 9:55 p.m. UTC
btrfs_cleanup_defrag_inodes() is not called frequently, only in remount
or unmount, but the way it frees the inodes in fs_info->defrag_inodes
is inefficient. Each time it needs to locate first node, remove it,
potentially rebalance tree until it's done. This allows to do a
conditional reschedule.

For cleanups the rbtree_postorder_for_each_entry_safe() iterator is
convenient but if the reschedule happens and unlocks fs_info->defrag_inodes_lock
we can't be sure that the tree is in the same state. If that happens,
restart the iteration from the beginning.

The cleanup operation is kmem_cache_free() which will likely take the
fast path for most objects.

Signed-off-by: David Sterba <dsterba@suse.com>
---
 fs/btrfs/defrag.c | 17 ++++++++---------
 1 file changed, 8 insertions(+), 9 deletions(-)

Comments

Qu Wenruo Aug. 27, 2024, 10:59 p.m. UTC | #1
在 2024/8/28 07:25, David Sterba 写道:
> btrfs_cleanup_defrag_inodes() is not called frequently, only in remount
> or unmount, but the way it frees the inodes in fs_info->defrag_inodes
> is inefficient. Each time it needs to locate first node, remove it,
> potentially rebalance tree until it's done. This allows to do a
> conditional reschedule.
> 
> For cleanups the rbtree_postorder_for_each_entry_safe() iterator is
> convenient but if the reschedule happens and unlocks fs_info->defrag_inodes_lock
> we can't be sure that the tree is in the same state. If that happens,
> restart the iteration from the beginning.

In that case, isn't the rbtree itself in an inconsistent state, and 
restarting will only cause invalid memory access?

So in this particular case, since we can be interrupted, the full tree 
balance looks like the only safe way we can go?

Thanks,
Qu

> 
> The cleanup operation is kmem_cache_free() which will likely take the
> fast path for most objects.
> 
> Signed-off-by: David Sterba <dsterba@suse.com>
> ---
>   fs/btrfs/defrag.c | 17 ++++++++---------
>   1 file changed, 8 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/btrfs/defrag.c b/fs/btrfs/defrag.c
> index 41d67065d02b..bd1769dd134c 100644
> --- a/fs/btrfs/defrag.c
> +++ b/fs/btrfs/defrag.c
> @@ -212,19 +212,18 @@ static struct inode_defrag *btrfs_pick_defrag_inode(
>   
>   void btrfs_cleanup_defrag_inodes(struct btrfs_fs_info *fs_info)
>   {
> -	struct inode_defrag *defrag;
> -	struct rb_node *node;
> +	struct inode_defrag *defrag, *next;
>   
>   	spin_lock(&fs_info->defrag_inodes_lock);
> -	node = rb_first(&fs_info->defrag_inodes);
> -	while (node) {
> -		rb_erase(node, &fs_info->defrag_inodes);
> -		defrag = rb_entry(node, struct inode_defrag, rb_node);
> +again:
> +	rbtree_postorder_for_each_entry_safe(defrag, next, &fs_info->defrag_inodes,
> +					     rb_node) {
>   		kmem_cache_free(btrfs_inode_defrag_cachep, defrag);
>   
> -		cond_resched_lock(&fs_info->defrag_inodes_lock);
> -
> -		node = rb_first(&fs_info->defrag_inodes);
> +		if (cond_resched_lock(&fs_info->defrag_inodes_lock)) {
> +			/* Rescheduled and unlocked. */
> +			goto again;
> +		}
>   	}
>   	spin_unlock(&fs_info->defrag_inodes_lock);
>   }
David Sterba Aug. 27, 2024, 11:18 p.m. UTC | #2
On Wed, Aug 28, 2024 at 08:29:23AM +0930, Qu Wenruo wrote:
> 
> 
> 在 2024/8/28 07:25, David Sterba 写道:
> > btrfs_cleanup_defrag_inodes() is not called frequently, only in remount
> > or unmount, but the way it frees the inodes in fs_info->defrag_inodes
> > is inefficient. Each time it needs to locate first node, remove it,
> > potentially rebalance tree until it's done. This allows to do a
> > conditional reschedule.
> > 
> > For cleanups the rbtree_postorder_for_each_entry_safe() iterator is
> > convenient but if the reschedule happens and unlocks fs_info->defrag_inodes_lock
> > we can't be sure that the tree is in the same state. If that happens,
> > restart the iteration from the beginning.
> 
> In that case, isn't the rbtree itself in an inconsistent state, and 
> restarting will only cause invalid memory access?
> 
> So in this particular case, since we can be interrupted, the full tree 
> balance looks like the only safe way we can go?

You're right, the nodes get freed so even if the iteration is restarted
it would touch freed memory, IOW rbtree_postorder_for_each_entry_safe()
can't be interrupted. I can drop the reschedule, with the same argument
that it should be relatively fast even for thousands of entries, this
should not hurt for remouunt/umount context.
Qu Wenruo Aug. 27, 2024, 11:48 p.m. UTC | #3
在 2024/8/28 08:48, David Sterba 写道:
> On Wed, Aug 28, 2024 at 08:29:23AM +0930, Qu Wenruo wrote:
>>
>>
>> 在 2024/8/28 07:25, David Sterba 写道:
>>> btrfs_cleanup_defrag_inodes() is not called frequently, only in remount
>>> or unmount, but the way it frees the inodes in fs_info->defrag_inodes
>>> is inefficient. Each time it needs to locate first node, remove it,
>>> potentially rebalance tree until it's done. This allows to do a
>>> conditional reschedule.
>>>
>>> For cleanups the rbtree_postorder_for_each_entry_safe() iterator is
>>> convenient but if the reschedule happens and unlocks fs_info->defrag_inodes_lock
>>> we can't be sure that the tree is in the same state. If that happens,
>>> restart the iteration from the beginning.
>>
>> In that case, isn't the rbtree itself in an inconsistent state, and
>> restarting will only cause invalid memory access?
>>
>> So in this particular case, since we can be interrupted, the full tree
>> balance looks like the only safe way we can go?
>
> You're right, the nodes get freed so even if the iteration is restarted
> it would touch freed memory, IOW rbtree_postorder_for_each_entry_safe()
> can't be interrupted. I can drop the reschedule, with the same argument
> that it should be relatively fast even for thousands of entries, this
> should not hurt for remouunt/umount context.
>

Considering the autodefrag is only triggered for certain writes, and at
remount (to RO) or unmount time, there should be no more writes, the
solution looks fine.

Thanks,
Qu
David Sterba Aug. 28, 2024, 12:11 a.m. UTC | #4
On Wed, Aug 28, 2024 at 09:18:11AM +0930, Qu Wenruo wrote:
> 
> 
> 在 2024/8/28 08:48, David Sterba 写道:
> > On Wed, Aug 28, 2024 at 08:29:23AM +0930, Qu Wenruo wrote:
> >>
> >>
> >> 在 2024/8/28 07:25, David Sterba 写道:
> >>> btrfs_cleanup_defrag_inodes() is not called frequently, only in remount
> >>> or unmount, but the way it frees the inodes in fs_info->defrag_inodes
> >>> is inefficient. Each time it needs to locate first node, remove it,
> >>> potentially rebalance tree until it's done. This allows to do a
> >>> conditional reschedule.
> >>>
> >>> For cleanups the rbtree_postorder_for_each_entry_safe() iterator is
> >>> convenient but if the reschedule happens and unlocks fs_info->defrag_inodes_lock
> >>> we can't be sure that the tree is in the same state. If that happens,
> >>> restart the iteration from the beginning.
> >>
> >> In that case, isn't the rbtree itself in an inconsistent state, and
> >> restarting will only cause invalid memory access?
> >>
> >> So in this particular case, since we can be interrupted, the full tree
> >> balance looks like the only safe way we can go?
> >
> > You're right, the nodes get freed so even if the iteration is restarted
> > it would touch freed memory, IOW rbtree_postorder_for_each_entry_safe()
> > can't be interrupted. I can drop the reschedule, with the same argument
> > that it should be relatively fast even for thousands of entries, this
> > should not hurt for remouunt/umount context.
> >
> 
> Considering the autodefrag is only triggered for certain writes, and at
> remount (to RO) or unmount time, there should be no more writes, the
> solution looks fine.

Ok, thanks. I'll commit the following updated version:

btrfs: clear defragmented inodes using postorder in btrfs_cleanup_defrag_inodes()

btrfs_cleanup_defrag_inodes() is not called frequently, only in remount
or unmount, but the way it frees the inodes in fs_info->defrag_inodes
is inefficient. Each time it needs to locate first node, remove it,
potentially rebalance tree until it's done. This allows to do a
conditional reschedule.

For cleanups the rbtree_postorder_for_each_entry_safe() iterator is
convenient but we can't reschedule and restart iteration because some of
the tree nodes would be already freed.

The cleanup operation is kmem_cache_free() which will likely take the
fast path for most objects so rescheduling should not be necessary.

Signed-off-by: David Sterba <dsterba@suse.com>

--- a/fs/btrfs/defrag.c
+++ b/fs/btrfs/defrag.c
@@ -212,19 +212,12 @@ static struct inode_defrag *btrfs_pick_defrag_inode(
 
 void btrfs_cleanup_defrag_inodes(struct btrfs_fs_info *fs_info)
 {
-       struct inode_defrag *defrag;
-       struct rb_node *node;
+       struct inode_defrag *defrag, *next;
 
        spin_lock(&fs_info->defrag_inodes_lock);
-       node = rb_first(&fs_info->defrag_inodes);
-       while (node) {
-               rb_erase(node, &fs_info->defrag_inodes);
-               defrag = rb_entry(node, struct inode_defrag, rb_node);
+       rbtree_postorder_for_each_entry_safe(defrag, next, &fs_info->defrag_inodes,
+                                            rb_node) {
                kmem_cache_free(btrfs_inode_defrag_cachep, defrag);
-
-               cond_resched_lock(&fs_info->defrag_inodes_lock);
-
-               node = rb_first(&fs_info->defrag_inodes);
        }
        spin_unlock(&fs_info->defrag_inodes_lock);
 }
Qu Wenruo Aug. 28, 2024, 12:13 a.m. UTC | #5
在 2024/8/28 09:41, David Sterba 写道:
> On Wed, Aug 28, 2024 at 09:18:11AM +0930, Qu Wenruo wrote:
>>
>>
>> 在 2024/8/28 08:48, David Sterba 写道:
>>> On Wed, Aug 28, 2024 at 08:29:23AM +0930, Qu Wenruo wrote:
>>>>
>>>>
>>>> 在 2024/8/28 07:25, David Sterba 写道:
>>>>> btrfs_cleanup_defrag_inodes() is not called frequently, only in remount
>>>>> or unmount, but the way it frees the inodes in fs_info->defrag_inodes
>>>>> is inefficient. Each time it needs to locate first node, remove it,
>>>>> potentially rebalance tree until it's done. This allows to do a
>>>>> conditional reschedule.
>>>>>
>>>>> For cleanups the rbtree_postorder_for_each_entry_safe() iterator is
>>>>> convenient but if the reschedule happens and unlocks fs_info->defrag_inodes_lock
>>>>> we can't be sure that the tree is in the same state. If that happens,
>>>>> restart the iteration from the beginning.
>>>>
>>>> In that case, isn't the rbtree itself in an inconsistent state, and
>>>> restarting will only cause invalid memory access?
>>>>
>>>> So in this particular case, since we can be interrupted, the full tree
>>>> balance looks like the only safe way we can go?
>>>
>>> You're right, the nodes get freed so even if the iteration is restarted
>>> it would touch freed memory, IOW rbtree_postorder_for_each_entry_safe()
>>> can't be interrupted. I can drop the reschedule, with the same argument
>>> that it should be relatively fast even for thousands of entries, this
>>> should not hurt for remouunt/umount context.
>>>
>>
>> Considering the autodefrag is only triggered for certain writes, and at
>> remount (to RO) or unmount time, there should be no more writes, the
>> solution looks fine.
>
> Ok, thanks. I'll commit the following updated version:
>
> btrfs: clear defragmented inodes using postorder in btrfs_cleanup_defrag_inodes()
>
> btrfs_cleanup_defrag_inodes() is not called frequently, only in remount
> or unmount, but the way it frees the inodes in fs_info->defrag_inodes
> is inefficient. Each time it needs to locate first node, remove it,
> potentially rebalance tree until it's done. This allows to do a
> conditional reschedule.
>
> For cleanups the rbtree_postorder_for_each_entry_safe() iterator is
> convenient but we can't reschedule and restart iteration because some of
> the tree nodes would be already freed.
>
> The cleanup operation is kmem_cache_free() which will likely take the
> fast path for most objects so rescheduling should not be necessary.
>
> Signed-off-by: David Sterba <dsterba@suse.com>
>
> --- a/fs/btrfs/defrag.c
> +++ b/fs/btrfs/defrag.c
> @@ -212,19 +212,12 @@ static struct inode_defrag *btrfs_pick_defrag_inode(
>
>   void btrfs_cleanup_defrag_inodes(struct btrfs_fs_info *fs_info)
>   {
> -       struct inode_defrag *defrag;
> -       struct rb_node *node;
> +       struct inode_defrag *defrag, *next;
>
>          spin_lock(&fs_info->defrag_inodes_lock);
> -       node = rb_first(&fs_info->defrag_inodes);
> -       while (node) {
> -               rb_erase(node, &fs_info->defrag_inodes);
> -               defrag = rb_entry(node, struct inode_defrag, rb_node);
> +       rbtree_postorder_for_each_entry_safe(defrag, next, &fs_info->defrag_inodes,
> +                                            rb_node) {
>                  kmem_cache_free(btrfs_inode_defrag_cachep, defrag);
> -
> -               cond_resched_lock(&fs_info->defrag_inodes_lock);
> -
> -               node = rb_first(&fs_info->defrag_inodes);
>          }

Since it's just a simple kmem_cache_free() line, there is no need for
the brackets.

Otherwise the whole series looks good to me.

Reviewed-by: Qu Wenruo <wqu@suse.com>

Thanks,
Qu
>          spin_unlock(&fs_info->defrag_inodes_lock);
>   }
>
diff mbox series

Patch

diff --git a/fs/btrfs/defrag.c b/fs/btrfs/defrag.c
index 41d67065d02b..bd1769dd134c 100644
--- a/fs/btrfs/defrag.c
+++ b/fs/btrfs/defrag.c
@@ -212,19 +212,18 @@  static struct inode_defrag *btrfs_pick_defrag_inode(
 
 void btrfs_cleanup_defrag_inodes(struct btrfs_fs_info *fs_info)
 {
-	struct inode_defrag *defrag;
-	struct rb_node *node;
+	struct inode_defrag *defrag, *next;
 
 	spin_lock(&fs_info->defrag_inodes_lock);
-	node = rb_first(&fs_info->defrag_inodes);
-	while (node) {
-		rb_erase(node, &fs_info->defrag_inodes);
-		defrag = rb_entry(node, struct inode_defrag, rb_node);
+again:
+	rbtree_postorder_for_each_entry_safe(defrag, next, &fs_info->defrag_inodes,
+					     rb_node) {
 		kmem_cache_free(btrfs_inode_defrag_cachep, defrag);
 
-		cond_resched_lock(&fs_info->defrag_inodes_lock);
-
-		node = rb_first(&fs_info->defrag_inodes);
+		if (cond_resched_lock(&fs_info->defrag_inodes_lock)) {
+			/* Rescheduled and unlocked. */
+			goto again;
+		}
 	}
 	spin_unlock(&fs_info->defrag_inodes_lock);
 }