diff mbox series

[1/2] btrfs: fix leak of source device allocation state after device replace

Message ID e695f292be17c831e6024a743365737372a7a7aa.1682528751.git.fdmanana@suse.com (mailing list archive)
State New, archived
Headers show
Series btrfs: fix extent state leaks after device replace | expand

Commit Message

Filipe Manana April 26, 2023, 5:13 p.m. UTC
From: Filipe Manana <fdmanana@suse.com>

When a device replace finishes, the source device is freed by calling
btrfs_free_device() at btrfs_rm_dev_replace_free_srcdev(), but the
allocation state, tracked in the device's alloc_state io tree, is never
freed.

This is a regression recently introduced by commit f0bb5474cff0 ("btrfs:
remove redundant release of btrfs_device::alloc_state"), which removed a
call to extent_io_tree_release() from btrfs_free_device(), with the
rationale that btrfs_close_one_device() already releases the allocation
state from a device and btrfs_close_one_device() is always called before
a device is freed with btrfs_free_device(). However that is not true for
the device replace case, as btrfs_free_device() is called without any
previous call to btrfs_close_one_device().

The issue is trivial to reproduce, for example, by running test btrfs/027
from fstests:

  $ ./check btrfs/027
  $ rmmod btrfs
  $ dmesg
  (...)
  [84519.395485] BTRFS info (device sdc): dev_replace from <missing disk> (devid 2) to /dev/sdg started
  [84519.466224] BTRFS info (device sdc): dev_replace from <missing disk> (devid 2) to /dev/sdg finished
  [84519.552251] BTRFS info (device sdc): scrub: started on devid 1
  [84519.552277] BTRFS info (device sdc): scrub: started on devid 2
  [84519.552332] BTRFS info (device sdc): scrub: started on devid 3
  [84519.552705] BTRFS info (device sdc): scrub: started on devid 4
  [84519.604261] BTRFS info (device sdc): scrub: finished on devid 4 with status: 0
  [84519.609374] BTRFS info (device sdc): scrub: finished on devid 3 with status: 0
  [84519.610818] BTRFS info (device sdc): scrub: finished on devid 1 with status: 0
  [84519.610927] BTRFS info (device sdc): scrub: finished on devid 2 with status: 0
  [84559.503795] BTRFS: state leak: start 1048576 end 1351614463 state 1 in tree 1 refs 1
  [84559.506764] BTRFS: state leak: start 1048576 end 1347420159 state 1 in tree 1 refs 1
  [84559.510294] BTRFS: state leak: start 1048576 end 1351614463 state 1 in tree 1 refs 1

So fix this by adding back the call to extent_io_tree_release() at
btrfs_free_device().

Fixes: f0bb5474cff0 ("btrfs: remove redundant release of btrfs_device::alloc_state")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/volumes.c | 1 +
 1 file changed, 1 insertion(+)

Comments

Qu Wenruo April 27, 2023, 3:15 a.m. UTC | #1
On 2023/4/27 01:13, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> When a device replace finishes, the source device is freed by calling
> btrfs_free_device() at btrfs_rm_dev_replace_free_srcdev(), but the
> allocation state, tracked in the device's alloc_state io tree, is never
> freed.
> 
> This is a regression recently introduced by commit f0bb5474cff0 ("btrfs:
> remove redundant release of btrfs_device::alloc_state"), which removed a
> call to extent_io_tree_release() from btrfs_free_device(), with the
> rationale that btrfs_close_one_device() already releases the allocation
> state from a device and btrfs_close_one_device() is always called before
> a device is freed with btrfs_free_device(). However that is not true for
> the device replace case, as btrfs_free_device() is called without any
> previous call to btrfs_close_one_device().
> 
> The issue is trivial to reproduce, for example, by running test btrfs/027
> from fstests:
> 
>    $ ./check btrfs/027
>    $ rmmod btrfs
>    $ dmesg
>    (...)
>    [84519.395485] BTRFS info (device sdc): dev_replace from <missing disk> (devid 2) to /dev/sdg started
>    [84519.466224] BTRFS info (device sdc): dev_replace from <missing disk> (devid 2) to /dev/sdg finished
>    [84519.552251] BTRFS info (device sdc): scrub: started on devid 1
>    [84519.552277] BTRFS info (device sdc): scrub: started on devid 2
>    [84519.552332] BTRFS info (device sdc): scrub: started on devid 3
>    [84519.552705] BTRFS info (device sdc): scrub: started on devid 4
>    [84519.604261] BTRFS info (device sdc): scrub: finished on devid 4 with status: 0
>    [84519.609374] BTRFS info (device sdc): scrub: finished on devid 3 with status: 0
>    [84519.610818] BTRFS info (device sdc): scrub: finished on devid 1 with status: 0
>    [84519.610927] BTRFS info (device sdc): scrub: finished on devid 2 with status: 0
>    [84559.503795] BTRFS: state leak: start 1048576 end 1351614463 state 1 in tree 1 refs 1
>    [84559.506764] BTRFS: state leak: start 1048576 end 1347420159 state 1 in tree 1 refs 1
>    [84559.510294] BTRFS: state leak: start 1048576 end 1351614463 state 1 in tree 1 refs 1
> 
> So fix this by adding back the call to extent_io_tree_release() at
> btrfs_free_device().
> 
> Fixes: f0bb5474cff0 ("btrfs: remove redundant release of btrfs_device::alloc_state")
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

Reviewed-by: Qu Wenruo <wqu@suse.com>

Thanks,
Qu
> ---
>   fs/btrfs/volumes.c | 1 +
>   1 file changed, 1 insertion(+)
> 
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 03f52e4a20aa..841e799dece5 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -395,6 +395,7 @@ void btrfs_free_device(struct btrfs_device *device)
>   {
>   	WARN_ON(!list_empty(&device->post_commit_list));
>   	rcu_string_free(device->name);
> +	extent_io_tree_release(&device->alloc_state);
>   	btrfs_destroy_dev_zone_info(device);
>   	kfree(device);
>   }
Anand Jain April 27, 2023, 3:16 p.m. UTC | #2
On 27/04/2023 01:13, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> When a device replace finishes, the source device is freed by calling
> btrfs_free_device() at btrfs_rm_dev_replace_free_srcdev(), but the
> allocation state, tracked in the device's alloc_state io tree, is never
> freed.
> 
> This is a regression recently introduced by commit f0bb5474cff0 ("btrfs:
> remove redundant release of btrfs_device::alloc_state"), which removed a
> call to extent_io_tree_release() from btrfs_free_device(), with the
> rationale that btrfs_close_one_device() already releases the allocation
> state from a device and btrfs_close_one_device() is always called before
> a device is freed with btrfs_free_device(). However that is not true for
> the device replace case, as btrfs_free_device() is called without any
> previous call to btrfs_close_one_device().
> 
> The issue is trivial to reproduce, for example, by running test btrfs/027
> from fstests:
> 
>    $ ./check btrfs/027
>    $ rmmod btrfs

Ah, module reload is a useful way to verify. I have now enabled
it in the fstests by setting TEST_FS_MODULE_RELOAD=1, and I am
able to reproduce the issue.

>    $ dmesg
>    (...)
>    [84519.395485] BTRFS info (device sdc): dev_replace from <missing disk> (devid 2) to /dev/sdg started
>    [84519.466224] BTRFS info (device sdc): dev_replace from <missing disk> (devid 2) to /dev/sdg finished
>    [84519.552251] BTRFS info (device sdc): scrub: started on devid 1
>    [84519.552277] BTRFS info (device sdc): scrub: started on devid 2
>    [84519.552332] BTRFS info (device sdc): scrub: started on devid 3
>    [84519.552705] BTRFS info (device sdc): scrub: started on devid 4
>    [84519.604261] BTRFS info (device sdc): scrub: finished on devid 4 with status: 0
>    [84519.609374] BTRFS info (device sdc): scrub: finished on devid 3 with status: 0
>    [84519.610818] BTRFS info (device sdc): scrub: finished on devid 1 with status: 0
>    [84519.610927] BTRFS info (device sdc): scrub: finished on devid 2 with status: 0
>    [84559.503795] BTRFS: state leak: start 1048576 end 1351614463 state 1 in tree 1 refs 1
>    [84559.506764] BTRFS: state leak: start 1048576 end 1347420159 state 1 in tree 1 refs 1
>    [84559.510294] BTRFS: state leak: start 1048576 end 1351614463 state 1 in tree 1 refs 1
> 
> So fix this by adding back the call to extent_io_tree_release() at
> btrfs_free_device().
> 
> Fixes: f0bb5474cff0 ("btrfs: remove redundant release of btrfs_device::alloc_state")
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> ---
>   fs/btrfs/volumes.c | 1 +
>   1 file changed, 1 insertion(+)
> 
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 03f52e4a20aa..841e799dece5 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -395,6 +395,7 @@ void btrfs_free_device(struct btrfs_device *device)
>   {
>   	WARN_ON(!list_empty(&device->post_commit_list));
>   	rcu_string_free(device->name);
> +	extent_io_tree_release(&device->alloc_state);
>   	btrfs_destroy_dev_zone_info(device);
>   	kfree(device);
>   }


Hmm. Is this fix incomplete? It does not address the concern raised in
f0bb5474cff0. Why don't we also do this...

-----
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 40ef429d10a5..e8c26856426e 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1133,7 +1133,6 @@ static void btrfs_close_one_device(struct 
btrfs_device *device)

         device->fs_info = NULL;
         atomic_set(&device->dev_stats_ccnt, 0);
-       extent_io_tree_release(&device->alloc_state);

         /*
          * Reset the flush error record. We might have a transient 
flush error
-----

Thanks, Anand
Filipe Manana April 27, 2023, 4:34 p.m. UTC | #3
On Thu, Apr 27, 2023 at 4:16 PM Anand Jain <anand.jain@oracle.com> wrote:
>
> On 27/04/2023 01:13, fdmanana@kernel.org wrote:
> > From: Filipe Manana <fdmanana@suse.com>
> >
> > When a device replace finishes, the source device is freed by calling
> > btrfs_free_device() at btrfs_rm_dev_replace_free_srcdev(), but the
> > allocation state, tracked in the device's alloc_state io tree, is never
> > freed.
> >
> > This is a regression recently introduced by commit f0bb5474cff0 ("btrfs:
> > remove redundant release of btrfs_device::alloc_state"), which removed a
> > call to extent_io_tree_release() from btrfs_free_device(), with the
> > rationale that btrfs_close_one_device() already releases the allocation
> > state from a device and btrfs_close_one_device() is always called before
> > a device is freed with btrfs_free_device(). However that is not true for
> > the device replace case, as btrfs_free_device() is called without any
> > previous call to btrfs_close_one_device().
> >
> > The issue is trivial to reproduce, for example, by running test btrfs/027
> > from fstests:
> >
> >    $ ./check btrfs/027
> >    $ rmmod btrfs
>
> Ah, module reload is a useful way to verify. I have now enabled
> it in the fstests by setting TEST_FS_MODULE_RELOAD=1, and I am
> able to reproduce the issue.
>
> >    $ dmesg
> >    (...)
> >    [84519.395485] BTRFS info (device sdc): dev_replace from <missing disk> (devid 2) to /dev/sdg started
> >    [84519.466224] BTRFS info (device sdc): dev_replace from <missing disk> (devid 2) to /dev/sdg finished
> >    [84519.552251] BTRFS info (device sdc): scrub: started on devid 1
> >    [84519.552277] BTRFS info (device sdc): scrub: started on devid 2
> >    [84519.552332] BTRFS info (device sdc): scrub: started on devid 3
> >    [84519.552705] BTRFS info (device sdc): scrub: started on devid 4
> >    [84519.604261] BTRFS info (device sdc): scrub: finished on devid 4 with status: 0
> >    [84519.609374] BTRFS info (device sdc): scrub: finished on devid 3 with status: 0
> >    [84519.610818] BTRFS info (device sdc): scrub: finished on devid 1 with status: 0
> >    [84519.610927] BTRFS info (device sdc): scrub: finished on devid 2 with status: 0
> >    [84559.503795] BTRFS: state leak: start 1048576 end 1351614463 state 1 in tree 1 refs 1
> >    [84559.506764] BTRFS: state leak: start 1048576 end 1347420159 state 1 in tree 1 refs 1
> >    [84559.510294] BTRFS: state leak: start 1048576 end 1351614463 state 1 in tree 1 refs 1
> >
> > So fix this by adding back the call to extent_io_tree_release() at
> > btrfs_free_device().
> >
> > Fixes: f0bb5474cff0 ("btrfs: remove redundant release of btrfs_device::alloc_state")
> > Signed-off-by: Filipe Manana <fdmanana@suse.com>
> > ---
> >   fs/btrfs/volumes.c | 1 +
> >   1 file changed, 1 insertion(+)
> >
> > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> > index 03f52e4a20aa..841e799dece5 100644
> > --- a/fs/btrfs/volumes.c
> > +++ b/fs/btrfs/volumes.c
> > @@ -395,6 +395,7 @@ void btrfs_free_device(struct btrfs_device *device)
> >   {
> >       WARN_ON(!list_empty(&device->post_commit_list));
> >       rcu_string_free(device->name);
> > +     extent_io_tree_release(&device->alloc_state);
> >       btrfs_destroy_dev_zone_info(device);
> >       kfree(device);
> >   }
>
>
> Hmm. Is this fix incomplete? It does not address the concern raised in
> f0bb5474cff0. Why don't we also do this...

What's the concern exactly?
One extra call to extent_io_tree_release() that's actually needed?

>
> -----
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 40ef429d10a5..e8c26856426e 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -1133,7 +1133,6 @@ static void btrfs_close_one_device(struct
> btrfs_device *device)
>
>          device->fs_info = NULL;
>          atomic_set(&device->dev_stats_ccnt, 0);
> -       extent_io_tree_release(&device->alloc_state);

This will change semantics a bit.
Removing this, will make us hold on to more memory for a longer period
when fs_devices->opened > 1, so btrfs_free_device() /
free_fs_devices() will be called potentially much later.

I'd rather keep the behaviour we had unless there's a stronger reason
other than removing 1 line of code.

Thanks.



>
>          /*
>           * Reset the flush error record. We might have a transient
> flush error
> -----
>
> Thanks, Anand
>
>
Anand Jain April 27, 2023, 11:39 p.m. UTC | #4
On 28/4/23 00:34, Filipe Manana wrote:
> On Thu, Apr 27, 2023 at 4:16 PM Anand Jain <anand.jain@oracle.com> wrote:
>>
>> On 27/04/2023 01:13, fdmanana@kernel.org wrote:
>>> From: Filipe Manana <fdmanana@suse.com>
>>>
>>> When a device replace finishes, the source device is freed by calling
>>> btrfs_free_device() at btrfs_rm_dev_replace_free_srcdev(), but the
>>> allocation state, tracked in the device's alloc_state io tree, is never
>>> freed.
>>>
>>> This is a regression recently introduced by commit f0bb5474cff0 ("btrfs:
>>> remove redundant release of btrfs_device::alloc_state"), which removed a
>>> call to extent_io_tree_release() from btrfs_free_device(), with the
>>> rationale that btrfs_close_one_device() already releases the allocation
>>> state from a device and btrfs_close_one_device() is always called before
>>> a device is freed with btrfs_free_device(). However that is not true for
>>> the device replace case, as btrfs_free_device() is called without any
>>> previous call to btrfs_close_one_device().
>>>
>>> The issue is trivial to reproduce, for example, by running test btrfs/027
>>> from fstests:
>>>
>>>     $ ./check btrfs/027
>>>     $ rmmod btrfs
>>
>> Ah, module reload is a useful way to verify. I have now enabled
>> it in the fstests by setting TEST_FS_MODULE_RELOAD=1, and I am
>> able to reproduce the issue.
>>
>>>     $ dmesg
>>>     (...)
>>>     [84519.395485] BTRFS info (device sdc): dev_replace from <missing disk> (devid 2) to /dev/sdg started
>>>     [84519.466224] BTRFS info (device sdc): dev_replace from <missing disk> (devid 2) to /dev/sdg finished
>>>     [84519.552251] BTRFS info (device sdc): scrub: started on devid 1
>>>     [84519.552277] BTRFS info (device sdc): scrub: started on devid 2
>>>     [84519.552332] BTRFS info (device sdc): scrub: started on devid 3
>>>     [84519.552705] BTRFS info (device sdc): scrub: started on devid 4
>>>     [84519.604261] BTRFS info (device sdc): scrub: finished on devid 4 with status: 0
>>>     [84519.609374] BTRFS info (device sdc): scrub: finished on devid 3 with status: 0
>>>     [84519.610818] BTRFS info (device sdc): scrub: finished on devid 1 with status: 0
>>>     [84519.610927] BTRFS info (device sdc): scrub: finished on devid 2 with status: 0
>>>     [84559.503795] BTRFS: state leak: start 1048576 end 1351614463 state 1 in tree 1 refs 1
>>>     [84559.506764] BTRFS: state leak: start 1048576 end 1347420159 state 1 in tree 1 refs 1
>>>     [84559.510294] BTRFS: state leak: start 1048576 end 1351614463 state 1 in tree 1 refs 1
>>>
>>> So fix this by adding back the call to extent_io_tree_release() at
>>> btrfs_free_device().
>>>
>>> Fixes: f0bb5474cff0 ("btrfs: remove redundant release of btrfs_device::alloc_state")
>>> Signed-off-by: Filipe Manana <fdmanana@suse.com>
>>> ---
>>>    fs/btrfs/volumes.c | 1 +
>>>    1 file changed, 1 insertion(+)
>>>
>>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>>> index 03f52e4a20aa..841e799dece5 100644
>>> --- a/fs/btrfs/volumes.c
>>> +++ b/fs/btrfs/volumes.c
>>> @@ -395,6 +395,7 @@ void btrfs_free_device(struct btrfs_device *device)
>>>    {
>>>        WARN_ON(!list_empty(&device->post_commit_list));
>>>        rcu_string_free(device->name);
>>> +     extent_io_tree_release(&device->alloc_state);
>>>        btrfs_destroy_dev_zone_info(device);
>>>        kfree(device);
>>>    }
>>
>>
>> Hmm. Is this fix incomplete? It does not address the concern raised in
>> f0bb5474cff0. Why don't we also do this...
> 
> What's the concern exactly?
> One extra call to extent_io_tree_release() that's actually needed?
I'm looking for a solution that meets the need without leading to a
situation where extent_io_tree_release() is invoked twice elsewhere.

> 
>>
>> -----
>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>> index 40ef429d10a5..e8c26856426e 100644
>> --- a/fs/btrfs/volumes.c
>> +++ b/fs/btrfs/volumes.c
>> @@ -1133,7 +1133,6 @@ static void btrfs_close_one_device(struct
>> btrfs_device *device)
>>
>>           device->fs_info = NULL;
>>           atomic_set(&device->dev_stats_ccnt, 0);
>> -       extent_io_tree_release(&device->alloc_state);
> 
> This will change semantics a bit.
> Removing this, will make us hold on to more memory for a longer period
> when fs_devices->opened > 1, so btrfs_free_device() /
> free_fs_devices() will be called potentially much later.
> 
> I'd rather keep the behaviour we had unless there's a stronger reason
> other than removing 1 line of code.

I'm trying to simplify by merging steps in either 
btrfs_rm_dev_replace_remove_srcdev() or 
btrfs_rm_dev_replace_free_srcdev() with btrfs_close_one_device().

This could beautifully address the issue. Let's see if it works.

Thanks, Anand

> 
> Thanks.
> 
> 
> 
>>
>>           /*
>>            * Reset the flush error record. We might have a transient
>> flush error
>> -----
>>
>> Thanks, Anand
>>
>>
diff mbox series

Patch

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 03f52e4a20aa..841e799dece5 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -395,6 +395,7 @@  void btrfs_free_device(struct btrfs_device *device)
 {
 	WARN_ON(!list_empty(&device->post_commit_list));
 	rcu_string_free(device->name);
+	extent_io_tree_release(&device->alloc_state);
 	btrfs_destroy_dev_zone_info(device);
 	kfree(device);
 }