Message ID | e695f292be17c831e6024a743365737372a7a7aa.1682528751.git.fdmanana@suse.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | btrfs: fix extent state leaks after device replace | expand |
On 2023/4/27 01:13, fdmanana@kernel.org wrote: > From: Filipe Manana <fdmanana@suse.com> > > When a device replace finishes, the source device is freed by calling > btrfs_free_device() at btrfs_rm_dev_replace_free_srcdev(), but the > allocation state, tracked in the device's alloc_state io tree, is never > freed. > > This is a regression recently introduced by commit f0bb5474cff0 ("btrfs: > remove redundant release of btrfs_device::alloc_state"), which removed a > call to extent_io_tree_release() from btrfs_free_device(), with the > rationale that btrfs_close_one_device() already releases the allocation > state from a device and btrfs_close_one_device() is always called before > a device is freed with btrfs_free_device(). However that is not true for > the device replace case, as btrfs_free_device() is called without any > previous call to btrfs_close_one_device(). > > The issue is trivial to reproduce, for example, by running test btrfs/027 > from fstests: > > $ ./check btrfs/027 > $ rmmod btrfs > $ dmesg > (...) > [84519.395485] BTRFS info (device sdc): dev_replace from <missing disk> (devid 2) to /dev/sdg started > [84519.466224] BTRFS info (device sdc): dev_replace from <missing disk> (devid 2) to /dev/sdg finished > [84519.552251] BTRFS info (device sdc): scrub: started on devid 1 > [84519.552277] BTRFS info (device sdc): scrub: started on devid 2 > [84519.552332] BTRFS info (device sdc): scrub: started on devid 3 > [84519.552705] BTRFS info (device sdc): scrub: started on devid 4 > [84519.604261] BTRFS info (device sdc): scrub: finished on devid 4 with status: 0 > [84519.609374] BTRFS info (device sdc): scrub: finished on devid 3 with status: 0 > [84519.610818] BTRFS info (device sdc): scrub: finished on devid 1 with status: 0 > [84519.610927] BTRFS info (device sdc): scrub: finished on devid 2 with status: 0 > [84559.503795] BTRFS: state leak: start 1048576 end 1351614463 state 1 in tree 1 refs 1 > [84559.506764] BTRFS: state leak: start 1048576 end 1347420159 state 1 in tree 1 refs 1 > [84559.510294] BTRFS: state leak: start 1048576 end 1351614463 state 1 in tree 1 refs 1 > > So fix this by adding back the call to extent_io_tree_release() at > btrfs_free_device(). > > Fixes: f0bb5474cff0 ("btrfs: remove redundant release of btrfs_device::alloc_state") > Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Thanks, Qu > --- > fs/btrfs/volumes.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c > index 03f52e4a20aa..841e799dece5 100644 > --- a/fs/btrfs/volumes.c > +++ b/fs/btrfs/volumes.c > @@ -395,6 +395,7 @@ void btrfs_free_device(struct btrfs_device *device) > { > WARN_ON(!list_empty(&device->post_commit_list)); > rcu_string_free(device->name); > + extent_io_tree_release(&device->alloc_state); > btrfs_destroy_dev_zone_info(device); > kfree(device); > }
On 27/04/2023 01:13, fdmanana@kernel.org wrote: > From: Filipe Manana <fdmanana@suse.com> > > When a device replace finishes, the source device is freed by calling > btrfs_free_device() at btrfs_rm_dev_replace_free_srcdev(), but the > allocation state, tracked in the device's alloc_state io tree, is never > freed. > > This is a regression recently introduced by commit f0bb5474cff0 ("btrfs: > remove redundant release of btrfs_device::alloc_state"), which removed a > call to extent_io_tree_release() from btrfs_free_device(), with the > rationale that btrfs_close_one_device() already releases the allocation > state from a device and btrfs_close_one_device() is always called before > a device is freed with btrfs_free_device(). However that is not true for > the device replace case, as btrfs_free_device() is called without any > previous call to btrfs_close_one_device(). > > The issue is trivial to reproduce, for example, by running test btrfs/027 > from fstests: > > $ ./check btrfs/027 > $ rmmod btrfs Ah, module reload is a useful way to verify. I have now enabled it in the fstests by setting TEST_FS_MODULE_RELOAD=1, and I am able to reproduce the issue. > $ dmesg > (...) > [84519.395485] BTRFS info (device sdc): dev_replace from <missing disk> (devid 2) to /dev/sdg started > [84519.466224] BTRFS info (device sdc): dev_replace from <missing disk> (devid 2) to /dev/sdg finished > [84519.552251] BTRFS info (device sdc): scrub: started on devid 1 > [84519.552277] BTRFS info (device sdc): scrub: started on devid 2 > [84519.552332] BTRFS info (device sdc): scrub: started on devid 3 > [84519.552705] BTRFS info (device sdc): scrub: started on devid 4 > [84519.604261] BTRFS info (device sdc): scrub: finished on devid 4 with status: 0 > [84519.609374] BTRFS info (device sdc): scrub: finished on devid 3 with status: 0 > [84519.610818] BTRFS info (device sdc): scrub: finished on devid 1 with status: 0 > [84519.610927] BTRFS info (device sdc): scrub: finished on devid 2 with status: 0 > [84559.503795] BTRFS: state leak: start 1048576 end 1351614463 state 1 in tree 1 refs 1 > [84559.506764] BTRFS: state leak: start 1048576 end 1347420159 state 1 in tree 1 refs 1 > [84559.510294] BTRFS: state leak: start 1048576 end 1351614463 state 1 in tree 1 refs 1 > > So fix this by adding back the call to extent_io_tree_release() at > btrfs_free_device(). > > Fixes: f0bb5474cff0 ("btrfs: remove redundant release of btrfs_device::alloc_state") > Signed-off-by: Filipe Manana <fdmanana@suse.com> > --- > fs/btrfs/volumes.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c > index 03f52e4a20aa..841e799dece5 100644 > --- a/fs/btrfs/volumes.c > +++ b/fs/btrfs/volumes.c > @@ -395,6 +395,7 @@ void btrfs_free_device(struct btrfs_device *device) > { > WARN_ON(!list_empty(&device->post_commit_list)); > rcu_string_free(device->name); > + extent_io_tree_release(&device->alloc_state); > btrfs_destroy_dev_zone_info(device); > kfree(device); > } Hmm. Is this fix incomplete? It does not address the concern raised in f0bb5474cff0. Why don't we also do this... ----- diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 40ef429d10a5..e8c26856426e 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -1133,7 +1133,6 @@ static void btrfs_close_one_device(struct btrfs_device *device) device->fs_info = NULL; atomic_set(&device->dev_stats_ccnt, 0); - extent_io_tree_release(&device->alloc_state); /* * Reset the flush error record. We might have a transient flush error ----- Thanks, Anand
On Thu, Apr 27, 2023 at 4:16 PM Anand Jain <anand.jain@oracle.com> wrote: > > On 27/04/2023 01:13, fdmanana@kernel.org wrote: > > From: Filipe Manana <fdmanana@suse.com> > > > > When a device replace finishes, the source device is freed by calling > > btrfs_free_device() at btrfs_rm_dev_replace_free_srcdev(), but the > > allocation state, tracked in the device's alloc_state io tree, is never > > freed. > > > > This is a regression recently introduced by commit f0bb5474cff0 ("btrfs: > > remove redundant release of btrfs_device::alloc_state"), which removed a > > call to extent_io_tree_release() from btrfs_free_device(), with the > > rationale that btrfs_close_one_device() already releases the allocation > > state from a device and btrfs_close_one_device() is always called before > > a device is freed with btrfs_free_device(). However that is not true for > > the device replace case, as btrfs_free_device() is called without any > > previous call to btrfs_close_one_device(). > > > > The issue is trivial to reproduce, for example, by running test btrfs/027 > > from fstests: > > > > $ ./check btrfs/027 > > $ rmmod btrfs > > Ah, module reload is a useful way to verify. I have now enabled > it in the fstests by setting TEST_FS_MODULE_RELOAD=1, and I am > able to reproduce the issue. > > > $ dmesg > > (...) > > [84519.395485] BTRFS info (device sdc): dev_replace from <missing disk> (devid 2) to /dev/sdg started > > [84519.466224] BTRFS info (device sdc): dev_replace from <missing disk> (devid 2) to /dev/sdg finished > > [84519.552251] BTRFS info (device sdc): scrub: started on devid 1 > > [84519.552277] BTRFS info (device sdc): scrub: started on devid 2 > > [84519.552332] BTRFS info (device sdc): scrub: started on devid 3 > > [84519.552705] BTRFS info (device sdc): scrub: started on devid 4 > > [84519.604261] BTRFS info (device sdc): scrub: finished on devid 4 with status: 0 > > [84519.609374] BTRFS info (device sdc): scrub: finished on devid 3 with status: 0 > > [84519.610818] BTRFS info (device sdc): scrub: finished on devid 1 with status: 0 > > [84519.610927] BTRFS info (device sdc): scrub: finished on devid 2 with status: 0 > > [84559.503795] BTRFS: state leak: start 1048576 end 1351614463 state 1 in tree 1 refs 1 > > [84559.506764] BTRFS: state leak: start 1048576 end 1347420159 state 1 in tree 1 refs 1 > > [84559.510294] BTRFS: state leak: start 1048576 end 1351614463 state 1 in tree 1 refs 1 > > > > So fix this by adding back the call to extent_io_tree_release() at > > btrfs_free_device(). > > > > Fixes: f0bb5474cff0 ("btrfs: remove redundant release of btrfs_device::alloc_state") > > Signed-off-by: Filipe Manana <fdmanana@suse.com> > > --- > > fs/btrfs/volumes.c | 1 + > > 1 file changed, 1 insertion(+) > > > > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c > > index 03f52e4a20aa..841e799dece5 100644 > > --- a/fs/btrfs/volumes.c > > +++ b/fs/btrfs/volumes.c > > @@ -395,6 +395,7 @@ void btrfs_free_device(struct btrfs_device *device) > > { > > WARN_ON(!list_empty(&device->post_commit_list)); > > rcu_string_free(device->name); > > + extent_io_tree_release(&device->alloc_state); > > btrfs_destroy_dev_zone_info(device); > > kfree(device); > > } > > > Hmm. Is this fix incomplete? It does not address the concern raised in > f0bb5474cff0. Why don't we also do this... What's the concern exactly? One extra call to extent_io_tree_release() that's actually needed? > > ----- > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c > index 40ef429d10a5..e8c26856426e 100644 > --- a/fs/btrfs/volumes.c > +++ b/fs/btrfs/volumes.c > @@ -1133,7 +1133,6 @@ static void btrfs_close_one_device(struct > btrfs_device *device) > > device->fs_info = NULL; > atomic_set(&device->dev_stats_ccnt, 0); > - extent_io_tree_release(&device->alloc_state); This will change semantics a bit. Removing this, will make us hold on to more memory for a longer period when fs_devices->opened > 1, so btrfs_free_device() / free_fs_devices() will be called potentially much later. I'd rather keep the behaviour we had unless there's a stronger reason other than removing 1 line of code. Thanks. > > /* > * Reset the flush error record. We might have a transient > flush error > ----- > > Thanks, Anand > >
On 28/4/23 00:34, Filipe Manana wrote: > On Thu, Apr 27, 2023 at 4:16 PM Anand Jain <anand.jain@oracle.com> wrote: >> >> On 27/04/2023 01:13, fdmanana@kernel.org wrote: >>> From: Filipe Manana <fdmanana@suse.com> >>> >>> When a device replace finishes, the source device is freed by calling >>> btrfs_free_device() at btrfs_rm_dev_replace_free_srcdev(), but the >>> allocation state, tracked in the device's alloc_state io tree, is never >>> freed. >>> >>> This is a regression recently introduced by commit f0bb5474cff0 ("btrfs: >>> remove redundant release of btrfs_device::alloc_state"), which removed a >>> call to extent_io_tree_release() from btrfs_free_device(), with the >>> rationale that btrfs_close_one_device() already releases the allocation >>> state from a device and btrfs_close_one_device() is always called before >>> a device is freed with btrfs_free_device(). However that is not true for >>> the device replace case, as btrfs_free_device() is called without any >>> previous call to btrfs_close_one_device(). >>> >>> The issue is trivial to reproduce, for example, by running test btrfs/027 >>> from fstests: >>> >>> $ ./check btrfs/027 >>> $ rmmod btrfs >> >> Ah, module reload is a useful way to verify. I have now enabled >> it in the fstests by setting TEST_FS_MODULE_RELOAD=1, and I am >> able to reproduce the issue. >> >>> $ dmesg >>> (...) >>> [84519.395485] BTRFS info (device sdc): dev_replace from <missing disk> (devid 2) to /dev/sdg started >>> [84519.466224] BTRFS info (device sdc): dev_replace from <missing disk> (devid 2) to /dev/sdg finished >>> [84519.552251] BTRFS info (device sdc): scrub: started on devid 1 >>> [84519.552277] BTRFS info (device sdc): scrub: started on devid 2 >>> [84519.552332] BTRFS info (device sdc): scrub: started on devid 3 >>> [84519.552705] BTRFS info (device sdc): scrub: started on devid 4 >>> [84519.604261] BTRFS info (device sdc): scrub: finished on devid 4 with status: 0 >>> [84519.609374] BTRFS info (device sdc): scrub: finished on devid 3 with status: 0 >>> [84519.610818] BTRFS info (device sdc): scrub: finished on devid 1 with status: 0 >>> [84519.610927] BTRFS info (device sdc): scrub: finished on devid 2 with status: 0 >>> [84559.503795] BTRFS: state leak: start 1048576 end 1351614463 state 1 in tree 1 refs 1 >>> [84559.506764] BTRFS: state leak: start 1048576 end 1347420159 state 1 in tree 1 refs 1 >>> [84559.510294] BTRFS: state leak: start 1048576 end 1351614463 state 1 in tree 1 refs 1 >>> >>> So fix this by adding back the call to extent_io_tree_release() at >>> btrfs_free_device(). >>> >>> Fixes: f0bb5474cff0 ("btrfs: remove redundant release of btrfs_device::alloc_state") >>> Signed-off-by: Filipe Manana <fdmanana@suse.com> >>> --- >>> fs/btrfs/volumes.c | 1 + >>> 1 file changed, 1 insertion(+) >>> >>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c >>> index 03f52e4a20aa..841e799dece5 100644 >>> --- a/fs/btrfs/volumes.c >>> +++ b/fs/btrfs/volumes.c >>> @@ -395,6 +395,7 @@ void btrfs_free_device(struct btrfs_device *device) >>> { >>> WARN_ON(!list_empty(&device->post_commit_list)); >>> rcu_string_free(device->name); >>> + extent_io_tree_release(&device->alloc_state); >>> btrfs_destroy_dev_zone_info(device); >>> kfree(device); >>> } >> >> >> Hmm. Is this fix incomplete? It does not address the concern raised in >> f0bb5474cff0. Why don't we also do this... > > What's the concern exactly? > One extra call to extent_io_tree_release() that's actually needed? I'm looking for a solution that meets the need without leading to a situation where extent_io_tree_release() is invoked twice elsewhere. > >> >> ----- >> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c >> index 40ef429d10a5..e8c26856426e 100644 >> --- a/fs/btrfs/volumes.c >> +++ b/fs/btrfs/volumes.c >> @@ -1133,7 +1133,6 @@ static void btrfs_close_one_device(struct >> btrfs_device *device) >> >> device->fs_info = NULL; >> atomic_set(&device->dev_stats_ccnt, 0); >> - extent_io_tree_release(&device->alloc_state); > > This will change semantics a bit. > Removing this, will make us hold on to more memory for a longer period > when fs_devices->opened > 1, so btrfs_free_device() / > free_fs_devices() will be called potentially much later. > > I'd rather keep the behaviour we had unless there's a stronger reason > other than removing 1 line of code. I'm trying to simplify by merging steps in either btrfs_rm_dev_replace_remove_srcdev() or btrfs_rm_dev_replace_free_srcdev() with btrfs_close_one_device(). This could beautifully address the issue. Let's see if it works. Thanks, Anand > > Thanks. > > > >> >> /* >> * Reset the flush error record. We might have a transient >> flush error >> ----- >> >> Thanks, Anand >> >>
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 03f52e4a20aa..841e799dece5 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -395,6 +395,7 @@ void btrfs_free_device(struct btrfs_device *device) { WARN_ON(!list_empty(&device->post_commit_list)); rcu_string_free(device->name); + extent_io_tree_release(&device->alloc_state); btrfs_destroy_dev_zone_info(device); kfree(device); }