diff mbox series

btrfs: do not clear read-only when adding sprout device

Message ID 16c05d39566858bb8bc1e03bd19947cf2b601b98.1647906815.git.boris@bur.io (mailing list archive)
State New, archived
Headers show
Series btrfs: do not clear read-only when adding sprout device | expand

Commit Message

Boris Burkov March 21, 2022, 11:56 p.m. UTC
If you follow the seed/sprout wiki, it suggests the following workflow:

btrfstune -S 1 seed_dev
mount seed_dev mnt
btrfs device add sprout_dev
mount -o remount,rw mnt

The first mount mounts the FS readonly, which results in not setting
BTRFS_FS_OPEN, and setting the readonly bit on the sb. The device add
somewhat surprisingly clears the readonly bit on the sb (though the
mount is still practically readonly, from the users perspective...).
Finally, the remount checks the readonly bit on the sb against the flag
and sees no change, so it does not run the code intended to run on
ro->rw transitions, leaving BTRFS_FS_OPEN unset.

As a result, when the cleaner_kthread runs, it sees no BTRFS_FS_OPEN and
does no work. This results in leaking deleted snapshots until we run out
of space.

I propose fixing it at the first departure from what feels reasonable:
when we clear the readonly bit on the sb during device add. I have a
reproducer of the issue here:
https://raw.githubusercontent.com/boryas/scripts/main/sh/seed/mkseed.sh
and confirm that this patch fixes it, and seems to work OK, otherwise. I
will admit that I couldn't dig up the original rationale for clearing
the bit here (it dates back to the original seed/sprout commit without
explicit explanation) so it's hard to imagine all the ramifications of
the change.

Signed-off-by: Boris Burkov <boris@bur.io>
---
 fs/btrfs/volumes.c | 4 ----
 1 file changed, 4 deletions(-)

Comments

Josef Bacik March 22, 2022, 9:46 p.m. UTC | #1
On Mon, Mar 21, 2022 at 04:56:17PM -0700, Boris Burkov wrote:
> If you follow the seed/sprout wiki, it suggests the following workflow:
> 
> btrfstune -S 1 seed_dev
> mount seed_dev mnt
> btrfs device add sprout_dev
> mount -o remount,rw mnt
> 
> The first mount mounts the FS readonly, which results in not setting
> BTRFS_FS_OPEN, and setting the readonly bit on the sb. The device add
> somewhat surprisingly clears the readonly bit on the sb (though the
> mount is still practically readonly, from the users perspective...).
> Finally, the remount checks the readonly bit on the sb against the flag
> and sees no change, so it does not run the code intended to run on
> ro->rw transitions, leaving BTRFS_FS_OPEN unset.
> 
> As a result, when the cleaner_kthread runs, it sees no BTRFS_FS_OPEN and
> does no work. This results in leaking deleted snapshots until we run out
> of space.
> 
> I propose fixing it at the first departure from what feels reasonable:
> when we clear the readonly bit on the sb during device add. I have a
> reproducer of the issue here:
> https://raw.githubusercontent.com/boryas/scripts/main/sh/seed/mkseed.sh
> and confirm that this patch fixes it, and seems to work OK, otherwise. I
> will admit that I couldn't dig up the original rationale for clearing
> the bit here (it dates back to the original seed/sprout commit without
> explicit explanation) so it's hard to imagine all the ramifications of
> the change.
> 
> Signed-off-by: Boris Burkov <boris@bur.io>

Can we get this turned into a fstest please?

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef
Naohiro Aota March 23, 2022, 12:52 a.m. UTC | #2
On Mon, Mar 21, 2022 at 04:56:17PM -0700, Boris Burkov wrote:
> If you follow the seed/sprout wiki, it suggests the following workflow:
> 
> btrfstune -S 1 seed_dev
> mount seed_dev mnt
> btrfs device add sprout_dev
> mount -o remount,rw mnt
> 
> The first mount mounts the FS readonly, which results in not setting
> BTRFS_FS_OPEN, and setting the readonly bit on the sb. The device add
> somewhat surprisingly clears the readonly bit on the sb (though the
> mount is still practically readonly, from the users perspective...).
> Finally, the remount checks the readonly bit on the sb against the flag
> and sees no change, so it does not run the code intended to run on
> ro->rw transitions, leaving BTRFS_FS_OPEN unset.
> 
> As a result, when the cleaner_kthread runs, it sees no BTRFS_FS_OPEN and
> does no work. This results in leaking deleted snapshots until we run out
> of space.
> 
> I propose fixing it at the first departure from what feels reasonable:
> when we clear the readonly bit on the sb during device add. I have a
> reproducer of the issue here:
> https://raw.githubusercontent.com/boryas/scripts/main/sh/seed/mkseed.sh
> and confirm that this patch fixes it, and seems to work OK, otherwise. I
> will admit that I couldn't dig up the original rationale for clearing
> the bit here (it dates back to the original seed/sprout commit without
> explicit explanation) so it's hard to imagine all the ramifications of
> the change.
> 
> Signed-off-by: Boris Burkov <boris@bur.io>
> ---
>  fs/btrfs/volumes.c | 4 ----
>  1 file changed, 4 deletions(-)
> 
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 3fd17e87815a..75d7eeb26fe6 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -2675,8 +2675,6 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
>  	set_blocksize(device->bdev, BTRFS_BDEV_BLOCKSIZE);
>  
>  	if (seeding_dev) {
> -		btrfs_clear_sb_rdonly(sb);
> -

After this line, it updates the metadata e.g, with
init_first_rw_device() and writes them out with
btrfs_commit_transaction(). Is that OK to do so with the SB_RDONLY
flag set?

>  		/* GFP_KERNEL allocation must not be under device_list_mutex */
>  		seed_devices = btrfs_init_sprout(fs_info);
>  		if (IS_ERR(seed_devices)) {
> @@ -2819,8 +2817,6 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
>  	mutex_unlock(&fs_info->chunk_mutex);
>  	mutex_unlock(&fs_info->fs_devices->device_list_mutex);
>  error_trans:
> -	if (seeding_dev)
> -		btrfs_set_sb_rdonly(sb);
>  	if (trans)
>  		btrfs_end_transaction(trans);
>  error_free_zone:
> -- 
> 2.30.2
>
Anand Jain March 23, 2022, 10:44 a.m. UTC | #3
On 22/03/2022 07:56, Boris Burkov wrote:
> If you follow the seed/sprout wiki, it suggests the following workflow:
> 
> btrfstune -S 1 seed_dev
> mount seed_dev mnt
> btrfs device add sprout_dev

> mount -o remount,rw mnt
or
  umount mnt
  mount sprout mnt

> The first mount mounts the FS readonly, which results in not setting
> BTRFS_FS_OPEN, and setting the readonly bit on the sb.

  Why not set the BTRFS_FS_OPEN?

@@ -3904,8 +3904,11 @@ int __cold open_ctree(struct super_block *sb, 
struct btrfs_fs_devices *fs_device
                 goto fail_qgroup;
         }

-       if (sb_rdonly(sb))
+       if (sb_rdonly(sb)) {
+               btrfs_set_sb_rdonly(sb);
+               set_bit(BTRFS_FS_OPEN, &fs_info->flags);
                 goto clear_oneshot;
+       }

         ret = btrfs_start_pre_rw_mount(fs_info);
         if (ret) {

> The device add
> somewhat surprisingly clears the readonly bit on the sb (though the
> mount is still practically readonly, from the users perspective...).
> Finally, the remount checks the readonly bit on the sb against the flag
> and sees no change, so it does not run the code intended to run on
> ro->rw transitions, leaving BTRFS_FS_OPEN unset.

  Originally, the step 'btrfs device add sprout_dev' provided seed
  fs writeable without a remount.

  I think the btrfs_clear_sb_rdonly(sb) in btrfs_init_new_device()
  was part of it.

  Removing it doesn't seem to affect the seed-sprout functionality
  (did I miss anything?) either the -o remount,rw
  or mount recycle will get it writeable.

> As a result, when the cleaner_kthread runs, it sees no BTRFS_FS_OPEN and
> does no work. This results in leaking deleted snapshots until we run out
> of space.


> I propose fixing it at the first departure from what feels reasonable:
> when we clear the readonly bit on the sb during device add. I have a
> reproducer of the issue here:
> https://raw.githubusercontent.com/boryas/scripts/main/sh/seed/mkseed.sh
> and confirm that this patch fixes it, and seems to work OK, otherwise. I
> will admit that I couldn't dig up the original rationale for clearing
> the bit here (it dates back to the original seed/sprout commit without
> explicit explanation) so it's hard to imagine all the ramifications of
> the change.

  We got fstests -g seed to test the seed-sprout stuff. Your test case
  here fits in it. IMO.

Thanks, Anand


> Signed-off-by: Boris Burkov <boris@bur.io>
> ---
>   fs/btrfs/volumes.c | 4 ----
>   1 file changed, 4 deletions(-)
> 
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 3fd17e87815a..75d7eeb26fe6 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -2675,8 +2675,6 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
>   	set_blocksize(device->bdev, BTRFS_BDEV_BLOCKSIZE);
>   
>   	if (seeding_dev) {
> -		btrfs_clear_sb_rdonly(sb);
> -
>   		/* GFP_KERNEL allocation must not be under device_list_mutex */
>   		seed_devices = btrfs_init_sprout(fs_info);
>   		if (IS_ERR(seed_devices)) {
> @@ -2819,8 +2817,6 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
>   	mutex_unlock(&fs_info->chunk_mutex);
>   	mutex_unlock(&fs_info->fs_devices->device_list_mutex);
>   error_trans:
> -	if (seeding_dev)
> -		btrfs_set_sb_rdonly(sb);
>   	if (trans)
>   		btrfs_end_transaction(trans);
>   error_free_zone:
Boris Burkov March 23, 2022, 6:16 p.m. UTC | #4
On Wed, Mar 23, 2022 at 12:52:15AM +0000, Naohiro Aota wrote:
> On Mon, Mar 21, 2022 at 04:56:17PM -0700, Boris Burkov wrote:
> > If you follow the seed/sprout wiki, it suggests the following workflow:
> > 
> > btrfstune -S 1 seed_dev > > mount seed_dev mnt
> > btrfs device add sprout_dev
> > mount -o remount,rw mnt
> > 
> > The first mount mounts the FS readonly, which results in not setting
> > BTRFS_FS_OPEN, and setting the readonly bit on the sb. The device add
> > somewhat surprisingly clears the readonly bit on the sb (though the
> > mount is still practically readonly, from the users perspective...).
> > Finally, the remount checks the readonly bit on the sb against the flag
> > and sees no change, so it does not run the code intended to run on
> > ro->rw transitions, leaving BTRFS_FS_OPEN unset.
> > 
> > As a result, when the cleaner_kthread runs, it sees no BTRFS_FS_OPEN and
> > does no work. This results in leaking deleted snapshots until we run out
> > of space.
> > 
> > I propose fixing it at the first departure from what feels reasonable:
> > when we clear the readonly bit on the sb during device add. I have a
> > reproducer of the issue here:
> > https://raw.githubusercontent.com/boryas/scripts/main/sh/seed/mkseed.sh
> > and confirm that this patch fixes it, and seems to work OK, otherwise. I
> > will admit that I couldn't dig up the original rationale for clearing
> > the bit here (it dates back to the original seed/sprout commit without
> > explicit explanation) so it's hard to imagine all the ramifications of
> > the change.
> > 
> > Signed-off-by: Boris Burkov <boris@bur.io>
> > ---
> >  fs/btrfs/volumes.c | 4 ----
> >  1 file changed, 4 deletions(-)
> > 
> > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> > index 3fd17e87815a..75d7eeb26fe6 100644
> > --- a/fs/btrfs/volumes.c
> > +++ b/fs/btrfs/volumes.c
> > @@ -2675,8 +2675,6 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
> >  	set_blocksize(device->bdev, BTRFS_BDEV_BLOCKSIZE);
> >  
> >  	if (seeding_dev) {
> > -		btrfs_clear_sb_rdonly(sb);
> > -
> 
> After this line, it updates the metadata e.g, with
> init_first_rw_device() and writes them out with
> btrfs_commit_transaction(). Is that OK to do so with the SB_RDONLY
> flag set?

Good question. As far as I can tell, the functions don't explicitly
check sb_rdonly, though that could be because they expect that to be
checked before you ever try to commit a transaction, for example..

If there is an issue, it's probably somewhat subtle, because the basic
behavior does work.

> 
> >  		/* GFP_KERNEL allocation must not be under device_list_mutex */
> >  		seed_devices = btrfs_init_sprout(fs_info);
> >  		if (IS_ERR(seed_devices)) {
> > @@ -2819,8 +2817,6 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
> >  	mutex_unlock(&fs_info->chunk_mutex);
> >  	mutex_unlock(&fs_info->fs_devices->device_list_mutex);
> >  error_trans:
> > -	if (seeding_dev)
> > -		btrfs_set_sb_rdonly(sb);
> >  	if (trans)
> >  		btrfs_end_transaction(trans);
> >  error_free_zone:
> > -- 
> > 2.30.2
> >
Boris Burkov March 23, 2022, 6:25 p.m. UTC | #5
On Wed, Mar 23, 2022 at 06:44:42PM +0800, Anand Jain wrote:
> On 22/03/2022 07:56, Boris Burkov wrote:
> > If you follow the seed/sprout wiki, it suggests the following workflow:
> > 
> > btrfstune -S 1 seed_dev
> > mount seed_dev mnt
> > btrfs device add sprout_dev
> 
> > mount -o remount,rw mnt
> or
>  umount mnt
>  mount sprout mnt

Agreed. FWIW, I tested that umount/mount is not vulnerable to the bug.
However, a user might be using one of the documented workflows anyway,
and would need it for an fs they add a sprout to while it is in use.

> 
> > The first mount mounts the FS readonly, which results in not setting
> > BTRFS_FS_OPEN, and setting the readonly bit on the sb.
> 
>  Why not set the BTRFS_FS_OPEN?
> 

One reason is that there is other logic that runs when transitioning
from ro->rw in remount besides just setting BTRFS_FS_OPEN. By not
improperly clearing ro, we let that logic do its thing naturally.

> @@ -3904,8 +3904,11 @@ int __cold open_ctree(struct super_block *sb, struct
> btrfs_fs_devices *fs_device
>                 goto fail_qgroup;
>         }
> 
> -       if (sb_rdonly(sb))
> +       if (sb_rdonly(sb)) {
> +               btrfs_set_sb_rdonly(sb);
> +               set_bit(BTRFS_FS_OPEN, &fs_info->flags);
>                 goto clear_oneshot;
> +       }
> 
>         ret = btrfs_start_pre_rw_mount(fs_info);
>         if (ret) {
> 
> > The device add
> > somewhat surprisingly clears the readonly bit on the sb (though the
> > mount is still practically readonly, from the users perspective...).
> > Finally, the remount checks the readonly bit on the sb against the flag
> > and sees no change, so it does not run the code intended to run on
> > ro->rw transitions, leaving BTRFS_FS_OPEN unset.
> 
>  Originally, the step 'btrfs device add sprout_dev' provided seed
>  fs writeable without a remount.

That is very interesting. Do you remember why we changed this behavior
to the current behavior of leaving the seed readonly until the
remount/mount cycle?

> 
>  I think the btrfs_clear_sb_rdonly(sb) in btrfs_init_new_device()
>  was part of it.
> 
>  Removing it doesn't seem to affect the seed-sprout functionality
>  (did I miss anything?) either the -o remount,rw
>  or mount recycle will get it writeable.

My current understanding (probably flawed..):
before this patch: we clear the rdonly bit, but the fs is still readonly
until remount (should figure out exactly how)
after this patch: behavior is the same, except the rdonly bit gets
cleared along with the mounting, not the device add.

> 
> > As a result, when the cleaner_kthread runs, it sees no BTRFS_FS_OPEN and
> > does no work. This results in leaking deleted snapshots until we run out
> > of space.
> 
> 
> > I propose fixing it at the first departure from what feels reasonable:
> > when we clear the readonly bit on the sb during device add. I have a
> > reproducer of the issue here:
> > https://raw.githubusercontent.com/boryas/scripts/main/sh/seed/mkseed.sh
> > and confirm that this patch fixes it, and seems to work OK, otherwise. I
> > will admit that I couldn't dig up the original rationale for clearing
> > the bit here (it dates back to the original seed/sprout commit without
> > explicit explanation) so it's hard to imagine all the ramifications of
> > the change.
> 
>  We got fstests -g seed to test the seed-sprout stuff. Your test case
>  here fits in it. IMO.

Thanks for the tip, I'll add it there, regardless of how we fix it.

> 
> Thanks, Anand
> 
> 
> > Signed-off-by: Boris Burkov <boris@bur.io>
> > ---
> >   fs/btrfs/volumes.c | 4 ----
> >   1 file changed, 4 deletions(-)
> > 
> > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> > index 3fd17e87815a..75d7eeb26fe6 100644
> > --- a/fs/btrfs/volumes.c
> > +++ b/fs/btrfs/volumes.c
> > @@ -2675,8 +2675,6 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
> >   	set_blocksize(device->bdev, BTRFS_BDEV_BLOCKSIZE);
> >   	if (seeding_dev) {
> > -		btrfs_clear_sb_rdonly(sb);
> > -
> >   		/* GFP_KERNEL allocation must not be under device_list_mutex */
> >   		seed_devices = btrfs_init_sprout(fs_info);
> >   		if (IS_ERR(seed_devices)) {
> > @@ -2819,8 +2817,6 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
> >   	mutex_unlock(&fs_info->chunk_mutex);
> >   	mutex_unlock(&fs_info->fs_devices->device_list_mutex);
> >   error_trans:
> > -	if (seeding_dev)
> > -		btrfs_set_sb_rdonly(sb);
> >   	if (trans)
> >   		btrfs_end_transaction(trans);
> >   error_free_zone:
>
Josef Bacik March 23, 2022, 8:17 p.m. UTC | #6
On Wed, Mar 23, 2022 at 06:44:42PM +0800, Anand Jain wrote:
> On 22/03/2022 07:56, Boris Burkov wrote:
> > If you follow the seed/sprout wiki, it suggests the following workflow:
> > 
> > btrfstune -S 1 seed_dev
> > mount seed_dev mnt
> > btrfs device add sprout_dev
> 
> > mount -o remount,rw mnt
> or
>  umount mnt
>  mount sprout mnt
> 
> > The first mount mounts the FS readonly, which results in not setting
> > BTRFS_FS_OPEN, and setting the readonly bit on the sb.
> 
>  Why not set the BTRFS_FS_OPEN?
> 
> @@ -3904,8 +3904,11 @@ int __cold open_ctree(struct super_block *sb, struct
> btrfs_fs_devices *fs_device
>                 goto fail_qgroup;
>         }
> 
> -       if (sb_rdonly(sb))
> +       if (sb_rdonly(sb)) {
> +               btrfs_set_sb_rdonly(sb);
> +               set_bit(BTRFS_FS_OPEN, &fs_info->flags);
>                 goto clear_oneshot;
> +       }
> 
>         ret = btrfs_start_pre_rw_mount(fs_info);
>         if (ret) {
> 
> > The device add
> > somewhat surprisingly clears the readonly bit on the sb (though the
> > mount is still practically readonly, from the users perspective...).
> > Finally, the remount checks the readonly bit on the sb against the flag
> > and sees no change, so it does not run the code intended to run on
> > ro->rw transitions, leaving BTRFS_FS_OPEN unset.
> 
>  Originally, the step 'btrfs device add sprout_dev' provided seed
>  fs writeable without a remount.
> 
>  I think the btrfs_clear_sb_rdonly(sb) in btrfs_init_new_device()
>  was part of it.
> 

Yeah this was a bad idea, I don't want to randomly flip a mount from ro->rw
without the user telling us to.  Thanks,

Josef
Anand Jain March 24, 2022, 11:16 a.m. UTC | #7
On 24/03/2022 02:25, Boris Burkov wrote:
> On Wed, Mar 23, 2022 at 06:44:42PM +0800, Anand Jain wrote:
>> On 22/03/2022 07:56, Boris Burkov wrote:
>>> If you follow the seed/sprout wiki, it suggests the following workflow:
>>>
>>> btrfstune -S 1 seed_dev
>>> mount seed_dev mnt
>>> btrfs device add sprout_dev
>>
>>> mount -o remount,rw mnt
>> or
>>   umount mnt
>>   mount sprout mnt
> 
> Agreed. FWIW, I tested that umount/mount is not vulnerable to the bug.
> However, a user might be using one of the documented workflows anyway,
> and would need it for an fs they add a sprout to while it is in use.

Yep.

>>
>>> The first mount mounts the FS readonly, which results in not setting
>>> BTRFS_FS_OPEN, and setting the readonly bit on the sb.
>>
>>   Why not set the BTRFS_FS_OPEN?
>>
> 
> One reason is that there is other logic that runs when transitioning
> from ro->rw in remount besides just setting BTRFS_FS_OPEN. By not
> improperly clearing ro, we let that logic do its thing naturally.
> 

>> @@ -3904,8 +3904,11 @@ int __cold open_ctree(struct super_block *sb, struct
>> btrfs_fs_devices *fs_device
>>                  goto fail_qgroup;
>>          }
>>
>> -       if (sb_rdonly(sb))
>> +       if (sb_rdonly(sb)) {
>> +               btrfs_set_sb_rdonly(sb);
>> +               set_bit(BTRFS_FS_OPEN, &fs_info->flags);

  Looked more deep. No. I misunderstood BTRFS_FS_OPEN, it implies
  open_ctree() is complete and read-writeable. So this won't work.

>>                  goto clear_oneshot;
>> +       }
>>
>>          ret = btrfs_start_pre_rw_mount(fs_info);
>>          if (ret) {
>>
>>> The device add
>>> somewhat surprisingly clears the readonly bit on the sb (though the
>>> mount is still practically readonly, from the users perspective...).
>>> Finally, the remount checks the readonly bit on the sb against the flag
>>> and sees no change, so it does not run the code intended to run on
>>> ro->rw transitions, leaving BTRFS_FS_OPEN unset.
>>
>>   Originally, the step 'btrfs device add sprout_dev' provided seed
>>   fs writeable without a remount.
> 
> That is very interesting. Do you remember why we changed this behavior
> to the current behavior of leaving the seed readonly until the
> remount/mount cycle?

  So far, I couldn't find the commit that changed it.
  Perhaps Josef may have an idea?

  More below.

>>   I think the btrfs_clear_sb_rdonly(sb) in btrfs_init_new_device()
>>   was part of it.
>>
>>   Removing it doesn't seem to affect the seed-sprout functionality
>>   (did I miss anything?) either the -o remount,rw
>>   or mount recycle will get it writeable.
> 
> My current understanding (probably flawed..):
> before this patch: we clear the rdonly bit, but the fs is still readonly
> until remount (should figure out exactly how)
> after this patch: behavior is the same, except the rdonly bit gets
> cleared along with the mounting, not the device add.
> 
>>
>>> As a result, when the cleaner_kthread runs, it sees no BTRFS_FS_OPEN and
>>> does no work. This results in leaking deleted snapshots until we run out
>>> of space.
>>
>>
>>> I propose fixing it at the first departure from what feels reasonable:
>>> when we clear the readonly bit on the sb during device add. I have a
>>> reproducer of the issue here:
>>> https://raw.githubusercontent.com/boryas/scripts/main/sh/seed/mkseed.sh
>>> and confirm that this patch fixes it, and seems to work OK, otherwise. I
>>> will admit that I couldn't dig up the original rationale for clearing
>>> the bit here (it dates back to the original seed/sprout commit without
>>> explicit explanation) so it's hard to imagine all the ramifications of
>>> the change.
>>
>>   We got fstests -g seed to test the seed-sprout stuff. Your test case
>>   here fits in it. IMO.
> 
> Thanks for the tip, I'll add it there, regardless of how we fix it.
> 
>>
>> Thanks, Anand
>>
>>


>>> Signed-off-by: Boris Burkov <boris@bur.io>
>>> ---
>>>    fs/btrfs/volumes.c | 4 ----
>>>    1 file changed, 4 deletions(-)
>>>
>>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>>> index 3fd17e87815a..75d7eeb26fe6 100644
>>> --- a/fs/btrfs/volumes.c
>>> +++ b/fs/btrfs/volumes.c
>>> @@ -2675,8 +2675,6 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
>>>    	set_blocksize(device->bdev, BTRFS_BDEV_BLOCKSIZE);
>>>    	if (seeding_dev) {
>>> -		btrfs_clear_sb_rdonly(sb);
>>> -
>>>    		/* GFP_KERNEL allocation must not be under device_list_mutex */
>>>    		seed_devices = btrfs_init_sprout(fs_info);
>>>    		if (IS_ERR(seed_devices)) {
>>> @@ -2819,8 +2817,6 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
>>>    	mutex_unlock(&fs_info->chunk_mutex);
>>>    	mutex_unlock(&fs_info->fs_devices->device_list_mutex);
>>>    error_trans:
>>> -	if (seeding_dev)
>>> -		btrfs_set_sb_rdonly(sb);
>>>    	if (trans)
>>>    		btrfs_end_transaction(trans);
>>>    error_free_zone:
>>

This looks good to me. You may add.

Reviewed-by: Anand Jain <anand.jain@oracle.com>

Thanks, Anand
Anand Jain March 28, 2022, 11:11 a.m. UTC | #8
On 24/03/2022 02:16, Boris Burkov wrote:
> On Wed, Mar 23, 2022 at 12:52:15AM +0000, Naohiro Aota wrote:
>> On Mon, Mar 21, 2022 at 04:56:17PM -0700, Boris Burkov wrote:
>>> If you follow the seed/sprout wiki, it suggests the following workflow:
>>>
>>> btrfstune -S 1 seed_dev > > mount seed_dev mnt
>>> btrfs device add sprout_dev
>>> mount -o remount,rw mnt
>>>
>>> The first mount mounts the FS readonly, which results in not setting
>>> BTRFS_FS_OPEN, and setting the readonly bit on the sb. The device add
>>> somewhat surprisingly clears the readonly bit on the sb (though the
>>> mount is still practically readonly, from the users perspective...).
>>> Finally, the remount checks the readonly bit on the sb against the flag
>>> and sees no change, so it does not run the code intended to run on
>>> ro->rw transitions, leaving BTRFS_FS_OPEN unset.
>>>
>>> As a result, when the cleaner_kthread runs, it sees no BTRFS_FS_OPEN and
>>> does no work. This results in leaking deleted snapshots until we run out
>>> of space.
>>>
>>> I propose fixing it at the first departure from what feels reasonable:
>>> when we clear the readonly bit on the sb during device add. I have a
>>> reproducer of the issue here:
>>> https://raw.githubusercontent.com/boryas/scripts/main/sh/seed/mkseed.sh
>>> and confirm that this patch fixes it, and seems to work OK, otherwise. I
>>> will admit that I couldn't dig up the original rationale for clearing
>>> the bit here (it dates back to the original seed/sprout commit without
>>> explicit explanation) so it's hard to imagine all the ramifications of
>>> the change.
>>>
>>> Signed-off-by: Boris Burkov <boris@bur.io>
>>> ---
>>>   fs/btrfs/volumes.c | 4 ----
>>>   1 file changed, 4 deletions(-)
>>>
>>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>>> index 3fd17e87815a..75d7eeb26fe6 100644
>>> --- a/fs/btrfs/volumes.c
>>> +++ b/fs/btrfs/volumes.c
>>> @@ -2675,8 +2675,6 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
>>>   	set_blocksize(device->bdev, BTRFS_BDEV_BLOCKSIZE);
>>>   
>>>   	if (seeding_dev) {
>>> -		btrfs_clear_sb_rdonly(sb);
>>> -
>>
>> After this line, it updates the metadata e.g, with
>> init_first_rw_device() and writes them out with
>> btrfs_commit_transaction(). Is that OK to do so with the SB_RDONLY
>> flag set?
> 

It is ok as the device-add step creates a _new_ sprout filesystem which 
is RW-able. btrfs_setup_sprout() resets the seeding flag.

  super_flags = btrfs_super_flags(disk_super) &
  ~BTRFS_SUPER_FLAG_SEEDING;
  btrfs_set_super_flags(disk_super, super_flags);

Thanks, Anand

> Good question. As far as I can tell, the functions don't explicitly
> check sb_rdonly, though that could be because they expect that to be
> checked before you ever try to commit a transaction, for example..
> 
> If there is an issue, it's probably somewhat subtle, because the basic
> behavior does work.
> 
>>
>>>   		/* GFP_KERNEL allocation must not be under device_list_mutex */
>>>   		seed_devices = btrfs_init_sprout(fs_info);
>>>   		if (IS_ERR(seed_devices)) {
>>> @@ -2819,8 +2817,6 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
>>>   	mutex_unlock(&fs_info->chunk_mutex);
>>>   	mutex_unlock(&fs_info->fs_devices->device_list_mutex);
>>>   error_trans:
>>> -	if (seeding_dev)
>>> -		btrfs_set_sb_rdonly(sb);
>>>   	if (trans)
>>>   		btrfs_end_transaction(trans);
>>>   error_free_zone:
>>> -- 
>>> 2.30.2
>>>
Naohiro Aota March 29, 2022, 4:33 a.m. UTC | #9
On Mon, Mar 28, 2022 at 07:11:39PM +0800, Anand Jain wrote:
> On 24/03/2022 02:16, Boris Burkov wrote:
> > On Wed, Mar 23, 2022 at 12:52:15AM +0000, Naohiro Aota wrote:
> > > On Mon, Mar 21, 2022 at 04:56:17PM -0700, Boris Burkov wrote:
> > > > If you follow the seed/sprout wiki, it suggests the following workflow:
> > > > 
> > > > btrfstune -S 1 seed_dev > > mount seed_dev mnt
> > > > btrfs device add sprout_dev
> > > > mount -o remount,rw mnt
> > > > 
> > > > The first mount mounts the FS readonly, which results in not setting
> > > > BTRFS_FS_OPEN, and setting the readonly bit on the sb. The device add
> > > > somewhat surprisingly clears the readonly bit on the sb (though the
> > > > mount is still practically readonly, from the users perspective...).
> > > > Finally, the remount checks the readonly bit on the sb against the flag
> > > > and sees no change, so it does not run the code intended to run on
> > > > ro->rw transitions, leaving BTRFS_FS_OPEN unset.
> > > > 
> > > > As a result, when the cleaner_kthread runs, it sees no BTRFS_FS_OPEN and
> > > > does no work. This results in leaking deleted snapshots until we run out
> > > > of space.
> > > > 
> > > > I propose fixing it at the first departure from what feels reasonable:
> > > > when we clear the readonly bit on the sb during device add. I have a
> > > > reproducer of the issue here:
> > > > https://raw.githubusercontent.com/boryas/scripts/main/sh/seed/mkseed.sh
> > > > and confirm that this patch fixes it, and seems to work OK, otherwise. I
> > > > will admit that I couldn't dig up the original rationale for clearing
> > > > the bit here (it dates back to the original seed/sprout commit without
> > > > explicit explanation) so it's hard to imagine all the ramifications of
> > > > the change.
> > > > 
> > > > Signed-off-by: Boris Burkov <boris@bur.io>
> > > > ---
> > > >   fs/btrfs/volumes.c | 4 ----
> > > >   1 file changed, 4 deletions(-)
> > > > 
> > > > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> > > > index 3fd17e87815a..75d7eeb26fe6 100644
> > > > --- a/fs/btrfs/volumes.c
> > > > +++ b/fs/btrfs/volumes.c
> > > > @@ -2675,8 +2675,6 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
> > > >   	set_blocksize(device->bdev, BTRFS_BDEV_BLOCKSIZE);
> > > >   	if (seeding_dev) {
> > > > -		btrfs_clear_sb_rdonly(sb);
> > > > -
> > > 
> > > After this line, it updates the metadata e.g, with
> > > init_first_rw_device() and writes them out with
> > > btrfs_commit_transaction(). Is that OK to do so with the SB_RDONLY
> > > flag set?
> > 
> 
> It is ok as the device-add step creates a _new_ sprout filesystem which is
> RW-able. btrfs_setup_sprout() resets the seeding flag.
> 
>  super_flags = btrfs_super_flags(disk_super) &
>  ~BTRFS_SUPER_FLAG_SEEDING;
>  btrfs_set_super_flags(disk_super, super_flags);
> 
> Thanks, Anand

Yeah, I see that point. I'm concerned about an interaction with the
VFS code. With this patch applied, it is going to do file-system
internal writes (updating the trees and transaction commit) with the
SB_RDONLY flag set. Doesn't it break the current and future assumptions
of the VFS side?

But, it's just a concern. It might not be a bid deal.

> > Good question. As far as I can tell, the functions don't explicitly
> > check sb_rdonly, though that could be because they expect that to be
> > checked before you ever try to commit a transaction, for example..
> > 
> > If there is an issue, it's probably somewhat subtle, because the basic
> > behavior does work.
> > 
> > > 
> > > >   		/* GFP_KERNEL allocation must not be under device_list_mutex */
> > > >   		seed_devices = btrfs_init_sprout(fs_info);
> > > >   		if (IS_ERR(seed_devices)) {
> > > > @@ -2819,8 +2817,6 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
> > > >   	mutex_unlock(&fs_info->chunk_mutex);
> > > >   	mutex_unlock(&fs_info->fs_devices->device_list_mutex);
> > > >   error_trans:
> > > > -	if (seeding_dev)
> > > > -		btrfs_set_sb_rdonly(sb);
> > > >   	if (trans)
> > > >   		btrfs_end_transaction(trans);
> > > >   error_free_zone:
> > > > -- 
> > > > 2.30.2
> > > > 
>
Boris Burkov March 29, 2022, 7:45 p.m. UTC | #10
On Tue, Mar 29, 2022 at 04:33:21AM +0000, Naohiro Aota wrote:
> On Mon, Mar 28, 2022 at 07:11:39PM +0800, Anand Jain wrote:
> > On 24/03/2022 02:16, Boris Burkov wrote:
> > > On Wed, Mar 23, 2022 at 12:52:15AM +0000, Naohiro Aota wrote:
> > > > On Mon, Mar 21, 2022 at 04:56:17PM -0700, Boris Burkov wrote:
> > > > > If you follow the seed/sprout wiki, it suggests the following workflow:
> > > > > 
> > > > > btrfstune -S 1 seed_dev > > mount seed_dev mnt
> > > > > btrfs device add sprout_dev
> > > > > mount -o remount,rw mnt
> > > > > 
> > > > > The first mount mounts the FS readonly, which results in not setting
> > > > > BTRFS_FS_OPEN, and setting the readonly bit on the sb. The device add
> > > > > somewhat surprisingly clears the readonly bit on the sb (though the
> > > > > mount is still practically readonly, from the users perspective...).
> > > > > Finally, the remount checks the readonly bit on the sb against the flag
> > > > > and sees no change, so it does not run the code intended to run on
> > > > > ro->rw transitions, leaving BTRFS_FS_OPEN unset.
> > > > > 
> > > > > As a result, when the cleaner_kthread runs, it sees no BTRFS_FS_OPEN and
> > > > > does no work. This results in leaking deleted snapshots until we run out
> > > > > of space.
> > > > > 
> > > > > I propose fixing it at the first departure from what feels reasonable:
> > > > > when we clear the readonly bit on the sb during device add. I have a
> > > > > reproducer of the issue here:
> > > > > https://raw.githubusercontent.com/boryas/scripts/main/sh/seed/mkseed.sh
> > > > > and confirm that this patch fixes it, and seems to work OK, otherwise. I
> > > > > will admit that I couldn't dig up the original rationale for clearing
> > > > > the bit here (it dates back to the original seed/sprout commit without
> > > > > explicit explanation) so it's hard to imagine all the ramifications of
> > > > > the change.
> > > > > 
> > > > > Signed-off-by: Boris Burkov <boris@bur.io>
> > > > > ---
> > > > >   fs/btrfs/volumes.c | 4 ----
> > > > >   1 file changed, 4 deletions(-)
> > > > > 
> > > > > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> > > > > index 3fd17e87815a..75d7eeb26fe6 100644
> > > > > --- a/fs/btrfs/volumes.c
> > > > > +++ b/fs/btrfs/volumes.c
> > > > > @@ -2675,8 +2675,6 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
> > > > >   	set_blocksize(device->bdev, BTRFS_BDEV_BLOCKSIZE);
> > > > >   	if (seeding_dev) {
> > > > > -		btrfs_clear_sb_rdonly(sb);
> > > > > -
> > > > 
> > > > After this line, it updates the metadata e.g, with
> > > > init_first_rw_device() and writes them out with
> > > > btrfs_commit_transaction(). Is that OK to do so with the SB_RDONLY
> > > > flag set?
> > > 
> > 
> > It is ok as the device-add step creates a _new_ sprout filesystem which is
> > RW-able. btrfs_setup_sprout() resets the seeding flag.
> > 
> >  super_flags = btrfs_super_flags(disk_super) &
> >  ~BTRFS_SUPER_FLAG_SEEDING;
> >  btrfs_set_super_flags(disk_super, super_flags);
> > 
> > Thanks, Anand
> 
> Yeah, I see that point. I'm concerned about an interaction with the
> VFS code. With this patch applied, it is going to do file-system
> internal writes (updating the trees and transaction commit) with the
> SB_RDONLY flag set. Doesn't it break the current and future assumptions
> of the VFS side?
> 
> But, it's just a concern. It might not be a bid deal.

I discussed this with Chris and he pointed out an existing case:
log-replay. Log replay runs in mount before the sb_rdonly check and
checks rw_devices, but not the fs readonly bit. It uses transactions and
such to replay the log. This seems similar enough here: the device is rw,
but the fs is ro.

> 
> > > Good question. As far as I can tell, the functions don't explicitly
> > > check sb_rdonly, though that could be because they expect that to be
> > > checked before you ever try to commit a transaction, for example..
> > > 
> > > If there is an issue, it's probably somewhat subtle, because the basic
> > > behavior does work.
> > > 
> > > > 
> > > > >   		/* GFP_KERNEL allocation must not be under device_list_mutex */
> > > > >   		seed_devices = btrfs_init_sprout(fs_info);
> > > > >   		if (IS_ERR(seed_devices)) {
> > > > > @@ -2819,8 +2817,6 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
> > > > >   	mutex_unlock(&fs_info->chunk_mutex);
> > > > >   	mutex_unlock(&fs_info->fs_devices->device_list_mutex);
> > > > >   error_trans:
> > > > > -	if (seeding_dev)
> > > > > -		btrfs_set_sb_rdonly(sb);
> > > > >   	if (trans)
> > > > >   		btrfs_end_transaction(trans);
> > > > >   error_free_zone:
> > > > > -- 
> > > > > 2.30.2
> > > > > 
> >
diff mbox series

Patch

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 3fd17e87815a..75d7eeb26fe6 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -2675,8 +2675,6 @@  int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
 	set_blocksize(device->bdev, BTRFS_BDEV_BLOCKSIZE);
 
 	if (seeding_dev) {
-		btrfs_clear_sb_rdonly(sb);
-
 		/* GFP_KERNEL allocation must not be under device_list_mutex */
 		seed_devices = btrfs_init_sprout(fs_info);
 		if (IS_ERR(seed_devices)) {
@@ -2819,8 +2817,6 @@  int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
 	mutex_unlock(&fs_info->chunk_mutex);
 	mutex_unlock(&fs_info->fs_devices->device_list_mutex);
 error_trans:
-	if (seeding_dev)
-		btrfs_set_sb_rdonly(sb);
 	if (trans)
 		btrfs_end_transaction(trans);
 error_free_zone: