diff mbox series

btrfs: do not clear read-only when adding sprout device

Message ID a9aa42f6bc2739ab46ce015f749e15177f8730d6.1729028033.git.boris@bur.io (mailing list archive)
State New
Headers show
Series btrfs: do not clear read-only when adding sprout device | expand

Commit Message

Boris Burkov Oct. 15, 2024, 9:38 p.m. UTC
If you follow the seed/sprout wiki, it suggests the following workflow:

btrfstune -S 1 seed_dev
mount seed_dev mnt
btrfs device add sprout_dev
mount -o remount,rw mnt

The first mount mounts the FS readonly, which results in not setting
BTRFS_FS_OPEN, and setting the readonly bit on the sb. The device add
somewhat surprisingly clears the readonly bit on the sb (though the
mount is still practically readonly, from the users perspective...).
Finally, the remount checks the readonly bit on the sb against the flag
and sees no change, so it does not run the code intended to run on
ro->rw transitions, leaving BTRFS_FS_OPEN unset.

As a result, when the cleaner_kthread runs, it sees no BTRFS_FS_OPEN and
does no work. This results in leaking deleted snapshots until we run out
of space.

I propose fixing it at the first departure from what feels reasonable:
when we clear the readonly bit on the sb during device add.

A new fstest I have written reproduces the bug and confirms the fix.

Signed-off-by: Boris Burkov <boris@bur.io>
---
Note that this is a resend of an old unmerged fix:
https://lore.kernel.org/linux-btrfs/16c05d39566858bb8bc1e03bd19947cf2b601b98.1647906815.git.boris@bur.io/T/#u
Some other ideas for fixing it by modifying how we set BTRFS_FS_OPEN
were also explored but not merged around that time:
https://lore.kernel.org/linux-btrfs/cover.1654216941.git.anand.jain@oracle.com/

I don't have a strong preference, but I would really like to see this
trivial bug fixed. For what it is worth, we have been carrying this
patch internally at Meta since I first sent it with no incident.
---
 fs/btrfs/volumes.c | 4 ----
 1 file changed, 4 deletions(-)

Comments

Qu Wenruo Oct. 15, 2024, 10 p.m. UTC | #1
在 2024/10/16 08:08, Boris Burkov 写道:
> If you follow the seed/sprout wiki, it suggests the following workflow:
>
> btrfstune -S 1 seed_dev
> mount seed_dev mnt
> btrfs device add sprout_dev
> mount -o remount,rw mnt
>
> The first mount mounts the FS readonly, which results in not setting
> BTRFS_FS_OPEN, and setting the readonly bit on the sb. The device add
> somewhat surprisingly clears the readonly bit on the sb (though the
> mount is still practically readonly, from the users perspective...).
> Finally, the remount checks the readonly bit on the sb against the flag
> and sees no change, so it does not run the code intended to run on
> ro->rw transitions, leaving BTRFS_FS_OPEN unset.
>
> As a result, when the cleaner_kthread runs, it sees no BTRFS_FS_OPEN and
> does no work. This results in leaking deleted snapshots until we run out
> of space.
>
> I propose fixing it at the first departure from what feels reasonable:
> when we clear the readonly bit on the sb during device add.
>
> A new fstest I have written reproduces the bug and confirms the fix.
>
> Signed-off-by: Boris Burkov <boris@bur.io>

The fix looks good to me, small and keeps the super block ro flag
consistent.

IIRC the old behavior of sprout is, adding device will immediately mark
the fs RW, which is a big surprise changing the RO/RW status.

So the extra Rw remount requirement looks very reasonable to me.

Reviewed-by: Qu Wenruo <wqu@suse.com>

Thanks,
Qu

> ---
> Note that this is a resend of an old unmerged fix:
> https://lore.kernel.org/linux-btrfs/16c05d39566858bb8bc1e03bd19947cf2b601b98.1647906815.git.boris@bur.io/T/#u
> Some other ideas for fixing it by modifying how we set BTRFS_FS_OPEN
> were also explored but not merged around that time:
> https://lore.kernel.org/linux-btrfs/cover.1654216941.git.anand.jain@oracle.com/
>
> I don't have a strong preference, but I would really like to see this
> trivial bug fixed. For what it is worth, we have been carrying this
> patch internally at Meta since I first sent it with no incident.
> ---
>   fs/btrfs/volumes.c | 4 ----
>   1 file changed, 4 deletions(-)
>
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index dc9f54849f39..84e861dcb350 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -2841,8 +2841,6 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
>   	set_blocksize(device->bdev_file, BTRFS_BDEV_BLOCKSIZE);
>
>   	if (seeding_dev) {
> -		btrfs_clear_sb_rdonly(sb);
> -
>   		/* GFP_KERNEL allocation must not be under device_list_mutex */
>   		seed_devices = btrfs_init_sprout(fs_info);
>   		if (IS_ERR(seed_devices)) {
> @@ -2985,8 +2983,6 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
>   	mutex_unlock(&fs_info->chunk_mutex);
>   	mutex_unlock(&fs_info->fs_devices->device_list_mutex);
>   error_trans:
> -	if (seeding_dev)
> -		btrfs_set_sb_rdonly(sb);
>   	if (trans)
>   		btrfs_end_transaction(trans);
>   error_free_zone:
Qu Wenruo Oct. 15, 2024, 10:12 p.m. UTC | #2
在 2024/10/16 08:30, Qu Wenruo 写道:
>
>
> 在 2024/10/16 08:08, Boris Burkov 写道:
>> If you follow the seed/sprout wiki, it suggests the following workflow:
>>
>> btrfstune -S 1 seed_dev
>> mount seed_dev mnt
>> btrfs device add sprout_dev
>> mount -o remount,rw mnt
>>
>> The first mount mounts the FS readonly, which results in not setting
>> BTRFS_FS_OPEN, and setting the readonly bit on the sb. The device add
>> somewhat surprisingly clears the readonly bit on the sb (though the
>> mount is still practically readonly, from the users perspective...).
>> Finally, the remount checks the readonly bit on the sb against the flag
>> and sees no change, so it does not run the code intended to run on
>> ro->rw transitions, leaving BTRFS_FS_OPEN unset.
>>
>> As a result, when the cleaner_kthread runs, it sees no BTRFS_FS_OPEN and
>> does no work. This results in leaking deleted snapshots until we run out
>> of space.
>>
>> I propose fixing it at the first departure from what feels reasonable:
>> when we clear the readonly bit on the sb during device add.
>>
>> A new fstest I have written reproduces the bug and confirms the fix.
>>
>> Signed-off-by: Boris Burkov <boris@bur.io>
>
> The fix looks good to me, small and keeps the super block ro flag
> consistent.
>
> IIRC the old behavior of sprout is, adding device will immediately mark
> the fs RW, which is a big surprise changing the RO/RW status.
>
> So the extra Rw remount requirement looks very reasonable to me.

Forgot to mention, although it's a trivial change in the behavior, if we
are determined to go this path, the man page of the "SEEDING DEVICE"
chapter also need to be updated.

Thanks,
Qu
>
> Reviewed-by: Qu Wenruo <wqu@suse.com>
>
> Thanks,
> Qu
>
>> ---
>> Note that this is a resend of an old unmerged fix:
>> https://lore.kernel.org/linux-
>> btrfs/16c05d39566858bb8bc1e03bd19947cf2b601b98.1647906815.git.boris@bur.io/T/#u
>> Some other ideas for fixing it by modifying how we set BTRFS_FS_OPEN
>> were also explored but not merged around that time:
>> https://lore.kernel.org/linux-btrfs/
>> cover.1654216941.git.anand.jain@oracle.com/
>>
>> I don't have a strong preference, but I would really like to see this
>> trivial bug fixed. For what it is worth, we have been carrying this
>> patch internally at Meta since I first sent it with no incident.
>> ---
>>   fs/btrfs/volumes.c | 4 ----
>>   1 file changed, 4 deletions(-)
>>
>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>> index dc9f54849f39..84e861dcb350 100644
>> --- a/fs/btrfs/volumes.c
>> +++ b/fs/btrfs/volumes.c
>> @@ -2841,8 +2841,6 @@ int btrfs_init_new_device(struct btrfs_fs_info
>> *fs_info, const char *device_path
>>       set_blocksize(device->bdev_file, BTRFS_BDEV_BLOCKSIZE);
>>
>>       if (seeding_dev) {
>> -        btrfs_clear_sb_rdonly(sb);
>> -
>>           /* GFP_KERNEL allocation must not be under device_list_mutex */
>>           seed_devices = btrfs_init_sprout(fs_info);
>>           if (IS_ERR(seed_devices)) {
>> @@ -2985,8 +2983,6 @@ int btrfs_init_new_device(struct btrfs_fs_info
>> *fs_info, const char *device_path
>>       mutex_unlock(&fs_info->chunk_mutex);
>>       mutex_unlock(&fs_info->fs_devices->device_list_mutex);
>>   error_trans:
>> -    if (seeding_dev)
>> -        btrfs_set_sb_rdonly(sb);
>>       if (trans)
>>           btrfs_end_transaction(trans);
>>   error_free_zone:
>
>
Boris Burkov Oct. 15, 2024, 11:23 p.m. UTC | #3
On Wed, Oct 16, 2024 at 08:42:14AM +1030, Qu Wenruo wrote:
> 
> 
> 在 2024/10/16 08:30, Qu Wenruo 写道:
> > 
> > 
> > 在 2024/10/16 08:08, Boris Burkov 写道:
> > > If you follow the seed/sprout wiki, it suggests the following workflow:
> > > 
> > > btrfstune -S 1 seed_dev
> > > mount seed_dev mnt
> > > btrfs device add sprout_dev
> > > mount -o remount,rw mnt
> > > 
> > > The first mount mounts the FS readonly, which results in not setting
> > > BTRFS_FS_OPEN, and setting the readonly bit on the sb. The device add
> > > somewhat surprisingly clears the readonly bit on the sb (though the
> > > mount is still practically readonly, from the users perspective...).
> > > Finally, the remount checks the readonly bit on the sb against the flag
> > > and sees no change, so it does not run the code intended to run on
> > > ro->rw transitions, leaving BTRFS_FS_OPEN unset.
> > > 
> > > As a result, when the cleaner_kthread runs, it sees no BTRFS_FS_OPEN and
> > > does no work. This results in leaking deleted snapshots until we run out
> > > of space.
> > > 
> > > I propose fixing it at the first departure from what feels reasonable:
> > > when we clear the readonly bit on the sb during device add.
> > > 
> > > A new fstest I have written reproduces the bug and confirms the fix.
> > > 
> > > Signed-off-by: Boris Burkov <boris@bur.io>
> > 
> > The fix looks good to me, small and keeps the super block ro flag
> > consistent.
> > 
> > IIRC the old behavior of sprout is, adding device will immediately mark
> > the fs RW, which is a big surprise changing the RO/RW status.
> > 
> > So the extra Rw remount requirement looks very reasonable to me.
> 
> Forgot to mention, although it's a trivial change in the behavior, if we
> are determined to go this path, the man page of the "SEEDING DEVICE"
> chapter also need to be updated.

I just checked the man page and everything there looks correct, at least
to me. It quite clearly states that after the 'device add' the fs is
ready to be remounted read-write (via cycle mount or remount).

Please let me know if there is some specific update you want me to make
that I missed, though!

BTW, this patch does change the rdonly flag behavior, so I will update
the test to look at that, as you suggested.

> 
> Thanks,
> Qu
> > 
> > Reviewed-by: Qu Wenruo <wqu@suse.com>
> > 
> > Thanks,
> > Qu
> > 
> > > ---
> > > Note that this is a resend of an old unmerged fix:
> > > https://lore.kernel.org/linux-
> > > btrfs/16c05d39566858bb8bc1e03bd19947cf2b601b98.1647906815.git.boris@bur.io/T/#u
> > > Some other ideas for fixing it by modifying how we set BTRFS_FS_OPEN
> > > were also explored but not merged around that time:
> > > https://lore.kernel.org/linux-btrfs/
> > > cover.1654216941.git.anand.jain@oracle.com/
> > > 
> > > I don't have a strong preference, but I would really like to see this
> > > trivial bug fixed. For what it is worth, we have been carrying this
> > > patch internally at Meta since I first sent it with no incident.
> > > ---
> > >   fs/btrfs/volumes.c | 4 ----
> > >   1 file changed, 4 deletions(-)
> > > 
> > > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> > > index dc9f54849f39..84e861dcb350 100644
> > > --- a/fs/btrfs/volumes.c
> > > +++ b/fs/btrfs/volumes.c
> > > @@ -2841,8 +2841,6 @@ int btrfs_init_new_device(struct btrfs_fs_info
> > > *fs_info, const char *device_path
> > >       set_blocksize(device->bdev_file, BTRFS_BDEV_BLOCKSIZE);
> > > 
> > >       if (seeding_dev) {
> > > -        btrfs_clear_sb_rdonly(sb);
> > > -
> > >           /* GFP_KERNEL allocation must not be under device_list_mutex */
> > >           seed_devices = btrfs_init_sprout(fs_info);
> > >           if (IS_ERR(seed_devices)) {
> > > @@ -2985,8 +2983,6 @@ int btrfs_init_new_device(struct btrfs_fs_info
> > > *fs_info, const char *device_path
> > >       mutex_unlock(&fs_info->chunk_mutex);
> > >       mutex_unlock(&fs_info->fs_devices->device_list_mutex);
> > >   error_trans:
> > > -    if (seeding_dev)
> > > -        btrfs_set_sb_rdonly(sb);
> > >       if (trans)
> > >           btrfs_end_transaction(trans);
> > >   error_free_zone:
> > 
> > 
>
Anand Jain Oct. 16, 2024, 5:14 p.m. UTC | #4
On 16/10/24 05:38, Boris Burkov wrote:
> If you follow the seed/sprout wiki, it suggests the following workflow:
> 
> btrfstune -S 1 seed_dev
> mount seed_dev mnt
> btrfs device add sprout_dev
> mount -o remount,rw mnt
> 



> The first mount mounts the FS readonly, which results in not setting
> BTRFS_FS_OPEN, and setting the readonly bit on the sb. The device add
> somewhat surprisingly clears the readonly bit on the sb (though the
> mount is still practically readonly, from the users perspective...).
> Finally, the remount checks the readonly bit on the sb against the flag
> and sees no change, so it does not run the code intended to run on
> ro->rw transitions, leaving BTRFS_FS_OPEN unset.
> 
> As a result, when the cleaner_kthread runs, it sees no BTRFS_FS_OPEN and
> does no work. This results in leaking deleted snapshots until we run out
> of space.
> 
> I propose fixing it at the first departure from what feels reasonable:
> when we clear the readonly bit on the sb during device add.
> 
> A new fstest I have written reproduces the bug and confirms the fix.
> 
> Signed-off-by: Boris Burkov <boris@bur.io>
> ---
> Note that this is a resend of an old unmerged fix:
> https://lore.kernel.org/linux-btrfs/16c05d39566858bb8bc1e03bd19947cf2b601b98.1647906815.git.boris@bur.io/T/#u
> Some other ideas for fixing it by modifying how we set BTRFS_FS_OPEN
> were also explored but not merged around that time:
> https://lore.kernel.org/linux-btrfs/cover.1654216941.git.anand.jain@oracle.com/
> 
> I don't have a strong preference, but I would really like to see this
> trivial bug fixed. For what it is worth, we have been carrying this
> patch internally at Meta since I first sent it with no incident.
> ---


I remember fixing this before. I tested on 5.15, and the bug isn't
there, but it’s back in 6.10, so something broke in between.
We need to track it down.

The original design (kernel 4.x and below) makes the filesystem switch
to read-write mode after adding a sprout because:

    You can’t add a device to a normal read-only filesystem
  so with seed read-only mount is different.
    With a seed device, adding a writable device transforms
  it into a new read-write filesystem with a _new_ FSID and
  fs_devices. Logically, read-write at this stage makes sense,
  but I’m okay without it and in fact we had fixed this before,
  but a patch somewhere seems to have broken it again.


(Demo below. :<x> is the return code from the 'run' command at
  https://github.com/asj/run.git)


----- 5.15.0-208.159.3.2.el9uek.x86_64 ----
$ mkfs.btrfs -fq /dev/loop0 :0
$ btrfstune -S1 /dev/loop0 :0
$ mount /dev/loop0 /btrfs :0
mount: /btrfs: WARNING: source write-protected, mounted read-only.

$ cat /proc/self/mounts | grep btrfs :0
/dev/loop0 /btrfs btrfs ro,relatime,space_cache=v2,subvolid=5,subvol=/ 0 0

$ findmnt -o SOURCE,UUID /btrfs :0
SOURCE     UUID
/dev/loop0 64f21b87-4e4c-4786-b2cd-c09a5ccd2afa

$ btrfs fi show -m :0
Label: none  uuid: 64f21b87-4e4c-4786-b2cd-c09a5ccd2afa
	Total devices 1 FS bytes used 144.00KiB
	devid    1 size 3.00GiB used 536.00MiB path /dev/loop0

$ ls /sys/fs/btrfs :0
64f21b87-4e4c-4786-b2cd-c09a5ccd2afa
features

$ btrfs dev add -f /dev/loop1 /btrfs :0

# After adding the device, the path and UUID are different,
# so it’s a new filesystem. (But, as I said, I’m fine with
# keeping it read-only and needing remount,rw.

$ cat /proc/self/mounts | grep btrfs :0
/dev/loop1 /btrfs btrfs ro,relatime,space_cache=v2,subvolid=5,subvol=/ 0 0

$ findmnt -o SOURCE,UUID /btrfs :0
SOURCE     UUID
/dev/loop1 948cea35-18db-45da-9ec8-3d46cb5f0413

$ btrfs fi show -m :0
Label: none  uuid: 948cea35-18db-45da-9ec8-3d46cb5f0413
	Total devices 2 FS bytes used 144.00KiB
	devid    1 size 3.00GiB used 520.00MiB path /dev/loop0
	devid    2 size 3.00GiB used 576.00MiB path /dev/loop1


$ ls /sys/fs/btrfs :0
948cea35-18db-45da-9ec8-3d46cb5f0413
features
---------


Thanks, Anand

>   fs/btrfs/volumes.c | 4 ----
>   1 file changed, 4 deletions(-)
> 
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index dc9f54849f39..84e861dcb350 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -2841,8 +2841,6 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
>   	set_blocksize(device->bdev_file, BTRFS_BDEV_BLOCKSIZE);
>   
>   	if (seeding_dev) {
> -		btrfs_clear_sb_rdonly(sb);
> -
>   		/* GFP_KERNEL allocation must not be under device_list_mutex */
>   		seed_devices = btrfs_init_sprout(fs_info);
>   		if (IS_ERR(seed_devices)) {
> @@ -2985,8 +2983,6 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
>   	mutex_unlock(&fs_info->chunk_mutex);
>   	mutex_unlock(&fs_info->fs_devices->device_list_mutex);
>   error_trans:
> -	if (seeding_dev)
> -		btrfs_set_sb_rdonly(sb);
>   	if (trans)
>   		btrfs_end_transaction(trans);
>   error_free_zone:
Boris Burkov Oct. 16, 2024, 5:24 p.m. UTC | #5
On Thu, Oct 17, 2024 at 01:14:16AM +0800, Anand Jain wrote:
> On 16/10/24 05:38, Boris Burkov wrote:
> > If you follow the seed/sprout wiki, it suggests the following workflow:
> > 
> > btrfstune -S 1 seed_dev
> > mount seed_dev mnt
> > btrfs device add sprout_dev
> > mount -o remount,rw mnt
> > 
> 
> 
> 
> > The first mount mounts the FS readonly, which results in not setting
> > BTRFS_FS_OPEN, and setting the readonly bit on the sb. The device add
> > somewhat surprisingly clears the readonly bit on the sb (though the
> > mount is still practically readonly, from the users perspective...).
> > Finally, the remount checks the readonly bit on the sb against the flag
> > and sees no change, so it does not run the code intended to run on
> > ro->rw transitions, leaving BTRFS_FS_OPEN unset.
> > 
> > As a result, when the cleaner_kthread runs, it sees no BTRFS_FS_OPEN and
> > does no work. This results in leaking deleted snapshots until we run out
> > of space.
> > 
> > I propose fixing it at the first departure from what feels reasonable:
> > when we clear the readonly bit on the sb during device add.
> > 
> > A new fstest I have written reproduces the bug and confirms the fix.
> > 
> > Signed-off-by: Boris Burkov <boris@bur.io>
> > ---
> > Note that this is a resend of an old unmerged fix:
> > https://lore.kernel.org/linux-btrfs/16c05d39566858bb8bc1e03bd19947cf2b601b98.1647906815.git.boris@bur.io/T/#u
> > Some other ideas for fixing it by modifying how we set BTRFS_FS_OPEN
> > were also explored but not merged around that time:
> > https://lore.kernel.org/linux-btrfs/cover.1654216941.git.anand.jain@oracle.com/
> > 
> > I don't have a strong preference, but I would really like to see this
> > trivial bug fixed. For what it is worth, we have been carrying this
> > patch internally at Meta since I first sent it with no incident.
> > ---
> 
> 
> I remember fixing this before. I tested on 5.15, and the bug isn't
> there, but it’s back in 6.10, so something broke in between.
> We need to track it down.

Thanks for weighing in again, and for re-testing on 5.15. That's
interesting that it broke again. And sorry if I didn't follow the rdonly
check patches properly and those did end up getting merged. Poor code
archaeology on my part :)

At any rate, I have pushed this patch into for-next for now, as I
think it constitutes an improvement without breaking any documented
behavior. If you look into what happened between 5.15 and 6.11 and want
to back it out with a different fix, I will not be offended.

If we also land the fstest I submitted, then hopefully future kernels
will *not* be breaking this again!

Thanks,
Boris

> 
> The original design (kernel 4.x and below) makes the filesystem switch
> to read-write mode after adding a sprout because:
> 
>    You can’t add a device to a normal read-only filesystem
>  so with seed read-only mount is different.
>    With a seed device, adding a writable device transforms
>  it into a new read-write filesystem with a _new_ FSID and
>  fs_devices. Logically, read-write at this stage makes sense,
>  but I’m okay without it and in fact we had fixed this before,
>  but a patch somewhere seems to have broken it again.
> 
> 
> (Demo below. :<x> is the return code from the 'run' command at
>  https://github.com/asj/run.git)
> 
> 
> ----- 5.15.0-208.159.3.2.el9uek.x86_64 ----
> $ mkfs.btrfs -fq /dev/loop0 :0
> $ btrfstune -S1 /dev/loop0 :0
> $ mount /dev/loop0 /btrfs :0
> mount: /btrfs: WARNING: source write-protected, mounted read-only.
> 
> $ cat /proc/self/mounts | grep btrfs :0
> /dev/loop0 /btrfs btrfs ro,relatime,space_cache=v2,subvolid=5,subvol=/ 0 0
> 
> $ findmnt -o SOURCE,UUID /btrfs :0
> SOURCE     UUID
> /dev/loop0 64f21b87-4e4c-4786-b2cd-c09a5ccd2afa
> 
> $ btrfs fi show -m :0
> Label: none  uuid: 64f21b87-4e4c-4786-b2cd-c09a5ccd2afa
> 	Total devices 1 FS bytes used 144.00KiB
> 	devid    1 size 3.00GiB used 536.00MiB path /dev/loop0
> 
> $ ls /sys/fs/btrfs :0
> 64f21b87-4e4c-4786-b2cd-c09a5ccd2afa
> features
> 
> $ btrfs dev add -f /dev/loop1 /btrfs :0
> 
> # After adding the device, the path and UUID are different,
> # so it’s a new filesystem. (But, as I said, I’m fine with
> # keeping it read-only and needing remount,rw.
> 
> $ cat /proc/self/mounts | grep btrfs :0
> /dev/loop1 /btrfs btrfs ro,relatime,space_cache=v2,subvolid=5,subvol=/ 0 0
> 
> $ findmnt -o SOURCE,UUID /btrfs :0
> SOURCE     UUID
> /dev/loop1 948cea35-18db-45da-9ec8-3d46cb5f0413
> 
> $ btrfs fi show -m :0
> Label: none  uuid: 948cea35-18db-45da-9ec8-3d46cb5f0413
> 	Total devices 2 FS bytes used 144.00KiB
> 	devid    1 size 3.00GiB used 520.00MiB path /dev/loop0
> 	devid    2 size 3.00GiB used 576.00MiB path /dev/loop1
> 
> 
> $ ls /sys/fs/btrfs :0
> 948cea35-18db-45da-9ec8-3d46cb5f0413
> features
> ---------
> 
> 
> Thanks, Anand
> 
> >   fs/btrfs/volumes.c | 4 ----
> >   1 file changed, 4 deletions(-)
> > 
> > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> > index dc9f54849f39..84e861dcb350 100644
> > --- a/fs/btrfs/volumes.c
> > +++ b/fs/btrfs/volumes.c
> > @@ -2841,8 +2841,6 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
> >   	set_blocksize(device->bdev_file, BTRFS_BDEV_BLOCKSIZE);
> >   	if (seeding_dev) {
> > -		btrfs_clear_sb_rdonly(sb);
> > -
> >   		/* GFP_KERNEL allocation must not be under device_list_mutex */
> >   		seed_devices = btrfs_init_sprout(fs_info);
> >   		if (IS_ERR(seed_devices)) {
> > @@ -2985,8 +2983,6 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
> >   	mutex_unlock(&fs_info->chunk_mutex);
> >   	mutex_unlock(&fs_info->fs_devices->device_list_mutex);
> >   error_trans:
> > -	if (seeding_dev)
> > -		btrfs_set_sb_rdonly(sb);
> >   	if (trans)
> >   		btrfs_end_transaction(trans);
> >   error_free_zone:
>
David Sterba Oct. 17, 2024, 2:01 p.m. UTC | #6
On Tue, Oct 15, 2024 at 02:38:34PM -0700, Boris Burkov wrote:
> If you follow the seed/sprout wiki, it suggests the following workflow:
> 
> btrfstune -S 1 seed_dev
> mount seed_dev mnt
> btrfs device add sprout_dev
> mount -o remount,rw mnt
> 
> The first mount mounts the FS readonly, which results in not setting
> BTRFS_FS_OPEN, and setting the readonly bit on the sb. The device add
> somewhat surprisingly clears the readonly bit on the sb (though the
> mount is still practically readonly, from the users perspective...).
> Finally, the remount checks the readonly bit on the sb against the flag
> and sees no change, so it does not run the code intended to run on
> ro->rw transitions, leaving BTRFS_FS_OPEN unset.
> 
> As a result, when the cleaner_kthread runs, it sees no BTRFS_FS_OPEN and
> does no work. This results in leaking deleted snapshots until we run out
> of space.
> 
> I propose fixing it at the first departure from what feels reasonable:
> when we clear the readonly bit on the sb during device add.
> 
> A new fstest I have written reproduces the bug and confirms the fix.
> 
> Signed-off-by: Boris Burkov <boris@bur.io>
> ---
> Note that this is a resend of an old unmerged fix:
> https://lore.kernel.org/linux-btrfs/16c05d39566858bb8bc1e03bd19947cf2b601b98.1647906815.git.boris@bur.io/T/#u
> Some other ideas for fixing it by modifying how we set BTRFS_FS_OPEN
> were also explored but not merged around that time:
> https://lore.kernel.org/linux-btrfs/cover.1654216941.git.anand.jain@oracle.com/
> 
> I don't have a strong preference, but I would really like to see this
> trivial bug fixed. For what it is worth, we have been carrying this
> patch internally at Meta since I first sent it with no incident.

We have an unresolved dispute about the fix and now the patch got to
for-next within a few days because it got two reviews but not mine.
The way you use it in Meta works for you but the fix is changing
behaviour so we can either ignore everybody else relying on the old
way or say that seeding is so niche that we don't care and see what we
can do once we get some report.
Boris Burkov Oct. 17, 2024, 4:41 p.m. UTC | #7
On Thu, Oct 17, 2024 at 04:01:12PM +0200, David Sterba wrote:
> On Tue, Oct 15, 2024 at 02:38:34PM -0700, Boris Burkov wrote:
> > If you follow the seed/sprout wiki, it suggests the following workflow:
> > 
> > btrfstune -S 1 seed_dev
> > mount seed_dev mnt
> > btrfs device add sprout_dev
> > mount -o remount,rw mnt
> > 
> > The first mount mounts the FS readonly, which results in not setting
> > BTRFS_FS_OPEN, and setting the readonly bit on the sb. The device add
> > somewhat surprisingly clears the readonly bit on the sb (though the
> > mount is still practically readonly, from the users perspective...).
> > Finally, the remount checks the readonly bit on the sb against the flag
> > and sees no change, so it does not run the code intended to run on
> > ro->rw transitions, leaving BTRFS_FS_OPEN unset.
> > 
> > As a result, when the cleaner_kthread runs, it sees no BTRFS_FS_OPEN and
> > does no work. This results in leaking deleted snapshots until we run out
> > of space.
> > 
> > I propose fixing it at the first departure from what feels reasonable:
> > when we clear the readonly bit on the sb during device add.
> > 
> > A new fstest I have written reproduces the bug and confirms the fix.
> > 
> > Signed-off-by: Boris Burkov <boris@bur.io>
> > ---
> > Note that this is a resend of an old unmerged fix:
> > https://lore.kernel.org/linux-btrfs/16c05d39566858bb8bc1e03bd19947cf2b601b98.1647906815.git.boris@bur.io/T/#u
> > Some other ideas for fixing it by modifying how we set BTRFS_FS_OPEN
> > were also explored but not merged around that time:
> > https://lore.kernel.org/linux-btrfs/cover.1654216941.git.anand.jain@oracle.com/
> > 
> > I don't have a strong preference, but I would really like to see this
> > trivial bug fixed. For what it is worth, we have been carrying this
> > patch internally at Meta since I first sent it with no incident.
> 
> We have an unresolved dispute about the fix and now the patch got to
> for-next within a few days because it got two reviews but not mine.
> The way you use it in Meta works for you but the fix is changing
> behaviour so we can either ignore everybody else relying on the old
> way or say that seeding is so niche that we don't care and see what we
> can do once we get some report.

Please feel free to remove it, and I am happy to review whatever other
fix you think is best. Sorry for rushing, I just wanted to get it done
and out of my head so I could move on to other issues.

I assume your concern is that before this fix, the filesystem is actually
read-write after the device add, which this patch breaks?

My only argument against this is that the documentation has always said
you needed to remount/cycle-mount after adding the sprout, so there is
no fair reason to assume this would work. (In fact, it *doesn't*, the fs
is once again in a unexpected, degraded, state...)

But if existing LiveCD users are relying on this undocumented behavior,
then I think you are right and it's a bad idea to break them.

As long as my test (and presumably some fix) goes in and I don't have to
carry this patch internally for two more years, then I am happy.

Thanks,
Boris
Qu Wenruo Oct. 17, 2024, 8:47 p.m. UTC | #8
在 2024/10/17 03:44, Anand Jain 写道:
> On 16/10/24 05:38, Boris Burkov wrote:
>> If you follow the seed/sprout wiki, it suggests the following workflow:
>>
>> btrfstune -S 1 seed_dev
>> mount seed_dev mnt
>> btrfs device add sprout_dev
>> mount -o remount,rw mnt
>>
>
>
>
>> The first mount mounts the FS readonly, which results in not setting
>> BTRFS_FS_OPEN, and setting the readonly bit on the sb. The device add
>> somewhat surprisingly clears the readonly bit on the sb (though the
>> mount is still practically readonly, from the users perspective...).
>> Finally, the remount checks the readonly bit on the sb against the flag
>> and sees no change, so it does not run the code intended to run on
>> ro->rw transitions, leaving BTRFS_FS_OPEN unset.
>>
>> As a result, when the cleaner_kthread runs, it sees no BTRFS_FS_OPEN and
>> does no work. This results in leaking deleted snapshots until we run out
>> of space.
>>
>> I propose fixing it at the first departure from what feels reasonable:
>> when we clear the readonly bit on the sb during device add.
>>
>> A new fstest I have written reproduces the bug and confirms the fix.
>>
>> Signed-off-by: Boris Burkov <boris@bur.io>
>> ---
>> Note that this is a resend of an old unmerged fix:
>> https://lore.kernel.org/linux-
>> btrfs/16c05d39566858bb8bc1e03bd19947cf2b601b98.1647906815.git.boris@bur.io/T/#u
>> Some other ideas for fixing it by modifying how we set BTRFS_FS_OPEN
>> were also explored but not merged around that time:
>> https://lore.kernel.org/linux-btrfs/
>> cover.1654216941.git.anand.jain@oracle.com/
>>
>> I don't have a strong preference, but I would really like to see this
>> trivial bug fixed. For what it is worth, we have been carrying this
>> patch internally at Meta since I first sent it with no incident.
>> ---
>
>
> I remember fixing this before. I tested on 5.15, and the bug isn't
> there, but it’s back in 6.10, so something broke in between.
> We need to track it down.
>
> The original design (kernel 4.x and below) makes the filesystem switch
> to read-write mode after adding a sprout because:
>
>     You can’t add a device to a normal read-only filesystem
>   so with seed read-only mount is different.
>     With a seed device, adding a writable device transforms
>   it into a new read-write filesystem with a _new_ FSID and
>   fs_devices. Logically, read-write at this stage makes sense,
>   but I’m okay without it and in fact we had fixed this before,
>   but a patch somewhere seems to have broken it again.
>
>
> (Demo below. :<x> is the return code from the 'run' command at
>   https://github.com/asj/run.git)
>
>
> ----- 5.15.0-208.159.3.2.el9uek.x86_64 ----

I also tried it on upstream kernel v5.15.94, the behavior is still the
old changed to RW immediately after device add:

[adam@btrfs-vm ~]$ uname -a
Linux btrfs-vm 5.15.94-1-lts #1 SMP Wed, 15 Feb 2023 07:09:02 +0000
x86_64 GNU/Linux
[adam@btrfs-vm ~]$ sudo mkfs.btrfs  -f /dev/test/scratch1 > /dev/null
[adam@btrfs-vm ~]$ sudo btrfstune -S 1 /dev/test/scratch1
[adam@btrfs-vm ~]$ sudo mount /dev/test/scratch1  /mnt/btrfs/
mount: /mnt/btrfs: WARNING: source write-protected, mounted read-only.
[adam@btrfs-vm ~]$ sudo btrfs device add -f /dev/test/scratch2  /mnt/btrfs/
Performing full device TRIM /dev/test/scratch2 (10.00GiB) ...
[adam@btrfs-vm ~]$ sudo touch /mnt/btrfs/file
[adam@btrfs-vm ~]$ mount | grep mnt/btrfs
/dev/mapper/test-scratch2 on /mnt/btrfs type btrfs
(rw,relatime,space_cache=v2,subvolid=5,subvol=/)

So it looks like it's some extra backports causing the behavior change.

But I still strongly prefer to keep it RO.
Even if it's a different fs under the hood, it still suddenly changes
the RO/RW status of a mount point without letting the user to know.

Thanks,
Qu

> $ mkfs.btrfs -fq /dev/loop0 :0
> $ btrfstune -S1 /dev/loop0 :0
> $ mount /dev/loop0 /btrfs :0
> mount: /btrfs: WARNING: source write-protected, mounted read-only.
>
> $ cat /proc/self/mounts | grep btrfs :0
> /dev/loop0 /btrfs btrfs ro,relatime,space_cache=v2,subvolid=5,subvol=/ 0 0
>
> $ findmnt -o SOURCE,UUID /btrfs :0
> SOURCE     UUID
> /dev/loop0 64f21b87-4e4c-4786-b2cd-c09a5ccd2afa
>
> $ btrfs fi show -m :0
> Label: none  uuid: 64f21b87-4e4c-4786-b2cd-c09a5ccd2afa
>      Total devices 1 FS bytes used 144.00KiB
>      devid    1 size 3.00GiB used 536.00MiB path /dev/loop0
>
> $ ls /sys/fs/btrfs :0
> 64f21b87-4e4c-4786-b2cd-c09a5ccd2afa
> features
>
> $ btrfs dev add -f /dev/loop1 /btrfs :0
>
> # After adding the device, the path and UUID are different,
> # so it’s a new filesystem. (But, as I said, I’m fine with
> # keeping it read-only and needing remount,rw.
>
> $ cat /proc/self/mounts | grep btrfs :0
> /dev/loop1 /btrfs btrfs ro,relatime,space_cache=v2,subvolid=5,subvol=/ 0 0
>
> $ findmnt -o SOURCE,UUID /btrfs :0
> SOURCE     UUID
> /dev/loop1 948cea35-18db-45da-9ec8-3d46cb5f0413
>
> $ btrfs fi show -m :0
> Label: none  uuid: 948cea35-18db-45da-9ec8-3d46cb5f0413
>      Total devices 2 FS bytes used 144.00KiB
>      devid    1 size 3.00GiB used 520.00MiB path /dev/loop0
>      devid    2 size 3.00GiB used 576.00MiB path /dev/loop1
>
>
> $ ls /sys/fs/btrfs :0
> 948cea35-18db-45da-9ec8-3d46cb5f0413
> features
> ---------
>
>
> Thanks, Anand
>
>>   fs/btrfs/volumes.c | 4 ----
>>   1 file changed, 4 deletions(-)
>>
>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>> index dc9f54849f39..84e861dcb350 100644
>> --- a/fs/btrfs/volumes.c
>> +++ b/fs/btrfs/volumes.c
>> @@ -2841,8 +2841,6 @@ int btrfs_init_new_device(struct btrfs_fs_info
>> *fs_info, const char *device_path
>>       set_blocksize(device->bdev_file, BTRFS_BDEV_BLOCKSIZE);
>>       if (seeding_dev) {
>> -        btrfs_clear_sb_rdonly(sb);
>> -
>>           /* GFP_KERNEL allocation must not be under device_list_mutex */
>>           seed_devices = btrfs_init_sprout(fs_info);
>>           if (IS_ERR(seed_devices)) {
>> @@ -2985,8 +2983,6 @@ int btrfs_init_new_device(struct btrfs_fs_info
>> *fs_info, const char *device_path
>>       mutex_unlock(&fs_info->chunk_mutex);
>>       mutex_unlock(&fs_info->fs_devices->device_list_mutex);
>>   error_trans:
>> -    if (seeding_dev)
>> -        btrfs_set_sb_rdonly(sb);
>>       if (trans)
>>           btrfs_end_transaction(trans);
>>   error_free_zone:
>
>
diff mbox series

Patch

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index dc9f54849f39..84e861dcb350 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -2841,8 +2841,6 @@  int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
 	set_blocksize(device->bdev_file, BTRFS_BDEV_BLOCKSIZE);
 
 	if (seeding_dev) {
-		btrfs_clear_sb_rdonly(sb);
-
 		/* GFP_KERNEL allocation must not be under device_list_mutex */
 		seed_devices = btrfs_init_sprout(fs_info);
 		if (IS_ERR(seed_devices)) {
@@ -2985,8 +2983,6 @@  int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
 	mutex_unlock(&fs_info->chunk_mutex);
 	mutex_unlock(&fs_info->fs_devices->device_list_mutex);
 error_trans:
-	if (seeding_dev)
-		btrfs_set_sb_rdonly(sb);
 	if (trans)
 		btrfs_end_transaction(trans);
 error_free_zone: