Message ID | 16c05d39566858bb8bc1e03bd19947cf2b601b98.1647906815.git.boris@bur.io (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | btrfs: do not clear read-only when adding sprout device | expand |
On Mon, Mar 21, 2022 at 04:56:17PM -0700, Boris Burkov wrote: > If you follow the seed/sprout wiki, it suggests the following workflow: > > btrfstune -S 1 seed_dev > mount seed_dev mnt > btrfs device add sprout_dev > mount -o remount,rw mnt > > The first mount mounts the FS readonly, which results in not setting > BTRFS_FS_OPEN, and setting the readonly bit on the sb. The device add > somewhat surprisingly clears the readonly bit on the sb (though the > mount is still practically readonly, from the users perspective...). > Finally, the remount checks the readonly bit on the sb against the flag > and sees no change, so it does not run the code intended to run on > ro->rw transitions, leaving BTRFS_FS_OPEN unset. > > As a result, when the cleaner_kthread runs, it sees no BTRFS_FS_OPEN and > does no work. This results in leaking deleted snapshots until we run out > of space. > > I propose fixing it at the first departure from what feels reasonable: > when we clear the readonly bit on the sb during device add. I have a > reproducer of the issue here: > https://raw.githubusercontent.com/boryas/scripts/main/sh/seed/mkseed.sh > and confirm that this patch fixes it, and seems to work OK, otherwise. I > will admit that I couldn't dig up the original rationale for clearing > the bit here (it dates back to the original seed/sprout commit without > explicit explanation) so it's hard to imagine all the ramifications of > the change. > > Signed-off-by: Boris Burkov <boris@bur.io> Can we get this turned into a fstest please? Reviewed-by: Josef Bacik <josef@toxicpanda.com> Thanks, Josef
On Mon, Mar 21, 2022 at 04:56:17PM -0700, Boris Burkov wrote: > If you follow the seed/sprout wiki, it suggests the following workflow: > > btrfstune -S 1 seed_dev > mount seed_dev mnt > btrfs device add sprout_dev > mount -o remount,rw mnt > > The first mount mounts the FS readonly, which results in not setting > BTRFS_FS_OPEN, and setting the readonly bit on the sb. The device add > somewhat surprisingly clears the readonly bit on the sb (though the > mount is still practically readonly, from the users perspective...). > Finally, the remount checks the readonly bit on the sb against the flag > and sees no change, so it does not run the code intended to run on > ro->rw transitions, leaving BTRFS_FS_OPEN unset. > > As a result, when the cleaner_kthread runs, it sees no BTRFS_FS_OPEN and > does no work. This results in leaking deleted snapshots until we run out > of space. > > I propose fixing it at the first departure from what feels reasonable: > when we clear the readonly bit on the sb during device add. I have a > reproducer of the issue here: > https://raw.githubusercontent.com/boryas/scripts/main/sh/seed/mkseed.sh > and confirm that this patch fixes it, and seems to work OK, otherwise. I > will admit that I couldn't dig up the original rationale for clearing > the bit here (it dates back to the original seed/sprout commit without > explicit explanation) so it's hard to imagine all the ramifications of > the change. > > Signed-off-by: Boris Burkov <boris@bur.io> > --- > fs/btrfs/volumes.c | 4 ---- > 1 file changed, 4 deletions(-) > > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c > index 3fd17e87815a..75d7eeb26fe6 100644 > --- a/fs/btrfs/volumes.c > +++ b/fs/btrfs/volumes.c > @@ -2675,8 +2675,6 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path > set_blocksize(device->bdev, BTRFS_BDEV_BLOCKSIZE); > > if (seeding_dev) { > - btrfs_clear_sb_rdonly(sb); > - After this line, it updates the metadata e.g, with init_first_rw_device() and writes them out with btrfs_commit_transaction(). Is that OK to do so with the SB_RDONLY flag set? > /* GFP_KERNEL allocation must not be under device_list_mutex */ > seed_devices = btrfs_init_sprout(fs_info); > if (IS_ERR(seed_devices)) { > @@ -2819,8 +2817,6 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path > mutex_unlock(&fs_info->chunk_mutex); > mutex_unlock(&fs_info->fs_devices->device_list_mutex); > error_trans: > - if (seeding_dev) > - btrfs_set_sb_rdonly(sb); > if (trans) > btrfs_end_transaction(trans); > error_free_zone: > -- > 2.30.2 >
On 22/03/2022 07:56, Boris Burkov wrote: > If you follow the seed/sprout wiki, it suggests the following workflow: > > btrfstune -S 1 seed_dev > mount seed_dev mnt > btrfs device add sprout_dev > mount -o remount,rw mnt or umount mnt mount sprout mnt > The first mount mounts the FS readonly, which results in not setting > BTRFS_FS_OPEN, and setting the readonly bit on the sb. Why not set the BTRFS_FS_OPEN? @@ -3904,8 +3904,11 @@ int __cold open_ctree(struct super_block *sb, struct btrfs_fs_devices *fs_device goto fail_qgroup; } - if (sb_rdonly(sb)) + if (sb_rdonly(sb)) { + btrfs_set_sb_rdonly(sb); + set_bit(BTRFS_FS_OPEN, &fs_info->flags); goto clear_oneshot; + } ret = btrfs_start_pre_rw_mount(fs_info); if (ret) { > The device add > somewhat surprisingly clears the readonly bit on the sb (though the > mount is still practically readonly, from the users perspective...). > Finally, the remount checks the readonly bit on the sb against the flag > and sees no change, so it does not run the code intended to run on > ro->rw transitions, leaving BTRFS_FS_OPEN unset. Originally, the step 'btrfs device add sprout_dev' provided seed fs writeable without a remount. I think the btrfs_clear_sb_rdonly(sb) in btrfs_init_new_device() was part of it. Removing it doesn't seem to affect the seed-sprout functionality (did I miss anything?) either the -o remount,rw or mount recycle will get it writeable. > As a result, when the cleaner_kthread runs, it sees no BTRFS_FS_OPEN and > does no work. This results in leaking deleted snapshots until we run out > of space. > I propose fixing it at the first departure from what feels reasonable: > when we clear the readonly bit on the sb during device add. I have a > reproducer of the issue here: > https://raw.githubusercontent.com/boryas/scripts/main/sh/seed/mkseed.sh > and confirm that this patch fixes it, and seems to work OK, otherwise. I > will admit that I couldn't dig up the original rationale for clearing > the bit here (it dates back to the original seed/sprout commit without > explicit explanation) so it's hard to imagine all the ramifications of > the change. We got fstests -g seed to test the seed-sprout stuff. Your test case here fits in it. IMO. Thanks, Anand > Signed-off-by: Boris Burkov <boris@bur.io> > --- > fs/btrfs/volumes.c | 4 ---- > 1 file changed, 4 deletions(-) > > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c > index 3fd17e87815a..75d7eeb26fe6 100644 > --- a/fs/btrfs/volumes.c > +++ b/fs/btrfs/volumes.c > @@ -2675,8 +2675,6 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path > set_blocksize(device->bdev, BTRFS_BDEV_BLOCKSIZE); > > if (seeding_dev) { > - btrfs_clear_sb_rdonly(sb); > - > /* GFP_KERNEL allocation must not be under device_list_mutex */ > seed_devices = btrfs_init_sprout(fs_info); > if (IS_ERR(seed_devices)) { > @@ -2819,8 +2817,6 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path > mutex_unlock(&fs_info->chunk_mutex); > mutex_unlock(&fs_info->fs_devices->device_list_mutex); > error_trans: > - if (seeding_dev) > - btrfs_set_sb_rdonly(sb); > if (trans) > btrfs_end_transaction(trans); > error_free_zone:
On Wed, Mar 23, 2022 at 12:52:15AM +0000, Naohiro Aota wrote: > On Mon, Mar 21, 2022 at 04:56:17PM -0700, Boris Burkov wrote: > > If you follow the seed/sprout wiki, it suggests the following workflow: > > > > btrfstune -S 1 seed_dev > > mount seed_dev mnt > > btrfs device add sprout_dev > > mount -o remount,rw mnt > > > > The first mount mounts the FS readonly, which results in not setting > > BTRFS_FS_OPEN, and setting the readonly bit on the sb. The device add > > somewhat surprisingly clears the readonly bit on the sb (though the > > mount is still practically readonly, from the users perspective...). > > Finally, the remount checks the readonly bit on the sb against the flag > > and sees no change, so it does not run the code intended to run on > > ro->rw transitions, leaving BTRFS_FS_OPEN unset. > > > > As a result, when the cleaner_kthread runs, it sees no BTRFS_FS_OPEN and > > does no work. This results in leaking deleted snapshots until we run out > > of space. > > > > I propose fixing it at the first departure from what feels reasonable: > > when we clear the readonly bit on the sb during device add. I have a > > reproducer of the issue here: > > https://raw.githubusercontent.com/boryas/scripts/main/sh/seed/mkseed.sh > > and confirm that this patch fixes it, and seems to work OK, otherwise. I > > will admit that I couldn't dig up the original rationale for clearing > > the bit here (it dates back to the original seed/sprout commit without > > explicit explanation) so it's hard to imagine all the ramifications of > > the change. > > > > Signed-off-by: Boris Burkov <boris@bur.io> > > --- > > fs/btrfs/volumes.c | 4 ---- > > 1 file changed, 4 deletions(-) > > > > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c > > index 3fd17e87815a..75d7eeb26fe6 100644 > > --- a/fs/btrfs/volumes.c > > +++ b/fs/btrfs/volumes.c > > @@ -2675,8 +2675,6 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path > > set_blocksize(device->bdev, BTRFS_BDEV_BLOCKSIZE); > > > > if (seeding_dev) { > > - btrfs_clear_sb_rdonly(sb); > > - > > After this line, it updates the metadata e.g, with > init_first_rw_device() and writes them out with > btrfs_commit_transaction(). Is that OK to do so with the SB_RDONLY > flag set? Good question. As far as I can tell, the functions don't explicitly check sb_rdonly, though that could be because they expect that to be checked before you ever try to commit a transaction, for example.. If there is an issue, it's probably somewhat subtle, because the basic behavior does work. > > > /* GFP_KERNEL allocation must not be under device_list_mutex */ > > seed_devices = btrfs_init_sprout(fs_info); > > if (IS_ERR(seed_devices)) { > > @@ -2819,8 +2817,6 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path > > mutex_unlock(&fs_info->chunk_mutex); > > mutex_unlock(&fs_info->fs_devices->device_list_mutex); > > error_trans: > > - if (seeding_dev) > > - btrfs_set_sb_rdonly(sb); > > if (trans) > > btrfs_end_transaction(trans); > > error_free_zone: > > -- > > 2.30.2 > >
On Wed, Mar 23, 2022 at 06:44:42PM +0800, Anand Jain wrote: > On 22/03/2022 07:56, Boris Burkov wrote: > > If you follow the seed/sprout wiki, it suggests the following workflow: > > > > btrfstune -S 1 seed_dev > > mount seed_dev mnt > > btrfs device add sprout_dev > > > mount -o remount,rw mnt > or > umount mnt > mount sprout mnt Agreed. FWIW, I tested that umount/mount is not vulnerable to the bug. However, a user might be using one of the documented workflows anyway, and would need it for an fs they add a sprout to while it is in use. > > > The first mount mounts the FS readonly, which results in not setting > > BTRFS_FS_OPEN, and setting the readonly bit on the sb. > > Why not set the BTRFS_FS_OPEN? > One reason is that there is other logic that runs when transitioning from ro->rw in remount besides just setting BTRFS_FS_OPEN. By not improperly clearing ro, we let that logic do its thing naturally. > @@ -3904,8 +3904,11 @@ int __cold open_ctree(struct super_block *sb, struct > btrfs_fs_devices *fs_device > goto fail_qgroup; > } > > - if (sb_rdonly(sb)) > + if (sb_rdonly(sb)) { > + btrfs_set_sb_rdonly(sb); > + set_bit(BTRFS_FS_OPEN, &fs_info->flags); > goto clear_oneshot; > + } > > ret = btrfs_start_pre_rw_mount(fs_info); > if (ret) { > > > The device add > > somewhat surprisingly clears the readonly bit on the sb (though the > > mount is still practically readonly, from the users perspective...). > > Finally, the remount checks the readonly bit on the sb against the flag > > and sees no change, so it does not run the code intended to run on > > ro->rw transitions, leaving BTRFS_FS_OPEN unset. > > Originally, the step 'btrfs device add sprout_dev' provided seed > fs writeable without a remount. That is very interesting. Do you remember why we changed this behavior to the current behavior of leaving the seed readonly until the remount/mount cycle? > > I think the btrfs_clear_sb_rdonly(sb) in btrfs_init_new_device() > was part of it. > > Removing it doesn't seem to affect the seed-sprout functionality > (did I miss anything?) either the -o remount,rw > or mount recycle will get it writeable. My current understanding (probably flawed..): before this patch: we clear the rdonly bit, but the fs is still readonly until remount (should figure out exactly how) after this patch: behavior is the same, except the rdonly bit gets cleared along with the mounting, not the device add. > > > As a result, when the cleaner_kthread runs, it sees no BTRFS_FS_OPEN and > > does no work. This results in leaking deleted snapshots until we run out > > of space. > > > > I propose fixing it at the first departure from what feels reasonable: > > when we clear the readonly bit on the sb during device add. I have a > > reproducer of the issue here: > > https://raw.githubusercontent.com/boryas/scripts/main/sh/seed/mkseed.sh > > and confirm that this patch fixes it, and seems to work OK, otherwise. I > > will admit that I couldn't dig up the original rationale for clearing > > the bit here (it dates back to the original seed/sprout commit without > > explicit explanation) so it's hard to imagine all the ramifications of > > the change. > > We got fstests -g seed to test the seed-sprout stuff. Your test case > here fits in it. IMO. Thanks for the tip, I'll add it there, regardless of how we fix it. > > Thanks, Anand > > > > Signed-off-by: Boris Burkov <boris@bur.io> > > --- > > fs/btrfs/volumes.c | 4 ---- > > 1 file changed, 4 deletions(-) > > > > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c > > index 3fd17e87815a..75d7eeb26fe6 100644 > > --- a/fs/btrfs/volumes.c > > +++ b/fs/btrfs/volumes.c > > @@ -2675,8 +2675,6 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path > > set_blocksize(device->bdev, BTRFS_BDEV_BLOCKSIZE); > > if (seeding_dev) { > > - btrfs_clear_sb_rdonly(sb); > > - > > /* GFP_KERNEL allocation must not be under device_list_mutex */ > > seed_devices = btrfs_init_sprout(fs_info); > > if (IS_ERR(seed_devices)) { > > @@ -2819,8 +2817,6 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path > > mutex_unlock(&fs_info->chunk_mutex); > > mutex_unlock(&fs_info->fs_devices->device_list_mutex); > > error_trans: > > - if (seeding_dev) > > - btrfs_set_sb_rdonly(sb); > > if (trans) > > btrfs_end_transaction(trans); > > error_free_zone: >
On Wed, Mar 23, 2022 at 06:44:42PM +0800, Anand Jain wrote: > On 22/03/2022 07:56, Boris Burkov wrote: > > If you follow the seed/sprout wiki, it suggests the following workflow: > > > > btrfstune -S 1 seed_dev > > mount seed_dev mnt > > btrfs device add sprout_dev > > > mount -o remount,rw mnt > or > umount mnt > mount sprout mnt > > > The first mount mounts the FS readonly, which results in not setting > > BTRFS_FS_OPEN, and setting the readonly bit on the sb. > > Why not set the BTRFS_FS_OPEN? > > @@ -3904,8 +3904,11 @@ int __cold open_ctree(struct super_block *sb, struct > btrfs_fs_devices *fs_device > goto fail_qgroup; > } > > - if (sb_rdonly(sb)) > + if (sb_rdonly(sb)) { > + btrfs_set_sb_rdonly(sb); > + set_bit(BTRFS_FS_OPEN, &fs_info->flags); > goto clear_oneshot; > + } > > ret = btrfs_start_pre_rw_mount(fs_info); > if (ret) { > > > The device add > > somewhat surprisingly clears the readonly bit on the sb (though the > > mount is still practically readonly, from the users perspective...). > > Finally, the remount checks the readonly bit on the sb against the flag > > and sees no change, so it does not run the code intended to run on > > ro->rw transitions, leaving BTRFS_FS_OPEN unset. > > Originally, the step 'btrfs device add sprout_dev' provided seed > fs writeable without a remount. > > I think the btrfs_clear_sb_rdonly(sb) in btrfs_init_new_device() > was part of it. > Yeah this was a bad idea, I don't want to randomly flip a mount from ro->rw without the user telling us to. Thanks, Josef
On 24/03/2022 02:25, Boris Burkov wrote: > On Wed, Mar 23, 2022 at 06:44:42PM +0800, Anand Jain wrote: >> On 22/03/2022 07:56, Boris Burkov wrote: >>> If you follow the seed/sprout wiki, it suggests the following workflow: >>> >>> btrfstune -S 1 seed_dev >>> mount seed_dev mnt >>> btrfs device add sprout_dev >> >>> mount -o remount,rw mnt >> or >> umount mnt >> mount sprout mnt > > Agreed. FWIW, I tested that umount/mount is not vulnerable to the bug. > However, a user might be using one of the documented workflows anyway, > and would need it for an fs they add a sprout to while it is in use. Yep. >> >>> The first mount mounts the FS readonly, which results in not setting >>> BTRFS_FS_OPEN, and setting the readonly bit on the sb. >> >> Why not set the BTRFS_FS_OPEN? >> > > One reason is that there is other logic that runs when transitioning > from ro->rw in remount besides just setting BTRFS_FS_OPEN. By not > improperly clearing ro, we let that logic do its thing naturally. > >> @@ -3904,8 +3904,11 @@ int __cold open_ctree(struct super_block *sb, struct >> btrfs_fs_devices *fs_device >> goto fail_qgroup; >> } >> >> - if (sb_rdonly(sb)) >> + if (sb_rdonly(sb)) { >> + btrfs_set_sb_rdonly(sb); >> + set_bit(BTRFS_FS_OPEN, &fs_info->flags); Looked more deep. No. I misunderstood BTRFS_FS_OPEN, it implies open_ctree() is complete and read-writeable. So this won't work. >> goto clear_oneshot; >> + } >> >> ret = btrfs_start_pre_rw_mount(fs_info); >> if (ret) { >> >>> The device add >>> somewhat surprisingly clears the readonly bit on the sb (though the >>> mount is still practically readonly, from the users perspective...). >>> Finally, the remount checks the readonly bit on the sb against the flag >>> and sees no change, so it does not run the code intended to run on >>> ro->rw transitions, leaving BTRFS_FS_OPEN unset. >> >> Originally, the step 'btrfs device add sprout_dev' provided seed >> fs writeable without a remount. > > That is very interesting. Do you remember why we changed this behavior > to the current behavior of leaving the seed readonly until the > remount/mount cycle? So far, I couldn't find the commit that changed it. Perhaps Josef may have an idea? More below. >> I think the btrfs_clear_sb_rdonly(sb) in btrfs_init_new_device() >> was part of it. >> >> Removing it doesn't seem to affect the seed-sprout functionality >> (did I miss anything?) either the -o remount,rw >> or mount recycle will get it writeable. > > My current understanding (probably flawed..): > before this patch: we clear the rdonly bit, but the fs is still readonly > until remount (should figure out exactly how) > after this patch: behavior is the same, except the rdonly bit gets > cleared along with the mounting, not the device add. > >> >>> As a result, when the cleaner_kthread runs, it sees no BTRFS_FS_OPEN and >>> does no work. This results in leaking deleted snapshots until we run out >>> of space. >> >> >>> I propose fixing it at the first departure from what feels reasonable: >>> when we clear the readonly bit on the sb during device add. I have a >>> reproducer of the issue here: >>> https://raw.githubusercontent.com/boryas/scripts/main/sh/seed/mkseed.sh >>> and confirm that this patch fixes it, and seems to work OK, otherwise. I >>> will admit that I couldn't dig up the original rationale for clearing >>> the bit here (it dates back to the original seed/sprout commit without >>> explicit explanation) so it's hard to imagine all the ramifications of >>> the change. >> >> We got fstests -g seed to test the seed-sprout stuff. Your test case >> here fits in it. IMO. > > Thanks for the tip, I'll add it there, regardless of how we fix it. > >> >> Thanks, Anand >> >> >>> Signed-off-by: Boris Burkov <boris@bur.io> >>> --- >>> fs/btrfs/volumes.c | 4 ---- >>> 1 file changed, 4 deletions(-) >>> >>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c >>> index 3fd17e87815a..75d7eeb26fe6 100644 >>> --- a/fs/btrfs/volumes.c >>> +++ b/fs/btrfs/volumes.c >>> @@ -2675,8 +2675,6 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path >>> set_blocksize(device->bdev, BTRFS_BDEV_BLOCKSIZE); >>> if (seeding_dev) { >>> - btrfs_clear_sb_rdonly(sb); >>> - >>> /* GFP_KERNEL allocation must not be under device_list_mutex */ >>> seed_devices = btrfs_init_sprout(fs_info); >>> if (IS_ERR(seed_devices)) { >>> @@ -2819,8 +2817,6 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path >>> mutex_unlock(&fs_info->chunk_mutex); >>> mutex_unlock(&fs_info->fs_devices->device_list_mutex); >>> error_trans: >>> - if (seeding_dev) >>> - btrfs_set_sb_rdonly(sb); >>> if (trans) >>> btrfs_end_transaction(trans); >>> error_free_zone: >> This looks good to me. You may add. Reviewed-by: Anand Jain <anand.jain@oracle.com> Thanks, Anand
On 24/03/2022 02:16, Boris Burkov wrote: > On Wed, Mar 23, 2022 at 12:52:15AM +0000, Naohiro Aota wrote: >> On Mon, Mar 21, 2022 at 04:56:17PM -0700, Boris Burkov wrote: >>> If you follow the seed/sprout wiki, it suggests the following workflow: >>> >>> btrfstune -S 1 seed_dev > > mount seed_dev mnt >>> btrfs device add sprout_dev >>> mount -o remount,rw mnt >>> >>> The first mount mounts the FS readonly, which results in not setting >>> BTRFS_FS_OPEN, and setting the readonly bit on the sb. The device add >>> somewhat surprisingly clears the readonly bit on the sb (though the >>> mount is still practically readonly, from the users perspective...). >>> Finally, the remount checks the readonly bit on the sb against the flag >>> and sees no change, so it does not run the code intended to run on >>> ro->rw transitions, leaving BTRFS_FS_OPEN unset. >>> >>> As a result, when the cleaner_kthread runs, it sees no BTRFS_FS_OPEN and >>> does no work. This results in leaking deleted snapshots until we run out >>> of space. >>> >>> I propose fixing it at the first departure from what feels reasonable: >>> when we clear the readonly bit on the sb during device add. I have a >>> reproducer of the issue here: >>> https://raw.githubusercontent.com/boryas/scripts/main/sh/seed/mkseed.sh >>> and confirm that this patch fixes it, and seems to work OK, otherwise. I >>> will admit that I couldn't dig up the original rationale for clearing >>> the bit here (it dates back to the original seed/sprout commit without >>> explicit explanation) so it's hard to imagine all the ramifications of >>> the change. >>> >>> Signed-off-by: Boris Burkov <boris@bur.io> >>> --- >>> fs/btrfs/volumes.c | 4 ---- >>> 1 file changed, 4 deletions(-) >>> >>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c >>> index 3fd17e87815a..75d7eeb26fe6 100644 >>> --- a/fs/btrfs/volumes.c >>> +++ b/fs/btrfs/volumes.c >>> @@ -2675,8 +2675,6 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path >>> set_blocksize(device->bdev, BTRFS_BDEV_BLOCKSIZE); >>> >>> if (seeding_dev) { >>> - btrfs_clear_sb_rdonly(sb); >>> - >> >> After this line, it updates the metadata e.g, with >> init_first_rw_device() and writes them out with >> btrfs_commit_transaction(). Is that OK to do so with the SB_RDONLY >> flag set? > It is ok as the device-add step creates a _new_ sprout filesystem which is RW-able. btrfs_setup_sprout() resets the seeding flag. super_flags = btrfs_super_flags(disk_super) & ~BTRFS_SUPER_FLAG_SEEDING; btrfs_set_super_flags(disk_super, super_flags); Thanks, Anand > Good question. As far as I can tell, the functions don't explicitly > check sb_rdonly, though that could be because they expect that to be > checked before you ever try to commit a transaction, for example.. > > If there is an issue, it's probably somewhat subtle, because the basic > behavior does work. > >> >>> /* GFP_KERNEL allocation must not be under device_list_mutex */ >>> seed_devices = btrfs_init_sprout(fs_info); >>> if (IS_ERR(seed_devices)) { >>> @@ -2819,8 +2817,6 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path >>> mutex_unlock(&fs_info->chunk_mutex); >>> mutex_unlock(&fs_info->fs_devices->device_list_mutex); >>> error_trans: >>> - if (seeding_dev) >>> - btrfs_set_sb_rdonly(sb); >>> if (trans) >>> btrfs_end_transaction(trans); >>> error_free_zone: >>> -- >>> 2.30.2 >>>
On Mon, Mar 28, 2022 at 07:11:39PM +0800, Anand Jain wrote: > On 24/03/2022 02:16, Boris Burkov wrote: > > On Wed, Mar 23, 2022 at 12:52:15AM +0000, Naohiro Aota wrote: > > > On Mon, Mar 21, 2022 at 04:56:17PM -0700, Boris Burkov wrote: > > > > If you follow the seed/sprout wiki, it suggests the following workflow: > > > > > > > > btrfstune -S 1 seed_dev > > mount seed_dev mnt > > > > btrfs device add sprout_dev > > > > mount -o remount,rw mnt > > > > > > > > The first mount mounts the FS readonly, which results in not setting > > > > BTRFS_FS_OPEN, and setting the readonly bit on the sb. The device add > > > > somewhat surprisingly clears the readonly bit on the sb (though the > > > > mount is still practically readonly, from the users perspective...). > > > > Finally, the remount checks the readonly bit on the sb against the flag > > > > and sees no change, so it does not run the code intended to run on > > > > ro->rw transitions, leaving BTRFS_FS_OPEN unset. > > > > > > > > As a result, when the cleaner_kthread runs, it sees no BTRFS_FS_OPEN and > > > > does no work. This results in leaking deleted snapshots until we run out > > > > of space. > > > > > > > > I propose fixing it at the first departure from what feels reasonable: > > > > when we clear the readonly bit on the sb during device add. I have a > > > > reproducer of the issue here: > > > > https://raw.githubusercontent.com/boryas/scripts/main/sh/seed/mkseed.sh > > > > and confirm that this patch fixes it, and seems to work OK, otherwise. I > > > > will admit that I couldn't dig up the original rationale for clearing > > > > the bit here (it dates back to the original seed/sprout commit without > > > > explicit explanation) so it's hard to imagine all the ramifications of > > > > the change. > > > > > > > > Signed-off-by: Boris Burkov <boris@bur.io> > > > > --- > > > > fs/btrfs/volumes.c | 4 ---- > > > > 1 file changed, 4 deletions(-) > > > > > > > > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c > > > > index 3fd17e87815a..75d7eeb26fe6 100644 > > > > --- a/fs/btrfs/volumes.c > > > > +++ b/fs/btrfs/volumes.c > > > > @@ -2675,8 +2675,6 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path > > > > set_blocksize(device->bdev, BTRFS_BDEV_BLOCKSIZE); > > > > if (seeding_dev) { > > > > - btrfs_clear_sb_rdonly(sb); > > > > - > > > > > > After this line, it updates the metadata e.g, with > > > init_first_rw_device() and writes them out with > > > btrfs_commit_transaction(). Is that OK to do so with the SB_RDONLY > > > flag set? > > > > It is ok as the device-add step creates a _new_ sprout filesystem which is > RW-able. btrfs_setup_sprout() resets the seeding flag. > > super_flags = btrfs_super_flags(disk_super) & > ~BTRFS_SUPER_FLAG_SEEDING; > btrfs_set_super_flags(disk_super, super_flags); > > Thanks, Anand Yeah, I see that point. I'm concerned about an interaction with the VFS code. With this patch applied, it is going to do file-system internal writes (updating the trees and transaction commit) with the SB_RDONLY flag set. Doesn't it break the current and future assumptions of the VFS side? But, it's just a concern. It might not be a bid deal. > > Good question. As far as I can tell, the functions don't explicitly > > check sb_rdonly, though that could be because they expect that to be > > checked before you ever try to commit a transaction, for example.. > > > > If there is an issue, it's probably somewhat subtle, because the basic > > behavior does work. > > > > > > > > > /* GFP_KERNEL allocation must not be under device_list_mutex */ > > > > seed_devices = btrfs_init_sprout(fs_info); > > > > if (IS_ERR(seed_devices)) { > > > > @@ -2819,8 +2817,6 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path > > > > mutex_unlock(&fs_info->chunk_mutex); > > > > mutex_unlock(&fs_info->fs_devices->device_list_mutex); > > > > error_trans: > > > > - if (seeding_dev) > > > > - btrfs_set_sb_rdonly(sb); > > > > if (trans) > > > > btrfs_end_transaction(trans); > > > > error_free_zone: > > > > -- > > > > 2.30.2 > > > > >
On Tue, Mar 29, 2022 at 04:33:21AM +0000, Naohiro Aota wrote: > On Mon, Mar 28, 2022 at 07:11:39PM +0800, Anand Jain wrote: > > On 24/03/2022 02:16, Boris Burkov wrote: > > > On Wed, Mar 23, 2022 at 12:52:15AM +0000, Naohiro Aota wrote: > > > > On Mon, Mar 21, 2022 at 04:56:17PM -0700, Boris Burkov wrote: > > > > > If you follow the seed/sprout wiki, it suggests the following workflow: > > > > > > > > > > btrfstune -S 1 seed_dev > > mount seed_dev mnt > > > > > btrfs device add sprout_dev > > > > > mount -o remount,rw mnt > > > > > > > > > > The first mount mounts the FS readonly, which results in not setting > > > > > BTRFS_FS_OPEN, and setting the readonly bit on the sb. The device add > > > > > somewhat surprisingly clears the readonly bit on the sb (though the > > > > > mount is still practically readonly, from the users perspective...). > > > > > Finally, the remount checks the readonly bit on the sb against the flag > > > > > and sees no change, so it does not run the code intended to run on > > > > > ro->rw transitions, leaving BTRFS_FS_OPEN unset. > > > > > > > > > > As a result, when the cleaner_kthread runs, it sees no BTRFS_FS_OPEN and > > > > > does no work. This results in leaking deleted snapshots until we run out > > > > > of space. > > > > > > > > > > I propose fixing it at the first departure from what feels reasonable: > > > > > when we clear the readonly bit on the sb during device add. I have a > > > > > reproducer of the issue here: > > > > > https://raw.githubusercontent.com/boryas/scripts/main/sh/seed/mkseed.sh > > > > > and confirm that this patch fixes it, and seems to work OK, otherwise. I > > > > > will admit that I couldn't dig up the original rationale for clearing > > > > > the bit here (it dates back to the original seed/sprout commit without > > > > > explicit explanation) so it's hard to imagine all the ramifications of > > > > > the change. > > > > > > > > > > Signed-off-by: Boris Burkov <boris@bur.io> > > > > > --- > > > > > fs/btrfs/volumes.c | 4 ---- > > > > > 1 file changed, 4 deletions(-) > > > > > > > > > > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c > > > > > index 3fd17e87815a..75d7eeb26fe6 100644 > > > > > --- a/fs/btrfs/volumes.c > > > > > +++ b/fs/btrfs/volumes.c > > > > > @@ -2675,8 +2675,6 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path > > > > > set_blocksize(device->bdev, BTRFS_BDEV_BLOCKSIZE); > > > > > if (seeding_dev) { > > > > > - btrfs_clear_sb_rdonly(sb); > > > > > - > > > > > > > > After this line, it updates the metadata e.g, with > > > > init_first_rw_device() and writes them out with > > > > btrfs_commit_transaction(). Is that OK to do so with the SB_RDONLY > > > > flag set? > > > > > > > It is ok as the device-add step creates a _new_ sprout filesystem which is > > RW-able. btrfs_setup_sprout() resets the seeding flag. > > > > super_flags = btrfs_super_flags(disk_super) & > > ~BTRFS_SUPER_FLAG_SEEDING; > > btrfs_set_super_flags(disk_super, super_flags); > > > > Thanks, Anand > > Yeah, I see that point. I'm concerned about an interaction with the > VFS code. With this patch applied, it is going to do file-system > internal writes (updating the trees and transaction commit) with the > SB_RDONLY flag set. Doesn't it break the current and future assumptions > of the VFS side? > > But, it's just a concern. It might not be a bid deal. I discussed this with Chris and he pointed out an existing case: log-replay. Log replay runs in mount before the sb_rdonly check and checks rw_devices, but not the fs readonly bit. It uses transactions and such to replay the log. This seems similar enough here: the device is rw, but the fs is ro. > > > > Good question. As far as I can tell, the functions don't explicitly > > > check sb_rdonly, though that could be because they expect that to be > > > checked before you ever try to commit a transaction, for example.. > > > > > > If there is an issue, it's probably somewhat subtle, because the basic > > > behavior does work. > > > > > > > > > > > > /* GFP_KERNEL allocation must not be under device_list_mutex */ > > > > > seed_devices = btrfs_init_sprout(fs_info); > > > > > if (IS_ERR(seed_devices)) { > > > > > @@ -2819,8 +2817,6 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path > > > > > mutex_unlock(&fs_info->chunk_mutex); > > > > > mutex_unlock(&fs_info->fs_devices->device_list_mutex); > > > > > error_trans: > > > > > - if (seeding_dev) > > > > > - btrfs_set_sb_rdonly(sb); > > > > > if (trans) > > > > > btrfs_end_transaction(trans); > > > > > error_free_zone: > > > > > -- > > > > > 2.30.2 > > > > > > >
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 3fd17e87815a..75d7eeb26fe6 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -2675,8 +2675,6 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path set_blocksize(device->bdev, BTRFS_BDEV_BLOCKSIZE); if (seeding_dev) { - btrfs_clear_sb_rdonly(sb); - /* GFP_KERNEL allocation must not be under device_list_mutex */ seed_devices = btrfs_init_sprout(fs_info); if (IS_ERR(seed_devices)) { @@ -2819,8 +2817,6 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path mutex_unlock(&fs_info->chunk_mutex); mutex_unlock(&fs_info->fs_devices->device_list_mutex); error_trans: - if (seeding_dev) - btrfs_set_sb_rdonly(sb); if (trans) btrfs_end_transaction(trans); error_free_zone:
If you follow the seed/sprout wiki, it suggests the following workflow: btrfstune -S 1 seed_dev mount seed_dev mnt btrfs device add sprout_dev mount -o remount,rw mnt The first mount mounts the FS readonly, which results in not setting BTRFS_FS_OPEN, and setting the readonly bit on the sb. The device add somewhat surprisingly clears the readonly bit on the sb (though the mount is still practically readonly, from the users perspective...). Finally, the remount checks the readonly bit on the sb against the flag and sees no change, so it does not run the code intended to run on ro->rw transitions, leaving BTRFS_FS_OPEN unset. As a result, when the cleaner_kthread runs, it sees no BTRFS_FS_OPEN and does no work. This results in leaking deleted snapshots until we run out of space. I propose fixing it at the first departure from what feels reasonable: when we clear the readonly bit on the sb during device add. I have a reproducer of the issue here: https://raw.githubusercontent.com/boryas/scripts/main/sh/seed/mkseed.sh and confirm that this patch fixes it, and seems to work OK, otherwise. I will admit that I couldn't dig up the original rationale for clearing the bit here (it dates back to the original seed/sprout commit without explicit explanation) so it's hard to imagine all the ramifications of the change. Signed-off-by: Boris Burkov <boris@bur.io> --- fs/btrfs/volumes.c | 4 ---- 1 file changed, 4 deletions(-)