Message ID | a9aa42f6bc2739ab46ce015f749e15177f8730d6.1729028033.git.boris@bur.io (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | btrfs: do not clear read-only when adding sprout device | expand |
在 2024/10/16 08:08, Boris Burkov 写道: > If you follow the seed/sprout wiki, it suggests the following workflow: > > btrfstune -S 1 seed_dev > mount seed_dev mnt > btrfs device add sprout_dev > mount -o remount,rw mnt > > The first mount mounts the FS readonly, which results in not setting > BTRFS_FS_OPEN, and setting the readonly bit on the sb. The device add > somewhat surprisingly clears the readonly bit on the sb (though the > mount is still practically readonly, from the users perspective...). > Finally, the remount checks the readonly bit on the sb against the flag > and sees no change, so it does not run the code intended to run on > ro->rw transitions, leaving BTRFS_FS_OPEN unset. > > As a result, when the cleaner_kthread runs, it sees no BTRFS_FS_OPEN and > does no work. This results in leaking deleted snapshots until we run out > of space. > > I propose fixing it at the first departure from what feels reasonable: > when we clear the readonly bit on the sb during device add. > > A new fstest I have written reproduces the bug and confirms the fix. > > Signed-off-by: Boris Burkov <boris@bur.io> The fix looks good to me, small and keeps the super block ro flag consistent. IIRC the old behavior of sprout is, adding device will immediately mark the fs RW, which is a big surprise changing the RO/RW status. So the extra Rw remount requirement looks very reasonable to me. Reviewed-by: Qu Wenruo <wqu@suse.com> Thanks, Qu > --- > Note that this is a resend of an old unmerged fix: > https://lore.kernel.org/linux-btrfs/16c05d39566858bb8bc1e03bd19947cf2b601b98.1647906815.git.boris@bur.io/T/#u > Some other ideas for fixing it by modifying how we set BTRFS_FS_OPEN > were also explored but not merged around that time: > https://lore.kernel.org/linux-btrfs/cover.1654216941.git.anand.jain@oracle.com/ > > I don't have a strong preference, but I would really like to see this > trivial bug fixed. For what it is worth, we have been carrying this > patch internally at Meta since I first sent it with no incident. > --- > fs/btrfs/volumes.c | 4 ---- > 1 file changed, 4 deletions(-) > > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c > index dc9f54849f39..84e861dcb350 100644 > --- a/fs/btrfs/volumes.c > +++ b/fs/btrfs/volumes.c > @@ -2841,8 +2841,6 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path > set_blocksize(device->bdev_file, BTRFS_BDEV_BLOCKSIZE); > > if (seeding_dev) { > - btrfs_clear_sb_rdonly(sb); > - > /* GFP_KERNEL allocation must not be under device_list_mutex */ > seed_devices = btrfs_init_sprout(fs_info); > if (IS_ERR(seed_devices)) { > @@ -2985,8 +2983,6 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path > mutex_unlock(&fs_info->chunk_mutex); > mutex_unlock(&fs_info->fs_devices->device_list_mutex); > error_trans: > - if (seeding_dev) > - btrfs_set_sb_rdonly(sb); > if (trans) > btrfs_end_transaction(trans); > error_free_zone:
在 2024/10/16 08:30, Qu Wenruo 写道: > > > 在 2024/10/16 08:08, Boris Burkov 写道: >> If you follow the seed/sprout wiki, it suggests the following workflow: >> >> btrfstune -S 1 seed_dev >> mount seed_dev mnt >> btrfs device add sprout_dev >> mount -o remount,rw mnt >> >> The first mount mounts the FS readonly, which results in not setting >> BTRFS_FS_OPEN, and setting the readonly bit on the sb. The device add >> somewhat surprisingly clears the readonly bit on the sb (though the >> mount is still practically readonly, from the users perspective...). >> Finally, the remount checks the readonly bit on the sb against the flag >> and sees no change, so it does not run the code intended to run on >> ro->rw transitions, leaving BTRFS_FS_OPEN unset. >> >> As a result, when the cleaner_kthread runs, it sees no BTRFS_FS_OPEN and >> does no work. This results in leaking deleted snapshots until we run out >> of space. >> >> I propose fixing it at the first departure from what feels reasonable: >> when we clear the readonly bit on the sb during device add. >> >> A new fstest I have written reproduces the bug and confirms the fix. >> >> Signed-off-by: Boris Burkov <boris@bur.io> > > The fix looks good to me, small and keeps the super block ro flag > consistent. > > IIRC the old behavior of sprout is, adding device will immediately mark > the fs RW, which is a big surprise changing the RO/RW status. > > So the extra Rw remount requirement looks very reasonable to me. Forgot to mention, although it's a trivial change in the behavior, if we are determined to go this path, the man page of the "SEEDING DEVICE" chapter also need to be updated. Thanks, Qu > > Reviewed-by: Qu Wenruo <wqu@suse.com> > > Thanks, > Qu > >> --- >> Note that this is a resend of an old unmerged fix: >> https://lore.kernel.org/linux- >> btrfs/16c05d39566858bb8bc1e03bd19947cf2b601b98.1647906815.git.boris@bur.io/T/#u >> Some other ideas for fixing it by modifying how we set BTRFS_FS_OPEN >> were also explored but not merged around that time: >> https://lore.kernel.org/linux-btrfs/ >> cover.1654216941.git.anand.jain@oracle.com/ >> >> I don't have a strong preference, but I would really like to see this >> trivial bug fixed. For what it is worth, we have been carrying this >> patch internally at Meta since I first sent it with no incident. >> --- >> fs/btrfs/volumes.c | 4 ---- >> 1 file changed, 4 deletions(-) >> >> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c >> index dc9f54849f39..84e861dcb350 100644 >> --- a/fs/btrfs/volumes.c >> +++ b/fs/btrfs/volumes.c >> @@ -2841,8 +2841,6 @@ int btrfs_init_new_device(struct btrfs_fs_info >> *fs_info, const char *device_path >> set_blocksize(device->bdev_file, BTRFS_BDEV_BLOCKSIZE); >> >> if (seeding_dev) { >> - btrfs_clear_sb_rdonly(sb); >> - >> /* GFP_KERNEL allocation must not be under device_list_mutex */ >> seed_devices = btrfs_init_sprout(fs_info); >> if (IS_ERR(seed_devices)) { >> @@ -2985,8 +2983,6 @@ int btrfs_init_new_device(struct btrfs_fs_info >> *fs_info, const char *device_path >> mutex_unlock(&fs_info->chunk_mutex); >> mutex_unlock(&fs_info->fs_devices->device_list_mutex); >> error_trans: >> - if (seeding_dev) >> - btrfs_set_sb_rdonly(sb); >> if (trans) >> btrfs_end_transaction(trans); >> error_free_zone: > >
On Wed, Oct 16, 2024 at 08:42:14AM +1030, Qu Wenruo wrote: > > > 在 2024/10/16 08:30, Qu Wenruo 写道: > > > > > > 在 2024/10/16 08:08, Boris Burkov 写道: > > > If you follow the seed/sprout wiki, it suggests the following workflow: > > > > > > btrfstune -S 1 seed_dev > > > mount seed_dev mnt > > > btrfs device add sprout_dev > > > mount -o remount,rw mnt > > > > > > The first mount mounts the FS readonly, which results in not setting > > > BTRFS_FS_OPEN, and setting the readonly bit on the sb. The device add > > > somewhat surprisingly clears the readonly bit on the sb (though the > > > mount is still practically readonly, from the users perspective...). > > > Finally, the remount checks the readonly bit on the sb against the flag > > > and sees no change, so it does not run the code intended to run on > > > ro->rw transitions, leaving BTRFS_FS_OPEN unset. > > > > > > As a result, when the cleaner_kthread runs, it sees no BTRFS_FS_OPEN and > > > does no work. This results in leaking deleted snapshots until we run out > > > of space. > > > > > > I propose fixing it at the first departure from what feels reasonable: > > > when we clear the readonly bit on the sb during device add. > > > > > > A new fstest I have written reproduces the bug and confirms the fix. > > > > > > Signed-off-by: Boris Burkov <boris@bur.io> > > > > The fix looks good to me, small and keeps the super block ro flag > > consistent. > > > > IIRC the old behavior of sprout is, adding device will immediately mark > > the fs RW, which is a big surprise changing the RO/RW status. > > > > So the extra Rw remount requirement looks very reasonable to me. > > Forgot to mention, although it's a trivial change in the behavior, if we > are determined to go this path, the man page of the "SEEDING DEVICE" > chapter also need to be updated. I just checked the man page and everything there looks correct, at least to me. It quite clearly states that after the 'device add' the fs is ready to be remounted read-write (via cycle mount or remount). Please let me know if there is some specific update you want me to make that I missed, though! BTW, this patch does change the rdonly flag behavior, so I will update the test to look at that, as you suggested. > > Thanks, > Qu > > > > Reviewed-by: Qu Wenruo <wqu@suse.com> > > > > Thanks, > > Qu > > > > > --- > > > Note that this is a resend of an old unmerged fix: > > > https://lore.kernel.org/linux- > > > btrfs/16c05d39566858bb8bc1e03bd19947cf2b601b98.1647906815.git.boris@bur.io/T/#u > > > Some other ideas for fixing it by modifying how we set BTRFS_FS_OPEN > > > were also explored but not merged around that time: > > > https://lore.kernel.org/linux-btrfs/ > > > cover.1654216941.git.anand.jain@oracle.com/ > > > > > > I don't have a strong preference, but I would really like to see this > > > trivial bug fixed. For what it is worth, we have been carrying this > > > patch internally at Meta since I first sent it with no incident. > > > --- > > > fs/btrfs/volumes.c | 4 ---- > > > 1 file changed, 4 deletions(-) > > > > > > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c > > > index dc9f54849f39..84e861dcb350 100644 > > > --- a/fs/btrfs/volumes.c > > > +++ b/fs/btrfs/volumes.c > > > @@ -2841,8 +2841,6 @@ int btrfs_init_new_device(struct btrfs_fs_info > > > *fs_info, const char *device_path > > > set_blocksize(device->bdev_file, BTRFS_BDEV_BLOCKSIZE); > > > > > > if (seeding_dev) { > > > - btrfs_clear_sb_rdonly(sb); > > > - > > > /* GFP_KERNEL allocation must not be under device_list_mutex */ > > > seed_devices = btrfs_init_sprout(fs_info); > > > if (IS_ERR(seed_devices)) { > > > @@ -2985,8 +2983,6 @@ int btrfs_init_new_device(struct btrfs_fs_info > > > *fs_info, const char *device_path > > > mutex_unlock(&fs_info->chunk_mutex); > > > mutex_unlock(&fs_info->fs_devices->device_list_mutex); > > > error_trans: > > > - if (seeding_dev) > > > - btrfs_set_sb_rdonly(sb); > > > if (trans) > > > btrfs_end_transaction(trans); > > > error_free_zone: > > > > >
On 16/10/24 05:38, Boris Burkov wrote: > If you follow the seed/sprout wiki, it suggests the following workflow: > > btrfstune -S 1 seed_dev > mount seed_dev mnt > btrfs device add sprout_dev > mount -o remount,rw mnt > > The first mount mounts the FS readonly, which results in not setting > BTRFS_FS_OPEN, and setting the readonly bit on the sb. The device add > somewhat surprisingly clears the readonly bit on the sb (though the > mount is still practically readonly, from the users perspective...). > Finally, the remount checks the readonly bit on the sb against the flag > and sees no change, so it does not run the code intended to run on > ro->rw transitions, leaving BTRFS_FS_OPEN unset. > > As a result, when the cleaner_kthread runs, it sees no BTRFS_FS_OPEN and > does no work. This results in leaking deleted snapshots until we run out > of space. > > I propose fixing it at the first departure from what feels reasonable: > when we clear the readonly bit on the sb during device add. > > A new fstest I have written reproduces the bug and confirms the fix. > > Signed-off-by: Boris Burkov <boris@bur.io> > --- > Note that this is a resend of an old unmerged fix: > https://lore.kernel.org/linux-btrfs/16c05d39566858bb8bc1e03bd19947cf2b601b98.1647906815.git.boris@bur.io/T/#u > Some other ideas for fixing it by modifying how we set BTRFS_FS_OPEN > were also explored but not merged around that time: > https://lore.kernel.org/linux-btrfs/cover.1654216941.git.anand.jain@oracle.com/ > > I don't have a strong preference, but I would really like to see this > trivial bug fixed. For what it is worth, we have been carrying this > patch internally at Meta since I first sent it with no incident. > --- I remember fixing this before. I tested on 5.15, and the bug isn't there, but it’s back in 6.10, so something broke in between. We need to track it down. The original design (kernel 4.x and below) makes the filesystem switch to read-write mode after adding a sprout because: You can’t add a device to a normal read-only filesystem so with seed read-only mount is different. With a seed device, adding a writable device transforms it into a new read-write filesystem with a _new_ FSID and fs_devices. Logically, read-write at this stage makes sense, but I’m okay without it and in fact we had fixed this before, but a patch somewhere seems to have broken it again. (Demo below. :<x> is the return code from the 'run' command at https://github.com/asj/run.git) ----- 5.15.0-208.159.3.2.el9uek.x86_64 ---- $ mkfs.btrfs -fq /dev/loop0 :0 $ btrfstune -S1 /dev/loop0 :0 $ mount /dev/loop0 /btrfs :0 mount: /btrfs: WARNING: source write-protected, mounted read-only. $ cat /proc/self/mounts | grep btrfs :0 /dev/loop0 /btrfs btrfs ro,relatime,space_cache=v2,subvolid=5,subvol=/ 0 0 $ findmnt -o SOURCE,UUID /btrfs :0 SOURCE UUID /dev/loop0 64f21b87-4e4c-4786-b2cd-c09a5ccd2afa $ btrfs fi show -m :0 Label: none uuid: 64f21b87-4e4c-4786-b2cd-c09a5ccd2afa Total devices 1 FS bytes used 144.00KiB devid 1 size 3.00GiB used 536.00MiB path /dev/loop0 $ ls /sys/fs/btrfs :0 64f21b87-4e4c-4786-b2cd-c09a5ccd2afa features $ btrfs dev add -f /dev/loop1 /btrfs :0 # After adding the device, the path and UUID are different, # so it’s a new filesystem. (But, as I said, I’m fine with # keeping it read-only and needing remount,rw. $ cat /proc/self/mounts | grep btrfs :0 /dev/loop1 /btrfs btrfs ro,relatime,space_cache=v2,subvolid=5,subvol=/ 0 0 $ findmnt -o SOURCE,UUID /btrfs :0 SOURCE UUID /dev/loop1 948cea35-18db-45da-9ec8-3d46cb5f0413 $ btrfs fi show -m :0 Label: none uuid: 948cea35-18db-45da-9ec8-3d46cb5f0413 Total devices 2 FS bytes used 144.00KiB devid 1 size 3.00GiB used 520.00MiB path /dev/loop0 devid 2 size 3.00GiB used 576.00MiB path /dev/loop1 $ ls /sys/fs/btrfs :0 948cea35-18db-45da-9ec8-3d46cb5f0413 features --------- Thanks, Anand > fs/btrfs/volumes.c | 4 ---- > 1 file changed, 4 deletions(-) > > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c > index dc9f54849f39..84e861dcb350 100644 > --- a/fs/btrfs/volumes.c > +++ b/fs/btrfs/volumes.c > @@ -2841,8 +2841,6 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path > set_blocksize(device->bdev_file, BTRFS_BDEV_BLOCKSIZE); > > if (seeding_dev) { > - btrfs_clear_sb_rdonly(sb); > - > /* GFP_KERNEL allocation must not be under device_list_mutex */ > seed_devices = btrfs_init_sprout(fs_info); > if (IS_ERR(seed_devices)) { > @@ -2985,8 +2983,6 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path > mutex_unlock(&fs_info->chunk_mutex); > mutex_unlock(&fs_info->fs_devices->device_list_mutex); > error_trans: > - if (seeding_dev) > - btrfs_set_sb_rdonly(sb); > if (trans) > btrfs_end_transaction(trans); > error_free_zone:
On Thu, Oct 17, 2024 at 01:14:16AM +0800, Anand Jain wrote: > On 16/10/24 05:38, Boris Burkov wrote: > > If you follow the seed/sprout wiki, it suggests the following workflow: > > > > btrfstune -S 1 seed_dev > > mount seed_dev mnt > > btrfs device add sprout_dev > > mount -o remount,rw mnt > > > > > > > The first mount mounts the FS readonly, which results in not setting > > BTRFS_FS_OPEN, and setting the readonly bit on the sb. The device add > > somewhat surprisingly clears the readonly bit on the sb (though the > > mount is still practically readonly, from the users perspective...). > > Finally, the remount checks the readonly bit on the sb against the flag > > and sees no change, so it does not run the code intended to run on > > ro->rw transitions, leaving BTRFS_FS_OPEN unset. > > > > As a result, when the cleaner_kthread runs, it sees no BTRFS_FS_OPEN and > > does no work. This results in leaking deleted snapshots until we run out > > of space. > > > > I propose fixing it at the first departure from what feels reasonable: > > when we clear the readonly bit on the sb during device add. > > > > A new fstest I have written reproduces the bug and confirms the fix. > > > > Signed-off-by: Boris Burkov <boris@bur.io> > > --- > > Note that this is a resend of an old unmerged fix: > > https://lore.kernel.org/linux-btrfs/16c05d39566858bb8bc1e03bd19947cf2b601b98.1647906815.git.boris@bur.io/T/#u > > Some other ideas for fixing it by modifying how we set BTRFS_FS_OPEN > > were also explored but not merged around that time: > > https://lore.kernel.org/linux-btrfs/cover.1654216941.git.anand.jain@oracle.com/ > > > > I don't have a strong preference, but I would really like to see this > > trivial bug fixed. For what it is worth, we have been carrying this > > patch internally at Meta since I first sent it with no incident. > > --- > > > I remember fixing this before. I tested on 5.15, and the bug isn't > there, but it’s back in 6.10, so something broke in between. > We need to track it down. Thanks for weighing in again, and for re-testing on 5.15. That's interesting that it broke again. And sorry if I didn't follow the rdonly check patches properly and those did end up getting merged. Poor code archaeology on my part :) At any rate, I have pushed this patch into for-next for now, as I think it constitutes an improvement without breaking any documented behavior. If you look into what happened between 5.15 and 6.11 and want to back it out with a different fix, I will not be offended. If we also land the fstest I submitted, then hopefully future kernels will *not* be breaking this again! Thanks, Boris > > The original design (kernel 4.x and below) makes the filesystem switch > to read-write mode after adding a sprout because: > > You can’t add a device to a normal read-only filesystem > so with seed read-only mount is different. > With a seed device, adding a writable device transforms > it into a new read-write filesystem with a _new_ FSID and > fs_devices. Logically, read-write at this stage makes sense, > but I’m okay without it and in fact we had fixed this before, > but a patch somewhere seems to have broken it again. > > > (Demo below. :<x> is the return code from the 'run' command at > https://github.com/asj/run.git) > > > ----- 5.15.0-208.159.3.2.el9uek.x86_64 ---- > $ mkfs.btrfs -fq /dev/loop0 :0 > $ btrfstune -S1 /dev/loop0 :0 > $ mount /dev/loop0 /btrfs :0 > mount: /btrfs: WARNING: source write-protected, mounted read-only. > > $ cat /proc/self/mounts | grep btrfs :0 > /dev/loop0 /btrfs btrfs ro,relatime,space_cache=v2,subvolid=5,subvol=/ 0 0 > > $ findmnt -o SOURCE,UUID /btrfs :0 > SOURCE UUID > /dev/loop0 64f21b87-4e4c-4786-b2cd-c09a5ccd2afa > > $ btrfs fi show -m :0 > Label: none uuid: 64f21b87-4e4c-4786-b2cd-c09a5ccd2afa > Total devices 1 FS bytes used 144.00KiB > devid 1 size 3.00GiB used 536.00MiB path /dev/loop0 > > $ ls /sys/fs/btrfs :0 > 64f21b87-4e4c-4786-b2cd-c09a5ccd2afa > features > > $ btrfs dev add -f /dev/loop1 /btrfs :0 > > # After adding the device, the path and UUID are different, > # so it’s a new filesystem. (But, as I said, I’m fine with > # keeping it read-only and needing remount,rw. > > $ cat /proc/self/mounts | grep btrfs :0 > /dev/loop1 /btrfs btrfs ro,relatime,space_cache=v2,subvolid=5,subvol=/ 0 0 > > $ findmnt -o SOURCE,UUID /btrfs :0 > SOURCE UUID > /dev/loop1 948cea35-18db-45da-9ec8-3d46cb5f0413 > > $ btrfs fi show -m :0 > Label: none uuid: 948cea35-18db-45da-9ec8-3d46cb5f0413 > Total devices 2 FS bytes used 144.00KiB > devid 1 size 3.00GiB used 520.00MiB path /dev/loop0 > devid 2 size 3.00GiB used 576.00MiB path /dev/loop1 > > > $ ls /sys/fs/btrfs :0 > 948cea35-18db-45da-9ec8-3d46cb5f0413 > features > --------- > > > Thanks, Anand > > > fs/btrfs/volumes.c | 4 ---- > > 1 file changed, 4 deletions(-) > > > > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c > > index dc9f54849f39..84e861dcb350 100644 > > --- a/fs/btrfs/volumes.c > > +++ b/fs/btrfs/volumes.c > > @@ -2841,8 +2841,6 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path > > set_blocksize(device->bdev_file, BTRFS_BDEV_BLOCKSIZE); > > if (seeding_dev) { > > - btrfs_clear_sb_rdonly(sb); > > - > > /* GFP_KERNEL allocation must not be under device_list_mutex */ > > seed_devices = btrfs_init_sprout(fs_info); > > if (IS_ERR(seed_devices)) { > > @@ -2985,8 +2983,6 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path > > mutex_unlock(&fs_info->chunk_mutex); > > mutex_unlock(&fs_info->fs_devices->device_list_mutex); > > error_trans: > > - if (seeding_dev) > > - btrfs_set_sb_rdonly(sb); > > if (trans) > > btrfs_end_transaction(trans); > > error_free_zone: >
On Tue, Oct 15, 2024 at 02:38:34PM -0700, Boris Burkov wrote: > If you follow the seed/sprout wiki, it suggests the following workflow: > > btrfstune -S 1 seed_dev > mount seed_dev mnt > btrfs device add sprout_dev > mount -o remount,rw mnt > > The first mount mounts the FS readonly, which results in not setting > BTRFS_FS_OPEN, and setting the readonly bit on the sb. The device add > somewhat surprisingly clears the readonly bit on the sb (though the > mount is still practically readonly, from the users perspective...). > Finally, the remount checks the readonly bit on the sb against the flag > and sees no change, so it does not run the code intended to run on > ro->rw transitions, leaving BTRFS_FS_OPEN unset. > > As a result, when the cleaner_kthread runs, it sees no BTRFS_FS_OPEN and > does no work. This results in leaking deleted snapshots until we run out > of space. > > I propose fixing it at the first departure from what feels reasonable: > when we clear the readonly bit on the sb during device add. > > A new fstest I have written reproduces the bug and confirms the fix. > > Signed-off-by: Boris Burkov <boris@bur.io> > --- > Note that this is a resend of an old unmerged fix: > https://lore.kernel.org/linux-btrfs/16c05d39566858bb8bc1e03bd19947cf2b601b98.1647906815.git.boris@bur.io/T/#u > Some other ideas for fixing it by modifying how we set BTRFS_FS_OPEN > were also explored but not merged around that time: > https://lore.kernel.org/linux-btrfs/cover.1654216941.git.anand.jain@oracle.com/ > > I don't have a strong preference, but I would really like to see this > trivial bug fixed. For what it is worth, we have been carrying this > patch internally at Meta since I first sent it with no incident. We have an unresolved dispute about the fix and now the patch got to for-next within a few days because it got two reviews but not mine. The way you use it in Meta works for you but the fix is changing behaviour so we can either ignore everybody else relying on the old way or say that seeding is so niche that we don't care and see what we can do once we get some report.
On Thu, Oct 17, 2024 at 04:01:12PM +0200, David Sterba wrote: > On Tue, Oct 15, 2024 at 02:38:34PM -0700, Boris Burkov wrote: > > If you follow the seed/sprout wiki, it suggests the following workflow: > > > > btrfstune -S 1 seed_dev > > mount seed_dev mnt > > btrfs device add sprout_dev > > mount -o remount,rw mnt > > > > The first mount mounts the FS readonly, which results in not setting > > BTRFS_FS_OPEN, and setting the readonly bit on the sb. The device add > > somewhat surprisingly clears the readonly bit on the sb (though the > > mount is still practically readonly, from the users perspective...). > > Finally, the remount checks the readonly bit on the sb against the flag > > and sees no change, so it does not run the code intended to run on > > ro->rw transitions, leaving BTRFS_FS_OPEN unset. > > > > As a result, when the cleaner_kthread runs, it sees no BTRFS_FS_OPEN and > > does no work. This results in leaking deleted snapshots until we run out > > of space. > > > > I propose fixing it at the first departure from what feels reasonable: > > when we clear the readonly bit on the sb during device add. > > > > A new fstest I have written reproduces the bug and confirms the fix. > > > > Signed-off-by: Boris Burkov <boris@bur.io> > > --- > > Note that this is a resend of an old unmerged fix: > > https://lore.kernel.org/linux-btrfs/16c05d39566858bb8bc1e03bd19947cf2b601b98.1647906815.git.boris@bur.io/T/#u > > Some other ideas for fixing it by modifying how we set BTRFS_FS_OPEN > > were also explored but not merged around that time: > > https://lore.kernel.org/linux-btrfs/cover.1654216941.git.anand.jain@oracle.com/ > > > > I don't have a strong preference, but I would really like to see this > > trivial bug fixed. For what it is worth, we have been carrying this > > patch internally at Meta since I first sent it with no incident. > > We have an unresolved dispute about the fix and now the patch got to > for-next within a few days because it got two reviews but not mine. > The way you use it in Meta works for you but the fix is changing > behaviour so we can either ignore everybody else relying on the old > way or say that seeding is so niche that we don't care and see what we > can do once we get some report. Please feel free to remove it, and I am happy to review whatever other fix you think is best. Sorry for rushing, I just wanted to get it done and out of my head so I could move on to other issues. I assume your concern is that before this fix, the filesystem is actually read-write after the device add, which this patch breaks? My only argument against this is that the documentation has always said you needed to remount/cycle-mount after adding the sprout, so there is no fair reason to assume this would work. (In fact, it *doesn't*, the fs is once again in a unexpected, degraded, state...) But if existing LiveCD users are relying on this undocumented behavior, then I think you are right and it's a bad idea to break them. As long as my test (and presumably some fix) goes in and I don't have to carry this patch internally for two more years, then I am happy. Thanks, Boris
在 2024/10/17 03:44, Anand Jain 写道: > On 16/10/24 05:38, Boris Burkov wrote: >> If you follow the seed/sprout wiki, it suggests the following workflow: >> >> btrfstune -S 1 seed_dev >> mount seed_dev mnt >> btrfs device add sprout_dev >> mount -o remount,rw mnt >> > > > >> The first mount mounts the FS readonly, which results in not setting >> BTRFS_FS_OPEN, and setting the readonly bit on the sb. The device add >> somewhat surprisingly clears the readonly bit on the sb (though the >> mount is still practically readonly, from the users perspective...). >> Finally, the remount checks the readonly bit on the sb against the flag >> and sees no change, so it does not run the code intended to run on >> ro->rw transitions, leaving BTRFS_FS_OPEN unset. >> >> As a result, when the cleaner_kthread runs, it sees no BTRFS_FS_OPEN and >> does no work. This results in leaking deleted snapshots until we run out >> of space. >> >> I propose fixing it at the first departure from what feels reasonable: >> when we clear the readonly bit on the sb during device add. >> >> A new fstest I have written reproduces the bug and confirms the fix. >> >> Signed-off-by: Boris Burkov <boris@bur.io> >> --- >> Note that this is a resend of an old unmerged fix: >> https://lore.kernel.org/linux- >> btrfs/16c05d39566858bb8bc1e03bd19947cf2b601b98.1647906815.git.boris@bur.io/T/#u >> Some other ideas for fixing it by modifying how we set BTRFS_FS_OPEN >> were also explored but not merged around that time: >> https://lore.kernel.org/linux-btrfs/ >> cover.1654216941.git.anand.jain@oracle.com/ >> >> I don't have a strong preference, but I would really like to see this >> trivial bug fixed. For what it is worth, we have been carrying this >> patch internally at Meta since I first sent it with no incident. >> --- > > > I remember fixing this before. I tested on 5.15, and the bug isn't > there, but it’s back in 6.10, so something broke in between. > We need to track it down. > > The original design (kernel 4.x and below) makes the filesystem switch > to read-write mode after adding a sprout because: > > You can’t add a device to a normal read-only filesystem > so with seed read-only mount is different. > With a seed device, adding a writable device transforms > it into a new read-write filesystem with a _new_ FSID and > fs_devices. Logically, read-write at this stage makes sense, > but I’m okay without it and in fact we had fixed this before, > but a patch somewhere seems to have broken it again. > > > (Demo below. :<x> is the return code from the 'run' command at > https://github.com/asj/run.git) > > > ----- 5.15.0-208.159.3.2.el9uek.x86_64 ---- I also tried it on upstream kernel v5.15.94, the behavior is still the old changed to RW immediately after device add: [adam@btrfs-vm ~]$ uname -a Linux btrfs-vm 5.15.94-1-lts #1 SMP Wed, 15 Feb 2023 07:09:02 +0000 x86_64 GNU/Linux [adam@btrfs-vm ~]$ sudo mkfs.btrfs -f /dev/test/scratch1 > /dev/null [adam@btrfs-vm ~]$ sudo btrfstune -S 1 /dev/test/scratch1 [adam@btrfs-vm ~]$ sudo mount /dev/test/scratch1 /mnt/btrfs/ mount: /mnt/btrfs: WARNING: source write-protected, mounted read-only. [adam@btrfs-vm ~]$ sudo btrfs device add -f /dev/test/scratch2 /mnt/btrfs/ Performing full device TRIM /dev/test/scratch2 (10.00GiB) ... [adam@btrfs-vm ~]$ sudo touch /mnt/btrfs/file [adam@btrfs-vm ~]$ mount | grep mnt/btrfs /dev/mapper/test-scratch2 on /mnt/btrfs type btrfs (rw,relatime,space_cache=v2,subvolid=5,subvol=/) So it looks like it's some extra backports causing the behavior change. But I still strongly prefer to keep it RO. Even if it's a different fs under the hood, it still suddenly changes the RO/RW status of a mount point without letting the user to know. Thanks, Qu > $ mkfs.btrfs -fq /dev/loop0 :0 > $ btrfstune -S1 /dev/loop0 :0 > $ mount /dev/loop0 /btrfs :0 > mount: /btrfs: WARNING: source write-protected, mounted read-only. > > $ cat /proc/self/mounts | grep btrfs :0 > /dev/loop0 /btrfs btrfs ro,relatime,space_cache=v2,subvolid=5,subvol=/ 0 0 > > $ findmnt -o SOURCE,UUID /btrfs :0 > SOURCE UUID > /dev/loop0 64f21b87-4e4c-4786-b2cd-c09a5ccd2afa > > $ btrfs fi show -m :0 > Label: none uuid: 64f21b87-4e4c-4786-b2cd-c09a5ccd2afa > Total devices 1 FS bytes used 144.00KiB > devid 1 size 3.00GiB used 536.00MiB path /dev/loop0 > > $ ls /sys/fs/btrfs :0 > 64f21b87-4e4c-4786-b2cd-c09a5ccd2afa > features > > $ btrfs dev add -f /dev/loop1 /btrfs :0 > > # After adding the device, the path and UUID are different, > # so it’s a new filesystem. (But, as I said, I’m fine with > # keeping it read-only and needing remount,rw. > > $ cat /proc/self/mounts | grep btrfs :0 > /dev/loop1 /btrfs btrfs ro,relatime,space_cache=v2,subvolid=5,subvol=/ 0 0 > > $ findmnt -o SOURCE,UUID /btrfs :0 > SOURCE UUID > /dev/loop1 948cea35-18db-45da-9ec8-3d46cb5f0413 > > $ btrfs fi show -m :0 > Label: none uuid: 948cea35-18db-45da-9ec8-3d46cb5f0413 > Total devices 2 FS bytes used 144.00KiB > devid 1 size 3.00GiB used 520.00MiB path /dev/loop0 > devid 2 size 3.00GiB used 576.00MiB path /dev/loop1 > > > $ ls /sys/fs/btrfs :0 > 948cea35-18db-45da-9ec8-3d46cb5f0413 > features > --------- > > > Thanks, Anand > >> fs/btrfs/volumes.c | 4 ---- >> 1 file changed, 4 deletions(-) >> >> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c >> index dc9f54849f39..84e861dcb350 100644 >> --- a/fs/btrfs/volumes.c >> +++ b/fs/btrfs/volumes.c >> @@ -2841,8 +2841,6 @@ int btrfs_init_new_device(struct btrfs_fs_info >> *fs_info, const char *device_path >> set_blocksize(device->bdev_file, BTRFS_BDEV_BLOCKSIZE); >> if (seeding_dev) { >> - btrfs_clear_sb_rdonly(sb); >> - >> /* GFP_KERNEL allocation must not be under device_list_mutex */ >> seed_devices = btrfs_init_sprout(fs_info); >> if (IS_ERR(seed_devices)) { >> @@ -2985,8 +2983,6 @@ int btrfs_init_new_device(struct btrfs_fs_info >> *fs_info, const char *device_path >> mutex_unlock(&fs_info->chunk_mutex); >> mutex_unlock(&fs_info->fs_devices->device_list_mutex); >> error_trans: >> - if (seeding_dev) >> - btrfs_set_sb_rdonly(sb); >> if (trans) >> btrfs_end_transaction(trans); >> error_free_zone: > >
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index dc9f54849f39..84e861dcb350 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -2841,8 +2841,6 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path set_blocksize(device->bdev_file, BTRFS_BDEV_BLOCKSIZE); if (seeding_dev) { - btrfs_clear_sb_rdonly(sb); - /* GFP_KERNEL allocation must not be under device_list_mutex */ seed_devices = btrfs_init_sprout(fs_info); if (IS_ERR(seed_devices)) { @@ -2985,8 +2983,6 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path mutex_unlock(&fs_info->chunk_mutex); mutex_unlock(&fs_info->fs_devices->device_list_mutex); error_trans: - if (seeding_dev) - btrfs_set_sb_rdonly(sb); if (trans) btrfs_end_transaction(trans); error_free_zone:
If you follow the seed/sprout wiki, it suggests the following workflow: btrfstune -S 1 seed_dev mount seed_dev mnt btrfs device add sprout_dev mount -o remount,rw mnt The first mount mounts the FS readonly, which results in not setting BTRFS_FS_OPEN, and setting the readonly bit on the sb. The device add somewhat surprisingly clears the readonly bit on the sb (though the mount is still practically readonly, from the users perspective...). Finally, the remount checks the readonly bit on the sb against the flag and sees no change, so it does not run the code intended to run on ro->rw transitions, leaving BTRFS_FS_OPEN unset. As a result, when the cleaner_kthread runs, it sees no BTRFS_FS_OPEN and does no work. This results in leaking deleted snapshots until we run out of space. I propose fixing it at the first departure from what feels reasonable: when we clear the readonly bit on the sb during device add. A new fstest I have written reproduces the bug and confirms the fix. Signed-off-by: Boris Burkov <boris@bur.io> --- Note that this is a resend of an old unmerged fix: https://lore.kernel.org/linux-btrfs/16c05d39566858bb8bc1e03bd19947cf2b601b98.1647906815.git.boris@bur.io/T/#u Some other ideas for fixing it by modifying how we set BTRFS_FS_OPEN were also explored but not merged around that time: https://lore.kernel.org/linux-btrfs/cover.1654216941.git.anand.jain@oracle.com/ I don't have a strong preference, but I would really like to see this trivial bug fixed. For what it is worth, we have been carrying this patch internally at Meta since I first sent it with no incident. --- fs/btrfs/volumes.c | 4 ---- 1 file changed, 4 deletions(-)