Message ID | 20190212070319.30619-1-wqu@suse.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [RFC] btrfs: Don't create SINGLE or DUP chunks for degraded rw mount | expand |
On 2019-02-12 2:03 a.m., Qu Wenruo wrote: > So we only need to consider missing devices as writable, and calculate > our chunk allocation profile with missing devices too. > > Then every thing should work as expected, without annoying SINGLE/DUP > chunks blocking later degraded mount. > > Does this mean you would rely on scrub/CSUM to repair the missing data if device is restored? begin:vcard fn:Remi Gauvin n:Gauvin;Remi org:Georgian Infotech adr:;;3-51 Sykes St. N.;Meaford;ON;N4L 1X3;Canada email;internet:remi@georgianit.com tel;work:226-256-1545 version:2.1 end:vcard
On 2019/2/12 下午3:20, Remi Gauvin wrote: > On 2019-02-12 2:03 a.m., Qu Wenruo wrote: > >> So we only need to consider missing devices as writable, and calculate >> our chunk allocation profile with missing devices too. >> >> Then every thing should work as expected, without annoying SINGLE/DUP >> chunks blocking later degraded mount. >> >> > > Does this mean you would rely on scrub/CSUM to repair the missing data > if device is restored? Yes, just as btrfs usually does. Thanks, Qu
On 2019-02-12 2:22 a.m., Qu Wenruo wrote: >> Does this mean you would rely on scrub/CSUM to repair the missing data >> if device is restored? > > Yes, just as btrfs usually does. > I don't really understand the implications of the problems with mounting fs when single/dup data chunk are allocated on raid1, but I would think that would actually be a preferable situation than filling a drive with 'data' we know is completely bogus... converting single/dup data to raid should be much faster than tripping on CSUM errors, and less prone to missed errors? begin:vcard fn:Remi Gauvin n:Gauvin;Remi org:Georgian Infotech adr:;;3-51 Sykes St. N.;Meaford;ON;N4L 1X3;Canada email;internet:remi@georgianit.com tel;work:226-256-1545 version:2.1 end:vcard
On 2019/2/12 下午3:43, Remi Gauvin wrote: > On 2019-02-12 2:22 a.m., Qu Wenruo wrote: > >>> Does this mean you would rely on scrub/CSUM to repair the missing data >>> if device is restored? >> >> Yes, just as btrfs usually does. >> > > I don't really understand the implications of the problems with mounting > fs when single/dup data chunk are allocated on raid1, Consider this use case: One btrfs with 2 devices, RAID1 for data and metadata. One day devid 2 got failure, and before replacement arrives, user can only use devid 1 alone. (Maybe that's the root fs). Then new disk arrived, user replaced the missing device, caused SINGLE or DUP chunks on devid 1, and more importantly, some metadata/data is already in DUP/SINGLE chunks. Then some days later, devid 1 get failure too, now user is unable to mount the fs degraded RW any more, since SINGLE/DUP chunks are all on devid 1, and no way to replace devid 1. Thanks, Qu > but I would think > that would actually be a preferable situation than filling a drive with > 'data' we know is completely bogus... converting single/dup data to raid > should be much faster than tripping on CSUM errors, and less prone to > missed errors? > >
On 2019-02-12 2:47 a.m., Qu Wenruo wrote: > > > Consider this use case: > > One btrfs with 2 devices, RAID1 for data and metadata. > > One day devid 2 got failure, and before replacement arrives, user can > only use devid 1 alone. (Maybe that's the root fs). > > Then new disk arrived, user replaced the missing device, caused SINGLE > or DUP chunks on devid 1, and more importantly, some metadata/data is > already in DUP/SINGLE chunks. > Maybe the btrfs-replace command can have some magic logic attached that checks for single/dup chunks after completion and either launches an automatic convert, or prompts the user that one is probably needed? begin:vcard fn:Remi Gauvin n:Gauvin;Remi org:Georgian Infotech adr:;;3-51 Sykes St. N.;Meaford;ON;N4L 1X3;Canada email;internet:remi@georgianit.com tel;work:226-256-1545 version:2.1 end:vcard
On 2019/2/12 下午3:55, Remi Gauvin wrote: > On 2019-02-12 2:47 a.m., Qu Wenruo wrote: >> >> >> Consider this use case: >> >> One btrfs with 2 devices, RAID1 for data and metadata. >> >> One day devid 2 got failure, and before replacement arrives, user can >> only use devid 1 alone. (Maybe that's the root fs). >> >> Then new disk arrived, user replaced the missing device, caused SINGLE >> or DUP chunks on devid 1, and more importantly, some metadata/data is >> already in DUP/SINGLE chunks. >> > > > Maybe the btrfs-replace command can have some magic logic attached that > checks for single/dup chunks after completion and either launches an > automatic convert, or prompts the user that one is probably needed? The problem is, how btrfs-progs know which chunks are old existing chunks and which are new chunks created by balance? I considered this method, but kernel fix without creating unnecessary chunks are way more cleaner. Thanks, Qu > > >
12.02.2019 10:47, Qu Wenruo пишет: > > > On 2019/2/12 下午3:43, Remi Gauvin wrote: >> On 2019-02-12 2:22 a.m., Qu Wenruo wrote: >> >>>> Does this mean you would rely on scrub/CSUM to repair the missing data >>>> if device is restored? >>> >>> Yes, just as btrfs usually does. >>> >> >> I don't really understand the implications of the problems with mounting >> fs when single/dup data chunk are allocated on raid1, > > Consider this use case: > > One btrfs with 2 devices, RAID1 for data and metadata. > > One day devid 2 got failure, and before replacement arrives, user can > only use devid 1 alone. (Maybe that's the root fs). > > Then new disk arrived, user replaced the missing device, caused SINGLE > or DUP chunks on devid 1, and more importantly, some metadata/data is > already in DUP/SINGLE chunks. > > Then some days later, devid 1 get failure too, now user is unable to > mount the fs degraded RW any more, since SINGLE/DUP chunks are all on > devid 1, and no way to replace devid 1. > But if I understand what happens after your patch correctly, replacement device still does not contain valid data until someone does scrub. So in either case manual step is required to restore full redundancy. Or does "btrfs replace" restore content on replacement device automatically? > Thanks, > Qu > >> but I would think >> that would actually be a preferable situation than filling a drive with >> 'data' we know is completely bogus... converting single/dup data to raid >> should be much faster than tripping on CSUM errors, and less prone to >> missed errors? >> >> >
On 2019-02-12 1:42 p.m., Andrei Borzenkov wrote: >> > > But if I understand what happens after your patch correctly, replacement > device still does not contain valid data until someone does scrub. So in > either case manual step is required to restore full redundancy. > > Or does "btrfs replace" restore content on replacement device automatically? > It should.... Qu's example assumes the drive being replaced stays missing, in which case replace is copying all the data that should be on the missing drive from it's mirror source. There should be no difference between the data that used to be there and the data that didn't ever even get written there in the first place. For some reason, my mind last night got stuck on the hypothetical situation where a volume is mounted and used degraded, but then the missing device, presumably not faulty, get's re-added.. then things would go bad quickly. But I concede that scenario falls squarely in the "That's stupid don't do that" category.
On 2019/2/13 上午2:42, Andrei Borzenkov wrote: > 12.02.2019 10:47, Qu Wenruo пишет: >> >> >> On 2019/2/12 下午3:43, Remi Gauvin wrote: >>> On 2019-02-12 2:22 a.m., Qu Wenruo wrote: >>> >>>>> Does this mean you would rely on scrub/CSUM to repair the missing data >>>>> if device is restored? >>>> >>>> Yes, just as btrfs usually does. >>>> >>> >>> I don't really understand the implications of the problems with mounting >>> fs when single/dup data chunk are allocated on raid1, >> >> Consider this use case: >> >> One btrfs with 2 devices, RAID1 for data and metadata. >> >> One day devid 2 got failure, and before replacement arrives, user can >> only use devid 1 alone. (Maybe that's the root fs). >> >> Then new disk arrived, user replaced the missing device, caused SINGLE >> or DUP chunks on devid 1, and more importantly, some metadata/data is >> already in DUP/SINGLE chunks. >> >> Then some days later, devid 1 get failure too, now user is unable to >> mount the fs degraded RW any more, since SINGLE/DUP chunks are all on >> devid 1, and no way to replace devid 1. >> > > But if I understand what happens after your patch correctly, replacement > device still does not contain valid data until someone does scrub. Nope. That's incorrect. We still go through normal device replace routine, which will copy data from degraded chunks to new device. Thanks, Qu > So in > either case manual step is required to restore full redundancy. > > Or does "btrfs replace" restore content on replacement device automatically? > >> Thanks, >> Qu >> >>> but I would think >>> that would actually be a preferable situation than filling a drive with >>> 'data' we know is completely bogus... converting single/dup data to raid >>> should be much faster than tripping on CSUM errors, and less prone to >>> missed errors? >>> >>> >> > >
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 0dde0cbc1622..bf691ecb6c70 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -4081,6 +4081,13 @@ static u64 btrfs_reduce_alloc_profile(struct btrfs_fs_info *fs_info, u64 flags) u64 raid_type; u64 allowed = 0; + /* + * For degraded mount, still count missing devices as rw devices + * to avoid alloc SINGLE/DUP chunks + */ + if (btrfs_test_opt(fs_info, DEGRADED)) + num_devices += fs_info->fs_devices->missing_devices; + /* * see if restripe for this chunk_type is in progress, if so * try to reduce to the target profile @@ -9626,6 +9633,12 @@ static u64 update_block_group_flags(struct btrfs_fs_info *fs_info, u64 flags) return extended_to_chunk(stripped); num_devices = fs_info->fs_devices->rw_devices; + /* + * For degraded mount, still count missing devices as rw devices + * to avoid alloc SINGLE/DUP chunks + */ + if (btrfs_test_opt(fs_info, DEGRADED)) + num_devices += fs_info->fs_devices->missing_devices; stripped = BTRFS_BLOCK_GROUP_RAID0 | BTRFS_BLOCK_GROUP_RAID5 | BTRFS_BLOCK_GROUP_RAID6 | diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 03f223aa7194..8e8b3581877f 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -6660,6 +6660,13 @@ static struct btrfs_device *add_missing_dev(struct btrfs_fs_devices *fs_devices, set_bit(BTRFS_DEV_STATE_MISSING, &device->dev_state); fs_devices->missing_devices++; + /* + * For degraded mount, still count missing devices as writable to + * avoid unnecessary SINGLE/DUP chunks + */ + if (btrfs_test_opt(fs_devices->fs_info, DEGRADED)) + set_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state); + return device; }
[PROBLEM] The following script can easily create unnecessary SINGLE or DUP chunks: #!/bin/bash dev1="/dev/test/scratch1" dev2="/dev/test/scratch2" dev3="/dev/test/scratch3" mnt="/mnt/btrfs" umount $dev1 $dev2 $dev3 $mnt &> /dev/null mkfs.btrfs -f $dev1 $dev2 -d raid1 -m raid1 mount $dev1 $mnt umount $dev1 wipefs -fa $dev2 mount $dev1 -o degraded $mnt btrfs replace start -Bf 2 $dev3 $mnt umount $dev1 btrfs ins dump-tree -t chunk $dev1 With the following chunks in chunk tree: leaf 3016753152 items 11 free space 14900 generation 9 owner CHUNK_TREE leaf 3016753152 flags 0x1(WRITTEN) backref revision 1 fs uuid 7c5fc730-5c16-4a2b-ad39-c26e85951426 chunk uuid 1c64265b-253e-411e-b164-b935a45d474b item 0 key (DEV_ITEMS DEV_ITEM 1) itemoff 16185 itemsize 98 item 1 key (DEV_ITEMS DEV_ITEM 2) itemoff 16087 itemsize 98 item 2 key (FIRST_CHUNK_TREE CHUNK_ITEM 22020096) itemoff 15975 itemsize 112 length 8388608 owner 2 stripe_len 65536 type SYSTEM|RAID1 ... item 3 key (FIRST_CHUNK_TREE CHUNK_ITEM 30408704) itemoff 15863 itemsize 112 length 1073741824 owner 2 stripe_len 65536 type METADATA|RAID1 item 4 key (FIRST_CHUNK_TREE CHUNK_ITEM 1104150528) itemoff 15751 itemsize 112 length 1073741824 owner 2 stripe_len 65536 type DATA|RAID1 item 5 key (FIRST_CHUNK_TREE CHUNK_ITEM 2177892352) itemoff 15671 itemsize 80 length 268435456 owner 2 stripe_len 65536 type METADATA ^^^ SINGLE item 6 key (FIRST_CHUNK_TREE CHUNK_ITEM 2446327808) itemoff 15591 itemsize 80 length 33554432 owner 2 stripe_len 65536 type SYSTEM ^^^ SINGLE item 7 key (FIRST_CHUNK_TREE CHUNK_ITEM 2479882240) itemoff 15511 itemsize 80 length 536870912 owner 2 stripe_len 65536 type DATA ^^^ SINGLE item 8 key (FIRST_CHUNK_TREE CHUNK_ITEM 3016753152) itemoff 15399 itemsize 112 length 33554432 owner 2 stripe_len 65536 type SYSTEM|DUP ^^^ DUP item 9 key (FIRST_CHUNK_TREE CHUNK_ITEM 3050307584) itemoff 15287 itemsize 112 length 268435456 owner 2 stripe_len 65536 type METADATA|DUP ^^^ DUP item 10 key (FIRST_CHUNK_TREE CHUNK_ITEM 3318743040) itemoff 15175 itemsize 112 length 536870912 owner 2 stripe_len 65536 type DATA|DUP ^^^ DUP [CAUSE] When degraded mounted, no matter whether we're mounting RW or RO, missing devices are never considered RW, as we're acting as we only have one rw device. So any write to the degraded fs will cause btrfs to create new SINGLE or DUP chunks to restore newly written data. [FIX] At mount time, btrfs has already done chunk level degradation check, thus we can write to degraded chunks without problem. So we only need to consider missing devices as writable, and calculate our chunk allocation profile with missing devices too. Then every thing should work as expected, without annoying SINGLE/DUP chunks blocking later degraded mount. With fix applied, the above replace will result the following chunk layout instead: leaf 22036480 items 5 free space 15626 generation 5 owner CHUNK_TREE leaf 22036480 flags 0x1(WRITTEN) backref revision 1 fs uuid 7b825e77-e694-4474-9bfe-7bd7565fde0e chunk uuid 2c2d9e94-a819-4479-8f16-ab529c0a4f62 item 0 key (DEV_ITEMS DEV_ITEM 1) itemoff 16185 itemsize 98 item 1 key (DEV_ITEMS DEV_ITEM 2) itemoff 16087 itemsize 98 item 2 key (FIRST_CHUNK_TREE CHUNK_ITEM 22020096) itemoff 15975 itemsize 112 length 8388608 owner 2 stripe_len 65536 type SYSTEM|RAID1 item 3 key (FIRST_CHUNK_TREE CHUNK_ITEM 30408704) itemoff 15863 itemsize 112 length 1073741824 owner 2 stripe_len 65536 type METADATA|RAID1 item 4 key (FIRST_CHUNK_TREE CHUNK_ITEM 1104150528) itemoff 15751 itemsize 112 length 1073741824 owner 2 stripe_len 65536 type DATA|RAID1 Reported-by: Jakob Schöttl <jschoett@gmail.com> Cc: Jakob Schöttl <jschoett@gmail.com> Signed-off-by: Qu Wenruo <wqu@suse.com> --- fs/btrfs/extent-tree.c | 13 +++++++++++++ fs/btrfs/volumes.c | 7 +++++++ 2 files changed, 20 insertions(+)