diff mbox series

[RFC] btrfs: Don't create SINGLE or DUP chunks for degraded rw mount

Message ID 20190212070319.30619-1-wqu@suse.com (mailing list archive)
State New, archived
Headers show
Series [RFC] btrfs: Don't create SINGLE or DUP chunks for degraded rw mount | expand

Commit Message

Qu Wenruo Feb. 12, 2019, 7:03 a.m. UTC
[PROBLEM]
The following script can easily create unnecessary SINGLE or DUP chunks:
  #!/bin/bash

  dev1="/dev/test/scratch1"
  dev2="/dev/test/scratch2"
  dev3="/dev/test/scratch3"
  mnt="/mnt/btrfs"

  umount $dev1 $dev2 $dev3 $mnt &> /dev/null

  mkfs.btrfs -f $dev1 $dev2 -d raid1 -m raid1

  mount $dev1 $mnt
  umount $dev1

  wipefs -fa $dev2

  mount $dev1 -o degraded $mnt
  btrfs replace start -Bf 2 $dev3 $mnt
  umount $dev1
  btrfs ins dump-tree -t chunk $dev1

With the following chunks in chunk tree:
  leaf 3016753152 items 11 free space 14900 generation 9 owner CHUNK_TREE
  leaf 3016753152 flags 0x1(WRITTEN) backref revision 1
  fs uuid 7c5fc730-5c16-4a2b-ad39-c26e85951426
  chunk uuid 1c64265b-253e-411e-b164-b935a45d474b
  	item 0 key (DEV_ITEMS DEV_ITEM 1) itemoff 16185 itemsize 98
  	item 1 key (DEV_ITEMS DEV_ITEM 2) itemoff 16087 itemsize 98
  	item 2 key (FIRST_CHUNK_TREE CHUNK_ITEM 22020096) itemoff 15975 itemsize 112
  		length 8388608 owner 2 stripe_len 65536 type SYSTEM|RAID1
  		...
  	item 3 key (FIRST_CHUNK_TREE CHUNK_ITEM 30408704) itemoff 15863 itemsize 112
  		length 1073741824 owner 2 stripe_len 65536 type METADATA|RAID1
  	item 4 key (FIRST_CHUNK_TREE CHUNK_ITEM 1104150528) itemoff 15751 itemsize 112
  		length 1073741824 owner 2 stripe_len 65536 type DATA|RAID1
  	item 5 key (FIRST_CHUNK_TREE CHUNK_ITEM 2177892352) itemoff 15671 itemsize 80
  		length 268435456 owner 2 stripe_len 65536 type METADATA
  							       ^^^ SINGLE
  	item 6 key (FIRST_CHUNK_TREE CHUNK_ITEM 2446327808) itemoff 15591 itemsize 80
  		length 33554432 owner 2 stripe_len 65536 type SYSTEM
  							      ^^^ SINGLE
  	item 7 key (FIRST_CHUNK_TREE CHUNK_ITEM 2479882240) itemoff 15511 itemsize 80
  		length 536870912 owner 2 stripe_len 65536 type DATA
  							       ^^^ SINGLE
  	item 8 key (FIRST_CHUNK_TREE CHUNK_ITEM 3016753152) itemoff 15399 itemsize 112
  		length 33554432 owner 2 stripe_len 65536 type SYSTEM|DUP
  							      ^^^ DUP
  	item 9 key (FIRST_CHUNK_TREE CHUNK_ITEM 3050307584) itemoff 15287 itemsize 112
  		length 268435456 owner 2 stripe_len 65536 type METADATA|DUP
  							       ^^^ DUP
  	item 10 key (FIRST_CHUNK_TREE CHUNK_ITEM 3318743040) itemoff 15175 itemsize 112
  		length 536870912 owner 2 stripe_len 65536 type DATA|DUP
  							       ^^^ DUP

[CAUSE]
When degraded mounted, no matter whether we're mounting RW or RO,
missing devices are never considered RW, as we're acting as we only have
one rw device.

So any write to the degraded fs will cause btrfs to create new SINGLE or
DUP chunks to restore newly written data.

[FIX]
At mount time, btrfs has already done chunk level degradation check,
thus we can write to degraded chunks without problem.

So we only need to consider missing devices as writable, and calculate
our chunk allocation profile with missing devices too.

Then every thing should work as expected, without annoying SINGLE/DUP
chunks blocking later degraded mount.

With fix applied, the above replace will result the following chunk
layout instead:
leaf 22036480 items 5 free space 15626 generation 5 owner CHUNK_TREE
leaf 22036480 flags 0x1(WRITTEN) backref revision 1
fs uuid 7b825e77-e694-4474-9bfe-7bd7565fde0e
chunk uuid 2c2d9e94-a819-4479-8f16-ab529c0a4f62
	item 0 key (DEV_ITEMS DEV_ITEM 1) itemoff 16185 itemsize 98
	item 1 key (DEV_ITEMS DEV_ITEM 2) itemoff 16087 itemsize 98
	item 2 key (FIRST_CHUNK_TREE CHUNK_ITEM 22020096) itemoff 15975 itemsize 112
		length 8388608 owner 2 stripe_len 65536 type SYSTEM|RAID1
	item 3 key (FIRST_CHUNK_TREE CHUNK_ITEM 30408704) itemoff 15863 itemsize 112
		length 1073741824 owner 2 stripe_len 65536 type METADATA|RAID1
	item 4 key (FIRST_CHUNK_TREE CHUNK_ITEM 1104150528) itemoff 15751 itemsize 112
		length 1073741824 owner 2 stripe_len 65536 type DATA|RAID1

Reported-by: Jakob Schöttl <jschoett@gmail.com>
Cc: Jakob Schöttl <jschoett@gmail.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent-tree.c | 13 +++++++++++++
 fs/btrfs/volumes.c     |  7 +++++++
 2 files changed, 20 insertions(+)

Comments

Remi Gauvin Feb. 12, 2019, 7:20 a.m. UTC | #1
On 2019-02-12 2:03 a.m., Qu Wenruo wrote:

> So we only need to consider missing devices as writable, and calculate
> our chunk allocation profile with missing devices too.
> 
> Then every thing should work as expected, without annoying SINGLE/DUP
> chunks blocking later degraded mount.
> 
>

Does this mean you would rely on scrub/CSUM to repair the missing data
if device is restored?
begin:vcard
fn:Remi Gauvin
n:Gauvin;Remi
org:Georgian Infotech
adr:;;3-51 Sykes St. N.;Meaford;ON;N4L 1X3;Canada
email;internet:remi@georgianit.com
tel;work:226-256-1545
version:2.1
end:vcard
Qu Wenruo Feb. 12, 2019, 7:22 a.m. UTC | #2
On 2019/2/12 下午3:20, Remi Gauvin wrote:
> On 2019-02-12 2:03 a.m., Qu Wenruo wrote:
> 
>> So we only need to consider missing devices as writable, and calculate
>> our chunk allocation profile with missing devices too.
>>
>> Then every thing should work as expected, without annoying SINGLE/DUP
>> chunks blocking later degraded mount.
>>
>>
> 
> Does this mean you would rely on scrub/CSUM to repair the missing data
> if device is restored?

Yes, just as btrfs usually does.

Thanks,
Qu
Remi Gauvin Feb. 12, 2019, 7:43 a.m. UTC | #3
On 2019-02-12 2:22 a.m., Qu Wenruo wrote:

>> Does this mean you would rely on scrub/CSUM to repair the missing data
>> if device is restored?
> 
> Yes, just as btrfs usually does.
> 

I don't really understand the implications of the problems with mounting
fs when single/dup data chunk are allocated on raid1, but I would think
that would actually be a preferable situation than filling a drive with
'data' we know is completely bogus... converting single/dup data to raid
should be much faster than tripping on CSUM errors, and less prone to
missed errors?
begin:vcard
fn:Remi Gauvin
n:Gauvin;Remi
org:Georgian Infotech
adr:;;3-51 Sykes St. N.;Meaford;ON;N4L 1X3;Canada
email;internet:remi@georgianit.com
tel;work:226-256-1545
version:2.1
end:vcard
Qu Wenruo Feb. 12, 2019, 7:47 a.m. UTC | #4
On 2019/2/12 下午3:43, Remi Gauvin wrote:
> On 2019-02-12 2:22 a.m., Qu Wenruo wrote:
> 
>>> Does this mean you would rely on scrub/CSUM to repair the missing data
>>> if device is restored?
>>
>> Yes, just as btrfs usually does.
>>
> 
> I don't really understand the implications of the problems with mounting
> fs when single/dup data chunk are allocated on raid1,

Consider this use case:

One btrfs with 2 devices, RAID1 for data and metadata.

One day devid 2 got failure, and before replacement arrives, user can
only use devid 1 alone. (Maybe that's the root fs).

Then new disk arrived, user replaced the missing device, caused SINGLE
or DUP chunks on devid 1, and more importantly, some metadata/data is
already in DUP/SINGLE chunks.

Then some days later, devid 1 get failure too, now user is unable to
mount the fs degraded RW any more, since SINGLE/DUP chunks are all on
devid 1, and no way to replace devid 1.

Thanks,
Qu

> but I would think
> that would actually be a preferable situation than filling a drive with
> 'data' we know is completely bogus... converting single/dup data to raid
> should be much faster than tripping on CSUM errors, and less prone to
> missed errors?
> 
>
Remi Gauvin Feb. 12, 2019, 7:55 a.m. UTC | #5
On 2019-02-12 2:47 a.m., Qu Wenruo wrote:
>
> 
> Consider this use case:
> 
> One btrfs with 2 devices, RAID1 for data and metadata.
> 
> One day devid 2 got failure, and before replacement arrives, user can
> only use devid 1 alone. (Maybe that's the root fs).
> 
> Then new disk arrived, user replaced the missing device, caused SINGLE
> or DUP chunks on devid 1, and more importantly, some metadata/data is
> already in DUP/SINGLE chunks.
> 


Maybe the btrfs-replace command can have some magic logic attached that
checks for single/dup chunks after completion and either launches an
automatic convert, or prompts the user that one is probably needed?
begin:vcard
fn:Remi Gauvin
n:Gauvin;Remi
org:Georgian Infotech
adr:;;3-51 Sykes St. N.;Meaford;ON;N4L 1X3;Canada
email;internet:remi@georgianit.com
tel;work:226-256-1545
version:2.1
end:vcard
Qu Wenruo Feb. 12, 2019, 7:57 a.m. UTC | #6
On 2019/2/12 下午3:55, Remi Gauvin wrote:
> On 2019-02-12 2:47 a.m., Qu Wenruo wrote:
>>
>>
>> Consider this use case:
>>
>> One btrfs with 2 devices, RAID1 for data and metadata.
>>
>> One day devid 2 got failure, and before replacement arrives, user can
>> only use devid 1 alone. (Maybe that's the root fs).
>>
>> Then new disk arrived, user replaced the missing device, caused SINGLE
>> or DUP chunks on devid 1, and more importantly, some metadata/data is
>> already in DUP/SINGLE chunks.
>>
> 
> 
> Maybe the btrfs-replace command can have some magic logic attached that
> checks for single/dup chunks after completion and either launches an
> automatic convert, or prompts the user that one is probably needed?

The problem is, how btrfs-progs know which chunks are old existing
chunks and which are new chunks created by balance?

I considered this method, but kernel fix without creating unnecessary
chunks are way more cleaner.

Thanks,
Qu
> 
> 
>
Andrei Borzenkov Feb. 12, 2019, 6:42 p.m. UTC | #7
12.02.2019 10:47, Qu Wenruo пишет:
> 
> 
> On 2019/2/12 下午3:43, Remi Gauvin wrote:
>> On 2019-02-12 2:22 a.m., Qu Wenruo wrote:
>>
>>>> Does this mean you would rely on scrub/CSUM to repair the missing data
>>>> if device is restored?
>>>
>>> Yes, just as btrfs usually does.
>>>
>>
>> I don't really understand the implications of the problems with mounting
>> fs when single/dup data chunk are allocated on raid1,
> 
> Consider this use case:
> 
> One btrfs with 2 devices, RAID1 for data and metadata.
> 
> One day devid 2 got failure, and before replacement arrives, user can
> only use devid 1 alone. (Maybe that's the root fs).
> 
> Then new disk arrived, user replaced the missing device, caused SINGLE
> or DUP chunks on devid 1, and more importantly, some metadata/data is
> already in DUP/SINGLE chunks.
> 
> Then some days later, devid 1 get failure too, now user is unable to
> mount the fs degraded RW any more, since SINGLE/DUP chunks are all on
> devid 1, and no way to replace devid 1.
> 

But if I understand what happens after your patch correctly, replacement
device still does not contain valid data until someone does scrub. So in
either case manual step is required to restore full redundancy.

Or does "btrfs replace" restore content on replacement device automatically?

> Thanks,
> Qu
> 
>> but I would think
>> that would actually be a preferable situation than filling a drive with
>> 'data' we know is completely bogus... converting single/dup data to raid
>> should be much faster than tripping on CSUM errors, and less prone to
>> missed errors?
>>
>>
>
Remi Gauvin Feb. 12, 2019, 7:09 p.m. UTC | #8
On 2019-02-12 1:42 p.m., Andrei Borzenkov wrote:

>>
> 
> But if I understand what happens after your patch correctly, replacement
> device still does not contain valid data until someone does scrub. So in
> either case manual step is required to restore full redundancy.
> 
> Or does "btrfs replace" restore content on replacement device automatically?
> 

It should.... Qu's example assumes the drive being replaced stays
missing, in which case replace is copying all the data that should be on
the missing drive from it's mirror source.  There should be no
difference between the data that used to be there and the data that
didn't ever even get written there in the first place.

For some reason, my mind last night got stuck on the hypothetical
situation where a volume is mounted and used degraded, but then the
missing device, presumably not faulty, get's re-added.. then things
would go bad quickly.  But I concede that scenario falls squarely in the
"That's stupid don't do that" category.
Qu Wenruo Feb. 13, 2019, 12:44 a.m. UTC | #9
On 2019/2/13 上午2:42, Andrei Borzenkov wrote:
> 12.02.2019 10:47, Qu Wenruo пишет:
>>
>>
>> On 2019/2/12 下午3:43, Remi Gauvin wrote:
>>> On 2019-02-12 2:22 a.m., Qu Wenruo wrote:
>>>
>>>>> Does this mean you would rely on scrub/CSUM to repair the missing data
>>>>> if device is restored?
>>>>
>>>> Yes, just as btrfs usually does.
>>>>
>>>
>>> I don't really understand the implications of the problems with mounting
>>> fs when single/dup data chunk are allocated on raid1,
>>
>> Consider this use case:
>>
>> One btrfs with 2 devices, RAID1 for data and metadata.
>>
>> One day devid 2 got failure, and before replacement arrives, user can
>> only use devid 1 alone. (Maybe that's the root fs).
>>
>> Then new disk arrived, user replaced the missing device, caused SINGLE
>> or DUP chunks on devid 1, and more importantly, some metadata/data is
>> already in DUP/SINGLE chunks.
>>
>> Then some days later, devid 1 get failure too, now user is unable to
>> mount the fs degraded RW any more, since SINGLE/DUP chunks are all on
>> devid 1, and no way to replace devid 1.
>>
> 
> But if I understand what happens after your patch correctly, replacement
> device still does not contain valid data until someone does scrub.

Nope. That's incorrect.

We still go through normal device replace routine, which will copy data
from degraded chunks to new device.

Thanks,
Qu

> So in
> either case manual step is required to restore full redundancy.
> 
> Or does "btrfs replace" restore content on replacement device automatically?
> 
>> Thanks,
>> Qu
>>
>>> but I would think
>>> that would actually be a preferable situation than filling a drive with
>>> 'data' we know is completely bogus... converting single/dup data to raid
>>> should be much faster than tripping on CSUM errors, and less prone to
>>> missed errors?
>>>
>>>
>>
> 
>
diff mbox series

Patch

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 0dde0cbc1622..bf691ecb6c70 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4081,6 +4081,13 @@  static u64 btrfs_reduce_alloc_profile(struct btrfs_fs_info *fs_info, u64 flags)
 	u64 raid_type;
 	u64 allowed = 0;
 
+	/*
+	 * For degraded mount, still count missing devices as rw devices
+	 * to avoid alloc SINGLE/DUP chunks
+	 */
+	if (btrfs_test_opt(fs_info, DEGRADED))
+		num_devices += fs_info->fs_devices->missing_devices;
+
 	/*
 	 * see if restripe for this chunk_type is in progress, if so
 	 * try to reduce to the target profile
@@ -9626,6 +9633,12 @@  static u64 update_block_group_flags(struct btrfs_fs_info *fs_info, u64 flags)
 		return extended_to_chunk(stripped);
 
 	num_devices = fs_info->fs_devices->rw_devices;
+	/*
+	 * For degraded mount, still count missing devices as rw devices
+	 * to avoid alloc SINGLE/DUP chunks
+	 */
+	if (btrfs_test_opt(fs_info, DEGRADED))
+		num_devices += fs_info->fs_devices->missing_devices;
 
 	stripped = BTRFS_BLOCK_GROUP_RAID0 |
 		BTRFS_BLOCK_GROUP_RAID5 | BTRFS_BLOCK_GROUP_RAID6 |
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 03f223aa7194..8e8b3581877f 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -6660,6 +6660,13 @@  static struct btrfs_device *add_missing_dev(struct btrfs_fs_devices *fs_devices,
 	set_bit(BTRFS_DEV_STATE_MISSING, &device->dev_state);
 	fs_devices->missing_devices++;
 
+	/*
+	 * For degraded mount, still count missing devices as writable to
+	 * avoid unnecessary SINGLE/DUP chunks
+	 */
+	if (btrfs_test_opt(fs_devices->fs_info, DEGRADED))
+		set_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state);
+
 	return device;
 }