btrfs_progs: mkfs: match devid order to the stripe index
diff mbox series

Message ID 20190628022611.2844-1-anand.jain@oracle.com
State New
Headers show
Series
  • btrfs_progs: mkfs: match devid order to the stripe index
Related show

Commit Message

Anand Jain June 28, 2019, 2:26 a.m. UTC
At the time mkfs.btrfs the device id and stripe index gets reversed as
shown in [1]. This patch helps to keep them in order at the time of
mkfs.btrfs. And makes it easier to debug.

Before:
Stripe 0 is on devid 2; Stipe 1 is on devid 1;

./mkfs.btrfs -fq -draid1 -mraid1 /dev/sdb /dev/sdc && btrfs in dump-tree -d /dev/sdb | grep -A 10000 "chunk tree" | grep -B 10000 "device tree" | grep -A 13  "FIRST_CHUNK_TREE CHUNK_ITEM"
	item 2 key (FIRST_CHUNK_TREE CHUNK_ITEM 22020096) itemoff 15975 itemsize 112
		length 8388608 owner 2 stripe_len 65536 type SYSTEM|RAID1
		io_align 65536 io_width 65536 sector_size 4096
		num_stripes 2 sub_stripes 0
			stripe 0 devid 2 offset 1048576
			dev_uuid d9fe51c4-6e79-446d-87ee-5be3184798cd
			stripe 1 devid 1 offset 22020096
			dev_uuid 16f626ca-1a54-469b-ac7e-25623af884ab
	item 3 key (FIRST_CHUNK_TREE CHUNK_ITEM 30408704) itemoff 15863 itemsize 112
		length 268435456 owner 2 stripe_len 65536 type METADATA|RAID1
		io_align 65536 io_width 65536 sector_size 4096
		num_stripes 2 sub_stripes 0
			stripe 0 devid 2 offset 9437184
			dev_uuid d9fe51c4-6e79-446d-87ee-5be3184798cd
			stripe 1 devid 1 offset 30408704
			dev_uuid 16f626ca-1a54-469b-ac7e-25623af884ab
	item 4 key (FIRST_CHUNK_TREE CHUNK_ITEM 298844160) itemoff 15751 itemsize 112
		length 314572800 owner 2 stripe_len 65536 type DATA|RAID1
		io_align 65536 io_width 65536 sector_size 4096
		num_stripes 2 sub_stripes 0
			stripe 0 devid 2 offset 277872640
			dev_uuid d9fe51c4-6e79-446d-87ee-5be3184798cd
			stripe 1 devid 1 offset 298844160
			dev_uuid 16f626ca-1a54-469b-ac7e-25623af884ab

After:
Stripe 0 is on devid 1; Stripe 1 is on devid 2

./mkfs.btrfs -fq -draid1 -mraid1 /dev/sdb /dev/sdc && btrfs in dump-tree -d /dev/sdb | grep -A 10000 "chunk tree" | grep -B 10000 "device tree" | grep -A 13  "FIRST_CHUNK_TREE CHUNK_ITEM"
/dev/sdb: 8 bytes were erased at offset 0x00010040 (btrfs): 5f 42 48 52 66 53 5f 4d
/dev/sdc: 8 bytes were erased at offset 0x00010040 (btrfs): 5f 42 48 52 66 53 5f 4d
	item 2 key (FIRST_CHUNK_TREE CHUNK_ITEM 22020096) itemoff 15975 itemsize 112
		length 8388608 owner 2 stripe_len 65536 type SYSTEM|RAID1
		io_align 65536 io_width 65536 sector_size 4096
		num_stripes 2 sub_stripes 0
			stripe 0 devid 1 offset 22020096
			dev_uuid 6abc88fa-f42e-4f0c-9bc3-2225735e51d1
			stripe 1 devid 2 offset 1048576
			dev_uuid 73746d27-13a6-4d58-ac6b-48c90c31d94d
	item 3 key (FIRST_CHUNK_TREE CHUNK_ITEM 30408704) itemoff 15863 itemsize 112
		length 268435456 owner 2 stripe_len 65536 type METADATA|RAID1
		io_align 65536 io_width 65536 sector_size 4096
		num_stripes 2 sub_stripes 0
			stripe 0 devid 1 offset 30408704
			dev_uuid 6abc88fa-f42e-4f0c-9bc3-2225735e51d1
			stripe 1 devid 2 offset 9437184
			dev_uuid 73746d27-13a6-4d58-ac6b-48c90c31d94d
	item 4 key (FIRST_CHUNK_TREE CHUNK_ITEM 298844160) itemoff 15751 itemsize 112
		length 314572800 owner 2 stripe_len 65536 type DATA|RAID1
		io_align 65536 io_width 65536 sector_size 4096
		num_stripes 2 sub_stripes 0
			stripe 0 devid 1 offset 298844160
			dev_uuid 6abc88fa-f42e-4f0c-9bc3-2225735e51d1
			stripe 1 devid 2 offset 277872640
			dev_uuid 73746d27-13a6-4d58-ac6b-48c90c31d94d

Signed-off-by: Anand Jain <anand.jain@oracle.com>
---
 volumes.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Comments

Qu Wenruo June 28, 2019, 2:44 a.m. UTC | #1
On 2019/6/28 上午10:26, Anand Jain wrote:
> At the time mkfs.btrfs the device id and stripe index gets reversed as
> shown in [1]. This patch helps to keep them in order at the time of
> mkfs.btrfs. And makes it easier to debug.
> 
> Before:
> Stripe 0 is on devid 2; Stipe 1 is on devid 1;
> 
> ./mkfs.btrfs -fq -draid1 -mraid1 /dev/sdb /dev/sdc && btrfs in dump-tree -d /dev/sdb | grep -A 10000 "chunk tree" | grep -B 10000 "device tree" | grep -A 13  "FIRST_CHUNK_TREE CHUNK_ITEM"
> 	item 2 key (FIRST_CHUNK_TREE CHUNK_ITEM 22020096) itemoff 15975 itemsize 112
> 		length 8388608 owner 2 stripe_len 65536 type SYSTEM|RAID1
> 		io_align 65536 io_width 65536 sector_size 4096
> 		num_stripes 2 sub_stripes 0
> 			stripe 0 devid 2 offset 1048576
> 			dev_uuid d9fe51c4-6e79-446d-87ee-5be3184798cd
> 			stripe 1 devid 1 offset 22020096
> 			dev_uuid 16f626ca-1a54-469b-ac7e-25623af884ab
> 	item 3 key (FIRST_CHUNK_TREE CHUNK_ITEM 30408704) itemoff 15863 itemsize 112
> 		length 268435456 owner 2 stripe_len 65536 type METADATA|RAID1
> 		io_align 65536 io_width 65536 sector_size 4096
> 		num_stripes 2 sub_stripes 0
> 			stripe 0 devid 2 offset 9437184
> 			dev_uuid d9fe51c4-6e79-446d-87ee-5be3184798cd
> 			stripe 1 devid 1 offset 30408704
> 			dev_uuid 16f626ca-1a54-469b-ac7e-25623af884ab
> 	item 4 key (FIRST_CHUNK_TREE CHUNK_ITEM 298844160) itemoff 15751 itemsize 112
> 		length 314572800 owner 2 stripe_len 65536 type DATA|RAID1
> 		io_align 65536 io_width 65536 sector_size 4096
> 		num_stripes 2 sub_stripes 0
> 			stripe 0 devid 2 offset 277872640
> 			dev_uuid d9fe51c4-6e79-446d-87ee-5be3184798cd
> 			stripe 1 devid 1 offset 298844160
> 			dev_uuid 16f626ca-1a54-469b-ac7e-25623af884ab
> 
> After:
> Stripe 0 is on devid 1; Stripe 1 is on devid 2
> 
> ./mkfs.btrfs -fq -draid1 -mraid1 /dev/sdb /dev/sdc && btrfs in dump-tree -d /dev/sdb | grep -A 10000 "chunk tree" | grep -B 10000 "device tree" | grep -A 13  "FIRST_CHUNK_TREE CHUNK_ITEM"
> /dev/sdb: 8 bytes were erased at offset 0x00010040 (btrfs): 5f 42 48 52 66 53 5f 4d
> /dev/sdc: 8 bytes were erased at offset 0x00010040 (btrfs): 5f 42 48 52 66 53 5f 4d
> 	item 2 key (FIRST_CHUNK_TREE CHUNK_ITEM 22020096) itemoff 15975 itemsize 112
> 		length 8388608 owner 2 stripe_len 65536 type SYSTEM|RAID1
> 		io_align 65536 io_width 65536 sector_size 4096
> 		num_stripes 2 sub_stripes 0
> 			stripe 0 devid 1 offset 22020096
> 			dev_uuid 6abc88fa-f42e-4f0c-9bc3-2225735e51d1
> 			stripe 1 devid 2 offset 1048576
> 			dev_uuid 73746d27-13a6-4d58-ac6b-48c90c31d94d
> 	item 3 key (FIRST_CHUNK_TREE CHUNK_ITEM 30408704) itemoff 15863 itemsize 112
> 		length 268435456 owner 2 stripe_len 65536 type METADATA|RAID1
> 		io_align 65536 io_width 65536 sector_size 4096
> 		num_stripes 2 sub_stripes 0
> 			stripe 0 devid 1 offset 30408704
> 			dev_uuid 6abc88fa-f42e-4f0c-9bc3-2225735e51d1
> 			stripe 1 devid 2 offset 9437184
> 			dev_uuid 73746d27-13a6-4d58-ac6b-48c90c31d94d
> 	item 4 key (FIRST_CHUNK_TREE CHUNK_ITEM 298844160) itemoff 15751 itemsize 112
> 		length 314572800 owner 2 stripe_len 65536 type DATA|RAID1
> 		io_align 65536 io_width 65536 sector_size 4096
> 		num_stripes 2 sub_stripes 0
> 			stripe 0 devid 1 offset 298844160
> 			dev_uuid 6abc88fa-f42e-4f0c-9bc3-2225735e51d1
> 			stripe 1 devid 2 offset 277872640
> 			dev_uuid 73746d27-13a6-4d58-ac6b-48c90c31d94d
> 
> Signed-off-by: Anand Jain <anand.jain@oracle.com>

Reviewed-by: Qu Wenruo <wqu@suse.com>

But please also check the comment inlined below.
> ---
>  volumes.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/volumes.c b/volumes.c
> index 79d1d6a07fb7..8c8b17e814b8 100644
> --- a/volumes.c
> +++ b/volumes.c
> @@ -1109,7 +1109,7 @@ again:
>  			return ret;
>  		cur = cur->next;
>  		if (avail >= min_free) {
> -			list_move_tail(&device->dev_list, &private_devs);
> +			list_move(&device->dev_list, &private_devs);

This is OK since current btrfs-progs chunk allocator doesn't follow the
kernel behavior by sorting devices with its unallocated space.
So it's completely devid based.

But please keep in mind that, if we're going to unify the chunk
allocator behavior of kernel and btrfs-progs, the behavior will change.

As the initial temporary chunk is always allocated on devid 1, reducing
its unallocated space thus reducing its priority in chunk allocator, and
making the devid sequence more unreliable.

Thanks,
Qu

>  			index++;
>  			if (type & BTRFS_BLOCK_GROUP_DUP)
>  				index++;
> @@ -1166,7 +1166,7 @@ again:
>  		/* loop over this device again if we're doing a dup group */
>  		if (!(type & BTRFS_BLOCK_GROUP_DUP) ||
>  		    (index == num_stripes - 1))
> -			list_move_tail(&device->dev_list, dev_list);
> +			list_move(&device->dev_list, dev_list);
>  
>  		ret = btrfs_alloc_dev_extent(trans, device, key.offset,
>  			     calc_size, &dev_offset);
>
Anand Jain June 28, 2019, 3:28 a.m. UTC | #2
On 28/6/19 10:44 AM, Qu Wenruo wrote:
> 
> 
> On 2019/6/28 上午10:26, Anand Jain wrote:
>> At the time mkfs.btrfs the device id and stripe index gets reversed as
>> shown in [1]. This patch helps to keep them in order at the time of
>> mkfs.btrfs. And makes it easier to debug.
>>
>> Before:
>> Stripe 0 is on devid 2; Stipe 1 is on devid 1;
>>
>> ./mkfs.btrfs -fq -draid1 -mraid1 /dev/sdb /dev/sdc && btrfs in dump-tree -d /dev/sdb | grep -A 10000 "chunk tree" | grep -B 10000 "device tree" | grep -A 13  "FIRST_CHUNK_TREE CHUNK_ITEM"
>> 	item 2 key (FIRST_CHUNK_TREE CHUNK_ITEM 22020096) itemoff 15975 itemsize 112
>> 		length 8388608 owner 2 stripe_len 65536 type SYSTEM|RAID1
>> 		io_align 65536 io_width 65536 sector_size 4096
>> 		num_stripes 2 sub_stripes 0
>> 			stripe 0 devid 2 offset 1048576
>> 			dev_uuid d9fe51c4-6e79-446d-87ee-5be3184798cd
>> 			stripe 1 devid 1 offset 22020096
>> 			dev_uuid 16f626ca-1a54-469b-ac7e-25623af884ab
>> 	item 3 key (FIRST_CHUNK_TREE CHUNK_ITEM 30408704) itemoff 15863 itemsize 112
>> 		length 268435456 owner 2 stripe_len 65536 type METADATA|RAID1
>> 		io_align 65536 io_width 65536 sector_size 4096
>> 		num_stripes 2 sub_stripes 0
>> 			stripe 0 devid 2 offset 9437184
>> 			dev_uuid d9fe51c4-6e79-446d-87ee-5be3184798cd
>> 			stripe 1 devid 1 offset 30408704
>> 			dev_uuid 16f626ca-1a54-469b-ac7e-25623af884ab
>> 	item 4 key (FIRST_CHUNK_TREE CHUNK_ITEM 298844160) itemoff 15751 itemsize 112
>> 		length 314572800 owner 2 stripe_len 65536 type DATA|RAID1
>> 		io_align 65536 io_width 65536 sector_size 4096
>> 		num_stripes 2 sub_stripes 0
>> 			stripe 0 devid 2 offset 277872640
>> 			dev_uuid d9fe51c4-6e79-446d-87ee-5be3184798cd
>> 			stripe 1 devid 1 offset 298844160
>> 			dev_uuid 16f626ca-1a54-469b-ac7e-25623af884ab
>>
>> After:
>> Stripe 0 is on devid 1; Stripe 1 is on devid 2
>>
>> ./mkfs.btrfs -fq -draid1 -mraid1 /dev/sdb /dev/sdc && btrfs in dump-tree -d /dev/sdb | grep -A 10000 "chunk tree" | grep -B 10000 "device tree" | grep -A 13  "FIRST_CHUNK_TREE CHUNK_ITEM"
>> /dev/sdb: 8 bytes were erased at offset 0x00010040 (btrfs): 5f 42 48 52 66 53 5f 4d
>> /dev/sdc: 8 bytes were erased at offset 0x00010040 (btrfs): 5f 42 48 52 66 53 5f 4d
>> 	item 2 key (FIRST_CHUNK_TREE CHUNK_ITEM 22020096) itemoff 15975 itemsize 112
>> 		length 8388608 owner 2 stripe_len 65536 type SYSTEM|RAID1
>> 		io_align 65536 io_width 65536 sector_size 4096
>> 		num_stripes 2 sub_stripes 0
>> 			stripe 0 devid 1 offset 22020096
>> 			dev_uuid 6abc88fa-f42e-4f0c-9bc3-2225735e51d1
>> 			stripe 1 devid 2 offset 1048576
>> 			dev_uuid 73746d27-13a6-4d58-ac6b-48c90c31d94d
>> 	item 3 key (FIRST_CHUNK_TREE CHUNK_ITEM 30408704) itemoff 15863 itemsize 112
>> 		length 268435456 owner 2 stripe_len 65536 type METADATA|RAID1
>> 		io_align 65536 io_width 65536 sector_size 4096
>> 		num_stripes 2 sub_stripes 0
>> 			stripe 0 devid 1 offset 30408704
>> 			dev_uuid 6abc88fa-f42e-4f0c-9bc3-2225735e51d1
>> 			stripe 1 devid 2 offset 9437184
>> 			dev_uuid 73746d27-13a6-4d58-ac6b-48c90c31d94d
>> 	item 4 key (FIRST_CHUNK_TREE CHUNK_ITEM 298844160) itemoff 15751 itemsize 112
>> 		length 314572800 owner 2 stripe_len 65536 type DATA|RAID1
>> 		io_align 65536 io_width 65536 sector_size 4096
>> 		num_stripes 2 sub_stripes 0
>> 			stripe 0 devid 1 offset 298844160
>> 			dev_uuid 6abc88fa-f42e-4f0c-9bc3-2225735e51d1
>> 			stripe 1 devid 2 offset 277872640
>> 			dev_uuid 73746d27-13a6-4d58-ac6b-48c90c31d94d
>>
>> Signed-off-by: Anand Jain <anand.jain@oracle.com>
> 
> Reviewed-by: Qu Wenruo <wqu@suse.com>
> 
> But please also check the comment inlined below.
>> ---
>>   volumes.c | 4 ++--
>>   1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/volumes.c b/volumes.c
>> index 79d1d6a07fb7..8c8b17e814b8 100644
>> --- a/volumes.c
>> +++ b/volumes.c
>> @@ -1109,7 +1109,7 @@ again:
>>   			return ret;
>>   		cur = cur->next;
>>   		if (avail >= min_free) {
>> -			list_move_tail(&device->dev_list, &private_devs);
>> +			list_move(&device->dev_list, &private_devs);
> 
> This is OK since current btrfs-progs chunk allocator doesn't follow the
> kernel behavior by sorting devices with its unallocated space.
> So it's completely devid based.
> 
> But please keep in mind that, if we're going to unify the chunk
> allocator behavior of kernel and btrfs-progs, the behavior will change.
> 
> As the initial temporary chunk is always allocated on devid 1, reducing
> its unallocated space thus reducing its priority in chunk allocator, and
> making the devid sequence more unreliable.

  Right. For the debug here, I have an experimental code which disables
  the unallocated space sort in the kernel. I don't have a strong reason
  to disable the sort in the kernel so didn't send the patch.

Thanks, Anand

> Thanks,
> Qu
> 
>>   			index++;
>>   			if (type & BTRFS_BLOCK_GROUP_DUP)
>>   				index++;
>> @@ -1166,7 +1166,7 @@ again:
>>   		/* loop over this device again if we're doing a dup group */
>>   		if (!(type & BTRFS_BLOCK_GROUP_DUP) ||
>>   		    (index == num_stripes - 1))
>> -			list_move_tail(&device->dev_list, dev_list);
>> +			list_move(&device->dev_list, dev_list);
>>   
>>   		ret = btrfs_alloc_dev_extent(trans, device, key.offset,
>>   			     calc_size, &dev_offset);
>>
>
Qu Wenruo June 28, 2019, 6:01 a.m. UTC | #3
On 2019/6/28 上午11:28, Anand Jain wrote:
> On 28/6/19 10:44 AM, Qu Wenruo wrote:
>>
>>
>> On 2019/6/28 上午10:26, Anand Jain wrote:
>>> At the time mkfs.btrfs the device id and stripe index gets reversed as
>>> shown in [1]. This patch helps to keep them in order at the time of
>>> mkfs.btrfs. And makes it easier to debug.
>>>
>>> Before:
>>> Stripe 0 is on devid 2; Stipe 1 is on devid 1;
>>>
>>> ./mkfs.btrfs -fq -draid1 -mraid1 /dev/sdb /dev/sdc && btrfs in
>>> dump-tree -d /dev/sdb | grep -A 10000 "chunk tree" | grep -B 10000
>>> "device tree" | grep -A 13  "FIRST_CHUNK_TREE CHUNK_ITEM"
>>>     item 2 key (FIRST_CHUNK_TREE CHUNK_ITEM 22020096) itemoff 15975
>>> itemsize 112
>>>         length 8388608 owner 2 stripe_len 65536 type SYSTEM|RAID1
>>>         io_align 65536 io_width 65536 sector_size 4096
>>>         num_stripes 2 sub_stripes 0
>>>             stripe 0 devid 2 offset 1048576
>>>             dev_uuid d9fe51c4-6e79-446d-87ee-5be3184798cd
>>>             stripe 1 devid 1 offset 22020096
>>>             dev_uuid 16f626ca-1a54-469b-ac7e-25623af884ab
>>>     item 3 key (FIRST_CHUNK_TREE CHUNK_ITEM 30408704) itemoff 15863
>>> itemsize 112
>>>         length 268435456 owner 2 stripe_len 65536 type METADATA|RAID1
>>>         io_align 65536 io_width 65536 sector_size 4096
>>>         num_stripes 2 sub_stripes 0
>>>             stripe 0 devid 2 offset 9437184
>>>             dev_uuid d9fe51c4-6e79-446d-87ee-5be3184798cd
>>>             stripe 1 devid 1 offset 30408704
>>>             dev_uuid 16f626ca-1a54-469b-ac7e-25623af884ab
>>>     item 4 key (FIRST_CHUNK_TREE CHUNK_ITEM 298844160) itemoff 15751
>>> itemsize 112
>>>         length 314572800 owner 2 stripe_len 65536 type DATA|RAID1
>>>         io_align 65536 io_width 65536 sector_size 4096
>>>         num_stripes 2 sub_stripes 0
>>>             stripe 0 devid 2 offset 277872640
>>>             dev_uuid d9fe51c4-6e79-446d-87ee-5be3184798cd
>>>             stripe 1 devid 1 offset 298844160
>>>             dev_uuid 16f626ca-1a54-469b-ac7e-25623af884ab
>>>
>>> After:
>>> Stripe 0 is on devid 1; Stripe 1 is on devid 2
>>>
>>> ./mkfs.btrfs -fq -draid1 -mraid1 /dev/sdb /dev/sdc && btrfs in
>>> dump-tree -d /dev/sdb | grep -A 10000 "chunk tree" | grep -B 10000
>>> "device tree" | grep -A 13  "FIRST_CHUNK_TREE CHUNK_ITEM"
>>> /dev/sdb: 8 bytes were erased at offset 0x00010040 (btrfs): 5f 42 48
>>> 52 66 53 5f 4d
>>> /dev/sdc: 8 bytes were erased at offset 0x00010040 (btrfs): 5f 42 48
>>> 52 66 53 5f 4d
>>>     item 2 key (FIRST_CHUNK_TREE CHUNK_ITEM 22020096) itemoff 15975
>>> itemsize 112
>>>         length 8388608 owner 2 stripe_len 65536 type SYSTEM|RAID1
>>>         io_align 65536 io_width 65536 sector_size 4096
>>>         num_stripes 2 sub_stripes 0
>>>             stripe 0 devid 1 offset 22020096
>>>             dev_uuid 6abc88fa-f42e-4f0c-9bc3-2225735e51d1
>>>             stripe 1 devid 2 offset 1048576
>>>             dev_uuid 73746d27-13a6-4d58-ac6b-48c90c31d94d
>>>     item 3 key (FIRST_CHUNK_TREE CHUNK_ITEM 30408704) itemoff 15863
>>> itemsize 112
>>>         length 268435456 owner 2 stripe_len 65536 type METADATA|RAID1
>>>         io_align 65536 io_width 65536 sector_size 4096
>>>         num_stripes 2 sub_stripes 0
>>>             stripe 0 devid 1 offset 30408704
>>>             dev_uuid 6abc88fa-f42e-4f0c-9bc3-2225735e51d1
>>>             stripe 1 devid 2 offset 9437184
>>>             dev_uuid 73746d27-13a6-4d58-ac6b-48c90c31d94d
>>>     item 4 key (FIRST_CHUNK_TREE CHUNK_ITEM 298844160) itemoff 15751
>>> itemsize 112
>>>         length 314572800 owner 2 stripe_len 65536 type DATA|RAID1
>>>         io_align 65536 io_width 65536 sector_size 4096
>>>         num_stripes 2 sub_stripes 0
>>>             stripe 0 devid 1 offset 298844160
>>>             dev_uuid 6abc88fa-f42e-4f0c-9bc3-2225735e51d1
>>>             stripe 1 devid 2 offset 277872640
>>>             dev_uuid 73746d27-13a6-4d58-ac6b-48c90c31d94d
>>>
>>> Signed-off-by: Anand Jain <anand.jain@oracle.com>
>>
>> Reviewed-by: Qu Wenruo <wqu@suse.com>
>>
>> But please also check the comment inlined below.
>>> ---
>>>   volumes.c | 4 ++--
>>>   1 file changed, 2 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/volumes.c b/volumes.c
>>> index 79d1d6a07fb7..8c8b17e814b8 100644
>>> --- a/volumes.c
>>> +++ b/volumes.c
>>> @@ -1109,7 +1109,7 @@ again:
>>>               return ret;
>>>           cur = cur->next;
>>>           if (avail >= min_free) {
>>> -            list_move_tail(&device->dev_list, &private_devs);
>>> +            list_move(&device->dev_list, &private_devs);
>>
>> This is OK since current btrfs-progs chunk allocator doesn't follow the
>> kernel behavior by sorting devices with its unallocated space.
>> So it's completely devid based.
>>
>> But please keep in mind that, if we're going to unify the chunk
>> allocator behavior of kernel and btrfs-progs, the behavior will change.
>>
>> As the initial temporary chunk is always allocated on devid 1, reducing
>> its unallocated space thus reducing its priority in chunk allocator, and
>> making the devid sequence more unreliable.
> 
>  Right. For the debug here, I have an experimental code which disables
>  the unallocated space sort in the kernel. I don't have a strong reason
>  to disable the sort in the kernel so didn't send the patch.

I'd say that unallocated sort is a hidden way to prevent starvation.

The mostly common case is 3 disk RAID1. (1024M X 3)
With the unallocated space sort, we can take full use of 1.5T.

While without that, we can only use 1T, as all allocation will happen on
the first (or last) 2 devices, not utilize the remaining disk at all.

So that kernel part is very helpful to prevent starvation.

Thanks,
Qu

> 
> Thanks, Anand
> 
>> Thanks,
>> Qu
>>
>>>               index++;
>>>               if (type & BTRFS_BLOCK_GROUP_DUP)
>>>                   index++;
>>> @@ -1166,7 +1166,7 @@ again:
>>>           /* loop over this device again if we're doing a dup group */
>>>           if (!(type & BTRFS_BLOCK_GROUP_DUP) ||
>>>               (index == num_stripes - 1))
>>> -            list_move_tail(&device->dev_list, dev_list);
>>> +            list_move(&device->dev_list, dev_list);
>>>             ret = btrfs_alloc_dev_extent(trans, device, key.offset,
>>>                    calc_size, &dev_offset);
>>>
>>
>
David Sterba July 3, 2019, 1:21 p.m. UTC | #4
On Fri, Jun 28, 2019 at 10:26:11AM +0800, Anand Jain wrote:
> At the time mkfs.btrfs the device id and stripe index gets reversed as
> shown in [1]. This patch helps to keep them in order at the time of
> mkfs.btrfs. And makes it easier to debug.
> 
> Before:
> Stripe 0 is on devid 2; Stipe 1 is on devid 1;
> 
> ./mkfs.btrfs -fq -draid1 -mraid1 /dev/sdb /dev/sdc && btrfs in dump-tree -d /dev/sdb | grep -A 10000 "chunk tree" | grep -B 10000 "device tree" | grep -A 13  "FIRST_CHUNK_TREE CHUNK_ITEM"

I've reformatted that so it's not overly long line. For dumps it's ok
but a command can be split by && or | .

> 	item 2 key (FIRST_CHUNK_TREE CHUNK_ITEM 22020096) itemoff 15975 itemsize 112
> 		length 8388608 owner 2 stripe_len 65536 type SYSTEM|RAID1
> 		io_align 65536 io_width 65536 sector_size 4096
> 		num_stripes 2 sub_stripes 0
> 			stripe 0 devid 2 offset 1048576
> 			dev_uuid d9fe51c4-6e79-446d-87ee-5be3184798cd
> 			stripe 1 devid 1 offset 22020096
> 			dev_uuid 16f626ca-1a54-469b-ac7e-25623af884ab
> 	item 3 key (FIRST_CHUNK_TREE CHUNK_ITEM 30408704) itemoff 15863 itemsize 112
> 		length 268435456 owner 2 stripe_len 65536 type METADATA|RAID1
> 		io_align 65536 io_width 65536 sector_size 4096
> 		num_stripes 2 sub_stripes 0
> 			stripe 0 devid 2 offset 9437184
> 			dev_uuid d9fe51c4-6e79-446d-87ee-5be3184798cd
> 			stripe 1 devid 1 offset 30408704
> 			dev_uuid 16f626ca-1a54-469b-ac7e-25623af884ab
> 	item 4 key (FIRST_CHUNK_TREE CHUNK_ITEM 298844160) itemoff 15751 itemsize 112
> 		length 314572800 owner 2 stripe_len 65536 type DATA|RAID1
> 		io_align 65536 io_width 65536 sector_size 4096
> 		num_stripes 2 sub_stripes 0
> 			stripe 0 devid 2 offset 277872640
> 			dev_uuid d9fe51c4-6e79-446d-87ee-5be3184798cd
> 			stripe 1 devid 1 offset 298844160
> 			dev_uuid 16f626ca-1a54-469b-ac7e-25623af884ab
> 
> After:
> Stripe 0 is on devid 1; Stripe 1 is on devid 2
> 
> ./mkfs.btrfs -fq -draid1 -mraid1 /dev/sdb /dev/sdc && btrfs in dump-tree -d /dev/sdb | grep -A 10000 "chunk tree" | grep -B 10000 "device tree" | grep -A 13  "FIRST_CHUNK_TREE CHUNK_ITEM"
> /dev/sdb: 8 bytes were erased at offset 0x00010040 (btrfs): 5f 42 48 52 66 53 5f 4d
> /dev/sdc: 8 bytes were erased at offset 0x00010040 (btrfs): 5f 42 48 52 66 53 5f 4d
> 	item 2 key (FIRST_CHUNK_TREE CHUNK_ITEM 22020096) itemoff 15975 itemsize 112
> 		length 8388608 owner 2 stripe_len 65536 type SYSTEM|RAID1
> 		io_align 65536 io_width 65536 sector_size 4096
> 		num_stripes 2 sub_stripes 0
> 			stripe 0 devid 1 offset 22020096
> 			dev_uuid 6abc88fa-f42e-4f0c-9bc3-2225735e51d1
> 			stripe 1 devid 2 offset 1048576
> 			dev_uuid 73746d27-13a6-4d58-ac6b-48c90c31d94d
> 	item 3 key (FIRST_CHUNK_TREE CHUNK_ITEM 30408704) itemoff 15863 itemsize 112
> 		length 268435456 owner 2 stripe_len 65536 type METADATA|RAID1
> 		io_align 65536 io_width 65536 sector_size 4096
> 		num_stripes 2 sub_stripes 0
> 			stripe 0 devid 1 offset 30408704
> 			dev_uuid 6abc88fa-f42e-4f0c-9bc3-2225735e51d1
> 			stripe 1 devid 2 offset 9437184
> 			dev_uuid 73746d27-13a6-4d58-ac6b-48c90c31d94d
> 	item 4 key (FIRST_CHUNK_TREE CHUNK_ITEM 298844160) itemoff 15751 itemsize 112
> 		length 314572800 owner 2 stripe_len 65536 type DATA|RAID1
> 		io_align 65536 io_width 65536 sector_size 4096
> 		num_stripes 2 sub_stripes 0
> 			stripe 0 devid 1 offset 298844160
> 			dev_uuid 6abc88fa-f42e-4f0c-9bc3-2225735e51d1
> 			stripe 1 devid 2 offset 277872640
> 			dev_uuid 73746d27-13a6-4d58-ac6b-48c90c31d94d
> 
> Signed-off-by: Anand Jain <anand.jain@oracle.com>

Added to devel, thanks.

Patch
diff mbox series

diff --git a/volumes.c b/volumes.c
index 79d1d6a07fb7..8c8b17e814b8 100644
--- a/volumes.c
+++ b/volumes.c
@@ -1109,7 +1109,7 @@  again:
 			return ret;
 		cur = cur->next;
 		if (avail >= min_free) {
-			list_move_tail(&device->dev_list, &private_devs);
+			list_move(&device->dev_list, &private_devs);
 			index++;
 			if (type & BTRFS_BLOCK_GROUP_DUP)
 				index++;
@@ -1166,7 +1166,7 @@  again:
 		/* loop over this device again if we're doing a dup group */
 		if (!(type & BTRFS_BLOCK_GROUP_DUP) ||
 		    (index == num_stripes - 1))
-			list_move_tail(&device->dev_list, dev_list);
+			list_move(&device->dev_list, dev_list);
 
 		ret = btrfs_alloc_dev_extent(trans, device, key.offset,
 			     calc_size, &dev_offset);