diff mbox series

Btrfs: avoid allocating too many data chunks on massive concurrent write

Message ID 20190426110922.21888-1-robbieko@synology.com (mailing list archive)
State New, archived
Headers show
Series Btrfs: avoid allocating too many data chunks on massive concurrent write | expand

Commit Message

robbieko April 26, 2019, 11:09 a.m. UTC
From: Robbie Ko <robbieko@synology.com>

I found a issue when btrfs allocates much more space than it actual needed
on massive concurrent writes. That may consume all free space and when it
need to allocate more space for metadata that result in ENOSPC.

I did a test that issue by 5000 dd to do write stress concurrently.

The space info after ENOSPC:
Overall:
Device size: 926.91GiB
Device allocated: 926.91GiB
Device unallocated: 1.00MiB
Device missing: 0.00B
Used: 211.59GiB
Free (estimated): 714.18GiB (min: 714.18GiB)
Data ratio: 1.00
Metadata ratio: 2.00
Global reserve: 512.00MiB (used: 0.00B)

Data,single: Size:923.77GiB, Used:209.59GiB
/dev/devname 923.77GiB

Metadata,DUP: Size:1.50GiB, Used:1022.50MiB
/dev/devname 3.00GiB

System,DUP: Size:72.00MiB, Used:128.00KiB
/dev/devname 144.00MiB

We can see that the Metadata space (1022.50MiB + 512.00MiB) is almost full.
But Data allocated much more space (923.77GiB) than it actually needed
(209.59GiB).

When the data space is not enough, this 5000 dd process will call
do_chunk_alloc() to allocate more space.

In the while loop of do_chunk_alloc, the variable force will be changed
to CHUNK_ALLOC_FORCE in second run and should_alloc_chunk() will always
return true when force is CHUNK_ALLOC_FORCE. That means every concurrent
dd process will allocate a chunk in do_chunk_alloc().

Fix this by keeping original value of force and re-assign it to force in
the end of the loop.

Signed-off-by: Robbie Ko <robbieko@synology.com>
---
 fs/btrfs/extent-tree.c | 2 ++
 1 file changed, 2 insertions(+)

Comments

Filipe Manana May 3, 2019, 10:53 a.m. UTC | #1
On Fri, Apr 26, 2019 at 12:10 PM robbieko <robbieko@synology.com> wrote:
>
> From: Robbie Ko <robbieko@synology.com>
>
> I found a issue when btrfs allocates much more space than it actual needed
> on massive concurrent writes. That may consume all free space and when it
> need to allocate more space for metadata that result in ENOSPC.
>
> I did a test that issue by 5000 dd to do write stress concurrently.
>
> The space info after ENOSPC:
> Overall:
> Device size: 926.91GiB
> Device allocated: 926.91GiB
> Device unallocated: 1.00MiB
> Device missing: 0.00B
> Used: 211.59GiB
> Free (estimated): 714.18GiB (min: 714.18GiB)
> Data ratio: 1.00
> Metadata ratio: 2.00
> Global reserve: 512.00MiB (used: 0.00B)
>
> Data,single: Size:923.77GiB, Used:209.59GiB
> /dev/devname 923.77GiB
>
> Metadata,DUP: Size:1.50GiB, Used:1022.50MiB
> /dev/devname 3.00GiB
>
> System,DUP: Size:72.00MiB, Used:128.00KiB
> /dev/devname 144.00MiB

So can you provide more details? What parameters you gave to the dd processes?

I tried to reproduce that like this:

for ((i = 0; i < 5000; i++)); do
  dd if=/dev/zero of=/mnt/sdi/dd$i bs=2M oflag=dsync &
  dd_pids[$i]=$!
done

wait ${dd_pids[@]}

But after it finished, the used data space was about the same as the
allocated data space (it was on a 200G fs).
So I didn't observe the problem you mention, even though at first
glance the patch looks ok.

Thanks.

>
> We can see that the Metadata space (1022.50MiB + 512.00MiB) is almost full.
> But Data allocated much more space (923.77GiB) than it actually needed
> (209.59GiB).
>
> When the data space is not enough, this 5000 dd process will call
> do_chunk_alloc() to allocate more space.
>
> In the while loop of do_chunk_alloc, the variable force will be changed
> to CHUNK_ALLOC_FORCE in second run and should_alloc_chunk() will always
> return true when force is CHUNK_ALLOC_FORCE. That means every concurrent
> dd process will allocate a chunk in do_chunk_alloc().
>
> Fix this by keeping original value of force and re-assign it to force in
> the end of the loop.
>
> Signed-off-by: Robbie Ko <robbieko@synology.com>
> ---
>  fs/btrfs/extent-tree.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index c5880329ae37..73856f70db31 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -4511,6 +4511,7 @@ static int do_chunk_alloc(struct btrfs_trans_handle *trans, u64 flags,
>         bool wait_for_alloc = false;
>         bool should_alloc = false;
>         int ret = 0;
> +       int orig_force = force;
>
>         /* Don't re-enter if we're already allocating a chunk */
>         if (trans->allocating_chunk)
> @@ -4544,6 +4545,7 @@ static int do_chunk_alloc(struct btrfs_trans_handle *trans, u64 flags,
>                          */
>                         wait_for_alloc = true;
>                         spin_unlock(&space_info->lock);
> +                       force = orig_force;
>                         mutex_lock(&fs_info->chunk_mutex);
>                         mutex_unlock(&fs_info->chunk_mutex);
>                 } else {
> --
> 2.17.1
>
robbieko May 6, 2019, 2:54 a.m. UTC | #2
Filipe Manana 於 2019-05-03 18:53 寫到:
> On Fri, Apr 26, 2019 at 12:10 PM robbieko <robbieko@synology.com> 
> wrote:
>> 
>> From: Robbie Ko <robbieko@synology.com>
>> 
>> I found a issue when btrfs allocates much more space than it actual 
>> needed
>> on massive concurrent writes. That may consume all free space and when 
>> it
>> need to allocate more space for metadata that result in ENOSPC.
>> 
>> I did a test that issue by 5000 dd to do write stress concurrently.
>> 
>> The space info after ENOSPC:
>> Overall:
>> Device size: 926.91GiB
>> Device allocated: 926.91GiB
>> Device unallocated: 1.00MiB
>> Device missing: 0.00B
>> Used: 211.59GiB
>> Free (estimated): 714.18GiB (min: 714.18GiB)
>> Data ratio: 1.00
>> Metadata ratio: 2.00
>> Global reserve: 512.00MiB (used: 0.00B)
>> 
>> Data,single: Size:923.77GiB, Used:209.59GiB
>> /dev/devname 923.77GiB
>> 
>> Metadata,DUP: Size:1.50GiB, Used:1022.50MiB
>> /dev/devname 3.00GiB
>> 
>> System,DUP: Size:72.00MiB, Used:128.00KiB
>> /dev/devname 144.00MiB
> 
> So can you provide more details? What parameters you gave to the dd 
> processes?
> 
> I tried to reproduce that like this:
> 
> for ((i = 0; i < 5000; i++)); do
>   dd if=/dev/zero of=/mnt/sdi/dd$i bs=2M oflag=dsync &
>   dd_pids[$i]=$!
> done
> 
> wait ${dd_pids[@]}
> 
> But after it finished, the used data space was about the same as the
> allocated data space (it was on a 200G fs).
> So I didn't observe the problem you mention, even though at first
> glance the patch looks ok.
> 
> Thanks.
> 

I use the following script :

#!/bin/sh
for (( i=1; i<=1000000; i++ ))
do
	dd if=/dev/urandom of="/mnt/dir/sa${i}" bs=1k count=1024 status=none &
	remain=$(( $i % 5000))
	if [ $remain -eq 0 ]; then
		echo "i = ${i}"
		wait
	fi
done

The problem occurred after the script ran for one days.

Thanks.


>> 
>> We can see that the Metadata space (1022.50MiB + 512.00MiB) is almost 
>> full.
>> But Data allocated much more space (923.77GiB) than it actually needed
>> (209.59GiB).
>> 
>> When the data space is not enough, this 5000 dd process will call
>> do_chunk_alloc() to allocate more space.
>> 
>> In the while loop of do_chunk_alloc, the variable force will be 
>> changed
>> to CHUNK_ALLOC_FORCE in second run and should_alloc_chunk() will 
>> always
>> return true when force is CHUNK_ALLOC_FORCE. That means every 
>> concurrent
>> dd process will allocate a chunk in do_chunk_alloc().
>> 
>> Fix this by keeping original value of force and re-assign it to force 
>> in
>> the end of the loop.
>> 
>> Signed-off-by: Robbie Ko <robbieko@synology.com>
>> ---
>>  fs/btrfs/extent-tree.c | 2 ++
>>  1 file changed, 2 insertions(+)
>> 
>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>> index c5880329ae37..73856f70db31 100644
>> --- a/fs/btrfs/extent-tree.c
>> +++ b/fs/btrfs/extent-tree.c
>> @@ -4511,6 +4511,7 @@ static int do_chunk_alloc(struct 
>> btrfs_trans_handle *trans, u64 flags,
>>         bool wait_for_alloc = false;
>>         bool should_alloc = false;
>>         int ret = 0;
>> +       int orig_force = force;
>> 
>>         /* Don't re-enter if we're already allocating a chunk */
>>         if (trans->allocating_chunk)
>> @@ -4544,6 +4545,7 @@ static int do_chunk_alloc(struct 
>> btrfs_trans_handle *trans, u64 flags,
>>                          */
>>                         wait_for_alloc = true;
>>                         spin_unlock(&space_info->lock);
>> +                       force = orig_force;
>>                         mutex_lock(&fs_info->chunk_mutex);
>>                         mutex_unlock(&fs_info->chunk_mutex);
>>                 } else {
>> --
>> 2.17.1
>>
diff mbox series

Patch

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index c5880329ae37..73856f70db31 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4511,6 +4511,7 @@  static int do_chunk_alloc(struct btrfs_trans_handle *trans, u64 flags,
 	bool wait_for_alloc = false;
 	bool should_alloc = false;
 	int ret = 0;
+	int orig_force = force;
 
 	/* Don't re-enter if we're already allocating a chunk */
 	if (trans->allocating_chunk)
@@ -4544,6 +4545,7 @@  static int do_chunk_alloc(struct btrfs_trans_handle *trans, u64 flags,
 			 */
 			wait_for_alloc = true;
 			spin_unlock(&space_info->lock);
+			force = orig_force;
 			mutex_lock(&fs_info->chunk_mutex);
 			mutex_unlock(&fs_info->chunk_mutex);
 		} else {