diff mbox

[v8,19/27] btrfs: try more times to alloc metadata reserve space

Message ID 1458610552-9845-20-git-send-email-quwenruo@cn.fujitsu.com (mailing list archive)
State New, archived
Headers show

Commit Message

Qu Wenruo March 22, 2016, 1:35 a.m. UTC
From: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>

In btrfs_delalloc_reserve_metadata(), the number of metadata bytes we try
to reserve is calculated by the difference between outstanding_extents and
reserved_extents.

When reserve_metadata_bytes() fails to reserve desited metadata space,
it has already done some reclaim work, such as write ordered extents.

In that case, outstanding_extents and reserved_extents may already
changed, and we may reserve enough metadata space then.

So this patch will try to call reserve_metadata_bytes() at most 3 times
to ensure we really run out of space.

Such false ENOSPC is mainly caused by small file extents and time
consuming delalloc functions, which mainly affects in-band
de-duplication. (Compress should also be affected, but LZO/zlib is
faster than SHA256, so still harder to trigger than dedupe).

Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
---
 fs/btrfs/extent-tree.c | 25 ++++++++++++++++++++++---
 1 file changed, 22 insertions(+), 3 deletions(-)

Comments

Josef Bacik April 22, 2016, 6:06 p.m. UTC | #1
On 03/21/2016 09:35 PM, Qu Wenruo wrote:
> From: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
>
> In btrfs_delalloc_reserve_metadata(), the number of metadata bytes we try
> to reserve is calculated by the difference between outstanding_extents and
> reserved_extents.
>
> When reserve_metadata_bytes() fails to reserve desited metadata space,
> it has already done some reclaim work, such as write ordered extents.
>
> In that case, outstanding_extents and reserved_extents may already
> changed, and we may reserve enough metadata space then.
>
> So this patch will try to call reserve_metadata_bytes() at most 3 times
> to ensure we really run out of space.
>
> Such false ENOSPC is mainly caused by small file extents and time
> consuming delalloc functions, which mainly affects in-band
> de-duplication. (Compress should also be affected, but LZO/zlib is
> faster than SHA256, so still harder to trigger than dedupe).
>
> Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
> ---
>   fs/btrfs/extent-tree.c | 25 ++++++++++++++++++++++---
>   1 file changed, 22 insertions(+), 3 deletions(-)
>
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index dabd721..016d2ec 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -2421,7 +2421,7 @@ static int run_one_delayed_ref(struct btrfs_trans_handle *trans,
>   				 * a new extent is revered, then deleted
>   				 * in one tran, and inc/dec get merged to 0.
>   				 *
> -				 * In this case, we need to remove its dedup
> +				 * In this case, we need to remove its dedupe
>   				 * hash.
>   				 */
>   				btrfs_dedupe_del(trans, fs_info, node->bytenr);
> @@ -5675,6 +5675,7 @@ int btrfs_delalloc_reserve_metadata(struct inode *inode, u64 num_bytes)
>   	bool delalloc_lock = true;
>   	u64 to_free = 0;
>   	unsigned dropped;
> +	int loops = 0;
>
>   	/* If we are a free space inode we need to not flush since we will be in
>   	 * the middle of a transaction commit.  We also don't need the delalloc
> @@ -5690,11 +5691,12 @@ int btrfs_delalloc_reserve_metadata(struct inode *inode, u64 num_bytes)
>   	    btrfs_transaction_in_commit(root->fs_info))
>   		schedule_timeout(1);
>
> +	num_bytes = ALIGN(num_bytes, root->sectorsize);
> +
> +again:
>   	if (delalloc_lock)
>   		mutex_lock(&BTRFS_I(inode)->delalloc_mutex);
>
> -	num_bytes = ALIGN(num_bytes, root->sectorsize);
> -
>   	spin_lock(&BTRFS_I(inode)->lock);
>   	nr_extents = (unsigned)div64_u64(num_bytes +
>   					 BTRFS_MAX_EXTENT_SIZE - 1,
> @@ -5815,6 +5817,23 @@ out_fail:
>   	}
>   	if (delalloc_lock)
>   		mutex_unlock(&BTRFS_I(inode)->delalloc_mutex);
> +	/*
> +	 * The number of metadata bytes is calculated by the difference
> +	 * between outstanding_extents and reserved_extents. Sometimes though
> +	 * reserve_metadata_bytes() fails to reserve the wanted metadata bytes,
> +	 * indeed it has already done some work to reclaim metadata space, hence
> +	 * both outstanding_extents and reserved_extents would have changed and
> +	 * the bytes we try to reserve would also has changed(may be smaller).
> +	 * So here we try to reserve again. This is much useful for online
> +	 * dedupe, which will easily eat almost all meta space.
> +	 *
> +	 * XXX: Indeed here 3 is arbitrarily choosed, it's a good workaround for
> +	 * online dedupe, later we should find a better method to avoid dedupe
> +	 * enospc issue.
> +	 */
> +	if (unlikely(ret == -ENOSPC && loops++ < 3))
> +		goto again;
> +
>   	return ret;
>   }
>
>

NAK, we aren't going to just arbitrarily retry to make our metadata 
reservation.  Dropping reserved metadata space by completing ordered 
extents should free enough to make our current reservation, and in fact 
this only accounts for the disparity, so should be an accurate count 
most of the time.  I can see a case for detecting that the disparity no 
longer exists and retrying in that case (we free enough ordered extents 
that we are no longer trying to reserve ours + overflow but now only 
ours) and retry in _that specific case_, but we need to limit it to this 
case only.  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Qu Wenruo April 25, 2016, 12:54 a.m. UTC | #2
Josef Bacik wrote on 2016/04/22 14:06 -0400:
> On 03/21/2016 09:35 PM, Qu Wenruo wrote:
>> From: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
>>
>> In btrfs_delalloc_reserve_metadata(), the number of metadata bytes we try
>> to reserve is calculated by the difference between outstanding_extents
>> and
>> reserved_extents.
>>
>> When reserve_metadata_bytes() fails to reserve desited metadata space,
>> it has already done some reclaim work, such as write ordered extents.
>>
>> In that case, outstanding_extents and reserved_extents may already
>> changed, and we may reserve enough metadata space then.
>>
>> So this patch will try to call reserve_metadata_bytes() at most 3 times
>> to ensure we really run out of space.
>>
>> Such false ENOSPC is mainly caused by small file extents and time
>> consuming delalloc functions, which mainly affects in-band
>> de-duplication. (Compress should also be affected, but LZO/zlib is
>> faster than SHA256, so still harder to trigger than dedupe).
>>
>> Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
>> ---
>>   fs/btrfs/extent-tree.c | 25 ++++++++++++++++++++++---
>>   1 file changed, 22 insertions(+), 3 deletions(-)
>>
>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>> index dabd721..016d2ec 100644
>> --- a/fs/btrfs/extent-tree.c
>> +++ b/fs/btrfs/extent-tree.c
>> @@ -2421,7 +2421,7 @@ static int run_one_delayed_ref(struct
>> btrfs_trans_handle *trans,
>>                    * a new extent is revered, then deleted
>>                    * in one tran, and inc/dec get merged to 0.
>>                    *
>> -                 * In this case, we need to remove its dedup
>> +                 * In this case, we need to remove its dedupe
>>                    * hash.
>>                    */
>>                   btrfs_dedupe_del(trans, fs_info, node->bytenr);
>> @@ -5675,6 +5675,7 @@ int btrfs_delalloc_reserve_metadata(struct inode
>> *inode, u64 num_bytes)
>>       bool delalloc_lock = true;
>>       u64 to_free = 0;
>>       unsigned dropped;
>> +    int loops = 0;
>>
>>       /* If we are a free space inode we need to not flush since we
>> will be in
>>        * the middle of a transaction commit.  We also don't need the
>> delalloc
>> @@ -5690,11 +5691,12 @@ int btrfs_delalloc_reserve_metadata(struct
>> inode *inode, u64 num_bytes)
>>           btrfs_transaction_in_commit(root->fs_info))
>>           schedule_timeout(1);
>>
>> +    num_bytes = ALIGN(num_bytes, root->sectorsize);
>> +
>> +again:
>>       if (delalloc_lock)
>>           mutex_lock(&BTRFS_I(inode)->delalloc_mutex);
>>
>> -    num_bytes = ALIGN(num_bytes, root->sectorsize);
>> -
>>       spin_lock(&BTRFS_I(inode)->lock);
>>       nr_extents = (unsigned)div64_u64(num_bytes +
>>                        BTRFS_MAX_EXTENT_SIZE - 1,
>> @@ -5815,6 +5817,23 @@ out_fail:
>>       }
>>       if (delalloc_lock)
>>           mutex_unlock(&BTRFS_I(inode)->delalloc_mutex);
>> +    /*
>> +     * The number of metadata bytes is calculated by the difference
>> +     * between outstanding_extents and reserved_extents. Sometimes
>> though
>> +     * reserve_metadata_bytes() fails to reserve the wanted metadata
>> bytes,
>> +     * indeed it has already done some work to reclaim metadata
>> space, hence
>> +     * both outstanding_extents and reserved_extents would have
>> changed and
>> +     * the bytes we try to reserve would also has changed(may be
>> smaller).
>> +     * So here we try to reserve again. This is much useful for online
>> +     * dedupe, which will easily eat almost all meta space.
>> +     *
>> +     * XXX: Indeed here 3 is arbitrarily choosed, it's a good
>> workaround for
>> +     * online dedupe, later we should find a better method to avoid
>> dedupe
>> +     * enospc issue.
>> +     */
>> +    if (unlikely(ret == -ENOSPC && loops++ < 3))
>> +        goto again;
>> +
>>       return ret;
>>   }
>>
>>
>
> NAK, we aren't going to just arbitrarily retry to make our metadata
> reservation.  Dropping reserved metadata space by completing ordered
> extents should free enough to make our current reservation, and in fact
> this only accounts for the disparity, so should be an accurate count
> most of the time.  I can see a case for detecting that the disparity no
> longer exists and retrying in that case (we free enough ordered extents
> that we are no longer trying to reserve ours + overflow but now only
> ours) and retry in _that specific case_, but we need to limit it to this
> case only.  Thanks,

Would it be OK to retry only for dedupe enabled case?

Currently it's only a workaround and we are still digging the root 
cause, but for a workaround, I assume it is good enough though for 
dedupe enabled case.

Thanks,
Qu
>
> Josef
>
>


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Josef Bacik April 25, 2016, 2:05 p.m. UTC | #3
On 04/24/2016 08:54 PM, Qu Wenruo wrote:
>
>
> Josef Bacik wrote on 2016/04/22 14:06 -0400:
>> On 03/21/2016 09:35 PM, Qu Wenruo wrote:
>>> From: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
>>>
>>> In btrfs_delalloc_reserve_metadata(), the number of metadata bytes we
>>> try
>>> to reserve is calculated by the difference between outstanding_extents
>>> and
>>> reserved_extents.
>>>
>>> When reserve_metadata_bytes() fails to reserve desited metadata space,
>>> it has already done some reclaim work, such as write ordered extents.
>>>
>>> In that case, outstanding_extents and reserved_extents may already
>>> changed, and we may reserve enough metadata space then.
>>>
>>> So this patch will try to call reserve_metadata_bytes() at most 3 times
>>> to ensure we really run out of space.
>>>
>>> Such false ENOSPC is mainly caused by small file extents and time
>>> consuming delalloc functions, which mainly affects in-band
>>> de-duplication. (Compress should also be affected, but LZO/zlib is
>>> faster than SHA256, so still harder to trigger than dedupe).
>>>
>>> Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
>>> ---
>>>   fs/btrfs/extent-tree.c | 25 ++++++++++++++++++++++---
>>>   1 file changed, 22 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>>> index dabd721..016d2ec 100644
>>> --- a/fs/btrfs/extent-tree.c
>>> +++ b/fs/btrfs/extent-tree.c
>>> @@ -2421,7 +2421,7 @@ static int run_one_delayed_ref(struct
>>> btrfs_trans_handle *trans,
>>>                    * a new extent is revered, then deleted
>>>                    * in one tran, and inc/dec get merged to 0.
>>>                    *
>>> -                 * In this case, we need to remove its dedup
>>> +                 * In this case, we need to remove its dedupe
>>>                    * hash.
>>>                    */
>>>                   btrfs_dedupe_del(trans, fs_info, node->bytenr);
>>> @@ -5675,6 +5675,7 @@ int btrfs_delalloc_reserve_metadata(struct inode
>>> *inode, u64 num_bytes)
>>>       bool delalloc_lock = true;
>>>       u64 to_free = 0;
>>>       unsigned dropped;
>>> +    int loops = 0;
>>>
>>>       /* If we are a free space inode we need to not flush since we
>>> will be in
>>>        * the middle of a transaction commit.  We also don't need the
>>> delalloc
>>> @@ -5690,11 +5691,12 @@ int btrfs_delalloc_reserve_metadata(struct
>>> inode *inode, u64 num_bytes)
>>>           btrfs_transaction_in_commit(root->fs_info))
>>>           schedule_timeout(1);
>>>
>>> +    num_bytes = ALIGN(num_bytes, root->sectorsize);
>>> +
>>> +again:
>>>       if (delalloc_lock)
>>>           mutex_lock(&BTRFS_I(inode)->delalloc_mutex);
>>>
>>> -    num_bytes = ALIGN(num_bytes, root->sectorsize);
>>> -
>>>       spin_lock(&BTRFS_I(inode)->lock);
>>>       nr_extents = (unsigned)div64_u64(num_bytes +
>>>                        BTRFS_MAX_EXTENT_SIZE - 1,
>>> @@ -5815,6 +5817,23 @@ out_fail:
>>>       }
>>>       if (delalloc_lock)
>>>           mutex_unlock(&BTRFS_I(inode)->delalloc_mutex);
>>> +    /*
>>> +     * The number of metadata bytes is calculated by the difference
>>> +     * between outstanding_extents and reserved_extents. Sometimes
>>> though
>>> +     * reserve_metadata_bytes() fails to reserve the wanted metadata
>>> bytes,
>>> +     * indeed it has already done some work to reclaim metadata
>>> space, hence
>>> +     * both outstanding_extents and reserved_extents would have
>>> changed and
>>> +     * the bytes we try to reserve would also has changed(may be
>>> smaller).
>>> +     * So here we try to reserve again. This is much useful for online
>>> +     * dedupe, which will easily eat almost all meta space.
>>> +     *
>>> +     * XXX: Indeed here 3 is arbitrarily choosed, it's a good
>>> workaround for
>>> +     * online dedupe, later we should find a better method to avoid
>>> dedupe
>>> +     * enospc issue.
>>> +     */
>>> +    if (unlikely(ret == -ENOSPC && loops++ < 3))
>>> +        goto again;
>>> +
>>>       return ret;
>>>   }
>>>
>>>
>>
>> NAK, we aren't going to just arbitrarily retry to make our metadata
>> reservation.  Dropping reserved metadata space by completing ordered
>> extents should free enough to make our current reservation, and in fact
>> this only accounts for the disparity, so should be an accurate count
>> most of the time.  I can see a case for detecting that the disparity no
>> longer exists and retrying in that case (we free enough ordered extents
>> that we are no longer trying to reserve ours + overflow but now only
>> ours) and retry in _that specific case_, but we need to limit it to this
>> case only.  Thanks,
>
> Would it be OK to retry only for dedupe enabled case?
>
> Currently it's only a workaround and we are still digging the root
> cause, but for a workaround, I assume it is good enough though for
> dedupe enabled case.
>

No we're not going to leave things in a known broken state to come back 
to later, that just makes it so we forget stuff and it sits there 
forever.  Thanks,

Josef

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Qu Wenruo April 26, 2016, 12:50 a.m. UTC | #4
Josef Bacik wrote on 2016/04/25 10:05 -0400:
> On 04/24/2016 08:54 PM, Qu Wenruo wrote:
>>
>>
>> Josef Bacik wrote on 2016/04/22 14:06 -0400:
>>> On 03/21/2016 09:35 PM, Qu Wenruo wrote:
>>>> From: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
>>>>
>>>> In btrfs_delalloc_reserve_metadata(), the number of metadata bytes we
>>>> try
>>>> to reserve is calculated by the difference between outstanding_extents
>>>> and
>>>> reserved_extents.
>>>>
>>>> When reserve_metadata_bytes() fails to reserve desited metadata space,
>>>> it has already done some reclaim work, such as write ordered extents.
>>>>
>>>> In that case, outstanding_extents and reserved_extents may already
>>>> changed, and we may reserve enough metadata space then.
>>>>
>>>> So this patch will try to call reserve_metadata_bytes() at most 3 times
>>>> to ensure we really run out of space.
>>>>
>>>> Such false ENOSPC is mainly caused by small file extents and time
>>>> consuming delalloc functions, which mainly affects in-band
>>>> de-duplication. (Compress should also be affected, but LZO/zlib is
>>>> faster than SHA256, so still harder to trigger than dedupe).
>>>>
>>>> Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
>>>> ---
>>>>   fs/btrfs/extent-tree.c | 25 ++++++++++++++++++++++---
>>>>   1 file changed, 22 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>>>> index dabd721..016d2ec 100644
>>>> --- a/fs/btrfs/extent-tree.c
>>>> +++ b/fs/btrfs/extent-tree.c
>>>> @@ -2421,7 +2421,7 @@ static int run_one_delayed_ref(struct
>>>> btrfs_trans_handle *trans,
>>>>                    * a new extent is revered, then deleted
>>>>                    * in one tran, and inc/dec get merged to 0.
>>>>                    *
>>>> -                 * In this case, we need to remove its dedup
>>>> +                 * In this case, we need to remove its dedupe
>>>>                    * hash.
>>>>                    */
>>>>                   btrfs_dedupe_del(trans, fs_info, node->bytenr);
>>>> @@ -5675,6 +5675,7 @@ int btrfs_delalloc_reserve_metadata(struct inode
>>>> *inode, u64 num_bytes)
>>>>       bool delalloc_lock = true;
>>>>       u64 to_free = 0;
>>>>       unsigned dropped;
>>>> +    int loops = 0;
>>>>
>>>>       /* If we are a free space inode we need to not flush since we
>>>> will be in
>>>>        * the middle of a transaction commit.  We also don't need the
>>>> delalloc
>>>> @@ -5690,11 +5691,12 @@ int btrfs_delalloc_reserve_metadata(struct
>>>> inode *inode, u64 num_bytes)
>>>>           btrfs_transaction_in_commit(root->fs_info))
>>>>           schedule_timeout(1);
>>>>
>>>> +    num_bytes = ALIGN(num_bytes, root->sectorsize);
>>>> +
>>>> +again:
>>>>       if (delalloc_lock)
>>>>           mutex_lock(&BTRFS_I(inode)->delalloc_mutex);
>>>>
>>>> -    num_bytes = ALIGN(num_bytes, root->sectorsize);
>>>> -
>>>>       spin_lock(&BTRFS_I(inode)->lock);
>>>>       nr_extents = (unsigned)div64_u64(num_bytes +
>>>>                        BTRFS_MAX_EXTENT_SIZE - 1,
>>>> @@ -5815,6 +5817,23 @@ out_fail:
>>>>       }
>>>>       if (delalloc_lock)
>>>>           mutex_unlock(&BTRFS_I(inode)->delalloc_mutex);
>>>> +    /*
>>>> +     * The number of metadata bytes is calculated by the difference
>>>> +     * between outstanding_extents and reserved_extents. Sometimes
>>>> though
>>>> +     * reserve_metadata_bytes() fails to reserve the wanted metadata
>>>> bytes,
>>>> +     * indeed it has already done some work to reclaim metadata
>>>> space, hence
>>>> +     * both outstanding_extents and reserved_extents would have
>>>> changed and
>>>> +     * the bytes we try to reserve would also has changed(may be
>>>> smaller).
>>>> +     * So here we try to reserve again. This is much useful for online
>>>> +     * dedupe, which will easily eat almost all meta space.
>>>> +     *
>>>> +     * XXX: Indeed here 3 is arbitrarily choosed, it's a good
>>>> workaround for
>>>> +     * online dedupe, later we should find a better method to avoid
>>>> dedupe
>>>> +     * enospc issue.
>>>> +     */
>>>> +    if (unlikely(ret == -ENOSPC && loops++ < 3))
>>>> +        goto again;
>>>> +
>>>>       return ret;
>>>>   }
>>>>
>>>>
>>>
>>> NAK, we aren't going to just arbitrarily retry to make our metadata
>>> reservation.  Dropping reserved metadata space by completing ordered
>>> extents should free enough to make our current reservation, and in fact
>>> this only accounts for the disparity, so should be an accurate count
>>> most of the time.  I can see a case for detecting that the disparity no
>>> longer exists and retrying in that case (we free enough ordered extents
>>> that we are no longer trying to reserve ours + overflow but now only
>>> ours) and retry in _that specific case_, but we need to limit it to this
>>> case only.  Thanks,
>>
>> Would it be OK to retry only for dedupe enabled case?
>>
>> Currently it's only a workaround and we are still digging the root
>> cause, but for a workaround, I assume it is good enough though for
>> dedupe enabled case.
>>
>
> No we're not going to leave things in a known broken state to come back
> to later, that just makes it so we forget stuff and it sits there
> forever.  Thanks,
>
> Josef

OK, We'll investigate it and find the best fix.

BTW, we also found extent-tree.c is using the same 3 loops code:
(and that's why we choose the same method)
------
         loops = 0;
         while (delalloc_bytes && loops < 3) {
                 max_reclaim = min(delalloc_bytes, to_reclaim);
                 nr_pages = max_reclaim >> PAGE_CACHE_SHIFT;
                 btrfs_writeback_inodes_sb_nr(root, nr_pages, items);
                 /*
                  * We need to wait for the async pages to actually 
start before
                  * we do anything.
                  */
                 max_reclaim = 
atomic_read(&root->fs_info->async_delalloc_pages);
                 if (!max_reclaim)
                         goto skip_async;
------

Any idea why it's still there?

Thanks,
Qu


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index dabd721..016d2ec 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2421,7 +2421,7 @@  static int run_one_delayed_ref(struct btrfs_trans_handle *trans,
 				 * a new extent is revered, then deleted
 				 * in one tran, and inc/dec get merged to 0.
 				 *
-				 * In this case, we need to remove its dedup
+				 * In this case, we need to remove its dedupe
 				 * hash.
 				 */
 				btrfs_dedupe_del(trans, fs_info, node->bytenr);
@@ -5675,6 +5675,7 @@  int btrfs_delalloc_reserve_metadata(struct inode *inode, u64 num_bytes)
 	bool delalloc_lock = true;
 	u64 to_free = 0;
 	unsigned dropped;
+	int loops = 0;
 
 	/* If we are a free space inode we need to not flush since we will be in
 	 * the middle of a transaction commit.  We also don't need the delalloc
@@ -5690,11 +5691,12 @@  int btrfs_delalloc_reserve_metadata(struct inode *inode, u64 num_bytes)
 	    btrfs_transaction_in_commit(root->fs_info))
 		schedule_timeout(1);
 
+	num_bytes = ALIGN(num_bytes, root->sectorsize);
+
+again:
 	if (delalloc_lock)
 		mutex_lock(&BTRFS_I(inode)->delalloc_mutex);
 
-	num_bytes = ALIGN(num_bytes, root->sectorsize);
-
 	spin_lock(&BTRFS_I(inode)->lock);
 	nr_extents = (unsigned)div64_u64(num_bytes +
 					 BTRFS_MAX_EXTENT_SIZE - 1,
@@ -5815,6 +5817,23 @@  out_fail:
 	}
 	if (delalloc_lock)
 		mutex_unlock(&BTRFS_I(inode)->delalloc_mutex);
+	/*
+	 * The number of metadata bytes is calculated by the difference
+	 * between outstanding_extents and reserved_extents. Sometimes though
+	 * reserve_metadata_bytes() fails to reserve the wanted metadata bytes,
+	 * indeed it has already done some work to reclaim metadata space, hence
+	 * both outstanding_extents and reserved_extents would have changed and
+	 * the bytes we try to reserve would also has changed(may be smaller).
+	 * So here we try to reserve again. This is much useful for online
+	 * dedupe, which will easily eat almost all meta space.
+	 *
+	 * XXX: Indeed here 3 is arbitrarily choosed, it's a good workaround for
+	 * online dedupe, later we should find a better method to avoid dedupe
+	 * enospc issue.
+	 */
+	if (unlikely(ret == -ENOSPC && loops++ < 3))
+		goto again;
+
 	return ret;
 }