diff mbox

[4/4] btrfs: update btrfs_space_info's bytes_may_use timely

Message ID 20160720055637.7275-5-wangxg.fnst@cn.fujitsu.com (mailing list archive)
State Superseded
Headers show

Commit Message

Xiaoguang Wang July 20, 2016, 5:56 a.m. UTC
This patch can fix some false ENOSPC errors, below test script can
reproduce one false ENOSPC error:
	#!/bin/bash
	dd if=/dev/zero of=fs.img bs=$((1024*1024)) count=128
	dev=$(losetup --show -f fs.img)
	mkfs.btrfs -f -M $dev
	mkdir /tmp/mntpoint
	mount $dev /tmp/mntpoint
	cd /tmp/mntpoint
	xfs_io -f -c "falloc 0 $((64*1024*1024))" testfile

Above script will fail for ENOSPC reason, but indeed fs still has free
space to satisfy this request. Please see call graph:
btrfs_fallocate()
|-> btrfs_alloc_data_chunk_ondemand()
|   bytes_may_use += 64M
|-> btrfs_prealloc_file_range()
    |-> btrfs_reserve_extent()
        |-> btrfs_add_reserved_bytes()
        |   alloc_type is RESERVE_ALLOC_NO_ACCOUNT, so it does not
        |   change bytes_may_use, and bytes_reserved += 64M. Now
        |   bytes_may_use + bytes_reserved == 128M, which is greater
        |   than btrfs_space_info's total_bytes, false enospc occurs.
        |   Note, the bytes_may_use decrease operation will done in
        |   end of btrfs_fallocate(), which is too late.

Here is another simple case for buffered write:
                    CPU 1              |              CPU 2
                                       |
|-> cow_file_range()                   |-> __btrfs_buffered_write()
    |-> btrfs_reserve_extent()         |   |
    |                                  |   |
    |                                  |   |
    |    .....                         |   |-> btrfs_check_data_free_space()
    |                                  |
    |                                  |
    |-> extent_clear_unlock_delalloc() |

In CPU 1, btrfs_reserve_extent()->find_free_extent()->
btrfs_add_reserved_bytes() do not decrease bytes_may_use, the decrease
operation will be delayed to be done in extent_clear_unlock_delalloc().
Assume in this case, btrfs_reserve_extent() reserved 128MB data, CPU2's
btrfs_check_data_free_space() tries to reserve 100MB data space.
If
	100MB > data_sinfo->total_bytes - data_sinfo->bytes_used -
		data_sinfo->bytes_reserved - data_sinfo->bytes_pinned -
		data_sinfo->bytes_readonly - data_sinfo->bytes_may_use
btrfs_check_data_free_space() will try to allcate new data chunk or call
btrfs_start_delalloc_roots(), or commit current transaction inorder to
reserve some free space, obviously a lot of work. But indeed it's not
necessary as long as decreasing bytes_may_use timely, we still have
free space, decreasing 128M from bytes_may_use.

To fix this issue, this patch chooses to update bytes_may_use for both
data and metadata in btrfs_add_reserved_bytes(). For compress path, real
extent length may not be equal to file content length, so introduce a
ram_bytes argument for btrfs_reserve_extent(), find_free_extent() and
btrfs_add_reserved_bytes(), it's becasue bytes_may_use is increased by
file content length. Then compress path can update bytes_may_use
correctly. Also now we can discard RESERVE_ALLOC_NO_ACCOUNT, RESERVE_ALLOC
and RESERVE_FREE.

For inode marked as NODATACOW or extent marked as PREALLOC, we can
directly call btrfs_free_reserved_data_space() to adjust bytes_may_use.

Meanwhile __btrfs_prealloc_file_range() will call
btrfs_free_reserved_data_space() internally for both sucessful and failed
path, btrfs_prealloc_file_range()'s callers does not need to call
btrfs_free_reserved_data_space() any more.

Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
---
 fs/btrfs/ctree.h       |  2 +-
 fs/btrfs/extent-tree.c | 56 +++++++++++++++++---------------------------------
 fs/btrfs/file.c        | 26 +++++++++++++----------
 fs/btrfs/inode-map.c   |  3 +--
 fs/btrfs/inode.c       | 37 ++++++++++++++++++++++++---------
 fs/btrfs/relocation.c  | 11 ++++++++--
 6 files changed, 72 insertions(+), 63 deletions(-)

Comments

Josef Bacik July 20, 2016, 1:35 p.m. UTC | #1
On 07/20/2016 01:56 AM, Wang Xiaoguang wrote:
> This patch can fix some false ENOSPC errors, below test script can
> reproduce one false ENOSPC error:
> 	#!/bin/bash
> 	dd if=/dev/zero of=fs.img bs=$((1024*1024)) count=128
> 	dev=$(losetup --show -f fs.img)
> 	mkfs.btrfs -f -M $dev
> 	mkdir /tmp/mntpoint
> 	mount $dev /tmp/mntpoint
> 	cd /tmp/mntpoint
> 	xfs_io -f -c "falloc 0 $((64*1024*1024))" testfile
>
> Above script will fail for ENOSPC reason, but indeed fs still has free
> space to satisfy this request. Please see call graph:
> btrfs_fallocate()
> |-> btrfs_alloc_data_chunk_ondemand()
> |   bytes_may_use += 64M
> |-> btrfs_prealloc_file_range()
>     |-> btrfs_reserve_extent()
>         |-> btrfs_add_reserved_bytes()
>         |   alloc_type is RESERVE_ALLOC_NO_ACCOUNT, so it does not
>         |   change bytes_may_use, and bytes_reserved += 64M. Now
>         |   bytes_may_use + bytes_reserved == 128M, which is greater
>         |   than btrfs_space_info's total_bytes, false enospc occurs.
>         |   Note, the bytes_may_use decrease operation will done in
>         |   end of btrfs_fallocate(), which is too late.
>
> Here is another simple case for buffered write:
>                     CPU 1              |              CPU 2
>                                        |
> |-> cow_file_range()                   |-> __btrfs_buffered_write()
>     |-> btrfs_reserve_extent()         |   |
>     |                                  |   |
>     |                                  |   |
>     |    .....                         |   |-> btrfs_check_data_free_space()
>     |                                  |
>     |                                  |
>     |-> extent_clear_unlock_delalloc() |
>
> In CPU 1, btrfs_reserve_extent()->find_free_extent()->
> btrfs_add_reserved_bytes() do not decrease bytes_may_use, the decrease
> operation will be delayed to be done in extent_clear_unlock_delalloc().
> Assume in this case, btrfs_reserve_extent() reserved 128MB data, CPU2's
> btrfs_check_data_free_space() tries to reserve 100MB data space.
> If
> 	100MB > data_sinfo->total_bytes - data_sinfo->bytes_used -
> 		data_sinfo->bytes_reserved - data_sinfo->bytes_pinned -
> 		data_sinfo->bytes_readonly - data_sinfo->bytes_may_use
> btrfs_check_data_free_space() will try to allcate new data chunk or call
> btrfs_start_delalloc_roots(), or commit current transaction inorder to
> reserve some free space, obviously a lot of work. But indeed it's not
> necessary as long as decreasing bytes_may_use timely, we still have
> free space, decreasing 128M from bytes_may_use.
>
> To fix this issue, this patch chooses to update bytes_may_use for both
> data and metadata in btrfs_add_reserved_bytes(). For compress path, real
> extent length may not be equal to file content length, so introduce a
> ram_bytes argument for btrfs_reserve_extent(), find_free_extent() and
> btrfs_add_reserved_bytes(), it's becasue bytes_may_use is increased by
> file content length. Then compress path can update bytes_may_use
> correctly. Also now we can discard RESERVE_ALLOC_NO_ACCOUNT, RESERVE_ALLOC
> and RESERVE_FREE.
>
> For inode marked as NODATACOW or extent marked as PREALLOC, we can
> directly call btrfs_free_reserved_data_space() to adjust bytes_may_use.
>
> Meanwhile __btrfs_prealloc_file_range() will call
> btrfs_free_reserved_data_space() internally for both sucessful and failed
> path, btrfs_prealloc_file_range()'s callers does not need to call
> btrfs_free_reserved_data_space() any more.
>
> Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
> ---
>  fs/btrfs/ctree.h       |  2 +-
>  fs/btrfs/extent-tree.c | 56 +++++++++++++++++---------------------------------
>  fs/btrfs/file.c        | 26 +++++++++++++----------
>  fs/btrfs/inode-map.c   |  3 +--
>  fs/btrfs/inode.c       | 37 ++++++++++++++++++++++++---------
>  fs/btrfs/relocation.c  | 11 ++++++++--
>  6 files changed, 72 insertions(+), 63 deletions(-)
>
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 4274a7b..7eb2913 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -2556,7 +2556,7 @@ int btrfs_alloc_logged_file_extent(struct btrfs_trans_handle *trans,
>  				   struct btrfs_root *root,
>  				   u64 root_objectid, u64 owner, u64 offset,
>  				   struct btrfs_key *ins);
> -int btrfs_reserve_extent(struct btrfs_root *root, u64 num_bytes,
> +int btrfs_reserve_extent(struct btrfs_root *root, u64 ram_bytes, u64 num_bytes,
>  			 u64 min_alloc_size, u64 empty_size, u64 hint_byte,
>  			 struct btrfs_key *ins, int is_data, int delalloc);
>  int btrfs_inc_ref(struct btrfs_trans_handle *trans, struct btrfs_root *root,
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 8eaac39..5447973 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -60,21 +60,6 @@ enum {
>  	CHUNK_ALLOC_FORCE = 2,
>  };
>
> -/*
> - * Control how reservations are dealt with.
> - *
> - * RESERVE_FREE - freeing a reservation.
> - * RESERVE_ALLOC - allocating space and we need to update bytes_may_use for
> - *   ENOSPC accounting
> - * RESERVE_ALLOC_NO_ACCOUNT - allocating space and we should not update
> - *   bytes_may_use as the ENOSPC accounting is done elsewhere
> - */
> -enum {
> -	RESERVE_FREE = 0,
> -	RESERVE_ALLOC = 1,
> -	RESERVE_ALLOC_NO_ACCOUNT = 2,
> -};
> -
>  static int update_block_group(struct btrfs_trans_handle *trans,
>  			      struct btrfs_root *root, u64 bytenr,
>  			      u64 num_bytes, int alloc);
> @@ -105,7 +90,7 @@ static int find_next_key(struct btrfs_path *path, int level,
>  static void dump_space_info(struct btrfs_space_info *info, u64 bytes,
>  			    int dump_block_groups);
>  static int btrfs_add_reserved_bytes(struct btrfs_block_group_cache *cache,
> -				    u64 num_bytes, int reserve, int delalloc);
> +				    u64 ram_bytes, u64 num_bytes, int delalloc);
>  static int btrfs_free_reserved_bytes(struct btrfs_block_group_cache *cache,
>  				     u64 num_bytes, int delalloc);
>  static int block_rsv_use_bytes(struct btrfs_block_rsv *block_rsv,
> @@ -3491,7 +3476,6 @@ again:
>  		dcs = BTRFS_DC_SETUP;
>  	else if (ret == -ENOSPC)
>  		set_bit(BTRFS_TRANS_CACHE_ENOSPC, &trans->transaction->flags);
> -	btrfs_free_reserved_data_space(inode, 0, num_pages);
>
>  out_put:
>  	iput(inode);
> @@ -6300,8 +6284,9 @@ void btrfs_wait_block_group_reservations(struct btrfs_block_group_cache *bg)
>  /**
>   * btrfs_add_reserved_bytes - update the block_group and space info counters
>   * @cache:	The cache we are manipulating
> + * @ram_bytes:  The number of bytes of file content, and will be same to
> + *              @num_bytes except for the compress path.
>   * @num_bytes:	The number of bytes in question
> - * @reserve:	One of the reservation enums
>   * @delalloc:   The blocks are allocated for the delalloc write
>   *
>   * This is called by the allocator when it reserves space. Metadata
> @@ -6316,7 +6301,7 @@ void btrfs_wait_block_group_reservations(struct btrfs_block_group_cache *bg)
>   * succeeds.
>   */
>  static int btrfs_add_reserved_bytes(struct btrfs_block_group_cache *cache,
> -				    u64 num_bytes, int reserve, int delalloc)
> +				    u64 ram_bytes, u64 num_bytes, int delalloc)
>  {
>  	struct btrfs_space_info *space_info = cache->space_info;
>  	int ret = 0;
> @@ -6328,13 +6313,11 @@ static int btrfs_add_reserved_bytes(struct btrfs_block_group_cache *cache,
>  	} else {
>  		cache->reserved += num_bytes;
>  		space_info->bytes_reserved += num_bytes;
> -		if (reserve == RESERVE_ALLOC) {
> -			trace_btrfs_space_reservation(cache->fs_info,
> -					"space_info", space_info->flags,
> -					num_bytes, 0);
> -			space_info->bytes_may_use -= num_bytes;
> -		}
>
> +		trace_btrfs_space_reservation(cache->fs_info,
> +				"space_info", space_info->flags,
> +				num_bytes, 0);

This needs to be ram_bytes to keep the accounting consistent for tools that use 
these tracepoints.  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Xiaoguang Wang July 21, 2016, 1:18 a.m. UTC | #2
hello,

On 07/20/2016 09:35 PM, Josef Bacik wrote:
> On 07/20/2016 01:56 AM, Wang Xiaoguang wrote:
>> This patch can fix some false ENOSPC errors, below test script can
>> reproduce one false ENOSPC error:
>>     #!/bin/bash
>>     dd if=/dev/zero of=fs.img bs=$((1024*1024)) count=128
>>     dev=$(losetup --show -f fs.img)
>>     mkfs.btrfs -f -M $dev
>>     mkdir /tmp/mntpoint
>>     mount $dev /tmp/mntpoint
>>     cd /tmp/mntpoint
>>     xfs_io -f -c "falloc 0 $((64*1024*1024))" testfile
>>
>> Above script will fail for ENOSPC reason, but indeed fs still has free
>> space to satisfy this request. Please see call graph:
>> btrfs_fallocate()
>> |-> btrfs_alloc_data_chunk_ondemand()
>> |   bytes_may_use += 64M
>> |-> btrfs_prealloc_file_range()
>>     |-> btrfs_reserve_extent()
>>         |-> btrfs_add_reserved_bytes()
>>         |   alloc_type is RESERVE_ALLOC_NO_ACCOUNT, so it does not
>>         |   change bytes_may_use, and bytes_reserved += 64M. Now
>>         |   bytes_may_use + bytes_reserved == 128M, which is greater
>>         |   than btrfs_space_info's total_bytes, false enospc occurs.
>>         |   Note, the bytes_may_use decrease operation will done in
>>         |   end of btrfs_fallocate(), which is too late.
>>
>> Here is another simple case for buffered write:
>>                     CPU 1              |              CPU 2
>>                                        |
>> |-> cow_file_range()                   |-> __btrfs_buffered_write()
>>     |-> btrfs_reserve_extent()         |   |
>>     |                                  |   |
>>     |                                  |   |
>>     |    .....                         |   |-> 
>> btrfs_check_data_free_space()
>>     |                                  |
>>     |                                  |
>>     |-> extent_clear_unlock_delalloc() |
>>
>> In CPU 1, btrfs_reserve_extent()->find_free_extent()->
>> btrfs_add_reserved_bytes() do not decrease bytes_may_use, the decrease
>> operation will be delayed to be done in extent_clear_unlock_delalloc().
>> Assume in this case, btrfs_reserve_extent() reserved 128MB data, CPU2's
>> btrfs_check_data_free_space() tries to reserve 100MB data space.
>> If
>>     100MB > data_sinfo->total_bytes - data_sinfo->bytes_used -
>>         data_sinfo->bytes_reserved - data_sinfo->bytes_pinned -
>>         data_sinfo->bytes_readonly - data_sinfo->bytes_may_use
>> btrfs_check_data_free_space() will try to allcate new data chunk or call
>> btrfs_start_delalloc_roots(), or commit current transaction inorder to
>> reserve some free space, obviously a lot of work. But indeed it's not
>> necessary as long as decreasing bytes_may_use timely, we still have
>> free space, decreasing 128M from bytes_may_use.
>>
>> To fix this issue, this patch chooses to update bytes_may_use for both
>> data and metadata in btrfs_add_reserved_bytes(). For compress path, real
>> extent length may not be equal to file content length, so introduce a
>> ram_bytes argument for btrfs_reserve_extent(), find_free_extent() and
>> btrfs_add_reserved_bytes(), it's becasue bytes_may_use is increased by
>> file content length. Then compress path can update bytes_may_use
>> correctly. Also now we can discard RESERVE_ALLOC_NO_ACCOUNT, 
>> RESERVE_ALLOC
>> and RESERVE_FREE.
>>
>> For inode marked as NODATACOW or extent marked as PREALLOC, we can
>> directly call btrfs_free_reserved_data_space() to adjust bytes_may_use.
>>
>> Meanwhile __btrfs_prealloc_file_range() will call
>> btrfs_free_reserved_data_space() internally for both sucessful and 
>> failed
>> path, btrfs_prealloc_file_range()'s callers does not need to call
>> btrfs_free_reserved_data_space() any more.
>>
>> Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
>> ---
>>  fs/btrfs/ctree.h       |  2 +-
>>  fs/btrfs/extent-tree.c | 56 
>> +++++++++++++++++---------------------------------
>>  fs/btrfs/file.c        | 26 +++++++++++++----------
>>  fs/btrfs/inode-map.c   |  3 +--
>>  fs/btrfs/inode.c       | 37 ++++++++++++++++++++++++---------
>>  fs/btrfs/relocation.c  | 11 ++++++++--
>>  6 files changed, 72 insertions(+), 63 deletions(-)
>>
>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
>> index 4274a7b..7eb2913 100644
>> --- a/fs/btrfs/ctree.h
>> +++ b/fs/btrfs/ctree.h
>> @@ -2556,7 +2556,7 @@ int btrfs_alloc_logged_file_extent(struct 
>> btrfs_trans_handle *trans,
>>                     struct btrfs_root *root,
>>                     u64 root_objectid, u64 owner, u64 offset,
>>                     struct btrfs_key *ins);
>> -int btrfs_reserve_extent(struct btrfs_root *root, u64 num_bytes,
>> +int btrfs_reserve_extent(struct btrfs_root *root, u64 ram_bytes, u64 
>> num_bytes,
>>               u64 min_alloc_size, u64 empty_size, u64 hint_byte,
>>               struct btrfs_key *ins, int is_data, int delalloc);
>>  int btrfs_inc_ref(struct btrfs_trans_handle *trans, struct 
>> btrfs_root *root,
>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>> index 8eaac39..5447973 100644
>> --- a/fs/btrfs/extent-tree.c
>> +++ b/fs/btrfs/extent-tree.c
>> @@ -60,21 +60,6 @@ enum {
>>      CHUNK_ALLOC_FORCE = 2,
>>  };
>>
>> -/*
>> - * Control how reservations are dealt with.
>> - *
>> - * RESERVE_FREE - freeing a reservation.
>> - * RESERVE_ALLOC - allocating space and we need to update 
>> bytes_may_use for
>> - *   ENOSPC accounting
>> - * RESERVE_ALLOC_NO_ACCOUNT - allocating space and we should not update
>> - *   bytes_may_use as the ENOSPC accounting is done elsewhere
>> - */
>> -enum {
>> -    RESERVE_FREE = 0,
>> -    RESERVE_ALLOC = 1,
>> -    RESERVE_ALLOC_NO_ACCOUNT = 2,
>> -};
>> -
>>  static int update_block_group(struct btrfs_trans_handle *trans,
>>                    struct btrfs_root *root, u64 bytenr,
>>                    u64 num_bytes, int alloc);
>> @@ -105,7 +90,7 @@ static int find_next_key(struct btrfs_path *path, 
>> int level,
>>  static void dump_space_info(struct btrfs_space_info *info, u64 bytes,
>>                  int dump_block_groups);
>>  static int btrfs_add_reserved_bytes(struct btrfs_block_group_cache 
>> *cache,
>> -                    u64 num_bytes, int reserve, int delalloc);
>> +                    u64 ram_bytes, u64 num_bytes, int delalloc);
>>  static int btrfs_free_reserved_bytes(struct btrfs_block_group_cache 
>> *cache,
>>                       u64 num_bytes, int delalloc);
>>  static int block_rsv_use_bytes(struct btrfs_block_rsv *block_rsv,
>> @@ -3491,7 +3476,6 @@ again:
>>          dcs = BTRFS_DC_SETUP;
>>      else if (ret == -ENOSPC)
>>          set_bit(BTRFS_TRANS_CACHE_ENOSPC, &trans->transaction->flags);
>> -    btrfs_free_reserved_data_space(inode, 0, num_pages);
>>
>>  out_put:
>>      iput(inode);
>> @@ -6300,8 +6284,9 @@ void btrfs_wait_block_group_reservations(struct 
>> btrfs_block_group_cache *bg)
>>  /**
>>   * btrfs_add_reserved_bytes - update the block_group and space info 
>> counters
>>   * @cache:    The cache we are manipulating
>> + * @ram_bytes:  The number of bytes of file content, and will be 
>> same to
>> + *              @num_bytes except for the compress path.
>>   * @num_bytes:    The number of bytes in question
>> - * @reserve:    One of the reservation enums
>>   * @delalloc:   The blocks are allocated for the delalloc write
>>   *
>>   * This is called by the allocator when it reserves space. Metadata
>> @@ -6316,7 +6301,7 @@ void btrfs_wait_block_group_reservations(struct 
>> btrfs_block_group_cache *bg)
>>   * succeeds.
>>   */
>>  static int btrfs_add_reserved_bytes(struct btrfs_block_group_cache 
>> *cache,
>> -                    u64 num_bytes, int reserve, int delalloc)
>> +                    u64 ram_bytes, u64 num_bytes, int delalloc)
>>  {
>>      struct btrfs_space_info *space_info = cache->space_info;
>>      int ret = 0;
>> @@ -6328,13 +6313,11 @@ static int btrfs_add_reserved_bytes(struct 
>> btrfs_block_group_cache *cache,
>>      } else {
>>          cache->reserved += num_bytes;
>>          space_info->bytes_reserved += num_bytes;
>> -        if (reserve == RESERVE_ALLOC) {
>> -            trace_btrfs_space_reservation(cache->fs_info,
>> -                    "space_info", space_info->flags,
>> -                    num_bytes, 0);
>> -            space_info->bytes_may_use -= num_bytes;
>> -        }
>>
>> +        trace_btrfs_space_reservation(cache->fs_info,
>> +                "space_info", space_info->flags,
>> +                num_bytes, 0);
>
> This needs to be ram_bytes to keep the accounting consistent for tools 
> that use these tracepoints.  Thanks,
OK, I'll fix this issue in later version.

Regards,
Xiaoguang Wang
>
> Josef
>
>



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 4274a7b..7eb2913 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2556,7 +2556,7 @@  int btrfs_alloc_logged_file_extent(struct btrfs_trans_handle *trans,
 				   struct btrfs_root *root,
 				   u64 root_objectid, u64 owner, u64 offset,
 				   struct btrfs_key *ins);
-int btrfs_reserve_extent(struct btrfs_root *root, u64 num_bytes,
+int btrfs_reserve_extent(struct btrfs_root *root, u64 ram_bytes, u64 num_bytes,
 			 u64 min_alloc_size, u64 empty_size, u64 hint_byte,
 			 struct btrfs_key *ins, int is_data, int delalloc);
 int btrfs_inc_ref(struct btrfs_trans_handle *trans, struct btrfs_root *root,
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 8eaac39..5447973 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -60,21 +60,6 @@  enum {
 	CHUNK_ALLOC_FORCE = 2,
 };
 
-/*
- * Control how reservations are dealt with.
- *
- * RESERVE_FREE - freeing a reservation.
- * RESERVE_ALLOC - allocating space and we need to update bytes_may_use for
- *   ENOSPC accounting
- * RESERVE_ALLOC_NO_ACCOUNT - allocating space and we should not update
- *   bytes_may_use as the ENOSPC accounting is done elsewhere
- */
-enum {
-	RESERVE_FREE = 0,
-	RESERVE_ALLOC = 1,
-	RESERVE_ALLOC_NO_ACCOUNT = 2,
-};
-
 static int update_block_group(struct btrfs_trans_handle *trans,
 			      struct btrfs_root *root, u64 bytenr,
 			      u64 num_bytes, int alloc);
@@ -105,7 +90,7 @@  static int find_next_key(struct btrfs_path *path, int level,
 static void dump_space_info(struct btrfs_space_info *info, u64 bytes,
 			    int dump_block_groups);
 static int btrfs_add_reserved_bytes(struct btrfs_block_group_cache *cache,
-				    u64 num_bytes, int reserve, int delalloc);
+				    u64 ram_bytes, u64 num_bytes, int delalloc);
 static int btrfs_free_reserved_bytes(struct btrfs_block_group_cache *cache,
 				     u64 num_bytes, int delalloc);
 static int block_rsv_use_bytes(struct btrfs_block_rsv *block_rsv,
@@ -3491,7 +3476,6 @@  again:
 		dcs = BTRFS_DC_SETUP;
 	else if (ret == -ENOSPC)
 		set_bit(BTRFS_TRANS_CACHE_ENOSPC, &trans->transaction->flags);
-	btrfs_free_reserved_data_space(inode, 0, num_pages);
 
 out_put:
 	iput(inode);
@@ -6300,8 +6284,9 @@  void btrfs_wait_block_group_reservations(struct btrfs_block_group_cache *bg)
 /**
  * btrfs_add_reserved_bytes - update the block_group and space info counters
  * @cache:	The cache we are manipulating
+ * @ram_bytes:  The number of bytes of file content, and will be same to
+ *              @num_bytes except for the compress path.
  * @num_bytes:	The number of bytes in question
- * @reserve:	One of the reservation enums
  * @delalloc:   The blocks are allocated for the delalloc write
  *
  * This is called by the allocator when it reserves space. Metadata
@@ -6316,7 +6301,7 @@  void btrfs_wait_block_group_reservations(struct btrfs_block_group_cache *bg)
  * succeeds.
  */
 static int btrfs_add_reserved_bytes(struct btrfs_block_group_cache *cache,
-				    u64 num_bytes, int reserve, int delalloc)
+				    u64 ram_bytes, u64 num_bytes, int delalloc)
 {
 	struct btrfs_space_info *space_info = cache->space_info;
 	int ret = 0;
@@ -6328,13 +6313,11 @@  static int btrfs_add_reserved_bytes(struct btrfs_block_group_cache *cache,
 	} else {
 		cache->reserved += num_bytes;
 		space_info->bytes_reserved += num_bytes;
-		if (reserve == RESERVE_ALLOC) {
-			trace_btrfs_space_reservation(cache->fs_info,
-					"space_info", space_info->flags,
-					num_bytes, 0);
-			space_info->bytes_may_use -= num_bytes;
-		}
 
+		trace_btrfs_space_reservation(cache->fs_info,
+				"space_info", space_info->flags,
+				num_bytes, 0);
+		space_info->bytes_may_use -= ram_bytes;
 		if (delalloc)
 			cache->delalloc_bytes += num_bytes;
 	}
@@ -7218,9 +7201,9 @@  btrfs_release_block_group(struct btrfs_block_group_cache *cache,
  * the free space extent currently.
  */
 static noinline int find_free_extent(struct btrfs_root *orig_root,
-				     u64 num_bytes, u64 empty_size,
-				     u64 hint_byte, struct btrfs_key *ins,
-				     u64 flags, int delalloc)
+				u64 ram_bytes, u64 num_bytes, u64 empty_size,
+				u64 hint_byte, struct btrfs_key *ins,
+				u64 flags, int delalloc)
 {
 	int ret = 0;
 	struct btrfs_root *root = orig_root->fs_info->extent_root;
@@ -7232,8 +7215,6 @@  static noinline int find_free_extent(struct btrfs_root *orig_root,
 	struct btrfs_space_info *space_info;
 	int loop = 0;
 	int index = __get_raid_index(flags);
-	int alloc_type = (flags & BTRFS_BLOCK_GROUP_DATA) ?
-		RESERVE_ALLOC_NO_ACCOUNT : RESERVE_ALLOC;
 	bool failed_cluster_refill = false;
 	bool failed_alloc = false;
 	bool use_cluster = true;
@@ -7565,8 +7546,8 @@  checks:
 					     search_start - offset);
 		BUG_ON(offset > search_start);
 
-		ret = btrfs_add_reserved_bytes(block_group, num_bytes,
-				alloc_type, delalloc);
+		ret = btrfs_add_reserved_bytes(block_group, ram_bytes,
+				num_bytes, delalloc);
 		if (ret == -EAGAIN) {
 			btrfs_add_free_space(block_group, offset, num_bytes);
 			goto loop;
@@ -7739,7 +7720,7 @@  again:
 	up_read(&info->groups_sem);
 }
 
-int btrfs_reserve_extent(struct btrfs_root *root,
+int btrfs_reserve_extent(struct btrfs_root *root, u64 ram_bytes,
 			 u64 num_bytes, u64 min_alloc_size,
 			 u64 empty_size, u64 hint_byte,
 			 struct btrfs_key *ins, int is_data, int delalloc)
@@ -7751,8 +7732,8 @@  int btrfs_reserve_extent(struct btrfs_root *root,
 	flags = btrfs_get_alloc_profile(root, is_data);
 again:
 	WARN_ON(num_bytes < root->sectorsize);
-	ret = find_free_extent(root, num_bytes, empty_size, hint_byte, ins,
-			       flags, delalloc);
+	ret = find_free_extent(root, ram_bytes, num_bytes, empty_size,
+			       hint_byte, ins, flags, delalloc);
 	if (!ret && !is_data) {
 		btrfs_dec_block_group_reservations(root->fs_info,
 						   ins->objectid);
@@ -7761,6 +7742,7 @@  again:
 			num_bytes = min(num_bytes >> 1, ins->offset);
 			num_bytes = round_down(num_bytes, root->sectorsize);
 			num_bytes = max(num_bytes, min_alloc_size);
+			ram_bytes = num_bytes;
 			if (num_bytes == min_alloc_size)
 				final_tried = true;
 			goto again;
@@ -8029,7 +8011,7 @@  int btrfs_alloc_logged_file_extent(struct btrfs_trans_handle *trans,
 		return -EINVAL;
 
 	ret = btrfs_add_reserved_bytes(block_group, ins->offset,
-			RESERVE_ALLOC_NO_ACCOUNT, 0);
+				       ins->offset, 0);
 	BUG_ON(ret); /* logic error */
 	ret = alloc_reserved_file_extent(trans, root, 0, root_objectid,
 					 0, owner, offset, ins, 1);
@@ -8171,7 +8153,7 @@  struct extent_buffer *btrfs_alloc_tree_block(struct btrfs_trans_handle *trans,
 	if (IS_ERR(block_rsv))
 		return ERR_CAST(block_rsv);
 
-	ret = btrfs_reserve_extent(root, blocksize, blocksize,
+	ret = btrfs_reserve_extent(root, blocksize, blocksize, blocksize,
 				   empty_size, hint, &ins, 0, 0);
 	if (ret)
 		goto out_unuse;
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 2234e88..b4d9258 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2669,6 +2669,7 @@  static long btrfs_fallocate(struct file *file, int mode,
 
 	alloc_start = round_down(offset, blocksize);
 	alloc_end = round_up(offset + len, blocksize);
+	cur_offset = alloc_start;
 
 	/* Make sure we aren't being give some crap mode */
 	if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
@@ -2761,7 +2762,6 @@  static long btrfs_fallocate(struct file *file, int mode,
 
 	/* First, check if we exceed the qgroup limit */
 	INIT_LIST_HEAD(&reserve_list);
-	cur_offset = alloc_start;
 	while (1) {
 		em = btrfs_get_extent(inode, NULL, 0, cur_offset,
 				      alloc_end - cur_offset, 0);
@@ -2788,6 +2788,14 @@  static long btrfs_fallocate(struct file *file, int mode,
 					last_byte - cur_offset);
 			if (ret < 0)
 				break;
+		} else {
+			/*
+			 * Do not need to reserve unwritten extent for this
+			 * range, free reserved data space first, otherwise
+			 * it'll result in false ENOSPC error.
+			 */
+			btrfs_free_reserved_data_space(inode, cur_offset,
+				last_byte - cur_offset);
 		}
 		free_extent_map(em);
 		cur_offset = last_byte;
@@ -2805,6 +2813,9 @@  static long btrfs_fallocate(struct file *file, int mode,
 					range->start,
 					range->len, 1 << inode->i_blkbits,
 					offset + len, &alloc_hint);
+		else
+			btrfs_free_reserved_data_space(inode, range->start,
+						       range->len);
 		list_del(&range->list);
 		kfree(range);
 	}
@@ -2839,18 +2850,11 @@  out_unlock:
 	unlock_extent_cached(&BTRFS_I(inode)->io_tree, alloc_start, locked_end,
 			     &cached_state, GFP_KERNEL);
 out:
-	/*
-	 * As we waited the extent range, the data_rsv_map must be empty
-	 * in the range, as written data range will be released from it.
-	 * And for prealloacted extent, it will also be released when
-	 * its metadata is written.
-	 * So this is completely used as cleanup.
-	 */
-	btrfs_qgroup_free_data(inode, alloc_start, alloc_end - alloc_start);
 	inode_unlock(inode);
 	/* Let go of our reservation. */
-	btrfs_free_reserved_data_space(inode, alloc_start,
-				       alloc_end - alloc_start);
+	if (ret != 0)
+		btrfs_free_reserved_data_space(inode, alloc_start,
+				       alloc_end - cur_offset);
 	return ret;
 }
 
diff --git a/fs/btrfs/inode-map.c b/fs/btrfs/inode-map.c
index 70107f7..e59e7d6 100644
--- a/fs/btrfs/inode-map.c
+++ b/fs/btrfs/inode-map.c
@@ -495,10 +495,9 @@  again:
 	ret = btrfs_prealloc_file_range_trans(inode, trans, 0, 0, prealloc,
 					      prealloc, prealloc, &alloc_hint);
 	if (ret) {
-		btrfs_delalloc_release_space(inode, 0, prealloc);
+		btrfs_delalloc_release_metadata(inode, prealloc);
 		goto out_put;
 	}
-	btrfs_free_reserved_data_space(inode, 0, prealloc);
 
 	ret = btrfs_write_out_ino_cache(root, trans, path, inode);
 out_put:
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 4421954..e0cee59 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -564,6 +564,8 @@  cont:
 						     PAGE_SET_WRITEBACK |
 						     page_error_op |
 						     PAGE_END_WRITEBACK);
+			btrfs_free_reserved_data_space_noquota(inode, start,
+						end - start + 1);
 			goto free_pages_out;
 		}
 	}
@@ -739,7 +741,7 @@  retry:
 		lock_extent(io_tree, async_extent->start,
 			    async_extent->start + async_extent->ram_size - 1);
 
-		ret = btrfs_reserve_extent(root,
+		ret = btrfs_reserve_extent(root, async_extent->ram_size,
 					   async_extent->compressed_size,
 					   async_extent->compressed_size,
 					   0, alloc_hint, &ins, 1, 1);
@@ -966,7 +968,8 @@  static noinline int cow_file_range(struct inode *inode,
 				     EXTENT_DEFRAG, PAGE_UNLOCK |
 				     PAGE_CLEAR_DIRTY | PAGE_SET_WRITEBACK |
 				     PAGE_END_WRITEBACK);
-
+			btrfs_free_reserved_data_space_noquota(inode, start,
+						end - start + 1);
 			*nr_written = *nr_written +
 			     (end - start + PAGE_SIZE) / PAGE_SIZE;
 			*page_started = 1;
@@ -986,7 +989,7 @@  static noinline int cow_file_range(struct inode *inode,
 		unsigned long op;
 
 		cur_alloc_size = disk_num_bytes;
-		ret = btrfs_reserve_extent(root, cur_alloc_size,
+		ret = btrfs_reserve_extent(root, cur_alloc_size, cur_alloc_size,
 					   root->sectorsize, 0, alloc_hint,
 					   &ins, 1, 1);
 		if (ret < 0)
@@ -1485,8 +1488,10 @@  out_check:
 		extent_clear_unlock_delalloc(inode, cur_offset,
 					     cur_offset + num_bytes - 1,
 					     locked_page, EXTENT_LOCKED |
-					     EXTENT_DELALLOC, PAGE_UNLOCK |
-					     PAGE_SET_PRIVATE2);
+					     EXTENT_DELALLOC |
+					     EXTENT_CLEAR_DATA_RESV,
+					     PAGE_UNLOCK | PAGE_SET_PRIVATE2);
+
 		if (!nolock && nocow)
 			btrfs_end_write_no_snapshoting(root);
 		cur_offset = extent_end;
@@ -1803,7 +1808,9 @@  static void btrfs_clear_bit_hook(struct inode *inode,
 			return;
 
 		if (root->root_key.objectid != BTRFS_DATA_RELOC_TREE_OBJECTID
-		    && do_list && !(state->state & EXTENT_NORESERVE))
+		    && do_list && !(state->state & EXTENT_NORESERVE)
+		    && (*bits & (EXTENT_DO_ACCOUNTING |
+		    EXTENT_CLEAR_DATA_RESV)))
 			btrfs_free_reserved_data_space_noquota(inode,
 					state->start, len);
 
@@ -7214,7 +7221,7 @@  static struct extent_map *btrfs_new_extent_direct(struct inode *inode,
 	int ret;
 
 	alloc_hint = get_extent_allocation_hint(inode, start, len);
-	ret = btrfs_reserve_extent(root, len, root->sectorsize, 0,
+	ret = btrfs_reserve_extent(root, len, len, root->sectorsize, 0,
 				   alloc_hint, &ins, 1, 1);
 	if (ret)
 		return ERR_PTR(ret);
@@ -7714,6 +7721,13 @@  static int btrfs_get_blocks_direct(struct inode *inode, sector_t iblock,
 				ret = PTR_ERR(em2);
 				goto unlock_err;
 			}
+			/*
+			 * For inode marked NODATACOW or extent marked PREALLOC,
+			 * use the existing or preallocated extent, so does not
+			 * need to adjust btrfs_space_info's bytes_may_use.
+			 */
+			btrfs_free_reserved_data_space_noquota(inode,
+					start, len);
 			goto unlock;
 		}
 	}
@@ -7748,7 +7762,6 @@  unlock:
 			i_size_write(inode, start + len);
 
 		adjust_dio_outstanding_extents(inode, dio_data, len);
-		btrfs_free_reserved_data_space(inode, start, len);
 		WARN_ON(dio_data->reserve < len);
 		dio_data->reserve -= len;
 		dio_data->unsubmitted_oe_range_end = start + len;
@@ -10269,6 +10282,7 @@  static int __btrfs_prealloc_file_range(struct inode *inode, int mode,
 	u64 last_alloc = (u64)-1;
 	int ret = 0;
 	bool own_trans = true;
+	u64 end = start + num_bytes - 1;
 
 	if (trans)
 		own_trans = false;
@@ -10290,8 +10304,8 @@  static int __btrfs_prealloc_file_range(struct inode *inode, int mode,
 		 * sized chunks.
 		 */
 		cur_bytes = min(cur_bytes, last_alloc);
-		ret = btrfs_reserve_extent(root, cur_bytes, min_size, 0,
-					   *alloc_hint, &ins, 1, 0);
+		ret = btrfs_reserve_extent(root, cur_bytes, cur_bytes,
+				min_size, 0, *alloc_hint, &ins, 1, 0);
 		if (ret) {
 			if (own_trans)
 				btrfs_end_transaction(trans, root);
@@ -10377,6 +10391,9 @@  next:
 		if (own_trans)
 			btrfs_end_transaction(trans, root);
 	}
+	if (cur_offset < end)
+		btrfs_free_reserved_data_space(inode, cur_offset,
+			end - cur_offset + 1);
 	return ret;
 }
 
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index a0de885..f39c4db 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -3032,6 +3032,7 @@  int prealloc_file_extent_cluster(struct inode *inode,
 	int ret = 0;
 	u64 prealloc_start = cluster->start - offset;
 	u64 prealloc_end = cluster->end - offset;
+	u64 cur_offset;
 
 	BUG_ON(cluster->start != cluster->boundary[0]);
 	inode_lock(inode);
@@ -3041,6 +3042,7 @@  int prealloc_file_extent_cluster(struct inode *inode,
 	if (ret)
 		goto out;
 
+	cur_offset = prealloc_start;
 	while (nr < cluster->nr) {
 		start = cluster->boundary[nr] - offset;
 		if (nr + 1 < cluster->nr)
@@ -3050,16 +3052,21 @@  int prealloc_file_extent_cluster(struct inode *inode,
 
 		lock_extent(&BTRFS_I(inode)->io_tree, start, end);
 		num_bytes = end + 1 - start;
+		if (cur_offset < start)
+			btrfs_free_reserved_data_space(inode, cur_offset,
+					start - cur_offset);
 		ret = btrfs_prealloc_file_range(inode, 0, start,
 						num_bytes, num_bytes,
 						end + 1, &alloc_hint);
+		cur_offset = end + 1;
 		unlock_extent(&BTRFS_I(inode)->io_tree, start, end);
 		if (ret)
 			break;
 		nr++;
 	}
-	btrfs_free_reserved_data_space(inode, prealloc_start,
-				       prealloc_end + 1 - prealloc_start);
+	if (cur_offset < prealloc_end)
+		btrfs_free_reserved_data_space(inode, cur_offset,
+				       prealloc_end + 1 - cur_offset);
 out:
 	inode_unlock(inode);
 	return ret;