[17/36] btrfs: loop in inode_rsv_refill

Message ID	20180911175807.26181-18-josef@toxicpanda.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-btrfs-owner@kernel.org> From: Josef Bacik <josef@toxicpanda.com> To: kernel-team@fb.com, linux-btrfs@vger.kernel.org Subject: [PATCH 17/36] btrfs: loop in inode_rsv_refill Date: Tue, 11 Sep 2018 13:57:48 -0400 Message-Id: <20180911175807.26181-18-josef@toxicpanda.com> In-Reply-To: <20180911175807.26181-1-josef@toxicpanda.com> References: <20180911175807.26181-1-josef@toxicpanda.com> Sender: linux-btrfs-owner@vger.kernel.org Precedence: bulk
Series	My current patch queue \| expand [00/35,v2] My current patch queue [01/36] btrfs: add btrfs_delete_ref_head helper [02/36] btrfs: add cleanup_ref_head_accounting helper [03/36] btrfs: cleanup extent_op handling [04/36] btrfs: only track ref_heads in delayed_ref_updates [05/36] btrfs: only count ref heads run in __btrfs_run_delayed_refs [06/36] btrfs: introduce delayed_refs_rsv [07/36] btrfs: check if free bgs for commit [08/36] btrfs: dump block_rsv whe dumping space info [09/36] btrfs: release metadata before running delayed refs [10/36] btrfs: protect space cache inode alloc with nofs [11/36] btrfs: fix truncate throttling [12/36] btrfs: don't use global rsv for chunk allocation [13/36] btrfs: add ALLOC_CHUNK_FORCE to the flushing code [14/36] btrfs: reset max_extent_size properly [15/36] btrfs: don't enospc all tickets on flush failure [16/36] btrfs: run delayed iputs before committing [17/36] btrfs: loop in inode_rsv_refill [18/36] btrfs: move the dio_sem higher up the callchain [19/36] btrfs: set max_extent_size properly [20/36] btrfs: don't use ctl->free_space for max_extent_size [21/36] btrfs: reset max_extent_size on clear in a bitmap [22/36] btrfs: only run delayed refs if we're committing [23/36] btrfs: make sure we create all new bgs [24/36] btrfs: assert on non-empty delayed iputs [25/36] btrfs: pass delayed_refs_root to btrfs_delayed_ref_lock [26/36] btrfs: make btrfs_destroy_delayed_refs use btrfs_delayed_ref_lock [27/36] btrfs: make btrfs_destroy_delayed_refs use btrfs_delete_ref_head [28/36] btrfs: handle delayed ref head accounting cleanup in abort [29/36] btrfs: call btrfs_create_pending_block_groups unconditionally [30/36] btrfs: just delete pending bgs if we are aborted [31/36] btrfs: cleanup pending bgs on transaction abort [32/36] btrfs: clear delayed_refs_rsv for dirty bg cleanup [33/36] btrfs: only free reserved extent if we didn't insert it [34/36] btrfs: fix insert_reserved error handling [35/36] btrfs: wait on ordered extents on abort cleanup [36/36] MAINTAINERS: update my email address for btrfs

Message ID

20180911175807.26181-18-josef@toxicpanda.com (mailing list archive)

State

New, archived

Headers

From: Josef Bacik <josef@toxicpanda.com>
To: kernel-team@fb.com, linux-btrfs@vger.kernel.org
Subject: [PATCH 17/36] btrfs: loop in inode_rsv_refill
Date: Tue, 11 Sep 2018 13:57:48 -0400
Message-Id: <20180911175807.26181-18-josef@toxicpanda.com>
In-Reply-To: <20180911175807.26181-1-josef@toxicpanda.com>
References: <20180911175807.26181-1-josef@toxicpanda.com>
Sender: linux-btrfs-owner@vger.kernel.org
Precedence: bulk

Series

My current patch queue | expand

Commit Message

Josef Bacik Sept. 11, 2018, 5:57 p.m. UTC

With severe fragmentation we can end up with our inode rsv size being
huge during writeout, which would cause us to need to make very large
metadata reservations.  However we may not actually need that much once
writeout is complete.  So instead try to make our reservation, and if we
couldn't make it re-calculate our new reservation size and try again.
If our reservation size doesn't change between tries then we know we are
actually out of space and can error out.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/extent-tree.c | 19 +++++++++++++++++--
 1 file changed, 17 insertions(+), 2 deletions(-)

Comments

Omar Sandoval Sept. 19, 2018, 12:17 a.m. UTC | #1

On Tue, Sep 11, 2018 at 01:57:48PM -0400, Josef Bacik wrote:
> With severe fragmentation we can end up with our inode rsv size being
> huge during writeout, which would cause us to need to make very large
> metadata reservations.  However we may not actually need that much once
> writeout is complete.  So instead try to make our reservation, and if we
> couldn't make it re-calculate our new reservation size and try again.
> If our reservation size doesn't change between tries then we know we are
> actually out of space and can error out.
> 
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
>  fs/btrfs/extent-tree.c | 19 +++++++++++++++++--
>  1 file changed, 17 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 57567d013447..e43834380ce6 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -5790,10 +5790,11 @@ static int btrfs_inode_rsv_refill(struct btrfs_inode *inode,
>  {
>  	struct btrfs_root *root = inode->root;
>  	struct btrfs_block_rsv *block_rsv = &inode->block_rsv;
> -	u64 num_bytes = 0;
> +	u64 num_bytes = 0, last = 0;
>  	u64 qgroup_num_bytes = 0;
>  	int ret = -ENOSPC;
>  
> +again:
>  	spin_lock(&block_rsv->lock);
>  	if (block_rsv->reserved < block_rsv->size)
>  		num_bytes = block_rsv->size - block_rsv->reserved;
> @@ -5818,8 +5819,22 @@ static int btrfs_inode_rsv_refill(struct btrfs_inode *inode,
>  		spin_lock(&block_rsv->lock);
>  		block_rsv->qgroup_rsv_reserved += qgroup_num_bytes;
>  		spin_unlock(&block_rsv->lock);
> -	} else
> +	} else {
>  		btrfs_qgroup_free_meta_prealloc(root, qgroup_num_bytes);
> +
> +		/*
> +		 * If we are fragmented we can end up with a lot of outstanding
> +		 * extents which will make our size be much larger than our
> +		 * reserved amount.  If we happen to try to do a reservation
> +		 * here that may result in us trying to do a pretty hefty
> +		 * reservation, which we may not need once delalloc flushing
> +		 * happens.  If this is the case try and do the reserve again.
> +		 */
> +		if (flush == BTRFS_RESERVE_FLUSH_ALL && last != num_bytes) {

Is there any point in retrying the reservation if num_bytes didn't
change? As this is written, we will:

1. Calculate num_bytes
2. Try reservation, say it fails
3. Recalculate num_bytes, say it doesn't change
4. Retry the reservation anyways, and it fails again

Maybe we should check if it changed before we retry the reservation? So
then we'd have

1. Calculate num_bytes
2. Try reservation, fails
3. Recalculate num_bytes, it doesn't change, bail out

Also, is it possible that num_bytes > last because of other operations
happening at the same time, and should we still retry in that case?

> +			last = num_bytes;
> +			goto again;
> +		}
> +	}
>  	return ret;
>  }
>  
> -- 
> 2.14.3
>

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 57567d013447..e43834380ce6 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -5790,10 +5790,11 @@  static int btrfs_inode_rsv_refill(struct btrfs_inode *inode,
 {
 	struct btrfs_root *root = inode->root;
 	struct btrfs_block_rsv *block_rsv = &inode->block_rsv;
-	u64 num_bytes = 0;
+	u64 num_bytes = 0, last = 0;
 	u64 qgroup_num_bytes = 0;
 	int ret = -ENOSPC;
 
+again:
 	spin_lock(&block_rsv->lock);
 	if (block_rsv->reserved < block_rsv->size)
 		num_bytes = block_rsv->size - block_rsv->reserved;
@@ -5818,8 +5819,22 @@  static int btrfs_inode_rsv_refill(struct btrfs_inode *inode,
 		spin_lock(&block_rsv->lock);
 		block_rsv->qgroup_rsv_reserved += qgroup_num_bytes;
 		spin_unlock(&block_rsv->lock);
-	} else
+	} else {
 		btrfs_qgroup_free_meta_prealloc(root, qgroup_num_bytes);
+
+		/*
+		 * If we are fragmented we can end up with a lot of outstanding
+		 * extents which will make our size be much larger than our
+		 * reserved amount.  If we happen to try to do a reservation
+		 * here that may result in us trying to do a pretty hefty
+		 * reservation, which we may not need once delalloc flushing
+		 * happens.  If this is the case try and do the reserve again.
+		 */
+		if (flush == BTRFS_RESERVE_FLUSH_ALL && last != num_bytes) {
+			last = num_bytes;
+			goto again;
+		}
+	}
 	return ret;
 }

[17/36] btrfs: loop in inode_rsv_refill

Commit Message

Comments

Patch