diff mbox series

[v3,3/4] btrfs: remove free space items when creating free space tree

Message ID e8c4e0e500f1f19787c84cf8fb7a54063f0fedf0.1600282812.git.boris@bur.io (mailing list archive)
State New, archived
Headers show
Series btrfs: free space tree mounting fixes | expand

Commit Message

Boris Burkov Sept. 17, 2020, 6:13 p.m. UTC
When the file system transitions from space cache v1 to v2 it removes
the old cached data, but does not remove the FREE_SPACE items nor the
free space inodes they point to. This doesn't cause any issues besides
being a bit inefficient, since these items no longer do anything useful.

To fix it, as part of populating the free space tree, destroy each block
group's free space item and free space inode. This code is lifted from
the existing code for removing them when removing the block group.

References: https://github.com/btrfs/btrfs-todo/issues/5
Signed-off-by: Boris Burkov <boris@bur.io>
---
v3:
- pass in optional inode to btrfs_remove_free_space_inode, which fixes
  the bug of not issuing an iput for it in the bg delete case.
- fix bug where the orphan generated by fst creation could not be
  cleaned up, because delayed_iput had an outstanding reference
v2:
- remove_free_space_inode -> btrfs_remove_free_space_inode
- undo sinful whitespace change

 fs/btrfs/block-group.c      | 39 ++-----------------------
 fs/btrfs/disk-io.c          |  9 ++++++
 fs/btrfs/free-space-cache.c | 58 +++++++++++++++++++++++++++++++++++++
 fs/btrfs/free-space-cache.h |  3 ++
 fs/btrfs/free-space-tree.c  |  3 ++
 5 files changed, 75 insertions(+), 37 deletions(-)

Comments

Josef Bacik Sept. 21, 2020, 2:54 p.m. UTC | #1
On 9/17/20 2:13 PM, Boris Burkov wrote:
> When the file system transitions from space cache v1 to v2 it removes
> the old cached data, but does not remove the FREE_SPACE items nor the
> free space inodes they point to. This doesn't cause any issues besides
> being a bit inefficient, since these items no longer do anything useful.
> 
> To fix it, as part of populating the free space tree, destroy each block
> group's free space item and free space inode. This code is lifted from
> the existing code for removing them when removing the block group.
> 
> References: https://github.com/btrfs/btrfs-todo/issues/5
> Signed-off-by: Boris Burkov <boris@bur.io>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef
David Sterba Sept. 21, 2020, 5:13 p.m. UTC | #2
On Thu, Sep 17, 2020 at 11:13:40AM -0700, Boris Burkov wrote:
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -3333,6 +3333,15 @@ int __cold open_ctree(struct super_block *sb, struct btrfs_fs_devices *fs_device
>  			close_ctree(fs_info);
>  			return ret;
>  		}
> +		/*
> +		 * Creating the free space tree creates inode orphan items and
> +		 * delayed iputs when it deletes the free space inodes. Later in
> +		 * open_ctree, we run btrfs_orphan_cleanup which tries to clean
> +		 * up the orphan items. However, the outstanding references on
> +		 * the inodes from the delayed iputs causes the cleanup to fail.
> +		 * To fix it, force going through the delayed iputs here.
> +		 */
> +		btrfs_run_delayed_iputs(fs_info);

This is called from open_ctree, so this is mount context and the free
space tree creation is called before that. That will schedule all free
space inodes for deletion and waits here. This takes time proportional
to the filesystem size.

We've had reports that this takes a lot of time already, so I wonder if
the delayed iputs can be avoided here.
Boris Burkov Sept. 21, 2020, 6:22 p.m. UTC | #3
On Mon, Sep 21, 2020 at 07:13:04PM +0200, David Sterba wrote:
> On Thu, Sep 17, 2020 at 11:13:40AM -0700, Boris Burkov wrote:
> > --- a/fs/btrfs/disk-io.c
> > +++ b/fs/btrfs/disk-io.c
> > @@ -3333,6 +3333,15 @@ int __cold open_ctree(struct super_block *sb, struct btrfs_fs_devices *fs_device
> >  			close_ctree(fs_info);
> >  			return ret;
> >  		}
> > +		/*
> > +		 * Creating the free space tree creates inode orphan items and
> > +		 * delayed iputs when it deletes the free space inodes. Later in
> > +		 * open_ctree, we run btrfs_orphan_cleanup which tries to clean
> > +		 * up the orphan items. However, the outstanding references on
> > +		 * the inodes from the delayed iputs causes the cleanup to fail.
> > +		 * To fix it, force going through the delayed iputs here.
> > +		 */
> > +		btrfs_run_delayed_iputs(fs_info);
> 
> This is called from open_ctree, so this is mount context and the free
> space tree creation is called before that. That will schedule all free
> space inodes for deletion and waits here. This takes time proportional
> to the filesystem size.
> 
> We've had reports that this takes a lot of time already, so I wonder if
> the delayed iputs can be avoided here.

Good point. The way I see it, deleting the inodes creates legitimate
orphan items, and the delayed_iput doesn't go down until after
open_ctree, I assume for the reason you give. So to delete the free
space inodes without incurring the iput cost during mount, we could:

1. make orphan cleanup smart enough to tell the difference between
orphan items left over from an old mount, vs. orphan items created while
mounting.

2. move orphan cleanup to before free space tree creation/free space
inode deletion in open_ctree.

I'll try to think of a way to do 1. or think of some option 3, as 2.
seems a bit hacky/fragile.

While thinking about a legitimate case for orphans, I realized that
unconditionally failing on free space inode delete errors in
create_free_space_tree without handling ENOENT is almost certainly
wrong, so I will likely need to resend this patch anyway.
Josef Bacik Sept. 21, 2020, 7:01 p.m. UTC | #4
On 9/21/20 1:13 PM, David Sterba wrote:
> On Thu, Sep 17, 2020 at 11:13:40AM -0700, Boris Burkov wrote:
>> --- a/fs/btrfs/disk-io.c
>> +++ b/fs/btrfs/disk-io.c
>> @@ -3333,6 +3333,15 @@ int __cold open_ctree(struct super_block *sb, struct btrfs_fs_devices *fs_device
>>   			close_ctree(fs_info);
>>   			return ret;
>>   		}
>> +		/*
>> +		 * Creating the free space tree creates inode orphan items and
>> +		 * delayed iputs when it deletes the free space inodes. Later in
>> +		 * open_ctree, we run btrfs_orphan_cleanup which tries to clean
>> +		 * up the orphan items. However, the outstanding references on
>> +		 * the inodes from the delayed iputs causes the cleanup to fail.
>> +		 * To fix it, force going through the delayed iputs here.
>> +		 */
>> +		btrfs_run_delayed_iputs(fs_info);
> 
> This is called from open_ctree, so this is mount context and the free
> space tree creation is called before that. That will schedule all free
> space inodes for deletion and waits here. This takes time proportional
> to the filesystem size.
> 
> We've had reports that this takes a lot of time already, so I wonder if
> the delayed iputs can be avoided here.
> 

Chris and I told him to do it this way.  If you have a giant FS the time is 
mostly going to be spent doing the free space tree, the iputs won't add much 
more time to that.  We have to do this so the orphan cleanup that follows 
doesn't screw up because we haven't put our inodes yet.  This is one time pain 
and avoids us having to figure out what to do about orphans we generate while 
mounting the file system.  Thanks,

Josef
David Sterba Sept. 24, 2020, 5:07 p.m. UTC | #5
On Thu, Sep 17, 2020 at 11:13:40AM -0700, Boris Burkov wrote:
> --- a/fs/btrfs/free-space-cache.c
> +++ b/fs/btrfs/free-space-cache.c
> @@ -207,6 +207,64 @@ int create_free_space_inode(struct btrfs_trans_handle *trans,
>  					 ino, block_group->start);
>  }
>  
> +/*
> + * inode is an optional sink: if it is NULL, btrfs_remove_free_space_inode
> + * handles lookup, otherwise it takes ownership and iputs the inode.
> + * Don't reuse an inode pointer after passing it into this function.
> + */
> +int btrfs_remove_free_space_inode(struct btrfs_trans_handle *trans,
> +				  struct inode *inode,
> +				  struct btrfs_block_group *block_group)
> +{
> +	struct btrfs_path *path;
> +	struct btrfs_key key;
> +	int ret = 0;
> +	path = btrfs_alloc_path();
> +	if (!path)
> +		return -ENOMEM;
> +
> +	if (!inode) {
> +		inode = lookup_free_space_inode(block_group, path);
> +	}

No { } around single statement

> +	if (IS_ERR(inode)) {
> +		if (PTR_ERR(inode) != -ENOENT)
> +			ret = PTR_ERR(inode);
> +		goto out;
> +	}
> +	ret = btrfs_orphan_add(trans, BTRFS_I(inode));
> +	if (ret) {
> +		btrfs_add_delayed_iput(inode);
> +		goto out;
> +	}
> +	clear_nlink(inode);
> +	/* One for the block groups ref */
> +	spin_lock(&block_group->lock);
> +	if (block_group->iref) {
> +		block_group->iref = 0;
> +		block_group->inode = NULL;
> +		spin_unlock(&block_group->lock);
> +		iput(inode);
> +	} else {
> +		spin_unlock(&block_group->lock);
> +	}
> +	/* One for the lookup ref */
> +	btrfs_add_delayed_iput(inode);
> +
> +	key.objectid = BTRFS_FREE_SPACE_OBJECTID;
> +	key.type = 0;
> +	key.offset = block_group->start;
> +	ret = btrfs_search_slot(trans, trans->fs_info->tree_root, &key, path, -1, 1);

This slightly overflows 80 but it's full one parameter so in this case
I'd rather wrap it (from -1). Otherwise, more than 80 is ok-ish if it's
the ); or end of an identifier that's half visible.


> +	if (ret) {
> +		if (ret > 0)
> +			ret = 0;
> +		goto out;
> +	}
> +	ret = btrfs_del_item(trans, trans->fs_info->tree_root, path);
> +out:
> +	btrfs_free_path(path);
> +	return ret;
> +}
> +
>  int btrfs_check_trunc_cache_free_space(struct btrfs_fs_info *fs_info,
>  				       struct btrfs_block_rsv *rsv)
>  {
diff mbox series

Patch

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index 01e8ba1da1d3..717b3435c88e 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -892,8 +892,6 @@  int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
 	struct btrfs_path *path;
 	struct btrfs_block_group *block_group;
 	struct btrfs_free_cluster *cluster;
-	struct btrfs_root *tree_root = fs_info->tree_root;
-	struct btrfs_key key;
 	struct inode *inode;
 	struct kobject *kobj = NULL;
 	int ret;
@@ -971,42 +969,9 @@  int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
 	spin_unlock(&trans->transaction->dirty_bgs_lock);
 	mutex_unlock(&trans->transaction->cache_write_mutex);
 
-	if (!IS_ERR(inode)) {
-		ret = btrfs_orphan_add(trans, BTRFS_I(inode));
-		if (ret) {
-			btrfs_add_delayed_iput(inode);
-			goto out;
-		}
-		clear_nlink(inode);
-		/* One for the block groups ref */
-		spin_lock(&block_group->lock);
-		if (block_group->iref) {
-			block_group->iref = 0;
-			block_group->inode = NULL;
-			spin_unlock(&block_group->lock);
-			iput(inode);
-		} else {
-			spin_unlock(&block_group->lock);
-		}
-		/* One for our lookup ref */
-		btrfs_add_delayed_iput(inode);
-	}
-
-	key.objectid = BTRFS_FREE_SPACE_OBJECTID;
-	key.type = 0;
-	key.offset = block_group->start;
-
-	ret = btrfs_search_slot(trans, tree_root, &key, path, -1, 1);
-	if (ret < 0)
+	ret = btrfs_remove_free_space_inode(trans, inode, block_group);
+	if (ret)
 		goto out;
-	if (ret > 0)
-		btrfs_release_path(path);
-	if (ret == 0) {
-		ret = btrfs_del_item(trans, tree_root, path);
-		if (ret)
-			goto out;
-		btrfs_release_path(path);
-	}
 
 	spin_lock(&fs_info->block_group_cache_lock);
 	rb_erase(&block_group->cache_node,
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index ade92e93e63f..775d29565665 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3333,6 +3333,15 @@  int __cold open_ctree(struct super_block *sb, struct btrfs_fs_devices *fs_device
 			close_ctree(fs_info);
 			return ret;
 		}
+		/*
+		 * Creating the free space tree creates inode orphan items and
+		 * delayed iputs when it deletes the free space inodes. Later in
+		 * open_ctree, we run btrfs_orphan_cleanup which tries to clean
+		 * up the orphan items. However, the outstanding references on
+		 * the inodes from the delayed iputs causes the cleanup to fail.
+		 * To fix it, force going through the delayed iputs here.
+		 */
+		btrfs_run_delayed_iputs(fs_info);
 	}
 
 	if ((bool)btrfs_test_opt(fs_info, SPACE_CACHE) !=
diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index 25420d51039c..6e1bbe87d734 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -207,6 +207,64 @@  int create_free_space_inode(struct btrfs_trans_handle *trans,
 					 ino, block_group->start);
 }
 
+/*
+ * inode is an optional sink: if it is NULL, btrfs_remove_free_space_inode
+ * handles lookup, otherwise it takes ownership and iputs the inode.
+ * Don't reuse an inode pointer after passing it into this function.
+ */
+int btrfs_remove_free_space_inode(struct btrfs_trans_handle *trans,
+				  struct inode *inode,
+				  struct btrfs_block_group *block_group)
+{
+	struct btrfs_path *path;
+	struct btrfs_key key;
+	int ret = 0;
+	path = btrfs_alloc_path();
+	if (!path)
+		return -ENOMEM;
+
+	if (!inode) {
+		inode = lookup_free_space_inode(block_group, path);
+	}
+	if (IS_ERR(inode)) {
+		if (PTR_ERR(inode) != -ENOENT)
+			ret = PTR_ERR(inode);
+		goto out;
+	}
+	ret = btrfs_orphan_add(trans, BTRFS_I(inode));
+	if (ret) {
+		btrfs_add_delayed_iput(inode);
+		goto out;
+	}
+	clear_nlink(inode);
+	/* One for the block groups ref */
+	spin_lock(&block_group->lock);
+	if (block_group->iref) {
+		block_group->iref = 0;
+		block_group->inode = NULL;
+		spin_unlock(&block_group->lock);
+		iput(inode);
+	} else {
+		spin_unlock(&block_group->lock);
+	}
+	/* One for the lookup ref */
+	btrfs_add_delayed_iput(inode);
+
+	key.objectid = BTRFS_FREE_SPACE_OBJECTID;
+	key.type = 0;
+	key.offset = block_group->start;
+	ret = btrfs_search_slot(trans, trans->fs_info->tree_root, &key, path, -1, 1);
+	if (ret) {
+		if (ret > 0)
+			ret = 0;
+		goto out;
+	}
+	ret = btrfs_del_item(trans, trans->fs_info->tree_root, path);
+out:
+	btrfs_free_path(path);
+	return ret;
+}
+
 int btrfs_check_trunc_cache_free_space(struct btrfs_fs_info *fs_info,
 				       struct btrfs_block_rsv *rsv)
 {
diff --git a/fs/btrfs/free-space-cache.h b/fs/btrfs/free-space-cache.h
index 5fbdbd2fe740..0c01c53fce82 100644
--- a/fs/btrfs/free-space-cache.h
+++ b/fs/btrfs/free-space-cache.h
@@ -84,6 +84,9 @@  struct inode *lookup_free_space_inode(struct btrfs_block_group *block_group,
 int create_free_space_inode(struct btrfs_trans_handle *trans,
 			    struct btrfs_block_group *block_group,
 			    struct btrfs_path *path);
+int btrfs_remove_free_space_inode(struct btrfs_trans_handle *trans,
+				  struct inode *inode,
+				  struct btrfs_block_group *block_group);
 
 int btrfs_check_trunc_cache_free_space(struct btrfs_fs_info *fs_info,
 				       struct btrfs_block_rsv *rsv);
diff --git a/fs/btrfs/free-space-tree.c b/fs/btrfs/free-space-tree.c
index 6b9faf3b0e96..d5926d36bf73 100644
--- a/fs/btrfs/free-space-tree.c
+++ b/fs/btrfs/free-space-tree.c
@@ -1165,6 +1165,9 @@  int btrfs_create_free_space_tree(struct btrfs_fs_info *fs_info)
 		block_group = rb_entry(node, struct btrfs_block_group,
 				       cache_node);
 		ret = populate_free_space_tree(trans, block_group);
+		if (ret)
+			goto abort;
+		ret = btrfs_remove_free_space_inode(trans, NULL, block_group);
 		if (ret)
 			goto abort;
 		node = rb_next(node);