Message ID | e8c4e0e500f1f19787c84cf8fb7a54063f0fedf0.1600282812.git.boris@bur.io (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | btrfs: free space tree mounting fixes | expand |
On 9/17/20 2:13 PM, Boris Burkov wrote: > When the file system transitions from space cache v1 to v2 it removes > the old cached data, but does not remove the FREE_SPACE items nor the > free space inodes they point to. This doesn't cause any issues besides > being a bit inefficient, since these items no longer do anything useful. > > To fix it, as part of populating the free space tree, destroy each block > group's free space item and free space inode. This code is lifted from > the existing code for removing them when removing the block group. > > References: https://github.com/btrfs/btrfs-todo/issues/5 > Signed-off-by: Boris Burkov <boris@bur.io> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Thanks, Josef
On Thu, Sep 17, 2020 at 11:13:40AM -0700, Boris Burkov wrote: > --- a/fs/btrfs/disk-io.c > +++ b/fs/btrfs/disk-io.c > @@ -3333,6 +3333,15 @@ int __cold open_ctree(struct super_block *sb, struct btrfs_fs_devices *fs_device > close_ctree(fs_info); > return ret; > } > + /* > + * Creating the free space tree creates inode orphan items and > + * delayed iputs when it deletes the free space inodes. Later in > + * open_ctree, we run btrfs_orphan_cleanup which tries to clean > + * up the orphan items. However, the outstanding references on > + * the inodes from the delayed iputs causes the cleanup to fail. > + * To fix it, force going through the delayed iputs here. > + */ > + btrfs_run_delayed_iputs(fs_info); This is called from open_ctree, so this is mount context and the free space tree creation is called before that. That will schedule all free space inodes for deletion and waits here. This takes time proportional to the filesystem size. We've had reports that this takes a lot of time already, so I wonder if the delayed iputs can be avoided here.
On Mon, Sep 21, 2020 at 07:13:04PM +0200, David Sterba wrote: > On Thu, Sep 17, 2020 at 11:13:40AM -0700, Boris Burkov wrote: > > --- a/fs/btrfs/disk-io.c > > +++ b/fs/btrfs/disk-io.c > > @@ -3333,6 +3333,15 @@ int __cold open_ctree(struct super_block *sb, struct btrfs_fs_devices *fs_device > > close_ctree(fs_info); > > return ret; > > } > > + /* > > + * Creating the free space tree creates inode orphan items and > > + * delayed iputs when it deletes the free space inodes. Later in > > + * open_ctree, we run btrfs_orphan_cleanup which tries to clean > > + * up the orphan items. However, the outstanding references on > > + * the inodes from the delayed iputs causes the cleanup to fail. > > + * To fix it, force going through the delayed iputs here. > > + */ > > + btrfs_run_delayed_iputs(fs_info); > > This is called from open_ctree, so this is mount context and the free > space tree creation is called before that. That will schedule all free > space inodes for deletion and waits here. This takes time proportional > to the filesystem size. > > We've had reports that this takes a lot of time already, so I wonder if > the delayed iputs can be avoided here. Good point. The way I see it, deleting the inodes creates legitimate orphan items, and the delayed_iput doesn't go down until after open_ctree, I assume for the reason you give. So to delete the free space inodes without incurring the iput cost during mount, we could: 1. make orphan cleanup smart enough to tell the difference between orphan items left over from an old mount, vs. orphan items created while mounting. 2. move orphan cleanup to before free space tree creation/free space inode deletion in open_ctree. I'll try to think of a way to do 1. or think of some option 3, as 2. seems a bit hacky/fragile. While thinking about a legitimate case for orphans, I realized that unconditionally failing on free space inode delete errors in create_free_space_tree without handling ENOENT is almost certainly wrong, so I will likely need to resend this patch anyway.
On 9/21/20 1:13 PM, David Sterba wrote: > On Thu, Sep 17, 2020 at 11:13:40AM -0700, Boris Burkov wrote: >> --- a/fs/btrfs/disk-io.c >> +++ b/fs/btrfs/disk-io.c >> @@ -3333,6 +3333,15 @@ int __cold open_ctree(struct super_block *sb, struct btrfs_fs_devices *fs_device >> close_ctree(fs_info); >> return ret; >> } >> + /* >> + * Creating the free space tree creates inode orphan items and >> + * delayed iputs when it deletes the free space inodes. Later in >> + * open_ctree, we run btrfs_orphan_cleanup which tries to clean >> + * up the orphan items. However, the outstanding references on >> + * the inodes from the delayed iputs causes the cleanup to fail. >> + * To fix it, force going through the delayed iputs here. >> + */ >> + btrfs_run_delayed_iputs(fs_info); > > This is called from open_ctree, so this is mount context and the free > space tree creation is called before that. That will schedule all free > space inodes for deletion and waits here. This takes time proportional > to the filesystem size. > > We've had reports that this takes a lot of time already, so I wonder if > the delayed iputs can be avoided here. > Chris and I told him to do it this way. If you have a giant FS the time is mostly going to be spent doing the free space tree, the iputs won't add much more time to that. We have to do this so the orphan cleanup that follows doesn't screw up because we haven't put our inodes yet. This is one time pain and avoids us having to figure out what to do about orphans we generate while mounting the file system. Thanks, Josef
On Thu, Sep 17, 2020 at 11:13:40AM -0700, Boris Burkov wrote: > --- a/fs/btrfs/free-space-cache.c > +++ b/fs/btrfs/free-space-cache.c > @@ -207,6 +207,64 @@ int create_free_space_inode(struct btrfs_trans_handle *trans, > ino, block_group->start); > } > > +/* > + * inode is an optional sink: if it is NULL, btrfs_remove_free_space_inode > + * handles lookup, otherwise it takes ownership and iputs the inode. > + * Don't reuse an inode pointer after passing it into this function. > + */ > +int btrfs_remove_free_space_inode(struct btrfs_trans_handle *trans, > + struct inode *inode, > + struct btrfs_block_group *block_group) > +{ > + struct btrfs_path *path; > + struct btrfs_key key; > + int ret = 0; > + path = btrfs_alloc_path(); > + if (!path) > + return -ENOMEM; > + > + if (!inode) { > + inode = lookup_free_space_inode(block_group, path); > + } No { } around single statement > + if (IS_ERR(inode)) { > + if (PTR_ERR(inode) != -ENOENT) > + ret = PTR_ERR(inode); > + goto out; > + } > + ret = btrfs_orphan_add(trans, BTRFS_I(inode)); > + if (ret) { > + btrfs_add_delayed_iput(inode); > + goto out; > + } > + clear_nlink(inode); > + /* One for the block groups ref */ > + spin_lock(&block_group->lock); > + if (block_group->iref) { > + block_group->iref = 0; > + block_group->inode = NULL; > + spin_unlock(&block_group->lock); > + iput(inode); > + } else { > + spin_unlock(&block_group->lock); > + } > + /* One for the lookup ref */ > + btrfs_add_delayed_iput(inode); > + > + key.objectid = BTRFS_FREE_SPACE_OBJECTID; > + key.type = 0; > + key.offset = block_group->start; > + ret = btrfs_search_slot(trans, trans->fs_info->tree_root, &key, path, -1, 1); This slightly overflows 80 but it's full one parameter so in this case I'd rather wrap it (from -1). Otherwise, more than 80 is ok-ish if it's the ); or end of an identifier that's half visible. > + if (ret) { > + if (ret > 0) > + ret = 0; > + goto out; > + } > + ret = btrfs_del_item(trans, trans->fs_info->tree_root, path); > +out: > + btrfs_free_path(path); > + return ret; > +} > + > int btrfs_check_trunc_cache_free_space(struct btrfs_fs_info *fs_info, > struct btrfs_block_rsv *rsv) > {
diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c index 01e8ba1da1d3..717b3435c88e 100644 --- a/fs/btrfs/block-group.c +++ b/fs/btrfs/block-group.c @@ -892,8 +892,6 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans, struct btrfs_path *path; struct btrfs_block_group *block_group; struct btrfs_free_cluster *cluster; - struct btrfs_root *tree_root = fs_info->tree_root; - struct btrfs_key key; struct inode *inode; struct kobject *kobj = NULL; int ret; @@ -971,42 +969,9 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans, spin_unlock(&trans->transaction->dirty_bgs_lock); mutex_unlock(&trans->transaction->cache_write_mutex); - if (!IS_ERR(inode)) { - ret = btrfs_orphan_add(trans, BTRFS_I(inode)); - if (ret) { - btrfs_add_delayed_iput(inode); - goto out; - } - clear_nlink(inode); - /* One for the block groups ref */ - spin_lock(&block_group->lock); - if (block_group->iref) { - block_group->iref = 0; - block_group->inode = NULL; - spin_unlock(&block_group->lock); - iput(inode); - } else { - spin_unlock(&block_group->lock); - } - /* One for our lookup ref */ - btrfs_add_delayed_iput(inode); - } - - key.objectid = BTRFS_FREE_SPACE_OBJECTID; - key.type = 0; - key.offset = block_group->start; - - ret = btrfs_search_slot(trans, tree_root, &key, path, -1, 1); - if (ret < 0) + ret = btrfs_remove_free_space_inode(trans, inode, block_group); + if (ret) goto out; - if (ret > 0) - btrfs_release_path(path); - if (ret == 0) { - ret = btrfs_del_item(trans, tree_root, path); - if (ret) - goto out; - btrfs_release_path(path); - } spin_lock(&fs_info->block_group_cache_lock); rb_erase(&block_group->cache_node, diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index ade92e93e63f..775d29565665 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -3333,6 +3333,15 @@ int __cold open_ctree(struct super_block *sb, struct btrfs_fs_devices *fs_device close_ctree(fs_info); return ret; } + /* + * Creating the free space tree creates inode orphan items and + * delayed iputs when it deletes the free space inodes. Later in + * open_ctree, we run btrfs_orphan_cleanup which tries to clean + * up the orphan items. However, the outstanding references on + * the inodes from the delayed iputs causes the cleanup to fail. + * To fix it, force going through the delayed iputs here. + */ + btrfs_run_delayed_iputs(fs_info); } if ((bool)btrfs_test_opt(fs_info, SPACE_CACHE) != diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c index 25420d51039c..6e1bbe87d734 100644 --- a/fs/btrfs/free-space-cache.c +++ b/fs/btrfs/free-space-cache.c @@ -207,6 +207,64 @@ int create_free_space_inode(struct btrfs_trans_handle *trans, ino, block_group->start); } +/* + * inode is an optional sink: if it is NULL, btrfs_remove_free_space_inode + * handles lookup, otherwise it takes ownership and iputs the inode. + * Don't reuse an inode pointer after passing it into this function. + */ +int btrfs_remove_free_space_inode(struct btrfs_trans_handle *trans, + struct inode *inode, + struct btrfs_block_group *block_group) +{ + struct btrfs_path *path; + struct btrfs_key key; + int ret = 0; + path = btrfs_alloc_path(); + if (!path) + return -ENOMEM; + + if (!inode) { + inode = lookup_free_space_inode(block_group, path); + } + if (IS_ERR(inode)) { + if (PTR_ERR(inode) != -ENOENT) + ret = PTR_ERR(inode); + goto out; + } + ret = btrfs_orphan_add(trans, BTRFS_I(inode)); + if (ret) { + btrfs_add_delayed_iput(inode); + goto out; + } + clear_nlink(inode); + /* One for the block groups ref */ + spin_lock(&block_group->lock); + if (block_group->iref) { + block_group->iref = 0; + block_group->inode = NULL; + spin_unlock(&block_group->lock); + iput(inode); + } else { + spin_unlock(&block_group->lock); + } + /* One for the lookup ref */ + btrfs_add_delayed_iput(inode); + + key.objectid = BTRFS_FREE_SPACE_OBJECTID; + key.type = 0; + key.offset = block_group->start; + ret = btrfs_search_slot(trans, trans->fs_info->tree_root, &key, path, -1, 1); + if (ret) { + if (ret > 0) + ret = 0; + goto out; + } + ret = btrfs_del_item(trans, trans->fs_info->tree_root, path); +out: + btrfs_free_path(path); + return ret; +} + int btrfs_check_trunc_cache_free_space(struct btrfs_fs_info *fs_info, struct btrfs_block_rsv *rsv) { diff --git a/fs/btrfs/free-space-cache.h b/fs/btrfs/free-space-cache.h index 5fbdbd2fe740..0c01c53fce82 100644 --- a/fs/btrfs/free-space-cache.h +++ b/fs/btrfs/free-space-cache.h @@ -84,6 +84,9 @@ struct inode *lookup_free_space_inode(struct btrfs_block_group *block_group, int create_free_space_inode(struct btrfs_trans_handle *trans, struct btrfs_block_group *block_group, struct btrfs_path *path); +int btrfs_remove_free_space_inode(struct btrfs_trans_handle *trans, + struct inode *inode, + struct btrfs_block_group *block_group); int btrfs_check_trunc_cache_free_space(struct btrfs_fs_info *fs_info, struct btrfs_block_rsv *rsv); diff --git a/fs/btrfs/free-space-tree.c b/fs/btrfs/free-space-tree.c index 6b9faf3b0e96..d5926d36bf73 100644 --- a/fs/btrfs/free-space-tree.c +++ b/fs/btrfs/free-space-tree.c @@ -1165,6 +1165,9 @@ int btrfs_create_free_space_tree(struct btrfs_fs_info *fs_info) block_group = rb_entry(node, struct btrfs_block_group, cache_node); ret = populate_free_space_tree(trans, block_group); + if (ret) + goto abort; + ret = btrfs_remove_free_space_inode(trans, NULL, block_group); if (ret) goto abort; node = rb_next(node);
When the file system transitions from space cache v1 to v2 it removes the old cached data, but does not remove the FREE_SPACE items nor the free space inodes they point to. This doesn't cause any issues besides being a bit inefficient, since these items no longer do anything useful. To fix it, as part of populating the free space tree, destroy each block group's free space item and free space inode. This code is lifted from the existing code for removing them when removing the block group. References: https://github.com/btrfs/btrfs-todo/issues/5 Signed-off-by: Boris Burkov <boris@bur.io> --- v3: - pass in optional inode to btrfs_remove_free_space_inode, which fixes the bug of not issuing an iput for it in the bg delete case. - fix bug where the orphan generated by fst creation could not be cleaned up, because delayed_iput had an outstanding reference v2: - remove_free_space_inode -> btrfs_remove_free_space_inode - undo sinful whitespace change fs/btrfs/block-group.c | 39 ++----------------------- fs/btrfs/disk-io.c | 9 ++++++ fs/btrfs/free-space-cache.c | 58 +++++++++++++++++++++++++++++++++++++ fs/btrfs/free-space-cache.h | 3 ++ fs/btrfs/free-space-tree.c | 3 ++ 5 files changed, 75 insertions(+), 37 deletions(-)