diff mbox series

[2/2] btrfs: save drop_progress if we drop refs at all

Message ID 20190206204615.5862-3-josef@toxicpanda.com (mailing list archive)
State New, archived
Headers show
Series Fix missing reference aborts when resuming snapshot delete | expand

Commit Message

Josef Bacik Feb. 6, 2019, 8:46 p.m. UTC
Previously we only updated the drop_progress key if we were in the
DROP_REFERENCE stage of snapshot deletion.  This is because the
UPDATE_BACKREF stage checks the flags of the blocks it's converting to
FULL_BACKREF, so if we go over a block we processed before it doesn't
matter, we just don't do anything.

The problem is in do_walk_down() we will go ahead and drop the roots
reference to any blocks that we know we won't need to walk into.

Given subvolume A and snapshot B.  The root of B points to all of the
nodes that belong to A, so all of those nodes have a refcnt > 1.  If B
did not modify those blocks it'll hit this condition in do_walk_down

if (!wc->update_ref ||
    generation <= root->root_key.offset)
	goto skip;

and in "goto skip" we simply do a btrfs_free_extent() for that bytenr
that we point at.

Now assume we modified some data in B, and then took a snapshot of B and
call it C.  C points to all the nodes in B, making every node the root
of B points to have a refcnt > 1.  This assumes the root level is 2 or
higher.

We delete snapshot B, which does the above work in do_walk_down,
free'ing our ref for nodes we share with A that we didn't modify.  Now
we hit a node we _did_ modify, thus we own.  We need to walk down into
this node and we set wc->stage == UPDATE_BACKREF.  We walk down to level
0 which we also own because we modified data.  We can't walk any further
down and thus now need to walk up and start the next part of the
deletion.  Now walk_up_proc is supposed to put us back into
DROP_REFERENCE, but there's an exception to this

if (level < wc->shared_level)
	goto out;

we are at level == 0, and our shared_level == 1.  We skip out of this
one and go up to level 1.  Since path->slots[1] < nritems we
path->slots[1]++ and break out of walk_up_tree to stop our transaction
and loop back around.  Now in btrfs_drop_snapshot we have this snippet

if (wc->stage == DROP_REFERENCE) {
	level = wc->level;
	btrfs_node_key(path->nodes[level],
		       &root_item->drop_progress,
		       path->slots[level]);
	root_item->drop_level = level;
}

our stage == UPDATE_BACKREF still, so we don't update the drop_progress
key.  This is a problem because we would have done btrfs_free_extent()
for the nodes leading up to our current position.  If we crash or
unmount here and go to remount we'll start over where we were before and
try to free our ref for blocks we've already freed, and thus abort()
out.

Fix this by keeping track of the last place we dropped a reference for
our block in do_walk_down.  Then if wc->stage == UPDATE_BACKREF we know
we'll start over from a place we meant to, and otherwise things continue
to work as they did before.

I have a complicated reproducer for this problem, without this patch
we'll fail to fsck the fs when replaying the log writes log.  With this
patch we can replay the whole log without any fsck or mount failures.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/extent-tree.c | 26 ++++++++++++++++++++------
 1 file changed, 20 insertions(+), 6 deletions(-)

Comments

Filipe Manana Feb. 7, 2019, 12:07 p.m. UTC | #1
On Wed, Feb 6, 2019 at 8:47 PM Josef Bacik <josef@toxicpanda.com> wrote:
>
> Previously we only updated the drop_progress key if we were in the
> DROP_REFERENCE stage of snapshot deletion.  This is because the
> UPDATE_BACKREF stage checks the flags of the blocks it's converting to
> FULL_BACKREF, so if we go over a block we processed before it doesn't
> matter, we just don't do anything.
>
> The problem is in do_walk_down() we will go ahead and drop the roots
> reference to any blocks that we know we won't need to walk into.
>
> Given subvolume A and snapshot B.  The root of B points to all of the
> nodes that belong to A, so all of those nodes have a refcnt > 1.  If B
> did not modify those blocks it'll hit this condition in do_walk_down
>
> if (!wc->update_ref ||
>     generation <= root->root_key.offset)
>         goto skip;
>
> and in "goto skip" we simply do a btrfs_free_extent() for that bytenr
> that we point at.
>
> Now assume we modified some data in B, and then took a snapshot of B and
> call it C.  C points to all the nodes in B, making every node the root
> of B points to have a refcnt > 1.  This assumes the root level is 2 or
> higher.
>
> We delete snapshot B, which does the above work in do_walk_down,
> free'ing our ref for nodes we share with A that we didn't modify.  Now
> we hit a node we _did_ modify, thus we own.  We need to walk down into
> this node and we set wc->stage == UPDATE_BACKREF.  We walk down to level
> 0 which we also own because we modified data.  We can't walk any further
> down and thus now need to walk up and start the next part of the
> deletion.  Now walk_up_proc is supposed to put us back into
> DROP_REFERENCE, but there's an exception to this
>
> if (level < wc->shared_level)
>         goto out;
>
> we are at level == 0, and our shared_level == 1.  We skip out of this
> one and go up to level 1.  Since path->slots[1] < nritems we
> path->slots[1]++ and break out of walk_up_tree to stop our transaction
> and loop back around.  Now in btrfs_drop_snapshot we have this snippet
>
> if (wc->stage == DROP_REFERENCE) {
>         level = wc->level;
>         btrfs_node_key(path->nodes[level],
>                        &root_item->drop_progress,
>                        path->slots[level]);
>         root_item->drop_level = level;
> }
>
> our stage == UPDATE_BACKREF still, so we don't update the drop_progress
> key.  This is a problem because we would have done btrfs_free_extent()
> for the nodes leading up to our current position.  If we crash or
> unmount here and go to remount we'll start over where we were before and
> try to free our ref for blocks we've already freed, and thus abort()
> out.
>
> Fix this by keeping track of the last place we dropped a reference for
> our block in do_walk_down.  Then if wc->stage == UPDATE_BACKREF we know
> we'll start over from a place we meant to, and otherwise things continue
> to work as they did before.
>
> I have a complicated reproducer for this problem, without this patch
> we'll fail to fsck the fs when replaying the log writes log.  With this
> patch we can replay the whole log without any fsck or mount failures.
>
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>

Reviewed-by: Filipe Manana <fdmanana@suse.com>

Looks good, great catch!

> ---
>  fs/btrfs/extent-tree.c | 26 ++++++++++++++++++++------
>  1 file changed, 20 insertions(+), 6 deletions(-)
>
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index f40d6086c947..53fd4626660c 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -8765,6 +8765,8 @@ struct walk_control {
>         u64 refs[BTRFS_MAX_LEVEL];
>         u64 flags[BTRFS_MAX_LEVEL];
>         struct btrfs_key update_progress;
> +       struct btrfs_key drop_progress;
> +       int drop_level;
>         int stage;
>         int level;
>         int shared_level;
> @@ -9148,6 +9150,16 @@ static noinline int do_walk_down(struct btrfs_trans_handle *trans,
>                                              ret);
>                         }
>                 }
> +
> +               /*
> +                * We need to update the next key in our walk control so we can
> +                * update the drop_progress key accordingly.  We don't care if
> +                * find_next_key doesn't find a key because that means we're at
> +                * the end and are going to clean up now.
> +                */
> +               wc->drop_level = level;
> +               find_next_key(path, level, &wc->drop_progress);
> +
>                 ret = btrfs_free_extent(trans, root, bytenr, fs_info->nodesize,
>                                         parent, root->root_key.objectid,
>                                         level - 1, 0);
> @@ -9499,12 +9511,14 @@ int btrfs_drop_snapshot(struct btrfs_root *root,
>                 }
>
>                 if (wc->stage == DROP_REFERENCE) {
> -                       level = wc->level;
> -                       btrfs_node_key(path->nodes[level],
> -                                      &root_item->drop_progress,
> -                                      path->slots[level]);
> -                       root_item->drop_level = level;
> -               }
> +                       wc->drop_level = wc->level;
> +                       btrfs_node_key_to_cpu(path->nodes[wc->drop_level],
> +                                             &wc->drop_progress,
> +                                             path->slots[wc->drop_level]);
> +               }
> +               btrfs_cpu_key_to_disk(&root_item->drop_progress,
> +                                     &wc->drop_progress);
> +               root_item->drop_level = wc->drop_level;
>
>                 BUG_ON(wc->level == 0);
>                 if (btrfs_should_end_transaction(trans) ||
> --
> 2.14.3
>
diff mbox series

Patch

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index f40d6086c947..53fd4626660c 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -8765,6 +8765,8 @@  struct walk_control {
 	u64 refs[BTRFS_MAX_LEVEL];
 	u64 flags[BTRFS_MAX_LEVEL];
 	struct btrfs_key update_progress;
+	struct btrfs_key drop_progress;
+	int drop_level;
 	int stage;
 	int level;
 	int shared_level;
@@ -9148,6 +9150,16 @@  static noinline int do_walk_down(struct btrfs_trans_handle *trans,
 					     ret);
 			}
 		}
+
+		/*
+		 * We need to update the next key in our walk control so we can
+		 * update the drop_progress key accordingly.  We don't care if
+		 * find_next_key doesn't find a key because that means we're at
+		 * the end and are going to clean up now.
+		 */
+		wc->drop_level = level;
+		find_next_key(path, level, &wc->drop_progress);
+
 		ret = btrfs_free_extent(trans, root, bytenr, fs_info->nodesize,
 					parent, root->root_key.objectid,
 					level - 1, 0);
@@ -9499,12 +9511,14 @@  int btrfs_drop_snapshot(struct btrfs_root *root,
 		}
 
 		if (wc->stage == DROP_REFERENCE) {
-			level = wc->level;
-			btrfs_node_key(path->nodes[level],
-				       &root_item->drop_progress,
-				       path->slots[level]);
-			root_item->drop_level = level;
-		}
+			wc->drop_level = wc->level;
+			btrfs_node_key_to_cpu(path->nodes[wc->drop_level],
+					      &wc->drop_progress,
+					      path->slots[wc->drop_level]);
+		}
+		btrfs_cpu_key_to_disk(&root_item->drop_progress,
+				      &wc->drop_progress);
+		root_item->drop_level = wc->drop_level;
 
 		BUG_ON(wc->level == 0);
 		if (btrfs_should_end_transaction(trans) ||