diff mbox series

btrfs: fix improper generation check in snapshot delete

Message ID b1b8f27cad83060a4157af8f7514681a85956549.1731515508.git.josef@toxicpanda.com (mailing list archive)
State New
Headers show
Series btrfs: fix improper generation check in snapshot delete | expand

Commit Message

Josef Bacik Nov. 13, 2024, 4:31 p.m. UTC
We have been using the following check

if (generation <= root->root_key.offset)

to make decisions about whether or not to visit a node during snapshot
delete.  This is because for normal subvolumes this is set to 0, and for
snapshots it's set to the creation generation.  The idea being that if
the generation of the node is less than or equal to our creation
generation then we don't need to visit that node, because it doesn't
belong to us, we can simply drop our reference and move on.

However reloc roots don't have their generation stored in
root->root_key.offset, instead that is the objectid of their
corresponding fs root.  This means we can incorrectly not walk into
nodes that need to be dropped when deleting a reloc root.

There are a variety of consequences to making the wrong choice in two
distinct areas.

visit_node_for_delete()

1. False positive.  We think we are newer than the block when we really
   aren't.  We don't visit the node and drop our reference to the node
   and carry on.  This would result in leaked space.
2. False negative.  We do decide to walk down into a block that we
   should have just dropped our reference to.  However this means that
   the child node will have refs > 1, so we will switch to
   UPDATE_BACKREF, and then the subsequent walk_down_proc will notice
   that btrfs_header_owner(node) != root->root_key.objectid and it'll
   break out of the loop, and then walk_up_proc will drop our reference,
   so this appears to be ok.

do_walk_down()

1. False positive.  We are in UPDATE_BACKREF and incorrectly decide that
   we are done and don't need to update the backref for our lower nodes.
   This is another case that simply won't happen with relocation, as we
   only have to do UPDATE_BACKREF if the node below us was shared and
   didn't have FULL_BACKREF set, and since we don't own that node
   because we're a reloc root we actually won't end up in this case.
2. False negative.  Again this is tricky because as described above, we
   simply wouldn't be here from relocation, because we don't own any of
   the nodes because we never set btrfs_header_owner() to the reloc root
   objectid, and we always use FULL_BACKREF, we never actually need to
   set FULL_BACKREF on any children.

Having spent a lot of time stressing relocation/snapshot delete recently
I've not seen this pop in practice.  But this is objectively incorrect,
so fix this to get the correct starting generation based on the root
we're dropping to keep me from thinking there's a problem here.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/extent-tree.c |  6 +++---
 fs/btrfs/root-tree.c   | 20 ++++++++++++++++++++
 fs/btrfs/root-tree.h   |  1 +
 3 files changed, 24 insertions(+), 3 deletions(-)

Comments

Filipe Manana Nov. 13, 2024, 4:42 p.m. UTC | #1
On Wed, Nov 13, 2024 at 4:32 PM Josef Bacik <josef@toxicpanda.com> wrote:
>
> We have been using the following check
>
> if (generation <= root->root_key.offset)
>
> to make decisions about whether or not to visit a node during snapshot
> delete.  This is because for normal subvolumes this is set to 0, and for
> snapshots it's set to the creation generation.  The idea being that if
> the generation of the node is less than or equal to our creation
> generation then we don't need to visit that node, because it doesn't
> belong to us, we can simply drop our reference and move on.
>
> However reloc roots don't have their generation stored in
> root->root_key.offset, instead that is the objectid of their
> corresponding fs root.  This means we can incorrectly not walk into
> nodes that need to be dropped when deleting a reloc root.
>
> There are a variety of consequences to making the wrong choice in two
> distinct areas.
>
> visit_node_for_delete()
>
> 1. False positive.  We think we are newer than the block when we really
>    aren't.  We don't visit the node and drop our reference to the node
>    and carry on.  This would result in leaked space.
> 2. False negative.  We do decide to walk down into a block that we
>    should have just dropped our reference to.  However this means that
>    the child node will have refs > 1, so we will switch to
>    UPDATE_BACKREF, and then the subsequent walk_down_proc will notice
>    that btrfs_header_owner(node) != root->root_key.objectid and it'll
>    break out of the loop, and then walk_up_proc will drop our reference,
>    so this appears to be ok.
>
> do_walk_down()
>
> 1. False positive.  We are in UPDATE_BACKREF and incorrectly decide that
>    we are done and don't need to update the backref for our lower nodes.
>    This is another case that simply won't happen with relocation, as we
>    only have to do UPDATE_BACKREF if the node below us was shared and
>    didn't have FULL_BACKREF set, and since we don't own that node
>    because we're a reloc root we actually won't end up in this case.
> 2. False negative.  Again this is tricky because as described above, we
>    simply wouldn't be here from relocation, because we don't own any of
>    the nodes because we never set btrfs_header_owner() to the reloc root
>    objectid, and we always use FULL_BACKREF, we never actually need to
>    set FULL_BACKREF on any children.
>
> Having spent a lot of time stressing relocation/snapshot delete recently
> I've not seen this pop in practice.  But this is objectively incorrect,
> so fix this to get the correct starting generation based on the root
> we're dropping to keep me from thinking there's a problem here.
>
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
>  fs/btrfs/extent-tree.c |  6 +++---
>  fs/btrfs/root-tree.c   | 20 ++++++++++++++++++++
>  fs/btrfs/root-tree.h   |  1 +
>  3 files changed, 24 insertions(+), 3 deletions(-)
>
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 412e318e4a22..43a771f7bd7a 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -5285,7 +5285,7 @@ static bool visit_node_for_delete(struct btrfs_root *root, struct walk_control *
>          * reference to it.
>          */
>         generation = btrfs_node_ptr_generation(eb, slot);
> -       if (!wc->update_ref || generation <= root->root_key.offset)
> +       if (!wc->update_ref || generation <= btrfs_root_origin_generation(root))
>                 return false;
>
>         /*
> @@ -5340,7 +5340,7 @@ static noinline void reada_walk_down(struct btrfs_trans_handle *trans,
>                         goto reada;
>
>                 if (wc->stage == UPDATE_BACKREF &&
> -                   generation <= root->root_key.offset)
> +                   generation <= btrfs_root_origin_generation(root))
>                         continue;
>
>                 /* We don't lock the tree block, it's OK to be racy here */
> @@ -5683,7 +5683,7 @@ static noinline int do_walk_down(struct btrfs_trans_handle *trans,
>          * for the subtree
>          */
>         if (wc->stage == UPDATE_BACKREF &&
> -           generation <= root->root_key.offset) {
> +           generation <= btrfs_root_origin_generation(root)) {
>                 wc->lookup_info = 1;
>                 return 1;
>         }
> diff --git a/fs/btrfs/root-tree.c b/fs/btrfs/root-tree.c
> index 33962671a96c..017a155ffd5e 100644
> --- a/fs/btrfs/root-tree.c
> +++ b/fs/btrfs/root-tree.c
> @@ -547,3 +547,23 @@ int btrfs_subvolume_reserve_metadata(struct btrfs_root *root,
>         }
>         return ret;
>  }
> +
> +/*
> + * btrfs_root_start_generation - return the generation this root started with.
> + * @root - the root we're chcking
> + *
> + * Every normal root that is created with root->root_key.offset set to it's
> + * originating generation.  If it is a snapshot it is the generation when the
> + * snapshot was created.
> + *
> + * However for TREE_RELOC roots root_key.offset is the objectid of the owning
> + * tree root.  Thankfully we copy the root item of the owning tree root, which
> + * has it's last_snapshot set to what we would have root_key.offset set to, so
> + * return that if we are a TREE_RELOC root.
> + */
> +u64 btrfs_root_origin_generation(struct btrfs_root *root)

The root argument can be made const.

> +{
> +       if (btrfs_root_id(root) == BTRFS_TREE_RELOC_OBJECTID)
> +               return btrfs_root_last_snapshot(&root->root_item);
> +       return root->root_key.offset;

This is so small and trivial that it could be instead a static inline
function at fs.h.

Also having this in root-tree.c/h is a bit odd since this is generic
and actually never used against the root tree but subvolume/snapshot
and relocation roots.
So one more reason to make it static inline and place it at fs.h,
where we have btrfs_root_id() and other generic root related
functions.

Otherwise it looks fine, thanks.

> +}
> diff --git a/fs/btrfs/root-tree.h b/fs/btrfs/root-tree.h
> index 8f5739e732b9..030b74e821e4 100644
> --- a/fs/btrfs/root-tree.h
> +++ b/fs/btrfs/root-tree.h
> @@ -38,5 +38,6 @@ void btrfs_set_root_node(struct btrfs_root_item *item,
>                          struct extent_buffer *node);
>  void btrfs_check_and_init_root_item(struct btrfs_root_item *item);
>  void btrfs_update_root_times(struct btrfs_trans_handle *trans, struct btrfs_root *root);
> +u64 btrfs_root_origin_generation(struct btrfs_root *root);
>
>  #endif
> --
> 2.43.0
>
>
Johannes Thumshirn Nov. 13, 2024, 4:46 p.m. UTC | #2
On 13.11.24 17:32, Josef Bacik wrote:
> + * @root - the root we're chcking

Checking

Otherwise,
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Filipe Manana Nov. 13, 2024, 4:49 p.m. UTC | #3
On Wed, Nov 13, 2024 at 4:42 PM Filipe Manana <fdmanana@kernel.org> wrote:
>
> On Wed, Nov 13, 2024 at 4:32 PM Josef Bacik <josef@toxicpanda.com> wrote:
> >
> > We have been using the following check
> >
> > if (generation <= root->root_key.offset)
> >
> > to make decisions about whether or not to visit a node during snapshot
> > delete.  This is because for normal subvolumes this is set to 0, and for
> > snapshots it's set to the creation generation.  The idea being that if
> > the generation of the node is less than or equal to our creation
> > generation then we don't need to visit that node, because it doesn't
> > belong to us, we can simply drop our reference and move on.
> >
> > However reloc roots don't have their generation stored in
> > root->root_key.offset, instead that is the objectid of their
> > corresponding fs root.  This means we can incorrectly not walk into
> > nodes that need to be dropped when deleting a reloc root.
> >
> > There are a variety of consequences to making the wrong choice in two
> > distinct areas.
> >
> > visit_node_for_delete()
> >
> > 1. False positive.  We think we are newer than the block when we really
> >    aren't.  We don't visit the node and drop our reference to the node
> >    and carry on.  This would result in leaked space.
> > 2. False negative.  We do decide to walk down into a block that we
> >    should have just dropped our reference to.  However this means that
> >    the child node will have refs > 1, so we will switch to
> >    UPDATE_BACKREF, and then the subsequent walk_down_proc will notice
> >    that btrfs_header_owner(node) != root->root_key.objectid and it'll
> >    break out of the loop, and then walk_up_proc will drop our reference,
> >    so this appears to be ok.
> >
> > do_walk_down()
> >
> > 1. False positive.  We are in UPDATE_BACKREF and incorrectly decide that
> >    we are done and don't need to update the backref for our lower nodes.
> >    This is another case that simply won't happen with relocation, as we
> >    only have to do UPDATE_BACKREF if the node below us was shared and
> >    didn't have FULL_BACKREF set, and since we don't own that node
> >    because we're a reloc root we actually won't end up in this case.
> > 2. False negative.  Again this is tricky because as described above, we
> >    simply wouldn't be here from relocation, because we don't own any of
> >    the nodes because we never set btrfs_header_owner() to the reloc root
> >    objectid, and we always use FULL_BACKREF, we never actually need to
> >    set FULL_BACKREF on any children.
> >
> > Having spent a lot of time stressing relocation/snapshot delete recently
> > I've not seen this pop in practice.  But this is objectively incorrect,
> > so fix this to get the correct starting generation based on the root
> > we're dropping to keep me from thinking there's a problem here.
> >
> > Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> > ---
> >  fs/btrfs/extent-tree.c |  6 +++---
> >  fs/btrfs/root-tree.c   | 20 ++++++++++++++++++++
> >  fs/btrfs/root-tree.h   |  1 +
> >  3 files changed, 24 insertions(+), 3 deletions(-)
> >
> > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> > index 412e318e4a22..43a771f7bd7a 100644
> > --- a/fs/btrfs/extent-tree.c
> > +++ b/fs/btrfs/extent-tree.c
> > @@ -5285,7 +5285,7 @@ static bool visit_node_for_delete(struct btrfs_root *root, struct walk_control *
> >          * reference to it.
> >          */
> >         generation = btrfs_node_ptr_generation(eb, slot);
> > -       if (!wc->update_ref || generation <= root->root_key.offset)
> > +       if (!wc->update_ref || generation <= btrfs_root_origin_generation(root))
> >                 return false;
> >
> >         /*
> > @@ -5340,7 +5340,7 @@ static noinline void reada_walk_down(struct btrfs_trans_handle *trans,
> >                         goto reada;
> >
> >                 if (wc->stage == UPDATE_BACKREF &&
> > -                   generation <= root->root_key.offset)
> > +                   generation <= btrfs_root_origin_generation(root))
> >                         continue;
> >
> >                 /* We don't lock the tree block, it's OK to be racy here */
> > @@ -5683,7 +5683,7 @@ static noinline int do_walk_down(struct btrfs_trans_handle *trans,
> >          * for the subtree
> >          */
> >         if (wc->stage == UPDATE_BACKREF &&
> > -           generation <= root->root_key.offset) {
> > +           generation <= btrfs_root_origin_generation(root)) {
> >                 wc->lookup_info = 1;
> >                 return 1;
> >         }
> > diff --git a/fs/btrfs/root-tree.c b/fs/btrfs/root-tree.c
> > index 33962671a96c..017a155ffd5e 100644
> > --- a/fs/btrfs/root-tree.c
> > +++ b/fs/btrfs/root-tree.c
> > @@ -547,3 +547,23 @@ int btrfs_subvolume_reserve_metadata(struct btrfs_root *root,
> >         }
> >         return ret;
> >  }
> > +
> > +/*
> > + * btrfs_root_start_generation - return the generation this root started with.
> > + * @root - the root we're chcking
> > + *
> > + * Every normal root that is created with root->root_key.offset set to it's
> > + * originating generation.  If it is a snapshot it is the generation when the
> > + * snapshot was created.
> > + *
> > + * However for TREE_RELOC roots root_key.offset is the objectid of the owning
> > + * tree root.  Thankfully we copy the root item of the owning tree root, which
> > + * has it's last_snapshot set to what we would have root_key.offset set to, so
> > + * return that if we are a TREE_RELOC root.
> > + */
> > +u64 btrfs_root_origin_generation(struct btrfs_root *root)
>
> The root argument can be made const.
>
> > +{
> > +       if (btrfs_root_id(root) == BTRFS_TREE_RELOC_OBJECTID)
> > +               return btrfs_root_last_snapshot(&root->root_item);
> > +       return root->root_key.offset;
>
> This is so small and trivial that it could be instead a static inline
> function at fs.h.
>
> Also having this in root-tree.c/h is a bit odd since this is generic
> and actually never used against the root tree but subvolume/snapshot
> and relocation roots.
> So one more reason to make it static inline and place it at fs.h,

It's actually ctree.h and not fs.sh where we have the existing root
related helpers (btrfs_root_id, btrfs_root_dead, etc).

> where we have btrfs_root_id() and other generic root related
> functions.
>
> Otherwise it looks fine, thanks.
>
> > +}
> > diff --git a/fs/btrfs/root-tree.h b/fs/btrfs/root-tree.h
> > index 8f5739e732b9..030b74e821e4 100644
> > --- a/fs/btrfs/root-tree.h
> > +++ b/fs/btrfs/root-tree.h
> > @@ -38,5 +38,6 @@ void btrfs_set_root_node(struct btrfs_root_item *item,
> >                          struct extent_buffer *node);
> >  void btrfs_check_and_init_root_item(struct btrfs_root_item *item);
> >  void btrfs_update_root_times(struct btrfs_trans_handle *trans, struct btrfs_root *root);
> > +u64 btrfs_root_origin_generation(struct btrfs_root *root);
> >
> >  #endif
> > --
> > 2.43.0
> >
> >
diff mbox series

Patch

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 412e318e4a22..43a771f7bd7a 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -5285,7 +5285,7 @@  static bool visit_node_for_delete(struct btrfs_root *root, struct walk_control *
 	 * reference to it.
 	 */
 	generation = btrfs_node_ptr_generation(eb, slot);
-	if (!wc->update_ref || generation <= root->root_key.offset)
+	if (!wc->update_ref || generation <= btrfs_root_origin_generation(root))
 		return false;
 
 	/*
@@ -5340,7 +5340,7 @@  static noinline void reada_walk_down(struct btrfs_trans_handle *trans,
 			goto reada;
 
 		if (wc->stage == UPDATE_BACKREF &&
-		    generation <= root->root_key.offset)
+		    generation <= btrfs_root_origin_generation(root))
 			continue;
 
 		/* We don't lock the tree block, it's OK to be racy here */
@@ -5683,7 +5683,7 @@  static noinline int do_walk_down(struct btrfs_trans_handle *trans,
 	 * for the subtree
 	 */
 	if (wc->stage == UPDATE_BACKREF &&
-	    generation <= root->root_key.offset) {
+	    generation <= btrfs_root_origin_generation(root)) {
 		wc->lookup_info = 1;
 		return 1;
 	}
diff --git a/fs/btrfs/root-tree.c b/fs/btrfs/root-tree.c
index 33962671a96c..017a155ffd5e 100644
--- a/fs/btrfs/root-tree.c
+++ b/fs/btrfs/root-tree.c
@@ -547,3 +547,23 @@  int btrfs_subvolume_reserve_metadata(struct btrfs_root *root,
 	}
 	return ret;
 }
+
+/*
+ * btrfs_root_start_generation - return the generation this root started with.
+ * @root - the root we're chcking
+ *
+ * Every normal root that is created with root->root_key.offset set to it's
+ * originating generation.  If it is a snapshot it is the generation when the
+ * snapshot was created.
+ *
+ * However for TREE_RELOC roots root_key.offset is the objectid of the owning
+ * tree root.  Thankfully we copy the root item of the owning tree root, which
+ * has it's last_snapshot set to what we would have root_key.offset set to, so
+ * return that if we are a TREE_RELOC root.
+ */
+u64 btrfs_root_origin_generation(struct btrfs_root *root)
+{
+	if (btrfs_root_id(root) == BTRFS_TREE_RELOC_OBJECTID)
+		return btrfs_root_last_snapshot(&root->root_item);
+	return root->root_key.offset;
+}
diff --git a/fs/btrfs/root-tree.h b/fs/btrfs/root-tree.h
index 8f5739e732b9..030b74e821e4 100644
--- a/fs/btrfs/root-tree.h
+++ b/fs/btrfs/root-tree.h
@@ -38,5 +38,6 @@  void btrfs_set_root_node(struct btrfs_root_item *item,
 			 struct extent_buffer *node);
 void btrfs_check_and_init_root_item(struct btrfs_root_item *item);
 void btrfs_update_root_times(struct btrfs_trans_handle *trans, struct btrfs_root *root);
+u64 btrfs_root_origin_generation(struct btrfs_root *root);
 
 #endif