diff mbox series

[1/2] btrfs: fix btrfs_prev_leaf() to not return the same key twice

Message ID e6eaa082d536fc5223eae4627bc7dc6369e257d9.1681295111.git.fdmanana@suse.com (mailing list archive)
State New, archived
Headers show
Series btrfs: fix for btrfs_prev_leaf() and unexport it | expand

Commit Message

Filipe Manana April 12, 2023, 10:33 a.m. UTC
From: Filipe Manana <fdmanana@suse.com>

A call to btrfs_prev_leaf() may end up returning a path that points to the
same item (key) again. This happens if while btrfs_prev_leaf(), after we
release the path, a concurrent insertion happens, which moves items off
from a sibbling into the front of the previous leaf, and an item with the
computed previous key does not exists.

For example, suppose we have the two following leaves:

  Leaf A

  -------------------------------------------------------------
  | ...   key (300 96 10)   key (300 96 15)   key (300 96 16) |
  -------------------------------------------------------------
              slot 20             slot 21             slot 22

  Leaf B

  -------------------------------------------------------------
  | key (300 96 20)   key (300 96 21)   key (300 96 22)   ... |
  -------------------------------------------------------------
      slot 0             slot 1             slot 2

If we call btrfs_prev_leaf(), from btrfs_previous_item() for example, with
a path pointing to leaf B and slot 0 and the following happens:

1) At btrfs_prev_leaf() we compute the previous key to search as:
   (300 96 19), which is a key that does not exists in the tree;

2) Then we call btrfs_release_path() at btrfs_prev_leaf();

3) Some other task inserts a key at leaf A, that sorts before the key at
   slot 20, for example it has an objectid of 299. In order to make room
   for the new key, the key at slot 22 is moved to the front of leaf B.
   This happens at push_leaf_right(), called from split_leaf().

   After this leaf B now looks like:

  --------------------------------------------------------------------------------
  | key (300 96 16)    key (300 96 20)   key (300 96 21)   key (300 96 22)   ... |
  --------------------------------------------------------------------------------
       slot 0              slot 1             slot 2             slot 3

4) At btrfs_prev_leaf() we call btrfs_search_slot() for the computed
   previous key: (300 96 19). Since the key does not exists,
   btrfs_search_slot() returns 1 and with a path pointing to leaf B
   and slot 1, the item with key (300 96 20);

5) This makes btrfs_prev_leaf() return a path that points to slot 1 of
   leaf B, the same key as before it was called, since the key at slot 0
   of leaf B (300 96 16) is less than the computed previous key, which is
   (300 96 19);

6) As a consequence btrfs_previous_item() returns a path that points again
   to the item with key (300 96 20).

For some users of btrfs_prev_leaf() or btrfs_previous_item() this may not
be functional a problem, despite not making sense to return a new path
pointing again to the same item/key. However for a caller such as
tree-log.c:log_dir_items(), this has a bad consequence, as it can result
in not logging some dir index deletions in case the directory is being
logged without holding the inode's VFS lock (logging triggered while
logging a child inode for example) - for the example scenario above, in
case the dir index keys 17, 18 and 19 were deleted in the current
transaction.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/ctree.c | 32 +++++++++++++++++++++++++++++++-
 1 file changed, 31 insertions(+), 1 deletion(-)

Comments

Wang Yugui April 13, 2023, 7:57 p.m. UTC | #1
Hi,

> From: Filipe Manana <fdmanana@suse.com>
> 
> A call to btrfs_prev_leaf() may end up returning a path that points to the
> same item (key) again. This happens if while btrfs_prev_leaf(), after we
> release the path, a concurrent insertion happens, which moves items off
> from a sibbling into the front of the previous leaf, and an item with the
> computed previous key does not exists.
> 
> For example, suppose we have the two following leaves:
> 
>   Leaf A
> 
>   -------------------------------------------------------------
>   | ...   key (300 96 10)   key (300 96 15)   key (300 96 16) |
>   -------------------------------------------------------------
>               slot 20             slot 21             slot 22
> 
>   Leaf B
> 
>   -------------------------------------------------------------
>   | key (300 96 20)   key (300 96 21)   key (300 96 22)   ... |
>   -------------------------------------------------------------
>       slot 0             slot 1             slot 2
> 
> If we call btrfs_prev_leaf(), from btrfs_previous_item() for example, with
> a path pointing to leaf B and slot 0 and the following happens:
> 
> 1) At btrfs_prev_leaf() we compute the previous key to search as:
>    (300 96 19), which is a key that does not exists in the tree;
> 
> 2) Then we call btrfs_release_path() at btrfs_prev_leaf();
> 
> 3) Some other task inserts a key at leaf A, that sorts before the key at
>    slot 20, for example it has an objectid of 299. In order to make room
>    for the new key, the key at slot 22 is moved to the front of leaf B.
>    This happens at push_leaf_right(), called from split_leaf().
> 
>    After this leaf B now looks like:
> 
>   --------------------------------------------------------------------------------
>   | key (300 96 16)    key (300 96 20)   key (300 96 21)   key (300 96 22)   ... |
>   --------------------------------------------------------------------------------
>        slot 0              slot 1             slot 2             slot 3
> 
> 4) At btrfs_prev_leaf() we call btrfs_search_slot() for the computed
>    previous key: (300 96 19). Since the key does not exists,
>    btrfs_search_slot() returns 1 and with a path pointing to leaf B
>    and slot 1, the item with key (300 96 20);
> 
> 5) This makes btrfs_prev_leaf() return a path that points to slot 1 of
>    leaf B, the same key as before it was called, since the key at slot 0
>    of leaf B (300 96 16) is less than the computed previous key, which is
>    (300 96 19);
> 
> 6) As a consequence btrfs_previous_item() returns a path that points again
>    to the item with key (300 96 20).
> 
> For some users of btrfs_prev_leaf() or btrfs_previous_item() this may not
> be functional a problem, despite not making sense to return a new path
> pointing again to the same item/key. However for a caller such as
> tree-log.c:log_dir_items(), this has a bad consequence, as it can result
> in not logging some dir index deletions in case the directory is being
> logged without holding the inode's VFS lock (logging triggered while
> logging a child inode for example) - for the example scenario above, in
> case the dir index keys 17, 18 and 19 were deleted in the current
> transaction.

fstests(btrfs/080) had few frequency(<1/10) to fail after this pacth is applied.
but it is yet not clear enough.

# cat results//btrfs/080.out.bad
QA output created by 080
Unexpected digest for file /mnt/scratch/20_40_30_986878191_snap/foobar_63
(see /usr/hpc-bio/xfstests/results//btrfs/080.full for details)

Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2023/04/14
Wang Yugui April 14, 2023, 3:43 a.m. UTC | #2
Hi,

> Hi,
> 
> > From: Filipe Manana <fdmanana@suse.com>
> > 
> > A call to btrfs_prev_leaf() may end up returning a path that points to the
> > same item (key) again. This happens if while btrfs_prev_leaf(), after we
> > release the path, a concurrent insertion happens, which moves items off
> > from a sibbling into the front of the previous leaf, and an item with the
> > computed previous key does not exists.
> > 
> > For example, suppose we have the two following leaves:
> > 
> >   Leaf A
> > 
> >   -------------------------------------------------------------
> >   | ...   key (300 96 10)   key (300 96 15)   key (300 96 16) |
> >   -------------------------------------------------------------
> >               slot 20             slot 21             slot 22
> > 
> >   Leaf B
> > 
> >   -------------------------------------------------------------
> >   | key (300 96 20)   key (300 96 21)   key (300 96 22)   ... |
> >   -------------------------------------------------------------
> >       slot 0             slot 1             slot 2
> > 
> > If we call btrfs_prev_leaf(), from btrfs_previous_item() for example, with
> > a path pointing to leaf B and slot 0 and the following happens:
> > 
> > 1) At btrfs_prev_leaf() we compute the previous key to search as:
> >    (300 96 19), which is a key that does not exists in the tree;
> > 
> > 2) Then we call btrfs_release_path() at btrfs_prev_leaf();
> > 
> > 3) Some other task inserts a key at leaf A, that sorts before the key at
> >    slot 20, for example it has an objectid of 299. In order to make room
> >    for the new key, the key at slot 22 is moved to the front of leaf B.
> >    This happens at push_leaf_right(), called from split_leaf().
> > 
> >    After this leaf B now looks like:
> > 
> >   --------------------------------------------------------------------------------
> >   | key (300 96 16)    key (300 96 20)   key (300 96 21)   key (300 96 22)   ... |
> >   --------------------------------------------------------------------------------
> >        slot 0              slot 1             slot 2             slot 3
> > 
> > 4) At btrfs_prev_leaf() we call btrfs_search_slot() for the computed
> >    previous key: (300 96 19). Since the key does not exists,
> >    btrfs_search_slot() returns 1 and with a path pointing to leaf B
> >    and slot 1, the item with key (300 96 20);
> > 
> > 5) This makes btrfs_prev_leaf() return a path that points to slot 1 of
> >    leaf B, the same key as before it was called, since the key at slot 0
> >    of leaf B (300 96 16) is less than the computed previous key, which is
> >    (300 96 19);
> > 
> > 6) As a consequence btrfs_previous_item() returns a path that points again
> >    to the item with key (300 96 20).
> > 
> > For some users of btrfs_prev_leaf() or btrfs_previous_item() this may not
> > be functional a problem, despite not making sense to return a new path
> > pointing again to the same item/key. However for a caller such as
> > tree-log.c:log_dir_items(), this has a bad consequence, as it can result
> > in not logging some dir index deletions in case the directory is being
> > logged without holding the inode's VFS lock (logging triggered while
> > logging a child inode for example) - for the example scenario above, in
> > case the dir index keys 17, 18 and 19 were deleted in the current
> > transaction.
> 
> fstests(btrfs/080) had few frequency(<1/10) to fail after this pacth is applied.
> but it is yet not clear enough.
> 
> # cat results//btrfs/080.out.bad
> QA output created by 080
> Unexpected digest for file /mnt/scratch/20_40_30_986878191_snap/foobar_63
> (see /usr/hpc-bio/xfstests/results//btrfs/080.full for details)

fstests(btrfs/080) failure happened without this patch too.

the next info update may take some long time, because of the low reproduce
frequency.

Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2023/04/14
Filipe Manana April 17, 2023, 10:18 a.m. UTC | #3
On Fri, Apr 14, 2023 at 4:44 AM Wang Yugui <wangyugui@e16-tech.com> wrote:
>
> Hi,
>
> > Hi,
> >
> > > From: Filipe Manana <fdmanana@suse.com>
> > >
> > > A call to btrfs_prev_leaf() may end up returning a path that points to the
> > > same item (key) again. This happens if while btrfs_prev_leaf(), after we
> > > release the path, a concurrent insertion happens, which moves items off
> > > from a sibbling into the front of the previous leaf, and an item with the
> > > computed previous key does not exists.
> > >
> > > For example, suppose we have the two following leaves:
> > >
> > >   Leaf A
> > >
> > >   -------------------------------------------------------------
> > >   | ...   key (300 96 10)   key (300 96 15)   key (300 96 16) |
> > >   -------------------------------------------------------------
> > >               slot 20             slot 21             slot 22
> > >
> > >   Leaf B
> > >
> > >   -------------------------------------------------------------
> > >   | key (300 96 20)   key (300 96 21)   key (300 96 22)   ... |
> > >   -------------------------------------------------------------
> > >       slot 0             slot 1             slot 2
> > >
> > > If we call btrfs_prev_leaf(), from btrfs_previous_item() for example, with
> > > a path pointing to leaf B and slot 0 and the following happens:
> > >
> > > 1) At btrfs_prev_leaf() we compute the previous key to search as:
> > >    (300 96 19), which is a key that does not exists in the tree;
> > >
> > > 2) Then we call btrfs_release_path() at btrfs_prev_leaf();
> > >
> > > 3) Some other task inserts a key at leaf A, that sorts before the key at
> > >    slot 20, for example it has an objectid of 299. In order to make room
> > >    for the new key, the key at slot 22 is moved to the front of leaf B.
> > >    This happens at push_leaf_right(), called from split_leaf().
> > >
> > >    After this leaf B now looks like:
> > >
> > >   --------------------------------------------------------------------------------
> > >   | key (300 96 16)    key (300 96 20)   key (300 96 21)   key (300 96 22)   ... |
> > >   --------------------------------------------------------------------------------
> > >        slot 0              slot 1             slot 2             slot 3
> > >
> > > 4) At btrfs_prev_leaf() we call btrfs_search_slot() for the computed
> > >    previous key: (300 96 19). Since the key does not exists,
> > >    btrfs_search_slot() returns 1 and with a path pointing to leaf B
> > >    and slot 1, the item with key (300 96 20);
> > >
> > > 5) This makes btrfs_prev_leaf() return a path that points to slot 1 of
> > >    leaf B, the same key as before it was called, since the key at slot 0
> > >    of leaf B (300 96 16) is less than the computed previous key, which is
> > >    (300 96 19);
> > >
> > > 6) As a consequence btrfs_previous_item() returns a path that points again
> > >    to the item with key (300 96 20).
> > >
> > > For some users of btrfs_prev_leaf() or btrfs_previous_item() this may not
> > > be functional a problem, despite not making sense to return a new path
> > > pointing again to the same item/key. However for a caller such as
> > > tree-log.c:log_dir_items(), this has a bad consequence, as it can result
> > > in not logging some dir index deletions in case the directory is being
> > > logged without holding the inode's VFS lock (logging triggered while
> > > logging a child inode for example) - for the example scenario above, in
> > > case the dir index keys 17, 18 and 19 were deleted in the current
> > > transaction.
> >
> > fstests(btrfs/080) had few frequency(<1/10) to fail after this pacth is applied.
> > but it is yet not clear enough.
> >
> > # cat results//btrfs/080.out.bad
> > QA output created by 080
> > Unexpected digest for file /mnt/scratch/20_40_30_986878191_snap/foobar_63
> > (see /usr/hpc-bio/xfstests/results//btrfs/080.full for details)
>
> fstests(btrfs/080) failure happened without this patch too.
>
> the next info update may take some long time, because of the low reproduce
> frequency.

That test sporadically fails without the patch yes.

Also, more important: the test does not exercise at all the function
changed by this patch.
So you can save time and skip doing that pointless info update on
frequency failure...

Thanks.

>
> Best Regards
> Wang Yugui (wangyugui@e16-tech.com)
> 2023/04/14
>
>
diff mbox series

Patch

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 3c983c70028a..6bdd8e42c95e 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -4478,10 +4478,12 @@  int btrfs_del_items(struct btrfs_trans_handle *trans, struct btrfs_root *root,
 int btrfs_prev_leaf(struct btrfs_root *root, struct btrfs_path *path)
 {
 	struct btrfs_key key;
+	struct btrfs_key orig_key;
 	struct btrfs_disk_key found_key;
 	int ret;
 
 	btrfs_item_key_to_cpu(path->nodes[0], &key, 0);
+	orig_key = key;
 
 	if (key.offset > 0) {
 		key.offset--;
@@ -4498,8 +4500,36 @@  int btrfs_prev_leaf(struct btrfs_root *root, struct btrfs_path *path)
 
 	btrfs_release_path(path);
 	ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
-	if (ret < 0)
+	if (ret <= 0)
 		return ret;
+
+	/*
+	 * Previous key not found. Even if we were at slot 0 of the leaf we had
+	 * before releasing the path and calling btrfs_search_slot(), we now may
+	 * be in a slot pointing to the same original key - this can happen if
+	 * after we released the path, one of more items were moved from a
+	 * sibbling leaf into the front of the leaf we had due to an insertion
+	 * (see push_leaf_right()).
+	 * If we hit this case and our slot is > 0 and just decrement the slot
+	 * so that the caller does not process the same key again, which may or
+	 * may not break the caller, depending on its logic.
+	 */
+	if (path->slots[0] < btrfs_header_nritems(path->nodes[0])) {
+		btrfs_item_key(path->nodes[0], &found_key, path->slots[0]);
+		ret = comp_keys(&found_key, &orig_key);
+		if (ret == 0) {
+			if (path->slots[0] > 0) {
+				path->slots[0]--;
+				return 0;
+			}
+			/*
+			 * At slot 0, same key as before, it means orig_key is
+			 * the lowest, leftmost, key in the tree. We're done.
+			 */
+			return 1;
+		}
+	}
+
 	btrfs_item_key(path->nodes[0], &found_key, 0);
 	ret = comp_keys(&found_key, &key);
 	/*