Message ID | e6eaa082d536fc5223eae4627bc7dc6369e257d9.1681295111.git.fdmanana@suse.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | btrfs: fix for btrfs_prev_leaf() and unexport it | expand |
Hi, > From: Filipe Manana <fdmanana@suse.com> > > A call to btrfs_prev_leaf() may end up returning a path that points to the > same item (key) again. This happens if while btrfs_prev_leaf(), after we > release the path, a concurrent insertion happens, which moves items off > from a sibbling into the front of the previous leaf, and an item with the > computed previous key does not exists. > > For example, suppose we have the two following leaves: > > Leaf A > > ------------------------------------------------------------- > | ... key (300 96 10) key (300 96 15) key (300 96 16) | > ------------------------------------------------------------- > slot 20 slot 21 slot 22 > > Leaf B > > ------------------------------------------------------------- > | key (300 96 20) key (300 96 21) key (300 96 22) ... | > ------------------------------------------------------------- > slot 0 slot 1 slot 2 > > If we call btrfs_prev_leaf(), from btrfs_previous_item() for example, with > a path pointing to leaf B and slot 0 and the following happens: > > 1) At btrfs_prev_leaf() we compute the previous key to search as: > (300 96 19), which is a key that does not exists in the tree; > > 2) Then we call btrfs_release_path() at btrfs_prev_leaf(); > > 3) Some other task inserts a key at leaf A, that sorts before the key at > slot 20, for example it has an objectid of 299. In order to make room > for the new key, the key at slot 22 is moved to the front of leaf B. > This happens at push_leaf_right(), called from split_leaf(). > > After this leaf B now looks like: > > -------------------------------------------------------------------------------- > | key (300 96 16) key (300 96 20) key (300 96 21) key (300 96 22) ... | > -------------------------------------------------------------------------------- > slot 0 slot 1 slot 2 slot 3 > > 4) At btrfs_prev_leaf() we call btrfs_search_slot() for the computed > previous key: (300 96 19). Since the key does not exists, > btrfs_search_slot() returns 1 and with a path pointing to leaf B > and slot 1, the item with key (300 96 20); > > 5) This makes btrfs_prev_leaf() return a path that points to slot 1 of > leaf B, the same key as before it was called, since the key at slot 0 > of leaf B (300 96 16) is less than the computed previous key, which is > (300 96 19); > > 6) As a consequence btrfs_previous_item() returns a path that points again > to the item with key (300 96 20). > > For some users of btrfs_prev_leaf() or btrfs_previous_item() this may not > be functional a problem, despite not making sense to return a new path > pointing again to the same item/key. However for a caller such as > tree-log.c:log_dir_items(), this has a bad consequence, as it can result > in not logging some dir index deletions in case the directory is being > logged without holding the inode's VFS lock (logging triggered while > logging a child inode for example) - for the example scenario above, in > case the dir index keys 17, 18 and 19 were deleted in the current > transaction. fstests(btrfs/080) had few frequency(<1/10) to fail after this pacth is applied. but it is yet not clear enough. # cat results//btrfs/080.out.bad QA output created by 080 Unexpected digest for file /mnt/scratch/20_40_30_986878191_snap/foobar_63 (see /usr/hpc-bio/xfstests/results//btrfs/080.full for details) Best Regards Wang Yugui (wangyugui@e16-tech.com) 2023/04/14
Hi, > Hi, > > > From: Filipe Manana <fdmanana@suse.com> > > > > A call to btrfs_prev_leaf() may end up returning a path that points to the > > same item (key) again. This happens if while btrfs_prev_leaf(), after we > > release the path, a concurrent insertion happens, which moves items off > > from a sibbling into the front of the previous leaf, and an item with the > > computed previous key does not exists. > > > > For example, suppose we have the two following leaves: > > > > Leaf A > > > > ------------------------------------------------------------- > > | ... key (300 96 10) key (300 96 15) key (300 96 16) | > > ------------------------------------------------------------- > > slot 20 slot 21 slot 22 > > > > Leaf B > > > > ------------------------------------------------------------- > > | key (300 96 20) key (300 96 21) key (300 96 22) ... | > > ------------------------------------------------------------- > > slot 0 slot 1 slot 2 > > > > If we call btrfs_prev_leaf(), from btrfs_previous_item() for example, with > > a path pointing to leaf B and slot 0 and the following happens: > > > > 1) At btrfs_prev_leaf() we compute the previous key to search as: > > (300 96 19), which is a key that does not exists in the tree; > > > > 2) Then we call btrfs_release_path() at btrfs_prev_leaf(); > > > > 3) Some other task inserts a key at leaf A, that sorts before the key at > > slot 20, for example it has an objectid of 299. In order to make room > > for the new key, the key at slot 22 is moved to the front of leaf B. > > This happens at push_leaf_right(), called from split_leaf(). > > > > After this leaf B now looks like: > > > > -------------------------------------------------------------------------------- > > | key (300 96 16) key (300 96 20) key (300 96 21) key (300 96 22) ... | > > -------------------------------------------------------------------------------- > > slot 0 slot 1 slot 2 slot 3 > > > > 4) At btrfs_prev_leaf() we call btrfs_search_slot() for the computed > > previous key: (300 96 19). Since the key does not exists, > > btrfs_search_slot() returns 1 and with a path pointing to leaf B > > and slot 1, the item with key (300 96 20); > > > > 5) This makes btrfs_prev_leaf() return a path that points to slot 1 of > > leaf B, the same key as before it was called, since the key at slot 0 > > of leaf B (300 96 16) is less than the computed previous key, which is > > (300 96 19); > > > > 6) As a consequence btrfs_previous_item() returns a path that points again > > to the item with key (300 96 20). > > > > For some users of btrfs_prev_leaf() or btrfs_previous_item() this may not > > be functional a problem, despite not making sense to return a new path > > pointing again to the same item/key. However for a caller such as > > tree-log.c:log_dir_items(), this has a bad consequence, as it can result > > in not logging some dir index deletions in case the directory is being > > logged without holding the inode's VFS lock (logging triggered while > > logging a child inode for example) - for the example scenario above, in > > case the dir index keys 17, 18 and 19 were deleted in the current > > transaction. > > fstests(btrfs/080) had few frequency(<1/10) to fail after this pacth is applied. > but it is yet not clear enough. > > # cat results//btrfs/080.out.bad > QA output created by 080 > Unexpected digest for file /mnt/scratch/20_40_30_986878191_snap/foobar_63 > (see /usr/hpc-bio/xfstests/results//btrfs/080.full for details) fstests(btrfs/080) failure happened without this patch too. the next info update may take some long time, because of the low reproduce frequency. Best Regards Wang Yugui (wangyugui@e16-tech.com) 2023/04/14
On Fri, Apr 14, 2023 at 4:44 AM Wang Yugui <wangyugui@e16-tech.com> wrote: > > Hi, > > > Hi, > > > > > From: Filipe Manana <fdmanana@suse.com> > > > > > > A call to btrfs_prev_leaf() may end up returning a path that points to the > > > same item (key) again. This happens if while btrfs_prev_leaf(), after we > > > release the path, a concurrent insertion happens, which moves items off > > > from a sibbling into the front of the previous leaf, and an item with the > > > computed previous key does not exists. > > > > > > For example, suppose we have the two following leaves: > > > > > > Leaf A > > > > > > ------------------------------------------------------------- > > > | ... key (300 96 10) key (300 96 15) key (300 96 16) | > > > ------------------------------------------------------------- > > > slot 20 slot 21 slot 22 > > > > > > Leaf B > > > > > > ------------------------------------------------------------- > > > | key (300 96 20) key (300 96 21) key (300 96 22) ... | > > > ------------------------------------------------------------- > > > slot 0 slot 1 slot 2 > > > > > > If we call btrfs_prev_leaf(), from btrfs_previous_item() for example, with > > > a path pointing to leaf B and slot 0 and the following happens: > > > > > > 1) At btrfs_prev_leaf() we compute the previous key to search as: > > > (300 96 19), which is a key that does not exists in the tree; > > > > > > 2) Then we call btrfs_release_path() at btrfs_prev_leaf(); > > > > > > 3) Some other task inserts a key at leaf A, that sorts before the key at > > > slot 20, for example it has an objectid of 299. In order to make room > > > for the new key, the key at slot 22 is moved to the front of leaf B. > > > This happens at push_leaf_right(), called from split_leaf(). > > > > > > After this leaf B now looks like: > > > > > > -------------------------------------------------------------------------------- > > > | key (300 96 16) key (300 96 20) key (300 96 21) key (300 96 22) ... | > > > -------------------------------------------------------------------------------- > > > slot 0 slot 1 slot 2 slot 3 > > > > > > 4) At btrfs_prev_leaf() we call btrfs_search_slot() for the computed > > > previous key: (300 96 19). Since the key does not exists, > > > btrfs_search_slot() returns 1 and with a path pointing to leaf B > > > and slot 1, the item with key (300 96 20); > > > > > > 5) This makes btrfs_prev_leaf() return a path that points to slot 1 of > > > leaf B, the same key as before it was called, since the key at slot 0 > > > of leaf B (300 96 16) is less than the computed previous key, which is > > > (300 96 19); > > > > > > 6) As a consequence btrfs_previous_item() returns a path that points again > > > to the item with key (300 96 20). > > > > > > For some users of btrfs_prev_leaf() or btrfs_previous_item() this may not > > > be functional a problem, despite not making sense to return a new path > > > pointing again to the same item/key. However for a caller such as > > > tree-log.c:log_dir_items(), this has a bad consequence, as it can result > > > in not logging some dir index deletions in case the directory is being > > > logged without holding the inode's VFS lock (logging triggered while > > > logging a child inode for example) - for the example scenario above, in > > > case the dir index keys 17, 18 and 19 were deleted in the current > > > transaction. > > > > fstests(btrfs/080) had few frequency(<1/10) to fail after this pacth is applied. > > but it is yet not clear enough. > > > > # cat results//btrfs/080.out.bad > > QA output created by 080 > > Unexpected digest for file /mnt/scratch/20_40_30_986878191_snap/foobar_63 > > (see /usr/hpc-bio/xfstests/results//btrfs/080.full for details) > > fstests(btrfs/080) failure happened without this patch too. > > the next info update may take some long time, because of the low reproduce > frequency. That test sporadically fails without the patch yes. Also, more important: the test does not exercise at all the function changed by this patch. So you can save time and skip doing that pointless info update on frequency failure... Thanks. > > Best Regards > Wang Yugui (wangyugui@e16-tech.com) > 2023/04/14 > >
diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c index 3c983c70028a..6bdd8e42c95e 100644 --- a/fs/btrfs/ctree.c +++ b/fs/btrfs/ctree.c @@ -4478,10 +4478,12 @@ int btrfs_del_items(struct btrfs_trans_handle *trans, struct btrfs_root *root, int btrfs_prev_leaf(struct btrfs_root *root, struct btrfs_path *path) { struct btrfs_key key; + struct btrfs_key orig_key; struct btrfs_disk_key found_key; int ret; btrfs_item_key_to_cpu(path->nodes[0], &key, 0); + orig_key = key; if (key.offset > 0) { key.offset--; @@ -4498,8 +4500,36 @@ int btrfs_prev_leaf(struct btrfs_root *root, struct btrfs_path *path) btrfs_release_path(path); ret = btrfs_search_slot(NULL, root, &key, path, 0, 0); - if (ret < 0) + if (ret <= 0) return ret; + + /* + * Previous key not found. Even if we were at slot 0 of the leaf we had + * before releasing the path and calling btrfs_search_slot(), we now may + * be in a slot pointing to the same original key - this can happen if + * after we released the path, one of more items were moved from a + * sibbling leaf into the front of the leaf we had due to an insertion + * (see push_leaf_right()). + * If we hit this case and our slot is > 0 and just decrement the slot + * so that the caller does not process the same key again, which may or + * may not break the caller, depending on its logic. + */ + if (path->slots[0] < btrfs_header_nritems(path->nodes[0])) { + btrfs_item_key(path->nodes[0], &found_key, path->slots[0]); + ret = comp_keys(&found_key, &orig_key); + if (ret == 0) { + if (path->slots[0] > 0) { + path->slots[0]--; + return 0; + } + /* + * At slot 0, same key as before, it means orig_key is + * the lowest, leftmost, key in the tree. We're done. + */ + return 1; + } + } + btrfs_item_key(path->nodes[0], &found_key, 0); ret = comp_keys(&found_key, &key); /*