diff mbox series

[v2] btrfs: do not batch insert non-consecutive dir indexes during log replay

Message ID a69adbc22b4b4340a7289d8b9bbb9878d6c00192.1658411151.git.josef@toxicpanda.com (mailing list archive)
State New, archived
Headers show
Series [v2] btrfs: do not batch insert non-consecutive dir indexes during log replay | expand

Commit Message

Josef Bacik July 21, 2022, 1:47 p.m. UTC
While running generic/475 in a loop I got the following error

BTRFS critical (device dm-11): corrupt leaf: root=5 block=31096832 slot=69, bad key order, prev (263 96 531) current (263 96 524)
<snip>
 item 65 key (263 96 517) itemoff 14132 itemsize 33
 item 66 key (263 96 523) itemoff 14099 itemsize 33
 item 67 key (263 96 525) itemoff 14066 itemsize 33
 item 68 key (263 96 531) itemoff 14033 itemsize 33
 item 69 key (263 96 524) itemoff 14000 itemsize 33

As you can see here we have 3 dir index keys with the dir index value of
523, 524, and 525 inserted between 517 and 524.  This occurs because our
dir index insertion code will bulk insert all dir index items on the
node regardless of their actual key value.

This makes sense on a normally running system, because if there's a gap
in between the items there was a deletion before the item was inserted,
so there's not going to be an overlap of the dir index items that need
to be inserted and what exists on disk.

However during log replay this isn't necessarily true, we could have any
number of dir indexes in the tree already.

Fix this by seeing if we're replaying the log, and if we are simply skip
batching if there's a gap in the key space.

This file system was left broken from the fstest, I tested this patch
against the broken fs to make sure it replayed the log properly, and
then btrfs check'ed the file system after the log replay to verify
everything was ok.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/delayed-inode.c | 35 +++++++++++++++++++++++++++++++++--
 1 file changed, 33 insertions(+), 2 deletions(-)

Comments

Sweet Tea Dorminy July 21, 2022, 2:32 p.m. UTC | #1
On 7/21/22 09:47, Josef Bacik wrote:
> While running generic/475 in a loop I got the following error
> 
> BTRFS critical (device dm-11): corrupt leaf: root=5 block=31096832 slot=69, bad key order, prev (263 96 531) current (263 96 524)
> <snip>
>   item 65 key (263 96 517) itemoff 14132 itemsize 33
>   item 66 key (263 96 523) itemoff 14099 itemsize 33
>   item 67 key (263 96 525) itemoff 14066 itemsize 33
>   item 68 key (263 96 531) itemoff 14033 itemsize 33
>   item 69 key (263 96 524) itemoff 14000 itemsize 33
> 
> As you can see here we have 3 dir index keys with the dir index value of
> 523, 524, and 525 inserted between 517 and 524.  This occurs because our
> dir index insertion code will bulk insert all dir index items on the
> node regardless of their actual key value.
> 
> This makes sense on a normally running system, because if there's a gap
> in between the items there was a deletion before the item was inserted,
> so there's not going to be an overlap of the dir index items that need
> to be inserted and what exists on disk.
> 
> However during log replay this isn't necessarily true, we could have any
> number of dir indexes in the tree already.
> 
> Fix this by seeing if we're replaying the log, and if we are simply skip
> batching if there's a gap in the key space.
> 
> This file system was left broken from the fstest, I tested this patch
> against the broken fs to make sure it replayed the log properly, and
> then btrfs check'ed the file system after the log replay to verify
> everything was ok.
> 
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
> Reviewed-by: Filipe Manana <fdmanana@suse.com>
> ---
>   fs/btrfs/delayed-inode.c | 35 +++++++++++++++++++++++++++++++++--
>   1 file changed, 33 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
> index 823aa05b3e38..e7f34871a132 100644
> --- a/fs/btrfs/delayed-inode.c
> +++ b/fs/btrfs/delayed-inode.c
> @@ -691,9 +691,22 @@ static int btrfs_insert_delayed_item(struct btrfs_trans_handle *trans,
>   	int total_size;
>   	char *ins_data = NULL;
>   	int ret;
> +	bool continuous_keys_only = false;
>   
>   	lockdep_assert_held(&node->mutex);
>   
> +	/*
> +	 * During normal operation the delayed index offset is continuously
> +	 * increasing, so we can batch insert all items as there will not be any
> +	 * overlapping keys in the tree.
> +	 *
> +	 * The exception to this is log replay, where we may have interleaved
> +	 * offsets in the tree, so our batch needs to be continuous keys only in
> +	 * order to ensure we do not end up with out of order items in our leaf.
> +	 */
> +	if (test_bit(BTRFS_FS_LOG_RECOVERING, &fs_info->flags))
> +		continuous_keys_only = true;
> +
>   	/*
>   	 * For delayed items to insert, we track reserved metadata bytes based
>   	 * on the number of leaves that we will use.
> @@ -715,6 +728,14 @@ static int btrfs_insert_delayed_item(struct btrfs_trans_handle *trans,
>   		if (!next)
>   			break;
>   
> +		/*
> +		 * We cannot allow gaps in the key space if we're doing log
> +		 * replay.
> +		 */
> +		if (continuous_keys_only &&
> +		    (next->key.offset != curr->key.offset + 1))
> +			break;
> +
>   		ASSERT(next->bytes_reserved == 0);
>   
>   		next_size = next->data_len + sizeof(struct btrfs_item);
> @@ -775,7 +796,17 @@ static int btrfs_insert_delayed_item(struct btrfs_trans_handle *trans,
>   
>   	ASSERT(node->index_item_leaves > 0);
>   
> -	if (next) {
> +	/*
> +	 * For normal operations we will batch an entire leaf's worth of delayed
> +	 * items, so if there are more items to process we can decrement
> +	 * index_item_leaves by 1 as we inserted 1 leaf's worth of items.
> +	 *
> +	 * However for log replay we may not have inserted an entire leaf's
> +	 * worth of items, we may have not had continuous items, so decrementing
> +	 * here would mess up the index_item_leaves accounting.  For this case
> +	 * only clean up the accounting when there are no items left.
> +	 */
> +	if (next && !continuous_keys_only) {
>   		/*
>   		 * We inserted one batch of items into a leaf a there are more
>   		 * items to flush in a future batch, now release one unit of
> @@ -784,7 +815,7 @@ static int btrfs_insert_delayed_item(struct btrfs_trans_handle *trans,
>   		 */
>   		btrfs_delayed_item_release_leaves(node, 1);
>   		node->index_item_leaves--;
> -	} else {
> +	} else if (!next) {
>   		/*
>   		 * There are no more items to insert. We can have a number of
>   		 * reserved leaves > 1 here - this happens when many dir index
Looks great, thank you!
David Sterba July 25, 2022, 3:44 p.m. UTC | #2
On Thu, Jul 21, 2022 at 09:47:39AM -0400, Josef Bacik wrote:
> While running generic/475 in a loop I got the following error
> 
> BTRFS critical (device dm-11): corrupt leaf: root=5 block=31096832 slot=69, bad key order, prev (263 96 531) current (263 96 524)
> <snip>
>  item 65 key (263 96 517) itemoff 14132 itemsize 33
>  item 66 key (263 96 523) itemoff 14099 itemsize 33
>  item 67 key (263 96 525) itemoff 14066 itemsize 33
>  item 68 key (263 96 531) itemoff 14033 itemsize 33
>  item 69 key (263 96 524) itemoff 14000 itemsize 33
> 
> As you can see here we have 3 dir index keys with the dir index value of
> 523, 524, and 525 inserted between 517 and 524.  This occurs because our
> dir index insertion code will bulk insert all dir index items on the
> node regardless of their actual key value.
> 
> This makes sense on a normally running system, because if there's a gap
> in between the items there was a deletion before the item was inserted,
> so there's not going to be an overlap of the dir index items that need
> to be inserted and what exists on disk.
> 
> However during log replay this isn't necessarily true, we could have any
> number of dir indexes in the tree already.
> 
> Fix this by seeing if we're replaying the log, and if we are simply skip
> batching if there's a gap in the key space.
> 
> This file system was left broken from the fstest, I tested this patch
> against the broken fs to make sure it replayed the log properly, and
> then btrfs check'ed the file system after the log replay to verify
> everything was ok.
> 
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
> Reviewed-by: Filipe Manana <fdmanana@suse.com>

Thanks, add to misc-next.
diff mbox series

Patch

diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
index 823aa05b3e38..e7f34871a132 100644
--- a/fs/btrfs/delayed-inode.c
+++ b/fs/btrfs/delayed-inode.c
@@ -691,9 +691,22 @@  static int btrfs_insert_delayed_item(struct btrfs_trans_handle *trans,
 	int total_size;
 	char *ins_data = NULL;
 	int ret;
+	bool continuous_keys_only = false;
 
 	lockdep_assert_held(&node->mutex);
 
+	/*
+	 * During normal operation the delayed index offset is continuously
+	 * increasing, so we can batch insert all items as there will not be any
+	 * overlapping keys in the tree.
+	 *
+	 * The exception to this is log replay, where we may have interleaved
+	 * offsets in the tree, so our batch needs to be continuous keys only in
+	 * order to ensure we do not end up with out of order items in our leaf.
+	 */
+	if (test_bit(BTRFS_FS_LOG_RECOVERING, &fs_info->flags))
+		continuous_keys_only = true;
+
 	/*
 	 * For delayed items to insert, we track reserved metadata bytes based
 	 * on the number of leaves that we will use.
@@ -715,6 +728,14 @@  static int btrfs_insert_delayed_item(struct btrfs_trans_handle *trans,
 		if (!next)
 			break;
 
+		/*
+		 * We cannot allow gaps in the key space if we're doing log
+		 * replay.
+		 */
+		if (continuous_keys_only &&
+		    (next->key.offset != curr->key.offset + 1))
+			break;
+
 		ASSERT(next->bytes_reserved == 0);
 
 		next_size = next->data_len + sizeof(struct btrfs_item);
@@ -775,7 +796,17 @@  static int btrfs_insert_delayed_item(struct btrfs_trans_handle *trans,
 
 	ASSERT(node->index_item_leaves > 0);
 
-	if (next) {
+	/*
+	 * For normal operations we will batch an entire leaf's worth of delayed
+	 * items, so if there are more items to process we can decrement
+	 * index_item_leaves by 1 as we inserted 1 leaf's worth of items.
+	 *
+	 * However for log replay we may not have inserted an entire leaf's
+	 * worth of items, we may have not had continuous items, so decrementing
+	 * here would mess up the index_item_leaves accounting.  For this case
+	 * only clean up the accounting when there are no items left.
+	 */
+	if (next && !continuous_keys_only) {
 		/*
 		 * We inserted one batch of items into a leaf a there are more
 		 * items to flush in a future batch, now release one unit of
@@ -784,7 +815,7 @@  static int btrfs_insert_delayed_item(struct btrfs_trans_handle *trans,
 		 */
 		btrfs_delayed_item_release_leaves(node, 1);
 		node->index_item_leaves--;
-	} else {
+	} else if (!next) {
 		/*
 		 * There are no more items to insert. We can have a number of
 		 * reserved leaves > 1 here - this happens when many dir index